SlideShare a Scribd company logo
Powering a Graph Data
System with Scylla +
JanusGraph
Ryan Stauffer, Founder & CEO
Presenter
Ryan Stauffer, Founder & CEO
Ryan founded Enharmonic to change the way we interact with
data. He has experience building modern data solutions for fast-
moving companies, both as a consultant and as the leader of
Data Strategy and Analytics at Private Equity-backed Driven
Brands. He received his MBA from Washington University in St.
Louis, and has additional experience in Investment Banking and
as a U.S. Army Infantry Officer. In his free time, he makes music
and tries to set PRs running up Potrero Hill.
Powering a Graph Data System with Scylla + JanusGraph
Graph Data System?
What?
Graph Data System
We can break down the concept of a “Graph Data System” into 2 pieces:
■ Graph - we’re modelling our data as a property graph
● Vertices model logical entities (Customer, Product, Order)
● Edges model logical relationships between entities (PURCHASED, IN_ORDER)
● Properties model attributes of entities/relationships (name, purchaseDate)
■ Data System - we use several components in a single system to store
and retrieve our data
JanusGraph & Scylla Overview
Why?
3 Core Benefits
■ Flexibility
■ Schema support
■ OLTP & OLAP support (Distinct from Scylla Workload Prioritization)
Flexibility
The “killer feature” of a graph data model is flexibility
■ Changing database schemas to support new business logic and data
sources is tough!
■ The nature of a graph’s data model makes it easier to evolve the data
model over time
■ Iterate on our model to match our understanding as we learn,
without having to start from scratch
■ In practice
● Incorporate fresh data sources without breaking existing workloads
● Write query results directly to the graph as new vertices & edges
● Share production-quality data between teams
Schema Support
By supporting a defined schema, our data system can enforce business
logic, and minimize duplicative application code
■ Flexible schema support out-of-the-box
■ We can pre-define the properties and datatypes that are possible for
a given vertex or edge, without requiring that each vertex/edge
contain every property
■ We can pre-define which edge types are allowed to connect a pair of
vertices, without requiring every pair of vertices to have this edge
■ Simplifies testing on new use cases
■ Separates data integrity maintenance from business logic
OLTP + OLAP
■ Transactional (graph-local) workloads
● Begin with a small number of vertices (found with the help of an index)
● Traverse across a reasonably small number of edges and vertices
● Goal is to minimize latency
● With Scylla, we can achieve scalable, single-digit millisecond response
■ Analytical (graph-global) workloads
● Travel to all (or a substantial portion) of the vertices and edges
● Includes many classic graph algorithms
● Goal is to maximize throughput (might leverage Spark)
■ The same traversal language (Gremlin) can be used to write both
types of workloads
■ At the graph level -> distinct from Scylla workload prioritization
Deployment
Where to Deploy?
VMs
Bare
Metal
Kubernetes
■ Open-source system for managing containerized applications
■ Groups application containers into logical units
■ Builds abstractions on top of the basic resources
● Compute
● Memory
● Disk
● Network
Deployment Overview
Stateful SetDeployment Storage Class
Headless
Service
Load
Balancer
Client
■ The “stateful” components of our system are Scylla & Elasticsearch
■ JanusGraph is deployed as a stateless server that stores and
retrieves data to and from the stateful systems
Scylla
■ Use your existing deployment == Zero lift!
■ New keyspace for JanusGraph data
Elasticsearch
Stateful Set Storage ClassHeadless Service
Elasticsearch - Manifest Summary
Storage Class kind: StatefulSet
metadata: ...
spec:
serviceName: es
replicas: 3
selector: { matchLabels: { app: es }}
template:
metadata: { labels: { app: es }}
spec:
containers:
- name: elasticsearch
image: .../elasticsearch-oss:6.6.0
env:
- name: discovery.zen.ping.unicast.hosts
value: "es-0.es.default.svc.cluster.local,..."
volumeMounts:
- name: data
mountPath: /usr/share/elasticsearch/data
volumeClaimTemplates:
- metadata: { name: data }
spec:
accessModes: [ ReadWriteOnce ]
storageClassName: elasticsearch-ssd
kind: Service
metadata:
name: es
labels: { app: es }
spec:
clusterIP: None
ports:
- port: 9200
- port: 9300
selector:
app: es
Headless Service
Stateful Set
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: elasticsearch-ssd
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-ssd
Elasticsearch - Deploy
$ kubectl apply -f elasticsearch.yaml
storageclass.storage.k8s.io/elasticsearch-ssd created
service/es created
statefulset.apps/elasticsearch created
$ kubectl get all -l app=elasticsearch
NAME READY AGE
statefulset.apps/elasticsearch 3/3 2m10s
NAME READY STATUS RESTARTS AGE
pod/elasticsearch-0 1/1 Running 0 2m9s
pod/elasticsearch-1 1/1 Running 0 87s
pod/elasticsearch-2 1/1 Running 0 44s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/es ClusterIP None <none> 9200/TCP,9300/TCP 2m9s
JanusGraph
JanusGraph Image
$ git clone https://github.com/JanusGraph/janusgraph-docker.git
$ cd janusgraph-docker
$ sudo ./build-images.sh 0.4
# Push the image to your private project repository
$ docker tag janusgraph/janusgraph:0.4.0 gcr.io/$PROJECT/janusgraph:0.4.0
$ gcloud auth configure-docker
$ docker push gcr.io/$PROJECT/janusgraph:0.4.0
■ There are already official JanusGraph images on Docker Hub
■ You can also build your own using the JanusGraph project build
scripts and push it to a private image repository (ex: GCP)
$ docker pull janusgraph/janusgraph:0.4.0
JanusGraph Console
(Just a Pod…)
JanusGraph Console - Manifest Summary
■ Run JanusGraph in a Pod, and connect to it directly
● Graph is only accessible through this console connection, but actions are persisted
in Scylla and Elasticsearch
kind: Pod
spec:
containers:
- name: janusgraph
image: .../janusgraph:0.4.0
env:
- name: JANUS_PROPS_TEMPLATE
value: cql-es
- name: janusgraph.storage.hostname
value: 10.138.0.3
- name: janusgraph.storage.cql.keyspace
value: graphdev
- name: janusgraph.index.search.hostname
value: "es-0.es.default.svc.cluster.local,..."
graph = JanusGraphFactory.open('/etc/opt/janusgraph/janusgraph.properties')
mgmt = graph.openManagement()
JanusGraph Console - Deploy & Define Schema
$ kubectl create -f janusgraph-gremlin-console.yaml
$ kubectl exec -it janusgraph-gremlin-console -- bin/gremlin.sh
,,,/
(o o)
-----oOOo-(3)-oOOo-----
...
gremlin>
// Define Schema for a Product Vertex and Properties
Product = mgmt.makeVertexLabel("Product").make()
name = mgmt.makePropertyKey("name").
dataType(String.class).cardinality(Cardinality.SINGLE).make()
productId = mgmt.makePropertyKey("productId").
dataType(Integer.class).cardinality(Cardinality.SINGLE).make()
mgmt.addProperties(Product, name, productId)
mgmt.commit()
JanusGraph Server
DeploymentLoad Balancer
JanusGraph Server - Manifest Summary
■ Deploy JanusGraph as a standalone server
Service
kind: Deployment
labels:
app: janusgraph
spec:
replicas: 1
template:
spec:
containers:
- name: janusgraph
image: .../janusgraph:0.4.0
env:
- name: JANUS_PROPS_TEMPLATE
value: cql-es
- name: janusgraph.storage.hostname
value: 10.138.0.3
- name: janusgraph.storage.cql.keyspace
value: graphdev
- name: janusgraph.index.search.hostname
value: "es-0.es.default.svc.cluster.local,..."
Deployment
kind: Service
metadata:
name: janusgraph-service-lb
spec:
type: LoadBalancer
selector:
app: janusgraph
ports:
- name: gremlin-server-websocket
protocol: TCP
port: 8182
targetPort: 8182
● Uses TinkerPop Gremlin Server
● Graph will be accessible to a wide range of client languages (Python, Java, JS, etc.)
JanusGraph Server - Deploy
$ kubectl apply -f janusgraph.yaml
service/janusgraph-service-lb created
deployment.apps/janusgraph-server created
$ kubectl get all -l app=janusgraph
NAME READY STATUS RESTARTS AGE
pod/janusgraph-server-5d77dd9ddf-nc87p 1/1 Running 0 1m2s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/janusgraph-service-lb LoadBalancer 10.0.12.109 35.121.171.101 8182/TCP 1m3s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/janusgraph-server 1/1 1 1 1m3s
NAME DESIRED CURRENT READY AGE
replicaset.apps/janusgraph-server-5d77dd9ddf 1 1 1 1m2s
A Better Way - Helm Charts
■ Nobody has time to manage all of these individual manifest files!
■ Use Helm (https://helm.sh) - the “package manager” for k8s
■ Makes it easy to define, deploy & upgrade Kubernetes applications
■ You can find our opinionated take on deploying JanusGraph with
Helm at https://github.com/EnharmonicAI/janusgraph-helm
With Kubernetes, it’s easy
to deploy JanusGraph on
top of Scylla
Flexible, scalable graph
data system for building
applications
Thank you Stay in touch
Any questions?
Ryan Stauffer
ryan@enharmonic.ai
@RyantheStauffer

More Related Content

PPTX
Basic oracle-database-administration
PDF
Oracle 12c Multitenant architecture
PDF
Gain 3 Benefits with Delta Sharing
PPTX
Ch10.pptx
PPT
Les 12 fl_db
PDF
Cheatsheet of msdos
PDF
Zero Data Loss Recovery Appliance - Deep Dive
PPTX
The oracle database architecture
Basic oracle-database-administration
Oracle 12c Multitenant architecture
Gain 3 Benefits with Delta Sharing
Ch10.pptx
Les 12 fl_db
Cheatsheet of msdos
Zero Data Loss Recovery Appliance - Deep Dive
The oracle database architecture

What's hot (20)

PPT
Database concepts
PDF
Rsdu table consistency_009
PPTX
Oracle 12c Architecture
PDF
Rman Presentation
PPT
Les 17 sched
KEY
Lecture 07 - Basic SQL
PPTX
DATABASE MANAGEMENT
PDF
Step by Step Restore rman to different host
PDF
Netezza Architecture and Administration
PPTX
Database management system
PPT
storage device
PPTX
File Management
PPTX
A presentation on Motherboard
PDF
Maximum Availability Architecture - Best Practices for Oracle Database 19c
PDF
Rman 12c new_features
PDF
Understanding Oracle RAC 12c Internals OOW13 [CON8806]
PPTX
Introduction to database
PDF
Ddb 1.6-design issues
PPTX
Windows 10 user guide
PPTX
Tableau Visual analytics complete deck 2
Database concepts
Rsdu table consistency_009
Oracle 12c Architecture
Rman Presentation
Les 17 sched
Lecture 07 - Basic SQL
DATABASE MANAGEMENT
Step by Step Restore rman to different host
Netezza Architecture and Administration
Database management system
storage device
File Management
A presentation on Motherboard
Maximum Availability Architecture - Best Practices for Oracle Database 19c
Rman 12c new_features
Understanding Oracle RAC 12c Internals OOW13 [CON8806]
Introduction to database
Ddb 1.6-design issues
Windows 10 user guide
Tableau Visual analytics complete deck 2
Ad

Similar to Powering a Graph Data System with Scylla + JanusGraph (20)

PPTX
Mastering MapReduce: MapReduce for Big Data Management and Analysis
PPTX
Hot tutorials
PDF
Data has a better idea the in-memory data grid
PDF
Spark streaming , Spark SQL
PPTX
Honey I Shrunk the Database
PPTX
This is training for spark SQL essential
PDF
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
PPTX
Hadoop cluster performance profiler
PDF
Spark ml streaming
PDF
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
PPTX
Akka Microservices Architecture And Design
PPTX
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
PPT
BDAS Shark study report 03 v1.1
PPT
Java Developers, make the database work for you (NLJUG JFall 2010)
PDF
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
PPTX
Intro to SnappyData Webinar
PPTX
Big Data on the Cloud
PPTX
Odtug2011 adf developers make the database work for you
PDF
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
DOCX
Cassandra data modelling best practices
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Hot tutorials
Data has a better idea the in-memory data grid
Spark streaming , Spark SQL
Honey I Shrunk the Database
This is training for spark SQL essential
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Hadoop cluster performance profiler
Spark ml streaming
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Akka Microservices Architecture And Design
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
BDAS Shark study report 03 v1.1
Java Developers, make the database work for you (NLJUG JFall 2010)
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Intro to SnappyData Webinar
Big Data on the Cloud
Odtug2011 adf developers make the database work for you
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Cassandra data modelling best practices
Ad

More from ScyllaDB (20)

PDF
Understanding The True Cost of DynamoDB Webinar
PDF
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
PDF
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
PDF
New Ways to Reduce Database Costs with ScyllaDB
PDF
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
PDF
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
PDF
Leading a High-Stakes Database Migration
PDF
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
PDF
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
PDF
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
PDF
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
PDF
ScyllaDB: 10 Years and Beyond by Dor Laor
PDF
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
PDF
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
PDF
Vector Search with ScyllaDB by Szymon Wasik
PDF
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
PDF
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
PDF
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
PDF
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
PDF
Lessons Learned from Building a Serverless Notifications System by Srushith R...
Understanding The True Cost of DynamoDB Webinar
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
New Ways to Reduce Database Costs with ScyllaDB
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
Leading a High-Stakes Database Migration
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
ScyllaDB: 10 Years and Beyond by Dor Laor
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
Vector Search with ScyllaDB by Szymon Wasik
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
Lessons Learned from Building a Serverless Notifications System by Srushith R...

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Machine learning based COVID-19 study performance prediction
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Electronic commerce courselecture one. Pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Approach and Philosophy of On baking technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Spectroscopy.pptx food analysis technology
PDF
KodekX | Application Modernization Development
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
Dropbox Q2 2025 Financial Results & Investor Presentation
Review of recent advances in non-invasive hemoglobin estimation
Machine learning based COVID-19 study performance prediction
20250228 LYD VKU AI Blended-Learning.pptx
Electronic commerce courselecture one. Pdf
Per capita expenditure prediction using model stacking based on satellite ima...
“AI and Expert System Decision Support & Business Intelligence Systems”
sap open course for s4hana steps from ECC to s4
Approach and Philosophy of On baking technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Spectroscopy.pptx food analysis technology
KodekX | Application Modernization Development
Spectral efficient network and resource selection model in 5G networks
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Building Integrated photovoltaic BIPV_UPV.pdf
MIND Revenue Release Quarter 2 2025 Press Release
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Big Data Technologies - Introduction.pptx
Encapsulation_ Review paper, used for researhc scholars

Powering a Graph Data System with Scylla + JanusGraph

  • 1. Powering a Graph Data System with Scylla + JanusGraph Ryan Stauffer, Founder & CEO
  • 2. Presenter Ryan Stauffer, Founder & CEO Ryan founded Enharmonic to change the way we interact with data. He has experience building modern data solutions for fast- moving companies, both as a consultant and as the leader of Data Strategy and Analytics at Private Equity-backed Driven Brands. He received his MBA from Washington University in St. Louis, and has additional experience in Investment Banking and as a U.S. Army Infantry Officer. In his free time, he makes music and tries to set PRs running up Potrero Hill.
  • 5. Graph Data System We can break down the concept of a “Graph Data System” into 2 pieces: ■ Graph - we’re modelling our data as a property graph ● Vertices model logical entities (Customer, Product, Order) ● Edges model logical relationships between entities (PURCHASED, IN_ORDER) ● Properties model attributes of entities/relationships (name, purchaseDate) ■ Data System - we use several components in a single system to store and retrieve our data
  • 8. 3 Core Benefits ■ Flexibility ■ Schema support ■ OLTP & OLAP support (Distinct from Scylla Workload Prioritization)
  • 9. Flexibility The “killer feature” of a graph data model is flexibility ■ Changing database schemas to support new business logic and data sources is tough! ■ The nature of a graph’s data model makes it easier to evolve the data model over time ■ Iterate on our model to match our understanding as we learn, without having to start from scratch ■ In practice ● Incorporate fresh data sources without breaking existing workloads ● Write query results directly to the graph as new vertices & edges ● Share production-quality data between teams
  • 10. Schema Support By supporting a defined schema, our data system can enforce business logic, and minimize duplicative application code ■ Flexible schema support out-of-the-box ■ We can pre-define the properties and datatypes that are possible for a given vertex or edge, without requiring that each vertex/edge contain every property ■ We can pre-define which edge types are allowed to connect a pair of vertices, without requiring every pair of vertices to have this edge ■ Simplifies testing on new use cases ■ Separates data integrity maintenance from business logic
  • 11. OLTP + OLAP ■ Transactional (graph-local) workloads ● Begin with a small number of vertices (found with the help of an index) ● Traverse across a reasonably small number of edges and vertices ● Goal is to minimize latency ● With Scylla, we can achieve scalable, single-digit millisecond response ■ Analytical (graph-global) workloads ● Travel to all (or a substantial portion) of the vertices and edges ● Includes many classic graph algorithms ● Goal is to maximize throughput (might leverage Spark) ■ The same traversal language (Gremlin) can be used to write both types of workloads ■ At the graph level -> distinct from Scylla workload prioritization
  • 14. Kubernetes ■ Open-source system for managing containerized applications ■ Groups application containers into logical units ■ Builds abstractions on top of the basic resources ● Compute ● Memory ● Disk ● Network
  • 15. Deployment Overview Stateful SetDeployment Storage Class Headless Service Load Balancer Client ■ The “stateful” components of our system are Scylla & Elasticsearch ■ JanusGraph is deployed as a stateless server that stores and retrieves data to and from the stateful systems
  • 16. Scylla ■ Use your existing deployment == Zero lift! ■ New keyspace for JanusGraph data
  • 17. Elasticsearch Stateful Set Storage ClassHeadless Service
  • 18. Elasticsearch - Manifest Summary Storage Class kind: StatefulSet metadata: ... spec: serviceName: es replicas: 3 selector: { matchLabels: { app: es }} template: metadata: { labels: { app: es }} spec: containers: - name: elasticsearch image: .../elasticsearch-oss:6.6.0 env: - name: discovery.zen.ping.unicast.hosts value: "es-0.es.default.svc.cluster.local,..." volumeMounts: - name: data mountPath: /usr/share/elasticsearch/data volumeClaimTemplates: - metadata: { name: data } spec: accessModes: [ ReadWriteOnce ] storageClassName: elasticsearch-ssd kind: Service metadata: name: es labels: { app: es } spec: clusterIP: None ports: - port: 9200 - port: 9300 selector: app: es Headless Service Stateful Set kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: elasticsearch-ssd provisioner: kubernetes.io/gce-pd parameters: type: pd-ssd
  • 19. Elasticsearch - Deploy $ kubectl apply -f elasticsearch.yaml storageclass.storage.k8s.io/elasticsearch-ssd created service/es created statefulset.apps/elasticsearch created $ kubectl get all -l app=elasticsearch NAME READY AGE statefulset.apps/elasticsearch 3/3 2m10s NAME READY STATUS RESTARTS AGE pod/elasticsearch-0 1/1 Running 0 2m9s pod/elasticsearch-1 1/1 Running 0 87s pod/elasticsearch-2 1/1 Running 0 44s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/es ClusterIP None <none> 9200/TCP,9300/TCP 2m9s
  • 21. JanusGraph Image $ git clone https://github.com/JanusGraph/janusgraph-docker.git $ cd janusgraph-docker $ sudo ./build-images.sh 0.4 # Push the image to your private project repository $ docker tag janusgraph/janusgraph:0.4.0 gcr.io/$PROJECT/janusgraph:0.4.0 $ gcloud auth configure-docker $ docker push gcr.io/$PROJECT/janusgraph:0.4.0 ■ There are already official JanusGraph images on Docker Hub ■ You can also build your own using the JanusGraph project build scripts and push it to a private image repository (ex: GCP) $ docker pull janusgraph/janusgraph:0.4.0
  • 23. JanusGraph Console - Manifest Summary ■ Run JanusGraph in a Pod, and connect to it directly ● Graph is only accessible through this console connection, but actions are persisted in Scylla and Elasticsearch kind: Pod spec: containers: - name: janusgraph image: .../janusgraph:0.4.0 env: - name: JANUS_PROPS_TEMPLATE value: cql-es - name: janusgraph.storage.hostname value: 10.138.0.3 - name: janusgraph.storage.cql.keyspace value: graphdev - name: janusgraph.index.search.hostname value: "es-0.es.default.svc.cluster.local,..."
  • 24. graph = JanusGraphFactory.open('/etc/opt/janusgraph/janusgraph.properties') mgmt = graph.openManagement() JanusGraph Console - Deploy & Define Schema $ kubectl create -f janusgraph-gremlin-console.yaml $ kubectl exec -it janusgraph-gremlin-console -- bin/gremlin.sh ,,,/ (o o) -----oOOo-(3)-oOOo----- ... gremlin> // Define Schema for a Product Vertex and Properties Product = mgmt.makeVertexLabel("Product").make() name = mgmt.makePropertyKey("name"). dataType(String.class).cardinality(Cardinality.SINGLE).make() productId = mgmt.makePropertyKey("productId"). dataType(Integer.class).cardinality(Cardinality.SINGLE).make() mgmt.addProperties(Product, name, productId) mgmt.commit()
  • 26. JanusGraph Server - Manifest Summary ■ Deploy JanusGraph as a standalone server Service kind: Deployment labels: app: janusgraph spec: replicas: 1 template: spec: containers: - name: janusgraph image: .../janusgraph:0.4.0 env: - name: JANUS_PROPS_TEMPLATE value: cql-es - name: janusgraph.storage.hostname value: 10.138.0.3 - name: janusgraph.storage.cql.keyspace value: graphdev - name: janusgraph.index.search.hostname value: "es-0.es.default.svc.cluster.local,..." Deployment kind: Service metadata: name: janusgraph-service-lb spec: type: LoadBalancer selector: app: janusgraph ports: - name: gremlin-server-websocket protocol: TCP port: 8182 targetPort: 8182 ● Uses TinkerPop Gremlin Server ● Graph will be accessible to a wide range of client languages (Python, Java, JS, etc.)
  • 27. JanusGraph Server - Deploy $ kubectl apply -f janusgraph.yaml service/janusgraph-service-lb created deployment.apps/janusgraph-server created $ kubectl get all -l app=janusgraph NAME READY STATUS RESTARTS AGE pod/janusgraph-server-5d77dd9ddf-nc87p 1/1 Running 0 1m2s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/janusgraph-service-lb LoadBalancer 10.0.12.109 35.121.171.101 8182/TCP 1m3s NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/janusgraph-server 1/1 1 1 1m3s NAME DESIRED CURRENT READY AGE replicaset.apps/janusgraph-server-5d77dd9ddf 1 1 1 1m2s
  • 28. A Better Way - Helm Charts ■ Nobody has time to manage all of these individual manifest files! ■ Use Helm (https://helm.sh) - the “package manager” for k8s ■ Makes it easy to define, deploy & upgrade Kubernetes applications ■ You can find our opinionated take on deploying JanusGraph with Helm at https://github.com/EnharmonicAI/janusgraph-helm
  • 29. With Kubernetes, it’s easy to deploy JanusGraph on top of Scylla
  • 30. Flexible, scalable graph data system for building applications
  • 31. Thank you Stay in touch Any questions? Ryan Stauffer ryan@enharmonic.ai @RyantheStauffer

Editor's Notes

  • #2: Let's give another round of applause to Brian.  Everything he said applies here – now we'll just dig into the technical pieces a bit more.
  • #3: I'm Ryan Stauffer, I'm the founder and CEO of a Bay Area startup called Enharmonic.  I first got excited about graph databases several years back when I was leading data analytics and strategy for a large automotive aftermarket company.  We were trying to build a unified model of data for the automotive aftermarket that combined data from across our different verticals.  Using the source data in its existing form – hundreds of tables, and hundreds of millions of rows & columns - was leading us down a really bad path.  It became clear that insights would be much easier if we used a graph data model, where we can explicitly model our data as real-world business concepts.  Ever since then, I’ve viewed graph data systems as a core part of the solution for how to ask and answer better questions about our businesses.
  • #4: For a litle backdrop about what we'll be talking about – what do we do at Enharmonic?  Well, we're working to solve the problem of how companies interact with their data. We provide a clean, visual interface that let's business decision makers directly access their data with free-text search and point-click-and-drag actions.  Data is modeled and retrieved as logical business concepts like Customers, Products, and Orders.  Our system recommends analyses that make sense based on the data, and then goes ahead and executes those with just a few clicks.  To make this possible, we use lots of automation on the backend – and sitting behind everything, we use a graph data system.
  • #5: Brian discussed graphs in the last session, so I'm not going to rehash everything, but I do want to do a brief level-set.  So what do I mean when I say "Graph Data System"?
  • #6: We can break that into 2 parts: "Graph" & "Data System" By "graph" we mean that we're modelling our data as a property graph, using Vertices, Edges & Properties. Vertices model entities like Customers or Products Edges model relationships between entities, like how one Customer KNOWS another Customer, or a Customer HAS PURCHASED a Product. Properties model attributesof entites and relationships, like the name and age of a Customer. By "Data System" we mean that several distinct components combine to form a single, logical system.
  • #7: There are several options for graph databases out there on the market, but when we need a combination of scalability, flexibility, and performance, we can look to a system built of JanusGraph, Scylla, and Elasticsearch. This is a single logical data system is structured into 3 parts: - In the center we have JanusGraph, a Java application that clients communicate with directly. - It serves as the abstraction layer that let's us interact with our data as a graph. - JG will write to and read from Scylla, where our data is ultimately persisted. - We can optionally add Elasticsearch to help us with advanced indexing and text search capabilities
  • #8: So that sounds interestnig, but why do we want to do this at all?
  • #9: I think there are 3 core benefits of this graph data system. - Flexibility - Schema support - Support for both transactional & analytical workloads
  • #10: The killer feature of using a graph is its flexibility - Business logic changes, application requirements change, and it can often be a real problem trying to support that with traditional databases - Using a graph means our data model isn't set in stone. - We can iterate and evolve the data model by adding additional vertices and edges to meet our new needs, without throwing out everything that already works. - We can also write analytics results directly back to the graph, explicitly connecting to our primary data. - This simplifies the ways that teams can collaborate and share insights, while allowing for powerful data provenance capabilities.
  • #11: Schema support is a real "nice-to-have" when it comes to separating business logic from lower-level database integrity issues. JanusGraph, unlike some other graph databases out there, supports defining a schema for data, but doesn't require that we do this. Basically, we can apply useful constraints to what is allowed and disallowed on our graph. For example, we can ensure that name and age properties are only allowed to be written to a Customer vertex, but we don't required that every Customer vertex have all of these properties (minimizes the need for pointless null field values!) We can also specify that a Product and Customer vertex are allowed to be related with a HAS_PURCHASED edge, bu we don't required that each Product vertex must have that edge. This sort of clear schema flexibility is difficult to replicate outside of a graph environment. Separates data integrity mantenance from our business logic – letting our DB take over DB tasks, without offloading them onto the application layer.
  • #12: - Finally, with this graph data system, we can execute both transactional and analytical workloads with the same data systtem and same query language – Gremlin. - We access data by “traversing” our graph, travelling from vertex to vertex by means of connecting edges. - We can think of a transactional workload to be one where we travel to a small number of vertices and edge, and where our goal is to minimize latency. - An analytical workload, on the other hand, is one where we travel to all, or a substantial portion, of our vertices and edges.  Our goal here is to maximize throughput. - Backed by the high-IO performance of Scylla, we can achieve scalable, single-digit millisecond response for transactional workloads.  We can also leverage Spark to handle large scale analytical workloads
  • #13: It's easy to talk about all of this in theory, but how do we go about actually deploying it
  • #14: 1st of all, WHERE are we going to deploy this? In a production environment, it makes sense to deploy Scylla on either VMs or bare metal. For JanusGraph & ES, there are many advantages to deploying on Kubernetes Q – Quick show of hands, who is using Kubernetes today? Q – Who has tried deploying Scylla on top of Kubernetes? (Yannis Zarkadas gave a great talk earlier today on using the Scylla Operator to manage Scylla on K8s – if you missed it I highly recommend checking out the talk online.)
  • #15: Kubernetes is an open source system for managing containerized applications Allows you to group and manage application containers as logical units Fundamentally, its about building and interacting with abstractions on top of basic resources (Compute, memory, disk, network) Not going to touch every last detail of the k8s manifests, but I want really dive into the low-level fundamental of the k8s resources you'll be using. Now even when setting up our pieces on k8s seems pedantic, remember that this greatly simplifies the process of installing and managing a complex application.  As many of you probably know, it's significantly easier to do it this way versus installing and upgrading each app and their dependencies manually at the VM level.
  • #16: Walkthrough the details of deploying the whole system. Big picture, we have 2 types of components – stateful and stateless Stateful components are Scylla and Elasticsearch, where we'll actually persists our data.  Everything else is stateless and ephemeral.  Our actual JanusGraph app pods for instance are stateless, and if one dies, we simply spin up a new one in its place. The what does this looks like? A client (maybe an app, maybe our little Scylla monster up here) and she'll issue queries to JanusGraph.  - Those queries hit a load balancer and are passed to 1 or more pods managed as part of a JanusGraph deployment. JanusGraph app is what presents the "graph" view of data, and it does it by intermediating between the client and stateful apps. Most data is put in Scylla, over here on the left. For more advanced indexing, we use Elasticsearch, which we deploy as a Stateful Set and Headless Service.
  • #17: Diving into more detail, we start with Scylla. We can actually use your existing Scylla cluster, meaning there's 0 lift! The one thing we'll do is create a new keyspace to hold graph data.
  • #18: To give us more advanced indexing capabilities, we'll deploy Elasticsearch as well. We deploy it on Kubernetes in 3 parts. - Headless Service - Stateful Set - Storage Class ES is stateful, so needs to persist data, which we'll accomplish this by means of a stateful set. Now, a stateful set is just used to manage 1 or more replica pods, which are the nodes in our ES cluster.  But it does this in a unique way.  It assigns numbers to each pod and the disks that are mounted to it.  This way, we consistently mount the same disk to the same pod #. This gives us a reliably stateful system, where even if individual pods fail, they're safely recreated automatically by Kubernetes.
  • #19: We define a storage class – what type of disks do we want to mount to our Elasticsearch nodes?  In this case, we'll choose SSDs. We'll define a headless service.  We set clusterIP to None, specify our standard ES ports, and provide a selctor to target our stateful set pods. The last step is to define our stateful set.  This references the Storage Class and Headless Service we just defined, so I color-coded the important bits. For storage, shown in blue, our goal is to define a disk from our elasticsearch-ssd storage class for each ES node, and mount it to that node.  To do this, we'll define a Volume Clam Template, and define a volume mount that mounts the disk at our ES data path. For networking, shown in red, we specify the Headless Service name.  We'll also define 1 environment variable, that allows for ES node discovery. Q – I THINK THERE'S A TYPO HERE ON THE SELECTOR FOR THE HEADLESS SERVICE.
  • #20: Assuming we put all of this into a single manifest file, we can deploy Elasticsearch to our Kubernetes cluster with a single "apply" command After a little bit of initialization, we can see the Ready status of our stateful set, the 3 pods it controls, and the services that routes network traffic to these pods.
  • #21: Now, for the last and most important piece of the puzzle – JanusGraph. We'll deploy this on Kubernetes as well.
  • #22: There are already official JanusGraph images available on Docker Hub, and for these examples we'll be using version 0.4.0 You could also build your own using the JanusGraph project build scripts, and push that image to a private image repository (for example, Google Cloud Platform)
  • #23: Now how do we use JanusGraph?  Let's start with a minimal example.  Not for production use - but illustrates how this all works. We'll deploy a single pod to get console access to our system.
  • #24: We'll run JanusGraph in a single pod, and connect to it directly. That means that the graph is only accessible through the console connection, but all of our actions are still persisted in Scylla and Elasticsearch. Now, the standard JanusGraph docker image includes some great templateing and presets, which allow us to configure out connection to our storage and indexing backends with just a few environment variables. We're using Scylla * Elasticsearch, so we set cql-es as our JanusGraph properties template. We set the hostname as 1 or more of the Scylla cluster hostnames We set the keyspace as a new, clean Scylla keyspace where we'll store all of our graph data. Finally, supply the K8s cluster hostnames for our Elasticsearch nodes.
  • #25: With that manifest file, we can create a pod, then connect to it with an interactive terminal. This will bring up a Gremlin Console. The JG Docker image will prepopuate a standard janusgraph.properties file that will reflect the env var configuration we just setup. We use a factory to create a graph instance, and then we can do whatever we'd like to! For example, we can start by defining a schema for a Product vertex with name and productId properties.
  • #26: If we want to actually move to a real environment, we need to support multiple users and applications, probably written in different languages. To handle this we deploy JanusGraph server. On Kubernetes, we'll do this as a Deployment, which manages 1 or more stateless replica pods. We put a load balancer in front of it, exposed on an external or internal IP depending on the use case.
  • #27: When we deploy JanusGraph as a standalone server, we're actually using the Apache TinkerPop Gremlin Server underneath the hood, which will accept Gremlin language queries issued from applications written in multiple languages (Python, Java, JS, etc.) The Service is pretty simple just a LoadBalancer that will route network requests to our pods.  We're using port 8182 because that's the standard gremlin websocket port. We manage those pods as a single deployment.  We specify the number of replicas, the image, and setup the environment variables just like we did before.
  • #28: We apply our manifest, and check that everything is running.  The key parts are the Load Balancer and Deployment. Once our LB has its IP assigned, we're able to connect to our JG pods with a client application.  Now we can issue queries, store data – do whatever we want! Now, some of that description of K8s manifest got pretty pedantic.  There's got to be a better way, right?
  • #29: There is – Helm Charts! Q – With a show of hands, who uses Helm Charts? Awesome.  We can think of Helm as a package manager for k8s.  It lets us template out and group related manifest files into logical packages called Charts. This makes it easy to define, deploy and upgrade Kubernetes applications with single commands. We just released our own opinionate take on how to deploy JanusGraph as a Helm Chart on Github.  If you like saving time and energy, please check it out and use it
  • #30: Kubernetes gives us tremendous power, and makes it easy to deploy JanusGraph on top of Scylla.
  • #31: With our deployment up and running, we have a flexible, scalable graph data system that we can use as the bedrock for an exciting new generation of applications.
  • #32: Thank you for your time. If you'd like to stay in touch, you can follow me on Twitter or connect with me on LinkedIn.  You can also contact me directly via email. I think we have a few more minutes, so what questions do you have?