SlideShare a Scribd company logo
GRAPH 101: GETTING STARTED
WITH TITAN AND CASSANDRA
Shaunak Das
PURPOSE OF THIS SESSION
Get users comfortable with DSE Graph
- DSE Graph is currently under heavy development
- Titan DB as a prototype
DSE Graph is built ‘on top’ of DSE, i.e. will use many of the other features
provided in DSE (e.g. Spark, Hadoop, Solr)
- Today, we use Titan with Cassandra for persistent storage of graphs
WHAT IS A GRAPH DATABASE?
You all know what a graph is. Abstractly, it consists of vertices and edges
connecting pairs of vertices. Edges are allowed to have a directionality between
vertices.
A graph database is a graph implemented as a data structure, in which vertex and
edge instances are allowed to hold multiples [Key:Value]-pairs.
EXAMPLE: AMAZON DATA SET
Id: 1
ASIN: 0827229534
title: Patterns of Preaching: A Sermon Sampler
group: Book
salesrank: 396585
similar: 5 0804215715 156101074X 0687023955 0687074231 082721619X
categories: 2
|Books[283155]|Subjects[1000]|Religion &
Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]
|Books[283155]|Subjects[1000]|Religion &
Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]
reviews: total: 2 downloaded: 2 avg rating: 5
2000-7-28 customer: A2JW67OY8U6HHK rating: 5 votes: 10 helpful: 9
2003-12-14 customer: A2VE83MZF98ITY rating: 5 votes: 6 helpful: 5
AMAZON GRAPH SCHEMA
image courtesy of Pierre LaPorte
IN THE MEANTIME...
As mentioned, we currently have Titan DB as a stand-in for DSE Graph
Titan is what we will be using today
There are several names you may have heard associated to Graph:
Titan
TinkerPop
Gremlin
Let me briefly (and perhaps incorrectly) distinguish what each is for.
TITAN? TINKERPOP?
With data, we care about (simplification):
how to ‘effectively’ store it
● serialization, compaction strategies
● This is Titan
how to ‘effectively’ retrieve/query it
● query algorithms, OLAP vs. OLTP
● this is TinkerPop
DSE Graph will encompass both parts of the above: graph storage and
graph querying/traversing
GETTING STARTED WITH TITAN
One can download a pre-built version of Titan 1.0, with TinkerPop
http://s3.thinkaurelius.com/downloads/titan/titan-1.0.0-hadoop1.zip
We will download and unpack it in a moment.
GREMLIN?
With this Titan distribution comes the Gremlin query language
The Gremlin query language is a graph traversal language, used to navigate and
query graph instances.
“Gremlin is to Titan what CQL is to Cassandra”
(This analogy is not perfect, but for our purposes, is good)
Just like Cassandra comes with a CQL shell, Titan comes with a Gremlin shell.
This will be how the user primarily interfaces with graphs. Let’s use it.
root@perf-lab-03b:~/titan-1.0.0-hadoop1# bin/gremlin.sh
,,,/
(o o)
-----oOOo-(3)-oOOo-----
plugin activated: aurelius.titan
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/titan-1.0.0-hadoop1/lib/slf4j-log4j12-
1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/titan-1.0.0-hadoop1/lib/logback-classic-
1.1.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
13:19:16 INFO org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph -
HADOOP_GREMLIN_LIBS is set to: /root/titan-1.0.0-hadoop1/lib
plugin activated: tinkerpop.hadoop
plugin activated: tinkerpop.tinkergraph
gremlin>
USING A TINKERGRAPH
gremlin> graph = TinkerGraph.open()
==>tinkergraph[vertices:0 edges:0]
gremlin> g = graph.traversal()
==>graphtraversalsource[tinkergraph[vertices:0 edges:0], standard]
gremlin> g.V().count()
==>0
gremlin> g.E().count()
==>0
TITAN WITH CASSANDRA
You should be asking yourself: where did that graph go?
TinkerGraph = in-memory graph. Once we closed it, the data in that TinkerGraph
instance is gone.
TitanGraph = persistent graph. This is where Cassandra comes into play.
How do we get Titan to play with Cassandra, in order to store a persistent
graph, which we can ‘microwave’ up for future querying and modifications?
TITAN WITH CASSANDRA
PREREQUISITE: Cassandra is running on your machine/cluster.
Enter back into the Gremlin REPL. Specify the type of graph, the host machine
Cassandra is running on, and keyspace we want to store this graph:
conf = new BaseConfiguration()
conf.setProperty(‘gremlin.graph’,’com.thinkaurelius.titan.core.TitanFactory’)
conf.setProperty(‘storage.backend’, ‘cassandra’)
conf.setProperty(‘storage.hostname’, ‘localhost’)
conf.setProperty(‘storage.cassandra.keyspace’, ‘graph’)
Instantiate your graph, with the above specified configurations:
graph = GraphFactory.open(conf)
AUTOMATING DATA LOADING
So that was quite a bit of work to just get two vertices into Cassandra.
What if we are dealing with a large data set that needs to get into a TitanGraph?
The Gremlin shell accepts parser scripts for automating the loading of data.
We have the following parser script for this data set:
https://github.com/riptano/automaton/blob/master/resources/tests/graph/scripts/A
mazonTitan.groovy
Let’s take a high-level glance at what is involved here.
QUESTIONS WE CAN ‘ANSWER’ WITH GRAPH
Let’s return back to our Amazon data set example. Suppose we want to determine
all users who liked a particular item with ASIN number X?
g.V() ← Get all vertices
g.V().has('ASIN', 'X') ← ...with ASIN value X
g.V().has(‘ASIN’, ‘X’).inE('rated') ← Grab its incoming rated edges
g.V().has(‘ASIN’, ‘X’).inE('rated').has('rating', 5) ← with
rating value 5
g.V().has('ASIN', ‘X’).inE('rated').has('rating', 5).outV() ←
customers
RECOMMENDATION SYSTEM?
Now suppose we want to get the top ten items that were liked by customers who
liked item with ‘ASIN’ value X?
What kind of traversal query should be make now?
POTENTIAL FUTURE SESSIONS AND TOPICS
Defining Graph Schema: indexing
Using Hadoop and Spark for ‘OLAP-Querying’ a Graph
Using Hadoop and Spark for Bulk Loading Graph Data into Cassandra
Your suggestions!

More Related Content

PPTX
ML on Big Data: Real-Time Analysis on Time Series
PDF
Titan and Cassandra at WellAware
PDF
Spark with Cassandra by Christopher Batey
PPTX
Cassandra synergy
PPTX
Slide #1:Introduction to Apache Storm
PPTX
Cassandra and Storm at Health Market Sceince
PPTX
Anomaly Detection with Apache Spark
PDF
Streams processing with Storm
ML on Big Data: Real-Time Analysis on Time Series
Titan and Cassandra at WellAware
Spark with Cassandra by Christopher Batey
Cassandra synergy
Slide #1:Introduction to Apache Storm
Cassandra and Storm at Health Market Sceince
Anomaly Detection with Apache Spark
Streams processing with Storm

What's hot (20)

PPTX
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
PDF
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
PPTX
Multi-Tenant Storm Service on Hadoop Grid
PDF
Scientific Computing With Amazon Web Services
PDF
PHP Backends for Real-Time User Interaction using Apache Storm.
DOCX
Cloudyn - Multi vendor Cloud management
PDF
Time series database by Harshil Ambagade
PDF
Data Science Lab Meetup: Cassandra and Spark
PDF
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
PDF
How to Reduce Your Database Total Cost of Ownership with TimescaleDB
PDF
QConSF 2014 talk on Netflix Mantis, a stream processing system
PDF
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?
ODP
Big data
PDF
Apache Storm Tutorial
PPT
Hadoop trainingin bangalore
PPT
Heapsort ppt
PPTX
SparkNotes
PDF
Imply at Apache Druid Meetup in London 1-15-20
PDF
Engineering fast indexes
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Multi-Tenant Storm Service on Hadoop Grid
Scientific Computing With Amazon Web Services
PHP Backends for Real-Time User Interaction using Apache Storm.
Cloudyn - Multi vendor Cloud management
Time series database by Harshil Ambagade
Data Science Lab Meetup: Cassandra and Spark
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
How to Reduce Your Database Total Cost of Ownership with TimescaleDB
QConSF 2014 talk on Netflix Mantis, a stream processing system
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?
Big data
Apache Storm Tutorial
Hadoop trainingin bangalore
Heapsort ppt
SparkNotes
Imply at Apache Druid Meetup in London 1-15-20
Engineering fast indexes
Ad

Similar to GRAPH 101- GETTING STARTED WITH TITAN AND CASSANDRA (20)

PPTX
Using Spark for Timeseries Graph Analytics ved
PPTX
Using Spark for Timeseries Graph Analytics ved
PPTX
Using spark for timeseries graph analytics
PDF
Introduction to TitanDB
PPTX
Graph databases: Tinkerpop and Titan DB
PDF
TinkerPop: a story of graphs, DBs, and graph DBs
PDF
Graph Processing with Titan and Scylla
PDF
Scylla Summit 2016: Graph Processing with Titan and Scylla
PPTX
Gremlin Queries with DataStax Enterprise Graph
PDF
Addressing performance issues in titan+cassandra
PPTX
Titan NYC Meetup March 2014
PDF
JanusGraph, Jupyter Meetup NYC
PDF
AgensGraph: a Multi-model Graph Database based on PostgreSql
PDF
TinkerPop 2020
PPT
10b. Graph Databases Lab
PDF
A Journey from Relational to Graph
PDF
GraphTech Ecosystem - part 1: Graph Databases
PPTX
Webinar - Bringing connected graph data to Cassandra with DSE Graph
PDF
DataStax: Datastax Enterprise - The Multi-Model Platform
Using Spark for Timeseries Graph Analytics ved
Using Spark for Timeseries Graph Analytics ved
Using spark for timeseries graph analytics
Introduction to TitanDB
Graph databases: Tinkerpop and Titan DB
TinkerPop: a story of graphs, DBs, and graph DBs
Graph Processing with Titan and Scylla
Scylla Summit 2016: Graph Processing with Titan and Scylla
Gremlin Queries with DataStax Enterprise Graph
Addressing performance issues in titan+cassandra
Titan NYC Meetup March 2014
JanusGraph, Jupyter Meetup NYC
AgensGraph: a Multi-model Graph Database based on PostgreSql
TinkerPop 2020
10b. Graph Databases Lab
A Journey from Relational to Graph
GraphTech Ecosystem - part 1: Graph Databases
Webinar - Bringing connected graph data to Cassandra with DSE Graph
DataStax: Datastax Enterprise - The Multi-Model Platform
Ad

GRAPH 101- GETTING STARTED WITH TITAN AND CASSANDRA

  • 1. GRAPH 101: GETTING STARTED WITH TITAN AND CASSANDRA Shaunak Das
  • 2. PURPOSE OF THIS SESSION Get users comfortable with DSE Graph - DSE Graph is currently under heavy development - Titan DB as a prototype DSE Graph is built ‘on top’ of DSE, i.e. will use many of the other features provided in DSE (e.g. Spark, Hadoop, Solr) - Today, we use Titan with Cassandra for persistent storage of graphs
  • 3. WHAT IS A GRAPH DATABASE? You all know what a graph is. Abstractly, it consists of vertices and edges connecting pairs of vertices. Edges are allowed to have a directionality between vertices. A graph database is a graph implemented as a data structure, in which vertex and edge instances are allowed to hold multiples [Key:Value]-pairs.
  • 4. EXAMPLE: AMAZON DATA SET Id: 1 ASIN: 0827229534 title: Patterns of Preaching: A Sermon Sampler group: Book salesrank: 396585 similar: 5 0804215715 156101074X 0687023955 0687074231 082721619X categories: 2 |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368] |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370] reviews: total: 2 downloaded: 2 avg rating: 5 2000-7-28 customer: A2JW67OY8U6HHK rating: 5 votes: 10 helpful: 9 2003-12-14 customer: A2VE83MZF98ITY rating: 5 votes: 6 helpful: 5
  • 5. AMAZON GRAPH SCHEMA image courtesy of Pierre LaPorte
  • 6. IN THE MEANTIME... As mentioned, we currently have Titan DB as a stand-in for DSE Graph Titan is what we will be using today There are several names you may have heard associated to Graph: Titan TinkerPop Gremlin Let me briefly (and perhaps incorrectly) distinguish what each is for.
  • 7. TITAN? TINKERPOP? With data, we care about (simplification): how to ‘effectively’ store it ● serialization, compaction strategies ● This is Titan how to ‘effectively’ retrieve/query it ● query algorithms, OLAP vs. OLTP ● this is TinkerPop DSE Graph will encompass both parts of the above: graph storage and graph querying/traversing
  • 8. GETTING STARTED WITH TITAN One can download a pre-built version of Titan 1.0, with TinkerPop http://s3.thinkaurelius.com/downloads/titan/titan-1.0.0-hadoop1.zip We will download and unpack it in a moment.
  • 9. GREMLIN? With this Titan distribution comes the Gremlin query language The Gremlin query language is a graph traversal language, used to navigate and query graph instances. “Gremlin is to Titan what CQL is to Cassandra” (This analogy is not perfect, but for our purposes, is good) Just like Cassandra comes with a CQL shell, Titan comes with a Gremlin shell. This will be how the user primarily interfaces with graphs. Let’s use it.
  • 10. root@perf-lab-03b:~/titan-1.0.0-hadoop1# bin/gremlin.sh ,,,/ (o o) -----oOOo-(3)-oOOo----- plugin activated: aurelius.titan plugin activated: tinkerpop.server plugin activated: tinkerpop.utilities SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/root/titan-1.0.0-hadoop1/lib/slf4j-log4j12- 1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/root/titan-1.0.0-hadoop1/lib/logback-classic- 1.1.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 13:19:16 INFO org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph - HADOOP_GREMLIN_LIBS is set to: /root/titan-1.0.0-hadoop1/lib plugin activated: tinkerpop.hadoop plugin activated: tinkerpop.tinkergraph gremlin>
  • 11. USING A TINKERGRAPH gremlin> graph = TinkerGraph.open() ==>tinkergraph[vertices:0 edges:0] gremlin> g = graph.traversal() ==>graphtraversalsource[tinkergraph[vertices:0 edges:0], standard] gremlin> g.V().count() ==>0 gremlin> g.E().count() ==>0
  • 12. TITAN WITH CASSANDRA You should be asking yourself: where did that graph go? TinkerGraph = in-memory graph. Once we closed it, the data in that TinkerGraph instance is gone. TitanGraph = persistent graph. This is where Cassandra comes into play. How do we get Titan to play with Cassandra, in order to store a persistent graph, which we can ‘microwave’ up for future querying and modifications?
  • 13. TITAN WITH CASSANDRA PREREQUISITE: Cassandra is running on your machine/cluster. Enter back into the Gremlin REPL. Specify the type of graph, the host machine Cassandra is running on, and keyspace we want to store this graph: conf = new BaseConfiguration() conf.setProperty(‘gremlin.graph’,’com.thinkaurelius.titan.core.TitanFactory’) conf.setProperty(‘storage.backend’, ‘cassandra’) conf.setProperty(‘storage.hostname’, ‘localhost’) conf.setProperty(‘storage.cassandra.keyspace’, ‘graph’) Instantiate your graph, with the above specified configurations: graph = GraphFactory.open(conf)
  • 14. AUTOMATING DATA LOADING So that was quite a bit of work to just get two vertices into Cassandra. What if we are dealing with a large data set that needs to get into a TitanGraph? The Gremlin shell accepts parser scripts for automating the loading of data.
  • 15. We have the following parser script for this data set: https://github.com/riptano/automaton/blob/master/resources/tests/graph/scripts/A mazonTitan.groovy Let’s take a high-level glance at what is involved here.
  • 16. QUESTIONS WE CAN ‘ANSWER’ WITH GRAPH Let’s return back to our Amazon data set example. Suppose we want to determine all users who liked a particular item with ASIN number X? g.V() ← Get all vertices g.V().has('ASIN', 'X') ← ...with ASIN value X g.V().has(‘ASIN’, ‘X’).inE('rated') ← Grab its incoming rated edges g.V().has(‘ASIN’, ‘X’).inE('rated').has('rating', 5) ← with rating value 5 g.V().has('ASIN', ‘X’).inE('rated').has('rating', 5).outV() ← customers
  • 17. RECOMMENDATION SYSTEM? Now suppose we want to get the top ten items that were liked by customers who liked item with ‘ASIN’ value X? What kind of traversal query should be make now?
  • 18. POTENTIAL FUTURE SESSIONS AND TOPICS Defining Graph Schema: indexing Using Hadoop and Spark for ‘OLAP-Querying’ a Graph Using Hadoop and Spark for Bulk Loading Graph Data into Cassandra Your suggestions!

Editor's Notes

  • #4: This slide is obligatory, so I apologize if this is nothing new.
  • #5: Consider a data set of the following form
  • #9: wget http://s3.thinkaurelius.com/downloads/titan/titan-1.0.0-hadoop1.zip unzip titan-1.0.0-hadoop1
  • #10: Do the TinkerGraph example graph = TinkerGraph.open() graph.addVertex(‘name’, ‘Shaunak’) graph.addVertex(‘company’, ‘DataStax’) g = graph.traversal() from = g.has(‘name’, ‘Shaunak’).next() to = g.has(‘company’, ‘DataStax’).next() edge = from.addEdge(‘works for’, to) edge.property(‘team’, ‘DSE Test’)
  • #14: Do the TinkerGraph example: graph = TinkerGraph.open() graph.addVertex(‘name’, ‘Shaunak’) graph.addVertex(‘company’, ‘DataStax’) g = graph.traversal() from = g.has(‘name’, ‘Shaunak’).next() to = g.has(‘company’, ‘DataStax’).next() edge = from.addEdge(‘works for’, to) edge.property(‘team’, ‘DSE Test’) // to display values from.properties() edge.properties() Show keyspace in CQL shell
  • #16: conf = new BaseConfiguration() conf.setProperty(‘gremlin.graph’, ‘ com.thinkaurelius.titan.core.TitanFactory’) conf.setProperty(‘storage.backend’, ‘cassandra’) conf.setProperty(‘storage.hostname’, ‘localhost’) conf.setProperty(‘storage.cassandra.keyspace’, ‘enron’) graph = GraphFactory.open(conf) :load /var/lib/automaton/tests/graph/scripts/EnronTitan.groovy EnronTitan.load_data(‘/var/lib/automaton/tests/data/graph/email-Enron.txt’, graph)
  • #18: g.V().has('ASIN', '0827229534').inE('rated').has('rating', 5).outV().outE('rated').has('rating', 5).inV().has('title').valueMap('title').groupCount().order().by(mapValues(), incr).unfold().limit(10)