SlideShare a Scribd company logo
Overview of Cassandra and
The Doradus OSS Project
Randy Guck
Principal Engineer, Dell Software
Overview
•  What is No SQL?
– Common RDB roadblocks
– NoSQL database types
•  Overview of Cassandra
– What's unique
– Limitations
•  Doradus
– Architecture
– Features
– The OLAP and Spider storage managers
– What each is good for
– Where to get Doradus
Why RDB Apps Look for Something Else
•  Performance
– B-trees
– Locking
– One writable copy of each record
•  Scaling costs
– RDBs scale "up"
– Big boxes, SANs, fiber channel, etc.
•  What if you want...
– Distributed access
– No single points of failure
– Instant failover
– Sharding
– Replication
NoSQL Data Models
Data Model Examples Elastic? Queries? Relationships?
Key–Value
LevelDB, Kyoto Cabinet,
Redis
No No No
Distributed Key–
Value
Dynamo, MemcacheDB,
Riak, Voldemort
Yes No No
Column-Oriented
Accumulo, Cassandra,
HBase
Yes Some No
Document-
Oriented
Couchbase,
Elasticsearch, MongoDB
Yes Yes Some
Graph Neo4J, OrientDB, Titan No Yes Yes
Sharding + replication
AND/OR/ranges/etc.
Built-in support
NoSQL Data Models
Data Model Examples Elastic? Queries? Relationships?
Key–Value
LevelDB, Kyoto Cabinet,
Redis
No No No
Distributed Key–
Value
Dynamo, MemcacheDB,
Riak, Voldemort
Yes No No
Column-Oriented
Accumulo, Cassandra,
HBase
Yes Some No
Document-
Oriented
Couchbase,
Elasticsearch, MongoDB
Yes Yes Some
Graph Neo4J, OrientDB, Titan No Yes Yes
Sharding + replication
AND/OR/ranges/etc.
Built-in support
Doradus goals
NoSQL Common Traits
•  Distributed cluster of nodes
– Commodity, shared-nothing servers
– Scales horizontally
– Expands elastically
•  Replication
– Performant local access
– Automatic failover
•  De-normalized data model
•  Schemaless/dynamic columns
•  Eventual consistency
N=5, RF=3
Is NoSQL Catching On?
Source: db-engines.com
Overview of Cassandra
•  Wide column NoSQL database
•  Open sourced by Facebook
•  Apache Project with active community
•  Commercially support by DataStax,
Acunu, others
•  Used by 1,500+ companies
•  "Pure peer" architecture
•  Largest known Cassandra cluster:
300+ TB data and 400+ machines.
What is Cassandra best for?
•  Continuous data streams
– Logs, events, audit records, measurements, ...
– Fast data ingestion
– Predictable read performance
•  Partitionable data
– "1,000's of little databases in one"
•  Elastic scalability
– Expand/upgrade/repair without downtime
•  Not good for:
– Blob store
– Persistent queue
– OLTP transactions
CQL Static Table
CREATE	
  TABLE	
  songs	
  (	
  
	
  	
  	
  id	
  	
  	
  	
  	
  	
  uuid	
  PRIMARY	
  KEY,	
  
	
  	
  	
  title	
  	
  	
  text,	
  
	
  	
  	
  album	
  	
  	
  text,	
  
	
  	
  	
  artist	
  	
  text,	
  
	
  	
  	
  data	
  	
  	
  	
  blob	
  
);	
  
CREATE	
  INDEX	
  ON	
  songs	
  (artist);	
  
Row Key Columns: "<column	
  name>"="<column	
  value>"	
  
62c36...	
   "album"="90125"	
   "artist"="Yes"	
   "data"=<audio>	
   "title"="Changes"	
  
837a2...	
   "album"="Crystal	
  Ball"	
   "artist"="Styx"	
   "data"=<audio>	
   "title"="Put	
  Me	
  On"	
  
2de83...	
   "album"="Nevermind"	
   "artist"="Nirvana"	
   "data"=<audio>	
   "title"="Breed"	
  
...	
  
CQL Clustered Table
CREATE	
  TABLE	
  playlists	
  (	
  
	
  	
  	
  id	
  	
  	
  	
  	
  	
  	
  	
  	
  uuid,	
  
	
  	
  	
  song_order	
  int,	
  
	
  	
  	
  song_id	
  	
  	
  	
  uuid,	
  	
  	
  //	
  copied	
  from	
  songs.id	
  
	
  	
  	
  title	
  	
  	
  	
  	
  	
  text,	
  	
  	
  //	
  copied	
  from	
  songs.title	
  
	
  	
  	
  album	
  	
  	
  	
  	
  	
  text,	
  	
  	
  //	
  copied	
  from	
  songs.album	
  
	
  	
  	
  artist	
  	
  	
  	
  	
  text,	
  	
  	
  //	
  copied	
  from	
  songs.artist	
  
	
  	
  	
  PRIMARY	
  KEY	
  (id,	
  song_order)	
  	
  	
  //	
  compound	
  key	
  
);	
  
Row Key Columns: "<song_order>:<column	
  name>"="<column	
  value>"	
  
28d23...	
  
"1:"=""	
   "1:album"="90125"	
   "1:artist"="Yes"	
   "1:song_id"="62c36..."	
  
"1:title"="Changes"	
   "2:"=""	
   "2:album"="Nevermind"	
   "2:artist"="Nirvana"	
  
"2:song_id"="2de83..."	
   "2:title"="Breed"	
   "3:"=""	
   ...	
  
2ed91...	
  
"1:"=""	
   "1:album"="Crystal	
  Ball"	
   "1:artist"="Styx"	
   "1:song_id"="837a2..."	
  
"1:title"="Put	
  Me	
  On"	
   "2:"=""	
   ...	
  
...	
  
Row Key Columns: "<song_order>:<column	
  name>"="<column	
  value>"	
  
28d23...	
  
"1:"=""	
   "1:album"="90125"	
   "1:artist"="Yes"	
   "1:song_id"="62c36..."	
  
"1:title"="Changes"	
   "2:"=""	
   "2:album"="Nevermind"	
   "2:artist"="Nirvana"	
  
"2:song_id"="2de83..."	
   "2:title"="Breed"	
   "3:"=""	
   ...	
  
2ed91...	
  
"1:"=""	
   "1:album"="Crystal	
  Ball"	
   "1:artist"="Styx"	
   "1:song_id"="837a2..."	
  
"1:title"="Put	
  Me	
  On"	
   "2:"=""	
   ...	
  
...	
  
CQL Clustered Table (cont.)
CQL "Rows"
CREATE	
  TABLE	
  playlists	
  (	
  
	
  	
  	
  id	
  	
  	
  	
  	
  	
  	
  	
  	
  uuid,	
  
	
  	
  	
  song_order	
  int,	
  
	
  	
  	
  song_id	
  	
  	
  	
  uuid,	
  	
  	
  //	
  copied	
  from	
  songs.id	
  
	
  	
  	
  title	
  	
  	
  	
  	
  	
  text,	
  	
  	
  //	
  copied	
  from	
  songs.title	
  
	
  	
  	
  album	
  	
  	
  	
  	
  	
  text,	
  	
  	
  //	
  copied	
  from	
  songs.album	
  
	
  	
  	
  artist	
  	
  	
  	
  	
  text,	
  	
  	
  //	
  copied	
  from	
  songs.artist	
  
	
  	
  	
  PRIMARY	
  KEY	
  (id,	
  song_order)	
  	
  	
  //	
  compound	
  key	
  
);	
  
Can we make Cassandra more appealing?
•  Data Model
– No direct support for relationships
•  Indexing
– Secondary indexes: single column only
– Hash table only: no range searching
•  Searching
– No joins, embedded queries
– No aggregate queries
– Limited equalities (e.g., SELECT * WHERE <key> IN (<list>))
– No full text search
– No OR clauses
– ...
What is Doradus?
•  Java service that enhances Cassandra
•  Adds features:
– REST API (JSON and XML)
– Multi-tenancy
– Graph model
– Multi-field/full text query language
– Automatic data aging
– OLAP and Spider storage services
•  Compatible with NoSQL tenets such as idempotent
updates
•  Under development for ~3 years
•  Open source: Apache 2.0 License
Doradus Graph Model
•  A cluster hosts one of more applications
•  An application own tables which store objects
•  An object consists of single- and multi-valued fields
•  A pair of link fields form a bi-directional relationship
Message
{Size, SendDate}
Participant
{ReceiptDate}
Address
{Name}
Person
{Name, Department}
Attachment
{Size, Extension} Managerè
çEmployees
êPerson
Address é
êAttachments
Messageé
Recipientsè
çMessageAsRecipient
Addressè
çParticipants
Senderè
çMessageAsSender
Example Object and Aggregate Queries
•  Lucene full text query
GET	
  /Email/Person/_query?q=FirstName:j*	
  AND	
  NOT	
  Office:[q	
  TO	
  z]	
  
•  Link path with filtering
GET	
  /Email/Message/_query?q=	
  
Sender.WHERE(ReceiptDate>'2010-­‐06-­‐01').Address.Name="*.com"	
  
•  Quantifiers
GET	
  /Email/Message/_aggregate?m=COUNT(*)	
  	
  
&q=ANY(Recipients).ALL(Address).NONE(Person).Department:sales	
  
&f=Tags,TOP(3,TRUNCATE(SendDate,DAY))	
  
	
  
•  Transitive links
GET	
  /Email/Person/_query?q=DirectReports^(3).LastName=wilson	
  
&f=DirectReports(Name,DirectReports(Name))	
  
	
  
Doradus: Architecture
Application
Doradus
Cassandra
REST API
Thrift or CQL
Data and
Log files
Doradus: Multi-Data Center Clusters
Cassandra
Doradus
Cassandra Cassandra
Doradus
Cassandra
Doradus
Cassandra Cassandra
Doradus
Node 1 Node 2 Node 3 Node 4 Node 5 Node 6
Rack 1, Data Center 1 Rack 1, Data Center 2
Applications Applications
DC=2, N=6, RF=3
Doradus: Internal Architecture
App App App
Monitor
App
Spider
Storage Service
OLAP
Storage Service
Cassandra Cluster
JMX
REST: Embedded Jetty Server
Cassandra Interface
doradus.yaml
REST
Doradus OLAP Service
•  Borrows from online analytical processing
– Sharding as data "cubes"
– Columnar storage
•  Very dense storage
– No indexes!
– Value arrays are compressed
•  Fast load time
– Up to 500,000 objects/second/node
– Small "data lag" time
•  Very fast queries
– Searches millions of objects/second
– Full DQL object and aggregate query support
OLAP Data Loading
EventsEventsEvents
EventsEventsPeople
EventsEventsComputers
EventsEventsDomains
Sources
OLAP Data Loading
T1
EventsEventsEvents
EventsEventsPeople
EventsEventsComputers
EventsEventsDomains
T2
T3
T4
T4
Sources Segments
…
Changes in
last n minutes
OLAP Data Loading
T1
EventsEventsEvents
EventsEventsPeople
EventsEventsComputers
EventsEventsDomains
T2
T3
T4
T4
2013-03-01
2013-02-28
2013-02-27
Sources Segments Shards
…
…
Changes in
last n minutes
Date-based shards
OLAP Data Loading
T1
EventsEventsEvents
EventsEventsPeople
EventsEventsComputers
EventsEventsDomains
T2
T3
T4
T4
2013-03-01
2013-02-28
2013-02-27
Sources Segments Shards OLAP Store
…
…
Changes in
last n minutes
Date-based shards
OLAP Use Case
•  Data: Windows Events
– 115M events
•  Test parameters
– Server: Quad Xeon CPUs, 32GB memory, 3 disks
– Cassandra memory: 1GB
– Load app/embedded Doradus memory: 4GB
– Load threads: 5
– Batch size: 5,000 events
– Shard size: 1 day (860 shards total)
•  Test results
– Total objects loaded: ~1 billion
– Total time: 32 minutes, 56 seconds
– Load rate: 502,991 objects/second
– Final database size: ~2GB
Doradus Spider Service
•  Analogous to Lucene + NoSQL
•  Fully inverted field indexing
– Configurable analyzers
– Stored-only (non-indexed) fields
•  Unique features:
– Automatic table-level sharding
– Statistics
– Pre-computed aggregate queries
– Refreshed in background
– Object-level data aging
•  Use case example:
– Indexing a massive number of documents
OLAP and Spider: When to Use
•  Spider is best for:
– Unstructured/variable-
structure data
– Configurable indexing
– Fine-grained updates with
immediate indexing
– Document storage and
searching
– Emphasis on full-text/multi-
field searching
•  OLAP is best for:
– High-volume data streams
– High performance analytic
queries
– Dense data storage
– Immutable/semi-mutable
data
– Data that can be loaded in
batches
– Data that can be partitioned
(e.g., time-sharded)
Summary
•  What's cool about Doradus?
– Bi-directional links with referential integrity
– Link paths: simpler than joins
– Idempotent updates
– Partial object updates
– Simple transitive searching
– OLAP: dense storage and fast queries
– It's free!
Thank you !
Doradus is available at:
https://github.com/dell-oss/Doradus
Contact me:
randy.guck@dell.software.com

More Related Content

PPTX
Extending Cassandra with Doradus OLAP for High Performance Analytics
PDF
Spark Dataframe - Mr. Jyotiska
PDF
SparkSQL and Dataframe
PDF
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
PDF
Fast track to getting started with DSE Max @ ING
PDF
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Spark cassandra integration 2016
PDF
Amazon DynamoDB Lessen's Learned by Beginner
Extending Cassandra with Doradus OLAP for High Performance Analytics
Spark Dataframe - Mr. Jyotiska
SparkSQL and Dataframe
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Fast track to getting started with DSE Max @ ING
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Spark cassandra integration 2016
Amazon DynamoDB Lessen's Learned by Beginner

What's hot (19)

PDF
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
PDF
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
PDF
Cassandra introduction 2016
PDF
managing big data
PDF
Spark Cassandra 2016
PPTX
Using Spark to Load Oracle Data into Cassandra
PDF
Apache cassandra in 2016
PDF
Sasi, cassandra on full text search ride
PPTX
MongoDB 3.0
PPTX
02 data warehouse applications with hive
PDF
Om nom nom nom
PDF
Munich March 2015 - Cassandra + Spark Overview
PDF
Spark and shark
PDF
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Datastax day 2016 : Cassandra data modeling basics
PDF
Big data 101 for beginners riga dev days
PDF
A Deep Dive Into Spark
PDF
Intro To Cascading
PPT
BDAS Shark study report 03 v1.1
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra introduction 2016
managing big data
Spark Cassandra 2016
Using Spark to Load Oracle Data into Cassandra
Apache cassandra in 2016
Sasi, cassandra on full text search ride
MongoDB 3.0
02 data warehouse applications with hive
Om nom nom nom
Munich March 2015 - Cassandra + Spark Overview
Spark and shark
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Datastax day 2016 : Cassandra data modeling basics
Big data 101 for beginners riga dev days
A Deep Dive Into Spark
Intro To Cascading
BDAS Shark study report 03 v1.1
Ad

Similar to Overiew of Cassandra and Doradus (20)

PPTX
Presentation
PPTX
Introduction to SQL Server Graph DB
PPTX
GreenDao Introduction
PDF
Manchester Hadoop Meetup: Spark Cassandra Integration
PDF
Big data analytics with Spark & Cassandra
PDF
Reading Cassandra Meetup Feb 2015: Apache Spark
PDF
C* Summit EU 2013: Denormalizing Your Data: A Java Library to Support Structu...
PDF
3 Dundee-Spark Overview for C* developers
PDF
2012 mongo db_bangalore_roadmap_new
PDF
Postgres vs Mongo / Олег Бартунов (Postgres Professional)
PPTX
When to no sql and when to know sql javaone
PDF
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...
PPTX
Cassandra Java APIs Old and New – A Comparison
PPTX
Scala meetup - Intro to spark
PDF
Big Data Processing using Apache Spark and Clojure
PDF
Hands On Spring Data
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
PDF
Databases for Data Science
KEY
Mongodb intro
PDF
Hybrid Databases - PHP UK Conference 22 February 2019
Presentation
Introduction to SQL Server Graph DB
GreenDao Introduction
Manchester Hadoop Meetup: Spark Cassandra Integration
Big data analytics with Spark & Cassandra
Reading Cassandra Meetup Feb 2015: Apache Spark
C* Summit EU 2013: Denormalizing Your Data: A Java Library to Support Structu...
3 Dundee-Spark Overview for C* developers
2012 mongo db_bangalore_roadmap_new
Postgres vs Mongo / Олег Бартунов (Postgres Professional)
When to no sql and when to know sql javaone
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...
Cassandra Java APIs Old and New – A Comparison
Scala meetup - Intro to spark
Big Data Processing using Apache Spark and Clojure
Hands On Spring Data
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
Databases for Data Science
Mongodb intro
Hybrid Databases - PHP UK Conference 22 February 2019
Ad

Recently uploaded (20)

PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Digital Strategies for Manufacturing Companies
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
ai tools demonstartion for schools and inter college
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPT
Introduction Database Management System for Course Database
PDF
System and Network Administration Chapter 2
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Digital Strategies for Manufacturing Companies
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Operating system designcfffgfgggggggvggggggggg
Design an Analysis of Algorithms I-SECS-1021-03
2025 Textile ERP Trends: SAP, Odoo & Oracle
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Odoo POS Development Services by CandidRoot Solutions
ai tools demonstartion for schools and inter college
How Creative Agencies Leverage Project Management Software.pdf
Understanding Forklifts - TECH EHS Solution
Design an Analysis of Algorithms II-SECS-1021-03
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Introduction Database Management System for Course Database
System and Network Administration Chapter 2
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
How to Migrate SBCGlobal Email to Yahoo Easily
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx

Overiew of Cassandra and Doradus

  • 1. Overview of Cassandra and The Doradus OSS Project Randy Guck Principal Engineer, Dell Software
  • 2. Overview •  What is No SQL? – Common RDB roadblocks – NoSQL database types •  Overview of Cassandra – What's unique – Limitations •  Doradus – Architecture – Features – The OLAP and Spider storage managers – What each is good for – Where to get Doradus
  • 3. Why RDB Apps Look for Something Else •  Performance – B-trees – Locking – One writable copy of each record •  Scaling costs – RDBs scale "up" – Big boxes, SANs, fiber channel, etc. •  What if you want... – Distributed access – No single points of failure – Instant failover – Sharding – Replication
  • 4. NoSQL Data Models Data Model Examples Elastic? Queries? Relationships? Key–Value LevelDB, Kyoto Cabinet, Redis No No No Distributed Key– Value Dynamo, MemcacheDB, Riak, Voldemort Yes No No Column-Oriented Accumulo, Cassandra, HBase Yes Some No Document- Oriented Couchbase, Elasticsearch, MongoDB Yes Yes Some Graph Neo4J, OrientDB, Titan No Yes Yes Sharding + replication AND/OR/ranges/etc. Built-in support
  • 5. NoSQL Data Models Data Model Examples Elastic? Queries? Relationships? Key–Value LevelDB, Kyoto Cabinet, Redis No No No Distributed Key– Value Dynamo, MemcacheDB, Riak, Voldemort Yes No No Column-Oriented Accumulo, Cassandra, HBase Yes Some No Document- Oriented Couchbase, Elasticsearch, MongoDB Yes Yes Some Graph Neo4J, OrientDB, Titan No Yes Yes Sharding + replication AND/OR/ranges/etc. Built-in support Doradus goals
  • 6. NoSQL Common Traits •  Distributed cluster of nodes – Commodity, shared-nothing servers – Scales horizontally – Expands elastically •  Replication – Performant local access – Automatic failover •  De-normalized data model •  Schemaless/dynamic columns •  Eventual consistency N=5, RF=3
  • 7. Is NoSQL Catching On? Source: db-engines.com
  • 8. Overview of Cassandra •  Wide column NoSQL database •  Open sourced by Facebook •  Apache Project with active community •  Commercially support by DataStax, Acunu, others •  Used by 1,500+ companies •  "Pure peer" architecture •  Largest known Cassandra cluster: 300+ TB data and 400+ machines.
  • 9. What is Cassandra best for? •  Continuous data streams – Logs, events, audit records, measurements, ... – Fast data ingestion – Predictable read performance •  Partitionable data – "1,000's of little databases in one" •  Elastic scalability – Expand/upgrade/repair without downtime •  Not good for: – Blob store – Persistent queue – OLTP transactions
  • 10. CQL Static Table CREATE  TABLE  songs  (        id            uuid  PRIMARY  KEY,        title      text,        album      text,        artist    text,        data        blob   );   CREATE  INDEX  ON  songs  (artist);   Row Key Columns: "<column  name>"="<column  value>"   62c36...   "album"="90125"   "artist"="Yes"   "data"=<audio>   "title"="Changes"   837a2...   "album"="Crystal  Ball"   "artist"="Styx"   "data"=<audio>   "title"="Put  Me  On"   2de83...   "album"="Nevermind"   "artist"="Nirvana"   "data"=<audio>   "title"="Breed"   ...  
  • 11. CQL Clustered Table CREATE  TABLE  playlists  (        id                  uuid,        song_order  int,        song_id        uuid,      //  copied  from  songs.id        title            text,      //  copied  from  songs.title        album            text,      //  copied  from  songs.album        artist          text,      //  copied  from  songs.artist        PRIMARY  KEY  (id,  song_order)      //  compound  key   );   Row Key Columns: "<song_order>:<column  name>"="<column  value>"   28d23...   "1:"=""   "1:album"="90125"   "1:artist"="Yes"   "1:song_id"="62c36..."   "1:title"="Changes"   "2:"=""   "2:album"="Nevermind"   "2:artist"="Nirvana"   "2:song_id"="2de83..."   "2:title"="Breed"   "3:"=""   ...   2ed91...   "1:"=""   "1:album"="Crystal  Ball"   "1:artist"="Styx"   "1:song_id"="837a2..."   "1:title"="Put  Me  On"   "2:"=""   ...   ...  
  • 12. Row Key Columns: "<song_order>:<column  name>"="<column  value>"   28d23...   "1:"=""   "1:album"="90125"   "1:artist"="Yes"   "1:song_id"="62c36..."   "1:title"="Changes"   "2:"=""   "2:album"="Nevermind"   "2:artist"="Nirvana"   "2:song_id"="2de83..."   "2:title"="Breed"   "3:"=""   ...   2ed91...   "1:"=""   "1:album"="Crystal  Ball"   "1:artist"="Styx"   "1:song_id"="837a2..."   "1:title"="Put  Me  On"   "2:"=""   ...   ...   CQL Clustered Table (cont.) CQL "Rows" CREATE  TABLE  playlists  (        id                  uuid,        song_order  int,        song_id        uuid,      //  copied  from  songs.id        title            text,      //  copied  from  songs.title        album            text,      //  copied  from  songs.album        artist          text,      //  copied  from  songs.artist        PRIMARY  KEY  (id,  song_order)      //  compound  key   );  
  • 13. Can we make Cassandra more appealing? •  Data Model – No direct support for relationships •  Indexing – Secondary indexes: single column only – Hash table only: no range searching •  Searching – No joins, embedded queries – No aggregate queries – Limited equalities (e.g., SELECT * WHERE <key> IN (<list>)) – No full text search – No OR clauses – ...
  • 14. What is Doradus? •  Java service that enhances Cassandra •  Adds features: – REST API (JSON and XML) – Multi-tenancy – Graph model – Multi-field/full text query language – Automatic data aging – OLAP and Spider storage services •  Compatible with NoSQL tenets such as idempotent updates •  Under development for ~3 years •  Open source: Apache 2.0 License
  • 15. Doradus Graph Model •  A cluster hosts one of more applications •  An application own tables which store objects •  An object consists of single- and multi-valued fields •  A pair of link fields form a bi-directional relationship Message {Size, SendDate} Participant {ReceiptDate} Address {Name} Person {Name, Department} Attachment {Size, Extension} Managerè çEmployees êPerson Address é êAttachments Messageé Recipientsè çMessageAsRecipient Addressè çParticipants Senderè çMessageAsSender
  • 16. Example Object and Aggregate Queries •  Lucene full text query GET  /Email/Person/_query?q=FirstName:j*  AND  NOT  Office:[q  TO  z]   •  Link path with filtering GET  /Email/Message/_query?q=   Sender.WHERE(ReceiptDate>'2010-­‐06-­‐01').Address.Name="*.com"   •  Quantifiers GET  /Email/Message/_aggregate?m=COUNT(*)     &q=ANY(Recipients).ALL(Address).NONE(Person).Department:sales   &f=Tags,TOP(3,TRUNCATE(SendDate,DAY))     •  Transitive links GET  /Email/Person/_query?q=DirectReports^(3).LastName=wilson   &f=DirectReports(Name,DirectReports(Name))    
  • 18. Doradus: Multi-Data Center Clusters Cassandra Doradus Cassandra Cassandra Doradus Cassandra Doradus Cassandra Cassandra Doradus Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Rack 1, Data Center 1 Rack 1, Data Center 2 Applications Applications DC=2, N=6, RF=3
  • 19. Doradus: Internal Architecture App App App Monitor App Spider Storage Service OLAP Storage Service Cassandra Cluster JMX REST: Embedded Jetty Server Cassandra Interface doradus.yaml REST
  • 20. Doradus OLAP Service •  Borrows from online analytical processing – Sharding as data "cubes" – Columnar storage •  Very dense storage – No indexes! – Value arrays are compressed •  Fast load time – Up to 500,000 objects/second/node – Small "data lag" time •  Very fast queries – Searches millions of objects/second – Full DQL object and aggregate query support
  • 25. OLAP Use Case •  Data: Windows Events – 115M events •  Test parameters – Server: Quad Xeon CPUs, 32GB memory, 3 disks – Cassandra memory: 1GB – Load app/embedded Doradus memory: 4GB – Load threads: 5 – Batch size: 5,000 events – Shard size: 1 day (860 shards total) •  Test results – Total objects loaded: ~1 billion – Total time: 32 minutes, 56 seconds – Load rate: 502,991 objects/second – Final database size: ~2GB
  • 26. Doradus Spider Service •  Analogous to Lucene + NoSQL •  Fully inverted field indexing – Configurable analyzers – Stored-only (non-indexed) fields •  Unique features: – Automatic table-level sharding – Statistics – Pre-computed aggregate queries – Refreshed in background – Object-level data aging •  Use case example: – Indexing a massive number of documents
  • 27. OLAP and Spider: When to Use •  Spider is best for: – Unstructured/variable- structure data – Configurable indexing – Fine-grained updates with immediate indexing – Document storage and searching – Emphasis on full-text/multi- field searching •  OLAP is best for: – High-volume data streams – High performance analytic queries – Dense data storage – Immutable/semi-mutable data – Data that can be loaded in batches – Data that can be partitioned (e.g., time-sharded)
  • 28. Summary •  What's cool about Doradus? – Bi-directional links with referential integrity – Link paths: simpler than joins – Idempotent updates – Partial object updates – Simple transitive searching – OLAP: dense storage and fast queries – It's free!
  • 29. Thank you ! Doradus is available at: https://github.com/dell-oss/Doradus Contact me: randy.guck@dell.software.com