Overiew of Cassandra and Doradus

Overview of Cassandra and
The Doradus OSS Project
Randy Guck
Principal Engineer, Dell Software

Overview
•  What is No SQL?
– Common RDB roadblocks
– NoSQL database types
•  Overview of Cassandra
– What's unique
– Limitations
•  Doradus
– Architecture
– Features
– The OLAP and Spider storage managers
– What each is good for
– Where to get Doradus

Why RDB Apps Look for Something Else
•  Performance
– B-trees
– Locking
– One writable copy of each record
•  Scaling costs
– RDBs scale "up"
– Big boxes, SANs, ﬁber channel, etc.
•  What if you want...
– Distributed access
– No single points of failure
– Instant failover
– Sharding
– Replication

NoSQL Data Models
Data Model Examples Elastic? Queries? Relationships?
Key–Value
LevelDB, Kyoto Cabinet,
Redis
No No No
Distributed Key–
Value
Dynamo, MemcacheDB,
Riak, Voldemort
Yes No No
Column-Oriented
Accumulo, Cassandra,
HBase
Yes Some No
Document-
Oriented
Couchbase,
Elasticsearch, MongoDB
Yes Yes Some
Graph Neo4J, OrientDB, Titan No Yes Yes
Sharding + replication
AND/OR/ranges/etc.
Built-in support

NoSQL Data Models
Data Model Examples Elastic? Queries? Relationships?
Key–Value
LevelDB, Kyoto Cabinet,
Redis
No No No
Distributed Key–
Value
Dynamo, MemcacheDB,
Riak, Voldemort
Yes No No
Column-Oriented
Accumulo, Cassandra,
HBase
Yes Some No
Document-
Oriented
Couchbase,
Elasticsearch, MongoDB
Yes Yes Some
Graph Neo4J, OrientDB, Titan No Yes Yes
Sharding + replication
AND/OR/ranges/etc.
Built-in support
Doradus goals

NoSQL Common Traits
•  Distributed cluster of nodes
– Commodity, shared-nothing servers
– Scales horizontally
– Expands elastically
•  Replication
– Performant local access
– Automatic failover
•  De-normalized data model
•  Schemaless/dynamic columns
•  Eventual consistency
N=5, RF=3

Is NoSQL Catching On?
Source: db-engines.com

Overview of Cassandra
•  Wide column NoSQL database
•  Open sourced by Facebook
•  Apache Project with active community
•  Commercially support by DataStax,
Acunu, others
•  Used by 1,500+ companies
•  "Pure peer" architecture
•  Largest known Cassandra cluster:
300+ TB data and 400+ machines.

What is Cassandra best for?
•  Continuous data streams
– Logs, events, audit records, measurements, ...
– Fast data ingestion
– Predictable read performance
•  Partitionable data
– "1,000's of little databases in one"
•  Elastic scalability
– Expand/upgrade/repair without downtime
•  Not good for:
– Blob store
– Persistent queue
– OLTP transactions

CQL Static Table
CREATE
TABLE
songs
(

id

uuid
PRIMARY
KEY,

title

text,

album

text,

artist

text,

data

blob

);

CREATE
INDEX
ON
songs
(artist);

Row Key Columns: "<column
name>"="<column
value>"

62c36...
"album"="90125"
"artist"="Yes"
"data"=<audio>
"title"="Changes"

837a2...
"album"="Crystal
Ball"
"artist"="Styx"
"data"=<audio>
"title"="Put
Me
On"

2de83...
"album"="Nevermind"
"artist"="Nirvana"
"data"=<audio>
"title"="Breed"

...

CQL Clustered Table
CREATE
TABLE
playlists
(

id

uuid,

song_order
int,

song_id

uuid,

//
copied
from
songs.id

title

text,

//
copied
from
songs.title

album

text,

//
copied
from
songs.album

artist

text,

//
copied
from
songs.artist

PRIMARY
KEY
(id,
song_order)

//
compound
key

);

Row Key Columns: "<song_order>:<column
name>"="<column
value>"

28d23...

"1:"=""
"1:album"="90125"
"1:artist"="Yes"
"1:song_id"="62c36..."

"1:title"="Changes"
"2:"=""
"2:album"="Nevermind"
"2:artist"="Nirvana"

"2:song_id"="2de83..."
"2:title"="Breed"
"3:"=""
...

2ed91...

"1:"=""
"1:album"="Crystal
Ball"
"1:artist"="Styx"
"1:song_id"="837a2..."

"1:title"="Put
Me
On"
"2:"=""
...

...

Row Key Columns: "<song_order>:<column
name>"="<column
value>"

28d23...

"1:"=""
"1:album"="90125"
"1:artist"="Yes"
"1:song_id"="62c36..."

"1:title"="Changes"
"2:"=""
"2:album"="Nevermind"
"2:artist"="Nirvana"

"2:song_id"="2de83..."
"2:title"="Breed"
"3:"=""
...

2ed91...

"1:"=""
"1:album"="Crystal
Ball"
"1:artist"="Styx"
"1:song_id"="837a2..."

"1:title"="Put
Me
On"
"2:"=""
...

...

CQL Clustered Table (cont.)
CQL "Rows"
CREATE
TABLE
playlists
(

id

uuid,

song_order
int,

song_id

uuid,

//
copied
from
songs.id

title

text,

//
copied
from
songs.title

album

text,

//
copied
from
songs.album

artist

text,

//
copied
from
songs.artist

PRIMARY
KEY
(id,
song_order)

//
compound
key

);

Can we make Cassandra more appealing?
•  Data Model
– No direct support for relationships
•  Indexing
– Secondary indexes: single column only
– Hash table only: no range searching
•  Searching
– No joins, embedded queries
– No aggregate queries
– Limited equalities (e.g., SELECT * WHERE <key> IN (<list>))
– No full text search
– No OR clauses
– ...

What is Doradus?
•  Java service that enhances Cassandra
•  Adds features:
– REST API (JSON and XML)
– Multi-tenancy
– Graph model
– Multi-ﬁeld/full text query language
– Automatic data aging
– OLAP and Spider storage services
•  Compatible with NoSQL tenets such as idempotent
updates
•  Under development for ~3 years
•  Open source: Apache 2.0 License

Doradus Graph Model
•  A cluster hosts one of more applications
•  An application own tables which store objects
•  An object consists of single- and multi-valued ﬁelds
•  A pair of link ﬁelds form a bi-directional relationship
Message
{Size, SendDate}
Participant
{ReceiptDate}
Address
{Name}
Person
{Name, Department}
Attachment
{Size, Extension} Managerè
çEmployees
êPerson
Address é
êAttachments
Messageé
Recipientsè
çMessageAsRecipient
Addressè
çParticipants
Senderè
çMessageAsSender

Example Object and Aggregate Queries
•  Lucene full text query
GET
/Email/Person/_query?q=FirstName:j*
AND
NOT
Office:[q
TO
z]

•  Link path with ﬁltering
GET
/Email/Message/_query?q=

Sender.WHERE(ReceiptDate>'2010-‐06-‐01').Address.Name="*.com"

•  Quantiﬁers
GET
/Email/Message/_aggregate?m=COUNT(*)

&q=ANY(Recipients).ALL(Address).NONE(Person).Department:sales

&f=Tags,TOP(3,TRUNCATE(SendDate,DAY))

•  Transitive links
GET
/Email/Person/_query?q=DirectReports^(3).LastName=wilson

&f=DirectReports(Name,DirectReports(Name))

Doradus: Architecture
Application
Doradus
Cassandra
REST API
Thrift or CQL
Data and
Log ﬁles

Doradus: Multi-Data Center Clusters
Cassandra
Doradus
Cassandra Cassandra
Doradus
Cassandra
Doradus
Cassandra Cassandra
Doradus
Node 1 Node 2 Node 3 Node 4 Node 5 Node 6
Rack 1, Data Center 1 Rack 1, Data Center 2
Applications Applications
DC=2, N=6, RF=3

Doradus: Internal Architecture
App App App
Monitor
App
Spider
Storage Service
OLAP
Storage Service
Cassandra Cluster
JMX
REST: Embedded Jetty Server
Cassandra Interface
doradus.yaml
REST

Doradus OLAP Service
•  Borrows from online analytical processing
– Sharding as data "cubes"
– Columnar storage
•  Very dense storage
– No indexes!
– Value arrays are compressed
•  Fast load time
– Up to 500,000 objects/second/node
– Small "data lag" time
•  Very fast queries
– Searches millions of objects/second
– Full DQL object and aggregate query support

OLAP Data Loading
EventsEventsEvents
EventsEventsPeople
EventsEventsComputers
EventsEventsDomains
Sources

OLAP Data Loading
T1
EventsEventsEvents
EventsEventsPeople
EventsEventsDomains
T2
T3
T4
T4
Sources Segments
…
Changes in
last n minutes

OLAP Data Loading
T1
EventsEventsEvents
EventsEventsPeople
EventsEventsDomains
T2
T3
T4
T4
2013-03-01
2013-02-28
2013-02-27
Sources Segments Shards
…
…
Changes in
last n minutes
Date-based shards

OLAP Data Loading
T1
EventsEventsEvents
EventsEventsPeople
EventsEventsDomains
T2
T3
T4
T4
2013-03-01
2013-02-28
2013-02-27
Sources Segments Shards OLAP Store
…
…
Changes in
last n minutes
Date-based shards

OLAP Use Case
•  Data: Windows Events
– 115M events
•  Test parameters
– Server: Quad Xeon CPUs, 32GB memory, 3 disks
– Cassandra memory: 1GB
– Load app/embedded Doradus memory: 4GB
– Load threads: 5
– Batch size: 5,000 events
– Shard size: 1 day (860 shards total)
•  Test results
– Total objects loaded: ~1 billion
– Total time: 32 minutes, 56 seconds
– Load rate: 502,991 objects/second
– Final database size: ~2GB

Doradus Spider Service
•  Analogous to Lucene + NoSQL
•  Fully inverted field indexing
– Configurable analyzers
– Stored-only (non-indexed) fields
•  Unique features:
– Automatic table-level sharding
– Statistics
– Pre-computed aggregate queries
– Refreshed in background
– Object-level data aging
•  Use case example:
– Indexing a massive number of documents

OLAP and Spider: When to Use
•  Spider is best for:
– Unstructured/variable-
structure data
– Conﬁgurable indexing
– Fine-grained updates with
immediate indexing
– Document storage and
searching
– Emphasis on full-text/multi-
ﬁeld searching
•  OLAP is best for:
– High-volume data streams
– High performance analytic
queries
– Dense data storage
– Immutable/semi-mutable
data
– Data that can be loaded in
batches
– Data that can be partitioned
(e.g., time-sharded)

Summary
•  What's cool about Doradus?
– Bi-directional links with referential integrity
– Link paths: simpler than joins
– Idempotent updates
– Partial object updates
– Simple transitive searching
– OLAP: dense storage and fast queries
– It's free!

Thank you !
Doradus is available at:
https://github.com/dell-oss/Doradus
Contact me:
randy.guck@dell.software.com

Overiew of Cassandra and Doradus

More Related Content

What's hot (19)

Similar to Overiew of Cassandra and Doradus (20)

Recently uploaded (20)

Overiew of Cassandra and Doradus