SlideShare a Scribd company logo
MONGODB TO
CASSANDRA

ARCHITECTURAL LESSONS
!

Jon Hadad & Blake Eggleston
Overview
Differences in DB Architectures
!

SHIFT Platform
!

SHIFT Media Manager
!

Intro to cqlengine
MongoDB Architecture
Important Concepts

•
•
•
•
•

replica set (master / slave)
shard (replica set within a cluster)
config server (topology)
mongos (router)
Shard key is an indexed field that
determines the shard a particular
document belongs to

!

sources: http://docs.mongodb.org/manual/core/sharded-cluster-architectures-production/, http://docs.mongodb.org/manual/core/sharding-shard-key/
Cassandra Architecture
• Only 1 type of server (Cassandra)
• Ring Based Replication (no master
or slave)
• No single point of failure
• Key hashes to a location in the ring
• Replication Factor (RF=3)
• Limited query flexibility (always
select by key)
• Each query has a consistency level
source: http://developer.rackspace.com/images/2013-03-27-rackspace-service-registry-status-update/vnodes.png
Cassandra Storage
• SSTables are immutable
• Each column includes a timestamp of when it was written
• The same column can exist for a given key in multiple
SSTables
• Deletes are written as tombstones
• SSTables are periodically merged (compaction)
• Compaction keeps the column with the latest timestamp
on conflicts

source: http://developer.rackspace.com/images/2013-03-27-rackspace-service-registry-status-update/vnodes.png
Cassandra Writes
• Writes are written to any node in

the cluster (the coordinator) which
figures out where it should go


• Writes are saved in memory to a
“memtable”, and written to a
commit log.


• Memtables are flushed to disk
periodically as SSTables.

source: http://www.datastax.com/docs/_images/write_access.png
Cassandra Reads
• Any server may be queried
• Acts as coordinator
• Data is pulled from SSTables and
merged
• Contacts nodes with the
requested key
• Performs read repair if necessary
• Reads are a more time consuming
operation than writes.

source: http://www.datastax.com/docs/_images/write_access.png
MongoDB Advantages
• Very Flexible Documents

• Very Flexible Queries

• Full text search (2.4)

• Aggregation Framework

• Geospatial Indexes / Queries

• Really good documentation
MongoDB Pitfalls
•
•
!

!

!

Many queries will route to entire
cluster
!

Overwriting documents / changing
doc sizes causes memory
fragmentation problems (db repair)

•

Query language is awkward for
humans

•

Queries that go to disk pay an
enormous penalty

•

Max size of 256GB per collection

source: https://blog.serverdensity.com/map-reduce-and-mongodb/
Cassandra Advantages
• Multi data center aware & reliable
• Fewer moving parts
• No DB / table locking
• Unbelievable with time series data (stats)
• Performance scales linearly as you add servers
• Optimized compaction options for traditional spinning
disks and SSDs
• Lots of control over how your data is stored on disk.
Cassandra Pitfalls
• Secondary Indexes have hidden costs
• Individual reads (single rows) are not as fast as other DBs
• JVM can be intimidating (GC)
• Data modeling requires more planning
• Generally need to construct a table per query you intend on
running
• Ad hoc queries or queries with lots of permutations can be
very difficult to model
• We complement Cassandra with Elastic Search for these types
of queries (also Solr & DS Enterprise are good choices)
Media Manager
Social Analytics
What is Media Manager?
• Ad buying and management tool for Facebook, Twitter

• We sync ~2 billion ad stats a month

• We roll up stats at multiple levels in real time

• 10 node C* cluster, AWS high I/O

• Peaked at 150K queries / second

• Approx 150GB of data, growing 10% / week
Real time Rollups
•
•

•

A single row per parent object type &
date


campaign
+date

ad1

ad2

ad3

stats

stats

stats

For any object (teams, folders,
campaign) we can perform a rollup for
a given date by accessing only a single
row. This limits our I/O and is
extremely efficient.

New ad stats are propagated up
immediately in rollups with very few
reads.

rollup

campaign1 campaign2

campaign3

folder+date
stats

stats

stats
Why Cassandra?
• Almost our entire DB is in our working set.

• We have rows on disk that are inconsistently
sized, so heuristics on doc size for
preallocation are not useful.


• We could not tolerate unpredictable query
behavior due to disk access.
SHIFT.com

Collaboration Platform
Real time Collaboration
• Build for Marketers

• Allows communication across departments and organizations

• 3rd Party Applications
Messaging
• Messages are fanned out to an entire team

• Teams may have hundreds of members

• Each member has perspectival view of their messages and
their own metadata on those messages (tags & unread)
Message Inbox
• When a message is sent or replied to, we
use insert a record with a timeuuid into a
persons stream which points to the
message.


• Timeuuids are stored on disk in reverse

user

timeuuid1

timeuuid2 timeuuid3

jon

msg1

msg2

msg3

blake

msg3

msg1

msg2

order of the embedded timestamp


• We can easily query the row for the first N
items in the users inbox


• We store multiple views as tags for each
user to quickly surface messages in
different contexts.
CQLENGINE
python CQL3 mapper
cqlengine features
• CQL3 Object Mapper for Python
• Supports Cassandra 1.2
• Builds queries supporting the following:
•
•
•
•
•
•

•
•
•

TTLs
Per Query Consistency
Blind Table Updates
Batch Queries
Counters
Maps, sets, lists
Schema management
Per table compaction settings
Table Polymorphism
Table Polymorphism
• In a single table we can have heterogenous objects
• We use this on Media Manager for Ad types
campaign

ad

type

1

1

page_post

1

2

mobile_ad

1

3

application_ad
Upcoming Features
• Work seamlessly with multiple clusters

• Native driver integration

• Key cache / row cache configuration

• Cassandra 2.0 features

• Third party plugins
• session
• flask
• identity map
THANK YOU
Jon

Blake

jon@shift.com
@rustyrazorblade

blake@shift.com
@beggleston

SANTA MONICA
310.310.8315

PALO ALTO
650.804.8319

NEW YORK
646.649.2972

www.shift.com

CHICAGO
312.465.2152

More Related Content

PPTX
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
PPTX
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
PDF
Migration Best Practices: From RDBMS to Cassandra without a Hitch
PDF
Cassandra TK 2014 - Large Nodes
PPTX
Webinar: Buckle Up: The Future of the Distributed Database is Here - DataStax...
PPTX
Making Every Drop Count: How i20 Addresses the Water Crisis with the IoT and ...
PPTX
Introducing DataStax Enterprise 4.7
PDF
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
Migration Best Practices: From RDBMS to Cassandra without a Hitch
Cassandra TK 2014 - Large Nodes
Webinar: Buckle Up: The Future of the Distributed Database is Here - DataStax...
Making Every Drop Count: How i20 Addresses the Water Crisis with the IoT and ...
Introducing DataStax Enterprise 4.7
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...

What's hot (20)

PDF
Shift: Real World Migration from MongoDB to Cassandra
PDF
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
PPTX
Webinar: DataStax Training - Everything you need to become a Cassandra Rockstar
PDF
Cisco: Cassandra adoption on Cisco UCS & OpenStack
PPTX
Don’t Get Caught in a PCI Pickle: Meet Compliance and Protect Payment Card Da...
PPTX
Webinar | Introducing DataStax Enterprise 4.6
PDF
Disney+ Hotstar: Scaling NoSQL for Millions of Video On-Demand Users
PDF
C*ollege Credit: Is My App a Good Fit for Cassandra?
PPTX
DataStax C*ollege Credit: What and Why NoSQL?
PDF
Cassandra Core Concepts
PDF
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...
PPTX
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
PPTX
There are More Clouds! Azure and Cassandra (Carlos Rolo, Pythian) | C* Summit...
PPTX
Webinar: Get On-Demand Education Anytime, Anywhere with Coursera and DataStax
PPTX
Scylla Summit 2018: Adventures in AdTech: Processing 50 Billion User Profiles...
PPTX
mParticle's Journey to Scylla from Cassandra
PDF
The Promise and Perils of Encrypting Cassandra Data (Ameesh Divatia, Baffle, ...
PDF
Real Time Analytics with Dse
PPTX
From PoCs to Production
PPTX
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Shift: Real World Migration from MongoDB to Cassandra
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Webinar: DataStax Training - Everything you need to become a Cassandra Rockstar
Cisco: Cassandra adoption on Cisco UCS & OpenStack
Don’t Get Caught in a PCI Pickle: Meet Compliance and Protect Payment Card Da...
Webinar | Introducing DataStax Enterprise 4.6
Disney+ Hotstar: Scaling NoSQL for Millions of Video On-Demand Users
C*ollege Credit: Is My App a Good Fit for Cassandra?
DataStax C*ollege Credit: What and Why NoSQL?
Cassandra Core Concepts
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
There are More Clouds! Azure and Cassandra (Carlos Rolo, Pythian) | C* Summit...
Webinar: Get On-Demand Education Anytime, Anywhere with Coursera and DataStax
Scylla Summit 2018: Adventures in AdTech: Processing 50 Billion User Profiles...
mParticle's Journey to Scylla from Cassandra
The Promise and Perils of Encrypting Cassandra Data (Ameesh Divatia, Baffle, ...
Real Time Analytics with Dse
From PoCs to Production
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Ad

Similar to Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons (20)

PPTX
MongoDB 2.4 and spring data
PDF
Data Lake and the rise of the microservices
PDF
PPTX
Big Data (NJ SQL Server User Group)
PDF
Intro to Cassandra
KEY
The Care + Feeding of a Mongodb Cluster
PPTX
Navigating NoSQL in cloudy skies
PPTX
MongoDB : Scaling, Security & Performance
PPTX
MongoDB Internals
PDF
A Closer Look at Apache Kudu
PDF
Scalability, Availability & Stability Patterns
PPT
MongoDB Sharding Webinar 2014
PPTX
MyHeritage backend group - build to scale
PPTX
Drop acid
PDF
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
PPTX
Conceptos básicos. Seminario web 6: Despliegue de producción
PPTX
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
PDF
Migrating from MySQL to MongoDB
PPTX
Intro to Big Data and NoSQL
PPTX
MongoDB
MongoDB 2.4 and spring data
Data Lake and the rise of the microservices
Big Data (NJ SQL Server User Group)
Intro to Cassandra
The Care + Feeding of a Mongodb Cluster
Navigating NoSQL in cloudy skies
MongoDB : Scaling, Security & Performance
MongoDB Internals
A Closer Look at Apache Kudu
Scalability, Availability & Stability Patterns
MongoDB Sharding Webinar 2014
MyHeritage backend group - build to scale
Drop acid
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Conceptos básicos. Seminario web 6: Despliegue de producción
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Migrating from MySQL to MongoDB
Intro to Big Data and NoSQL
MongoDB
Ad

More from DataStax (20)

PPTX
Is Your Enterprise Ready to Shine This Holiday Season?
PPTX
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
PPTX
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
PPTX
Best Practices for Getting to Production with DataStax Enterprise Graph
PPTX
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
PPTX
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
PDF
Webinar | Better Together: Apache Cassandra and Apache Kafka
PDF
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
PDF
Introduction to Apache Cassandra™ + What’s New in 4.0
PPTX
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
PPTX
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
PDF
Designing a Distributed Cloud Database for Dummies
PDF
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
PDF
How to Evaluate Cloud Databases for eCommerce
PPTX
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
PPTX
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
PPTX
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
PPTX
Datastax - The Architect's guide to customer experience (CX)
PPTX
An Operational Data Layer is Critical for Transformative Banking Applications
PPTX
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Is Your Enterprise Ready to Shine This Holiday Season?
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Best Practices for Getting to Production with DataStax Enterprise Graph
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | Better Together: Apache Cassandra and Apache Kafka
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Introduction to Apache Cassandra™ + What’s New in 4.0
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Designing a Distributed Cloud Database for Dummies
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Evaluate Cloud Databases for eCommerce
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Datastax - The Architect's guide to customer experience (CX)
An Operational Data Layer is Critical for Transformative Banking Applications
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking

Recently uploaded (20)

PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
Big Data Technologies - Introduction.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
cuic standard and advanced reporting.pdf
PDF
Electronic commerce courselecture one. Pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Big Data Technologies - Introduction.pptx
Spectral efficient network and resource selection model in 5G networks
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
NewMind AI Weekly Chronicles - August'25 Week I
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Encapsulation theory and applications.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Network Security Unit 5.pdf for BCA BBA.
Diabetes mellitus diagnosis method based random forest with bat algorithm
Per capita expenditure prediction using model stacking based on satellite ima...
Building Integrated photovoltaic BIPV_UPV.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
cuic standard and advanced reporting.pdf
Electronic commerce courselecture one. Pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx

Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons

  • 2. Overview Differences in DB Architectures ! SHIFT Platform ! SHIFT Media Manager ! Intro to cqlengine
  • 3. MongoDB Architecture Important Concepts • • • • • replica set (master / slave) shard (replica set within a cluster) config server (topology) mongos (router) Shard key is an indexed field that determines the shard a particular document belongs to ! sources: http://docs.mongodb.org/manual/core/sharded-cluster-architectures-production/, http://docs.mongodb.org/manual/core/sharding-shard-key/
  • 4. Cassandra Architecture • Only 1 type of server (Cassandra) • Ring Based Replication (no master or slave) • No single point of failure • Key hashes to a location in the ring • Replication Factor (RF=3) • Limited query flexibility (always select by key) • Each query has a consistency level source: http://developer.rackspace.com/images/2013-03-27-rackspace-service-registry-status-update/vnodes.png
  • 5. Cassandra Storage • SSTables are immutable • Each column includes a timestamp of when it was written • The same column can exist for a given key in multiple SSTables • Deletes are written as tombstones • SSTables are periodically merged (compaction) • Compaction keeps the column with the latest timestamp on conflicts source: http://developer.rackspace.com/images/2013-03-27-rackspace-service-registry-status-update/vnodes.png
  • 6. Cassandra Writes • Writes are written to any node in the cluster (the coordinator) which figures out where it should go
 • Writes are saved in memory to a “memtable”, and written to a commit log.
 • Memtables are flushed to disk periodically as SSTables. source: http://www.datastax.com/docs/_images/write_access.png
  • 7. Cassandra Reads • Any server may be queried • Acts as coordinator • Data is pulled from SSTables and merged • Contacts nodes with the requested key • Performs read repair if necessary • Reads are a more time consuming operation than writes. source: http://www.datastax.com/docs/_images/write_access.png
  • 8. MongoDB Advantages • Very Flexible Documents
 • Very Flexible Queries
 • Full text search (2.4)
 • Aggregation Framework
 • Geospatial Indexes / Queries
 • Really good documentation
  • 9. MongoDB Pitfalls • • ! ! ! Many queries will route to entire cluster ! Overwriting documents / changing doc sizes causes memory fragmentation problems (db repair) • Query language is awkward for humans • Queries that go to disk pay an enormous penalty • Max size of 256GB per collection source: https://blog.serverdensity.com/map-reduce-and-mongodb/
  • 10. Cassandra Advantages • Multi data center aware & reliable • Fewer moving parts • No DB / table locking • Unbelievable with time series data (stats) • Performance scales linearly as you add servers • Optimized compaction options for traditional spinning disks and SSDs • Lots of control over how your data is stored on disk.
  • 11. Cassandra Pitfalls • Secondary Indexes have hidden costs • Individual reads (single rows) are not as fast as other DBs • JVM can be intimidating (GC) • Data modeling requires more planning • Generally need to construct a table per query you intend on running • Ad hoc queries or queries with lots of permutations can be very difficult to model • We complement Cassandra with Elastic Search for these types of queries (also Solr & DS Enterprise are good choices)
  • 13. What is Media Manager? • Ad buying and management tool for Facebook, Twitter
 • We sync ~2 billion ad stats a month
 • We roll up stats at multiple levels in real time
 • 10 node C* cluster, AWS high I/O
 • Peaked at 150K queries / second
 • Approx 150GB of data, growing 10% / week
  • 14. Real time Rollups • • • A single row per parent object type & date
 campaign +date ad1 ad2 ad3 stats stats stats For any object (teams, folders, campaign) we can perform a rollup for a given date by accessing only a single row. This limits our I/O and is extremely efficient.
 New ad stats are propagated up immediately in rollups with very few reads. rollup campaign1 campaign2 campaign3 folder+date stats stats stats
  • 15. Why Cassandra? • Almost our entire DB is in our working set.
 • We have rows on disk that are inconsistently sized, so heuristics on doc size for preallocation are not useful.
 • We could not tolerate unpredictable query behavior due to disk access.
  • 17. Real time Collaboration • Build for Marketers
 • Allows communication across departments and organizations
 • 3rd Party Applications
  • 18. Messaging • Messages are fanned out to an entire team
 • Teams may have hundreds of members
 • Each member has perspectival view of their messages and their own metadata on those messages (tags & unread)
  • 19. Message Inbox • When a message is sent or replied to, we use insert a record with a timeuuid into a persons stream which points to the message.
 • Timeuuids are stored on disk in reverse user timeuuid1 timeuuid2 timeuuid3 jon msg1 msg2 msg3 blake msg3 msg1 msg2 order of the embedded timestamp
 • We can easily query the row for the first N items in the users inbox
 • We store multiple views as tags for each user to quickly surface messages in different contexts.
  • 21. cqlengine features • CQL3 Object Mapper for Python • Supports Cassandra 1.2 • Builds queries supporting the following: • • • • • • • • • TTLs Per Query Consistency Blind Table Updates Batch Queries Counters Maps, sets, lists Schema management Per table compaction settings Table Polymorphism
  • 22. Table Polymorphism • In a single table we can have heterogenous objects • We use this on Media Manager for Ad types campaign ad type 1 1 page_post 1 2 mobile_ad 1 3 application_ad
  • 23. Upcoming Features • Work seamlessly with multiple clusters
 • Native driver integration
 • Key cache / row cache configuration
 • Cassandra 2.0 features
 • Third party plugins • session • flask • identity map
  • 24. THANK YOU Jon Blake jon@shift.com @rustyrazorblade blake@shift.com @beggleston SANTA MONICA 310.310.8315 PALO ALTO 650.804.8319 NEW YORK 646.649.2972 www.shift.com CHICAGO 312.465.2152