SlideShare a Scribd company logo
Demi Ben-Ari
Cassandra Meetup – 10/11/2015
Israel
About me
Demi Ben-Ari
Senior Software Engineer at Windward Ltd.
BS’c Computer Science – Academic College Tel-Aviv Yaffo
In the Past:
Software Team Leader & Senior Java Software Engineer,
Missile defense and Alert System - “Ofek” unit - IAF
Agenda
 Data flow with Mongo DB
 The Problem
 Solution
 Lessons learned from a Newbi
 Conclusion
Environment Description
Cluster
Dev Testing
Live
Staging
ProductionEnv
Data pipeline flow – Use Case
 Batch Apache Spark applications running
every 10 - 60 minutes
 Request Rate:
◩ Bursts of ~9 million requests per batch job
◩ Beginning – Reads
◩ End - Writes
Workflow with MongoDB
Worker 1
Worker 2

.

.




Worker N
MongoDB
Replica Set
Spark
Cluster
Master
Write
Read
Spark Slave - Server Specs
 Instance Type: r3.xlarge
 CPU’s: 4
 RAM: 30.5GB
 Storage: ephemeral
 Amount: 10+
MongoDB - Server Specs
 MongoDB version: 2.6.1
 Instance Type: m3.xlarge (AWS)
 CPU’s: 4
 RAM: 15GB
 Storage: EBS
 DB Size: ~500GB
 Collection Indexes: 5 (4 compound)
The Problem
 Batch jobs
◩ Should run for 5-10 minutes in total
◩ Actual - runs for ~40 minutes
 Why?
◩ ~20 minutes to write with the Java mongo driver –
Async (Unacknowledged)
◩ ~20 minutes to sync the journal
◩ Total: ~ 40 Minutes of the DB being unavailable
◩ No batch process response and no UI serving
Alternative Solutions
 Shareded MongoDB (With replica sets)
◩ Pros:
 Increases Throughput by the amount of shards
 Increases the availability of the DB
◩ Cons:
 Very hard to manage DevOps wise (for a small team of
developers)
 High cost of servers – because each shared need 3
replicas
Workflow with MongoDB
Worker 1
Worker 2

.

.




Worker N
Spark
Cluster
Master
Write
Read
Master
Our DevOps – After that solution
We had no
DevOps guy at
that time at all

Alternative Solutions
 DynamoDB (We’re hosted on Amazon)
◩ Pros:
 No need to manage DevOps
◩ Cons:
 Catholic Wedding Amazons Service
 Not enough usage use cases
 Might get to a high cost for the service
Alternative Solutions
 Apache Cassandra
◩ Pros:
 Very large developer community
 Linearly scalable Database
 No single master architecture
 Proven working with distributed engines like Apache
Spark
◩ Cons:
 We had no experience at all with the Database
 No Geo Spatial Index – Needed to implement by
ourselves
The Solution
 Migration to Apache Cassandra (Steps)
◩ Writing to Mongo and Cassandra simultaneously
◩ Create easily a Cassandra cluster using
DataStax Community AMI on AWS
◩ First easy step – Using the spark-cassandra-
connector
 (Easy bootstrap move to Spark ïƒł Cassandra)
◩ Creating a monitoring dashboard to
Cassandra
Workflow with Cassandra
Worker 1
Worker 2

.

.




Worker N
Cassandra
Cluster
Spark
Cluster
Write
Read
Result
 Performance improvement
◩ Batch write parts of the job run in 3 minutes
instead of ~ 40 minutes in MongoDB
 Took 2 weeks to go from “Zero to Hero”, and to
ramp up a running solution that work without
glitches
Lessons learned from a
Newbi
 Use TokenAwarePolicy when connecting to
the cluster – Spreads the load on the
coordinators
Cluster cluster = null;
Builder builder = Cluster.builder()
.withSocketOptions(socketOptions);
builder = builder.withLoadBalancingPolicy(new TokenAwarePolicy(
new
DCAwareRoundRobinPolicy()));
cluster = builder.build();
Lessons learned from a
Newbi
 Monitor everything!!! – All of the Metrics
◩ Cassandra
◩ JVM
◩ OS
 Feature flag every parameter to the
connection, you’ll need it for tuning later
Monitor Everything!!!
 DataStax – OpsCenter
◩ Comes bundled with the DataStax
Community AMI on AWS
Monitor Everything!!!
 Graphite + Grafana
◩ Pluggable metrics – Since Cassandra 2.0.x
 Cassandra internal metrics
 JVM metrics
◩ OS – Metrics
 CollectD / StatsD – Reporting to graphite
◩ Should be combined with application level
metrics in the same graphs
 Better visibility on correlations of the metrics
Monitor Everything!!!
 Graphite + Grafana
Lessons learned from a
Newbi
 “nodetool” is your friend
◩ tpstats, cfhistograms, cfstats

 Data Modeling
◩ Time series data
◩ Evenly distributed partitions
◩ Everything becomes more rigid
 Know your queries before you model
Lessons learned from a
Newbi
 CQL Queries
◩ Once we got to know our data model better,
It got more efficient performance wise to use
CQL statement instead of the “spark-cassandra-
connector”
◩ Prepared Statements, Delete queries (of full
partitions), Range queries

Useful Cassandra GUI Clients
 DevCenter – By DataStax - Free
 Dbeaver – Free & Open Source
◩ Supports a wide variety of databeses
Conclusion
 Cassandra is a great linear scaling Distributed
Database
 Monitor as much as you can
◩ Get visibility of what’s going on in the Cluster
 Data Modeling correctly is the Key for success
 Be ready for your next war
◩ Cassandra performance tuning – You’ll get to that for
sure
Questions?
Thanks,
Resources and Contact
 Demi Ben-Ari
◩ LinkedIn
◩ Twitter: @demibenari
◩ Blog: http://progexc.blogspot.com/
◩ Email: demi.benari@gmail.com
◩ “Big Things” Community
 Meetup, YouTube, Facebook, Twitter

More Related Content

DOC
online book sale srs Apeksha
PDF
Online food project
DOCX
Final project(news portal system).docx
PPTX
Online Food Ordering Website project
DOCX
408372362-Student-Result-management-System-project-report-docx.docx
PDF
Web Technology End semester Examination Questions
PPTX
Online Quiz System Project Report ppt
ODP
19.cobra
online book sale srs Apeksha
Online food project
Final project(news portal system).docx
Online Food Ordering Website project
408372362-Student-Result-management-System-project-report-docx.docx
Web Technology End semester Examination Questions
Online Quiz System Project Report ppt
19.cobra

What's hot (12)

PPT
RAM and ROM Memory Overview
PDF
A Mobile Based Women Safety Application (I Safe Apps)
DOC
Software design specification
PPTX
QUIZ APP PPT.pptx
 
PPTX
Course management system
PPTX
Engineering club presentation (072813) (modified)
PPTX
Android Thread
DOCX
Final srs of academic a webpage based android app
DOC
Airline reservation system project report (1)
PDF
Web Services (SOAP, WSDL, UDDI)
PDF
Online ecommerce website srs
PDF
online quiz application project presentation
RAM and ROM Memory Overview
A Mobile Based Women Safety Application (I Safe Apps)
Software design specification
QUIZ APP PPT.pptx
 
Course management system
Engineering club presentation (072813) (modified)
Android Thread
Final srs of academic a webpage based android app
Airline reservation system project report (1)
Web Services (SOAP, WSDL, UDDI)
Online ecommerce website srs
online quiz application project presentation
Ad

Viewers also liked (14)

PDF
Shift: Real World Migration from MongoDB to Cassandra
PPT
Mongodb
PDF
Oracle vs NoSQL – The good, the bad and the ugly
KEY
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]
PDF
Managing data workflows with Luigi
PPTX
Database Architecture & Scaling Strategies, in the Cloud & on the Rack
PPT
The NoSQL Way in Postgres
 
PDF
Big Data
PDF
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
PPTX
A Beginner's Guide to Building Data Pipelines with Luigi
PDF
Real-time Big Data Processing with Storm
 
PDF
Architecting an Highly Available and Scalable WordPress Site in AWS
KEY
Big Data in Real-Time at Twitter
PPT
Enterprise Architecture for Dummies - TOGAF 9 enterprise architecture overview
Shift: Real World Migration from MongoDB to Cassandra
Mongodb
Oracle vs NoSQL – The good, the bad and the ugly
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]
Managing data workflows with Luigi
Database Architecture & Scaling Strategies, in the Cloud & on the Rack
The NoSQL Way in Postgres
 
Big Data
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
A Beginner's Guide to Building Data Pipelines with Luigi
Real-time Big Data Processing with Storm
 
Architecting an Highly Available and Scalable WordPress Site in AWS
Big Data in Real-Time at Twitter
Enterprise Architecture for Dummies - TOGAF 9 enterprise architecture overview
Ad

Similar to Migrating Data Pipeline from MongoDB to Cassandra (20)

PDF
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
PDF
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
PPTX
S3 cassandra or outer space? dumping time series data using spark
PDF
Cassandra at Pollfish
PDF
Cassandra at Pollfish
PDF
Scala like distributed collections - dumping time-series data with apache spark
PDF
Headaches and Breakthroughs in Building Continuous Applications
PPTX
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
PPTX
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
PPTX
Spark 101 - First steps to distributed computing
PPTX
Profiling & Testing with Spark
PPTX
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
PPTX
TenMax Data Pipeline Experience Sharing
PPTX
Paris Data Geek - Spark Streaming
PPTX
Seattle Spark Meetup Mobius CSharp API
PDF
Unified Big Data Processing with Apache Spark
PPTX
Apache Spark on HDinsight Training
PPTX
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
PDF
Store stream data on Data Lake
PPTX
Stream, stream, stream: Different streaming methods with Spark and Kafka
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3 cassandra or outer space? dumping time series data using spark
Cassandra at Pollfish
Cassandra at Pollfish
Scala like distributed collections - dumping time-series data with apache spark
Headaches and Breakthroughs in Building Continuous Applications
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
Spark 101 - First steps to distributed computing
Profiling & Testing with Spark
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
TenMax Data Pipeline Experience Sharing
Paris Data Geek - Spark Streaming
Seattle Spark Meetup Mobius CSharp API
Unified Big Data Processing with Apache Spark
Apache Spark on HDinsight Training
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
Store stream data on Data Lake
Stream, stream, stream: Different streaming methods with Spark and Kafka

More from Demi Ben-Ari (20)

PDF
Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
PPTX
CTO Management Tool Box - Demi Ben-Ari at Panorays
PPTX
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
PPTX
Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...
PPTX
CTO Management ToolBox - Demi Ben-Ari -- Panorays
PPTX
All I Wanted Is to Found a Startup - Demi Ben-Ari - Panorays
PDF
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - Panorays
PDF
Community, Unifying the Geeks to Create Value - Demi Ben-Ari
PDF
Apache Spark 101 - Demi Ben-Ari - Panorays
PDF
Know the Startup World - Demi Ben-Ari - Ofek Alumni
PDF
Big Data made easy in the era of the Cloud - Demi Ben-Ari
PDF
Know the Startup World - Demi Ben Ari - Ofek Alumni
PDF
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
PDF
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
PDF
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
PDF
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
PDF
Bootstrapping a Tech Community - Demi Ben-Ari
PDF
Apache Spark 101 - Demi Ben-Ari
PDF
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
PDF
Monitoring Big Data Systems - "The Simple Way"
Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
CTO Management Tool Box - Demi Ben-Ari at Panorays
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...
CTO Management ToolBox - Demi Ben-Ari -- Panorays
All I Wanted Is to Found a Startup - Demi Ben-Ari - Panorays
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - Panorays
Community, Unifying the Geeks to Create Value - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari - Panorays
Know the Startup World - Demi Ben-Ari - Ofek Alumni
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Know the Startup World - Demi Ben Ari - Ofek Alumni
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
Bootstrapping a Tech Community - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Monitoring Big Data Systems - "The Simple Way"

Recently uploaded (20)

PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
top salesforce developer skills in 2025.pdf
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
ai tools demonstartion for schools and inter college
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
AI in Product Development-omnex systems
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Nekopoi APK 2025 free lastest update
PPTX
Introduction to Artificial Intelligence
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
L1 - Introduction to python Backend.pptx
PDF
System and Network Administraation Chapter 3
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
Online Work Permit System for Fast Permit Processing
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
medical staffing services at VALiNTRY
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
Adobe Illustrator 28.6 Crack My Vision of Vector Design
top salesforce developer skills in 2025.pdf
Understanding Forklifts - TECH EHS Solution
ai tools demonstartion for schools and inter college
Design an Analysis of Algorithms II-SECS-1021-03
AI in Product Development-omnex systems
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Odoo POS Development Services by CandidRoot Solutions
Wondershare Filmora 15 Crack With Activation Key [2025
Nekopoi APK 2025 free lastest update
Introduction to Artificial Intelligence
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
L1 - Introduction to python Backend.pptx
System and Network Administraation Chapter 3
Which alternative to Crystal Reports is best for small or large businesses.pdf
Online Work Permit System for Fast Permit Processing
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
medical staffing services at VALiNTRY
2025 Textile ERP Trends: SAP, Odoo & Oracle

Migrating Data Pipeline from MongoDB to Cassandra

  • 1. Demi Ben-Ari Cassandra Meetup – 10/11/2015 Israel
  • 2. About me Demi Ben-Ari Senior Software Engineer at Windward Ltd. BS’c Computer Science – Academic College Tel-Aviv Yaffo In the Past: Software Team Leader & Senior Java Software Engineer, Missile defense and Alert System - “Ofek” unit - IAF
  • 3. Agenda  Data flow with Mongo DB  The Problem  Solution  Lessons learned from a Newbi  Conclusion
  • 5. Data pipeline flow – Use Case  Batch Apache Spark applications running every 10 - 60 minutes  Request Rate: ◩ Bursts of ~9 million requests per batch job ◩ Beginning – Reads ◩ End - Writes
  • 6. Workflow with MongoDB Worker 1 Worker 2 
. 
. 
 
 Worker N MongoDB Replica Set Spark Cluster Master Write Read
  • 7. Spark Slave - Server Specs  Instance Type: r3.xlarge  CPU’s: 4  RAM: 30.5GB  Storage: ephemeral  Amount: 10+
  • 8. MongoDB - Server Specs  MongoDB version: 2.6.1  Instance Type: m3.xlarge (AWS)  CPU’s: 4  RAM: 15GB  Storage: EBS  DB Size: ~500GB  Collection Indexes: 5 (4 compound)
  • 9. The Problem  Batch jobs ◩ Should run for 5-10 minutes in total ◩ Actual - runs for ~40 minutes  Why? ◩ ~20 minutes to write with the Java mongo driver – Async (Unacknowledged) ◩ ~20 minutes to sync the journal ◩ Total: ~ 40 Minutes of the DB being unavailable ◩ No batch process response and no UI serving
  • 10. Alternative Solutions  Shareded MongoDB (With replica sets) ◩ Pros:  Increases Throughput by the amount of shards  Increases the availability of the DB ◩ Cons:  Very hard to manage DevOps wise (for a small team of developers)  High cost of servers – because each shared need 3 replicas
  • 11. Workflow with MongoDB Worker 1 Worker 2 
. 
. 
 
 Worker N Spark Cluster Master Write Read Master
  • 12. Our DevOps – After that solution We had no DevOps guy at that time at all 
  • 13. Alternative Solutions  DynamoDB (We’re hosted on Amazon) ◩ Pros:  No need to manage DevOps ◩ Cons:  Catholic Wedding Amazons Service  Not enough usage use cases  Might get to a high cost for the service
  • 14. Alternative Solutions  Apache Cassandra ◩ Pros:  Very large developer community  Linearly scalable Database  No single master architecture  Proven working with distributed engines like Apache Spark ◩ Cons:  We had no experience at all with the Database  No Geo Spatial Index – Needed to implement by ourselves
  • 15. The Solution  Migration to Apache Cassandra (Steps) ◩ Writing to Mongo and Cassandra simultaneously ◩ Create easily a Cassandra cluster using DataStax Community AMI on AWS ◩ First easy step – Using the spark-cassandra- connector  (Easy bootstrap move to Spark ïƒł Cassandra) ◩ Creating a monitoring dashboard to Cassandra
  • 16. Workflow with Cassandra Worker 1 Worker 2 
. 
. 
 
 Worker N Cassandra Cluster Spark Cluster Write Read
  • 17. Result  Performance improvement ◩ Batch write parts of the job run in 3 minutes instead of ~ 40 minutes in MongoDB  Took 2 weeks to go from “Zero to Hero”, and to ramp up a running solution that work without glitches
  • 18. Lessons learned from a Newbi  Use TokenAwarePolicy when connecting to the cluster – Spreads the load on the coordinators Cluster cluster = null; Builder builder = Cluster.builder() .withSocketOptions(socketOptions); builder = builder.withLoadBalancingPolicy(new TokenAwarePolicy( new DCAwareRoundRobinPolicy())); cluster = builder.build();
  • 19. Lessons learned from a Newbi  Monitor everything!!! – All of the Metrics ◩ Cassandra ◩ JVM ◩ OS  Feature flag every parameter to the connection, you’ll need it for tuning later
  • 20. Monitor Everything!!!  DataStax – OpsCenter ◩ Comes bundled with the DataStax Community AMI on AWS
  • 21. Monitor Everything!!!  Graphite + Grafana ◩ Pluggable metrics – Since Cassandra 2.0.x  Cassandra internal metrics  JVM metrics ◩ OS – Metrics  CollectD / StatsD – Reporting to graphite ◩ Should be combined with application level metrics in the same graphs  Better visibility on correlations of the metrics
  • 23. Lessons learned from a Newbi  “nodetool” is your friend ◩ tpstats, cfhistograms, cfstats
  Data Modeling ◩ Time series data ◩ Evenly distributed partitions ◩ Everything becomes more rigid  Know your queries before you model
  • 24. Lessons learned from a Newbi  CQL Queries ◩ Once we got to know our data model better, It got more efficient performance wise to use CQL statement instead of the “spark-cassandra- connector” ◩ Prepared Statements, Delete queries (of full partitions), Range queries

  • 25. Useful Cassandra GUI Clients  DevCenter – By DataStax - Free  Dbeaver – Free & Open Source ◩ Supports a wide variety of databeses
  • 26. Conclusion  Cassandra is a great linear scaling Distributed Database  Monitor as much as you can ◩ Get visibility of what’s going on in the Cluster  Data Modeling correctly is the Key for success  Be ready for your next war ◩ Cassandra performance tuning – You’ll get to that for sure
  • 28. Thanks, Resources and Contact  Demi Ben-Ari ◩ LinkedIn ◩ Twitter: @demibenari ◩ Blog: http://progexc.blogspot.com/ ◩ Email: demi.benari@gmail.com ◩ “Big Things” Community  Meetup, YouTube, Facebook, Twitter