SlideShare a Scribd company logo
Stavros Kontopoulos
Software Engineer, MSc
Cassandra at Pollfish.com
Cassandra at Pollfish
What we do at Pollfish?
• Target mobile users with surveys through our android/ ios sdk which is installed via thousands
of mobile apps. Developers benefit from completed surveys, companies also may run a survey
campaign in real time. Now analytics/ML pipeline…
Why Apache Cassandra?
• Store time series about system events, user activities, survey results and much more..
• Amazing write throughput, take advantage of idempotent writes with proper resolution.
• Decent read speed throughput and low latency.
• Integrates with spark to implement our analytics, insights pipeline.
• A new business model.
A flashback… why I am here…
Our Tech Team- Disciplinary oriented
• Front End Development - UI Design • Back-end, data engineer(s) • Data Scientist • DevOps
Pollfish High Level Architecture
Mobile Users (~600K active per day)
APP SERVER 1
APP SERVER N
Other systems
(PostGres, Geo, Redis…)
DSE
Cassandra/Spark Cluster
Survey Customers
Why Datastax
Start up program and support when you go to production
Many tools for development and maintenance: Pig, hive, shark, CFS, OpsCenter.
Clear setup and support for real time data storage and analytics in the same cluster.
Can be extended for other workloads like Search (Solr).
Supports multiple DCs easily for other purposes staging, backup etc.
Product version we used: DSE 4.5.1 (spark 0.9.1), 4.5.3, 4.6.0 (Spark 1.1.0)
Our Cassandra Cluster- Setup
2 DataCenters (Cassandra  Analytics):
• 1 for real-time data storage, read/write path.
• 1 for analytics nodes (Spark and Hadoop enabled), read/write path ETL, machine learning.
• Use the DSE setup with DSESimpleSnitch for mixed workloads multi-DC clusters.
3 Nodes per DC (details next), planning for more.
Data is written at Cassandra DC and replicated to Analytics DC for all keyspaces needed.
2 seed nodes, one per DC.
Our Cassandra Cluster- Setup
Our Cassandra Cluster- Setup
Our Cassandra Cluster- Infrastructure Details
Cassandra DC:
• Node 0,1,2: Standard_A4 (8 cores, 14 GB memory), 15DISKs in RAID 0
Analytics DC:
• Node 0,1,2: Standard_D13 (8 cores, 56 GB memory), 15DISKs in RAID 0
A bit about disks (ideally JBOD s but we can only have network disks)…
	 FS type: Ext4
	 Analytics disks per node:
Azure Os disk (/dev/sda1 , mp: /): 29GB
Azure VM tmp disk (/dev/sdb1, mp: /mnt/resource): 400GB - for spark temp files…
data_file_directories: /mnt/dsedata/lib/cassandra/data (2,9T)
commitlog_directory: /var/lib/cassandra/commitlog (197GB disk)
A few tips towards production…
Follow as a start:
http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installRecommendSettings.html
http://www.datastax.com/wp-content/uploads/2014/04/WP-DataStax-Enterprise-Best-Practices.pdf
• NTP is a must for all servers, Cassandra needs that for node data synchronization and when you
do analytics
• Proper limits:
		 cassandra - memlock unlimited
		 cassandra - nofile 100000
		 cassandra - nproc 32768
		 cassandra - as unlimited
• Optimum blockdev --setra settings for RAID (OK): All value must be set to 128KB for RA.
• No swap
A few Cassandra optimization tips
For Cassandra DC:
	 • concurrent_reads: 64
	 • concurrent_writes: 32
	 • Adjust according to the load and node IOPS/throughput..
Heap size adjustments: Datastax does a good job to automatically handle this… In general
for production 8GB for heap is ok… to avoid gc pauses… found in bigger jvms. We use 4GB in
smaller Cassandra nodes and 8GB in Analytics.
	 • In case you have a load with short lived objects adjust accordingly the parNew size.
	 Your tools: jconsole,jvisualvm,hprof, jstack, enable GC reports at logging level…
JVM_OPTS=”$JVM_OPTS -XX:+PrintGCDetails”
JVM_OPTS=”$JVM_OPTS -XX:+PrintGCDateStamps”
JVM_OPTS=”$JVM_OPTS -XX:+PrintHeapAtGC“
…
Managing nodes
When you have a running car how you change the tire?
Adding a new clean node to the cluster fast :
• Add the following line to /etc/dse/cassandra/cassandra-env.sh :
• JVM_OPTS=”$JVM_OPTS -Dcassandra.join_ring=false“
• Start the service
• Revert change and execute nodetool join
• After joining finishes execute a node rebalance from OpsCenter.
• Fallback: remove data and commit log dirs. And start again.
Stopping a node:
nodetool drain -h host name
sudo service dse stop
Integration with Spark- Use cases
Commmon ETL case:
Mobile User profile (ML):
HDFS Compatible (CFS)
Cassandra Raw Data
Spark/Cassandra
Spark/Cassandra
HDFS Compatible (CFS)
Cassandra User profile data
Customers Amazon S3
Integration with Spark- Our job framework
• We have built an analytics framework to run spark jobs on top of Cassandra.
Executes jobs remotely through maven.
• SparkContext is created on a non-dse node. Needed some tricks to make it work, now we move
to a more dse official approach with spark submit. Note: We started with 0.9.1 DSE version
where no spark submit was available.
• The framework provides an API to run jobs in production and use common functionality at the
Data science side for code re-use.
Datastax Cassandra Opscenter
• Facilitates: monitoring, health check and daily operations at the cluster level.
• We use it for: node rebalance, repairs,snapshotting, latency throughput checks along with low
level tools like iotop, ioping etc.
Datastax Cassandra Opscenter (Pending cluster operations)
Datastax Cassandra Opscenter (Keyspace Details)
Datastax Cassandra Opscenter (Last 10 days)
The Good and the Bad with Cassandra and DSE
good
• Easy to start with,
enough documentation.
• Reliable performance so far.
• Easily scalable,
easy to add/remove nodes.
• Support.
bad
• Bugs at the Cassandra level and the
Spark level. Need active follow up of the
lists for Spark, Cassandra and DSE tools.
Upgrade maybe the only solution and not
a piece of case.
• Gets tricky to optimize depending on your
load specifically when you don’t have
time to measure everything upfront and
you are in production already.
The Good and the Bad with Cassandra and DSE (Some issue examples)
Apply backpressure gently when overloaded with writes,
https://issues.apache.org/jira/browse/CASSANDRA-7937
Reproduce by writing large volume of data with large column values eg. 300K, crawling scenario...
Spark Cassandra Connector issues (eg. java driver not closing connection with cluster)...
As a developer some cool stuff I mess with
Currently I am extending the framework I have built to deliver spark jobs on top of Cassandra.
• IT with Cassandra embedded connector.
• Cassandra Schema Design.
• ETL use cases.
Organizing dev env:
Optimizing development environment with vagraant, automatic installation - avoid upgrade
headaches, setup and IT level testing. Introduced CI with Jenkins, Artifactory etc…
thank you

More Related Content

PPTX
Solving Office 365 Big Challenges using Cassandra + Spark
PPTX
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
PPTX
How to size up an Apache Cassandra cluster (Training)
PDF
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
PPTX
Load testing Cassandra applications
PDF
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
PDF
Developing with Cassandra
PPTX
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Solving Office 365 Big Challenges using Cassandra + Spark
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
How to size up an Apache Cassandra cluster (Training)
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
Load testing Cassandra applications
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Developing with Cassandra
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...

What's hot (18)

PDF
Apache Cassandra in the Real World
PPTX
M6d cassandrapresentation
PDF
Managing Cassandra at Scale by Al Tobey
PDF
Cassandra Workshop - Cassandra from scratch in one day
PDF
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
PDF
Cassandra CLuster Management by Japan Cassandra Community
PPTX
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
PDF
Instaclustr webinar 2017 feb 08 japan
PPTX
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
PDF
Micro-batching: High-performance writes
PDF
Advanced Operations
PDF
Understanding Cassandra internals to solve real-world problems
PPTX
Everyday I’m scaling... Cassandra
PDF
Webinar: Diagnosing Apache Cassandra Problems in Production
PDF
Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...
PPTX
Speeding up R with Parallel Programming in the Cloud
PPTX
Hindsight is 20/20: MySQL to Cassandra
PPTX
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
Apache Cassandra in the Real World
M6d cassandrapresentation
Managing Cassandra at Scale by Al Tobey
Cassandra Workshop - Cassandra from scratch in one day
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra CLuster Management by Japan Cassandra Community
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Instaclustr webinar 2017 feb 08 japan
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Micro-batching: High-performance writes
Advanced Operations
Understanding Cassandra internals to solve real-world problems
Everyday I’m scaling... Cassandra
Webinar: Diagnosing Apache Cassandra Problems in Production
Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...
Speeding up R with Parallel Programming in the Cloud
Hindsight is 20/20: MySQL to Cassandra
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
Ad

Similar to Cassandra at Pollfish (20)

PPTX
Cassandra in Operation
PPTX
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
PPTX
Webinar: Don't Leave Your Data in the Dark
PPTX
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
PPTX
Migrating Data Pipeline from MongoDB to Cassandra
PPTX
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
PDF
LJC: Fault tolerance with Apache Cassandra
PDF
Five Lessons in Distributed Databases
PDF
DataSource V2 and Cassandra – A Whole New World
PPTX
Load Testing Cassandra Applications (Ben Slater, Instaclustr) | C* Summit 2016
PPTX
Load Testing Cassandra Applications
PPTX
5 Ways to Use Spark to Enrich your Cassandra Environment
PDF
DataStax: Extreme Cassandra Optimization: The Sequel
PPTX
Getting started with Cassandra 2.1
PDF
Kafka spark cassandra webinar feb 16 2016
PDF
Kafka spark cassandra webinar feb 16 2016
PPTX
Performance tuning - A key to successful cassandra migration
PPTX
Cassandra Tuning - Above and Beyond (Matija Gobec, SmartCat) | Cassandra Summ...
PPTX
Cassandra Tuning - above and beyond
DOCX
Cassandra data modelling best practices
Cassandra in Operation
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
Webinar: Don't Leave Your Data in the Dark
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Migrating Data Pipeline from MongoDB to Cassandra
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
LJC: Fault tolerance with Apache Cassandra
Five Lessons in Distributed Databases
DataSource V2 and Cassandra – A Whole New World
Load Testing Cassandra Applications (Ben Slater, Instaclustr) | C* Summit 2016
Load Testing Cassandra Applications
5 Ways to Use Spark to Enrich your Cassandra Environment
DataStax: Extreme Cassandra Optimization: The Sequel
Getting started with Cassandra 2.1
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016
Performance tuning - A key to successful cassandra migration
Cassandra Tuning - Above and Beyond (Matija Gobec, SmartCat) | Cassandra Summ...
Cassandra Tuning - above and beyond
Cassandra data modelling best practices
Ad

Recently uploaded (20)

PPTX
A Presentation on Artificial Intelligence
PDF
Encapsulation theory and applications.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPT
Teaching material agriculture food technology
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
KodekX | Application Modernization Development
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Big Data Technologies - Introduction.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
cuic standard and advanced reporting.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
A Presentation on Artificial Intelligence
Encapsulation theory and applications.pdf
Network Security Unit 5.pdf for BCA BBA.
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
20250228 LYD VKU AI Blended-Learning.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Unlocking AI with Model Context Protocol (MCP)
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Teaching material agriculture food technology
Encapsulation_ Review paper, used for researhc scholars
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
KodekX | Application Modernization Development
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Big Data Technologies - Introduction.pptx
Spectral efficient network and resource selection model in 5G networks
cuic standard and advanced reporting.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf

Cassandra at Pollfish

  • 1. Stavros Kontopoulos Software Engineer, MSc Cassandra at Pollfish.com
  • 3. What we do at Pollfish? • Target mobile users with surveys through our android/ ios sdk which is installed via thousands of mobile apps. Developers benefit from completed surveys, companies also may run a survey campaign in real time. Now analytics/ML pipeline… Why Apache Cassandra? • Store time series about system events, user activities, survey results and much more.. • Amazing write throughput, take advantage of idempotent writes with proper resolution. • Decent read speed throughput and low latency. • Integrates with spark to implement our analytics, insights pipeline. • A new business model. A flashback… why I am here…
  • 4. Our Tech Team- Disciplinary oriented • Front End Development - UI Design • Back-end, data engineer(s) • Data Scientist • DevOps
  • 5. Pollfish High Level Architecture Mobile Users (~600K active per day) APP SERVER 1 APP SERVER N Other systems (PostGres, Geo, Redis…) DSE Cassandra/Spark Cluster Survey Customers
  • 6. Why Datastax Start up program and support when you go to production Many tools for development and maintenance: Pig, hive, shark, CFS, OpsCenter. Clear setup and support for real time data storage and analytics in the same cluster. Can be extended for other workloads like Search (Solr). Supports multiple DCs easily for other purposes staging, backup etc. Product version we used: DSE 4.5.1 (spark 0.9.1), 4.5.3, 4.6.0 (Spark 1.1.0)
  • 7. Our Cassandra Cluster- Setup 2 DataCenters (Cassandra Analytics): • 1 for real-time data storage, read/write path. • 1 for analytics nodes (Spark and Hadoop enabled), read/write path ETL, machine learning. • Use the DSE setup with DSESimpleSnitch for mixed workloads multi-DC clusters. 3 Nodes per DC (details next), planning for more. Data is written at Cassandra DC and replicated to Analytics DC for all keyspaces needed. 2 seed nodes, one per DC.
  • 10. Our Cassandra Cluster- Infrastructure Details Cassandra DC: • Node 0,1,2: Standard_A4 (8 cores, 14 GB memory), 15DISKs in RAID 0 Analytics DC: • Node 0,1,2: Standard_D13 (8 cores, 56 GB memory), 15DISKs in RAID 0 A bit about disks (ideally JBOD s but we can only have network disks)… FS type: Ext4 Analytics disks per node: Azure Os disk (/dev/sda1 , mp: /): 29GB Azure VM tmp disk (/dev/sdb1, mp: /mnt/resource): 400GB - for spark temp files… data_file_directories: /mnt/dsedata/lib/cassandra/data (2,9T) commitlog_directory: /var/lib/cassandra/commitlog (197GB disk)
  • 11. A few tips towards production… Follow as a start: http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installRecommendSettings.html http://www.datastax.com/wp-content/uploads/2014/04/WP-DataStax-Enterprise-Best-Practices.pdf • NTP is a must for all servers, Cassandra needs that for node data synchronization and when you do analytics • Proper limits: cassandra - memlock unlimited cassandra - nofile 100000 cassandra - nproc 32768 cassandra - as unlimited • Optimum blockdev --setra settings for RAID (OK): All value must be set to 128KB for RA. • No swap
  • 12. A few Cassandra optimization tips For Cassandra DC: • concurrent_reads: 64 • concurrent_writes: 32 • Adjust according to the load and node IOPS/throughput.. Heap size adjustments: Datastax does a good job to automatically handle this… In general for production 8GB for heap is ok… to avoid gc pauses… found in bigger jvms. We use 4GB in smaller Cassandra nodes and 8GB in Analytics. • In case you have a load with short lived objects adjust accordingly the parNew size. Your tools: jconsole,jvisualvm,hprof, jstack, enable GC reports at logging level… JVM_OPTS=”$JVM_OPTS -XX:+PrintGCDetails” JVM_OPTS=”$JVM_OPTS -XX:+PrintGCDateStamps” JVM_OPTS=”$JVM_OPTS -XX:+PrintHeapAtGC“ …
  • 13. Managing nodes When you have a running car how you change the tire? Adding a new clean node to the cluster fast : • Add the following line to /etc/dse/cassandra/cassandra-env.sh : • JVM_OPTS=”$JVM_OPTS -Dcassandra.join_ring=false“ • Start the service • Revert change and execute nodetool join • After joining finishes execute a node rebalance from OpsCenter. • Fallback: remove data and commit log dirs. And start again. Stopping a node: nodetool drain -h host name sudo service dse stop
  • 14. Integration with Spark- Use cases Commmon ETL case: Mobile User profile (ML): HDFS Compatible (CFS) Cassandra Raw Data Spark/Cassandra Spark/Cassandra HDFS Compatible (CFS) Cassandra User profile data Customers Amazon S3
  • 15. Integration with Spark- Our job framework • We have built an analytics framework to run spark jobs on top of Cassandra. Executes jobs remotely through maven. • SparkContext is created on a non-dse node. Needed some tricks to make it work, now we move to a more dse official approach with spark submit. Note: We started with 0.9.1 DSE version where no spark submit was available. • The framework provides an API to run jobs in production and use common functionality at the Data science side for code re-use.
  • 16. Datastax Cassandra Opscenter • Facilitates: monitoring, health check and daily operations at the cluster level. • We use it for: node rebalance, repairs,snapshotting, latency throughput checks along with low level tools like iotop, ioping etc.
  • 17. Datastax Cassandra Opscenter (Pending cluster operations)
  • 18. Datastax Cassandra Opscenter (Keyspace Details)
  • 20. The Good and the Bad with Cassandra and DSE good • Easy to start with, enough documentation. • Reliable performance so far. • Easily scalable, easy to add/remove nodes. • Support. bad • Bugs at the Cassandra level and the Spark level. Need active follow up of the lists for Spark, Cassandra and DSE tools. Upgrade maybe the only solution and not a piece of case. • Gets tricky to optimize depending on your load specifically when you don’t have time to measure everything upfront and you are in production already.
  • 21. The Good and the Bad with Cassandra and DSE (Some issue examples) Apply backpressure gently when overloaded with writes, https://issues.apache.org/jira/browse/CASSANDRA-7937 Reproduce by writing large volume of data with large column values eg. 300K, crawling scenario... Spark Cassandra Connector issues (eg. java driver not closing connection with cluster)...
  • 22. As a developer some cool stuff I mess with Currently I am extending the framework I have built to deliver spark jobs on top of Cassandra. • IT with Cassandra embedded connector. • Cassandra Schema Design. • ETL use cases. Organizing dev env: Optimizing development environment with vagraant, automatic installation - avoid upgrade headaches, setup and IT level testing. Introduced CI with Jenkins, Artifactory etc…