SlideShare a Scribd company logo
BUILDING REAL-TIME ANALYTICS
With DSE Enterprise.
jKoolCloud.com
1
Objectives
• Store everything, analyze everything…
• Combined real-time & historical analytics
• Fast response, flexible query capabilities
• Target - for business user
• Insulate us from underlying software
• Hide complexity
• Scale for ingesting data-in-motion
• Scale for storing data-at-rest
• Elasticity & Operational efficiency
• Ease of monitoring & management
2
Technologies we considered?
• SQL (Oracle, MySQL, etc.)
• No scale. We have had a lot of experience our customer’s issues
with this at our parent company Nastel…
• RAM was “the” bottleneck. Commits take too long and while that is
happening everything else stops
• NoSQL
• Cassandra/Solr (DSE)
• Hadoop/MapReduce
• MongoDB
• Clustered Computing Platforms
• STORM
• MapReduce
• Spark (we learned about this while building jKool)
3
Why we chose Cassandra/Solr?
• Pros:
• Simple to setup & scale for clustered deployments
• Scalable, resilient, fault-tolerant (easy replication)
• Ability to have data automatically expire (TTL – necessary for our pricing model)
• Configurable replication strategy
• Great for heavy write workloads
• Write performance was better than Hadoop.
• Insert rate was of paramount importance for us – get data in as fast as possible was
our goal
• Java driver balances the load amongst the nodes in a cluster for us (master-slave
would never have worked for us)
• Solr provides a way to index all incoming data - essential
• DSE provides a nice integration between Cassandra and Solr
• Cons:
• Susceptible to GC pauses (memory management)
• The more memory the more GC pauses
• Less memory and more nodes seems a better approach than one big “honking” server
(we see 6-8GB optimal, so far)
• Data compaction tasks may hang
4
Why not Hadoop MapReduce?
• MapReduce too slow for real-time workloads
• Ok for batch, not so great for real-time
• Need to be paired with other technologies for query (Hive/Pig)
• Complex to setup, run and operate
• Our goals were simplicity first…
• Opted for STORM/SPARK wrapped with our own micro
services platform FatPipes instead of the Map Reduce
functionality
5
Why we chose Cassandra/Solr vs. Mongo?
• Why not Mongo?
• Global write-lock performance concerns…
• Cassandra/Solr
• Java based (our project was in Java)
• Easy to scale, replicate data,
• Flexible write & write consistency levels (ALL, QUORUM, ANY,
etc.)
• Did we say Java? Yes.(we like Java…)
• Flexible choice of platform coverage
• Great for time-series data streams (market focus for jKool)
• Inherent query limitations in Cassandra solved via Solr
integration (provided with DSE – as mentioned earlier)
6
How we achieved near real-time analytics?
• Created our own micro-services architecture (FatPipes)
which runs on top of:
• STORM/JMS/Kafka
• FatPipes can be embedded or distributed
• Real-time Grid
• Feeds tracking data and real-time queries to CEP and back
• User interacts with Real-time via JKQL (jKool Query Language)
• English like query language for analyzing data in motion and at rest.
• “Subscribe” verb for real-time updates
Real-time (Real-time.png)
7
Why clustered computing platforms?
• STORM paired with Kafka/JMS and CEP
• Clustered way to process incoming real-time streams
• STORM handles clustering/distribution
• Kafka/JMS for a messaging between grids
• Split streaming workload across the cluster
• Achieve linear scalability for incoming real-time streams
• Apache Spark (alternative to MapReduce)
• For distributing queries and trend analysis
• Micro batching for historical analytics
• Loading large dataset into memory (across different nodes)
• Running queries against large data-sets
8
Key to Real-time Analytics
• Process streams as they come while at the same time
avoiding IO
• Streams are split into real-time queue and persistence queue with
eventual consistency (eventually… both real-time and historical
must reconcile)
• Both have to be processed in parallel
• Writing to persistence layer and then analyzing will not achieve
near-real time processing
9
High Level Architecture
10
Deeper View
Web Application Server Web Application Server Web Application Server
jKool Web Grid
Cassandra
Cassandra
Cassandra
Cassandra
Storage Grid
Solr
Solr
Solr
Solr
Search Grid
Digest, Index
Real-time Grid
JKQL
FatPipes Micro Services (INGEST)
Compute Grid
FatPipes Micro Services (REAL-TIME)
(STORM/CEP)
Distributed Messaging (JMS or Kafka)
11
Challenges we ran into?
• So many technology options (…so little time…)
• Deciding on the right combination is key early on
• Cassandra/Solr deployment – (it was a learning experience for us)
• Lots of configuration, memory management, replication options
• Monitoring, managing clusters
• Cassandra/Solr, STORM, Zookeeper, Messaging
• +Leverage parent company’s AutoPilot Technology
• Achieving near real-time analytics proved extremely
challenging – but we did it!
• Keeping track of latencies across cluster
• Estimating computational capacity required to crunch incoming
streams
12
Business Analyst User Interface
It's easy to “visualize your data”
13
jKOOL IN REAL-TIME
Real-time Demonstration of jKool’s usage of DSE

More Related Content

PDF
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
PPTX
Real time data pipeline with spark streaming and cassandra with mesos
PDF
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
PDF
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
PDF
Cassandra 2.0 and timeseries
PPTX
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
PDF
Time series with Apache Cassandra - Long version
PDF
Introduction to data modeling with apache cassandra
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
Real time data pipeline with spark streaming and cassandra with mesos
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Cassandra 2.0 and timeseries
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Time series with Apache Cassandra - Long version
Introduction to data modeling with apache cassandra

What's hot (20)

PPTX
Spark + Cassandra = Real Time Analytics on Operational Data
PDF
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
PDF
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
PPTX
Using Spark to Load Oracle Data into Cassandra
PDF
Analytics with Cassandra & Spark
PDF
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
PDF
Time series with apache cassandra strata
PDF
Cassandra Basics, Counters and Time Series Modeling
PPTX
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
PDF
Storing time series data with Apache Cassandra
PDF
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
PPTX
Cassandra Summit 2015: Intro to DSE Search
PPTX
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
PDF
Spark with Cassandra by Christopher Batey
PDF
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
PPTX
DataStax: An Introduction to DataStax Enterprise Search
PDF
Cassandra + Spark + Elk
PDF
Spark and cassandra (Hulu Talk)
PDF
Owning time series with team apache Strata San Jose 2015
PDF
Cassandra and Spark: Optimizing for Data Locality
Spark + Cassandra = Real Time Analytics on Operational Data
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
Using Spark to Load Oracle Data into Cassandra
Analytics with Cassandra & Spark
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Time series with apache cassandra strata
Cassandra Basics, Counters and Time Series Modeling
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
Storing time series data with Apache Cassandra
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Cassandra Summit 2015: Intro to DSE Search
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
Spark with Cassandra by Christopher Batey
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
DataStax: An Introduction to DataStax Enterprise Search
Cassandra + Spark + Elk
Spark and cassandra (Hulu Talk)
Owning time series with team apache Strata San Jose 2015
Cassandra and Spark: Optimizing for Data Locality
Ad

Viewers also liked (20)

PDF
An Introduction to Distributed Search with Cassandra and Solr
PPTX
Using Event-Driven Architectures with Cassandra
PDF
Solr & Cassandra: Searching Cassandra with DataStax Enterprise
PDF
SMARTSTUDY Django 오픈 세션 2012-08
PPTX
Enabling Search in your Cassandra Application with DataStax Enterprise
PPTX
Webinar Google Analytics Real Time MA 22-11-11
PPTX
Real-Time Big Data with Storm, Kafka and GigaSpaces
PDF
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
PDF
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
PDF
Stratio: Geospatial and bitemporal search in Cassandra with pluggable Lucene ...
PDF
DataStax: How to Roll Cassandra into Production Without Losing your Health, M...
PDF
DataStax: Old Dogs, New Tricks. Teaching your Relational DBA to fetch
PDF
Battery Ventures: Simulating and Visualizing Large Scale Cassandra Deployments
PDF
Cassandra 3.0 Awesomeness
PDF
DataStax: 7 Deadly Sins for Cassandra Ops
PDF
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
PDF
Crash course intro to cassandra
PDF
Cassandra Core Concepts
PDF
Diagnosing Problems in Production - Cassandra
PDF
Enter the Snake Pit for Fast and Easy Spark
An Introduction to Distributed Search with Cassandra and Solr
Using Event-Driven Architectures with Cassandra
Solr & Cassandra: Searching Cassandra with DataStax Enterprise
SMARTSTUDY Django 오픈 세션 2012-08
Enabling Search in your Cassandra Application with DataStax Enterprise
Webinar Google Analytics Real Time MA 22-11-11
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Stratio: Geospatial and bitemporal search in Cassandra with pluggable Lucene ...
DataStax: How to Roll Cassandra into Production Without Losing your Health, M...
DataStax: Old Dogs, New Tricks. Teaching your Relational DBA to fetch
Battery Ventures: Simulating and Visualizing Large Scale Cassandra Deployments
Cassandra 3.0 Awesomeness
DataStax: 7 Deadly Sins for Cassandra Ops
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
Crash course intro to cassandra
Cassandra Core Concepts
Diagnosing Problems in Production - Cassandra
Enter the Snake Pit for Fast and Easy Spark
Ad

Similar to How We Used Cassandra/Solr to Build Real-Time Analytics Platform (20)

PPTX
How jKool Analyzes Streaming Data in Real Time with DataStax
PPTX
How jKool Analyzes Streaming Data in Real Time with DataStax
PPTX
Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...
PPTX
Big Data Warehousing Meetup with Riak
PDF
How can Hadoop & SAP be integrated
PPTX
An Enterprise Architect's View of MongoDB
PPTX
Real-time searching of big data with Solr and Hadoop
PPT
SQL, NoSQL, BigData in Data Architecture
PDF
Architecting Data in the AWS Ecosystem
PDF
BIG DATA: From mammoth to elephant
PPTX
NoSQL A brief look at Apache Cassandra Distributed Database
PDF
Webinar: The Future of SQL
PDF
Tweaking performance on high-load projects
PPT
Big Data
PPTX
Top 10 lessons learned from deploying hadoop in a private cloud
PPTX
Dataiku big data paris - the rise of the hadoop ecosystem
PDF
Fb talk arch_summit
PPTX
MyHeritage backend group - build to scale
PPT
Big Data Real Time Analytics - A Facebook Case Study
PPTX
Comparing sql and nosql dbs
How jKool Analyzes Streaming Data in Real Time with DataStax
How jKool Analyzes Streaming Data in Real Time with DataStax
Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...
Big Data Warehousing Meetup with Riak
How can Hadoop & SAP be integrated
An Enterprise Architect's View of MongoDB
Real-time searching of big data with Solr and Hadoop
SQL, NoSQL, BigData in Data Architecture
Architecting Data in the AWS Ecosystem
BIG DATA: From mammoth to elephant
NoSQL A brief look at Apache Cassandra Distributed Database
Webinar: The Future of SQL
Tweaking performance on high-load projects
Big Data
Top 10 lessons learned from deploying hadoop in a private cloud
Dataiku big data paris - the rise of the hadoop ecosystem
Fb talk arch_summit
MyHeritage backend group - build to scale
Big Data Real Time Analytics - A Facebook Case Study
Comparing sql and nosql dbs

More from DataStax Academy (20)

PDF
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
PPTX
Introduction to DataStax Enterprise Graph Database
PPTX
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
PPTX
Cassandra on Docker @ Walmart Labs
PDF
Cassandra 3.0 Data Modeling
PPTX
Cassandra Adoption on Cisco UCS & Open stack
PDF
Data Modeling for Apache Cassandra
PDF
Coursera Cassandra Driver
PDF
Production Ready Cassandra
PDF
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 1
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 2
PDF
Standing Up Your First Cluster
PDF
Real Time Analytics with Dse
PDF
Introduction to Data Modeling with Apache Cassandra
PDF
Cassandra Core Concepts
PPTX
Bad Habits Die Hard
PDF
Advanced Data Modeling with Apache Cassandra
PDF
Advanced Cassandra
PDF
Apache Cassandra and Drivers
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Cassandra on Docker @ Walmart Labs
Cassandra 3.0 Data Modeling
Cassandra Adoption on Cisco UCS & Open stack
Data Modeling for Apache Cassandra
Coursera Cassandra Driver
Production Ready Cassandra
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 2
Standing Up Your First Cluster
Real Time Analytics with Dse
Introduction to Data Modeling with Apache Cassandra
Cassandra Core Concepts
Bad Habits Die Hard
Advanced Data Modeling with Apache Cassandra
Advanced Cassandra
Apache Cassandra and Drivers

Recently uploaded (20)

PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
UNIT 4 Total Quality Management .pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
web development for engineering and engineering
PPTX
Welding lecture in detail for understanding
PDF
PPT on Performance Review to get promotions
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
DOCX
573137875-Attendance-Management-System-original
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
CH1 Production IntroductoryConcepts.pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Embodied AI: Ushering in the Next Era of Intelligent Systems
UNIT 4 Total Quality Management .pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
web development for engineering and engineering
Welding lecture in detail for understanding
PPT on Performance Review to get promotions
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
573137875-Attendance-Management-System-original
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Foundation to blockchain - A guide to Blockchain Tech
R24 SURVEYING LAB MANUAL for civil enggi
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
CH1 Production IntroductoryConcepts.pptx

How We Used Cassandra/Solr to Build Real-Time Analytics Platform

  • 1. BUILDING REAL-TIME ANALYTICS With DSE Enterprise. jKoolCloud.com 1
  • 2. Objectives • Store everything, analyze everything… • Combined real-time & historical analytics • Fast response, flexible query capabilities • Target - for business user • Insulate us from underlying software • Hide complexity • Scale for ingesting data-in-motion • Scale for storing data-at-rest • Elasticity & Operational efficiency • Ease of monitoring & management 2
  • 3. Technologies we considered? • SQL (Oracle, MySQL, etc.) • No scale. We have had a lot of experience our customer’s issues with this at our parent company Nastel… • RAM was “the” bottleneck. Commits take too long and while that is happening everything else stops • NoSQL • Cassandra/Solr (DSE) • Hadoop/MapReduce • MongoDB • Clustered Computing Platforms • STORM • MapReduce • Spark (we learned about this while building jKool) 3
  • 4. Why we chose Cassandra/Solr? • Pros: • Simple to setup & scale for clustered deployments • Scalable, resilient, fault-tolerant (easy replication) • Ability to have data automatically expire (TTL – necessary for our pricing model) • Configurable replication strategy • Great for heavy write workloads • Write performance was better than Hadoop. • Insert rate was of paramount importance for us – get data in as fast as possible was our goal • Java driver balances the load amongst the nodes in a cluster for us (master-slave would never have worked for us) • Solr provides a way to index all incoming data - essential • DSE provides a nice integration between Cassandra and Solr • Cons: • Susceptible to GC pauses (memory management) • The more memory the more GC pauses • Less memory and more nodes seems a better approach than one big “honking” server (we see 6-8GB optimal, so far) • Data compaction tasks may hang 4
  • 5. Why not Hadoop MapReduce? • MapReduce too slow for real-time workloads • Ok for batch, not so great for real-time • Need to be paired with other technologies for query (Hive/Pig) • Complex to setup, run and operate • Our goals were simplicity first… • Opted for STORM/SPARK wrapped with our own micro services platform FatPipes instead of the Map Reduce functionality 5
  • 6. Why we chose Cassandra/Solr vs. Mongo? • Why not Mongo? • Global write-lock performance concerns… • Cassandra/Solr • Java based (our project was in Java) • Easy to scale, replicate data, • Flexible write & write consistency levels (ALL, QUORUM, ANY, etc.) • Did we say Java? Yes.(we like Java…) • Flexible choice of platform coverage • Great for time-series data streams (market focus for jKool) • Inherent query limitations in Cassandra solved via Solr integration (provided with DSE – as mentioned earlier) 6
  • 7. How we achieved near real-time analytics? • Created our own micro-services architecture (FatPipes) which runs on top of: • STORM/JMS/Kafka • FatPipes can be embedded or distributed • Real-time Grid • Feeds tracking data and real-time queries to CEP and back • User interacts with Real-time via JKQL (jKool Query Language) • English like query language for analyzing data in motion and at rest. • “Subscribe” verb for real-time updates Real-time (Real-time.png) 7
  • 8. Why clustered computing platforms? • STORM paired with Kafka/JMS and CEP • Clustered way to process incoming real-time streams • STORM handles clustering/distribution • Kafka/JMS for a messaging between grids • Split streaming workload across the cluster • Achieve linear scalability for incoming real-time streams • Apache Spark (alternative to MapReduce) • For distributing queries and trend analysis • Micro batching for historical analytics • Loading large dataset into memory (across different nodes) • Running queries against large data-sets 8
  • 9. Key to Real-time Analytics • Process streams as they come while at the same time avoiding IO • Streams are split into real-time queue and persistence queue with eventual consistency (eventually… both real-time and historical must reconcile) • Both have to be processed in parallel • Writing to persistence layer and then analyzing will not achieve near-real time processing 9
  • 11. Deeper View Web Application Server Web Application Server Web Application Server jKool Web Grid Cassandra Cassandra Cassandra Cassandra Storage Grid Solr Solr Solr Solr Search Grid Digest, Index Real-time Grid JKQL FatPipes Micro Services (INGEST) Compute Grid FatPipes Micro Services (REAL-TIME) (STORM/CEP) Distributed Messaging (JMS or Kafka) 11
  • 12. Challenges we ran into? • So many technology options (…so little time…) • Deciding on the right combination is key early on • Cassandra/Solr deployment – (it was a learning experience for us) • Lots of configuration, memory management, replication options • Monitoring, managing clusters • Cassandra/Solr, STORM, Zookeeper, Messaging • +Leverage parent company’s AutoPilot Technology • Achieving near real-time analytics proved extremely challenging – but we did it! • Keeping track of latencies across cluster • Estimating computational capacity required to crunch incoming streams 12
  • 13. Business Analyst User Interface It's easy to “visualize your data” 13
  • 14. jKOOL IN REAL-TIME Real-time Demonstration of jKool’s usage of DSE