SlideShare a Scribd company logo
Real Time Big Data With Storm,
Cassandra, and In-Memory Computing
DeWayne Filppi
@dfilppi
Big Data Predictions
“Over the next few years we'll see the adoption of scalable
frameworks and platforms for handling
streaming, or near real-time, analysis and processing. In the
same way that Hadoop has been borne out of large-scale web
applications, these platforms will be driven by the needs of large-
scale location-aware mobile, social and sensor use.”
Edd Dumbill, O’REILLY
2
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved3
The Two Vs of Big Data
Velocity Volume
We’re Living in a Real Time World…
Homeland Security
Real Time Search
Social
eCommerce
User Tracking &
Engagement
Financial Services
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved4
The Flavors of Big Data Analytics
Counting Correlating Research
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved5
Analytics @ Twitter – Counting
 How many signups,
tweets, retweets for a
topic?
 What’s the average
latency?
 Demographics
 Countries and cities
 Gender
 Age groups
 Device types
 …
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved6
Analytics @ Twitter – Correlating
 What devices fail at the
same time?
 What features get user
hooked?
 What places on the
globe are “happening”?
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved7
Analytics @ Twitter – Research
 Sentiment analysis
 “Obama is popular”
 Trends
 “People like to tweet
after watching
American Idol”
 Spam patterns
 How can you tell when
a user spams?
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved8
It’s All about Timing
“Real time”
(< few Seconds)
Reasonably Quick
(seconds - minutes)
Batch
(hours/days)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved9
It’s All about Timing
• Event driven / stream processing
• High resolution – every tweet gets counted
• Ad-hoc querying
• Medium resolution (aggregations)
• Long running batch jobs (ETL, map/reduce)
• Low resolution (trends & patterns)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved10
This is what
we’re here
to discuss 
VELOCITY + VAST VOLUME =
IN MEMORY + BIG DATA
11
 RAM is the new disk
 Data partitioned across a cluster
 Large “virtual” memory space
 Transactional
 Highly available
 Code collocated with data.
In Memory Data Grid Review
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved12
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved13
Data Grid + Cassandra: A Complete Solution
• Data flows through the in-memory cluster async to Cassandra
• Side effects calculated
• Filtering an option
• Enrichment an option
• Results instantly available
• Internal and external event listeners notified
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved14
Simplified Event Flow
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved15
Grid – Cassandra Interface
 Hector and CQL based interface
 In memory data must be mapped to column families.
 Configurable class to column family mapping
 Must serialize individual fields
 Fixed fields can use defined types
 Variable fields ( for schemaless in-memory mode) need serializers
 Object model flattening
 By default, nested fields are flattened.
 Can be overridden by custom serializer.
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved16
Virtues and Limitations
 Could be faster: high availability has a cost
 Complex flows not easy to assemble or understand with simple
event handlers
 Complete stack, not just two tools of many
 Fast.
 Microsecond latencies for in memory operations
 Fast enough for almost anybody
 Highly available/self healing
 Elastic
 Popular open source, real time, in-memory, streaming
computation platform.
 Includes distributed runtime and intuitive API for defining
distributed processing flows.
 Scalable and fault tolerant.
 Developed at BackType,
and open sourced by Twitter
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved17
Storm Background
 Streams
 Unbounded sequence of tuples
 Spouts
 Source of streams (Queues)
 Bolts
 Functions, Filters, Joins, Aggregations
 Topologies
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved18
Storm Abstractions
Spout
Bolt
Topologies
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved19
Streaming word count with Storm
 Storm has a simple builder interface to creating stream processing
topologies
 Storm delegates persistence to external providers
 Cassandra, because of its write performance, is commonly used
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved20
Storm : Optimistic Processing
 Storm (quite rationally) assumes success is normal
 Storm uses batching and pipelining for performance
 Therefore the spout must be able to replay tuples on demand
in case of error.
 Any kind of quasi-queue like data source can be fashioned
into a spout.
 No persistence is ever required, and speed attained by
minimizing network hops during topology processing.
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved21
Fast. Want to go faster?
 Eliminate non-memory components
 Substitute disk based queue for reliable in-memory queue
 Substitute disk based state persistence to in-memory
persistence
 Asynchronously update disk based state (C*)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved22
Sample Architecture
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved23
References
 Try the Cloudify recipe
 Download Cloudify : http://www.cloudifysource.org/
 Download the Recipe (apps/xapstream, services/xapstream):
– https://github.com/CloudifySource/cloudify-recipes
 XAP – Cassandra Interface Details;
 http://wiki.gigaspaces.com/wiki/display/XAP95/Cassandra+Space+Persistency
 Check out the source for the XAP Spout and a sample state
implementation backed by XAP, and a Storm friendly streaming
implemention on github:
 https://github.com/Gigaspaces/storm-integration
 For more background on the effort, check out my recent blog posts at
http://blog.gigaspaces.com/
 http://blog.gigaspaces.com/gigaspaces-and-storm-part-1-storm-clouds/
 http://blog.gigaspaces.com/gigaspaces-and-storm-part-2-xap-integration/
 Part 3 coming soon.
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved24
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved25
Twitter Storm With Cassandra
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved26
Storm Overview
 Streams
 Unbounded sequence of tuples
 Spouts
 Source of streams (Queues)
 Bolts
 Functions, Filters, Joins, Aggregations
 Topologies
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved27
Storm Concepts
Spouts
Bolt
Topologies
Challenge – Word Count
Word:Count
Tweets
Count
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved28
• Hottest topics
• URL mentions
• etc.
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved29
Streaming word count with Storm
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved30
Supercharging Storm
 Storm doesn’t supply persistence, but provides for it
 Storm optimizes IO to slow persistence (e.g. databases) using
batching.
 Storm processes streams. The stream provider itself needs to
support persistency, batching, and reliability.
Tweets,
events,whatever….
XAP Real Time Analytics
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved31
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Two Layer Approach
 Advantage: Minimal
“impedance mismatch”
between layers.
– Both NoSQL cluster
technologies, with similar
advantages
 Grid layer serves as an in
memory cache for interactive
requests.
 Grid layer serves as a real time
computation fabric for CEP, and
limited ( to allocated memory)
real time distributed query
capability.
In Memory Compute Cluster
NoSQL Cluster
...
RawEventStream
RawEventStream
RawEventStream
RealTimeEvents
Raw And Derived Events
RealTimeEvents
ReportingEngine
SCALE
SCALE
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved33
Simplified Architecture
 Flowing event streams through memory for side effects
 Event driven architecture executing in-memory
 Raw events flushed, aggregations/derivations retained
 All layers horizontally scalable
 All layers highly available
 Real-time analytics & cached batch analytics on same scalable
layer
 Data grid provides a transactional/consistent façade on NoSQL
store (in this case eliminating SQL database entirely)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved34
Key Concepts
Keep Things In Memory
Facebook keeps 80% of its
data in Memory
(Stanford research)
RAM is 100-1000x faster
than Disk (Random seek)
• Disk: 5 -10ms
• RAM: ~0.001msec
Take Aways
 A data grid can serve different needs for big data analytics:
 Supercharge a dedicated stream processing cluster like Storm.
– Provide fast, reliable, transactional tuple streams and state
 Provide a general purpose analytics platform
– Roll your own
 Simplify overall architecture while enhancing scalability
– Ultra high performance/low latency
– Dynamically scalable processing and in-memory storage
– Eliminate messaging tier
– Eliminate or minimize need for RDBMS
 Realtime Analytics with Storm and Hadoop
 http://www.slideshare.net/Hadoop_Summit/realtime-
analytics-with-storm
 Learn and fork the code on github:
https://github.com/Gigaspaces/storm-integration
 Twitter Storm:
http://storm-project.net
 XAP + Storm Detailed Blog Post
http://blog.gigaspaces.com/gigaspaces-and-storm-part-2-
xap-integration/
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved37
References
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved38

More Related Content

PDF
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
PPTX
CEP - simplified streaming architecture - Strata Singapore 2016
PPTX
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
PPTX
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
PPTX
Apache Druid Design and Future prospect
PDF
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
PPTX
MapR and Machine Learning Primer
PDF
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
CEP - simplified streaming architecture - Strata Singapore 2016
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
Apache Druid Design and Future prospect
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
MapR and Machine Learning Primer
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...

What's hot (20)

PDF
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
PDF
MapR & Skytree:
PPTX
Programmatic Bidding Data Streams & Druid
PDF
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
PPTX
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
PDF
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
PDF
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
PDF
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
PDF
Deep Learning at Scale
PDF
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
PDF
Build a Time Series Application with Apache Spark and Apache HBase
PDF
What's Next for Google's BigTable
PDF
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
PPTX
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
PDF
Data Pipelines with Spark & DataStax Enterprise
PDF
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
PDF
Performance Analysis of Apache Spark and Presto in Cloud Environments
PDF
PPTX
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
PDF
Rapids: Data Science on GPUs
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
MapR & Skytree:
Programmatic Bidding Data Streams & Druid
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Deep Learning at Scale
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Build a Time Series Application with Apache Spark and Apache HBase
What's Next for Google's BigTable
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
Data Pipelines with Spark & DataStax Enterprise
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Performance Analysis of Apache Spark and Presto in Cloud Environments
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Rapids: Data Science on GPUs
Ad

Similar to Cassandra summit-2013 (20)

PDF
C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Compu...
PPTX
Real-Time Big Data at In-Memory Speed, Using Storm
PPTX
The Big Data Stack
PPTX
Trivento summercamp masterclass 9/9/2016
PDF
Storm@Twitter, SIGMOD 2014 paper
PPTX
Software architecture for data applications
PDF
Data Streaming For Big Data
PPTX
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
PPTX
Big Data_Architecture.pptx
PDF
Apache streams 2015
ODP
Web-scale data processing: practical approaches for low-latency and batch
PPTX
Reactconf 2014 - Event Stream Processing
PDF
Data Streaming Technology Overview
PPTX
Bigdata analytics-twitter
PDF
Velocity 2015-final
PPTX
An adaptive and eventually self healing framework for geo-distributed real-ti...
PDF
Real-time Big Data Processing with Storm
PPTX
Cloud storage
PDF
Buzzwords 2014 / Overview / part1
PPT
CS8091_BDA_Unit_IV_Stream_Computing
C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Compu...
Real-Time Big Data at In-Memory Speed, Using Storm
The Big Data Stack
Trivento summercamp masterclass 9/9/2016
Storm@Twitter, SIGMOD 2014 paper
Software architecture for data applications
Data Streaming For Big Data
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Big Data_Architecture.pptx
Apache streams 2015
Web-scale data processing: practical approaches for low-latency and batch
Reactconf 2014 - Event Stream Processing
Data Streaming Technology Overview
Bigdata analytics-twitter
Velocity 2015-final
An adaptive and eventually self healing framework for geo-distributed real-ti...
Real-time Big Data Processing with Storm
Cloud storage
Buzzwords 2014 / Overview / part1
CS8091_BDA_Unit_IV_Stream_Computing
Ad

More from dfilppi (7)

PPTX
Container Orchestration
PPTX
NFV Orchestration for Optimal Performance
PPTX
Hybrid cloud openstack meetup
PPTX
TOSCA and Cloudify
PPTX
Middle Tier Scalability - Present and Future
PPTX
An Application Centric Approach to Devops
PPTX
Building an elastic real time no sql platform
Container Orchestration
NFV Orchestration for Optimal Performance
Hybrid cloud openstack meetup
TOSCA and Cloudify
Middle Tier Scalability - Present and Future
An Application Centric Approach to Devops
Building an elastic real time no sql platform

Recently uploaded (20)

PDF
Encapsulation theory and applications.pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
KodekX | Application Modernization Development
PDF
Approach and Philosophy of On baking technology
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
Encapsulation theory and applications.pdf
Machine learning based COVID-19 study performance prediction
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Review of recent advances in non-invasive hemoglobin estimation
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Chapter 3 Spatial Domain Image Processing.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
KodekX | Application Modernization Development
Approach and Philosophy of On baking technology
The AUB Centre for AI in Media Proposal.docx
NewMind AI Weekly Chronicles - August'25 Week I
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf

Cassandra summit-2013

  • 1. Real Time Big Data With Storm, Cassandra, and In-Memory Computing DeWayne Filppi @dfilppi
  • 2. Big Data Predictions “Over the next few years we'll see the adoption of scalable frameworks and platforms for handling streaming, or near real-time, analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large- scale location-aware mobile, social and sensor use.” Edd Dumbill, O’REILLY 2 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • 3. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved3 The Two Vs of Big Data Velocity Volume
  • 4. We’re Living in a Real Time World… Homeland Security Real Time Search Social eCommerce User Tracking & Engagement Financial Services ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved4
  • 5. The Flavors of Big Data Analytics Counting Correlating Research ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved5
  • 6. Analytics @ Twitter – Counting  How many signups, tweets, retweets for a topic?  What’s the average latency?  Demographics  Countries and cities  Gender  Age groups  Device types  … ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved6
  • 7. Analytics @ Twitter – Correlating  What devices fail at the same time?  What features get user hooked?  What places on the globe are “happening”? ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved7
  • 8. Analytics @ Twitter – Research  Sentiment analysis  “Obama is popular”  Trends  “People like to tweet after watching American Idol”  Spam patterns  How can you tell when a user spams? ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved8
  • 9. It’s All about Timing “Real time” (< few Seconds) Reasonably Quick (seconds - minutes) Batch (hours/days) ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved9
  • 10. It’s All about Timing • Event driven / stream processing • High resolution – every tweet gets counted • Ad-hoc querying • Medium resolution (aggregations) • Long running batch jobs (ETL, map/reduce) • Low resolution (trends & patterns) ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved10 This is what we’re here to discuss 
  • 11. VELOCITY + VAST VOLUME = IN MEMORY + BIG DATA 11
  • 12.  RAM is the new disk  Data partitioned across a cluster  Large “virtual” memory space  Transactional  Highly available  Code collocated with data. In Memory Data Grid Review ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved12
  • 13. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved13 Data Grid + Cassandra: A Complete Solution • Data flows through the in-memory cluster async to Cassandra • Side effects calculated • Filtering an option • Enrichment an option • Results instantly available • Internal and external event listeners notified
  • 14. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved14 Simplified Event Flow
  • 15. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved15 Grid – Cassandra Interface  Hector and CQL based interface  In memory data must be mapped to column families.  Configurable class to column family mapping  Must serialize individual fields  Fixed fields can use defined types  Variable fields ( for schemaless in-memory mode) need serializers  Object model flattening  By default, nested fields are flattened.  Can be overridden by custom serializer.
  • 16. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved16 Virtues and Limitations  Could be faster: high availability has a cost  Complex flows not easy to assemble or understand with simple event handlers  Complete stack, not just two tools of many  Fast.  Microsecond latencies for in memory operations  Fast enough for almost anybody  Highly available/self healing  Elastic
  • 17.  Popular open source, real time, in-memory, streaming computation platform.  Includes distributed runtime and intuitive API for defining distributed processing flows.  Scalable and fault tolerant.  Developed at BackType, and open sourced by Twitter ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved17 Storm Background
  • 18.  Streams  Unbounded sequence of tuples  Spouts  Source of streams (Queues)  Bolts  Functions, Filters, Joins, Aggregations  Topologies ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved18 Storm Abstractions Spout Bolt Topologies
  • 19. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved19 Streaming word count with Storm  Storm has a simple builder interface to creating stream processing topologies  Storm delegates persistence to external providers  Cassandra, because of its write performance, is commonly used
  • 20. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved20 Storm : Optimistic Processing  Storm (quite rationally) assumes success is normal  Storm uses batching and pipelining for performance  Therefore the spout must be able to replay tuples on demand in case of error.  Any kind of quasi-queue like data source can be fashioned into a spout.  No persistence is ever required, and speed attained by minimizing network hops during topology processing.
  • 21. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved21 Fast. Want to go faster?  Eliminate non-memory components  Substitute disk based queue for reliable in-memory queue  Substitute disk based state persistence to in-memory persistence  Asynchronously update disk based state (C*)
  • 22. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved22 Sample Architecture
  • 23. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved23 References  Try the Cloudify recipe  Download Cloudify : http://www.cloudifysource.org/  Download the Recipe (apps/xapstream, services/xapstream): – https://github.com/CloudifySource/cloudify-recipes  XAP – Cassandra Interface Details;  http://wiki.gigaspaces.com/wiki/display/XAP95/Cassandra+Space+Persistency  Check out the source for the XAP Spout and a sample state implementation backed by XAP, and a Storm friendly streaming implemention on github:  https://github.com/Gigaspaces/storm-integration  For more background on the effort, check out my recent blog posts at http://blog.gigaspaces.com/  http://blog.gigaspaces.com/gigaspaces-and-storm-part-1-storm-clouds/  http://blog.gigaspaces.com/gigaspaces-and-storm-part-2-xap-integration/  Part 3 coming soon.
  • 24. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved24
  • 25. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved25 Twitter Storm With Cassandra
  • 26. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved26 Storm Overview
  • 27.  Streams  Unbounded sequence of tuples  Spouts  Source of streams (Queues)  Bolts  Functions, Filters, Joins, Aggregations  Topologies ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved27 Storm Concepts Spouts Bolt Topologies
  • 28. Challenge – Word Count Word:Count Tweets Count ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved28 • Hottest topics • URL mentions • etc.
  • 29. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved29 Streaming word count with Storm
  • 30. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved30 Supercharging Storm  Storm doesn’t supply persistence, but provides for it  Storm optimizes IO to slow persistence (e.g. databases) using batching.  Storm processes streams. The stream provider itself needs to support persistency, batching, and reliability. Tweets, events,whatever….
  • 31. XAP Real Time Analytics ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved31
  • 32. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved Two Layer Approach  Advantage: Minimal “impedance mismatch” between layers. – Both NoSQL cluster technologies, with similar advantages  Grid layer serves as an in memory cache for interactive requests.  Grid layer serves as a real time computation fabric for CEP, and limited ( to allocated memory) real time distributed query capability. In Memory Compute Cluster NoSQL Cluster ... RawEventStream RawEventStream RawEventStream RealTimeEvents Raw And Derived Events RealTimeEvents ReportingEngine SCALE SCALE
  • 33. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved33 Simplified Architecture
  • 34.  Flowing event streams through memory for side effects  Event driven architecture executing in-memory  Raw events flushed, aggregations/derivations retained  All layers horizontally scalable  All layers highly available  Real-time analytics & cached batch analytics on same scalable layer  Data grid provides a transactional/consistent façade on NoSQL store (in this case eliminating SQL database entirely) ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved34 Key Concepts
  • 35. Keep Things In Memory Facebook keeps 80% of its data in Memory (Stanford research) RAM is 100-1000x faster than Disk (Random seek) • Disk: 5 -10ms • RAM: ~0.001msec
  • 36. Take Aways  A data grid can serve different needs for big data analytics:  Supercharge a dedicated stream processing cluster like Storm. – Provide fast, reliable, transactional tuple streams and state  Provide a general purpose analytics platform – Roll your own  Simplify overall architecture while enhancing scalability – Ultra high performance/low latency – Dynamically scalable processing and in-memory storage – Eliminate messaging tier – Eliminate or minimize need for RDBMS
  • 37.  Realtime Analytics with Storm and Hadoop  http://www.slideshare.net/Hadoop_Summit/realtime- analytics-with-storm  Learn and fork the code on github: https://github.com/Gigaspaces/storm-integration  Twitter Storm: http://storm-project.net  XAP + Storm Detailed Blog Post http://blog.gigaspaces.com/gigaspaces-and-storm-part-2- xap-integration/ ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved37 References
  • 38. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved38

Editor's Notes

  • #5: ActiveInsight