SlideShare a Scribd company logo
Distributed and Fault-TolerantDistributed and Fault-Tolerant
Realtime ComputationRealtime Computation
www.folio3.com@folio_3
Folio3 – OverviewFolio3 – Overview
www.folio3.com @folio_3
Who We Are
 We are a Development Partner for our customers
 Design software solutions, not just implement them
 Focus on the solution – Platform and technology agnostic
 Expertise in building applications that are:
Mobile Social Cloud-based Gamified
What We Do
 Areas of Focus
 Enterprise
 Custom enterprise applications
 Product development targeting the enterprise
 Mobile
 Custom mobile apps for iOS, Android, Windows Phone, BB OS
 Mobile platform (server-to-server) development
 Social Media
 CMS based websites for consumers and enterprise (corporate, consumer,
community & social networking)
 Social media platform development (enterprise & consumer)
Folio3 At a Glance
 Founded in 2005
 Over 200 full time employees
 Offices in the US, Canada, Bulgaria & Pakistan
 Palo Alto, CA.
 Sofia, Bulgaria
 Karachi, Pakistan
Toronto, Canada
Areas of Focus: Enterprise
 Automating workflows
 Cloud based solutions
 Application integration
 Platform development
 Healthcare
 Mobile Enterprise
 Digital Media
 Supply Chain
Some of Our Enterprise Clients
Areas of Focus: Mobile
 Serious enterprise applications for Banks,
Businesses
 Fun consumer apps for app discovery,
interaction, exercise gamification and play
 Educational apps
 Augmented Reality apps
 Mobile Platforms
Some of Our Mobile Clients
Areas of Focus: Web & Social Media
 Community Sites based on
Content Management Systems
 Enterprise Social Networking
 Social Games for Facebook &
Mobile
 Companion Apps for games
Some of Our Web Clients
www.folio3.com @folio_3
Distributed and Fault-TolerantDistributed and Fault-Tolerant
Realtime ComputationRealtime Computation
Agenda
 Big Data
 Hadoop Vs Storm
 Lambda Architecture
 Storm Architecture And Concepts
Big Data
To understand “Big Data”, it has four dimensions :
 Volume : Scale of Data (terabytes, petabytes, exabytes)
 Velocity : Need to be analyzed quickly (milliseconds to
seconds to respond)
 Variety : Different forms of Data (& Data Sources)
 Veracity : Uncertainty of Data (due to data inconsistency,
ambiguities, latency, data incompleteness)
Example Query
Total Number of Page Views To A Website
URL over a range of time
Example Query
function pageViewsOverTime(bigData, url, startTime, endTime) {
int count = 0;
for (data : bigData) {
if ( data.url == url &&
data.timestamp >= startTime &&
data.timestamp <= endTime ) {
count ++;
}
}
return count;
}
Example Query
TOO SLOW : Big Data is in petabytes
(Volume)
Hadoop Data Processing Architecture
Data
Store
(HDFS)
Hadoop
(Map
Reduce)
Batch View
(Processed
Data)
Query
 Views generated in batch maybe out of date
 Batch workflow is too slow
Data Flow Batch Run
Lambda Architecture
Immutable Master Dataset ( stored in HDFS)
What is Apache Storm ?
 Storm is a real-time distributed computing framework for
reliably processing large volumes of high velocity unbounded
data streams.
 It was created by Nathan Marz and his team at BackType, and
released as open source in 2011(after BackType was acquired by
Twitter)
Five characteristics make Storm ideal for real-time data processing
workloads.
 Fast – benchmarked at processing one million+ 100 byte messages per second
per node
 Scalable – with parallel calculations that run across a cluster of machines
 Fault-tolerant – when workers die, Storm will automatically restart them. If a
node dies, the work will be restarted on another node.
 Reliable – Storm guarantees that each unit of data (tuple) will be processed at
least once or exactly once. Messages are only replayed when there are failures.
 Easy to operate – standard configurations are suitable for production on day
one. Once deployed, Storm is easy to operate.
Tweet from Nathan Marz (31 May 2012)
Storm Topology
 The input stream of a Storm cluster is handled by a component called a Spout.
 The spout passes the to a component called a Bolt, which transforms it in some
way.
 A Bolt either persists the data in storage, or passes it to some other bolt.
Functional Programming
h(g(f(data)))
λ-calculus
Sample Problem
… Thus the heavens and the earth were finished, and all the host of them.
And on the seventh day God ended his work which he had made
and he rested on the seventh day from all his work which he had made…
File : Bible.txt
(“thus”, “the”, “heavens”, “and”, “the”, “earth”, “were”,
“finished”
“and”, “all”, “the”, “host”, “of”, “them”)
{“Thus the heavens and the earth were finished, and all the host of
them.”}
{“And on the seventh day God ended his work which he had made”}
( (“testaments”, 10), (“holy”, 12), (“faith”,
34) )
f
g
h
Relationship of Storm Topology with Functional
Programming
BoltBolt BoltBoltSpoutSpoutData
f g h
Line-reader Word-Splitter Word-Counter
Data Source Reliability
 A data source is considered “unreliable”, if there is no means to replay a
message.
 A data source is considered “reliable” if it can somehow replay a
message if processing fails at any point.
 A data source is considered “durable” if it can replay any message or set
of messages given the necessary selection criteria.
Reliability Limitations: Integrating Kafka with Apache Storm
 Exactly once processing requires a “durable” data source.
 At least once processing requires a “reliable” data source.
 An “unreliable” data source can be wrapped to provide additional
guarantees.
 For Apache Storm (demo), I’ve backed up unreliable data source with
Apache Kafka (minor latency overhead to ensure 100% durability).
Relationship of Storm Topology with Functional Programming
BoltBolt BoltBoltSpoutSpout
Data
f g h
Storm Spout subscribed to topic
bible of kafka messaging queue
Word-Splitter Word-CounterTopic: bible
…5|4|3|2|1
Line-reader
Scenarios / Use cases where Storm can be effectively used
 Predictive Analysis
 Social Graph Analysis
 Network Monitoring
 Recommendation Engine
 Realtime Analytics
 Online Machine Learning
 Continuous Computation
 Distributed Remote Procedure Call
 Website Activity Tracking
 Log Aggregation
Storm Components
A Storm cluster has 3 sets of nodes
Nimbus Nodes
Zookeeper Nodes
Supervisor Nodes
Storm Components
A Storm cluster has 3 sets of nodes
Nimbus Nodes
Zookeeper Nodes
Supervisor Nodes
 Master Node Daemon
 Distributes code across the
cluster
 Launches workers across the
cluster
 Monitors computation and
reallocates workers as needed
Storm Components
A Storm cluster has 3 sets of nodes
Nimbus Nodes
Zookeeper Nodes
Supervisor Nodes
 Manages all the coordination
between Nimbus and the
supervisors.
Storm Components
A Storm cluster has 3 sets of nodes
Nimbus Nodes
Zookeeper Nodes
Supervisor Nodes
 Executes a subset of topology
(spout and /or bolts).
 Listens for jobs assigned to the
machine and starts and stops
worker processes as necessary.
Known Limitations:
 Nimbus : A single point of failure
 When Nimbus is down :
 Topologies continue to work
 Tasks from failing nodes (Spouts/Bolts) aren’t replayed
 Can’t upload a new topology or rebalance an old one
 It is recommended to run Nimbus under daemon tool or monit so that
it could be restarted automatically when it is down.
(In contrast to Hadoop, if the Job Tracker dies, all the running jobs are lost)
Contact
 For more details about our services, please get in touch
with us.
contact@folio3.com
US Office: (408) 365-4638
www.folio3.com

More Related Content

PDF
Real Time Data Streaming using Kafka & Storm
PPTX
From a kafkaesque story to The Promised Land
PPTX
Experience with Kafka & Storm
PDF
Real-Time Analytics with Kafka, Cassandra and Storm
PDF
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
PDF
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
PDF
Kafka and Storm - event processing in realtime
PDF
Scaling Apache Storm - Strata + Hadoop World 2014
Real Time Data Streaming using Kafka & Storm
From a kafkaesque story to The Promised Land
Experience with Kafka & Storm
Real-Time Analytics with Kafka, Cassandra and Storm
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Kafka and Storm - event processing in realtime
Scaling Apache Storm - Strata + Hadoop World 2014

What's hot (20)

PPTX
Resource Aware Scheduling in Apache Storm
PPTX
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
PDF
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
PDF
Apache Storm Concepts
PPTX
Real-Time Big Data at In-Memory Speed, Using Storm
PPTX
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
PDF
PHP Backends for Real-Time User Interaction using Apache Storm.
PDF
Storm and Cassandra
PPTX
Storm-on-YARN: Convergence of Low-Latency and Big-Data
PPTX
Functional Comparison and Performance Evaluation of Streaming Frameworks
PDF
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
PPTX
Real Time Data Processing Using Spark Streaming
PDF
Stream processing using Apache Storm - Big Data Meetup Athens 2016
PDF
Real-time streams and logs with Storm and Kafka
PDF
Learning Stream Processing with Apache Storm
PPTX
Introduction to Storm
PPTX
Spark vs storm
PDF
Distributed real time stream processing- why and how
PPTX
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Resource Aware Scheduling in Apache Storm
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Apache Storm Concepts
Real-Time Big Data at In-Memory Speed, Using Storm
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
PHP Backends for Real-Time User Interaction using Apache Storm.
Storm and Cassandra
Storm-on-YARN: Convergence of Low-Latency and Big-Data
Functional Comparison and Performance Evaluation of Streaming Frameworks
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Real Time Data Processing Using Spark Streaming
Stream processing using Apache Storm - Big Data Meetup Athens 2016
Real-time streams and logs with Storm and Kafka
Learning Stream Processing with Apache Storm
Introduction to Storm
Spark vs storm
Distributed real time stream processing- why and how
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Ad

Similar to Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache Kafka and Apache Zookeeper (20)

PPTX
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
PDF
Storm Real Time Computation
PDF
BWB Meetup: Storm - distributed realtime computation system
PDF
Kafka storm-v2
PDF
Storm@Twitter, SIGMOD 2014 paper
PDF
Hadoop Summit Europe 2014: Apache Storm Architecture
PDF
Apache Storm
PDF
Jan 2012 HUG: Storm
PDF
Storm
PPTX
1 storm-intro
PDF
Storm @ Fifth Elephant 2013
PDF
Storm at Forter
PDF
Storm: distributed and fault-tolerant realtime computation
PPTX
Slide #1:Introduction to Apache Storm
PDF
Mhug apache storm
PPT
Real-Time Streaming with Apache Spark Streaming and Apache Storm
PPTX
Introduction to Storm
PPTX
Cleveland HUG - Storm
PPTX
Apache Storm 0.9 basic training - Verisign
PDF
Streaming Analytics Unit 3 notes for engineers
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm Real Time Computation
BWB Meetup: Storm - distributed realtime computation system
Kafka storm-v2
Storm@Twitter, SIGMOD 2014 paper
Hadoop Summit Europe 2014: Apache Storm Architecture
Apache Storm
Jan 2012 HUG: Storm
Storm
1 storm-intro
Storm @ Fifth Elephant 2013
Storm at Forter
Storm: distributed and fault-tolerant realtime computation
Slide #1:Introduction to Apache Storm
Mhug apache storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Introduction to Storm
Cleveland HUG - Storm
Apache Storm 0.9 basic training - Verisign
Streaming Analytics Unit 3 notes for engineers
Ad

More from Folio3 Software (20)

PPT
Shopify & Shopify Plus Ecommerce Development Experts
PPT
Magento and Magento 2 Ecommerce Development
PPTX
All You Need to Know About Type Script
PPT
Enter the Big Picture
PPT
A Guideline to Test Your Own Code - Developer Testing
PPT
OWIN (Open Web Interface for .NET)
PPT
Introduction to Go-Lang
PPT
An Introduction to CSS Preprocessors (SASS & LESS)
PPT
Introduction to SharePoint 2013
PPT
An Overview of Blackberry 10
PPT
StackOverflow Architectural Overview
PPT
Enterprise Mobility - An Introduction
PPT
Introduction to Docker
PPT
Introduction to Enterprise Service Bus
PPT
NOSQL Database: Apache Cassandra
PPT
Regular Expression in Action
PPT
HTTP Server Push Techniques
PPT
Best Practices of Software Development
PPT
Offline Data Access in Enterprise Mobility
PPT
Realtime and Synchronous Applications
Shopify & Shopify Plus Ecommerce Development Experts
Magento and Magento 2 Ecommerce Development
All You Need to Know About Type Script
Enter the Big Picture
A Guideline to Test Your Own Code - Developer Testing
OWIN (Open Web Interface for .NET)
Introduction to Go-Lang
An Introduction to CSS Preprocessors (SASS & LESS)
Introduction to SharePoint 2013
An Overview of Blackberry 10
StackOverflow Architectural Overview
Enterprise Mobility - An Introduction
Introduction to Docker
Introduction to Enterprise Service Bus
NOSQL Database: Apache Cassandra
Regular Expression in Action
HTTP Server Push Techniques
Best Practices of Software Development
Offline Data Access in Enterprise Mobility
Realtime and Synchronous Applications

Recently uploaded (20)

DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Cloud computing and distributed systems.
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
cuic standard and advanced reporting.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
The AUB Centre for AI in Media Proposal.docx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Electronic commerce courselecture one. Pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
“AI and Expert System Decision Support & Business Intelligence Systems”
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Building Integrated photovoltaic BIPV_UPV.pdf
Review of recent advances in non-invasive hemoglobin estimation
Spectral efficient network and resource selection model in 5G networks
Mobile App Security Testing_ A Comprehensive Guide.pdf
Cloud computing and distributed systems.
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Network Security Unit 5.pdf for BCA BBA.
Dropbox Q2 2025 Financial Results & Investor Presentation
CIFDAQ's Market Insight: SEC Turns Pro Crypto
cuic standard and advanced reporting.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Understanding_Digital_Forensics_Presentation.pptx

Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache Kafka and Apache Zookeeper

  • 1. Distributed and Fault-TolerantDistributed and Fault-Tolerant Realtime ComputationRealtime Computation www.folio3.com@folio_3
  • 2. Folio3 – OverviewFolio3 – Overview www.folio3.com @folio_3
  • 3. Who We Are  We are a Development Partner for our customers  Design software solutions, not just implement them  Focus on the solution – Platform and technology agnostic  Expertise in building applications that are: Mobile Social Cloud-based Gamified
  • 4. What We Do  Areas of Focus  Enterprise  Custom enterprise applications  Product development targeting the enterprise  Mobile  Custom mobile apps for iOS, Android, Windows Phone, BB OS  Mobile platform (server-to-server) development  Social Media  CMS based websites for consumers and enterprise (corporate, consumer, community & social networking)  Social media platform development (enterprise & consumer)
  • 5. Folio3 At a Glance  Founded in 2005  Over 200 full time employees  Offices in the US, Canada, Bulgaria & Pakistan  Palo Alto, CA.  Sofia, Bulgaria  Karachi, Pakistan Toronto, Canada
  • 6. Areas of Focus: Enterprise  Automating workflows  Cloud based solutions  Application integration  Platform development  Healthcare  Mobile Enterprise  Digital Media  Supply Chain
  • 7. Some of Our Enterprise Clients
  • 8. Areas of Focus: Mobile  Serious enterprise applications for Banks, Businesses  Fun consumer apps for app discovery, interaction, exercise gamification and play  Educational apps  Augmented Reality apps  Mobile Platforms
  • 9. Some of Our Mobile Clients
  • 10. Areas of Focus: Web & Social Media  Community Sites based on Content Management Systems  Enterprise Social Networking  Social Games for Facebook & Mobile  Companion Apps for games
  • 11. Some of Our Web Clients
  • 12. www.folio3.com @folio_3 Distributed and Fault-TolerantDistributed and Fault-Tolerant Realtime ComputationRealtime Computation
  • 13. Agenda  Big Data  Hadoop Vs Storm  Lambda Architecture  Storm Architecture And Concepts
  • 14. Big Data To understand “Big Data”, it has four dimensions :  Volume : Scale of Data (terabytes, petabytes, exabytes)  Velocity : Need to be analyzed quickly (milliseconds to seconds to respond)  Variety : Different forms of Data (& Data Sources)  Veracity : Uncertainty of Data (due to data inconsistency, ambiguities, latency, data incompleteness)
  • 15. Example Query Total Number of Page Views To A Website URL over a range of time
  • 16. Example Query function pageViewsOverTime(bigData, url, startTime, endTime) { int count = 0; for (data : bigData) { if ( data.url == url && data.timestamp >= startTime && data.timestamp <= endTime ) { count ++; } } return count; }
  • 17. Example Query TOO SLOW : Big Data is in petabytes (Volume)
  • 18. Hadoop Data Processing Architecture Data Store (HDFS) Hadoop (Map Reduce) Batch View (Processed Data) Query  Views generated in batch maybe out of date  Batch workflow is too slow Data Flow Batch Run
  • 20. Immutable Master Dataset ( stored in HDFS)
  • 21. What is Apache Storm ?  Storm is a real-time distributed computing framework for reliably processing large volumes of high velocity unbounded data streams.  It was created by Nathan Marz and his team at BackType, and released as open source in 2011(after BackType was acquired by Twitter)
  • 22. Five characteristics make Storm ideal for real-time data processing workloads.  Fast – benchmarked at processing one million+ 100 byte messages per second per node  Scalable – with parallel calculations that run across a cluster of machines  Fault-tolerant – when workers die, Storm will automatically restart them. If a node dies, the work will be restarted on another node.  Reliable – Storm guarantees that each unit of data (tuple) will be processed at least once or exactly once. Messages are only replayed when there are failures.  Easy to operate – standard configurations are suitable for production on day one. Once deployed, Storm is easy to operate.
  • 23. Tweet from Nathan Marz (31 May 2012)
  • 24. Storm Topology  The input stream of a Storm cluster is handled by a component called a Spout.  The spout passes the to a component called a Bolt, which transforms it in some way.  A Bolt either persists the data in storage, or passes it to some other bolt.
  • 26. Sample Problem … Thus the heavens and the earth were finished, and all the host of them. And on the seventh day God ended his work which he had made and he rested on the seventh day from all his work which he had made… File : Bible.txt (“thus”, “the”, “heavens”, “and”, “the”, “earth”, “were”, “finished” “and”, “all”, “the”, “host”, “of”, “them”) {“Thus the heavens and the earth were finished, and all the host of them.”} {“And on the seventh day God ended his work which he had made”} ( (“testaments”, 10), (“holy”, 12), (“faith”, 34) ) f g h
  • 27. Relationship of Storm Topology with Functional Programming BoltBolt BoltBoltSpoutSpoutData f g h Line-reader Word-Splitter Word-Counter
  • 28. Data Source Reliability  A data source is considered “unreliable”, if there is no means to replay a message.  A data source is considered “reliable” if it can somehow replay a message if processing fails at any point.  A data source is considered “durable” if it can replay any message or set of messages given the necessary selection criteria.
  • 29. Reliability Limitations: Integrating Kafka with Apache Storm  Exactly once processing requires a “durable” data source.  At least once processing requires a “reliable” data source.  An “unreliable” data source can be wrapped to provide additional guarantees.  For Apache Storm (demo), I’ve backed up unreliable data source with Apache Kafka (minor latency overhead to ensure 100% durability).
  • 30. Relationship of Storm Topology with Functional Programming BoltBolt BoltBoltSpoutSpout Data f g h Storm Spout subscribed to topic bible of kafka messaging queue Word-Splitter Word-CounterTopic: bible …5|4|3|2|1 Line-reader
  • 31. Scenarios / Use cases where Storm can be effectively used  Predictive Analysis  Social Graph Analysis  Network Monitoring  Recommendation Engine  Realtime Analytics  Online Machine Learning  Continuous Computation  Distributed Remote Procedure Call  Website Activity Tracking  Log Aggregation
  • 32. Storm Components A Storm cluster has 3 sets of nodes Nimbus Nodes Zookeeper Nodes Supervisor Nodes
  • 33. Storm Components A Storm cluster has 3 sets of nodes Nimbus Nodes Zookeeper Nodes Supervisor Nodes  Master Node Daemon  Distributes code across the cluster  Launches workers across the cluster  Monitors computation and reallocates workers as needed
  • 34. Storm Components A Storm cluster has 3 sets of nodes Nimbus Nodes Zookeeper Nodes Supervisor Nodes  Manages all the coordination between Nimbus and the supervisors.
  • 35. Storm Components A Storm cluster has 3 sets of nodes Nimbus Nodes Zookeeper Nodes Supervisor Nodes  Executes a subset of topology (spout and /or bolts).  Listens for jobs assigned to the machine and starts and stops worker processes as necessary.
  • 36. Known Limitations:  Nimbus : A single point of failure  When Nimbus is down :  Topologies continue to work  Tasks from failing nodes (Spouts/Bolts) aren’t replayed  Can’t upload a new topology or rebalance an old one  It is recommended to run Nimbus under daemon tool or monit so that it could be restarted automatically when it is down. (In contrast to Hadoop, if the Job Tracker dies, all the running jobs are lost)
  • 37. Contact  For more details about our services, please get in touch with us. contact@folio3.com US Office: (408) 365-4638 www.folio3.com