Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache Kafka and Apache Zookeeper

Distributed and Fault-TolerantDistributed and Fault-Tolerant
Realtime ComputationRealtime Computation
www.folio3.com@folio_3

Folio3 – OverviewFolio3 – Overview
www.folio3.com @folio_3

Who We Are
 We are a Development Partner for our customers
 Design software solutions, not just implement them
 Focus on the solution – Platform and technology agnostic
 Expertise in building applications that are:
Mobile Social Cloud-based Gamified

What We Do
 Areas of Focus
 Enterprise
 Custom enterprise applications
 Product development targeting the enterprise
 Mobile
 Custom mobile apps for iOS, Android, Windows Phone, BB OS
 Mobile platform (server-to-server) development
 Social Media
 CMS based websites for consumers and enterprise (corporate, consumer,
community & social networking)
 Social media platform development (enterprise & consumer)

Folio3 At a Glance
 Founded in 2005
 Over 200 full time employees
 Offices in the US, Canada, Bulgaria & Pakistan
 Palo Alto, CA.
 Sofia, Bulgaria
 Karachi, Pakistan
Toronto, Canada

Areas of Focus: Enterprise
 Automating workflows
 Cloud based solutions
 Application integration
 Platform development
 Healthcare
 Mobile Enterprise
 Digital Media
 Supply Chain

Some of Our Enterprise Clients

Areas of Focus: Mobile
 Serious enterprise applications for Banks,
Businesses
 Fun consumer apps for app discovery,
interaction, exercise gamification and play
 Educational apps
 Augmented Reality apps
 Mobile Platforms

Areas of Focus: Web & Social Media
 Community Sites based on
Content Management Systems
 Enterprise Social Networking
 Social Games for Facebook &
Mobile
 Companion Apps for games

www.folio3.com @folio_3
Distributed and Fault-TolerantDistributed and Fault-Tolerant
Realtime ComputationRealtime Computation

Agenda
 Big Data
 Hadoop Vs Storm
 Lambda Architecture
 Storm Architecture And Concepts

Big Data
To understand “Big Data”, it has four dimensions :
 Volume : Scale of Data (terabytes, petabytes, exabytes)
 Velocity : Need to be analyzed quickly (milliseconds to
seconds to respond)
 Variety : Different forms of Data (& Data Sources)
 Veracity : Uncertainty of Data (due to data inconsistency,
ambiguities, latency, data incompleteness)

Example Query
Total Number of Page Views To A Website
URL over a range of time

Example Query
function pageViewsOverTime(bigData, url, startTime, endTime) {
int count = 0;
for (data : bigData) {
if ( data.url == url &&
data.timestamp >= startTime &&
data.timestamp <= endTime ) {
count ++;
}
}
return count;
}

Example Query
TOO SLOW : Big Data is in petabytes
(Volume)

Hadoop Data Processing Architecture
Data
Store
(HDFS)
Hadoop
(Map
Reduce)
Batch View
(Processed
Data)
Query
 Views generated in batch maybe out of date
 Batch workflow is too slow
Data Flow Batch Run

Immutable Master Dataset ( stored in HDFS)

What is Apache Storm ?
 Storm is a real-time distributed computing framework for
reliably processing large volumes of high velocity unbounded
data streams.
 It was created by Nathan Marz and his team at BackType, and
released as open source in 2011(after BackType was acquired by
Twitter)

Five characteristics make Storm ideal for real-time data processing
workloads.
 Fast – benchmarked at processing one million+ 100 byte messages per second
per node
 Scalable – with parallel calculations that run across a cluster of machines
 Fault-tolerant – when workers die, Storm will automatically restart them. If a
node dies, the work will be restarted on another node.
 Reliable – Storm guarantees that each unit of data (tuple) will be processed at
least once or exactly once. Messages are only replayed when there are failures.
 Easy to operate – standard configurations are suitable for production on day
one. Once deployed, Storm is easy to operate.

Tweet from Nathan Marz (31 May 2012)

Storm Topology
 The input stream of a Storm cluster is handled by a component called a Spout.
 The spout passes the to a component called a Bolt, which transforms it in some
way.
 A Bolt either persists the data in storage, or passes it to some other bolt.

Functional Programming
h(g(f(data)))
λ-calculus

Sample Problem
… Thus the heavens and the earth were finished, and all the host of them.
And on the seventh day God ended his work which he had made
and he rested on the seventh day from all his work which he had made…
File : Bible.txt
(“thus”, “the”, “heavens”, “and”, “the”, “earth”, “were”,
“finished”
“and”, “all”, “the”, “host”, “of”, “them”)
{“Thus the heavens and the earth were finished, and all the host of
them.”}
{“And on the seventh day God ended his work which he had made”}
( (“testaments”, 10), (“holy”, 12), (“faith”,
34) )
f
g
h

Relationship of Storm Topology with Functional
Programming
BoltBolt BoltBoltSpoutSpoutData
f g h
Line-reader Word-Splitter Word-Counter

Data Source Reliability
 A data source is considered “unreliable”, if there is no means to replay a
message.
 A data source is considered “reliable” if it can somehow replay a
message if processing fails at any point.
 A data source is considered “durable” if it can replay any message or set
of messages given the necessary selection criteria.

Reliability Limitations: Integrating Kafka with Apache Storm
 Exactly once processing requires a “durable” data source.
 At least once processing requires a “reliable” data source.
 An “unreliable” data source can be wrapped to provide additional
guarantees.
 For Apache Storm (demo), I’ve backed up unreliable data source with
Apache Kafka (minor latency overhead to ensure 100% durability).

Relationship of Storm Topology with Functional Programming
BoltBolt BoltBoltSpoutSpout
Data
f g h
Storm Spout subscribed to topic
bible of kafka messaging queue
Word-Splitter Word-CounterTopic: bible
…5|4|3|2|1
Line-reader

Scenarios / Use cases where Storm can be effectively used
 Predictive Analysis
 Social Graph Analysis
 Network Monitoring
 Recommendation Engine
 Realtime Analytics
 Online Machine Learning
 Continuous Computation
 Distributed Remote Procedure Call
 Website Activity Tracking
 Log Aggregation

Storm Components
A Storm cluster has 3 sets of nodes
Nimbus Nodes
Zookeeper Nodes
Supervisor Nodes

Storm Components
Nimbus Nodes
Zookeeper Nodes
Supervisor Nodes
 Master Node Daemon
 Distributes code across the
cluster
 Launches workers across the
cluster
 Monitors computation and
reallocates workers as needed

Storm Components
Nimbus Nodes
Zookeeper Nodes
Supervisor Nodes
 Manages all the coordination
between Nimbus and the
supervisors.

Storm Components
Nimbus Nodes
Zookeeper Nodes
Supervisor Nodes
 Executes a subset of topology
(spout and /or bolts).
 Listens for jobs assigned to the
machine and starts and stops
worker processes as necessary.

Known Limitations:
 Nimbus : A single point of failure
 When Nimbus is down :
 Topologies continue to work
 Tasks from failing nodes (Spouts/Bolts) aren’t replayed
 Can’t upload a new topology or rebalance an old one
 It is recommended to run Nimbus under daemon tool or monit so that
it could be restarted automatically when it is down.
(In contrast to Hadoop, if the Job Tracker dies, all the running jobs are lost)

Contact
 For more details about our services, please get in touch
with us.
contact@folio3.com
US Office: (408) 365-4638
www.folio3.com

Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache Kafka and Apache Zookeeper

More Related Content

What's hot (20)

Similar to Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache Kafka and Apache Zookeeper (20)

More from Folio3 Software (20)

Recently uploaded (20)

Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache Kafka and Apache Zookeeper