SlideShare a Scribd company logo
TUGA IT 2017
LISBON, PORTUGAL
THANK YOU TO OUR SPONSORS
PLATINUM
GOLD SILVER
PARTICIPATING COMMUNITIES
CLOUD
PRO
PT
Event processing with
Apache Storm
Nuno Caneco - Tuga IT - 20/May/2017
Nuno Caneco
Senior Software Engineer @ Talkdesk
/nunocaneco
nuno.caneco@gmail.com
/@nuno.caneco
Who am I
Stream Processing - Why?
WHY
● Data is crucial for business
● New data is always being generated
● Companies want to extract value from
data in “real-time”
USE CASES
● Fraud detection
● Sensor data aggregation
● Live monitoring
What is Storm?
Apache Storm is a free and open source distributed realtime computation system.
Storm makes it easy to reliably process unbounded streams of data, doing for realtime
processing what Hadoop did for batch processing. Storm is simple, can be used with
any programming language, and is a lot of fun to use!
Storm has many use cases: real time analytics, online machine learning, continuous
computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at
over a million tuples processed per second per node. It is scalable, fault-tolerant,
guarantees your data will be processed, and is easy to set up and operate.
http://storm.apache.org/
Under the hood
(A bit of)
Architecture
Nimbus
Master node
Zookeeper
Zookeeper
Cluster
coordination
Supervisor
Supervisor
Supervisor
Supervisor
Cluster
Worker
process
Worker
process
Worker
process
Worker
process
Worker
process
Worker
process
JVM
instances
Concepts
Topology
Topologies combine individual work units to be applied to input data
Spout
Bolt A
Bolt C
Bolt B
data
[Tuple]
[Tuple]
[Tuple]
[Tuple]
[Data out]
Bolt D
[Tuple]
Topology
Spout
● First node of every topology
○ Collects data from the outside world
○ Injects the data on the topology in order to be processed
● Must implement ISpout interface
○ BaseRichSpout is a more convenient abstract class
Spouts
Bolt
● Middle or terminating nodes of a topology
● Implements a Unit of Data Processing
● Each Bolt has an output stream
● Can emit one or more Tuples to other Bolts subscribing the output
stream
● Must implement IBolt
○ BaseRichBolt is a more convenient abstract class
Bolt
Tuple
Hash-alike data structure containing the data that flows between Spouts and Bolts
Data can be accessed by:
● Field index: [{0, “foo”}, {1, “bar”}]
● Key name: [{“foo_key”, “foo”}, {“bar_key”, “bar”}]
Values can be:
● Java primitive types
● String
● byte[]
● Any Serializable object
Example: Alert on monitored words
Monitored
Words Bolt
Collector
Spout
{Message} Split
Sentence
Bolt
{Word, MessageId}
Message
queue
[Message]
Notify User
Bolt
Store event
on DB Bolt
{MonitoredWord,
MessageId}
{MonitoredWord,
MessageId}
Demo
Message Processing Guarantees
Message
is lost
Error
Acknowledging Tuples
● ack(): Tuple was processed successfully
● fail(): Tuple failed to process
Tuples with no ack() nor fail() are automatically replayed
Tuples with fail() will fail up the dependency tree
Acknowledge done right
Ack: Anchoring
public class SplitSentence extends BaseRichBolt {
public void execute(Tuple tuple) {
String sentence = tuple.getString(0);
for(String word: sentence.split(" ")) {
_collector.emit(tuple, new Values(word));
}
_collector.ack(tuple);
}
}
}
Anchors the
output tuple to
the input tupleAcknowledges
the input tuple
Dealing with fail()
Spout
Bolt Bolt
BoltBolt Bolt✅ ✅
✅
✅
✅
✅ Spout
Bolt Bolt
BoltBolt Bolt✅ ✅
✅
❌
❌
❌
Beware!
Storm is designed to scale to process millions of messages per second.
It's design deliberately assumes that some Tuples might be lost.
If your application needs Exactly Once semantics, you should consider using
Trident (will talk about that in a while)
Storm does not ensure
exactly once processing
Demo
Parallelism
Cluster node
Worker Process Worker Process
Cluster Node → 1+ JVM instances
JVM Instance → 1+ Threads
Thread → 1+ Task
Each instance of a Bolt or Spout is a
Task
Thread Thread
Thread Thread
Task Task
Task
Task Task
Task Task Task
Parallelism Example
Config conf = new Config();
conf.setNumWorkers(2); // use two worker processes
topologyBuilder.setSpout("blue-spout", new BlueSpout(), 2);
topologyBuilder.setBolt("green-bolt", new GreenBolt(), 2)
.setNumTasks(4)
.shuffleGrouping("blue-spout");
topologyBuilder.setBolt("yellow-bolt", new YellowBolt(), 6)
.shuffleGrouping("green-bolt");
Stream grouping
● Shuffle grouping: randomly distributed across all downstream Bolts
● Fields grouping: GROUP BY values - Same values of the grouped fields will be
delivered to the same Bolt
● All grouping: The stream is replicated across all the bolt's tasks. Use this grouping
with care
● Direct grouping: The producer of the Tuple must indicate which consumer will
receive the Tuple
● Custom Grouping: When you go NIH
Storm UI - Cluster
Storm UI - Cluster
Storm UI - Topology
Storm UI - Topology
Other features: Trident
Trident is an abstraction layer to manage state across the topology
The state can be kept:
● Internally in the topology - in memory or backed by HDFS
● Externally on a Database - such as Memcached or Cassandra
Other features: Storm SQL
The Storm SQL integration allows users to run SQL queries over streaming data
in Storm.
Cool feature, but still experimental
Q&A
Questions ?
Thank you
/nunocaneco
nuno.caneco@gmail.com
/@nuno.caneco
PLEASE FILL IN EVALUATION FORMS
FRIDAY, MAY 19th SATURDAY, MAY 20th
https://survs.com/survey/cprwce7pi8 https://survs.com/survey/l9kksmlzd8
YOUR OPINION IS IMPORTANT!
THANK YOU TO OUR SPONSORS
PLATINUM
GOLD SILVER
Tuga it 2017 - Event processing with Apache Storm
Trident: How it works
1. Tuples are processed as small batches
2. Each batch of tuples is given a unique id called the "transaction id" (txid).
a. If the batch is replayed, it is given the exact same txid.
3. State updates are ordered among batches. That is, the state updates for
batch 3 won't be applied until the state updates for batch 2 have succeeded.
Trident: Transactional Spout
Trident
Store
java => [count=5, txid=1]
kotlin => [count=8, txid=2]
csharp => [count=10, txid=3]
["kotlin"]
["kotlin"]
["csharp"]
Batch txid=3
Trident
Store
java => [count=5, txid=1]
kotlin => [count=10, txid=3]
csharp => [count=10, txid=3]
"kotlin" += 2
"csharp" += 0

More Related Content

PDF
Go at uber
PDF
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
PDF
EKAW - Triple Pattern Fragments
PDF
mypipe: Buffering and consuming MySQL changes via Kafka
PDF
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
PDF
Multidimensional Interfaces for Selecting Data with Order
PPTX
Diagnosing HotSpot JVM Memory Leaks with JFR and JMC
PPTX
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Go at uber
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
EKAW - Triple Pattern Fragments
mypipe: Buffering and consuming MySQL changes via Kafka
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Multidimensional Interfaces for Selecting Data with Order
Diagnosing HotSpot JVM Memory Leaks with JFR and JMC
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana

What's hot (9)

DOCX
Bsdtw17: george neville neil: realities of dtrace on free-bsd
PDF
Short introduction to Storm
PDF
Building Conclave: a decentralized, real-time collaborative text editor
PPTX
Data structure
PDF
EKAW - Linked Data Publishing
PDF
Be a Zen monk, the Python way
PDF
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
PDF
BWB Meetup: Storm - distributed realtime computation system
PDF
Clustering_Algorithm_DR
Bsdtw17: george neville neil: realities of dtrace on free-bsd
Short introduction to Storm
Building Conclave: a decentralized, real-time collaborative text editor
Data structure
EKAW - Linked Data Publishing
Be a Zen monk, the Python way
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
BWB Meetup: Storm - distributed realtime computation system
Clustering_Algorithm_DR
Ad

Similar to Tuga it 2017 - Event processing with Apache Storm (20)

PDF
Distributed real time stream processing- why and how
PDF
Data Science in the Cloud @StitchFix
PDF
FLiP Into Trino
PDF
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
PDF
Fletcher Framework for Programming FPGA
PDF
Interconnection Automation For All - Extended - MPS 2023
PPT
TAU on Power 9
PDF
JConf.dev 2022 - Apache Pulsar Development 101 with Java
PDF
IoT Story: From Edge to HDP
PDF
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
ODP
Log Management Systems
PDF
Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Y...
PDF
Data engineering Stl Big Data IDEA user group
PDF
Let’s write open IoText protocol for time-series data in Rust
PPTX
Being HAPI! Reverse Proxying on Purpose
PDF
The Future of Fast Databases: Lessons from a Decade of QuestDB
PDF
Season 7 Episode 1 - Tools for Data Scientists
PDF
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
PPTX
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
PDF
MACHBASE_NEO
Distributed real time stream processing- why and how
Data Science in the Cloud @StitchFix
FLiP Into Trino
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Fletcher Framework for Programming FPGA
Interconnection Automation For All - Extended - MPS 2023
TAU on Power 9
JConf.dev 2022 - Apache Pulsar Development 101 with Java
IoT Story: From Edge to HDP
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Log Management Systems
Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Y...
Data engineering Stl Big Data IDEA user group
Let’s write open IoText protocol for time-series data in Rust
Being HAPI! Reverse Proxying on Purpose
The Future of Fast Databases: Lessons from a Decade of QuestDB
Season 7 Episode 1 - Tools for Data Scientists
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
MACHBASE_NEO
Ad

More from Nuno Caneco (8)

PDF
Building resilient applications
PDF
Stateful mock servers to the rescue on REST ecosystems
PPTX
Git from the trenches
PDF
Tuga IT 2017 - Redis
PDF
Fullstack LX - Improving your application performance
PPTX
Running agile on a non-agile environment
PPTX
Introducing redis
PPTX
Tuga it 2016 improving your application performance
Building resilient applications
Stateful mock servers to the rescue on REST ecosystems
Git from the trenches
Tuga IT 2017 - Redis
Fullstack LX - Improving your application performance
Running agile on a non-agile environment
Introducing redis
Tuga it 2016 improving your application performance

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
cuic standard and advanced reporting.pdf
PPT
Teaching material agriculture food technology
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Machine Learning_overview_presentation.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Electronic commerce courselecture one. Pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Encapsulation theory and applications.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Cloud computing and distributed systems.
“AI and Expert System Decision Support & Business Intelligence Systems”
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
cuic standard and advanced reporting.pdf
Teaching material agriculture food technology
Digital-Transformation-Roadmap-for-Companies.pptx
Assigned Numbers - 2025 - Bluetooth® Document
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
NewMind AI Weekly Chronicles - August'25-Week II
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Machine Learning_overview_presentation.pptx
MIND Revenue Release Quarter 2 2025 Press Release
Electronic commerce courselecture one. Pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Encapsulation_ Review paper, used for researhc scholars
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Encapsulation theory and applications.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Cloud computing and distributed systems.

Tuga it 2017 - Event processing with Apache Storm

  • 2. THANK YOU TO OUR SPONSORS PLATINUM GOLD SILVER
  • 4. Event processing with Apache Storm Nuno Caneco - Tuga IT - 20/May/2017
  • 5. Nuno Caneco Senior Software Engineer @ Talkdesk /nunocaneco nuno.caneco@gmail.com /@nuno.caneco Who am I
  • 6. Stream Processing - Why? WHY ● Data is crucial for business ● New data is always being generated ● Companies want to extract value from data in “real-time” USE CASES ● Fraud detection ● Sensor data aggregation ● Live monitoring
  • 7. What is Storm? Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use! Storm has many use cases: real time analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate. http://storm.apache.org/
  • 11. Topology Topologies combine individual work units to be applied to input data Spout Bolt A Bolt C Bolt B data [Tuple] [Tuple] [Tuple] [Tuple] [Data out] Bolt D [Tuple] Topology
  • 12. Spout ● First node of every topology ○ Collects data from the outside world ○ Injects the data on the topology in order to be processed ● Must implement ISpout interface ○ BaseRichSpout is a more convenient abstract class Spouts
  • 13. Bolt ● Middle or terminating nodes of a topology ● Implements a Unit of Data Processing ● Each Bolt has an output stream ● Can emit one or more Tuples to other Bolts subscribing the output stream ● Must implement IBolt ○ BaseRichBolt is a more convenient abstract class Bolt
  • 14. Tuple Hash-alike data structure containing the data that flows between Spouts and Bolts Data can be accessed by: ● Field index: [{0, “foo”}, {1, “bar”}] ● Key name: [{“foo_key”, “foo”}, {“bar_key”, “bar”}] Values can be: ● Java primitive types ● String ● byte[] ● Any Serializable object
  • 15. Example: Alert on monitored words Monitored Words Bolt Collector Spout {Message} Split Sentence Bolt {Word, MessageId} Message queue [Message] Notify User Bolt Store event on DB Bolt {MonitoredWord, MessageId} {MonitoredWord, MessageId}
  • 16. Demo
  • 18. Acknowledging Tuples ● ack(): Tuple was processed successfully ● fail(): Tuple failed to process Tuples with no ack() nor fail() are automatically replayed Tuples with fail() will fail up the dependency tree
  • 20. Ack: Anchoring public class SplitSentence extends BaseRichBolt { public void execute(Tuple tuple) { String sentence = tuple.getString(0); for(String word: sentence.split(" ")) { _collector.emit(tuple, new Values(word)); } _collector.ack(tuple); } } } Anchors the output tuple to the input tupleAcknowledges the input tuple
  • 21. Dealing with fail() Spout Bolt Bolt BoltBolt Bolt✅ ✅ ✅ ✅ ✅ ✅ Spout Bolt Bolt BoltBolt Bolt✅ ✅ ✅ ❌ ❌ ❌
  • 22. Beware! Storm is designed to scale to process millions of messages per second. It's design deliberately assumes that some Tuples might be lost. If your application needs Exactly Once semantics, you should consider using Trident (will talk about that in a while) Storm does not ensure exactly once processing
  • 23. Demo
  • 24. Parallelism Cluster node Worker Process Worker Process Cluster Node → 1+ JVM instances JVM Instance → 1+ Threads Thread → 1+ Task Each instance of a Bolt or Spout is a Task Thread Thread Thread Thread Task Task Task Task Task Task Task Task
  • 25. Parallelism Example Config conf = new Config(); conf.setNumWorkers(2); // use two worker processes topologyBuilder.setSpout("blue-spout", new BlueSpout(), 2); topologyBuilder.setBolt("green-bolt", new GreenBolt(), 2) .setNumTasks(4) .shuffleGrouping("blue-spout"); topologyBuilder.setBolt("yellow-bolt", new YellowBolt(), 6) .shuffleGrouping("green-bolt");
  • 26. Stream grouping ● Shuffle grouping: randomly distributed across all downstream Bolts ● Fields grouping: GROUP BY values - Same values of the grouped fields will be delivered to the same Bolt ● All grouping: The stream is replicated across all the bolt's tasks. Use this grouping with care ● Direct grouping: The producer of the Tuple must indicate which consumer will receive the Tuple ● Custom Grouping: When you go NIH
  • 27. Storm UI - Cluster
  • 28. Storm UI - Cluster
  • 29. Storm UI - Topology
  • 30. Storm UI - Topology
  • 31. Other features: Trident Trident is an abstraction layer to manage state across the topology The state can be kept: ● Internally in the topology - in memory or backed by HDFS ● Externally on a Database - such as Memcached or Cassandra
  • 32. Other features: Storm SQL The Storm SQL integration allows users to run SQL queries over streaming data in Storm. Cool feature, but still experimental
  • 35. PLEASE FILL IN EVALUATION FORMS FRIDAY, MAY 19th SATURDAY, MAY 20th https://survs.com/survey/cprwce7pi8 https://survs.com/survey/l9kksmlzd8 YOUR OPINION IS IMPORTANT!
  • 36. THANK YOU TO OUR SPONSORS PLATINUM GOLD SILVER
  • 38. Trident: How it works 1. Tuples are processed as small batches 2. Each batch of tuples is given a unique id called the "transaction id" (txid). a. If the batch is replayed, it is given the exact same txid. 3. State updates are ordered among batches. That is, the state updates for batch 3 won't be applied until the state updates for batch 2 have succeeded.
  • 39. Trident: Transactional Spout Trident Store java => [count=5, txid=1] kotlin => [count=8, txid=2] csharp => [count=10, txid=3] ["kotlin"] ["kotlin"] ["csharp"] Batch txid=3 Trident Store java => [count=5, txid=1] kotlin => [count=10, txid=3] csharp => [count=10, txid=3] "kotlin" += 2 "csharp" += 0