SlideShare a Scribd company logo
Sankt Augustin
24-25.08.2013
Introduction to
Twitter Storm
uweseiler
Sankt Augustin
24-25.08.2013 About me
Big Data Nerd
TravelpiratePhotography Enthusiast
Hadoop Trainer MongoDB Author
Sankt Augustin
24-25.08.2013 About us
is a bunch of…
Big Data Nerds Agile Ninjas Continuous Delivery Gurus
Enterprise Java Specialists Performance Geeks
Join us!
Sankt Augustin
24-25.08.2013 Agenda
• Why Twitter Storm?
• What is Twitter Storm?
• What to do with Twitter Storm?
Sankt Augustin
24-25.08.2013 The 3 V’s of Big Data
VarietyVolume Velocity
Sankt Augustin
24-25.08.2013 Velocity
Sankt Augustin
24-25.08.2013 Why Twitter Storm?
Sankt Augustin
24-25.08.2013 Batch vs. Real-Time processing
• Batch processing
– Gathering of data and processing as a
group at one time.
• Real-time processing
– Processing of data that takes place as the
information is being entered.
Sankt Augustin
24-25.08.2013 Lambda architecture
Sankt Augustin
24-25.08.2013 Bridging the gap…
• A batch workflow is too slow
• Views are out of date
Absorbed into batch views
Time
Not Absorbed
Now
Just a few hours
of data
Sankt Augustin
24-25.08.2013 Storm vs. Hadoop
• Real-time
processing
• Topologies run
forever
• No SPOF
• Stateless nodes
• Batch processing
• Jobs run to
completion
• NameNode is SPOF
• Stateful nodes
• Scalable
• Gurantees no dataloss
• Open Source
Sankt Augustin
24-25.08.2013 Stream Processing
Stream processing is a technical paradigm to process
big volumes of unbound sequence of tuples in real-time
Source Stream Processing
• Algorithmic trading
• Sensor data monitoring
• Continuous analytics
Sankt Augustin
24-25.08.2013 Example: Stream of tweets
https://github.com/colinsurprenant/tweitgeist
Sankt Augustin
24-25.08.2013 Agenda
• Why Twitter Storm?
• What is Twitter Storm?
• What to do with Twitter Storm?
Sankt Augustin
24-25.08.2013 Welcome, Twitter Storm!
• Created by Nathan Marz @ BackType
– Analyze tweets, links, users on Twitter
• Open sourced on 19th September, 2011
– Eclipse Public License 1.0
– Storm v0.5.2
• Latest Updates
– Current stable release v0.8.2 released on 11th January,
2013
– Major core improvements planned for v0.9.0
– Storm will be an Apache Project [soon..]
Sankt Augustin
24-25.08.2013 Storm under the hood
• Java & Clojure
• Apache Thrift
– Cross language bridge, RPC, Framework to build
services
• ZeroMQ
– Asynchronous message transport layer
• Kryo
– Serialization framework
• Jetty
– Embedded web server
Sankt Augustin
24-25.08.2013 Conceptual view
Spout
Spout
Spout:
Source of streams
Bolt
Bolt
Bolt
Bolt
Bolt
Bolt:
Consumer of streams,
Processing of tuples,
Possibly emits new tuples
Tuple
Tuple
Tuple
Tuple:
List of name-value pairs
Stream:
Unbound sequence of tuples
Topology: Network of Spouts & Bolts as the nodes and stream as the edge
Sankt Augustin
24-25.08.2013 Physical view
Java thread
spawned
by worker, runs one
or more tasks of the
same component
Nimbus
ZooKeeper
WorkerSupervisor
Executor Task
ZooKeeper
ZooKeeper
Supervisor
Supervisor
Supervisor
Supervisor
Worker
Worker
Worker Node
Worker Process
Java process
executing a subset
of topology
Component (Spout/
Bolt) instance,
performs the actual
data processing
Master daemon process
Responsible for
• distributing code
• assigning tasks
• monitoring failures
Storing operational
cluster state
Worker daemon process listening
for work assigned to its node
Sankt Augustin
24-25.08.2013 A simple example: WordCount
FileReader
Spout
WordSplit
Bolt
WordCount
Bolt
line
shakespeare.txt
word
of: 18126
to: 18763
i: 19540
and: 26099
the: 27730
Sorted list
Sankt Augustin
24-25.08.2013 FileReaderSpout I
package de.codecentric.storm.wordcount.spouts;
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.util.Map;
import backtype.storm.spout.SpoutOutputCollector;
import backtype.storm.task.TopologyContext;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.base.BaseRichSpout;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Values;
public class FileReaderSpout extends BaseRichSpout {
private SpoutOutputCollector collector;
private FileReader fileReader;
private boolean completed = false;
public void ack(Object msgId) {
System.out.println("OK:" + msgId);
}
public void fail(Object msgId) {
System.out.println("FAIL:" + msgId);
}
Sankt Augustin
24-25.08.2013 FileReaderSpout II
/**
* Declare the output field "line"
*/
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("line"));
}
/**
* We will read the file and get the collector object
*/
public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {
try {
this.fileReader = new FileReader(conf.get("wordsFile").toString());
} catch (FileNotFoundException e) {
throw new RuntimeException("Error reading file ["
+ conf.get("wordFile") + "]");
}
this.collector = collector;
}
public void close() {
}
Sankt Augustin
24-25.08.2013 FileReaderSpout III
/**
* The only thing that the methods will do is emit each file line
*/
public void nextTuple() {
/**
* The nextuple it is called forever, so if we have read the file we
* will wait and then return
*/
String str;
// Open the reader
BufferedReader reader = new BufferedReader(fileReader);
try {
// Read all lines
while ((str = reader.readLine()) != null) {
/**
* Emit each line as a value
*/
this.collector.emit(new Values(str), str);
}
} catch (Exception e) {
throw new RuntimeException("Error reading tuple", e);
} finally {
completed = true;
}
}
}
Sankt Augustin
24-25.08.2013 WordSplitBolt I
package de.codecentric.storm.wordcount.bolts;
import backtype.storm.topology.BasicOutputCollector;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.base.BaseBasicBolt;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Tuple;
import backtype.storm.tuple.Values;
public class WordSplitBolt extends BaseBasicBolt {
public void cleanup() {}
/**
* The bolt will only emit the field "word"
*/
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
Sankt Augustin
24-25.08.2013 WordSplitBolt II
/**
* The bolt will receive the line from the
* words file and process it to split it into words
*/
public void execute(Tuple input, BasicOutputCollector collector) {
String sentence = input.getString(0);
String[] words = sentence.split(" ");
for(String word : words){
word = word.trim();
if(!word.isEmpty()){
word = word.toLowerCase();
collector.emit(new Values(word));
}
}
}
Sankt Augustin
24-25.08.2013 WordCountBolt I
package de.codecentric.storm.wordcount.bolts;
import java.util.Comparator;
import java.util.HashMap;
import java.util.Map;
import java.util.SortedSet;
import java.util.TreeSet;
import backtype.storm.task.TopologyContext;
import backtype.storm.topology.BasicOutputCollector;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.base.BaseBasicBolt;
import backtype.storm.tuple.Tuple;
public class WordCountBolt extends BaseBasicBolt {
/**
*
*/
private static final long serialVersionUID = 1L;
Integer id;
String name;
Map<String, Integer> counters;
Sankt Augustin
24-25.08.2013 WordCountBolt II
/**
* On create
*/
@Override
public void prepare(Map stormConf, TopologyContext context) {
this.counters = new HashMap<String, Integer>();
this.name = context.getThisComponentId();
this.id = context.getThisTaskId();
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
}
@Override
public void execute(Tuple input, BasicOutputCollector collector) {
String str = input.getString(0);
/**
* If the word doesn't exist in the map we will create this, if not we will add 1
*/
if (!counters.containsKey(str)) {
counters.put(str, 1);
} else {
Integer c = counters.get(str) + 1;
counters.put(str, c);
}
}
Sankt Augustin
24-25.08.2013 WordCountBolt III
/**
* At the end of the spout (when the cluster is shutdown we will show the
* word counters
*/
@Override
public void cleanup() {
// Sort map
SortedSet<Map.Entry<String, Integer>> sortedCounts = entriesSortedByValues(counters);
System.out.println("-- Word Counter [" + name + "-" + id + "] --");
for (Map.Entry<String, Integer> entry : sortedCounts) {
System.out.println(entry.getKey() + ": " + entry.getValue());
}
}
…
}
Sankt Augustin
24-25.08.2013 WordCountTopology
public class WordCountTopology {
public static void main(String[] args) throws InterruptedException {
// Topology definition
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("word-reader",new FileReaderSpout());
builder.setBolt("word-normalizer", new WordSplitBolt())
.shuffleGrouping("word-reader");
builder.setBolt("word-counter", new WordCountBolt(),1)
.fieldsGrouping("word-normalizer", new Fields("word"));
// Configuration
Config conf = new Config();
conf.put("wordsFile", args[0]);
conf.setDebug(false);
// Run Topology
conf.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 1);
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("word-count-topology", conf, builder.createTopology());
// You don‘t do this on a regular topology
Utils.sleep(10000);
cluster.killTopology("word-count-topology");
cluster.shutdown();
}
}
Sankt Augustin
24-25.08.2013 Stream Grouping
• Each Spout or Bolt might be running n instances in parallel
• Groupings are used to decide to which task in the
subscribing bolt (group) a tuple is sent to.
• Possible Groupings:
Grouping Feature
Shuffle Random grouping
Fields Grouped by value such that equal value results in same task
All Replicates to all tasks
Global Makes all tuples go to one task
None Makes Bolt run in the same thread as the Bolt / Spout it subscribes to
Direct Producer (task that emits) controls which Consumer will receive
Local If the target bolt has one or more tasks in the same worker process,
tuples will be shuffled to just those in-process tasks
Sankt Augustin
24-25.08.2013 Key features of Twitter Storm
Storm is
• Fast & scalable
• Fault-tolerant
• Guaranteeing message processing
• Easy to setup & operate
• Free & Open Source
Sankt Augustin
24-25.08.2013 Key features of Twitter Storm
Storm is
• Fast & scalable
• Fault-tolerant
• Guaranteeing message processing
• Easy to setup & operate
• Free & Open Source
Sankt Augustin
24-25.08.2013 Extremely performant
Sankt Augustin
24-25.08.2013 Parallelism
Number of worker nodes = 2
Number of worker slots per node = 4
Number of topology worker = 4
FileReaderSpout WordSplitBolt WordCountBolt
Number of tasks =
Not specified = Same
as parallism hint
Parellism_hint = 2
Number of tasks = 8
Parellism_hint = 4
Number of tasks =
Not specified = 6
Parellism_hint = 6
Number of component instances = 2 + 8 + 6 = 16
Number of executor threads = 2 + 4 + 6 = 12
Sankt Augustin
24-25.08.2013 Message passing
Receive
Thread
Executor
Transfer
Thread
Executor
Executor
Receiver queue
To other workers
From other workers
Internal transfer queue
Transfer queue
Interprocess communication is mediated by ZeroMQ
Outside transfer is done with Kryo serialization
Local communication is mediated by LMAX Disruptor
Inside transfer is done with no serialization
Sankt Augustin
24-25.08.2013 Key features of Twitter Storm
Storm is
• Fast & scalable
• Fault-tolerant
• Guaranteeing message processing
• Easy to setup & operate
• Free & Open Source
Sankt Augustin
24-25.08.2013 Fault tolerance
Nimbus ZooKeeper Supervisor Worker
Cluster works normally
Monitoring
cluster state
Synchronizing
assignment
Sending heartbeat
Reading worker heart
beat from local file
system
Sending executor heartbeat
Sankt Augustin
24-25.08.2013 Fault tolerance
Nimbus ZooKeeper Supervisor Worker
Nimbus goes down
Monitoring
cluster state
Synchronizing
assignment
Sending heartbeat
Reading worker heart
beat from local file
system
Sending executor heartbeat
Processing will still continue. But topology lifecycle
operations and reassignment facility are lost
Sankt Augustin
24-25.08.2013 Fault tolerance
Nimbus ZooKeeper Supervisor Worker
Worker node goes down
Monitoring
cluster state
Sending executor heartbeat
Nimbus will reassign the tasks to other machines
and the processing will continue
Supervisor Worker
Synchronizing
assignment
Sending heartbeat
Reading worker heart
beat from local file
system
Sankt Augustin
24-25.08.2013 Fault tolerance
Nimbus ZooKeeper Supervisor Worker
Supervisor goes down
Monitoring
cluster state
Synchronizing
assignment
Sending heartbeat
Reading worker heart
beat from local file
system
Sending executor heartbeat
Processing will still continue. But assignment is
never synchronized
Sankt Augustin
24-25.08.2013 Fault tolerance
Nimbus ZooKeeper Supervisor Worker
Worker process goes down
Monitoring
cluster state
Synchronizing
assignment
Sending heartbeat
Reading worker heart
beat from local file
system
Sending executor heartbeat
Supervisor will restart the worker process and the
processing will continue
Sankt Augustin
24-25.08.2013 Key features of Twitter Storm
Storm is
• Fast & scalable
• Fault-tolerant
• Guaranteeing message processing
• Easy to setup & operate
• Free & Open Source
Sankt Augustin
24-25.08.2013 Reliability API
public class FileReaderSpout extends BaseRichSpout {
public void nextTuple() {
…;
UUID messageID = getMsgID();
collector.emit(newValues(line), msgId)
}
public void ack(Object msgId) {
// Do something with acked message id
}
public void fail(Object msgId) {
// Do something with failes message id
}
}
public class WordSplitBolt extends BaseBasicBolt {
public void execute(Tuple input, BasicOutputCollector collector) {
for (String s : input.getString(0).split("s")) {
collector.emit(input, newValues(s));
}
collector.ack(input);
}
}
Tupel tree
Anchoring incoming tuple to
outgoing tuples
Sending ack
This
“This is a line”
This
This
This
Emiting tuple with Message ID
Sankt Augustin
24-25.08.2013 ACKing Framework
ACKer init
FileReaderSpout WordSplitBolt WordCountBolt
ACKer implicit
boltACKer ack
ACKer fail
ACKer ack
ACKer fail
Tuple A
Tuple B
Tuple C
• Emitted tuple A, XOR tuple A id with ack val
• Emitted tuple B, XOR tuple B id with ack val
• Emitted tuple C, XOR tuple C id with ack val
• Acked tuple A, XOR tuple A id with ack val
• Acked tuple B, XOR tuple B id with ack val
• Acked tuple C, XOR tuple C id with ack val
Spout Tuple ID Spout Task ID ACK val (64 Bit)
ACKer implizit bolt
ACK val has become 0, ACKer implicit bolt
knows the tuple tree has been completed
Sankt Augustin
24-25.08.2013 Key features of Twitter Storm
Storm is
• Fast & scalable
• Fault-tolerant
• Guaranteeing message processing
• Easy to setup & operate
• Free & Open Source
Sankt Augustin
24-25.08.2013 Cluster Setup
• Setup ZooKeeper cluster
• Install dependencies on Nimbus and worker machines
– ZeroMQ 2.1.7 and JZMQ
– Java 6 and Python 2.6.6
– unzip
• Download and extract a Storm release to Nimbus and
worker machines
• Fill in mandatory configuration into storm.yaml
• Launch daemons under supervision using storm scripts
• Start a topology:
– storm jar <path_topology_jar> <main_class> <arg1>…<argN>
Sankt Augustin
24-25.08.2013 Cluster Summary
Sankt Augustin
24-25.08.2013 Topology Summary
Sankt Augustin
24-25.08.2013 Component Summary
Sankt Augustin
24-25.08.2013 Key features of Twitter Storm
Storm is
• Fast & scalable
• Fault-tolerant
• Guaranteeing message processing
• Easy to setup & operate
• Free & Open Source
Sankt Augustin
24-25.08.2013 Basic resources
• Storm is available at
– http://storm-project.net/
– https://github.com/nathanmarz/storm
under Eclipse Public License 1.0
• Get help on
– http://groups.google.com/group/storm-user
– #storm-user freenode room
• Follow
@stormprocessor and @nathanmarz
Sankt Augustin
24-25.08.2013 Many contributions
• Community repository for modules to use Storm at
– https://github.com/nathanmarz/storm-contrib
– including integration with Redis, Kafka, MongoDB, HBase, JMS,
Amazon SQS, …
• Good articles for understanding Storm internals
– http://www.michael-noll.com/blog/2012/10/16/understanding-the-
parallelism-of-a-stormtopology/
– http://www.michael-noll.com/blog/2013/06/21/understanding-storm-
internal-messagebuffers/
• Good slides for understanding real-life examples
– http://www.slideshare.net/DanLynn1/storm-as-deep-into-
realtime-data-processing-as-youcan-get-in-30-minutes
– http://www.slideshare.net/KrishnaGade2/storm-at-twitter
Sankt Augustin
24-25.08.2013 Coming next…
• Current release: 0.8.2
• Work in progress (newest): 0.9.0-wip21
– SLF4J and Logback
– Pluggable tuple serialization and blowfish
encryption
– Pluggable interprocess messaging and Netty
implementation
– Some bug fixes
– And more
• Storm on YARN
Sankt Augustin
24-25.08.2013 Agenda
• Why Twitter Storm?
• What is Twitter Storm?
• What to do with Twitter Storm?
Sankt Augustin
24-25.08.2013 One example: Webshop
• Webtracking component
• No defined page impression
• Identifying page impressions using
Varnish logs of the click stream data
• Page consists of different fragments
– Body
– Article description
– Recommendation box, …
• Session data also of interest
Sankt Augustin
24-25.08.2013 One example: Webshop
• Custom solution using J2EE and
MongoDB
• Export into Comscore DAx and
Enterprise DWH
• Solution is currently working but not
scalable
• What about performance?
Sankt Augustin
24-25.08.2013 Topology Architecture

More Related Content

PDF
Real time and reliable processing with Apache Storm
PDF
Hadoop Summit Europe 2014: Apache Storm Architecture
PDF
Storm Anatomy
PDF
Storm - As deep into real-time data processing as you can get in 30 minutes.
PPTX
Cassandra and Storm at Health Market Sceince
PPTX
Apache Storm 0.9 basic training - Verisign
PDF
Realtime processing with storm presentation
PDF
Streams processing with Storm
Real time and reliable processing with Apache Storm
Hadoop Summit Europe 2014: Apache Storm Architecture
Storm Anatomy
Storm - As deep into real-time data processing as you can get in 30 minutes.
Cassandra and Storm at Health Market Sceince
Apache Storm 0.9 basic training - Verisign
Realtime processing with storm presentation
Streams processing with Storm

What's hot (19)

PPTX
PDF
Introduction to Apache Storm
PDF
Storm and Cassandra
PDF
Distributed Realtime Computation using Apache Storm
PDF
Storm Real Time Computation
PDF
Learning Stream Processing with Apache Storm
PPTX
Multi-Tenant Storm Service on Hadoop Grid
PDF
PHP Backends for Real-Time User Interaction using Apache Storm.
PPTX
Introduction to Storm
PPTX
Scaling Apache Storm (Hadoop Summit 2015)
PDF
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
PPTX
Apache Storm Internals
PDF
Realtime Analytics with Storm and Hadoop
PPTX
Slide #1:Introduction to Apache Storm
PPTX
Improved Reliable Streaming Processing: Apache Storm as example
PDF
Scaling Apache Storm - Strata + Hadoop World 2014
PDF
Apache Storm Tutorial
PDF
Distributed real time stream processing- why and how
PDF
Storm: The Real-Time Layer - GlueCon 2012
Introduction to Apache Storm
Storm and Cassandra
Distributed Realtime Computation using Apache Storm
Storm Real Time Computation
Learning Stream Processing with Apache Storm
Multi-Tenant Storm Service on Hadoop Grid
PHP Backends for Real-Time User Interaction using Apache Storm.
Introduction to Storm
Scaling Apache Storm (Hadoop Summit 2015)
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Apache Storm Internals
Realtime Analytics with Storm and Hadoop
Slide #1:Introduction to Apache Storm
Improved Reliable Streaming Processing: Apache Storm as example
Scaling Apache Storm - Strata + Hadoop World 2014
Apache Storm Tutorial
Distributed real time stream processing- why and how
Storm: The Real-Time Layer - GlueCon 2012
Ad

Similar to Introduction to Twitter Storm (20)

PDF
Storm @ Fifth Elephant 2013
PDF
PDF
Storm introduction
PPTX
storm-170531123446.pptx
PDF
Storm - The Real-Time Layer Your Big Data's Been Missing
PPT
Real-Time Streaming with Apache Spark Streaming and Apache Storm
PDF
storm at twitter
PDF
Tuga it 2017 - Event processing with Apache Storm
PPTX
Introduction to Storm
PDF
Developing Java Streaming Applications with Apache Storm
PPTX
Sinfonier: How I turned my grandmother into a data analyst - Fran J. Gomez - ...
PDF
Real time stream processing presentation at General Assemb.ly
PDF
Storm
PPT
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
PPTX
Apache Storm and twitter Streaming API integration
PPS
Storm presentation
PPTX
실시간 인벤트 처리
PDF
Twitter Stream Processing
PDF
Jan 2012 HUG: Storm
Storm @ Fifth Elephant 2013
Storm introduction
storm-170531123446.pptx
Storm - The Real-Time Layer Your Big Data's Been Missing
Real-Time Streaming with Apache Spark Streaming and Apache Storm
storm at twitter
Tuga it 2017 - Event processing with Apache Storm
Introduction to Storm
Developing Java Streaming Applications with Apache Storm
Sinfonier: How I turned my grandmother into a data analyst - Fran J. Gomez - ...
Real time stream processing presentation at General Assemb.ly
Storm
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Apache Storm and twitter Streaming API integration
Storm presentation
실시간 인벤트 처리
Twitter Stream Processing
Jan 2012 HUG: Storm
Ad

More from Uwe Printz (20)

PDF
Hadoop 3.0 - Revolution or evolution?
PDF
Hadoop 3.0 - Revolution or evolution?
PDF
Hadoop meets Agile! - An Agile Big Data Model
PDF
Hadoop & Security - Past, Present, Future
PDF
Hadoop Operations - Best practices from the field
PDF
Apache Spark
PDF
Lightning Talk: Agility & Databases
PDF
Hadoop 2 - More than MapReduce
PDF
Welcome to Hadoop2Land!
PDF
Hadoop 2 - Beyond MapReduce
PDF
MongoDB für Java Programmierer (JUGKA, 11.12.13)
PDF
Hadoop 2 - Going beyond MapReduce
PDF
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
PDF
MongoDB for Coder Training (Coding Serbia 2013)
PDF
MongoDB für Java-Programmierer
PDF
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
PDF
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
PDF
Introduction to the Hadoop Ecosystem (SEACON Edition)
PDF
Introduction to the Hadoop Ecosystem (codemotion Edition)
PDF
Map/Confused? A practical approach to Map/Reduce with MongoDB
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
Hadoop meets Agile! - An Agile Big Data Model
Hadoop & Security - Past, Present, Future
Hadoop Operations - Best practices from the field
Apache Spark
Lightning Talk: Agility & Databases
Hadoop 2 - More than MapReduce
Welcome to Hadoop2Land!
Hadoop 2 - Beyond MapReduce
MongoDB für Java Programmierer (JUGKA, 11.12.13)
Hadoop 2 - Going beyond MapReduce
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
MongoDB for Coder Training (Coding Serbia 2013)
MongoDB für Java-Programmierer
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
Map/Confused? A practical approach to Map/Reduce with MongoDB

Recently uploaded (20)

PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Empathic Computing: Creating Shared Understanding
PDF
Approach and Philosophy of On baking technology
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
Programs and apps: productivity, graphics, security and other tools
Building Integrated photovoltaic BIPV_UPV.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Big Data Technologies - Introduction.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Advanced methodologies resolving dimensionality complications for autism neur...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Empathic Computing: Creating Shared Understanding
Approach and Philosophy of On baking technology
“AI and Expert System Decision Support & Business Intelligence Systems”
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Understanding_Digital_Forensics_Presentation.pptx

Introduction to Twitter Storm

  • 2. Sankt Augustin 24-25.08.2013 About me Big Data Nerd TravelpiratePhotography Enthusiast Hadoop Trainer MongoDB Author
  • 3. Sankt Augustin 24-25.08.2013 About us is a bunch of… Big Data Nerds Agile Ninjas Continuous Delivery Gurus Enterprise Java Specialists Performance Geeks Join us!
  • 4. Sankt Augustin 24-25.08.2013 Agenda • Why Twitter Storm? • What is Twitter Storm? • What to do with Twitter Storm?
  • 5. Sankt Augustin 24-25.08.2013 The 3 V’s of Big Data VarietyVolume Velocity
  • 8. Sankt Augustin 24-25.08.2013 Batch vs. Real-Time processing • Batch processing – Gathering of data and processing as a group at one time. • Real-time processing – Processing of data that takes place as the information is being entered.
  • 10. Sankt Augustin 24-25.08.2013 Bridging the gap… • A batch workflow is too slow • Views are out of date Absorbed into batch views Time Not Absorbed Now Just a few hours of data
  • 11. Sankt Augustin 24-25.08.2013 Storm vs. Hadoop • Real-time processing • Topologies run forever • No SPOF • Stateless nodes • Batch processing • Jobs run to completion • NameNode is SPOF • Stateful nodes • Scalable • Gurantees no dataloss • Open Source
  • 12. Sankt Augustin 24-25.08.2013 Stream Processing Stream processing is a technical paradigm to process big volumes of unbound sequence of tuples in real-time Source Stream Processing • Algorithmic trading • Sensor data monitoring • Continuous analytics
  • 13. Sankt Augustin 24-25.08.2013 Example: Stream of tweets https://github.com/colinsurprenant/tweitgeist
  • 14. Sankt Augustin 24-25.08.2013 Agenda • Why Twitter Storm? • What is Twitter Storm? • What to do with Twitter Storm?
  • 15. Sankt Augustin 24-25.08.2013 Welcome, Twitter Storm! • Created by Nathan Marz @ BackType – Analyze tweets, links, users on Twitter • Open sourced on 19th September, 2011 – Eclipse Public License 1.0 – Storm v0.5.2 • Latest Updates – Current stable release v0.8.2 released on 11th January, 2013 – Major core improvements planned for v0.9.0 – Storm will be an Apache Project [soon..]
  • 16. Sankt Augustin 24-25.08.2013 Storm under the hood • Java & Clojure • Apache Thrift – Cross language bridge, RPC, Framework to build services • ZeroMQ – Asynchronous message transport layer • Kryo – Serialization framework • Jetty – Embedded web server
  • 17. Sankt Augustin 24-25.08.2013 Conceptual view Spout Spout Spout: Source of streams Bolt Bolt Bolt Bolt Bolt Bolt: Consumer of streams, Processing of tuples, Possibly emits new tuples Tuple Tuple Tuple Tuple: List of name-value pairs Stream: Unbound sequence of tuples Topology: Network of Spouts & Bolts as the nodes and stream as the edge
  • 18. Sankt Augustin 24-25.08.2013 Physical view Java thread spawned by worker, runs one or more tasks of the same component Nimbus ZooKeeper WorkerSupervisor Executor Task ZooKeeper ZooKeeper Supervisor Supervisor Supervisor Supervisor Worker Worker Worker Node Worker Process Java process executing a subset of topology Component (Spout/ Bolt) instance, performs the actual data processing Master daemon process Responsible for • distributing code • assigning tasks • monitoring failures Storing operational cluster state Worker daemon process listening for work assigned to its node
  • 19. Sankt Augustin 24-25.08.2013 A simple example: WordCount FileReader Spout WordSplit Bolt WordCount Bolt line shakespeare.txt word of: 18126 to: 18763 i: 19540 and: 26099 the: 27730 Sorted list
  • 20. Sankt Augustin 24-25.08.2013 FileReaderSpout I package de.codecentric.storm.wordcount.spouts; import java.io.BufferedReader; import java.io.FileNotFoundException; import java.io.FileReader; import java.util.Map; import backtype.storm.spout.SpoutOutputCollector; import backtype.storm.task.TopologyContext; import backtype.storm.topology.OutputFieldsDeclarer; import backtype.storm.topology.base.BaseRichSpout; import backtype.storm.tuple.Fields; import backtype.storm.tuple.Values; public class FileReaderSpout extends BaseRichSpout { private SpoutOutputCollector collector; private FileReader fileReader; private boolean completed = false; public void ack(Object msgId) { System.out.println("OK:" + msgId); } public void fail(Object msgId) { System.out.println("FAIL:" + msgId); }
  • 21. Sankt Augustin 24-25.08.2013 FileReaderSpout II /** * Declare the output field "line" */ public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("line")); } /** * We will read the file and get the collector object */ public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) { try { this.fileReader = new FileReader(conf.get("wordsFile").toString()); } catch (FileNotFoundException e) { throw new RuntimeException("Error reading file [" + conf.get("wordFile") + "]"); } this.collector = collector; } public void close() { }
  • 22. Sankt Augustin 24-25.08.2013 FileReaderSpout III /** * The only thing that the methods will do is emit each file line */ public void nextTuple() { /** * The nextuple it is called forever, so if we have read the file we * will wait and then return */ String str; // Open the reader BufferedReader reader = new BufferedReader(fileReader); try { // Read all lines while ((str = reader.readLine()) != null) { /** * Emit each line as a value */ this.collector.emit(new Values(str), str); } } catch (Exception e) { throw new RuntimeException("Error reading tuple", e); } finally { completed = true; } } }
  • 23. Sankt Augustin 24-25.08.2013 WordSplitBolt I package de.codecentric.storm.wordcount.bolts; import backtype.storm.topology.BasicOutputCollector; import backtype.storm.topology.OutputFieldsDeclarer; import backtype.storm.topology.base.BaseBasicBolt; import backtype.storm.tuple.Fields; import backtype.storm.tuple.Tuple; import backtype.storm.tuple.Values; public class WordSplitBolt extends BaseBasicBolt { public void cleanup() {} /** * The bolt will only emit the field "word" */ public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); }
  • 24. Sankt Augustin 24-25.08.2013 WordSplitBolt II /** * The bolt will receive the line from the * words file and process it to split it into words */ public void execute(Tuple input, BasicOutputCollector collector) { String sentence = input.getString(0); String[] words = sentence.split(" "); for(String word : words){ word = word.trim(); if(!word.isEmpty()){ word = word.toLowerCase(); collector.emit(new Values(word)); } } }
  • 25. Sankt Augustin 24-25.08.2013 WordCountBolt I package de.codecentric.storm.wordcount.bolts; import java.util.Comparator; import java.util.HashMap; import java.util.Map; import java.util.SortedSet; import java.util.TreeSet; import backtype.storm.task.TopologyContext; import backtype.storm.topology.BasicOutputCollector; import backtype.storm.topology.OutputFieldsDeclarer; import backtype.storm.topology.base.BaseBasicBolt; import backtype.storm.tuple.Tuple; public class WordCountBolt extends BaseBasicBolt { /** * */ private static final long serialVersionUID = 1L; Integer id; String name; Map<String, Integer> counters;
  • 26. Sankt Augustin 24-25.08.2013 WordCountBolt II /** * On create */ @Override public void prepare(Map stormConf, TopologyContext context) { this.counters = new HashMap<String, Integer>(); this.name = context.getThisComponentId(); this.id = context.getThisTaskId(); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { } @Override public void execute(Tuple input, BasicOutputCollector collector) { String str = input.getString(0); /** * If the word doesn't exist in the map we will create this, if not we will add 1 */ if (!counters.containsKey(str)) { counters.put(str, 1); } else { Integer c = counters.get(str) + 1; counters.put(str, c); } }
  • 27. Sankt Augustin 24-25.08.2013 WordCountBolt III /** * At the end of the spout (when the cluster is shutdown we will show the * word counters */ @Override public void cleanup() { // Sort map SortedSet<Map.Entry<String, Integer>> sortedCounts = entriesSortedByValues(counters); System.out.println("-- Word Counter [" + name + "-" + id + "] --"); for (Map.Entry<String, Integer> entry : sortedCounts) { System.out.println(entry.getKey() + ": " + entry.getValue()); } } … }
  • 28. Sankt Augustin 24-25.08.2013 WordCountTopology public class WordCountTopology { public static void main(String[] args) throws InterruptedException { // Topology definition TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("word-reader",new FileReaderSpout()); builder.setBolt("word-normalizer", new WordSplitBolt()) .shuffleGrouping("word-reader"); builder.setBolt("word-counter", new WordCountBolt(),1) .fieldsGrouping("word-normalizer", new Fields("word")); // Configuration Config conf = new Config(); conf.put("wordsFile", args[0]); conf.setDebug(false); // Run Topology conf.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 1); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count-topology", conf, builder.createTopology()); // You don‘t do this on a regular topology Utils.sleep(10000); cluster.killTopology("word-count-topology"); cluster.shutdown(); } }
  • 29. Sankt Augustin 24-25.08.2013 Stream Grouping • Each Spout or Bolt might be running n instances in parallel • Groupings are used to decide to which task in the subscribing bolt (group) a tuple is sent to. • Possible Groupings: Grouping Feature Shuffle Random grouping Fields Grouped by value such that equal value results in same task All Replicates to all tasks Global Makes all tuples go to one task None Makes Bolt run in the same thread as the Bolt / Spout it subscribes to Direct Producer (task that emits) controls which Consumer will receive Local If the target bolt has one or more tasks in the same worker process, tuples will be shuffled to just those in-process tasks
  • 30. Sankt Augustin 24-25.08.2013 Key features of Twitter Storm Storm is • Fast & scalable • Fault-tolerant • Guaranteeing message processing • Easy to setup & operate • Free & Open Source
  • 31. Sankt Augustin 24-25.08.2013 Key features of Twitter Storm Storm is • Fast & scalable • Fault-tolerant • Guaranteeing message processing • Easy to setup & operate • Free & Open Source
  • 33. Sankt Augustin 24-25.08.2013 Parallelism Number of worker nodes = 2 Number of worker slots per node = 4 Number of topology worker = 4 FileReaderSpout WordSplitBolt WordCountBolt Number of tasks = Not specified = Same as parallism hint Parellism_hint = 2 Number of tasks = 8 Parellism_hint = 4 Number of tasks = Not specified = 6 Parellism_hint = 6 Number of component instances = 2 + 8 + 6 = 16 Number of executor threads = 2 + 4 + 6 = 12
  • 34. Sankt Augustin 24-25.08.2013 Message passing Receive Thread Executor Transfer Thread Executor Executor Receiver queue To other workers From other workers Internal transfer queue Transfer queue Interprocess communication is mediated by ZeroMQ Outside transfer is done with Kryo serialization Local communication is mediated by LMAX Disruptor Inside transfer is done with no serialization
  • 35. Sankt Augustin 24-25.08.2013 Key features of Twitter Storm Storm is • Fast & scalable • Fault-tolerant • Guaranteeing message processing • Easy to setup & operate • Free & Open Source
  • 36. Sankt Augustin 24-25.08.2013 Fault tolerance Nimbus ZooKeeper Supervisor Worker Cluster works normally Monitoring cluster state Synchronizing assignment Sending heartbeat Reading worker heart beat from local file system Sending executor heartbeat
  • 37. Sankt Augustin 24-25.08.2013 Fault tolerance Nimbus ZooKeeper Supervisor Worker Nimbus goes down Monitoring cluster state Synchronizing assignment Sending heartbeat Reading worker heart beat from local file system Sending executor heartbeat Processing will still continue. But topology lifecycle operations and reassignment facility are lost
  • 38. Sankt Augustin 24-25.08.2013 Fault tolerance Nimbus ZooKeeper Supervisor Worker Worker node goes down Monitoring cluster state Sending executor heartbeat Nimbus will reassign the tasks to other machines and the processing will continue Supervisor Worker Synchronizing assignment Sending heartbeat Reading worker heart beat from local file system
  • 39. Sankt Augustin 24-25.08.2013 Fault tolerance Nimbus ZooKeeper Supervisor Worker Supervisor goes down Monitoring cluster state Synchronizing assignment Sending heartbeat Reading worker heart beat from local file system Sending executor heartbeat Processing will still continue. But assignment is never synchronized
  • 40. Sankt Augustin 24-25.08.2013 Fault tolerance Nimbus ZooKeeper Supervisor Worker Worker process goes down Monitoring cluster state Synchronizing assignment Sending heartbeat Reading worker heart beat from local file system Sending executor heartbeat Supervisor will restart the worker process and the processing will continue
  • 41. Sankt Augustin 24-25.08.2013 Key features of Twitter Storm Storm is • Fast & scalable • Fault-tolerant • Guaranteeing message processing • Easy to setup & operate • Free & Open Source
  • 42. Sankt Augustin 24-25.08.2013 Reliability API public class FileReaderSpout extends BaseRichSpout { public void nextTuple() { …; UUID messageID = getMsgID(); collector.emit(newValues(line), msgId) } public void ack(Object msgId) { // Do something with acked message id } public void fail(Object msgId) { // Do something with failes message id } } public class WordSplitBolt extends BaseBasicBolt { public void execute(Tuple input, BasicOutputCollector collector) { for (String s : input.getString(0).split("s")) { collector.emit(input, newValues(s)); } collector.ack(input); } } Tupel tree Anchoring incoming tuple to outgoing tuples Sending ack This “This is a line” This This This Emiting tuple with Message ID
  • 43. Sankt Augustin 24-25.08.2013 ACKing Framework ACKer init FileReaderSpout WordSplitBolt WordCountBolt ACKer implicit boltACKer ack ACKer fail ACKer ack ACKer fail Tuple A Tuple B Tuple C • Emitted tuple A, XOR tuple A id with ack val • Emitted tuple B, XOR tuple B id with ack val • Emitted tuple C, XOR tuple C id with ack val • Acked tuple A, XOR tuple A id with ack val • Acked tuple B, XOR tuple B id with ack val • Acked tuple C, XOR tuple C id with ack val Spout Tuple ID Spout Task ID ACK val (64 Bit) ACKer implizit bolt ACK val has become 0, ACKer implicit bolt knows the tuple tree has been completed
  • 44. Sankt Augustin 24-25.08.2013 Key features of Twitter Storm Storm is • Fast & scalable • Fault-tolerant • Guaranteeing message processing • Easy to setup & operate • Free & Open Source
  • 45. Sankt Augustin 24-25.08.2013 Cluster Setup • Setup ZooKeeper cluster • Install dependencies on Nimbus and worker machines – ZeroMQ 2.1.7 and JZMQ – Java 6 and Python 2.6.6 – unzip • Download and extract a Storm release to Nimbus and worker machines • Fill in mandatory configuration into storm.yaml • Launch daemons under supervision using storm scripts • Start a topology: – storm jar <path_topology_jar> <main_class> <arg1>…<argN>
  • 49. Sankt Augustin 24-25.08.2013 Key features of Twitter Storm Storm is • Fast & scalable • Fault-tolerant • Guaranteeing message processing • Easy to setup & operate • Free & Open Source
  • 50. Sankt Augustin 24-25.08.2013 Basic resources • Storm is available at – http://storm-project.net/ – https://github.com/nathanmarz/storm under Eclipse Public License 1.0 • Get help on – http://groups.google.com/group/storm-user – #storm-user freenode room • Follow @stormprocessor and @nathanmarz
  • 51. Sankt Augustin 24-25.08.2013 Many contributions • Community repository for modules to use Storm at – https://github.com/nathanmarz/storm-contrib – including integration with Redis, Kafka, MongoDB, HBase, JMS, Amazon SQS, … • Good articles for understanding Storm internals – http://www.michael-noll.com/blog/2012/10/16/understanding-the- parallelism-of-a-stormtopology/ – http://www.michael-noll.com/blog/2013/06/21/understanding-storm- internal-messagebuffers/ • Good slides for understanding real-life examples – http://www.slideshare.net/DanLynn1/storm-as-deep-into- realtime-data-processing-as-youcan-get-in-30-minutes – http://www.slideshare.net/KrishnaGade2/storm-at-twitter
  • 52. Sankt Augustin 24-25.08.2013 Coming next… • Current release: 0.8.2 • Work in progress (newest): 0.9.0-wip21 – SLF4J and Logback – Pluggable tuple serialization and blowfish encryption – Pluggable interprocess messaging and Netty implementation – Some bug fixes – And more • Storm on YARN
  • 53. Sankt Augustin 24-25.08.2013 Agenda • Why Twitter Storm? • What is Twitter Storm? • What to do with Twitter Storm?
  • 54. Sankt Augustin 24-25.08.2013 One example: Webshop • Webtracking component • No defined page impression • Identifying page impressions using Varnish logs of the click stream data • Page consists of different fragments – Body – Article description – Recommendation box, … • Session data also of interest
  • 55. Sankt Augustin 24-25.08.2013 One example: Webshop • Custom solution using J2EE and MongoDB • Export into Comscore DAx and Enterprise DWH • Solution is currently working but not scalable • What about performance?