©MapR Technologies
Hadoop and Storm
AJUG 5/21/2013
whoami
• Brad Anderson
• Solutions Architect at MapR (Atlanta)
• ATLHUG co-chair
• NoSQL East Conference 2009
• “boorad” most places (twitter, github)
• banderson@maprtech.com
Hadoop: A Paradigm Shift
 Distributed computing platform
– Large clusters
– Commodity hardware
 Pioneered at Google
– Google File System, MapReduce and BigTable
 Commercially available as Hadoop
Ship the Function to the Data
SAN/NAS
data data data
data data data
data data data
data data data
data data data
function
RDBMS
Traditional Architecture
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
Distributed Computing
MapReduce Flow
Input Map Combin
e
Shuffle
and sort
Reduc
e
Output
Reduc
e
Variation: No Reduce Necessary
Example: Batch File Transformation
Input Map Output
MPG M4V
Variation: Multiple MapReduces
Example: Fraud Detection in User Transactions
LDA training
Transaction
data
LDA scoring
HBase /
MapR M7 Edition
G2 score
Candidate
events for
analyst review
95 %-ile LDA
anomaly
MapReduce
http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
Pig
MR Equivalent to Pig Script
Hive
MapR Distribution for Apache Hadoop
Complete Hadoop
distribution
Comprehensive
management suite
Industry-standard
interfaces
Enterprise-grade
dependability
Enterprise-grade security
(US Intelligence Agency)
Patents - IP
Higher performance
Hadoop Use Cases
ETL/EDW Offload
Sensor / Telemetry Data
Recommendation Engine
Search
•ML algorithms
•eDiscovery
Fleet Management
Fraud Detection / Risk Management
Traffic Decongestion
One Platform for Big Data
…
99.999%
HA
Data
Protection
Disaster
Recovery
Scalability
&
Performance
Enterprise
Integration
Multi-
tenancy
Map
Reduce
File-Based
Applications
SQL Database Search Stream
Processing
Batc
h
Interactiv
e
Realtime
Batch
Log file Analysis
Data Warehouse Offload
Fraud Detection
Clickstream Analytics
Realtime
Sensor Analysis
“Twitterscraping”
Telematics
Process Optimization
Interactive
Forensic Analysis
Analytic Modeling
BI User Focus
©MapR Technologies
Storm
“Hadoop for Realtime”
©MapR Technologies
Before Storm
Queues Workers
©MapR Technologies
Example
(simplified)
©MapR Technologies
Storm
Guaranteed data processing
Horizontal scalability
Fault-tolerance
No intermediate message brokers!
Higher level abstraction than
message passing
“Just works”
©MapR Technologies
Unbounded sequence of tuples
Tuple Tuple Tuple Tuple Tuple Tuple Tuple
Streams
©MapR Technologies
Source of streams
Spouts
©MapR Technologies
public interface ISpout extends Serializable {
void open(Map conf,
TopologyContext context,
SpoutOutputCollector collector);
void close();
void nextTuple();
void ack(Object msgId);
void fail(Object msgId);
}
Spouts
©MapR Technologies
Processes input streams and produces new streams
Tuple Tuple Tuple Tuple
Bolts
©MapR Technologies
public class DoubleAndTripleBolt extends BaseRichBolt {
private OutputCollectorBase _collector;
public void prepare(Map conf,
TopologyContext context,
OutputCollectorBase collector) {
_collector = collector;
}
public void execute(Tuple input) {
int val = input.getInteger(0);
_collector.emit(input, new Values(val*2, val*3));
_collector.ack(input);
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("double", "triple"));
}
}
Bolts
©MapR Technologies
Network of spouts and bolts
Topologies
©MapR Technologies
TridentTopology topology = new TridentTopology();
TridentState wordCounts =
topology.newStream("spout1", spout)
.each(new Fields("sentence"),
new Split(),
new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(new MemoryMapState.Factory(),
new Count(),
new Fields("count"))
.parallelismHint(6);
Trident
Cascading for Storm
Storm
©MapR Technologies
Hadoop
batch
processes
Apps
Busines
s
Value
Raw
Data
realtime
processesQueue(Kafka)
Parallel Cluster Ingest
©MapR Technologies
Hadoop
batch
processes
Apps
Busines
s
Value
Raw
Data
realtime
processes
Storm
TailSpout
Franz
Queue(Kafka)
StormKafka
Twitter
Twitter API
TweetLogger
Kafka
Cluster
Kafka
Cluster
Kafka
Cluster
Kafka
API
Storm
Web Service NAS
Web
Data
Hadoop
Flume
HDFS
Data
Twitter
Twitter
API
Catcher Storm
Topic
Queue
Web-server
http
Web
Data
MapR
TweetLogger
Scaling Estimates
Twitter Firehose
 Old School – 8+ separate
clusters, 20-25 nodes
• >3 Kafka nodes
• >2 TweetLoggers
• 5-10 Hadoop
• >2 Catcher nodes
• >3 Storm
• 3 zookeepers
• NAS for web storage
• >2 web servers
 MapR – One Platform
• 5-10 nodes total
• Any node does any job
• Full HA included
• Backups included
©MapR Technologies
github
• Watch TailSpout & Franz development
• https://github.com/{tdunning | boorad | pfcurtis}/mapr-spout
• And our example Twitter implementation
• https://github.com/{tdunning | boorad | pfcurtis}/mapr-spout-test
Demo

More Related Content

PPTX
DECK36 - Log everything! and Realtime Datastream Analytics with Storm
KEY
Realtime Computation with Storm
PPTX
Big Data Paris
PDF
Graph computation
PPTX
Barbara Nelson [InfluxData] | How Can I Put That Dashboard in My App? | Influ...
PDF
Statsd introduction
PDF
On-Prem Solution for the Selection of Wind Energy Models
PDF
Fast Cars, Big Data How Streaming can help Formula 1
DECK36 - Log everything! and Realtime Datastream Analytics with Storm
Realtime Computation with Storm
Big Data Paris
Graph computation
Barbara Nelson [InfluxData] | How Can I Put That Dashboard in My App? | Influ...
Statsd introduction
On-Prem Solution for the Selection of Wind Energy Models
Fast Cars, Big Data How Streaming can help Formula 1

What's hot (20)

PDF
The Future of Sharding
 
PDF
Introduction to Spark
PPTX
Jeremy Foran [BAI Communications] | Detecting Subway Overcrowding in Real Tim...
PPTX
Apache Spark Machine Learning Decision Trees
PDF
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
PPTX
Anais Dotis-Georgiou & Steven Soroka [InfluxData] | Machine Learning with Tel...
PPTX
CourboSpark: Decision Tree for Time-series on Spark
PDF
Collecting metrics with Graphite and StatsD
PDF
Reliable Performance at Scale with Apache Spark on Kubernetes
PDF
Make your PySpark Data Fly with Arrow!
PPTX
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
PDF
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
PPTX
Scott Anderson [InfluxData] | Map & Reduce – The Powerhouses of Custom Flux F...
PPTX
InfluxDB IOx Tech Talks: A Rusty Introduction to Apache Arrow and How it App...
PPTX
H20 - Thirst for Machine Learning
PPTX
GoTo Amsterdam 2013 Skinned
PDF
Build a Time Series Application with Apache Spark and Apache HBase
PDF
Apache Spark Overview
PDF
Introduction to Apache Hivemall v0.5.0
PDF
Applying Machine Learning to Live Patient Data
The Future of Sharding
 
Introduction to Spark
Jeremy Foran [BAI Communications] | Detecting Subway Overcrowding in Real Tim...
Apache Spark Machine Learning Decision Trees
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Anais Dotis-Georgiou & Steven Soroka [InfluxData] | Machine Learning with Tel...
CourboSpark: Decision Tree for Time-series on Spark
Collecting metrics with Graphite and StatsD
Reliable Performance at Scale with Apache Spark on Kubernetes
Make your PySpark Data Fly with Arrow!
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
Scott Anderson [InfluxData] | Map & Reduce – The Powerhouses of Custom Flux F...
InfluxDB IOx Tech Talks: A Rusty Introduction to Apache Arrow and How it App...
H20 - Thirst for Machine Learning
GoTo Amsterdam 2013 Skinned
Build a Time Series Application with Apache Spark and Apache HBase
Apache Spark Overview
Introduction to Apache Hivemall v0.5.0
Applying Machine Learning to Live Patient Data
Ad

Similar to Hadoop and Storm - AJUG talk (20)

PDF
Realtime Computation with Storm
PDF
Parallel Data Processing with MapReduce: A Survey
PPTX
Real-time and long-time together
PDF
Summingbird: Streaming Portable, MapReduce
PDF
MapReduce and Hadoop
PDF
Real time stream processing presentation at General Assemb.ly
PDF
Is Spark Replacing Hadoop
PDF
Real-time Big Data Processing with Storm
PDF
Seminar_Report_hadoop
PDF
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
PPTX
London hug
PDF
Tuga it 2017 - Event processing with Apache Storm
PDF
Stream Processing Everywhere - What to use?
PDF
lec6_ref.pdf
PPTX
Hadoop
PPTX
Map reduce helpful for college students.pptx
PDF
Learning Stream Processing with Apache Storm
PPTX
Cleveland Hadoop Users Group - Spark
PDF
Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster
PDF
E031201032036
Realtime Computation with Storm
Parallel Data Processing with MapReduce: A Survey
Real-time and long-time together
Summingbird: Streaming Portable, MapReduce
MapReduce and Hadoop
Real time stream processing presentation at General Assemb.ly
Is Spark Replacing Hadoop
Real-time Big Data Processing with Storm
Seminar_Report_hadoop
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
London hug
Tuga it 2017 - Event processing with Apache Storm
Stream Processing Everywhere - What to use?
lec6_ref.pdf
Hadoop
Map reduce helpful for college students.pptx
Learning Stream Processing with Apache Storm
Cleveland Hadoop Users Group - Spark
Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster
E031201032036
Ad

More from boorad (10)

PPTX
Big Data Analysis Patterns with Hadoop, Mahout and Solr
PPTX
Big Data Analysis Patterns - TriHUG 6/27/2013
PPTX
Big Data Use Cases
PPTX
PhillyDB Talk - Beyond Batch
KEY
TriHUG - Beyond Batch
KEY
Large Scale Data Analysis Tools
KEY
DevNexus 2011
KEY
DevNation Atlanta
KEY
NOSQL, CouchDB, and the Cloud
PDF
Why Erlang? - Bar Camp Atlanta 2008
Big Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Use Cases
PhillyDB Talk - Beyond Batch
TriHUG - Beyond Batch
Large Scale Data Analysis Tools
DevNexus 2011
DevNation Atlanta
NOSQL, CouchDB, and the Cloud
Why Erlang? - Bar Camp Atlanta 2008

Recently uploaded (20)

PPTX
Microsoft Excel 365/2024 Beginner's training
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
sbt 2.0: go big (Scala Days 2025 edition)
PDF
Credit Without Borders: AI and Financial Inclusion in Bangladesh
PDF
The influence of sentiment analysis in enhancing early warning system model f...
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
PDF
sustainability-14-14877-v2.pddhzftheheeeee
DOCX
search engine optimization ppt fir known well about this
PPTX
2018-HIPAA-Renewal-Training for executives
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PPTX
Custom Battery Pack Design Considerations for Performance and Safety
PPTX
Build Your First AI Agent with UiPath.pptx
PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PDF
Five Habits of High-Impact Board Members
PPTX
TEXTILE technology diploma scope and career opportunities
PDF
Comparative analysis of machine learning models for fake news detection in so...
PDF
Architecture types and enterprise applications.pdf
PDF
Improvisation in detection of pomegranate leaf disease using transfer learni...
PPT
Geologic Time for studying geology for geologist
Microsoft Excel 365/2024 Beginner's training
NewMind AI Weekly Chronicles – August ’25 Week III
sbt 2.0: go big (Scala Days 2025 edition)
Credit Without Borders: AI and Financial Inclusion in Bangladesh
The influence of sentiment analysis in enhancing early warning system model f...
Convolutional neural network based encoder-decoder for efficient real-time ob...
sustainability-14-14877-v2.pddhzftheheeeee
search engine optimization ppt fir known well about this
2018-HIPAA-Renewal-Training for executives
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
Custom Battery Pack Design Considerations for Performance and Safety
Build Your First AI Agent with UiPath.pptx
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
Five Habits of High-Impact Board Members
TEXTILE technology diploma scope and career opportunities
Comparative analysis of machine learning models for fake news detection in so...
Architecture types and enterprise applications.pdf
Improvisation in detection of pomegranate leaf disease using transfer learni...
Geologic Time for studying geology for geologist

Hadoop and Storm - AJUG talk