SlideShare a Scribd company logo
Storm: Distributed Real-Time
Computation
(better than Hadoop)
Brian O’Neill, CTO
boneill@healthmarketscience.com
@boneill42
Talk Breakdown
29%
20%
31%
20%
Topics
(1) Motivation
(2) Polyglot Persistence
(3) Analytics
(4) Lambda Architecture
Health Market Science - Then
What we were.
Health Market Science - Now
Intersecting Big Data
w/ Healthcare
We’re fixing healthcare!
Data Pipelines
I/O
The InputFrom government,
state boards, etc.
From the internet,
social data,
networks / graphs
From third-parties,
medical claims
From customers,
expenses,
sales data,
beneficiary information,
quality scores
Data
Pipeline
The Output
Script
Claims
Expense
Sanction
Address
Contact
(phone, fax, etc.)
Drug
RepresentativeDivision
Expense ManagerTM
Provider Verification™
MarketViewTM
Customer
Feed(s)
Customer
Master
Provider MasterFileTM
Credentials
“Agile MDM”
1 billion claims
per year
Organization
Practitioner
Referrals
Sounds easy
Except...
Incomplete Capture
No foreign keys
Differing schemas
Changing schemas
Conflicting information
Ad-hoc Analysis (is hard)
Point-In-Time Retrieval
Golde
n
Record
Master Data Management
Harvested
Government
Private
faddress Î F@t0
flicense Î F@t5
fsanction Î F@t1 fsanction Î F@t4
Schema Change!
Why?
?’s
Our MDM Pipeline
- Data Stewardship
- Data Scientists
- Business Analysts
Ingestion
- Semantic Tagging
- Standardization
- Data Mapping
Incorporation
- Consolidation
- Enumeration
- Association
Insight
- Search
- Reports
- Analytics
Feeds
(multiple formats,
changing over time)
API / FTP Web Interface
DimensionsLogicRules
Our first “Pipeline”
+
Sweet!
Dirt Simple
Lightning Fast
Highly Available
Scalable
Multi-Datacenter (DR)
Not Sweet.
How do we query the data?
NoSQL Indexes?
Do such things exist?
Rev. 1 – Wide Rows!
AOP
Triggers!Data model to
support your
queries.
9 7 32 74 99 12 42
$3.50 $7.00 $8.75 $1.00 $4.20 $3.17 $8.88
ONC : PA : 19460
D’Oh! What about ad hoc?
Transformation
Rev 2 – Elastic Search!
AOP
Triggers!
D’Oh!
What if ES fails?
What about schema / type information?
Rev 3 - Apache Storm!
Anatomy of a Storm Cluster
• Nimbus
– Master Node
• Zookeeper
– Cluster Coordination
• Supervisors
– Worker Nodes
Storm Primitives
• Streams
– Unbounded sequence of tuples
• Spouts
– Stream Sources
• Bolts
– Unit of Computation
• Topologies
– Combination of n Spouts and m Bolts
– Defines the overall “Computation”
Storm Spouts
• Represents a source (stream) of data
– Queues (JMS, Kafka, Kestrel, etc.)
– Twitter Firehose
– Sensor Data
• Emits “Tuples” (Events) based on source
– Primary Storm data structure
– Set of Key-Value pairs
Storm Bolts
• Receive Tuples from Spouts or other Bolts
• Operate on, or React to Data
– Functions/Filters/Joins/Aggregations
– Database writes/lookups
• Optionally emit additional Tuples
Storm Topologies
• Data flow between spouts and bolts
• Routing of Tuples between spouts/bolts
– Stream “Groupings”
• Parallelism of Components
• Long-Lived
Storm Topologies
Persistent Word Count
http://github.com/hmsonline/storm-cassandra
CODE INTERLUDE
NEXT LEVEL : TRIDENT
Trident
• Part of Storm
• Provides a higher-level abstraction for stream
processing
– Constructs for state management and batching
• Adds additional primitives that abstract away
common topological patterns
Trident State
Sequences writes by batch
• Spouts
– Transactional
• Batch contents never change
– Opaque
• Batch contents can change
• State
– Transactional
• Store batch number with counts to maintain sequencing of
writes
– Opaque
• Store previous value in order to overwrite the current value
when contents of a batch change
State Management
Last Batch Value
15 1000
(+59)
Last Batch Value
16 1059
Transactional
Last Batch Previous Current
15 980 1000
(+59)
Opaque
replay == incorporated already?
(because batch composition is the same)
Last Batch Previous Current
16 1000 1059
Last Batch Previous Current
15 980 1000
(+72)
Last Batch Previous Current
16 1000 1072
replay == re-incorporate
Batch composition changes! (not guaranteed)
BACK TO OUR REGULARLY
SCHEDULED TALK
Polyglot Persistence
“The Right Tool for the Job”
Oracle is a registered trademark
of Oracle Corporation and/or its
affiliates. Other names may be
trademarks of their respective
owners.
Back to the Pipeline
KafkaDW
Storm
C* ES Titan SQL
MDM Topology*
*Notional
Design Principles
• What we got:
– At-least-once processing
– Simple data flows
• What we needed to account for:
– Replays
Idempotent Operations!
Immutable Data!
Cassandra State (v0.4.0)
git@github.com:hmsonline/storm-cassandra.git
{tuple}  <mapper>  (ks, cf, row, k:v[])
Storm Cassandra
Trident Elastic Search (v0.3.1)
git@github.com:hmsonline/trident-elasticsearch.git
{tuple}  <mapper>  (idx, docid, k:v[])
Storm Elastic Search
Storm Graph (v0.1.2)
Coming soon to...
git@github.com:hmsonline/storm-graph.git
for (tuple : batch)
<processor> (graph, tuple)
Storm JDBI (v0.1.14)
INTERNAL ONLY (so far)
Worth releasing?
{tuple}  <mapper>  (JDBC Statement)
All good!
But...
What was the average amount for a
medical claim associated with procedure
X by zip code over the last five years?
Hadoop (<2)? Batch?
Yuck. ‘Nuff Said.
http://www.slideshare.net/prash1784/introduction-to-hadoop-and-pig-15036186
Let’s Pre-Compute It!
stream
.groupBy(new Field(“ICD9”))
.groupBy(new Field(“zip”))
.aggregate(new Field(“amount”),
new Average())
D’Oh!
GroupBy’s.
They set data in motion!
Lesson Learned
https://github.com/nathanmarz/storm/wiki/Trident-API-Overview
If possible, avoid
re-partitioning
operations!
(e.g. LOG.error!)
Why so hard?
D’Oh!
19 != 9
What we don’t want:
LOCKS!
What’s the alternative?
CONSENSUS!
Cassandra 2.0!
http://www.slideshare.net/planetcassandra/nyc-jonathan-ellis-keynote-cassandra-12-20
http://www.cs.cornell.edu/courses/CS6452/2012sp/papers/paxos-complex.pdf
Conditional Updates
“The alert reader will notice here that
Paxos gives us the ability to agree on
exactly one proposal. After one has been
accepted, it will be returned to future
leaders in the promise, and the new leader
will have to re-propose it again.”
http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0
UPDATE value=9 WHERE word=“fox” IF value=6
Love CQL
Conditional Updates
+
Batch Statements
+
Collections
=
BADASS DATA MODELS
Announcing : Storm Cassandra CQL!
git@github.com:hmsonline/storm-cassandra-cql.git
{tuple}  <mapper>  (CQL Statement)
Trident Batching =? CQL Batching
CassandraCqlState
public void commit(Long txid) {
BatchStatement batch = new BatchStatement(Type.LOGGED);
batch.addAll(this.statements);
clientFactory.getSession().execute(batch);
}
public void addStatement(Statement statement) {
this.statements.add(statement);
}
public ResultSet execute(Statement statement){
return clientFactory.getSession().execute(statement);
}
CassandraCqlStateUpdater
public void updateState(CassandraCqlState state,
List<TridentTuple> tuples,
TridentCollector collector) {
for (TridentTuple tuple : tuples) {
Statement statement = this.mapper.map(tuple);
state.addStatement(statement);
}
}
ExampleMapper
public Statement map(List<String> keys, Number value) {
Insert statement =
QueryBuilder.insertInto(KEYSPACE_NAME, TABLE_NAME);
statement.value(KEY_NAME, keys.get(0));
statement.value(VALUE_NAME, value);
return statement;
}
public Statement retrieve(List<String> keys) {
Select statement = QueryBuilder.select()
.column(KEY_NAME).column(VALUE_NAME)
.from(KEYSPACE_NAME, TABLE_NAME)
.where(QueryBuilder.eq(KEY_NAME, keys.get(0)));
return statement;
}
Incremental State!
• Collapse aggregation into the state object.
– This allows the state object to aggregate with current state
in a loop until success.
• Uses Trident Batching to perform in-memory
aggregation for the batch.
for (tuple : batch)
state.aggregate(tuple);
while (failed?) {
persisted_state = read(state)
aggregate(in_memory_state, persisted_state)
failed? = conditionally_update(state)
}
Partition 1
In-Memory Aggregation by Key!
Key Value
fox 6
brown 3
Partition 2
Key Value
fox 3
lazy 72C*
No More GroupBy!
To protect against replays
Use partition + batch identifier(s) in
your conditional update!
“BatchId + partitionIndex consistently represents the
same data as long as:
1.Any repartitioning you do is deterministic (so
partitionBy is, but shuffle is not)
2.You're using a spout that replays the exact same
batch each time (which is true of transactional spouts
but not of opaque transactional spouts)”
- Nathan Marz
The Lambda Architecture
http://architects.dzone.com/articles/nathan-marzs-lamda
Let’s Challenge This a Bit
because “additional tools and techniques” cost
money and time.
• Questions:
– Can we solve the problem with a single tool and a
single approach?
– Can we re-use logic across layers?
– Or better yet, can we collapse layers?
A Traditional Interpretation
Speed Layer
(Storm)
Batch Layer
(Hadoop)
Data
Stream
Serving Layer
HBase
Impala
D’Oh! Two pipelines!
Integrating Web Services
• We need a web service that receives an event
and provides,
– an immediate acknowledgement
– a high likelihood that the data is integrated very soon
– a guarantee that the data will be integrated eventually
• We need an architecture that provides for,
– Code / Logic and approach re-use
– Fault-Tolerance
Grand Finale
The Idea : Embedding State!
Kafka
DropWizard
C*
IncrementalCqlState
aggregate(tuple)
“Batch” Layer
(Storm)
Client
The Sequence of Events
The Wins
• Reuse Aggregations and State Code!
• To re-compute (or backfill) a dimension,
simply re-queue!
• Storm is the “safety” net
– If a DW host fails during aggregation, Storm will fill
in the gaps for all ACK’d events.
• Is there an opportunity to reuse more?
– BatchingStrategy & PartitionStrategy?
In the end, all good. =)
Plug
The Book
Shout out:
Taylor Goetz
Thanks
Brian O’Neill, CTO
boneill@healthmarketscience.com
@boneill42

More Related Content

PPTX
Spark - Philly JUG
PDF
Introduction to df
PPT
Mapreduce in Search
PDF
Introducing DataFrames in Spark for Large Scale Data Science
PDF
Beyond SQL: Speeding up Spark with DataFrames
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
PDF
Spark SQL with Scala Code Examples
PDF
Apache Spark Side of Funnels
Spark - Philly JUG
Introduction to df
Mapreduce in Search
Introducing DataFrames in Spark for Large Scale Data Science
Beyond SQL: Speeding up Spark with DataFrames
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Spark SQL with Scala Code Examples
Apache Spark Side of Funnels

What's hot (20)

PDF
Spark Application for Time Series Analysis
PPTX
Spark meetup v2.0.5
PPTX
Neo, Titan & Cassandra
PPTX
Large scale, interactive ad-hoc queries over different datastores with Apache...
PDF
Is Spark Replacing Hadoop
PPTX
Introduction to Apache Drill - interactive query and analysis at scale
PDF
Data Science with Spark
PDF
Spark and the Future of Advanced Analytics by Thomas Dinsmore
PPTX
Sf NoSQL MeetUp: Apache Hadoop and HBase
PDF
Apache Spark Overview @ ferret
PPTX
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
PDF
Time series database by Harshil Ambagade
PDF
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
PDF
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
PPTX
MapReduce Design Patterns
PPTX
Hadoop & Hive Change the Data Warehousing Game Forever
PDF
Hadoop Ecosystem Architecture Overview
PPT
Hadoop trainingin bangalore
PDF
End-to-end Data Pipeline with Apache Spark
PDF
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark Application for Time Series Analysis
Spark meetup v2.0.5
Neo, Titan & Cassandra
Large scale, interactive ad-hoc queries over different datastores with Apache...
Is Spark Replacing Hadoop
Introduction to Apache Drill - interactive query and analysis at scale
Data Science with Spark
Spark and the Future of Advanced Analytics by Thomas Dinsmore
Sf NoSQL MeetUp: Apache Hadoop and HBase
Apache Spark Overview @ ferret
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Time series database by Harshil Ambagade
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
MapReduce Design Patterns
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop Ecosystem Architecture Overview
Hadoop trainingin bangalore
End-to-end Data Pipeline with Apache Spark
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Ad

Similar to Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard (20)

PPTX
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
PPTX
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
PPTX
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
PPTX
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
PPTX
Cassandra and Storm at Health Market Sceince
ODP
Web-scale data processing: practical approaches for low-latency and batch
PDF
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
PPTX
Stream Computing (The Engineer's Perspective)
PPTX
Cassandra synergy
PDF
Storm at spider.io - London Storm Meetup 2013-06-18
PDF
Real-time Big Data Processing with Storm
PPTX
Real-Time Big Data with Storm, Kafka and GigaSpaces
PDF
Open analytics meetup alex poon (1)
PPTX
PDF
Data Streaming Technology Overview
PPTX
Yahoo compares Storm and Spark
PDF
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
PPTX
Storm overview & integration
PPTX
Past, Present, and Future of Apache Storm
PPTX
Paris DataGeek - SummingBird
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
Cassandra and Storm at Health Market Sceince
Web-scale data processing: practical approaches for low-latency and batch
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Stream Computing (The Engineer's Perspective)
Cassandra synergy
Storm at spider.io - London Storm Meetup 2013-06-18
Real-time Big Data Processing with Storm
Real-Time Big Data with Storm, Kafka and GigaSpaces
Open analytics meetup alex poon (1)
Data Streaming Technology Overview
Yahoo compares Storm and Spark
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
Storm overview & integration
Past, Present, and Future of Apache Storm
Paris DataGeek - SummingBird
Ad

More from Brian O'Neill (6)

PPTX
Big data philly_jug
PPT
The Art of Platform Development
PPTX
Hms nyc* talk
PPTX
Collaborative software development
KEY
Ruby on Big Data @ Philly Ruby Group
KEY
Ruby on Big Data (Cassandra + Hadoop)
Big data philly_jug
The Art of Platform Development
Hms nyc* talk
Collaborative software development
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data (Cassandra + Hadoop)

Recently uploaded (20)

PPTX
L1 - Introduction to python Backend.pptx
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Nekopoi APK 2025 free lastest update
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
top salesforce developer skills in 2025.pdf
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
ai tools demonstartion for schools and inter college
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
AI in Product Development-omnex systems
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
medical staffing services at VALiNTRY
L1 - Introduction to python Backend.pptx
PTS Company Brochure 2025 (1).pdf.......
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Nekopoi APK 2025 free lastest update
Internet Downloader Manager (IDM) Crack 6.42 Build 41
ManageIQ - Sprint 268 Review - Slide Deck
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
top salesforce developer skills in 2025.pdf
Navsoft: AI-Powered Business Solutions & Custom Software Development
ai tools demonstartion for schools and inter college
Softaken Excel to vCard Converter Software.pdf
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Design an Analysis of Algorithms II-SECS-1021-03
Operating system designcfffgfgggggggvggggggggg
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
ISO 45001 Occupational Health and Safety Management System
AI in Product Development-omnex systems
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
VVF-Customer-Presentation2025-Ver1.9.pptx
medical staffing services at VALiNTRY

Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard

Editor's Notes

  • #20: Tuple: set of key-value pairs (values can be serialized objects)
  • #45: title Distributed Counting participant A participant B participant Storage note over Storage {"fox" : 6} end note note over A count("fox", batch)=3 end note A->Storage: read("fox") note over B count("fox", batch)=10 end note Storage->A: 6 B->Storage: read("fox") Storage->B: 8 note over A add(6, 3) = 9 end note note over B add(6, 10) = 16 end note B->Storage: write(16) A->Storage: write(9) note over Storage {"fox":16} end note
  • #62: title Distributed Counting participant Client participant DropWizard participant Kafka participant State(1) participant C* participant Storm participant State(2) Client->DropWizard: POST(event) DropWizard->State(1): aggregate(new Tuple(event)) DropWizard->Kafka: queue(event) DropWizard->Client: 200(ACK) note over State(1) duration (30 sec.) end note State(1)->C*: state, events = read(key) note over State(1) state = aggregate (state, in_memory_state) events = join (events, in_memory_events) end note State(1)->C*: write(state, events) Kafka->Storm: dequeue(event) Storm->State(2): persisted_state, events = read(key) note over State(2) if (!contains?(event)) ... end note State(2)->C*: if !contains(ids) write(state)