SlideShare a Scribd company logo
Data Pipelines :
Improving on the Lambda
Architecture
Brian O’Neill, CTO
boneill@healthmarketscience.com
@boneill42
Talk Breakdown
29%
20%
31%
20%
Topics
(1) Motivation
(2) Polyglot Persistence
(3) Analytics
(4) Lambda Architecture
Health Market Science - Then
What we were.
Health Market Science - Now
Intersecting Big Data
w/ Healthcare
We’re fixing healthcare!
Data Pipelines
I/O
The InputFrom government,
state boards, etc.
From the internet,
social data,
networks / graphs
From third-parties,
medical claims
From customers,
expenses,
sales data,
beneficiary information,
quality scores
Data
Pipeline
The Output
Script
Claims
Expense
Sanction
Address
Contact
(phone, fax, etc.)
Drug
RepresentativeDivision
Expense ManagerTM
Provider Verification™
MarketViewTM
Customer
Feed(s)
Customer
Master
Provider MasterFileTM
Credentials
“Agile MDM”
1 billion claims
per year
Organization
Practitioner
Referrals
Sounds easy
Except...
Incomplete Capture
No foreign keys
Differing schemas
Changing schemas
Conflicting information
Ad-hoc Analysis (is hard)
Point-In-Time Retrieval
Golde
n
Record
Master Data Management
Harvested
Government
Private
faddress Î F@t0
flicense Î F@t5
fsanction Î F@t1 fsanction Î F@t4
Schema Change!
Why?
?’s
Our MDM Pipeline
- Data Stewardship
- Data Scientists
- Business Analysts
Ingestion
- Semantic Tagging
- Standardization
- Data Mapping
Incorporation
- Consolidation
- Enumeration
- Association
Insight
- Search
- Reports
- Analytics
Feeds
(multiple
formats, changing
over time)
API / FTP Web Interface
DimensionsLogicRules
Our first “Pipeline”
+
Sweet!
Dirt Simple
Lightning Fast
Highly Available
Scalable
Multi-Datacenter (DR)
Not Sweet.
How do we query the data?
NoSQL Indexes?
Do such things exist?
Rev. 1 – Wide Rows!
AOP
Triggers!Data model to
support your
queries.
9 7 32 74 99 12 42
$3.50 $7.00 $8.75 $1.00 $4.20 $3.17 $8.88
ONC : PA : 19460
D’Oh! What about ad hoc?
Transformation
Rev 2 – Elastic Search!
AOP
Triggers!
D’Oh!
What if ES fails?
What about schema / type information?
Rev 3 - Apache Storm!
Polyglot Persistence
“The Right Tool for the Job”
Oracle is a registered trademark
of Oracle Corporation and/or its
affiliates. Other names may be
trademarks of their respective
owners.
Back to the Pipeline
KafkaDW
Storm
C* ES Titan SQL
Design Principles
• What we got:
– At-least-once processing
– Simple data flows
• What we needed to account for:
– Replays
Idempotent Operations!
Immutable Data!
Cassandra State (v0.4.0)
git@github.com:hmsonline/storm-cassandra.git
{tuple}  <mapper>  (ks, cf, row, k:v[])
Storm Cassandra
Trident Elastic Search (v0.3.1)
git@github.com:hmsonline/trident-elasticsearch.git
{tuple}  <mapper>  (idx, docid, k:v[])
Storm Elastic Search
Storm Graph (v0.1.2)
Coming soon to...
git@github.com:hmsonline/storm-graph.git
for (tuple : batch)
<processor> (graph, tuple)
Storm JDBI (v0.1.14)
INTERNAL ONLY (so far)
Worth releasing?
{tuple}  <mapper>  (JDBC Statement)
All good!
But...
What was the average amount for a
medical claim associated with procedure
X by zip code over the last five years?
Hadoop (<2)? Batch?
Yuck. ‘Nuff Said.
http://www.slideshare.net/prash1784/introduction-to-hadoop-and-pig-15036186
Let’s Pre-Compute It!
stream
.groupBy(new Field(“ICD9”))
.groupBy(new Field(“zip”))
.aggregate(new Field(“amount”),
new Average())
D’Oh!
GroupBy’s.
They set data in motion!
Lesson Learned
https://github.com/nathanmarz/storm/wiki/Trident-API-Overview
If possible, avoid
re-partitioning
operations!
(e.g. LOG.error!)
Why so hard?
D’Oh!
19 != 9
What we don’t want:
LOCKS!
What’s the alternative?
CONSENSUS!
Cassandra 2.0!
http://www.slideshare.net/planetcassandra/nyc-jonathan-ellis-keynote-cassandra-12-20
http://www.cs.cornell.edu/courses/CS6452/2012sp/papers/paxos-complex.pdf
Conditional Updates
“The alert reader will notice here that Paxos gives us the
ability to agree on exactly one proposal. After one has been
accepted, it will be returned to future leaders in the
promise, and the new leader will have to re-propose it
again.”
http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0
UPDATE value=9 WHERE word=“fox” IF value=6
Love CQL
Conditional Updates
+
Batch Statements
+
Collections
=
BADASS DATA MODELS
Announcing : Storm Cassandra CQL!
git@github.com:hmsonline/storm-cassandra-cql.git
{tuple}  <mapper>  (CQL Statement)
Trident Batching =? CQL Batching
CassandraCqlState
public void commit(Long txid) {
BatchStatement batch = new BatchStatement(Type.LOGGED);
batch.addAll(this.statements);
clientFactory.getSession().execute(batch);
}
public void addStatement(Statement statement) {
this.statements.add(statement);
}
public ResultSet execute(Statement statement){
return clientFactory.getSession().execute(statement);
}
CassandraCqlStateUpdater
public void updateState(CassandraCqlState state,
List<TridentTuple> tuples,
TridentCollector collector) {
for (TridentTuple tuple : tuples) {
Statement statement = this.mapper.map(tuple);
state.addStatement(statement);
}
}
ExampleMapper
public Statement map(List<String> keys, Number value) {
Insert statement =
QueryBuilder.insertInto(KEYSPACE_NAME, TABLE_NAME);
statement.value(KEY_NAME, keys.get(0));
statement.value(VALUE_NAME, value);
return statement;
}
public Statement retrieve(List<String> keys) {
Select statement = QueryBuilder.select()
.column(KEY_NAME).column(VALUE_NAME)
.from(KEYSPACE_NAME, TABLE_NAME)
.where(QueryBuilder.eq(KEY_NAME, keys.get(0)));
return statement;
}
Incremental State!
• Collapse aggregation into the state object.
– This allows the state object to aggregate with current state
in a loop until success.
• Uses Trident Batching to perform in-memory
aggregation for the batch.
for (tuple : batch)
state.aggregate(tuple);
while (failed?) {
persisted_state = read(state)
aggregate(in_memory_state, persisted_state)
failed? = conditionally_update(state)
}
Partition 1
In-Memory Aggregation by Key!
Key Value
fox 6
brown 3
Partition 2
Key Value
fox 3
lazy 72C*
No More GroupBy!
To protect against replays
Use partition + batch identifier(s) in
your conditional update!
“BatchId + partitionIndex consistently represents the
same data as long as:
1.Any repartitioning you do is deterministic (so
partitionBy is, but shuffle is not)
2.You're using a spout that replays the exact same
batch each time (which is true of transactional spouts
but not of opaque transactional spouts)”
- Nathan Marz
The Lambda Architecture
http://architects.dzone.com/articles/nathan-marzs-lamda
Let’s Challenge This a Bit
because “additional tools and techniques” cost
money and time.
• Questions:
– Can we solve the problem with a single tool and a
single approach?
– Can we re-use logic across layers?
– Or better yet, can we collapse layers?
A Traditional Interpretation
Speed Layer
(Storm)
Batch Layer
(Hadoop)
Data
Stream
Serving Layer
HBase
Impala
D’Oh! Two pipelines!
Integrating Web Services
• We need a web service that receives an event
and provides,
– an immediate acknowledgement
– a high likelihood that the data is integrated very soon
– a guarantee that the data will be integrated eventually
• We need an architecture that provides for,
– Code / Logic and approach re-use
– Fault-Tolerance
Grand Finale
The Idea : Embedding State!
Kafka
DropWizard
C*
IncrementalCqlState
aggregate(tuple)
“Batch” Layer
(Storm)
Client
The Sequence of Events
The Wins
• Reuse Aggregations and State Code!
• To re-compute (or backfill) a
dimension, simply re-queue!
• Storm is the “safety” net
– If a DW host fails during aggregation, Storm will fill
in the gaps for all ACK’d events.
• Is there an opportunity to reuse more?
– BatchingStrategy & PartitionStrategy?
In the end, all good. =)
Plug
The Book
Shout out:
Taylor Goetz
Thanks
Brian O’Neill, CTO
boneill@healthmarketscience.com
@boneill42

More Related Content

PDF
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
PPTX
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
PPT
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
PDF
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
PPTX
Yahoo compares Storm and Spark
PPTX
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
PDF
STORM as an ETL Engine to HADOOP
PPTX
Resource Aware Scheduling in Apache Storm
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Yahoo compares Storm and Spark
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
STORM as an ETL Engine to HADOOP
Resource Aware Scheduling in Apache Storm

What's hot (20)

PDF
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
PPTX
Design Patterns For Real Time Streaming Data Analytics
PPTX
Design Patterns for Large-Scale Real-Time Learning
PDF
Big Data, Mob Scale.
PPTX
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
PDF
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
PDF
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
PPTX
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...
PDF
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
PDF
Designing and Building a Graph Database Application - Ian Robinson (Neo Techn...
PDF
Apache Storm
PDF
Storm: distributed and fault-tolerant realtime computation
PPTX
Real Time Data Processing Using Spark Streaming
PPTX
Apache Beam: A unified model for batch and stream processing data
PDF
Real Time Data Streaming using Kafka & Storm
PDF
Apache storm vs. Spark Streaming
PPTX
Functional Comparison and Performance Evaluation of Streaming Frameworks
PDF
Introducing Kafka Connect and Implementing Custom Connectors
PDF
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Design Patterns For Real Time Streaming Data Analytics
Design Patterns for Large-Scale Real-Time Learning
Big Data, Mob Scale.
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Designing and Building a Graph Database Application - Ian Robinson (Neo Techn...
Apache Storm
Storm: distributed and fault-tolerant realtime computation
Real Time Data Processing Using Spark Streaming
Apache Beam: A unified model for batch and stream processing data
Real Time Data Streaming using Kafka & Storm
Apache storm vs. Spark Streaming
Functional Comparison and Performance Evaluation of Streaming Frameworks
Introducing Kafka Connect and Implementing Custom Connectors
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Ad

Viewers also liked (20)

PPTX
Speed layer : Real time views in LAMBDA architecture
PPTX
Achieve big data analytic platform with lambda architecture on cloud
PPTX
Lambda architecture on Spark, Kafka for real-time large scale ML
PPTX
Real time machine learning
PDF
Arquitectura Lambda
PDF
Apache Storm vs. Spark Streaming - two stream processing platforms compared
PDF
Big data real time architectures
PDF
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
PPTX
Big data philly_jug
PPTX
A Critique of the CAP Theorem (Papers We Love @ Seattle)
PDF
NYC* Jonathan Ellis Keynote: "Cassandra 1.2 + 2.0"
PDF
Apache spark meetup
PDF
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
PPTX
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
PPTX
Extending Data Lake using the Lambda Architecture June 2015
PPTX
Hortonworks Data In Motion Series Part 4
PDF
Lambda architecture for real time big data
PPTX
PDF
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
PPTX
Building an Effective Data Warehouse Architecture
Speed layer : Real time views in LAMBDA architecture
Achieve big data analytic platform with lambda architecture on cloud
Lambda architecture on Spark, Kafka for real-time large scale ML
Real time machine learning
Arquitectura Lambda
Apache Storm vs. Spark Streaming - two stream processing platforms compared
Big data real time architectures
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Big data philly_jug
A Critique of the CAP Theorem (Papers We Love @ Seattle)
NYC* Jonathan Ellis Keynote: "Cassandra 1.2 + 2.0"
Apache spark meetup
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Extending Data Lake using the Lambda Architecture June 2015
Hortonworks Data In Motion Series Part 4
Lambda architecture for real time big data
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Building an Effective Data Warehouse Architecture
Ad

Similar to Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture (20)

PPTX
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
PPTX
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
PPT
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
PDF
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
PPTX
Trivento summercamp masterclass 9/9/2016
PDF
EclipseCon Keynote: Apache Hadoop - An Introduction
PDF
Integration Patterns for Big Data Applications
PDF
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
PDF
IoT NY - Google Cloud Services for IoT
PDF
Big data analytics 1
PPTX
Big data business case
PPT
Hw09 Protein Alignment
PDF
Big Data: Architecture and Performance Considerations in Logical Data Lakes
PDF
Spark Based Distributed Deep Learning Framework For Big Data Applications
PPTX
L'impatto della sicurezza su DevOps
PPT
Four Problems You Run into When DIY-ing a “Big Data” Analytics System
PPTX
Microsoft Dryad
PPT
Bigdata processing with Spark
PPTX
PDF
A Comprehensive Study on Big Data Applications and Challenges
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
Trivento summercamp masterclass 9/9/2016
EclipseCon Keynote: Apache Hadoop - An Introduction
Integration Patterns for Big Data Applications
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
IoT NY - Google Cloud Services for IoT
Big data analytics 1
Big data business case
Hw09 Protein Alignment
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Spark Based Distributed Deep Learning Framework For Big Data Applications
L'impatto della sicurezza su DevOps
Four Problems You Run into When DIY-ing a “Big Data” Analytics System
Microsoft Dryad
Bigdata processing with Spark
A Comprehensive Study on Big Data Applications and Challenges

More from Brian O'Neill (6)

PPTX
Spark - Philly JUG
PPT
The Art of Platform Development
PPTX
Hms nyc* talk
PPTX
Collaborative software development
KEY
Ruby on Big Data @ Philly Ruby Group
KEY
Ruby on Big Data (Cassandra + Hadoop)
Spark - Philly JUG
The Art of Platform Development
Hms nyc* talk
Collaborative software development
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data (Cassandra + Hadoop)

Recently uploaded (20)

PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Introduction to Artificial Intelligence
PDF
AI in Product Development-omnex systems
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
history of c programming in notes for students .pptx
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Nekopoi APK 2025 free lastest update
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Introduction to Artificial Intelligence
AI in Product Development-omnex systems
Design an Analysis of Algorithms I-SECS-1021-03
Odoo Companies in India – Driving Business Transformation.pdf
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PTS Company Brochure 2025 (1).pdf.......
history of c programming in notes for students .pptx
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Odoo POS Development Services by CandidRoot Solutions
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
2025 Textile ERP Trends: SAP, Odoo & Oracle
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Nekopoi APK 2025 free lastest update
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
How Creative Agencies Leverage Project Management Software.pdf
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...

Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Editor's Notes

  • #31: title Distributed Countingparticipant Aparticipant Bparticipant Storagenote over Storage{&quot;fox&quot; : 6}end notenote over Acount(&quot;fox&quot;, batch)=3end noteA-&gt;Storage: read(&quot;fox&quot;)note over Bcount(&quot;fox&quot;, batch)=10end noteStorage-&gt;A: 6B-&gt;Storage: read(&quot;fox&quot;)Storage-&gt;B: 8note over Aadd(6, 3) = 9end notenote over Badd(6, 10) = 16end noteB-&gt;Storage: write(16)A-&gt;Storage: write(9)note over Storage{&quot;fox&quot;:16}end note
  • #48: title Distributed Countingparticipant Clientparticipant DropWizardparticipant Kafkaparticipant State(1)participant C*participant Stormparticipant State(2)Client-&gt;DropWizard: POST(event)DropWizard-&gt;State(1): aggregate(new Tuple(event))DropWizard-&gt;Kafka: queue(event)DropWizard-&gt;Client: 200(ACK)note over State(1)duration (30 sec.)end noteState(1)-&gt;C*: state, events = read(key)note over State(1)state = aggregate (state, in_memory_state)events = join (events, in_memory_events)end noteState(1)-&gt;C*: write(state, events)Kafka-&gt;Storm: dequeue(event)Storm-&gt;State(2): persisted_state, events = read(key)note over State(2)if (!contains?(event)) ...end noteState(2)-&gt;C*: if !contains(ids) write(state)