Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Data Pipelines :
Improving on the Lambda
Architecture
Brian O’Neill, CTO
boneill@healthmarketscience.com
@boneill42

Talk Breakdown
29%
20%
31%
20%
Topics
(1) Motivation
(2) Polyglot Persistence
(3) Analytics
(4) Lambda Architecture

Health Market Science - Then
What we were.

Health Market Science - Now
Intersecting Big Data
w/ Healthcare
We’re fixing healthcare!

The InputFrom government,
state boards, etc.
From the internet,
social data,
networks / graphs
From third-parties,
medical claims
From customers,
expenses,
sales data,
beneficiary information,
quality scores
Data
Pipeline

The Output
Script
Claims
Expense
Sanction
Address
Contact
(phone, fax, etc.)
Drug
RepresentativeDivision
Expense ManagerTM
Provider Verification™
MarketViewTM
Customer
Feed(s)
Customer
Master
Provider MasterFileTM
Credentials
“Agile MDM”
1 billion claims
per year
Organization
Practitioner
Referrals

Sounds easy
Except...
Incomplete Capture
No foreign keys
Differing schemas
Changing schemas
Conflicting information
Ad-hoc Analysis (is hard)
Point-In-Time Retrieval

Golde
n
Record
Master Data Management
Harvested
Government
Private
faddress Î F@t0
flicense Î F@t5
fsanction Î F@t1 fsanction Î F@t4
Schema Change!

Our MDM Pipeline
- Data Stewardship
- Data Scientists
- Business Analysts
Ingestion
- Semantic Tagging
- Standardization
- Data Mapping
Incorporation
- Consolidation
- Enumeration
- Association
Insight
- Search
- Reports
- Analytics
Feeds
(multiple
formats, changing
over time)
API / FTP Web Interface
DimensionsLogicRules

Sweet!
Dirt Simple
Lightning Fast
Highly Available
Scalable
Multi-Datacenter (DR)

Not Sweet.
How do we query the data?
NoSQL Indexes?
Do such things exist?

Rev. 1 – Wide Rows!
AOP
Triggers!Data model to
support your
queries.
9 7 32 74 99 12 42
$3.50 $7.00 $8.75 $1.00 $4.20 $3.17 $8.88
ONC : PA : 19460
D’Oh! What about ad hoc?

Transformation
Rev 2 – Elastic Search!
AOP
Triggers!
D’Oh!
What if ES fails?
What about schema / type information?

Polyglot Persistence
“The Right Tool for the Job”
Oracle is a registered trademark
of Oracle Corporation and/or its
affiliates. Other names may be
trademarks of their respective
owners.

Back to the Pipeline
KafkaDW
Storm
C* ES Titan SQL

Design Principles
• What we got:
– At-least-once processing
– Simple data flows
• What we needed to account for:
– Replays
Idempotent Operations!
Immutable Data!

Cassandra State (v0.4.0)
git@github.com:hmsonline/storm-cassandra.git
{tuple}  <mapper>  (ks, cf, row, k:v[])
Storm Cassandra

Trident Elastic Search (v0.3.1)
git@github.com:hmsonline/trident-elasticsearch.git
{tuple}  <mapper>  (idx, docid, k:v[])
Storm Elastic Search

Storm Graph (v0.1.2)
Coming soon to...
git@github.com:hmsonline/storm-graph.git
for (tuple : batch)
<processor> (graph, tuple)

Storm JDBI (v0.1.14)
INTERNAL ONLY (so far)
Worth releasing?
{tuple}  <mapper>  (JDBC Statement)

But...
What was the average amount for a
medical claim associated with procedure
X by zip code over the last five years?

Hadoop (<2)? Batch?
Yuck. ‘Nuff Said.
http://www.slideshare.net/prash1784/introduction-to-hadoop-and-pig-15036186

Let’s Pre-Compute It!
stream
.groupBy(new Field(“ICD9”))
.groupBy(new Field(“zip”))
.aggregate(new Field(“amount”),
new Average())
D’Oh!
GroupBy’s.
They set data in motion!

Lesson Learned
https://github.com/nathanmarz/storm/wiki/Trident-API-Overview
If possible, avoid
re-partitioning
operations!
(e.g. LOG.error!)

Why so hard?
D’Oh!
19 != 9
What we don’t want:
LOCKS!
What’s the alternative?
CONSENSUS!

Cassandra 2.0!
http://www.slideshare.net/planetcassandra/nyc-jonathan-ellis-keynote-cassandra-12-20
http://www.cs.cornell.edu/courses/CS6452/2012sp/papers/paxos-complex.pdf

Conditional Updates
“The alert reader will notice here that Paxos gives us the
ability to agree on exactly one proposal. After one has been
accepted, it will be returned to future leaders in the
promise, and the new leader will have to re-propose it
again.”
http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0
UPDATE value=9 WHERE word=“fox” IF value=6

Love CQL
Conditional Updates
+
Batch Statements
+
Collections
=
BADASS DATA MODELS

Announcing : Storm Cassandra CQL!
git@github.com:hmsonline/storm-cassandra-cql.git
{tuple}  <mapper>  (CQL Statement)
Trident Batching =? CQL Batching

CassandraCqlState
public void commit(Long txid) {
BatchStatement batch = new BatchStatement(Type.LOGGED);
batch.addAll(this.statements);
clientFactory.getSession().execute(batch);
}
public void addStatement(Statement statement) {
this.statements.add(statement);
}
public ResultSet execute(Statement statement){
return clientFactory.getSession().execute(statement);
}

CassandraCqlStateUpdater
public void updateState(CassandraCqlState state,
List<TridentTuple> tuples,
TridentCollector collector) {
for (TridentTuple tuple : tuples) {
Statement statement = this.mapper.map(tuple);
state.addStatement(statement);
}
}

ExampleMapper
public Statement map(List<String> keys, Number value) {
Insert statement =
QueryBuilder.insertInto(KEYSPACE_NAME, TABLE_NAME);
statement.value(KEY_NAME, keys.get(0));
statement.value(VALUE_NAME, value);
return statement;
}
public Statement retrieve(List<String> keys) {
Select statement = QueryBuilder.select()
.column(KEY_NAME).column(VALUE_NAME)
.from(KEYSPACE_NAME, TABLE_NAME)
.where(QueryBuilder.eq(KEY_NAME, keys.get(0)));
return statement;
}

Incremental State!
• Collapse aggregation into the state object.
– This allows the state object to aggregate with current state
in a loop until success.
• Uses Trident Batching to perform in-memory
aggregation for the batch.
for (tuple : batch)
state.aggregate(tuple);
while (failed?) {
persisted_state = read(state)
aggregate(in_memory_state, persisted_state)
failed? = conditionally_update(state)
}

Partition 1
In-Memory Aggregation by Key!
Key Value
fox 6
brown 3
Partition 2
Key Value
fox 3
lazy 72C*
No More GroupBy!

To protect against replays
Use partition + batch identifier(s) in
your conditional update!
“BatchId + partitionIndex consistently represents the
same data as long as:
1.Any repartitioning you do is deterministic (so
partitionBy is, but shuffle is not)
2.You're using a spout that replays the exact same
batch each time (which is true of transactional spouts
but not of opaque transactional spouts)”
- Nathan Marz

The Lambda Architecture
http://architects.dzone.com/articles/nathan-marzs-lamda

Let’s Challenge This a Bit
because “additional tools and techniques” cost
money and time.
• Questions:
– Can we solve the problem with a single tool and a
single approach?
– Can we re-use logic across layers?
– Or better yet, can we collapse layers?

A Traditional Interpretation
Speed Layer
(Storm)
Batch Layer
(Hadoop)
Data
Stream
Serving Layer
HBase
Impala
D’Oh! Two pipelines!

Integrating Web Services
• We need a web service that receives an event
and provides,
– an immediate acknowledgement
– a high likelihood that the data is integrated very soon
– a guarantee that the data will be integrated eventually
• We need an architecture that provides for,
– Code / Logic and approach re-use
– Fault-Tolerance

The Idea : Embedding State!
Kafka
DropWizard
C*
IncrementalCqlState
aggregate(tuple)
“Batch” Layer
(Storm)
Client

The Wins
• Reuse Aggregations and State Code!
• To re-compute (or backfill) a
dimension, simply re-queue!
• Storm is the “safety” net
– If a DW host fails during aggregation, Storm will fill
in the gaps for all ACK’d events.
• Is there an opportunity to reuse more?
– BatchingStrategy & PartitionStrategy?

Plug
The Book
Shout out:
Taylor Goetz

Thanks
Brian O’Neill, CTO
boneill@healthmarketscience.com
@boneill42

Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture (20)

More from Brian O'Neill (6)

Recently uploaded (20)

Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Editor's Notes