Web-scale data processing: practical approaches for low-latency and batch

$>whoami
Edward Capriolo
●

Developer @ dstillery (the company formally
known as m6d aka media6degrees)

●

Hive: Project Management Committee

●

Hadoop'in it since 0.17.2

●

Cassandra-'in it since 0.6.X

●

Hive'in it 0.3.X

●

Incredibly skilled with power point

Agenda for this talk
●

Batch processing via Hadoop

●

Stream processing

●

Relational Databases and NoSQL

●

Life lessons, quips, and other prospective

Before we talk tech...
●
●

●

●

Lets talk math!
Yay! math fun! (as people start leaving room)
Don't worry. It is only a couple slides.
Wanted to talk about relational algebra since it
is the foundation of relation databases
Even in the NoSQL age, relational algebra is
alive and well

Relational algebra...
A big slide with many words
●

●

Relational algebra received little attention outside of pure
mathematics until the publication of E.F. Codd's relational model of
data in 1970. Codd proposed such an algebra as a basis for
database query languages.
In computer science, relational algebra is an offshoot of first-order
logic and of algebra of sets concerned with operations over finitary
relations, usually made more convenient to work with by identifying
the components of a tuple by a name (called attribute) rather than
by a numeric column index, which is called a relation in database
terminology.

http://en.wikipedia.org/wiki/Relational_algebra

Operators of Relational algebra:

Projection

●

SELECT Age, Weight ...
Extended projections

●

SELECT Age+Weight as X ...

●

SELECT ROUND(Weight),Age+1 as X ...

Selection

●

SELECT * FROM Person

●

SELECT * FROM Person WHERE Age >=34

●

SELECT * FROM Person WHERE Age = Weight

Joins

●

●

SELECT * FROM Car JOIN Boat on (CarPrice
>= BoatPrice)
SELECT * FROM Car JOIN Boat on (CarPrice
= BoatPrice)

Aggregate

●

SELECT sum(C) FROM r

●

SELECT A, sum(C) FROM r GROUP BY A
http://www.cbcb.umd.edu/confcour/CMSC424/Relational_algebra.pdf

Other Operators
●

Set operations
–
–

Intersection

–
●

Union
Cartesian Product

Outer joins
–
–

RIGHT,

–
●

LEFT
FULL

Semi Join / Exists

Batch Processing and Big Data
●

When hadoop game on the scene it was a
game changer because:
–

Viable implementation of Google's map reduce
white paper

–

Worked with commodity hardware

–

Had no exuberant software fees

–

Scaled processing and storage with growing
companies without typically needed processes
to be redesigned

Archetype Hadoop deployment
(circa facebook 2009)
Scribe Writers
Realtime
Hadoop
Cluster
Web Servers

Scribe MidTier

Oracle RAC

Hadoop Hive Warehouse

MySQL

http://hadoopblog.blogspot.com/2009/06/hdfs-scribe-integration.html

The Hadoop archetype
●
●

●
●

Component generating events (web servers)
Component collecting logs into hadoop
(scribe)
Translation of raw data using hadoop and hive
Output of rollups to oracle and other data
systems
–

feedback loops (mysql <-> hive)

Use case: Book store
●

Our book store will be named (say it with me!):
–
–

Big Data,

–

No SQL,

–

Real Time Analytics,

–
●

Web scale,

Books!

One more time!
–

Web scale, Big Data, No SQL, Real Time Analytics, Books
●

(A buzzword bingo company)

Domain model
{
"id":"00001",
"refer":"http://affiliate1.superbooks.com",
"ip":"209.191.139.200",
"status":"ACCEPTED",
"eventTimeInMillis":1383011801439,
"credit_hash":"ab45de21",
"email":"bob@compuserv.com",
"purchases":[ {
"name":"Programming Hive",
"cost":30.0 }, {
"name":"frAgile Software Development",
"cost":0.2 } ]
}

Complex serialized payloads
●

●

●

“process web logs” in facebook's case were
NOT always tab delimited text files
In many cases scribe was logging complex
structures in thrift format
Hadoop (and hive) can work with complex
records not typical in RDBMS

Log collection/ingestion

http://flume.apache.org/FlumeUserGuide.html

Several ingestion approaches
●

Scribe never took off

●

Choctaw (hangs around not sexy)

●

Log servers log direct with HDFS API

●

Duck taped up set of shell scripts

●

Flume seems to be the most widely used,
feature rich, and supported system

Left up to the user...
●

What format do you want the raw data in

●

How should the data be staged in HDFS
–
–

●

hourly directories
by host

How to monitor
–

Semantics of what the pipeline should do if files
stop appearing?

–

Application specific sanity checks

Hive and relational algebra
●

SELECT refer,
sum(purchase.cost)
FROM store_transaction

<- Projection
<- Aggregation

LATERAL VIEW explode
(purchase) plist as purchase <- Hive sexyness
<- Aggregation
GROUP BY refer
WHERE refer = 'y'
<- Selection

Hadoop/Hive's parallel
implementation

Drawbacks of
the batch approach
●

Not efficient/possible on small time windows
–

●

Jobs have start up time and over head

Late data can be troublesome
–

Resulting in full rerun

–

Re-run of dependent jobs

●

Failures can set processing hours back (or maybe days

●

Scheduling of dependent tasks
–

Not a huge consensus around proper tool
●
●
●

Oozie
Azcaban
Cron ... pause not

More drawbacks of Batch data
●

Interactive analysis of results

●

Detecting sanity of input

●

●

Result data typically moved into other systems
for interactive analysis (post process)
Most computational steps spill/persist to disk
–

Components of a job can be pipelined but
between two jobs is persistent storage. That
needs to be re-read in for next batch.

Stream processing
●

My first job “stream processing” reading in
Associated Press data
–

–
●

●

Connecting to a terminal server connected to a serial
modem
Writing this information to a database

My definition: Processing data across one or more
coordinated data channels
Like “Big Data”, Stream Processing is:
–

Whatever you say it is

Common components
of stream processing
●

●

●

Message Queue – A system that delivers a
never ending stream of data
Processing engine – Manages streams and
connects data to processing
External/Internal persistence – Some data
may live outside the stream.
–

It could be transient or persistent

Why most Message Queue
software does not 'scale'
●

MQ 'guarantees'
●
●

●

MQ Typically optimize by keeping all data in memory
–

Semantics around what happens when memory is full
●
●
●

●

In order delivery
Acknowledgments

Block
Persist to disk
Throw away

Not trashing Messages Queues here. Many of their
guarantees are hard to deliver at scale, and not always
needed

Kafka – A high-throughput
distributed messaging system

A publish-subscribe messaging re-thought as a
distributed commit log

Distributed
●

Data streams are partitioned and spread over a
cluster of machines

Durable and fast
●

Messages are always persisted to disk!

●

Consumers track their position in log files

●

Kafka uses the sendfile system call for
performance

Consumer Groups
●

●

Multiple groups can subscribe to an event
stream
Producers can determine event partitioning

Great! You have streaming data.
How do you process it?
●

Storm - https://github.com/nathanmarz/storm

●

Samza - samza.incubator.apache.org

●

S4 - http://incubator.apache.org/s4/

●

http://www03.ibm.com/software/products/us/en/infospherestreams/
Heck even I wrote one!

●

IronCount https://github.com/edwardcapriolo/IronCount

Before you have a holy war
over this software decision...

Storm
●

Distributed and fault-tolerant realtime
computation: stream processing,
continuous computation, distributed RPC.

Storm (Trident) API
●

Data comes from spouts

●

Spouts/streams produce tuples

●

FixedBatchSpout spout = new
FixedBatchSpout(new Fields("sentence"), 1,
new Values("line one"),
new Values("line two"));

https://github.com/nathanmarz/storm/wiki/Trident-tutorial

(extended) Projection
●

Stream can be processed into another stream

●

Here a line is split into words

●

●

Stream words = stream.each(new
Fields("sentence"), new Split(), new
Fields("word"));
(Similar to hive's LATERAL VIEW)

Grouping and Aggregation
●

●

GroupedStream groupByWord =
words.groupBy( new Fields("word"));
TridentState groupByState =
groupByWord.persistentAggregate(new
MemoryMapState.Factory(), new Count(),
new Fields("count"));

Great! We just did distributed
stream processing!
●
●

But where is the results?
groupByWord.persistentAggregate(new
MemoryMapState.Factory(), new Count(),
new Fields("count"));

●

In Memory... aka nowhere :)

●

We can change that...

●

But first some math/science/dribble I stole
from wikipedia in an attempt to sound smart!

Temporal database
●

●

A temporal database is a database with built-in
support for handling data involving time, for
example a temporal data model and a
temporal version of Structured Query
Language (SQL).
Temporal databases are in contrast to current
databases, which store only facts which are
believed to be true at the current time

Batch/Hadoop was easy
(temporaly speaking)
●

Input data is typically in write-once hdfs files*

●

Output data typically to write-once output files*

●

●

●

Reduce phase does not start until map/shuffle
is done
Output data typically available until the entire
job is done*
Idempotent computation
*Going to qualify everything with typically, because of computational idempotency

The real “real time”
●

Real time is often misused

●

Anecdotally people usually mean
–
–

●

Low latency
Small windows of time (sub-minute & sub-second)

Our bookstore wants “real time” stats
–

●

aggegations and data stores updated incrementally as
data is processed

One way to implement this is discrete columns
bucketed by time

Tempor-alizing data
●

●

●

●

In an earlier example we aggegated revenue by
referrer like this:
SELECT refer, sum(purchase.cost) ...
GROUP BY refer
Now we include the time:
SELECT date(eventtime),hour(eventtime),
minute(eventtime) refer, sum(purchase.cost)
GROUP BY day(eventtime),hour(eventtime),
minute(eventtime)

Storing data in Cassandra
●

Horizontally scalable (hundreds of nodes)

●

No single point of failure

●

Integrated replication

●

Writes like lightning (Structured log storage)

●

Reads like thunder (LevelDB & BigTable
inspired storage)

Scalable time series made
easy with cassandra
●

●

●

Create a table with one row per day per refer, sorted by
time
CREATE TABLE purchase_by_refer (
refer text,
dt date,
event_time timestamp,
tot counter,
PRIMARY KEY ((refer,dt),event_time));
UPDATE purchase_by_refer set tot=tot+1 where refer =
'store1 and dt='2013-01-12' and event_time=''2013-01-12
07:03:00'

If you want c* and storm
●
●

●

https://github.com/hmsonline/storm-cassandra
Uses Cassandra as a peristance model for
storm
Good documentation

The home stretch: Joining streams
and caching data
●

●

●

Some use cases of distributed streaming
involve keeping local caches
Streaming algorithms requires memory of
recent events and do not want to query a
datastore each time an event is received
Kafka is useful in this case because the user
can dictated the partition the data is sent to

Streaming
Recommendation System

https://github.com/edwardcapriolo/IronCount

Input Streams
Stream 1: users

Stream 2: items

user|1:edward

cart|1:saw:2.00

user|2:nate

cart|1:hammer:3.00

user|3:stacey

cart|3:puppy:1.00

●

Both streams merged (union)

●

The field after the pipe is the userid (projection)

●

User id should be the partition key when sent on
(aggregation)

Handle message and route by id
public void handleMessage(MessageAndMetadata
<Message> m) {
String line = getMessage(m.message());
String[] parts = line.split("|");
String table = parts[0];
String row = parts[1];
String [] columns = row.split(":");
producer.send(new ProducerData<String, String>
("reduce", columns[0], Arrays.asList(table+"|"+row)));
}

Update in memory copy
●

public class ReduceHandler implements MessageHandler {
HashMap<User,ArrayList<Item>> data = new
EvictingHashMap<User,ArrayList<Item>>();
...
public void handleMessage (MessageAndMetadata<Message>
m) {
if ( table.equals("cart")){
Item i = new Item();
i.parse(columns);
incrementItemCounter(u);
incrementDollarByUser(u,i);
}
suggestNewItemsForUser(u);

Challenges of streaming
●

Replay of data could double/miss count

●

New evolving API's
–

●
●

●

You may have to build support for your stack

Distributed computation is harder to log/debug
Monitoring consumption on topics to avoid
falling behind
Monitoring topics to notice if data stops

Web-scale data processing: practical approaches for low-latency and batch

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Web-scale data processing: practical approaches for low-latency and batch (20)

More from Edward Capriolo (14)

Recently uploaded (20)

Web-scale data processing: practical approaches for low-latency and batch

Editor's Notes