SlideShare a Scribd company logo
$>whoami
Edward Capriolo
●

Developer @ dstillery (the company formally
known as m6d aka media6degrees)

●

Hive: Project Management Committee

●

Hadoop'in it since 0.17.2

●

Cassandra-'in it since 0.6.X

●

Hive'in it 0.3.X

●

Incredibly skilled with power point
Agenda for this talk
●

Batch processing via Hadoop

●

Stream processing

●

Relational Databases and NoSQL

●

Life lessons, quips, and other prospective
Before we talk tech...
●
●

●

●

Lets talk math!
Yay! math fun! (as people start leaving room)
Don't worry. It is only a couple slides.
Wanted to talk about relational algebra since it
is the foundation of relation databases
Even in the NoSQL age, relational algebra is
alive and well
Relational algebra...
A big slide with many words
●

●

Relational algebra received little attention outside of pure
mathematics until the publication of E.F. Codd's relational model of
data in 1970. Codd proposed such an algebra as a basis for
database query languages.
In computer science, relational algebra is an offshoot of first-order
logic and of algebra of sets concerned with operations over finitary
relations, usually made more convenient to work with by identifying
the components of a tuple by a name (called attribute) rather than
by a numeric column index, which is called a relation in database
terminology.

http://en.wikipedia.org/wiki/Relational_algebra
Operators of Relational algebra:
Projection

●

SELECT Age, Weight ...
Extended projections

●

SELECT Age+Weight as X ...

●

SELECT ROUND(Weight),Age+1 as X ...
Selection

●

SELECT * FROM Person

●

SELECT * FROM Person WHERE Age >=34

●

SELECT * FROM Person WHERE Age = Weight
Joins

●

●

SELECT * FROM Car JOIN Boat on (CarPrice
>= BoatPrice)
SELECT * FROM Car JOIN Boat on (CarPrice
= BoatPrice)
Aggregate

●

SELECT sum(C) FROM r

●

SELECT A, sum(C) FROM r GROUP BY A
http://www.cbcb.umd.edu/confcour/CMSC424/Relational_algebra.pdf
Other Operators
●

Set operations
–
–

Intersection

–
●

Union
Cartesian Product

Outer joins
–
–

RIGHT,

–
●

LEFT
FULL

Semi Join / Exists
Batch Processing and Big Data
●

When hadoop game on the scene it was a
game changer because:
–

Viable implementation of Google's map reduce
white paper

–

Worked with commodity hardware

–

Had no exuberant software fees

–

Scaled processing and storage with growing
companies without typically needed processes
to be redesigned
Archetype Hadoop deployment
(circa facebook 2009)
Scribe Writers
Realtime
Hadoop
Cluster
Web Servers

Scribe MidTier

Oracle RAC

Hadoop Hive Warehouse

MySQL

http://hadoopblog.blogspot.com/2009/06/hdfs-scribe-integration.html
The Hadoop archetype
●
●

●
●

Component generating events (web servers)
Component collecting logs into hadoop
(scribe)
Translation of raw data using hadoop and hive
Output of rollups to oracle and other data
systems
–

feedback loops (mysql <-> hive)
Use case: Book store
●

Our book store will be named (say it with me!):
–
–

Big Data,

–

No SQL,

–

Real Time Analytics,

–
●

Web scale,

Books!

One more time!
–

Web scale, Big Data, No SQL, Real Time Analytics, Books
●

(A buzzword bingo company)
Domain model
{
"id":"00001",
"refer":"http://affiliate1.superbooks.com",
"ip":"209.191.139.200",
"status":"ACCEPTED",
"eventTimeInMillis":1383011801439,
"credit_hash":"ab45de21",
"email":"bob@compuserv.com",
"purchases":[ {
"name":"Programming Hive",
"cost":30.0 }, {
"name":"frAgile Software Development",
"cost":0.2 } ]
}
Complex serialized payloads
●

●

●

“process web logs” in facebook's case were
NOT always tab delimited text files
In many cases scribe was logging complex
structures in thrift format
Hadoop (and hive) can work with complex
records not typical in RDBMS
Log collection/ingestion

http://flume.apache.org/FlumeUserGuide.html
Several ingestion approaches
●

Scribe never took off

●

Choctaw (hangs around not sexy)

●

Log servers log direct with HDFS API

●

Duck taped up set of shell scripts

●

Flume seems to be the most widely used,
feature rich, and supported system
Left up to the user...
●

What format do you want the raw data in

●

How should the data be staged in HDFS
–
–

●

hourly directories
by host

How to monitor
–

Semantics of what the pipeline should do if files
stop appearing?

–

Application specific sanity checks
Unleash the hounds!
Hive and relational algebra
●

SELECT refer,
sum(purchase.cost)
FROM store_transaction

<- Projection
<- Aggregation

LATERAL VIEW explode
(purchase) plist as purchase <- Hive sexyness
<- Aggregation
GROUP BY refer
WHERE refer = 'y'
<- Selection
Hadoop/Hive's parallel
implementation
Drawbacks of
the batch approach
●

Not efficient/possible on small time windows
–

●

Jobs have start up time and over head

Late data can be troublesome
–

Resulting in full rerun

–

Re-run of dependent jobs

●

Failures can set processing hours back (or maybe days

●

Scheduling of dependent tasks
–

Not a huge consensus around proper tool
●
●
●

Oozie
Azcaban
Cron ... pause not
More drawbacks of Batch data
●

Interactive analysis of results

●

Detecting sanity of input

●

●

Result data typically moved into other systems
for interactive analysis (post process)
Most computational steps spill/persist to disk
–

Components of a job can be pipelined but
between two jobs is persistent storage. That
needs to be re-read in for next batch.
Stream Processing
Stream processing
●

My first job “stream processing” reading in
Associated Press data
–

–
●

●

Connecting to a terminal server connected to a serial
modem
Writing this information to a database

My definition: Processing data across one or more
coordinated data channels
Like “Big Data”, Stream Processing is:
–

Whatever you say it is
Common components
of stream processing
●

●

●

Message Queue – A system that delivers a
never ending stream of data
Processing engine – Manages streams and
connects data to processing
External/Internal persistence – Some data
may live outside the stream.
–

It could be transient or persistent
Message Queues
Why most Message Queue
software does not 'scale'
●

MQ 'guarantees'
●
●

●

MQ Typically optimize by keeping all data in memory
–

Semantics around what happens when memory is full
●
●
●

●

In order delivery
Acknowledgments

Block
Persist to disk
Throw away

Not trashing Messages Queues here. Many of their
guarantees are hard to deliver at scale, and not always
needed
Kafka – A high-throughput
distributed messaging system

A publish-subscribe messaging re-thought as a
distributed commit log
Distributed
●

Data streams are partitioned and spread over a
cluster of machines
Durable and fast
●

Messages are always persisted to disk!

●

Consumers track their position in log files

●

Kafka uses the sendfile system call for
performance
Consumer Groups
●

●

Multiple groups can subscribe to an event
stream
Producers can determine event partitioning
Great! You have streaming data.
How do you process it?
●

Storm - https://github.com/nathanmarz/storm

●

Samza - samza.incubator.apache.org

●

S4 - http://incubator.apache.org/s4/

●

http://www03.ibm.com/software/products/us/en/infospherestreams/
Heck even I wrote one!

●

IronCount https://github.com/edwardcapriolo/IronCount
Before you have a holy war
over this software decision...
Storm
●

Distributed and fault-tolerant realtime
computation: stream processing,
continuous computation, distributed RPC.
Storm (Trident) API
●

Data comes from spouts

●

Spouts/streams produce tuples

●

FixedBatchSpout spout = new
FixedBatchSpout(new Fields("sentence"), 1,
new Values("line one"),
new Values("line two"));

https://github.com/nathanmarz/storm/wiki/Trident-tutorial
(extended) Projection
●

Stream can be processed into another stream

●

Here a line is split into words

●

●

Stream words = stream.each(new
Fields("sentence"), new Split(), new
Fields("word"));
(Similar to hive's LATERAL VIEW)
Grouping and Aggregation
●

●

GroupedStream groupByWord =
words.groupBy( new Fields("word"));
TridentState groupByState =
groupByWord.persistentAggregate(new
MemoryMapState.Factory(), new Count(),
new Fields("count"));
Great! We just did distributed
stream processing!
●
●

But where is the results?
groupByWord.persistentAggregate(new
MemoryMapState.Factory(), new Count(),
new Fields("count"));

●

In Memory... aka nowhere :)

●

We can change that...

●

But first some math/science/dribble I stole
from wikipedia in an attempt to sound smart!
Temporal database
●

●

A temporal database is a database with built-in
support for handling data involving time, for
example a temporal data model and a
temporal version of Structured Query
Language (SQL).
Temporal databases are in contrast to current
databases, which store only facts which are
believed to be true at the current time
Batch/Hadoop was easy
(temporaly speaking)
●

Input data is typically in write-once hdfs files*

●

Output data typically to write-once output files*

●

●

●

Reduce phase does not start until map/shuffle
is done
Output data typically available until the entire
job is done*
Idempotent computation
*Going to qualify everything with typically, because of computational idempotency
The real “real time”
●

Real time is often misused

●

Anecdotally people usually mean
–
–

●

Low latency
Small windows of time (sub-minute & sub-second)

Our bookstore wants “real time” stats
–

●

aggegations and data stores updated incrementally as
data is processed

One way to implement this is discrete columns
bucketed by time
Tempor-alizing data
●

●

●

●

In an earlier example we aggegated revenue by
referrer like this:
SELECT refer, sum(purchase.cost) ...
GROUP BY refer
Now we include the time:
SELECT date(eventtime),hour(eventtime),
minute(eventtime) refer, sum(purchase.cost)
GROUP BY day(eventtime),hour(eventtime),
minute(eventtime)
Storing data in Cassandra
●

Horizontally scalable (hundreds of nodes)

●

No single point of failure

●

Integrated replication

●

Writes like lightning (Structured log storage)

●

Reads like thunder (LevelDB & BigTable
inspired storage)
Scalable time series made
easy with cassandra
●

●

●

Create a table with one row per day per refer, sorted by
time
CREATE TABLE purchase_by_refer (
refer text,
dt date,
event_time timestamp,
tot counter,
PRIMARY KEY ((refer,dt),event_time));
UPDATE purchase_by_refer set tot=tot+1 where refer =
'store1 and dt='2013-01-12' and event_time=''2013-01-12
07:03:00'
If you want c* and storm
●
●

●

https://github.com/hmsonline/storm-cassandra
Uses Cassandra as a peristance model for
storm
Good documentation
The home stretch: Joining streams
and caching data
●

●

●

Some use cases of distributed streaming
involve keeping local caches
Streaming algorithms requires memory of
recent events and do not want to query a
datastore each time an event is received
Kafka is useful in this case because the user
can dictated the partition the data is sent to
Streaming
Recommendation System

https://github.com/edwardcapriolo/IronCount
Input Streams
Stream 1: users

Stream 2: items

user|1:edward

cart|1:saw:2.00

user|2:nate

cart|1:hammer:3.00

user|3:stacey

cart|3:puppy:1.00

●

Both streams merged (union)

●

The field after the pipe is the userid (projection)

●

User id should be the partition key when sent on
(aggregation)
Handle message and route by id
public void handleMessage(MessageAndMetadata
<Message> m) {
String line = getMessage(m.message());
String[] parts = line.split("|");
String table = parts[0];
String row = parts[1];
String [] columns = row.split(":");
producer.send(new ProducerData<String, String>
("reduce", columns[0], Arrays.asList(table+"|"+row)));
}
Update in memory copy
●

public class ReduceHandler implements MessageHandler {
HashMap<User,ArrayList<Item>> data = new
EvictingHashMap<User,ArrayList<Item>>();
...
public void handleMessage (MessageAndMetadata<Message>
m) {
if ( table.equals("cart")){
Item i = new Item();
i.parse(columns);
incrementItemCounter(u);
incrementDollarByUser(u,i);
}
suggestNewItemsForUser(u);
Challenges of streaming
●

Replay of data could double/miss count

●

New evolving API's
–

●
●

●

You may have to build support for your stack

Distributed computation is harder to log/debug
Monitoring consumption on topics to avoid
falling behind
Monitoring topics to notice if data stops
El fin

More Related Content

ODP
Big data nyu
PPT
Building your own NSQL store
PDF
Cassandra background-and-architecture
PDF
DataStax and Esri: Geotemporal IoT Search and Analytics
PDF
Pythian: My First 100 days with a Cassandra Cluster
PPTX
An Overview of Apache Cassandra
PPTX
PDF
Hadoop-2.6.0 Slides
Big data nyu
Building your own NSQL store
Cassandra background-and-architecture
DataStax and Esri: Geotemporal IoT Search and Analytics
Pythian: My First 100 days with a Cassandra Cluster
An Overview of Apache Cassandra
Hadoop-2.6.0 Slides

What's hot (20)

PPTX
M6d cassandrapresentation
PDF
NoSQL and NewSQL: Tradeoffs between Scalable Performance & Consistency
PDF
ScyllaDB: NoSQL at Ludicrous Speed
PDF
Webinar: Using Control Theory to Keep Compactions Under Control
PPTX
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
PPTX
Introduction to NoSQL & Apache Cassandra
PPTX
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
PDF
Signal Digital: The Skinny on Wide Rows
PDF
Cassandra: Open Source Bigtable + Dynamo
PPTX
Cassandra Tuning - above and beyond
PPTX
BI, Reporting and Analytics on Apache Cassandra
PDF
Spark and cassandra (Hulu Talk)
PPTX
Cassandra concepts, patterns and anti-patterns
PPTX
Druid realtime indexing
PDF
Large volume data analysis on the Typesafe Reactive Platform
PDF
Re-Engineering PostgreSQL as a Time-Series Database
PDF
Cassandra Summit 2014: Cyanite — Better Graphite Storage with Apache Cassandra
PPT
Webinar: Getting Started with Apache Cassandra
PDF
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...
PDF
Introduction to Cassandra
M6d cassandrapresentation
NoSQL and NewSQL: Tradeoffs between Scalable Performance & Consistency
ScyllaDB: NoSQL at Ludicrous Speed
Webinar: Using Control Theory to Keep Compactions Under Control
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Introduction to NoSQL & Apache Cassandra
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
Signal Digital: The Skinny on Wide Rows
Cassandra: Open Source Bigtable + Dynamo
Cassandra Tuning - above and beyond
BI, Reporting and Analytics on Apache Cassandra
Spark and cassandra (Hulu Talk)
Cassandra concepts, patterns and anti-patterns
Druid realtime indexing
Large volume data analysis on the Typesafe Reactive Platform
Re-Engineering PostgreSQL as a Time-Series Database
Cassandra Summit 2014: Cyanite — Better Graphite Storage with Apache Cassandra
Webinar: Getting Started with Apache Cassandra
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...
Introduction to Cassandra
Ad

Viewers also liked (20)

PDF
Akash sharma lo 2
ODP
9/11 Lore Nata Maria Ale
PPT
Apres pi pcbc
PDF
第2回 ★★オンラインくるま座集会★★ 2013.10.03 at 小泉一真くるま座談話室
PPT
Multimedia01
PPT
1 18
PDF
11 Mistakes While Looking For A Job
PPT
Adore global pvt ltd
PDF
ClinicalStandards
PDF
Researchers - recommendations from AIGLIA2014
PDF
Full turkey cycle17 2013
PPT
Unit 2 age of exploration- guided notes
PPTX
The 12 types of advertising 5&6
PPTX
Ancillary magazine making
PPT
20087067 choi mun jung presentation
PPTX
Company Profile
PPTX
Scarlett Falling Down
PPT
JMS PowerPoint for our Epals
PPT
Introduction to Density
PDF
Kuronen: Oppilas- ja opiskelijahuolto osaksi lasten ja nuorten hyvinvointisuu...
Akash sharma lo 2
9/11 Lore Nata Maria Ale
Apres pi pcbc
第2回 ★★オンラインくるま座集会★★ 2013.10.03 at 小泉一真くるま座談話室
Multimedia01
1 18
11 Mistakes While Looking For A Job
Adore global pvt ltd
ClinicalStandards
Researchers - recommendations from AIGLIA2014
Full turkey cycle17 2013
Unit 2 age of exploration- guided notes
The 12 types of advertising 5&6
Ancillary magazine making
20087067 choi mun jung presentation
Company Profile
Scarlett Falling Down
JMS PowerPoint for our Epals
Introduction to Density
Kuronen: Oppilas- ja opiskelijahuolto osaksi lasten ja nuorten hyvinvointisuu...
Ad

Similar to Web-scale data processing: practical approaches for low-latency and batch (20)

PDF
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
PPTX
Big Data Processing
PPTX
Trivento summercamp masterclass 9/9/2016
PPTX
Software architecture for data applications
PPTX
The Big Data Stack
PDF
Data Streaming Technology Overview
PPTX
Architecting Your First Big Data Implementation
PPTX
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
PPTX
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
PPTX
Hadoop for sysadmins
PPTX
Strata NY 2018: The deconstructed database
PPTX
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
PPTX
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
PPT
Hadoop Frameworks Panel__HadoopSummit2010
PPTX
Big data concepts
PPTX
Big Data Concepts
PDF
From flat files to deconstructed database
PDF
Big data and hadoop overvew
PPT
Architecting Big Data Ingest & Manipulation
PPTX
Big Data for QAs
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
Big Data Processing
Trivento summercamp masterclass 9/9/2016
Software architecture for data applications
The Big Data Stack
Data Streaming Technology Overview
Architecting Your First Big Data Implementation
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Hadoop for sysadmins
Strata NY 2018: The deconstructed database
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Hadoop Frameworks Panel__HadoopSummit2010
Big data concepts
Big Data Concepts
From flat files to deconstructed database
Big data and hadoop overvew
Architecting Big Data Ingest & Manipulation
Big Data for QAs

More from Edward Capriolo (14)

PPT
Nibiru: Building your own NoSQL store
PPT
Cassandra4hadoop
ODP
Intravert Server side processing for Cassandra
ODP
M6d cassandra summit
ODP
Apache Kafka Demo
ODP
Cassandra NoSQL Lan party
PPT
Breaking first-normal form with Hive
ODP
Casbase presentation
PPT
Hadoop Monitoring best Practices
PPT
Whirlwind tour of Hadoop and HIve
ODP
Cli deep dive
ODP
Cassandra as Memcache
PPT
Counters for real-time statistics
PPT
Real world capacity
Nibiru: Building your own NoSQL store
Cassandra4hadoop
Intravert Server side processing for Cassandra
M6d cassandra summit
Apache Kafka Demo
Cassandra NoSQL Lan party
Breaking first-normal form with Hive
Casbase presentation
Hadoop Monitoring best Practices
Whirlwind tour of Hadoop and HIve
Cli deep dive
Cassandra as Memcache
Counters for real-time statistics
Real world capacity

Recently uploaded (20)

PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Electronic commerce courselecture one. Pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Cloud computing and distributed systems.
PDF
Encapsulation theory and applications.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPT
Teaching material agriculture food technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Big Data Technologies - Introduction.pptx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Electronic commerce courselecture one. Pdf
The AUB Centre for AI in Media Proposal.docx
Cloud computing and distributed systems.
Encapsulation theory and applications.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
MYSQL Presentation for SQL database connectivity
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Review of recent advances in non-invasive hemoglobin estimation
NewMind AI Weekly Chronicles - August'25 Week I
Digital-Transformation-Roadmap-for-Companies.pptx
Teaching material agriculture food technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
A Presentation on Artificial Intelligence
Unlocking AI with Model Context Protocol (MCP)
Per capita expenditure prediction using model stacking based on satellite ima...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Big Data Technologies - Introduction.pptx

Web-scale data processing: practical approaches for low-latency and batch

  • 1. $>whoami Edward Capriolo ● Developer @ dstillery (the company formally known as m6d aka media6degrees) ● Hive: Project Management Committee ● Hadoop'in it since 0.17.2 ● Cassandra-'in it since 0.6.X ● Hive'in it 0.3.X ● Incredibly skilled with power point
  • 2. Agenda for this talk ● Batch processing via Hadoop ● Stream processing ● Relational Databases and NoSQL ● Life lessons, quips, and other prospective
  • 3. Before we talk tech... ● ● ● ● Lets talk math! Yay! math fun! (as people start leaving room) Don't worry. It is only a couple slides. Wanted to talk about relational algebra since it is the foundation of relation databases Even in the NoSQL age, relational algebra is alive and well
  • 4. Relational algebra... A big slide with many words ● ● Relational algebra received little attention outside of pure mathematics until the publication of E.F. Codd's relational model of data in 1970. Codd proposed such an algebra as a basis for database query languages. In computer science, relational algebra is an offshoot of first-order logic and of algebra of sets concerned with operations over finitary relations, usually made more convenient to work with by identifying the components of a tuple by a name (called attribute) rather than by a numeric column index, which is called a relation in database terminology. http://en.wikipedia.org/wiki/Relational_algebra
  • 6. Projection ● SELECT Age, Weight ... Extended projections ● SELECT Age+Weight as X ... ● SELECT ROUND(Weight),Age+1 as X ...
  • 7. Selection ● SELECT * FROM Person ● SELECT * FROM Person WHERE Age >=34 ● SELECT * FROM Person WHERE Age = Weight
  • 8. Joins ● ● SELECT * FROM Car JOIN Boat on (CarPrice >= BoatPrice) SELECT * FROM Car JOIN Boat on (CarPrice = BoatPrice)
  • 9. Aggregate ● SELECT sum(C) FROM r ● SELECT A, sum(C) FROM r GROUP BY A http://www.cbcb.umd.edu/confcour/CMSC424/Relational_algebra.pdf
  • 10. Other Operators ● Set operations – – Intersection – ● Union Cartesian Product Outer joins – – RIGHT, – ● LEFT FULL Semi Join / Exists
  • 11. Batch Processing and Big Data ● When hadoop game on the scene it was a game changer because: – Viable implementation of Google's map reduce white paper – Worked with commodity hardware – Had no exuberant software fees – Scaled processing and storage with growing companies without typically needed processes to be redesigned
  • 12. Archetype Hadoop deployment (circa facebook 2009) Scribe Writers Realtime Hadoop Cluster Web Servers Scribe MidTier Oracle RAC Hadoop Hive Warehouse MySQL http://hadoopblog.blogspot.com/2009/06/hdfs-scribe-integration.html
  • 13. The Hadoop archetype ● ● ● ● Component generating events (web servers) Component collecting logs into hadoop (scribe) Translation of raw data using hadoop and hive Output of rollups to oracle and other data systems – feedback loops (mysql <-> hive)
  • 14. Use case: Book store ● Our book store will be named (say it with me!): – – Big Data, – No SQL, – Real Time Analytics, – ● Web scale, Books! One more time! – Web scale, Big Data, No SQL, Real Time Analytics, Books ● (A buzzword bingo company)
  • 16. Complex serialized payloads ● ● ● “process web logs” in facebook's case were NOT always tab delimited text files In many cases scribe was logging complex structures in thrift format Hadoop (and hive) can work with complex records not typical in RDBMS
  • 18. Several ingestion approaches ● Scribe never took off ● Choctaw (hangs around not sexy) ● Log servers log direct with HDFS API ● Duck taped up set of shell scripts ● Flume seems to be the most widely used, feature rich, and supported system
  • 19. Left up to the user... ● What format do you want the raw data in ● How should the data be staged in HDFS – – ● hourly directories by host How to monitor – Semantics of what the pipeline should do if files stop appearing? – Application specific sanity checks
  • 21. Hive and relational algebra ● SELECT refer, sum(purchase.cost) FROM store_transaction <- Projection <- Aggregation LATERAL VIEW explode (purchase) plist as purchase <- Hive sexyness <- Aggregation GROUP BY refer WHERE refer = 'y' <- Selection
  • 23. Drawbacks of the batch approach ● Not efficient/possible on small time windows – ● Jobs have start up time and over head Late data can be troublesome – Resulting in full rerun – Re-run of dependent jobs ● Failures can set processing hours back (or maybe days ● Scheduling of dependent tasks – Not a huge consensus around proper tool ● ● ● Oozie Azcaban Cron ... pause not
  • 24. More drawbacks of Batch data ● Interactive analysis of results ● Detecting sanity of input ● ● Result data typically moved into other systems for interactive analysis (post process) Most computational steps spill/persist to disk – Components of a job can be pipelined but between two jobs is persistent storage. That needs to be re-read in for next batch.
  • 26. Stream processing ● My first job “stream processing” reading in Associated Press data – – ● ● Connecting to a terminal server connected to a serial modem Writing this information to a database My definition: Processing data across one or more coordinated data channels Like “Big Data”, Stream Processing is: – Whatever you say it is
  • 27. Common components of stream processing ● ● ● Message Queue – A system that delivers a never ending stream of data Processing engine – Manages streams and connects data to processing External/Internal persistence – Some data may live outside the stream. – It could be transient or persistent
  • 29. Why most Message Queue software does not 'scale' ● MQ 'guarantees' ● ● ● MQ Typically optimize by keeping all data in memory – Semantics around what happens when memory is full ● ● ● ● In order delivery Acknowledgments Block Persist to disk Throw away Not trashing Messages Queues here. Many of their guarantees are hard to deliver at scale, and not always needed
  • 30. Kafka – A high-throughput distributed messaging system A publish-subscribe messaging re-thought as a distributed commit log
  • 31. Distributed ● Data streams are partitioned and spread over a cluster of machines
  • 32. Durable and fast ● Messages are always persisted to disk! ● Consumers track their position in log files ● Kafka uses the sendfile system call for performance
  • 33. Consumer Groups ● ● Multiple groups can subscribe to an event stream Producers can determine event partitioning
  • 34. Great! You have streaming data. How do you process it? ● Storm - https://github.com/nathanmarz/storm ● Samza - samza.incubator.apache.org ● S4 - http://incubator.apache.org/s4/ ● http://www03.ibm.com/software/products/us/en/infospherestreams/ Heck even I wrote one! ● IronCount https://github.com/edwardcapriolo/IronCount
  • 35. Before you have a holy war over this software decision...
  • 36. Storm ● Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC.
  • 37. Storm (Trident) API ● Data comes from spouts ● Spouts/streams produce tuples ● FixedBatchSpout spout = new FixedBatchSpout(new Fields("sentence"), 1, new Values("line one"), new Values("line two")); https://github.com/nathanmarz/storm/wiki/Trident-tutorial
  • 38. (extended) Projection ● Stream can be processed into another stream ● Here a line is split into words ● ● Stream words = stream.each(new Fields("sentence"), new Split(), new Fields("word")); (Similar to hive's LATERAL VIEW)
  • 39. Grouping and Aggregation ● ● GroupedStream groupByWord = words.groupBy( new Fields("word")); TridentState groupByState = groupByWord.persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"));
  • 40. Great! We just did distributed stream processing! ● ● But where is the results? groupByWord.persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")); ● In Memory... aka nowhere :) ● We can change that... ● But first some math/science/dribble I stole from wikipedia in an attempt to sound smart!
  • 41. Temporal database ● ● A temporal database is a database with built-in support for handling data involving time, for example a temporal data model and a temporal version of Structured Query Language (SQL). Temporal databases are in contrast to current databases, which store only facts which are believed to be true at the current time
  • 42. Batch/Hadoop was easy (temporaly speaking) ● Input data is typically in write-once hdfs files* ● Output data typically to write-once output files* ● ● ● Reduce phase does not start until map/shuffle is done Output data typically available until the entire job is done* Idempotent computation *Going to qualify everything with typically, because of computational idempotency
  • 43. The real “real time” ● Real time is often misused ● Anecdotally people usually mean – – ● Low latency Small windows of time (sub-minute & sub-second) Our bookstore wants “real time” stats – ● aggegations and data stores updated incrementally as data is processed One way to implement this is discrete columns bucketed by time
  • 44. Tempor-alizing data ● ● ● ● In an earlier example we aggegated revenue by referrer like this: SELECT refer, sum(purchase.cost) ... GROUP BY refer Now we include the time: SELECT date(eventtime),hour(eventtime), minute(eventtime) refer, sum(purchase.cost) GROUP BY day(eventtime),hour(eventtime), minute(eventtime)
  • 45. Storing data in Cassandra ● Horizontally scalable (hundreds of nodes) ● No single point of failure ● Integrated replication ● Writes like lightning (Structured log storage) ● Reads like thunder (LevelDB & BigTable inspired storage)
  • 46. Scalable time series made easy with cassandra ● ● ● Create a table with one row per day per refer, sorted by time CREATE TABLE purchase_by_refer ( refer text, dt date, event_time timestamp, tot counter, PRIMARY KEY ((refer,dt),event_time)); UPDATE purchase_by_refer set tot=tot+1 where refer = 'store1 and dt='2013-01-12' and event_time=''2013-01-12 07:03:00'
  • 47. If you want c* and storm ● ● ● https://github.com/hmsonline/storm-cassandra Uses Cassandra as a peristance model for storm Good documentation
  • 48. The home stretch: Joining streams and caching data ● ● ● Some use cases of distributed streaming involve keeping local caches Streaming algorithms requires memory of recent events and do not want to query a datastore each time an event is received Kafka is useful in this case because the user can dictated the partition the data is sent to
  • 50. Input Streams Stream 1: users Stream 2: items user|1:edward cart|1:saw:2.00 user|2:nate cart|1:hammer:3.00 user|3:stacey cart|3:puppy:1.00 ● Both streams merged (union) ● The field after the pipe is the userid (projection) ● User id should be the partition key when sent on (aggregation)
  • 51. Handle message and route by id public void handleMessage(MessageAndMetadata <Message> m) { String line = getMessage(m.message()); String[] parts = line.split("|"); String table = parts[0]; String row = parts[1]; String [] columns = row.split(":"); producer.send(new ProducerData<String, String> ("reduce", columns[0], Arrays.asList(table+"|"+row))); }
  • 52. Update in memory copy ● public class ReduceHandler implements MessageHandler { HashMap<User,ArrayList<Item>> data = new EvictingHashMap<User,ArrayList<Item>>(); ... public void handleMessage (MessageAndMetadata<Message> m) { if ( table.equals("cart")){ Item i = new Item(); i.parse(columns); incrementItemCounter(u); incrementDollarByUser(u,i); } suggestNewItemsForUser(u);
  • 53. Challenges of streaming ● Replay of data could double/miss count ● New evolving API's – ● ● ● You may have to build support for your stack Distributed computation is harder to log/debug Monitoring consumption on topics to avoid falling behind Monitoring topics to notice if data stops

Editor's Notes