Large-Scale Data Science on Hadoop (Intel Big Data Day)

1© Cloudera, Inc. All rights reserved.
Large-Scale Data Science
on Hadoop
Uri Laserson | Data Scientist | @laserson

About the speaker
• Data Scientist at Cloudera
• PhD in BME at MIT/Harvard
• Committer on ADAM, impyla
• Co-author on Advanced Analytics with Spark
• laserson@cloudera.com

What is a data scientist?

What is a data science?

What is a data science?
Phase 1. Collect Data Phase 2. Data Science? Phase 3. Profit!

Some things you might do as a data scientist
• Data quality issues
• Data formats/versions
• Data source integration
• Exploration/visualization
• Building/deploy models

Plumbing
Exploratory Operational

1. Data science is data plumbing

Example:
• Sells deep analysis of huge satellite images
• Easy: C++ to analyze images
• Hard: continuously reliably
ingesting, transforming
• Expensive:
storing, computing
• Hadoop as the
data science plumber

2. Data science is investigative analytics

Example: large UK retailer
• Customer Churn
• SAS, Hive
• Path Analysis
• Giraph, MapReduce
• Customer Segmentation
• SAS, Spotfire, Impala
• Hadoop as one hub for investigative tools
• Avoid buying, training for N new tools

3. Data science is operational analytics

Example:
• Real-time Search, ML over Patient Data
• MapReduce for indexing, learning
• HBase for storage and fast access
• Storm for incremental update
• RDBMS for recent
derived data
• API façade for input and
querying learning
Engineering
Machine Learning

Plumbing
Exploratory Operational

Factors to consider when choosing your tools
• Single-node performance
• Scalability
• Language and tooling familiarity
• Integration with Hadoop
• Libraries / functions / richness of ecosystem
• Integration with data prep / ETL workflows
Pattern
JPMML

Plumbing in a nutshell
Plumbing Apache Kafka
Apache Pig
Apache Crunch

Serialization/RPC frameworks
• Specify schemas/services in user-friendly
IDLs
• Code-generation to multiple languages (wire-
compatible/portable)
• Compact, binary formats
• Natural support for schema evolution
• Multiple implementations:
• Apache Thrift, Apache Avro, Google’s
Protocol Buffers
service Twitter {
void ping();
bool postTweet(1:Tweet tweet);
TweetSearchResult searchTweets(1:string query);
}
struct Tweet {
1: required i32 userId;
2: required string userName;
3: required string text;
4: optional Location loc;
16: optional string language = "english"
}

Log and service oriented architecture
http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

Factory (operational) vs. Laboratory (exploratory)
Programming languages
Systems languages
Latency, throughput
Huge data
Online problems
Automated
Developers, Engineers
Statistical environments, BI tools
High-level languages
Accuracy
Medium-sized data
Offline work
Ad-hoc
Statisticians, Analysts
vs.

Exploratory analytics
• Offline
• Statistical Environment
• Discovery-phase
• Model Building and Tuning
• Accuracy Important
• Medium-scale
• Visualizations
Exploratory

Exploratory: BI/visualization
• Nothing Hadoop-specific
• Take your pick of any 3rd party tool
• Typically connects to Hadoop via SQL
interface with Impala

Exploratory: SAS
• Connects to Hadoop data stores
• Can push down some computation
to cluster, but requires data
movement
• Mature and widely used; large algo
library
• Ongoing collaborative engineering
effort with Cloudera

Exploratory: Python
• Python and JVM don’t play nice
• Hadoop Streaming / mrjob / scikit-
learn
• Impyla: Python UDFs on Impala
• PySpark: Spark API in Python

Operational analytics
• Online
• Real-Time
• Cluster Environment
• Model Serving, Update
• QPS, Latency Important
• Large Scale
Operational
Pattern
JPMML

Operational: MLlib (Spark)
• Model building on Spark
• Fast (distributed in-memory)
• Basic algorithms only
• LR, SVM, decision tree
• PCA, SVD
• K-means
• ALS
• Easy integration with Spark-as-ETL

GROUPBY integration with Hadoop
Read Hadoop data Requires data movement

GROUPBY integration with Hadoop
YARN-managed Outside

GROUPBY open source
Open source Closed source

GROUPBY active community
Active community Not

Languages
Java Python R Scala

• Next-generation general processing engine for Hadoop
• APIs in Python, Java, Scala (and early R)
• DAG execution / in-memory
• Interactive REPL
• Batch or streaming
• MLlib, GraphX
• Active community
• Scala-like API

Large scale or real-time?
Large-Scale
Offline
Batch
Real-Time
Online
Streaming
vs

Large scale or real-time?
Large-Scale
Offline
Batch
Real-Time
Online
Streaming
vs
Why Don’t We Have Both?
λ!

Lambda architecture
• Tackle in 3 Layers
• Batch Layer:
offline, big model build
• Speed Layer:
near-real-time, approximate update
• Serving Layer:
real-time model
query / scoring

PMML
• Predictive Modeling Markup Language
• XML-based format for predictive models
• Standardized by Data Mining Group
(www.dmg.org)
• Wide tool support
<PMML xmlns="http://www.dmg.org/PMML-4_1"
version="4.1">
<Header copyright="www.dmg.org"/>
<DataDictionary numberOfFields="5">
<DataField name="temperature"
optype="continuous"
dataType="double"/>
…
</DataDictionary>
<TreeModel modelName="golfing"
functionName="classification">
<MiningSchema>
<MiningField name="temperature"/>
…
</MiningSchema>
<Node score="will play">
<Node score="will play">
<SimplePredicate field="outlook"
operator="equal"
value="sunny"/>
…
</Node>
</Node>
</TreeModel>
</PMML>

Lambda implementation: Oryx 2.x
• Generic lambda-architecture platform
• With ML specializations
• hyperparam selection
• Built on Spark Streaming, Kafka
• With Intel
• 2.x: pre-alpha
github.com/OryxProject/oryx

Lambda implementation: Oryx 2.x
github.com/OryxProject/oryx

HTTP REST API
• Convention for RPC-like request /
response
• HTTP verbs, transport
• GET : query
• POST : add input
• Easy from browser, CLI, Java,
Python, Scala, etc.
GET /recommend/jwills
HTTP/1.1 200 OK
Content-Type: text/plain
"Ray LaMontagne",0.951
"Fleet Foxes",0.7905
"The National",0.688
"Shearwater",0.3017

Thank you
laserson@cloudera.com

Large-Scale Data Science on Hadoop (Intel Big Data Day)

More Related Content

What's hot (20)

Viewers also liked (9)

Similar to Large-Scale Data Science on Hadoop (Intel Big Data Day) (20)

More from Uri Laserson (7)

Recently uploaded (20)

Large-Scale Data Science on Hadoop (Intel Big Data Day)

Editor's Notes