SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
Large-Scale Data Science
on Hadoop
Uri Laserson | Data Scientist | @laserson
2© Cloudera, Inc. All rights reserved.
About the speaker
• Data Scientist at Cloudera
• PhD in BME at MIT/Harvard
• Committer on ADAM, impyla
• Co-author on Advanced Analytics with Spark
• laserson@cloudera.com
3© Cloudera, Inc. All rights reserved.
What is a data scientist?
4© Cloudera, Inc. All rights reserved.
What is a data scientist?
5© Cloudera, Inc. All rights reserved.
What is a data scientist?
6© Cloudera, Inc. All rights reserved.
What is a data science?
7© Cloudera, Inc. All rights reserved.
What is a data science?
Phase 1. Collect Data Phase 2. Data Science? Phase 3. Profit!
8© Cloudera, Inc. All rights reserved.
Some things you might do as a data scientist
• Data quality issues
• Data formats/versions
• Data source integration
• Exploration/visualization
• Building/deploy models
9© Cloudera, Inc. All rights reserved.
Plumbing
Exploratory Operational
10© Cloudera, Inc. All rights reserved.
1. Data science is data plumbing
11© Cloudera, Inc. All rights reserved.
Example:
• Sells deep analysis of huge satellite images
• Easy: C++ to analyze images
• Hard: continuously reliably
ingesting, transforming
• Expensive:
storing, computing
• Hadoop as the
data science plumber
12© Cloudera, Inc. All rights reserved.
2. Data science is investigative analytics
13© Cloudera, Inc. All rights reserved.
Example: large UK retailer
• Customer Churn
• SAS, Hive
• Path Analysis
• Giraph, MapReduce
• Customer Segmentation
• SAS, Spotfire, Impala
• Hadoop as one hub for investigative tools
• Avoid buying, training for N new tools
14© Cloudera, Inc. All rights reserved.
3. Data science is operational analytics
15© Cloudera, Inc. All rights reserved.
Example:
• Real-time Search, ML over Patient Data
• MapReduce for indexing, learning
• HBase for storage and fast access
• Storm for incremental update
• RDBMS for recent
derived data
• API façade for input and
querying learning
Engineering
Machine Learning
16© Cloudera, Inc. All rights reserved.
Plumbing
Exploratory Operational
17© Cloudera, Inc. All rights reserved.
Factors to consider when choosing your tools
• Single-node performance
• Scalability
• Language and tooling familiarity
• Integration with Hadoop
• Libraries / functions / richness of ecosystem
• Integration with data prep / ETL workflows
Pattern
JPMML
18© Cloudera, Inc. All rights reserved.
Plumbing in a nutshell
Plumbing Apache Kafka
Apache Pig
Apache Crunch
19© Cloudera, Inc. All rights reserved.
Serialization/RPC frameworks
• Specify schemas/services in user-friendly
IDLs
• Code-generation to multiple languages (wire-
compatible/portable)
• Compact, binary formats
• Natural support for schema evolution
• Multiple implementations:
• Apache Thrift, Apache Avro, Google’s
Protocol Buffers
service Twitter {
void ping();
bool postTweet(1:Tweet tweet);
TweetSearchResult searchTweets(1:string query);
}
struct Tweet {
1: required i32 userId;
2: required string userName;
3: required string text;
4: optional Location loc;
16: optional string language = "english"
}
20© Cloudera, Inc. All rights reserved.
Log and service oriented architecture
http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
21© Cloudera, Inc. All rights reserved.
Log and service oriented architecture
http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
22© Cloudera, Inc. All rights reserved.
Factory (operational) vs. Laboratory (exploratory)
Programming languages
Systems languages
Latency, throughput
Huge data
Online problems
Automated
Developers, Engineers
Statistical environments, BI tools
High-level languages
Accuracy
Medium-sized data
Offline work
Ad-hoc
Statisticians, Analysts
vs.
23© Cloudera, Inc. All rights reserved.
Exploratory analytics
• Offline
• Statistical Environment
• Discovery-phase
• Model Building and Tuning
• Accuracy Important
• Medium-scale
• Visualizations
Exploratory
24© Cloudera, Inc. All rights reserved.
Exploratory: BI/visualization
• Nothing Hadoop-specific
• Take your pick of any 3rd party tool
• Typically connects to Hadoop via SQL
interface with Impala
25© Cloudera, Inc. All rights reserved.
Exploratory: SAS
• Connects to Hadoop data stores
• Can push down some computation
to cluster, but requires data
movement
• Mature and widely used; large algo
library
• Ongoing collaborative engineering
effort with Cloudera
26© Cloudera, Inc. All rights reserved.
Exploratory: Python
• Python and JVM don’t play nice
• Hadoop Streaming / mrjob / scikit-
learn
• Impyla: Python UDFs on Impala
• PySpark: Spark API in Python
27© Cloudera, Inc. All rights reserved.
Operational analytics
• Online
• Real-Time
• Cluster Environment
• Model Serving, Update
• QPS, Latency Important
• Large Scale
Operational
Pattern
JPMML
28© Cloudera, Inc. All rights reserved.
Operational: MLlib (Spark)
• Model building on Spark
• Fast (distributed in-memory)
• Basic algorithms only
• LR, SVM, decision tree
• PCA, SVD
• K-means
• ALS
• Easy integration with Spark-as-ETL
29© Cloudera, Inc. All rights reserved.
GROUPBY integration with Hadoop
Read Hadoop data Requires data movement
30© Cloudera, Inc. All rights reserved.
GROUPBY integration with Hadoop
YARN-managed Outside
31© Cloudera, Inc. All rights reserved.
GROUPBY open source
Open source Closed source
32© Cloudera, Inc. All rights reserved.
GROUPBY active community
Active community Not
33© Cloudera, Inc. All rights reserved.
Languages
Java Python R Scala
34© Cloudera, Inc. All rights reserved.
• Next-generation general processing engine for Hadoop
• APIs in Python, Java, Scala (and early R)
• DAG execution / in-memory
• Interactive REPL
• Batch or streaming
• MLlib, GraphX
• Active community
• Scala-like API
35© Cloudera, Inc. All rights reserved.
Large scale or real-time?
Large-Scale
Offline
Batch
Real-Time
Online
Streaming
vs
36© Cloudera, Inc. All rights reserved.
Large scale or real-time?
Large-Scale
Offline
Batch
Real-Time
Online
Streaming
vs
Why Don’t We Have Both?
λ!
37© Cloudera, Inc. All rights reserved.
Lambda architecture
• Tackle in 3 Layers
• Batch Layer:
offline, big model build
• Speed Layer:
near-real-time, approximate update
• Serving Layer:
real-time model
query / scoring
38© Cloudera, Inc. All rights reserved.
PMML
• Predictive Modeling Markup Language
• XML-based format for predictive models
• Standardized by Data Mining Group
(www.dmg.org)
• Wide tool support
<PMML xmlns="http://www.dmg.org/PMML-4_1"
version="4.1">
<Header copyright="www.dmg.org"/>
<DataDictionary numberOfFields="5">
<DataField name="temperature"
optype="continuous"
dataType="double"/>
…
</DataDictionary>
<TreeModel modelName="golfing"
functionName="classification">
<MiningSchema>
<MiningField name="temperature"/>
…
</MiningSchema>
<Node score="will play">
<Node score="will play">
<SimplePredicate field="outlook"
operator="equal"
value="sunny"/>
…
</Node>
</Node>
</TreeModel>
</PMML>
39© Cloudera, Inc. All rights reserved.
Lambda implementation: Oryx 2.x
• Generic lambda-architecture platform
• With ML specializations
• hyperparam selection
• Built on Spark Streaming, Kafka
• With Intel
• 2.x: pre-alpha
github.com/OryxProject/oryx
40© Cloudera, Inc. All rights reserved.
Lambda implementation: Oryx 2.x
github.com/OryxProject/oryx
41© Cloudera, Inc. All rights reserved.
HTTP REST API
• Convention for RPC-like request /
response
• HTTP verbs, transport
• GET : query
• POST : add input
• Easy from browser, CLI, Java,
Python, Scala, etc.
GET /recommend/jwills
HTTP/1.1 200 OK
Content-Type: text/plain
"Ray LaMontagne",0.951
"Fleet Foxes",0.7905
"The National",0.688
"Shearwater",0.3017
42© Cloudera, Inc. All rights reserved.
Thank you
laserson@cloudera.com

More Related Content

PPTX
Node Labels in YARN
PPTX
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
PDF
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
PDF
Apache Spark & Hadoop
PDF
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
PDF
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
PPTX
Event Detection Pipelines with Apache Kafka
PPTX
Streaming in the Wild with Apache Flink
Node Labels in YARN
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
Apache Spark & Hadoop
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Event Detection Pipelines with Apache Kafka
Streaming in the Wild with Apache Flink

What's hot (20)

PPTX
LLAP: Sub-Second Analytical Queries in Hive
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
PDF
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
PPTX
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
PPTX
Real-time Hadoop: The Ideal Messaging System for Hadoop
PPTX
Genome Analysis Pipelines with Spark and ADAM
PPTX
Time-oriented event search. A new level of scale
PPTX
Deep Learning with Spark and GPUs
PPTX
LLAP: Sub-Second Analytical Queries in Hive
PDF
Apache Spark Briefing
PPTX
Applied Deep Learning with Spark and Deeplearning4j
PDF
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
PPTX
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
PPTX
Large Scale Graph Analytics with JanusGraph
PPTX
Bringing complex event processing to Spark streaming
PPTX
Hadoop 3 in a Nutshell
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
PDF
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
PPT
Running Spark in Production
PPTX
Functional Programming and Big Data
LLAP: Sub-Second Analytical Queries in Hive
Flexible and Real-Time Stream Processing with Apache Flink
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Real-time Hadoop: The Ideal Messaging System for Hadoop
Genome Analysis Pipelines with Spark and ADAM
Time-oriented event search. A new level of scale
Deep Learning with Spark and GPUs
LLAP: Sub-Second Analytical Queries in Hive
Apache Spark Briefing
Applied Deep Learning with Spark and Deeplearning4j
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Large Scale Graph Analytics with JanusGraph
Bringing complex event processing to Spark streaming
Hadoop 3 in a Nutshell
The columnar roadmap: Apache Parquet and Apache Arrow
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Running Spark in Production
Functional Programming and Big Data
Ad

Viewers also liked (9)

PPTX
Transform Banking with Big Data and Automated Machine Learning 9.12.17
PPTX
Put Alternative Data to Use in Capital Markets

PDF
Non-geek's big data playbook - Hadoop & EDW - SAS Best Practices
PPTX
The Big Picture: Real-time Data is Defining Intelligent Offers
PPTX
Cloudera Customer Success Story
PPTX
IoT - Data Management Trends, Best Practices, & Use Cases
PPTX
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
PPTX
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
PPTX
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...
Transform Banking with Big Data and Automated Machine Learning 9.12.17
Put Alternative Data to Use in Capital Markets

Non-geek's big data playbook - Hadoop & EDW - SAS Best Practices
The Big Picture: Real-time Data is Defining Intelligent Offers
Cloudera Customer Success Story
IoT - Data Management Trends, Best Practices, & Use Cases
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...
Ad

Similar to Large-Scale Data Science on Hadoop (Intel Big Data Day) (20)

PPTX
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
PPTX
From Insight to Action: Using Data Science to Transform Your Organization
PPTX
Data Science and CDSW
PPTX
Unlocking data science in the enterprise - with Oracle and Cloudera
PPTX
The 6th Wave of Automation: Automation of Decisions | Cloudera Analytics & Ma...
PPTX
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
PPT
Data Science Day New York: Data Science: A Personal History
PPTX
Part 3: Models in Production: A Look From Beginning to End
PPTX
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
PDF
NOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science Workbench
PPTX
Part 1: Introducing the Cloudera Data Science Workbench
PPTX
Data Science at Scale Using Apache Spark and Apache Hadoop
PDF
Data Science and Machine Learning for the Enterprise
PPTX
Part 2: A Visual Dive into Machine Learning and Deep Learning 

PPTX
Turning Data into Business Value with a Modern Data Platform
PPTX
Machine Learning and Hadoop: Present and future
PPTX
Hadoop and Machine Learning
PPTX
Introducing the data science sandbox as a service 8.30.18
PPTX
Edge to AI: Analytics from Edge to Cloud with Efficient Movement of Machine ...
PDF
Emerging trends in data analytics
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
From Insight to Action: Using Data Science to Transform Your Organization
Data Science and CDSW
Unlocking data science in the enterprise - with Oracle and Cloudera
The 6th Wave of Automation: Automation of Decisions | Cloudera Analytics & Ma...
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Data Science Day New York: Data Science: A Personal History
Part 3: Models in Production: A Look From Beginning to End
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
NOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science Workbench
Part 1: Introducing the Cloudera Data Science Workbench
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science and Machine Learning for the Enterprise
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Turning Data into Business Value with a Modern Data Platform
Machine Learning and Hadoop: Present and future
Hadoop and Machine Learning
Introducing the data science sandbox as a service 8.30.18
Edge to AI: Analytics from Edge to Cloud with Efficient Movement of Machine ...
Emerging trends in data analytics

More from Uri Laserson (7)

PPTX
Petascale Genomics (Strata Singapore 20151203)
PPTX
Genomics Is Not Special: Towards Data Intensive Biology
PPTX
APIs and Synthetic Biology
PPTX
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
PPTX
Python in the Hadoop Ecosystem (Rock Health presentation)
PPTX
Hadoop for Bioinformatics: Building a Scalable Variant Store
PPTX
Hadoop ecosystem for health/life sciences
Petascale Genomics (Strata Singapore 20151203)
Genomics Is Not Special: Towards Data Intensive Biology
APIs and Synthetic Biology
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
Python in the Hadoop Ecosystem (Rock Health presentation)
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop ecosystem for health/life sciences

Recently uploaded (20)

PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Cloud computing and distributed systems.
PPTX
Big Data Technologies - Introduction.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
A Presentation on Artificial Intelligence
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Unlocking AI with Model Context Protocol (MCP)
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Empathic Computing: Creating Shared Understanding
Cloud computing and distributed systems.
Big Data Technologies - Introduction.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Machine learning based COVID-19 study performance prediction
Review of recent advances in non-invasive hemoglobin estimation
Dropbox Q2 2025 Financial Results & Investor Presentation
The AUB Centre for AI in Media Proposal.docx
Digital-Transformation-Roadmap-for-Companies.pptx
NewMind AI Weekly Chronicles - August'25 Week I
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Understanding_Digital_Forensics_Presentation.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
A Presentation on Artificial Intelligence
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

Large-Scale Data Science on Hadoop (Intel Big Data Day)

  • 1. 1© Cloudera, Inc. All rights reserved. Large-Scale Data Science on Hadoop Uri Laserson | Data Scientist | @laserson
  • 2. 2© Cloudera, Inc. All rights reserved. About the speaker • Data Scientist at Cloudera • PhD in BME at MIT/Harvard • Committer on ADAM, impyla • Co-author on Advanced Analytics with Spark • laserson@cloudera.com
  • 3. 3© Cloudera, Inc. All rights reserved. What is a data scientist?
  • 4. 4© Cloudera, Inc. All rights reserved. What is a data scientist?
  • 5. 5© Cloudera, Inc. All rights reserved. What is a data scientist?
  • 6. 6© Cloudera, Inc. All rights reserved. What is a data science?
  • 7. 7© Cloudera, Inc. All rights reserved. What is a data science? Phase 1. Collect Data Phase 2. Data Science? Phase 3. Profit!
  • 8. 8© Cloudera, Inc. All rights reserved. Some things you might do as a data scientist • Data quality issues • Data formats/versions • Data source integration • Exploration/visualization • Building/deploy models
  • 9. 9© Cloudera, Inc. All rights reserved. Plumbing Exploratory Operational
  • 10. 10© Cloudera, Inc. All rights reserved. 1. Data science is data plumbing
  • 11. 11© Cloudera, Inc. All rights reserved. Example: • Sells deep analysis of huge satellite images • Easy: C++ to analyze images • Hard: continuously reliably ingesting, transforming • Expensive: storing, computing • Hadoop as the data science plumber
  • 12. 12© Cloudera, Inc. All rights reserved. 2. Data science is investigative analytics
  • 13. 13© Cloudera, Inc. All rights reserved. Example: large UK retailer • Customer Churn • SAS, Hive • Path Analysis • Giraph, MapReduce • Customer Segmentation • SAS, Spotfire, Impala • Hadoop as one hub for investigative tools • Avoid buying, training for N new tools
  • 14. 14© Cloudera, Inc. All rights reserved. 3. Data science is operational analytics
  • 15. 15© Cloudera, Inc. All rights reserved. Example: • Real-time Search, ML over Patient Data • MapReduce for indexing, learning • HBase for storage and fast access • Storm for incremental update • RDBMS for recent derived data • API façade for input and querying learning Engineering Machine Learning
  • 16. 16© Cloudera, Inc. All rights reserved. Plumbing Exploratory Operational
  • 17. 17© Cloudera, Inc. All rights reserved. Factors to consider when choosing your tools • Single-node performance • Scalability • Language and tooling familiarity • Integration with Hadoop • Libraries / functions / richness of ecosystem • Integration with data prep / ETL workflows Pattern JPMML
  • 18. 18© Cloudera, Inc. All rights reserved. Plumbing in a nutshell Plumbing Apache Kafka Apache Pig Apache Crunch
  • 19. 19© Cloudera, Inc. All rights reserved. Serialization/RPC frameworks • Specify schemas/services in user-friendly IDLs • Code-generation to multiple languages (wire- compatible/portable) • Compact, binary formats • Natural support for schema evolution • Multiple implementations: • Apache Thrift, Apache Avro, Google’s Protocol Buffers service Twitter { void ping(); bool postTweet(1:Tweet tweet); TweetSearchResult searchTweets(1:string query); } struct Tweet { 1: required i32 userId; 2: required string userName; 3: required string text; 4: optional Location loc; 16: optional string language = "english" }
  • 20. 20© Cloudera, Inc. All rights reserved. Log and service oriented architecture http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
  • 21. 21© Cloudera, Inc. All rights reserved. Log and service oriented architecture http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
  • 22. 22© Cloudera, Inc. All rights reserved. Factory (operational) vs. Laboratory (exploratory) Programming languages Systems languages Latency, throughput Huge data Online problems Automated Developers, Engineers Statistical environments, BI tools High-level languages Accuracy Medium-sized data Offline work Ad-hoc Statisticians, Analysts vs.
  • 23. 23© Cloudera, Inc. All rights reserved. Exploratory analytics • Offline • Statistical Environment • Discovery-phase • Model Building and Tuning • Accuracy Important • Medium-scale • Visualizations Exploratory
  • 24. 24© Cloudera, Inc. All rights reserved. Exploratory: BI/visualization • Nothing Hadoop-specific • Take your pick of any 3rd party tool • Typically connects to Hadoop via SQL interface with Impala
  • 25. 25© Cloudera, Inc. All rights reserved. Exploratory: SAS • Connects to Hadoop data stores • Can push down some computation to cluster, but requires data movement • Mature and widely used; large algo library • Ongoing collaborative engineering effort with Cloudera
  • 26. 26© Cloudera, Inc. All rights reserved. Exploratory: Python • Python and JVM don’t play nice • Hadoop Streaming / mrjob / scikit- learn • Impyla: Python UDFs on Impala • PySpark: Spark API in Python
  • 27. 27© Cloudera, Inc. All rights reserved. Operational analytics • Online • Real-Time • Cluster Environment • Model Serving, Update • QPS, Latency Important • Large Scale Operational Pattern JPMML
  • 28. 28© Cloudera, Inc. All rights reserved. Operational: MLlib (Spark) • Model building on Spark • Fast (distributed in-memory) • Basic algorithms only • LR, SVM, decision tree • PCA, SVD • K-means • ALS • Easy integration with Spark-as-ETL
  • 29. 29© Cloudera, Inc. All rights reserved. GROUPBY integration with Hadoop Read Hadoop data Requires data movement
  • 30. 30© Cloudera, Inc. All rights reserved. GROUPBY integration with Hadoop YARN-managed Outside
  • 31. 31© Cloudera, Inc. All rights reserved. GROUPBY open source Open source Closed source
  • 32. 32© Cloudera, Inc. All rights reserved. GROUPBY active community Active community Not
  • 33. 33© Cloudera, Inc. All rights reserved. Languages Java Python R Scala
  • 34. 34© Cloudera, Inc. All rights reserved. • Next-generation general processing engine for Hadoop • APIs in Python, Java, Scala (and early R) • DAG execution / in-memory • Interactive REPL • Batch or streaming • MLlib, GraphX • Active community • Scala-like API
  • 35. 35© Cloudera, Inc. All rights reserved. Large scale or real-time? Large-Scale Offline Batch Real-Time Online Streaming vs
  • 36. 36© Cloudera, Inc. All rights reserved. Large scale or real-time? Large-Scale Offline Batch Real-Time Online Streaming vs Why Don’t We Have Both? λ!
  • 37. 37© Cloudera, Inc. All rights reserved. Lambda architecture • Tackle in 3 Layers • Batch Layer: offline, big model build • Speed Layer: near-real-time, approximate update • Serving Layer: real-time model query / scoring
  • 38. 38© Cloudera, Inc. All rights reserved. PMML • Predictive Modeling Markup Language • XML-based format for predictive models • Standardized by Data Mining Group (www.dmg.org) • Wide tool support <PMML xmlns="http://www.dmg.org/PMML-4_1" version="4.1"> <Header copyright="www.dmg.org"/> <DataDictionary numberOfFields="5"> <DataField name="temperature" optype="continuous" dataType="double"/> … </DataDictionary> <TreeModel modelName="golfing" functionName="classification"> <MiningSchema> <MiningField name="temperature"/> … </MiningSchema> <Node score="will play"> <Node score="will play"> <SimplePredicate field="outlook" operator="equal" value="sunny"/> … </Node> </Node> </TreeModel> </PMML>
  • 39. 39© Cloudera, Inc. All rights reserved. Lambda implementation: Oryx 2.x • Generic lambda-architecture platform • With ML specializations • hyperparam selection • Built on Spark Streaming, Kafka • With Intel • 2.x: pre-alpha github.com/OryxProject/oryx
  • 40. 40© Cloudera, Inc. All rights reserved. Lambda implementation: Oryx 2.x github.com/OryxProject/oryx
  • 41. 41© Cloudera, Inc. All rights reserved. HTTP REST API • Convention for RPC-like request / response • HTTP verbs, transport • GET : query • POST : add input • Easy from browser, CLI, Java, Python, Scala, etc. GET /recommend/jwills HTTP/1.1 200 OK Content-Type: text/plain "Ray LaMontagne",0.951 "Fleet Foxes",0.7905 "The National",0.688 "Shearwater",0.3017
  • 42. 42© Cloudera, Inc. All rights reserved. Thank you laserson@cloudera.com

Editor's Notes

  • #2: What makes data science special on Hadoop.
  • #3: Background as a scientist. Do genomics/life sciences especially. Shameless plug for our new book.
  • #7: Or instead, what is data science?
  • #8: SCARES ME the most when I show up at clients.
  • #9: Difficult to define, but…
  • #10: One way to organize these things.
  • #11: TF-IDF model From simple theory to complicated practical implementation.
  • #12: Any given operation on an image is not difficult. Reliably integrating satellite data with complex/custom pipelines is difficult. Must coordinate many tasks.
  • #13: Most similar to research/science/statistics. You don’t really know what you’re doing. Exploratory. Lot’s of tools to do this – Python, R, SAS, etc. BI tools (Tableau).
  • #14: Doing it at scale more difficult. Hadoop centralizes. No need to copy data for each application. Bioinformatics spends lots of time mucking with different file formats in different systems. Many orgs are very siloed.
  • #15: Most unique to Hadoop/big data. Don’t want to train a model once. Given model, want to deploy it. Update it.
  • #17: If this is landscape of what data science is, what are some tools/recs? ~10 min mark
  • #19: ETL tools. Traditional Hadoop. Don’t want to say much except….
  • #20: Most common thing: instrumentation and schemas. Need culture of data/telemetry. Best stuff when you join data sets. Requires de-siloization. Requires centralized schemas.
  • #22: Also Kafka
  • #24: Ad Hoc Focus on Accuracy, Visualization Traditional stats tools like R, Python, SAS
  • #25: Ad Hoc Focus on Accuracy, Visualization Traditional stats tools like R, Python, SAS
  • #27: Thunder as framework on Spark.
  • #29: Mahout is deprecated.
  • #30: Another way to think about the tools is based on different features…
  • #34: Probably detect a theme here.
  • #36: Lot’s of tools in Hadoop have a dichotomy between online and offline. Do we have to choose?