SlideShare a Scribd company logo
Apache® Spark™ MLlib:
From Quick Start to Scikit-Learn
Joseph K. Bradley
February 24th, 2016
About the speaker: Joseph Bradley
Joseph Bradley is a Software Engineerand Apache
Spark Committer working on MLlib at Databricks.
Previously,he was a postdoc at UC Berkeley after
receiving hisPh.D. in Machine Learning from
Carnegie Mellon U. in 2013.Hisresearch included
probabilistic graphical models,parallel sparse
regression,and aggregation mechanismsfor peer
grading in MOOCs.
2
About the moderator: Denny Lee
Denny Lee is a Technology Evangelistwith
Databricks; he is a hands-on data sciencesengineer
with more than 15 years of experience developing
internet-scale infrastructure, data platforms, and
distributed systems for both on-premisesand cloud.
Prior to joining Databricks, Denny worked as a
SeniorDirector of Data SciencesEngineering at
Concur and was part of the incubation teamthat
builtHadoop on Windowsand Azure (currently
known as HDInsight).
3
We are Databricks, the company behind Apache Spark
Founded by the creators of
Apache Spark in 2013
Share of Spark code
contributed by Databricks
in 2014
75%
4
Data Value
Created Databricks on top of Spark to make big data simple.
…
Apache Spark Engine
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Unified engineacross diverse workloads & environments
Scale out, fault tolerant
Python, Java, Scala, and R
APIs
Standard libraries
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
NOTABL E USERS THAT PRESENTED AT SPARK SUMMIT
2 0 1 5 SAN F RANCISCO
Source: Slide5ofSparkCommunityUpdate
Machine Learning: What and Why?
What: ML usesdata to identify patterns and make decisions.
Why: Thecore value of ML is automated decision making.
• Especially important when dealing with TB or PB of data
Many use cases, including:
• Marketing and advertising optimization
• Security monitoring /fraud detection
• Operational optimizations
Why Spark MLlib
Provide generalpurposeML algorithms on top of Spark
• Hide complexity of distributing data & queries,and scaling
• Leverage Spark improvements(DataFrames, Tungsten, Datasets)
Advantages of MLlib’s design:
• Simplicity
• Scalability
• Streamlined end-to-end
• Compatibility
Spark scales well
Largest cluster:
8000 Nodes (Tencent)
Largest single job:
1 PB (Alibaba, Databricks)
Top Streaming Intake:
1 TB/hour (HHMI
Janelia Farm)
2014 On-Disk Sort Record
Fastest Open Source Engine
for sorting a PB
Machine Learning highlights
Source: Why you should use Sparkfor Machine Learning
Source: Toyota Customer 360 Insightson Apache Spark and MLlib
Performance
• Original batch job: 160 hours
• Same Job re-written using Apache Spark: 4 hours
ML task
• Prioritize incoming social media in real-time using Spark MLlib
(differentiate campaign, feedback, product feedback, and noise)
• ML life cycle: Extract features and train:
• V1: 56%Accuracy ->V9: 82%Accuracy
• RemoveFalse Positives andSemanticAnalysis (similarity between
concepts)
Example analysis:
Population vs. housing price
Links
Simplifying Machine Learning with Databricks Blog Post
Population vs. Price Multi-chart SparkSQL Notebook
Population vs. Price Linear Regression Python Notebook
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Scatterplot
import numpy as np
import matplotlib.pyplot as plt
x = data.map(lambda p:
(p.features[0])).collect()
y = data.map(lambda p:
(p.label)).collect()
from pandas import *
from ggplot import *
pydf = DataFrame({'pop':x,'price':y})
p = ggplot(pydf, aes('pop','price')) + 
geom_point(color='blue')
display(p)
Linear Regression with SGD
Define and Build Models
# Import LinearRegression class
from pyspark.ml.regression import LinearRegression
# Define LinearRegression model
lr = LinearRegression()
# Build two models
modelA = lr.fit(data, {lr.regParam:0.0})
modelB = lr.fit(data, {lr.regParam: 100.0})
Linear Regression with SGD
Make Predictions
# Make predictions
predictionsA = modelA.transform(data)
display(predictionsA)
Linear Regression with SGD
Evaluate the Models
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(metricName="mse")
MSE = evaluator.evaluate(predictionsA)
print("ModelA: Mean Squared Error = " + str(MSE))
ModelA: Mean Squared Error = 16538.4813081
ModelB: Mean Squared Error = 16769.2917636
Scatterplot with plotting Regression
Models
p = ggplot(pydf, aes('pop','price')) + 
geom_point(color='blue') + 
geom_line(pydf, aes('pop','predA'),
color='red') + 
geom_line(pydf, aes('pop','predB'),
color='green') + 
scale_x_log(10) + scale_y_log10()
display(p)
Learning more about MLlib
Guides & examples
• Example workflow using ML Pipelines (Python)
• Power plant data analysis workflow (Scala)
• The above 2 links are part of the Databricks Guide, which contains
many more examples and references.
References
• Apache Spark MLlib User Guide
• The MLlib User Guide containscodesnippetsfor almost all algorithms, as wellas
links to API documentation.
• Meng et al. “MLlib: Machine Learning in Apache Spark.” 2015.
http://arxiv.org/abs/1505.06807 (academic paper)
21
Combining the Strengths
of MLlib, scikit-learn, & R
23
Greatlibraries à Business investment
• Education
• Tooling & workflows
Big Data
24
Scaling (trees)Topic model on 4.5 million
Wikipedia articles
Recommendation with
50 million users,
5 million songs,
50 billion ratings
Big Data & MLlib
• More data à higher accuracy
• Scalewith business (# users,available data)
• Integrate with production systems
25
Bridging the gap
How do you get from a single-machine workload
to a distributed one?
26
At school: Machine Learning with R on my laptop
The Goal: Machine Learning on a huge computing cluster
Wish list
• Run original code on a production environment
• Use distributed data sources
• Distribute ML workload piece by piece
• Use familiar algorithms & APIs
27
Our task
28
Sentiment analysis
Given a review (text),
Predict the user’srating.
Data	from	https://snap.stanford.edu/data/web-Amazon.html
Our ML workflow
29
Text
This scarf I
bought is
very strange.
When I ...
Label
Rating = 3.0
Tokenizer
Words
[This,
scarf,
I,
bought,
...]
Hashing
Term-Freq
Features
[2.0,
0.0,
3.0,
...]
Linear
Regression
Prediction
Rating = 2.7
Our ML workflow
30
Cross Validation
Linear
Regression
Feature
Extraction
regularization
parameter:
{0.0, 0.1, ...}
Cross validation
31
Cross Validation
...
Best Linear
Regression
Linear
Regression #1
Linear
Regression #2
Feature
Extraction
Linear
Regression #3
Cross validation
32
Cross Validation
...
Best Linear
Regression
Linear
Regression #1
Linear
Regression #2
Feature
Extraction
Linear
Regression #3
Distribute cross validation
33
Cross Validation
...
Best Linear
Regression
Linear
Regression #1
Linear
Regression #2
Feature
Extraction
Linear
Regression #3
Repeating this at home
This demo used:
• Spark 1.6
• spark-sklearn (on Spark Packages) (on PyPi)
The notebookfrom the demo is available here:
• sklearn integration
• MLlib + sklearn: Distribute Everything!
The Amazon Reviews data20K and test4K datasets were created and can be used within the
databricks-datasets with permission from Professor Julian McAuley @ UCSD.
Source: Image-based recommendations onstyles and substitutes.J.McAuley,C. Targett, J. Shi,
A. van den Hengel.SIGIR, 2015.
34
Integrations we mentioned
Data sources
• Spark DataFrames: Conversionsbetween pandas(local data) &
Spark (distributed data)
• MLlib: Conversionsbetween scipy & MLlib data types
Model selection / tuning
• spark-sklearn: Automatically distribute cross-validation
Python API
• MLlib: Distributed learning algorithmswith familiarAPIs
• spark-sklearn: Conversionsbetween scikit-learn & MLlib models
35
Integrations with R
DataFrames
• Conversionsbetween R(local)
& Spark (distributed)
• SQL queriesfrom R
36
model <- glm(Sepal_Length ~ Sepal_Width + Species,
data = df, family = "gaussian")
head(filter(df, df$waiting < 50))
## eruptions waiting
##1 1.750 47
##2 1.750 47
##3 1.867 48
API for calling MLlib algorithms from R
• Linear & logistic regression supported in Spark 1.6
• More algorithmsin development
Learning more about integrations
Python,pandas & scikit-learn
• spark-sklearn documentation and blog post
• Spark DataFrame Python API & pandas conversions
• Databricks Guide on using scikit-learn and other libraries with Spark
R
• Spark R API User Guide (DataFrames & ML)
• Databricks Guide: Spark R overview + docs & examples for each function
TensorFlow onApache Spark (Deep Learningin Python)
• Blog post explaining how to run TensorFlow on top of Spark, with example code
37
MLlib roadmap highlights
Workflow
• Simplify building and customizing ML Pipelines.
Key models
• Improve inspection for generalized linear models (linear & logistic
regression).
Language APIs
• Support Pipeline persistence (saving & loading Pipelines and Models)
in the Python API.
Spark 2.0RoadmapJIRA: https://issues.apache.org/jira/browse/SPARK-12626
More resources
• Databricks Guide
• ApacheSpark User Guide
• Databricks Community Forum
• Training courses:public classes,MOOCs, & private training
• Databricks Community Edition: Free hosted Apache Spark.
Join the waitlist for the beta release!
39
Thanks!

More Related Content

PDF
Big Data Visualization
PDF
Artificial Intelligence and Human Computer Interaction
PDF
PPTX
Data Visualization.pptx
PDF
Recent PhD Research Topic Ideas For Computer Science Engineering 2020
PPTX
Data mining tools overall
PPTX
Information Retrieval
PPTX
Big data by Mithlesh sadh
Big Data Visualization
Artificial Intelligence and Human Computer Interaction
Data Visualization.pptx
Recent PhD Research Topic Ideas For Computer Science Engineering 2020
Data mining tools overall
Information Retrieval
Big data by Mithlesh sadh

What's hot (20)

PPTX
Linear Search
PDF
What is Web-scraping?
PPTX
WEB Scraping.pptx
PPTX
Ontology for Knowledge and Data Strategies.pptx
PPTX
Web mining (structure mining)
PDF
Data science presentation
PPTX
Library monthly report
PDF
Information Retrieval based on Cluster Analysis Approach
PDF
Data preprocessing using Machine Learning
PPTX
Library management system
PDF
Web Scraping
PPTX
An overview of data warehousing and OLAP technology
PDF
Web mining and social media mining
PDF
What are Data structures in Python? | List, Dictionary, Tuple Explained | Edu...
PPSX
An Introduction to Semantic Web Technology
PDF
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
PPTX
Data Analytics Introduction.pptx
ODP
Web Information Retrieval and Mining
PPT
Data Structure Lec #1
Linear Search
What is Web-scraping?
WEB Scraping.pptx
Ontology for Knowledge and Data Strategies.pptx
Web mining (structure mining)
Data science presentation
Library monthly report
Information Retrieval based on Cluster Analysis Approach
Data preprocessing using Machine Learning
Library management system
Web Scraping
An overview of data warehousing and OLAP technology
Web mining and social media mining
What are Data structures in Python? | List, Dictionary, Tuple Explained | Edu...
An Introduction to Semantic Web Technology
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Data Analytics Introduction.pptx
Web Information Retrieval and Mining
Data Structure Lec #1
Ad

Similar to Apache® Spark™ MLlib: From Quick Start to Scikit-Learn (20)

PDF
Apache Spark's MLlib's Past Trajectory and new Directions
PDF
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
PPTX
Combining Machine Learning Frameworks with Apache Spark
PPTX
Combining Machine Learning frameworks with Apache Spark
PDF
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
PDF
Fighting Fraud with Apache Spark
PPTX
IBM Strategy for Spark
PPTX
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
PDF
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
PDF
Integrating Deep Learning Libraries with Apache Spark
PDF
The Analytics Frontier of the Hadoop Eco-System
PDF
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
PDF
An Insider’s Guide to Maximizing Spark SQL Performance
PPTX
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PDF
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
PDF
Designing Distributed Machine Learning on Apache Spark
PDF
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
PDF
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
PDF
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Fighting Fraud with Apache Spark
IBM Strategy for Spark
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Integrating Deep Learning Libraries with Apache Spark
The Analytics Frontier of the Hadoop Eco-System
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
An Insider’s Guide to Maximizing Spark SQL Performance
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
Designing Distributed Machine Learning on Apache Spark
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
KodekX | Application Modernization Development
PPTX
A Presentation on Artificial Intelligence
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Approach and Philosophy of On baking technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Machine learning based COVID-19 study performance prediction
PPT
Teaching material agriculture food technology
PDF
Modernizing your data center with Dell and AMD
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
NewMind AI Weekly Chronicles - August'25 Week I
KodekX | Application Modernization Development
A Presentation on Artificial Intelligence
Network Security Unit 5.pdf for BCA BBA.
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Approach and Philosophy of On baking technology
Advanced methodologies resolving dimensionality complications for autism neur...
Machine learning based COVID-19 study performance prediction
Teaching material agriculture food technology
Modernizing your data center with Dell and AMD
20250228 LYD VKU AI Blended-Learning.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Encapsulation_ Review paper, used for researhc scholars
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Electronic commerce courselecture one. Pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
The AUB Centre for AI in Media Proposal.docx
Chapter 3 Spatial Domain Image Processing.pdf
Spectral efficient network and resource selection model in 5G networks
NewMind AI Weekly Chronicles - August'25 Week I

Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

  • 1. Apache® Spark™ MLlib: From Quick Start to Scikit-Learn Joseph K. Bradley February 24th, 2016
  • 2. About the speaker: Joseph Bradley Joseph Bradley is a Software Engineerand Apache Spark Committer working on MLlib at Databricks. Previously,he was a postdoc at UC Berkeley after receiving hisPh.D. in Machine Learning from Carnegie Mellon U. in 2013.Hisresearch included probabilistic graphical models,parallel sparse regression,and aggregation mechanismsfor peer grading in MOOCs. 2
  • 3. About the moderator: Denny Lee Denny Lee is a Technology Evangelistwith Databricks; he is a hands-on data sciencesengineer with more than 15 years of experience developing internet-scale infrastructure, data platforms, and distributed systems for both on-premisesand cloud. Prior to joining Databricks, Denny worked as a SeniorDirector of Data SciencesEngineering at Concur and was part of the incubation teamthat builtHadoop on Windowsand Azure (currently known as HDInsight). 3
  • 4. We are Databricks, the company behind Apache Spark Founded by the creators of Apache Spark in 2013 Share of Spark code contributed by Databricks in 2014 75% 4 Data Value Created Databricks on top of Spark to make big data simple.
  • 5. … Apache Spark Engine Spark Core Spark Streaming Spark SQL MLlib GraphX Unified engineacross diverse workloads & environments Scale out, fault tolerant Python, Java, Scala, and R APIs Standard libraries
  • 7. NOTABL E USERS THAT PRESENTED AT SPARK SUMMIT 2 0 1 5 SAN F RANCISCO Source: Slide5ofSparkCommunityUpdate
  • 8. Machine Learning: What and Why? What: ML usesdata to identify patterns and make decisions. Why: Thecore value of ML is automated decision making. • Especially important when dealing with TB or PB of data Many use cases, including: • Marketing and advertising optimization • Security monitoring /fraud detection • Operational optimizations
  • 9. Why Spark MLlib Provide generalpurposeML algorithms on top of Spark • Hide complexity of distributing data & queries,and scaling • Leverage Spark improvements(DataFrames, Tungsten, Datasets) Advantages of MLlib’s design: • Simplicity • Scalability • Streamlined end-to-end • Compatibility
  • 10. Spark scales well Largest cluster: 8000 Nodes (Tencent) Largest single job: 1 PB (Alibaba, Databricks) Top Streaming Intake: 1 TB/hour (HHMI Janelia Farm) 2014 On-Disk Sort Record Fastest Open Source Engine for sorting a PB
  • 11. Machine Learning highlights Source: Why you should use Sparkfor Machine Learning
  • 12. Source: Toyota Customer 360 Insightson Apache Spark and MLlib Performance • Original batch job: 160 hours • Same Job re-written using Apache Spark: 4 hours ML task • Prioritize incoming social media in real-time using Spark MLlib (differentiate campaign, feedback, product feedback, and noise) • ML life cycle: Extract features and train: • V1: 56%Accuracy ->V9: 82%Accuracy • RemoveFalse Positives andSemanticAnalysis (similarity between concepts)
  • 13. Example analysis: Population vs. housing price Links Simplifying Machine Learning with Databricks Blog Post Population vs. Price Multi-chart SparkSQL Notebook Population vs. Price Linear Regression Python Notebook
  • 16. Scatterplot import numpy as np import matplotlib.pyplot as plt x = data.map(lambda p: (p.features[0])).collect() y = data.map(lambda p: (p.label)).collect() from pandas import * from ggplot import * pydf = DataFrame({'pop':x,'price':y}) p = ggplot(pydf, aes('pop','price')) + geom_point(color='blue') display(p)
  • 17. Linear Regression with SGD Define and Build Models # Import LinearRegression class from pyspark.ml.regression import LinearRegression # Define LinearRegression model lr = LinearRegression() # Build two models modelA = lr.fit(data, {lr.regParam:0.0}) modelB = lr.fit(data, {lr.regParam: 100.0})
  • 18. Linear Regression with SGD Make Predictions # Make predictions predictionsA = modelA.transform(data) display(predictionsA)
  • 19. Linear Regression with SGD Evaluate the Models from pyspark.ml.evaluation import RegressionEvaluator evaluator = RegressionEvaluator(metricName="mse") MSE = evaluator.evaluate(predictionsA) print("ModelA: Mean Squared Error = " + str(MSE)) ModelA: Mean Squared Error = 16538.4813081 ModelB: Mean Squared Error = 16769.2917636
  • 20. Scatterplot with plotting Regression Models p = ggplot(pydf, aes('pop','price')) + geom_point(color='blue') + geom_line(pydf, aes('pop','predA'), color='red') + geom_line(pydf, aes('pop','predB'), color='green') + scale_x_log(10) + scale_y_log10() display(p)
  • 21. Learning more about MLlib Guides & examples • Example workflow using ML Pipelines (Python) • Power plant data analysis workflow (Scala) • The above 2 links are part of the Databricks Guide, which contains many more examples and references. References • Apache Spark MLlib User Guide • The MLlib User Guide containscodesnippetsfor almost all algorithms, as wellas links to API documentation. • Meng et al. “MLlib: Machine Learning in Apache Spark.” 2015. http://arxiv.org/abs/1505.06807 (academic paper) 21
  • 22. Combining the Strengths of MLlib, scikit-learn, & R
  • 23. 23 Greatlibraries à Business investment • Education • Tooling & workflows
  • 24. Big Data 24 Scaling (trees)Topic model on 4.5 million Wikipedia articles Recommendation with 50 million users, 5 million songs, 50 billion ratings
  • 25. Big Data & MLlib • More data à higher accuracy • Scalewith business (# users,available data) • Integrate with production systems 25
  • 26. Bridging the gap How do you get from a single-machine workload to a distributed one? 26 At school: Machine Learning with R on my laptop The Goal: Machine Learning on a huge computing cluster
  • 27. Wish list • Run original code on a production environment • Use distributed data sources • Distribute ML workload piece by piece • Use familiar algorithms & APIs 27
  • 28. Our task 28 Sentiment analysis Given a review (text), Predict the user’srating. Data from https://snap.stanford.edu/data/web-Amazon.html
  • 29. Our ML workflow 29 Text This scarf I bought is very strange. When I ... Label Rating = 3.0 Tokenizer Words [This, scarf, I, bought, ...] Hashing Term-Freq Features [2.0, 0.0, 3.0, ...] Linear Regression Prediction Rating = 2.7
  • 30. Our ML workflow 30 Cross Validation Linear Regression Feature Extraction regularization parameter: {0.0, 0.1, ...}
  • 31. Cross validation 31 Cross Validation ... Best Linear Regression Linear Regression #1 Linear Regression #2 Feature Extraction Linear Regression #3
  • 32. Cross validation 32 Cross Validation ... Best Linear Regression Linear Regression #1 Linear Regression #2 Feature Extraction Linear Regression #3
  • 33. Distribute cross validation 33 Cross Validation ... Best Linear Regression Linear Regression #1 Linear Regression #2 Feature Extraction Linear Regression #3
  • 34. Repeating this at home This demo used: • Spark 1.6 • spark-sklearn (on Spark Packages) (on PyPi) The notebookfrom the demo is available here: • sklearn integration • MLlib + sklearn: Distribute Everything! The Amazon Reviews data20K and test4K datasets were created and can be used within the databricks-datasets with permission from Professor Julian McAuley @ UCSD. Source: Image-based recommendations onstyles and substitutes.J.McAuley,C. Targett, J. Shi, A. van den Hengel.SIGIR, 2015. 34
  • 35. Integrations we mentioned Data sources • Spark DataFrames: Conversionsbetween pandas(local data) & Spark (distributed data) • MLlib: Conversionsbetween scipy & MLlib data types Model selection / tuning • spark-sklearn: Automatically distribute cross-validation Python API • MLlib: Distributed learning algorithmswith familiarAPIs • spark-sklearn: Conversionsbetween scikit-learn & MLlib models 35
  • 36. Integrations with R DataFrames • Conversionsbetween R(local) & Spark (distributed) • SQL queriesfrom R 36 model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian") head(filter(df, df$waiting < 50)) ## eruptions waiting ##1 1.750 47 ##2 1.750 47 ##3 1.867 48 API for calling MLlib algorithms from R • Linear & logistic regression supported in Spark 1.6 • More algorithmsin development
  • 37. Learning more about integrations Python,pandas & scikit-learn • spark-sklearn documentation and blog post • Spark DataFrame Python API & pandas conversions • Databricks Guide on using scikit-learn and other libraries with Spark R • Spark R API User Guide (DataFrames & ML) • Databricks Guide: Spark R overview + docs & examples for each function TensorFlow onApache Spark (Deep Learningin Python) • Blog post explaining how to run TensorFlow on top of Spark, with example code 37
  • 38. MLlib roadmap highlights Workflow • Simplify building and customizing ML Pipelines. Key models • Improve inspection for generalized linear models (linear & logistic regression). Language APIs • Support Pipeline persistence (saving & loading Pipelines and Models) in the Python API. Spark 2.0RoadmapJIRA: https://issues.apache.org/jira/browse/SPARK-12626
  • 39. More resources • Databricks Guide • ApacheSpark User Guide • Databricks Community Forum • Training courses:public classes,MOOCs, & private training • Databricks Community Edition: Free hosted Apache Spark. Join the waitlist for the beta release! 39