SlideShare a Scribd company logo
Practical Machine Learning
Pipelines with MLlib
Joseph K. Bradley
March 18, 2015
Spark Summit East 2015
About Spark MLlib
Started in UC Berkeley AMPLab
‱ Shipped with Spark 0.8
Currently (Spark 1.3)
‱ Contributions from 50+ orgs, 100+ individuals
‱ Good coverage of algorithms
classification
regression
clustering
recommendation
feature extraction, selection
frequent itemsets
statistics
linear algebra
MLlib’s Mission
How can we move beyond this list of algorithms
and help users developer real ML workflows?
MLlib’s mission is to make practical
machine learning easy and scalable.
‱ Capable of learning from large-scale
datasets
‱ Easy to build machine learning
applications
Outline
ML workflows
Pipelines
Roadmap
Outline
ML
workflows
Pipelines
Roadmap
Example: Text Classification
Set Footer from Insert Dropdown Menu 6
Goal: Given a text document, predict its
topic.
Subject: Re: Lexan Polish?
Suggest McQuires #1 plastic
polish. It will help somewhat
but nothing will remove deep
scratches without making it
worse than it already is.
McQuires will do something...
1: about science
0: not about science
LabelFeatures
text, image, vector, ...
CTR, inches of rainfall, ...
Dataset: “20 Newsgroups”
From UCI KDD Archive
Training & Testing
Set Footer from Insert Dropdown Menu 7
Training Testing/Production
Given labeled data:
RDD of (features, label)
Subject: Re: Lexan Polish?
Suggest McQuires #1 plastic
polish. It will help...
Subject: RIPEM FAQ
RIPEM is a program which
performs Privacy Enhanced...
...
Label 0
Label 1
Learn a model.
Given new unlabeled data:
RDD of features
Subject: Apollo Training
The Apollo astronauts also
trained at (in) Meteor...
Subject: A demo of Nonsense
How can you lie about
something that no one...
Use model to make predictions.
Label 1
Label 0
Example ML Workflow
Training
Train model
labels + predictions
Evaluate
Load data
labels + plain text
labels + feature vectors
Extract features
Explicitly unzip & zip RDDs
labels.zip(predictions).map {
if (_._1 == _._2) ...
}
val features: RDD[Vector]
val predictions: RDD[Double]
Create many RDDs
val labels: RDD[Double] =
data.map(_.label)
Pain point
Example ML Workflow
Write as a script
Pain point
‱ Not modular
‱ Difficult to re-use workflow
Training
labels + feature vectors
Train model
labels + predictions
Evaluate
Load data
labels + plain text
Extract features
Example ML Workflow
Training
labels + feature vectors
Train model
labels + predictions
Evaluate
Load data
labels + plain text
Extract features
Testing/Production
feature vectors
Predict using model
predictions
Act on predictions
Load new data
plain text
Extract features
Almost
identical
workflow
Example ML Workflow
Training
labels + feature vectors
Train model
labels + predictions
Evaluate
Load data
labels + plain text
Extract features
Pain point
Parameter tuning
‱ Key part of ML
‱ Involves training many models
‱ For different splits of the data
‱ For different sets of parameters
Pain Points
Create & handle many RDDs and data types
Write as a script
Tune parameters
Enter...
Pipelines! in Spark 1.2 & 1.3
Outline
ML workflows
Pipelines
Roadmap
Key Concepts
DataFrame: The ML Dataset
Abstractions: Transformers, Estimators, &
Evaluators
Parameters: API & tuning
DataFrame: The ML Dataset
DataFrame: RDD + schema + DSL
Named columns with types
label: Double
text: String
words: Seq[String]
features: Vector
prediction: Double
label text words features
0 This is ... [“This”, “is”, 
] [0.5, 1.2, 
]
0 When we ... [“When”, ...] [1.9, -0.8, 
]
1 Knuth was ... [“Knuth”, 
] [0.0, 8.7, 
]
0 Or you ... [“Or”, “you”, 
] [0.1, -0.6, 
]
DataFrame: The ML Dataset
DataFrame: RDD + schema + DSL
Named columns with types Domain-Specific Language
# Select science articles
sciDocs =
data.filter(“label” == 1)
# Scale labels
data(“label”) * 0.5
DataFrame: The ML Dataset
DataFrame: RDD + schema + DSL
‱ Shipped with Spark 1.3
‱ APIs for Python, Java & Scala (+R in dev)
‱ Integration with Spark SQL
‱ Data import/export
‱ Internal optimizations
Named columns with types Domain-Specific Language
Pain point: Create & handle
many RDDs and data types
BIG data
Abstractions
Set Footer from Insert Dropdown Menu 18
Training
Train model
Evaluate
Load data
Extract features
Abstraction: Transformer
Set Footer from Insert Dropdown Menu 19
Training
Train model
Evaluate
Extract features
def transform(DataFrame): DataFrame
label: Double
text: String
label: Double
text: String
features: Vector
Abstraction: Estimator
Set Footer from Insert Dropdown Menu 20
Training
Train model
Evaluate
Extract features
label: Double
text: String
features: Vector
LogisticRegression
Model
def fit(DataFrame): Model
Train model
Abstraction: Evaluator
Set Footer from Insert Dropdown Menu 21
Training
Evaluate
Extract features
label: Double
text: String
features: Vector
prediction: Double
Metric:
accuracy
AUC
MSE
...
def evaluate(DataFrame): Double
Act on predictions
Abstraction: Model
Set Footer from Insert Dropdown Menu 22
Model is a type of Transformer
def transform(DataFrame): DataFrame
text: String
features: Vector
Testing/Production
Predict using model
Extract features text: String
features: Vector
prediction: Double
(Recall) Abstraction: Estimator
Set Footer from Insert Dropdown Menu 23
Training
Train model
Evaluate
Load data
Extract features
label: Double
text: String
features: Vector
LogisticRegression
Model
def fit(DataFrame): Model
Abstraction: Pipeline
Set Footer from Insert Dropdown Menu 24
Training
Train model
Evaluate
Load data
Extract features
label: Double
text: String
PipelineModel
Pipeline is a type of Estimator
def fit(DataFrame): Model
Abstraction: PipelineModel
Set Footer from Insert Dropdown Menu 25
text: String
PipelineModel is a type of Transformer
def transform(DataFrame): DataFrame
Testing/Production
Predict using model
Load data
Extract features text: String
features: Vector
prediction: Double
Act on predictions
Abstractions: Summary
Set Footer from Insert Dropdown Menu 26
Training
Train model
Evaluate
Load data
Extract featuresTransformer
DataFrame
Estimator
Evaluator
Testing
Predict using model
Evaluate
Load data
Extract features
Demo
Set Footer from Insert Dropdown Menu 27
Transformer
DataFrame
Estimator
Evaluator
label: Double
text: String
features: Vector
Current data schema
prediction: Double
Training
LogisticRegression
BinaryClassification
Evaluator
Load data
Tokenizer
Transformer HashingTF
words: Seq[String]
Demo
Set Footer from Insert Dropdown Menu 28
Transformer
DataFrame
Estimator
Evaluator
Training
LogisticRegression
BinaryClassification
Evaluator
Load data
Tokenizer
Transformer HashingTF
Pain point: Write as a script
Parameters
Set Footer from Insert Dropdown Menu 29
> hashingTF.numFeaturesStandard API
‱ Typed
‱ Defaults
‱ Built-in doc
‱ Autocomplete
org.apache.spark.ml.param.IntParam =
numFeatures: number of features
(default: 262144)
> hashingTF.setNumFeatures(1000)
> hashingTF.getNumFeatures
Parameter Tuning
Given:
‱ Estimator
‱ Parameter grid
‱ Evaluator
Find best parameters
lr.regParam
{0.01, 0.1, 0.5}
hashingTF.numFeatures
{100, 1000, 10000}
LogisticRegression
Tokenizer
HashingTF
BinaryClassification
Evaluator
CrossValidator
Parameter Tuning
Given:
‱ Estimator
‱ Parameter grid
‱ Evaluator
Find best parameters
LogisticRegression
Tokenizer
HashingTF
BinaryClassification
Evaluator
CrossValidator
Pain point: Tune parameters
Pipelines: Recap
Inspirations
scikit-learn
+ Spark DataFrame, Param API
MLBase (Berkeley AMPLab)
Ongoing collaborations
Create & handle many RDDs and data types
Write as a script
Tune parameters
DataFrame
Abstractions
Parameter API
* Groundwork done; full support WIP.
Also
‱ Python, Scala, Java APIs
‱ Schema validation
‱ User-Defined Types*
‱ Feature metadata*
‱ Multi-model training optimizations*
Outline
ML workflows
Pipelines
Roadmap
Roadmap
spark.mllib: Primary ML package
spark.ml: High-level Pipelines API for algorithms in spark.mllib
(experimental in Spark 1.2-1.3)
Near future
‱ Feature attributes
‱ Feature transformers
‱ More algorithms under Pipeline API
Farther ahead
‱ Ideas from AMPLab MLBase (auto-tuning models)
‱ SparkR integration
Thank you!
Outline
‱ ML workflows
‱ Pipelines
‱ DataFrame
‱ Abstractions
‱ Parameter tuning
‱ Roadmap
Spark documentation
http://spark.apache.org/
Pipelines blog post
https://databricks.com/blog/2015/01/07

More Related Content

PDF
Matlab (Presentation on MATLAB)
PPTX
Templates1
PPTX
Templates c++ - prashant odhavani - 160920107003
PDF
Big Data - Lab A1 (SC 11 Tutorial)
PDF
Machine learning pipeline with spark ml
PPTX
MLlib and Machine Learning on Spark
PDF
MLlib: Spark's Machine Learning Library
PPTX
Hundreds of queries in the time of one - Gianmario Spacagna
Matlab (Presentation on MATLAB)
Templates1
Templates c++ - prashant odhavani - 160920107003
Big Data - Lab A1 (SC 11 Tutorial)
Machine learning pipeline with spark ml
MLlib and Machine Learning on Spark
MLlib: Spark's Machine Learning Library
Hundreds of queries in the time of one - Gianmario Spacagna

What's hot (20)

PDF
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
PDF
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
PDF
Machine Learning with Spark MLlib
PDF
Object Oriented Programming in Matlab
PPT
MDE=Model Driven Everything (Spanish Eclipse Day 2009)
PDF
Machine learning on streams of data
PPTX
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
PDF
Reproducible AI using MLflow and PyTorch
PDF
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
PPTX
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
PPTX
Automate Machine Learning Pipeline Using MLBox
PDF
Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...
PDF
ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transforma...
PDF
COCOA: Communication-Efficient Coordinate Ascent
PPTX
Lessons Learned from Building Machine Learning Software at Netflix
PPTX
Pythonsevilla2019 - Introduction to MLFlow
PDF
Matlab OOP
PPTX
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
PDF
Pattern: PMML for Cascading and Hadoop
PPTX
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
 
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Machine Learning with Spark MLlib
Object Oriented Programming in Matlab
MDE=Model Driven Everything (Spanish Eclipse Day 2009)
Machine learning on streams of data
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Reproducible AI using MLflow and PyTorch
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Automate Machine Learning Pipeline Using MLBox
Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...
ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transforma...
COCOA: Communication-Efficient Coordinate Ascent
Lessons Learned from Building Machine Learning Software at Netflix
Pythonsevilla2019 - Introduction to MLFlow
Matlab OOP
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Pattern: PMML for Cascading and Hadoop
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
 
Ad

Viewers also liked (12)

PDF
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
PPTX
Knowledge Collaboration: Working with Data and Web Specialists
PPT
Lect21 09-11
PDF
AI and Big Data For National Intelligence
PDF
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
PPTX
Feature Engineering
 
PDF
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
PDF
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
PDF
Apache¼ Sparkℱ MLlib: From Quick Start to Scikit-Learn
PDF
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
PPTX
The How and Why of Feature Engineering
PPTX
Understanding Feature Space in Machine Learning
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Knowledge Collaboration: Working with Data and Web Specialists
Lect21 09-11
AI and Big Data For National Intelligence
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Feature Engineering
 
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
Apache¼ Sparkℱ MLlib: From Quick Start to Scikit-Learn
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
The How and Why of Feature Engineering
Understanding Feature Space in Machine Learning
Ad

Similar to Machine Learning Pipelines - Joseph Bradley - Databricks (20)

PDF
Practical Machine Learning Pipelines with MLlib
PPTX
Practical Distributed Machine Learning Pipelines on Hadoop
PPTX
Apache Spark MLlib
PPTX
Combining Machine Learning frameworks with Apache Spark
PPTX
Combining Machine Learning Frameworks with Apache Spark
PDF
Introduction to Spark ML Pipelines Workshop
PPTX
Introduction to Spark ML
PDF
Foundations for Scaling ML in Apache Spark
PDF
Spark DataFrames and ML Pipelines
PDF
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
PDF
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
PDF
Productionalizing Spark ML
PDF
An introduction into Spark ML plus how to go beyond when you get stuck
PPTX
Building Machine Learning Inference Pipelines at Scale (July 2019)
PDF
Introduction to and Extending Spark ML
PPTX
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
PDF
Apache Spark MLlib 2.0 Preview: Data Science and Production
PDF
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
PDF
Apache Spark Machine Learning
PPTX
Machine learning at scale - Webinar By zekeLabs
Practical Machine Learning Pipelines with MLlib
Practical Distributed Machine Learning Pipelines on Hadoop
Apache Spark MLlib
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML
Foundations for Scaling ML in Apache Spark
Spark DataFrames and ML Pipelines
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Productionalizing Spark ML
An introduction into Spark ML plus how to go beyond when you get stuck
Building Machine Learning Inference Pipelines at Scale (July 2019)
Introduction to and Extending Spark ML
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Apache Spark MLlib 2.0 Preview: Data Science and Production
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Apache Spark Machine Learning
Machine learning at scale - Webinar By zekeLabs

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
top salesforce developer skills in 2025.pdf
PDF
System and Network Administraation Chapter 3
PDF
Understanding Forklifts - TECH EHS Solution
PDF
medical staffing services at VALiNTRY
PDF
AI in Product Development-omnex systems
PPTX
Essential Infomation Tech presentation.pptx
PDF
Digital Strategies for Manufacturing Companies
PDF
System and Network Administration Chapter 2
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Operating system designcfffgfgggggggvggggggggg
Reimagine Home Health with the Power of Agentic AI​
Which alternative to Crystal Reports is best for small or large businesses.pdf
PTS Company Brochure 2025 (1).pdf.......
How to Migrate SBCGlobal Email to Yahoo Easily
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Odoo POS Development Services by CandidRoot Solutions
Internet Downloader Manager (IDM) Crack 6.42 Build 41
VVF-Customer-Presentation2025-Ver1.9.pptx
top salesforce developer skills in 2025.pdf
System and Network Administraation Chapter 3
Understanding Forklifts - TECH EHS Solution
medical staffing services at VALiNTRY
AI in Product Development-omnex systems
Essential Infomation Tech presentation.pptx
Digital Strategies for Manufacturing Companies
System and Network Administration Chapter 2

Machine Learning Pipelines - Joseph Bradley - Databricks

  • 1. Practical Machine Learning Pipelines with MLlib Joseph K. Bradley March 18, 2015 Spark Summit East 2015
  • 2. About Spark MLlib Started in UC Berkeley AMPLab ‱ Shipped with Spark 0.8 Currently (Spark 1.3) ‱ Contributions from 50+ orgs, 100+ individuals ‱ Good coverage of algorithms classification regression clustering recommendation feature extraction, selection frequent itemsets statistics linear algebra
  • 3. MLlib’s Mission How can we move beyond this list of algorithms and help users developer real ML workflows? MLlib’s mission is to make practical machine learning easy and scalable. ‱ Capable of learning from large-scale datasets ‱ Easy to build machine learning applications
  • 6. Example: Text Classification Set Footer from Insert Dropdown Menu 6 Goal: Given a text document, predict its topic. Subject: Re: Lexan Polish? Suggest McQuires #1 plastic polish. It will help somewhat but nothing will remove deep scratches without making it worse than it already is. McQuires will do something... 1: about science 0: not about science LabelFeatures text, image, vector, ... CTR, inches of rainfall, ... Dataset: “20 Newsgroups” From UCI KDD Archive
  • 7. Training & Testing Set Footer from Insert Dropdown Menu 7 Training Testing/Production Given labeled data: RDD of (features, label) Subject: Re: Lexan Polish? Suggest McQuires #1 plastic polish. It will help... Subject: RIPEM FAQ RIPEM is a program which performs Privacy Enhanced... ... Label 0 Label 1 Learn a model. Given new unlabeled data: RDD of features Subject: Apollo Training The Apollo astronauts also trained at (in) Meteor... Subject: A demo of Nonsense How can you lie about something that no one... Use model to make predictions. Label 1 Label 0
  • 8. Example ML Workflow Training Train model labels + predictions Evaluate Load data labels + plain text labels + feature vectors Extract features Explicitly unzip & zip RDDs labels.zip(predictions).map { if (_._1 == _._2) ... } val features: RDD[Vector] val predictions: RDD[Double] Create many RDDs val labels: RDD[Double] = data.map(_.label) Pain point
  • 9. Example ML Workflow Write as a script Pain point ‱ Not modular ‱ Difficult to re-use workflow Training labels + feature vectors Train model labels + predictions Evaluate Load data labels + plain text Extract features
  • 10. Example ML Workflow Training labels + feature vectors Train model labels + predictions Evaluate Load data labels + plain text Extract features Testing/Production feature vectors Predict using model predictions Act on predictions Load new data plain text Extract features Almost identical workflow
  • 11. Example ML Workflow Training labels + feature vectors Train model labels + predictions Evaluate Load data labels + plain text Extract features Pain point Parameter tuning ‱ Key part of ML ‱ Involves training many models ‱ For different splits of the data ‱ For different sets of parameters
  • 12. Pain Points Create & handle many RDDs and data types Write as a script Tune parameters Enter... Pipelines! in Spark 1.2 & 1.3
  • 14. Key Concepts DataFrame: The ML Dataset Abstractions: Transformers, Estimators, & Evaluators Parameters: API & tuning
  • 15. DataFrame: The ML Dataset DataFrame: RDD + schema + DSL Named columns with types label: Double text: String words: Seq[String] features: Vector prediction: Double label text words features 0 This is ... [“This”, “is”, 
] [0.5, 1.2, 
] 0 When we ... [“When”, ...] [1.9, -0.8, 
] 1 Knuth was ... [“Knuth”, 
] [0.0, 8.7, 
] 0 Or you ... [“Or”, “you”, 
] [0.1, -0.6, 
]
  • 16. DataFrame: The ML Dataset DataFrame: RDD + schema + DSL Named columns with types Domain-Specific Language # Select science articles sciDocs = data.filter(“label” == 1) # Scale labels data(“label”) * 0.5
  • 17. DataFrame: The ML Dataset DataFrame: RDD + schema + DSL ‱ Shipped with Spark 1.3 ‱ APIs for Python, Java & Scala (+R in dev) ‱ Integration with Spark SQL ‱ Data import/export ‱ Internal optimizations Named columns with types Domain-Specific Language Pain point: Create & handle many RDDs and data types BIG data
  • 18. Abstractions Set Footer from Insert Dropdown Menu 18 Training Train model Evaluate Load data Extract features
  • 19. Abstraction: Transformer Set Footer from Insert Dropdown Menu 19 Training Train model Evaluate Extract features def transform(DataFrame): DataFrame label: Double text: String label: Double text: String features: Vector
  • 20. Abstraction: Estimator Set Footer from Insert Dropdown Menu 20 Training Train model Evaluate Extract features label: Double text: String features: Vector LogisticRegression Model def fit(DataFrame): Model
  • 21. Train model Abstraction: Evaluator Set Footer from Insert Dropdown Menu 21 Training Evaluate Extract features label: Double text: String features: Vector prediction: Double Metric: accuracy AUC MSE ... def evaluate(DataFrame): Double
  • 22. Act on predictions Abstraction: Model Set Footer from Insert Dropdown Menu 22 Model is a type of Transformer def transform(DataFrame): DataFrame text: String features: Vector Testing/Production Predict using model Extract features text: String features: Vector prediction: Double
  • 23. (Recall) Abstraction: Estimator Set Footer from Insert Dropdown Menu 23 Training Train model Evaluate Load data Extract features label: Double text: String features: Vector LogisticRegression Model def fit(DataFrame): Model
  • 24. Abstraction: Pipeline Set Footer from Insert Dropdown Menu 24 Training Train model Evaluate Load data Extract features label: Double text: String PipelineModel Pipeline is a type of Estimator def fit(DataFrame): Model
  • 25. Abstraction: PipelineModel Set Footer from Insert Dropdown Menu 25 text: String PipelineModel is a type of Transformer def transform(DataFrame): DataFrame Testing/Production Predict using model Load data Extract features text: String features: Vector prediction: Double Act on predictions
  • 26. Abstractions: Summary Set Footer from Insert Dropdown Menu 26 Training Train model Evaluate Load data Extract featuresTransformer DataFrame Estimator Evaluator Testing Predict using model Evaluate Load data Extract features
  • 27. Demo Set Footer from Insert Dropdown Menu 27 Transformer DataFrame Estimator Evaluator label: Double text: String features: Vector Current data schema prediction: Double Training LogisticRegression BinaryClassification Evaluator Load data Tokenizer Transformer HashingTF words: Seq[String]
  • 28. Demo Set Footer from Insert Dropdown Menu 28 Transformer DataFrame Estimator Evaluator Training LogisticRegression BinaryClassification Evaluator Load data Tokenizer Transformer HashingTF Pain point: Write as a script
  • 29. Parameters Set Footer from Insert Dropdown Menu 29 > hashingTF.numFeaturesStandard API ‱ Typed ‱ Defaults ‱ Built-in doc ‱ Autocomplete org.apache.spark.ml.param.IntParam = numFeatures: number of features (default: 262144) > hashingTF.setNumFeatures(1000) > hashingTF.getNumFeatures
  • 30. Parameter Tuning Given: ‱ Estimator ‱ Parameter grid ‱ Evaluator Find best parameters lr.regParam {0.01, 0.1, 0.5} hashingTF.numFeatures {100, 1000, 10000} LogisticRegression Tokenizer HashingTF BinaryClassification Evaluator CrossValidator
  • 31. Parameter Tuning Given: ‱ Estimator ‱ Parameter grid ‱ Evaluator Find best parameters LogisticRegression Tokenizer HashingTF BinaryClassification Evaluator CrossValidator Pain point: Tune parameters
  • 32. Pipelines: Recap Inspirations scikit-learn + Spark DataFrame, Param API MLBase (Berkeley AMPLab) Ongoing collaborations Create & handle many RDDs and data types Write as a script Tune parameters DataFrame Abstractions Parameter API * Groundwork done; full support WIP. Also ‱ Python, Scala, Java APIs ‱ Schema validation ‱ User-Defined Types* ‱ Feature metadata* ‱ Multi-model training optimizations*
  • 34. Roadmap spark.mllib: Primary ML package spark.ml: High-level Pipelines API for algorithms in spark.mllib (experimental in Spark 1.2-1.3) Near future ‱ Feature attributes ‱ Feature transformers ‱ More algorithms under Pipeline API Farther ahead ‱ Ideas from AMPLab MLBase (auto-tuning models) ‱ SparkR integration
  • 35. Thank you! Outline ‱ ML workflows ‱ Pipelines ‱ DataFrame ‱ Abstractions ‱ Parameter tuning ‱ Roadmap Spark documentation http://spark.apache.org/ Pipelines blog post https://databricks.com/blog/2015/01/07

Editor's Notes

  • #3: Contributions estimated from github commit logs, with some effort to de-duplicate entities.
  • #7: Dataset source: http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html *Data from UCI KDD Archive, originally donated to archive by Tom Mitchell (CMU).
  • #9: Handling multiple RDDs and data types Split loaded data into featuresRDD, labelsRDD Transform featuresRDD  words RDD  feature vector RDD Zip labels with feature vector to create final RDD
  • #10: It is possible for programmers to abstract workflows by putting their workflows into methods or callable scripts. However, that makes it hard to do exploratory work or do rapid iterative tweaking-and-testing of workflows.
  • #11: Feature extraction in particular can be long and complicated, and using the same workflow is vital.
  • #12: Parameter tuning is doable in essentially any ML library, but it is often done by hand and involves a lot of repetitive but difficult-to-abstract code.
  • #13: This API shipped with Spark 1.2 and 1.3 but is still experimental.
  • #16: SchemaRDD + DSL (SchemaRDD is now called DataFrame, mentioned in Michael’s talk earlier in the day) Introduced in Spark 1.3 Integrates with pandas dataframes Catalyst optimizer handles column materialization Other: built-in data sources & 3rd-party extensions optimizations & codegen APIs for Python, Java & Scala (+R in dev)
  • #20: Columns which are not needed do not need to be materialized, so there is almost no penalty for keeping the columns around for later use. Default transformer behavior is to append columns.
  • #31: ----- Meeting Notes (3/18/15 01:56) ----- bad animation
  • #33: Easy for users to create their own Transformers and Estimators to plug into Pipelines.