Machine Learning Pipelines - Joseph Bradley - Databricks

Practical Machine Learning
Pipelines with MLlib
Joseph K. Bradley
March 18, 2015
Spark Summit East 2015

About Spark MLlib
Started in UC Berkeley AMPLab
• Shipped with Spark 0.8
Currently (Spark 1.3)
• Contributions from 50+ orgs, 100+ individuals
• Good coverage of algorithms
classification
regression
clustering
recommendation
feature extraction, selection
frequent itemsets
statistics
linear algebra

MLlib’s Mission
How can we move beyond this list of algorithms
and help users developer real ML workflows?
MLlib’s mission is to make practical
machine learning easy and scalable.
• Capable of learning from large-scale
datasets
• Easy to build machine learning
applications

Outline
ML workflows
Pipelines
Roadmap

Outline
ML
workflows
Pipelines
Roadmap

Example: Text Classification
Set Footer from Insert Dropdown Menu 6
Goal: Given a text document, predict its
topic.
Subject: Re: Lexan Polish?
Suggest McQuires #1 plastic
polish. It will help somewhat
but nothing will remove deep
scratches without making it
worse than it already is.
McQuires will do something...
1: about science
0: not about science
LabelFeatures
text, image, vector, ...
CTR, inches of rainfall, ...
Dataset: “20 Newsgroups”
From UCI KDD Archive

Training & Testing
Training Testing/Production
Given labeled data:
RDD of (features, label)
Subject: Re: Lexan Polish?
Suggest McQuires #1 plastic
polish. It will help...
Subject: RIPEM FAQ
RIPEM is a program which
performs Privacy Enhanced...
...
Label 0
Label 1
Learn a model.
Given new unlabeled data:
RDD of features
Subject: Apollo Training
The Apollo astronauts also
trained at (in) Meteor...
Subject: A demo of Nonsense
How can you lie about
something that no one...
Use model to make predictions.
Label 1
Label 0

Example ML Workflow
Training
Train model
labels + predictions
Evaluate
Load data
labels + plain text
labels + feature vectors
Extract features
Explicitly unzip & zip RDDs
labels.zip(predictions).map {
if (_._1 == _._2) ...
}
val features: RDD[Vector]
val predictions: RDD[Double]
Create many RDDs
val labels: RDD[Double] =
data.map(_.label)
Pain point

Example ML Workflow
Write as a script
Pain point
• Not modular
• Difficult to re-use workflow
Training
Train model
Evaluate
Load data
labels + plain text
Extract features

Example ML Workflow
Training
Train model
Evaluate
Load data
labels + plain text
Extract features
Testing/Production
feature vectors
Predict using model
predictions
Act on predictions
Load new data
plain text
Extract features
Almost
identical
workflow

Example ML Workflow
Training
Train model
Evaluate
Load data
labels + plain text
Extract features
Pain point
Parameter tuning
• Key part of ML
• Involves training many models
• For different splits of the data
• For different sets of parameters

Pain Points
Create & handle many RDDs and data types
Write as a script
Tune parameters
Enter...
Pipelines! in Spark 1.2 & 1.3

Key Concepts
DataFrame: The ML Dataset
Abstractions: Transformers, Estimators, &
Evaluators
Parameters: API & tuning

DataFrame: RDD + schema + DSL
Named columns with types
label: Double
text: String
words: Seq[String]
features: Vector
prediction: Double
label text words features
0 This is ... [“This”, “is”, …] [0.5, 1.2, …]
0 When we ... [“When”, ...] [1.9, -0.8, …]
1 Knuth was ... [“Knuth”, …] [0.0, 8.7, …]
0 Or you ... [“Or”, “you”, …] [0.1, -0.6, …]

Named columns with types Domain-Specific Language
# Select science articles
sciDocs =
data.filter(“label” == 1)
# Scale labels
data(“label”) * 0.5

• Shipped with Spark 1.3
• APIs for Python, Java & Scala (+R in dev)
• Integration with Spark SQL
• Data import/export
• Internal optimizations
Named columns with types Domain-Specific Language
Pain point: Create & handle
many RDDs and data types
BIG data

Abstractions
Training
Train model
Evaluate
Load data
Extract features

Abstraction: Transformer
Training
Train model
Evaluate
Extract features
def transform(DataFrame): DataFrame
label: Double
text: String
label: Double
text: String
features: Vector

Abstraction: Estimator
Training
Train model
Evaluate
Extract features
label: Double
text: String
features: Vector
LogisticRegression
Model
def fit(DataFrame): Model

Train model
Abstraction: Evaluator
Training
Evaluate
Extract features
label: Double
text: String
features: Vector
prediction: Double
Metric:
accuracy
AUC
MSE
...
def evaluate(DataFrame): Double

Act on predictions
Abstraction: Model
Model is a type of Transformer
text: String
features: Vector
Testing/Production
Predict using model
Extract features text: String
features: Vector
prediction: Double

(Recall) Abstraction: Estimator
Training
Train model
Evaluate
Load data
Extract features
label: Double
text: String
features: Vector
LogisticRegression
Model

Abstraction: Pipeline
Training
Train model
Evaluate
Load data
Extract features
label: Double
text: String
PipelineModel
Pipeline is a type of Estimator

Abstraction: PipelineModel
text: String
PipelineModel is a type of Transformer
Testing/Production
Predict using model
Load data
Extract features text: String
features: Vector
prediction: Double
Act on predictions

Abstractions: Summary
Training
Train model
Evaluate
Load data
Extract featuresTransformer
DataFrame
Estimator
Evaluator
Testing
Predict using model
Evaluate
Load data
Extract features

Demo
Transformer
DataFrame
Estimator
Evaluator
label: Double
text: String
features: Vector
Current data schema
prediction: Double
Training
LogisticRegression
BinaryClassification
Evaluator
Load data
Tokenizer
Transformer HashingTF
words: Seq[String]

Demo
Transformer
DataFrame
Estimator
Evaluator
Training
LogisticRegression
Evaluator
Load data
Tokenizer
Transformer HashingTF
Pain point: Write as a script

Parameters
> hashingTF.numFeaturesStandard API
• Typed
• Defaults
• Built-in doc
• Autocomplete
org.apache.spark.ml.param.IntParam =
numFeatures: number of features
(default: 262144)
> hashingTF.setNumFeatures(1000)
> hashingTF.getNumFeatures

Parameter Tuning
Given:
• Estimator
• Parameter grid
• Evaluator
Find best parameters
lr.regParam
{0.01, 0.1, 0.5}
hashingTF.numFeatures
{100, 1000, 10000}
LogisticRegression
Tokenizer
HashingTF
Evaluator
CrossValidator

Parameter Tuning
Given:
• Estimator
• Parameter grid
• Evaluator
Find best parameters
LogisticRegression
Tokenizer
HashingTF
Evaluator
CrossValidator
Pain point: Tune parameters

Pipelines: Recap
Inspirations
scikit-learn
+ Spark DataFrame, Param API
MLBase (Berkeley AMPLab)
Ongoing collaborations
Create & handle many RDDs and data types
Write as a script
Tune parameters
DataFrame
Abstractions
Parameter API
* Groundwork done; full support WIP.
Also
• Python, Scala, Java APIs
• Schema validation
• User-Defined Types*
• Feature metadata*
• Multi-model training optimizations*

Roadmap
spark.mllib: Primary ML package
spark.ml: High-level Pipelines API for algorithms in spark.mllib
(experimental in Spark 1.2-1.3)
Near future
• Feature attributes
• Feature transformers
• More algorithms under Pipeline API
Farther ahead
• Ideas from AMPLab MLBase (auto-tuning models)
• SparkR integration

Thank you!
Outline
• ML workflows
• Pipelines
• DataFrame
• Abstractions
• Parameter tuning
• Roadmap
Spark documentation
http://spark.apache.org/
Pipelines blog post
https://databricks.com/blog/2015/01/07

Machine Learning Pipelines - Joseph Bradley - Databricks

More Related Content

What's hot (20)

Viewers also liked (12)

Similar to Machine Learning Pipelines - Joseph Bradley - Databricks (20)

More from Spark Summit (20)

Recently uploaded (20)

Machine Learning Pipelines - Joseph Bradley - Databricks

Editor's Notes