MLSEV Virtual. From my First BigML Project to Production

Machine Learning School in Seville
2nd edition
March 2020

Idealized Machine Learning Workﬂows
Dr. Natalia Konstantinova (http://nkonst.com/machine-learning-explained-simple-words/)

(Non) automation via Web UI
Strengths of Web UI
Simple Just clicking around
Discoverable Exploration and experimenting
Abstract Transparent error handling and scalability

(Non) automation via Web UI
Strengths of Web UI
Simple Just clicking around
Discoverable Exploration and experimenting
Abstract Transparent error handling and scalability
Problems of Web UI
Only simple Simple tasks are simple, hard tasks quickly get
hard
No automation or batch operations Clicking humans don’t
scale well

Abstracting over raw HTTP: bindings

Example workﬂow: Python bindings
from bigml.api import BigML
api = BigML()
source = 'source/5643d345f43a234ff2310a3e'
dataset = api.create_dataset(source)
api.ok(dataset)
r, s = 0.8, "seed"
train_dataset = api.create_dataset(dataset, {"rate": r, "seed": s})
test_dataset = api.create_dataset(dataset, {"rate": r, "seed": s,
"out_of_bag": True})
api.ok(train_dataset)
model = api.create_model(train_dataset)
api.ok(model)
api.ok(test_dataset)
evaluation = api.create_evaluation(model, test_dataset)
api.ok(evaluation)

Automation via bindings
Is this production code?
How do we generalize to, say, 100 datasets?

# Now do it 100 times, serially
for i in range(0, 100):
r, s = 0.8, i
train = api.create_dataset(dataset, {"rate": r, "seed": s})
test = api.create_dataset(dataset, {"rate": r, "seed": s,
api.ok(train)
model.append(api.create_model(train))
api.ok(model)
api.ok(test)
evaluation.append(api.create_evaluation(model, test))
api.ok(evaluation[i])

# More efficient if we parallelize, but at what level?
r, s = 0.8, i
train.append(api.create_dataset(dataset, {"rate": r, "seed": s}))
test.append(api.create_dataset(dataset, {"rate": r, "seed": s,
# Do we wait here?
api.ok(train[i])
api.ok(test[i])
model.append(api.create_model(train[i]))
api.ok(model[i])
evaluation.append(api.create_evaluation(model, test_dataset))

# More efficient if we parallelize, but at what level?
r, s = 0.8, i
test.append(api.create_dataset(dataset, {"rate": r, "seed": s,
# Or do we wait here?
api.ok(train[i])
# and here?
api.ok(model[i])
api.ok(train[i])

# More efficient if we parallelize, but how do we handle errors??
r, s = 0.8, i
test.append(api.create_dataset(dataset, {"rate": r, "seed": s, "out_of
api.ok(train[i])
try:
api.ok(model[i])
api.ok(test[i])
except:
# How to recover if test[i] is failed? New datasets? Abort?

Client-side Machine Learning Automation
Problems of bindings-based, client solutions
Complexity Lots of details outside the problem domain
Reuse No inter-language compatibility
Scalability Client-side workﬂows are hard to optimize
Reproducibility Noisy, complex and hard to audit development
environment
Not enough abstraction

Machine Learning Workﬂows: the iceberg’s tip

A partial solution: CLI declarative tools
# "1-click" ensemble
bigmler --train data/iris.csv
--number-of-models 500
--sample-rate 0.85
--output-dir output/iris-ensemble
--project "ML Workshop"
# "1-click" dataset with parameterized fields
bigmler --train data/diabetes.csv
--no-model
--name "4-featured diabetes"
--dataset-fields
"plasma glucose,insulin,diabetes pedigree,diabetes"
--output-dir output/diabetes
--project "ML Workshop"

But not that bad
bigmler analyze --cross-validation # parameterized input
--dataset $(cat output/diabetes/dataset)
--k-folds 3 # number of folds during validation
--output-dir output/diabetes-validation

Machine Learning Workﬂows: the iceberg’s tip
Jeannine Takaki, Microsoft Azure Team

Problems of client-side solutions
Hard to generalize Declarative client tools hide complexity at
the cost of ﬂexibility
Hard to combine Black–box tools cannot be easily integrated
as parts of bigger client–side workﬂows
Hard to audit Client–side development environments are
complex and very hard to sandbox
Not enough automation

Complex Too ﬁne-grained, leaky abstractions
Cumbersome Error handling, network issues
Hard to reuse Tied to a single programming language
Hard to scale Parallelization again a problem
Not enough abstraction

Complex Too ﬁne-grained, leaky abstractions
Cumbersome Error handling, network issues
Hard to reuse Tied to a single programming language
Hard to scale Parallelization again a problem
Algorithmic complexity and computing resources management
problems mostly washed away are back!

Machine Learning Automation
Solution (scalability, reuse): Back to the server

Solution (complexity, reuse): Domain-speciﬁc languages

Solution (complexity, reuse): Domain-speciﬁc languages
venturebeat.com

In a Nutshell
1. Workflows reified as server–side, RESTful resources
2. Domain–specific language for ML workflow automation

Workflows as RESTful Resources
Library Reusable building-block: a collection of
WhizzML definitions that can be
imported by other libraries or scripts.
Script Executable code that describes an actual
workflow.
• Imports List of libraries with code
used by the script.
• Inputs List of input values that
parameterize the workflow.
• Outputs List of values computed by
the script and returned to the user.
Execution Given a script and a complete set of
inputs, the workflow can be executed
and its outputs generated.

Server-side Workﬂows: the bazaar

Metaprogramming in reﬂective DSLs: Scriptify
Resources that create
resources that create
. . .

Syntactic Abstraction: Simple workﬂow
;; ML artifacts are first-class citizens,
;; we only need to talk about our domain
(let ([train-id test-id] (create-dataset-split id 0.8)
model-id (create-model train-id))
(create-evaluation test-id
model-id
{"name" "Evaluation 80/20"
"missing_strategy" 0}))

Syntactic Abstraction: Simple workﬂow
;; ML artifacts are first-class citizens,
;; we only need to talk about our domain
(create-evaluation test-id
model-id
{"name" "Evaluation 80/20"
"missing_strategy" 0}))
Ready for production!

Scalability: Trivial parallelization
;; Workflow for 1 resource
(create-evaluation test-id model-id))

;; Workflow for arbitrary number of resources
(let (splits (for (id input-datasets)
(create-dataset-split id 0.8)))
(for (s splits)
(create-evaluation (s 1) (create-model (s 0)))))

;; Workflow for arbitrary number of resources
(let (splits (for (id input-datasets)
(create-dataset-split id 0.8)))
(for (s splits)
(create-evaluation (s 1) (create-model (s 0)))))
Ready for production!

api = BigML()
# choose workflow
script = 'script/567b4b5be3f2a123a690ff56'
# define parameters
inputs = {'input-dataset': 'dataset/5643d345f43a234ff2310a30'}
# execute
api.ok(api.create_execution(script, inputs))

api = BigML()
# choose workflow
script = 'script/567b4b5be3f2a123a690de1228'
# define parameters
inputs = {'input-datasets': ['dataset/5643d345f43a234ff2310a30',
'dataset/5643d345f43a234ff2310a31',
'dataset/5643d345f43a234ff2310a32',
...]}
# execute
api.ok(api.create_execution(script, inputs))

Example: Stacked Generalization
Objective: Improve predictions by modeling the output scores
of multiple trained models.
• Create a training and a holdout set
• Create n different models on the training set (with some
difference among them; e.g., single-tree vs. ensemble vs.
logistic regression)
• Make predictions from those models on the holdout set
• Train a model to predict the class based on the other
models’ predictions

(define [train-id hold-id]
(create-random-dataset-split dataset-id 0.5))
(define models
(create* ["model" "ensemble" "logisticregression"]
{"dataset" train-id}
{"dataset" train-id "number_of_models" 20}
{"dataset" train-id}))

(define (add-prediction-column dataset model)
(let (bp (create-and-wait-batch-prediction did mid))
((fetch bp) "output_dataset_resource")))
(define pred-dataset
(reduce add-prediction-column hold-id models))

(define meta-model
(create-model pred-dataset {"excluded_fields"
(input-fields dataset-id)}))

(define [train-id hold-id]
(create-random-dataset-split input-id 0.5))
(define models
(create* ["model" "ensemble" "logisticregression"]
{"dataset" train-id}
{"dataset" train-id "number_of_models" 20}
{"dataset" train-id}))
(define (add-prediction-column dataset model)
(let (bp (create-and-wait-batch-prediction did mid))
((fetch bp) "output_dataset_resource")))
(define ds (reduce add-prediction-column hold-id models))
(define meta-model
(create-model ds {"excluded_fields"

(define [models meta-model] (read-result execution-id))
(define predictions
(for (model models)
(create-prediction model input-data)))
(define prediction-values
(for (p predictions) (prediction-value p)))
(create-prediction {"model" meta-model
"input_data" prediction-values})

Are we there yet?
Instead of coding up “do this, then
this, then this, then ...” you can say,
“try to get a good score on these
data.” In other words, “here’s what
I like, let me know when one of
your monkeys on a typewriter gets
there.”
Cassie Kozyrkov

Are we there yet?
there.”
Cassie Kozyrkov
• Automatic model selection
• More declarative DSLs
• Automatic feature engineering

Are we there yet?
there.”
Cassie Kozyrkov
• Automatic model selection – OptiML
• More declarative DSLs

Are we there yet?
there.”
Cassie Kozyrkov
• More declarative DSLs – Working on it!

Are we there yet?
there.”
Cassie Kozyrkov
• More declarative DSLs – Working on it!
• Automatic feature engineering – 80% of an ML project

MLSEV Virtual. From my First BigML Project to Production

More Related Content

What's hot (20)

Similar to MLSEV Virtual. From my First BigML Project to Production (20)

More from BigML, Inc (20)

Recently uploaded (20)

MLSEV Virtual. From my First BigML Project to Production