SlideShare a Scribd company logo
Speaker: Alberto Scolari, PhD student @ Politecnico di Milano, Italy - alberto.scolari@polimi.it
UC Berkeley, 05/23/2018
Yunseong Lee, Alberto Scolari, Matteo Interlandi, Markus
Weimer, Marco D. Santambrogio, Byung-Gon Chun
PRETZEL: Opening the Black Box
of Machine Learning Prediction Serving Systems
ML-as-a-Service
• ML models are deployed on cloud platforms as black
boxes
• Users deploy multiple models per machine (10-100s)
2
ML-as-a-Service
• ML models are deployed on cloud platforms as black
boxes
• Users deploy multiple models per machine (10-100s)
2
good:0
bad:1
…
This is a good
product
ML-as-a-Service
• ML models are deployed on cloud platforms as black
boxes
• Users deploy multiple models per machine (10-100s)
2
• Deployed models are often similar
- Similar structure
- Similar state
good:0
bad:1
…
This is a good
product
ML-as-a-Service
• ML models are deployed on cloud platforms as black
boxes
• Users deploy multiple models per machine (10-100s)
2
• Deployed models are often similar
- Similar structure
- Similar state
good:0
bad:1
…
This is a good
product
good:0
bad:1
…
This is a good
product
ML-as-a-Service
• ML models are deployed on cloud platforms as black
boxes
• Users deploy multiple models per machine (10-100s)
2
• Deployed models are often similar
- Similar structure
- Similar state
• Two key requirements:
1. Performance: latency or throughput
2. Model density: number of models per
machine
good:0
bad:1
…
This is a good
product
good:0
bad:1
…
This is a good
product
3
Limitations of black-box
300 models in ML.NET [1], C#, from Amazon Reviews dataset [2]:
250 Sentiment Analysis + 50 Regression Task
[1] https://www.microsoft.com/net/learn/apps/machine-learning-and-ai/ml-dotnet
[2] R. He and J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th
International Conference on World Wide Web, WWW ’16
• Overheads: JIT, GC, virtual calls, …
• Operators are separate calls: no code fusion,
multiple data accesses
• Operators have different buffers and break
locality
3
Limitations of black-box
(a) We identify the operators by their parameters. The first
operators, which have multiple versions with different length
Figure 3: Probability for an operator within the 250 d
250⇥ y
100 pipelines have operator x, implying that pipe
Figure 4: CDF of latency of prediction requests of 25
DAGs. We denote the first prediction as cold; the h
line is reported as average over 100 predictions after
warm-up period of 10. The plot is normalized over t
99th percentile latency of the hot case.
describes this situation, where the performance of h
predictions over the 250 sentiment analysis pipelines wi
memory already allocated and JIT-compiled code is mo
than an order-of-magnitude faster then the cold versio
for the same pipelines. To drill down more into the pro
Limited performance
300 models in ML.NET [1], C#, from Amazon Reviews dataset [2]:
250 Sentiment Analysis + 50 Regression Task
[1] https://www.microsoft.com/net/learn/apps/machine-learning-and-ai/ml-dotnet
[2] R. He and J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th
International Conference on World Wide Web, WWW ’16
• Overheads: JIT, GC, virtual calls, …
• Operators are separate calls: no code fusion,
multiple data accesses
• Operators have different buffers and break
locality
3
Limitations of black-box
(a) We identify the operators by their parameters. The first
operators, which have multiple versions with different length
Figure 3: Probability for an operator within the 250 d
250⇥ y
100 pipelines have operator x, implying that pipe
Figure 4: CDF of latency of prediction requests of 25
DAGs. We denote the first prediction as cold; the h
line is reported as average over 100 predictions after
warm-up period of 10. The plot is normalized over t
99th percentile latency of the hot case.
describes this situation, where the performance of h
predictions over the 250 sentiment analysis pipelines wi
memory already allocated and JIT-compiled code is mo
than an order-of-magnitude faster then the cold versio
for the same pipelines. To drill down more into the pro
(a) We identify the operators by their parameters. The first two groups represent N-gram
operators, which have multiple versions with different length (e.g., unigram, trigram)
(b)
eac
Figure 3: Probability for an operator within the 250 different pipelines. If an operator x
250⇥ y
100 pipelines have operator x, implying that pipelines can save memory by sharing
• No state sharing: state is
duplicated-> memory is wasted
Limited performance
Limited density
300 models in ML.NET [1], C#, from Amazon Reviews dataset [2]:
250 Sentiment Analysis + 50 Regression Task
[1] https://www.microsoft.com/net/learn/apps/machine-learning-and-ai/ml-dotnet
[2] R. He and J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th
International Conference on World Wide Web, WWW ’16
White-box: open the black box of models for them to co-
exist better, be scheduled better
1. End-to-end Optimizations: merge operators in
computational units (logical stages), to decrease
overheads
2. Multi-model Optimizations: create once, use
everywhere, for both data and stages
4
White-Box Design principles
5
Off-line phase [1] - Flour
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
recognize that the Linear Regression c
Char and WordNgram, therefore bypas
of Concat. Additionally, Tokenizer can
Char and WordNgram, therefore it will
CharNgram (in one stage) and a dep
CharNgram and WordNGram (in anot
created. The final plan will therefore b
stages, versus the initial 4 operators (an
Model Plan Compiler: Model plans
two DAGs: a DAG composed of log
DAG of physical stages. Logical stag
tion of the stages output of the Oven O
lated parameters; physical stages conta
that will be executed by the PRETZEL
given DAG, there is a 1-to-n mapping
physical stages so that a logical stage
5
Off-line phase [1] - Flour
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
recognize that the Linear Regression c
Char and WordNgram, therefore bypas
of Concat. Additionally, Tokenizer can
Char and WordNgram, therefore it will
CharNgram (in one stage) and a dep
CharNgram and WordNGram (in anot
created. The final plan will therefore b
stages, versus the initial 4 operators (an
Model Plan Compiler: Model plans
two DAGs: a DAG composed of log
DAG of physical stages. Logical stag
tion of the stages output of the Oven O
lated parameters; physical stages conta
that will be executed by the PRETZEL
given DAG, there is a 1-to-n mapping
physical stages so that a logical stage
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
recognize that the Linear Regression can be pushed in
Char and WordNgram, therefore bypassing the executi
of Concat. Additionally, Tokenizer can be reused betwe
Char and WordNgram, therefore it will be pipelined w
CharNgram (in one stage) and a dependency betwe
CharNgram and WordNGram (in another stage) will
created. The final plan will therefore be composed by
stages, versus the initial 4 operators (and vectors) of ML
Model Plan Compiler: Model plans are composed
two DAGs: a DAG composed of logical stages, and
DAG of physical stages. Logical stages are an abstra
tion of the stages output of the Oven Optimizer, with
lated parameters; physical stages contains the actual co
that will be executed by the PRETZEL runtime. For.ea
given DAG, there is a 1-to-n mapping between logical
physical stages so that a logical stage can represent t
execution code of different physical implementations.
physical implementation is selected based on the param
Flour API
5
Off-line phase [1] - Flourhe dynamic decisions on how to schedule plans
on machine workload. Finally, a FrontEnd is used
mit prediction requests to the system.
del pipeline deployment and serving in PRETZEL
a two-phase process. During the off-line phase
n 4.1), ML2’s pre-trained pipelines are translated
our transformation-based API. Oven optimizer re-
es and fuses transformations into model plans com-
of parameterized logical units called stages. Each
stage is then AOT compiled into physical computa-
its where memory resources and threads are pooled
me. Model plans are registered for prediction serv-
he Runtime where physical stages and parameters
red between pipelines with similar model plans. In
line phase (Section 4.2), when an inference request
gistered model plan is received, physical stages are
eterized dynamically with the proper values main-
in the Object Store. The Scheduler is in charge of
g physical stages to shared execution units.
ures 6 and 7 pictorially summarize the above de-
ons; note that only the on-line phase is executed
rence time, whereas the model plans are generated
etely off-line. Next, we will describe each layer
sing the PRETZEL prediction system.
Off-line Phase
Flour
al of Flour is to provide an intermediate represen-
between ML frameworks (currently only ML2) and
EL that is both easy to target and amenable to op-
ions. Once a pipeline is ported into Flour, it can
arrays indicate the number and type of input fields. The
successive call to Tokenize in line 4 splits the input
fields into tokens. Lines 6 and 7 contain the two branches
defining the char-level and word-level n-gram transforma-
tions, which are then merged with the Concat transform
in line 9 before the linear binary classifier of line 10. Both
char and word n-gram transformations are parametrized
by the number of n-grams and maps translating n-grams
into numerical format (not shown in the Figure). Addi-
tionally, each Flour transformation accepts as input an
optional set of statistics gathered from training. These
statistics are used by the compiler to generate physical
plans more efficiently tailored to the model characteristics.
Example statistics are max vector size (to define the mini-
mum size of vectors to fetch from the pool at prediction
time 4.2), dense or sparse representations, etc.
Listing 1: Flour program for the sentiment analysis
pipeline. Transformations’ parameters are extracted from
the original ML2 pipeline.
1 var fContext = new FlourContext(objectStore, ...)
2 var tTokenizer = fContext.CSV.
3 FromText(fields, fieldsType,
sep).
4 Tokenize();
5
6 var tCNgram = tTokenizer.CharNgram(numCNgrms,
...);
7 var tWNgram = tTokenizer.WordNgram(numWNgrms,
...);
8 var fPrgrm = tCNgram.
9 Concat(tWNgram).
10 ClassifierBinaryLinear(cParams);
11
12 return fPrgrm.Plan();
We have instrumented the ML2 library to collect statis-
tics from training and with the related bindings to the
Box of Machine Learning
ving Systems
# 355
“This is a nice product”
Positive vs. Negative
Tokenizer
Char
Ngram
Word
Ngram
Concat
Linear
Regression
Figure 1: A sentimental analysis pipeline consisting of
operators for featurization (ellipses), followed by a ML
model (diamond). Tokenizer extracts tokens (e.g., words)
from the input string. Char and Word Ngrams featurize
input tokens by extracting n-grams. Concat generates a
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
recognize that the Linear Regression c
Char and WordNgram, therefore bypas
of Concat. Additionally, Tokenizer can
Char and WordNgram, therefore it will
CharNgram (in one stage) and a dep
CharNgram and WordNGram (in anot
created. The final plan will therefore b
stages, versus the initial 4 operators (an
Model Plan Compiler: Model plans
two DAGs: a DAG composed of log
DAG of physical stages. Logical stag
tion of the stages output of the Oven O
lated parameters; physical stages conta
that will be executed by the PRETZEL
given DAG, there is a 1-to-n mapping
physical stages so that a logical stage
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
recognize that the Linear Regression can be pushed in
Char and WordNgram, therefore bypassing the executi
of Concat. Additionally, Tokenizer can be reused betwe
Char and WordNgram, therefore it will be pipelined w
CharNgram (in one stage) and a dependency betwe
CharNgram and WordNGram (in another stage) will
created. The final plan will therefore be composed by
stages, versus the initial 4 operators (and vectors) of ML
Model Plan Compiler: Model plans are composed
two DAGs: a DAG composed of logical stages, and
DAG of physical stages. Logical stages are an abstra
tion of the stages output of the Oven Optimizer, with
lated parameters; physical stages contains the actual co
that will be executed by the PRETZEL runtime. For.ea
given DAG, there is a 1-to-n mapping between logical
physical stages so that a logical stage can represent t
execution code of different physical implementations.
physical implementation is selected based on the param
Flour API
5
Off-line phase [1] - Flour
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
recognize that the Linear Regression c
Char and WordNgram, therefore bypas
of Concat. Additionally, Tokenizer can
Char and WordNgram, therefore it will
CharNgram (in one stage) and a dep
CharNgram and WordNGram (in anot
created. The final plan will therefore b
stages, versus the initial 4 operators (an
Model Plan Compiler: Model plans
two DAGs: a DAG composed of log
DAG of physical stages. Logical stag
tion of the stages output of the Oven O
lated parameters; physical stages conta
that will be executed by the PRETZEL
given DAG, there is a 1-to-n mapping
physical stages so that a logical stage
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
recognize that the Linear Regression can be pushed in
Char and WordNgram, therefore bypassing the executi
of Concat. Additionally, Tokenizer can be reused betwe
Char and WordNgram, therefore it will be pipelined w
CharNgram (in one stage) and a dependency betwe
CharNgram and WordNGram (in another stage) will
created. The final plan will therefore be composed by
stages, versus the initial 4 operators (and vectors) of ML
Model Plan Compiler: Model plans are composed
two DAGs: a DAG composed of logical stages, and
DAG of physical stages. Logical stages are an abstra
tion of the stages output of the Oven Optimizer, with
lated parameters; physical stages contains the actual co
that will be executed by the PRETZEL runtime. For.ea
given DAG, there is a 1-to-n mapping between logical
physical stages so that a logical stage can represent t
execution code of different physical implementations.
physical implementation is selected based on the param
Flour API
5
Off-line phase [1] - Flour
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
recognize that the Linear Regression c
Char and WordNgram, therefore bypas
of Concat. Additionally, Tokenizer can
Char and WordNgram, therefore it will
CharNgram (in one stage) and a dep
CharNgram and WordNGram (in anot
created. The final plan will therefore b
stages, versus the initial 4 operators (an
Model Plan Compiler: Model plans
two DAGs: a DAG composed of log
DAG of physical stages. Logical stag
tion of the stages output of the Oven O
lated parameters; physical stages conta
that will be executed by the PRETZEL
given DAG, there is a 1-to-n mapping
physical stages so that a logical stage
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
the Oven Compiler using logical stages, parameters, and
statistics. A model plan is the union of all the elements
and is fed to the runtime.
recognize
Char and
of Concat
Char and
CharNgra
CharNgra
created. T
stages, ve
Model Pl
two DAG
DAG of p
tion of the
lated para
that will b
given DA
physical s
execution
physical i
ters chara
Plan co
DAG is g
Plan Com
representa
formation
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
recognize that the Linear R
Char and WordNgram, ther
of Concat. Additionally, To
Char and WordNgram, ther
CharNgram (in one stage)
CharNgram and WordNGr
created. The final plan wil
stages, versus the initial 4 o
Model Plan Compiler: M
two DAGs: a DAG comp
DAG of physical stages. L
tion of the stages output of
lated parameters; physical s
that will be executed by the
given DAG, there is a 1-to-
physical stages so that a lo
execution code of different
physical implementation is
ters characterizing a logica
Plan compilation is a two
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET
ZEL. In (1), a model is translated into a Flour program. (2
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
the Oven Compiler using logical stages, parameters, and
statistics. A model plan is the union of all the element
and is fed to the runtime.
(such as most featurizers) are pipelined together in a sin
gle pass over the data. This strategy achieves best data
locality because records are likely to reside in CPU reg
isters [33, 38]. Compute-intensive transformations (e.g
Oven
Optimiser
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
recognize that the Linear Regression can be pushed in
Char and WordNgram, therefore bypassing the executi
of Concat. Additionally, Tokenizer can be reused betwe
Char and WordNgram, therefore it will be pipelined w
CharNgram (in one stage) and a dependency betwe
CharNgram and WordNGram (in another stage) will
created. The final plan will therefore be composed by
stages, versus the initial 4 operators (and vectors) of ML
Model Plan Compiler: Model plans are composed
two DAGs: a DAG composed of logical stages, and
DAG of physical stages. Logical stages are an abstra
tion of the stages output of the Oven Optimizer, with
lated parameters; physical stages contains the actual co
that will be executed by the PRETZEL runtime. For.ea
given DAG, there is a 1-to-n mapping between logical
physical stages so that a logical stage can represent t
execution code of different physical implementations.
physical implementation is selected based on the param
Flour API
5
Off-line phase [1] - Flour
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
recognize that the Linear Regression c
Char and WordNgram, therefore bypas
of Concat. Additionally, Tokenizer can
Char and WordNgram, therefore it will
CharNgram (in one stage) and a dep
CharNgram and WordNGram (in anot
created. The final plan will therefore b
stages, versus the initial 4 operators (an
Model Plan Compiler: Model plans
two DAGs: a DAG composed of log
DAG of physical stages. Logical stag
tion of the stages output of the Oven O
lated parameters; physical stages conta
that will be executed by the PRETZEL
given DAG, there is a 1-to-n mapping
physical stages so that a logical stage
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
the Oven Compiler using logical stages, parameters, and
statistics. A model plan is the union of all the elements
and is fed to the runtime.
recognize
Char and
of Concat
Char and
CharNgra
CharNgra
created. T
stages, ve
Model Pl
two DAG
DAG of p
tion of the
lated para
that will b
given DA
physical s
execution
physical i
ters chara
Plan co
DAG is g
Plan Com
representa
formation
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
recognize that the Linear R
Char and WordNgram, ther
of Concat. Additionally, To
Char and WordNgram, ther
CharNgram (in one stage)
CharNgram and WordNGr
created. The final plan wil
stages, versus the initial 4 o
Model Plan Compiler: M
two DAGs: a DAG comp
DAG of physical stages. L
tion of the stages output of
lated parameters; physical s
that will be executed by the
given DAG, there is a 1-to-
physical stages so that a lo
execution code of different
physical implementation is
ters characterizing a logica
Plan compilation is a two
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET
ZEL. In (1), a model is translated into a Flour program. (2
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
the Oven Compiler using logical stages, parameters, and
statistics. A model plan is the union of all the element
and is fed to the runtime.
(such as most featurizers) are pipelined together in a sin
gle pass over the data. This strategy achieves best data
locality because records are likely to reside in CPU reg
isters [33, 38]. Compute-intensive transformations (e.g
Oven
Optimiser
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
the program. Additionally, parameters and statistics are
extracted. (3) A DAG of physical stages is generated by
the Oven Compiler using logical stages, parameters, and
statistics. A model plan is the union of all the elements
and is fed to the runtime.
(such as most featurizers) are pipelined together in a sin-
gle pass over the data. This strategy achieves best data
locality because records are likely to reside in CPU reg-
isters [33, 38]. Compute-intensive transformations (e.g.,
var fContext = ...;
var Tokenizer = ...;
return fPrgm.Plan();
(1) Flour
Transforms
Logical Stages
S1 S2 S3 1: [x]
2: [y,z]
3: …
int[100]
float[200]
…
Params Stats
Physical Stages
S1 S2 S3
(3) Compilation
(2) Optimization
Model
Stats
Params
Logical
Stages
Physical
Stages
Model Plan
Figure 6: Model optimization and compilation in PRET-
ZEL. In (1), a model is translated into a Flour program. (2)
Oven Optimizer generates a DAG of logical stages from
recognize that the Linear Regression can be pushed in
Char and WordNgram, therefore bypassing the executi
of Concat. Additionally, Tokenizer can be reused betwe
Char and WordNgram, therefore it will be pipelined w
CharNgram (in one stage) and a dependency betwe
CharNgram and WordNGram (in another stage) will
created. The final plan will therefore be composed by
stages, versus the initial 4 operators (and vectors) of ML
Model Plan Compiler: Model plans are composed
two DAGs: a DAG composed of logical stages, and
DAG of physical stages. Logical stages are an abstra
tion of the stages output of the Oven Optimizer, with
lated parameters; physical stages contains the actual co
that will be executed by the PRETZEL runtime. For.ea
given DAG, there is a 1-to-n mapping between logical
physical stages so that a logical stage can represent t
execution code of different physical implementations.
physical implementation is selected based on the param
Flour API
• Two main components:
– Runtime, with an Object Store
– Scheduler
• Runtime handles physical resources: threads and buffers
• Object Store caches objects of all models
– models register and retrieve state objects via a key
• Scheduler is event-based, each stage being an event
6
On-line phase [1] - Runtime
• 7x less memory for SA models, 40% for RT models
• Higher model density, higher efficiency and
profitability
7
Memory
Figure 8: Cumulative memory usage of the model
pipelines with and without Object Store, normalized by
Figure 9: Latency co
ZEL, with values nor
Cliffs indicate pipeli
• Hot vs Cold scenarios
• Without and with Caching of partial results
• Pretzel vs ML:NET: cold is 15x faster, hot is 2.6x
8
Latency
Figure 9: Latency comparison between ML2 and PRET-
• Batch size is 1000 inputs, multiple runs of same batch
• Delay batching: as in Clipper, let requests tolerate a given
delay
• Batch while latency is smaller than tolerated delay
• ML.NET suffers from missed data sharing, i.e. higher memory
traffic
9
Throughput
Figure 11: The average throughput computed among the
latency, instead, gracefully increases linearl
increase of the load.
• Batch size is 1000 inputs, multiple runs of same batch
• Delay batching: as in Clipper, let requests tolerate a given
delay
• Batch while latency is smaller than tolerated delay
• ML.NET suffers from missed data sharing, i.e. higher memory
traffic
9
Throughput
Figure 11: The average throughput computed among the
latency, instead, gracefully increases linearl
increase of the load.
the number of cores on the x-axis. PRETZEL scales lin
early to the number of CPU cores, close to the expecte
maximum throughput with Hyper Threading enabled.
not shared, thus increasing the pressure on the memor
subsystem: indeed, even if the data values are the sam
the model objects are mapped to different memory area
Delay Batching: As in Clipper, PRETZEL FrontEnd a
lows users to specify a maximum delay they can wait t
maximize throughput. For each model category, Figure 1
depicts the trade-off between throughput and latency a
the delay increases. Interestingly, for such models even
small delay helps reaching the optimal batch size.
Figure 12: Throughput and latency of SA and RT model
• We addressed performance/density bottlenecks in ML
inference for MaaS by advocating white-box approach
• Future work
- Code generation
- Support more model formats than ML.NET
- Distributed version
- NUMA awareness
10
Conclusions and current work
• Physical stages can be offloaded to FPGA
• How to deploy it?
– Fixed subset of common operators
– Multiple operators in kernels
– Whole model, deployed via partial reconf
11
Future work with FPGA
• Physical stages can be offloaded to FPGA
• How to deploy it?
– Fixed subset of common operators
– Multiple operators in kernels
– Whole model, deployed via partial reconf
11
Future work with FPGA
QUESTIONS ?
Speaker: Alberto Scolari, PhD student @ Politecnico di Milano, Italy - alberto.scolari@polimi.it

More Related Content

PDF
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PDF
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORM
PDF
SOLUTION MANUAL OF COMPUTER ORGANIZATION BY CARL HAMACHER, ZVONKO VRANESIC & ...
PDF
High Speed and Time Efficient 1-D DWT on Xilinx Virtex4 DWT Using 9/7 Filter ...
PPTX
Parallel sorting algorithm
DOCX
Flexible dsp accelerator architecture exploiting carry save arithmetic
PPT
Parallel algorithms
PPTX
Parallel algorithm in linear algebra
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORM
SOLUTION MANUAL OF COMPUTER ORGANIZATION BY CARL HAMACHER, ZVONKO VRANESIC & ...
High Speed and Time Efficient 1-D DWT on Xilinx Virtex4 DWT Using 9/7 Filter ...
Parallel sorting algorithm
Flexible dsp accelerator architecture exploiting carry save arithmetic
Parallel algorithms
Parallel algorithm in linear algebra

What's hot (20)

PDF
Chapter 4: Parallel Programming Languages
PDF
Parallel External Memory Algorithms Applied to Generalized Linear Models
PPS
PRAM algorithms from deepika
DOCX
A high performance fir filter architecture for fixed and reconfigurable appli...
PPT
Parallel algorithms
DOCX
Graph based transistor network generation method for supergate design
DOCX
High performance pipelined architecture of elliptic curve scalar multiplicati...
DOCX
High performance nb-ldpc decoder with reduction of message exchange
PPTX
Matrix multiplication
PDF
Elementary Parallel Algorithms
PPTX
ECE 565 FInal Project
PDF
Transfer Learning for Performance Analysis of Configurable Systems: A Causal ...
PDF
IRJET- Latin Square Computation of Order-3 using Open CL
PPT
Chapter 4 pc
PPT
3DD 1e SyCers
PDF
Design and Estimation of delay, power and area for Parallel prefix adders
PDF
[IJCT-V3I2P17] Authors: Sheng Lai, Xiaohua Meng, Dongqin Zheng
PPT
32-bit unsigned multiplier by using CSLA & CLAA
PDF
Parallel Algorithms
PDF
Chapter 4: Parallel Programming Languages
Parallel External Memory Algorithms Applied to Generalized Linear Models
PRAM algorithms from deepika
A high performance fir filter architecture for fixed and reconfigurable appli...
Parallel algorithms
Graph based transistor network generation method for supergate design
High performance pipelined architecture of elliptic curve scalar multiplicati...
High performance nb-ldpc decoder with reduction of message exchange
Matrix multiplication
Elementary Parallel Algorithms
ECE 565 FInal Project
Transfer Learning for Performance Analysis of Configurable Systems: A Causal ...
IRJET- Latin Square Computation of Order-3 using Open CL
Chapter 4 pc
3DD 1e SyCers
Design and Estimation of delay, power and area for Parallel prefix adders
[IJCT-V3I2P17] Authors: Sheng Lai, Xiaohua Meng, Dongqin Zheng
32-bit unsigned multiplier by using CSLA & CLAA
Parallel Algorithms
Ad

Similar to Pretzel: optimized Machine Learning framework for low-latency and high throughput workloads (20)

PDF
Pretzel: optimized Machine Learning framework for low-latency and high throug...
PDF
Pretzel: optimized Machine Learning framework for low-latency and high throu...
PDF
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
PDF
Determan SummerSim_submit_rev3
PDF
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
PPTX
Traffic Simulator
PPTX
Genetic Algorithm for task scheduling in Cloud Computing Environment
PDF
ULPGC2023-Erasmus+lectures_Ales_Zamuda_SystemsTheory+IntelligentAutonomousSys...
PDF
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
PDF
A SYSTEMC/SIMULINK CO-SIMULATION ENVIRONMENT OF THE JPEG ALGORITHM
PDF
Machine Learning @NECST
PDF
PPTX
System mldl meetup
PDF
Scaling Application on High Performance Computing Clusters and Analysis of th...
PPT
Rejunevating software reengineering processes
PPTX
IncQuery-D: Incremental Queries in the Cloud
PDF
Operationalizing Machine Learning: Serving ML Models
PDF
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
PPT
Cloudsim & greencloud
Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throu...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Determan SummerSim_submit_rev3
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Traffic Simulator
Genetic Algorithm for task scheduling in Cloud Computing Environment
ULPGC2023-Erasmus+lectures_Ales_Zamuda_SystemsTheory+IntelligentAutonomousSys...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
A SYSTEMC/SIMULINK CO-SIMULATION ENVIRONMENT OF THE JPEG ALGORITHM
Machine Learning @NECST
System mldl meetup
Scaling Application on High Performance Computing Clusters and Analysis of th...
Rejunevating software reengineering processes
IncQuery-D: Incremental Queries in the Cloud
Operationalizing Machine Learning: Serving ML Models
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Cloudsim & greencloud
Ad

More from NECST Lab @ Politecnico di Milano (20)

PDF
Mesticheria Team - WiiReflex
PPTX
Punto e virgola Team - Stressometro
PDF
BitIt Team - Stay.straight
PDF
BabYodini Team - Talking Gloves
PDF
printf("Nome Squadra"); Team - NeoTon
PPTX
BlackBoard Team - Motion Tracking Platform
PDF
#include<brain.h> Team - HomeBeatHome
PDF
Flipflops Team - Wave U
PDF
Bug(atta) Team - Little Brother
PDF
#NECSTCamp: come partecipare
PDF
NECSTCamp101@2020.10.1
PDF
NECSTLab101 2020.2021
PDF
TreeHouse, nourish your community
PDF
TiReX: Tiled Regular eXpressionsmatching architecture
PDF
Embedding based knowledge graph link prediction for drug repurposing
PDF
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PDF
EMPhASIS - An EMbedded Public Attention Stress Identification System
PDF
Luns - Automatic lungs segmentation through neural network
PDF
BlastFunction: How to combine Serverless and FPGAs
PDF
Maeve - Fast genome analysis leveraging exact string matching
Mesticheria Team - WiiReflex
Punto e virgola Team - Stressometro
BitIt Team - Stay.straight
BabYodini Team - Talking Gloves
printf("Nome Squadra"); Team - NeoTon
BlackBoard Team - Motion Tracking Platform
#include<brain.h> Team - HomeBeatHome
Flipflops Team - Wave U
Bug(atta) Team - Little Brother
#NECSTCamp: come partecipare
NECSTCamp101@2020.10.1
NECSTLab101 2020.2021
TreeHouse, nourish your community
TiReX: Tiled Regular eXpressionsmatching architecture
Embedding based knowledge graph link prediction for drug repurposing
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
EMPhASIS - An EMbedded Public Attention Stress Identification System
Luns - Automatic lungs segmentation through neural network
BlastFunction: How to combine Serverless and FPGAs
Maeve - Fast genome analysis leveraging exact string matching

Recently uploaded (20)

PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
composite construction of structures.pdf
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
Digital Logic Computer Design lecture notes
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPT
Project quality management in manufacturing
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Welding lecture in detail for understanding
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Construction Project Organization Group 2.pptx
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
bas. eng. economics group 4 presentation 1.pptx
composite construction of structures.pdf
CYBER-CRIMES AND SECURITY A guide to understanding
UNIT-1 - COAL BASED THERMAL POWER PLANTS
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Digital Logic Computer Design lecture notes
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Project quality management in manufacturing
R24 SURVEYING LAB MANUAL for civil enggi
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
Welding lecture in detail for understanding
additive manufacturing of ss316l using mig welding
Construction Project Organization Group 2.pptx
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks

Pretzel: optimized Machine Learning framework for low-latency and high throughput workloads

  • 1. Speaker: Alberto Scolari, PhD student @ Politecnico di Milano, Italy - alberto.scolari@polimi.it UC Berkeley, 05/23/2018 Yunseong Lee, Alberto Scolari, Matteo Interlandi, Markus Weimer, Marco D. Santambrogio, Byung-Gon Chun PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
  • 2. ML-as-a-Service • ML models are deployed on cloud platforms as black boxes • Users deploy multiple models per machine (10-100s) 2
  • 3. ML-as-a-Service • ML models are deployed on cloud platforms as black boxes • Users deploy multiple models per machine (10-100s) 2 good:0 bad:1 … This is a good product
  • 4. ML-as-a-Service • ML models are deployed on cloud platforms as black boxes • Users deploy multiple models per machine (10-100s) 2 • Deployed models are often similar - Similar structure - Similar state good:0 bad:1 … This is a good product
  • 5. ML-as-a-Service • ML models are deployed on cloud platforms as black boxes • Users deploy multiple models per machine (10-100s) 2 • Deployed models are often similar - Similar structure - Similar state good:0 bad:1 … This is a good product good:0 bad:1 … This is a good product
  • 6. ML-as-a-Service • ML models are deployed on cloud platforms as black boxes • Users deploy multiple models per machine (10-100s) 2 • Deployed models are often similar - Similar structure - Similar state • Two key requirements: 1. Performance: latency or throughput 2. Model density: number of models per machine good:0 bad:1 … This is a good product good:0 bad:1 … This is a good product
  • 7. 3 Limitations of black-box 300 models in ML.NET [1], C#, from Amazon Reviews dataset [2]: 250 Sentiment Analysis + 50 Regression Task [1] https://www.microsoft.com/net/learn/apps/machine-learning-and-ai/ml-dotnet [2] R. He and J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th International Conference on World Wide Web, WWW ’16
  • 8. • Overheads: JIT, GC, virtual calls, … • Operators are separate calls: no code fusion, multiple data accesses • Operators have different buffers and break locality 3 Limitations of black-box (a) We identify the operators by their parameters. The first operators, which have multiple versions with different length Figure 3: Probability for an operator within the 250 d 250⇥ y 100 pipelines have operator x, implying that pipe Figure 4: CDF of latency of prediction requests of 25 DAGs. We denote the first prediction as cold; the h line is reported as average over 100 predictions after warm-up period of 10. The plot is normalized over t 99th percentile latency of the hot case. describes this situation, where the performance of h predictions over the 250 sentiment analysis pipelines wi memory already allocated and JIT-compiled code is mo than an order-of-magnitude faster then the cold versio for the same pipelines. To drill down more into the pro Limited performance 300 models in ML.NET [1], C#, from Amazon Reviews dataset [2]: 250 Sentiment Analysis + 50 Regression Task [1] https://www.microsoft.com/net/learn/apps/machine-learning-and-ai/ml-dotnet [2] R. He and J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th International Conference on World Wide Web, WWW ’16
  • 9. • Overheads: JIT, GC, virtual calls, … • Operators are separate calls: no code fusion, multiple data accesses • Operators have different buffers and break locality 3 Limitations of black-box (a) We identify the operators by their parameters. The first operators, which have multiple versions with different length Figure 3: Probability for an operator within the 250 d 250⇥ y 100 pipelines have operator x, implying that pipe Figure 4: CDF of latency of prediction requests of 25 DAGs. We denote the first prediction as cold; the h line is reported as average over 100 predictions after warm-up period of 10. The plot is normalized over t 99th percentile latency of the hot case. describes this situation, where the performance of h predictions over the 250 sentiment analysis pipelines wi memory already allocated and JIT-compiled code is mo than an order-of-magnitude faster then the cold versio for the same pipelines. To drill down more into the pro (a) We identify the operators by their parameters. The first two groups represent N-gram operators, which have multiple versions with different length (e.g., unigram, trigram) (b) eac Figure 3: Probability for an operator within the 250 different pipelines. If an operator x 250⇥ y 100 pipelines have operator x, implying that pipelines can save memory by sharing • No state sharing: state is duplicated-> memory is wasted Limited performance Limited density 300 models in ML.NET [1], C#, from Amazon Reviews dataset [2]: 250 Sentiment Analysis + 50 Regression Task [1] https://www.microsoft.com/net/learn/apps/machine-learning-and-ai/ml-dotnet [2] R. He and J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th International Conference on World Wide Web, WWW ’16
  • 10. White-box: open the black box of models for them to co- exist better, be scheduled better 1. End-to-end Optimizations: merge operators in computational units (logical stages), to decrease overheads 2. Multi-model Optimizations: create once, use everywhere, for both data and stages 4 White-Box Design principles
  • 11. 5 Off-line phase [1] - Flour var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan recognize that the Linear Regression c Char and WordNgram, therefore bypas of Concat. Additionally, Tokenizer can Char and WordNgram, therefore it will CharNgram (in one stage) and a dep CharNgram and WordNGram (in anot created. The final plan will therefore b stages, versus the initial 4 operators (an Model Plan Compiler: Model plans two DAGs: a DAG composed of log DAG of physical stages. Logical stag tion of the stages output of the Oven O lated parameters; physical stages conta that will be executed by the PRETZEL given DAG, there is a 1-to-n mapping physical stages so that a logical stage
  • 12. 5 Off-line phase [1] - Flour var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan recognize that the Linear Regression c Char and WordNgram, therefore bypas of Concat. Additionally, Tokenizer can Char and WordNgram, therefore it will CharNgram (in one stage) and a dep CharNgram and WordNGram (in anot created. The final plan will therefore b stages, versus the initial 4 operators (an Model Plan Compiler: Model plans two DAGs: a DAG composed of log DAG of physical stages. Logical stag tion of the stages output of the Oven O lated parameters; physical stages conta that will be executed by the PRETZEL given DAG, there is a 1-to-n mapping physical stages so that a logical stage var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from recognize that the Linear Regression can be pushed in Char and WordNgram, therefore bypassing the executi of Concat. Additionally, Tokenizer can be reused betwe Char and WordNgram, therefore it will be pipelined w CharNgram (in one stage) and a dependency betwe CharNgram and WordNGram (in another stage) will created. The final plan will therefore be composed by stages, versus the initial 4 operators (and vectors) of ML Model Plan Compiler: Model plans are composed two DAGs: a DAG composed of logical stages, and DAG of physical stages. Logical stages are an abstra tion of the stages output of the Oven Optimizer, with lated parameters; physical stages contains the actual co that will be executed by the PRETZEL runtime. For.ea given DAG, there is a 1-to-n mapping between logical physical stages so that a logical stage can represent t execution code of different physical implementations. physical implementation is selected based on the param Flour API
  • 13. 5 Off-line phase [1] - Flourhe dynamic decisions on how to schedule plans on machine workload. Finally, a FrontEnd is used mit prediction requests to the system. del pipeline deployment and serving in PRETZEL a two-phase process. During the off-line phase n 4.1), ML2’s pre-trained pipelines are translated our transformation-based API. Oven optimizer re- es and fuses transformations into model plans com- of parameterized logical units called stages. Each stage is then AOT compiled into physical computa- its where memory resources and threads are pooled me. Model plans are registered for prediction serv- he Runtime where physical stages and parameters red between pipelines with similar model plans. In line phase (Section 4.2), when an inference request gistered model plan is received, physical stages are eterized dynamically with the proper values main- in the Object Store. The Scheduler is in charge of g physical stages to shared execution units. ures 6 and 7 pictorially summarize the above de- ons; note that only the on-line phase is executed rence time, whereas the model plans are generated etely off-line. Next, we will describe each layer sing the PRETZEL prediction system. Off-line Phase Flour al of Flour is to provide an intermediate represen- between ML frameworks (currently only ML2) and EL that is both easy to target and amenable to op- ions. Once a pipeline is ported into Flour, it can arrays indicate the number and type of input fields. The successive call to Tokenize in line 4 splits the input fields into tokens. Lines 6 and 7 contain the two branches defining the char-level and word-level n-gram transforma- tions, which are then merged with the Concat transform in line 9 before the linear binary classifier of line 10. Both char and word n-gram transformations are parametrized by the number of n-grams and maps translating n-grams into numerical format (not shown in the Figure). Addi- tionally, each Flour transformation accepts as input an optional set of statistics gathered from training. These statistics are used by the compiler to generate physical plans more efficiently tailored to the model characteristics. Example statistics are max vector size (to define the mini- mum size of vectors to fetch from the pool at prediction time 4.2), dense or sparse representations, etc. Listing 1: Flour program for the sentiment analysis pipeline. Transformations’ parameters are extracted from the original ML2 pipeline. 1 var fContext = new FlourContext(objectStore, ...) 2 var tTokenizer = fContext.CSV. 3 FromText(fields, fieldsType, sep). 4 Tokenize(); 5 6 var tCNgram = tTokenizer.CharNgram(numCNgrms, ...); 7 var tWNgram = tTokenizer.WordNgram(numWNgrms, ...); 8 var fPrgrm = tCNgram. 9 Concat(tWNgram). 10 ClassifierBinaryLinear(cParams); 11 12 return fPrgrm.Plan(); We have instrumented the ML2 library to collect statis- tics from training and with the related bindings to the Box of Machine Learning ving Systems # 355 “This is a nice product” Positive vs. Negative Tokenizer Char Ngram Word Ngram Concat Linear Regression Figure 1: A sentimental analysis pipeline consisting of operators for featurization (ellipses), followed by a ML model (diamond). Tokenizer extracts tokens (e.g., words) from the input string. Char and Word Ngrams featurize input tokens by extracting n-grams. Concat generates a var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan recognize that the Linear Regression c Char and WordNgram, therefore bypas of Concat. Additionally, Tokenizer can Char and WordNgram, therefore it will CharNgram (in one stage) and a dep CharNgram and WordNGram (in anot created. The final plan will therefore b stages, versus the initial 4 operators (an Model Plan Compiler: Model plans two DAGs: a DAG composed of log DAG of physical stages. Logical stag tion of the stages output of the Oven O lated parameters; physical stages conta that will be executed by the PRETZEL given DAG, there is a 1-to-n mapping physical stages so that a logical stage var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from recognize that the Linear Regression can be pushed in Char and WordNgram, therefore bypassing the executi of Concat. Additionally, Tokenizer can be reused betwe Char and WordNgram, therefore it will be pipelined w CharNgram (in one stage) and a dependency betwe CharNgram and WordNGram (in another stage) will created. The final plan will therefore be composed by stages, versus the initial 4 operators (and vectors) of ML Model Plan Compiler: Model plans are composed two DAGs: a DAG composed of logical stages, and DAG of physical stages. Logical stages are an abstra tion of the stages output of the Oven Optimizer, with lated parameters; physical stages contains the actual co that will be executed by the PRETZEL runtime. For.ea given DAG, there is a 1-to-n mapping between logical physical stages so that a logical stage can represent t execution code of different physical implementations. physical implementation is selected based on the param Flour API
  • 14. 5 Off-line phase [1] - Flour var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan recognize that the Linear Regression c Char and WordNgram, therefore bypas of Concat. Additionally, Tokenizer can Char and WordNgram, therefore it will CharNgram (in one stage) and a dep CharNgram and WordNGram (in anot created. The final plan will therefore b stages, versus the initial 4 operators (an Model Plan Compiler: Model plans two DAGs: a DAG composed of log DAG of physical stages. Logical stag tion of the stages output of the Oven O lated parameters; physical stages conta that will be executed by the PRETZEL given DAG, there is a 1-to-n mapping physical stages so that a logical stage var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from recognize that the Linear Regression can be pushed in Char and WordNgram, therefore bypassing the executi of Concat. Additionally, Tokenizer can be reused betwe Char and WordNgram, therefore it will be pipelined w CharNgram (in one stage) and a dependency betwe CharNgram and WordNGram (in another stage) will created. The final plan will therefore be composed by stages, versus the initial 4 operators (and vectors) of ML Model Plan Compiler: Model plans are composed two DAGs: a DAG composed of logical stages, and DAG of physical stages. Logical stages are an abstra tion of the stages output of the Oven Optimizer, with lated parameters; physical stages contains the actual co that will be executed by the PRETZEL runtime. For.ea given DAG, there is a 1-to-n mapping between logical physical stages so that a logical stage can represent t execution code of different physical implementations. physical implementation is selected based on the param Flour API
  • 15. 5 Off-line phase [1] - Flour var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan recognize that the Linear Regression c Char and WordNgram, therefore bypas of Concat. Additionally, Tokenizer can Char and WordNgram, therefore it will CharNgram (in one stage) and a dep CharNgram and WordNGram (in anot created. The final plan will therefore b stages, versus the initial 4 operators (an Model Plan Compiler: Model plans two DAGs: a DAG composed of log DAG of physical stages. Logical stag tion of the stages output of the Oven O lated parameters; physical stages conta that will be executed by the PRETZEL given DAG, there is a 1-to-n mapping physical stages so that a logical stage var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from the program. Additionally, parameters and statistics are extracted. (3) A DAG of physical stages is generated by the Oven Compiler using logical stages, parameters, and statistics. A model plan is the union of all the elements and is fed to the runtime. recognize Char and of Concat Char and CharNgra CharNgra created. T stages, ve Model Pl two DAG DAG of p tion of the lated para that will b given DA physical s execution physical i ters chara Plan co DAG is g Plan Com representa formation var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from the program. Additionally, parameters and statistics are extracted. (3) A DAG of physical stages is generated by recognize that the Linear R Char and WordNgram, ther of Concat. Additionally, To Char and WordNgram, ther CharNgram (in one stage) CharNgram and WordNGr created. The final plan wil stages, versus the initial 4 o Model Plan Compiler: M two DAGs: a DAG comp DAG of physical stages. L tion of the stages output of lated parameters; physical s that will be executed by the given DAG, there is a 1-to- physical stages so that a lo execution code of different physical implementation is ters characterizing a logica Plan compilation is a two var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET ZEL. In (1), a model is translated into a Flour program. (2 Oven Optimizer generates a DAG of logical stages from the program. Additionally, parameters and statistics are extracted. (3) A DAG of physical stages is generated by the Oven Compiler using logical stages, parameters, and statistics. A model plan is the union of all the element and is fed to the runtime. (such as most featurizers) are pipelined together in a sin gle pass over the data. This strategy achieves best data locality because records are likely to reside in CPU reg isters [33, 38]. Compute-intensive transformations (e.g Oven Optimiser var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from recognize that the Linear Regression can be pushed in Char and WordNgram, therefore bypassing the executi of Concat. Additionally, Tokenizer can be reused betwe Char and WordNgram, therefore it will be pipelined w CharNgram (in one stage) and a dependency betwe CharNgram and WordNGram (in another stage) will created. The final plan will therefore be composed by stages, versus the initial 4 operators (and vectors) of ML Model Plan Compiler: Model plans are composed two DAGs: a DAG composed of logical stages, and DAG of physical stages. Logical stages are an abstra tion of the stages output of the Oven Optimizer, with lated parameters; physical stages contains the actual co that will be executed by the PRETZEL runtime. For.ea given DAG, there is a 1-to-n mapping between logical physical stages so that a logical stage can represent t execution code of different physical implementations. physical implementation is selected based on the param Flour API
  • 16. 5 Off-line phase [1] - Flour var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan recognize that the Linear Regression c Char and WordNgram, therefore bypas of Concat. Additionally, Tokenizer can Char and WordNgram, therefore it will CharNgram (in one stage) and a dep CharNgram and WordNGram (in anot created. The final plan will therefore b stages, versus the initial 4 operators (an Model Plan Compiler: Model plans two DAGs: a DAG composed of log DAG of physical stages. Logical stag tion of the stages output of the Oven O lated parameters; physical stages conta that will be executed by the PRETZEL given DAG, there is a 1-to-n mapping physical stages so that a logical stage var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from the program. Additionally, parameters and statistics are extracted. (3) A DAG of physical stages is generated by the Oven Compiler using logical stages, parameters, and statistics. A model plan is the union of all the elements and is fed to the runtime. recognize Char and of Concat Char and CharNgra CharNgra created. T stages, ve Model Pl two DAG DAG of p tion of the lated para that will b given DA physical s execution physical i ters chara Plan co DAG is g Plan Com representa formation var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from the program. Additionally, parameters and statistics are extracted. (3) A DAG of physical stages is generated by recognize that the Linear R Char and WordNgram, ther of Concat. Additionally, To Char and WordNgram, ther CharNgram (in one stage) CharNgram and WordNGr created. The final plan wil stages, versus the initial 4 o Model Plan Compiler: M two DAGs: a DAG comp DAG of physical stages. L tion of the stages output of lated parameters; physical s that will be executed by the given DAG, there is a 1-to- physical stages so that a lo execution code of different physical implementation is ters characterizing a logica Plan compilation is a two var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET ZEL. In (1), a model is translated into a Flour program. (2 Oven Optimizer generates a DAG of logical stages from the program. Additionally, parameters and statistics are extracted. (3) A DAG of physical stages is generated by the Oven Compiler using logical stages, parameters, and statistics. A model plan is the union of all the element and is fed to the runtime. (such as most featurizers) are pipelined together in a sin gle pass over the data. This strategy achieves best data locality because records are likely to reside in CPU reg isters [33, 38]. Compute-intensive transformations (e.g Oven Optimiser var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from the program. Additionally, parameters and statistics are extracted. (3) A DAG of physical stages is generated by the Oven Compiler using logical stages, parameters, and statistics. A model plan is the union of all the elements and is fed to the runtime. (such as most featurizers) are pipelined together in a sin- gle pass over the data. This strategy achieves best data locality because records are likely to reside in CPU reg- isters [33, 38]. Compute-intensive transformations (e.g., var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); (1) Flour Transforms Logical Stages S1 S2 S3 1: [x] 2: [y,z] 3: … int[100] float[200] … Params Stats Physical Stages S1 S2 S3 (3) Compilation (2) Optimization Model Stats Params Logical Stages Physical Stages Model Plan Figure 6: Model optimization and compilation in PRET- ZEL. In (1), a model is translated into a Flour program. (2) Oven Optimizer generates a DAG of logical stages from recognize that the Linear Regression can be pushed in Char and WordNgram, therefore bypassing the executi of Concat. Additionally, Tokenizer can be reused betwe Char and WordNgram, therefore it will be pipelined w CharNgram (in one stage) and a dependency betwe CharNgram and WordNGram (in another stage) will created. The final plan will therefore be composed by stages, versus the initial 4 operators (and vectors) of ML Model Plan Compiler: Model plans are composed two DAGs: a DAG composed of logical stages, and DAG of physical stages. Logical stages are an abstra tion of the stages output of the Oven Optimizer, with lated parameters; physical stages contains the actual co that will be executed by the PRETZEL runtime. For.ea given DAG, there is a 1-to-n mapping between logical physical stages so that a logical stage can represent t execution code of different physical implementations. physical implementation is selected based on the param Flour API
  • 17. • Two main components: – Runtime, with an Object Store – Scheduler • Runtime handles physical resources: threads and buffers • Object Store caches objects of all models – models register and retrieve state objects via a key • Scheduler is event-based, each stage being an event 6 On-line phase [1] - Runtime
  • 18. • 7x less memory for SA models, 40% for RT models • Higher model density, higher efficiency and profitability 7 Memory Figure 8: Cumulative memory usage of the model pipelines with and without Object Store, normalized by Figure 9: Latency co ZEL, with values nor Cliffs indicate pipeli
  • 19. • Hot vs Cold scenarios • Without and with Caching of partial results • Pretzel vs ML:NET: cold is 15x faster, hot is 2.6x 8 Latency Figure 9: Latency comparison between ML2 and PRET-
  • 20. • Batch size is 1000 inputs, multiple runs of same batch • Delay batching: as in Clipper, let requests tolerate a given delay • Batch while latency is smaller than tolerated delay • ML.NET suffers from missed data sharing, i.e. higher memory traffic 9 Throughput Figure 11: The average throughput computed among the latency, instead, gracefully increases linearl increase of the load.
  • 21. • Batch size is 1000 inputs, multiple runs of same batch • Delay batching: as in Clipper, let requests tolerate a given delay • Batch while latency is smaller than tolerated delay • ML.NET suffers from missed data sharing, i.e. higher memory traffic 9 Throughput Figure 11: The average throughput computed among the latency, instead, gracefully increases linearl increase of the load. the number of cores on the x-axis. PRETZEL scales lin early to the number of CPU cores, close to the expecte maximum throughput with Hyper Threading enabled. not shared, thus increasing the pressure on the memor subsystem: indeed, even if the data values are the sam the model objects are mapped to different memory area Delay Batching: As in Clipper, PRETZEL FrontEnd a lows users to specify a maximum delay they can wait t maximize throughput. For each model category, Figure 1 depicts the trade-off between throughput and latency a the delay increases. Interestingly, for such models even small delay helps reaching the optimal batch size. Figure 12: Throughput and latency of SA and RT model
  • 22. • We addressed performance/density bottlenecks in ML inference for MaaS by advocating white-box approach • Future work - Code generation - Support more model formats than ML.NET - Distributed version - NUMA awareness 10 Conclusions and current work
  • 23. • Physical stages can be offloaded to FPGA • How to deploy it? – Fixed subset of common operators – Multiple operators in kernels – Whole model, deployed via partial reconf 11 Future work with FPGA
  • 24. • Physical stages can be offloaded to FPGA • How to deploy it? – Fixed subset of common operators – Multiple operators in kernels – Whole model, deployed via partial reconf 11 Future work with FPGA QUESTIONS ? Speaker: Alberto Scolari, PhD student @ Politecnico di Milano, Italy - alberto.scolari@polimi.it