ODSC West 2021 – Composition in ML

Composition in ML:
in Models, Tools, and Teams
ODSC West - Nov 16, 2021
Dr. Bryan Bischof
– Head of Data Science @ Weights and Biases –
1
In collaboration with Dr. Eric Bunch
Email: bryan.bischof@gmail.com

Deﬁnition
Compositionality, also known as Fregeʼs principle, states that the
meaning of a complex expression is determined by
1. the meanings of its constituent parts,
and
2. the rules for how those parts are combined.
3
c.f. Fong, Spivak, 2018

Examples
Matrix Factorization–or more specifically Singular Value Decomposition–is an
extremely popular latent factor model for recommendation systems. Recall that given a
user-item matrix with rating elements:
We wish to approximate this matrix via training; our approximation technique is to
factorize the matrix into three parts:
- U: representing the relationship between users and latent factors
- 𝚺: describing the strength of each latent factor
- V: indicating the similarity between items and latent factors
5
c.f. Koren, Bell, Volinsky, 2009

Deﬁnition
So sometimes,
ʻconstituent partsʼ are metric embeddings
and
ʻhow they are combinedʼ is linear-algebraically.
6

Examples
Seasonal Average Pooling–and other composite forecasting methods–are extremely
simple forecasting methods utilizing repeated model fitting on residuals-of-residuals.
For example letʼs build a univariate forecasting model for a series using only seasonal
components; during the training sequence f(t), consider Month-of-year, Week-of-month,
and Day-of-week as categorical features on each day; and consider
7

Deﬁnition
So sometimes,
ʻconstituent partsʼ are pooling layers
and
ʻhow they are combinedʼ is a recursive residual additive process.
8

Examples
Boosted Trees are an ensemble of trees fit via sequentially fit decision trees on the
residuals of the iteratively composed models. In particular, the model at each iteration
is the weighted sum of the iʼth tree, fit on the residuals of the i-1ʼth tree:
with learnable weighting parameters we get a powerful learner!
9
c.f. Friedman, 2009

Deﬁnition
So sometimes,
ʻconstituent partsʼ are weighted learners
and
ʻhow they are combinedʼ is recursive additively.
10

Examples
Foundational models–pretrained models combined with downstream task-specific
training–is becoming ubiquitous in deep learning research and applications.
11
c.f. Standley et. al., 2020, Li, Hoiem, 2017
There are numerous architectures for
model transfer. Some of the most exciting
in my opinion are those which jointly train
multiple downstream tasks, e.g. overall
network performance via minimization of
the aggregate loss over all tasks:

Deﬁnition
So sometimes,
ʻconstituent partsʼ are parameters trained via learning
and
ʻhow they are combinedʼ is a layer composition and loss sharing.
12

Examples
Equivariant Globally Natural DL–or Graph DL with invariance up to graph
isomorphism–pushes the emerging domain of graph learning to not only accommodate
global isomorphisms, but those built from local mappings.
13
c.f. Haan, Cohen, Welling, 2021
In this example the compositional structure is more
obvious, but nonetheless essential to the
formulation.
Node features may be embedded onto edge
features, and passed into convolution as normal
GNNs.

Deﬁnition
So sometimes,
ʻconstituent partsʼ are action mappings on the data structure
and
ʻhow they are combinedʼ is function composition and kernel
convolution.
14

Ok, ok. So composition is much a part of the
structural modeling we do as Machine Learning
practitioners.
But Iʼm more on the applied side...
15

YAFPT? YAMST?
I want to sell you an ML pipeline:
- Itʼs comprised of pure components, i.e. they return the same output every time
from the same input, and have no side eﬀects
- It is higher order–each component provides APIʼs for a function
- It is composable, i.e. theyʼre easily combined via knowledge only of the types of
their inputs and outputs
- It is curriable–providing a fixed set of parameters and inputs allows you to execute
the entire pipeline.
Are you buying? These happen to align with the core principles of Functional
Programming, but also Micro-services. Why does MLOps care about these?
17
c.f. fklearn

Models? Data? No!
Andrew Ng, has recently been proselytizing the gains of a data-centric approach to AI.
He rightly recognizes both the eﬀectiveness of data improvement and preparation, and
systematic attention to the data that your product is built on.
In particular he identifies, correctly, that one formulation of the data pipeline is as
follows:
And he rightly identifies the importance of those backwards arrows this flow. But...
18
c.f. From Model-centric to Data-centric AI

Right answer; wrong test.
Dr. Ngʼs recommendation:
Donʼt: Hold the data fixed and iteratively improve the Model,
Hold the code fixed and iteratively improve the data.
While I deeply appreciate this suggestion to be modular and flexible, it aims too low!
The recommendation from compositional thinking:
Hold the (composition) fixed and iteratively improve (one component).
i.e. Pipeline-centric AI!
19

Itʼs about the process
Data changes, but so do the other components!
The needs of the data change, the expectations of the model change, the objective
functions change, the sources change, etc. If your focus only on the data, youʼre
focusing too closely on the short term goals, and over-constraining your solution.
By instead making primary the data transformations, data assumptions, and
compositions (input and output types).
This allows you to rapidly iterate at multiple locations across the stack where you see
the most opportunity.
20

YAAICP
Letʼs bring in yet another AI catch-phrase:
the data flywheel.
What makes sense about this analogy is the implication
that the inertia of the spinning wheel, ramps up.
In the data flywheel strategy, data products provide
personalization and insight to drive more customer
interactions which may be converted back into
learnable structures.
Notice here the focus on composition!
21
c.f. Matt Turck, Building an AI Startup

Letʼs look at a real ML system architecture
Consider this incredible
overview of just about
every RecSys out there.
This diagram is
data-structure,
infrastructure, and model
architecture agnostic!
And yet, via only the
composition rules, we
have a full system design. 22
c.f. Higley, Oldridge, 2021, Yan, 2021

Is there anything that can help?
MLOps is a somewhat nascent field focused on the overall structure of ML products
and pipelines.
Technology is beginning to be developed around these needs, both to manage the
components of a pipeline centric system, and to execute the type alignment.
People are starting to align on explicit composition coherence:
23
c.f. Shreya Shankar, 2021

And some of us are building the platform
Like these compositional pipelines, our platform is built of components
24
c.f. Weights and Biases
and our platform handles the coherence.

In practice
Machine learning engineers
can avoid writing glue code,
and assert statements, and
drift monitors, and hard
coding url-slugs, and reading
local data into dataloaders,
and training loops, and
ensemble dags, and can get
back to focusing on the data,
the models, and the tasks.
25
c.f. W&B Launch

Donʼt start from scratch
In their Dota 2 challenge landmark paper, the OpenAI team described an essential
component in their mission to train better and better models:
In order to train without restarting from the beginning after each change, we developed
a collection of tools to resume training with minimal loss in performance which we call
surgery.... we performed approximately one surgery per two weeks.
If your dog hasnʼt learned to catch a frisbee by the time theyʼre six weeks old, donʼt get a
new dog–get a new training methodology. 🐕
Composable tools allow you to swap in and out your strategies wherever necessary.
26
c.f. OpenAI, Dota 2, 2019, OpenAI & Weights and Biases

My ML products donʼt look like this!
Well Andrej Karpathyʼs do 󰤇
27
c.f. Karpathy, ICML 2019

Thereʼs more?
An even bigger challenge than building eﬀective ML systems is building eﬀective
team structures to support the people who can build those systems.
- What is the right team architecture to enable people to do their best work, and
yet provide opportunities for growth?
- How do you create robustness to team departures, vacations, or burnout?
29

Take from engineering
In much of the above, we took engineeringʼs learning as a foundation and built on
top of that. Here too, we can take away important lessons:
- Atomic tasks, clearly specced
- Assignee agnostic tasks
- PR processes
- Component expertise
While being a full-stack data scientist creates plenty of opportunity for innovation,
over time that stack owns you, and buckles you into a full-time maintenance role.
30
c.f. Eric Colson, 2019

Focus on the relationships
Like our components and interfaces throughout this talk, we as ML practitioners
should–at any given time–focus on executing one task.
We should be given clear inputs and expectations for our outputs.
And we should understand how to communicate and exchange with others.
When it comes time for someone else to work on this task, it should be frictionless
and context rich. With clear documentation of whatʼs been done, and a system of
record for how to reproduce it.
31
c.f. Collaborative Reports

One more Karpathy reference
Karpathyʼs team on self-driving was distributed
over many components of a
massively-multitask problem. In addition to
adversarial collaboration, he generally found
diﬀiculty in optimizing how to compose their
eﬀorts.
Maybe he should try back-propagation to learn
a better weighting. 󰤇
32
c.f. Karpathy, ICML 2019

Thanks!
Check out W&Bʼs composable tools at:
Wandb.ai
Totally free for individuals & academics.
Come chat with us at our booth today and tomorrow, or email contact@wandb.ai. 33

ODSC West 2021 – Composition in ML

More Related Content

Similar to ODSC West 2021 – Composition in ML (20)

Recently uploaded (20)

ODSC West 2021 – Composition in ML