NYAI - Interactive Machine Learning by Daniel Hsu

Interactive machine learning
Daniel Hsu
Columbia University
1

Non-interactive machine learning
“Cartoon” of (non-interactive) machine learning
2

1. Get labeled data {(inputi , outputi )}n
i=1.
2

i=1.
2. Learn prediction function ˆf (e.g., classiﬁer, regressor, policy)
such that
ˆf (inputi ) ≈ outputi
for most i = 1, 2, . . . , n.
2

i=1.
such that
for most i = 1, 2, . . . , n.
Goal: ˆf (input) ≈ output for future (input, output) pairs.
2

i=1.
such that
for most i = 1, 2, . . . , n.
Some applications: document classiﬁcation, face detection,
speech recognition, machine translation, credit rating, . . .
2

i=1.
such that
for most i = 1, 2, . . . , n.
Some applications: document classiﬁcation, face detection,
speech recognition, machine translation, credit rating, . . .
There is a lot of technology for doing this,
and a lot of mathematical theories for understanding this.
2

Interactive machine learning: example #1
Practicing physician
3

Loop:
1. Patient arrives with symptoms, medical history, genome . . .
3

Loop:
2. Prescribe treatment.
3

Loop:
3. Observe impact on patient’s health (e.g., improves, worsens).
3

Loop:
3. Observe impact on patient’s health (e.g., improves, worsens).
Goal prescribe treatments that yield good health outcomes.
3

Website operator
4

Website operator
Loop:
1. User visits website with proﬁle, browsing history . . .
4

Website operator
Loop:
2. Choose content to display on website.
4

Website operator
Loop:
3. Observe user reaction to content (e.g., click, “like”).
4

Website operator
Loop:
3. Observe user reaction to content (e.g., click, “like”).
Goal choose content that yield desired user behavior.
4

E-mail service provider
5

Loop:
1. Receive e-mail messages for users (spam or not).
5

Loop:
2. Ask users to provide labels for some borderline cases.
5

Loop:
3. Improve spam ﬁlter using newly labeled messages.
5

Loop:
3. Improve spam ﬁlter using newly labeled messages.
Goal maximize accuracy of spam ﬁlter,
minimize number of queries to users.
5

Characteristics of interactive machine learning problems
6

1. Learning agent (a.k.a. “learner”) interacts with the world
(e.g., patients, users) to gather data.
6

2. Learner’s performance based on learner’s decisions.
6

3. Data available to learner depends on learner’s decisions.
6

4. State of the world depends on learner’s decisions.
6

4. State of the world depends on learner’s decisions.
This talk: two interactive machine learning problems,
1. active learning, and
2. contextual bandit learning.
(Our models for these problems have all but the last of above characteristics.)
6

Motivation
Recall non-interactive (a.k.a. passive) machine learning:
1. Get labeled data {(inputi , labeli )}n
i=1.
2. Learn function ˆf (e.g., classiﬁer, regressor, decision rule, policy)
such that
ˆf (inputi ) ≈ labeli
for most i = 1, 2, . . . , n.
8

Motivation
Recall non-interactive (a.k.a. passive) machine learning:
1. Get labeled data {(inputi , labeli )}n
i=1.
2. Learn function ˆf (e.g., classiﬁer, regressor, decision rule, policy)
such that
ˆf (inputi ) ≈ labeli
for most i = 1, 2, . . . , n.
Problem: labels often much more expensive to get than inputs.
E.g., can’t ask for spam/not-spam label of every e-mail.
8

Active learning
Basic structure of active learning:
9

Active learning
Start with pool of unlabeled inputs
(and maybe some (input, label) pairs).
9

Active learning
Repeat:
1. Train prediction function ˆf using current labeled pairs.
9

Active learning
Repeat:
2. Pick some inputs from the pool, and query their labels.
9

Active learning
Repeat:
Goal:
9

Active learning
Repeat:
Goal:
Final prediction function ˆf satisﬁes ˆf (input) ≈ label for
most future (input, label) pairs.
9

Active learning
Repeat:
Goal:
Final prediction function ˆf satisﬁes ˆf (input) ≈ label for
most future (input, label) pairs.
Number of label queries is small.
(Can be selective/adaptive about which labels to query . . . )
9

Sampling bias
Main diﬃculty of active learning: sampling bias.
10

Sampling bias
At any point in active learning process, labeled data set
typically not representative of overall data.
10

Sampling bias
Problem can derail learning and go unnoticed!
10

Sampling bias
Problem can derail learning and go unnoticed!
input ∈ R, label ∈ {0, 1}, classiﬁer type = threshold functions.
45%45% 5%5%
10

Sampling bias
Typical active learning heuristic:
Start with pool of unlabeled inputs.
45%45% 5%5%
10

Sampling bias
Pick a handful of inputs, and query their labels.
45%45% 5%5%
10

Sampling bias
Repeat:
1. Train classiﬁer ˆf using current labeled pairs.
45%45% 5%5%
10

Sampling bias
Repeat:
2. Pick input from pool closest to decision boundary of ˆf ,
and query its label.
45%45% 5%5%
10

Sampling bias
Repeat:
2. Pick input from pool closest to decision boundary of ˆf ,
and query its label.
45%45% 5%5%
Learner converges to classiﬁer with 5% error rate, even though
there’s a classiﬁer with 2.5% error rate. Not statistically consistent!
10

Framework for statistically consistent active learning
Importance weighted active learning
(Beygelzimer, Dasgupta, and Langford, ICML 2009)
11

Start with unlabeled pool {inputi }n
i=1.
11

i=1.
For each i = 1, 2, . . . , n (in random order):
1. Get inputi .
11

i=1.
1. Get inputi .
2. Pick probability pi ∈ [0, 1], toss coin with P(heads) = pi .
11

i=1.
1. Get inputi .
3. If heads, query labeli , and include (inputi , labeli ) with
importance weight 1/pi in data set.
11

i=1.
1. Get inputi .
3. If heads, query labeli , and include (inputi , labeli ) with
importance weight 1/pi in data set.
Importance weight = “inverse propensity” that labeli is queried.
Used to account for sampling bias in error rate estimates.
11

Importance weighted active learning algorithms
How to choose pi ?
12

How to choose pi ?
1. Use your favorite heuristic (as long as pi is never too small).
12

How to choose pi ?
2. Simple rule based on estimated error rate diﬀerences
(Beygelzimer, Hsu, Langford, and Zhang, NIPS 2010).
12

How to choose pi ?
In many cases, can prove the following:
Final classiﬁer is as accurate as learner that queries all n labels.
Expected number of queries
n
i=1 pi grows sub-linearly with n.
12

How to choose pi ?
n
3. Solve a large convex optimization problem
(Huang, Agarwal, Hsu, Langford, and Schapire, NIPS 2015).
12

How to choose pi ?
n
Even stronger theoretical guarantees.
12

How to choose pi ?
n
Even stronger theoretical guarantees.
Latter two methods can be made to work with any classiﬁer type,
provided a good (non-interactive) learning algorithm.
12

Contextual bandit learning
Protocol for contextual bandit learning:
For round t = 1, 2, . . . , T:
14

For round t = 1, 2, . . . , T:
1. Observe contextt. [e.g., user proﬁle]
14

For round t = 1, 2, . . . , T:
2. Choose actiont ∈ Actions. [e.g., display ad]
14

For round t = 1, 2, . . . , T:
3. Collect rewardt(actiont) ∈ [0, 1]. [e.g., click or no-click]
14

For round t = 1, 2, . . . , T:
Goal: Choose actions that yield high reward.
14

For round t = 1, 2, . . . , T:
Goal: Choose actions that yield high reward.
Note: In round t, only observe reward for chosen actiont, and not
reward that you would’ve received if you’d chosen a diﬀerent action.
14

Challenges in contextual bandit learning
15

1. Explore vs. exploit:
Should use what you’ve already learned about actions.
Must also learn about new actions that could be good.
15

2. Must use context eﬀectively.
Which action is good likely to depend on current context.
Want to do as well as the best policy (i.e., decision rule)
π: context → action
from some rich class of policies Π.
15

2. Must use context eﬀectively.
Which action is good likely to depend on current context.
Want to do as well as the best policy (i.e., decision rule)
π: context → action
from some rich class of policies Π.
3. Selection bias, especially when exploiting.
15

Why is exploration necessary?
No-exploration approach:
16

1. Using historical data, learn a “reward predictor” for each action
based on context:
reward(action | context) .
16

1. Using historical data, learn a “reward predictor” for each action
based on context:
reward(action | context) .
2. Then deploy policy ˆπ, given by
ˆπ(context) := arg max
action∈Actions
reward(action | context) ,
and collect more data.
16

Using no-exploration
Example: two actions {a1, a2}, two contexts {x1, x2}.
Suppose initial policy says ˆπ(x1) = a1 and ˆπ(x2) = a2.
17

Observed rewards
a1 a2
x1 0.7 —
x2 — 0.1
17

Observed rewards
a1 a2
x1 0.7 —
x2 — 0.1
Reward estimates
a1 a2
x1 0.7 0.5
x2 0.5 0.1
17

Observed rewards
a1 a2
x1 0.7 —
x2 — 0.1
Reward estimates
a1 a2
x1 0.7 0.5
x2 0.5 0.1
New policy: ˆπ (x1) = ˆπ (x2) = a1.
17

Observed rewards
a1 a2
x1 0.7 —
x2 — 0.1
Reward estimates
a1 a2
x1 0.7 0.5
x2 0.5 0.1
Observed rewards
a1 a2
x1 0.7 —
x2 0.3 0.1
17

Observed rewards
a1 a2
x1 0.7 —
x2 — 0.1
Reward estimates
a1 a2
x1 0.7 0.5
x2 0.5 0.1
Observed rewards
a1 a2
x1 0.7 —
x2 0.3 0.1
Reward estimates
a1 a2
x1 0.7 0.5
x2 0.3 0.1
17

Observed rewards
a1 a2
x1 0.7 —
x2 — 0.1
Reward estimates
a1 a2
x1 0.7 0.5
x2 0.5 0.1
Observed rewards
a1 a2
x1 0.7 —
x2 0.3 0.1
Reward estimates
a1 a2
x1 0.7 0.5
x2 0.3 0.1
Never try a2 in context x1.
17

Observed rewards
a1 a2
x1 0.7 —
x2 — 0.1
Reward estimates
a1 a2
x1 0.7 0.5
x2 0.5 0.1
Observed rewards
a1 a2
x1 0.7 —
x2 0.3 0.1
Reward estimates
a1 a2
x1 0.7 0.5
x2 0.3 0.1
True rewards
a1 a2
x1 0.7 1.0
x2 0.3 0.1
Never try a2 in context x1. Highly sub-optimal.
17

Framework for simultaneous exploration and exploitation
18

Initialize distribution W over some policies.
For t = 1, 2, . . . , T:
18

For t = 1, 2, . . . , T:
1. Observe contextt.
18

For t = 1, 2, . . . , T:
2. Randomly pick policy ˆπ ∼ W ,
choose ˆπ(contextt) ∈ Actions.
18

For t = 1, 2, . . . , T:
3. Collect rewardt(actiont) ∈ [0, 1],
update distribution W .
18

For t = 1, 2, . . . , T:
3. Collect rewardt(actiont) ∈ [0, 1],
update distribution W .
Can use “inverse propensity” importance weights to account for
sampling bias when estimating expected rewards of policies.
18

Algorithms for contextual bandit learning
How to choose the distribution W over policies?
19

Exp4 (Auer, Cesa-Bianchi, Freund, and Schapire, FOCS 1995)
Maintains exponential weights over all policies in a
policy class Π.
19

policy class Π.
Monster (Dudik, Hsu, Kale, Karampatziakis, Langford, Reyzin, and
Zhang, UAI 2011)
Solves optimization problem to come up with
poly(T, log |Π|)-sparse weights over policies.
19

policy class Π.
Zhang, UAI 2011)
Mini-monster (Agarwal, Hsu, Kale, Langford, Li, and Schapire, ICML 2014)
Simpler and faster algorithm to solve optimization
problem; get much sparser weights over policies.
19

policy class Π.
Zhang, UAI 2011)
Mini-monster (Agarwal, Hsu, Kale, Langford, Li, and Schapire, ICML 2014)
Simpler and faster algorithm to solve optimization
problem; get much sparser weights over policies.
Latter two methods rely on a good (non-interactive) learning
algorithm for policy class Π.
19

Wrap-up
Interactive machine learning models better capture how
machine learning is used in many applications (than
non-interactive frameworks).
20

Wrap-up
Sampling bias is a pervasive issue.
20

Wrap-up
Research: how to use non-interactive machine learning
technology in these new interactive settings.
20

Wrap-up
Research: how to use non-interactive machine learning
technology in these new interactive settings.
Thanks!
20

NYAI - Interactive Machine Learning by Daniel Hsu

More Related Content

Viewers also liked (20)

Similar to NYAI - Interactive Machine Learning by Daniel Hsu (20)

Recently uploaded (20)

NYAI - Interactive Machine Learning by Daniel Hsu