SlideShare a Scribd company logo
Active Learning
          COMS 6998-4:
   Learning and Empirical Inference




                Irina Rish
                IBM T.J. Watson Research Center
Outline
   Motivation
   Active learning approaches
     Membership   queries
     Uncertainty Sampling
     Information-based loss functions
     Uncertainty-Region Sampling
     Query by committee

   Applications
     Active Collaborative Prediction
     Active Bayes net learning


                                         2
Standard supervised learning model
Given m labeled points, want to learn a classifier with
misclassification rate <ε, chosen from a hypothesis class H with
VC dimension d < 1.




VC theory: need m to be roughly d/ε, in the realizable case.
                                                               3
Active learning
In many situations – like speech recognition and document
retrieval – unlabeled data is easy to come by, but there is a
charge for each label.




What is the minimum number of labels needed to achieve the
target error rate?                                         4
5
What is Active Learning?


   Unlabeled data are readily available; labels are
    expensive

   Want to use adaptive decisions to choose which
    labels to acquire for a given dataset

   Goal is accurate classifier with minimal cost


                                                       6
Active learning warning

   Choice of data is only as good as the model itself
   Assume a linear model, then two data points are sufficient
   What happens when data are not linear?




                                                            7
Active Learning Flavors




             Selective Sampling       Membership Queries




         Pool            Sequential




    Myopic       Batch



                                                           8
Active Learning Approaches


 Membership queries
 Uncertainty Sampling
 Information-based loss functions
 Uncertainty-Region Sampling
 Query by committee




                                     9
10
11
Problem

Many results in this framework, even for complicated
hypothesis classes.

[Baum and Lang, 1991] tried fitting a neural net to handwritten
characters.
Synthetic instances created were incomprehensible to humans!

[Lewis and Gale, 1992] tried training text classifiers.
“an artificial text created by a learning algorithm is unlikely to
be a legitimate natural language expression, and probably would
be uninterpretable by a human teacher.”


                                                               12
13
14
15
Uncertainty Sampling
[Lewis & Gale, 1994]
   Query the event that the current classifier is most
    uncertain about

    If uncertainty is measured in Euclidean distance:


              x        x   x x   x x   x   x   x   x




   Used trivially in SVMs, graphical models, etc.

                                                          16
Information-based Loss Function
[MacKay, 1992]


   Maximize KL-divergence between posterior and prior


   Maximize reduction in model entropy between posterior
    and prior


   Minimize cross-entropy between posterior and prior


   All of these are notions of information gain

                                                         17
Query by Committee
[Seung et al. 1992, Freund et al. 1997]



   Prior distribution over hypotheses
   Samples a set of classifiers from distribution
   Queries an example based on the degree of
    disagreement between committee of classifiers

               x      x       x x   x x       x   x   x   x


                          A               B       C



                                                              18
Infogain-based Active
Learning
Notation

We Have:

• Dataset, D

• Model parameter space, W

• Query algorithm, q




                             20
Dataset (D) Example
    t   Sex    Age    Test A   Test B   Test C   Disease
0       M     40-50     0        1        1        ?
1       F     50-60     0        1        0        ?
2       F     30-40     0        0        0        ?
3       F     60+       1        1        1        ?
4       M     10-20     0        1        0        ?
5       M     40-50     0        0        1        ?
6       F     0-10      0        0        0        ?
7       M     30-40     1        1        0        ?
8       M     20-30     0        0        1        ?       21
Notation

We Have:

• Dataset, D

• Model parameter space, W

• Query algorithm, q




                             22
Model Example


                           St           Ot


                       Probabilistic Classifier




Notation

T : Number of examples

Ot : Vector of features of example t

St : Class of example t
                                                  23
Model Example

 Patient state (St)
 St : DiseaseState




 Patient Observations (Ot)
 Ot1 : Gender
 Ot2 : Age
 Ot3 : TestA
 Ot4 : TestB
 Ot5 : TestC


                             24
Possible Model Structures


    Gender
                            TestB
    Age
               Gender

S   TestA               S   TestA

                Age
    TestB
                            TestC

    TestC



                                    25
Model Space


            Model:             St              Ot




Model Parameters:             P(St)          P(Ot|St)




Generative Model:
Must be able to compute P(St=i, Ot=ot | w)




                                                        26
Model Parameter Space (W)
• W = space of possible
  parameter values

• Prior on parameters:      P(W )

• Posterior over models: P (W | D ) ∝ P ( D | W ) P (W )
                                         T
                                      ∝ ∏ P ( St , Ot | W ) P(W )
                                          t




                                                               27
Notation

We Have:

• Dataset, D

• Model parameter space, W

• Query algorithm, q



                 q(W,D) returns t*, the next sample to label

                                                               28
Game

while NotDone
  •   Learn P(W | D)
  •   q chooses next example to label
  •   Expert adds label to D




                                        29
Simulation

O1    O2        O3      O4        O5        O6    O7


            e
      S2 fals                                               e
S1     =        S3      S4        S5 true
                                   =        S6    S7 fals
                                                   =
     S2                          S5              S7


                             ?
            ?
                      hmm…
                                       ?
                        q

                                                       30
Active Learning Flavors
• Pool
  (“random access” to patients)

• Sequential
  (must decide as patients walk in the door)




                                               31
q?
• Recall: q(W,D) returns the “most
  interesting” unlabelled example.

• Well, what makes a doctor curious about a
  patient?




                                          32
1994




       33
Score Function


scoreuncert ( S t ) = uncertainty(P ( St | Ot ))


                = H ( St )
                = ∑ P ( St = i ) log P( St = i )
                    i


                                                   34
Uncertainty Sampling Example
    t   Sex Age   Test   Test   Test    St     P(St)    H(St)
                   A      B      C



1       M 20-
            30
                   0      1      1      ?      0.02    0.043
2       F   20-
            30
                   0      1      0      ?      0.01    0.024
3       F   30-
            40
                   1      0      0      ?      0.05    0.086
4       F   60+    1      1      0       ?
                                       FALSE   0.12    0.159
5       M 10-
            20
                   0      1      0      ?      0.01    0.024
6       M 20-
            30
                   1      1      1      ?      0.96    0.073
                                                                35
Uncertainty Sampling Example
    t   Sex Age   Test   Test   Test    St     P(St)    H(St)
                   A      B      C



1       M 20-
            30
                   0      1      1      ?      0.01    0.024
2       F   20-
            30
                   0      1      0      ?      0.02    0.043
3       F   30-
            40
                   1      0      0      ?      0.04    0.073
4       F   60+    1      1      0       ?
                                       FALSE   0.00    0.00
5       M 10-
            20
                   0      1      0       ?
                                       TRUE    0.06    0.112
6       M 20-
            30
                   1      1      1      ?      0.97    0.059
                                                                36
Uncertainty Sampling

GOOD: couldn’t be easier
GOOD: often performs pretty well


BAD: H(St) measures information gain about the
 samples, not the model


       Sensitive to noisy samples
                                                 37
Can we do better than
uncertainty sampling?




                        38
1992




       Informative with
       respect to what?




                          39
Model Entropy
P(W|D)                  P(W|D)              P(W|D)




           W                      W                       W




         H(W) = high             …better…            H(W) = 0



                                                                40
Information-Gain
• Choose the example that is expected to
  most reduce H(W)

• I.e., Maximize H(W) – H(W | St)

            Current model   Expected model space
            space entropy   entropy if we learn St




                                                     41
Score Function


score IG ( St ) = MI ( St ;W )


              = H (W ) − H (W | St )


                                       42
We usually can’t just sum over all models to
 get H(St|W)
                   H (W ) = − ∫ P ( w) log P ( w) dw
                               w




…but we can sample from P(W | D)
                   H (W ) ∝ H (C )
                          = −∑ P (c) log P (c)
                              c∈C


                                                       43
Conditional Model Entropy


       H (W ) = ∫ P ( w) log P ( w) dw
                     w



H (W | St = i ) = ∫ P ( w | St = i ) log P ( w | St = i ) dw
                     w




     H (W | St ) = ∑ P ( St = i ) ∫ P( w | S t = i ) log P ( w | St = i ) dw
                                   w
                     i



                                                                       44
Score Function



score IG ( St ) = H (C ) − H (C | St )




                                         45
t   Sex   Age Test Test   Test   St   P(St)     Score =
               A    B      C                  H(C) - H(C|St)

1   M     20-3   0   1     1
          0
                                 ?    0.02       0.53
2   F     20-3   0   1     0
          0
                                 ?    0.01       0.58
3   F     30-4   1   0     0
          0
                                 ?    0.05       0.40
4   F     60+    1   1     1     ?    0.12       0.49
5   M     10-2   0   1     0
          0
                                 ?    0.01       0.57
          20-3
6   M
          0
                 0   0     1     ?    0.02       0.52
                                                               46
Score Function


score IG ( St ) = H (C ) − H (C | St )
              = H ( St ) − H ( St | C )

                       Familiar?



                                          47
Uncertainty Sampling &
   Information Gain


scoreUncertain ( St ) = H ( St )


score InfoGain ( St ) = H ( St ) − H ( St | C )


                                                  48
But there is a problem…




                          49
If our objective is to reduce
the prediction error, then


“the expected information gain
of an unlabeled sample is NOT
a sufficient criterion for
constructing good queries”




                           50
Strategy #2:
Query by Committee



Temporary Assumptions:
Pool           Sequential
P(W | D)       Version Space
Probabilistic  Noiseless



   QBC attacks the size of the “Version space”



                                                  51
O1        O2         O3       O4   O5   O6   O7



S1        S2         S3       S4   S5   S6   S7




                FALSE!


        FALSE!



     Model #1      Model #2

                                                  52
O1        O2        O3       O4   O5   O6   O7



S1           S2     S3       S4   S5   S6   S7




     TRUE!        TRUE!




     Model #1     Model #2

                                                 53
O1        O2      O3       O4     O5           O6           O7



S1        S2      S3       S4     S5           S6           S7




     FALSE!
                TRUE!           Ooh, now we’re going to learn
                                something for sure!

                                One of them is definitely wrong.
     Model #1   Model #2

                                                                 54
The Original QBC
Algorithm


As each example arrives…

•   Choose a committee, C, (usually of size 2)
    randomly from Version Space

•   Have each member of C classify it

•   If the committee disagrees, select it.
                                                 55
1992




       56
Infogain vs Query by Committee
[Seung, Opper, Sompolinsky, 1992; Freund, Seung, Shamir, Tishby 1997]

First idea: Try to rapidly reduce volume of version space?
Problem: doesn’t take data distribution into account.

       H:




Which pair of hypotheses is closest? Depends on data distribution P.
Distance measure on H: d(h,h’) = P(h(x) ≠ h’(x))               57
Query-by-committee
First idea: Try to rapidly reduce volume of version space?
Problem: doesn’t take data distribution into account.
To keep things simple, say d(h,h’) = Euclidean distance



  H:
                                           Error is likely to
                                           remain large!



                                                                58
Query-by-committee
Elegant scheme which decreases volume in a manner which is
sensitive to the data distribution.

Bayesian setting: given a prior π on H

H1 = H
For t = 1, 2, …
        receive an unlabeled point xt drawn from P
        [informally: is there a lot of disagreement about xt in Ht?]
        choose two hypotheses h,h’ randomly from (π, Ht)
        if h(xt) ≠ h’(xt): ask for xt’s label
        set Ht+1
                                                                   59
Problem: how to implement it efficiently?
Query-by-committee
For t = 1, 2, …
         receive an unlabeled point xt drawn from P
         choose two hypotheses h,h’ randomly from (π, Ht)
         if h(xt) ≠ h’(xt): ask for xt’s label
         set Ht+1

Observation: the probability of getting pair (h,h’) in the inner
loop (when a query is made) is proportional to π(h) π(h’) d(h,h’).



                                       vs.


                                                                60
      Ht
61
Query-by-committee
Label bound:
For H = {linear separators in Rd}, P = uniform distribution,
just d log 1/ε labels to reach a hypothesis with error < ε.

Implementation: need to randomly pick h according to (π, Ht).

e.g. H = {linear separators in Rd}, π = uniform distribution:

                                             How do you pick a
           Ht                                random point from a
                                             convex body?


                                                                62
Sampling from convex bodies


 By random walk!
 2. Ball walk
 3. Hit-and-run




[Gilad-Bachrach, Navot, Tishby 2005] Studies random walks and
also ways to kernelize QBC.
                                                           63
64
Some challenges

[1] For linear separators, analyze the label complexity for
some distribution other than uniform!

[2] How to handle nonseparable data?
Need a robust base learner



                +                    true boundary
                     -
                                                              65
Active Collaborative Prediction
Approach: Collaborative Prediction (CP)
    QoS measure (e.g. bandwidth)                       Movie Ratings
            Server1   Server2   Server3                Matrix   Geisha   Shrek
  Client1   3674                  18           Alina        ?     1        3
  Client2    187       567                     Gerry        4     2
  Client3                       1688           Irina        9              10
  Client4   3009       703        ?            Raja               4



  Given previously observed ratings R(x,y), where X is a
  “user” and Y is a “product”, predict unobserved ratings


             - will Alina like “The Matrix”? (unlikely )

             - will Client 86 have fast download from Server 39?

             - will member X of funding committee approve our project Y? 
                                                                                 67
Collaborative Prediction = Matrix Approximation


                      100 servers
                                           • Important assumption:
100 clients




                                             matrix entries are NOT independent,
                                             e.g. similar users have similar tastes



                                           • Approaches:
                                             mainly factorized models assuming
                                             hidden ‘factors’ that affect ratings
                                             (pLSA, MCVQ, SVD, NMF, MMMF, …)



                                                                                 68
2     4 5   1 4 2




                                           User’s ‘weights’         Factors
                                       associated with ‘factors’’
Assumptions:
     - there is a number of (hidden) factors behind the user
      preferences that relate to (hidden) movie properties
         - movies have intrinsic values associated with such factors
         - users have intrinsic weights with such factors;
           user ratings a weighted (linear) combinations of movie’s values


                                                                              69
2   4 5   1 4 2




                  70
2   4 5   1 4 2




                  71
rank k

   2   4 5   1 4 2                                        7   2   5   4   5   3   1   4   2
 3 1     2 2   5                                          3   1   2   4   2   2   7   5   6
 4   2   4 1   3 1                                        4   3   2   2   4   1   4   3   1


                                                     =
 3     3 4   2                                            3   1   2   3   4   3   2   4   5
 2 3   1   4 3   2                                        2   3   2   1   3   4   3   5   2
   2 2   1     4                                          8   2   2   9   1   8   3   4   5
 1   3   1 1     4                                        1   2   3   5   1   1   5   6   4



       Y                                                                  X
Objective: find a factorizable X=UV’ that approximates Y

                     X = arg min Loss ( X ' , Y )
                               X'

and satisfies some “regularization” constraints (e.g. rank(X) < k)

Loss functions: depends on the nature of your problem
                                                                                  72
Matrix Factorization Approaches

   Singular value decomposition (SVD) – low-rank approximation
        Assumes fully observed Y and sum-squared loss

   In collaborative prediction, Y is only partially observed
    Low-rank approximation becomes non-convex problem w/ many local minima

   Furthermore, we may not want sum-squared loss, but instead
      accurate predictions (0/1 loss, approximated by hinge loss)
      cost-sensitive predictions (missing a good server vs suggesting a bad one)
      ranking cost (e.g., suggest k ‘best’ movies for a user)

     NON-CONVEX PROBLEMS!

   Use instead the state-of-art Max-Margin Matrix Factorization [Srebro 05]
      replaces bounded rank constraint by bounded norm of U, V’ vectors
      convex optimization problem! – can be solved exactly by semi-definite programming
      strongly relates to learning max-margin classifiers (SVMs)


   Exploit MMMF’s properties to augment it with active sampling!                    73
Key Idea of MMMF
Rows – feature vectors, Columns – linear classifiers


                    Linear classifiersweight vectors
                                                               “margin” here =
                                                               Dist(sample, line)
                            v2
                    f1      -1




                                                                        “ma
                                                                            rgin
  Feature vectors




                                                                                ”
                                                            Xij = signij x marginij
                                                            Predictorij = signij

                                                        If signij > 0, classify as +1,
                                                        Otherwise classify as -1 74
MMMF: Simultaneous Search for Low-norm
Feature Vectors and Max-margin Classifiers




                                             75
Active Learning with MMMF



- We extend MMMF to Active-MMMF using margin-based active sampling

- We investigate exploitation vs exploration trade-offs imposed by different heuristics




                                 -0.3          -0.5                0.3                         0.4
                                 0.6           0.1                                      -0.9         -0.6
                                        -0.5          0.8                 0.2
                          -0.9                                     0.3           -0.1
                                        0.6           -0.7 -0.5                         0.7
                                               -0.9                       -0.8                       0.9
                                 0.1                               0.5           0.2           0.3
                                        -0.5                0.6                          0.6
                                 0.5           0.2          -0.9          0.7           -0.8         -0.5
                                 0.9                        -0.6                 -0.1          0.9
                                 0.7                  0.8 -0.4                   0.3                        Margin-based heuristics:
                                        -0.5                       -0.2                        0.6
                          -0.5                 -0.5         -0.5 -0.4
                                        0.6                 -0.5                 0.4           0.5           min-margin (most-uncertain)
                                 -0.2          -0.5         -0.5 0.1             0.9           0.3
                                                                                                             min-margin positive (“good” uncertain)
                           0.8          -0.5                 0.6                        0.2
                                                                                                             max-margin (‘safe choice’ but no76
                                                                                                                                              info)
Active Max-Margin Matrix Factorization


       A-MMMF(M,s)
          1. Given s sparse matrix Y, learn approximation X = MMMF(Y)
          2. Using current predictions, actively select “best s” samples and
          request their labels (e.g., test client/server pair via ‘enforced’ download)
          3. Add new samples to Y
          4. Repeat 1-3



Issues:
      Beyond simple greedy margin-based heuristics?
      Theoretical guarantees? not so easy with non-trivial learning
       methods and non-trivial data distributions
       (any suggestions???  )
                                                                                     77
Empirical Results


   Network latency prediction
   Bandwidth prediction (peer-to-peer)
   Movie Ranking Prediction
   Sensor net connectivity prediction



                                          78
Empirical Results: Latency Prediction

             P2Psim data                       NLANR-AMP data




Active sampling with most-uncertain (and most-uncertain positive)
heuristics provide consistent improvement over random and least-
uncertain-next sampling                                             79
Movie Rating Prediction (MovieLens)




                                      80
Sensor Network Connectivity




                              81
Introducing Cost: Exploration vs Exploitation

      DownloadGrid:                                      PlanetLab:
   bandwidth prediction                              latency prediction




 Active sampling  lower prediction errors at lower costs (saves 100s of samples)
 (better prediction  better server assignment decisions  faster downloads

Active sampling achieves a good exploration vs exploitation trade-off:
reduced decision cost AND information gain
                                                                             82
Conclusions

   Common challenge in many applications:
    need for cost-efficient sampling
   This talk: linear hidden factor models with active sampling
   Active sampling improves predictive accuracy while keeping
    sampling complexity low in a wide variety of applications
   Future work:
       Better active sampling heuristics?
       Theoretical analysis of active sampling performance?
       Dynamic Matrix Factorizations: tracking time-varying matrices
       Incremental MMMF? (solving from scratch every time is too costly)



                                                                            83
References
Some of the most influential papers

•   Simon Tong, Daphne Koller.
    Support Vector Machine Active Learning with Applications to Text Classification
    . Journal of Machine Learning Research. Volume 2, pages 45-66. 2001.

•   Y. Freund, H. S. Seung, E. Shamir, N. Tishby. 1997.
    Selective sampling using the query by committee algorithm. Machine
    Learning, 28:133—168

•   David Cohn, Zoubin Ghahramani, and Michael Jordan.
    Active learning with statistical models, Journal of Artificial Intelligence
    Research, (4): 129-145, 1996.

•   David Cohn, Les Atlas and Richard Ladner.
    Improving generalization with active learning, Machine Learning
    15(2):201-221, 1994.

•   D. J. C. Mackay.
    Information-Based Objective Functions for Active Data Selection. Neural
    Comput., vol. 4, no. 4, pp. 590--604, 1992.                                   84
NIPS papers
•   Francis Bach. Active learning for misspecified generalized linear models. NIPS-06
•   Ran Gilad-Bachrach, Amir Navot, Naftali Tishby. Query by Committee Made Real. NIPS-05
•   Brent Bryan, Jeff Schneider, Robert Nichol, Christopher Miller, Christopher Genovese, Larry Wasserman.
    Active Learning For Identifying Function Threshold Boundaries . NIPS-05
•   Rui Castro, Rebecca Willett, Robert Nowak. Faster Rates in Regression via Active Learning. NIPS-05
•   Sanjoy Dasgupta. Coarse sample complexity bounds for active learning. NIPS-05
•   Masashi Sugiyama. Active Learning for Misspecified Models. NIPS-05
•   Brigham Anderson, Andrew Moore. Fast Information Value for Graphical Models. NIPS-05
•   Dan Pelleg, Andrew W. Moore. Active Learning for Anomaly and Rare-Category Detection. NIPS-04
•   Sanjoy Dasgupta. Analysis of a greedy active learning strategy. NIPS-04
•   T. Jaakkola and H. Siegelmann. Active Information Retrieval. NIPS-01
•   M. K. Warmuth et al. Active Learning in the Drug Discovery Process. NIPS-01
•   Jonathan D. Nelson, Javier R. Movellan. Active Inference in Concept Learning. NIPS-00
•   Simon Tong, Daphne Koller. Active Learning for Parameter Estimation in Bayesian Networks. NIPS-00
•   Thomas Hofmann and Joachim M. Buhnmnn. Active Data Clustering. NIPS-97
•   K. Fukumizu. Active Learning in Multilayer Perceptrons. NIPS-95
•   Anders Krogh, Jesper Vedelsby.
    NEURAL NETWORK ENSEMBLES, CROSS VALIDATION, AND ACTIVE LEARNING. NIPS-94
•   Kah Kay Sung, Partha Niyogi. ACTIVE LEARNING FOR FUNCTION APPROXIMATION. NIPS-94
•   David Cohn, Zoubin Ghahramani, Michael I. Jordan. ACTIVE LEARNING WITH STATISTICAL MODELS.
    NIPS-94
•   Sebastian B. Thrun and Knut Moller. Active Exploration in Dynamic Environments. NIPS-91                  85
ICML papers
•   Maria-Florina Balcan, Alina Beygelzimer, John Langford. Agnostic Active Learning. ICML-06
•   Steven C. H. Hoi, Rong Jin, Jianke Zhu, Michael R. Lyu.
    Batch Mode Active Learning and Its Application to Medical Image Classification. ICML-06
•   Sriharsha Veeramachaneni, Emanuele Olivetti, Paolo Avesani.
    Active Sampling for Detecting Irrelevant Features. ICML-06
•   Kai Yu, Jinbo Bi, Volker Tresp. Active Learning via Transductive Experimental Design. ICML-06
•   Rohit Singh, Nathan Palmer, David Gifford, Bonnie Berger, Ziv Bar-Joseph. Active Learning for
    Sampling in Time-Series Experiments With Application to Gene Expression Analysis. ICML-05
•   Prem Melville, Raymond Mooney. Diverse Ensembles for Active Learning. ICML-04
•   Klaus Brinker. Active Learning of Label Ranking Functions. ICML-04
•   Hieu Nguyen, Arnold Smeulders. Active Learning Using Pre-clustering. ICML-04
•   Greg Schohn and David Cohn. Less is More: Active Learning with Support Vector Machines, ICML-00
•   Simon Tong, Daphne Koller.
    Support Vector Machine Active Learning with Applications to Text Classification. ICML-00.
•   COLT papers
•   S. Dasgupta, A. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. COLT-05.
•   H. S. Seung, M. Opper, and H. Sompolinski. 1992. Query by committee. COLT-92, pages 287--294.

                                                                                                    86
Journal Papers
•   Antoine Bordes, Seyda Ertekin, Jason Weston, Leon Bottou.
    Fast Kernel Classifiers with Online and Active Learning. Journal of Machine Learning Research
    (JMLR), vol. 6, pp. 1579-1619, 2005.
•   Simon Tong, Daphne Koller.
    Support Vector Machine Active Learning with Applications to Text Classification. Journal of
    Machine Learning Research. Volume 2, pages 45-66. 2001.
•   Y. Freund, H. S. Seung, E. Shamir, N. Tishby. 1997.
    Selective sampling using the query by committee algorithm. Machine Learning, 28:133--168
•   David Cohn, Zoubin Ghahramani, and Michael Jordan. Active learning with statistical models,
    Journal of Artificial Intelligence Research, (4): 129-145, 1996.
•   David Cohn, Les Atlas and Richard Ladner. Improving generalization with active learning,
    Machine Learning 15(2):201-221, 1994.
•   D. J. C. Mackay. Information-Based Objective Functions for Active Data Selection. Neural
    Comput., vol. 4, no. 4, pp. 590--604, 1992.
•   Haussler, D., Kearns, M., and Schapire, R. E. (1994).
    Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension
    . Machine Learning, 14, 83--113
•   Fedorov, V. V. 1972. Theory of optimal experiment. Academic Press.
•   Saar-Tsechansky, M. and F. Provost.
    Active Sampling for Class Probability Estimation and Ranking. Machine Learning 54:2 2004,
    153-178.
                                                                                                    87
Workshops
• http://domino.research.ibm.com/comm/researc




                                        88
Appendix




           89
Active Learning of Bayesian
         Networks
Entropy Function
• A measure of information in
  random event X with possible
  outcomes {x1,…,xn}
   H(x) = - Σi p(xi) log2 p(xi)

• Comments on entropy function:
   – Entropy of an event is zero when
     the outcome is known
   – Entropy is maximal when all
     outcomes are equally likely
• The average minimum yes/no
  questions to answer some
  question (connection to binary                          91
  search)                               [Shannon, 1948]
Kullback-Leibler divergence
• P is the true distribution; Q distribution is used to encode
  data instead of P
• KL divergence is the expected extra message length per
  datum that must be transmitted using Q

            DKL(P || Q) = Σi P(xi) log (P(xi)/Q(xi))
                        = Σi P(xi) log Q(xi) – Σi P(xi) log P(xi)
                        = -H(P,Q) + H(P)
                        = -Cross-entropy + entropy


• Measure of how “wrong” Q is with respect to true
  distribution P                                                92
Learning Bayesian Networks


                                                 E       B
      Data
                                             R       A
        +                 Learner
Prior Knowledge                                      C


                                                             E B P(A | E,B)
                                                             e b .9    .1
     • Model Building
                                                             e b .7    .3
         • Parameter estimation
         • Causal structure discovery                        e b .8    .2
     • Passive Learning vs Active Learning                   e b .99 .01
                                                                      93
Active Learning

• Selective Active Learning
• Interventional Active Learning


• Obtain measure of quality of current model
• Choose query that most improves quality
• Update model


                                               94
Active Learning: Parameter
                      Estimation
•
         [Tong & Koller, NIPS-2000]
      Given a BN structure G
• A prior distribution p(θ)
• Learner request a particular instantiation q (Query)
                                                            Updated
Initial Network
                                                           distribution
     G, p(θ)
                                                               p´(θ)
  E           B
                   Query (q)                                E         B
                                  Active Learner
          A
                                                                  A
      +
                  Response (x)
  Training
    data
                               How to update parameter density
                               How to select next query based on p
                                                                95
Updating parameter density

•Do not update A since we are fixing it
                                              B
•If we select A then do not update B
   •Sampling from P(B|A=a) ≠ P(B)
•If we force A then we can update B           A
   •Sampling from P(B|A:=a) = P(B)*
•Update all other nodes as usual
                                          J           M
•Obtain new density

  p(θ | A = a, X = x )
                                                           96
                                                  *Pearl 2000
Bayesian point estimation
• Goal: a single estimate θ
  – instead of a distribution p over θ


                p(θ)


                     ~
• If we choose θ and the true model is θ’
                     θ    θ’
  then we incur some loss, L(θ’ || θ)

                                            97
Bayesian point estimation
• We do not know the true θ’
• Density p represents optimal beliefs over θ’
• Choose θ that minimizes the expected loss
     ~
     θ = argminθ ∫p(θ’) L(θ’ || θ) dθ’

       ~
• Call θ the Bayesian point estimate
• Use the expected loss of the Bayesian point
  estimate as a measure of quality of p(θ):
  – Risk(p) = ∫p(θ’) L(θ’ || θ) dθ’
                                                 98
The Querying component
• Set the controllable variables so as to
  minimize the expected posterior risk:

    ExPRisk(p | Q=q)
                                              ~
    = ∑ P( X = x | Q = q) ∫ p(θ | x ) KL(θ || θ ) dθ
       x


• KL divergence will be used for loss
                        Conditional KL-divergence

  – KL(θ || θ’)=∑KL(Pθ(Xi|Ui)|| Pθ’(Xi|Ui))            99
Algorithm Summary
• For each potential query q
• Compute ∆Risk(X|q)
• Choose q for which ∆Risk(X|q) is greatest
  – Cost of computing ∆Risk(X|q):




• Cost of Bayesian network inference
• Complexity: O (|Q|. Cost of inference)

                                              100
Uncertainty sampling
Maintain a single hypothesis, based on labels seen so far.
Query the point about which this hypothesis is most “uncertain”.

Problem: confidence of a single hypothesis may not accurately
represent the true diversity of opinion in the hypothesis class.

           X
                                -
                   -                                    -
                                        -
                       +   +                        -
                                            -
                           +
                       +       +                -
               -                    -                              101
102
Region of uncertainty
Current version space: portion of H consistent with labels so far.
“Region of uncertainty” = part of data space about which there is
still some uncertainty (i.e. disagreement within version space)

                                            current version space
Suppose data lies
on circle in R2;                            +
hypotheses are       +
linear separators.
(spaces X, H
superimposed)
                                             region of uncertainty
                                             in data space     103
Region of uncertainty
Algorithm [CAL92]:
of the unlabeled points which lie in the region of uncertainty,
pick one at random to query.


Data and                                   current version space
hypothesis spaces,
superimposed:
(both are the
surface of the unit
sphere in Rd)
                                             region of uncertainty
                                             in data space     104
Region of uncertainty
Number of labels needed depends on H and also on P.

Special case: H = {linear separators in Rd}, P = uniform
distribution over unit sphere.

Then: just d log 1/ε labels are needed to reach a hypothesis
with error rate < ε.

[1] Supervised learning: d/ε labels.
[2] Best we can hope for.


                                                               105
Region of uncertainty
Algorithm [CAL92]:
of the unlabeled points which lie in the region of uncertainty,
pick one at random to query.

For more general distributions: suboptimal…



                                         Need to measure
                                         quality of a query – or
                                         alternatively, size of
                                         version space.

                                                              106
Expected
Infogain
of sample   Uncertainty sampling!




                                    107
108

More Related Content

PPT
activelearning.ppt
PPTX
Active learning
PPTX
Naive bayes
PPTX
Deep Reinforcement Learning
PDF
Robustness in deep learning
PPTX
Hyperparameter Tuning
PDF
Generative adversarial networks
PPT
Semi-supervised Learning
activelearning.ppt
Active learning
Naive bayes
Deep Reinforcement Learning
Robustness in deep learning
Hyperparameter Tuning
Generative adversarial networks
Semi-supervised Learning

What's hot (20)

PPTX
Recommender system introduction
PPTX
LinkedIn talk at Netflix ML Platform meetup Sep 2019
PPTX
Active learning: Scenarios and techniques
ODP
NAIVE BAYES CLASSIFIER
PDF
Relational knowledge distillation
PDF
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
PDF
Convolutional Neural Networks (CNN)
PDF
GANs and Applications
PDF
An introduction to reinforcement learning
PPTX
Deep neural networks
PPT
Recommendation system
PPTX
Introduction to Machine Learning
PPTX
Diffusion models beat gans on image synthesis
PDF
Intro to LLMs
PPTX
ONNX and MLflow
PDF
Link prediction
PDF
Sequential Decision Making in Recommendations
PPTX
Causal inference in practice
PPTX
Naive Bayes Presentation
PPTX
Supervised learning
Recommender system introduction
LinkedIn talk at Netflix ML Platform meetup Sep 2019
Active learning: Scenarios and techniques
NAIVE BAYES CLASSIFIER
Relational knowledge distillation
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Convolutional Neural Networks (CNN)
GANs and Applications
An introduction to reinforcement learning
Deep neural networks
Recommendation system
Introduction to Machine Learning
Diffusion models beat gans on image synthesis
Intro to LLMs
ONNX and MLflow
Link prediction
Sequential Decision Making in Recommendations
Causal inference in practice
Naive Bayes Presentation
Supervised learning
Ad

Similar to Active learning lecture (20)

PPT
Machine learning
PPT
AL slides.ppt
PPT
ppt
PPT
ppt
PDF
Introduction to Machine Learning
PPT
S10
PPT
S10
PPT
Machine Learning and Inductive Inference
PPT
AI_Lecture_34.ppt
PPT
M18 learning
PDF
An introduction to Machine Learning
DOC
Problem 1 – First-Order Predicate Calculus (15 points)
DOC
Problem 1 – First-Order Predicate Calculus (15 points)
PDF
Whitcher Ismrm 2009
PPT
Submodularity slides
PPT
Machine Learning
DOC
Spring 2003
PPT
Download presentation source
PDF
Logic-Based Representation, Reasoning and Machine Learning for Event Recognition
Machine learning
AL slides.ppt
ppt
ppt
Introduction to Machine Learning
S10
S10
Machine Learning and Inductive Inference
AI_Lecture_34.ppt
M18 learning
An introduction to Machine Learning
Problem 1 – First-Order Predicate Calculus (15 points)
Problem 1 – First-Order Predicate Calculus (15 points)
Whitcher Ismrm 2009
Submodularity slides
Machine Learning
Spring 2003
Download presentation source
Logic-Based Representation, Reasoning and Machine Learning for Event Recognition
Ad

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Cloud computing and distributed systems.
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Electronic commerce courselecture one. Pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Approach and Philosophy of On baking technology
PDF
cuic standard and advanced reporting.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
A Presentation on Artificial Intelligence
Reach Out and Touch Someone: Haptics and Empathic Computing
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Cloud computing and distributed systems.
Network Security Unit 5.pdf for BCA BBA.
Electronic commerce courselecture one. Pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Mobile App Security Testing_ A Comprehensive Guide.pdf
Machine learning based COVID-19 study performance prediction
Approach and Philosophy of On baking technology
cuic standard and advanced reporting.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Chapter 3 Spatial Domain Image Processing.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
A Presentation on Artificial Intelligence

Active learning lecture

  • 1. Active Learning COMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center
  • 2. Outline  Motivation  Active learning approaches  Membership queries  Uncertainty Sampling  Information-based loss functions  Uncertainty-Region Sampling  Query by committee  Applications  Active Collaborative Prediction  Active Bayes net learning 2
  • 3. Standard supervised learning model Given m labeled points, want to learn a classifier with misclassification rate <ε, chosen from a hypothesis class H with VC dimension d < 1. VC theory: need m to be roughly d/ε, in the realizable case. 3
  • 4. Active learning In many situations – like speech recognition and document retrieval – unlabeled data is easy to come by, but there is a charge for each label. What is the minimum number of labels needed to achieve the target error rate? 4
  • 5. 5
  • 6. What is Active Learning?  Unlabeled data are readily available; labels are expensive  Want to use adaptive decisions to choose which labels to acquire for a given dataset  Goal is accurate classifier with minimal cost 6
  • 7. Active learning warning  Choice of data is only as good as the model itself  Assume a linear model, then two data points are sufficient  What happens when data are not linear? 7
  • 8. Active Learning Flavors Selective Sampling Membership Queries Pool Sequential Myopic Batch 8
  • 9. Active Learning Approaches  Membership queries  Uncertainty Sampling  Information-based loss functions  Uncertainty-Region Sampling  Query by committee 9
  • 10. 10
  • 11. 11
  • 12. Problem Many results in this framework, even for complicated hypothesis classes. [Baum and Lang, 1991] tried fitting a neural net to handwritten characters. Synthetic instances created were incomprehensible to humans! [Lewis and Gale, 1992] tried training text classifiers. “an artificial text created by a learning algorithm is unlikely to be a legitimate natural language expression, and probably would be uninterpretable by a human teacher.” 12
  • 13. 13
  • 14. 14
  • 15. 15
  • 16. Uncertainty Sampling [Lewis & Gale, 1994]  Query the event that the current classifier is most uncertain about If uncertainty is measured in Euclidean distance: x x x x x x x x x x  Used trivially in SVMs, graphical models, etc. 16
  • 17. Information-based Loss Function [MacKay, 1992]  Maximize KL-divergence between posterior and prior  Maximize reduction in model entropy between posterior and prior  Minimize cross-entropy between posterior and prior  All of these are notions of information gain 17
  • 18. Query by Committee [Seung et al. 1992, Freund et al. 1997]  Prior distribution over hypotheses  Samples a set of classifiers from distribution  Queries an example based on the degree of disagreement between committee of classifiers x x x x x x x x x x A B C 18
  • 20. Notation We Have: • Dataset, D • Model parameter space, W • Query algorithm, q 20
  • 21. Dataset (D) Example t Sex Age Test A Test B Test C Disease 0 M 40-50 0 1 1 ? 1 F 50-60 0 1 0 ? 2 F 30-40 0 0 0 ? 3 F 60+ 1 1 1 ? 4 M 10-20 0 1 0 ? 5 M 40-50 0 0 1 ? 6 F 0-10 0 0 0 ? 7 M 30-40 1 1 0 ? 8 M 20-30 0 0 1 ? 21
  • 22. Notation We Have: • Dataset, D • Model parameter space, W • Query algorithm, q 22
  • 23. Model Example St Ot Probabilistic Classifier Notation T : Number of examples Ot : Vector of features of example t St : Class of example t 23
  • 24. Model Example Patient state (St) St : DiseaseState Patient Observations (Ot) Ot1 : Gender Ot2 : Age Ot3 : TestA Ot4 : TestB Ot5 : TestC 24
  • 25. Possible Model Structures Gender TestB Age Gender S TestA S TestA Age TestB TestC TestC 25
  • 26. Model Space Model: St Ot Model Parameters: P(St) P(Ot|St) Generative Model: Must be able to compute P(St=i, Ot=ot | w) 26
  • 27. Model Parameter Space (W) • W = space of possible parameter values • Prior on parameters: P(W ) • Posterior over models: P (W | D ) ∝ P ( D | W ) P (W ) T ∝ ∏ P ( St , Ot | W ) P(W ) t 27
  • 28. Notation We Have: • Dataset, D • Model parameter space, W • Query algorithm, q q(W,D) returns t*, the next sample to label 28
  • 29. Game while NotDone • Learn P(W | D) • q chooses next example to label • Expert adds label to D 29
  • 30. Simulation O1 O2 O3 O4 O5 O6 O7 e S2 fals e S1 = S3 S4 S5 true = S6 S7 fals = S2 S5 S7 ? ? hmm… ? q 30
  • 31. Active Learning Flavors • Pool (“random access” to patients) • Sequential (must decide as patients walk in the door) 31
  • 32. q? • Recall: q(W,D) returns the “most interesting” unlabelled example. • Well, what makes a doctor curious about a patient? 32
  • 33. 1994 33
  • 34. Score Function scoreuncert ( S t ) = uncertainty(P ( St | Ot )) = H ( St ) = ∑ P ( St = i ) log P( St = i ) i 34
  • 35. Uncertainty Sampling Example t Sex Age Test Test Test St P(St) H(St) A B C 1 M 20- 30 0 1 1 ? 0.02 0.043 2 F 20- 30 0 1 0 ? 0.01 0.024 3 F 30- 40 1 0 0 ? 0.05 0.086 4 F 60+ 1 1 0 ? FALSE 0.12 0.159 5 M 10- 20 0 1 0 ? 0.01 0.024 6 M 20- 30 1 1 1 ? 0.96 0.073 35
  • 36. Uncertainty Sampling Example t Sex Age Test Test Test St P(St) H(St) A B C 1 M 20- 30 0 1 1 ? 0.01 0.024 2 F 20- 30 0 1 0 ? 0.02 0.043 3 F 30- 40 1 0 0 ? 0.04 0.073 4 F 60+ 1 1 0 ? FALSE 0.00 0.00 5 M 10- 20 0 1 0 ? TRUE 0.06 0.112 6 M 20- 30 1 1 1 ? 0.97 0.059 36
  • 37. Uncertainty Sampling GOOD: couldn’t be easier GOOD: often performs pretty well BAD: H(St) measures information gain about the samples, not the model Sensitive to noisy samples 37
  • 38. Can we do better than uncertainty sampling? 38
  • 39. 1992 Informative with respect to what? 39
  • 40. Model Entropy P(W|D) P(W|D) P(W|D) W W W H(W) = high …better… H(W) = 0 40
  • 41. Information-Gain • Choose the example that is expected to most reduce H(W) • I.e., Maximize H(W) – H(W | St) Current model Expected model space space entropy entropy if we learn St 41
  • 42. Score Function score IG ( St ) = MI ( St ;W ) = H (W ) − H (W | St ) 42
  • 43. We usually can’t just sum over all models to get H(St|W) H (W ) = − ∫ P ( w) log P ( w) dw w …but we can sample from P(W | D) H (W ) ∝ H (C ) = −∑ P (c) log P (c) c∈C 43
  • 44. Conditional Model Entropy H (W ) = ∫ P ( w) log P ( w) dw w H (W | St = i ) = ∫ P ( w | St = i ) log P ( w | St = i ) dw w H (W | St ) = ∑ P ( St = i ) ∫ P( w | S t = i ) log P ( w | St = i ) dw w i 44
  • 45. Score Function score IG ( St ) = H (C ) − H (C | St ) 45
  • 46. t Sex Age Test Test Test St P(St) Score = A B C H(C) - H(C|St) 1 M 20-3 0 1 1 0 ? 0.02 0.53 2 F 20-3 0 1 0 0 ? 0.01 0.58 3 F 30-4 1 0 0 0 ? 0.05 0.40 4 F 60+ 1 1 1 ? 0.12 0.49 5 M 10-2 0 1 0 0 ? 0.01 0.57 20-3 6 M 0 0 0 1 ? 0.02 0.52 46
  • 47. Score Function score IG ( St ) = H (C ) − H (C | St ) = H ( St ) − H ( St | C ) Familiar? 47
  • 48. Uncertainty Sampling & Information Gain scoreUncertain ( St ) = H ( St ) score InfoGain ( St ) = H ( St ) − H ( St | C ) 48
  • 49. But there is a problem… 49
  • 50. If our objective is to reduce the prediction error, then “the expected information gain of an unlabeled sample is NOT a sufficient criterion for constructing good queries” 50
  • 51. Strategy #2: Query by Committee Temporary Assumptions: Pool  Sequential P(W | D)  Version Space Probabilistic  Noiseless  QBC attacks the size of the “Version space” 51
  • 52. O1 O2 O3 O4 O5 O6 O7 S1 S2 S3 S4 S5 S6 S7 FALSE! FALSE! Model #1 Model #2 52
  • 53. O1 O2 O3 O4 O5 O6 O7 S1 S2 S3 S4 S5 S6 S7 TRUE! TRUE! Model #1 Model #2 53
  • 54. O1 O2 O3 O4 O5 O6 O7 S1 S2 S3 S4 S5 S6 S7 FALSE! TRUE! Ooh, now we’re going to learn something for sure! One of them is definitely wrong. Model #1 Model #2 54
  • 55. The Original QBC Algorithm As each example arrives… • Choose a committee, C, (usually of size 2) randomly from Version Space • Have each member of C classify it • If the committee disagrees, select it. 55
  • 56. 1992 56
  • 57. Infogain vs Query by Committee [Seung, Opper, Sompolinsky, 1992; Freund, Seung, Shamir, Tishby 1997] First idea: Try to rapidly reduce volume of version space? Problem: doesn’t take data distribution into account. H: Which pair of hypotheses is closest? Depends on data distribution P. Distance measure on H: d(h,h’) = P(h(x) ≠ h’(x)) 57
  • 58. Query-by-committee First idea: Try to rapidly reduce volume of version space? Problem: doesn’t take data distribution into account. To keep things simple, say d(h,h’) = Euclidean distance H: Error is likely to remain large! 58
  • 59. Query-by-committee Elegant scheme which decreases volume in a manner which is sensitive to the data distribution. Bayesian setting: given a prior π on H H1 = H For t = 1, 2, … receive an unlabeled point xt drawn from P [informally: is there a lot of disagreement about xt in Ht?] choose two hypotheses h,h’ randomly from (π, Ht) if h(xt) ≠ h’(xt): ask for xt’s label set Ht+1 59 Problem: how to implement it efficiently?
  • 60. Query-by-committee For t = 1, 2, … receive an unlabeled point xt drawn from P choose two hypotheses h,h’ randomly from (π, Ht) if h(xt) ≠ h’(xt): ask for xt’s label set Ht+1 Observation: the probability of getting pair (h,h’) in the inner loop (when a query is made) is proportional to π(h) π(h’) d(h,h’). vs. 60 Ht
  • 61. 61
  • 62. Query-by-committee Label bound: For H = {linear separators in Rd}, P = uniform distribution, just d log 1/ε labels to reach a hypothesis with error < ε. Implementation: need to randomly pick h according to (π, Ht). e.g. H = {linear separators in Rd}, π = uniform distribution: How do you pick a Ht random point from a convex body? 62
  • 63. Sampling from convex bodies By random walk! 2. Ball walk 3. Hit-and-run [Gilad-Bachrach, Navot, Tishby 2005] Studies random walks and also ways to kernelize QBC. 63
  • 64. 64
  • 65. Some challenges [1] For linear separators, analyze the label complexity for some distribution other than uniform! [2] How to handle nonseparable data? Need a robust base learner + true boundary - 65
  • 67. Approach: Collaborative Prediction (CP) QoS measure (e.g. bandwidth) Movie Ratings Server1 Server2 Server3 Matrix Geisha Shrek Client1 3674 18 Alina ? 1 3 Client2 187 567 Gerry 4 2 Client3 1688 Irina 9 10 Client4 3009 703 ? Raja 4 Given previously observed ratings R(x,y), where X is a “user” and Y is a “product”, predict unobserved ratings - will Alina like “The Matrix”? (unlikely ) - will Client 86 have fast download from Server 39? - will member X of funding committee approve our project Y?  67
  • 68. Collaborative Prediction = Matrix Approximation 100 servers • Important assumption: 100 clients matrix entries are NOT independent, e.g. similar users have similar tastes • Approaches: mainly factorized models assuming hidden ‘factors’ that affect ratings (pLSA, MCVQ, SVD, NMF, MMMF, …) 68
  • 69. 2 4 5 1 4 2 User’s ‘weights’ Factors associated with ‘factors’’ Assumptions: - there is a number of (hidden) factors behind the user preferences that relate to (hidden) movie properties - movies have intrinsic values associated with such factors - users have intrinsic weights with such factors; user ratings a weighted (linear) combinations of movie’s values 69
  • 70. 2 4 5 1 4 2 70
  • 71. 2 4 5 1 4 2 71
  • 72. rank k 2 4 5 1 4 2 7 2 5 4 5 3 1 4 2 3 1 2 2 5 3 1 2 4 2 2 7 5 6 4 2 4 1 3 1 4 3 2 2 4 1 4 3 1 = 3 3 4 2 3 1 2 3 4 3 2 4 5 2 3 1 4 3 2 2 3 2 1 3 4 3 5 2 2 2 1 4 8 2 2 9 1 8 3 4 5 1 3 1 1 4 1 2 3 5 1 1 5 6 4 Y X Objective: find a factorizable X=UV’ that approximates Y X = arg min Loss ( X ' , Y ) X' and satisfies some “regularization” constraints (e.g. rank(X) < k) Loss functions: depends on the nature of your problem 72
  • 73. Matrix Factorization Approaches  Singular value decomposition (SVD) – low-rank approximation  Assumes fully observed Y and sum-squared loss  In collaborative prediction, Y is only partially observed Low-rank approximation becomes non-convex problem w/ many local minima  Furthermore, we may not want sum-squared loss, but instead  accurate predictions (0/1 loss, approximated by hinge loss)  cost-sensitive predictions (missing a good server vs suggesting a bad one)  ranking cost (e.g., suggest k ‘best’ movies for a user) NON-CONVEX PROBLEMS!  Use instead the state-of-art Max-Margin Matrix Factorization [Srebro 05]  replaces bounded rank constraint by bounded norm of U, V’ vectors  convex optimization problem! – can be solved exactly by semi-definite programming  strongly relates to learning max-margin classifiers (SVMs)  Exploit MMMF’s properties to augment it with active sampling! 73
  • 74. Key Idea of MMMF Rows – feature vectors, Columns – linear classifiers Linear classifiersweight vectors “margin” here = Dist(sample, line) v2 f1 -1 “ma rgin Feature vectors ” Xij = signij x marginij Predictorij = signij If signij > 0, classify as +1, Otherwise classify as -1 74
  • 75. MMMF: Simultaneous Search for Low-norm Feature Vectors and Max-margin Classifiers 75
  • 76. Active Learning with MMMF - We extend MMMF to Active-MMMF using margin-based active sampling - We investigate exploitation vs exploration trade-offs imposed by different heuristics -0.3 -0.5 0.3 0.4 0.6 0.1 -0.9 -0.6 -0.5 0.8 0.2 -0.9 0.3 -0.1 0.6 -0.7 -0.5 0.7 -0.9 -0.8 0.9 0.1 0.5 0.2 0.3 -0.5 0.6 0.6 0.5 0.2 -0.9 0.7 -0.8 -0.5 0.9 -0.6 -0.1 0.9 0.7 0.8 -0.4 0.3 Margin-based heuristics: -0.5 -0.2 0.6 -0.5 -0.5 -0.5 -0.4 0.6 -0.5 0.4 0.5 min-margin (most-uncertain) -0.2 -0.5 -0.5 0.1 0.9 0.3 min-margin positive (“good” uncertain) 0.8 -0.5 0.6 0.2 max-margin (‘safe choice’ but no76 info)
  • 77. Active Max-Margin Matrix Factorization A-MMMF(M,s) 1. Given s sparse matrix Y, learn approximation X = MMMF(Y) 2. Using current predictions, actively select “best s” samples and request their labels (e.g., test client/server pair via ‘enforced’ download) 3. Add new samples to Y 4. Repeat 1-3 Issues:  Beyond simple greedy margin-based heuristics?  Theoretical guarantees? not so easy with non-trivial learning methods and non-trivial data distributions (any suggestions???  ) 77
  • 78. Empirical Results  Network latency prediction  Bandwidth prediction (peer-to-peer)  Movie Ranking Prediction  Sensor net connectivity prediction 78
  • 79. Empirical Results: Latency Prediction P2Psim data NLANR-AMP data Active sampling with most-uncertain (and most-uncertain positive) heuristics provide consistent improvement over random and least- uncertain-next sampling 79
  • 80. Movie Rating Prediction (MovieLens) 80
  • 82. Introducing Cost: Exploration vs Exploitation DownloadGrid: PlanetLab: bandwidth prediction latency prediction Active sampling  lower prediction errors at lower costs (saves 100s of samples) (better prediction  better server assignment decisions  faster downloads Active sampling achieves a good exploration vs exploitation trade-off: reduced decision cost AND information gain 82
  • 83. Conclusions  Common challenge in many applications: need for cost-efficient sampling  This talk: linear hidden factor models with active sampling  Active sampling improves predictive accuracy while keeping sampling complexity low in a wide variety of applications  Future work:  Better active sampling heuristics?  Theoretical analysis of active sampling performance?  Dynamic Matrix Factorizations: tracking time-varying matrices  Incremental MMMF? (solving from scratch every time is too costly) 83
  • 84. References Some of the most influential papers • Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification . Journal of Machine Learning Research. Volume 2, pages 45-66. 2001. • Y. Freund, H. S. Seung, E. Shamir, N. Tishby. 1997. Selective sampling using the query by committee algorithm. Machine Learning, 28:133—168 • David Cohn, Zoubin Ghahramani, and Michael Jordan. Active learning with statistical models, Journal of Artificial Intelligence Research, (4): 129-145, 1996. • David Cohn, Les Atlas and Richard Ladner. Improving generalization with active learning, Machine Learning 15(2):201-221, 1994. • D. J. C. Mackay. Information-Based Objective Functions for Active Data Selection. Neural Comput., vol. 4, no. 4, pp. 590--604, 1992. 84
  • 85. NIPS papers • Francis Bach. Active learning for misspecified generalized linear models. NIPS-06 • Ran Gilad-Bachrach, Amir Navot, Naftali Tishby. Query by Committee Made Real. NIPS-05 • Brent Bryan, Jeff Schneider, Robert Nichol, Christopher Miller, Christopher Genovese, Larry Wasserman. Active Learning For Identifying Function Threshold Boundaries . NIPS-05 • Rui Castro, Rebecca Willett, Robert Nowak. Faster Rates in Regression via Active Learning. NIPS-05 • Sanjoy Dasgupta. Coarse sample complexity bounds for active learning. NIPS-05 • Masashi Sugiyama. Active Learning for Misspecified Models. NIPS-05 • Brigham Anderson, Andrew Moore. Fast Information Value for Graphical Models. NIPS-05 • Dan Pelleg, Andrew W. Moore. Active Learning for Anomaly and Rare-Category Detection. NIPS-04 • Sanjoy Dasgupta. Analysis of a greedy active learning strategy. NIPS-04 • T. Jaakkola and H. Siegelmann. Active Information Retrieval. NIPS-01 • M. K. Warmuth et al. Active Learning in the Drug Discovery Process. NIPS-01 • Jonathan D. Nelson, Javier R. Movellan. Active Inference in Concept Learning. NIPS-00 • Simon Tong, Daphne Koller. Active Learning for Parameter Estimation in Bayesian Networks. NIPS-00 • Thomas Hofmann and Joachim M. Buhnmnn. Active Data Clustering. NIPS-97 • K. Fukumizu. Active Learning in Multilayer Perceptrons. NIPS-95 • Anders Krogh, Jesper Vedelsby. NEURAL NETWORK ENSEMBLES, CROSS VALIDATION, AND ACTIVE LEARNING. NIPS-94 • Kah Kay Sung, Partha Niyogi. ACTIVE LEARNING FOR FUNCTION APPROXIMATION. NIPS-94 • David Cohn, Zoubin Ghahramani, Michael I. Jordan. ACTIVE LEARNING WITH STATISTICAL MODELS. NIPS-94 • Sebastian B. Thrun and Knut Moller. Active Exploration in Dynamic Environments. NIPS-91 85
  • 86. ICML papers • Maria-Florina Balcan, Alina Beygelzimer, John Langford. Agnostic Active Learning. ICML-06 • Steven C. H. Hoi, Rong Jin, Jianke Zhu, Michael R. Lyu. Batch Mode Active Learning and Its Application to Medical Image Classification. ICML-06 • Sriharsha Veeramachaneni, Emanuele Olivetti, Paolo Avesani. Active Sampling for Detecting Irrelevant Features. ICML-06 • Kai Yu, Jinbo Bi, Volker Tresp. Active Learning via Transductive Experimental Design. ICML-06 • Rohit Singh, Nathan Palmer, David Gifford, Bonnie Berger, Ziv Bar-Joseph. Active Learning for Sampling in Time-Series Experiments With Application to Gene Expression Analysis. ICML-05 • Prem Melville, Raymond Mooney. Diverse Ensembles for Active Learning. ICML-04 • Klaus Brinker. Active Learning of Label Ranking Functions. ICML-04 • Hieu Nguyen, Arnold Smeulders. Active Learning Using Pre-clustering. ICML-04 • Greg Schohn and David Cohn. Less is More: Active Learning with Support Vector Machines, ICML-00 • Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification. ICML-00. • COLT papers • S. Dasgupta, A. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. COLT-05. • H. S. Seung, M. Opper, and H. Sompolinski. 1992. Query by committee. COLT-92, pages 287--294. 86
  • 87. Journal Papers • Antoine Bordes, Seyda Ertekin, Jason Weston, Leon Bottou. Fast Kernel Classifiers with Online and Active Learning. Journal of Machine Learning Research (JMLR), vol. 6, pp. 1579-1619, 2005. • Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification. Journal of Machine Learning Research. Volume 2, pages 45-66. 2001. • Y. Freund, H. S. Seung, E. Shamir, N. Tishby. 1997. Selective sampling using the query by committee algorithm. Machine Learning, 28:133--168 • David Cohn, Zoubin Ghahramani, and Michael Jordan. Active learning with statistical models, Journal of Artificial Intelligence Research, (4): 129-145, 1996. • David Cohn, Les Atlas and Richard Ladner. Improving generalization with active learning, Machine Learning 15(2):201-221, 1994. • D. J. C. Mackay. Information-Based Objective Functions for Active Data Selection. Neural Comput., vol. 4, no. 4, pp. 590--604, 1992. • Haussler, D., Kearns, M., and Schapire, R. E. (1994). Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension . Machine Learning, 14, 83--113 • Fedorov, V. V. 1972. Theory of optimal experiment. Academic Press. • Saar-Tsechansky, M. and F. Provost. Active Sampling for Class Probability Estimation and Ranking. Machine Learning 54:2 2004, 153-178. 87
  • 89. Appendix 89
  • 90. Active Learning of Bayesian Networks
  • 91. Entropy Function • A measure of information in random event X with possible outcomes {x1,…,xn} H(x) = - Σi p(xi) log2 p(xi) • Comments on entropy function: – Entropy of an event is zero when the outcome is known – Entropy is maximal when all outcomes are equally likely • The average minimum yes/no questions to answer some question (connection to binary 91 search) [Shannon, 1948]
  • 92. Kullback-Leibler divergence • P is the true distribution; Q distribution is used to encode data instead of P • KL divergence is the expected extra message length per datum that must be transmitted using Q DKL(P || Q) = Σi P(xi) log (P(xi)/Q(xi)) = Σi P(xi) log Q(xi) – Σi P(xi) log P(xi) = -H(P,Q) + H(P) = -Cross-entropy + entropy • Measure of how “wrong” Q is with respect to true distribution P 92
  • 93. Learning Bayesian Networks E B Data R A + Learner Prior Knowledge C E B P(A | E,B) e b .9 .1 • Model Building e b .7 .3 • Parameter estimation • Causal structure discovery e b .8 .2 • Passive Learning vs Active Learning e b .99 .01 93
  • 94. Active Learning • Selective Active Learning • Interventional Active Learning • Obtain measure of quality of current model • Choose query that most improves quality • Update model 94
  • 95. Active Learning: Parameter Estimation • [Tong & Koller, NIPS-2000] Given a BN structure G • A prior distribution p(θ) • Learner request a particular instantiation q (Query) Updated Initial Network distribution G, p(θ) p´(θ) E B Query (q) E B Active Learner A A + Response (x) Training data How to update parameter density How to select next query based on p 95
  • 96. Updating parameter density •Do not update A since we are fixing it B •If we select A then do not update B •Sampling from P(B|A=a) ≠ P(B) •If we force A then we can update B A •Sampling from P(B|A:=a) = P(B)* •Update all other nodes as usual J M •Obtain new density p(θ | A = a, X = x ) 96 *Pearl 2000
  • 97. Bayesian point estimation • Goal: a single estimate θ – instead of a distribution p over θ p(θ) ~ • If we choose θ and the true model is θ’ θ θ’ then we incur some loss, L(θ’ || θ) 97
  • 98. Bayesian point estimation • We do not know the true θ’ • Density p represents optimal beliefs over θ’ • Choose θ that minimizes the expected loss ~ θ = argminθ ∫p(θ’) L(θ’ || θ) dθ’ ~ • Call θ the Bayesian point estimate • Use the expected loss of the Bayesian point estimate as a measure of quality of p(θ): – Risk(p) = ∫p(θ’) L(θ’ || θ) dθ’ 98
  • 99. The Querying component • Set the controllable variables so as to minimize the expected posterior risk: ExPRisk(p | Q=q) ~ = ∑ P( X = x | Q = q) ∫ p(θ | x ) KL(θ || θ ) dθ x • KL divergence will be used for loss Conditional KL-divergence – KL(θ || θ’)=∑KL(Pθ(Xi|Ui)|| Pθ’(Xi|Ui)) 99
  • 100. Algorithm Summary • For each potential query q • Compute ∆Risk(X|q) • Choose q for which ∆Risk(X|q) is greatest – Cost of computing ∆Risk(X|q): • Cost of Bayesian network inference • Complexity: O (|Q|. Cost of inference) 100
  • 101. Uncertainty sampling Maintain a single hypothesis, based on labels seen so far. Query the point about which this hypothesis is most “uncertain”. Problem: confidence of a single hypothesis may not accurately represent the true diversity of opinion in the hypothesis class. X - - - - + + - - + + + - - - 101
  • 102. 102
  • 103. Region of uncertainty Current version space: portion of H consistent with labels so far. “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space) current version space Suppose data lies on circle in R2; + hypotheses are + linear separators. (spaces X, H superimposed) region of uncertainty in data space 103
  • 104. Region of uncertainty Algorithm [CAL92]: of the unlabeled points which lie in the region of uncertainty, pick one at random to query. Data and current version space hypothesis spaces, superimposed: (both are the surface of the unit sphere in Rd) region of uncertainty in data space 104
  • 105. Region of uncertainty Number of labels needed depends on H and also on P. Special case: H = {linear separators in Rd}, P = uniform distribution over unit sphere. Then: just d log 1/ε labels are needed to reach a hypothesis with error rate < ε. [1] Supervised learning: d/ε labels. [2] Best we can hope for. 105
  • 106. Region of uncertainty Algorithm [CAL92]: of the unlabeled points which lie in the region of uncertainty, pick one at random to query. For more general distributions: suboptimal… Need to measure quality of a query – or alternatively, size of version space. 106
  • 107. Expected Infogain of sample Uncertainty sampling! 107
  • 108. 108

Editor's Notes

  • #58: Explain why this isn’t pathological: issue of margins
  • #59: Explain why this isn’t pathological: issue of margins
  • #60: Explain why this isn’t pathological: issue of margins
  • #61: Explain why this isn’t pathological: issue of margins
  • #63: Explain why this isn’t pathological: issue of margins
  • #64: Explain why this isn’t pathological: issue of margins
  • #68: New properties of our application vs CF: Dynamically changing ratings Possibilitues for active sampling Constraint on resources vs CF Absense of rating may mean that server was NOT selected selectd before on purpose (?) Future improvements: Use client and server ‘features’ (IP etc)
  • #69: Red box on left top : which data where actually used in experiments? Color version Histograms
  • #78: Examples: movies and dGrid – Motivate active learning in CF scenarios Minimize # of questions /downloads but maximuze their ‘informativeness’ Incremental algorithm – future work
  • #80: Error bars Sparsity for those matrices % of samples added %% added samples – X axis For future – initial training set effect on the active learning
  • #81: Error bars Sparsity for those matrices % of samples added %% added samples – X axis For future – initial training set effect on the active learning
  • #83: Error bars Sparsity for those matrices % of samples added %% added samples – X axis For future – initial training set effect on the active learning
  • #84: Error bars Sparsity for those matrices % of samples added %% added samples – X axis For future – initial training set effect on the active learning
  • #102: When you don’t have enough data to even fit a hypothesis, should you trust its confidence judgements?
  • #107: Explain why this isn’t pathological: issue of margins