On Semi-Supervised Learning and Beyond

Nov. 23rd, 2009

{ On SSL, and beyond }
- Theories, Methods, and a Possible Suggestion on Semi-Supervised Learning -

Lab Seminar Presentation
Eunjeong Park

Agenda

1. Background

2. Semi-Supervised Learning Methods

3. Assumptions on SSL

4. Future Work

Background Examples (1/2)

• Spam E-mail Classification

inbox

?
spam

Background Examples (2/2)

• Response Modeling

respondents

?
unlabeled

non-respondents

Background the Question (1/2)

• Statistical learning methods require LOTS of training data

– But since we only have a limited amount of labeled data,
– Can we figure out a way for our learning algorithms to take
advantage of all the unlabeled data?

Labeled Unlabeled …

Background the Question (2/2)

f: x→y
<xi, yi> <xi> …?

• Text/Web Mining • Marketing
– Document classification - Response Modeling
• f: Doc → Class • f: Demo+RFM → Response
• Spam filtering, web page classification - Fraud Detection
– Information extraction
• f: Demo+PaymentHistory → Fraud
• f: Sentence → Fact, f: Doc → Fact
- Customer Segmentation
– Translation
• f: EnglishDoc → FrenchDoc • f: Demo+RFM → Customer Seg.

Semi-Supervised
Learning Methodology [1]

• Generative models
– Unlabeled data is used to to either modify or reprioritize hypotheses obtained from
labeled data alone
– Given the Bayesian formula:
p( x | y ) P( y )
P( y | x) =
p( x)

we can easily discover that p(x) influences p(y|x)
– Mixture models with EM is in this category, and to some extent self-training, too

• Discriminative models
– Original discriminative training cannot be used for SSL, since p(y|x) is estimated
ignoring p(x)
– To solve the problem, p(x) dependent terms are often brought into the objective
function, which amounts to assuming p(y|x) and p(x) share parameters
– Transductive SVM, Gaussian processes, information regularization, graph-based
methods are in this category

※ For more on GM, DM refer to Appendix 1.

Semi-Supervised
Learning Previous methods

SSL Semi-Supervised Learning

• EM w/ Generative Mixture Models (Nigam et al., 2000; Miller & Uyar, 1997)
•Self-Training
• Co-Training and Multiview Learning (Blum & Mitchell, 1998; Goldman & Zhou, 2000)
• TSVMs (Bennett et al., 1999; Joachims, 1999)
•Gaussian Processes
•Information Regularization
•Entropy Minimization
• Graph-based methods (Blum & Chawla, 2001)

Ref [1], [2] reorganized

※ For more on the use of above methods, refer to Appendix 2.

Semi-Supervised Previous methods:
Learning
EM w/Generative Models (1/3)
Basic EM Algorithm Incorporated w/ unlabeled data [3]

Learning
• In a binary classification problem, if we assume each
class has a Gaussian distribution, then we can use
unlabeled data to help parameter estimation. [1]

Learning

Learning
Co-Training (1/4)

Professor Cho My Advisor

Learning
Co-Training (2/4)
• Key Idea: Classifier1 and Classifier2 must…
– Correctly classify labeled examples
– Agree on classification of unlabeled

Classifier 1: Hyperlinks only Classifier 2: Page only


Learning
Co-Training (3/4) [4]
• Given: labeled data L, unlabeled data U
• Loop:
– Train g1 (hyperlink classifier) using L
– Train g2 (page classifier) using L
– Allow g1 to label p positive, n negative examples from U
– Allow g2 to label p positive, n negative examples from U
– Add these self-labeled examples to L

Answer1 Answer2

Classifier1 Classifier2


Learning
Co-Training (4/4)
• Experimental Settings:
– begin with 12 labeled web pages (academic course)
– provide 1,000 additional unlabeled web pages
– average error: learning from labeled data 11.1%;
– average error: cotraining 5.0%

Learning
TSVMs

+
+
-
+

-

+
-
+
-

Learning
Graph-based methods
• Key idea: Define a graph where…
– nodes are labeled and unlabeled examples in the dataset, and
– edges (may be weighted) reflect the similarity of examples

– Then, nodes connected by a large-weight edge tend to have the
same label, and labels can propagation throughout the graph

• Note: Graph-based methods enjoy nice properties from spectral
graph theory

Assumptions on
SSL The Utility of Unlabeled Data

• Many SSL papers start with an introduction like…
“labeled data…is often very difficult and expensive to obtain, and
thus…unlabeled data holds significant promise in terms of vastly
expanding the applicability of learning methods [5]”
…but is this necessarily true?
– No! Do not take it for granted!
– Even though you don’t to have to spend as much time labeling
training data, you still need to spend much effort to design good
models / features / kernels / similarity functions for SSL!

• A good matching of problem structure with model assumption is
necessary to effectively use unlabeled data
– Bad matching can lead to degradation in classifier performance

Assumptions on
SSL An Example (1/2)

• Unlabeled Data Can Degrade Classification Performance of
Generative Classifiers [6] (1/2)

Naive Bayes classifier from data generated from a Naive Bayes model (left) and a TAN model (right).
Each point summarizes 10 runs of each classifier on testing data; bars cover 30 to 70 percentiles.

Assumptions on
SSL An Example (2/2)

Spam=0 Spam=1

#of the word ‘Loan’

Q1: Is this e-mail spam?
Q2: Was this e-mail written on a Sunday?

Agenda

1. Problem Definition

2. Semi-Supervised Learning Methods

3. Assumptions on SSL

4. Future Work

Future Work Multi-Edge Graph-Based SSL

• Aside to Semi-Supervised Classification, there are more…
– Semi-Supervised Clustering
– Semi-Supervised Regression

• There are also very similar methods such as…
– Active learning

• Based on the theories noted above, here’s my question:

f: x→y
<x1i> <x2i> <x3i> <x4i>

Future Work Multi-Edge Graph-Based SSL

• Ex1: • Ex2:

Appendix 1 GM vs. DM

• Discriminative models
– 방법론: 결정경계의 도입
– PR이 처음 레이더 신호 해석에 쓰이기 시작하던 1950년대부터, 1990년대 중반까지 사실상
PR을 대표하는 독점적인 방법이었음
– Rosenblat의 Perceptron(1958)과, PDP학파의 MLP(1986)역시 이러한 방향에서 주장된 것이
었음

• Generative models
– 1996년, PDP학파의 핵심멤버였던 Geoffrey Hinton에 의해 처음 소개됨 (Hinton, G., Using
Generative Models for Handwritten Digit Recognition, tPAMI, 1996.)
– 이로 인해, clustering 정도 밖에 없다고 여겨졌던 unsupervised learning도 다시 조명을 받
게 되었고, 곧 subspace analysis(ex: PCA)라는 우군을 얻게 되어 급격히 발전함
– 즉, class의 위치가 반드시 서로 다른 class간에 떨어져 있으리란 법이 없으며, 따라서 그보다
는 분포를 잘 묘사할 중심분포, 즉 혼재된 basis들로 기술해야한다는 관점임 (ex: 푸리에 급
수)

Appendix 2 The Use of SSL Methods[1]

• Do the classes produce well clustered data?
– EM w/ generative mixture models

• Is the existing supervised classifier complicated and hard to modify?
– Self-training

• Do the features naturally split into two sets?
– Co-training

• Already using SVM?
– TSVMs

• Is it true that two points with similar features tend to be in the same class?
– Graph-based methods

References

[1] Zhu, X., (2005). Semi-Supervised Learning Literature Survey, Computer Sciences,
University of Wisconsin-Madison.
[2] Seeger, M., (2001). Learning with labeled and unlabeled data (Technical Survey).
[3] Nigam, K., McCallum, A. K., Mitchell, T. M., (2000). Text Classification from
Labeled and Unlabeled Documents using EM, Machine Learning 39, 103-134.
[4] Mitchell, T. M., (1999). The Role of Unlabeled Data in Supervised Learning, Sixth
International Colloquium on Cognitive Science.
[5] Raina, R., Battle, A., Packer, B., Ng, A. Y., (2007). Self-taught Learning: Transfer
Learning from Unlabeled Data, 24th International Conference on Machine Learning.
[6] Cozman, F. G., Cohen, I., Cirelo M., (2002). Unlabeled data can degrade
classification performance of generative classifiers, FLAIRS-02.
[7] Balcan, M., Blum, A., Choi, P. P., Lafferty, J., Pantano, B., Rwebangira, M. R.,
Zhu, X., (2005). Person Identification in Webcam Images: An Application of Semi-
Supervised Learning, Proc. of the 22 st ICML Workshop on Learning with Partially
Classified Training Data, Bonn, Germany.

On Semi-Supervised Learning and Beyond

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to On Semi-Supervised Learning and Beyond (20)

Recently uploaded (20)

On Semi-Supervised Learning and Beyond