Vector space classification

 Each document is a vector, one component for each term
 A training set is a set of documents, each labelled with its class
 In vector space classification, this set corresponds to a labelled set of
points
 Documents in the same class form a contiguous region of space
 Documents from different class don’t overlap
 We define surfaces to delineate classes in the space

 Rocchio classification
 Divides the vector space into regions centered on centroids / center of mass
 Simple and efficient
 Classes should be approximately spherical with similar radii
 kNN classification
 No explicit training is required
 Less efficient
 Can handle non-spherical and complex classes better than rocchio
 Two-class classifiers
 One-of task – a document may be assigned to exactly one of several mutually exclusive classes
 Any-of task – a document can be assigned to any number of classes

 Relevance feedback method is adapted – it’s a two-class classification: relevant or non-
relevant
 Use standard tf-idf weighted vector to represent text document
 For training document in each category, compute a prototype vector by summing the
vectors of all the training documents in the category
 Prototype = centroid of members of class
 Assign test documents to the category with closest prototype vector based on cosine
similarity

 Where Dc is the set of all documents that belong to class c and v(d) is the vector space
representation of d
 Properties:
 Forms a simple generalization of examples in each class
 Classification is based on the similarity to class prototype
 Does not guarantee the consistency of classification with given training data
r
(c) 
1
| Dc |
r
v(d)
d Dc


 Prototype models face problems with polymorphic categories

 Forms a simple representation for each class – centroid/prototype
 Classification is based on the distance form the centroid
 Cheap to train and test documents
 Not preferred outside text classification
 Used quiet effectively used for text classification
 Worse than naïve bayes

 To classify the document d in class c
 Define k neighborhood N as the nearest neighbors of d
 Count number of document i in N that belong to c
 Estimate P(c/d) as i/k
 Choose as class argmax P(c/d) [=majority class]

 Learning is just storing the representation of training examples in D
 Testing instance x (1NN):
 Compute similarity between x and all examples in D
 Assign x the category of most similar example in D
 Does not explicitly compute the category
 Also called as:
 Case-based learning
 Memory based learning
 Lazy learning
 Rationale of kNN: contiguity hypothesis (documents in same class form a continuous
region and regions of different classes do not overlap)

 Using the closest example to determine the class is subject to errors due to
 a single typical example
 Noise in the category label of single training example

 A strong high-bias assumption is linear seperability
 In 2d can separate class by a line
 In higher dimension need a hyperplane
 seperating hyperplane can be found by linear programming (or a perceptron)
 Can be expressed as ax + by = c

Find a,b,c, such that
ax + by > c for red points
ax + by < c for blue points

In general, lots of possible
solutions for a,b,c.

 Lots of possible solutions for a, b, c
 Somp methods find a separating hyperplane but not the optimal one. Ex. Perceptron
 Which points should influence optimality?
 All points
 Linear/logistic regression
 Naïve bayes
 Only difficult points close to boundary
 Support vector machines

 Many common text classifiers are linear classifies:
 Naïve Bayes
 Perceptron
 Rocchio
 Logistic regression
 Support Vector Machines
 Despite the similarity, noticeable performances differ
 For separate problems, there are infinite number of separating hyperplanes. Which one do you
choose?
 What to do for non-separable problems?
 Different training methods pick different hyperplanes

A linear classifier like naïve
bayes does badly on this task
kNN does well (assuming
enough training data)

 Pictures like the one on the right is misleading
 Documents are zero along almost all axes
 Most document pairs are very far apart (i.e. not strictly orthogonal, but only share very
few common words)
 In classification terms – often document sets are separable
 This is part of why linear classifiers are quiet successful in this domain

 Any-of or multivalue classification
 Classes are independent of each other
 A document can belong to any number of classes
 Decomposes into n binary problems
 Quiet common for documents
 One-of or multinomial or polytomous classification
 Classes are mutually exclusive
 Each document belong to exactly one class

 Any-of
 Build a separator between each class and its complementary set
 Given test documents, evaluate it for membership in each class
 Apply decision criteria of classifiers independently
 One-of
 Build a separator between each class and it complementary set
 Given test doc, evaluate it for membership in each class
 Assign document to class with
 Maximum score
 Maximum confidence
 Maximum probability
?
?
?

Vector space classification

More Related Content

What's hot (20)

Similar to Vector space classification (20)

More from Ujjawal (10)

Recently uploaded (20)

Vector space classification