SlideShare a Scribd company logo
Vector space classification
 Each document is a vector, one component for each term
 A training set is a set of documents, each labelled with its class
 In vector space classification, this set corresponds to a labelled set of
points
 Documents in the same class form a contiguous region of space
 Documents from different class don’t overlap
 We define surfaces to delineate classes in the space
Government
Science
Arts
Government
Science
Arts
Government
Science
Arts
Vector space classification
 Rocchio classification
 Divides the vector space into regions centered on centroids / center of mass
 Simple and efficient
 Classes should be approximately spherical with similar radii
 kNN classification
 No explicit training is required
 Less efficient
 Can handle non-spherical and complex classes better than rocchio
 Two-class classifiers
 One-of task – a document may be assigned to exactly one of several mutually exclusive classes
 Any-of task – a document can be assigned to any number of classes
 Relevance feedback method is adapted – it’s a two-class classification: relevant or non-
relevant
 Use standard tf-idf weighted vector to represent text document
 For training document in each category, compute a prototype vector by summing the
vectors of all the training documents in the category
 Prototype = centroid of members of class
 Assign test documents to the category with closest prototype vector based on cosine
similarity
Vector space classification
 Where Dc is the set of all documents that belong to class c and v(d) is the vector space
representation of d
 Properties:
 Forms a simple generalization of examples in each class
 Classification is based on the similarity to class prototype
 Does not guarantee the consistency of classification with given training data
r
(c) 
1
| Dc |
r
v(d)
d Dc

 Prototype models face problems with polymorphic categories
 Forms a simple representation for each class – centroid/prototype
 Classification is based on the distance form the centroid
 Cheap to train and test documents
 Not preferred outside text classification
 Used quiet effectively used for text classification
 Worse than naïve bayes
 To classify the document d in class c
 Define k neighborhood N as the nearest neighbors of d
 Count number of document i in N that belong to c
 Estimate P(c/d) as i/k
 Choose as class argmax P(c/d) [=majority class]
 Learning is just storing the representation of training examples in D
 Testing instance x (1NN):
 Compute similarity between x and all examples in D
 Assign x the category of most similar example in D
 Does not explicitly compute the category
 Also called as:
 Case-based learning
 Memory based learning
 Lazy learning
 Rationale of kNN: contiguity hypothesis (documents in same class form a continuous
region and regions of different classes do not overlap)
 Using the closest example to determine the class is subject to errors due to
 a single typical example
 Noise in the category label of single training example
 A strong high-bias assumption is linear seperability
 In 2d can separate class by a line
 In higher dimension need a hyperplane
 seperating hyperplane can be found by linear programming (or a perceptron)
 Can be expressed as ax + by = c
Find a,b,c, such that
ax + by > c for red points
ax + by < c for blue points
In general, lots of possible
solutions for a,b,c.
 Lots of possible solutions for a, b, c
 Somp methods find a separating hyperplane but not the optimal one. Ex. Perceptron
 Which points should influence optimality?
 All points
 Linear/logistic regression
 Naïve bayes
 Only difficult points close to boundary
 Support vector machines
 Many common text classifiers are linear classifies:
 Naïve Bayes
 Perceptron
 Rocchio
 Logistic regression
 Support Vector Machines
 Despite the similarity, noticeable performances differ
 For separate problems, there are infinite number of separating hyperplanes. Which one do you
choose?
 What to do for non-separable problems?
 Different training methods pick different hyperplanes
Vector space classification
A linear classifier like naïve
bayes does badly on this task
kNN does well (assuming
enough training data)
 Pictures like the one on the right is misleading
 Documents are zero along almost all axes
 Most document pairs are very far apart (i.e. not strictly orthogonal, but only share very
few common words)
 In classification terms – often document sets are separable
 This is part of why linear classifiers are quiet successful in this domain
 Any-of or multivalue classification
 Classes are independent of each other
 A document can belong to any number of classes
 Decomposes into n binary problems
 Quiet common for documents
 One-of or multinomial or polytomous classification
 Classes are mutually exclusive
 Each document belong to exactly one class
 Any-of
 Build a separator between each class and its complementary set
 Given test documents, evaluate it for membership in each class
 Apply decision criteria of classifiers independently
 One-of
 Build a separator between each class and it complementary set
 Given test doc, evaluate it for membership in each class
 Assign document to class with
 Maximum score
 Maximum confidence
 Maximum probability
?
?
?

More Related Content

PPT
Latent Semantic Indexing For Information Retrieval
PDF
CS6007 information retrieval - 5 units notes
PPTX
Association Analysis in Data Mining
PPT
Information Retrieval Models
PPTX
Information retrieval introduction
PPT
4.2 spatial data mining
PPT
Web ontology language (owl)
PPTX
Lec1,2
Latent Semantic Indexing For Information Retrieval
CS6007 information retrieval - 5 units notes
Association Analysis in Data Mining
Information Retrieval Models
Information retrieval introduction
4.2 spatial data mining
Web ontology language (owl)
Lec1,2

What's hot (20)

PPSX
An Introduction to Semantic Web Technology
PPTX
Probabilistic information retrieval models & systems
PPTX
Uncertainty in AI
PPTX
AI: Planning and AI
PPTX
The impact of web on ir
PPTX
Ensemble methods in machine learning
PPTX
Rule Based Algorithms.pptx
PDF
CS6010 Social Network Analysis Unit I
PPT
3.7 outlier analysis
PDF
Mining Frequent Patterns And Association Rules
PPTX
Page rank algortihm
PDF
Data Mining: Association Rules Basics
PPTX
Seven step model of migration into the cloud
PPTX
Backtracking
PPT
Artificial Intelligence - Reasoning in Uncertain Situations
PDF
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
PPTX
Deductive databases
PPTX
Ensemble learning
PPTX
Vector space model of information retrieval
PPTX
Information retrieval (introduction)
An Introduction to Semantic Web Technology
Probabilistic information retrieval models & systems
Uncertainty in AI
AI: Planning and AI
The impact of web on ir
Ensemble methods in machine learning
Rule Based Algorithms.pptx
CS6010 Social Network Analysis Unit I
3.7 outlier analysis
Mining Frequent Patterns And Association Rules
Page rank algortihm
Data Mining: Association Rules Basics
Seven step model of migration into the cloud
Backtracking
Artificial Intelligence - Reasoning in Uncertain Situations
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
Deductive databases
Ensemble learning
Vector space model of information retrieval
Information retrieval (introduction)
Ad

Similar to Vector space classification (20)

PPTX
Text categorization
PPT
lecture_mooney.ppt
PDF
2020-TFG1 Natural Language Processing
PDF
Comparison of Text Classifiers on News Articles
PDF
The Magical Art of Extracting Meaning From Data
PPTX
cnn.pptx
PPT
Lecture 2
PPTX
Introduction to Convolutional Neural Network.pptx
PDF
IRJET- Multi Label Document Classification Approach using Machine Learning Te...
PDF
2012 mdsp pr13 support vector machine
PPTX
cnn.pptx Convolutional neural network used for image classication
PPT
Support Vector Machines Support Vector Machines
PPTX
17- Kernels and Clustering.pptx
PDF
Mapping Subsets of Scholarly Information
PPTX
Statistical Machine Learning unit4 lecture notes
PPT
Pattern Recognition- Basic Lecture Notes
PPT
Pattern Recognition and understanding patterns
PPT
PatternRecognition_fundamental_engineering.ppt
PPTX
Search Engines
PPTX
Module 1 Taxonomy of Machine L(1).pptx
Text categorization
lecture_mooney.ppt
2020-TFG1 Natural Language Processing
Comparison of Text Classifiers on News Articles
The Magical Art of Extracting Meaning From Data
cnn.pptx
Lecture 2
Introduction to Convolutional Neural Network.pptx
IRJET- Multi Label Document Classification Approach using Machine Learning Te...
2012 mdsp pr13 support vector machine
cnn.pptx Convolutional neural network used for image classication
Support Vector Machines Support Vector Machines
17- Kernels and Clustering.pptx
Mapping Subsets of Scholarly Information
Statistical Machine Learning unit4 lecture notes
Pattern Recognition- Basic Lecture Notes
Pattern Recognition and understanding patterns
PatternRecognition_fundamental_engineering.ppt
Search Engines
Module 1 Taxonomy of Machine L(1).pptx
Ad

More from Ujjawal (10)

PPTX
fMRI in machine learning
PPTX
Random forest
PPTX
Neural network for machine learning
PPTX
Information retrieval
PPTX
Genetic algorithm
PPTX
K nearest neighbor
PPTX
Support vector machines
PPTX
Scoring, term weighting and the vector space
PPTX
Bayes’ theorem and logistic regression
PPTX
Introduction to data mining
fMRI in machine learning
Random forest
Neural network for machine learning
Information retrieval
Genetic algorithm
K nearest neighbor
Support vector machines
Scoring, term weighting and the vector space
Bayes’ theorem and logistic regression
Introduction to data mining

Recently uploaded (20)

PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
.pdf is not working space design for the following data for the following dat...
PDF
annual-report-2024-2025 original latest.
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Lecture1 pattern recognition............
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Computer network topology notes for revision
PDF
Business Analytics and business intelligence.pdf
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Database Infoormation System (DBIS).pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
IB Computer Science - Internal Assessment.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
.pdf is not working space design for the following data for the following dat...
annual-report-2024-2025 original latest.
Qualitative Qantitative and Mixed Methods.pptx
Lecture1 pattern recognition............
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Computer network topology notes for revision
Business Analytics and business intelligence.pdf
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction-to-Cloud-ComputingFinal.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Database Infoormation System (DBIS).pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx

Vector space classification

  • 2.  Each document is a vector, one component for each term  A training set is a set of documents, each labelled with its class  In vector space classification, this set corresponds to a labelled set of points  Documents in the same class form a contiguous region of space  Documents from different class don’t overlap  We define surfaces to delineate classes in the space
  • 7.  Rocchio classification  Divides the vector space into regions centered on centroids / center of mass  Simple and efficient  Classes should be approximately spherical with similar radii  kNN classification  No explicit training is required  Less efficient  Can handle non-spherical and complex classes better than rocchio  Two-class classifiers  One-of task – a document may be assigned to exactly one of several mutually exclusive classes  Any-of task – a document can be assigned to any number of classes
  • 8.  Relevance feedback method is adapted – it’s a two-class classification: relevant or non- relevant  Use standard tf-idf weighted vector to represent text document  For training document in each category, compute a prototype vector by summing the vectors of all the training documents in the category  Prototype = centroid of members of class  Assign test documents to the category with closest prototype vector based on cosine similarity
  • 10.  Where Dc is the set of all documents that belong to class c and v(d) is the vector space representation of d  Properties:  Forms a simple generalization of examples in each class  Classification is based on the similarity to class prototype  Does not guarantee the consistency of classification with given training data r (c)  1 | Dc | r v(d) d Dc 
  • 11.  Prototype models face problems with polymorphic categories
  • 12.  Forms a simple representation for each class – centroid/prototype  Classification is based on the distance form the centroid  Cheap to train and test documents  Not preferred outside text classification  Used quiet effectively used for text classification  Worse than naïve bayes
  • 13.  To classify the document d in class c  Define k neighborhood N as the nearest neighbors of d  Count number of document i in N that belong to c  Estimate P(c/d) as i/k  Choose as class argmax P(c/d) [=majority class]
  • 14.  Learning is just storing the representation of training examples in D  Testing instance x (1NN):  Compute similarity between x and all examples in D  Assign x the category of most similar example in D  Does not explicitly compute the category  Also called as:  Case-based learning  Memory based learning  Lazy learning  Rationale of kNN: contiguity hypothesis (documents in same class form a continuous region and regions of different classes do not overlap)
  • 15.  Using the closest example to determine the class is subject to errors due to  a single typical example  Noise in the category label of single training example
  • 16.  A strong high-bias assumption is linear seperability  In 2d can separate class by a line  In higher dimension need a hyperplane  seperating hyperplane can be found by linear programming (or a perceptron)  Can be expressed as ax + by = c
  • 17. Find a,b,c, such that ax + by > c for red points ax + by < c for blue points
  • 18. In general, lots of possible solutions for a,b,c.
  • 19.  Lots of possible solutions for a, b, c  Somp methods find a separating hyperplane but not the optimal one. Ex. Perceptron  Which points should influence optimality?  All points  Linear/logistic regression  Naïve bayes  Only difficult points close to boundary  Support vector machines
  • 20.  Many common text classifiers are linear classifies:  Naïve Bayes  Perceptron  Rocchio  Logistic regression  Support Vector Machines  Despite the similarity, noticeable performances differ  For separate problems, there are infinite number of separating hyperplanes. Which one do you choose?  What to do for non-separable problems?  Different training methods pick different hyperplanes
  • 22. A linear classifier like naïve bayes does badly on this task kNN does well (assuming enough training data)
  • 23.  Pictures like the one on the right is misleading  Documents are zero along almost all axes  Most document pairs are very far apart (i.e. not strictly orthogonal, but only share very few common words)  In classification terms – often document sets are separable  This is part of why linear classifiers are quiet successful in this domain
  • 24.  Any-of or multivalue classification  Classes are independent of each other  A document can belong to any number of classes  Decomposes into n binary problems  Quiet common for documents  One-of or multinomial or polytomous classification  Classes are mutually exclusive  Each document belong to exactly one class
  • 25.  Any-of  Build a separator between each class and its complementary set  Given test documents, evaluate it for membership in each class  Apply decision criteria of classifiers independently  One-of  Build a separator between each class and it complementary set  Given test doc, evaluate it for membership in each class  Assign document to class with  Maximum score  Maximum confidence  Maximum probability ? ? ?