SlideShare a Scribd company logo
Text Categorization Chapter 16 Foundations of Statistical Natural Language Processing
Outline Preparation Decision Tree Maximum Entropy Modeling Perceptrons K nearest Neighbor Classification
Part I Preparation
Classification Classification / Categorization The task of assigning objects from a universe to two or mare classes (categorizes) Parse trees Sentence PP attachment The word’s seneses Context of a word Disambiguation topics Document Text categorization Languages Document Language identification Document authors Document Author identification The word’s (POS) tags Context of a word Tagging Categories Object Problem
Task Description Goal: Given the classification scheme, the system can decide which class(es) a document is related to. A mapping from document space to classification scheme. 1 to 1 / 1 to many  To build the mapping:  observe the known samples classified in the scheme,  Summarize the features and create rules/formula  Decide the classes for the new documents according to the rules.
Task Formulation Training set: (text doc, category) ->  -> for TC, doc is presented as a vector of (possibly weighted ) word counts  Model class a parameterized family of classifiers Training procedure selects one classifier from this family. E.g.  A data representation model g(x) = 0 x1 x2 w w = (1,1) b = -1 w x2 + b < 0 w x1 + b > 0 (0,1) (1,0)
Evaluation(1) Test set For binary classification (proportion of correctly classified objects) Contingency table d c No was assigned b a Yes was assigned No is correct Yes is correct
Evaluation(2) More than two categories Macro-averaging For each category create a contingency table, then compute  the precision/recall seperately Average the evaluation measure over categories. E.g  Micro-averaging: make a single contingency table for all the data by summing the scores in each cell for all categories. Macro-avg: give equal weight to each class Micro-avg: give equal weight to each object
Part II Decision Tree
E.g.  A trained decision tree for category “earnings” Doc = {cts=1, net =3} Node1  7681 articles P(c|n1) = 0.3000 split: cts  value: 2 Node2  5977 articles P(c|n2) = 0.116 split: net  value: 1 Node5  1704 articles P(c|n5) = 0.943 split: vs  value: 2 Node3 5436 articles P(c|n3) = 0.050 Node4 541 articles P(c|n4) = 0.649 Node6 301 articles P(c|n6) = 0.694 Node7 1403  articles P(c|n7) = 0.996 cts < 2 cts >= 2 net<1 Net>= 1 vs <2 vs >= 2
A Closer Look on the E.g. Doc = {cts=1, net =3}    Data presentation model Model class Structure decided Parameters ? Training Procedure
Data Presentation Model (1) An art in itself. Usually depends on the particular categorization method used. In this book, given as an e.g., we present each document as an weighted word vector.  The words are chosen by X 2  (chi-square) method from the training corpus 20 words are chosen. E.g. vs, mln, 1000, loss, profit… Ref to: Chap 5
Data Presentation Model (2) Each document is then represented as a vector of K = 20 integers,  ,  tf(ij) : the number of occurrences of term i in document j l(j) the length of document j E.g.  profit
Training Procedure: Growing (1) Growing a tree Splitting criterion:  finding the feature and its value that we will split on Information gain Stopping criterion Determines when to stop splitting e.g. all elements at a node have an identical representation or the same category Entropy of parent Node Proportion of elements that passed on to the left nodes Ref. Machine Learning
Training Procedure: Growing (2) E.g. the value of  G(‘cts’, 2) H(t) = -0.3*log(0.3) - 0.7*log(0.7) = 0.611 pL = 5977/7681 G(‘cts’,2) = 0.611 – (*) = 0.283 cts < 2 cts >= 2 Node1  7681 articles P(c|n1) = 0.3000 split: cts  value: 2 Node2  5977 articles2 p(c|n) = 0.116 Node5  1704 articles P(c|n5) = 0.943
Training Procedure: pruning (1) Overfitting: E.g. Introduced by the errors or coarse in the training set Or the insufficiency of training set Solution: Pruning Create a detailed decision tree, then pruning the tree to a appropriate size Approach:  Quinlan 1987 Quinlan 1993 Magerman 1994 Ref to :chap3 (3.7.1) machine learning
Training Procedure: pruning (2) Validation Validation set (Cross validation)
Discussion Learning Curve Large training set v.s. optimal performance Can be interpreted easily    greatest advantage The model is more complicated than classifiers like Naïve Bayes, linear regression, etc. Split the training set into smaller and smaller  subsets. This makes correct generalization harder. (not enough data for reliable prediction) Pruning addresses the problem  to some extent.
Part III Maximum Entropy Modeling Data presentation model Model class Training procedure
Basic Idea Given a set of raining documents and their categories Select features that represent empirical data. Select a  probability density function  to generate the empirical data  Found out the probability distribution (=decide the parameters of the  probability function) that  has the maximum entropy H(p) of all the possible p Satisfies the constrains given by features Maximizes the likelihood of the data New document is classified under the probability distribution
Data Presentation Model Remind the data presentation model in used by decision tree: Each document is then represented as a vector of K = 20 integers,  ,where s(ij) is an integer and presents the weight of the feature word; The features f(i) are defined to characterize any property of a pair (x,c).
Model Class Loglinear models K number of features, ai is the weight of feature fi and Z is a normalizing constant, used to ensure a probability distribution results. Classify new document: Compute p(x, 0), and p(x, 1) and, choose the class label with the greater probability
Training Process: Generalized Iterative Scaling Given the equation Under a set of constrains:  The expected value of fi for p* is the same as the expected value for the empirical distribution There is a unique maximum entropy distribution There is a computable procedure that converges to the distribution p* (16.2.1)
The Principle of Maximum Entropy Given by E.T.Jaynes in 1957 The distribution with maximum entropy is more possible to appear than other distributions.  (the entropy of a close system is continuously increasing?) Information entropy as a measurement of ‘uninformativeness’ If we chose a model with less entropy, we would add ‘information’ constraints to the model that are not justified by the empirical evidence available to us
Application to Text Categorization Feature selection In maximum entropy modeling, feature selection and training are usually integrated Test for convergence Compare the log difference between empirical and estimated feature expectations Generalized iterative scaling  Computationally expensive due to slow convergence VS. Naïve Bayes Both use the prior probability NB suppose no dependency between variables, while MEM doesn’t Strength Arbitrarily complex features can be defined if the experimenter believes that these features may contribute useful information for the classification decision. Unified framework for feature selection and classification
Part VI Perceptrons
Models Data presentation Model: text document is represented as term vectors. Model class Binary classification For any input text document x,  Class(x) = c iff f(x) > 0; else class(x) <> c; Algorithm:  Perceptron learning algorithm is a simple example of gradient descent algorithm The goal is to learn the weighted vector w and a threshold theta.
Perceptron learning Procedure: gradient descent Gradient descent an optimization algorithm.  To find a local minimum of a function using gradient descent,
Perceptron learning Procedure: Basic Idea To find a linear division of the training set Procedure:  Estimate w and theta, if they make a mistake we move them in the direction of greatest change for the optimality criterion.  For each ( x , y ) pair  Pass ( xi , yi , wi ) to the update rule  w ( j )' =  w ( j ) + α(δ −  y ) x ( j )  Perceptron convergence theorem Novikoff (1962) proved that the perceptron algorithm converges after a finite number of iterations if the data is linearly separable j-th item in the weight vector   j-th item in the input vector   expected output & output
Why w ( j )' =  w ( j ) + α(δ −  y ) x ( j ) Ref Ref:  www.cs.ualberta.ca/~sutton/book/8/node3.thml The 8.1 and 8.2 of reinforce learning Refer to formula 8.2 The greatest gradient Where w’ = {w1,w2,… wk, theta}, x’ = {x1, x2, …, xk, -1}
E.g. w x x w+x s’ s Yes No
Discussion The data set should be linear separable (-1969), researchers relaized the limitations, and the interest in perceptrons reminds low As a gradient descent algorithm, it doesn’t suffer from the local optimum problem Back propagation algorithm, etc. 80’s Multi-layer perceptrons, neural networks, connectionist models. Overcome the shortcoming of conceptrons, ideally can learn any classification function (e.g. XOR). Converges more slowly Can get caught in local optima
Part V K Nearest Neighbor Classification
Nearest Neighbor Category = purple
K Nearest Neighbor N = 4 Category = Blue
Discussion Similarity metric The complexity of KNN is in finding a good measure of similarity It’s performance is very dependent on the right similarity metric Efficiency However there are ways of implementing KNN search efficiently, and often there is an obvious choice for a similarity metric
Thanks!

More Related Content

PDF
Text categorization as graph
PDF
Text Classification/Categorization
PPT
Text categorization
PPTX
Text categorization
PPT
Text classification using Text kernels
PDF
Text Categorization Using Improved K Nearest Neighbor Algorithm
PDF
Text classification-php-v4
PPTX
Probabilistic models (part 1)
Text categorization as graph
Text Classification/Categorization
Text categorization
Text categorization
Text classification using Text kernels
Text Categorization Using Improved K Nearest Neighbor Algorithm
Text classification-php-v4
Probabilistic models (part 1)

What's hot (20)

PPTX
Presentation on Text Classification
PPTX
Tdm probabilistic models (part 2)
PPT
Lec 4,5
PPTX
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
PDF
Mapping Subsets of Scholarly Information
PPT
Data Mining: Concepts and Techniques — Chapter 2 —
PDF
Cluster analysis
PPTX
Deep Learning for Search
PPT
Ir models
ODP
Topic Modeling
PPT
Chapter 11 cluster advanced : web and text mining
PDF
Latent dirichletallocation presentation
PDF
Cluster Analysis
PPT
Capter10 cluster basic
PPTX
Deep Learning for Search
PPT
3.1 clustering
PDF
Learning to Rank - From pairwise approach to listwise
PDF
10 clusbasic
PPTX
Introduction to Clustering algorithm
PPTX
Boolean,vector space retrieval Models
Presentation on Text Classification
Tdm probabilistic models (part 2)
Lec 4,5
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Mapping Subsets of Scholarly Information
Data Mining: Concepts and Techniques — Chapter 2 —
Cluster analysis
Deep Learning for Search
Ir models
Topic Modeling
Chapter 11 cluster advanced : web and text mining
Latent dirichletallocation presentation
Cluster Analysis
Capter10 cluster basic
Deep Learning for Search
3.1 clustering
Learning to Rank - From pairwise approach to listwise
10 clusbasic
Introduction to Clustering algorithm
Boolean,vector space retrieval Models
Ad

Viewers also liked (20)

PDF
Text Categorization
PDF
Text categorization with Lucene and Solr
PDF
Tutorial on Text Categorization, EACL, 2003
PPTX
PPT
Textmining Predictive Models
PPT
[ppt]
PDF
It
PPT
Text categorization
PPTX
Text categorization using Rough Set
PPTX
[DL Hacks] Learning Transferable Features with Deep Adaptation Networks
PPTX
Text Classification/Categorization
PPT
Text classification
PPTX
Chainerを使ったらカノジョができたお話
PPTX
Text clustering
PPTX
Text categorization
PPTX
ICML2016読み会 概要紹介
PPTX
Icml読み会 deep speech2
PDF
Meta-Learning with Memory Augmented Neural Network
PDF
ベイジアンネットとレコメンデーション -第5回データマイニング+WEB勉強会@東京
Text Categorization
Text categorization with Lucene and Solr
Tutorial on Text Categorization, EACL, 2003
Textmining Predictive Models
[ppt]
It
Text categorization
Text categorization using Rough Set
[DL Hacks] Learning Transferable Features with Deep Adaptation Networks
Text Classification/Categorization
Text classification
Chainerを使ったらカノジョができたお話
Text clustering
Text categorization
ICML2016読み会 概要紹介
Icml読み会 deep speech2
Meta-Learning with Memory Augmented Neural Network
ベイジアンネットとレコメンデーション -第5回データマイニング+WEB勉強会@東京
Ad

Similar to 20070702 Text Categorization (20)

PPT
[ppt]
PPT
[ppt]
PPT
594503964-Introduction-to-Classification-PPT-Slides-1.ppt
PPTX
AI -learning and machine learning.pptx
PPTX
Machine Learning
PPTX
machine leraning : main principles and techniques
PDF
MLHEP 2015: Introductory Lecture #1
PPTX
Machine Learning course Lecture number 2 - Supervised machine learning, part ...
PPT
lecture_mooney.ppt
PPT
Introduction to Machine Learning Aristotelis Tsirigos
PPT
CC282 Decision trees Lecture 2 slides for CC282 Machine ...
PPT
Textmining Predictive Models
PPT
Textmining Predictive Models
PPTX
Machine Learning
PPTX
Classification Continued
PPTX
Classification Continued
PPTX
Machine learning with neural networks
PPTX
Machine Learning course Lecture number 1.pptx
PPTX
Unit 4 Classification of data and more info on it
[ppt]
[ppt]
594503964-Introduction-to-Classification-PPT-Slides-1.ppt
AI -learning and machine learning.pptx
Machine Learning
machine leraning : main principles and techniques
MLHEP 2015: Introductory Lecture #1
Machine Learning course Lecture number 2 - Supervised machine learning, part ...
lecture_mooney.ppt
Introduction to Machine Learning Aristotelis Tsirigos
CC282 Decision trees Lecture 2 slides for CC282 Machine ...
Textmining Predictive Models
Textmining Predictive Models
Machine Learning
Classification Continued
Classification Continued
Machine learning with neural networks
Machine Learning course Lecture number 1.pptx
Unit 4 Classification of data and more info on it

Recently uploaded (20)

PDF
VCE English Exam - Section C Student Revision Booklet
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
master seminar digital applications in india
PDF
Business Ethics Teaching Materials for college
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
Institutional Correction lecture only . . .
PDF
RMMM.pdf make it easy to upload and study
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
VCE English Exam - Section C Student Revision Booklet
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
2.FourierTransform-ShortQuestionswithAnswers.pdf
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
STATICS OF THE RIGID BODIES Hibbelers.pdf
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
Final Presentation General Medicine 03-08-2024.pptx
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Supply Chain Operations Speaking Notes -ICLT Program
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Week 4 Term 3 Study Techniques revisited.pptx
human mycosis Human fungal infections are called human mycosis..pptx
O5-L3 Freight Transport Ops (International) V1.pdf
master seminar digital applications in india
Business Ethics Teaching Materials for college
102 student loan defaulters named and shamed – Is someone you know on the list?
Institutional Correction lecture only . . .
RMMM.pdf make it easy to upload and study
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx

20070702 Text Categorization

  • 1. Text Categorization Chapter 16 Foundations of Statistical Natural Language Processing
  • 2. Outline Preparation Decision Tree Maximum Entropy Modeling Perceptrons K nearest Neighbor Classification
  • 4. Classification Classification / Categorization The task of assigning objects from a universe to two or mare classes (categorizes) Parse trees Sentence PP attachment The word’s seneses Context of a word Disambiguation topics Document Text categorization Languages Document Language identification Document authors Document Author identification The word’s (POS) tags Context of a word Tagging Categories Object Problem
  • 5. Task Description Goal: Given the classification scheme, the system can decide which class(es) a document is related to. A mapping from document space to classification scheme. 1 to 1 / 1 to many To build the mapping: observe the known samples classified in the scheme, Summarize the features and create rules/formula Decide the classes for the new documents according to the rules.
  • 6. Task Formulation Training set: (text doc, category) -> -> for TC, doc is presented as a vector of (possibly weighted ) word counts Model class a parameterized family of classifiers Training procedure selects one classifier from this family. E.g. A data representation model g(x) = 0 x1 x2 w w = (1,1) b = -1 w x2 + b < 0 w x1 + b > 0 (0,1) (1,0)
  • 7. Evaluation(1) Test set For binary classification (proportion of correctly classified objects) Contingency table d c No was assigned b a Yes was assigned No is correct Yes is correct
  • 8. Evaluation(2) More than two categories Macro-averaging For each category create a contingency table, then compute the precision/recall seperately Average the evaluation measure over categories. E.g Micro-averaging: make a single contingency table for all the data by summing the scores in each cell for all categories. Macro-avg: give equal weight to each class Micro-avg: give equal weight to each object
  • 10. E.g. A trained decision tree for category “earnings” Doc = {cts=1, net =3} Node1 7681 articles P(c|n1) = 0.3000 split: cts value: 2 Node2 5977 articles P(c|n2) = 0.116 split: net value: 1 Node5 1704 articles P(c|n5) = 0.943 split: vs value: 2 Node3 5436 articles P(c|n3) = 0.050 Node4 541 articles P(c|n4) = 0.649 Node6 301 articles P(c|n6) = 0.694 Node7 1403 articles P(c|n7) = 0.996 cts < 2 cts >= 2 net<1 Net>= 1 vs <2 vs >= 2
  • 11. A Closer Look on the E.g. Doc = {cts=1, net =3}  Data presentation model Model class Structure decided Parameters ? Training Procedure
  • 12. Data Presentation Model (1) An art in itself. Usually depends on the particular categorization method used. In this book, given as an e.g., we present each document as an weighted word vector. The words are chosen by X 2 (chi-square) method from the training corpus 20 words are chosen. E.g. vs, mln, 1000, loss, profit… Ref to: Chap 5
  • 13. Data Presentation Model (2) Each document is then represented as a vector of K = 20 integers, , tf(ij) : the number of occurrences of term i in document j l(j) the length of document j E.g. profit
  • 14. Training Procedure: Growing (1) Growing a tree Splitting criterion: finding the feature and its value that we will split on Information gain Stopping criterion Determines when to stop splitting e.g. all elements at a node have an identical representation or the same category Entropy of parent Node Proportion of elements that passed on to the left nodes Ref. Machine Learning
  • 15. Training Procedure: Growing (2) E.g. the value of G(‘cts’, 2) H(t) = -0.3*log(0.3) - 0.7*log(0.7) = 0.611 pL = 5977/7681 G(‘cts’,2) = 0.611 – (*) = 0.283 cts < 2 cts >= 2 Node1 7681 articles P(c|n1) = 0.3000 split: cts value: 2 Node2 5977 articles2 p(c|n) = 0.116 Node5 1704 articles P(c|n5) = 0.943
  • 16. Training Procedure: pruning (1) Overfitting: E.g. Introduced by the errors or coarse in the training set Or the insufficiency of training set Solution: Pruning Create a detailed decision tree, then pruning the tree to a appropriate size Approach: Quinlan 1987 Quinlan 1993 Magerman 1994 Ref to :chap3 (3.7.1) machine learning
  • 17. Training Procedure: pruning (2) Validation Validation set (Cross validation)
  • 18. Discussion Learning Curve Large training set v.s. optimal performance Can be interpreted easily  greatest advantage The model is more complicated than classifiers like Naïve Bayes, linear regression, etc. Split the training set into smaller and smaller subsets. This makes correct generalization harder. (not enough data for reliable prediction) Pruning addresses the problem to some extent.
  • 19. Part III Maximum Entropy Modeling Data presentation model Model class Training procedure
  • 20. Basic Idea Given a set of raining documents and their categories Select features that represent empirical data. Select a probability density function to generate the empirical data Found out the probability distribution (=decide the parameters of the probability function) that has the maximum entropy H(p) of all the possible p Satisfies the constrains given by features Maximizes the likelihood of the data New document is classified under the probability distribution
  • 21. Data Presentation Model Remind the data presentation model in used by decision tree: Each document is then represented as a vector of K = 20 integers, ,where s(ij) is an integer and presents the weight of the feature word; The features f(i) are defined to characterize any property of a pair (x,c).
  • 22. Model Class Loglinear models K number of features, ai is the weight of feature fi and Z is a normalizing constant, used to ensure a probability distribution results. Classify new document: Compute p(x, 0), and p(x, 1) and, choose the class label with the greater probability
  • 23. Training Process: Generalized Iterative Scaling Given the equation Under a set of constrains: The expected value of fi for p* is the same as the expected value for the empirical distribution There is a unique maximum entropy distribution There is a computable procedure that converges to the distribution p* (16.2.1)
  • 24. The Principle of Maximum Entropy Given by E.T.Jaynes in 1957 The distribution with maximum entropy is more possible to appear than other distributions. (the entropy of a close system is continuously increasing?) Information entropy as a measurement of ‘uninformativeness’ If we chose a model with less entropy, we would add ‘information’ constraints to the model that are not justified by the empirical evidence available to us
  • 25. Application to Text Categorization Feature selection In maximum entropy modeling, feature selection and training are usually integrated Test for convergence Compare the log difference between empirical and estimated feature expectations Generalized iterative scaling Computationally expensive due to slow convergence VS. Naïve Bayes Both use the prior probability NB suppose no dependency between variables, while MEM doesn’t Strength Arbitrarily complex features can be defined if the experimenter believes that these features may contribute useful information for the classification decision. Unified framework for feature selection and classification
  • 27. Models Data presentation Model: text document is represented as term vectors. Model class Binary classification For any input text document x, Class(x) = c iff f(x) > 0; else class(x) <> c; Algorithm: Perceptron learning algorithm is a simple example of gradient descent algorithm The goal is to learn the weighted vector w and a threshold theta.
  • 28. Perceptron learning Procedure: gradient descent Gradient descent an optimization algorithm. To find a local minimum of a function using gradient descent,
  • 29. Perceptron learning Procedure: Basic Idea To find a linear division of the training set Procedure: Estimate w and theta, if they make a mistake we move them in the direction of greatest change for the optimality criterion. For each ( x , y ) pair Pass ( xi , yi , wi ) to the update rule w ( j )' = w ( j ) + α(δ − y ) x ( j ) Perceptron convergence theorem Novikoff (1962) proved that the perceptron algorithm converges after a finite number of iterations if the data is linearly separable j-th item in the weight vector j-th item in the input vector expected output & output
  • 30. Why w ( j )' = w ( j ) + α(δ − y ) x ( j ) Ref Ref: www.cs.ualberta.ca/~sutton/book/8/node3.thml The 8.1 and 8.2 of reinforce learning Refer to formula 8.2 The greatest gradient Where w’ = {w1,w2,… wk, theta}, x’ = {x1, x2, …, xk, -1}
  • 31. E.g. w x x w+x s’ s Yes No
  • 32. Discussion The data set should be linear separable (-1969), researchers relaized the limitations, and the interest in perceptrons reminds low As a gradient descent algorithm, it doesn’t suffer from the local optimum problem Back propagation algorithm, etc. 80’s Multi-layer perceptrons, neural networks, connectionist models. Overcome the shortcoming of conceptrons, ideally can learn any classification function (e.g. XOR). Converges more slowly Can get caught in local optima
  • 33. Part V K Nearest Neighbor Classification
  • 35. K Nearest Neighbor N = 4 Category = Blue
  • 36. Discussion Similarity metric The complexity of KNN is in finding a good measure of similarity It’s performance is very dependent on the right similarity metric Efficiency However there are ways of implementing KNN search efficiently, and often there is an obvious choice for a similarity metric