20070702 Text Categorization

Text Categorization Chapter 16 Foundations of Statistical Natural Language Processing

Outline Preparation Decision Tree Maximum Entropy Modeling Perceptrons K nearest Neighbor Classification

Classification Classification / Categorization The task of assigning objects from a universe to two or mare classes (categorizes) Parse trees Sentence PP attachment The word’s seneses Context of a word Disambiguation topics Document Text categorization Languages Document Language identification Document authors Document Author identification The word’s (POS) tags Context of a word Tagging Categories Object Problem

Task Description Goal: Given the classification scheme, the system can decide which class(es) a document is related to. A mapping from document space to classification scheme. 1 to 1 / 1 to many To build the mapping: observe the known samples classified in the scheme, Summarize the features and create rules/formula Decide the classes for the new documents according to the rules.

Task Formulation Training set: (text doc, category) -> -> for TC, doc is presented as a vector of (possibly weighted ) word counts Model class a parameterized family of classifiers Training procedure selects one classifier from this family. E.g. A data representation model g(x) = 0 x1 x2 w w = (1,1) b = -1 w x2 + b < 0 w x1 + b > 0 (0,1) (1,0)

Evaluation(1) Test set For binary classification (proportion of correctly classified objects) Contingency table d c No was assigned b a Yes was assigned No is correct Yes is correct

Evaluation(2) More than two categories Macro-averaging For each category create a contingency table, then compute the precision/recall seperately Average the evaluation measure over categories. E.g Micro-averaging: make a single contingency table for all the data by summing the scores in each cell for all categories. Macro-avg: give equal weight to each class Micro-avg: give equal weight to each object

E.g. A trained decision tree for category “earnings” Doc = {cts=1, net =3} Node1 7681 articles P(c|n1) = 0.3000 split: cts value: 2 Node2 5977 articles P(c|n2) = 0.116 split: net value: 1 Node5 1704 articles P(c|n5) = 0.943 split: vs value: 2 Node3 5436 articles P(c|n3) = 0.050 Node4 541 articles P(c|n4) = 0.649 Node6 301 articles P(c|n6) = 0.694 Node7 1403 articles P(c|n7) = 0.996 cts < 2 cts >= 2 net<1 Net>= 1 vs <2 vs >= 2

A Closer Look on the E.g. Doc = {cts=1, net =3}  Data presentation model Model class Structure decided Parameters ? Training Procedure

Data Presentation Model (1) An art in itself. Usually depends on the particular categorization method used. In this book, given as an e.g., we present each document as an weighted word vector. The words are chosen by X 2 (chi-square) method from the training corpus 20 words are chosen. E.g. vs, mln, 1000, loss, profit… Ref to: Chap 5

Data Presentation Model (2) Each document is then represented as a vector of K = 20 integers, , tf(ij) : the number of occurrences of term i in document j l(j) the length of document j E.g. profit

Training Procedure: Growing (1) Growing a tree Splitting criterion: finding the feature and its value that we will split on Information gain Stopping criterion Determines when to stop splitting e.g. all elements at a node have an identical representation or the same category Entropy of parent Node Proportion of elements that passed on to the left nodes Ref. Machine Learning

Training Procedure: Growing (2) E.g. the value of G(‘cts’, 2) H(t) = -0.3*log(0.3) - 0.7*log(0.7) = 0.611 pL = 5977/7681 G(‘cts’,2) = 0.611 – (*) = 0.283 cts < 2 cts >= 2 Node1 7681 articles P(c|n1) = 0.3000 split: cts value: 2 Node2 5977 articles2 p(c|n) = 0.116 Node5 1704 articles P(c|n5) = 0.943

Training Procedure: pruning (1) Overfitting: E.g. Introduced by the errors or coarse in the training set Or the insufficiency of training set Solution: Pruning Create a detailed decision tree, then pruning the tree to a appropriate size Approach: Quinlan 1987 Quinlan 1993 Magerman 1994 Ref to :chap3 (3.7.1) machine learning

Training Procedure: pruning (2) Validation Validation set (Cross validation)

Discussion Learning Curve Large training set v.s. optimal performance Can be interpreted easily  greatest advantage The model is more complicated than classifiers like Naïve Bayes, linear regression, etc. Split the training set into smaller and smaller subsets. This makes correct generalization harder. (not enough data for reliable prediction) Pruning addresses the problem to some extent.

Part III Maximum Entropy Modeling Data presentation model Model class Training procedure

Basic Idea Given a set of raining documents and their categories Select features that represent empirical data. Select a probability density function to generate the empirical data Found out the probability distribution (=decide the parameters of the probability function) that has the maximum entropy H(p) of all the possible p Satisfies the constrains given by features Maximizes the likelihood of the data New document is classified under the probability distribution

Data Presentation Model Remind the data presentation model in used by decision tree: Each document is then represented as a vector of K = 20 integers, ,where s(ij) is an integer and presents the weight of the feature word; The features f(i) are defined to characterize any property of a pair (x,c).

Model Class Loglinear models K number of features, ai is the weight of feature fi and Z is a normalizing constant, used to ensure a probability distribution results. Classify new document: Compute p(x, 0), and p(x, 1) and, choose the class label with the greater probability

Training Process: Generalized Iterative Scaling Given the equation Under a set of constrains: The expected value of fi for p* is the same as the expected value for the empirical distribution There is a unique maximum entropy distribution There is a computable procedure that converges to the distribution p* (16.2.1)

The Principle of Maximum Entropy Given by E.T.Jaynes in 1957 The distribution with maximum entropy is more possible to appear than other distributions. (the entropy of a close system is continuously increasing?) Information entropy as a measurement of ‘uninformativeness’ If we chose a model with less entropy, we would add ‘information’ constraints to the model that are not justified by the empirical evidence available to us

Application to Text Categorization Feature selection In maximum entropy modeling, feature selection and training are usually integrated Test for convergence Compare the log difference between empirical and estimated feature expectations Generalized iterative scaling Computationally expensive due to slow convergence VS. Naïve Bayes Both use the prior probability NB suppose no dependency between variables, while MEM doesn’t Strength Arbitrarily complex features can be defined if the experimenter believes that these features may contribute useful information for the classification decision. Unified framework for feature selection and classification

Models Data presentation Model: text document is represented as term vectors. Model class Binary classification For any input text document x, Class(x) = c iff f(x) > 0; else class(x) <> c; Algorithm: Perceptron learning algorithm is a simple example of gradient descent algorithm The goal is to learn the weighted vector w and a threshold theta.

Perceptron learning Procedure: gradient descent Gradient descent an optimization algorithm. To find a local minimum of a function using gradient descent,

Perceptron learning Procedure: Basic Idea To find a linear division of the training set Procedure: Estimate w and theta, if they make a mistake we move them in the direction of greatest change for the optimality criterion. For each ( x , y ) pair Pass ( xi , yi , wi ) to the update rule w ( j )' = w ( j ) + α(δ − y ) x ( j ) Perceptron convergence theorem Novikoff (1962) proved that the perceptron algorithm converges after a finite number of iterations if the data is linearly separable j-th item in the weight vector j-th item in the input vector expected output & output

Why w ( j )' = w ( j ) + α(δ − y ) x ( j ) Ref Ref: www.cs.ualberta.ca/~sutton/book/8/node3.thml The 8.1 and 8.2 of reinforce learning Refer to formula 8.2 The greatest gradient Where w’ = {w1,w2,… wk, theta}, x’ = {x1, x2, …, xk, -1}

Discussion The data set should be linear separable (-1969), researchers relaized the limitations, and the interest in perceptrons reminds low As a gradient descent algorithm, it doesn’t suffer from the local optimum problem Back propagation algorithm, etc. 80’s Multi-layer perceptrons, neural networks, connectionist models. Overcome the shortcoming of conceptrons, ideally can learn any classification function (e.g. XOR). Converges more slowly Can get caught in local optima

Part V K Nearest Neighbor Classification

Nearest Neighbor Category = purple

K Nearest Neighbor N = 4 Category = Blue

Discussion Similarity metric The complexity of KNN is in finding a good measure of similarity It’s performance is very dependent on the right similarity metric Efficiency However there are ways of implementing KNN search efficiently, and often there is an obvious choice for a similarity metric

20070702 Text Categorization

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to 20070702 Text Categorization (20)

Recently uploaded (20)

20070702 Text Categorization