pmuthoju_presentation.ppt

Automatic Document Categorization using Support Vector Machines Prashanth Kumar Muthoju [email_address] Advisor: Dr. Zubair

Overview Introduction Problem Proposed Solution Improvements Results Future Work Conclusion References

Introduction What is Categorization Sorting a set of documents into categories from a predefined set. [ link ] Assigning a document to a category based on it’s contents.

Introduction .. Cont.d Types of Categorization : Manual Automatic (Machine Learning) Probabilistic (e.g., Naïve Bayesian) Decision Structures (e.g., Decision Trees) Support Machines (e.g., SVM)

Introduction .. Cont.d Why ‘Automation’ ? Manual categorization needs large number of human resources is expensive is time consuming

Introduction .. Cont.d Applications of Automatic Categorization: Indexing of scientific articles Spam filtering of e-mails Authorship attribution

Problem The DTIC document base has to be categorized into 25 fields (broad) and 251 groups (narrow) Fields/Groups listed here http://www.dtic.mil/trail/fieldgrp.html

Towards the solution .. Strategy: Exploit an existing collection with categorized documents A portion is used as training set Other potion is used as testing set Allow tuning of classifier to yield maximum effectiveness

Towards the solution .. What is Support Vector Machine ? Binary Classifier Finds the ith largest margin to separate two classes Subsequently classifies items Based on which side of the line They fall.

Towards the solution .. Why is SVM chosen for Automatic Categorization? Prior studies have suggested good results with SVM Relatively immune to ‘over fitting’ (fitting to coincidental relations encountered during training).

Towards the solution .. SVM Library (LibSVM 2.85) Java

Solution Before we can train the SVM using LibSVM for a Field/Group, we have to prepare dataset for that Field/Group. Each file is represented by <label> <feature1>:<value1> < feature 2>:<value2> ... (Sparse vector representation) <label> is 1 if positive file; 0 if negative file < feature>:<value> are represented by <word>:<tfidf> (Common words are eliminated before preparing data set).

Solution For each of the Field/Group, the following procedure is Repeated (Training phase): Collection Model by Dr. Zeil Field/Group K Field/Group K Download Documents ( PDF ) Convert PDF to Text Model Documents Using TF and IDF Positive Training Set for Negative Training Set for Field/Group K SVM For

Solution (Testing Phase) Field/Group 1 Field/Group K Field/Group N Trained SVM For Trained SVM For Trained SVM For Input Test Document ( PDF ) Convert PDF to Text Model Documents Using TF and IDF Estimate in the range 0 to 1 indicating how likely the Field/Group K maps to the test document .

Improving the results Scaling the vectors in datasets To make the <value>s in <feature>:<value> pairs between 0 and 1

Experiment Randomly selected 5 Field/Groups. 140200, 120200, 201300, 220200, 250400. For each field/group, 70 pdf files were downloaded. 50 files were used as positive files for training 20 files were used for testing An additional 50 files were taken randomly from all other field/groups as negative files for training.

Experiment Metric: Recall = #Correct Answers / #Total Possible Answers Precision = #Correct Answers / #Answers Produced

Results 140200 120200 201300 220200 250400 140200 13 2 1 2 2 120200 1 16 0 3 0 201300 0 5 13 2 0 220200 1 0 2 17 0 250400 0 0 1 0 19

Results ..Cont.d Category Precession Recall 140200 0.87 0.65 120200 0.70 0.80 201300 0.76 0.65 220200 0.71 0.85 250400 0.90 0.95

Future Work Hierarchical Model In flat model, we consider each field/group independent. In Hierarchical model, we consider all files under the branch as positive files for training 150000 150300 150600 150301 150302 150601 150602

Future Work Multi-Label classification Practically each document may belong to multiple field/groups.

Conclusion The classification results of DTIC documents based on Field/Groups were impressive. Ways to improve the results have been identified. A couple of suggestions were given for future work in this particular area.

References Sebastiani, F (2002). .Machine learning in automated text categorization.. ACM Computing Surveys. Vol. 34(1). pp. 1-47. Joachims, T. (1998). Text categorization with support vector machines: learning with many relevant features. ( http://www.cs.cornell.edu/people/tj/publications/joachims_98a.pdf ) J.T. Kwok. Automated text categorization using support vector machine. In Proceedings of the International Conference on Neural Information Processing, Kitakyushu, Japan, Oct. 1998, pp. 347- 351.

pmuthoju_presentation.ppt

More Related Content

What's hot (19)

Viewers also liked (20)

Similar to pmuthoju_presentation.ppt (20)

More from butest (20)

pmuthoju_presentation.ppt