SlideShare a Scribd company logo
Automatic Document Categorization using Support Vector Machines  Prashanth Kumar Muthoju [email_address] Advisor: Dr. Zubair
Overview Introduction Problem Proposed Solution Improvements Results Future Work Conclusion References
Introduction What is Categorization Sorting a set of documents into categories from a  predefined set. [ link ] Assigning a document to a category based on it’s contents.
Introduction .. Cont.d Types of Categorization : Manual  Automatic (Machine Learning) Probabilistic (e.g., Naïve Bayesian) Decision Structures (e.g., Decision Trees) Support Machines (e.g., SVM)
Introduction .. Cont.d Why ‘Automation’ ? Manual categorization needs large number of human resources is expensive is time consuming
Introduction .. Cont.d Applications of Automatic Categorization: Indexing of scientific articles Spam filtering of e-mails Authorship attribution
Problem The DTIC document base has to be categorized into 25 fields (broad) and 251 groups (narrow) Fields/Groups listed here  http://www.dtic.mil/trail/fieldgrp.html
Towards the solution .. Strategy: Exploit an existing collection with categorized documents A portion is used as training set Other potion is used as testing set Allow tuning of classifier to yield maximum effectiveness
Towards the solution .. What is Support Vector Machine ? Binary Classifier Finds the ith largest margin to separate two classes Subsequently classifies items  Based on which side of the line They fall.
Towards the solution .. Why is SVM chosen for Automatic Categorization? Prior studies have suggested good results with SVM Relatively immune to ‘over fitting’ (fitting to coincidental relations encountered during training).
Towards the solution .. SVM Library (LibSVM 2.85) Java
Solution Before we can train the SVM using LibSVM for a Field/Group, we have to prepare dataset for that Field/Group. Each file is represented by <label> <feature1>:<value1> < feature 2>:<value2> ...  (Sparse vector representation) <label> is 1 if positive file; 0 if negative file < feature>:<value> are represented by <word>:<tfidf> (Common words are eliminated before preparing data set).
Solution For each of the Field/Group, the following procedure is Repeated (Training phase): Collection Model by Dr. Zeil Field/Group K Field/Group K Download  Documents  ( PDF ) Convert PDF to  Text  Model  Documents  Using TF and  IDF Positive  Training Set for  Negative  Training Set for  Field/Group K SVM For
Solution (Testing Phase) Field/Group 1 Field/Group K Field/Group N Trained SVM For Trained SVM For Trained SVM For Input Test  Document ( PDF ) Convert PDF to  Text  Model  Documents  Using TF and  IDF Estimate in the  range  0  to  1  indicating how likely  the Field/Group K maps to  the test document .
Improving the results Scaling the vectors in datasets To make the <value>s in <feature>:<value> pairs between 0 and 1
Experiment Randomly selected 5 Field/Groups. 140200, 120200, 201300, 220200, 250400. For each field/group, 70 pdf files were downloaded. 50 files were used as positive files for training 20 files were used for testing An additional 50 files were taken randomly from all other field/groups as negative files for training.
Experiment Metric: Recall = #Correct Answers /  #Total Possible Answers Precision = #Correct Answers /  #Answers Produced
Results 140200 120200 201300 220200 250400 140200 13 2 1 2 2 120200 1 16 0 3 0 201300 0 5 13 2 0 220200 1 0 2 17 0 250400 0 0 1 0 19
Results ..Cont.d Category Precession Recall 140200 0.87 0.65 120200 0.70 0.80 201300 0.76 0.65 220200 0.71 0.85 250400 0.90 0.95
Future Work Hierarchical Model In flat model, we consider each field/group independent. In Hierarchical model, we consider all files under the branch as positive files for training 150000 150300 150600 150301 150302 150601 150602
Future Work Multi-Label classification Practically each document may belong to multiple field/groups.
Conclusion The classification results of DTIC documents based on Field/Groups were impressive.  Ways to improve the results have been identified. A couple of suggestions were given for future work in this particular area.
References Sebastiani, F (2002). .Machine learning in automated text categorization.. ACM Computing Surveys. Vol. 34(1). pp. 1-47. Joachims, T. (1998). Text categorization with support vector machines: learning with many relevant features.  ( http://www.cs.cornell.edu/people/tj/publications/joachims_98a.pdf ) J.T. Kwok. Automated text categorization using support vector machine. In Proceedings of the International Conference on Neural Information Processing, Kitakyushu, Japan, Oct. 1998, pp. 347- 351.

More Related Content

PPTX
Machine Learning
PDF
Text categorization as graph
PDF
Text Classification/Categorization
PPT
iccv2009 tutorial: boosting and random forest - part III
PDF
Improving Spam Mail Filtering Using Classification Algorithms With Partition ...
PPT
Cso gaddis java_chapter6
PDF
Probability density estimation using Product of Conditional Experts
PDF
Java - Class Structure
Machine Learning
Text categorization as graph
Text Classification/Categorization
iccv2009 tutorial: boosting and random forest - part III
Improving Spam Mail Filtering Using Classification Algorithms With Partition ...
Cso gaddis java_chapter6
Probability density estimation using Product of Conditional Experts
Java - Class Structure

What's hot (19)

PPT
[ppt]
PDF
Review : Adaptive Consistency Regularization for Semi-Supervised Transfer Lea...
PPTX
Ppt shuai
PPT
Implications of Ceiling Effects in Defect Predictors
PDF
Retrieval Performance Bound Analysis for Single Term Queries
PPTX
Semi-Supervised Learning
PDF
Boosted tree
PPTX
B4UConference_machine learning_deeplearning
PPTX
Text Classification
PDF
PPTX
GBM package in r
PPTX
Chap5java5th
PDF
Gradient Boosted Regression Trees in scikit-learn
PPTX
Chap6java5th
PPTX
Ml9 introduction to-unsupervised_learning_and_clustering_methods
PPTX
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
PPTX
Chap4java5th
PPTX
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
PDF
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
[ppt]
Review : Adaptive Consistency Regularization for Semi-Supervised Transfer Lea...
Ppt shuai
Implications of Ceiling Effects in Defect Predictors
Retrieval Performance Bound Analysis for Single Term Queries
Semi-Supervised Learning
Boosted tree
B4UConference_machine learning_deeplearning
Text Classification
GBM package in r
Chap5java5th
Gradient Boosted Regression Trees in scikit-learn
Chap6java5th
Ml9 introduction to-unsupervised_learning_and_clustering_methods
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Chap4java5th
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
Ad

Viewers also liked (20)

PPTX
Text categorization
PPTX
Text categorization
PPT
Ex 3 Nelly Courtois
PPT
Futurismo
PPT
Presentazionefuturismo3c
PDF
Machine learning in automated text categorization
PDF
Document classification models based on Bayesian networks
PDF
A BAYESIAN CLASSIFICATION APPROACH USING CLASS-SPECIFIC FEATURES FOR TEXT CAT...
PPTX
Automated Bug classification using Bayesian probabilistic approach
PPT
Bayesian Inference using b8
PDF
PDF
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
PPTX
Text Classification/Categorization
PPT
Il futurismo
PPTX
Il Futurismo
PDF
Text categorization with Lucene and Solr
ODP
MongoDB & Machine Learning
PPT
Text categorization
PDF
Bayesian Network Modeling using Python and R
PPTX
Naive Bayes Presentation
Text categorization
Text categorization
Ex 3 Nelly Courtois
Futurismo
Presentazionefuturismo3c
Machine learning in automated text categorization
Document classification models based on Bayesian networks
A BAYESIAN CLASSIFICATION APPROACH USING CLASS-SPECIFIC FEATURES FOR TEXT CAT...
Automated Bug classification using Bayesian probabilistic approach
Bayesian Inference using b8
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification/Categorization
Il futurismo
Il Futurismo
Text categorization with Lucene and Solr
MongoDB & Machine Learning
Text categorization
Bayesian Network Modeling using Python and R
Naive Bayes Presentation
Ad

Similar to pmuthoju_presentation.ppt (20)

PPT
Hands on Mahout!
PPT
Part 1
PPT
powerpoint
PPT
vorl1.ppt
PDF
Download full ebook of Solution Manual for COMP 3, 3rd Edition instant downlo...
PPT
MLlecture1.ppt
PPT
MLlecture1.ppt
DOCX
1 Saint Leo University GBA 334 Applied Decision.docx
PPT
chapter 5 Objectdesign.ppt
PPT
[ppt]
PPT
[ppt]
PDF
Intro to Machine Learning by Microsoft Ventures
PDF
Large Scale Hierarchical Text Classification
PPT
Catégorisation automatisée de contenus documentaires : la ...
PDF
Recommendation systems
PDF
Software Arch TDD ppt.pdf
PPTX
Week_1 Machine Learning introduction.pptx
PDF
IRJET - Automated Essay Grading System using Deep Learning
PDF
PPTX
Sentiment analysis using naive bayes classifier
Hands on Mahout!
Part 1
powerpoint
vorl1.ppt
Download full ebook of Solution Manual for COMP 3, 3rd Edition instant downlo...
MLlecture1.ppt
MLlecture1.ppt
1 Saint Leo University GBA 334 Applied Decision.docx
chapter 5 Objectdesign.ppt
[ppt]
[ppt]
Intro to Machine Learning by Microsoft Ventures
Large Scale Hierarchical Text Classification
Catégorisation automatisée de contenus documentaires : la ...
Recommendation systems
Software Arch TDD ppt.pdf
Week_1 Machine Learning introduction.pptx
IRJET - Automated Essay Grading System using Deep Learning
Sentiment analysis using naive bayes classifier

More from butest (20)

PDF
EL MODELO DE NEGOCIO DE YOUTUBE
DOC
1. MPEG I.B.P frame之不同
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPT
Timeline: The Life of Michael Jackson
DOCX
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPTX
Com 380, Summer II
PPT
PPT
DOCX
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
DOC
MICHAEL JACKSON.doc
PPTX
Social Networks: Twitter Facebook SL - Slide 1
PPT
Facebook
DOCX
Executive Summary Hare Chevrolet is a General Motors dealership ...
DOC
Welcome to the Dougherty County Public Library's Facebook and ...
DOC
NEWS ANNOUNCEMENT
DOC
C-2100 Ultra Zoom.doc
DOC
MAC Printing on ITS Printers.doc.doc
DOC
Mac OS X Guide.doc
DOC
hier
DOC
WEB DESIGN!
EL MODELO DE NEGOCIO DE YOUTUBE
1. MPEG I.B.P frame之不同
LESSONS FROM THE MICHAEL JACKSON TRIAL
Timeline: The Life of Michael Jackson
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
LESSONS FROM THE MICHAEL JACKSON TRIAL
Com 380, Summer II
PPT
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
MICHAEL JACKSON.doc
Social Networks: Twitter Facebook SL - Slide 1
Facebook
Executive Summary Hare Chevrolet is a General Motors dealership ...
Welcome to the Dougherty County Public Library's Facebook and ...
NEWS ANNOUNCEMENT
C-2100 Ultra Zoom.doc
MAC Printing on ITS Printers.doc.doc
Mac OS X Guide.doc
hier
WEB DESIGN!

pmuthoju_presentation.ppt

  • 1. Automatic Document Categorization using Support Vector Machines Prashanth Kumar Muthoju [email_address] Advisor: Dr. Zubair
  • 2. Overview Introduction Problem Proposed Solution Improvements Results Future Work Conclusion References
  • 3. Introduction What is Categorization Sorting a set of documents into categories from a predefined set. [ link ] Assigning a document to a category based on it’s contents.
  • 4. Introduction .. Cont.d Types of Categorization : Manual Automatic (Machine Learning) Probabilistic (e.g., Naïve Bayesian) Decision Structures (e.g., Decision Trees) Support Machines (e.g., SVM)
  • 5. Introduction .. Cont.d Why ‘Automation’ ? Manual categorization needs large number of human resources is expensive is time consuming
  • 6. Introduction .. Cont.d Applications of Automatic Categorization: Indexing of scientific articles Spam filtering of e-mails Authorship attribution
  • 7. Problem The DTIC document base has to be categorized into 25 fields (broad) and 251 groups (narrow) Fields/Groups listed here http://www.dtic.mil/trail/fieldgrp.html
  • 8. Towards the solution .. Strategy: Exploit an existing collection with categorized documents A portion is used as training set Other potion is used as testing set Allow tuning of classifier to yield maximum effectiveness
  • 9. Towards the solution .. What is Support Vector Machine ? Binary Classifier Finds the ith largest margin to separate two classes Subsequently classifies items Based on which side of the line They fall.
  • 10. Towards the solution .. Why is SVM chosen for Automatic Categorization? Prior studies have suggested good results with SVM Relatively immune to ‘over fitting’ (fitting to coincidental relations encountered during training).
  • 11. Towards the solution .. SVM Library (LibSVM 2.85) Java
  • 12. Solution Before we can train the SVM using LibSVM for a Field/Group, we have to prepare dataset for that Field/Group. Each file is represented by <label> <feature1>:<value1> < feature 2>:<value2> ... (Sparse vector representation) <label> is 1 if positive file; 0 if negative file < feature>:<value> are represented by <word>:<tfidf> (Common words are eliminated before preparing data set).
  • 13. Solution For each of the Field/Group, the following procedure is Repeated (Training phase): Collection Model by Dr. Zeil Field/Group K Field/Group K Download Documents ( PDF ) Convert PDF to Text Model Documents Using TF and IDF Positive Training Set for Negative Training Set for Field/Group K SVM For
  • 14. Solution (Testing Phase) Field/Group 1 Field/Group K Field/Group N Trained SVM For Trained SVM For Trained SVM For Input Test Document ( PDF ) Convert PDF to Text Model Documents Using TF and IDF Estimate in the range 0 to 1 indicating how likely the Field/Group K maps to the test document .
  • 15. Improving the results Scaling the vectors in datasets To make the <value>s in <feature>:<value> pairs between 0 and 1
  • 16. Experiment Randomly selected 5 Field/Groups. 140200, 120200, 201300, 220200, 250400. For each field/group, 70 pdf files were downloaded. 50 files were used as positive files for training 20 files were used for testing An additional 50 files were taken randomly from all other field/groups as negative files for training.
  • 17. Experiment Metric: Recall = #Correct Answers / #Total Possible Answers Precision = #Correct Answers / #Answers Produced
  • 18. Results 140200 120200 201300 220200 250400 140200 13 2 1 2 2 120200 1 16 0 3 0 201300 0 5 13 2 0 220200 1 0 2 17 0 250400 0 0 1 0 19
  • 19. Results ..Cont.d Category Precession Recall 140200 0.87 0.65 120200 0.70 0.80 201300 0.76 0.65 220200 0.71 0.85 250400 0.90 0.95
  • 20. Future Work Hierarchical Model In flat model, we consider each field/group independent. In Hierarchical model, we consider all files under the branch as positive files for training 150000 150300 150600 150301 150302 150601 150602
  • 21. Future Work Multi-Label classification Practically each document may belong to multiple field/groups.
  • 22. Conclusion The classification results of DTIC documents based on Field/Groups were impressive. Ways to improve the results have been identified. A couple of suggestions were given for future work in this particular area.
  • 23. References Sebastiani, F (2002). .Machine learning in automated text categorization.. ACM Computing Surveys. Vol. 34(1). pp. 1-47. Joachims, T. (1998). Text categorization with support vector machines: learning with many relevant features. ( http://www.cs.cornell.edu/people/tj/publications/joachims_98a.pdf ) J.T. Kwok. Automated text categorization using support vector machine. In Proceedings of the International Conference on Neural Information Processing, Kitakyushu, Japan, Oct. 1998, pp. 347- 351.