SlideShare a Scribd company logo
A novel approach based on prototypes and rough sets for document and feature reductions  in text categorization Shing-Hua Ho and Jung-Hsien Chiang Reporter :CHE-MIN LIAO 2007/8/27
Outline Introduction Document reduction based on prototype concept Feature reduction based on rough sets  Performance evaluation and comparison Experimental design and result analysis Conclusion
Introduction Text categorization is the task of automatically assigning predefined category labels to new texts. Recently,many statistical methods have been applied in text categorization to reduce the number of terms describing the content of documents due to high dimensionality of document representation.
Introduction Feature selection offers a means of choosing a smaller subset of original features to represent the original dataset. Rough set theory can discover hidden patterns and dependency relationships among large number of feature terms in text datasets. Rough set theory has been applied to many domains,including the task of text categorization and shown to be effective.
Introduction In this paper a new approach based on prototype concept for docment selection is provided which guarantees to reduce the size greatly as well as preserve the classification accuracy. The number of feature terms is reduced via our proposed rough-based algorithm. The properties of rough set theory are used to identify which feature terms should be deleted and which feature terms should be selected.
Document reduction based on prototype concept The idea of document reduction is to form a number of groups,each of which contains several documents of the same label,and each group mean is used as a prototype for the group. At the beginning,documents of each label form a group and their mean is calculated as the initial prototype.
Document reduction based on prototype concept We performed the algorithm based on the following four situations : For all documents within a group 1) If the closest prototype is the same group prototype,then there is no modification has been perfoemed for this group. 2) If the closest prototype is one of an incorrect label,then the group is split into several subgroups according to their label types.
Document reduction based on prototype concept 3) If the closest prototype is a prototype of a different group but have the same label,these documents are shifted to the group of that closest prototype. 4) When the closest prototype is a prototype of a different group and of an incorrect label,these documents are removed to form a new group and its mean is computed as a new prototype.
Document reduction based on prototype concept Input:  V  document-label pairs { dv  ,  s ( dv )},  v =1,..., V  and   s ( dv )∈{1,..., U ) is the label for document  dv . Output: Prototype set { Pz },  z =1,…, Z  and their   corresponding labels. Procedure: Step 01: Set  Gz ={ dv  | s ( dv )= z },  z =1,…, U . Step 02: For  z =1 to  U Calculate the initial prototypes  Pz =mean( Gz ) And their labels are  s ( Pz )= z ,  z =1,…, U   End For Step 03: Set  z =1,  Z = U Step 04: For  t =1 to  Z   Calculate  Dvt =||  dv - Pt  ||2, ∀ dv ∈ Gz   End For
Document reduction based on prototype concept Step 05: Determine the index of the closest prototype to each document  dv  as  Iv =arg min( Dvt ) Step 06: If  Iv = z , ∀ dv ∈ Gz  Then   go to Step 11 End If Step 07: If  s ( PIv )≠ s ( Pz ) ∀ cv ∈ Gz  Then   Set  Z = Z +1 and split  Gz  into two subgroups   Ga  and  Gb   Update their means:  Pa =mean( Ga ) and   Pb =mean( Gb )   If  s ( Pa )= s ( Pb ) Then   go to Step 04   End If End If
Document reduction based on prototype concept Step 08: If  s ( PIv )= s ( Pz ),  PIv ≠ Pz  for some  dv ∈ Gz  Then Remove these documents from  Gz  and include them in group  GIv Update their means:  PIv =mean( GIv ) and Pz =mean( Gz )   End If Step 09: If  s ( PIv )≠ s ( Pz ) for some  dv ∈ Gz  Then Set  Z = Z +1 Remove these documents from  Gz  and create a new group  Gn  containing these documents Update the means:  Pz =mean( Gz ) and Pn =mean( Gn )   End If Step 10: If  z ≠ Z  Then Set  z = z +1 and go to Step 04   End If Step 11: If  z = Z  and no change in groups or prototypes Then STOP   End If
Feature reduction based on rough sets The proposed algorithm is based on the following three properties : (1) An object can be a member of one lower   bound at most. (2) An object that is a member of the lower   bound of a cluster is also member of the   upper bound of the same cluster. (3) An object that does not belong to any lower   bound is the member of at least two upper   bounds.
Feature reduction based on rough sets Using the prototype document space model,every original feature term X n  can be represented by X n =(X 1 ,…,X z ) T  with respect to Z prototype documents. The distance between the object X n  and the mean m k  is defined as
Feature reduction based on rough sets
Feature reduction based on rough sets
Feature reduction based on rough sets
Feature reduction based on rough sets
Feature reduction based on rough sets The rough-based feature selection algorithm achieves exclusive clusters and required to determine the desired number of clusters.Theoretically,the suitable maximum number of clusters is estimated as  ,where N is the size of the features
Feature reduction based on rough sets
Performance evaluation and comparison
Performance evaluation and comparison Four Feature Selection Methods : Document frequency(DF) Information gain(IG) Mutual information(MI) χ 2  statistic method
Performance evaluation and comparison Four Classifiers : K-Nearest-Neighbor(KNN) Naïve Bayes(NB) Rocchio method Support Vector Machine(SVM)
Experimental design and result analysis The reuters-21578 dataset which is a collection of newswire stories from 1987 by David Lewis. the experimental results were obtained with 10-fold-cross-validation for all classifiers.
Experimental design and result analysis
Experimental design and result analysis
Experimental design and result analysis 20_newsgroups which was assembled by Ken Lang in Carnegie Mellon University. There are 20,000 documents in the dataset,collected from 20 dfferent newsgroups,and each contains 1,000 documents.The dataset contains 111,446 features in all.
Experimental design and result analysis
Experimental design and result analysis
Conclusion The best classification accuracy is achieved by using a subset of feature chosen by information gain method for LSVM classifier and our proposed method. Another point worth noting is not only classification accuracy but also computational efficiency is improved through feature reduction and document reduction.

More Related Content

PPTX
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
PDF
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
PDF
2014-mo444-practical-assignment-02-paulo_faria
PPTX
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
PDF
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
PPTX
Scoring, term weighting and the vector space
PDF
Interactive Latent Dirichlet Allocation
PPT
20070702 Text Categorization
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
2014-mo444-practical-assignment-02-paulo_faria
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Scoring, term weighting and the vector space
Interactive Latent Dirichlet Allocation
20070702 Text Categorization

What's hot (15)

PPT
Topic Models
PDF
Text categorization as graph
PDF
Author Topic Model
PPT
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
PPTX
Document Classification and Clustering
PDF
Birch
PPT
Textmining Retrieval And Clustering
PDF
Vchunk join an efficient algorithm for edit similarity joins
PPT
PPT
Lect4
PPTX
Document clustering for forensic analysis
PPTX
Clustering on database systems rkm
PPTX
Document clustering and classification
PPT
Cluster
PPT
3.2 partitioning methods
Topic Models
Text categorization as graph
Author Topic Model
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Document Classification and Clustering
Birch
Textmining Retrieval And Clustering
Vchunk join an efficient algorithm for edit similarity joins
Lect4
Document clustering for forensic analysis
Clustering on database systems rkm
Document clustering and classification
Cluster
3.2 partitioning methods
Ad

Viewers also liked (7)

PDF
Rough Set based Decision Tree for Identifying Vulnerable and Food Insecure Ho...
PPTX
New Rough Set Attribute Reduction Algorithm based on Grey Wolf Optimization
PPTX
Rough Set Semantics for Identity Management on the Web
PPTX
Data discretization
PPTX
Machine Learning
PPTX
Data mining: Classification and prediction
PPT
Free Download Powerpoint Slides
Rough Set based Decision Tree for Identifying Vulnerable and Food Insecure Ho...
New Rough Set Attribute Reduction Algorithm based on Grey Wolf Optimization
Rough Set Semantics for Identity Management on the Web
Data discretization
Machine Learning
Data mining: Classification and prediction
Free Download Powerpoint Slides
Ad

Similar to A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+Reductions+In+Text+Categorization (20)

PDF
A rough set based hybrid method to text categorization
PDF
Big Data with Rough Set Using Map- Reduce
PDF
Topics In Rough Set Theory Current Applications To Granular Computing Seiki A...
PDF
Reduct generation for the incremental data using rough set theory
PDF
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
PDF
30thSep2014
PPTX
Text categorization using Rough Set
PDF
A study on rough set theory based
PPT
The science behind predictive analytics a text mining perspective
PDF
Feature selection, optimization and clustering strategies of text documents
PDF
Dimensionality Reduction
PDF
Dimensionality Reduction Techniques for Document Clustering- A Survey
PDF
A Soft Set-based Co-occurrence for Clustering Web User Transactions
PPTX
Data reduction
PDF
An enhanced fuzzy rough set based clustering algorithm for categorical data
PDF
11 mm91r05
PDF
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
PDF
PPTX
Hanaa phd presentation 14-4-2017
PDF
An enhanced fuzzy rough set based clustering algorithm for categorical data
A rough set based hybrid method to text categorization
Big Data with Rough Set Using Map- Reduce
Topics In Rough Set Theory Current Applications To Granular Computing Seiki A...
Reduct generation for the incremental data using rough set theory
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
30thSep2014
Text categorization using Rough Set
A study on rough set theory based
The science behind predictive analytics a text mining perspective
Feature selection, optimization and clustering strategies of text documents
Dimensionality Reduction
Dimensionality Reduction Techniques for Document Clustering- A Survey
A Soft Set-based Co-occurrence for Clustering Web User Transactions
Data reduction
An enhanced fuzzy rough set based clustering algorithm for categorical data
11 mm91r05
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
Hanaa phd presentation 14-4-2017
An enhanced fuzzy rough set based clustering algorithm for categorical data

More from marxliouville (14)

PPT
20090813MEETING
PPT
20091006meeting
PPT
The Problem of Peer Node Recognition
PPT
FivaTech
PPT
1212 regular meeting
PPT
20081009 meeting
PPT
20080919 regular meeting報告
PDF
0902 regular meeting
PPT
04/29 regular meeting paper
PPT
04/29 regular meeting paper
PPT
2/19 regular meeting paper
PPT
12/18 regular meeting paper
PPT
10/23 paper
PPT
1023 paper
20090813MEETING
20091006meeting
The Problem of Peer Node Recognition
FivaTech
1212 regular meeting
20081009 meeting
20080919 regular meeting報告
0902 regular meeting
04/29 regular meeting paper
04/29 regular meeting paper
2/19 regular meeting paper
12/18 regular meeting paper
10/23 paper
1023 paper

Recently uploaded (20)

PPTX
EABDM Slides for Indifference curve.pptx
PDF
Understanding University Research Expenditures (1)_compressed.pdf
PDF
Lecture1.pdf buss1040 uses economics introduction
PDF
financing insitute rbi nabard adb imf world bank insurance and credit gurantee
PDF
NAPF_RESPONSE_TO_THE_PENSIONS_COMMISSION_8 _2_.pdf
PDF
Bladex Earnings Call Presentation 2Q2025
PPT
E commerce busin and some important issues
PDF
Topic Globalisation and Lifelines of National Economy.pdf
PDF
Circular Flow of Income by Dr. S. Malini
PPTX
Antihypertensive_Drugs_Presentation_Poonam_Painkra.pptx
PPTX
Globalization-of-Religion. Contemporary World
PDF
ECONOMICS AND ENTREPRENEURS LESSONSS AND
PDF
Spending, Allocation Choices, and Aging THROUGH Retirement. Are all of these ...
PDF
Q2 2025 :Lundin Gold Conference Call Presentation_Final.pdf
PPTX
Unilever_Financial_Analysis_Presentation.pptx
PPTX
Basic Concepts of Economics.pvhjkl;vbjkl;ptx
PPTX
social-studies-subject-for-high-school-globalization.pptx
PPTX
Introduction to Essence of Indian traditional knowledge.pptx
PDF
way to join Real illuminati agent 0782561496,0756664682
PDF
final_dropping_the_baton_-_how_america_is_failing_to_use_russia_sanctions_and...
EABDM Slides for Indifference curve.pptx
Understanding University Research Expenditures (1)_compressed.pdf
Lecture1.pdf buss1040 uses economics introduction
financing insitute rbi nabard adb imf world bank insurance and credit gurantee
NAPF_RESPONSE_TO_THE_PENSIONS_COMMISSION_8 _2_.pdf
Bladex Earnings Call Presentation 2Q2025
E commerce busin and some important issues
Topic Globalisation and Lifelines of National Economy.pdf
Circular Flow of Income by Dr. S. Malini
Antihypertensive_Drugs_Presentation_Poonam_Painkra.pptx
Globalization-of-Religion. Contemporary World
ECONOMICS AND ENTREPRENEURS LESSONSS AND
Spending, Allocation Choices, and Aging THROUGH Retirement. Are all of these ...
Q2 2025 :Lundin Gold Conference Call Presentation_Final.pdf
Unilever_Financial_Analysis_Presentation.pptx
Basic Concepts of Economics.pvhjkl;vbjkl;ptx
social-studies-subject-for-high-school-globalization.pptx
Introduction to Essence of Indian traditional knowledge.pptx
way to join Real illuminati agent 0782561496,0756664682
final_dropping_the_baton_-_how_america_is_failing_to_use_russia_sanctions_and...

A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+Reductions+In+Text+Categorization

  • 1. A novel approach based on prototypes and rough sets for document and feature reductions in text categorization Shing-Hua Ho and Jung-Hsien Chiang Reporter :CHE-MIN LIAO 2007/8/27
  • 2. Outline Introduction Document reduction based on prototype concept Feature reduction based on rough sets Performance evaluation and comparison Experimental design and result analysis Conclusion
  • 3. Introduction Text categorization is the task of automatically assigning predefined category labels to new texts. Recently,many statistical methods have been applied in text categorization to reduce the number of terms describing the content of documents due to high dimensionality of document representation.
  • 4. Introduction Feature selection offers a means of choosing a smaller subset of original features to represent the original dataset. Rough set theory can discover hidden patterns and dependency relationships among large number of feature terms in text datasets. Rough set theory has been applied to many domains,including the task of text categorization and shown to be effective.
  • 5. Introduction In this paper a new approach based on prototype concept for docment selection is provided which guarantees to reduce the size greatly as well as preserve the classification accuracy. The number of feature terms is reduced via our proposed rough-based algorithm. The properties of rough set theory are used to identify which feature terms should be deleted and which feature terms should be selected.
  • 6. Document reduction based on prototype concept The idea of document reduction is to form a number of groups,each of which contains several documents of the same label,and each group mean is used as a prototype for the group. At the beginning,documents of each label form a group and their mean is calculated as the initial prototype.
  • 7. Document reduction based on prototype concept We performed the algorithm based on the following four situations : For all documents within a group 1) If the closest prototype is the same group prototype,then there is no modification has been perfoemed for this group. 2) If the closest prototype is one of an incorrect label,then the group is split into several subgroups according to their label types.
  • 8. Document reduction based on prototype concept 3) If the closest prototype is a prototype of a different group but have the same label,these documents are shifted to the group of that closest prototype. 4) When the closest prototype is a prototype of a different group and of an incorrect label,these documents are removed to form a new group and its mean is computed as a new prototype.
  • 9. Document reduction based on prototype concept Input: V document-label pairs { dv , s ( dv )}, v =1,..., V and s ( dv )∈{1,..., U ) is the label for document dv . Output: Prototype set { Pz }, z =1,…, Z and their corresponding labels. Procedure: Step 01: Set Gz ={ dv | s ( dv )= z }, z =1,…, U . Step 02: For z =1 to U Calculate the initial prototypes Pz =mean( Gz ) And their labels are s ( Pz )= z , z =1,…, U End For Step 03: Set z =1, Z = U Step 04: For t =1 to Z Calculate Dvt =|| dv - Pt ||2, ∀ dv ∈ Gz End For
  • 10. Document reduction based on prototype concept Step 05: Determine the index of the closest prototype to each document dv as Iv =arg min( Dvt ) Step 06: If Iv = z , ∀ dv ∈ Gz Then go to Step 11 End If Step 07: If s ( PIv )≠ s ( Pz ) ∀ cv ∈ Gz Then Set Z = Z +1 and split Gz into two subgroups Ga and Gb Update their means: Pa =mean( Ga ) and Pb =mean( Gb ) If s ( Pa )= s ( Pb ) Then go to Step 04 End If End If
  • 11. Document reduction based on prototype concept Step 08: If s ( PIv )= s ( Pz ), PIv ≠ Pz for some dv ∈ Gz Then Remove these documents from Gz and include them in group GIv Update their means: PIv =mean( GIv ) and Pz =mean( Gz ) End If Step 09: If s ( PIv )≠ s ( Pz ) for some dv ∈ Gz Then Set Z = Z +1 Remove these documents from Gz and create a new group Gn containing these documents Update the means: Pz =mean( Gz ) and Pn =mean( Gn ) End If Step 10: If z ≠ Z Then Set z = z +1 and go to Step 04 End If Step 11: If z = Z and no change in groups or prototypes Then STOP End If
  • 12. Feature reduction based on rough sets The proposed algorithm is based on the following three properties : (1) An object can be a member of one lower bound at most. (2) An object that is a member of the lower bound of a cluster is also member of the upper bound of the same cluster. (3) An object that does not belong to any lower bound is the member of at least two upper bounds.
  • 13. Feature reduction based on rough sets Using the prototype document space model,every original feature term X n can be represented by X n =(X 1 ,…,X z ) T with respect to Z prototype documents. The distance between the object X n and the mean m k is defined as
  • 14. Feature reduction based on rough sets
  • 15. Feature reduction based on rough sets
  • 16. Feature reduction based on rough sets
  • 17. Feature reduction based on rough sets
  • 18. Feature reduction based on rough sets The rough-based feature selection algorithm achieves exclusive clusters and required to determine the desired number of clusters.Theoretically,the suitable maximum number of clusters is estimated as ,where N is the size of the features
  • 19. Feature reduction based on rough sets
  • 21. Performance evaluation and comparison Four Feature Selection Methods : Document frequency(DF) Information gain(IG) Mutual information(MI) χ 2 statistic method
  • 22. Performance evaluation and comparison Four Classifiers : K-Nearest-Neighbor(KNN) Naïve Bayes(NB) Rocchio method Support Vector Machine(SVM)
  • 23. Experimental design and result analysis The reuters-21578 dataset which is a collection of newswire stories from 1987 by David Lewis. the experimental results were obtained with 10-fold-cross-validation for all classifiers.
  • 24. Experimental design and result analysis
  • 25. Experimental design and result analysis
  • 26. Experimental design and result analysis 20_newsgroups which was assembled by Ken Lang in Carnegie Mellon University. There are 20,000 documents in the dataset,collected from 20 dfferent newsgroups,and each contains 1,000 documents.The dataset contains 111,446 features in all.
  • 27. Experimental design and result analysis
  • 28. Experimental design and result analysis
  • 29. Conclusion The best classification accuracy is achieved by using a subset of feature chosen by information gain method for LSVM classifier and our proposed method. Another point worth noting is not only classification accuracy but also computational efficiency is improved through feature reduction and document reduction.