A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+Reductions+In+Text+Categorization

A novel approach based on prototypes and rough sets for document and feature reductions in text categorization Shing-Hua Ho and Jung-Hsien Chiang Reporter :CHE-MIN LIAO 2007/8/27

Outline Introduction Document reduction based on prototype concept Feature reduction based on rough sets Performance evaluation and comparison Experimental design and result analysis Conclusion

Introduction Text categorization is the task of automatically assigning predefined category labels to new texts. Recently,many statistical methods have been applied in text categorization to reduce the number of terms describing the content of documents due to high dimensionality of document representation.

Introduction Feature selection offers a means of choosing a smaller subset of original features to represent the original dataset. Rough set theory can discover hidden patterns and dependency relationships among large number of feature terms in text datasets. Rough set theory has been applied to many domains,including the task of text categorization and shown to be effective.

Introduction In this paper a new approach based on prototype concept for docment selection is provided which guarantees to reduce the size greatly as well as preserve the classification accuracy. The number of feature terms is reduced via our proposed rough-based algorithm. The properties of rough set theory are used to identify which feature terms should be deleted and which feature terms should be selected.

Document reduction based on prototype concept The idea of document reduction is to form a number of groups,each of which contains several documents of the same label,and each group mean is used as a prototype for the group. At the beginning,documents of each label form a group and their mean is calculated as the initial prototype.

Document reduction based on prototype concept We performed the algorithm based on the following four situations ： For all documents within a group 1) If the closest prototype is the same group prototype,then there is no modification has been perfoemed for this group. 2) If the closest prototype is one of an incorrect label,then the group is split into several subgroups according to their label types.

Document reduction based on prototype concept 3) If the closest prototype is a prototype of a different group but have the same label,these documents are shifted to the group of that closest prototype. 4) When the closest prototype is a prototype of a different group and of an incorrect label,these documents are removed to form a new group and its mean is computed as a new prototype.

Document reduction based on prototype concept Input: V document-label pairs { dv , s ( dv )}, v =1,..., V and s ( dv )∈{1,..., U ) is the label for document dv . Output: Prototype set { Pz }, z =1,…, Z and their corresponding labels. Procedure: Step 01: Set Gz ={ dv | s ( dv )= z }, z =1,…, U . Step 02: For z =1 to U Calculate the initial prototypes Pz =mean( Gz ) And their labels are s ( Pz )= z , z =1,…, U End For Step 03: Set z =1, Z = U Step 04: For t =1 to Z Calculate Dvt =|| dv - Pt ||2, ∀ dv ∈ Gz End For

Document reduction based on prototype concept Step 05: Determine the index of the closest prototype to each document dv as Iv =arg min( Dvt ) Step 06: If Iv = z , ∀ dv ∈ Gz Then go to Step 11 End If Step 07: If s ( PIv )≠ s ( Pz ) ∀ cv ∈ Gz Then Set Z = Z +1 and split Gz into two subgroups Ga and Gb Update their means: Pa =mean( Ga ) and Pb =mean( Gb ) If s ( Pa )= s ( Pb ) Then go to Step 04 End If End If

Document reduction based on prototype concept Step 08: If s ( PIv )= s ( Pz ), PIv ≠ Pz for some dv ∈ Gz Then Remove these documents from Gz and include them in group GIv Update their means: PIv =mean( GIv ) and Pz =mean( Gz ) End If Step 09: If s ( PIv )≠ s ( Pz ) for some dv ∈ Gz Then Set Z = Z +1 Remove these documents from Gz and create a new group Gn containing these documents Update the means: Pz =mean( Gz ) and Pn =mean( Gn ) End If Step 10: If z ≠ Z Then Set z = z +1 and go to Step 04 End If Step 11: If z = Z and no change in groups or prototypes Then STOP End If

Feature reduction based on rough sets The proposed algorithm is based on the following three properties ： (1) An object can be a member of one lower bound at most. (2) An object that is a member of the lower bound of a cluster is also member of the upper bound of the same cluster. (3) An object that does not belong to any lower bound is the member of at least two upper bounds.

Feature reduction based on rough sets Using the prototype document space model,every original feature term X n can be represented by X n =(X 1 ,…,X z ) T with respect to Z prototype documents. The distance between the object X n and the mean m k is defined as

Feature reduction based on rough sets

Feature reduction based on rough sets The rough-based feature selection algorithm achieves exclusive clusters and required to determine the desired number of clusters.Theoretically,the suitable maximum number of clusters is estimated as ,where N is the size of the features

Performance evaluation and comparison

Performance evaluation and comparison Four Feature Selection Methods ： Document frequency(DF) Information gain(IG) Mutual information(MI) χ 2 statistic method

Performance evaluation and comparison Four Classifiers ： K-Nearest-Neighbor(KNN) Naïve Bayes(NB) Rocchio method Support Vector Machine(SVM)

Experimental design and result analysis The reuters-21578 dataset which is a collection of newswire stories from 1987 by David Lewis. the experimental results were obtained with 10-fold-cross-validation for all classifiers.

Experimental design and result analysis

Experimental design and result analysis 20_newsgroups which was assembled by Ken Lang in Carnegie Mellon University. There are 20,000 documents in the dataset,collected from 20 dfferent newsgroups,and each contains 1,000 documents.The dataset contains 111,446 features in all.

Conclusion The best classification accuracy is achieved by using a subset of feature chosen by information gain method for LSVM classifier and our proposed method. Another point worth noting is not only classification accuracy but also computational efficiency is improved through feature reduction and document reduction.

A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+Reductions+In+Text+Categorization

More Related Content

What's hot (15)

Viewers also liked (7)

Similar to A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+Reductions+In+Text+Categorization (20)

More from marxliouville (14)

Recently uploaded (20)

A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+Reductions+In+Text+Categorization