SlideShare a Scribd company logo
International Journal of Engineering Research and Development
e-ISSN: 2278-067X, p-ISSN: 2278-800X, www.ijerd.com
Volume 9, Issue 3 (December 2013), PP. 28-33

Concept and Term Based Similarity Measure for Text
Classification and Clustering
B.Sindhiya1, Dr.N.Tajunisha2
1

M.Phil Scholar Department Of Computer Science Sri Ramakrishna College Of Arts And Science For Women
Coimbatore, India.
2
M.Phil, Ph.D Associate Professor Department Of Computer Science Sri Ramakrishna College Of Arts And
Science For Women Coimbatore, India.

Abstract:- The exploitation of syntactic structures and semantic background knowledge has always been an
appealing subject in the context of data mining, text retrieval and information management. The usefulness of
this kind of information has been shown most prominently in highly specialized tasks, such as text
categorization scenarios. So far, however, additional syntactic or semantic information has been used only
individually. In this paper, a new principle approach , the concept and term based similarity measure, which
incorporates linguistic and semantic structures, using syntactic dependencies, and semantic background
knowledge is proposed. This novel method represents the meaning of texts in a high-dimensional space of
concepts derived from WordNet. A number of case studies have been included in the research to demonstrate
the various aspects of this framework.
Index Terms:- Document classification, document clustering, similarity measure, accuracy, classifiers,
clustering algorithms.

I.

INTRODUCTION

Clustering maps the data items into clusters, where clusters are natural grouping of data items based on
similarity or probability density methods. Unlike classification and prediction which analyzes class-label data
objects, clustering analyzes data objects without class-labels and tries to generate such labels. A similarity
measure or similarity function is a real-valued function that quantifies the similarity between two objects.
The similarity measure reflects the degree of closeness or separation of the target objects and should
correspond to the characteristics that are believed to distinguish the clusters embedded in the data. Before
Clustering, a similarity/distance measure must be determined. [1]. Choosing an appropriate similarity measure is
also crucial for cluster analysis, especially for a particular type of clustering algorithms.
Text Categorization (TC) is the classification of documents with respect to a set of one or more preexisting
categories [2]. The classification phase consists of generating a weighted vector for all categories, then using a
similarity measure to find the closest category. The similarity measure is used to determine the degree of
resemblance between two vectors. To achieve reasonable classification results, a similarity measure should
generally respond with larger values to documents that belong to the same class and with smaller values
otherwise. During the last decades, a large number of methods proposed for text categorization were typically
based on the classical Bag-of-Words model where each term or term stem is an independent feature.
The existing similarity measure was more frequently used to assess the similarity between
words.Although the information theoretic similarity measure results are statistically significant it does not
reduce the dimension of the vector model [3].Metric distances such as Euclidean distance are not appropriate for
high dimension and sparse domains. Due to the ignorance of any relation between words, the learning
algorithms are restricted to detect patterns in the used terminology only, while conceptual patterns remain
ignored.
Existing approaches requires performing an optimization over an entire collection of documents. Most
of these techniques are computationally expensive.

II.

RELATED WORKS

Similarity measures have been extensively used in text classification and clustering algorithms. The
spherical k-means algorithm [4] adopted the cosine similarity measure for document clustering. In this method
the unlabeled document collections are becoming increasingly common and available. Using words as features,
text documents are often represented as high-dimensional and sparse vectors. The algorithm outputs k disjoint
clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm.
Derrick Higgins [5] adopted a cosine-based pairwise adaptive similarity measure for document
clustering. Pairwise-adaptive similarity measure for large high dimensional document datasets improves the

28
Concept and Term Based Similarity Measure for Text Classification and Clustering
unsupervised clustering quality and speed compared to the original cosine similarity measure. Zoulficar younes
[6] reported results of clustering experiments with clustering algorithms and 12 different text data sets, and
concluded that the objective function based on cosine similarity “leads to the best solutions irrespective of the
number of clusters for most of the data sets.”
Daphe koller [7] proposed a divisive information-theoretic feature clustering algorithm for text
classification using the Kullback-Leibler divergence. High dimensionality of text can be a deterrent in applying
complex learners such as Support Vector Machines to the task of text classification
Kullback [8] combined squared Euclidean distance with relative entropy in a k-means like clustering
algorithm. K means algorithm introduced recently is specifically designed to handle unit length document
vectors. [6] conclude that the objective function based on cosine similarity leads to the best solutions
irrespective of the number of clusters for most of the data sets.
Craven [9] performed document clustering based on the proposed phrase based similarity measure. The
phrase-based document similarity to compute the pair-wise similarities of documents based on the Suffix Tree
Document (STD) model. By mapping each node in the suffix tree of STD model into a unique feature term in
the Vector Space Document (VSD) model, the phrase-based document similarity naturally inherits the term tfidf weighting scheme in computing the document similarity with phrases.

III.

PROPOSED METHODOLOGY

In this chapter the proposed concept and term based similarity between the documents is illustrated.
The proposed system, also measure the semantic similarity between the terms and concepts with use of wordnet
tool and tree tagger tool.
3.1 CSMTP (CONCEPT BASED SIMILARITY MEASURE FOR TEXT PROCESSING) ALGORITHM
The CSMTP algorithm selects the terms from the testing documents, Generates the terms from the
document, selects the appropriate feature and calculates the similarty measure based on the term and its
respective concepts.
CSMTP ALGORITHM
1. Let D1 and D2 be the testing documents.
2. Let T1 and T2 be the terms from the document D1 and D2.
3. Remove the stopwords ST1 and ST2 from the documents D1 and D2.
4. Let C1 and C2 be two concepts from T1 and T2 respectively where (T1 denotes the first thesaurus and T2
the second).
5. Compute the similarity measure between two concepts, with,
log max 𝐴 , 𝐵 − log⁡ 𝐴 ∩ 𝐵 )
(
𝑆𝐼𝑀 𝑐 𝑖 , 𝑐 𝑘 = 1 −
𝐿𝑜𝑔 𝑊 − 𝐿𝑜𝑔(min 𝐴 , 𝐵 )
where A and B are the sets of all articles that link to concepts c i and ck respectively and W is the set of
all articles. Max(A,B) represents the maximum similarity measure between A and B. min(A,B)
represents the minimum similarity measure between A and B. the log(A  B) represents the common
concepts in A and B.
6. Compute the semantic relatedness between term and its candidate concepts in a given document according
to the context information
1
1
𝑅𝑒𝑙 𝑡, 𝑐 𝑖 𝑑 𝑗 =
𝑆𝐼𝑀(𝑐 𝑖 , 𝑐 𝑘 )
𝑇 −1
|𝐶𝑆 𝑙 |
𝑡 1 ∈𝑇&𝑡 1 ≠𝑡

𝐶 𝑘 ∈𝐶𝑆 𝑙

where T is the term set of the jth document dj , tl is a term in dj except for t, csl is the candidate concept
set related to term tl.
3.2 SYNTACTIC REPRESENTATION
Tf-idf weighting scheme are used in syntactic level to record the syntactic information. Tf–idf, term
frequency–inverse document frequency, is a numerical statistic which reflects how important a word is to a
document in a collection or corpus. It is often used as a weighting factor in information retrieval and text
mining. The tf-idf value increases proportionally to the number of times a word appears in the document, but is
offset by the frequency of the word in the corpus, which helps to control for the fact that some words are
generally more common than others. Tf–idf is the product of two statistics, term frequency and inverse
document frequency. Various ways for determining the exact values of both statistics exist. In the case of the
term frequency tf(t,d), the simplest choice is to use the raw frequency of a term in a document, i.e. the number
of times that term t occurs in document d. If we denote the raw frequency of t by f(t,d), then the simple tf
scheme is tf(t,d) = f(t,d).

29
Concept and Term Based Similarity Measure for Text Classification and Clustering
3.3 SEMANTIC SIMILARITY
Semantic level consists of concepts related to the terms in the syntactic level. These two levels are
connected via the semantic correlation between terms and their relevant concepts.
WordNet is used to calculate the ascertain connections among four types of Parts of Speech (POS) noun, verb, adjective, and adverb. The minimum unit in a WordNet is synset, which represent an exact meaning
of a word. It includes the word, its clarification, and its synonyms.

IV.

EXPERIMENTAL RESULTS

In this section, the effectiveness of the proposed similarity measure CSMTP is investigated . The
investigation is done by applying the CSMTP measure in several text applications, including k-NN based singlelabel classification (SL-kNN) , k-NN based multi-label classification (ML-kNN) , k-means clustering (k-means)
, and hierarchical agglomerative clustering (HAC) . The data sets, namely WebKB, Reuters-8 respectively, are
used in the experiments presented below.
1) WebKB.
The documents in the WebKB data set are webpages collected by the World Wide Knowledge Base
(Web→Kb) project of the CMU text learning group. The documents were manually classified into several
different classes. The documents of this data set were not predesignated as training or testing patterns. the
datasets can be randomly divided into training and testing subsets.
2) Reuters-8
Reuters-21578 ModeApt`e Split Text Categorization Test Collection contains thousands of documents
collected from Reuters newswire in 1987. The most widely used version is Reuters-21578 ModeApt`e, which
contains 90 categories and 12902 documents.
4.1 CLASSIFICATION PERFORMANCE
For WebKB dataset, the randomly selected training documents are used for training/validation and the
testing documents are used for testing. For Reuters-8 dataset the predesignated training data are used for
training/validation and the predesignated testing data are used for testing. Note that the data for
training/validation are separate from the data for testing in each case
4.1.1 Single-Label Document Classification
In this experiment, we compare the performance of our measureand the others in single-label document
classification. The performance is evaluated by the classification accuracy , AC, which compares the predicted
label of each document with that provided by the document corpus:
𝑛
𝐼=1 𝐸(𝐶𝑖, 𝐶 𝑖 1)
𝐴𝐶 =
𝑛
where n is the number of testing documents, and ci and ci are the target label and the predicted label,
respectively, of the ith document. E(ci, ci1) = 1 if ci = ci1, and E(ci, ci1) = 0 otherwise.
Figure 4.1 shows the classification accuracy obtained by SL-kNN with SMTP and CSMTP similarity measures
using different class(k) different k(class) settings, i.e., k=4,8,12,16,20, on the training/validation data of
WebKB. The figure clearly shows that the single label document classification accuracy obtained using the
proposed CSMTP measure performs high comparing to the SMTP measure.

FIGURE 4.1 Classification Accuracy By Sl-Knn On Training/Validation With SMTP & CSMTP Using
Differen K Settings.

30
Concept and Term Based Similarity Measure for Text Classification and Clustering
TABLE 4.1 shows the classification AC(accuracy) obtained by single label classification on the testing data of
webkb
TABLE 4.1
Classification Accuracies By Sl–Knn With Different Measures On Testing Data Of WEBKB
K=1
K=3
K=5
K=7
K=9
SMTP
0.9013 0.9191 0.9242 0.9223 0.9233
CSMTP 0.9338 0.9411 0.9420 0.9447 0.9461
4.1.2MULTI LABEL CLASSIFICATION
Figure 4.2 shows the classification accuracy obtained by ML-kNN with SMTP and CSMTP similarity
measures using different class(k) different k(class) settings, i.e., k=4,8,12,16,20, on the training/validation data
of WebKB. The figure 5.2 clearly shows that the multi label document classification accuracy obtained using
the proposed CSMTP measure performs high comparing to the

Figure 4.2 Classification Accuracy By Ml-Knn On Training/Validation With Smtp & Csmtp Using Differen K
Settings.
TABLE 4.2 shows the classification AC(accuracy) obtained by multi label classification on the testing
data of webkb
TABLE 4.2
CLASSIFICATION ACCURACIES BY ML–KNN WITH DIFFERENT MEASURES ON TESTING DATA
OF WEBKB
K=1
K=3
K=5
K=7
K=9
0.6910 0.6932 0.6965 0.6990 0.7009
SMTP
CSMTP 0.7130 0.7111 0.7114 0.7092 0.7083
4.2 CLUSTERING PERFORMANCE
For a document corpus with p classes and n documents, remove the class labels and randomly select
one-third of the documents for training/validation and the remaining for testing. Note that the data for
training/validation are separate from the data for testing.
4.2.1 KMEANS CLUSTERING
In this experiment, the performance of the CSMTP MEASURE in clustering is compared with the
SMTP MEASURE . The performance is evaluated by the clustering accuracy , AC, which compares the
predicted label of each document with that provided by the document corpus:
𝐾
𝐼=1 𝑚𝑜𝑠𝑡 𝑖
𝐴𝐶 =
𝑛
Figure 4.3 shows the clustering accuracy obtained by kmeans with SMTP and CSMTP similarity
measures using different k(cluster) settings, i.e., k=4,8,12,16,20, on the training/validation data of WebKB. The
figure clearly shows that the kmeans clustering accuracy obtained using the proposed CSMTP measure performs
high comparing to the SMTP measure.
Figure 4.3 Clustering Accuracy By Kmeans On Training/Validation With Smtp & Csmtp Using
Differen K Settings.

31
Concept and Term Based Similarity Measure for Text Classification and Clustering

TABLE 4.3 shows the clustering AC(accuracy) obtained by kmeans clustering on the testing data of webkb
Table 4.3
Ac Values By K-Means With Different Measures On Testing Data Of WEBKB
K=8
K=16 K=24 K=32
0.7906 0.8584 0.8692 0.8728
SMTP
CSMTP 0.8450 0.8702 0.8796 0.8964
4.2.2 HIERARCHICAL AGGLOMERATIVE DOCUMENT CLSTERING PERFORMANCE
Figure 4.4 shows the Clustering accuracy obtained by hierarchical clustering with SMTP and CSMTP
similarity measures using different k(cluster) settings, i.e., k=4,8,12,16,20, on the training/validation data of
WebKB. The figure clearly shows that the hierarchical agglomerative clustering accuracy obtained using the
proposed CSMTP measure performs high comparing to the SMTP measure.

Figure 4.4 Clustering Accuracy By Hac On Training/Validation With Smtp & Csmtp Using Differen K
Settings.
TABLE 4.4 shows the classification AC(accuracy) obtained by multi label classification on the testing data of
webkb
Table 4.4
Clustering Accuracies By Hac With Different Measures On Testing Data Of Webkb
K=1
K=3
K=5
K=7
K=9
0.5960 0.5483 0.5367 0.5091 0.5000
SMTP
CSMTP 0.6770 0.6343 0.5552 0.5244 0.5200

V.

CONCLUSION

The concept and term based model represents document as a two-way model with the aid of WordNet.
In the two-way representation model, the term information is represented first, and the concept information is
represented second and these levels are connected by the semantic relatedness between terms and concepts.
Experimental results on real data sets have shown that the proposed model and classification framework
significantly improved the classification and clustering performance by comparing with the existing
SMTP(similarity measure for text processing) model .the experiments shows CSMTP(concept and term based
similarity measure for text processing) takes less time when running in parallel, less space when running in
series and categorization accuracy is high.

32
Concept and Term Based Similarity Measure for Text Classification and Clustering
VI.
FUTURE WORK
In future, the work can be focused on the concept mapping and weighting technology to find the better
concept vector space for documents, because the better concept-based representation can help to further improve
the performance of text classification and clustering framework.a new semantic-based vector space model
utilizing the category information can also be exploited. Afterwards, two-way representation model can be
extended to three-way model containing term, concept and category information respectively. CTSMTP will
also be improved to fit the three-way model and achieve more predominant text classification and clustering
performance.
REFERENCES
[1].
[2].
[3].

[4].
[5].
[6].
[7].

[8].
[9].
[10].
[11].
[12].
[13].

Similarity Measures For Text Document Clustering Anna Huang 2008.
F. Sebastiani. Machine Learning In Automated Text Categorization. Acm Computing Surveys,
34(1):1–47, 2002.
S. Clinchant And E. Gaussier. Information-Based Models For Ad Hoc Ir. Proceedings Of 33rd Annual
International Acm Sigir Conference On Research And Development In Information Retrieval, Pages
234–241, 2010.
Sentence Similarity Measures For Essay Coherence Derrick Higgins ,Jill Burstein,2007
Similarity Measures for Short Segments of Text Donald Metzler1, Susan Dumais2, Christopher
Meek,2007
Multi-Label Classification Algorithm Derived From K-Nearest Neighbor Rule With Label
Dependencies Zoulficar Younes, Fahed Abdallah, And Thierry Denoeux,2003
Daphe Koller And Mehran Sahami, Hierarchically Classifying Documents Using Very Few Words,
Proceedings Of The 14th International Conference On Machine Learning (Ml), Nashville, Tennessee,
July 1997, Pages 170-178.
S. Kullback And R. A. Leibler. On Information And Sufficiency. Annuals Of Mathematical Statistics,
22(1):79–86, 1951.
H. Chim And X. Deng. Efficient Phrase-Based Document Similarity For Clustering. Ieee Transactions
On Knowledge And Data Engineering, 20(9):1217 – 1229, 2008.
http://web.ist.utl.pt/ acardoso/datasets/.
http://www.cs.technion.ac.il/ ronb/thesis.html.
http://www.daviddlewis.com/resources/testcollections/reuters21578/
http://www.dmoz.org/

33

More Related Content

PDF
L0261075078
PPTX
Tdm probabilistic models (part 2)
PDF
Bl24409420
DOC
Discovering Novel Information with sentence Level clustering From Multi-docu...
PDF
E1062530
PDF
G04124041046
PDF
Seeds Affinity Propagation Based on Text Clustering
PDF
Blei ngjordan2003
L0261075078
Tdm probabilistic models (part 2)
Bl24409420
Discovering Novel Information with sentence Level clustering From Multi-docu...
E1062530
G04124041046
Seeds Affinity Propagation Based on Text Clustering
Blei ngjordan2003

What's hot (17)

PDF
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
PPTX
PDF
Topicmodels
PDF
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
PPTX
Search Engines
PDF
A rough set based hybrid method to text categorization
PDF
Ju3517011704
PDF
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
PDF
An Improved Similarity Matching based Clustering Framework for Short and Sent...
PDF
Topic models
PDF
Canini09a
PPTX
Adversarial and reinforcement learning-based approaches to information retrieval
PDF
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
PPTX
Probabilistic retrieval model
PDF
TopicModels_BleiPaper_Summary.pptx
ODP
Topic Modeling
PDF
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
Topicmodels
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
Search Engines
A rough set based hybrid method to text categorization
Ju3517011704
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
An Improved Similarity Matching based Clustering Framework for Short and Sent...
Topic models
Canini09a
Adversarial and reinforcement learning-based approaches to information retrieval
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
Probabilistic retrieval model
TopicModels_BleiPaper_Summary.pptx
Topic Modeling
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Ad

Viewers also liked (7)

PDF
International Journal of Engineering Research and Development (IJERD)
PDF
International Journal of Engineering Research and Development (IJERD)
DOCX
Kịch bản sư phạm
PDF
International Journal of Engineering Research and Development (IJERD)
PDF
Stepping Up To The Challenge - CMO Insignt From The Global C-Suite Study
PPTX
PDF
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
Kịch bản sư phạm
International Journal of Engineering Research and Development (IJERD)
Stepping Up To The Challenge - CMO Insignt From The Global C-Suite Study
International Journal of Engineering Research and Development (IJERD)
Ad

Similar to International Journal of Engineering Research and Development (IJERD) (20)

PDF
L0261075078
PDF
International Journal of Engineering and Science Invention (IJESI)
PDF
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
PDF
F017243241
PPTX
Chat bot using text similarity approach
PPTX
IR.pptx
PDF
J017145559
PDF
Challenging Issues and Similarity Measures for Web Document Clustering
PDF
20433-39028-3-PB.pdf
PPTX
Text similarity measures
PDF
Context Sensitive Relatedness Measure of Word Pairs
PDF
IRJET - Document Comparison based on TF-IDF Metric
PDF
Volume 2-issue-6-1969-1973
PDF
Volume 2-issue-6-1969-1973
PDF
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
PDF
A Novel Clustering Method for Similarity Measuring in Text Documents
PDF
A CLUSTERING TECHNIQUE FOR EMAIL CONTENT MINING
PDF
P13 corley
PDF
Context Driven Technique for Document Classification
PDF
Machine learning for text document classification-efficient classification ap...
L0261075078
International Journal of Engineering and Science Invention (IJESI)
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
F017243241
Chat bot using text similarity approach
IR.pptx
J017145559
Challenging Issues and Similarity Measures for Web Document Clustering
20433-39028-3-PB.pdf
Text similarity measures
Context Sensitive Relatedness Measure of Word Pairs
IRJET - Document Comparison based on TF-IDF Metric
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Clustering Method for Similarity Measuring in Text Documents
A CLUSTERING TECHNIQUE FOR EMAIL CONTENT MINING
P13 corley
Context Driven Technique for Document Classification
Machine learning for text document classification-efficient classification ap...

More from IJERD Editor (20)

PDF
A Novel Method for Prevention of Bandwidth Distributed Denial of Service Attacks
PDF
MEMS MICROPHONE INTERFACE
PDF
Influence of tensile behaviour of slab on the structural Behaviour of shear c...
PDF
Gold prospecting using Remote Sensing ‘A case study of Sudan’
PDF
Reducing Corrosion Rate by Welding Design
PDF
Router 1X3 – RTL Design and Verification
PDF
Active Power Exchange in Distributed Power-Flow Controller (DPFC) At Third Ha...
PDF
Mitigation of Voltage Sag/Swell with Fuzzy Control Reduced Rating DVR
PDF
Study on the Fused Deposition Modelling In Additive Manufacturing
PDF
Spyware triggering system by particular string value
PDF
A Blind Steganalysis on JPEG Gray Level Image Based on Statistical Features a...
PDF
Secure Image Transmission for Cloud Storage System Using Hybrid Scheme
PDF
Application of Buckley-Leverett Equation in Modeling the Radius of Invasion i...
PDF
Gesture Gaming on the World Wide Web Using an Ordinary Web Camera
PDF
Hardware Analysis of Resonant Frequency Converter Using Isolated Circuits And...
PDF
Simulated Analysis of Resonant Frequency Converter Using Different Tank Circu...
PDF
Moon-bounce: A Boon for VHF Dxing
PDF
“MS-Extractor: An Innovative Approach to Extract Microsatellites on „Y‟ Chrom...
PDF
Importance of Measurements in Smart Grid
PDF
Study of Macro level Properties of SCC using GGBS and Lime stone powder
A Novel Method for Prevention of Bandwidth Distributed Denial of Service Attacks
MEMS MICROPHONE INTERFACE
Influence of tensile behaviour of slab on the structural Behaviour of shear c...
Gold prospecting using Remote Sensing ‘A case study of Sudan’
Reducing Corrosion Rate by Welding Design
Router 1X3 – RTL Design and Verification
Active Power Exchange in Distributed Power-Flow Controller (DPFC) At Third Ha...
Mitigation of Voltage Sag/Swell with Fuzzy Control Reduced Rating DVR
Study on the Fused Deposition Modelling In Additive Manufacturing
Spyware triggering system by particular string value
A Blind Steganalysis on JPEG Gray Level Image Based on Statistical Features a...
Secure Image Transmission for Cloud Storage System Using Hybrid Scheme
Application of Buckley-Leverett Equation in Modeling the Radius of Invasion i...
Gesture Gaming on the World Wide Web Using an Ordinary Web Camera
Hardware Analysis of Resonant Frequency Converter Using Isolated Circuits And...
Simulated Analysis of Resonant Frequency Converter Using Different Tank Circu...
Moon-bounce: A Boon for VHF Dxing
“MS-Extractor: An Innovative Approach to Extract Microsatellites on „Y‟ Chrom...
Importance of Measurements in Smart Grid
Study of Macro level Properties of SCC using GGBS and Lime stone powder

Recently uploaded (20)

PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Approach and Philosophy of On baking technology
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Empathic Computing: Creating Shared Understanding
PPT
Teaching material agriculture food technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Machine learning based COVID-19 study performance prediction
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Spectral efficient network and resource selection model in 5G networks
Chapter 3 Spatial Domain Image Processing.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Advanced methodologies resolving dimensionality complications for autism neur...
Approach and Philosophy of On baking technology
Mobile App Security Testing_ A Comprehensive Guide.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Reach Out and Touch Someone: Haptics and Empathic Computing
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Review of recent advances in non-invasive hemoglobin estimation
Empathic Computing: Creating Shared Understanding
Teaching material agriculture food technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
Agricultural_Statistics_at_a_Glance_2022_0.pdf
cuic standard and advanced reporting.pdf
NewMind AI Monthly Chronicles - July 2025
The AUB Centre for AI in Media Proposal.docx
Machine learning based COVID-19 study performance prediction

International Journal of Engineering Research and Development (IJERD)

  • 1. International Journal of Engineering Research and Development e-ISSN: 2278-067X, p-ISSN: 2278-800X, www.ijerd.com Volume 9, Issue 3 (December 2013), PP. 28-33 Concept and Term Based Similarity Measure for Text Classification and Clustering B.Sindhiya1, Dr.N.Tajunisha2 1 M.Phil Scholar Department Of Computer Science Sri Ramakrishna College Of Arts And Science For Women Coimbatore, India. 2 M.Phil, Ph.D Associate Professor Department Of Computer Science Sri Ramakrishna College Of Arts And Science For Women Coimbatore, India. Abstract:- The exploitation of syntactic structures and semantic background knowledge has always been an appealing subject in the context of data mining, text retrieval and information management. The usefulness of this kind of information has been shown most prominently in highly specialized tasks, such as text categorization scenarios. So far, however, additional syntactic or semantic information has been used only individually. In this paper, a new principle approach , the concept and term based similarity measure, which incorporates linguistic and semantic structures, using syntactic dependencies, and semantic background knowledge is proposed. This novel method represents the meaning of texts in a high-dimensional space of concepts derived from WordNet. A number of case studies have been included in the research to demonstrate the various aspects of this framework. Index Terms:- Document classification, document clustering, similarity measure, accuracy, classifiers, clustering algorithms. I. INTRODUCTION Clustering maps the data items into clusters, where clusters are natural grouping of data items based on similarity or probability density methods. Unlike classification and prediction which analyzes class-label data objects, clustering analyzes data objects without class-labels and tries to generate such labels. A similarity measure or similarity function is a real-valued function that quantifies the similarity between two objects. The similarity measure reflects the degree of closeness or separation of the target objects and should correspond to the characteristics that are believed to distinguish the clusters embedded in the data. Before Clustering, a similarity/distance measure must be determined. [1]. Choosing an appropriate similarity measure is also crucial for cluster analysis, especially for a particular type of clustering algorithms. Text Categorization (TC) is the classification of documents with respect to a set of one or more preexisting categories [2]. The classification phase consists of generating a weighted vector for all categories, then using a similarity measure to find the closest category. The similarity measure is used to determine the degree of resemblance between two vectors. To achieve reasonable classification results, a similarity measure should generally respond with larger values to documents that belong to the same class and with smaller values otherwise. During the last decades, a large number of methods proposed for text categorization were typically based on the classical Bag-of-Words model where each term or term stem is an independent feature. The existing similarity measure was more frequently used to assess the similarity between words.Although the information theoretic similarity measure results are statistically significant it does not reduce the dimension of the vector model [3].Metric distances such as Euclidean distance are not appropriate for high dimension and sparse domains. Due to the ignorance of any relation between words, the learning algorithms are restricted to detect patterns in the used terminology only, while conceptual patterns remain ignored. Existing approaches requires performing an optimization over an entire collection of documents. Most of these techniques are computationally expensive. II. RELATED WORKS Similarity measures have been extensively used in text classification and clustering algorithms. The spherical k-means algorithm [4] adopted the cosine similarity measure for document clustering. In this method the unlabeled document collections are becoming increasingly common and available. Using words as features, text documents are often represented as high-dimensional and sparse vectors. The algorithm outputs k disjoint clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm. Derrick Higgins [5] adopted a cosine-based pairwise adaptive similarity measure for document clustering. Pairwise-adaptive similarity measure for large high dimensional document datasets improves the 28
  • 2. Concept and Term Based Similarity Measure for Text Classification and Clustering unsupervised clustering quality and speed compared to the original cosine similarity measure. Zoulficar younes [6] reported results of clustering experiments with clustering algorithms and 12 different text data sets, and concluded that the objective function based on cosine similarity “leads to the best solutions irrespective of the number of clusters for most of the data sets.” Daphe koller [7] proposed a divisive information-theoretic feature clustering algorithm for text classification using the Kullback-Leibler divergence. High dimensionality of text can be a deterrent in applying complex learners such as Support Vector Machines to the task of text classification Kullback [8] combined squared Euclidean distance with relative entropy in a k-means like clustering algorithm. K means algorithm introduced recently is specifically designed to handle unit length document vectors. [6] conclude that the objective function based on cosine similarity leads to the best solutions irrespective of the number of clusters for most of the data sets. Craven [9] performed document clustering based on the proposed phrase based similarity measure. The phrase-based document similarity to compute the pair-wise similarities of documents based on the Suffix Tree Document (STD) model. By mapping each node in the suffix tree of STD model into a unique feature term in the Vector Space Document (VSD) model, the phrase-based document similarity naturally inherits the term tfidf weighting scheme in computing the document similarity with phrases. III. PROPOSED METHODOLOGY In this chapter the proposed concept and term based similarity between the documents is illustrated. The proposed system, also measure the semantic similarity between the terms and concepts with use of wordnet tool and tree tagger tool. 3.1 CSMTP (CONCEPT BASED SIMILARITY MEASURE FOR TEXT PROCESSING) ALGORITHM The CSMTP algorithm selects the terms from the testing documents, Generates the terms from the document, selects the appropriate feature and calculates the similarty measure based on the term and its respective concepts. CSMTP ALGORITHM 1. Let D1 and D2 be the testing documents. 2. Let T1 and T2 be the terms from the document D1 and D2. 3. Remove the stopwords ST1 and ST2 from the documents D1 and D2. 4. Let C1 and C2 be two concepts from T1 and T2 respectively where (T1 denotes the first thesaurus and T2 the second). 5. Compute the similarity measure between two concepts, with, log max 𝐴 , 𝐵 − log⁡ 𝐴 ∩ 𝐵 ) ( 𝑆𝐼𝑀 𝑐 𝑖 , 𝑐 𝑘 = 1 − 𝐿𝑜𝑔 𝑊 − 𝐿𝑜𝑔(min 𝐴 , 𝐵 ) where A and B are the sets of all articles that link to concepts c i and ck respectively and W is the set of all articles. Max(A,B) represents the maximum similarity measure between A and B. min(A,B) represents the minimum similarity measure between A and B. the log(A  B) represents the common concepts in A and B. 6. Compute the semantic relatedness between term and its candidate concepts in a given document according to the context information 1 1 𝑅𝑒𝑙 𝑡, 𝑐 𝑖 𝑑 𝑗 = 𝑆𝐼𝑀(𝑐 𝑖 , 𝑐 𝑘 ) 𝑇 −1 |𝐶𝑆 𝑙 | 𝑡 1 ∈𝑇&𝑡 1 ≠𝑡 𝐶 𝑘 ∈𝐶𝑆 𝑙 where T is the term set of the jth document dj , tl is a term in dj except for t, csl is the candidate concept set related to term tl. 3.2 SYNTACTIC REPRESENTATION Tf-idf weighting scheme are used in syntactic level to record the syntactic information. Tf–idf, term frequency–inverse document frequency, is a numerical statistic which reflects how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others. Tf–idf is the product of two statistics, term frequency and inverse document frequency. Various ways for determining the exact values of both statistics exist. In the case of the term frequency tf(t,d), the simplest choice is to use the raw frequency of a term in a document, i.e. the number of times that term t occurs in document d. If we denote the raw frequency of t by f(t,d), then the simple tf scheme is tf(t,d) = f(t,d). 29
  • 3. Concept and Term Based Similarity Measure for Text Classification and Clustering 3.3 SEMANTIC SIMILARITY Semantic level consists of concepts related to the terms in the syntactic level. These two levels are connected via the semantic correlation between terms and their relevant concepts. WordNet is used to calculate the ascertain connections among four types of Parts of Speech (POS) noun, verb, adjective, and adverb. The minimum unit in a WordNet is synset, which represent an exact meaning of a word. It includes the word, its clarification, and its synonyms. IV. EXPERIMENTAL RESULTS In this section, the effectiveness of the proposed similarity measure CSMTP is investigated . The investigation is done by applying the CSMTP measure in several text applications, including k-NN based singlelabel classification (SL-kNN) , k-NN based multi-label classification (ML-kNN) , k-means clustering (k-means) , and hierarchical agglomerative clustering (HAC) . The data sets, namely WebKB, Reuters-8 respectively, are used in the experiments presented below. 1) WebKB. The documents in the WebKB data set are webpages collected by the World Wide Knowledge Base (Web→Kb) project of the CMU text learning group. The documents were manually classified into several different classes. The documents of this data set were not predesignated as training or testing patterns. the datasets can be randomly divided into training and testing subsets. 2) Reuters-8 Reuters-21578 ModeApt`e Split Text Categorization Test Collection contains thousands of documents collected from Reuters newswire in 1987. The most widely used version is Reuters-21578 ModeApt`e, which contains 90 categories and 12902 documents. 4.1 CLASSIFICATION PERFORMANCE For WebKB dataset, the randomly selected training documents are used for training/validation and the testing documents are used for testing. For Reuters-8 dataset the predesignated training data are used for training/validation and the predesignated testing data are used for testing. Note that the data for training/validation are separate from the data for testing in each case 4.1.1 Single-Label Document Classification In this experiment, we compare the performance of our measureand the others in single-label document classification. The performance is evaluated by the classification accuracy , AC, which compares the predicted label of each document with that provided by the document corpus: 𝑛 𝐼=1 𝐸(𝐶𝑖, 𝐶 𝑖 1) 𝐴𝐶 = 𝑛 where n is the number of testing documents, and ci and ci are the target label and the predicted label, respectively, of the ith document. E(ci, ci1) = 1 if ci = ci1, and E(ci, ci1) = 0 otherwise. Figure 4.1 shows the classification accuracy obtained by SL-kNN with SMTP and CSMTP similarity measures using different class(k) different k(class) settings, i.e., k=4,8,12,16,20, on the training/validation data of WebKB. The figure clearly shows that the single label document classification accuracy obtained using the proposed CSMTP measure performs high comparing to the SMTP measure. FIGURE 4.1 Classification Accuracy By Sl-Knn On Training/Validation With SMTP & CSMTP Using Differen K Settings. 30
  • 4. Concept and Term Based Similarity Measure for Text Classification and Clustering TABLE 4.1 shows the classification AC(accuracy) obtained by single label classification on the testing data of webkb TABLE 4.1 Classification Accuracies By Sl–Knn With Different Measures On Testing Data Of WEBKB K=1 K=3 K=5 K=7 K=9 SMTP 0.9013 0.9191 0.9242 0.9223 0.9233 CSMTP 0.9338 0.9411 0.9420 0.9447 0.9461 4.1.2MULTI LABEL CLASSIFICATION Figure 4.2 shows the classification accuracy obtained by ML-kNN with SMTP and CSMTP similarity measures using different class(k) different k(class) settings, i.e., k=4,8,12,16,20, on the training/validation data of WebKB. The figure 5.2 clearly shows that the multi label document classification accuracy obtained using the proposed CSMTP measure performs high comparing to the Figure 4.2 Classification Accuracy By Ml-Knn On Training/Validation With Smtp & Csmtp Using Differen K Settings. TABLE 4.2 shows the classification AC(accuracy) obtained by multi label classification on the testing data of webkb TABLE 4.2 CLASSIFICATION ACCURACIES BY ML–KNN WITH DIFFERENT MEASURES ON TESTING DATA OF WEBKB K=1 K=3 K=5 K=7 K=9 0.6910 0.6932 0.6965 0.6990 0.7009 SMTP CSMTP 0.7130 0.7111 0.7114 0.7092 0.7083 4.2 CLUSTERING PERFORMANCE For a document corpus with p classes and n documents, remove the class labels and randomly select one-third of the documents for training/validation and the remaining for testing. Note that the data for training/validation are separate from the data for testing. 4.2.1 KMEANS CLUSTERING In this experiment, the performance of the CSMTP MEASURE in clustering is compared with the SMTP MEASURE . The performance is evaluated by the clustering accuracy , AC, which compares the predicted label of each document with that provided by the document corpus: 𝐾 𝐼=1 𝑚𝑜𝑠𝑡 𝑖 𝐴𝐶 = 𝑛 Figure 4.3 shows the clustering accuracy obtained by kmeans with SMTP and CSMTP similarity measures using different k(cluster) settings, i.e., k=4,8,12,16,20, on the training/validation data of WebKB. The figure clearly shows that the kmeans clustering accuracy obtained using the proposed CSMTP measure performs high comparing to the SMTP measure. Figure 4.3 Clustering Accuracy By Kmeans On Training/Validation With Smtp & Csmtp Using Differen K Settings. 31
  • 5. Concept and Term Based Similarity Measure for Text Classification and Clustering TABLE 4.3 shows the clustering AC(accuracy) obtained by kmeans clustering on the testing data of webkb Table 4.3 Ac Values By K-Means With Different Measures On Testing Data Of WEBKB K=8 K=16 K=24 K=32 0.7906 0.8584 0.8692 0.8728 SMTP CSMTP 0.8450 0.8702 0.8796 0.8964 4.2.2 HIERARCHICAL AGGLOMERATIVE DOCUMENT CLSTERING PERFORMANCE Figure 4.4 shows the Clustering accuracy obtained by hierarchical clustering with SMTP and CSMTP similarity measures using different k(cluster) settings, i.e., k=4,8,12,16,20, on the training/validation data of WebKB. The figure clearly shows that the hierarchical agglomerative clustering accuracy obtained using the proposed CSMTP measure performs high comparing to the SMTP measure. Figure 4.4 Clustering Accuracy By Hac On Training/Validation With Smtp & Csmtp Using Differen K Settings. TABLE 4.4 shows the classification AC(accuracy) obtained by multi label classification on the testing data of webkb Table 4.4 Clustering Accuracies By Hac With Different Measures On Testing Data Of Webkb K=1 K=3 K=5 K=7 K=9 0.5960 0.5483 0.5367 0.5091 0.5000 SMTP CSMTP 0.6770 0.6343 0.5552 0.5244 0.5200 V. CONCLUSION The concept and term based model represents document as a two-way model with the aid of WordNet. In the two-way representation model, the term information is represented first, and the concept information is represented second and these levels are connected by the semantic relatedness between terms and concepts. Experimental results on real data sets have shown that the proposed model and classification framework significantly improved the classification and clustering performance by comparing with the existing SMTP(similarity measure for text processing) model .the experiments shows CSMTP(concept and term based similarity measure for text processing) takes less time when running in parallel, less space when running in series and categorization accuracy is high. 32
  • 6. Concept and Term Based Similarity Measure for Text Classification and Clustering VI. FUTURE WORK In future, the work can be focused on the concept mapping and weighting technology to find the better concept vector space for documents, because the better concept-based representation can help to further improve the performance of text classification and clustering framework.a new semantic-based vector space model utilizing the category information can also be exploited. Afterwards, two-way representation model can be extended to three-way model containing term, concept and category information respectively. CTSMTP will also be improved to fit the three-way model and achieve more predominant text classification and clustering performance. REFERENCES [1]. [2]. [3]. [4]. [5]. [6]. [7]. [8]. [9]. [10]. [11]. [12]. [13]. Similarity Measures For Text Document Clustering Anna Huang 2008. F. Sebastiani. Machine Learning In Automated Text Categorization. Acm Computing Surveys, 34(1):1–47, 2002. S. Clinchant And E. Gaussier. Information-Based Models For Ad Hoc Ir. Proceedings Of 33rd Annual International Acm Sigir Conference On Research And Development In Information Retrieval, Pages 234–241, 2010. Sentence Similarity Measures For Essay Coherence Derrick Higgins ,Jill Burstein,2007 Similarity Measures for Short Segments of Text Donald Metzler1, Susan Dumais2, Christopher Meek,2007 Multi-Label Classification Algorithm Derived From K-Nearest Neighbor Rule With Label Dependencies Zoulficar Younes, Fahed Abdallah, And Thierry Denoeux,2003 Daphe Koller And Mehran Sahami, Hierarchically Classifying Documents Using Very Few Words, Proceedings Of The 14th International Conference On Machine Learning (Ml), Nashville, Tennessee, July 1997, Pages 170-178. S. Kullback And R. A. Leibler. On Information And Sufficiency. Annuals Of Mathematical Statistics, 22(1):79–86, 1951. H. Chim And X. Deng. Efficient Phrase-Based Document Similarity For Clustering. Ieee Transactions On Knowledge And Data Engineering, 20(9):1217 – 1229, 2008. http://web.ist.utl.pt/ acardoso/datasets/. http://www.cs.technion.ac.il/ ronb/thesis.html. http://www.daviddlewis.com/resources/testcollections/reuters21578/ http://www.dmoz.org/ 33