SlideShare a Scribd company logo
GLOBALSOFT TECHNOLOGIES 
IEEE PROJECTS & SOFTWARE DEVELOPMENTS 
IEEE FINAL YEAR PROJECTS|IEEE ENGINEERING PROJECTS|IEEE STUDENTS PROJECTS|IEEE 
BULK PROJECTS|BE/BTECH/ME/MTECH/MS/MCA PROJECTS|CSE/IT/ECE/EEE PROJECTS 
CELL: +91 98495 39085, +91 99662 35788, +91 98495 57908, +91 97014 40401 
Visit: www.finalyearprojects.org Mail to:ieeefinalsemprojects@gmail.com 
A Similarity Measure for Text Classification and 
Clustering 
Abstract: 
Measuring the similarity between documents is an important operation in the text 
processing field. In this paper, a new similarity measure is proposed. To compute 
the similarity between two documents with respect to a feature, the proposed 
measure takes the following three cases into account: a) The feature appears in 
both documents, b) the feature appears in only one document, and c) the feature 
appears in none of the documents. For the first case, the similarity increases as the 
difference between the two involved feature values decreases. Furthermore, the 
contribution of the difference is normally scaled. For the second case, a fixed value 
is contributed to the similarity. For the last case, the feature has no contribution to 
the similarity. The proposed measure is extended to gauge the similarity between 
two sets of documents. The effectiveness of our measure is evaluated on several 
real-world data sets for text classification and clustering problems. The results 
show that the performance obtained by the proposed measure is better than that 
achieved by other measures. 
Existing System: 
• Clustering is one of the most interesting and important topics in data mining. 
The aim of clustering is to find intrinsic structures in data, and organize 
them into meaningful subgroups for further study and analysis.
• Existing Systems greedily picks the next frequent item set which represent 
the next cluster to minimize the overlapping between the documents that 
contain both the item set and some remaining item sets. 
• In other words, the clustering result depends on the order of picking up the 
item sets, which in turns depends on the greedy heuristic. This method does 
not follow a sequential order of selecting clusters. 
DISADVANTAGES: 
• Its disadvantage is that it does not yield the same result with each run, since 
the resulting clusters depend on the initial random assignments. 
• It minimizes intra-cluster variance, but does not ensure that the result has a 
global minimum of variance. 
• But has the same problems as k-means, the minimum is a local minimum, 
and the results depend on the initial choice of weights. 
• The Expectation-maximization algorithm is a more statistically formalized 
method which includes some of these ideas: partial membership in classes 
Proposed System: 
• The main work is to develop a novel hierarchal algorithm for document 
clustering which provides maximum efficiency and performance. Propose a 
novel way to evaluate similarity between documents, and consequently 
formulate new criterion functions for document clustering. 
• Assume that the majority. The purpose of this test is to check how much a 
similarity measure coincides with the true class labels. 
• It is particularly focused in studying and making use of cluster overlapping 
phenomenon to design cluster merging criteria.
• Experiments in both public data and document clustering data show that this 
approach can improve the efficiency of clustering and save computing time. 
System Requirements: 
Software Requirements: 
• Windows XP/Windows 2000 
• Java Runtime Environment with higher version(1.5) 
• Net Beans 
• My SQL Server 
Hardware requirements: 
• Pentium Processor IV with 2.80GHZ or Higher 
• 512 MB RAM 
• 2 GB HDD 
• 15” Monitor

More Related Content

DOCX
IEEE 2014 DOTNET DATA MINING PROJECTS Similarity preserving snippet based vis...
DOCX
3.a similarity measure for text classification and
PPTX
Differential Evolution Algorithm (DEA)
PDF
Effects of Highly Agreed Documents in Relevancy Prediction
PDF
C055011012
PPT
Mining from Open Answers in Questionnaire Data
PPTX
Low Cost Business Intelligence Platform for MongoDB instances using MEAN stack
IEEE 2014 DOTNET DATA MINING PROJECTS Similarity preserving snippet based vis...
3.a similarity measure for text classification and
Differential Evolution Algorithm (DEA)
Effects of Highly Agreed Documents in Relevancy Prediction
C055011012
Mining from Open Answers in Questionnaire Data
Low Cost Business Intelligence Platform for MongoDB instances using MEAN stack

What's hot (17)

PPTX
Comparison of papers NN-filter
PDF
Machine Language and Pattern Analysis IEEE 2015 Projects
PPTX
Information Retrieval-06
DOC
General factorization framework for context-aware recommendations
PPTX
Data Structure Assignment help , Data Structure Online tutors
DOCX
Levels and stages of evaluation
PPTX
Query Plan Generation using Particle Swarm Optimization
PDF
Конкурс Авито-2017 - Решение 3ое место
PPT
Paper presentation @IPAW'08
PDF
A systematic mapping study of performance analysis and modelling of cloud sys...
PDF
Poster Final
PDF
Calculation of Reusability Matrices for Object Oriented applications
PPTX
Dahlquist bosc 20160709
PPTX
Pizza club - March 2017 - Gaia
PDF
A Threshold fuzzy entropy based feature selection method applied in various b...
PDF
IRJET- A Review of Data Cleaning and its Current Approaches
Comparison of papers NN-filter
Machine Language and Pattern Analysis IEEE 2015 Projects
Information Retrieval-06
General factorization framework for context-aware recommendations
Data Structure Assignment help , Data Structure Online tutors
Levels and stages of evaluation
Query Plan Generation using Particle Swarm Optimization
Конкурс Авито-2017 - Решение 3ое место
Paper presentation @IPAW'08
A systematic mapping study of performance analysis and modelling of cloud sys...
Poster Final
Calculation of Reusability Matrices for Object Oriented applications
Dahlquist bosc 20160709
Pizza club - March 2017 - Gaia
A Threshold fuzzy entropy based feature selection method applied in various b...
IRJET- A Review of Data Cleaning and its Current Approaches
Ad

Similar to IEEE 2014 JAVA DATA MINING PROJECTS A similarity measure for text classification and clustering (20)

DOCX
2014 IEEE DOTNET DATA MINING PROJECT Similarity preserving snippet based visu...
PDF
Volume 2-issue-6-1969-1973
PDF
Volume 2-issue-6-1969-1973
PDF
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
PDF
A Competent and Empirical Model of Distributed Clustering
PDF
Bl24409420
PDF
Clustering Algorithm with a Novel Similarity Measure
PDF
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
PDF
An Improved Similarity Matching based Clustering Framework for Short and Sent...
PDF
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
PDF
Evaluating the Use of Clustering for Automatically Organising Digital Library...
PPT
Cluster
PDF
Ir3116271633
PDF
Bs31267274
PDF
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
PDF
Hierarchal clustering and similarity measures along with multi representation
PDF
Hierarchal clustering and similarity measures along
PDF
A Novel Clustering Method for Similarity Measuring in Text Documents
PDF
50120130406022
PDF
Review on Document Recommender Systems Using Hierarchical Clustering Techniques
2014 IEEE DOTNET DATA MINING PROJECT Similarity preserving snippet based visu...
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Competent and Empirical Model of Distributed Clustering
Bl24409420
Clustering Algorithm with a Novel Similarity Measure
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
An Improved Similarity Matching based Clustering Framework for Short and Sent...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
Evaluating the Use of Clustering for Automatically Organising Digital Library...
Cluster
Ir3116271633
Bs31267274
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
Hierarchal clustering and similarity measures along with multi representation
Hierarchal clustering and similarity measures along
A Novel Clustering Method for Similarity Measuring in Text Documents
50120130406022
Review on Document Recommender Systems Using Hierarchical Clustering Techniques
Ad

More from IEEEFINALYEARSTUDENTPROJECTS (20)

DOCX
IEEE 2014 JAVA NETWORK SECURITY PROJECTS Efficient and privacy aware data agg...
DOCX
IEEE 2014 JAVA NETWORK SECURITY PROJECTS Building a scalable system for steal...
DOCX
IEEE 2014 JAVA MOBILE COMPUTING PROJECTS Token mac a fair mac protocol for pa...
DOCX
IEEE 2014 JAVA MOBILE COMPUTING PROJECTS Tag sense leveraging smartphones for...
DOC
IEEE 2014 JAVA MOBILE COMPUTING PROJECTS Privacy preserving optimal meeting l...
DOCX
IEEE 2014 JAVA MOBILE COMPUTING PROJECTS Preserving location privacy in geo s...
DOCX
IEEE 2014 JAVA MOBILE COMPUTING PROJECTS Friendbook a semantic based friend r...
DOCX
IEEE 2014 JAVA MOBILE COMPUTING PROJECTS Efficient and privacy aware data agg...
DOCX
IEEE 2014 JAVA MOBILE COMPUTING PROJECTS Cloud assisted mobile-access of heal...
DOCX
IEEE 2014 JAVA MOBILE COMPUTING PROJECTS A low complexity algorithm for neigh...
DOCX
IEEE 2014 JAVA IMAGE PROCESSING PROJECTS Hierarchical prediction and context ...
DOCX
IEEE 2014 JAVA IMAGE PROCESSING PROJECTS Designing an-efficient-image encrypt...
DOCX
IEEE 2014 JAVA IMAGE PROCESSING PROJECTS Click prediction-for-web-image-reran...
DOCX
IEEE 2014 JAVA SERVICE COMPUTING PROJECTS Web service recommendation via expl...
DOCX
IEEE 2014 JAVA SERVICE COMPUTING PROJECTS Scalable and accurate prediction of...
DOCX
IEEE 2014 JAVA SERVICE COMPUTING PROJECTS Privacy enhanced web service compos...
DOCX
IEEE 2014 JAVA SERVICE COMPUTING PROJECTS Decentralized enactment of bpel pro...
DOCX
IEEE 2014 JAVA SERVICE COMPUTING PROJECTS A novel time obfuscated algorithm ...
DOCX
IEEE 2014 JAVA SOFTWARE ENGINEER PROJECTS Conservation of information softwar...
DOC
IEEE 2014 JAVA DATA MINING PROJECTS Xs path navigation on xml schemas made easy
IEEE 2014 JAVA NETWORK SECURITY PROJECTS Efficient and privacy aware data agg...
IEEE 2014 JAVA NETWORK SECURITY PROJECTS Building a scalable system for steal...
IEEE 2014 JAVA MOBILE COMPUTING PROJECTS Token mac a fair mac protocol for pa...
IEEE 2014 JAVA MOBILE COMPUTING PROJECTS Tag sense leveraging smartphones for...
IEEE 2014 JAVA MOBILE COMPUTING PROJECTS Privacy preserving optimal meeting l...
IEEE 2014 JAVA MOBILE COMPUTING PROJECTS Preserving location privacy in geo s...
IEEE 2014 JAVA MOBILE COMPUTING PROJECTS Friendbook a semantic based friend r...
IEEE 2014 JAVA MOBILE COMPUTING PROJECTS Efficient and privacy aware data agg...
IEEE 2014 JAVA MOBILE COMPUTING PROJECTS Cloud assisted mobile-access of heal...
IEEE 2014 JAVA MOBILE COMPUTING PROJECTS A low complexity algorithm for neigh...
IEEE 2014 JAVA IMAGE PROCESSING PROJECTS Hierarchical prediction and context ...
IEEE 2014 JAVA IMAGE PROCESSING PROJECTS Designing an-efficient-image encrypt...
IEEE 2014 JAVA IMAGE PROCESSING PROJECTS Click prediction-for-web-image-reran...
IEEE 2014 JAVA SERVICE COMPUTING PROJECTS Web service recommendation via expl...
IEEE 2014 JAVA SERVICE COMPUTING PROJECTS Scalable and accurate prediction of...
IEEE 2014 JAVA SERVICE COMPUTING PROJECTS Privacy enhanced web service compos...
IEEE 2014 JAVA SERVICE COMPUTING PROJECTS Decentralized enactment of bpel pro...
IEEE 2014 JAVA SERVICE COMPUTING PROJECTS A novel time obfuscated algorithm ...
IEEE 2014 JAVA SOFTWARE ENGINEER PROJECTS Conservation of information softwar...
IEEE 2014 JAVA DATA MINING PROJECTS Xs path navigation on xml schemas made easy

Recently uploaded (20)

PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
PPT on Performance Review to get promotions
PPTX
Geodesy 1.pptx...............................................
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
web development for engineering and engineering
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
composite construction of structures.pdf
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Internet of Things (IOT) - A guide to understanding
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPT on Performance Review to get promotions
Geodesy 1.pptx...............................................
Lecture Notes Electrical Wiring System Components
OOP with Java - Java Introduction (Basics)
web development for engineering and engineering
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
R24 SURVEYING LAB MANUAL for civil enggi
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Foundation to blockchain - A guide to Blockchain Tech
composite construction of structures.pdf
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Model Code of Practice - Construction Work - 21102022 .pdf
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Automation-in-Manufacturing-Chapter-Introduction.pdf
Embodied AI: Ushering in the Next Era of Intelligent Systems
Internet of Things (IOT) - A guide to understanding

IEEE 2014 JAVA DATA MINING PROJECTS A similarity measure for text classification and clustering

  • 1. GLOBALSOFT TECHNOLOGIES IEEE PROJECTS & SOFTWARE DEVELOPMENTS IEEE FINAL YEAR PROJECTS|IEEE ENGINEERING PROJECTS|IEEE STUDENTS PROJECTS|IEEE BULK PROJECTS|BE/BTECH/ME/MTECH/MS/MCA PROJECTS|CSE/IT/ECE/EEE PROJECTS CELL: +91 98495 39085, +91 99662 35788, +91 98495 57908, +91 97014 40401 Visit: www.finalyearprojects.org Mail to:ieeefinalsemprojects@gmail.com A Similarity Measure for Text Classification and Clustering Abstract: Measuring the similarity between documents is an important operation in the text processing field. In this paper, a new similarity measure is proposed. To compute the similarity between two documents with respect to a feature, the proposed measure takes the following three cases into account: a) The feature appears in both documents, b) the feature appears in only one document, and c) the feature appears in none of the documents. For the first case, the similarity increases as the difference between the two involved feature values decreases. Furthermore, the contribution of the difference is normally scaled. For the second case, a fixed value is contributed to the similarity. For the last case, the feature has no contribution to the similarity. The proposed measure is extended to gauge the similarity between two sets of documents. The effectiveness of our measure is evaluated on several real-world data sets for text classification and clustering problems. The results show that the performance obtained by the proposed measure is better than that achieved by other measures. Existing System: • Clustering is one of the most interesting and important topics in data mining. The aim of clustering is to find intrinsic structures in data, and organize them into meaningful subgroups for further study and analysis.
  • 2. • Existing Systems greedily picks the next frequent item set which represent the next cluster to minimize the overlapping between the documents that contain both the item set and some remaining item sets. • In other words, the clustering result depends on the order of picking up the item sets, which in turns depends on the greedy heuristic. This method does not follow a sequential order of selecting clusters. DISADVANTAGES: • Its disadvantage is that it does not yield the same result with each run, since the resulting clusters depend on the initial random assignments. • It minimizes intra-cluster variance, but does not ensure that the result has a global minimum of variance. • But has the same problems as k-means, the minimum is a local minimum, and the results depend on the initial choice of weights. • The Expectation-maximization algorithm is a more statistically formalized method which includes some of these ideas: partial membership in classes Proposed System: • The main work is to develop a novel hierarchal algorithm for document clustering which provides maximum efficiency and performance. Propose a novel way to evaluate similarity between documents, and consequently formulate new criterion functions for document clustering. • Assume that the majority. The purpose of this test is to check how much a similarity measure coincides with the true class labels. • It is particularly focused in studying and making use of cluster overlapping phenomenon to design cluster merging criteria.
  • 3. • Experiments in both public data and document clustering data show that this approach can improve the efficiency of clustering and save computing time. System Requirements: Software Requirements: • Windows XP/Windows 2000 • Java Runtime Environment with higher version(1.5) • Net Beans • My SQL Server Hardware requirements: • Pentium Processor IV with 2.80GHZ or Higher • 512 MB RAM • 2 GB HDD • 15” Monitor