GLOBALSOFT TECHNOLOGIES 
IEEE PROJECTS & SOFTWARE DEVELOPMENTS 
IEEE FINAL YEAR PROJECTS|IEEE ENGINEERING PROJECTS|IEEE STUDENTS PROJECTS|IEEE 
BULK PROJECTS|BE/BTECH/ME/MTECH/MS/MCA PROJECTS|CSE/IT/ECE/EEE PROJECTS 
CELL: +91 98495 39085, +91 99662 35788, +91 98495 57908, +91 97014 40401 
Visit: www.finalyearprojects.org Mail to:ieeefinalsemprojects@gmail.com 
A Similarity Measure for Text Classification and 
Clustering 
Abstract: 
Measuring the similarity between documents is an important operation in the text 
processing field. In this paper, a new similarity measure is proposed. To compute 
the similarity between two documents with respect to a feature, the proposed 
measure takes the following three cases into account: a) The feature appears in 
both documents, b) the feature appears in only one document, and c) the feature 
appears in none of the documents. For the first case, the similarity increases as the 
difference between the two involved feature values decreases. Furthermore, the 
contribution of the difference is normally scaled. For the second case, a fixed value 
is contributed to the similarity. For the last case, the feature has no contribution to 
the similarity. The proposed measure is extended to gauge the similarity between 
two sets of documents. The effectiveness of our measure is evaluated on several 
real-world data sets for text classification and clustering problems. The results 
show that the performance obtained by the proposed measure is better than that 
achieved by other measures. 
Existing System: 
• Clustering is one of the most interesting and important topics in data mining. 
The aim of clustering is to find intrinsic structures in data, and organize 
them into meaningful subgroups for further study and analysis.
• Existing Systems greedily picks the next frequent item set which represent 
the next cluster to minimize the overlapping between the documents that 
contain both the item set and some remaining item sets. 
• In other words, the clustering result depends on the order of picking up the 
item sets, which in turns depends on the greedy heuristic. This method does 
not follow a sequential order of selecting clusters. 
DISADVANTAGES: 
• Its disadvantage is that it does not yield the same result with each run, since 
the resulting clusters depend on the initial random assignments. 
• It minimizes intra-cluster variance, but does not ensure that the result has a 
global minimum of variance. 
• But has the same problems as k-means, the minimum is a local minimum, 
and the results depend on the initial choice of weights. 
• The Expectation-maximization algorithm is a more statistically formalized 
method which includes some of these ideas: partial membership in classes 
Proposed System: 
• The main work is to develop a novel hierarchal algorithm for document 
clustering which provides maximum efficiency and performance. Propose a 
novel way to evaluate similarity between documents, and consequently 
formulate new criterion functions for document clustering. 
• Assume that the majority. The purpose of this test is to check how much a 
similarity measure coincides with the true class labels. 
• It is particularly focused in studying and making use of cluster overlapping 
phenomenon to design cluster merging criteria.
• Experiments in both public data and document clustering data show that this 
approach can improve the efficiency of clustering and save computing time. 
System Requirements: 
Software Requirements: 
• Windows XP/Windows 2000 
• Java Runtime Environment with higher version(1.5) 
• Net Beans 
• My SQL Server 
Hardware requirements: 
• Pentium Processor IV with 2.80GHZ or Higher 
• 512 MB RAM 
• 2 GB HDD 
• 15” Monitor

More Related Content

DOCX
IEEE 2014 DOTNET DATA MINING PROJECTS Similarity preserving snippet based vis...
DOCX
3.a similarity measure for text classification and
PPTX
Differential Evolution Algorithm (DEA)
PDF
Effects of Highly Agreed Documents in Relevancy Prediction
PDF
C055011012
PPT
Mining from Open Answers in Questionnaire Data
PPTX
Low Cost Business Intelligence Platform for MongoDB instances using MEAN stack
IEEE 2014 DOTNET DATA MINING PROJECTS Similarity preserving snippet based vis...
3.a similarity measure for text classification and
Differential Evolution Algorithm (DEA)
Effects of Highly Agreed Documents in Relevancy Prediction
C055011012
Mining from Open Answers in Questionnaire Data
Low Cost Business Intelligence Platform for MongoDB instances using MEAN stack

What's hot (17)

PPTX
Comparison of papers NN-filter
PDF
Machine Language and Pattern Analysis IEEE 2015 Projects
PPTX
Information Retrieval-06
DOC
General factorization framework for context-aware recommendations
PPTX
Data Structure Assignment help , Data Structure Online tutors
DOCX
Levels and stages of evaluation
PPTX
Query Plan Generation using Particle Swarm Optimization
PDF
Конкурс Авито-2017 - Решение 3ое место
PPT
Paper presentation @IPAW'08
PDF
A systematic mapping study of performance analysis and modelling of cloud sys...
PDF
Poster Final
PDF
Calculation of Reusability Matrices for Object Oriented applications
PPTX
Dahlquist bosc 20160709
PPTX
Pizza club - March 2017 - Gaia
PDF
A Threshold fuzzy entropy based feature selection method applied in various b...
PDF
IRJET- A Review of Data Cleaning and its Current Approaches
Comparison of papers NN-filter
Machine Language and Pattern Analysis IEEE 2015 Projects
Information Retrieval-06
General factorization framework for context-aware recommendations
Data Structure Assignment help , Data Structure Online tutors
Levels and stages of evaluation
Query Plan Generation using Particle Swarm Optimization
Конкурс Авито-2017 - Решение 3ое место
Paper presentation @IPAW'08
A systematic mapping study of performance analysis and modelling of cloud sys...
Poster Final
Calculation of Reusability Matrices for Object Oriented applications
Dahlquist bosc 20160709
Pizza club - March 2017 - Gaia
A Threshold fuzzy entropy based feature selection method applied in various b...
IRJET- A Review of Data Cleaning and its Current Approaches
Ad

Similar to 2014 IEEE JAVA DATA MINING PROJECT A similarity measure for text classification and (20)

DOCX
2014 IEEE DOTNET DATA MINING PROJECT Similarity preserving snippet based visu...
PDF
Volume 2-issue-6-1969-1973
PDF
Volume 2-issue-6-1969-1973
PDF
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
PDF
A Competent and Empirical Model of Distributed Clustering
PDF
Bl24409420
PDF
Clustering Algorithm with a Novel Similarity Measure
PDF
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
PDF
An Improved Similarity Matching based Clustering Framework for Short and Sent...
PDF
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
PDF
Evaluating the Use of Clustering for Automatically Organising Digital Library...
PPT
Cluster
PDF
Ir3116271633
PDF
Bs31267274
PDF
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
PDF
Hierarchal clustering and similarity measures along
PDF
Hierarchal clustering and similarity measures along with multi representation
PDF
A Novel Clustering Method for Similarity Measuring in Text Documents
PDF
50120130406022
PDF
Review on Document Recommender Systems Using Hierarchical Clustering Techniques
2014 IEEE DOTNET DATA MINING PROJECT Similarity preserving snippet based visu...
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Competent and Empirical Model of Distributed Clustering
Bl24409420
Clustering Algorithm with a Novel Similarity Measure
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
An Improved Similarity Matching based Clustering Framework for Short and Sent...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
Evaluating the Use of Clustering for Automatically Organising Digital Library...
Cluster
Ir3116271633
Bs31267274
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
Hierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures along with multi representation
A Novel Clustering Method for Similarity Measuring in Text Documents
50120130406022
Review on Document Recommender Systems Using Hierarchical Clustering Techniques
Ad

More from IEEEMEMTECHSTUDENTSPROJECTS (20)

DOCX
2014 IEEE DOTNET DATA MINING PROJECT Web image re ranking using query-specifi...
DOCX
2014 IEEE DOTNET DATA MINING PROJECT Trusteddb a-trusted-hardware-based-datab...
DOCX
2014 IEEE DOTNET DATA MINING PROJECT Supporting privacy-protection-in-persona...
DOCX
2014 IEEE DOTNET DATA MINING PROJECT Product aspect-ranking-and--its-applicat...
DOCX
2014 IEEE DOTNET DATA MINING PROJECT Mining statistically significant co loca...
DOCX
2014 IEEE DOTNET DATA MINING PROJECT Lars an-efficient-and-scalable-location-...
DOCX
2014 IEEE DOTNET DATA MINING PROJECT Data mining with big data
DOCX
2014 IEEE DOTNET DATA MINING PROJECT Converged architecture for broadcast and...
DOCX
2014 IEEE DOTNET DATA MINING PROJECT Anonymous query processing in road networks
DOCX
2014 IEEE DOTNET DATA MINING PROJECT Ai and opinion mining
DOCX
2014 IEEE DOTNET DATA MINING PROJECT A probabilistic approach to string trans...
DOCX
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
DOC
2014 IEEE JAVA DATA MINING PROJECT Xs path navigation on xml schemas made easy
DOCX
2014 IEEE JAVA DATA MINING PROJECT Web image re ranking using query-specific ...
DOCX
2014 IEEE JAVA DATA MINING PROJECT Shortest path computing in relational dbms
DOCX
2014 IEEE JAVA DATA MINING PROJECT Security evaluation of pattern classifiers...
DOCX
2014 IEEE JAVA DATA MINING PROJECT Secure outsourced attribute based signatures
DOCX
2014 IEEE JAVA DATA MINING PROJECT Secure mining of association rules in hori...
DOCX
2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databases
DOCX
2014 IEEE JAVA DATA MINING PROJECT Privacy preserving and content-protecting ...
2014 IEEE DOTNET DATA MINING PROJECT Web image re ranking using query-specifi...
2014 IEEE DOTNET DATA MINING PROJECT Trusteddb a-trusted-hardware-based-datab...
2014 IEEE DOTNET DATA MINING PROJECT Supporting privacy-protection-in-persona...
2014 IEEE DOTNET DATA MINING PROJECT Product aspect-ranking-and--its-applicat...
2014 IEEE DOTNET DATA MINING PROJECT Mining statistically significant co loca...
2014 IEEE DOTNET DATA MINING PROJECT Lars an-efficient-and-scalable-location-...
2014 IEEE DOTNET DATA MINING PROJECT Data mining with big data
2014 IEEE DOTNET DATA MINING PROJECT Converged architecture for broadcast and...
2014 IEEE DOTNET DATA MINING PROJECT Anonymous query processing in road networks
2014 IEEE DOTNET DATA MINING PROJECT Ai and opinion mining
2014 IEEE DOTNET DATA MINING PROJECT A probabilistic approach to string trans...
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
2014 IEEE JAVA DATA MINING PROJECT Xs path navigation on xml schemas made easy
2014 IEEE JAVA DATA MINING PROJECT Web image re ranking using query-specific ...
2014 IEEE JAVA DATA MINING PROJECT Shortest path computing in relational dbms
2014 IEEE JAVA DATA MINING PROJECT Security evaluation of pattern classifiers...
2014 IEEE JAVA DATA MINING PROJECT Secure outsourced attribute based signatures
2014 IEEE JAVA DATA MINING PROJECT Secure mining of association rules in hori...
2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databases
2014 IEEE JAVA DATA MINING PROJECT Privacy preserving and content-protecting ...

Recently uploaded (20)

PPTX
ASME PCC-02 TRAINING -DESKTOP-NLE5HNP.pptx
PPTX
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
PDF
Soil Improvement Techniques Note - Rabbi
PPT
Total quality management ppt for engineering students
PDF
22EC502-MICROCONTROLLER AND INTERFACING-8051 MICROCONTROLLER.pdf
PDF
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
PPTX
Management Information system : MIS-e-Business Systems.pptx
PDF
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
PDF
Visual Aids for Exploratory Data Analysis.pdf
PDF
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
PPTX
Fundamentals of Mechanical Engineering.pptx
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PPTX
Feature types and data preprocessing steps
PPTX
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
PDF
Categorization of Factors Affecting Classification Algorithms Selection
PPTX
Software Engineering and software moduleing
PPTX
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
PDF
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf
ASME PCC-02 TRAINING -DESKTOP-NLE5HNP.pptx
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
Soil Improvement Techniques Note - Rabbi
Total quality management ppt for engineering students
22EC502-MICROCONTROLLER AND INTERFACING-8051 MICROCONTROLLER.pdf
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
Management Information system : MIS-e-Business Systems.pptx
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
Visual Aids for Exploratory Data Analysis.pdf
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
Fundamentals of Mechanical Engineering.pptx
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
III.4.1.2_The_Space_Environment.p pdffdf
Feature types and data preprocessing steps
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
Categorization of Factors Affecting Classification Algorithms Selection
Software Engineering and software moduleing
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf

2014 IEEE JAVA DATA MINING PROJECT A similarity measure for text classification and

  • 1. GLOBALSOFT TECHNOLOGIES IEEE PROJECTS & SOFTWARE DEVELOPMENTS IEEE FINAL YEAR PROJECTS|IEEE ENGINEERING PROJECTS|IEEE STUDENTS PROJECTS|IEEE BULK PROJECTS|BE/BTECH/ME/MTECH/MS/MCA PROJECTS|CSE/IT/ECE/EEE PROJECTS CELL: +91 98495 39085, +91 99662 35788, +91 98495 57908, +91 97014 40401 Visit: www.finalyearprojects.org Mail to:ieeefinalsemprojects@gmail.com A Similarity Measure for Text Classification and Clustering Abstract: Measuring the similarity between documents is an important operation in the text processing field. In this paper, a new similarity measure is proposed. To compute the similarity between two documents with respect to a feature, the proposed measure takes the following three cases into account: a) The feature appears in both documents, b) the feature appears in only one document, and c) the feature appears in none of the documents. For the first case, the similarity increases as the difference between the two involved feature values decreases. Furthermore, the contribution of the difference is normally scaled. For the second case, a fixed value is contributed to the similarity. For the last case, the feature has no contribution to the similarity. The proposed measure is extended to gauge the similarity between two sets of documents. The effectiveness of our measure is evaluated on several real-world data sets for text classification and clustering problems. The results show that the performance obtained by the proposed measure is better than that achieved by other measures. Existing System: • Clustering is one of the most interesting and important topics in data mining. The aim of clustering is to find intrinsic structures in data, and organize them into meaningful subgroups for further study and analysis.
  • 2. • Existing Systems greedily picks the next frequent item set which represent the next cluster to minimize the overlapping between the documents that contain both the item set and some remaining item sets. • In other words, the clustering result depends on the order of picking up the item sets, which in turns depends on the greedy heuristic. This method does not follow a sequential order of selecting clusters. DISADVANTAGES: • Its disadvantage is that it does not yield the same result with each run, since the resulting clusters depend on the initial random assignments. • It minimizes intra-cluster variance, but does not ensure that the result has a global minimum of variance. • But has the same problems as k-means, the minimum is a local minimum, and the results depend on the initial choice of weights. • The Expectation-maximization algorithm is a more statistically formalized method which includes some of these ideas: partial membership in classes Proposed System: • The main work is to develop a novel hierarchal algorithm for document clustering which provides maximum efficiency and performance. Propose a novel way to evaluate similarity between documents, and consequently formulate new criterion functions for document clustering. • Assume that the majority. The purpose of this test is to check how much a similarity measure coincides with the true class labels. • It is particularly focused in studying and making use of cluster overlapping phenomenon to design cluster merging criteria.
  • 3. • Experiments in both public data and document clustering data show that this approach can improve the efficiency of clustering and save computing time. System Requirements: Software Requirements: • Windows XP/Windows 2000 • Java Runtime Environment with higher version(1.5) • Net Beans • My SQL Server Hardware requirements: • Pentium Processor IV with 2.80GHZ or Higher • 512 MB RAM • 2 GB HDD • 15” Monitor