SlideShare a Scribd company logo
Retrieval and clustering of documents
Measuring similarity for retrieval Given Set of documents a similarity measure determines for retrieval measures how many documents are relevant to the particular category.
Cosine similarity for retrieval Cosine similarity  is a measure of similarity between two vectors of  n  dimensions by finding the cosine of the angle between them, often used to compare documents in text mining. Given two vectors of attributes,  A  and  B , the cosine similarity,  θ , is represented using a dot product and magnitude as Similarity =cos(ᶿ)=A.B/||A||||B||
Cosine similarity for retrieval For text matching, the attribute vectors  A  and  B  are usually the term frequency vectors of the documents. The cosine similarity can be seen as a method of normalizing document length during comparison. The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same, with 0 indicating independence, and in-between values indicating intermediate similarity or dissimilarity.
Cosine similarity for retrieval In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90°.
Web-based document search and link analysis Link analysis has been used successfully for deciding which web pages to add to the collection of documents how to order the documents matching a user query (i.e., how to rank pages). It has also been used to categorize web pages, to find pages that are related to given pages, to find duplicated web sites, and various other problems related to web information retrieval.
Link Analysis A link from page A to page B is a recommendation of page A by the author of page B If page A and page B are connected by a link the probability that they are on the same topic is higher than if they are not connected.
Application Ranking query results.(page Rank) crawling fi nding  related pages,  computing web page reputations  geographic scope, prediction categorizing web pages, computing statistics of web pages and of search engines.
Document matching Document matching  is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual. User queries can range from multi-sentence full descriptions of an information need to a few words.
Steps involved in document matching A document matching system has two main tasks: Find relevant documents to user queries Evaluate the matching results and sort them according to relevance, using algorithms such as PageRank.
  k-means clustering Given a set of observations ( x 1 ,  x 2 , …,  x n ), where each observation is a  d -dimensional real vector, then  k -means clustering aims to partition the  n  observations into  k  sets ( k  <  n )  S ={ S 1 ,  S 2 , …,  S k } so as to minimize the within-cluster sum of squares
K-Means algorithm 0. Input :  D ::={d 1 ,d 2 ,…d n  };  k ::=the cluster number; 1.  Select k document vectors as the initial centriods of k clusters  2 . Repeat 3.   Select one vector  d  in remaining documents 4.   Compute similarities between d and  k  centroids 5.  Put  d  in the closest cluster and recompute the centroid  6.  Until the centroids don’t change 7. Output: k  clusters of documents
Pros and Cons Advantage: linear time complexity  works relatively well in low dimension space Drawback: distance computation in high dimension space centroid vector may not well summarize the cluster documents initial  k  clusters affect the quality of clusters
Hierarchical clustering Input :  D ::={d 1 ,d 2 ,…d n  }; 1.  Calculate similarity matrix SIM[i,j]  2 . Repeat 3.   Merge the most similar two clusters, K and L, to form a new cluster KL 4.   Compute similarities between KL and each of the remaining  cluster and update SIM[i,j] 5.  Until there is a single(or specified number) cluster 6 . Output:  dendogram of clusters
Pros and cons Advantage: producing better quality clusters works relatively well in low dimension space Drawback: distance computation in high dimension space quadratic time complexity
The EM algorithm for clustering Let the analyzed object be described by two random variables and which are assumed to have a probability distribution function
The EM algorithm for clustering The distribution is known up to its parameter(s) . It is assumed that we are given a set of samples independently drawn from the distribution
The EM algorithm for clustering The Expectation-Maximization (EM) algorithm is an optimization procedure which computes the Maximal-Likelihood (ML) estimate of the unknown parameter when only uncomplete  ( is unknown) data are presented. In other words, the EM algorithm maximizes the likelihood function
Evaluation of clustering What Is A Good  Clustering ? Internal criterion: A good  clustering  will produce high quality clusters in which  the intra-class (that is, intra-cluster) similarity is high the inter-class similarity is low The measured quality of a  clustering  depends on both the document representation and the similarity measured used.
conclusion In this presentation we learned about Measuring similarity for retrieval Web-based document search and link analysis Document matching Clustering by similarity Hierarchical clustering The EM algorithm for clustering Evaluation of clustering
Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net

More Related Content

PPT
Textmining Predictive Models
PDF
Unsupervised learning clustering
PPTX
Cluster Analysis
PPTX
Datamining
PPTX
Clustering in Data Mining
PPT
The science behind predictive analytics a text mining perspective
PPTX
Cluster analysis
PPTX
Cluster analysis
Textmining Predictive Models
Unsupervised learning clustering
Cluster Analysis
Datamining
Clustering in Data Mining
The science behind predictive analytics a text mining perspective
Cluster analysis
Cluster analysis

What's hot (19)

PPTX
05 Clustering in Data Mining
PPTX
Document clustering and classification
PDF
Data Science - Part VII - Cluster Analysis
PPT
1.8 discretization
PPTX
Data reduction
PPTX
Machine learning clustering
PDF
Cluster Analysis
PPTX
Unsupervised learning (clustering)
PPTX
Text clustering
PPT
3.1 clustering
PPTX
Data For Datamining
PPT
Cluster spss week7
PPTX
Data Compression in Data mining and Business Intelligencs
PPTX
Data Reduction Stratergies
PPTX
Data discretization
PPT
Data Mining: Concepts and Techniques — Chapter 2 —
PPTX
Unsupervised learning clustering
PPT
1.7 data reduction
PPT
Clustering &amp; classification
05 Clustering in Data Mining
Document clustering and classification
Data Science - Part VII - Cluster Analysis
1.8 discretization
Data reduction
Machine learning clustering
Cluster Analysis
Unsupervised learning (clustering)
Text clustering
3.1 clustering
Data For Datamining
Cluster spss week7
Data Compression in Data mining and Business Intelligencs
Data Reduction Stratergies
Data discretization
Data Mining: Concepts and Techniques — Chapter 2 —
Unsupervised learning clustering
1.7 data reduction
Clustering &amp; classification
Ad

Similar to Textmining Retrieval And Clustering (20)

PDF
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
PPT
Cluster
PDF
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
PDF
Bl24409420
PDF
50120130406022
PDF
Clustering Algorithm with a Novel Similarity Measure
PDF
Bs31267274
PDF
PDF
A Competent and Empirical Model of Distributed Clustering
PPTX
Hierarchical clustering
PDF
Volume 2-issue-6-1969-1973
PDF
Volume 2-issue-6-1969-1973
PDF
A Novel Clustering Method for Similarity Measuring in Text Documents
PDF
L0261075078
PDF
L0261075078
PDF
International Journal of Engineering and Science Invention (IJESI)
PDF
Clustering Algorithms - Kmeans,Min ALgorithm
PDF
IRJET- A Survey of Text Document Clustering by using Clustering Techniques
PDF
IRJET- Text Document Clustering using K-Means Algorithm
PDF
Farthest Neighbor Approach for Finding Initial Centroids in K- Means
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
Cluster
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
Bl24409420
50120130406022
Clustering Algorithm with a Novel Similarity Measure
Bs31267274
A Competent and Empirical Model of Distributed Clustering
Hierarchical clustering
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973
A Novel Clustering Method for Similarity Measuring in Text Documents
L0261075078
L0261075078
International Journal of Engineering and Science Invention (IJESI)
Clustering Algorithms - Kmeans,Min ALgorithm
IRJET- A Survey of Text Document Clustering by using Clustering Techniques
IRJET- Text Document Clustering using K-Means Algorithm
Farthest Neighbor Approach for Finding Initial Centroids in K- Means
Ad

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Encapsulation theory and applications.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
KodekX | Application Modernization Development
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPT
Teaching material agriculture food technology
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Modernizing your data center with Dell and AMD
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
Approach and Philosophy of On baking technology
Empathic Computing: Creating Shared Understanding
Understanding_Digital_Forensics_Presentation.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Network Security Unit 5.pdf for BCA BBA.
Encapsulation theory and applications.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
KodekX | Application Modernization Development
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Dropbox Q2 2025 Financial Results & Investor Presentation
Chapter 3 Spatial Domain Image Processing.pdf
Teaching material agriculture food technology
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
The Rise and Fall of 3GPP – Time for a Sabbatical?
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Modernizing your data center with Dell and AMD
Per capita expenditure prediction using model stacking based on satellite ima...

Textmining Retrieval And Clustering

  • 2. Measuring similarity for retrieval Given Set of documents a similarity measure determines for retrieval measures how many documents are relevant to the particular category.
  • 3. Cosine similarity for retrieval Cosine similarity is a measure of similarity between two vectors of n dimensions by finding the cosine of the angle between them, often used to compare documents in text mining. Given two vectors of attributes, A and B , the cosine similarity, θ , is represented using a dot product and magnitude as Similarity =cos(ᶿ)=A.B/||A||||B||
  • 4. Cosine similarity for retrieval For text matching, the attribute vectors A and B are usually the term frequency vectors of the documents. The cosine similarity can be seen as a method of normalizing document length during comparison. The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same, with 0 indicating independence, and in-between values indicating intermediate similarity or dissimilarity.
  • 5. Cosine similarity for retrieval In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90°.
  • 6. Web-based document search and link analysis Link analysis has been used successfully for deciding which web pages to add to the collection of documents how to order the documents matching a user query (i.e., how to rank pages). It has also been used to categorize web pages, to find pages that are related to given pages, to find duplicated web sites, and various other problems related to web information retrieval.
  • 7. Link Analysis A link from page A to page B is a recommendation of page A by the author of page B If page A and page B are connected by a link the probability that they are on the same topic is higher than if they are not connected.
  • 8. Application Ranking query results.(page Rank) crawling fi nding related pages, computing web page reputations geographic scope, prediction categorizing web pages, computing statistics of web pages and of search engines.
  • 9. Document matching Document matching is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual. User queries can range from multi-sentence full descriptions of an information need to a few words.
  • 10. Steps involved in document matching A document matching system has two main tasks: Find relevant documents to user queries Evaluate the matching results and sort them according to relevance, using algorithms such as PageRank.
  • 11. k-means clustering Given a set of observations ( x 1 , x 2 , …, x n ), where each observation is a d -dimensional real vector, then k -means clustering aims to partition the n observations into k sets ( k  <  n ) S ={ S 1 , S 2 , …, S k } so as to minimize the within-cluster sum of squares
  • 12. K-Means algorithm 0. Input : D ::={d 1 ,d 2 ,…d n }; k ::=the cluster number; 1. Select k document vectors as the initial centriods of k clusters 2 . Repeat 3. Select one vector d in remaining documents 4. Compute similarities between d and k centroids 5. Put d in the closest cluster and recompute the centroid 6. Until the centroids don’t change 7. Output: k clusters of documents
  • 13. Pros and Cons Advantage: linear time complexity works relatively well in low dimension space Drawback: distance computation in high dimension space centroid vector may not well summarize the cluster documents initial k clusters affect the quality of clusters
  • 14. Hierarchical clustering Input : D ::={d 1 ,d 2 ,…d n }; 1. Calculate similarity matrix SIM[i,j] 2 . Repeat 3. Merge the most similar two clusters, K and L, to form a new cluster KL 4. Compute similarities between KL and each of the remaining cluster and update SIM[i,j] 5. Until there is a single(or specified number) cluster 6 . Output: dendogram of clusters
  • 15. Pros and cons Advantage: producing better quality clusters works relatively well in low dimension space Drawback: distance computation in high dimension space quadratic time complexity
  • 16. The EM algorithm for clustering Let the analyzed object be described by two random variables and which are assumed to have a probability distribution function
  • 17. The EM algorithm for clustering The distribution is known up to its parameter(s) . It is assumed that we are given a set of samples independently drawn from the distribution
  • 18. The EM algorithm for clustering The Expectation-Maximization (EM) algorithm is an optimization procedure which computes the Maximal-Likelihood (ML) estimate of the unknown parameter when only uncomplete ( is unknown) data are presented. In other words, the EM algorithm maximizes the likelihood function
  • 19. Evaluation of clustering What Is A Good Clustering ? Internal criterion: A good clustering will produce high quality clusters in which the intra-class (that is, intra-cluster) similarity is high the inter-class similarity is low The measured quality of a clustering depends on both the document representation and the similarity measured used.
  • 20. conclusion In this presentation we learned about Measuring similarity for retrieval Web-based document search and link analysis Document matching Clustering by similarity Hierarchical clustering The EM algorithm for clustering Evaluation of clustering
  • 21. Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net