SlideShare a Scribd company logo
UNIT 3-ALGORITHM FOR MINING CLUSTER
AND ASSOCIATION PATTERNS
3.1 Hierarchical clustering
3.2 K-means Clustering and density-based
Clustering
3.3.Self-Organizing Map
3.4 . Probability Distributions of Univariate Data
3.5. Association Rules
3.6. Bayesian Network
A cluster is a grouping or gathering of similar items or
entities. This implies a degree of proximity or
closeness among the elements within the group.
"Association Patterns" generally refers to the
discovery of relationships or dependencies between
items or variables within a dataset.
The clustering methods can be classified into the
following categories:
1. Partitioning Method
2. Hierarchical Method
3. Density-based Method
4. Grid-Based Method
5. Model-Based Method
6. Constraint-based Method
Partitioning Method: It is used to make partitions on the data in
order to form clusters. If “n” partitions are done on “p” objects of
the database then each partition is represented by a cluster and n
< p. The two conditions which need to be satisfied with this
Partitioning Clustering Method are:
1. One objective should only belong to only one group.
2. There should be no group without even a single purpose.
In the partitioning method, there is one technique called iterative
relocation, which means the object will be moved from one group
to another to improve the partitioning
Video Clips to review on Partitioning Method
https://www.youtube.com/watch?v=ktqRiYLEbg8
Hierarchical clustering
Hierarchical clustering is a powerful unsupervised machine learning
algorithm that groups data points into a hierarchy of clusters.Hierarchical
clustering creates a tree-like structure, known as a dendrogram, that
represents the nested relationships between clusters.
It is a method of cluster analysis in data mining that creates a
hierarchical representation of the clusters in a dataset. The method starts
by treating each data point as a separate cluster and then iteratively
combines the closest clusters until a stopping criterion is reached. A
dendrogram illustrates the hierarchical relationships among the clusters.
Core Concepts
● Hierarchy:
○ The fundamental characteristic of hierarchical clustering is that it builds a
hierarchy of clusters. Clusters can contain sub-clusters, and so on.
● Dendrogram:
○ A dendrogram is a tree diagram that visually represents the hierarchy of
clusters. The vertical axis of a dendrogram represents the distance or
dissimilarity between clusters.
There are two main approaches to hierarchical clustering:
1. Agglomerative (Bottom-up):
■ Starts with each data point as its own cluster.
■ Repeatedly merges the closest pairs of clusters until all data points
belong to a single cluster.
■ Initially consider every data point as an individual Cluster and at
every step, merge the nearest pairs of the cluster.
2.Divisive (Top-down):
■ Starts with all data points in a single cluster.
■ start with the data objects that are in the same cluster.
■ Repeatedly splits clusters into smaller clusters until
each data point is in its own cluster.
The algorithm for Agglomerative Hierarchical Clustering is:
1. Calculate the similarity of one cluster with all the other clusters
(calculate proximity matrix)
2. Consider every data point as an individual cluster
3. Merge the clusters which are highly similar or close to each other.
4. Recalculate the proximity matrix for each cluster
5. Repeat Steps 3 and 4 until only a single cluster remains.
Example : There are six data points A, B, C, D, E, and F.
Agglomerative Hierarchical clustering
● Step-1: Consider each alphabet as a single cluster and calculate the distance of one cluster from all the other clusters.
● Step-2: In the second step comparable clusters are merged together to form a single cluster. Let’s say cluster (B) and cluster
(C) are very similar to each other therefore we merge them in the second step similarly to cluster (D) and (E) and at last, we
get the clusters [(A), (BC), (DE), (F)]
● Step-3: We recalculate the proximity according to the algorithm and merge the two nearest clusters([(DE), (F)]) together to
form new clusters as [(A), (BC), (DEF)]
● Step-4: Repeating the same process; The clusters DEF and BC are comparable and merged together to form a new cluster.
We’re now left with clusters [(A), (BCDEF)].
● Step-5: At last, the two remaining clusters are merged together to form a single cluster [(ABCDEF)].
Divisive Hierarchical clustering is the opposite of Agglomerative Hierarchical
clustering. In Divisive Hierarchical clustering, we take into account all of the data points
as a single cluster and in every iteration, we separate the data points from the clusters
which aren’t comparable. In the end, we are left with N clusters.
Linkage Methods:
○ Linkage methods determine how the distance between clusters is calculated.
Common linkage methods include:
■ Single Linkage:
■ The distance between two clusters is the shortest distance between
any two points in the clusters.
■ Complete Linkage:
■ The distance between two clusters is the longest distance between
any two points in the clusters.
■ Average Linkage:
■ The distance between two clusters is the average distance between
all pairs of points in the clusters.
■ Ward's Linkage:
■ Minimizes the variance within clusters.
Key Advantages
● No Predefined Number of Clusters:
○ Hierarchical clustering doesn't require you to specify the
number of clusters beforehand. You can determine the
number of clusters by cutting the dendrogram at an
appropriate level.
● Hierarchical Relationships:
○ It reveals the hierarchical relationships between data points
and clusters, providing a more detailed understanding of the
data.
● Visual Representation:
○ Dendrograms provide a clear visual representation of the
clustering process.
Key Disadvantages:
● Computational Complexity:
○ Hierarchical clustering can be computationally expensive,
especially for large datasets.
● Sensitivity to Noise and Outliers:
○ Distance-based methods are sensitive to noise and outliers.
● Difficulty Handling Large Datasets:
○ Due to the computational complexity, it can be difficult to use with
very large datasets.
Video Clips on Hierarchical Clustering
https://www.youtube.com/watch?v=SAzGwacrje0
Density-Based Method: The density-based method mainly focuses on density. In this method, the given cluster
will keep on growing continuously as long as the density in the neighbourhood exceeds some threshold, i.e, for
each data point within a given cluster. The radius of a given cluster has to contain at least a minimum number
of points.
Grid-Based Method: In the Grid-Based method a grid is formed using the object together,i.e, the object space
is quantized into a finite number of cells that form a grid structure. One of the major advantages of the
grid-based method is fast processing time and it is dependent only on the number of cells in each dimension in
the quantized space. The processing time for this method is much faster so it can save time.
Model-Based Method: In the model-based method, all the clusters are hypothesized in order to find the data
which is best suited for the model. The clustering of the density function is used to locate the clusters for a
given model. It reflects the spatial distribution of data points and also provides a way to automatically
determine the number of clusters based on standard statistics, taking outlier or noise into account. Therefore it
yields robust clustering methods.
Constraint-Based Method: The constraint-based clustering method is performed by the incorporation of
application or user-oriented constraints. A constraint refers to the user expectation or the properties of the
desired clustering results. Constraints provide us with an interactive way of communication with the clustering
process. The user or the application requirement can specify constraints.
Video Clips for Review
1. Grid-based Clustering :
https://www.youtube.com/watch?v=iCg9e9cECm4
2. Model -based Clustering
https://www.youtube.com/watch?v=-VjtbwfAvh4
3.Constraint-Based Clustering
https://www.youtube.com/watch?v=bFdXmVPE0aI
K-means Clustering and Density-based Clustering
K-means and density-based clustering represent two distinct approaches with different strengths and
weaknesses.
K-means Clustering:
● Centroid-based:
○ K-means is a centroid-based algorithm. It aims to partition data into k clusters, where each cluster is represented by
its centroid (the mean of the data points in the cluster).
● Spherical clusters:
○ K-means tends to produce spherical clusters of roughly equal size. It works well when the clusters are
well-separated and have a globular shape.
● Requires pre-defined k:
○ A significant limitation of K-means is that you must specify the number of clusters (k) beforehand. Determining the
optimal value of k can be challenging.
● Sensitive to outliers:
○ Outliers can significantly affect the position of centroids, leading to poor clustering results.
● Computational efficiency:
○ K-means is generally computationally efficient, making it suitable for large datasets.
● How it works:
○ It iteratively assigns data points to the nearest centroid and updates the centroids until convergence.
Density-based Clustering (e.g., DBSCAN):
● Density-based:
○ Density-based clustering algorithms group together data points that are close to each
other based on a density criterion. They can identify clusters of arbitrary shapes.
● Handles arbitrary shapes:
○ Unlike K-means, density-based algorithms can discover clusters of irregular shapes and
varying sizes.
● Does not require pre-defined number of clusters:
○ A key advantage of density-based algorithms like DBSCAN is that they do not require you
to specify the number of clusters in advance.
● Robust to outliers:
○ Density-based algorithms can effectively identify and handle outliers, labeling them as
noise.
● Parameters:
○ DBSCAN relies on two main parameters: epsilon (the radius of the neighborhood) and
minPts (the minimum number of points required to form a dense region).
● How it works:
○ It identifies core points (points with a minimum number of neighboring points within a
specified radius) and expands clusters around them.
Key Differences
● Shape of clusters:
○ K-means: Spherical.
○ Density-based: Arbitrary.
● Number of clusters:
○ K-means: Must be specified.
○ Density-based: Automatically determined.
● Handling outliers:
○ K-means: Sensitive.
○ Density-based: Robust.
When to use which:
● Use K-means when:
○ You expect the clusters to be spherical.
○ You have a good estimate of the number of clusters.
○ Computational efficiency is a priority.
● Use density-based clustering when:
○ The clusters have irregular shapes.
○ You don't know the number of clusters.
○ Your data contains outliers.
Video Clips on Probabilistic and Density-based Clustering
https://www.youtube.com/watch?v=u_u7L219d1w
Self-Organizing Map (SOM)
A Self-Organizing Map (SOM), also known as a Kohonen map, is a type of artificial neural
network that uses unsupervised learning to produce a low-dimensional (typically
two-dimensional) representation of a higher-dimensional data space.
Core Concepts:
● Unsupervised Learning:
○ SOMs learn patterns in data without the need for labeled examples. This makes
them valuable for exploratory data analysis.
● Dimensionality Reduction:
○ They excel at reducing the complexity of high-dimensional data by projecting it onto
a lower-dimensional grid. This simplifies visualization and analysis.
● Topological Preservation:
○ A crucial characteristic of SOMs is their ability to preserve the topological
relationships within the data. This means that data points that are close to each
other in the high-dimensional space will also be close to each other on the
lower-dimensional grid.
○
● Competitive Learning:
○ SOMs use a competitive learning process, where neurons on the grid compete to
respond to input data. The "winning" neuron, known as the Best Matching Unit
(BMU), and its neighboring neurons, have their weights adjusted to more closely
resemble the input.
● Grid Structure:
○ The output of an SOM is a grid of neurons, typically arranged in a two-dimensional
lattice. This grid represents the low-dimensional "map" onto which the data is
projected.
Architecture of SOM
Algorithm
Training:
Step 1: Initialize the weights wij random value may be assumed. Initialize the learning
rate α.
Step 2: Calculate squared Euclidean distance.
D(j) = Σ (wij – xi)^2 where i=1 to n and j=1 to m
Step 3: Find index J, when D(j) is minimum that will be considered as winning index.
Step 4: For each j within a specific neighborhood of j and for all i, calculate the new
weight.
wij(new)=wij(old) + α[xi – wij(old)]
Step 5: Update the learning rule by using :
α(t+1) = 0.5 * t
Step 6: Test the Stopping Condition.
How it Works:
1. Initialization:
○ The weights of the neurons in the grid are initialized with random values.
2. Competitive Process:
○ For each input data point, the distances between the input and the weights of all neurons are calculated.
○ The neuron with the closest weight vector is declared the BMU.
3. Weight Adjustment:
○ The weights of the BMU and its neighboring neurons are adjusted to move them closer to the input data point.
○ The magnitude of the adjustment decreases with distance from the BMU.
4. Iteration:
○ Steps 2 and 3 are repeated for many iterations, allowing the grid to self-organize and reflect the underlying structure of
the data.
Applications:
● Data Visualization:
○ SOMs provide a powerful way to visualize high-dimensional data, making it easier to identify clusters and patterns.
● Clustering:
○ They can be used for clustering data by identifying groups of neurons that respond similarly to input patterns.
● Feature Extraction:
○ SOMs can extract relevant features from data by mapping it onto a lower-dimensional representation.
● Image Processing:
○ They have applications in image segmentation and recognition.
● Financial Analysis:
○ SOMs can be used to analyze financial data and identify market trends.
Key Advantages:
● Effective for visualizing high-dimensional data.
● Preserves topological relationships.
● Unsupervised learning.
●
Key Considerations:
● The size and shape of the grid can influence the results.
● Parameter tuning is often required.
In essence, Self-Organizing Maps are a valuable tool for exploring and
understanding complex data, particularly when visualization and dimensionality
reduction are important.
Video Clip on Self - Organizing Maps(SOM)
https://www.youtube.com/watch?v=H9H6s-x-0YE
END OF UNIT 3-PART 1

More Related Content

PPTX
clustering ppt.pptx
PDF
[ML]-Unsupervised-learning_Unit2.ppt.pdf
PDF
Chapter7 clustering types concepts algorithms.pdf
PPTX
05 Clustering in Data Mining
PDF
Unsupervised Learning in Machine Learning
PDF
clustering using different methods in .pdf
PPTX
unitvclusteranalysis-221214135407-1956d6ef.pptx
PPT
26-Clustering MTech-2017.ppt
clustering ppt.pptx
[ML]-Unsupervised-learning_Unit2.ppt.pdf
Chapter7 clustering types concepts algorithms.pdf
05 Clustering in Data Mining
Unsupervised Learning in Machine Learning
clustering using different methods in .pdf
unitvclusteranalysis-221214135407-1956d6ef.pptx
26-Clustering MTech-2017.ppt

Similar to Algorithm for mining cluster and association patterns (20)

PDF
CLUSTERING IN DATA MINING.pdf
PPTX
CLUSTER ANALYSIS ALGORITHMS.pptx
PPTX
Unsupervised learning (clustering)
PPTX
Hierarchical clustering machine learning by arpit_sharma
PPTX
DS9 - Clustering.pptx
PPTX
Unsupervised learning Algorithms and Assumptions
PPTX
Clustering on DSS
PDF
PPT s10-machine vision-s2
PPTX
Advanced database and data mining & clustering concepts
PDF
An Efficient Clustering Method for Aggregation on Data Fragments
PPTX
Unsupervised Learning.pptx
PPTX
clustering and distance metrics.pptx
PPTX
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
PPTX
Clustering algorithms Type in image segmentation .pptx
PDF
Mean shift and Hierarchical clustering
PDF
Clustering Using Shared Reference Points Algorithm Based On a Sound Data Model
PPT
Dataa miining
PPT
My8clst
PPT
15857 cse422 unsupervised-learning
PPTX
UNIT_V_Cluster Analysis.pptx
CLUSTERING IN DATA MINING.pdf
CLUSTER ANALYSIS ALGORITHMS.pptx
Unsupervised learning (clustering)
Hierarchical clustering machine learning by arpit_sharma
DS9 - Clustering.pptx
Unsupervised learning Algorithms and Assumptions
Clustering on DSS
PPT s10-machine vision-s2
Advanced database and data mining & clustering concepts
An Efficient Clustering Method for Aggregation on Data Fragments
Unsupervised Learning.pptx
clustering and distance metrics.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
Clustering algorithms Type in image segmentation .pptx
Mean shift and Hierarchical clustering
Clustering Using Shared Reference Points Algorithm Based On a Sound Data Model
Dataa miining
My8clst
15857 cse422 unsupervised-learning
UNIT_V_Cluster Analysis.pptx
Ad

Recently uploaded (20)

PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Cloud computing and distributed systems.
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPT
Teaching material agriculture food technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
A Presentation on Artificial Intelligence
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Per capita expenditure prediction using model stacking based on satellite ima...
Machine learning based COVID-19 study performance prediction
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
MYSQL Presentation for SQL database connectivity
Dropbox Q2 2025 Financial Results & Investor Presentation
Network Security Unit 5.pdf for BCA BBA.
Cloud computing and distributed systems.
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Teaching material agriculture food technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Electronic commerce courselecture one. Pdf
NewMind AI Monthly Chronicles - July 2025
Encapsulation_ Review paper, used for researhc scholars
Agricultural_Statistics_at_a_Glance_2022_0.pdf
A Presentation on Artificial Intelligence
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
The AUB Centre for AI in Media Proposal.docx
Ad

Algorithm for mining cluster and association patterns

  • 1. UNIT 3-ALGORITHM FOR MINING CLUSTER AND ASSOCIATION PATTERNS 3.1 Hierarchical clustering 3.2 K-means Clustering and density-based Clustering 3.3.Self-Organizing Map 3.4 . Probability Distributions of Univariate Data 3.5. Association Rules 3.6. Bayesian Network
  • 2. A cluster is a grouping or gathering of similar items or entities. This implies a degree of proximity or closeness among the elements within the group. "Association Patterns" generally refers to the discovery of relationships or dependencies between items or variables within a dataset.
  • 3. The clustering methods can be classified into the following categories: 1. Partitioning Method 2. Hierarchical Method 3. Density-based Method 4. Grid-Based Method 5. Model-Based Method 6. Constraint-based Method
  • 4. Partitioning Method: It is used to make partitions on the data in order to form clusters. If “n” partitions are done on “p” objects of the database then each partition is represented by a cluster and n < p. The two conditions which need to be satisfied with this Partitioning Clustering Method are: 1. One objective should only belong to only one group. 2. There should be no group without even a single purpose. In the partitioning method, there is one technique called iterative relocation, which means the object will be moved from one group to another to improve the partitioning
  • 5. Video Clips to review on Partitioning Method https://www.youtube.com/watch?v=ktqRiYLEbg8
  • 6. Hierarchical clustering Hierarchical clustering is a powerful unsupervised machine learning algorithm that groups data points into a hierarchy of clusters.Hierarchical clustering creates a tree-like structure, known as a dendrogram, that represents the nested relationships between clusters. It is a method of cluster analysis in data mining that creates a hierarchical representation of the clusters in a dataset. The method starts by treating each data point as a separate cluster and then iteratively combines the closest clusters until a stopping criterion is reached. A dendrogram illustrates the hierarchical relationships among the clusters.
  • 7. Core Concepts ● Hierarchy: ○ The fundamental characteristic of hierarchical clustering is that it builds a hierarchy of clusters. Clusters can contain sub-clusters, and so on. ● Dendrogram: ○ A dendrogram is a tree diagram that visually represents the hierarchy of clusters. The vertical axis of a dendrogram represents the distance or dissimilarity between clusters. There are two main approaches to hierarchical clustering: 1. Agglomerative (Bottom-up): ■ Starts with each data point as its own cluster. ■ Repeatedly merges the closest pairs of clusters until all data points belong to a single cluster. ■ Initially consider every data point as an individual Cluster and at every step, merge the nearest pairs of the cluster.
  • 8. 2.Divisive (Top-down): ■ Starts with all data points in a single cluster. ■ start with the data objects that are in the same cluster. ■ Repeatedly splits clusters into smaller clusters until each data point is in its own cluster.
  • 9. The algorithm for Agglomerative Hierarchical Clustering is: 1. Calculate the similarity of one cluster with all the other clusters (calculate proximity matrix) 2. Consider every data point as an individual cluster 3. Merge the clusters which are highly similar or close to each other. 4. Recalculate the proximity matrix for each cluster 5. Repeat Steps 3 and 4 until only a single cluster remains.
  • 10. Example : There are six data points A, B, C, D, E, and F. Agglomerative Hierarchical clustering ● Step-1: Consider each alphabet as a single cluster and calculate the distance of one cluster from all the other clusters. ● Step-2: In the second step comparable clusters are merged together to form a single cluster. Let’s say cluster (B) and cluster (C) are very similar to each other therefore we merge them in the second step similarly to cluster (D) and (E) and at last, we get the clusters [(A), (BC), (DE), (F)] ● Step-3: We recalculate the proximity according to the algorithm and merge the two nearest clusters([(DE), (F)]) together to form new clusters as [(A), (BC), (DEF)] ● Step-4: Repeating the same process; The clusters DEF and BC are comparable and merged together to form a new cluster. We’re now left with clusters [(A), (BCDEF)]. ● Step-5: At last, the two remaining clusters are merged together to form a single cluster [(ABCDEF)].
  • 11. Divisive Hierarchical clustering is the opposite of Agglomerative Hierarchical clustering. In Divisive Hierarchical clustering, we take into account all of the data points as a single cluster and in every iteration, we separate the data points from the clusters which aren’t comparable. In the end, we are left with N clusters.
  • 12. Linkage Methods: ○ Linkage methods determine how the distance between clusters is calculated. Common linkage methods include: ■ Single Linkage: ■ The distance between two clusters is the shortest distance between any two points in the clusters. ■ Complete Linkage: ■ The distance between two clusters is the longest distance between any two points in the clusters. ■ Average Linkage: ■ The distance between two clusters is the average distance between all pairs of points in the clusters. ■ Ward's Linkage: ■ Minimizes the variance within clusters.
  • 13. Key Advantages ● No Predefined Number of Clusters: ○ Hierarchical clustering doesn't require you to specify the number of clusters beforehand. You can determine the number of clusters by cutting the dendrogram at an appropriate level. ● Hierarchical Relationships: ○ It reveals the hierarchical relationships between data points and clusters, providing a more detailed understanding of the data. ● Visual Representation: ○ Dendrograms provide a clear visual representation of the clustering process.
  • 14. Key Disadvantages: ● Computational Complexity: ○ Hierarchical clustering can be computationally expensive, especially for large datasets. ● Sensitivity to Noise and Outliers: ○ Distance-based methods are sensitive to noise and outliers. ● Difficulty Handling Large Datasets: ○ Due to the computational complexity, it can be difficult to use with very large datasets.
  • 15. Video Clips on Hierarchical Clustering https://www.youtube.com/watch?v=SAzGwacrje0
  • 16. Density-Based Method: The density-based method mainly focuses on density. In this method, the given cluster will keep on growing continuously as long as the density in the neighbourhood exceeds some threshold, i.e, for each data point within a given cluster. The radius of a given cluster has to contain at least a minimum number of points. Grid-Based Method: In the Grid-Based method a grid is formed using the object together,i.e, the object space is quantized into a finite number of cells that form a grid structure. One of the major advantages of the grid-based method is fast processing time and it is dependent only on the number of cells in each dimension in the quantized space. The processing time for this method is much faster so it can save time. Model-Based Method: In the model-based method, all the clusters are hypothesized in order to find the data which is best suited for the model. The clustering of the density function is used to locate the clusters for a given model. It reflects the spatial distribution of data points and also provides a way to automatically determine the number of clusters based on standard statistics, taking outlier or noise into account. Therefore it yields robust clustering methods. Constraint-Based Method: The constraint-based clustering method is performed by the incorporation of application or user-oriented constraints. A constraint refers to the user expectation or the properties of the desired clustering results. Constraints provide us with an interactive way of communication with the clustering process. The user or the application requirement can specify constraints.
  • 17. Video Clips for Review 1. Grid-based Clustering : https://www.youtube.com/watch?v=iCg9e9cECm4 2. Model -based Clustering https://www.youtube.com/watch?v=-VjtbwfAvh4 3.Constraint-Based Clustering https://www.youtube.com/watch?v=bFdXmVPE0aI
  • 18. K-means Clustering and Density-based Clustering K-means and density-based clustering represent two distinct approaches with different strengths and weaknesses. K-means Clustering: ● Centroid-based: ○ K-means is a centroid-based algorithm. It aims to partition data into k clusters, where each cluster is represented by its centroid (the mean of the data points in the cluster). ● Spherical clusters: ○ K-means tends to produce spherical clusters of roughly equal size. It works well when the clusters are well-separated and have a globular shape. ● Requires pre-defined k: ○ A significant limitation of K-means is that you must specify the number of clusters (k) beforehand. Determining the optimal value of k can be challenging. ● Sensitive to outliers: ○ Outliers can significantly affect the position of centroids, leading to poor clustering results. ● Computational efficiency: ○ K-means is generally computationally efficient, making it suitable for large datasets. ● How it works: ○ It iteratively assigns data points to the nearest centroid and updates the centroids until convergence.
  • 19. Density-based Clustering (e.g., DBSCAN): ● Density-based: ○ Density-based clustering algorithms group together data points that are close to each other based on a density criterion. They can identify clusters of arbitrary shapes. ● Handles arbitrary shapes: ○ Unlike K-means, density-based algorithms can discover clusters of irregular shapes and varying sizes. ● Does not require pre-defined number of clusters: ○ A key advantage of density-based algorithms like DBSCAN is that they do not require you to specify the number of clusters in advance. ● Robust to outliers: ○ Density-based algorithms can effectively identify and handle outliers, labeling them as noise. ● Parameters: ○ DBSCAN relies on two main parameters: epsilon (the radius of the neighborhood) and minPts (the minimum number of points required to form a dense region). ● How it works: ○ It identifies core points (points with a minimum number of neighboring points within a specified radius) and expands clusters around them.
  • 20. Key Differences ● Shape of clusters: ○ K-means: Spherical. ○ Density-based: Arbitrary. ● Number of clusters: ○ K-means: Must be specified. ○ Density-based: Automatically determined. ● Handling outliers: ○ K-means: Sensitive. ○ Density-based: Robust.
  • 21. When to use which: ● Use K-means when: ○ You expect the clusters to be spherical. ○ You have a good estimate of the number of clusters. ○ Computational efficiency is a priority. ● Use density-based clustering when: ○ The clusters have irregular shapes. ○ You don't know the number of clusters. ○ Your data contains outliers.
  • 22. Video Clips on Probabilistic and Density-based Clustering https://www.youtube.com/watch?v=u_u7L219d1w
  • 23. Self-Organizing Map (SOM) A Self-Organizing Map (SOM), also known as a Kohonen map, is a type of artificial neural network that uses unsupervised learning to produce a low-dimensional (typically two-dimensional) representation of a higher-dimensional data space. Core Concepts: ● Unsupervised Learning: ○ SOMs learn patterns in data without the need for labeled examples. This makes them valuable for exploratory data analysis. ● Dimensionality Reduction: ○ They excel at reducing the complexity of high-dimensional data by projecting it onto a lower-dimensional grid. This simplifies visualization and analysis.
  • 24. ● Topological Preservation: ○ A crucial characteristic of SOMs is their ability to preserve the topological relationships within the data. This means that data points that are close to each other in the high-dimensional space will also be close to each other on the lower-dimensional grid. ○ ● Competitive Learning: ○ SOMs use a competitive learning process, where neurons on the grid compete to respond to input data. The "winning" neuron, known as the Best Matching Unit (BMU), and its neighboring neurons, have their weights adjusted to more closely resemble the input. ● Grid Structure: ○ The output of an SOM is a grid of neurons, typically arranged in a two-dimensional lattice. This grid represents the low-dimensional "map" onto which the data is projected.
  • 26. Algorithm Training: Step 1: Initialize the weights wij random value may be assumed. Initialize the learning rate α. Step 2: Calculate squared Euclidean distance. D(j) = Σ (wij – xi)^2 where i=1 to n and j=1 to m Step 3: Find index J, when D(j) is minimum that will be considered as winning index. Step 4: For each j within a specific neighborhood of j and for all i, calculate the new weight. wij(new)=wij(old) + α[xi – wij(old)] Step 5: Update the learning rule by using : α(t+1) = 0.5 * t Step 6: Test the Stopping Condition.
  • 27. How it Works: 1. Initialization: ○ The weights of the neurons in the grid are initialized with random values. 2. Competitive Process: ○ For each input data point, the distances between the input and the weights of all neurons are calculated. ○ The neuron with the closest weight vector is declared the BMU. 3. Weight Adjustment: ○ The weights of the BMU and its neighboring neurons are adjusted to move them closer to the input data point. ○ The magnitude of the adjustment decreases with distance from the BMU. 4. Iteration: ○ Steps 2 and 3 are repeated for many iterations, allowing the grid to self-organize and reflect the underlying structure of the data. Applications: ● Data Visualization: ○ SOMs provide a powerful way to visualize high-dimensional data, making it easier to identify clusters and patterns. ● Clustering: ○ They can be used for clustering data by identifying groups of neurons that respond similarly to input patterns. ● Feature Extraction: ○ SOMs can extract relevant features from data by mapping it onto a lower-dimensional representation. ● Image Processing: ○ They have applications in image segmentation and recognition. ● Financial Analysis: ○ SOMs can be used to analyze financial data and identify market trends.
  • 28. Key Advantages: ● Effective for visualizing high-dimensional data. ● Preserves topological relationships. ● Unsupervised learning. ● Key Considerations: ● The size and shape of the grid can influence the results. ● Parameter tuning is often required. In essence, Self-Organizing Maps are a valuable tool for exploring and understanding complex data, particularly when visualization and dimensionality reduction are important.
  • 29. Video Clip on Self - Organizing Maps(SOM) https://www.youtube.com/watch?v=H9H6s-x-0YE
  • 30. END OF UNIT 3-PART 1