Algorithm for mining cluster and association patterns
1. UNIT 3-ALGORITHM FOR MINING CLUSTER
AND ASSOCIATION PATTERNS
3.1 Hierarchical clustering
3.2 K-means Clustering and density-based
Clustering
3.3.Self-Organizing Map
3.4 . Probability Distributions of Univariate Data
3.5. Association Rules
3.6. Bayesian Network
2. A cluster is a grouping or gathering of similar items or
entities. This implies a degree of proximity or
closeness among the elements within the group.
"Association Patterns" generally refers to the
discovery of relationships or dependencies between
items or variables within a dataset.
3. The clustering methods can be classified into the
following categories:
1. Partitioning Method
2. Hierarchical Method
3. Density-based Method
4. Grid-Based Method
5. Model-Based Method
6. Constraint-based Method
4. Partitioning Method: It is used to make partitions on the data in
order to form clusters. If “n” partitions are done on “p” objects of
the database then each partition is represented by a cluster and n
< p. The two conditions which need to be satisfied with this
Partitioning Clustering Method are:
1. One objective should only belong to only one group.
2. There should be no group without even a single purpose.
In the partitioning method, there is one technique called iterative
relocation, which means the object will be moved from one group
to another to improve the partitioning
5. Video Clips to review on Partitioning Method
https://www.youtube.com/watch?v=ktqRiYLEbg8
6. Hierarchical clustering
Hierarchical clustering is a powerful unsupervised machine learning
algorithm that groups data points into a hierarchy of clusters.Hierarchical
clustering creates a tree-like structure, known as a dendrogram, that
represents the nested relationships between clusters.
It is a method of cluster analysis in data mining that creates a
hierarchical representation of the clusters in a dataset. The method starts
by treating each data point as a separate cluster and then iteratively
combines the closest clusters until a stopping criterion is reached. A
dendrogram illustrates the hierarchical relationships among the clusters.
7. Core Concepts
● Hierarchy:
○ The fundamental characteristic of hierarchical clustering is that it builds a
hierarchy of clusters. Clusters can contain sub-clusters, and so on.
● Dendrogram:
○ A dendrogram is a tree diagram that visually represents the hierarchy of
clusters. The vertical axis of a dendrogram represents the distance or
dissimilarity between clusters.
There are two main approaches to hierarchical clustering:
1. Agglomerative (Bottom-up):
■ Starts with each data point as its own cluster.
■ Repeatedly merges the closest pairs of clusters until all data points
belong to a single cluster.
■ Initially consider every data point as an individual Cluster and at
every step, merge the nearest pairs of the cluster.
8. 2.Divisive (Top-down):
■ Starts with all data points in a single cluster.
■ start with the data objects that are in the same cluster.
■ Repeatedly splits clusters into smaller clusters until
each data point is in its own cluster.
9. The algorithm for Agglomerative Hierarchical Clustering is:
1. Calculate the similarity of one cluster with all the other clusters
(calculate proximity matrix)
2. Consider every data point as an individual cluster
3. Merge the clusters which are highly similar or close to each other.
4. Recalculate the proximity matrix for each cluster
5. Repeat Steps 3 and 4 until only a single cluster remains.
10. Example : There are six data points A, B, C, D, E, and F.
Agglomerative Hierarchical clustering
● Step-1: Consider each alphabet as a single cluster and calculate the distance of one cluster from all the other clusters.
● Step-2: In the second step comparable clusters are merged together to form a single cluster. Let’s say cluster (B) and cluster
(C) are very similar to each other therefore we merge them in the second step similarly to cluster (D) and (E) and at last, we
get the clusters [(A), (BC), (DE), (F)]
● Step-3: We recalculate the proximity according to the algorithm and merge the two nearest clusters([(DE), (F)]) together to
form new clusters as [(A), (BC), (DEF)]
● Step-4: Repeating the same process; The clusters DEF and BC are comparable and merged together to form a new cluster.
We’re now left with clusters [(A), (BCDEF)].
● Step-5: At last, the two remaining clusters are merged together to form a single cluster [(ABCDEF)].
11. Divisive Hierarchical clustering is the opposite of Agglomerative Hierarchical
clustering. In Divisive Hierarchical clustering, we take into account all of the data points
as a single cluster and in every iteration, we separate the data points from the clusters
which aren’t comparable. In the end, we are left with N clusters.
12. Linkage Methods:
○ Linkage methods determine how the distance between clusters is calculated.
Common linkage methods include:
■ Single Linkage:
■ The distance between two clusters is the shortest distance between
any two points in the clusters.
■ Complete Linkage:
■ The distance between two clusters is the longest distance between
any two points in the clusters.
■ Average Linkage:
■ The distance between two clusters is the average distance between
all pairs of points in the clusters.
■ Ward's Linkage:
■ Minimizes the variance within clusters.
13. Key Advantages
● No Predefined Number of Clusters:
○ Hierarchical clustering doesn't require you to specify the
number of clusters beforehand. You can determine the
number of clusters by cutting the dendrogram at an
appropriate level.
● Hierarchical Relationships:
○ It reveals the hierarchical relationships between data points
and clusters, providing a more detailed understanding of the
data.
● Visual Representation:
○ Dendrograms provide a clear visual representation of the
clustering process.
14. Key Disadvantages:
● Computational Complexity:
○ Hierarchical clustering can be computationally expensive,
especially for large datasets.
● Sensitivity to Noise and Outliers:
○ Distance-based methods are sensitive to noise and outliers.
● Difficulty Handling Large Datasets:
○ Due to the computational complexity, it can be difficult to use with
very large datasets.
15. Video Clips on Hierarchical Clustering
https://www.youtube.com/watch?v=SAzGwacrje0
16. Density-Based Method: The density-based method mainly focuses on density. In this method, the given cluster
will keep on growing continuously as long as the density in the neighbourhood exceeds some threshold, i.e, for
each data point within a given cluster. The radius of a given cluster has to contain at least a minimum number
of points.
Grid-Based Method: In the Grid-Based method a grid is formed using the object together,i.e, the object space
is quantized into a finite number of cells that form a grid structure. One of the major advantages of the
grid-based method is fast processing time and it is dependent only on the number of cells in each dimension in
the quantized space. The processing time for this method is much faster so it can save time.
Model-Based Method: In the model-based method, all the clusters are hypothesized in order to find the data
which is best suited for the model. The clustering of the density function is used to locate the clusters for a
given model. It reflects the spatial distribution of data points and also provides a way to automatically
determine the number of clusters based on standard statistics, taking outlier or noise into account. Therefore it
yields robust clustering methods.
Constraint-Based Method: The constraint-based clustering method is performed by the incorporation of
application or user-oriented constraints. A constraint refers to the user expectation or the properties of the
desired clustering results. Constraints provide us with an interactive way of communication with the clustering
process. The user or the application requirement can specify constraints.
17. Video Clips for Review
1. Grid-based Clustering :
https://www.youtube.com/watch?v=iCg9e9cECm4
2. Model -based Clustering
https://www.youtube.com/watch?v=-VjtbwfAvh4
3.Constraint-Based Clustering
https://www.youtube.com/watch?v=bFdXmVPE0aI
18. K-means Clustering and Density-based Clustering
K-means and density-based clustering represent two distinct approaches with different strengths and
weaknesses.
K-means Clustering:
● Centroid-based:
○ K-means is a centroid-based algorithm. It aims to partition data into k clusters, where each cluster is represented by
its centroid (the mean of the data points in the cluster).
● Spherical clusters:
○ K-means tends to produce spherical clusters of roughly equal size. It works well when the clusters are
well-separated and have a globular shape.
● Requires pre-defined k:
○ A significant limitation of K-means is that you must specify the number of clusters (k) beforehand. Determining the
optimal value of k can be challenging.
● Sensitive to outliers:
○ Outliers can significantly affect the position of centroids, leading to poor clustering results.
● Computational efficiency:
○ K-means is generally computationally efficient, making it suitable for large datasets.
● How it works:
○ It iteratively assigns data points to the nearest centroid and updates the centroids until convergence.
19. Density-based Clustering (e.g., DBSCAN):
● Density-based:
○ Density-based clustering algorithms group together data points that are close to each
other based on a density criterion. They can identify clusters of arbitrary shapes.
● Handles arbitrary shapes:
○ Unlike K-means, density-based algorithms can discover clusters of irregular shapes and
varying sizes.
● Does not require pre-defined number of clusters:
○ A key advantage of density-based algorithms like DBSCAN is that they do not require you
to specify the number of clusters in advance.
● Robust to outliers:
○ Density-based algorithms can effectively identify and handle outliers, labeling them as
noise.
● Parameters:
○ DBSCAN relies on two main parameters: epsilon (the radius of the neighborhood) and
minPts (the minimum number of points required to form a dense region).
● How it works:
○ It identifies core points (points with a minimum number of neighboring points within a
specified radius) and expands clusters around them.
20. Key Differences
● Shape of clusters:
○ K-means: Spherical.
○ Density-based: Arbitrary.
● Number of clusters:
○ K-means: Must be specified.
○ Density-based: Automatically determined.
● Handling outliers:
○ K-means: Sensitive.
○ Density-based: Robust.
21. When to use which:
● Use K-means when:
○ You expect the clusters to be spherical.
○ You have a good estimate of the number of clusters.
○ Computational efficiency is a priority.
● Use density-based clustering when:
○ The clusters have irregular shapes.
○ You don't know the number of clusters.
○ Your data contains outliers.
22. Video Clips on Probabilistic and Density-based Clustering
https://www.youtube.com/watch?v=u_u7L219d1w
23. Self-Organizing Map (SOM)
A Self-Organizing Map (SOM), also known as a Kohonen map, is a type of artificial neural
network that uses unsupervised learning to produce a low-dimensional (typically
two-dimensional) representation of a higher-dimensional data space.
Core Concepts:
● Unsupervised Learning:
○ SOMs learn patterns in data without the need for labeled examples. This makes
them valuable for exploratory data analysis.
● Dimensionality Reduction:
○ They excel at reducing the complexity of high-dimensional data by projecting it onto
a lower-dimensional grid. This simplifies visualization and analysis.
24. ● Topological Preservation:
○ A crucial characteristic of SOMs is their ability to preserve the topological
relationships within the data. This means that data points that are close to each
other in the high-dimensional space will also be close to each other on the
lower-dimensional grid.
○
● Competitive Learning:
○ SOMs use a competitive learning process, where neurons on the grid compete to
respond to input data. The "winning" neuron, known as the Best Matching Unit
(BMU), and its neighboring neurons, have their weights adjusted to more closely
resemble the input.
● Grid Structure:
○ The output of an SOM is a grid of neurons, typically arranged in a two-dimensional
lattice. This grid represents the low-dimensional "map" onto which the data is
projected.
26. Algorithm
Training:
Step 1: Initialize the weights wij random value may be assumed. Initialize the learning
rate α.
Step 2: Calculate squared Euclidean distance.
D(j) = Σ (wij – xi)^2 where i=1 to n and j=1 to m
Step 3: Find index J, when D(j) is minimum that will be considered as winning index.
Step 4: For each j within a specific neighborhood of j and for all i, calculate the new
weight.
wij(new)=wij(old) + α[xi – wij(old)]
Step 5: Update the learning rule by using :
α(t+1) = 0.5 * t
Step 6: Test the Stopping Condition.
27. How it Works:
1. Initialization:
○ The weights of the neurons in the grid are initialized with random values.
2. Competitive Process:
○ For each input data point, the distances between the input and the weights of all neurons are calculated.
○ The neuron with the closest weight vector is declared the BMU.
3. Weight Adjustment:
○ The weights of the BMU and its neighboring neurons are adjusted to move them closer to the input data point.
○ The magnitude of the adjustment decreases with distance from the BMU.
4. Iteration:
○ Steps 2 and 3 are repeated for many iterations, allowing the grid to self-organize and reflect the underlying structure of
the data.
Applications:
● Data Visualization:
○ SOMs provide a powerful way to visualize high-dimensional data, making it easier to identify clusters and patterns.
● Clustering:
○ They can be used for clustering data by identifying groups of neurons that respond similarly to input patterns.
● Feature Extraction:
○ SOMs can extract relevant features from data by mapping it onto a lower-dimensional representation.
● Image Processing:
○ They have applications in image segmentation and recognition.
● Financial Analysis:
○ SOMs can be used to analyze financial data and identify market trends.
28. Key Advantages:
● Effective for visualizing high-dimensional data.
● Preserves topological relationships.
● Unsupervised learning.
●
Key Considerations:
● The size and shape of the grid can influence the results.
● Parameter tuning is often required.
In essence, Self-Organizing Maps are a valuable tool for exploring and
understanding complex data, particularly when visualization and dimensionality
reduction are important.
29. Video Clip on Self - Organizing Maps(SOM)
https://www.youtube.com/watch?v=H9H6s-x-0YE