SlideShare a Scribd company logo
1
HIERARCHICALAND NON-
HIERARCHICAL CLUSTERING
Course In-charge,
Dr. Kiran Prakash
Professor and Head
Department of Statistics and
Computer applications
Presented by,
Ranjith. C
M. Sc. (Ag) Statistics
BAM-2022-77
Agricultural college, Bapatla
Acharya N. G. Ranga Agricultural University
STAT 591 – Master’s Seminar (0+1)
1. Introduction
2. Review of Literature
3. Methodology
4. Data analysis using statistical packages
5. References
Contents
Introduction
Definitions
• Cluster
• Similar to one another within the same cluster
• Dissimilar to the objects in other clusters
• Cluster analysis is a statistical technique used to group a set of
objects in such a way that objects in the same group (cluster) are more
similar to each other than to those in other groups.
• Applications – in short
• As a stand-alone tool to get insight into data distribution
• As a preprocessing step for other algorithms
Other clustering methods
• Model based clustering
• The data points within each cluster follow a particular probability distribution
• Density based clustering
• Groups data points together based on their density within a defined radius or
distance threshold
• Grid based clustering
• Grid-based methods quantize the object space into a finite number of cells that
form a grid structure
• Fuzzy clustering
• assigns each data point a membership score for each cluster, rather than a binary
membership value
• Speciation of plants
• Clustering the characteristics of certain plants to identify the species, or to
decide there is sufficient evidence to decide it as a new species
• Study of Natural disasters
• To find the areas affected by earthquake, forest fire etc. and to take measures
• City planning
• To find the areas where more people are residing and build transportation
facilities and roads
• Planning survey
• To create the optimum sample size so as to conduct effective surveys
Applications
Applications
• Marketing
• Classify the products based on customer preferences
• Medical Diagnosis
• Group patients with similar symptoms or medical histories, aiding in disease
classification and personalized treatment plans.
• Crime Pattern Analysis
• Analyze crime data to identify clusters of similar criminal activities, assisting law
enforcement in targeted interventions.
• Image Segmentation
• Analyze and categorize images into clusters, assisting in image recognition and
computer vision applications.
Hierarchical clustering
• Hierarchical clustering is a method of cluster analysis that builds a
hierarchy of clusters. It starts with individual data points and
recursively merges or divides them to form a tree-like structure,
known as a dendrogram. The dendrogram represents the relationships
and similarities between different clusters and can be visually
interpreted to understand the organization of the data.
• Two types:
• Agglomerative hierarchical method
• Divisive hierarchical method
Hierarchical clustering
• Start with Individual Data Points
Begin by considering each data point as a separate cluster.
• Compute Pairwise Similarities
Calculate the similarity or dissimilarity between each pair of clusters or data
points. Common distance metrics include Euclidean distance, Manhattan
distance, or correlation coefficients.
• Merge Similar Clusters
Identify the pair of clusters with the highest similarity and merge them into
a single cluster. This creates a new cluster that replaces the two merged
clusters.
Hierarchical clustering
• Update Similarity Matrix
Recalculate the similarity or dissimilarity between the new cluster and the
remaining clusters.
• Repeat Steps 3-4
Repeat the process of merging the most similar clusters and updating the similarity
matrix until all data points are in a single cluster or until a predetermined number
of clusters is reached.
• Dendrogram Construction
Represent the clustering process using a dendrogram. The vertical lines in the
dendrogram indicate the merging of clusters, and the height at which they merge
reflects the dissimilarity at which the merging occurred.
Nonhierarchical clustering
• Non-hierarchical clustering, also known as partitioning clustering, is a
method of cluster analysis that divides a dataset into a predetermined
number of clusters. Unlike hierarchical clustering, which creates a
tree-like structure of nested clusters, non-hierarchical clustering
directly assigns data points to clusters without forming a hierarchy.
1. Introduction
2. Review of Literature
3. Methodology
4. Data analysis using statistical packages
5. References
Contents
Review of Literature
• Rainfall pattern
• Germplasm evaluation
• Chemical classification
• Groundwater contamination
• Image detection
Rainfall pattern
• A multivariate approach based on hierarchical cluster analysis has
been proposed to study the pattern of rainfall in different mandals of
Visakhapatnam district of Andhra Pradesh
• Rainfall patterns of 42 mandals based on 25 years of rainfall data of
Visakhapatnam district from 1986-2010 was employed
Hierarchical and Non Hierarchical Clustering.pptx
Category Rainfall (mm)
High rainfall >1162
Medium rainfall 862 – 1162
Low rainfall <862
Results
• The mandals were categorized
into 8 clusters based on mean
rainfall
• The application of these
approaches identified medium
rainfall (862 mm-1162 mm)
was the most frequent
representative pattern of
rainfall in majority mandals of
Visakhapatnam district
Germplasm Evaluation
• A set of 100 rice germplasm lines with four checks viz., BPT-5204,
PSB-68, Siri1253 and MGD-101 were evaluated in augmented block
design during Kharif 2020.
• Test entries along with checks were sown at a spacing of 20×10 cm in
augmented Block Design with four blocks, wherein each block
comprised of 25 genotypes and four checks were repeated in each
block
• Data related to days to 50% flowering, panicle length, panicles per
square metre, 1000-grain weight and grain yield was collected and
analysed
Cluster analysis
Sl
No
Cluster No of
Individuals
Character
1 Cluster 1 5 Early maturity, high grain yield, long panicle length and medium 1000-
grain weight
2 Cluster 2 27 Early maturing types with medium panicle length and low 1000-grain
weight
3 Cluster 3 23 Very early flowering and medium 1000-grain weight.
4 Cluster 4 45 Early flowering, short panicle length and more panicles per square meter
The average intra-cluster and
inter-cluster Euclidean
distances were estimated using
ward’s minimum variance
Results
• It was discovered that none of the clusters included at least one
genotype that had all of the desirable traits, ruling out the idea of
selecting one genotype for immediate usage. Therefore to judiciously
incorporate all of the desirable features, hybridization between
selected genotypes from divergent clusters is required.
• From cluster analysis maximum inter-cluster distance was observed
between clusters 2 and cluster 3 followed by cluster 1 and cluster 2. So
the genotypes selected from these clusters can be used for selecting
genetically diverse parents.
Chemical classification
• Five Piper nigrum essential oils were analyzed by GC-MS (gas
chromatography-mass spectrometry) were analyzed
• 78 compounds were identified accounting for more than 99% of the
compositions
• Based on P. nigrum essential oil compositions, a hierarchical cluster
analysis of the oils were done
• Analysis done using agglomerative hierarchical cluster (AHC) analysis
using XLSTAT Premium
• Dissimilarity was determined using Euclidean distance, and clustering
was defined using Ward’s method
Hierarchical and Non Hierarchical Clustering.pptx
Results
• The oils were dominated by monoterpene hydrocarbons. Black pepper
oils from various geographical locations have shown qualitative
similarities with differences in the concentrations of their major
components.
• β-Caryophyllene, limonene, β-pinene, α-pinene, δ-3-carene, sabinene,
and myrcene were the main components of P. nigrum oil
Groundwater contamination
• Groundwater samples from 30 locations
• Ionic balance error between the total concentrations of cations (Ca2+,
Mg2+
, Na+
and K+
) and the total concentrations of anions (HCO3
-
, Cl-
,
SO4
2-
and NO3
-
) expressed in milliequivalents per liter (meq/L) were
observed for each groundwater sample
Pollution index calculation
• In first step, the relative weight (Rw) from 1 to 5 was assigned for each
chemical parameter, depending upon its relative impact on human
beings. Minimum weight (1) was given to K+ and maximum weight (5)
to pH, TDS, SO4
2-
and NO3
-
• In second step, the weight parameter (Wp) was computed for each
chemical parameter to assess its relative share on overall chemical
groundwater quality
• In third step, the status of concentration (Sc) was determined by dividing
the concentration (C) of each chemical parameter of each groundwater
sample by its respective drinking water quality standard limit (Ds)
• In last step, pollution index of groundwater (PIG) was calculated by
adding all values of Ow (ΣOw)
STATISTICA version 6.1 was used. In HCA, a Complete Linkage is
used to determine the distance between the clusters or groups.
Hierarchical and Non Hierarchical Clustering.pptx
Hierarchical and Non Hierarchical Clustering.pptx
Results
• Group I represents low mineralized groundwater quality, Group II
shows moderately mineralized groundwater quality and Group III has
highly mineralized groundwater quality, depending upon the
availability sources.
• Half of the samples are falling under moderately mineralized category,
seven in low mineralized and eight in highly mineralized category.
Image detection
• The objective of this research is detection and classification of cotton
and tomato leaf diseases
• K-means clustering algorithm is used to separate the stained part and
healthy leaf region
• This proposed method of image processing is done in MATLAB 2016b
software
• L* represents the lightness, a* and b* represents the chromaticity
layers. All of the color information is in the a* and b* layers
• The derived features are Contrast, Correlation, Energy, Homogeneity,
Mean, Standard Deviation and Variance
K-means clustering algorithm
LOAD Image
Convert RGB
color space into
L*a*b* color
space
Clustering the
variant colors
Measure the
distance by using
Euclidean
Distance Matrix
Create a blank cell
array to store
clusters
CLUSTERS
NN Classification
Leaf Disease Bacterial Leaf
Spot
Target Spot Septoria Leaf
Spot
Leaf Mold Accuracy
Bacterial Leaf Spot 9 1 0 0 90%
Target Spot 2 8 0 0 80%
Septoria Leaf Spot 0 0 10 0 100%
Leaf Mold 0 0 0 10 100%
Average Accuracy 92.5%
1. Introduction
2. Review of Literature
3. Methodology
4. Data analysis using statistical packages
5. References
Contents
Methodology
Outline
1. Methodology
2. Further discussions – Hierarchical clustering
3. Non-Hierarchical clustering
4. Further discussions – Non-Hierarchical clustering
Methodology
Agglomerative Hierarchical method
• A series of successive mergers
• There are many initial clusters as objects
• The most similar groups are first grouped
• Then these initial groups are merged according to their similarities
• Eventually as the similarity decreases, all subgroups are fused into a
single cluster
Divisive Hierarchical methods
• Work opposite to Agglomerative method
• And initial single group of objects is divided into two subgroups such
that the objects in one subgroup are “far from” the objects in the other
• These are further divided into dissimilar subgroups
• The process is continued until there are as many subgroups are objects
– that is each object become a cluster
• Both agglomerative and divisive methods can be displayed as the two-
dimensional diagram called a Dendrogram
Algorithm for Agglomerative clustering
1. Start with N clusters, each containing a single entity and an N x N symmetric
matrix of distances (or similarities) D = {dik}
2. Search the distance matrix for the nearest (most similar) pair of clusters. Let
distance between “most similar” clusters U and V be dUV
3. Merge clusters U and V. Label the newly formed cluster (UV). Update the
entries in the distance matrix by (a) deleting the rows and columns
corresponding to clusters U and V and (b) adding a row and column giving the
distances between cluster (UV) and the remaining clusters.
4. Repeat steps 2 and 3 a total of N – 1 times. (All objects will be in single cluster
after the algorithm terminates). Record the identity of clusters that are merged
and the levels (distances or similarities) at which the mergers take place.
Linkage methods
• These methods are suitable for clustering items, as well as variables
• Three main types are there;
• Single linkage
• Complete linkage
• Average linkage
Cluster distance
𝑑13+𝑑14+𝑑15+𝑑 23+𝑑24+𝑑25
6
𝑑24
𝑑15
Figure 3.1 Intercluster distance (dissimiliarity) for (a) Single linkage (b) Complete linkage and (c)
Average linkage
Single linkage
• The inputs to a single linkage algorithm can be distances or
similarities between pairs of objects. Groups are formed from nearest
neighbors, where the term nearest neighbor connotes the smallest
distance or largest similarity
• Initially we must find the smallest distance in D = {dik} and merge the
corresponding objects, say U and V, to get the cluster (UV). In the next
step of general algorithm, the distance between UV and any other
cluster W are computed by
d(UV)W = min {dUW, dVW}
Clustering using single linkage
0
9 0
3 7 0
6 5 9 0
11 10 2 8 0
D = {dik} =
1
2
3
4
5
1 2 3 4 5
min (dik) = d53 = 2
i,k
0
9 0
3 7 0
6 5 9 0
11 10 2 8 0
1
2
3
4
5
1 2 3 4 5
Objects 5 and 3 are merged to form cluster (35). To implement next level of clustering we need the distance between
the cluster (35) and remaining objects 1, 2, and 4. The distances are
d(35)1 = min {d31, d51} = min {3, 11} = 3
d(35)2 = min {d32, d52} = min {7, 10} = 7
d(35)4 = min {d34, d54} = min {9, 8} = 8
Deleting the rows and columns of D corresponding to objects 3 and 5 we obtain a new distance matrix
0
3 0
7 9 0
8 6 5 0
(35)
1
2
4
(35) 1 2 4
The smallest distance between pairs of
clusters is now d(35)1 = 3, and we merge
cluster (1) with cluster (35) to get the next
cluster, (135). Calculating,
d(135)2 = min {d(35)2, d12} = min {7, 9} = 7
d(135)4 = min {d(35)4, d14} = min {8, 6} = 6
0
7 0
6 5 0
(135) 2 4
(135)
2
4
Minimum nearest neighbor distance between pairs of clusters is d(42) = 5, and we merge objects 4 and 2 to get the
cluster (24). At this point, we have two distinct clusters, (135) and (24). Their nearest neighbor distance is;
d(135)2,d(135)4 = ,min {d(135)2, d(135)4 = min {7, 6} = 6}
The final distance matrix becomes,
Consequently, clusters (135) and (24) are merged to form a single cluster of all five objects, (12345), when the
nearest neighbor distance reaches 6.
0
6 0
(135) (24)
(135)
(24)
6
4
2
0
1 3 5 2 4
Figure 3.2 Single linkage dendrogram for distances between
five objects
0
2 0
2 1 0
7 5 6 0
6 4 5 5 0
6 6 6 9 7 0
6 6 5 9 7 2 0
6 6 5 9 7 1 1 0
7 7 6 10 8 5 3 4 0
9 8 8 8 9 10 10 10 10 0
9 9 9 9 9 9 9 9 9 8 0
E
N
Da
Du
G
Fr
Sp
I
P
H
Fi
E N Da Du G Fr Sp I P H Fi
Consider the array of closeness between 10
languages.
We first search minimum distance between pairs of
languages (clusters). The minimum distance of 1
occurs between
Danish – Norwegian
Italian – French
Italian – Spanish
Numbering the languages in the order of appearance
gives,
d32 = 1, d86 = 1, d87 = 1
Since d76 = 2, we can merge only clusters 8 and 6 or
clusters 8 and 7. We cannot merge clusters 6, 7, and
8 at level 1. We choose first to merge 6 and 8, and
then to update the distance matrix and merge 2 and
3 to obtain the clusters (68) and (23).
6
4
2
0
8
10
E N Da Fr I Sp P Du G H Fi
Fig 3.3 Single
linkage
dendrograms for
distances between
numbers in 11
languages
Since single linkage joins clusters by shortest link between them, the technique cannot discern poorly separated
clusters. On the other hand, single linkage is one of the few clustering methods that can delineate non-ellipsoidal
clusters. The tendency of single linkage to pick out long string-like clusters is known as chaining.
Fig 3.4 Single linkage clusters
Single linkage confused by near overlap Chaining effect
Complete linkage
• Complete linkage clustering proceeds in much the same manner as
single linkage clusters, with one important exception; at each stage,
the distance (similarity) between clusters is determined by the distance
(similarity) between the two elements, one from each cluster that are
most distant.
• Thus complete linkage ensures that all items in a cluster are within
some maximum distance (or minimum similarity)
• The general agglomerative algorithm again starts by finding the
minimum entry inn D = {dik} and merging the corresponding objects,
such as U and V, to get cluster (UV). For step 3 of the general
algorithm in (12-12), the distances between (UV) and any other cluster
W are computed by
d(UV)W = max {dUW, dVW}
0
9 0
3 7 0
6 5 9 0
11 10 2 8 0
D = {dik} =
1
2
3
4
5
1 2 3 4 5
min (dik) = d53 = 2
i,k
0
9 0
3 7 0
6 5 9 0
11 10 2 8 0
1
2
3
4
5
1 2 3 4 5
At the first stage, objects 3 and 5 are merged, since they are most similar. This gives the cluster (35). At stage 2, we
compute,
d(35)1 = max {d31, d51} = max {3, 11} = 11
d(35)2 = max {d32, d52} = max {7, 10} = 10
d(35)4 = max {d34, d54} = max {9, 8} = 9
And the modified distance matrix becomes
The next merger occurs between the most similar groups, 2 and 4, to give the cluster (24). At stage 3, we have
d(24)(35) = max{d2(35), d4(35)} = max {10, 9} = 10
d(24)1 = max {d21, d41} = 9
0
11 0
10 9 0
9 6 5 0
(35) 1 2 4
(35)
1
2
4
(35) (24) 1
(35)
(24)
1
0
10 0
11 9 0
The next merger produces the cluster (124). At the final stage, the groups (35) and (124) are merged as the single cluster
(12345) at level
d(124)(35) = max{d1(35),d(24)(35)} = max{11,10} = 11
The dendrogram is given below
6
4
2
0
1 2 4 3 5
8
10
12
Figure 3.5 Complete
linkage dendrogram for
distances between five
objects
6
4
2
0
8
10
E N Da G Fr I Sp P Du H Fi
Fig 3.6 Complete
linkage dendrograms
for distances between
numbers in 11
languages
Average linkage
• Average distance between all pairs of items where one member or a pair
belongs to each cluster
• The first step is same, we begin by searching the distance matrix D = {dik}
to find the nearest objects. These are merged to form the cluster (UV)
• For step three, the distances between (UV) and the other cluster W are
determined by
• Where dik is the distance between object i in the cluster (UV) and object k in
the cluster W, N(UV) and NW are the number of items in cluster (UV) and W
respectively
6
4
2
0
8
10
E N Da G Du Fr I Sp P H Fi
Fig 3.7 Average
linkage dendrograms
for distances between
numbers in 11
languages
A comparison of the dendrogram in Fig 3.7 and Fig 3.6 indicates that the
average linkage yields to configuration very much like the complete linkage
configuration. However because distance is defined differently in each case, it
is not surprising that mergers take place at different levels
Ward’s Hierarchical Clustering Method
• Ward considered Hierarchical clustering procedure based on
minimizing the loss of information from joining two groups. This
method is usually implemented with loss of information taken to be an
increase in an error sum of squares criterion, ESS.
• First for a given cluster k, let ESSk be the sum of squared deviations of
every item in the cluster from the cluster mean(centroid). If there are k
clusters, define ESS as the sum of ESSk or ESS = ESS1 = ESS2 = …
ESSk
• At each step in the analysis, the union of every possible pair of
clusters is considered and the two clusters whose combination results
in the smallest increase in ESS (Minimum loss of information) are
joined.
• Initially, each cluster consists of a single item, and if there are N items,
ESSk = 0, k = 1, 2,…, N. So ESS = 0
• At the other extreme, when all the clusters are combined in a single
group of N items, the value of ESS is given by,
Where is the multivariate measurement associated with the jth
item and
is the mean of all the items
The results of Ward’s method can be displayed as
dendrogram.
The vertical axis gives the values of ESS at which the
mergers occur.
Ward’s method is based on the notion that the clusters
of multivariate observations are expected to be
roughly elliptically shaped.
It is a hierarchical precursor to nonhierarchical
clustering methods that optimize some criterion for
dividing data into a given number of elliptical groups.
Fig 3.8 Ward’s linkage method
Further discussions –
Hierarchical clustering
Further discussions – Hierarchical clustering
• There are many agglomerative hierarchical clustering procedures
besides single linkage, complete linkage and average linkage.
However all of them follow the basic algorithm
• In most clustering methods, sources of error and variation are not
formally considered in hierarchical procedures. This means that a
clustering method will be sensitive to outliers or “Noise”
• In Hierarchical clustering, there is no provision for a reallocation of
objects that may have been incorrectly grouped at an early stage.
Consequently, the final configuration of clusters should always be
carefully examined to see which are sensible
• For a particular problem, it is a good idea to try several clustering
methods and within a given method, a couple different ways of
assigning distances (similarities). If the outcomes from the several
methods are (roughly) consistent with one another, perhaps a case of
“natural” grouping can be advanced
• The stability of a hierarchical solution can be checked by applying the
clustering algorithm before and after small errors have been added to
the data units. If the groups are fairly well distinguished, the clustering
before and after perturbation should agree
• Common values (ties) in the similarity or distance matrix can produce
multiple solutions to a hierarchical clustering problem. That is, the
dendrograms corresponding to different treatments of the tied
similarities can be different, particularly at the lower levels. This is not
an inherent problem; sometimes multiple solutions occur for certain
kinds of data. The user needs to know of their existence so that
dendrograms can be properly interpreted.
The Inversion problem
In the following example, the clustering method joins A and B at distance 20. At the next
step, C is added to the group (AB) at distance 32. Next the clustering algorithm adds D to
the group (ABC) at a distance 30, a smaller distance than where C was added.
Inversions can occur when there is no clear cluster structure and are generally associated
with two hierarchical clustering algorithms known as centroid method and median method.
32
30
20
A B C D A B C D
30
32
20
The inversion is indicated
by a dendrogram with
crossover
The inversion is indicated
by a dendrogram with a
nonmonotonic scale
Non-Hierarchical clustering
Nonhierarchical clustering methods
• It is a clustering technique to group items, rather than variables, into a
collection of K cluster.
• K may be either be specified in advance or determined as part of the
clustering procedure.
• Because a distance matrix do not have to be determined and the basic
data do not have to be stored, Nonhierarchical methods can be applied
to much larger data sets than hierarchical techniques
The K-Means method
• MacQueen suggested the term K-means for describing an algorithm of
his that assigns each item to the cluster having the nearest centroid
(mean). In simple version, the process consists of three steps;
1. Partition the items into K initial clusters
2. Proceed through the list of items, assigning an item to the cluster
whose centroid (mean) is nearest. (Distance is usually computed
using Euclidean distance with either standardized or unstandardized
observations.) Recalculate the centroid for the cluster receiving the
new item and for the cluster losing the item.
3. Repeat step 2 until no more reassignments take place
The objective is to divide those items into K = 2 clusters such that the items within a
cluster are closer to one another than they are to the items in different clusters. To
implement the K = 2-means method, we arbitrarily partition the items into two clusters,
such as (AB) and (CD), and compute the coordinates () of the cluster centroid (mean).
Thus at Step 1, we have;
Item
Observations
A 5 3
B -1 1
C 1 -2
D -3 -2
Cluster
Coordinates of Centroid
(AB)
(CD)
Suppose we measure two variables X1 and X2 for each of four items A, B, C, and D.
The data are given in the following table;
At Step 2, we compute the Euclidean distance of each item from the group centroids and reassign each item to the nearest
group. If an item is moved from the initial configuration, the cluster centroids (means) must be updated before proceeding.
We compute the squared distances
Since A is closer to cluster (AB) than to cluster (CD), it is not reassigned. Continuing, we get,
And consequently, B is reassigned to
cluster (CD) , giving cluster (BCD)
and the following updated
coordinates of the centroid;
Again each item is checked for
reassignment. Computing squared
distances gives the following;
Cluster
Coordinates of Centroid
A 5 3
(BCD) -1 -1
Cluster
Item
A B C D
A 0 40 41 89
(BCD) 52 4 5 5
We see that each item is currently assigned to the cluster with the nearest
centroid (mean), and the process stops. The final K = 2 clusters are A and (BCD)
To check the stability of the clustering, it is desirable to rerun the algorithm with
a new initial partition.
A table of the cluster centroids (means) and within cluster variances also helps
to delineate group differences.
Further discussions – Non-hierarchical
clustering
• If two or more seed points inadvertently lie within a single cluster,
their resulting clusters will be poorly differentiated
• The existence of an outlier might produce at least one group with very
disperse items
• Even if the population is known to consist of K groups, the sampling
method may be such that data from the rarest group do not appear in
the sample. Forcing the data into K groups would lead to nonsensical
clusters
• It is always a good idea to rerun the algorithm for several choices
1. Introduction
2. Review of Literature
3. Methodology
4. Data analysis using statistical packages
5. References
Contents
Data analysis using
statistical packages - SPSS
1. Input the variables
2. Input the data into SPSS
3. Go to Analyze -> Classify -> Hierarchical cluster
4. Add necessary variables for classification
Name variable can
be placed here for
labelling in the
output
5. Define statistics
Distance/
Dissimilarity
matrix is optional
5. Define plots
Orientation of the
case (Not
dendrogram)
5. Define plots
Distance criteria
Clustering criteria
6. Continue -> Ok
Wine Quality data
Name FA VA CA RS CL FSO2 TSO2 D PH SO4 A Q
AA01 7 0.27 0.36 20.7 0.045 45 170 1.001 3 0.45 8.8 6
AA02 6.3 0.3 0.34 1.6 0.049 14 132 0.994 3.3 0.49 9.5 6
AA03 8.1 0.28 0.4 6.9 0.05 30 97 0.9951 3.26 0.44 10.1 6
AA04 7.2 0.23 0.32 8.5 0.058 47 186 0.9956 3.19 0.4 9.9 6
AA05 7.2 0.23 0.32 8.5 0.058 47 186 0.9956 3.19 0.4 9.9 6
AA06 8.1 0.28 0.4 6.9 0.05 30 97 0.9951 3.26 0.44 10.1 6
AA07 6.2 0.32 0.16 7 0.045 30 136 0.9949 3.18 0.47 9.6 6
AA08 7 0.27 0.36 20.7 0.045 45 170 1.001 3 0.45 8.8 6
AA09 6.3 0.3 0.34 1.6 0.049 14 132 0.994 3.3 0.49 9.5 6
AA10 8.1 0.22 0.43 1.5 0.044 28 129 0.9938 3.22 0.45 11 6
AA11 8.1 0.27 0.41 1.45 0.033 11 63 0.9908 2.99 0.56 12 5
AA12 8.6 0.23 0.4 4.2 0.035 17 109 0.9947 3.14 0.53 9.7 5
AA13 7.9 0.18 0.37 1.2 0.04 16 75 0.992 3.18 0.63 10.8 5
AA14 6.6 0.16 0.4 1.5 0.044 48 143 0.9912 3.54 0.52 12.4 7
AA15 8.3 0.42 0.62 19.25 0.04 41 172 1.0002 2.98 0.67 9.7 5
AA16 6.6 0.17 0.38 1.5 0.032 28 112 0.9914 3.25 0.55 11.4 7
AA17 6.3 0.48 0.04 1.1 0.046 30 99 0.9928 3.24 0.36 9.6 6
AA18 6.2 0.66 0.48 1.2 0.029 29 75 0.9892 3.33 0.39 12.8 8
AA19 7.4 0.34 0.42 1.1 0.033 17 171 0.9917 3.12 0.53 11.3 6
AA20 6.5 0.31 0.14 7.5 0.044 34 133 0.9955 3.22 0.5 9.5 5
AA21 6.2 0.66 0.48 1.2 0.029 29 75 0.9892 3.33 0.39 12.8 8
AA22 6.4 0.31 0.38 2.9 0.038 19 102 0.9912 3.17 0.35 11 7
AA23 6.8 0.26 0.42 1.7 0.049 41 122 0.993 3.47 0.48 10.5 8
AA24 7.6 0.67 0.14 1.5 0.074 25 168 0.9937 3.05 0.51 9.3 5
AA25 6.6 0.27 0.41 1.3 0.052 16 142 0.9951 3.42 0.47 10 6
Hierarchical and Non Hierarchical Clustering.pptx
Hierarchical and Non Hierarchical Clustering.pptx
Data analysis using statistical
packages - RStudio
library(readxl)
#attach wine data
winedata<-read_excel(file.choose())
winedata
wine1<-winedata[1:25,]
tail(wine1)
rownames(wine1)<-wine1$Sample
# Finding distance matrix
distance_mat <- dist(wine1, method = 'euclidean')
distance_mat
is.na(distance_mat)
# Fitting Hierarchical clustering Model
# to training dataset
set.seed(240) # Setting seed
Hierar_cl <- hclust(distance_mat, method = "average")
Hierar_cl
# Plotting dendrogram
plot(Hierar_cl)
# Choosing no. of clusters
# Cutting tree by height
abline(h = 25, col = "green")
# Cutting tree by no. of clusters
fit <- cutree(Hierar_cl, k = 6 )
fit
table(fit)
rect.hclust(Hierar_cl, k = 6, border = "green")
Hierarchical Clustering – R Script
Distances codes for clustering <dist(data, method = '’)>
• "ward.D": Ward’s minimum variance method
• "ward.D2": Ward’s minimum variance method (using the square of Euclidean distances)
• "single": Single linkage method (nearest neighbor)
• "complete": Complete linkage method (farthest neighbor)
• "average": UPGMA method (Unweighted Pair Group Method with Arithmetic Mean)
• "mcquitty": WPGMA method (Weighted Pair Group Method with Arithmetic Mean)
• "median": WPGMC method (Weighted Pair Group Method with Centroid Mean)
• "centroid": UPGMC method (Unweighted Pair Group Method with Centroid Mean)
Distances codes for distance matrix <hclust (data, method =“”)>
• "euclidean": Euclidean distance
• "maximum": Maximum distance
• "manhattan": Manhattan distance
• "canberra": Canberra distance
• "binary": Binary distance
• "minkowski": Minkowski distance
Hierarchical and Non Hierarchical Clustering.pptx
1. Introduction
2. Review of Literature
3. Methodology
4. Data analysis using statistical packages
5. References
Contents
References
Books
• Rencher AC. 2012. Methods of Multivariate Analysis. 3rd Ed. John
Wiley
• Srivastava MS and Khatri CG. 1979. An Introduction to Multivariate
Statistics. North Holland
• Johnson RA and Wichern DW. 1988. Applied Multivariate Statistical
Analysis. Prentice Hall
Articles
• Dosoky, N.S., Satyal, P., Barata, L.M., da Silva, J.K.R. and Setzer,
W.N., 2019. Volatiles of black pepper fruits (Piper nigrum L.).
Molecules, 24(23), p.4244.
• Talekar, S.C., Praveena, M.V. and Satish, R.G., 2022. Genetic diversity
using principal component analysis and hierarchical cluster analysis
in rice. International Journal of Plant Sciences. 17(2), p191-196
• Siva, G.S., Rao, V.S. and Babu, D.R., 2014. Cluster Analysis Approach
to Study the Rainfall Pattern in Visakhapatnam District. Weekly
Science Research Journal, 1, p.31.
Articles
• Rao, N.S. and Chaudhary, M., 2019. Hydrogeochemical processes
regulating the spatial distribution of groundwater contamination,
using pollution index of groundwater (PIG) and hierarchical cluster
analysis (HCA): a case study. Groundwater for Sustainable
Development, 9, p.100238.
• Kumari, C.U., Prasad, S.J. and Mounika, G., 2019, March. Leaf
disease detection: feature extraction with K-means clustering and
classification with ANN. In 2019 3rd International Conference on
Computing Methodologies and Communication (ICCMC) (pp. 1095-
1098). IEEE.
Data source for wine data
• Analysis of Wine Quality Data | STAT 508 (psu.edu)
Codes
• Hierarchical Clustering in R Programming – GeeksforGeeks
• Hierarchical Clustering in R: Step-by-Step Example - Statology

More Related Content

PDF
12. Clustering.pdf for the students of aktu.
PDF
Mastering Hierarchical Clustering: A Comprehensive Guide
PDF
Hierarchical clustering for Petroleum.pdf
PPTX
Unsupervised Learning-Clustering Algorithms.pptx
PDF
Defining Homogenous Climate zones of Bangladesh using Cluster Analysis
PPT
My8clst
PPTX
Hierarchical methods navdeep kaur newww.pptx
DOCX
Perceuptal mapping, Factor analysis, cluster analysis, conjoint analysis
12. Clustering.pdf for the students of aktu.
Mastering Hierarchical Clustering: A Comprehensive Guide
Hierarchical clustering for Petroleum.pdf
Unsupervised Learning-Clustering Algorithms.pptx
Defining Homogenous Climate zones of Bangladesh using Cluster Analysis
My8clst
Hierarchical methods navdeep kaur newww.pptx
Perceuptal mapping, Factor analysis, cluster analysis, conjoint analysis

Similar to Hierarchical and Non Hierarchical Clustering.pptx (20)

PDF
Survey on traditional and evolutionary clustering
PDF
Survey on traditional and evolutionary clustering approaches
PPT
Cluster spss week7
PDF
Everitt, landou cluster analysis
PDF
Classification_and_Ordination_Methods_as_a_Tool.pdf
PDF
Data Science - Part VII - Cluster Analysis
PPTX
Cluster analysis
DOCX
Curse of Dimensionality in Paradoxical High Dimensional Clinical Datasets � A...
PDF
Ch 4 Cluster Analysis.pdf
PPTX
Cluster analysis
PDF
Cancer data partitioning with data structure and difficulty independent clust...
PDF
Image Mining for Flower Classification by Genetic Association Rule Mining Usi...
PDF
K means clustering in the cloud - a mahout test
PPTX
Cluster Analysis
PPTX
Cluster
PPTX
Clusters techniques
PDF
Clustering Algorithm Based On Correlation Preserving Indexing
PDF
Clustering Approach Recommendation System using Agglomerative Algorithm
PPTX
Hierarchical clustering_12.pptxffefeeefe
PPT
Cluster
Survey on traditional and evolutionary clustering
Survey on traditional and evolutionary clustering approaches
Cluster spss week7
Everitt, landou cluster analysis
Classification_and_Ordination_Methods_as_a_Tool.pdf
Data Science - Part VII - Cluster Analysis
Cluster analysis
Curse of Dimensionality in Paradoxical High Dimensional Clinical Datasets � A...
Ch 4 Cluster Analysis.pdf
Cluster analysis
Cancer data partitioning with data structure and difficulty independent clust...
Image Mining for Flower Classification by Genetic Association Rule Mining Usi...
K means clustering in the cloud - a mahout test
Cluster Analysis
Cluster
Clusters techniques
Clustering Algorithm Based On Correlation Preserving Indexing
Clustering Approach Recommendation System using Agglomerative Algorithm
Hierarchical clustering_12.pptxffefeeefe
Cluster
Ad

More from Ranjith C (9)

PPTX
Missing Observations and how to deal with them.pptx
PPTX
Application of Advanced Machine Learning Methods for Crop Image Classificatio...
PPTX
Statistical applications on agricultural field experimental trials.pptx
PPTX
Analyzing Crop Shift in Belgaum using Markov Chain Analysis..pptx
PPTX
Turmeric price forecasting using Time series approach.pptx
PPTX
Black Pepper Price forecasting using time series approach.pptx
PPTX
Time series modelling for price forecasting in plantation crops.pptx
PPTX
Crop Image Classification using Machine Learning and Deep Learning Techniques...
PPTX
Role of Hybrid Time Series Models (ARIMA-ANN) in Forecasting Scenario of Agri...
Missing Observations and how to deal with them.pptx
Application of Advanced Machine Learning Methods for Crop Image Classificatio...
Statistical applications on agricultural field experimental trials.pptx
Analyzing Crop Shift in Belgaum using Markov Chain Analysis..pptx
Turmeric price forecasting using Time series approach.pptx
Black Pepper Price forecasting using time series approach.pptx
Time series modelling for price forecasting in plantation crops.pptx
Crop Image Classification using Machine Learning and Deep Learning Techniques...
Role of Hybrid Time Series Models (ARIMA-ANN) in Forecasting Scenario of Agri...
Ad

Recently uploaded (20)

PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PDF
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
A Complete Guide to Streamlining Business Processes
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
Navigating the Thai Supplements Landscape.pdf
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PDF
Microsoft Core Cloud Services powerpoint
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Global Data and Analytics Market Outlook Report
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPTX
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
PPTX
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
PPTX
CYBER SECURITY the Next Warefare Tactics
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PDF
Introduction to Data Science and Data Analysis
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
retention in jsjsksksksnbsndjddjdnFPD.pptx
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
A Complete Guide to Streamlining Business Processes
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Navigating the Thai Supplements Landscape.pdf
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
Microsoft Core Cloud Services powerpoint
STERILIZATION AND DISINFECTION-1.ppthhhbx
Global Data and Analytics Market Outlook Report
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
CYBER SECURITY the Next Warefare Tactics
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Introduction to Data Science and Data Analysis
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx

Hierarchical and Non Hierarchical Clustering.pptx

  • 1. 1 HIERARCHICALAND NON- HIERARCHICAL CLUSTERING Course In-charge, Dr. Kiran Prakash Professor and Head Department of Statistics and Computer applications Presented by, Ranjith. C M. Sc. (Ag) Statistics BAM-2022-77 Agricultural college, Bapatla Acharya N. G. Ranga Agricultural University STAT 591 – Master’s Seminar (0+1)
  • 2. 1. Introduction 2. Review of Literature 3. Methodology 4. Data analysis using statistical packages 5. References Contents
  • 4. Definitions • Cluster • Similar to one another within the same cluster • Dissimilar to the objects in other clusters • Cluster analysis is a statistical technique used to group a set of objects in such a way that objects in the same group (cluster) are more similar to each other than to those in other groups. • Applications – in short • As a stand-alone tool to get insight into data distribution • As a preprocessing step for other algorithms
  • 5. Other clustering methods • Model based clustering • The data points within each cluster follow a particular probability distribution • Density based clustering • Groups data points together based on their density within a defined radius or distance threshold • Grid based clustering • Grid-based methods quantize the object space into a finite number of cells that form a grid structure • Fuzzy clustering • assigns each data point a membership score for each cluster, rather than a binary membership value
  • 6. • Speciation of plants • Clustering the characteristics of certain plants to identify the species, or to decide there is sufficient evidence to decide it as a new species • Study of Natural disasters • To find the areas affected by earthquake, forest fire etc. and to take measures • City planning • To find the areas where more people are residing and build transportation facilities and roads • Planning survey • To create the optimum sample size so as to conduct effective surveys Applications
  • 7. Applications • Marketing • Classify the products based on customer preferences • Medical Diagnosis • Group patients with similar symptoms or medical histories, aiding in disease classification and personalized treatment plans. • Crime Pattern Analysis • Analyze crime data to identify clusters of similar criminal activities, assisting law enforcement in targeted interventions. • Image Segmentation • Analyze and categorize images into clusters, assisting in image recognition and computer vision applications.
  • 8. Hierarchical clustering • Hierarchical clustering is a method of cluster analysis that builds a hierarchy of clusters. It starts with individual data points and recursively merges or divides them to form a tree-like structure, known as a dendrogram. The dendrogram represents the relationships and similarities between different clusters and can be visually interpreted to understand the organization of the data. • Two types: • Agglomerative hierarchical method • Divisive hierarchical method
  • 9. Hierarchical clustering • Start with Individual Data Points Begin by considering each data point as a separate cluster. • Compute Pairwise Similarities Calculate the similarity or dissimilarity between each pair of clusters or data points. Common distance metrics include Euclidean distance, Manhattan distance, or correlation coefficients. • Merge Similar Clusters Identify the pair of clusters with the highest similarity and merge them into a single cluster. This creates a new cluster that replaces the two merged clusters.
  • 10. Hierarchical clustering • Update Similarity Matrix Recalculate the similarity or dissimilarity between the new cluster and the remaining clusters. • Repeat Steps 3-4 Repeat the process of merging the most similar clusters and updating the similarity matrix until all data points are in a single cluster or until a predetermined number of clusters is reached. • Dendrogram Construction Represent the clustering process using a dendrogram. The vertical lines in the dendrogram indicate the merging of clusters, and the height at which they merge reflects the dissimilarity at which the merging occurred.
  • 11. Nonhierarchical clustering • Non-hierarchical clustering, also known as partitioning clustering, is a method of cluster analysis that divides a dataset into a predetermined number of clusters. Unlike hierarchical clustering, which creates a tree-like structure of nested clusters, non-hierarchical clustering directly assigns data points to clusters without forming a hierarchy.
  • 12. 1. Introduction 2. Review of Literature 3. Methodology 4. Data analysis using statistical packages 5. References Contents
  • 13. Review of Literature • Rainfall pattern • Germplasm evaluation • Chemical classification • Groundwater contamination • Image detection
  • 14. Rainfall pattern • A multivariate approach based on hierarchical cluster analysis has been proposed to study the pattern of rainfall in different mandals of Visakhapatnam district of Andhra Pradesh • Rainfall patterns of 42 mandals based on 25 years of rainfall data of Visakhapatnam district from 1986-2010 was employed
  • 16. Category Rainfall (mm) High rainfall >1162 Medium rainfall 862 – 1162 Low rainfall <862
  • 17. Results • The mandals were categorized into 8 clusters based on mean rainfall • The application of these approaches identified medium rainfall (862 mm-1162 mm) was the most frequent representative pattern of rainfall in majority mandals of Visakhapatnam district
  • 18. Germplasm Evaluation • A set of 100 rice germplasm lines with four checks viz., BPT-5204, PSB-68, Siri1253 and MGD-101 were evaluated in augmented block design during Kharif 2020. • Test entries along with checks were sown at a spacing of 20×10 cm in augmented Block Design with four blocks, wherein each block comprised of 25 genotypes and four checks were repeated in each block • Data related to days to 50% flowering, panicle length, panicles per square metre, 1000-grain weight and grain yield was collected and analysed
  • 19. Cluster analysis Sl No Cluster No of Individuals Character 1 Cluster 1 5 Early maturity, high grain yield, long panicle length and medium 1000- grain weight 2 Cluster 2 27 Early maturing types with medium panicle length and low 1000-grain weight 3 Cluster 3 23 Very early flowering and medium 1000-grain weight. 4 Cluster 4 45 Early flowering, short panicle length and more panicles per square meter
  • 20. The average intra-cluster and inter-cluster Euclidean distances were estimated using ward’s minimum variance
  • 21. Results • It was discovered that none of the clusters included at least one genotype that had all of the desirable traits, ruling out the idea of selecting one genotype for immediate usage. Therefore to judiciously incorporate all of the desirable features, hybridization between selected genotypes from divergent clusters is required. • From cluster analysis maximum inter-cluster distance was observed between clusters 2 and cluster 3 followed by cluster 1 and cluster 2. So the genotypes selected from these clusters can be used for selecting genetically diverse parents.
  • 22. Chemical classification • Five Piper nigrum essential oils were analyzed by GC-MS (gas chromatography-mass spectrometry) were analyzed • 78 compounds were identified accounting for more than 99% of the compositions • Based on P. nigrum essential oil compositions, a hierarchical cluster analysis of the oils were done • Analysis done using agglomerative hierarchical cluster (AHC) analysis using XLSTAT Premium • Dissimilarity was determined using Euclidean distance, and clustering was defined using Ward’s method
  • 24. Results • The oils were dominated by monoterpene hydrocarbons. Black pepper oils from various geographical locations have shown qualitative similarities with differences in the concentrations of their major components. • β-Caryophyllene, limonene, β-pinene, α-pinene, δ-3-carene, sabinene, and myrcene were the main components of P. nigrum oil
  • 25. Groundwater contamination • Groundwater samples from 30 locations • Ionic balance error between the total concentrations of cations (Ca2+, Mg2+ , Na+ and K+ ) and the total concentrations of anions (HCO3 - , Cl- , SO4 2- and NO3 - ) expressed in milliequivalents per liter (meq/L) were observed for each groundwater sample
  • 26. Pollution index calculation • In first step, the relative weight (Rw) from 1 to 5 was assigned for each chemical parameter, depending upon its relative impact on human beings. Minimum weight (1) was given to K+ and maximum weight (5) to pH, TDS, SO4 2- and NO3 - • In second step, the weight parameter (Wp) was computed for each chemical parameter to assess its relative share on overall chemical groundwater quality • In third step, the status of concentration (Sc) was determined by dividing the concentration (C) of each chemical parameter of each groundwater sample by its respective drinking water quality standard limit (Ds)
  • 27. • In last step, pollution index of groundwater (PIG) was calculated by adding all values of Ow (ΣOw) STATISTICA version 6.1 was used. In HCA, a Complete Linkage is used to determine the distance between the clusters or groups.
  • 30. Results • Group I represents low mineralized groundwater quality, Group II shows moderately mineralized groundwater quality and Group III has highly mineralized groundwater quality, depending upon the availability sources. • Half of the samples are falling under moderately mineralized category, seven in low mineralized and eight in highly mineralized category.
  • 31. Image detection • The objective of this research is detection and classification of cotton and tomato leaf diseases • K-means clustering algorithm is used to separate the stained part and healthy leaf region • This proposed method of image processing is done in MATLAB 2016b software • L* represents the lightness, a* and b* represents the chromaticity layers. All of the color information is in the a* and b* layers • The derived features are Contrast, Correlation, Energy, Homogeneity, Mean, Standard Deviation and Variance
  • 32. K-means clustering algorithm LOAD Image Convert RGB color space into L*a*b* color space Clustering the variant colors Measure the distance by using Euclidean Distance Matrix Create a blank cell array to store clusters CLUSTERS
  • 33. NN Classification Leaf Disease Bacterial Leaf Spot Target Spot Septoria Leaf Spot Leaf Mold Accuracy Bacterial Leaf Spot 9 1 0 0 90% Target Spot 2 8 0 0 80% Septoria Leaf Spot 0 0 10 0 100% Leaf Mold 0 0 0 10 100% Average Accuracy 92.5%
  • 34. 1. Introduction 2. Review of Literature 3. Methodology 4. Data analysis using statistical packages 5. References Contents
  • 36. Outline 1. Methodology 2. Further discussions – Hierarchical clustering 3. Non-Hierarchical clustering 4. Further discussions – Non-Hierarchical clustering
  • 38. Agglomerative Hierarchical method • A series of successive mergers • There are many initial clusters as objects • The most similar groups are first grouped • Then these initial groups are merged according to their similarities • Eventually as the similarity decreases, all subgroups are fused into a single cluster
  • 39. Divisive Hierarchical methods • Work opposite to Agglomerative method • And initial single group of objects is divided into two subgroups such that the objects in one subgroup are “far from” the objects in the other • These are further divided into dissimilar subgroups • The process is continued until there are as many subgroups are objects – that is each object become a cluster • Both agglomerative and divisive methods can be displayed as the two- dimensional diagram called a Dendrogram
  • 40. Algorithm for Agglomerative clustering 1. Start with N clusters, each containing a single entity and an N x N symmetric matrix of distances (or similarities) D = {dik} 2. Search the distance matrix for the nearest (most similar) pair of clusters. Let distance between “most similar” clusters U and V be dUV 3. Merge clusters U and V. Label the newly formed cluster (UV). Update the entries in the distance matrix by (a) deleting the rows and columns corresponding to clusters U and V and (b) adding a row and column giving the distances between cluster (UV) and the remaining clusters. 4. Repeat steps 2 and 3 a total of N – 1 times. (All objects will be in single cluster after the algorithm terminates). Record the identity of clusters that are merged and the levels (distances or similarities) at which the mergers take place.
  • 41. Linkage methods • These methods are suitable for clustering items, as well as variables • Three main types are there; • Single linkage • Complete linkage • Average linkage
  • 42. Cluster distance 𝑑13+𝑑14+𝑑15+𝑑 23+𝑑24+𝑑25 6 𝑑24 𝑑15 Figure 3.1 Intercluster distance (dissimiliarity) for (a) Single linkage (b) Complete linkage and (c) Average linkage
  • 43. Single linkage • The inputs to a single linkage algorithm can be distances or similarities between pairs of objects. Groups are formed from nearest neighbors, where the term nearest neighbor connotes the smallest distance or largest similarity • Initially we must find the smallest distance in D = {dik} and merge the corresponding objects, say U and V, to get the cluster (UV). In the next step of general algorithm, the distance between UV and any other cluster W are computed by d(UV)W = min {dUW, dVW}
  • 44. Clustering using single linkage 0 9 0 3 7 0 6 5 9 0 11 10 2 8 0 D = {dik} = 1 2 3 4 5 1 2 3 4 5 min (dik) = d53 = 2 i,k 0 9 0 3 7 0 6 5 9 0 11 10 2 8 0 1 2 3 4 5 1 2 3 4 5
  • 45. Objects 5 and 3 are merged to form cluster (35). To implement next level of clustering we need the distance between the cluster (35) and remaining objects 1, 2, and 4. The distances are d(35)1 = min {d31, d51} = min {3, 11} = 3 d(35)2 = min {d32, d52} = min {7, 10} = 7 d(35)4 = min {d34, d54} = min {9, 8} = 8 Deleting the rows and columns of D corresponding to objects 3 and 5 we obtain a new distance matrix 0 3 0 7 9 0 8 6 5 0 (35) 1 2 4 (35) 1 2 4 The smallest distance between pairs of clusters is now d(35)1 = 3, and we merge cluster (1) with cluster (35) to get the next cluster, (135). Calculating, d(135)2 = min {d(35)2, d12} = min {7, 9} = 7 d(135)4 = min {d(35)4, d14} = min {8, 6} = 6 0 7 0 6 5 0 (135) 2 4 (135) 2 4 Minimum nearest neighbor distance between pairs of clusters is d(42) = 5, and we merge objects 4 and 2 to get the cluster (24). At this point, we have two distinct clusters, (135) and (24). Their nearest neighbor distance is; d(135)2,d(135)4 = ,min {d(135)2, d(135)4 = min {7, 6} = 6}
  • 46. The final distance matrix becomes, Consequently, clusters (135) and (24) are merged to form a single cluster of all five objects, (12345), when the nearest neighbor distance reaches 6. 0 6 0 (135) (24) (135) (24) 6 4 2 0 1 3 5 2 4 Figure 3.2 Single linkage dendrogram for distances between five objects
  • 47. 0 2 0 2 1 0 7 5 6 0 6 4 5 5 0 6 6 6 9 7 0 6 6 5 9 7 2 0 6 6 5 9 7 1 1 0 7 7 6 10 8 5 3 4 0 9 8 8 8 9 10 10 10 10 0 9 9 9 9 9 9 9 9 9 8 0 E N Da Du G Fr Sp I P H Fi E N Da Du G Fr Sp I P H Fi Consider the array of closeness between 10 languages. We first search minimum distance between pairs of languages (clusters). The minimum distance of 1 occurs between Danish – Norwegian Italian – French Italian – Spanish Numbering the languages in the order of appearance gives, d32 = 1, d86 = 1, d87 = 1 Since d76 = 2, we can merge only clusters 8 and 6 or clusters 8 and 7. We cannot merge clusters 6, 7, and 8 at level 1. We choose first to merge 6 and 8, and then to update the distance matrix and merge 2 and 3 to obtain the clusters (68) and (23).
  • 48. 6 4 2 0 8 10 E N Da Fr I Sp P Du G H Fi Fig 3.3 Single linkage dendrograms for distances between numbers in 11 languages
  • 49. Since single linkage joins clusters by shortest link between them, the technique cannot discern poorly separated clusters. On the other hand, single linkage is one of the few clustering methods that can delineate non-ellipsoidal clusters. The tendency of single linkage to pick out long string-like clusters is known as chaining. Fig 3.4 Single linkage clusters Single linkage confused by near overlap Chaining effect
  • 50. Complete linkage • Complete linkage clustering proceeds in much the same manner as single linkage clusters, with one important exception; at each stage, the distance (similarity) between clusters is determined by the distance (similarity) between the two elements, one from each cluster that are most distant. • Thus complete linkage ensures that all items in a cluster are within some maximum distance (or minimum similarity)
  • 51. • The general agglomerative algorithm again starts by finding the minimum entry inn D = {dik} and merging the corresponding objects, such as U and V, to get cluster (UV). For step 3 of the general algorithm in (12-12), the distances between (UV) and any other cluster W are computed by d(UV)W = max {dUW, dVW} 0 9 0 3 7 0 6 5 9 0 11 10 2 8 0 D = {dik} = 1 2 3 4 5 1 2 3 4 5 min (dik) = d53 = 2 i,k 0 9 0 3 7 0 6 5 9 0 11 10 2 8 0 1 2 3 4 5 1 2 3 4 5
  • 52. At the first stage, objects 3 and 5 are merged, since they are most similar. This gives the cluster (35). At stage 2, we compute, d(35)1 = max {d31, d51} = max {3, 11} = 11 d(35)2 = max {d32, d52} = max {7, 10} = 10 d(35)4 = max {d34, d54} = max {9, 8} = 9 And the modified distance matrix becomes The next merger occurs between the most similar groups, 2 and 4, to give the cluster (24). At stage 3, we have d(24)(35) = max{d2(35), d4(35)} = max {10, 9} = 10 d(24)1 = max {d21, d41} = 9 0 11 0 10 9 0 9 6 5 0 (35) 1 2 4 (35) 1 2 4
  • 53. (35) (24) 1 (35) (24) 1 0 10 0 11 9 0 The next merger produces the cluster (124). At the final stage, the groups (35) and (124) are merged as the single cluster (12345) at level d(124)(35) = max{d1(35),d(24)(35)} = max{11,10} = 11 The dendrogram is given below 6 4 2 0 1 2 4 3 5 8 10 12 Figure 3.5 Complete linkage dendrogram for distances between five objects
  • 54. 6 4 2 0 8 10 E N Da G Fr I Sp P Du H Fi Fig 3.6 Complete linkage dendrograms for distances between numbers in 11 languages
  • 55. Average linkage • Average distance between all pairs of items where one member or a pair belongs to each cluster • The first step is same, we begin by searching the distance matrix D = {dik} to find the nearest objects. These are merged to form the cluster (UV) • For step three, the distances between (UV) and the other cluster W are determined by • Where dik is the distance between object i in the cluster (UV) and object k in the cluster W, N(UV) and NW are the number of items in cluster (UV) and W respectively
  • 56. 6 4 2 0 8 10 E N Da G Du Fr I Sp P H Fi Fig 3.7 Average linkage dendrograms for distances between numbers in 11 languages
  • 57. A comparison of the dendrogram in Fig 3.7 and Fig 3.6 indicates that the average linkage yields to configuration very much like the complete linkage configuration. However because distance is defined differently in each case, it is not surprising that mergers take place at different levels
  • 58. Ward’s Hierarchical Clustering Method • Ward considered Hierarchical clustering procedure based on minimizing the loss of information from joining two groups. This method is usually implemented with loss of information taken to be an increase in an error sum of squares criterion, ESS. • First for a given cluster k, let ESSk be the sum of squared deviations of every item in the cluster from the cluster mean(centroid). If there are k clusters, define ESS as the sum of ESSk or ESS = ESS1 = ESS2 = … ESSk
  • 59. • At each step in the analysis, the union of every possible pair of clusters is considered and the two clusters whose combination results in the smallest increase in ESS (Minimum loss of information) are joined. • Initially, each cluster consists of a single item, and if there are N items, ESSk = 0, k = 1, 2,…, N. So ESS = 0 • At the other extreme, when all the clusters are combined in a single group of N items, the value of ESS is given by, Where is the multivariate measurement associated with the jth item and is the mean of all the items
  • 60. The results of Ward’s method can be displayed as dendrogram. The vertical axis gives the values of ESS at which the mergers occur. Ward’s method is based on the notion that the clusters of multivariate observations are expected to be roughly elliptically shaped. It is a hierarchical precursor to nonhierarchical clustering methods that optimize some criterion for dividing data into a given number of elliptical groups. Fig 3.8 Ward’s linkage method
  • 62. Further discussions – Hierarchical clustering • There are many agglomerative hierarchical clustering procedures besides single linkage, complete linkage and average linkage. However all of them follow the basic algorithm • In most clustering methods, sources of error and variation are not formally considered in hierarchical procedures. This means that a clustering method will be sensitive to outliers or “Noise” • In Hierarchical clustering, there is no provision for a reallocation of objects that may have been incorrectly grouped at an early stage. Consequently, the final configuration of clusters should always be carefully examined to see which are sensible
  • 63. • For a particular problem, it is a good idea to try several clustering methods and within a given method, a couple different ways of assigning distances (similarities). If the outcomes from the several methods are (roughly) consistent with one another, perhaps a case of “natural” grouping can be advanced • The stability of a hierarchical solution can be checked by applying the clustering algorithm before and after small errors have been added to the data units. If the groups are fairly well distinguished, the clustering before and after perturbation should agree
  • 64. • Common values (ties) in the similarity or distance matrix can produce multiple solutions to a hierarchical clustering problem. That is, the dendrograms corresponding to different treatments of the tied similarities can be different, particularly at the lower levels. This is not an inherent problem; sometimes multiple solutions occur for certain kinds of data. The user needs to know of their existence so that dendrograms can be properly interpreted.
  • 65. The Inversion problem In the following example, the clustering method joins A and B at distance 20. At the next step, C is added to the group (AB) at distance 32. Next the clustering algorithm adds D to the group (ABC) at a distance 30, a smaller distance than where C was added. Inversions can occur when there is no clear cluster structure and are generally associated with two hierarchical clustering algorithms known as centroid method and median method. 32 30 20 A B C D A B C D 30 32 20 The inversion is indicated by a dendrogram with crossover The inversion is indicated by a dendrogram with a nonmonotonic scale
  • 67. Nonhierarchical clustering methods • It is a clustering technique to group items, rather than variables, into a collection of K cluster. • K may be either be specified in advance or determined as part of the clustering procedure. • Because a distance matrix do not have to be determined and the basic data do not have to be stored, Nonhierarchical methods can be applied to much larger data sets than hierarchical techniques
  • 68. The K-Means method • MacQueen suggested the term K-means for describing an algorithm of his that assigns each item to the cluster having the nearest centroid (mean). In simple version, the process consists of three steps; 1. Partition the items into K initial clusters 2. Proceed through the list of items, assigning an item to the cluster whose centroid (mean) is nearest. (Distance is usually computed using Euclidean distance with either standardized or unstandardized observations.) Recalculate the centroid for the cluster receiving the new item and for the cluster losing the item. 3. Repeat step 2 until no more reassignments take place
  • 69. The objective is to divide those items into K = 2 clusters such that the items within a cluster are closer to one another than they are to the items in different clusters. To implement the K = 2-means method, we arbitrarily partition the items into two clusters, such as (AB) and (CD), and compute the coordinates () of the cluster centroid (mean). Thus at Step 1, we have; Item Observations A 5 3 B -1 1 C 1 -2 D -3 -2 Cluster Coordinates of Centroid (AB) (CD) Suppose we measure two variables X1 and X2 for each of four items A, B, C, and D. The data are given in the following table;
  • 70. At Step 2, we compute the Euclidean distance of each item from the group centroids and reassign each item to the nearest group. If an item is moved from the initial configuration, the cluster centroids (means) must be updated before proceeding. We compute the squared distances Since A is closer to cluster (AB) than to cluster (CD), it is not reassigned. Continuing, we get, And consequently, B is reassigned to cluster (CD) , giving cluster (BCD) and the following updated coordinates of the centroid; Again each item is checked for reassignment. Computing squared distances gives the following; Cluster Coordinates of Centroid A 5 3 (BCD) -1 -1 Cluster Item A B C D A 0 40 41 89 (BCD) 52 4 5 5
  • 71. We see that each item is currently assigned to the cluster with the nearest centroid (mean), and the process stops. The final K = 2 clusters are A and (BCD) To check the stability of the clustering, it is desirable to rerun the algorithm with a new initial partition. A table of the cluster centroids (means) and within cluster variances also helps to delineate group differences.
  • 72. Further discussions – Non-hierarchical clustering • If two or more seed points inadvertently lie within a single cluster, their resulting clusters will be poorly differentiated • The existence of an outlier might produce at least one group with very disperse items • Even if the population is known to consist of K groups, the sampling method may be such that data from the rarest group do not appear in the sample. Forcing the data into K groups would lead to nonsensical clusters • It is always a good idea to rerun the algorithm for several choices
  • 73. 1. Introduction 2. Review of Literature 3. Methodology 4. Data analysis using statistical packages 5. References Contents
  • 75. 1. Input the variables
  • 76. 2. Input the data into SPSS
  • 77. 3. Go to Analyze -> Classify -> Hierarchical cluster
  • 78. 4. Add necessary variables for classification Name variable can be placed here for labelling in the output
  • 80. 5. Define plots Orientation of the case (Not dendrogram)
  • 81. 5. Define plots Distance criteria Clustering criteria 6. Continue -> Ok
  • 82. Wine Quality data Name FA VA CA RS CL FSO2 TSO2 D PH SO4 A Q AA01 7 0.27 0.36 20.7 0.045 45 170 1.001 3 0.45 8.8 6 AA02 6.3 0.3 0.34 1.6 0.049 14 132 0.994 3.3 0.49 9.5 6 AA03 8.1 0.28 0.4 6.9 0.05 30 97 0.9951 3.26 0.44 10.1 6 AA04 7.2 0.23 0.32 8.5 0.058 47 186 0.9956 3.19 0.4 9.9 6 AA05 7.2 0.23 0.32 8.5 0.058 47 186 0.9956 3.19 0.4 9.9 6 AA06 8.1 0.28 0.4 6.9 0.05 30 97 0.9951 3.26 0.44 10.1 6 AA07 6.2 0.32 0.16 7 0.045 30 136 0.9949 3.18 0.47 9.6 6 AA08 7 0.27 0.36 20.7 0.045 45 170 1.001 3 0.45 8.8 6 AA09 6.3 0.3 0.34 1.6 0.049 14 132 0.994 3.3 0.49 9.5 6 AA10 8.1 0.22 0.43 1.5 0.044 28 129 0.9938 3.22 0.45 11 6 AA11 8.1 0.27 0.41 1.45 0.033 11 63 0.9908 2.99 0.56 12 5 AA12 8.6 0.23 0.4 4.2 0.035 17 109 0.9947 3.14 0.53 9.7 5 AA13 7.9 0.18 0.37 1.2 0.04 16 75 0.992 3.18 0.63 10.8 5 AA14 6.6 0.16 0.4 1.5 0.044 48 143 0.9912 3.54 0.52 12.4 7 AA15 8.3 0.42 0.62 19.25 0.04 41 172 1.0002 2.98 0.67 9.7 5 AA16 6.6 0.17 0.38 1.5 0.032 28 112 0.9914 3.25 0.55 11.4 7 AA17 6.3 0.48 0.04 1.1 0.046 30 99 0.9928 3.24 0.36 9.6 6 AA18 6.2 0.66 0.48 1.2 0.029 29 75 0.9892 3.33 0.39 12.8 8 AA19 7.4 0.34 0.42 1.1 0.033 17 171 0.9917 3.12 0.53 11.3 6 AA20 6.5 0.31 0.14 7.5 0.044 34 133 0.9955 3.22 0.5 9.5 5 AA21 6.2 0.66 0.48 1.2 0.029 29 75 0.9892 3.33 0.39 12.8 8 AA22 6.4 0.31 0.38 2.9 0.038 19 102 0.9912 3.17 0.35 11 7 AA23 6.8 0.26 0.42 1.7 0.049 41 122 0.993 3.47 0.48 10.5 8 AA24 7.6 0.67 0.14 1.5 0.074 25 168 0.9937 3.05 0.51 9.3 5 AA25 6.6 0.27 0.41 1.3 0.052 16 142 0.9951 3.42 0.47 10 6
  • 85. Data analysis using statistical packages - RStudio
  • 86. library(readxl) #attach wine data winedata<-read_excel(file.choose()) winedata wine1<-winedata[1:25,] tail(wine1) rownames(wine1)<-wine1$Sample # Finding distance matrix distance_mat <- dist(wine1, method = 'euclidean') distance_mat is.na(distance_mat) # Fitting Hierarchical clustering Model # to training dataset set.seed(240) # Setting seed Hierar_cl <- hclust(distance_mat, method = "average") Hierar_cl # Plotting dendrogram plot(Hierar_cl) # Choosing no. of clusters # Cutting tree by height abline(h = 25, col = "green") # Cutting tree by no. of clusters fit <- cutree(Hierar_cl, k = 6 ) fit table(fit) rect.hclust(Hierar_cl, k = 6, border = "green") Hierarchical Clustering – R Script
  • 87. Distances codes for clustering <dist(data, method = '’)> • "ward.D": Ward’s minimum variance method • "ward.D2": Ward’s minimum variance method (using the square of Euclidean distances) • "single": Single linkage method (nearest neighbor) • "complete": Complete linkage method (farthest neighbor) • "average": UPGMA method (Unweighted Pair Group Method with Arithmetic Mean) • "mcquitty": WPGMA method (Weighted Pair Group Method with Arithmetic Mean) • "median": WPGMC method (Weighted Pair Group Method with Centroid Mean) • "centroid": UPGMC method (Unweighted Pair Group Method with Centroid Mean) Distances codes for distance matrix <hclust (data, method =“”)> • "euclidean": Euclidean distance • "maximum": Maximum distance • "manhattan": Manhattan distance • "canberra": Canberra distance • "binary": Binary distance • "minkowski": Minkowski distance
  • 89. 1. Introduction 2. Review of Literature 3. Methodology 4. Data analysis using statistical packages 5. References Contents
  • 91. Books • Rencher AC. 2012. Methods of Multivariate Analysis. 3rd Ed. John Wiley • Srivastava MS and Khatri CG. 1979. An Introduction to Multivariate Statistics. North Holland • Johnson RA and Wichern DW. 1988. Applied Multivariate Statistical Analysis. Prentice Hall
  • 92. Articles • Dosoky, N.S., Satyal, P., Barata, L.M., da Silva, J.K.R. and Setzer, W.N., 2019. Volatiles of black pepper fruits (Piper nigrum L.). Molecules, 24(23), p.4244. • Talekar, S.C., Praveena, M.V. and Satish, R.G., 2022. Genetic diversity using principal component analysis and hierarchical cluster analysis in rice. International Journal of Plant Sciences. 17(2), p191-196 • Siva, G.S., Rao, V.S. and Babu, D.R., 2014. Cluster Analysis Approach to Study the Rainfall Pattern in Visakhapatnam District. Weekly Science Research Journal, 1, p.31.
  • 93. Articles • Rao, N.S. and Chaudhary, M., 2019. Hydrogeochemical processes regulating the spatial distribution of groundwater contamination, using pollution index of groundwater (PIG) and hierarchical cluster analysis (HCA): a case study. Groundwater for Sustainable Development, 9, p.100238. • Kumari, C.U., Prasad, S.J. and Mounika, G., 2019, March. Leaf disease detection: feature extraction with K-means clustering and classification with ANN. In 2019 3rd International Conference on Computing Methodologies and Communication (ICCMC) (pp. 1095- 1098). IEEE.
  • 94. Data source for wine data • Analysis of Wine Quality Data | STAT 508 (psu.edu)
  • 95. Codes • Hierarchical Clustering in R Programming – GeeksforGeeks • Hierarchical Clustering in R: Step-by-Step Example - Statology