Hierarchical and Non Hierarchical Clustering.pptx

1
HIERARCHICALAND NON-
HIERARCHICAL CLUSTERING
Course In-charge,
Dr. Kiran Prakash
Professor and Head
Department of Statistics and
Computer applications
Presented by,
Ranjith. C
M. Sc. (Ag) Statistics
BAM-2022-77
Agricultural college, Bapatla
Acharya N. G. Ranga Agricultural University
STAT 591 – Master’s Seminar (0+1)

1. Introduction
2. Review of Literature
3. Methodology
4. Data analysis using statistical packages
5. References
Contents

Definitions
• Cluster
• Similar to one another within the same cluster
• Dissimilar to the objects in other clusters
• Cluster analysis is a statistical technique used to group a set of
objects in such a way that objects in the same group (cluster) are more
similar to each other than to those in other groups.
• Applications – in short
• As a stand-alone tool to get insight into data distribution
• As a preprocessing step for other algorithms

Other clustering methods
• Model based clustering
• The data points within each cluster follow a particular probability distribution
• Density based clustering
• Groups data points together based on their density within a defined radius or
distance threshold
• Grid based clustering
• Grid-based methods quantize the object space into a finite number of cells that
form a grid structure
• Fuzzy clustering
• assigns each data point a membership score for each cluster, rather than a binary
membership value

• Speciation of plants
• Clustering the characteristics of certain plants to identify the species, or to
decide there is sufficient evidence to decide it as a new species
• Study of Natural disasters
• To find the areas affected by earthquake, forest fire etc. and to take measures
• City planning
• To find the areas where more people are residing and build transportation
facilities and roads
• Planning survey
• To create the optimum sample size so as to conduct effective surveys
Applications

Applications
• Marketing
• Classify the products based on customer preferences
• Medical Diagnosis
• Group patients with similar symptoms or medical histories, aiding in disease
classification and personalized treatment plans.
• Crime Pattern Analysis
• Analyze crime data to identify clusters of similar criminal activities, assisting law
enforcement in targeted interventions.
• Image Segmentation
• Analyze and categorize images into clusters, assisting in image recognition and
computer vision applications.

Hierarchical clustering
• Hierarchical clustering is a method of cluster analysis that builds a
hierarchy of clusters. It starts with individual data points and
recursively merges or divides them to form a tree-like structure,
known as a dendrogram. The dendrogram represents the relationships
and similarities between different clusters and can be visually
interpreted to understand the organization of the data.
• Two types:
• Agglomerative hierarchical method
• Divisive hierarchical method

• Start with Individual Data Points
Begin by considering each data point as a separate cluster.
• Compute Pairwise Similarities
Calculate the similarity or dissimilarity between each pair of clusters or data
points. Common distance metrics include Euclidean distance, Manhattan
distance, or correlation coefficients.
• Merge Similar Clusters
Identify the pair of clusters with the highest similarity and merge them into
a single cluster. This creates a new cluster that replaces the two merged
clusters.

• Update Similarity Matrix
Recalculate the similarity or dissimilarity between the new cluster and the
remaining clusters.
• Repeat Steps 3-4
Repeat the process of merging the most similar clusters and updating the similarity
matrix until all data points are in a single cluster or until a predetermined number
of clusters is reached.
• Dendrogram Construction
Represent the clustering process using a dendrogram. The vertical lines in the
dendrogram indicate the merging of clusters, and the height at which they merge
reflects the dissimilarity at which the merging occurred.

Nonhierarchical clustering
• Non-hierarchical clustering, also known as partitioning clustering, is a
method of cluster analysis that divides a dataset into a predetermined
number of clusters. Unlike hierarchical clustering, which creates a
tree-like structure of nested clusters, non-hierarchical clustering
directly assigns data points to clusters without forming a hierarchy.

Review of Literature
• Rainfall pattern
• Germplasm evaluation
• Chemical classification
• Groundwater contamination
• Image detection

Rainfall pattern
• A multivariate approach based on hierarchical cluster analysis has
been proposed to study the pattern of rainfall in different mandals of
Visakhapatnam district of Andhra Pradesh
• Rainfall patterns of 42 mandals based on 25 years of rainfall data of
Visakhapatnam district from 1986-2010 was employed

Hierarchical and Non Hierarchical Clustering.pptx

Category Rainfall (mm)
High rainfall >1162
Medium rainfall 862 – 1162
Low rainfall <862

Results
• The mandals were categorized
into 8 clusters based on mean
rainfall
• The application of these
approaches identified medium
rainfall (862 mm-1162 mm)
was the most frequent
representative pattern of
rainfall in majority mandals of
Visakhapatnam district

Germplasm Evaluation
• A set of 100 rice germplasm lines with four checks viz., BPT-5204,
PSB-68, Siri1253 and MGD-101 were evaluated in augmented block
design during Kharif 2020.
• Test entries along with checks were sown at a spacing of 20×10 cm in
augmented Block Design with four blocks, wherein each block
comprised of 25 genotypes and four checks were repeated in each
block
• Data related to days to 50% flowering, panicle length, panicles per
square metre, 1000-grain weight and grain yield was collected and
analysed

Cluster analysis
Sl
No
Cluster No of
Individuals
Character
1 Cluster 1 5 Early maturity, high grain yield, long panicle length and medium 1000-
grain weight
2 Cluster 2 27 Early maturing types with medium panicle length and low 1000-grain
weight
3 Cluster 3 23 Very early flowering and medium 1000-grain weight.
4 Cluster 4 45 Early flowering, short panicle length and more panicles per square meter

The average intra-cluster and
inter-cluster Euclidean
distances were estimated using
ward’s minimum variance

Results
• It was discovered that none of the clusters included at least one
genotype that had all of the desirable traits, ruling out the idea of
selecting one genotype for immediate usage. Therefore to judiciously
incorporate all of the desirable features, hybridization between
selected genotypes from divergent clusters is required.
• From cluster analysis maximum inter-cluster distance was observed
between clusters 2 and cluster 3 followed by cluster 1 and cluster 2. So
the genotypes selected from these clusters can be used for selecting
genetically diverse parents.

Chemical classification
• Five Piper nigrum essential oils were analyzed by GC-MS (gas
chromatography-mass spectrometry) were analyzed
• 78 compounds were identified accounting for more than 99% of the
compositions
• Based on P. nigrum essential oil compositions, a hierarchical cluster
analysis of the oils were done
• Analysis done using agglomerative hierarchical cluster (AHC) analysis
using XLSTAT Premium
• Dissimilarity was determined using Euclidean distance, and clustering
was defined using Ward’s method

Results
• The oils were dominated by monoterpene hydrocarbons. Black pepper
oils from various geographical locations have shown qualitative
similarities with differences in the concentrations of their major
components.
• β-Caryophyllene, limonene, β-pinene, α-pinene, δ-3-carene, sabinene,
and myrcene were the main components of P. nigrum oil

Groundwater contamination
• Groundwater samples from 30 locations
• Ionic balance error between the total concentrations of cations (Ca2+,
Mg2+
, Na+
and K+
) and the total concentrations of anions (HCO3
-
, Cl-
,
SO4
2-
and NO3
-
) expressed in milliequivalents per liter (meq/L) were
observed for each groundwater sample

Pollution index calculation
• In first step, the relative weight (Rw) from 1 to 5 was assigned for each
chemical parameter, depending upon its relative impact on human
beings. Minimum weight (1) was given to K+ and maximum weight (5)
to pH, TDS, SO4
2-
and NO3
-
• In second step, the weight parameter (Wp) was computed for each
chemical parameter to assess its relative share on overall chemical
groundwater quality
• In third step, the status of concentration (Sc) was determined by dividing
the concentration (C) of each chemical parameter of each groundwater
sample by its respective drinking water quality standard limit (Ds)

• In last step, pollution index of groundwater (PIG) was calculated by
adding all values of Ow (ΣOw)
STATISTICA version 6.1 was used. In HCA, a Complete Linkage is
used to determine the distance between the clusters or groups.

Results
• Group I represents low mineralized groundwater quality, Group II
shows moderately mineralized groundwater quality and Group III has
highly mineralized groundwater quality, depending upon the
availability sources.
• Half of the samples are falling under moderately mineralized category,
seven in low mineralized and eight in highly mineralized category.

Image detection
• The objective of this research is detection and classification of cotton
and tomato leaf diseases
• K-means clustering algorithm is used to separate the stained part and
healthy leaf region
• This proposed method of image processing is done in MATLAB 2016b
software
• L* represents the lightness, a* and b* represents the chromaticity
layers. All of the color information is in the a* and b* layers
• The derived features are Contrast, Correlation, Energy, Homogeneity,
Mean, Standard Deviation and Variance

K-means clustering algorithm
LOAD Image
Convert RGB
color space into
L*a*b* color
space
Clustering the
variant colors
Measure the
distance by using
Euclidean
Distance Matrix
Create a blank cell
array to store
clusters
CLUSTERS

NN Classification
Leaf Disease Bacterial Leaf
Spot
Target Spot Septoria Leaf
Spot
Leaf Mold Accuracy
Bacterial Leaf Spot 9 1 0 0 90%
Target Spot 2 8 0 0 80%
Septoria Leaf Spot 0 0 10 0 100%
Leaf Mold 0 0 0 10 100%
Average Accuracy 92.5%

Outline
1. Methodology
2. Further discussions – Hierarchical clustering
3. Non-Hierarchical clustering
4. Further discussions – Non-Hierarchical clustering

Agglomerative Hierarchical method
• A series of successive mergers
• There are many initial clusters as objects
• The most similar groups are first grouped
• Then these initial groups are merged according to their similarities
• Eventually as the similarity decreases, all subgroups are fused into a
single cluster

Divisive Hierarchical methods
• Work opposite to Agglomerative method
• And initial single group of objects is divided into two subgroups such
that the objects in one subgroup are “far from” the objects in the other
• These are further divided into dissimilar subgroups
• The process is continued until there are as many subgroups are objects
– that is each object become a cluster
• Both agglomerative and divisive methods can be displayed as the two-
dimensional diagram called a Dendrogram

Algorithm for Agglomerative clustering
1. Start with N clusters, each containing a single entity and an N x N symmetric
matrix of distances (or similarities) D = {dik}
2. Search the distance matrix for the nearest (most similar) pair of clusters. Let
distance between “most similar” clusters U and V be dUV
3. Merge clusters U and V. Label the newly formed cluster (UV). Update the
entries in the distance matrix by (a) deleting the rows and columns
corresponding to clusters U and V and (b) adding a row and column giving the
distances between cluster (UV) and the remaining clusters.
4. Repeat steps 2 and 3 a total of N – 1 times. (All objects will be in single cluster
after the algorithm terminates). Record the identity of clusters that are merged
and the levels (distances or similarities) at which the mergers take place.

Linkage methods
• These methods are suitable for clustering items, as well as variables
• Three main types are there;
• Single linkage
• Complete linkage
• Average linkage

Cluster distance
𝑑13+𝑑14+𝑑15+𝑑 23+𝑑24+𝑑25
6
𝑑24
𝑑15
Figure 3.1 Intercluster distance (dissimiliarity) for (a) Single linkage (b) Complete linkage and (c)
Average linkage

Single linkage
• The inputs to a single linkage algorithm can be distances or
similarities between pairs of objects. Groups are formed from nearest
neighbors, where the term nearest neighbor connotes the smallest
distance or largest similarity
• Initially we must find the smallest distance in D = {dik} and merge the
corresponding objects, say U and V, to get the cluster (UV). In the next
step of general algorithm, the distance between UV and any other
cluster W are computed by
d(UV)W = min {dUW, dVW}

Clustering using single linkage
0
9 0
3 7 0
6 5 9 0
11 10 2 8 0
D = {dik} =
1
2
3
4
5
1 2 3 4 5
min (dik) = d53 = 2
i,k
0
9 0
3 7 0
6 5 9 0
11 10 2 8 0
1
2
3
4
5
1 2 3 4 5

Objects 5 and 3 are merged to form cluster (35). To implement next level of clustering we need the distance between
the cluster (35) and remaining objects 1, 2, and 4. The distances are
d(35)1 = min {d31, d51} = min {3, 11} = 3
d(35)2 = min {d32, d52} = min {7, 10} = 7
d(35)4 = min {d34, d54} = min {9, 8} = 8
Deleting the rows and columns of D corresponding to objects 3 and 5 we obtain a new distance matrix
0
3 0
7 9 0
8 6 5 0
(35)
1
2
4
(35) 1 2 4
The smallest distance between pairs of
clusters is now d(35)1 = 3, and we merge
cluster (1) with cluster (35) to get the next
cluster, (135). Calculating,
d(135)2 = min {d(35)2, d12} = min {7, 9} = 7
d(135)4 = min {d(35)4, d14} = min {8, 6} = 6
0
7 0
6 5 0
(135) 2 4
(135)
2
4
Minimum nearest neighbor distance between pairs of clusters is d(42) = 5, and we merge objects 4 and 2 to get the
cluster (24). At this point, we have two distinct clusters, (135) and (24). Their nearest neighbor distance is;
d(135)2,d(135)4 = ,min {d(135)2, d(135)4 = min {7, 6} = 6}

The final distance matrix becomes,
Consequently, clusters (135) and (24) are merged to form a single cluster of all five objects, (12345), when the
nearest neighbor distance reaches 6.
0
6 0
(135) (24)
(135)
(24)
6
4
2
0
1 3 5 2 4
Figure 3.2 Single linkage dendrogram for distances between
five objects

0
2 0
2 1 0
7 5 6 0
6 4 5 5 0
6 6 6 9 7 0
6 6 5 9 7 2 0
6 6 5 9 7 1 1 0
7 7 6 10 8 5 3 4 0
9 8 8 8 9 10 10 10 10 0
9 9 9 9 9 9 9 9 9 8 0
E
N
Da
Du
G
Fr
Sp
I
P
H
Fi
E N Da Du G Fr Sp I P H Fi
Consider the array of closeness between 10
languages.
We first search minimum distance between pairs of
languages (clusters). The minimum distance of 1
occurs between
Danish – Norwegian
Italian – French
Italian – Spanish
Numbering the languages in the order of appearance
gives,
d32 = 1, d86 = 1, d87 = 1
Since d76 = 2, we can merge only clusters 8 and 6 or
clusters 8 and 7. We cannot merge clusters 6, 7, and
8 at level 1. We choose first to merge 6 and 8, and
then to update the distance matrix and merge 2 and
3 to obtain the clusters (68) and (23).

6
4
2
0
8
10
E N Da Fr I Sp P Du G H Fi
Fig 3.3 Single
linkage
dendrograms for
distances between
numbers in 11
languages

Since single linkage joins clusters by shortest link between them, the technique cannot discern poorly separated
clusters. On the other hand, single linkage is one of the few clustering methods that can delineate non-ellipsoidal
clusters. The tendency of single linkage to pick out long string-like clusters is known as chaining.
Fig 3.4 Single linkage clusters
Single linkage confused by near overlap Chaining effect

Complete linkage
• Complete linkage clustering proceeds in much the same manner as
single linkage clusters, with one important exception; at each stage,
the distance (similarity) between clusters is determined by the distance
(similarity) between the two elements, one from each cluster that are
most distant.
• Thus complete linkage ensures that all items in a cluster are within
some maximum distance (or minimum similarity)

• The general agglomerative algorithm again starts by finding the
minimum entry inn D = {dik} and merging the corresponding objects,
such as U and V, to get cluster (UV). For step 3 of the general
algorithm in (12-12), the distances between (UV) and any other cluster
W are computed by
d(UV)W = max {dUW, dVW}
0
9 0
3 7 0
6 5 9 0
11 10 2 8 0
D = {dik} =
1
2
3
4
5
1 2 3 4 5
min (dik) = d53 = 2
i,k
0
9 0
3 7 0
6 5 9 0
11 10 2 8 0
1
2
3
4
5
1 2 3 4 5

At the first stage, objects 3 and 5 are merged, since they are most similar. This gives the cluster (35). At stage 2, we
compute,
d(35)1 = max {d31, d51} = max {3, 11} = 11
d(35)2 = max {d32, d52} = max {7, 10} = 10
d(35)4 = max {d34, d54} = max {9, 8} = 9
And the modified distance matrix becomes
The next merger occurs between the most similar groups, 2 and 4, to give the cluster (24). At stage 3, we have
d(24)(35) = max{d2(35), d4(35)} = max {10, 9} = 10
d(24)1 = max {d21, d41} = 9
0
11 0
10 9 0
9 6 5 0
(35) 1 2 4
(35)
1
2
4

(35) (24) 1
(35)
(24)
1
0
10 0
11 9 0
The next merger produces the cluster (124). At the final stage, the groups (35) and (124) are merged as the single cluster
(12345) at level
d(124)(35) = max{d1(35),d(24)(35)} = max{11,10} = 11
The dendrogram is given below
6
4
2
0
1 2 4 3 5
8
10
12
Figure 3.5 Complete
linkage dendrogram for
distances between five
objects

6
4
2
0
8
10
E N Da G Fr I Sp P Du H Fi
Fig 3.6 Complete
linkage dendrograms
for distances between
numbers in 11
languages

Average linkage
• Average distance between all pairs of items where one member or a pair
belongs to each cluster
• The first step is same, we begin by searching the distance matrix D = {dik}
to find the nearest objects. These are merged to form the cluster (UV)
• For step three, the distances between (UV) and the other cluster W are
determined by
• Where dik is the distance between object i in the cluster (UV) and object k in
the cluster W, N(UV) and NW are the number of items in cluster (UV) and W
respectively

6
4
2
0
8
10
E N Da G Du Fr I Sp P H Fi
Fig 3.7 Average
linkage dendrograms
for distances between
numbers in 11
languages

A comparison of the dendrogram in Fig 3.7 and Fig 3.6 indicates that the
average linkage yields to configuration very much like the complete linkage
configuration. However because distance is defined differently in each case, it
is not surprising that mergers take place at different levels

Ward’s Hierarchical Clustering Method
• Ward considered Hierarchical clustering procedure based on
minimizing the loss of information from joining two groups. This
method is usually implemented with loss of information taken to be an
increase in an error sum of squares criterion, ESS.
• First for a given cluster k, let ESSk be the sum of squared deviations of
every item in the cluster from the cluster mean(centroid). If there are k
clusters, define ESS as the sum of ESSk or ESS = ESS1 = ESS2 = …
ESSk

• At each step in the analysis, the union of every possible pair of
clusters is considered and the two clusters whose combination results
in the smallest increase in ESS (Minimum loss of information) are
joined.
• Initially, each cluster consists of a single item, and if there are N items,
ESSk = 0, k = 1, 2,…, N. So ESS = 0
• At the other extreme, when all the clusters are combined in a single
group of N items, the value of ESS is given by,
Where is the multivariate measurement associated with the jth
item and
is the mean of all the items

The results of Ward’s method can be displayed as
dendrogram.
The vertical axis gives the values of ESS at which the
mergers occur.
Ward’s method is based on the notion that the clusters
of multivariate observations are expected to be
roughly elliptically shaped.
It is a hierarchical precursor to nonhierarchical
clustering methods that optimize some criterion for
dividing data into a given number of elliptical groups.
Fig 3.8 Ward’s linkage method

Further discussions –

Further discussions – Hierarchical clustering
• There are many agglomerative hierarchical clustering procedures
besides single linkage, complete linkage and average linkage.
However all of them follow the basic algorithm
• In most clustering methods, sources of error and variation are not
formally considered in hierarchical procedures. This means that a
clustering method will be sensitive to outliers or “Noise”
• In Hierarchical clustering, there is no provision for a reallocation of
objects that may have been incorrectly grouped at an early stage.
Consequently, the final configuration of clusters should always be
carefully examined to see which are sensible

• For a particular problem, it is a good idea to try several clustering
methods and within a given method, a couple different ways of
assigning distances (similarities). If the outcomes from the several
methods are (roughly) consistent with one another, perhaps a case of
“natural” grouping can be advanced
• The stability of a hierarchical solution can be checked by applying the
clustering algorithm before and after small errors have been added to
the data units. If the groups are fairly well distinguished, the clustering
before and after perturbation should agree

• Common values (ties) in the similarity or distance matrix can produce
multiple solutions to a hierarchical clustering problem. That is, the
dendrograms corresponding to different treatments of the tied
similarities can be different, particularly at the lower levels. This is not
an inherent problem; sometimes multiple solutions occur for certain
kinds of data. The user needs to know of their existence so that
dendrograms can be properly interpreted.

The Inversion problem
In the following example, the clustering method joins A and B at distance 20. At the next
step, C is added to the group (AB) at distance 32. Next the clustering algorithm adds D to
the group (ABC) at a distance 30, a smaller distance than where C was added.
Inversions can occur when there is no clear cluster structure and are generally associated
with two hierarchical clustering algorithms known as centroid method and median method.
32
30
20
A B C D A B C D
30
32
20
The inversion is indicated
by a dendrogram with
crossover
The inversion is indicated
by a dendrogram with a
nonmonotonic scale

Nonhierarchical clustering methods
• It is a clustering technique to group items, rather than variables, into a
collection of K cluster.
• K may be either be specified in advance or determined as part of the
clustering procedure.
• Because a distance matrix do not have to be determined and the basic
data do not have to be stored, Nonhierarchical methods can be applied
to much larger data sets than hierarchical techniques

The K-Means method
• MacQueen suggested the term K-means for describing an algorithm of
his that assigns each item to the cluster having the nearest centroid
(mean). In simple version, the process consists of three steps;
1. Partition the items into K initial clusters
2. Proceed through the list of items, assigning an item to the cluster
whose centroid (mean) is nearest. (Distance is usually computed
using Euclidean distance with either standardized or unstandardized
observations.) Recalculate the centroid for the cluster receiving the
new item and for the cluster losing the item.
3. Repeat step 2 until no more reassignments take place

The objective is to divide those items into K = 2 clusters such that the items within a
cluster are closer to one another than they are to the items in different clusters. To
implement the K = 2-means method, we arbitrarily partition the items into two clusters,
such as (AB) and (CD), and compute the coordinates () of the cluster centroid (mean).
Thus at Step 1, we have;
Item
Observations
A 5 3
B -1 1
C 1 -2
D -3 -2
Cluster
Coordinates of Centroid
(AB)
(CD)
Suppose we measure two variables X1 and X2 for each of four items A, B, C, and D.
The data are given in the following table;

At Step 2, we compute the Euclidean distance of each item from the group centroids and reassign each item to the nearest
group. If an item is moved from the initial configuration, the cluster centroids (means) must be updated before proceeding.
We compute the squared distances
Since A is closer to cluster (AB) than to cluster (CD), it is not reassigned. Continuing, we get,
And consequently, B is reassigned to
cluster (CD) , giving cluster (BCD)
and the following updated
coordinates of the centroid;
Again each item is checked for
reassignment. Computing squared
distances gives the following;
Cluster
Coordinates of Centroid
A 5 3
(BCD) -1 -1
Cluster
Item
A B C D
A 0 40 41 89
(BCD) 52 4 5 5

We see that each item is currently assigned to the cluster with the nearest
centroid (mean), and the process stops. The final K = 2 clusters are A and (BCD)
To check the stability of the clustering, it is desirable to rerun the algorithm with
a new initial partition.
A table of the cluster centroids (means) and within cluster variances also helps
to delineate group differences.

Further discussions – Non-hierarchical
clustering
• If two or more seed points inadvertently lie within a single cluster,
their resulting clusters will be poorly differentiated
• The existence of an outlier might produce at least one group with very
disperse items
• Even if the population is known to consist of K groups, the sampling
method may be such that data from the rarest group do not appear in
the sample. Forcing the data into K groups would lead to nonsensical
clusters
• It is always a good idea to rerun the algorithm for several choices

Data analysis using
statistical packages - SPSS

3. Go to Analyze -> Classify -> Hierarchical cluster

4. Add necessary variables for classification
Name variable can
be placed here for
labelling in the
output

5. Define statistics
Distance/
Dissimilarity
matrix is optional

5. Define plots
Orientation of the
case (Not
dendrogram)

5. Define plots
Distance criteria
Clustering criteria
6. Continue -> Ok

Wine Quality data
Name FA VA CA RS CL FSO2 TSO2 D PH SO4 A Q
AA01 7 0.27 0.36 20.7 0.045 45 170 1.001 3 0.45 8.8 6
AA02 6.3 0.3 0.34 1.6 0.049 14 132 0.994 3.3 0.49 9.5 6
AA03 8.1 0.28 0.4 6.9 0.05 30 97 0.9951 3.26 0.44 10.1 6
AA04 7.2 0.23 0.32 8.5 0.058 47 186 0.9956 3.19 0.4 9.9 6
AA05 7.2 0.23 0.32 8.5 0.058 47 186 0.9956 3.19 0.4 9.9 6
AA06 8.1 0.28 0.4 6.9 0.05 30 97 0.9951 3.26 0.44 10.1 6
AA07 6.2 0.32 0.16 7 0.045 30 136 0.9949 3.18 0.47 9.6 6
AA08 7 0.27 0.36 20.7 0.045 45 170 1.001 3 0.45 8.8 6
AA09 6.3 0.3 0.34 1.6 0.049 14 132 0.994 3.3 0.49 9.5 6
AA10 8.1 0.22 0.43 1.5 0.044 28 129 0.9938 3.22 0.45 11 6
AA11 8.1 0.27 0.41 1.45 0.033 11 63 0.9908 2.99 0.56 12 5
AA12 8.6 0.23 0.4 4.2 0.035 17 109 0.9947 3.14 0.53 9.7 5
AA13 7.9 0.18 0.37 1.2 0.04 16 75 0.992 3.18 0.63 10.8 5
AA14 6.6 0.16 0.4 1.5 0.044 48 143 0.9912 3.54 0.52 12.4 7
AA15 8.3 0.42 0.62 19.25 0.04 41 172 1.0002 2.98 0.67 9.7 5
AA16 6.6 0.17 0.38 1.5 0.032 28 112 0.9914 3.25 0.55 11.4 7
AA17 6.3 0.48 0.04 1.1 0.046 30 99 0.9928 3.24 0.36 9.6 6
AA18 6.2 0.66 0.48 1.2 0.029 29 75 0.9892 3.33 0.39 12.8 8
AA19 7.4 0.34 0.42 1.1 0.033 17 171 0.9917 3.12 0.53 11.3 6
AA20 6.5 0.31 0.14 7.5 0.044 34 133 0.9955 3.22 0.5 9.5 5
AA21 6.2 0.66 0.48 1.2 0.029 29 75 0.9892 3.33 0.39 12.8 8
AA22 6.4 0.31 0.38 2.9 0.038 19 102 0.9912 3.17 0.35 11 7
AA23 6.8 0.26 0.42 1.7 0.049 41 122 0.993 3.47 0.48 10.5 8
AA24 7.6 0.67 0.14 1.5 0.074 25 168 0.9937 3.05 0.51 9.3 5
AA25 6.6 0.27 0.41 1.3 0.052 16 142 0.9951 3.42 0.47 10 6

Data analysis using statistical
packages - RStudio

library(readxl)
#attach wine data
winedata<-read_excel(file.choose())
winedata
wine1<-winedata[1:25,]
tail(wine1)
rownames(wine1)<-wine1$Sample
# Finding distance matrix
distance_mat <- dist(wine1, method = 'euclidean')
distance_mat
is.na(distance_mat)
# Fitting Hierarchical clustering Model
# to training dataset
set.seed(240) # Setting seed
Hierar_cl <- hclust(distance_mat, method = "average")
Hierar_cl
# Plotting dendrogram
plot(Hierar_cl)
# Choosing no. of clusters
# Cutting tree by height
abline(h = 25, col = "green")
# Cutting tree by no. of clusters
fit <- cutree(Hierar_cl, k = 6 )
fit
table(fit)
rect.hclust(Hierar_cl, k = 6, border = "green")
Hierarchical Clustering – R Script

Distances codes for clustering <dist(data, method = '’)>
• "ward.D": Ward’s minimum variance method
• "ward.D2": Ward’s minimum variance method (using the square of Euclidean distances)
• "single": Single linkage method (nearest neighbor)
• "complete": Complete linkage method (farthest neighbor)
• "average": UPGMA method (Unweighted Pair Group Method with Arithmetic Mean)
• "mcquitty": WPGMA method (Weighted Pair Group Method with Arithmetic Mean)
• "median": WPGMC method (Weighted Pair Group Method with Centroid Mean)
• "centroid": UPGMC method (Unweighted Pair Group Method with Centroid Mean)
Distances codes for distance matrix <hclust (data, method =“”)>
• "euclidean": Euclidean distance
• "maximum": Maximum distance
• "manhattan": Manhattan distance
• "canberra": Canberra distance
• "binary": Binary distance
• "minkowski": Minkowski distance

Books
• Rencher AC. 2012. Methods of Multivariate Analysis. 3rd Ed. John
Wiley
• Srivastava MS and Khatri CG. 1979. An Introduction to Multivariate
Statistics. North Holland
• Johnson RA and Wichern DW. 1988. Applied Multivariate Statistical
Analysis. Prentice Hall

Articles
• Dosoky, N.S., Satyal, P., Barata, L.M., da Silva, J.K.R. and Setzer,
W.N., 2019. Volatiles of black pepper fruits (Piper nigrum L.).
Molecules, 24(23), p.4244.
• Talekar, S.C., Praveena, M.V. and Satish, R.G., 2022. Genetic diversity
using principal component analysis and hierarchical cluster analysis
in rice. International Journal of Plant Sciences. 17(2), p191-196
• Siva, G.S., Rao, V.S. and Babu, D.R., 2014. Cluster Analysis Approach
to Study the Rainfall Pattern in Visakhapatnam District. Weekly
Science Research Journal, 1, p.31.

Articles
• Rao, N.S. and Chaudhary, M., 2019. Hydrogeochemical processes
regulating the spatial distribution of groundwater contamination,
using pollution index of groundwater (PIG) and hierarchical cluster
analysis (HCA): a case study. Groundwater for Sustainable
Development, 9, p.100238.
• Kumari, C.U., Prasad, S.J. and Mounika, G., 2019, March. Leaf
disease detection: feature extraction with K-means clustering and
classification with ANN. In 2019 3rd International Conference on
Computing Methodologies and Communication (ICCMC) (pp. 1095-
1098). IEEE.

Data source for wine data
• Analysis of Wine Quality Data | STAT 508 (psu.edu)

Codes
• Hierarchical Clustering in R Programming – GeeksforGeeks
• Hierarchical Clustering in R: Step-by-Step Example - Statology

Hierarchical and Non Hierarchical Clustering.pptx

More Related Content

Similar to Hierarchical and Non Hierarchical Clustering.pptx (20)

More from Ranjith C (9)

Recently uploaded (20)

Hierarchical and Non Hierarchical Clustering.pptx