Clustering.pptx

Clustering
Clustering in R Programming Language is an
unsupervised learning technique in which the
data set is partitioned into several groups called
as clusters based on their similarity.
Several clusters of data are produced after the
segmentation of data.
All the objects in a cluster share common
characteristics.
During data mining and analysis, clustering is
used to find similar datasets.

Applications of Clustering in R Programming Language
• Marketing: In R programming, clustering is helpful for the
marketing field. It helps in finding the market pattern and
thus, helping in finding the likely buyers. Getting the
interests of customers using clustering and showing the
same product of their interest can increase the chance of
buying the product.
• Medical Science: In the medical field, there is a new
invention of medicines and treatments on a daily basis.
Sometimes, new species are also found by researchers
and scientists. Their category can be easily found by using
the clustering algorithm based on their similarities.
• Games: A clustering algorithm can also be used to show
the games to the user based on his interests.
• Internet: An user browses a lot of websites based on his
interest. Browsing history can be aggregated to perform
clustering on it and based on clustering results, the profile
of the user is generated.

Methods of Clustering
• There are 2 types of clustering in R programming:
• Hard clustering: In this type of clustering, the data
point either belongs to the cluster totally or not and
the data point is assigned to one cluster only.
• The algorithm used for hard clustering is k-means
clustering.
• Soft clustering: In soft clustering, the probability or
likelihood of a data point is assigned in the clusters
rather than putting each data point in a cluster.
• Each data point exists in all the clusters with some
probability.
• The algorithm used for soft clustering is the fuzzy
clustering method or soft k-means.

Find The Distance
• m<-matrix(1:16, nrow=4)
• m
• dist(m,method="euclidean")
• dist(m,method="manhattan")
• dist(m,method="maximum")
• dist(m,method="canberra")
• dist(m,method="minkowski")
• x<-mtcars["Honda Civic",]
• x
• y<-mtcars["Camaro Z28",]
• y
• dist(rbind(x,y))
• dist(as.matrix(mtcars))

K-Means Clustering in R Programming language
• K-Means is an iterative hard clustering
technique that uses an unsupervised learning
algorithm.
• In this, total numbers of clusters are pre-
defined by the user and based on the similarity
of each data point, the data points are
clustered.
• This algorithm also finds out the centroid of
the cluster.

• Specify number of clusters (K): Let us take an example of k =2
and 5 data points.
• Randomly assign each data point to a cluster: In the below
example, the red and green color shows 2 clusters with their
respective random data points assigned to them.
• Calculate cluster centroids: The cross mark represents the
centroid of the corresponding cluster.
• Re-allocate each data point to their nearest cluster
centroid: Green data point is assigned to the red cluster as it is
near to the centroid of red cluster.
• Re-figure cluster centroid
• Syntax: kmeans(x, centers, nstart)
• where,
• x represents numeric matrix or data frame object
• centers represents the K value or distinct cluster centers
• nstart represents number of random sets to be chosen

• install.packages("factoextra")
• library(factoextra)
• # Loading dataset
• df <- mtcars
• df
• # Omitting any NA values
• df <- na.omit(df)
• df
• # Scaling dataset
• df <- scale(df)
• df
• # output to be present as PNG file
• png(file = "KMeansExample.png")
• km <- kmeans(df, centers = 4, nstart = 25)
• km
• # Visualize the clusters
• fviz_cluster(km, data = df)
• # saving the file
• dev.off()
• png(file = "KMeansExample2.png")
• km
• dev.off()

• png(file = "KMeansExample2.png")
• km
• dev.off()

• data("iris")
• head(iris)
• nrow(iris)
• i1<-iris
• i1
• i1$Species=NULL
• head(i1)
• res=kmeans(i1,3)
• res
• plot(iris[c("Petal.Length","Petal.Width")],col=res$cluster)
• plot(iris[c("Petal.Length","Petal.Width")],col=iris$Species)
• table(iris$Species,res$cluster)
• plot(iris[c("Sepal.Length","Sepal.Width")],col=res$cluster)
• plot(iris[c("Sepal.Length","Sepal.Width")],col=iris$Species)

Hierarchical Clustering in R Programming
• Hierarchical clustering in R Programming Language is an
Unsupervised non-linear algorithm in which clusters are
created such that they have a hierarchy(or a pre-
determined ordering).
• For example, consider a family of up to three
generations. A grandfather and mother have their
children that become father and mother of their children.
• So, they all are grouped together to the same family i.e
they form a hierarchy.
• R – Hierarchical Clustering
• Hierarchical clustering is of two types:
• Agglomerative Hierarchical clustering: It starts at
individual leaves and successfully merges clusters
together. Its a Bottom-up approach.
• Divisive Hierarchical clustering: It starts at the root and
recursively split the clusters. It’s a top-down approach.

• In hierarchical clustering, Objects are
categorized into a hierarchy similar to a tree-
shaped structure which is used to interpret
hierarchical clustering models. The algorithm is
as follows:
• Make each data point in a single point cluster
that forms N clusters.
• Take the two closest data points and make them
one cluster that forms N-1 clusters.
• Take the two closest clusters and make them
one cluster that forms N-2 clusters.
• Repeat steps 3 until there is only one cluster.

• Dendrogram is a hierarchy of clusters in which
distances are converted into heights.
• It clusters n units or objects each
with p feature into smaller groups.
• Units in the same cluster are joined by a
horizontal line. The leaves at the bottom
represent individual units.
• It provides a visual representation of clusters.
Thumb Rule: Largest vertical distance which
doesn’t cut any horizontal line defines the
optimal number of clusters.

• The Dataset
• mtcars(motor trend car road test) comprise fuel
consumption, performance, and 10 aspects of
automobile design for 32 automobiles. It comes
pre-installed with dplyr package in R.
• # Installing the package
• install.packages("dplyr")
•
• # Loading package
• library(dplyr)
•
• # Summary of dataset in package
• head(mtcars)

• Performing Hierarchical clustering on Dataset
• Using Hierarchical Clustering algorithm on the dataset
using hclust() which is pre-installed in stats package when R
is installed.
• # Finding distance matrix
• distance_mat <- dist(mtcars, method = 'euclidean')
• distance_mat
• The values are shown as per the distance matrix calculation
with the method as euclidean.
• Model Hierar_cl:
• # Fitting Hierarchical clustering Model
• # to training dataset
• set.seed(240) # Setting seed
• Hierar_cl <- hclust(distance_mat, method = "average")
• Hierar_cl
• In the model, the cluster method is average, distance is
euclidean and no. of objects are 32.

• Plot dendrogram:
• # Plotting dendrogram
• plot(Hierar_cl)
•
• # Choosing no. of clusters
• # Cutting tree by height
• abline(h = 110, col = "green")
• The plot dendrogram is shown with x-axis as distance matrix and y-axis as
height.
• Cutted tree:
• # Cutting tree by no. of clusters
• fit <- cutree(Hierar_cl, k = 3 )
• fit
• So, Tree is cut where k = 3 and each category represents its number of
clusters.
• Plotting dendrogram after cutting:
• table(fit)
• rect.hclust(Hierar_cl, k = 3, border = "green")
• The plot denotes dendrogram after being cut. The green lines show the
number of clusters as per the thumb rule.

Clustering.pptx

More Related Content

What's hot (20)

Similar to Clustering.pptx (20)

More from Ramakrishna Reddy Bijjam (20)

Recently uploaded (20)

Clustering.pptx