SlideShare a Scribd company logo
Clustering
Clustering in R Programming Language is an
unsupervised learning technique in which the
data set is partitioned into several groups called
as clusters based on their similarity.
Several clusters of data are produced after the
segmentation of data.
All the objects in a cluster share common
characteristics.
During data mining and analysis, clustering is
used to find similar datasets.
Applications of Clustering in R Programming Language
• Marketing: In R programming, clustering is helpful for the
marketing field. It helps in finding the market pattern and
thus, helping in finding the likely buyers. Getting the
interests of customers using clustering and showing the
same product of their interest can increase the chance of
buying the product.
• Medical Science: In the medical field, there is a new
invention of medicines and treatments on a daily basis.
Sometimes, new species are also found by researchers
and scientists. Their category can be easily found by using
the clustering algorithm based on their similarities.
• Games: A clustering algorithm can also be used to show
the games to the user based on his interests.
• Internet: An user browses a lot of websites based on his
interest. Browsing history can be aggregated to perform
clustering on it and based on clustering results, the profile
of the user is generated.
Methods of Clustering
• There are 2 types of clustering in R programming:
• Hard clustering: In this type of clustering, the data
point either belongs to the cluster totally or not and
the data point is assigned to one cluster only.
• The algorithm used for hard clustering is k-means
clustering.
• Soft clustering: In soft clustering, the probability or
likelihood of a data point is assigned in the clusters
rather than putting each data point in a cluster.
• Each data point exists in all the clusters with some
probability.
• The algorithm used for soft clustering is the fuzzy
clustering method or soft k-means.
Find The Distance
• m<-matrix(1:16, nrow=4)
• m
• dist(m,method="euclidean")
• dist(m,method="manhattan")
• dist(m,method="maximum")
• dist(m,method="canberra")
• dist(m,method="minkowski")
• x<-mtcars["Honda Civic",]
• x
• y<-mtcars["Camaro Z28",]
• y
• dist(rbind(x,y))
• dist(as.matrix(mtcars))
K-Means Clustering in R Programming language
• K-Means is an iterative hard clustering
technique that uses an unsupervised learning
algorithm.
• In this, total numbers of clusters are pre-
defined by the user and based on the similarity
of each data point, the data points are
clustered.
• This algorithm also finds out the centroid of
the cluster.
• Specify number of clusters (K): Let us take an example of k =2
and 5 data points.
• Randomly assign each data point to a cluster: In the below
example, the red and green color shows 2 clusters with their
respective random data points assigned to them.
• Calculate cluster centroids: The cross mark represents the
centroid of the corresponding cluster.
• Re-allocate each data point to their nearest cluster
centroid: Green data point is assigned to the red cluster as it is
near to the centroid of red cluster.
• Re-figure cluster centroid
• Syntax: kmeans(x, centers, nstart)
• where,
• x represents numeric matrix or data frame object
• centers represents the K value or distinct cluster centers
• nstart represents number of random sets to be chosen
• install.packages("factoextra")
• library(factoextra)
• # Loading dataset
• df <- mtcars
• df
• # Omitting any NA values
• df <- na.omit(df)
• df
• # Scaling dataset
• df <- scale(df)
• df
• # output to be present as PNG file
• png(file = "KMeansExample.png")
• km <- kmeans(df, centers = 4, nstart = 25)
• km
• # Visualize the clusters
• fviz_cluster(km, data = df)
• # saving the file
• dev.off()
• # output to be present as PNG file
• png(file = "KMeansExample2.png")
• km <- kmeans(df, centers = 5, nstart = 25)
• km
• # Visualize the clusters
• fviz_cluster(km, data = df)
• # saving the file
• dev.off()
• # output to be present as PNG file
• png(file = "KMeansExample2.png")
• km <- kmeans(df, centers = 5, nstart = 25)
• km
• # Visualize the clusters
• fviz_cluster(km, data = df)
• # saving the file
• dev.off()
• data("iris")
• head(iris)
• nrow(iris)
• i1<-iris
• i1
• i1$Species=NULL
• head(i1)
• res=kmeans(i1,3)
• res
• plot(iris[c("Petal.Length","Petal.Width")],col=res$cluster)
• plot(iris[c("Petal.Length","Petal.Width")],col=iris$Species)
• table(iris$Species,res$cluster)
• plot(iris[c("Sepal.Length","Sepal.Width")],col=res$cluster)
• plot(iris[c("Sepal.Length","Sepal.Width")],col=iris$Species)
Clustering.pptx
Clustering.pptx
Clustering.pptx
Hierarchical Clustering in R Programming
• Hierarchical clustering in R Programming Language is an
Unsupervised non-linear algorithm in which clusters are
created such that they have a hierarchy(or a pre-
determined ordering).
• For example, consider a family of up to three
generations. A grandfather and mother have their
children that become father and mother of their children.
• So, they all are grouped together to the same family i.e
they form a hierarchy.
• R – Hierarchical Clustering
• Hierarchical clustering is of two types:
• Agglomerative Hierarchical clustering: It starts at
individual leaves and successfully merges clusters
together. Its a Bottom-up approach.
• Divisive Hierarchical clustering: It starts at the root and
recursively split the clusters. It’s a top-down approach.
• In hierarchical clustering, Objects are
categorized into a hierarchy similar to a tree-
shaped structure which is used to interpret
hierarchical clustering models. The algorithm is
as follows:
• Make each data point in a single point cluster
that forms N clusters.
• Take the two closest data points and make them
one cluster that forms N-1 clusters.
• Take the two closest clusters and make them
one cluster that forms N-2 clusters.
• Repeat steps 3 until there is only one cluster.
Clustering.pptx
• Dendrogram is a hierarchy of clusters in which
distances are converted into heights.
• It clusters n units or objects each
with p feature into smaller groups.
• Units in the same cluster are joined by a
horizontal line. The leaves at the bottom
represent individual units.
• It provides a visual representation of clusters.
Thumb Rule: Largest vertical distance which
doesn’t cut any horizontal line defines the
optimal number of clusters.
• The Dataset
• mtcars(motor trend car road test) comprise fuel
consumption, performance, and 10 aspects of
automobile design for 32 automobiles. It comes
pre-installed with dplyr package in R.
• # Installing the package
• install.packages("dplyr")
•
• # Loading package
• library(dplyr)
•
• # Summary of dataset in package
• head(mtcars)
• Performing Hierarchical clustering on Dataset
• Using Hierarchical Clustering algorithm on the dataset
using hclust() which is pre-installed in stats package when R
is installed.
• # Finding distance matrix
• distance_mat <- dist(mtcars, method = 'euclidean')
• distance_mat
• The values are shown as per the distance matrix calculation
with the method as euclidean.
• Model Hierar_cl:
• # Fitting Hierarchical clustering Model
• # to training dataset
• set.seed(240) # Setting seed
• Hierar_cl <- hclust(distance_mat, method = "average")
• Hierar_cl
• In the model, the cluster method is average, distance is
euclidean and no. of objects are 32.
• Plot dendrogram:
• # Plotting dendrogram
• plot(Hierar_cl)
•
• # Choosing no. of clusters
• # Cutting tree by height
• abline(h = 110, col = "green")
• The plot dendrogram is shown with x-axis as distance matrix and y-axis as
height.
• Cutted tree:
• # Cutting tree by no. of clusters
• fit <- cutree(Hierar_cl, k = 3 )
• fit
• So, Tree is cut where k = 3 and each category represents its number of
clusters.
• Plotting dendrogram after cutting:
• table(fit)
• rect.hclust(Hierar_cl, k = 3, border = "green")
• The plot denotes dendrogram after being cut. The green lines show the
number of clusters as per the thumb rule.

More Related Content

PPTX
Time Series.pptx
PPTX
Decision Tree.pptx
PPTX
Descriptive Statistics in R.pptx
PPTX
Linear Regression.pptx
PPTX
PPTX
Vectormaths and Matrix in R.pptx
PPTX
Logistical Regression.pptx
PPTX
Data Exploration in R.pptx
Time Series.pptx
Decision Tree.pptx
Descriptive Statistics in R.pptx
Linear Regression.pptx
Vectormaths and Matrix in R.pptx
Logistical Regression.pptx
Data Exploration in R.pptx

What's hot (20)

PPTX
Introduction to oracle database (basic concepts)
PPTX
Exploratory data analysis
PDF
R data-import, data-export
 
PPT
Query processing-and-optimization
PDF
R Programming: Transform/Reshape Data In R
PDF
Dbms 3: 3 Schema Architecture
PDF
R Programming: Mathematical Functions In R
PPTX
Basic Statistical Descriptions of Data.pptx
PPTX
04 Classification in Data Mining
PPT
Database concepts
PDF
R Programming: Introduction to Matrices
PPTX
R programming Fundamentals
PPTX
Data reduction
PPTX
Oracle Database Introduction
DOC
Branch and bound
PPT
Chapter 12. Outlier Detection.ppt
PPTX
Exploratory data analysis
PDF
Introduction to R Programming
PDF
Differential privacy and applications to location privacy
PPTX
Data Preprocessing
Introduction to oracle database (basic concepts)
Exploratory data analysis
R data-import, data-export
 
Query processing-and-optimization
R Programming: Transform/Reshape Data In R
Dbms 3: 3 Schema Architecture
R Programming: Mathematical Functions In R
Basic Statistical Descriptions of Data.pptx
04 Classification in Data Mining
Database concepts
R Programming: Introduction to Matrices
R programming Fundamentals
Data reduction
Oracle Database Introduction
Branch and bound
Chapter 12. Outlier Detection.ppt
Exploratory data analysis
Introduction to R Programming
Differential privacy and applications to location privacy
Data Preprocessing
Ad

Similar to Clustering.pptx (20)

PPTX
Machine Learning : Clustering - Cluster analysis.pptx
PDF
PPT s10-machine vision-s2
PPT
26-Clustering MTech-2017.ppt
PPTX
hierarchical clustering.pptx
PDF
clustering using different methods in .pdf
PPTX
Clustering as a unsupervised learning method inin machine learning
PPTX
DS9 - Clustering.pptx
PPTX
Clustering: A Scikit Learn Tutorial
PPT
DM_clustering.ppt
PPTX
machine learning - Clustering in R
PPTX
Unsupervised learning clustering
PPTX
05 k-means clustering
PDF
ch_5_dm clustering in data mining.......
PPTX
ML SFCSE.pptx
PPTX
Unsupervised learning Modi.pptx
PPTX
Unsupervised learning (clustering)
PPTX
Machine Learning in R
PDF
MLT Unit4.pdfgmgkgmflbmrfmbrfmbfrmbofl;mb;lf
PDF
MLT Unit4.pdffdhngnrfgrgrfflmbpmpphfhbomf
PDF
Unsupervised Learning in Machine Learning
Machine Learning : Clustering - Cluster analysis.pptx
PPT s10-machine vision-s2
26-Clustering MTech-2017.ppt
hierarchical clustering.pptx
clustering using different methods in .pdf
Clustering as a unsupervised learning method inin machine learning
DS9 - Clustering.pptx
Clustering: A Scikit Learn Tutorial
DM_clustering.ppt
machine learning - Clustering in R
Unsupervised learning clustering
05 k-means clustering
ch_5_dm clustering in data mining.......
ML SFCSE.pptx
Unsupervised learning Modi.pptx
Unsupervised learning (clustering)
Machine Learning in R
MLT Unit4.pdfgmgkgmflbmrfmbrfmbfrmbofl;mb;lf
MLT Unit4.pdffdhngnrfgrgrfflmbpmpphfhbomf
Unsupervised Learning in Machine Learning
Ad

More from Ramakrishna Reddy Bijjam (20)

PPTX
DataStructures in Pyhton Pandas and numpy.pptx
PPTX
Pyhton with Mysql to perform CRUD operations.pptx
PPTX
Regular expressions,function and glob module.pptx
PPTX
Natural Language processing using nltk.pptx
PPTX
Parsing HTML read and write operations and OS Module.pptx
PPTX
JSON, XML and Data Science introduction.pptx
PPTX
What is FIle and explanation of text files.pptx
PPTX
BINARY files CSV files JSON files with example.pptx
DOCX
VBS control structures for if do whilw.docx
DOCX
Builtinfunctions in vbscript and its types.docx
DOCX
VBScript Functions procedures and arrays.docx
DOCX
VBScript datatypes and control structures.docx
PPTX
Numbers and global functions conversions .pptx
DOCX
Structured Graphics in dhtml and active controls.docx
DOCX
Filters and its types as wave shadow.docx
PPTX
JavaScript Arrays and its types .pptx
PPTX
JS Control Statements and Functions.pptx
PPTX
Code conversions binary to Gray vice versa.pptx
PDF
FIXED and FLOATING-POINT-REPRESENTATION.pdf
PPTX
Handling Missing Data for Data Analysis.pptx
DataStructures in Pyhton Pandas and numpy.pptx
Pyhton with Mysql to perform CRUD operations.pptx
Regular expressions,function and glob module.pptx
Natural Language processing using nltk.pptx
Parsing HTML read and write operations and OS Module.pptx
JSON, XML and Data Science introduction.pptx
What is FIle and explanation of text files.pptx
BINARY files CSV files JSON files with example.pptx
VBS control structures for if do whilw.docx
Builtinfunctions in vbscript and its types.docx
VBScript Functions procedures and arrays.docx
VBScript datatypes and control structures.docx
Numbers and global functions conversions .pptx
Structured Graphics in dhtml and active controls.docx
Filters and its types as wave shadow.docx
JavaScript Arrays and its types .pptx
JS Control Statements and Functions.pptx
Code conversions binary to Gray vice versa.pptx
FIXED and FLOATING-POINT-REPRESENTATION.pdf
Handling Missing Data for Data Analysis.pptx

Recently uploaded (20)

PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Lecture1 pattern recognition............
PDF
Introduction to Data Science and Data Analysis
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPT
Predictive modeling basics in data cleaning process
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Introduction to the R Programming Language
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
annual-report-2024-2025 original latest.
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Transcultural that can help you someday.
PDF
Business Analytics and business intelligence.pdf
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
modul_python (1).pptx for professional and student
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Reliability_Chapter_ presentation 1221.5784
Lecture1 pattern recognition............
Introduction to Data Science and Data Analysis
Galatica Smart Energy Infrastructure Startup Pitch Deck
Predictive modeling basics in data cleaning process
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
SAP 2 completion done . PRESENTATION.pptx
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
[EN] Industrial Machine Downtime Prediction
Introduction to the R Programming Language
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
annual-report-2024-2025 original latest.
STERILIZATION AND DISINFECTION-1.ppthhhbx
Transcultural that can help you someday.
Business Analytics and business intelligence.pdf
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
modul_python (1).pptx for professional and student

Clustering.pptx

  • 1. Clustering Clustering in R Programming Language is an unsupervised learning technique in which the data set is partitioned into several groups called as clusters based on their similarity. Several clusters of data are produced after the segmentation of data. All the objects in a cluster share common characteristics. During data mining and analysis, clustering is used to find similar datasets.
  • 2. Applications of Clustering in R Programming Language • Marketing: In R programming, clustering is helpful for the marketing field. It helps in finding the market pattern and thus, helping in finding the likely buyers. Getting the interests of customers using clustering and showing the same product of their interest can increase the chance of buying the product. • Medical Science: In the medical field, there is a new invention of medicines and treatments on a daily basis. Sometimes, new species are also found by researchers and scientists. Their category can be easily found by using the clustering algorithm based on their similarities. • Games: A clustering algorithm can also be used to show the games to the user based on his interests. • Internet: An user browses a lot of websites based on his interest. Browsing history can be aggregated to perform clustering on it and based on clustering results, the profile of the user is generated.
  • 3. Methods of Clustering • There are 2 types of clustering in R programming: • Hard clustering: In this type of clustering, the data point either belongs to the cluster totally or not and the data point is assigned to one cluster only. • The algorithm used for hard clustering is k-means clustering. • Soft clustering: In soft clustering, the probability or likelihood of a data point is assigned in the clusters rather than putting each data point in a cluster. • Each data point exists in all the clusters with some probability. • The algorithm used for soft clustering is the fuzzy clustering method or soft k-means.
  • 4. Find The Distance • m<-matrix(1:16, nrow=4) • m • dist(m,method="euclidean") • dist(m,method="manhattan") • dist(m,method="maximum") • dist(m,method="canberra") • dist(m,method="minkowski") • x<-mtcars["Honda Civic",] • x • y<-mtcars["Camaro Z28",] • y • dist(rbind(x,y)) • dist(as.matrix(mtcars))
  • 5. K-Means Clustering in R Programming language • K-Means is an iterative hard clustering technique that uses an unsupervised learning algorithm. • In this, total numbers of clusters are pre- defined by the user and based on the similarity of each data point, the data points are clustered. • This algorithm also finds out the centroid of the cluster.
  • 6. • Specify number of clusters (K): Let us take an example of k =2 and 5 data points. • Randomly assign each data point to a cluster: In the below example, the red and green color shows 2 clusters with their respective random data points assigned to them. • Calculate cluster centroids: The cross mark represents the centroid of the corresponding cluster. • Re-allocate each data point to their nearest cluster centroid: Green data point is assigned to the red cluster as it is near to the centroid of red cluster. • Re-figure cluster centroid • Syntax: kmeans(x, centers, nstart) • where, • x represents numeric matrix or data frame object • centers represents the K value or distinct cluster centers • nstart represents number of random sets to be chosen
  • 7. • install.packages("factoextra") • library(factoextra) • # Loading dataset • df <- mtcars • df • # Omitting any NA values • df <- na.omit(df) • df • # Scaling dataset • df <- scale(df) • df • # output to be present as PNG file • png(file = "KMeansExample.png") • km <- kmeans(df, centers = 4, nstart = 25) • km • # Visualize the clusters • fviz_cluster(km, data = df) • # saving the file • dev.off() • # output to be present as PNG file • png(file = "KMeansExample2.png") • km <- kmeans(df, centers = 5, nstart = 25) • km • # Visualize the clusters • fviz_cluster(km, data = df) • # saving the file • dev.off()
  • 8. • # output to be present as PNG file • png(file = "KMeansExample2.png") • km <- kmeans(df, centers = 5, nstart = 25) • km • # Visualize the clusters • fviz_cluster(km, data = df) • # saving the file • dev.off()
  • 9. • data("iris") • head(iris) • nrow(iris) • i1<-iris • i1 • i1$Species=NULL • head(i1) • res=kmeans(i1,3) • res • plot(iris[c("Petal.Length","Petal.Width")],col=res$cluster) • plot(iris[c("Petal.Length","Petal.Width")],col=iris$Species) • table(iris$Species,res$cluster) • plot(iris[c("Sepal.Length","Sepal.Width")],col=res$cluster) • plot(iris[c("Sepal.Length","Sepal.Width")],col=iris$Species)
  • 13. Hierarchical Clustering in R Programming • Hierarchical clustering in R Programming Language is an Unsupervised non-linear algorithm in which clusters are created such that they have a hierarchy(or a pre- determined ordering). • For example, consider a family of up to three generations. A grandfather and mother have their children that become father and mother of their children. • So, they all are grouped together to the same family i.e they form a hierarchy. • R – Hierarchical Clustering • Hierarchical clustering is of two types: • Agglomerative Hierarchical clustering: It starts at individual leaves and successfully merges clusters together. Its a Bottom-up approach. • Divisive Hierarchical clustering: It starts at the root and recursively split the clusters. It’s a top-down approach.
  • 14. • In hierarchical clustering, Objects are categorized into a hierarchy similar to a tree- shaped structure which is used to interpret hierarchical clustering models. The algorithm is as follows: • Make each data point in a single point cluster that forms N clusters. • Take the two closest data points and make them one cluster that forms N-1 clusters. • Take the two closest clusters and make them one cluster that forms N-2 clusters. • Repeat steps 3 until there is only one cluster.
  • 16. • Dendrogram is a hierarchy of clusters in which distances are converted into heights. • It clusters n units or objects each with p feature into smaller groups. • Units in the same cluster are joined by a horizontal line. The leaves at the bottom represent individual units. • It provides a visual representation of clusters. Thumb Rule: Largest vertical distance which doesn’t cut any horizontal line defines the optimal number of clusters.
  • 17. • The Dataset • mtcars(motor trend car road test) comprise fuel consumption, performance, and 10 aspects of automobile design for 32 automobiles. It comes pre-installed with dplyr package in R. • # Installing the package • install.packages("dplyr") • • # Loading package • library(dplyr) • • # Summary of dataset in package • head(mtcars)
  • 18. • Performing Hierarchical clustering on Dataset • Using Hierarchical Clustering algorithm on the dataset using hclust() which is pre-installed in stats package when R is installed. • # Finding distance matrix • distance_mat <- dist(mtcars, method = 'euclidean') • distance_mat • The values are shown as per the distance matrix calculation with the method as euclidean. • Model Hierar_cl: • # Fitting Hierarchical clustering Model • # to training dataset • set.seed(240) # Setting seed • Hierar_cl <- hclust(distance_mat, method = "average") • Hierar_cl • In the model, the cluster method is average, distance is euclidean and no. of objects are 32.
  • 19. • Plot dendrogram: • # Plotting dendrogram • plot(Hierar_cl) • • # Choosing no. of clusters • # Cutting tree by height • abline(h = 110, col = "green") • The plot dendrogram is shown with x-axis as distance matrix and y-axis as height. • Cutted tree: • # Cutting tree by no. of clusters • fit <- cutree(Hierar_cl, k = 3 ) • fit • So, Tree is cut where k = 3 and each category represents its number of clusters. • Plotting dendrogram after cutting: • table(fit) • rect.hclust(Hierar_cl, k = 3, border = "green") • The plot denotes dendrogram after being cut. The green lines show the number of clusters as per the thumb rule.