Clustering as a unsupervised learning method inin machine learning

Clustering:
K-means Clustering

• With the abundance of raw data and the need for analysis, the concept
of unsupervised learning became popular over time.
• The main goal of unsupervised learning is to discover hidden and
exciting patterns in unlabelled data.
• The most common unsupervised learning algorithm is clustering.
• grouping documents according to the topic.
• Market Segmentation Statistical data analysis
• Social network analysis Image segmentation
• Anomaly detection, etc.

• It is used by the Amazon in its recommendation system to provide the
recommendations as per the past search of products.
• Netflix also uses this technique to recommend the movies and web-
series to its users as per the watch history.

K –means Clustering:
• K-Means Clustering is an Unsupervised Learning algorithm, which
groups the unlabeled dataset into different clusters.
• K defines the number of pre-defined clusters that need to be created
in the process, as if K=2, there will be two clusters, and for K=3, there
will be three clusters, and so on.
• It is a centroid-based algorithm, where each cluster is associated with
a centroid.
• The main aim of this algorithm is to minimize the sum of distances
between the data point and their corresponding clusters.

Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input
dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.

Elbow Method:
• To find the optimal number of clusters.
• This method uses the concept of WCSS value.
• WCSS stands for Within Cluster Sum of Squares, which defines the total
variations within a cluster.

Hierarchical clustering, Choosing the number of clusters
• Hierarchical clustering is another unsupervised machine learning algorithm,
which is used to group the unlabeled datasets into a cluster and also
known as hierarchical cluster analysis or HCA.
• Develop the hierarchy of clusters in the form of a tree, and this tree-shaped
structure is known as the dendrogram.
• we don't need to have knowledge about the predefined number of
clusters.
• To group the datasets into clusters, it follows the bottom-up approach.
• This algorithm considers each data as a single cluster at the beginning

Step-1: Create each data point as a single cluster. Let's say there are N
data points, so the number of clusters will also be N.

Step-2: Take two closest data points or clusters and merge them to form one
cluster. So, there will now be N-1 clusters.

Step-3: Again, take the two closest clusters and merge them together to form
one cluster. There will be N-2 clusters.

Repeat Step 3 until only one cluster left. So, we will get the following clusters.
Consider the below images:

Step-5: Once all the clusters are combined into one big cluster, develop the
dendrogram to divide the clusters as per the problem.

To decide the number of clusters by looking at the dendrogram, you can use a
simple strategy called the "largest vertical distance without crossing any
horizontal line". Essentially, you draw a horizontal line across the dendrogram
and count the number of vertical lines it crosses. This will give you the number
of clusters. The idea is to choose a cut such that the distance (or difference)
between two clusters is maximum, which means they are the most dissimilar
and hence should be separate clusters.
However, the decision also depends on the context and the specific problem
you are trying to solve. Sometimes, domain knowledge can also help in
deciding the number of clusters.

Locate the largest vertical difference between nodes in the dendrogram, and in the middle
pass a horizontal line. The number of vertical lines intersecting it is the optimal number of
clusters.

Measure for the distance between two clusters :
• Closest distance between the two clusters is crucial for the hierarchical
clustering.
• There are various ways to calculate the distance between two clusters, and
these ways decide the rule for clustering.
• These measures are called Linkage methods.

Single Linkage: It is the Shortest Distance between the closest points of the
clusters.

Complete Linkage: It is the farthest distance between the two points of
two different clusters.

Average Linkage: It is the linkage method in which the distance between
each pair of datasets is added up and then divided by the total number of
datasets to calculate the average distance between two clusters.

Centroid Linkage: It is the linkage method in which the distance between the
centroid of the clusters is calculated.

Clustering as a unsupervised learning method inin machine learning

More Related Content

Similar to Clustering as a unsupervised learning method inin machine learning (20)

Recently uploaded (20)

Clustering as a unsupervised learning method inin machine learning