Clustering

Clustering Algorithms: An Introduction

Classification Method of Supervised learning Learns a method for predicting the instance class from pre-labeled (classified) instances

Clustering Method of unsupervised learning Finds “natural” grouping of instances given un-labeled data

Clustering Methods Many different method and algorithms: For numeric and/or symbolic data Deterministic vs. probabilistic Exclusive vs. overlapping Hierarchical vs. flat Top-down vs. bottom-up

Clusters: exclusive vs. overlapping a k j i h g f e d c b

Example of Outlier x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x Outlier

Methods of Clustering Hierarchical (Agglomerative): Initially, each point in cluster by itself. Repeatedly combine the two “nearest” clusters into one. Point Assignment: Maintain a set of clusters. Place points into their “nearest” cluster.

Hierarchical clustering Bottom up Start with single-instance clusters At each step, join the two closest clusters Design decision: distance between clusters E.g. two closest instances in clusters vs. distance between means Top down Start with one universal cluster Find two clusters Proceed recursively on each subset Can be very fast Both methods produce a dendrogram

Incremental clustering Heuristic approach (COBWEB/CLASSIT) Form a hierarchy of clusters incrementally Start: tree consists of empty root node Then: add instances one by one update tree appropriately at each stage to update, find the right leaf for an instance May involve restructuring the tree Base update decisions on category utility

And in the Non-Euclidean Case? The only “locations” we can talk about are the points themselves. I.e., there is no “average” of two points. Approach 1: clustroid = point “closest” to other points. Treat clustroid as if it were centroid, when computing intercluster distances.

“ Closest” Point? Possible meanings: Smallest maximum distance to the other points. Smallest average distance to other points. Smallest sum of squares of distances to other points. Etc., etc.

k – Means Algorithm(s) Assumes Euclidean space. Start by picking k , the number of clusters. Initialize clusters by picking one point per cluster. Example: pick one point at random, then k -1 other points, each as far away as possible from the previous points.

Populating Clusters For each point, place it in the cluster whose current centroid it is nearest. After all points are assigned, fix the centroids of the k clusters. Optional : reassign all points to their closest centroid. Sometimes moves points between clusters.

Simple Clustering: K-means Works with numeric data only Pick a number (K) of cluster centers (at random) Assign every item to its nearest cluster center (e.g. using Euclidean distance) Move each cluster center to the mean of its assigned items Repeat steps 2,3 until convergence (change in cluster assignments less than a threshold)

K-means clustering summary Advantages Simple, understandable items automatically assigned to clusters Disadvantages Must pick number of clusters before hand All items forced into a cluster Too sensitive to outliers

K-means variations K-medoids – instead of mean, use medians of each cluster Mean of 1, 3, 5, 7, 9 is Mean of 1, 3, 5, 7, 1009 is Median of 1, 3, 5, 7, 1009 is Median advantage: not affected by extreme values For large databases, use sampling 5 205 5

Examples of Clustering Applications Marketing: discover customer groups and use them for targeted marketing and re-organization Astronomy: find groups of similar stars and galaxies Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults Genomics: finding groups of gene with similar expression And many more.

Clustering Summary unsupervised many approaches K-means – simple, sometimes useful K-medoids is less sensitive to outliers Hierarchical clustering – works for symbolic attributes

References This PPT is complied from: Data Mining: Concepts and Techniques, 2nd ed. The Morgan Kaufmann Series in Data Management Systems, Jim Gray, Series Editor, Morgan Kaufmann Publishers, March 2006. ISBN 1-55860-901-6

Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net

Clustering

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Clustering (20)

More from DataminingTools Inc (20)

Recently uploaded (20)

Clustering