K means and dbscan

K-MEAN CLUSTER
BY CHENG ZHAN
HOUSTON MACHINE LEARNING MEETUP
1/7/2017

INTRODUCTION
• K-means (MacQueen, 1967) is one of the simplest
unsupervised learning algorithms that solve the well known
clustering problem.
• The main idea is to define k centroids, one for each
cluster.

• Input
• M(set of points)
• k(number of clusters)
• Output
• μ_1 , …, μ_k (cluster centroids)
• k-Means clusters the M point into K clusters by minimizing the
squared error function
μ

K-MEAN ALGORITHM
• 0. Initialize cluster centers
• 1. Assign observations to closest
cluster center
• 2. Revise cluster centers as mean of
assigned observations
• 3. Repeat 1&2 until convergence

K-MEANS IN PRACTICE
• How to choose initial centroids
• select randomly among the data points
• generate completely randomly
• How to choose k
• study the data
• run k-Means for different k (measure squared error for each k)
• Run k-means many times!
• Get many choices of initial points

QUESTIONS
• Euclidean distance results in spherical clusters
• What cluster shape does the Manhattan distance give?
• Think of other distance measures. What cluster shapes
will those yield?

DENSITY-BASED SPATIAL CLUSTERING OF APPLICATION
WITH NOISE
• DBSCAN is a Density-Based Clustering algorithm
• In density based clustering we partition points into dense regions separated
by not-so-dense regions.
• Important Questions:
• How do we measure density and what is a dense region?
• DBSCAN:
• Density at point p: number of points within a circle of radius Eps
• Dense Region: A circle of radius Eps that contains at least MinPts points

DETERMINING EPS & MINPTS
• Idea is that for points in a cluster, their kth nearest neighbors
are at roughly the same distance
• Noise points have the kth nearest neighbor at farther distance
• So, plot sorted distance of every point to its kth nearest
neighbor
• Find the distance d where there is a “knee” in the curve
• Eps = d, MinPts = k

DISTANCE METRIC FOR DOCUMENTS
• Motivations
• Identical – easy
• Modified or related (Ex: DNA, Plagiarism, Authorship)
• Did Francis Bacon write Shakespeare’s plays

CHALLENGES
• How do we measure similarity
• How do we search over articles

DOCUMENT REPRESENTATION
• Word count document representation
• Bag of words model
• Ignore order of words
• Count # of instances of each word in vocabulary

EXAMPLE
• Word: Sequence of alphanumeric characters. For example, the phrase “6.006
is fun” has 4 words.
• Word Frequencies: Word frequency D(w) of a given word w is the number of
times it occurs in a document D.
• For example, the words and word frequencies for the above phrase are as
below: Word 6 The Is 006 Easy Fun
Count 1 0 1 1 0 1

METRIC
• d(x,x) = 0
• d(x,y) = d(y,x)
• d(x,y) + d(y,z) >= d(x,z)

METRIC
• Inner product of the vectors D1 andD2 containing the word frequencies
for all words in the 2 documents. Equivalently, this is the projection of
vectors D1 onto D2 or vice versa. Mathematically this is expressed as:
D1 ·D2 = ∑ D1(w) .D2(w)
• Angle Metric: The angle between the vectors D1 and D2 gives an
indication of overlap between the 2 documents. Mathematically this
angle is expressed as:
θ(D1,D2) = arccos (
𝐷1.𝐷2
| 𝐷1 |∗| 𝐷2 |
)

PYTHON EXAMPLE
• https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/
docdist2.py

REFERENCE
• http://www.cs.haifa.ac.il/~rita/uml_course/lectures/kmeans.pdf
• https://cs.wmich.edu/alfuqaha/summer14/cs6530/lectures/ClusteringA
nalysis.pdf
• http://www.it.uu.se/edu/course/homepage/infoutv/ht09/a2t.pdf
• http://www.cse.buffalo.edu/~jing/cse601/fa12/materials/clustering_de
nsity.pdf
• Machine Learning Specialization by University of Washington in
Coursera

K means and dbscan

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to K means and dbscan (20)

More from Yan Xu (20)

Recently uploaded (20)

K means and dbscan