SlideShare a Scribd company logo
Clustering	and	Factorization										
in	SystemML	(part	1)
Alexandre	Evfimievski
1
K-means	Clustering
• INPUT:		n		records		x1,	x2,	…,	xn as	the	rows	of	matrix		X
– Each		xi is		m-dimensional:		xi =		(xi1,	xi2,	…,	xim)
– Matrix		X		is		(n	× m)-dimensional
• INPUT:		k,		an	integer	in		{1,	2,	…,	n}
• OUTPUT:		Partition	the	records	into		k		clusters		S1,	S2,	…,	Sk
– May	use		n		labels		y1,	y2,	…,	yn in		{1,	2,	…,	k}
– NOTE:		Same	clusters	can	label	in		k! ways		– important	if	checking	
correctness	(don’t	just	compare	“predicted”	and	“true”	label)
• METRIC:		Minimize	within-cluster	sum	of	squares (WCSS)
• Cluster	“means”	are		k		vectors	that	capture	as	much	variance	
in	the	data	as	possible
2
( )
2
21
:meanWCSS ∑=
∈−=
n
i jiji SxSx
K-means	Clustering
• K-means	is	a	little	similar	to	linear	regression:
– Linear	regression	error =		∑i≤n	(yi		– xi	·β)2
– BUT:		Clustering	describes		xi ’s		themselves,	not		yi	’s		given		xi	’s
• K-means	can	work	in	“linearization	space”		(like	kernel	SVM)
• How	to	pick		k	?
– Try		k	=	1,	2,	…,		up	to	some	limit;		check	for	overfitting
– Pick	the	best		k		in	the	context	of	the	whole	task
• Caveats	for	k-means
– They	do	NOT	estimate	a	mixture	of	Gaussians
• EM	algorithm	does	this
– The		k		clusters	tend	to	be	of	similar	size
• Do	NOT	use	for	imbalanced	clusters!
3
( )
2
21
:meanWCSS ∑=
∈−=
n
i jiji SxSx
The	K-means	Algorithm
• Pick		k		“centroids”		c1,	c2,	…,	ck from	the	records		{x1,	x2,	…,	xn}
– Try	to	pick	centroids	far	from	each	other
• Assign	each	record	to	the	nearest	centroid:
– For	each		xi compute		di =		min	{dist(xi	,	cj)	over	all	cj	}
– Cluster		Sj ←		{	xi :	dist(xi	,	cj)	=	di	}
• Reset	each	centroid	to	its	cluster’s	mean:
– Centroid		cj ←		mean(Sj)		=		∑i≤n		(xi in	Sj?) ·xi /		|Sj|
• Repeat	“assign”	and	“reset”	steps	until	convergence
• Loss	decreases:		WCSSold ≥		C-WCSSnew ≥		WCSSnew
– Converges	to	local	optimum	(often,	not	global)
4
( )
2
21
:centroidWCSS-C ∑=
∈−=
n
i jiji SxSx
The	K-means	Algorithm
• Runaway	centroid:		closest	to	no	record	at	“assign”	step	
– Occasionally	happens	e.g.	with	k	=	3	centroids	and	2	data	clusters
– Options:	(a)	terminate,	(b)	reduce	k	by	1
• Centroids	vs.	means	@	early	termination:
– After	“assign”	step,	cluster	centroids	≠	their	means
• Centroids:	(a)	define	the	clusters,	(b)	already	computed
• Means:	(a)	define	the	WCSS	metric,	(b)	not	yet	computed
– We	report	centroids	and	centroid-WCSS	(C-WCSS)
• Multiple	runs:
– Required	against	a	bad	local	optimum
– Use	“parfor”	loop,	with	random	initial	centroids
5
K-means:		DML		Implementation
C = All_C [(k * (run - 1) + 1) : (k * run), ];
iter = 0; term_code = 0; wcss = 0;
while (term_code == 0) {
D = -2 * (X %*% t(C)) + t(rowSums(C ^ 2));
minD = rowMins (D); wcss_old = wcss;
wcss = sumXsq + sum (minD);
if (wcss_old - wcss < eps * wcss & iter > 0) {
term_code = 1; # Convergence is reached
} else {
if (iter >= max_iter) { term_code = 2;
} else { iter = iter + 1;
P = ppred (D, minD, "<=");
P = P / rowSums(P);
if (sum (ppred (colSums (P), 0.0, "<=")) > 0) {
term_code = 3; # "Runaway" centroid
} else {
C = t(P / colSums(P)) %*% X;
} } } }
All_C [(k * (run - 1) + 1) : (k * run), ] = C;
final_wcss [run, 1] = wcss; t_code [run, 1] = term_code; 6
Want	smooth	assign?	
Edit	here
Tensor	avoidance	
maneuver
ParFor I/O
K-means++ Initialization	Heuristic
• Picks	centroids	from		X		at	random,	pushing	them	far	apart
• Gets	WCSS	down	to		O(log	k)	× optimal		in	expectation
• How	to	pick	centroids:
– Centroid c1:		Pick	uniformly	at	random	from	X-rows
– Centroid c2:		Prob	[c2	←xi	]		=		(1/Σ)	·	dist(xi	,	c1)2
– Centroid cj:		Prob	[cj	←xi	]		=		(1/Σ)	·	min{dist(xi	,	c1)2,	…,	dist(xi	,	cj–1	)2}
– Probability	to	pick	a	row	is	proportional	to	its	squared	min-distance	
from	earlier	centroids
• If		X		is	huge,	we	use	a	sample	of		X,		different	across	runs
– Otherwise	picking		k		centroids	requires		k		passes	over		X
7
David	Arthur,	Sergei	Vassilvitskii		“k-means++:	the	advantages	of	careful	seeding”	in	SODA	2007
K-means	Predict	Script
• Predictor	and	Evaluator	in	one:
– Given		X		(data)	and		C		(centroids),	assigns	cluster	labels prY
– Compares	2	clusterings,	“predicted” prY and	“specified” spY
• Computes	WCSS,	as	well	as	Between-Cluster	Sum	of	Squares	
(BCSS)	and	Total	Sum	of	Squares	(TSS)
– Dataset		X		must	be	available
– If	centroids		C		are	given,	also	computes		C-WCSS		and		C-BCSS
• Two	ways	to	compare prY and spY :
– Same-cluster	and	different-cluster		PAIRS		from prY and spY
– For	each		prY-cluster		find	best-matching		spY-cluster,		and	vice	versa
– All	in	count	as	well	as	in	%	to	full	count
8
Weighted	Non-Negative	Matrix	
Factorization	(WNMF)
• INPUT:		X is	non-negative	(n × m)-matrix
– Example:		Xij =	1		if		person #i		clicked	ad #j,		else		Xij =	0
• INPUT (OPTIONAL):		W is	penalty	(n × m)-matrix
– Example:		Wij =	1		if		person #i		saw	ad #j,		else		Wij =	0
• OUTPUT:		(n × k)-matrix		U,		(m × k)-matrix		V such	that:
– k topics:			Uic =	affinity(prs.	#i,	topic	#c),			Vjc =	affinity (ad	#j,	topic	#c)
– Approximation:			Xij ≈		Ui1	·	Vj1 +		Ui2	·	Vj2 +	…	+		Uik	·	Vjk
– Predict	a	“click”	if	for		some #c		both Uic and		Vjc are	high
9
( )( )2
1 1
,
min ij
T
ij
n
i
m
j
ij
VU
VUXW −∑∑= =
0,0t.s. ≥≥ VU
Weighted	Non-Negative	Matrix	
Factorization	(WNMF)
• NOTE:		Non-negativity	is	critical	for	this	“bipartite	clustering”	
interpretation	of		U and		V
– Matrix		U of	size		n × k		=		cluster	affinity	for	people
– Matrix		V of	size		m × k		=		cluster	affinity	for	ads
• Negatives	would	violate	“disjunction	of	conjunctions”	sense:
– Approximation:			Xij ≈		Ui1	·	Vj1 +		Ui2	·	Vj2 +	…	+		Uik	·	Vjk
– Predict	a	“click”	if	for		some #c		both Uic and		Vjc are	high
10
( )( )2
1 1
,
min ij
T
ij
n
i
m
j
ij
VU
VUXW −∑∑= =
0,0t.s. ≥≥ VU
11
§ Easy	to	parallelize	using	SystemML
§ Multiple	runs	help	avoid	bad	local	optima
§ Must	specify		k		:			Run	for	k =	1,	2,	3	...		(as	in	k-means)
( )[ ]
( )[ ] ε+∗
∗
←
ij
TT
ij
T
ijij
UUVW
UXW
VV
( )[ ]
( )[ ] ε+∗
∗
←
ij
T
ij
ijij
VUVW
VXW
UU
WNMF	:	Multiplicative	Update
Daniel	D.	Lee,	H.	Sebastian	Seung		“Algorithms	for	Non-negative	Matrix	Factorization”		in	NIPS	2000
Inside		A		Run		of		(W)NMF
• Assume	that	W	is	a	sparse	matrix
12
U = RND_U [, (r-1)*k + 1 : r*k];
V = RND_V [, (r-1)*k + 1 : r*k];
f_old = 0; i = 0;
f_new = sum ((X - U %*% t(V)) ^ 2); f_new = sum (W * (X - U %*% t(V)) ^ 2);
while (abs (f_new - f_old) > tol * f_new & i < max_iter)
{ {
f_old = f_new; f_old = f_new;
U = U * (X %*% V)
/ (U %*% (t(V) %*% V) + eps);
U = U * ((W * X) %*% V)
/ ( (W * (U %*% t(V))) %*% V + eps);
V = V * t(t(U) %*% X)
/ (V %*% (t(U) %*% U) + eps);
V = V * (t(W * X) %*% U)
/ (t(W * (U %*% t(V))) %*% U + eps);
f_new = sum ((X - U %*% t(V))^2); f_new = sum (W * (X - U %*% t(V))^2);
i = i + 1; i = i + 1;
} }
Sum-Product	Rewrites
• Matrix	chain	product	optimization
– Example: (U %*% t(V)) %*% V = U %*% (t(V) %*% V)
• Moving	operators	from	big	matrices	to	smaller	ones
– Example: t(X) %*% U = t(t(U) %*% X)
• Opening	brackets	in	expressions	(ongoing	research)
– Example: sum ((X – U %*% t(V))^2) = sum (X^2) –
2 * sum(X * (U %*% t(V)) + sum((U %*% t(V))^2)
– K-means: D		=		rowSums	(X	^	2)	– 2	*	(X	%*%	t(C))	+	t(rowSums	(C	^	2))
• Indexed	sum	rearrangements:
– sum ((U %*% t(V))^2) = sum ((t(U) %*% U) * (t(V) %*% V))
– sum (U %*% t(V)) = sum (colSums(U) * colSums(V))
13
Operator	Fusion:		W.	Sq.	Loss
• Weighted	Squared	Loss: sum (W * (X – U %*% t(V))^2)
– Common	pattern	for	factorization	algorithms
– W and	X usually	very	sparse	(<	0.001)
– Problem:		“Outer”	product	of		U %*% t(V) creates	three dense
intermediates	in	the	size	of	X
è Fused	w.sq.loss	operator:
– Key	observations:		Sparse		W * allows	selective	computation,	and	“sum”	
aggregate	significantly	reduces	memory	requirements
U–
t(V)
XWsum *
2
BACK-UP
15

More Related Content

PDF
Local linear approximation
PPTX
[4] num integration
PPTX
Interpolation and Extrapolation
PPT
Numerical differentiation integration
PDF
Finite elements : basis functions
PPTX
Computer Graphic - Lines, Circles and Ellipse
PDF
CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...
PDF
Natural and Clamped Cubic Splines
Local linear approximation
[4] num integration
Interpolation and Extrapolation
Numerical differentiation integration
Finite elements : basis functions
Computer Graphic - Lines, Circles and Ellipse
CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...
Natural and Clamped Cubic Splines

What's hot (20)

PPTX
Lec05 circle ellipse
PPTX
Circle generation algorithm
PPT
Arrays
PPT
Newton-Raphson Method
PDF
10CSL67 CG LAB PROGRAM 9
PPT
Midpoint circle algo
PDF
Applied numerical methods lec9
PDF
Econometric Analysis 8th Edition Greene Solutions Manual
PPT
Newton divided difference interpolation
PPT
PDF
Calculus AB - Slope of secant and tangent lines
PDF
Resumen de Integrales (Cálculo Diferencial e Integral UNAB)
PPTX
Interpolation
PPTX
Computer Graphics
PPTX
Bressenham’s Midpoint Circle Drawing Algorithm
PDF
Refresher probabilities-statistics
PDF
Integration
PDF
Calculo de integrais_indefinidos_com_aplicacao_das_proprie
DOCX
AU QP Answer key NOv/Dec 2015 Computer Graphics 5 sem CSE
PDF
Singularities in the one control problem. S.I.S.S.A., Trieste August 16, 2007.
Lec05 circle ellipse
Circle generation algorithm
Arrays
Newton-Raphson Method
10CSL67 CG LAB PROGRAM 9
Midpoint circle algo
Applied numerical methods lec9
Econometric Analysis 8th Edition Greene Solutions Manual
Newton divided difference interpolation
Calculus AB - Slope of secant and tangent lines
Resumen de Integrales (Cálculo Diferencial e Integral UNAB)
Interpolation
Computer Graphics
Bressenham’s Midpoint Circle Drawing Algorithm
Refresher probabilities-statistics
Integration
Calculo de integrais_indefinidos_com_aplicacao_das_proprie
AU QP Answer key NOv/Dec 2015 Computer Graphics 5 sem CSE
Singularities in the one control problem. S.I.S.S.A., Trieste August 16, 2007.
Ad

Viewers also liked (20)

PDF
#Yes!wecan - Social Issues Awareness Competition
PDF
Cara membuat email
PDF
PDF
COMPLETED PRACTICUM
PDF
Sdm Bali - THR - Permen 6 2016 tunjangan hari raya keagamaan
DOC
Katherine E Horne Resume December 2015
PDF
GT2PropulsionSystemSubmissionDocument
PDF
(696015275) monografía
PDF
PORFOLIO (F_DISERTATION) ALEJANDRO MARCILLA-GARCIA
PPTX
computacion
PDF
newsletter
PPTX
Stress and lupus
PDF
Ensayo de Compactación de Suelos
DOCX
Sistema operativo jr
PDF
Praktek
PDF
Comprehensive corporate communication suite
PPT
Carbide products
PPTX
줄기세포화장품
PPTX
Carlos Rivero Predictive Analytics Presentation
PDF
UIA: Pest control – the natural way
#Yes!wecan - Social Issues Awareness Competition
Cara membuat email
COMPLETED PRACTICUM
Sdm Bali - THR - Permen 6 2016 tunjangan hari raya keagamaan
Katherine E Horne Resume December 2015
GT2PropulsionSystemSubmissionDocument
(696015275) monografía
PORFOLIO (F_DISERTATION) ALEJANDRO MARCILLA-GARCIA
computacion
newsletter
Stress and lupus
Ensayo de Compactación de Suelos
Sistema operativo jr
Praktek
Comprehensive corporate communication suite
Carbide products
줄기세포화장품
Carlos Rivero Predictive Analytics Presentation
UIA: Pest control – the natural way
Ad

Similar to Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski (20)

PDF
Open GL T0074 56 sm4
PPTX
Teknik Simulasi
PDF
5 DimensionalityReduction.pdf
PDF
SPDE presentation 2012
PPTX
Introduction to Neural Networks and Deep Learning from Scratch
PPT
order picking policies pick sequencing batching
PPT
Order-Picking-Policies.ppt
PDF
Notes and guide for matlab coding and excersie
PDF
Response Surface in Tensor Train format for Uncertainty Quantification
DOCX
Rasterisation of a circle by the bresenham algorithm
DOCX
Rasterisation of a circle by the bresenham algorithm
PDF
10CSL67 CG LAB PROGRAM 6
PDF
Integration techniques
PPTX
DimensionalityReduction.pptx
PPT
PDF
Unit-2 raster scan graphics,line,circle and polygon algorithms
PDF
Ch01 basic concepts_nosoluiton
PPTX
Output primitives in Computer Graphics
PPTX
439_Applied_Mathematics_for_Civil_Engineering_LECTURE_1 Function.pptx
PDF
Open GL 04 linealgos
Open GL T0074 56 sm4
Teknik Simulasi
5 DimensionalityReduction.pdf
SPDE presentation 2012
Introduction to Neural Networks and Deep Learning from Scratch
order picking policies pick sequencing batching
Order-Picking-Policies.ppt
Notes and guide for matlab coding and excersie
Response Surface in Tensor Train format for Uncertainty Quantification
Rasterisation of a circle by the bresenham algorithm
Rasterisation of a circle by the bresenham algorithm
10CSL67 CG LAB PROGRAM 6
Integration techniques
DimensionalityReduction.pptx
Unit-2 raster scan graphics,line,circle and polygon algorithms
Ch01 basic concepts_nosoluiton
Output primitives in Computer Graphics
439_Applied_Mathematics_for_Civil_Engineering_LECTURE_1 Function.pptx
Open GL 04 linealgos

More from Arvind Surve (20)

PDF
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
PDF
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
PDF
Apache SystemML Architecture by Niketan Panesar
PDF
Clustering and Factorization using Apache SystemML by Prithviraj Sen
PDF
Classification using Apache SystemML by Prithviraj Sen
PDF
Regression using Apache SystemML by Alexandre V Evfimievski
PDF
Data preparation, training and validation using SystemML by Faraz Makari Mans...
PDF
DML Syntax and Invocation process
PDF
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
PDF
Apache SystemML 2016 Summer class primer by Berthold Reinwald
PDF
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
PDF
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
PDF
Apache SystemML Architecture by Niketan Panesar
PDF
Clustering and Factorization using Apache SystemML by Prithviraj Sen
PDF
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
PDF
Classification using Apache SystemML by Prithviraj Sen
PDF
Regression using Apache SystemML by Alexandre V Evfimievski
PDF
Data preparation, training and validation using SystemML by Faraz Makari Mans...
PDF
S1 DML Syntax and Invocation
PDF
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Architecture by Niketan Panesar
Clustering and Factorization using Apache SystemML by Prithviraj Sen
Classification using Apache SystemML by Prithviraj Sen
Regression using Apache SystemML by Alexandre V Evfimievski
Data preparation, training and validation using SystemML by Faraz Makari Mans...
DML Syntax and Invocation process
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Apache SystemML 2016 Summer class primer by Berthold Reinwald
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Architecture by Niketan Panesar
Clustering and Factorization using Apache SystemML by Prithviraj Sen
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
Classification using Apache SystemML by Prithviraj Sen
Regression using Apache SystemML by Alexandre V Evfimievski
Data preparation, training and validation using SystemML by Faraz Makari Mans...
S1 DML Syntax and Invocation
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal

Recently uploaded (20)

PDF
Pre independence Education in Inndia.pdf
PDF
RMMM.pdf make it easy to upload and study
PPTX
Cell Types and Its function , kingdom of life
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Insiders guide to clinical Medicine.pdf
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PDF
Basic Mud Logging Guide for educational purpose
PPTX
Pharma ospi slides which help in ospi learning
PDF
Classroom Observation Tools for Teachers
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Pre independence Education in Inndia.pdf
RMMM.pdf make it easy to upload and study
Cell Types and Its function , kingdom of life
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Module 4: Burden of Disease Tutorial Slides S2 2025
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
VCE English Exam - Section C Student Revision Booklet
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Insiders guide to clinical Medicine.pdf
Renaissance Architecture: A Journey from Faith to Humanism
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
Basic Mud Logging Guide for educational purpose
Pharma ospi slides which help in ospi learning
Classroom Observation Tools for Teachers
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx

Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski

  • 2. K-means Clustering • INPUT: n records x1, x2, …, xn as the rows of matrix X – Each xi is m-dimensional: xi = (xi1, xi2, …, xim) – Matrix X is (n × m)-dimensional • INPUT: k, an integer in {1, 2, …, n} • OUTPUT: Partition the records into k clusters S1, S2, …, Sk – May use n labels y1, y2, …, yn in {1, 2, …, k} – NOTE: Same clusters can label in k! ways – important if checking correctness (don’t just compare “predicted” and “true” label) • METRIC: Minimize within-cluster sum of squares (WCSS) • Cluster “means” are k vectors that capture as much variance in the data as possible 2 ( ) 2 21 :meanWCSS ∑= ∈−= n i jiji SxSx
  • 3. K-means Clustering • K-means is a little similar to linear regression: – Linear regression error = ∑i≤n (yi – xi ·β)2 – BUT: Clustering describes xi ’s themselves, not yi ’s given xi ’s • K-means can work in “linearization space” (like kernel SVM) • How to pick k ? – Try k = 1, 2, …, up to some limit; check for overfitting – Pick the best k in the context of the whole task • Caveats for k-means – They do NOT estimate a mixture of Gaussians • EM algorithm does this – The k clusters tend to be of similar size • Do NOT use for imbalanced clusters! 3 ( ) 2 21 :meanWCSS ∑= ∈−= n i jiji SxSx
  • 4. The K-means Algorithm • Pick k “centroids” c1, c2, …, ck from the records {x1, x2, …, xn} – Try to pick centroids far from each other • Assign each record to the nearest centroid: – For each xi compute di = min {dist(xi , cj) over all cj } – Cluster Sj ← { xi : dist(xi , cj) = di } • Reset each centroid to its cluster’s mean: – Centroid cj ← mean(Sj) = ∑i≤n (xi in Sj?) ·xi / |Sj| • Repeat “assign” and “reset” steps until convergence • Loss decreases: WCSSold ≥ C-WCSSnew ≥ WCSSnew – Converges to local optimum (often, not global) 4 ( ) 2 21 :centroidWCSS-C ∑= ∈−= n i jiji SxSx
  • 5. The K-means Algorithm • Runaway centroid: closest to no record at “assign” step – Occasionally happens e.g. with k = 3 centroids and 2 data clusters – Options: (a) terminate, (b) reduce k by 1 • Centroids vs. means @ early termination: – After “assign” step, cluster centroids ≠ their means • Centroids: (a) define the clusters, (b) already computed • Means: (a) define the WCSS metric, (b) not yet computed – We report centroids and centroid-WCSS (C-WCSS) • Multiple runs: – Required against a bad local optimum – Use “parfor” loop, with random initial centroids 5
  • 6. K-means: DML Implementation C = All_C [(k * (run - 1) + 1) : (k * run), ]; iter = 0; term_code = 0; wcss = 0; while (term_code == 0) { D = -2 * (X %*% t(C)) + t(rowSums(C ^ 2)); minD = rowMins (D); wcss_old = wcss; wcss = sumXsq + sum (minD); if (wcss_old - wcss < eps * wcss & iter > 0) { term_code = 1; # Convergence is reached } else { if (iter >= max_iter) { term_code = 2; } else { iter = iter + 1; P = ppred (D, minD, "<="); P = P / rowSums(P); if (sum (ppred (colSums (P), 0.0, "<=")) > 0) { term_code = 3; # "Runaway" centroid } else { C = t(P / colSums(P)) %*% X; } } } } All_C [(k * (run - 1) + 1) : (k * run), ] = C; final_wcss [run, 1] = wcss; t_code [run, 1] = term_code; 6 Want smooth assign? Edit here Tensor avoidance maneuver ParFor I/O
  • 7. K-means++ Initialization Heuristic • Picks centroids from X at random, pushing them far apart • Gets WCSS down to O(log k) × optimal in expectation • How to pick centroids: – Centroid c1: Pick uniformly at random from X-rows – Centroid c2: Prob [c2 ←xi ] = (1/Σ) · dist(xi , c1)2 – Centroid cj: Prob [cj ←xi ] = (1/Σ) · min{dist(xi , c1)2, …, dist(xi , cj–1 )2} – Probability to pick a row is proportional to its squared min-distance from earlier centroids • If X is huge, we use a sample of X, different across runs – Otherwise picking k centroids requires k passes over X 7 David Arthur, Sergei Vassilvitskii “k-means++: the advantages of careful seeding” in SODA 2007
  • 8. K-means Predict Script • Predictor and Evaluator in one: – Given X (data) and C (centroids), assigns cluster labels prY – Compares 2 clusterings, “predicted” prY and “specified” spY • Computes WCSS, as well as Between-Cluster Sum of Squares (BCSS) and Total Sum of Squares (TSS) – Dataset X must be available – If centroids C are given, also computes C-WCSS and C-BCSS • Two ways to compare prY and spY : – Same-cluster and different-cluster PAIRS from prY and spY – For each prY-cluster find best-matching spY-cluster, and vice versa – All in count as well as in % to full count 8
  • 9. Weighted Non-Negative Matrix Factorization (WNMF) • INPUT: X is non-negative (n × m)-matrix – Example: Xij = 1 if person #i clicked ad #j, else Xij = 0 • INPUT (OPTIONAL): W is penalty (n × m)-matrix – Example: Wij = 1 if person #i saw ad #j, else Wij = 0 • OUTPUT: (n × k)-matrix U, (m × k)-matrix V such that: – k topics: Uic = affinity(prs. #i, topic #c), Vjc = affinity (ad #j, topic #c) – Approximation: Xij ≈ Ui1 · Vj1 + Ui2 · Vj2 + … + Uik · Vjk – Predict a “click” if for some #c both Uic and Vjc are high 9 ( )( )2 1 1 , min ij T ij n i m j ij VU VUXW −∑∑= = 0,0t.s. ≥≥ VU
  • 10. Weighted Non-Negative Matrix Factorization (WNMF) • NOTE: Non-negativity is critical for this “bipartite clustering” interpretation of U and V – Matrix U of size n × k = cluster affinity for people – Matrix V of size m × k = cluster affinity for ads • Negatives would violate “disjunction of conjunctions” sense: – Approximation: Xij ≈ Ui1 · Vj1 + Ui2 · Vj2 + … + Uik · Vjk – Predict a “click” if for some #c both Uic and Vjc are high 10 ( )( )2 1 1 , min ij T ij n i m j ij VU VUXW −∑∑= = 0,0t.s. ≥≥ VU
  • 11. 11 § Easy to parallelize using SystemML § Multiple runs help avoid bad local optima § Must specify k : Run for k = 1, 2, 3 ... (as in k-means) ( )[ ] ( )[ ] ε+∗ ∗ ← ij TT ij T ijij UUVW UXW VV ( )[ ] ( )[ ] ε+∗ ∗ ← ij T ij ijij VUVW VXW UU WNMF : Multiplicative Update Daniel D. Lee, H. Sebastian Seung “Algorithms for Non-negative Matrix Factorization” in NIPS 2000
  • 12. Inside A Run of (W)NMF • Assume that W is a sparse matrix 12 U = RND_U [, (r-1)*k + 1 : r*k]; V = RND_V [, (r-1)*k + 1 : r*k]; f_old = 0; i = 0; f_new = sum ((X - U %*% t(V)) ^ 2); f_new = sum (W * (X - U %*% t(V)) ^ 2); while (abs (f_new - f_old) > tol * f_new & i < max_iter) { { f_old = f_new; f_old = f_new; U = U * (X %*% V) / (U %*% (t(V) %*% V) + eps); U = U * ((W * X) %*% V) / ( (W * (U %*% t(V))) %*% V + eps); V = V * t(t(U) %*% X) / (V %*% (t(U) %*% U) + eps); V = V * (t(W * X) %*% U) / (t(W * (U %*% t(V))) %*% U + eps); f_new = sum ((X - U %*% t(V))^2); f_new = sum (W * (X - U %*% t(V))^2); i = i + 1; i = i + 1; } }
  • 13. Sum-Product Rewrites • Matrix chain product optimization – Example: (U %*% t(V)) %*% V = U %*% (t(V) %*% V) • Moving operators from big matrices to smaller ones – Example: t(X) %*% U = t(t(U) %*% X) • Opening brackets in expressions (ongoing research) – Example: sum ((X – U %*% t(V))^2) = sum (X^2) – 2 * sum(X * (U %*% t(V)) + sum((U %*% t(V))^2) – K-means: D = rowSums (X ^ 2) – 2 * (X %*% t(C)) + t(rowSums (C ^ 2)) • Indexed sum rearrangements: – sum ((U %*% t(V))^2) = sum ((t(U) %*% U) * (t(V) %*% V)) – sum (U %*% t(V)) = sum (colSums(U) * colSums(V)) 13
  • 14. Operator Fusion: W. Sq. Loss • Weighted Squared Loss: sum (W * (X – U %*% t(V))^2) – Common pattern for factorization algorithms – W and X usually very sparse (< 0.001) – Problem: “Outer” product of U %*% t(V) creates three dense intermediates in the size of X è Fused w.sq.loss operator: – Key observations: Sparse W * allows selective computation, and “sum” aggregate significantly reduces memory requirements U– t(V) XWsum * 2