Multi core k means

M U LT I - C O R E  
K - M E A N S
BÖHM C.; PERDACHER M.; PLANT C.
SPEAKER: MARTIN PERDACHER

MULTI-CORE K-MEANS
INTRODUCTION
• K-means is highly relevant use-case for knowledge discovery on
big data
• We maximise the performance of K-means by applying two types
of parallelism:
• MIMD (Multiple Instruction Multiple Data)
• SIMD (Single Instruction Multiple Data)
• Avoid branching operations like if-then:
• Code cluster IDs and distances in joint variables

MIMD VS SIMD
IN A SHARED ENVIRONMENT
INTRODUCTION
• Corse-grained parallelism
• OpenMP
• Fine-grained parallelism
• Advanced Vector eXtensions
(AVX2)
• Auto-vectorization exists, but
is far from being efficient.

AVX REGISTERS
INTRODUCTION
YMM0:
YMM1:
…
YMM15:
256 bit
IEEE-754 double:
64 bit
fractionexponentsign
±2exponent·fraction
YMM0:
YMM1:
YMM2:
+
=
+
=
+
=
+
=
AVX OPERATIONS
_mm256_add_pd
_mm256_sub_pd
_mm256_mul_pd
_mm256_min_pd

LOOP TRAVERSAL
MULTI-CORE K-MEANS
75
3
n
d
1
1
2 4
2
6
Thread1
31
2 4
2
Thread2
k
SIMD
5 7
3
31
sequential
loops

AVX
INTELLIGENT REUSE OF REGISTERS
YMM0
YMM1
YMM2
YMM3
YMM4
YMM5
YMM6
YMM7
YMM8
YMM9
YMM10
YMM11
YMM12
YMM13
YMM14
YMM15
16 distance calculations between
4 data points and 4 centroids
4 dimensions of the data points
4 dimensions of the centroids
reserved for intermediate results
minimum distance for the assignment of the 4 points

AVOID BRANCHING
BACKPACKED CLUSTER ID CODING
• How to determine  
efficiently?
• AVX has primitives for min but
not for argmin
• Idea is to store current
clusterId j in least significant 8
bits of current distance
sign exponent fraction (52 bit)
cluster-ID

AVOID BRANCHING
• Our technique automatically copies the clusterId
• Even with SIMD - primitives:
cluster-ID
YMM15 := _mm256_min_pd (YMM14, YMM15)
29.5
410.9
29.5
YMM15: 316.3
418.7
316.3
212.8
416.5
212.8
115.0
412.3
412.3
YMM14:
new
YMM15 :
new

INFLUENCE ON THE DISTANCE?
• How much does a backpacked clusterId change the distance?
• Not much: 
If the true distance = 1.0 and we have a clusterId of 255 
13 
1.000000000000057
• Not significantly: 
Euclidean distance involves a square root, this means that half
of the bits are numerically insignificant anyway
numerically significant in ||xi-µj|| cluster-ID: 26 bit

SETTING
PERFORMANCE EVALUATION
• 2 quad-core CPUs 2.4 GHz 
- Intel Xeon E5-2609  
- (Sandy Bridge micro-architecture) 
- AVX1
• Cache 
- 4x32 kB L1 data cache 
- 4x256 kB L2 cache 
- 10 MB (shared) L3 cache
• Software 
C++ (GNU g++)
• 5 iterations
• Synthetic data 
- n up to 64 millions 
- k up to 20 
- d up to 100
• Real data from UCI 
- Forest Covertype 
(n=580000, d=54) 
- Houshold data 
(n= 2 Million, d=7)

REAL DATA
RUN UNTIL CONVERGENCE
0
2
4
6
8
10
12
Synthetic
12D
CoverType
54D
Household
7D
No Vect. (1-core)
Autovect. (1-core)
MKM (1-core)
No Vect. (8-core)
Autovect. (8-core)
MKM (8-core)
51.2
39.1
55.3

SYNTHETIC DATA
DASHED LINE SHOWS IDEAL CURVE
Neue Experimente für SDM final Version
n=32 Million; k=40; d=20
# Threads Autovect. BLAS‐KM no ID coding MKM
1 134.313 43.873 60.915 31.18 134.313 43.873 60.915 31.18
2 68.03 28.856 25.569 18.896 67.1565 21.9365 30.4575 15.59
3 46.871 19.408 18.228 12.501 44.771 14.6243333 20.305 10.3933333
4 36.031 15.39 13.843 9.155 33.57825 10.96825 15.22875 7.795
5 29.411 12.296 13.888 7.64 26.8626 8.7746 12.183 6.236
6 25.081 13.858 10.583 6.554 22.3855 7.31216667 10.1525 5.19666667
7 21.914 11.896 10.923 5.533 19.1875714 6.26757143 8.70214286 4.45428571
8 19.758 10.392 8.519 5.017 16.789125 5.484125 7.614375 3.8975
0
20
40
60
80
100
120
140
160
1 2 3 4 5 6 7 8
Runtimefor5Iterations(s)
Number of Threads
Autovect.
BLAS-KM
no ID coding
MKM
0
10
20
30
40
50
1 2 3 4 5 6 7 8
Number of Threads
0
20
40
60
80
100
1 2 3 4 5 6 7 8
Number of Threads
0
50
100
150
200
250
300
1 2 3 4 5 6 7 8
Number of Threads

SCALABILITY
IN N, D AND K
Autovect. 8 MKM 8 factor no vect 1 core
1 0.887 0.147 6.03401361 6,113
16 13.748 2.534 5.42541436 95.532
32 26.865 5.036 5.33459095
48 43.191 8.274 5.22008702
64 59.179 9.306 6.3592306 2408
258.757791
d = 20 ; k= 40 28.3733365
c=8
iter=5
0
10
20
30
40
50
60
70
0 20 40 60
Runtimefor5Iter.(s)
# Objects (Millions)
0
10
20
30
40
50
60
70
0 10 20 30 40 50
Dimensionality
0
10
20
30
40
50
60
70
20 40 60 80 100
# Clusters
Autovect.
MKM

M U LT I - C O R E  
K - M E A N S
BÖHM C.; PERDACHER M.; PLANT C.
SPEAKER: MARTIN PERDACHER
Source code available at:
https://informatik.univie.ac.at/dm/downloads/
PaperId: 031_115

Multi core k means

More Related Content

What's hot (19)

Similar to Multi core k means (20)

Recently uploaded (20)

Multi core k means