SlideShare a Scribd company logo
M U LT I - C O R E 

K - M E A N S
BÖHM C.; PERDACHER M.; PLANT C.
SPEAKER: MARTIN PERDACHER
MULTI-CORE K-MEANS
INTRODUCTION
• K-means is highly relevant use-case for knowledge discovery on
big data
• We maximise the performance of K-means by applying two types
of parallelism:
• MIMD (Multiple Instruction Multiple Data)
• SIMD (Single Instruction Multiple Data)
• Avoid branching operations like if-then:
• Code cluster IDs and distances in joint variables
MIMD VS SIMD
IN A SHARED ENVIRONMENT
INTRODUCTION
• Corse-grained parallelism
• OpenMP
• Fine-grained parallelism
• Advanced Vector eXtensions
(AVX2)
• Auto-vectorization exists, but
is far from being efficient.
AVX REGISTERS
INTRODUCTION
YMM0:
YMM1:
…
YMM15:
256	bit
IEEE-754	double:
64	bit
fractionexponentsign
±2exponent·fraction
YMM0:
YMM1:
YMM2:
+
=
+
=
+
=
+
=
AVX OPERATIONS
_mm256_add_pd
_mm256_sub_pd
_mm256_mul_pd
_mm256_min_pd
CLASSICAL VARIANT
K-MEANS
LOOP TRAVERSAL
MULTI-CORE K-MEANS
75
3
n
d
1
1
2 4
2
6
Thread1
31
2 4
2
Thread2
k
SIMD
5 7
3
31
sequential
loops
AVX
INTELLIGENT REUSE OF REGISTERS
YMM0
YMM1
YMM2
YMM3
YMM4
YMM5
YMM6
YMM7
YMM8
YMM9
YMM10
YMM11
YMM12
YMM13
YMM14
YMM15
16 distance calculations between
4 data points and 4 centroids
4 dimensions of the data points
4 dimensions of the centroids
reserved for intermediate results
minimum distance for the assignment of the 4 points
AVOID BRANCHING
BACKPACKED CLUSTER ID CODING
• How to determine 

efficiently?
• AVX has primitives for min but
not for argmin
• Idea is to store current
clusterId j in least significant 8
bits of current distance
sign exponent fraction (52 bit)
cluster-ID
AVOID BRANCHING
BACKPACKED CLUSTER ID CODING
• Our technique automatically copies the clusterId
• Even with SIMD - primitives:
sign exponent fraction (52 bit)
cluster-ID
YMM15 := _mm256_min_pd (YMM14, YMM15)
29.5
410.9
29.5
YMM15: 316.3
418.7
316.3
212.8
416.5
212.8
115.0
412.3
412.3
YMM14:
new
YMM15 :
new
INFLUENCE ON THE DISTANCE?
BACKPACKED CLUSTER ID CODING
• How much does a backpacked clusterId change the distance?
• Not much:

If the true distance = 1.0 and we have a clusterId of 255

13

1.000000000000057
• Not significantly:

Euclidean distance involves a square root, this means that half
of the bits are numerically insignificant anyway
sign exponent fraction (52 bit)
numerically significant in ||xi-µj|| cluster-ID: 26 bit
SETTING
PERFORMANCE EVALUATION
• 2 quad-core CPUs 2.4 GHz

- Intel Xeon E5-2609 

- (Sandy Bridge micro-architecture)

- AVX1
• Cache

- 4x32 kB L1 data cache

- 4x256 kB L2 cache

- 10 MB (shared) L3 cache
• Software

C++ (GNU g++)
• 5 iterations
• Synthetic data

- n up to 64 millions

- k up to 20

- d up to 100
• Real data from UCI

- Forest Covertype

(n=580000, d=54)

- Houshold data

(n= 2 Million, d=7)
REAL DATA
RUN UNTIL CONVERGENCE
0
2
4
6
8
10
12
Synthetic
12D
CoverType
54D
Household
7D
No Vect. (1-core)
Autovect. (1-core)
MKM (1-core)
No Vect. (8-core)
Autovect. (8-core)
MKM (8-core)
51.2
39.1
55.3
SYNTHETIC DATA
DASHED LINE SHOWS IDEAL CURVE
Neue Experimente für SDM final Version
n=32 Million; k=40; d=20
# Threads Autovect. BLAS‐KM no ID coding MKM
1 134.313 43.873 60.915 31.18 134.313 43.873 60.915 31.18
2 68.03 28.856 25.569 18.896 67.1565 21.9365 30.4575 15.59
3 46.871 19.408 18.228 12.501 44.771 14.6243333 20.305 10.3933333
4 36.031 15.39 13.843 9.155 33.57825 10.96825 15.22875 7.795
5 29.411 12.296 13.888 7.64 26.8626 8.7746 12.183 6.236
6 25.081 13.858 10.583 6.554 22.3855 7.31216667 10.1525 5.19666667
7 21.914 11.896 10.923 5.533 19.1875714 6.26757143 8.70214286 4.45428571
8 19.758 10.392 8.519 5.017 16.789125 5.484125 7.614375 3.8975
0
20
40
60
80
100
120
140
160
1 2 3 4 5 6 7 8
Runtimefor5Iterations(s)
Number of Threads
Autovect.
BLAS-KM
no ID coding
MKM
0
10
20
30
40
50
1 2 3 4 5 6 7 8
Runtimefor5Iterations(s)
Number of Threads
0
20
40
60
80
100
1 2 3 4 5 6 7 8
Runtimefor5Iterations(s)
Number of Threads
0
50
100
150
200
250
300
1 2 3 4 5 6 7 8
Runtimefor5Iterations(s)
Number of Threads
SCALABILITY
IN N, D AND K
Autovect. 8 MKM 8 factor no vect 1 core
1 0.887 0.147 6.03401361 6,113
16 13.748 2.534 5.42541436 95.532
32 26.865 5.036 5.33459095
48 43.191 8.274 5.22008702
64 59.179 9.306 6.3592306 2408
258.757791
d = 20 ; k= 40 28.3733365
c=8
iter=5
0
10
20
30
40
50
60
70
0 20 40 60
Runtimefor5Iter.(s)
# Objects (Millions)
0
10
20
30
40
50
60
70
0 10 20 30 40 50
Dimensionality
0
10
20
30
40
50
60
70
20 40 60 80 100
# Clusters
Autovect.
MKM
M U LT I - C O R E 

K - M E A N S
BÖHM C.; PERDACHER M.; PLANT C.
SPEAKER: MARTIN PERDACHER
Source code available at:
https://informatik.univie.ac.at/dm/downloads/
PaperId: 031_115

More Related Content

PDF
Unit 5 vsp
PDF
Naist2015 dec ver1
PPTX
IOEfficientParalleMatrixMultiplication_present
PPTX
Paralell
PDF
A Random Forest using a Multi-valued Decision Diagram on an FPGa
PDF
Graph Regularised Hashing
PDF
Kim Hammar & Konstantin Sozinov - Distributed LSTM training - Predicting Huma...
PDF
An Analysis of Convolution for Inference
Unit 5 vsp
Naist2015 dec ver1
IOEfficientParalleMatrixMultiplication_present
Paralell
A Random Forest using a Multi-valued Decision Diagram on an FPGa
Graph Regularised Hashing
Kim Hammar & Konstantin Sozinov - Distributed LSTM training - Predicting Huma...
An Analysis of Convolution for Inference

What's hot (19)

PPTX
Semantic Segmentation on Satellite Imagery
PDF
High-Performance GPU Programming for Deep Learning
PPTX
The Network structure of R packages on CRAN & BioConductor
PPT
3 2--power-aware-cloud
PDF
Pr057 mask rcnn
PDF
Design the High Speed Kogge-Stone Adder by Using
PDF
Fine grained asynchronism for pseudo-spectral codes - with application to tur...
PDF
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
PPTX
Network simulator 2
PDF
Fast and Scalable NUMA-based Thread Parallel Breadth-first Search
PDF
AILABS - Lecture Series - Is AI the New Electricity? Topic:- Classification a...
PDF
NUMA-aware Scalable Graph Traversal on SGI UV Systems
PPTX
Seismic data analysis with u net
PDF
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
PDF
Mapping Parallel Programs into Hierarchical Distributed Computer Systems
PDF
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
PDF
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
PDF
Orthogonal Faster than Nyquist Transmission for SIMO Wireless Systems
Semantic Segmentation on Satellite Imagery
High-Performance GPU Programming for Deep Learning
The Network structure of R packages on CRAN & BioConductor
3 2--power-aware-cloud
Pr057 mask rcnn
Design the High Speed Kogge-Stone Adder by Using
Fine grained asynchronism for pseudo-spectral codes - with application to tur...
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
Network simulator 2
Fast and Scalable NUMA-based Thread Parallel Breadth-first Search
AILABS - Lecture Series - Is AI the New Electricity? Topic:- Classification a...
NUMA-aware Scalable Graph Traversal on SGI UV Systems
Seismic data analysis with u net
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Mapping Parallel Programs into Hierarchical Distributed Computer Systems
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
Orthogonal Faster than Nyquist Transmission for SIMO Wireless Systems
Ad

Similar to Multi core k means (20)

PDF
26_Fan.pdf
PDF
Mask-RCNN for Instance Segmentation
PPTX
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
PPTX
Reducing the dimensionality of data with neural networks
PDF
Introduction to deeplearning engineering
PDF
Deep learning introduction basic information
PDF
2016 03-03 marchand
PDF
“The Importance of Memory for Breaking the Edge AI Performance Bottleneck,” a...
PPTX
Cluster Computing with Dryad
PPT
Positioning techniques in 3 g networks (1)
PPTX
Data-Level Parallelism in Microprocessors
PPT
Semiconductor overview
PPTX
Coding the Continuum
PPTX
LUT-Network Revision2 -English version-
PDF
Achitecture Aware Algorithms and Software for Peta and Exascale
PPTX
underground cable fault location using aruino,gsm&gps
PDF
CUDA and Caffe for deep learning
PPTX
Simple regenerating codes: Network Coding for Cloud Storage
PDF
数据中心网络研究:机遇与挑战
PDF
Hardware Acceleration for Machine Learning
26_Fan.pdf
Mask-RCNN for Instance Segmentation
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Reducing the dimensionality of data with neural networks
Introduction to deeplearning engineering
Deep learning introduction basic information
2016 03-03 marchand
“The Importance of Memory for Breaking the Edge AI Performance Bottleneck,” a...
Cluster Computing with Dryad
Positioning techniques in 3 g networks (1)
Data-Level Parallelism in Microprocessors
Semiconductor overview
Coding the Continuum
LUT-Network Revision2 -English version-
Achitecture Aware Algorithms and Software for Peta and Exascale
underground cable fault location using aruino,gsm&gps
CUDA and Caffe for deep learning
Simple regenerating codes: Network Coding for Cloud Storage
数据中心网络研究:机遇与挑战
Hardware Acceleration for Machine Learning
Ad

Recently uploaded (20)

PDF
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
PPT
6.1 High Risk New Born. Padetric health ppt
PPTX
BODY FLUIDS AND CIRCULATION class 11 .pptx
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PDF
An interstellar mission to test astrophysical black holes
PPTX
CORDINATION COMPOUND AND ITS APPLICATIONS
PPTX
BIOMOLECULES PPT........................
PPTX
perinatal infections 2-171220190027.pptx
PPTX
Fluid dynamics vivavoce presentation of prakash
PDF
The scientific heritage No 166 (166) (2025)
PDF
GROUP 2 ORIGINAL PPT. pdf Hhfiwhwifhww0ojuwoadwsfjofjwsofjw
PDF
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPT
veterinary parasitology ````````````.ppt
PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PDF
Placing the Near-Earth Object Impact Probability in Context
PDF
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
6.1 High Risk New Born. Padetric health ppt
BODY FLUIDS AND CIRCULATION class 11 .pptx
Biophysics 2.pdffffffffffffffffffffffffff
An interstellar mission to test astrophysical black holes
CORDINATION COMPOUND AND ITS APPLICATIONS
BIOMOLECULES PPT........................
perinatal infections 2-171220190027.pptx
Fluid dynamics vivavoce presentation of prakash
The scientific heritage No 166 (166) (2025)
GROUP 2 ORIGINAL PPT. pdf Hhfiwhwifhww0ojuwoadwsfjofjwsofjw
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
7. General Toxicologyfor clinical phrmacy.pptx
veterinary parasitology ````````````.ppt
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
Placing the Near-Earth Object Impact Probability in Context
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx

Multi core k means

  • 1. M U LT I - C O R E 
 K - M E A N S BÖHM C.; PERDACHER M.; PLANT C. SPEAKER: MARTIN PERDACHER
  • 2. MULTI-CORE K-MEANS INTRODUCTION • K-means is highly relevant use-case for knowledge discovery on big data • We maximise the performance of K-means by applying two types of parallelism: • MIMD (Multiple Instruction Multiple Data) • SIMD (Single Instruction Multiple Data) • Avoid branching operations like if-then: • Code cluster IDs and distances in joint variables
  • 3. MIMD VS SIMD IN A SHARED ENVIRONMENT INTRODUCTION • Corse-grained parallelism • OpenMP • Fine-grained parallelism • Advanced Vector eXtensions (AVX2) • Auto-vectorization exists, but is far from being efficient.
  • 6. LOOP TRAVERSAL MULTI-CORE K-MEANS 75 3 n d 1 1 2 4 2 6 Thread1 31 2 4 2 Thread2 k SIMD 5 7 3 31 sequential loops
  • 7. AVX INTELLIGENT REUSE OF REGISTERS YMM0 YMM1 YMM2 YMM3 YMM4 YMM5 YMM6 YMM7 YMM8 YMM9 YMM10 YMM11 YMM12 YMM13 YMM14 YMM15 16 distance calculations between 4 data points and 4 centroids 4 dimensions of the data points 4 dimensions of the centroids reserved for intermediate results minimum distance for the assignment of the 4 points
  • 8. AVOID BRANCHING BACKPACKED CLUSTER ID CODING • How to determine 
 efficiently? • AVX has primitives for min but not for argmin • Idea is to store current clusterId j in least significant 8 bits of current distance sign exponent fraction (52 bit) cluster-ID
  • 9. AVOID BRANCHING BACKPACKED CLUSTER ID CODING • Our technique automatically copies the clusterId • Even with SIMD - primitives: sign exponent fraction (52 bit) cluster-ID YMM15 := _mm256_min_pd (YMM14, YMM15) 29.5 410.9 29.5 YMM15: 316.3 418.7 316.3 212.8 416.5 212.8 115.0 412.3 412.3 YMM14: new YMM15 : new
  • 10. INFLUENCE ON THE DISTANCE? BACKPACKED CLUSTER ID CODING • How much does a backpacked clusterId change the distance? • Not much:
 If the true distance = 1.0 and we have a clusterId of 255
 13
 1.000000000000057 • Not significantly:
 Euclidean distance involves a square root, this means that half of the bits are numerically insignificant anyway sign exponent fraction (52 bit) numerically significant in ||xi-µj|| cluster-ID: 26 bit
  • 11. SETTING PERFORMANCE EVALUATION • 2 quad-core CPUs 2.4 GHz
 - Intel Xeon E5-2609 
 - (Sandy Bridge micro-architecture)
 - AVX1 • Cache
 - 4x32 kB L1 data cache
 - 4x256 kB L2 cache
 - 10 MB (shared) L3 cache • Software
 C++ (GNU g++) • 5 iterations • Synthetic data
 - n up to 64 millions
 - k up to 20
 - d up to 100 • Real data from UCI
 - Forest Covertype
 (n=580000, d=54)
 - Houshold data
 (n= 2 Million, d=7)
  • 12. REAL DATA RUN UNTIL CONVERGENCE 0 2 4 6 8 10 12 Synthetic 12D CoverType 54D Household 7D No Vect. (1-core) Autovect. (1-core) MKM (1-core) No Vect. (8-core) Autovect. (8-core) MKM (8-core) 51.2 39.1 55.3
  • 13. SYNTHETIC DATA DASHED LINE SHOWS IDEAL CURVE Neue Experimente für SDM final Version n=32 Million; k=40; d=20 # Threads Autovect. BLAS‐KM no ID coding MKM 1 134.313 43.873 60.915 31.18 134.313 43.873 60.915 31.18 2 68.03 28.856 25.569 18.896 67.1565 21.9365 30.4575 15.59 3 46.871 19.408 18.228 12.501 44.771 14.6243333 20.305 10.3933333 4 36.031 15.39 13.843 9.155 33.57825 10.96825 15.22875 7.795 5 29.411 12.296 13.888 7.64 26.8626 8.7746 12.183 6.236 6 25.081 13.858 10.583 6.554 22.3855 7.31216667 10.1525 5.19666667 7 21.914 11.896 10.923 5.533 19.1875714 6.26757143 8.70214286 4.45428571 8 19.758 10.392 8.519 5.017 16.789125 5.484125 7.614375 3.8975 0 20 40 60 80 100 120 140 160 1 2 3 4 5 6 7 8 Runtimefor5Iterations(s) Number of Threads Autovect. BLAS-KM no ID coding MKM 0 10 20 30 40 50 1 2 3 4 5 6 7 8 Runtimefor5Iterations(s) Number of Threads 0 20 40 60 80 100 1 2 3 4 5 6 7 8 Runtimefor5Iterations(s) Number of Threads 0 50 100 150 200 250 300 1 2 3 4 5 6 7 8 Runtimefor5Iterations(s) Number of Threads
  • 14. SCALABILITY IN N, D AND K Autovect. 8 MKM 8 factor no vect 1 core 1 0.887 0.147 6.03401361 6,113 16 13.748 2.534 5.42541436 95.532 32 26.865 5.036 5.33459095 48 43.191 8.274 5.22008702 64 59.179 9.306 6.3592306 2408 258.757791 d = 20 ; k= 40 28.3733365 c=8 iter=5 0 10 20 30 40 50 60 70 0 20 40 60 Runtimefor5Iter.(s) # Objects (Millions) 0 10 20 30 40 50 60 70 0 10 20 30 40 50 Dimensionality 0 10 20 30 40 50 60 70 20 40 60 80 100 # Clusters Autovect. MKM
  • 15. M U LT I - C O R E 
 K - M E A N S BÖHM C.; PERDACHER M.; PLANT C. SPEAKER: MARTIN PERDACHER Source code available at: https://informatik.univie.ac.at/dm/downloads/ PaperId: 031_115