SlideShare a Scribd company logo
A Lock-Free Algorithm of Tree-Based
Reduction for Large Scale Clustering on
GPGPU
National Institute of Informatics, Japan
Ruo Ando
2019 2nd International Conference on Artificial
Intelligence and Pattern Recognition
2019 年第二届人工智能和模式识别国际会议
North China University of Technology (NCUT) / 北方工业大学
August 17th, 2019 11.35-12.15
Slideshare version rev.2019.08.19
Abstract
• Recently, the art of concurrency and parallelism
has been advanced rapidly. However, conventional
techniques still suffer of the drawback of lock
contention.
• This talk reports the current situation of massively
parallel computing.
• Based on this situation, a Lock-free technique of
tree-based reduction for large scale clustering on
GPGPU is illustrated.
• In experiment, the performance of native GPU
kernel with atomic instruction, CUDA Thrust
template libraries and proposal method is
compared and evaluated.
Bottlenecks for massive parallelism
• Lock contention: Threads should spend as little
time inside a critical section as possible to reduce
the amount of time other threads sit idle waiting to
acquire the lock, a state known as “lock
contention”.
• Using a multitude of small separate critical
sections introduces system overheads associated
with acquiring and releasing each separate lock.
In many cases, contention for
locks reduces parallel efficiency
and hurts scalability.
❑道生一、一生二、二生三、三生萬物 - 老子
❑ Unreasonable Effectiveness of Data
If a machine learning program cannot work with a training of
a million examples, then the intuitive conclusion follows
that it cannot work at all.
However, it has become clear that machine learning using a
huge dataset with a trillion items can be highly effective in
tasks for which machine learning using a sanitized (clean)
dataset with a only million items is NOT useful.
Chen Sun, Abhinav Shrivastava, Saurabh Singh, Abhinav Gupta, “Revisiting
Unreasonable Effectiveness of Data in Deep Learning Era”, ICCV 2017
https://arxiv.org/abs/1707.02968
Scalability
Reduction pattern
A reduction combines every element
in a collection into a single element
using an associative combiner
function.
Given the associativity of the
combiner function, many Different
ordering are possible, but with
different spans.
If the combiner function is also
commutative, additional Orderings
are possible.
Tree structure depends on a
reordering of the combiner
Operations by associativity.
"2019/07/02 00:00:00.867","841","25846”
"2019/07/02 03:03:00.511","784","52326”
"2019/07/02 00:00:00.867",“700",“40000”
"2019/07/02 11:11:37.872","336","50346”
"2019/07/02 00:00:00.867",“1541",“65846”
Proposal method(2) - large scale clustering
• Fine reduction
- New cluster assignment
- Calculating sums of each cluster
const int fine_shared_memory = 3 * threads * sizeof(float);
fine_reduce<<<blocks, threads, fine_shared_memory>>>;
• Coarse reduction
- Calculating centroids (new means)
const int coarse_shared_memory = 2 * k * blocks * sizeof(float);
coarse_reduce<<<1, k * blocks, coarse_shared_memory>>>
Overview and Grid layout
threads threadsthreads threads
Fine
Coarse
fine_reduce<<<blocks, threads, fine_shared_memory>>>
coarse_reduce<<<1, k * blocks, coarse_shared_memory>>>
blocks
X
Y
Input and output: fine reduction
fine_reduce<<<blocks, threads, fine_shared_memory>>>
Shared memory layout - Fine reduction
threads threadsthreads threads
Fine
fine_reduce<<<blocks, threads, fine_shared_memory>>>
blocks
X
Y
const int fine_shared_memory = 3 * threads * sizeof(float);
1
0
0
1 3 42
blockID = 1
1
0
0
1 3 42
blockID = 2
1
0
0
1 3 42
blockID = 3
1
0
0
1 3 42
blockID = 4
1
0
0
1
0
0
1
0
0
1
1
0
1
1
0
1
1
0
44
42
1
3
0
clusters(5)
coarse_reduce<<<1, k * blocks, coarse_shared_memory>>>
Input and output: coarse reduction
Shared memory layout - coarse reduction
blocks
coarse_reduce<<<1, k * blocks, coarse_shared_memory>>>
k
X
Y
const int coarse_shared_memory = 2 * k * blocks * sizeof(float);
Experimental results
① By using the atomicAdd function, programmer can rewrite the incr kenrel.
This instruction atomically add a value V[i] to the value stored at memory
location M.
__global__ void incr(__global__ int *ptr) { int temp = atomicAdd( ptr, 1); }
② Thrust provides two vector containers, host_vector and device_vector. The
host_vector is stored in host memory while device_vector lives in GPU device
memory
③ Reductions in serial execution like the averaging performed during the update step
scale linearly. However, parallel reductions can be implemented efficiently by using
two-stage tree-reduction: fine and coarse reduction. Key point in fine-coarse
reduction is averaging is not performed over all our data. Instead, for each cluster,
the points assigned to each cluster should be averaged.
thrust::device_vector<float> d_mean_x(h_x.begin(), h_x.begin() + k);
thrust::device_vector<float> d_mean_y(h_y.begin(), h_y.begin() + k);
Experimental results 1
Conclusion
• Recently, the art of concurrency and parallelism
has been advanced rapidly. However, conventional
techniques still suffer of the drawback of lock
contention.
• This talk reports the current situation of massively
parallel computing.
• Based on this situation, a Lock-free technique of
tree-based reduction for large scale clustering on
GPGPU is illustrated.
• In experiment, the performance of native GPU
kernel with atomic instruction, CUDA Thrust
template libraries and proposal method is
compared and evaluated.

More Related Content

PPTX
deep learning from scratch chapter 3 neural network
PPTX
JOB SCHEDULING USING ANT COLONY OPTIMIZATION ALGORITHM
PPTX
Reducing the dimensionality of data with neural networks
PDF
ENERGY EFFICIENT VIRTUAL MACHINE ASSIGNMENT BASED ON ENERGY CONSUMPTION AND R...
PDF
211121 detection in crowded scenes one proposal, multiple predictions
PDF
Nephele pegasus
PDF
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
PDF
Artificial intelligence at the edge
deep learning from scratch chapter 3 neural network
JOB SCHEDULING USING ANT COLONY OPTIMIZATION ALGORITHM
Reducing the dimensionality of data with neural networks
ENERGY EFFICIENT VIRTUAL MACHINE ASSIGNMENT BASED ON ENERGY CONSUMPTION AND R...
211121 detection in crowded scenes one proposal, multiple predictions
Nephele pegasus
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
Artificial intelligence at the edge

What's hot (18)

PPTX
Hpc with qpu
PDF
Accumulo and the Convergence of Machine Learning, Big Data, and Supercomputing
PDF
Machine learning at scale with Google Cloud Platform
PDF
Products go Green: Worst-Case Energy Consumption in Software Product Lines
PDF
Optimal Power System Planning with Renewable DGs with Reactive Power Consider...
DOCX
solar air heater Using ANN
PPTX
Object classification using CNN & VGG16 Model (Keras and Tensorflow)
PDF
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
PDF
Advanced Techniques for Mobile Robotics
DOCX
Reversible data hiding with optimal value transfer
PPT
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
DOCX
Function computation over heterogeneous
PDF
Raul sena - Apresentação Analiticsemtudo - Scientific Applications using GPU
PDF
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
PDF
Meetup 18/10/2018 - Artificiële intelligentie en mobiliteit
PDF
Optimize Virtual Machine Placement in Banker Algorithm for Energy Efficient C...
PDF
Robust foreground modelling to segment and detect multiple moving objects in ...
PPTX
Graph Neural Network 1부
Hpc with qpu
Accumulo and the Convergence of Machine Learning, Big Data, and Supercomputing
Machine learning at scale with Google Cloud Platform
Products go Green: Worst-Case Energy Consumption in Software Product Lines
Optimal Power System Planning with Renewable DGs with Reactive Power Consider...
solar air heater Using ANN
Object classification using CNN & VGG16 Model (Keras and Tensorflow)
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
Advanced Techniques for Mobile Robotics
Reversible data hiding with optimal value transfer
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Function computation over heterogeneous
Raul sena - Apresentação Analiticsemtudo - Scientific Applications using GPU
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Meetup 18/10/2018 - Artificiële intelligentie en mobiliteit
Optimize Virtual Machine Placement in Banker Algorithm for Energy Efficient C...
Robust foreground modelling to segment and detect multiple moving objects in ...
Graph Neural Network 1부
Ad

Similar to A Lock-Free Algorithm of Tree-Based Reduction for Large Scale Clustering on GPGPU (20)

PDF
E035425030
PDF
Optimum capacity allocation of distributed generation
PDF
Creating smaller, faster, production-ready mobile machine learning models.
PDF
RunPool: A Dynamic Pooling Layer for Convolution Neural Network
PDF
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
PDF
Garbage Classification Using Deep Learning Techniques
PDF
A survey on energy efficient with task consolidation in the virtualized cloud...
PDF
A survey on energy efficient with task consolidation in the virtualized cloud...
PDF
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
PDF
DEEP LEARNING BASED BRAIN STROKE DETECTION
PDF
Comparative Study of Neural Networks Algorithms for Cloud Computing CPU Sched...
PDF
Parallel Processing Technique for Time Efficient Matrix Multiplication
PDF
A Novel Technique to Enhance the Lifetime of Wireless Sensor Networks through...
PDF
CONFIGURABLE TASK MAPPING FOR MULTIPLE OBJECTIVES IN MACRO-PROGRAMMING OF WIR...
PDF
CONFIGURABLE TASK MAPPING FOR MULTIPLE OBJECTIVES IN MACRO-PROGRAMMING OF WIR...
PDF
Compact optimized deep learning model for edge: a review
PDF
Implementation of p pic algorithm in map reduce to handle big data
PDF
Energy efficient chaotic whale optimization technique for data gathering in w...
PDF
A survey on the layers of convolutional Neural Network
PDF
Vol 16 No 2 - July-December 2016
E035425030
Optimum capacity allocation of distributed generation
Creating smaller, faster, production-ready mobile machine learning models.
RunPool: A Dynamic Pooling Layer for Convolution Neural Network
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
Garbage Classification Using Deep Learning Techniques
A survey on energy efficient with task consolidation in the virtualized cloud...
A survey on energy efficient with task consolidation in the virtualized cloud...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
DEEP LEARNING BASED BRAIN STROKE DETECTION
Comparative Study of Neural Networks Algorithms for Cloud Computing CPU Sched...
Parallel Processing Technique for Time Efficient Matrix Multiplication
A Novel Technique to Enhance the Lifetime of Wireless Sensor Networks through...
CONFIGURABLE TASK MAPPING FOR MULTIPLE OBJECTIVES IN MACRO-PROGRAMMING OF WIR...
CONFIGURABLE TASK MAPPING FOR MULTIPLE OBJECTIVES IN MACRO-PROGRAMMING OF WIR...
Compact optimized deep learning model for edge: a review
Implementation of p pic algorithm in map reduce to handle big data
Energy efficient chaotic whale optimization technique for data gathering in w...
A survey on the layers of convolutional Neural Network
Vol 16 No 2 - July-December 2016
Ad

More from Ruo Ando (20)

PDF
KISTI-NII Joint Security Workshop 2023.pdf
PDF
Gartner 「セキュリティ&リスクマネジメントサミット 2019」- 安藤
PDF
解説#86 決定木 - ss.pdf
PDF
SaaSアカデミー for バックオフィス アイドルと学ぶDX講座 ~アイドル戦略に見るDXを専門家が徹底解説~
PDF
解説#83 情報エントロピー
PDF
解説#82 記号論理学
PDF
解説#81 ロジスティック回帰
PDF
解説#74 連結リスト
PDF
解説#76 福岡正信
PDF
解説#77 非加算無限
PDF
解説#1 C言語ポインタとアドレス
PDF
解説#78 誤差逆伝播
PDF
解説#73 ハフマン符号
PDF
【技術解説20】 ミニバッチ確率的勾配降下法
PDF
【技術解説4】assertion failureとuse after-free
PDF
ITmedia Security Week 2021 講演資料
PPTX
ファジングの解説
PDF
AI(機械学習・深層学習)との協働スキルとOperational AIの事例紹介 @ ビジネス+ITセミナー 2020年11月
PDF
【AI実装4】TensorFlowのプログラムを読む2 非線形回帰
PDF
Intel Trusted Computing Group 1st Workshop
KISTI-NII Joint Security Workshop 2023.pdf
Gartner 「セキュリティ&リスクマネジメントサミット 2019」- 安藤
解説#86 決定木 - ss.pdf
SaaSアカデミー for バックオフィス アイドルと学ぶDX講座 ~アイドル戦略に見るDXを専門家が徹底解説~
解説#83 情報エントロピー
解説#82 記号論理学
解説#81 ロジスティック回帰
解説#74 連結リスト
解説#76 福岡正信
解説#77 非加算無限
解説#1 C言語ポインタとアドレス
解説#78 誤差逆伝播
解説#73 ハフマン符号
【技術解説20】 ミニバッチ確率的勾配降下法
【技術解説4】assertion failureとuse after-free
ITmedia Security Week 2021 講演資料
ファジングの解説
AI(機械学習・深層学習)との協働スキルとOperational AIの事例紹介 @ ビジネス+ITセミナー 2020年11月
【AI実装4】TensorFlowのプログラムを読む2 非線形回帰
Intel Trusted Computing Group 1st Workshop

Recently uploaded (20)

PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
Geodesy 1.pptx...............................................
PPTX
Construction Project Organization Group 2.pptx
PPT
Mechanical Engineering MATERIALS Selection
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
Digital Logic Computer Design lecture notes
PDF
PPT on Performance Review to get promotions
PPTX
web development for engineering and engineering
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
composite construction of structures.pdf
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Lecture Notes Electrical Wiring System Components
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Geodesy 1.pptx...............................................
Construction Project Organization Group 2.pptx
Mechanical Engineering MATERIALS Selection
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Digital Logic Computer Design lecture notes
PPT on Performance Review to get promotions
web development for engineering and engineering
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
R24 SURVEYING LAB MANUAL for civil enggi
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Internet of Things (IOT) - A guide to understanding
composite construction of structures.pdf
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
OOP with Java - Java Introduction (Basics)
Lecture Notes Electrical Wiring System Components
Model Code of Practice - Construction Work - 21102022 .pdf
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx

A Lock-Free Algorithm of Tree-Based Reduction for Large Scale Clustering on GPGPU

  • 1. A Lock-Free Algorithm of Tree-Based Reduction for Large Scale Clustering on GPGPU National Institute of Informatics, Japan Ruo Ando 2019 2nd International Conference on Artificial Intelligence and Pattern Recognition 2019 年第二届人工智能和模式识别国际会议 North China University of Technology (NCUT) / 北方工业大学 August 17th, 2019 11.35-12.15 Slideshare version rev.2019.08.19
  • 2. Abstract • Recently, the art of concurrency and parallelism has been advanced rapidly. However, conventional techniques still suffer of the drawback of lock contention. • This talk reports the current situation of massively parallel computing. • Based on this situation, a Lock-free technique of tree-based reduction for large scale clustering on GPGPU is illustrated. • In experiment, the performance of native GPU kernel with atomic instruction, CUDA Thrust template libraries and proposal method is compared and evaluated.
  • 3. Bottlenecks for massive parallelism • Lock contention: Threads should spend as little time inside a critical section as possible to reduce the amount of time other threads sit idle waiting to acquire the lock, a state known as “lock contention”. • Using a multitude of small separate critical sections introduces system overheads associated with acquiring and releasing each separate lock. In many cases, contention for locks reduces parallel efficiency and hurts scalability.
  • 4. ❑道生一、一生二、二生三、三生萬物 - 老子 ❑ Unreasonable Effectiveness of Data If a machine learning program cannot work with a training of a million examples, then the intuitive conclusion follows that it cannot work at all. However, it has become clear that machine learning using a huge dataset with a trillion items can be highly effective in tasks for which machine learning using a sanitized (clean) dataset with a only million items is NOT useful. Chen Sun, Abhinav Shrivastava, Saurabh Singh, Abhinav Gupta, “Revisiting Unreasonable Effectiveness of Data in Deep Learning Era”, ICCV 2017 https://arxiv.org/abs/1707.02968 Scalability
  • 5. Reduction pattern A reduction combines every element in a collection into a single element using an associative combiner function. Given the associativity of the combiner function, many Different ordering are possible, but with different spans. If the combiner function is also commutative, additional Orderings are possible. Tree structure depends on a reordering of the combiner Operations by associativity. "2019/07/02 00:00:00.867","841","25846” "2019/07/02 03:03:00.511","784","52326” "2019/07/02 00:00:00.867",“700",“40000” "2019/07/02 11:11:37.872","336","50346” "2019/07/02 00:00:00.867",“1541",“65846”
  • 6. Proposal method(2) - large scale clustering • Fine reduction - New cluster assignment - Calculating sums of each cluster const int fine_shared_memory = 3 * threads * sizeof(float); fine_reduce<<<blocks, threads, fine_shared_memory>>>; • Coarse reduction - Calculating centroids (new means) const int coarse_shared_memory = 2 * k * blocks * sizeof(float); coarse_reduce<<<1, k * blocks, coarse_shared_memory>>>
  • 7. Overview and Grid layout threads threadsthreads threads Fine Coarse fine_reduce<<<blocks, threads, fine_shared_memory>>> coarse_reduce<<<1, k * blocks, coarse_shared_memory>>> blocks X Y
  • 8. Input and output: fine reduction fine_reduce<<<blocks, threads, fine_shared_memory>>>
  • 9. Shared memory layout - Fine reduction threads threadsthreads threads Fine fine_reduce<<<blocks, threads, fine_shared_memory>>> blocks X Y const int fine_shared_memory = 3 * threads * sizeof(float);
  • 10. 1 0 0 1 3 42 blockID = 1 1 0 0 1 3 42 blockID = 2 1 0 0 1 3 42 blockID = 3 1 0 0 1 3 42 blockID = 4 1 0 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 0 44 42 1 3 0 clusters(5) coarse_reduce<<<1, k * blocks, coarse_shared_memory>>> Input and output: coarse reduction
  • 11. Shared memory layout - coarse reduction blocks coarse_reduce<<<1, k * blocks, coarse_shared_memory>>> k X Y const int coarse_shared_memory = 2 * k * blocks * sizeof(float);
  • 12. Experimental results ① By using the atomicAdd function, programmer can rewrite the incr kenrel. This instruction atomically add a value V[i] to the value stored at memory location M. __global__ void incr(__global__ int *ptr) { int temp = atomicAdd( ptr, 1); } ② Thrust provides two vector containers, host_vector and device_vector. The host_vector is stored in host memory while device_vector lives in GPU device memory ③ Reductions in serial execution like the averaging performed during the update step scale linearly. However, parallel reductions can be implemented efficiently by using two-stage tree-reduction: fine and coarse reduction. Key point in fine-coarse reduction is averaging is not performed over all our data. Instead, for each cluster, the points assigned to each cluster should be averaged. thrust::device_vector<float> d_mean_x(h_x.begin(), h_x.begin() + k); thrust::device_vector<float> d_mean_y(h_y.begin(), h_y.begin() + k);
  • 14. Conclusion • Recently, the art of concurrency and parallelism has been advanced rapidly. However, conventional techniques still suffer of the drawback of lock contention. • This talk reports the current situation of massively parallel computing. • Based on this situation, a Lock-free technique of tree-based reduction for large scale clustering on GPGPU is illustrated. • In experiment, the performance of native GPU kernel with atomic instruction, CUDA Thrust template libraries and proposal method is compared and evaluated.