SlideShare a Scribd company logo
CUDA and Caffe for 
deep learning 
Amgad Muhammad 
Mohamed Ghoneim
Outline 
• GPU Computing 
• What is CUDA? 
• Why use CUDA? 
• When use CUDA? 
• CUDA - Machine Specs . 
• CUDA - Matrix Multiplication 
• CUDA - Closest Pair in 2D 
• Convolution Neural Networks 
• Auto Encoder
GPU Computing 
• Moore’s law slowed down. 
• Computation is directed towards parallelism instead of 
better processing unit performance. 
• CPU has a small number of processing units with very 
high processing power. 
• GPU has a large number of processing units with 
moderate processing power.
What is CUDA? 
• Compute Unified Device Architecture 
• Introduced by nVidia in 2006. 
• Refers to 2 different concepts: 
1. CUDA Architecture: Massively parallel architecture of 
modern GPUs with hundreds of cores. 
2. CUDA Programming Model: the model used to program 
these GPUs
[Bryan Catanzaro]
Why use CUDA? 
• Efficiently processing thousands of small/repeated tasks 
in parallel. 
• It provides a methodology for these tasks to 
communicate and cooperate efficiently. 
• Scalable and intuitive mechanism to express 
parallelism.
When use CUDA? 
• Lots of computations and lots of data. 
• Parallel algorithms. 
• Neural Networks. 
• Physical Simulations 
• Distributed Computing 
• Accelerated Encryption, Decryption and Compression
CUDA – Machine Specs . 
Machine specs for this experiment: 
- Processor: Dual-core AMD Opteron(™) processor 2216 2.4 GHz (2 
processors). 
- RAM: 32.0 GB 
- OS: 64-bit Windows 7 
- Graphics Card: Quadro FX 4600 
- CUDA Driver: 5.5 
- CUDA Compatibility: 1.0 
- # of Cores: 96 
- Core Clock: 500MHz 
- Memory: 768MB 
- Memory Clock: 1400MHz
CUDA - Matrix Multiplication 
Comparing different implementations: 
All the times below are in milliseconds. 
100 200 300 400 500 600 700 800 900 1000 
25000 
20000 
15000 
10000 
5000 
0 
Matrix Multiplication 
Matrix Side 
CPU GPU 
Time in MS
CUDA - Closest Pair in 2D 
This is a well known problem where the 
algorithm tries to find the 2 points that 
closest to each other. There are many 
solutions to address this problem: 
1. Brute Force complexity O( n^2 ) 
2. Divide and Conquer O( n log(n) ) 
For completeness there is another implementation using KD-trees with complexity similar to D&C.
CUDA - Closest Pair in 2D (cont.) 
Comparing different implementations: 
All the times below are in milliseconds. 
100 1000 5000 10000 20000 25000 30000 40000 50000 100000 
250000 
200000 
150000 
100000 
50000 
0 
Closest Pair in 2D 
Number of Points 
Brute Force CPU BF GPU BF GPU Optimized 
Time in MS
CUDA - Closest Pair in 2D (cont.) 
Comparing different implementations: 
All the times below are in milliseconds. 
1000 
900 
800 
700 
600 
500 
400 
300 
200 
100 
0 
Closest Pair in 2D 
100 1000 5000 10000 20000 25000 30000 40000 
Number of Points 
BF GPU Optimized Divide and Conquer CPU 
Time in MS
CUDA - Closest Pair in 2D (cont.) 
To explain how optimized GPU version works we need to review the threads hierarchy in 
the GPU works:
CUDA - Closest Pair in 2D (cont.) 
To explain how optimized GPU version works we need to review the memory hierarchy in 
the GPU works:
CUDA – back to Matrix Multiplication 
Explaining the matrix multiplication optimization on board
CUDA - Closest Pair in 2D (cont.) 
Explaining the optimized code on board 
__global__ void FindClosestGPU2(float2* points, float* vals, int count) 
{ 
__shared__ float2 sharedPoints[blockSize]; 
if(count <= 1) return; 
int idx = threadIdx.x + blockIdx.x * blockDim.x; 
float2 thisPoint; 
float distanceToClosest = FLT_MAX; 
if(idx < count) thisPoint = points[idx]; 
for(int currentBlockOfPoints = 0; currentBlockOfPoints < gridDim.x; currentBlockOfPoints++) { 
if(threadIdx.x + currentBlockOfPoints * blockSize < count) 
sharedPoints[threadIdx.x] = points[threadIdx.x + currentBlockOfPoints * blockSize]; 
else 
sharedPoints[threadIdx.x].x = reasonableINF, sharedPoints[threadIdx.x].y = reasonableINF; 
__syncthreads(); 
if(idx < count) { 
float *ptr = &sharedPoints[0].x; 
for(int i = 0; i < blockSize; i++) { 
float dist = (thisPoint.x - ptr[0]) * (thisPoint.x - ptr[0]) + 
(thisPoint.y - ptr[1]) * (thisPoint.y - ptr[1]); 
ptr += 2; 
if(dist < distanceToClosest && (i + currentBlockOfPoints * blockSize < count) 
&& (i + currentBlockOfPoints * blockSize != idx)) 
distanceToClosest = dist; 
} 
}_ 
_syncthreads(); 
}i 
f(idx < count) 
vals[idx] = distanceToClosest; 
}
CNN
Convolution, The first operation to optimize
Pooling, the second operation to optimize
Results
LeNet Results The MNIST database of handwritten digits, has a training set of 60,000 examples, and a test set of 10,000 examples. We used 
OpenBlas for parallelization on the CPU 
Due to the fact that the data set is small in size, the overhead wasn't compensated by the speedup. 
1 CPU Core 2 CPU Cores 3 CPU Cores 4 CPU Cores 
800 
700 
600 
500 
400 
300 
200 
100 
0 
CNN 
with GPU without GPU 
Time in Seconds
AutoEncoder
AutoEncoders Results 
The MNIST database of handwritten digits, has a training set of 60,000 examples, and a test set of 10,000 examples. And the main 
operation here is inner product 
1 CPU Core 2 CPU Cores 3 CPU Cores 
800 
700 
600 
500 
400 
300 
200 
100 
0 
Auto Encoder 
with GPU without GPU 
Time in Seconds
Thank You! 
Questions?

More Related Content

PDF
Slide tesi
PDF
PDF
Introduction to Chainer 11 may,2018
PDF
IIBMP2019 講演資料「オープンソースで始める深層学習」
PDF
Deep learning for molecules, introduction to chainer chemistry
PDF
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager Execution
PDF
[251] implementing deep learning using cu dnn
PPT
Neural tool box
Slide tesi
Introduction to Chainer 11 may,2018
IIBMP2019 講演資料「オープンソースで始める深層学習」
Deep learning for molecules, introduction to chainer chemistry
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager Execution
[251] implementing deep learning using cu dnn
Neural tool box

What's hot (20)

PPTX
Chainer v3
PDF
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PDF
Chainer ui v0.3 and imagereport
PDF
[PR12] PR-036 Learning to Remember Rare Events
PDF
Svm map reduce_slides
PDF
Introduction to Chainer
PPTX
PyTorch Tutorial for NTU Machine Learing Course 2017
PDF
Scaling Deep Learning with MXNet
PDF
Overview of Chainer and Its Features
PDF
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
PDF
NAS EP Algorithm
PPTX
Deep learning and its application
PPTX
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
PPTX
Deep Learning for AI (2)
PDF
Pytorch for tf_developers
PDF
Introduction to Chainer
PDF
Deep Learning with PyTorch
PDF
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
PDF
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
PPTX
Introduction to PyTorch
Chainer v3
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
Chainer ui v0.3 and imagereport
[PR12] PR-036 Learning to Remember Rare Events
Svm map reduce_slides
Introduction to Chainer
PyTorch Tutorial for NTU Machine Learing Course 2017
Scaling Deep Learning with MXNet
Overview of Chainer and Its Features
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NAS EP Algorithm
Deep learning and its application
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
Deep Learning for AI (2)
Pytorch for tf_developers
Introduction to Chainer
Deep Learning with PyTorch
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Introduction to PyTorch
Ad

Viewers also liked (18)

PDF
Recurrent Neural Networks, LSTM and GRU
PDF
Tech Talk NVIDIA CUDA
PDF
GPU Computing for Cognitive Robotics
PDF
How Zalando accelerates warehouse operations with neural networks - Calvin Se...
PDF
Back-propagation Primer
PDF
Machine Learning with New Hardware Challegens
PPTX
Face recognition using neural network
PPTX
Face recognition using artificial neural network
PPTX
Artificial intelligence NEURAL NETWORKS
PPTX
neural network
PDF
A Three-Dimensional Representation method for Noisy Point Clouds based on Gro...
PPT
Artificial neural network
PPTX
Neural network & its applications
PDF
最近のDeep Learning (NLP) 界隈におけるAttention事情
PDF
Backpropagation in Convolutional Neural Network
PDF
Artificial neural networks
PPTX
Artificial neural network
DOCX
Hand Written Character Recognition Using Neural Networks
Recurrent Neural Networks, LSTM and GRU
Tech Talk NVIDIA CUDA
GPU Computing for Cognitive Robotics
How Zalando accelerates warehouse operations with neural networks - Calvin Se...
Back-propagation Primer
Machine Learning with New Hardware Challegens
Face recognition using neural network
Face recognition using artificial neural network
Artificial intelligence NEURAL NETWORKS
neural network
A Three-Dimensional Representation method for Noisy Point Clouds based on Gro...
Artificial neural network
Neural network & its applications
最近のDeep Learning (NLP) 界隈におけるAttention事情
Backpropagation in Convolutional Neural Network
Artificial neural networks
Artificial neural network
Hand Written Character Recognition Using Neural Networks
Ad

Similar to CUDA and Caffe for deep learning (20)

PDF
Comparison of Parallel Algorithms For An Image Processing Problem on Cuda
PPT
Lecture 04
PDF
Parallel Implementation of K Means Clustering on CUDA
PDF
CUDA Deep Dive
PDF
Cuda Without a Phd - A practical guick start
PPTX
GPU-Accelerated Parallel Computing
PPT
Parallel computing with Gpu
PPTX
Lrz kurs: gpu and mic programming with r
PDF
GPU Programming
PDF
3. CUDA_Thread.pdf info on cuda threads .
PPTX
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
PDF
Introduction to GPUs for Machine Learning
PPT
3. CUDA_PPT.ppt info abt threads in cuda
PDF
The Rise of Parallel Computing
PPTX
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
PDF
Accelerating microbiome research with OpenACC
PPTX
GPU and Deep learning best practices
KEY
CUDA based Iris Detection based on Hough Transform
PDF
Solving large sparse linear systems on the GPU
PPT
Introduction to parallel computing using CUDA
Comparison of Parallel Algorithms For An Image Processing Problem on Cuda
Lecture 04
Parallel Implementation of K Means Clustering on CUDA
CUDA Deep Dive
Cuda Without a Phd - A practical guick start
GPU-Accelerated Parallel Computing
Parallel computing with Gpu
Lrz kurs: gpu and mic programming with r
GPU Programming
3. CUDA_Thread.pdf info on cuda threads .
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Introduction to GPUs for Machine Learning
3. CUDA_PPT.ppt info abt threads in cuda
The Rise of Parallel Computing
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
Accelerating microbiome research with OpenACC
GPU and Deep learning best practices
CUDA based Iris Detection based on Hough Transform
Solving large sparse linear systems on the GPU
Introduction to parallel computing using CUDA

More from Amgad Muhammad (6)

PPTX
Improving region based CNN object detector using bayesian optimization
PDF
Auto-Encoders and PCA, a brief psychological background
PPTX
Android Performance Best Practices
PPTX
Unsupervised Feature Learning
PPTX
Google File System
PPTX
Improving region based CNN object detector using bayesian optimization
Auto-Encoders and PCA, a brief psychological background
Android Performance Best Practices
Unsupervised Feature Learning
Google File System

Recently uploaded (20)

PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Lecture1 pattern recognition............
PDF
Mega Projects Data Mega Projects Data
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Database Infoormation System (DBIS).pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Computer network topology notes for revision
PPTX
Introduction to machine learning and Linear Models
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
oil_refinery_comprehensive_20250804084928 (1).pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Fluorescence-microscope_Botany_detailed content
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Lecture1 pattern recognition............
Mega Projects Data Mega Projects Data
Qualitative Qantitative and Mixed Methods.pptx
ISS -ESG Data flows What is ESG and HowHow
Database Infoormation System (DBIS).pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Computer network topology notes for revision
Introduction to machine learning and Linear Models
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Miokarditis (Inflamasi pada Otot Jantung)
Clinical guidelines as a resource for EBP(1).pdf
IBA_Chapter_11_Slides_Final_Accessible.pptx

CUDA and Caffe for deep learning

  • 1. CUDA and Caffe for deep learning Amgad Muhammad Mohamed Ghoneim
  • 2. Outline • GPU Computing • What is CUDA? • Why use CUDA? • When use CUDA? • CUDA - Machine Specs . • CUDA - Matrix Multiplication • CUDA - Closest Pair in 2D • Convolution Neural Networks • Auto Encoder
  • 3. GPU Computing • Moore’s law slowed down. • Computation is directed towards parallelism instead of better processing unit performance. • CPU has a small number of processing units with very high processing power. • GPU has a large number of processing units with moderate processing power.
  • 4. What is CUDA? • Compute Unified Device Architecture • Introduced by nVidia in 2006. • Refers to 2 different concepts: 1. CUDA Architecture: Massively parallel architecture of modern GPUs with hundreds of cores. 2. CUDA Programming Model: the model used to program these GPUs
  • 6. Why use CUDA? • Efficiently processing thousands of small/repeated tasks in parallel. • It provides a methodology for these tasks to communicate and cooperate efficiently. • Scalable and intuitive mechanism to express parallelism.
  • 7. When use CUDA? • Lots of computations and lots of data. • Parallel algorithms. • Neural Networks. • Physical Simulations • Distributed Computing • Accelerated Encryption, Decryption and Compression
  • 8. CUDA – Machine Specs . Machine specs for this experiment: - Processor: Dual-core AMD Opteron(™) processor 2216 2.4 GHz (2 processors). - RAM: 32.0 GB - OS: 64-bit Windows 7 - Graphics Card: Quadro FX 4600 - CUDA Driver: 5.5 - CUDA Compatibility: 1.0 - # of Cores: 96 - Core Clock: 500MHz - Memory: 768MB - Memory Clock: 1400MHz
  • 9. CUDA - Matrix Multiplication Comparing different implementations: All the times below are in milliseconds. 100 200 300 400 500 600 700 800 900 1000 25000 20000 15000 10000 5000 0 Matrix Multiplication Matrix Side CPU GPU Time in MS
  • 10. CUDA - Closest Pair in 2D This is a well known problem where the algorithm tries to find the 2 points that closest to each other. There are many solutions to address this problem: 1. Brute Force complexity O( n^2 ) 2. Divide and Conquer O( n log(n) ) For completeness there is another implementation using KD-trees with complexity similar to D&C.
  • 11. CUDA - Closest Pair in 2D (cont.) Comparing different implementations: All the times below are in milliseconds. 100 1000 5000 10000 20000 25000 30000 40000 50000 100000 250000 200000 150000 100000 50000 0 Closest Pair in 2D Number of Points Brute Force CPU BF GPU BF GPU Optimized Time in MS
  • 12. CUDA - Closest Pair in 2D (cont.) Comparing different implementations: All the times below are in milliseconds. 1000 900 800 700 600 500 400 300 200 100 0 Closest Pair in 2D 100 1000 5000 10000 20000 25000 30000 40000 Number of Points BF GPU Optimized Divide and Conquer CPU Time in MS
  • 13. CUDA - Closest Pair in 2D (cont.) To explain how optimized GPU version works we need to review the threads hierarchy in the GPU works:
  • 14. CUDA - Closest Pair in 2D (cont.) To explain how optimized GPU version works we need to review the memory hierarchy in the GPU works:
  • 15. CUDA – back to Matrix Multiplication Explaining the matrix multiplication optimization on board
  • 16. CUDA - Closest Pair in 2D (cont.) Explaining the optimized code on board __global__ void FindClosestGPU2(float2* points, float* vals, int count) { __shared__ float2 sharedPoints[blockSize]; if(count <= 1) return; int idx = threadIdx.x + blockIdx.x * blockDim.x; float2 thisPoint; float distanceToClosest = FLT_MAX; if(idx < count) thisPoint = points[idx]; for(int currentBlockOfPoints = 0; currentBlockOfPoints < gridDim.x; currentBlockOfPoints++) { if(threadIdx.x + currentBlockOfPoints * blockSize < count) sharedPoints[threadIdx.x] = points[threadIdx.x + currentBlockOfPoints * blockSize]; else sharedPoints[threadIdx.x].x = reasonableINF, sharedPoints[threadIdx.x].y = reasonableINF; __syncthreads(); if(idx < count) { float *ptr = &sharedPoints[0].x; for(int i = 0; i < blockSize; i++) { float dist = (thisPoint.x - ptr[0]) * (thisPoint.x - ptr[0]) + (thisPoint.y - ptr[1]) * (thisPoint.y - ptr[1]); ptr += 2; if(dist < distanceToClosest && (i + currentBlockOfPoints * blockSize < count) && (i + currentBlockOfPoints * blockSize != idx)) distanceToClosest = dist; } }_ _syncthreads(); }i f(idx < count) vals[idx] = distanceToClosest; }
  • 17. CNN
  • 18. Convolution, The first operation to optimize
  • 19. Pooling, the second operation to optimize
  • 21. LeNet Results The MNIST database of handwritten digits, has a training set of 60,000 examples, and a test set of 10,000 examples. We used OpenBlas for parallelization on the CPU Due to the fact that the data set is small in size, the overhead wasn't compensated by the speedup. 1 CPU Core 2 CPU Cores 3 CPU Cores 4 CPU Cores 800 700 600 500 400 300 200 100 0 CNN with GPU without GPU Time in Seconds
  • 23. AutoEncoders Results The MNIST database of handwritten digits, has a training set of 60,000 examples, and a test set of 10,000 examples. And the main operation here is inner product 1 CPU Core 2 CPU Cores 3 CPU Cores 800 700 600 500 400 300 200 100 0 Auto Encoder with GPU without GPU Time in Seconds