SlideShare a Scribd company logo
EE-5351: Course project presentation
Accelerated Logistic Regression on GPU(s)
Rahul Bhojwani, Swaraj Khadanga, Anand Saharan
12/16/2018
Outline
• Problem description
• Key concepts
• Problem Understanding
• Datasets used
• Our Solutions
• Results
• References
• Model training and selection is the most costly
and repetitive step.
Our Focus
The Logistic Regression Model
● y = f(X) ; where y = (0, 1)
● The "logit" model solves these problems:
ln[p/(1-p)] = WTX + b
● p is the probability that the event Y occurs, p(Y=1)
● p/(1-p) is the "odds ratio"
● ln[p/(1-p)] is the log odds ratio, or "logit"
The Logistic Regression Model
● The logistic distribution constrains the estimated
probabilities to lie between 0 and 1.
● The estimated probability is:
p = 1/[1 + exp(- WTX + b)]
Logistic Regression Training
● Training set {(x1,y1),.......,(xn,yn)} with yn belongs to {0,1}
● Likelihood, assuming independence
● Log-likelihood, to be maximized
● And then weights are updated using:
Logistic Regression Training
● Let
So, this is the main
computation that
needs to be optimized.
● The negative log-likelihood, to be minimized
● The gradient of the objective function
Problem understanding
X y
w
N_Data
N_features
N_Data
N_features
1 1
N_Data ~ 106 - 108
N_features ~ 101 - 103
Problem understanding
Let’s call it Sigmoid Kernel
X
● Apply sigmoid to
each
● Subtract with y
Problem understanding
Let’s call it Grad_compute kernel
X
(Intermediate vector)
X
Grad
Weights
N_features
1
Training Routine (Pseudo Code)
1. initialize params
2. for epoch in 1,2,3...N
a. get batch from the file
b. compute intermediate vector [sigmoid(w.T*x) - y]
c. compute gradient
d. update gradient
3. repeat 2 until we have next batch
4. end
Datasets used
• HIGGS Dataset
– N_features = 28
– N_data = 500000
• DIGIT Dataset
– N_features = 784
– N_data = 10000
• We couldn’t load the entire HIGGS dataset on the
machine, so N_data was small.
• We repeated multiple epochs to increase N_data
dimension in both the cases
Sequential Version
Sigmoid
Kernel
Grad
compute
Sigmoid Kernel -1
• Each thread processes one data point (Xi,Yi)
• X is stored as row-major
• Uncoalesced access
Uncoalesced
access
Figure for understanding
Let’s call it Sigmoid Kernel
X
● Apply sigmoid to
each
● Subtract with y
Sigmoid Kernel -2
• Each thread processes one data point (Xi,Yi)
• X is stored in column major format
• Coalesced data access
Coalesced
access
Figure for understanding
Let’s call it Sigmoid Kernel
X
● Apply sigmoid to
each
● Subtract with y
Sigmoid Kernel - 3(Shared memory)
• Weights are being reused by threads.
• So used shared memory.
Sigmoid Kernel -4(constant memory)
• Weights values are constant in the kernel.
• So tried to store them in constant memory.
• Problem:
– The weights needs to be updated in the next kernel.
– So, need to copy the weights to host and then copy
them back to constant memory before training next
batch.
– This drawback led to no improvement in the computation
speed.
Sigmoid Kernel -5(Parallelized reduction)
• Problems in previous kernels:
– In all the above kernels, there is loop running on the feature
dimension to get the sum.
– Higher feature dimension would make it slow.
• Solution:
– Consecutive threads does one multiplication xij * wj.
– Stores result into shared memory.
– Every block does private reduction is done on shared
memory to compute one data point (Xi,Yi)
– Did the same with weights in constant memory.
PS:- If FEATURE_SIZE > 1024, thread coarsening. each thread will do multiple
computation. (Data is not transposed for memory coalescing)
Figure for understanding
Let’s call it Sigmoid Kernel
X
● Apply sigmoid to
each
● Subtract with y
Sigmoid Kernel -5(Parallelized reduction)
PS:- If FEATURE_SIZE > 1024, thread coarsening. each thread will do multiple computation.
(Data is not transposed for memory coalescing)
Reduction
step
Data is needed in row major
format for memory coalescing
Next sub problem
Let’s call it Grad compute kernel
X
(Intermediate vector)
X
Grad
Weights
N_features
1
• 1 Block in grid, 2D block.
• Each thread computes individual Xij * IMj.
• Tiled computation on the entire data.
• At each tile adds the data to the shared memory
• In the end a set of threads loops to reduce the shared
memory value.
Grad Computation Kernel - Basic
Figure for explanation
Let’s call it Grad compute kernel
X
(Intermediate vector)
X
Grad
Weights
N_features
1
Grad Computation Kernel - Basic
Grad Computation Kernel - 2(1D Grid, 2D
block)
• Problems with previous kernel:
– Not exploiting all the threads.
• The blocks are used only in the N_data dimension.
• Instead of 1 set of threads processing all tiles, each
tile is processed by one block.
• Private reduction is applied to get each tiles value.
• Later each block atomically adds to the the global
memory.
Figure for explanation
Let’s call it Grad compute kernel
X
(Intermediate vector)
X
Grad
Weights
N_features
1
Grad Computation Kernel-2(1D Grid, 2D block)
Grad Computation Kernel-2(2D Grid, 2D block)
• Problems with previous kernel:
– The max number of threads in block is limited, so can’t
increase data_dim threads, leading to more atomic adds.
– Higher num_features dimension can’t be handled.
• 2D grid is used to run blocks in N_data and
N_feature dimension.
• Private reduction is applied to get each tiles value.
• Later each block atomically adds to the the global
memory
Grad Computation Kernel-2(2D Grid, 2D block)
Transpose Kernel (For memory
coalescing)
• Solving sub-problem 1 using parallelized reduction
needs data in row-major for memory coalescing.
• Solving sub-problem 2 using 1D Grid and 2D Grid
needs data in column-major for memory coalescing.
• So, a kernel is required to transpose the data matrix
X.
Weight Update Kernel
• Kernel to update new weights according to
• Since the weights are already in device memory so
this kernel was very inexpensive.
• The kernel exploits memory coalescing in read and
write.
Hardware Accelerated Exponentiation
• Sigmoid is a very recurring operation in this entire
computation.
• Accelerated it by using hardware accelerated
functions.
• __expf() instead of exp()
• One key thing to note is that continuous training for
multiple epochs makes this process I/O expensive.
• Currently data loading in CPU and computation in
GPU are sequential.
• We exploit this by streaming the load of next batch
in CPU and computation in GPU parallely using
double buffering.
Interleaving CPU and GPU computations
using streams
Results
GPU accelerated the logistic regression computation by 57x
References
• https://www.datanami.com/2018/09/05/how-to-build-a-better-machine-learning-
pipeline/
• Class notes
• https://www.kaggle.com/c/digit-recognizer/data
• https://laurel.datsi.fi.upm.es/_media/proyectos/gopac/cuda-gdb.pdf
• https://archive.ics.uci.edu/ml/datasets/HIGGS
• https://devblogs.nvidia.com/efficient-matrix-transpose-cuda-cc/
Accelerated Logistic Regression on GPU(s)

More Related Content

PPTX
Semantic Segmentation on Satellite Imagery
PPTX
Improving access to satellite imagery with Cloud computing
PDF
Centernet
PPTX
Summary of survey papers on deep learning method to 3D data
PDF
PPTX
Summarizing videos with Attention
PPTX
VIBE: Video Inference for Human Body Pose and Shape Estimation
PDF
Deep Learning for Computer Vision: Memory usage and computational considerati...
Semantic Segmentation on Satellite Imagery
Improving access to satellite imagery with Cloud computing
Centernet
Summary of survey papers on deep learning method to 3D data
Summarizing videos with Attention
VIBE: Video Inference for Human Body Pose and Shape Estimation
Deep Learning for Computer Vision: Memory usage and computational considerati...

What's hot (20)

PDF
Weakly supervised semantic segmentation of 3D point cloud
PPTX
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
PDF
Webinar on Graph Neural Networks
PPTX
Convolutional Patch Representations for Image Retrieval An unsupervised approach
PDF
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
PDF
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
PDF
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
PDF
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
PPTX
Introduction to Graph neural networks @ Vienna Deep Learning meetup
PPTX
Graph Neural Network - Introduction
PDF
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
PPTX
Graph R-CNN for Scene Graph Generation
PDF
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
PDF
Gnn overview
PDF
Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017
PDF
Convolutional Neural Networks (D1L3 2017 UPC Deep Learning for Computer Vision)
PDF
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
PPTX
Deep image retrieval - learning global representations for image search - ub ...
PDF
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
PPTX
Semantic segmentation with Convolutional Neural Network Approaches
Weakly supervised semantic segmentation of 3D point cloud
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Webinar on Graph Neural Networks
Convolutional Patch Representations for Image Retrieval An unsupervised approach
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Introduction to Graph neural networks @ Vienna Deep Learning meetup
Graph Neural Network - Introduction
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Graph R-CNN for Scene Graph Generation
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Gnn overview
Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017
Convolutional Neural Networks (D1L3 2017 UPC Deep Learning for Computer Vision)
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
Deep image retrieval - learning global representations for image search - ub ...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Semantic segmentation with Convolutional Neural Network Approaches
Ad

Similar to Accelerated Logistic Regression on GPU(s) (20)

PDF
Introduction to Applied Machine Learning
PDF
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
PDF
GPGPU Computation
PDF
Gpu perf-presentation
PDF
Eye deep
PDF
Chainer v4 and v5
PDF
Netflix machine learning
PPTX
Practical ML
PPTX
Deep learning
PPTX
SIMD.pptx
PPTX
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
PDF
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
PPTX
Parallel convolutional neural network
PDF
Using CNTK's Python Interface for Deep LearningDave DeBarr -
PDF
[PR12] PR-036 Learning to Remember Rare Events
PPTX
Scaling up Machine Learning Algorithms for Classification
PDF
Parallel Implementation of K Means Clustering on CUDA
PPTX
TensorRT survey
PPT
deep learning UNIT-1 Introduction Part-1.ppt
PPTX
Nuts and Bolts of Transfer Learning.pptx
Introduction to Applied Machine Learning
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
GPGPU Computation
Gpu perf-presentation
Eye deep
Chainer v4 and v5
Netflix machine learning
Practical ML
Deep learning
SIMD.pptx
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Parallel convolutional neural network
Using CNTK's Python Interface for Deep LearningDave DeBarr -
[PR12] PR-036 Learning to Remember Rare Events
Scaling up Machine Learning Algorithms for Classification
Parallel Implementation of K Means Clustering on CUDA
TensorRT survey
deep learning UNIT-1 Introduction Part-1.ppt
Nuts and Bolts of Transfer Learning.pptx
Ad

Recently uploaded (20)

PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Introduction to Business Data Analytics.
PPTX
1_Introduction to advance data techniques.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Computer network topology notes for revision
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Global journeys: estimating international migration
PDF
Launch Your Data Science Career in Kochi – 2025
PDF
Lecture1 pattern recognition............
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Fluorescence-microscope_Botany_detailed content
Introduction to Business Data Analytics.
1_Introduction to advance data techniques.pptx
Quality review (1)_presentation of this 21
STUDY DESIGN details- Lt Col Maksud (21).pptx
IB Computer Science - Internal Assessment.pptx
Introduction-to-Cloud-ComputingFinal.pptx
.pdf is not working space design for the following data for the following dat...
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Computer network topology notes for revision
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Moving the Public Sector (Government) to a Digital Adoption
Database Infoormation System (DBIS).pptx
Global journeys: estimating international migration
Launch Your Data Science Career in Kochi – 2025
Lecture1 pattern recognition............
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck

Accelerated Logistic Regression on GPU(s)

  • 1. EE-5351: Course project presentation Accelerated Logistic Regression on GPU(s) Rahul Bhojwani, Swaraj Khadanga, Anand Saharan 12/16/2018
  • 2. Outline • Problem description • Key concepts • Problem Understanding • Datasets used • Our Solutions • Results • References
  • 3. • Model training and selection is the most costly and repetitive step.
  • 5. The Logistic Regression Model ● y = f(X) ; where y = (0, 1) ● The "logit" model solves these problems: ln[p/(1-p)] = WTX + b ● p is the probability that the event Y occurs, p(Y=1) ● p/(1-p) is the "odds ratio" ● ln[p/(1-p)] is the log odds ratio, or "logit"
  • 6. The Logistic Regression Model ● The logistic distribution constrains the estimated probabilities to lie between 0 and 1. ● The estimated probability is: p = 1/[1 + exp(- WTX + b)]
  • 7. Logistic Regression Training ● Training set {(x1,y1),.......,(xn,yn)} with yn belongs to {0,1} ● Likelihood, assuming independence ● Log-likelihood, to be maximized
  • 8. ● And then weights are updated using: Logistic Regression Training ● Let So, this is the main computation that needs to be optimized. ● The negative log-likelihood, to be minimized ● The gradient of the objective function
  • 9. Problem understanding X y w N_Data N_features N_Data N_features 1 1 N_Data ~ 106 - 108 N_features ~ 101 - 103
  • 10. Problem understanding Let’s call it Sigmoid Kernel X ● Apply sigmoid to each ● Subtract with y
  • 11. Problem understanding Let’s call it Grad_compute kernel X (Intermediate vector) X Grad Weights N_features 1
  • 12. Training Routine (Pseudo Code) 1. initialize params 2. for epoch in 1,2,3...N a. get batch from the file b. compute intermediate vector [sigmoid(w.T*x) - y] c. compute gradient d. update gradient 3. repeat 2 until we have next batch 4. end
  • 13. Datasets used • HIGGS Dataset – N_features = 28 – N_data = 500000 • DIGIT Dataset – N_features = 784 – N_data = 10000 • We couldn’t load the entire HIGGS dataset on the machine, so N_data was small. • We repeated multiple epochs to increase N_data dimension in both the cases
  • 15. Sigmoid Kernel -1 • Each thread processes one data point (Xi,Yi) • X is stored as row-major • Uncoalesced access Uncoalesced access
  • 16. Figure for understanding Let’s call it Sigmoid Kernel X ● Apply sigmoid to each ● Subtract with y
  • 17. Sigmoid Kernel -2 • Each thread processes one data point (Xi,Yi) • X is stored in column major format • Coalesced data access Coalesced access
  • 18. Figure for understanding Let’s call it Sigmoid Kernel X ● Apply sigmoid to each ● Subtract with y
  • 19. Sigmoid Kernel - 3(Shared memory) • Weights are being reused by threads. • So used shared memory.
  • 20. Sigmoid Kernel -4(constant memory) • Weights values are constant in the kernel. • So tried to store them in constant memory. • Problem: – The weights needs to be updated in the next kernel. – So, need to copy the weights to host and then copy them back to constant memory before training next batch. – This drawback led to no improvement in the computation speed.
  • 21. Sigmoid Kernel -5(Parallelized reduction) • Problems in previous kernels: – In all the above kernels, there is loop running on the feature dimension to get the sum. – Higher feature dimension would make it slow. • Solution: – Consecutive threads does one multiplication xij * wj. – Stores result into shared memory. – Every block does private reduction is done on shared memory to compute one data point (Xi,Yi) – Did the same with weights in constant memory. PS:- If FEATURE_SIZE > 1024, thread coarsening. each thread will do multiple computation. (Data is not transposed for memory coalescing)
  • 22. Figure for understanding Let’s call it Sigmoid Kernel X ● Apply sigmoid to each ● Subtract with y
  • 23. Sigmoid Kernel -5(Parallelized reduction) PS:- If FEATURE_SIZE > 1024, thread coarsening. each thread will do multiple computation. (Data is not transposed for memory coalescing) Reduction step Data is needed in row major format for memory coalescing
  • 24. Next sub problem Let’s call it Grad compute kernel X (Intermediate vector) X Grad Weights N_features 1
  • 25. • 1 Block in grid, 2D block. • Each thread computes individual Xij * IMj. • Tiled computation on the entire data. • At each tile adds the data to the shared memory • In the end a set of threads loops to reduce the shared memory value. Grad Computation Kernel - Basic
  • 26. Figure for explanation Let’s call it Grad compute kernel X (Intermediate vector) X Grad Weights N_features 1
  • 28. Grad Computation Kernel - 2(1D Grid, 2D block) • Problems with previous kernel: – Not exploiting all the threads. • The blocks are used only in the N_data dimension. • Instead of 1 set of threads processing all tiles, each tile is processed by one block. • Private reduction is applied to get each tiles value. • Later each block atomically adds to the the global memory.
  • 29. Figure for explanation Let’s call it Grad compute kernel X (Intermediate vector) X Grad Weights N_features 1
  • 30. Grad Computation Kernel-2(1D Grid, 2D block)
  • 31. Grad Computation Kernel-2(2D Grid, 2D block) • Problems with previous kernel: – The max number of threads in block is limited, so can’t increase data_dim threads, leading to more atomic adds. – Higher num_features dimension can’t be handled. • 2D grid is used to run blocks in N_data and N_feature dimension. • Private reduction is applied to get each tiles value. • Later each block atomically adds to the the global memory
  • 32. Grad Computation Kernel-2(2D Grid, 2D block)
  • 33. Transpose Kernel (For memory coalescing) • Solving sub-problem 1 using parallelized reduction needs data in row-major for memory coalescing. • Solving sub-problem 2 using 1D Grid and 2D Grid needs data in column-major for memory coalescing. • So, a kernel is required to transpose the data matrix X.
  • 34. Weight Update Kernel • Kernel to update new weights according to • Since the weights are already in device memory so this kernel was very inexpensive. • The kernel exploits memory coalescing in read and write.
  • 35. Hardware Accelerated Exponentiation • Sigmoid is a very recurring operation in this entire computation. • Accelerated it by using hardware accelerated functions. • __expf() instead of exp()
  • 36. • One key thing to note is that continuous training for multiple epochs makes this process I/O expensive. • Currently data loading in CPU and computation in GPU are sequential. • We exploit this by streaming the load of next batch in CPU and computation in GPU parallely using double buffering. Interleaving CPU and GPU computations using streams
  • 37. Results GPU accelerated the logistic regression computation by 57x
  • 38. References • https://www.datanami.com/2018/09/05/how-to-build-a-better-machine-learning- pipeline/ • Class notes • https://www.kaggle.com/c/digit-recognizer/data • https://laurel.datsi.fi.upm.es/_media/proyectos/gopac/cuda-gdb.pdf • https://archive.ics.uci.edu/ml/datasets/HIGGS • https://devblogs.nvidia.com/efficient-matrix-transpose-cuda-cc/