Accelerated Logistic Regression on GPU(s)

EE-5351: Course project presentation
Accelerated Logistic Regression on GPU(s)
Rahul Bhojwani, Swaraj Khadanga, Anand Saharan
12/16/2018

Outline
• Problem description
• Key concepts
• Problem Understanding
• Datasets used
• Our Solutions
• Results
• References

• Model training and selection is the most costly
and repetitive step.

The Logistic Regression Model
● y = f(X) ; where y = (0, 1)
● The "logit" model solves these problems:
ln[p/(1-p)] = WTX + b
● p is the probability that the event Y occurs, p(Y=1)
● p/(1-p) is the "odds ratio"
● ln[p/(1-p)] is the log odds ratio, or "logit"

The Logistic Regression Model
● The logistic distribution constrains the estimated
probabilities to lie between 0 and 1.
● The estimated probability is:
p = 1/[1 + exp(- WTX + b)]

Logistic Regression Training
● Training set {(x1,y1),.......,(xn,yn)} with yn belongs to {0,1}
● Likelihood, assuming independence
● Log-likelihood, to be maximized

● And then weights are updated using:
Logistic Regression Training
● Let
So, this is the main
computation that
needs to be optimized.
● The negative log-likelihood, to be minimized
● The gradient of the objective function

Problem understanding
X y
w
N_Data
N_features
N_Data
N_features
1 1
N_Data ~ 106 - 108
N_features ~ 101 - 103

Let’s call it Sigmoid Kernel
X
● Apply sigmoid to
each
● Subtract with y

Let’s call it Grad_compute kernel
X
(Intermediate vector)
X
Grad
Weights
N_features
1

Training Routine (Pseudo Code)
1. initialize params
2. for epoch in 1,2,3...N
a. get batch from the file
b. compute intermediate vector [sigmoid(w.T*x) - y]
c. compute gradient
d. update gradient
3. repeat 2 until we have next batch
4. end

Datasets used
• HIGGS Dataset
– N_features = 28
– N_data = 500000
• DIGIT Dataset
– N_features = 784
– N_data = 10000
• We couldn’t load the entire HIGGS dataset on the
machine, so N_data was small.
• We repeated multiple epochs to increase N_data
dimension in both the cases

Sequential Version
Sigmoid
Kernel
Grad
compute

Sigmoid Kernel -1
• Each thread processes one data point (Xi,Yi)
• X is stored as row-major
• Uncoalesced access
Uncoalesced
access

Figure for understanding
Let’s call it Sigmoid Kernel
X
● Apply sigmoid to
each
● Subtract with y

Sigmoid Kernel -2
• Each thread processes one data point (Xi,Yi)
• X is stored in column major format
• Coalesced data access
Coalesced
access

Sigmoid Kernel - 3(Shared memory)
• Weights are being reused by threads.
• So used shared memory.

Sigmoid Kernel -4(constant memory)
• Weights values are constant in the kernel.
• So tried to store them in constant memory.
• Problem:
– The weights needs to be updated in the next kernel.
– So, need to copy the weights to host and then copy
them back to constant memory before training next
batch.
– This drawback led to no improvement in the computation
speed.

Sigmoid Kernel -5(Parallelized reduction)
• Problems in previous kernels:
– In all the above kernels, there is loop running on the feature
dimension to get the sum.
– Higher feature dimension would make it slow.
• Solution:
– Consecutive threads does one multiplication xij * wj.
– Stores result into shared memory.
– Every block does private reduction is done on shared
memory to compute one data point (Xi,Yi)
– Did the same with weights in constant memory.
PS:- If FEATURE_SIZE > 1024, thread coarsening. each thread will do multiple
computation. (Data is not transposed for memory coalescing)

Sigmoid Kernel -5(Parallelized reduction)
PS:- If FEATURE_SIZE > 1024, thread coarsening. each thread will do multiple computation.
(Data is not transposed for memory coalescing)
Reduction
step
Data is needed in row major
format for memory coalescing

Next sub problem
Let’s call it Grad compute kernel
X
X
Grad
Weights
N_features
1

• 1 Block in grid, 2D block.
• Each thread computes individual Xij * IMj.
• Tiled computation on the entire data.
• At each tile adds the data to the shared memory
• In the end a set of threads loops to reduce the shared
memory value.
Grad Computation Kernel - Basic

Figure for explanation
Let’s call it Grad compute kernel
X
X
Grad
Weights
N_features
1

Grad Computation Kernel - Basic

Grad Computation Kernel - 2(1D Grid, 2D
block)
• Problems with previous kernel:
– Not exploiting all the threads.
• The blocks are used only in the N_data dimension.
• Instead of 1 set of threads processing all tiles, each
tile is processed by one block.
• Private reduction is applied to get each tiles value.
• Later each block atomically adds to the the global
memory.

Grad Computation Kernel-2(1D Grid, 2D block)

• Problems with previous kernel:
– The max number of threads in block is limited, so can’t
increase data_dim threads, leading to more atomic adds.
– Higher num_features dimension can’t be handled.
• 2D grid is used to run blocks in N_data and
N_feature dimension.
• Private reduction is applied to get each tiles value.
• Later each block atomically adds to the the global
memory

Transpose Kernel (For memory
coalescing)
• Solving sub-problem 1 using parallelized reduction
needs data in row-major for memory coalescing.
• Solving sub-problem 2 using 1D Grid and 2D Grid
needs data in column-major for memory coalescing.
• So, a kernel is required to transpose the data matrix
X.

Weight Update Kernel
• Kernel to update new weights according to
• Since the weights are already in device memory so
this kernel was very inexpensive.
• The kernel exploits memory coalescing in read and
write.

Hardware Accelerated Exponentiation
• Sigmoid is a very recurring operation in this entire
computation.
• Accelerated it by using hardware accelerated
functions.
• __expf() instead of exp()

• One key thing to note is that continuous training for
multiple epochs makes this process I/O expensive.
• Currently data loading in CPU and computation in
GPU are sequential.
• We exploit this by streaming the load of next batch
in CPU and computation in GPU parallely using
double buffering.
Interleaving CPU and GPU computations
using streams

Results
GPU accelerated the logistic regression computation by 57x

References
• https://www.datanami.com/2018/09/05/how-to-build-a-better-machine-learning-
pipeline/
• Class notes
• https://www.kaggle.com/c/digit-recognizer/data
• https://laurel.datsi.fi.upm.es/_media/proyectos/gopac/cuda-gdb.pdf
• https://archive.ics.uci.edu/ml/datasets/HIGGS
• https://devblogs.nvidia.com/efficient-matrix-transpose-cuda-cc/

Accelerated Logistic Regression on GPU(s)

Accelerated Logistic Regression on GPU(s)

More Related Content

What's hot (20)

Similar to Accelerated Logistic Regression on GPU(s) (20)

Recently uploaded (20)

Accelerated Logistic Regression on GPU(s)