SlideShare a Scribd company logo
CUDA Accelerated Face Recognition
Numaan. A
Third Year Undergraduate
Department of Electrical Engineering
Indian Institute of Technology, Madras, India
ee08b044@smail.iitm.ac.in
Sibi. A
NeST-NVIDIA Centre for GPU Computing
NeST, India
sibi.a@nestgroup.net
July 26, 2010
Abstract
This paper presents a study of the efficiency and
performance speedup achieved by applying Graph-
ics Processing Units for Face Recognition Solutions.
We explore one of the possibilities of parallelizing and
optimizing a well-known Face Recognition algorithm,
Principal Component Analysis (PCA) with Eigen-
faces.
1 Introduction
In recent years, the Graphics Processing Units (GPU)
has been the subject of extensive research and the
computation speed of GPUs has been rapidly increas-
ing. The computational power of the latest genera-
tion of GPUs, measured in Flops1
, is several times
that of a high end CPU and for this reason, they
are being increasingly used for non-graphics applica-
tions or general-purpose computing (GPGPU). Tra-
ditionally, this power of the GPUs could only be
harnessed through graphics APIs and was primarily
used only by professionals familiar with these APIs.
1floating-point operations per second
CUDA (Compute Unified Device Architecture) de-
veloped by NVIDIA, tries to address this issue by in-
troducing a familiar C like development environment
to GPGPU programming and allows programmers to
launch hundreds of concurrent threads to run on the
“massively” parallel NVIDIA GPUs, with very little
software overhead. This paper portrays our efforts to
use this power to tame a computationally intensive,
yet highly parallelizable PCA based algorithm used in
face recognition solutions. We developed both CPU
serial code and GPU parallel code to compare the ex-
ecution time in each case and measure the speed up
achieved by the GPU over the CPU.
2 PCA Theory
2.1 Introduction
Principal Component Analysis (PCA) is one of the
early and most successful techniques that have been
used for face recognition. PCA aims to reduce the
dimensionality of data so that it can be economically
represented and processed. Information contained in
a human face is highly redundant, with each pixel
highly correlated to its neighboring pixels and the
1
main idea of using PCA for face recognition is to
remove this redundancy, and extract the features re-
quired for comparison of faces. To increase accuracy
of the algorithm, we are using a slightly modified
method called Improved PCA, in which the images
used for training are grouped into different classes
and each class contains multiple images of a single
person with different facial expressions. The mathe-
matics behind PCA is described in the following sub-
sections.
2.2 Mathematics of PCA
Firstly, the 2-D facial images are resized using bilin-
ear interpolation to reduce the dimension of data and
increase the speed of computation. The resized im-
age is represented as a 1-D vector by concatenating
each row into one long vector . Let’s suppose we have
M training samples per class, each of size N (=total
pixels in the resized image). Let the training vectors
be represented as xi. pj’s represent pixel values.
xi = [p1 . . . pN ]T
, i = 1, . . . , M
The training vectors are then normalized by sub-
tracting the class mean image m from them.
m =
1
M
M
i=1
xi
Let wi be the normalized images.
wi = xi − m
The matrix W is composed by placing the normal-
ized image vectors wi side by side. The eigenvectors
and eigenvalues of the covariance matrix C is com-
puted.
C = WWT
The size of C is N × N which could be enormous.
For example, images of size 16 × 16 give a covari-
ance of size 256 × 256. It is not practical to solve
for eigenvectors of C directly. Hence the eigenvectors
of the surrogate matrix WT
W of size M × M are
computed and the first M −1 eigenvectors and eigen-
values of C are given by Wdi and µi, where di and
µi are eigenvectors and eigenvalues of the surrogate
matrix, respectively.
The eigenvectors corresponding to non-zero eigen-
values of the covariance matrix make an orthonormal
basis for the subspace within which most of the image
data can be represented with minimum error. The
eigenvector associated with the highest eigenvalue re-
flects the greatest variance in the image. These eigen-
vectors are known as eigenimages or eigenfaces and
when normalized look like faces. The eigenvectors of
all the classes are computed similarly and all these
eigenvectors are placed side by side to make up the
eigenspace S.
The mean image of the entire training set, m
is computed and each training vector xi is normal-
ized. The normalized training vectors wi are pro-
jected onto the eigenspace S, and the projected fea-
ture vectors yi of the training samples are obtained.
wi = xi − m
yi = ST
wi
The simplest method for determining which class
the test face falls under is to find the class k, that
minimizes the Euclidean distance. The test image
is projected onto the eigenspace and the Euclidean
distance between the projected test image and each
of the projected training samples are computed. If
the minimum Euclidean distance falls under a prede-
fined threshold θ, the face is classified as belonging to
the class to which contained the feature vector that
yielded the minimum Euclidean distance.
3 CPU Implementation
3.1 Database
For our implementation, we are using one of the most
common databases used for testing face recognition
solutions, called the ORL Face Database (formerly
known as the AT&T Database). The ORL Database
2
contains 400 grayscale images of resolution 112 ×
92 of 40 subjects with 10 images per subject. The
images are taken under various situations, such as
different time, different angles, different expressions
(happy, angry, surprise, etc.) and different face de-
tails (with/without spectacles, with/without beard,
different hair styles etc). To truly show the power
of a GPU, the image database has to be large. So
in our implementation we scaled the ORL Database
multiple times by copying and concatenating, to cre-
ate bigger databases. This allowed us to measure the
GPU performance for very large databases, as high
as 15000 images.
3.2 Training phase
The most time consuming operation in the training
phase is the extraction of feature vector from the
training samples by projecting each one of them to
the eigenspace. The computation of eigenfaces and
eigenspace is relatively less intensive since we have
used the surrogate matrix workaround to decrease the
size of the matrix and compute the eigenvectors with
ease. Hence we have decided to parallelize only pro-
jection step of the training process. The steps till the
projection of training samples are done in MATLAB
and the projection is written in C.
The MATLAB routine acquires the images from
files, resizes them to a standard 16 × 16 resolution
and computes the eigenfaces and eigenspace. The
data required for the projection step, namely, the
resized training samples, the database mean image
and the eigenvectors, are then dumped into a bi-
nary file with is later read by the C routine to com-
plete the training process. The C routine reads the
data dumped by the MATLAB routine and extracts
the feature vectors by normalizing the resized train-
ing samples and projecting each of them onto the
eigenspace. With this the training process is com-
plete and the feature vectors are dumped onto a bi-
nary file to be used in the testing process.
3.3 Testing phase
The entire testing process is written in C. OpenCV
is used to acquire the testing images from file, and
resize them to the standard 16 × 16 resolution using
bilinear interpolation. The resized image is normal-
ized with the database mean image and projected
onto the eigenspace computed in the training phase.
The euclidean distance between the test image fea-
ture vector and the training sample feature vectors
are computed and the index of the feature vector
yielding the minimum euclidean distance is found.
The face that yielded this feature vector is the most
probable match for the input test face.
4 GPU Implementation
4.1 Training phase
4.1.1 Introduction
As mentioned in Section 3.2, only the final feature ex-
traction process of the training phase is parallelized.
Before the training samples can be projected, all the
data required for the projection process is copied to
the device’s global memory and the time taken for
copying is noted. As all the data are of a read-only
nature, they are bound as texture to take advantage
of the cached texture memory.
4.1.2 Kernel
The projection process is highly parallelizable and
can be parallelized in two ways. The threads can be
launched to parallelize the computation of a partic-
ular feature vector, wherein, each thread computes a
single element of the feature vector. Or, the threads
can be launched to parallelize projection of multiple
training samples, wherein each thread projects and
computes the feature vector of a particular training
sample. Since the number of training samples is large,
the latter is adopted for the projection operation in
training phase. We have adopted the former in the
testing phase, where only one image has to be pro-
jected, details of which are explained in Section 4.2.
Before the projection kernel is called, the execution
configuration is set. The number of threads per block,
T1, is set to a standard of 256 and the total number of
blocks is B1, where B1 = (ceil) (N1/T1), N1 = total
number of training samples.
3
Each thread projects and computes the feature vec-
tor of a particular training sample by serially comput-
ing each element of the feature vector one by one.
Each element of the feature vector is obtained by
taking inner product of the training image vector
and the corresponding eigenvector in the eigenspace.
The training sample is normalized with the database
mean image element by element as it is fetched from
texture memory and the intermediate sum of the in-
ner product with eigenvector is stored in the shared
memory. After each element of the feature vector is
computed, the data is written back into the global
memory and the next element is computed. All the
data is aligned in a columnar fashion to avoid un-
coalesced memory accesses and shared memory bank
conflicts.
Figure 1: Threads in Projection Kernel (Training)
After the kernel has finished running on the device
the entire data is copied back to the host memory
and dumped as a binary file to be used in the testing
phase.
4.2 Testing Phase
4.2.1 Introduction
The testing process is completely run on the GPU
and is handled by three kernels. The first kernel
normalizes and projects it onto the eigenspace and
extracts the feature vector. The second kernel par-
allely computes the euclidean distance between the
feature vector of the test image and that of the train-
ing images. The final kernel, finds the minimum of
the euclidean distance and index of the training sam-
ple which yielded that minimum. The resized test
image, database mean image, eigenvectors and the
projected training samples are first copied to the de-
vice memory and the test image, mean image and
eigenvectors are bound as texture to take advantage
of the cached texture memory. Due to the relatively
larger size of the projected training samples data and
the limitation on maximum texture memory, the pro-
jected training samples are not bound as texture.
Figure 2: Recognition pipeline
4.2.2 Projection Kernel
As mentioned in section 4.1.2, the projection kernel
in testing process is parallelized to concurrently com-
pute each element of the feature vector. The number
of threads per block, T2, is set to a standard of 256
and the total number of blocks is B2, where B2 =
(ceil) (N2/T2), N2 = size of feature vector.
Each thread computes each element of the feature
vector, which is obtained by taking inner product of
the test image vector and the corresponding eigenvec-
tor in the eigenspace. The test image is normalized
with the database mean image, element by element
as it is fetched from texture memory and the inter-
mediate sum of the inner product with eigenvector is
4
stored in the shared memory.
Figure 3: Threads in Projection Kernel (Testing)
After the entire feature vector is computed, the
data is written back into global memory. The colum-
nar alignment of all the eigenvectors avoids unco-
alesced memory accesses and shared memory bank
conflicts.
4.2.3 Euclidean Distance Kernel
The kernel for computing the euclidean distance is
very similar to the projection kernel used in training
phase. Threads are launched to concurrently com-
pute the euclidean distance between the test image
feature vector and the training sample feature vec-
tors. The number of threads per block, T3, is set to
a standard of 256 and the total number of blocks is
B3, where B3 = (ceil) (N3/T3), N3 = total number
of training samples.
Each thread computes a particular euclidean dis-
tance serially. The difference of each element of the
vectors are computed, squared and summed. The in-
termediate sum is stored in shared memory. After all
the euclidean distances are computed, the data from
the shared memory is written to the global mem-
ory. The columnar alignment of training sample fea-
ture vectors avoids uncoalesced memory accesses and
shared memory bank conflicts.
Figure 4: Threads in Euclidean Distance Kernel
4.2.4 Minimum Kernel
The minimum kernel computes the minimum value
of euclidean distance and its distance. The vector
containing euclidean distances is divided into smaller
blocks and each thread serially finds the value and
index of the minimum in a particular block. The ker-
nel is called iteratively with fewer and fewer threads
till only one block is left. After execution of the ker-
nel the minimum value and its index is copid back
to host memory. The training sample at the index
computed by the kernel is the most probable match
for the test image.
5 Performance Statistics
To test the performance of CPU and GPU, 5 images
per subject of the ORL Database was selected as the
training set. This set of 200 images was then repli-
cated and concatenated to create databases of size
ranging from 1000 to 15000 images. The eigenvectors
corresponding to the 4 highest eigenvalues per class,
were selected for forming the eigenspace. This led
to feature vectors which grew in size as the database
grew. This replicated database was trained with CPU
and GPU and the execution time was noted and the
GPU speedup for the training process was calculated.
5
For testing, one image per subject from the ORL
Database were selected and the total time taken by
the CPU and GPU to test all 40 test images was
noted and was used to calculate the GPU speedup
for testing process.
To get accurate performance statistics, the training
and testing processes were run multiple times on dif-
ferent CPUs and GPUs. The following graphs were
plotted with the data obtained from the performance
tests. All the CPU times are based on single-core
performances.
Fig. 5 shows the time taken three different CPUs
to execute the projection process during training.
Figure 5: Training time for different CPUs
Fig. 6 shows the time taken by different NVIDIA
GPUs to execute the projection process during train-
ing. It includes the time taken for data transfers to
and from the device.
Figure 6: Training time for different GPUs
Fig. 7 shows the performance speedup of different
GPUs over Intel Core 2 Quad Q9550 CPU during
training databases of varying sizes.
Figure 7: Training Speedup
Fig. 8 shows total time taken by different CPUs
for testing 40 images.
6
Figure 8: Testing using GPU
Fig. 9 shows total time taken by GPUs to test 40
images. For this test, the trained database was copied
to device memory once and 40 images were tested
one by one. It includes time taken for transferring
test image to device and getting match index from
device.
Figure 9: Testing using GPU
Fig. 10 shows the performance speedup of differ-
ent GPUs over Intel Core 2 Quad Q9550 CPU when
testing 40 images.
Figure 10: Testing Speedup
Fig 11 shows the execution time of the recognition
pipeline on the GPU for varying database sizes.
Figure 11: Recognition pipeline on CPU
7
Fig. 12 shows the execution time of the recognition
pipeline on the GPU for varying database sizes. It is
the time taken to transfer test image to device, find
the match and transfer its index back to host.
Figure 12: Recognition Pipeline on GPU
Fig. 13 shows the performance speedup of differ-
ent GPUs over Intel Core 2 Quad Q9550 CPU when
executing the recognition pipeline.
Figure 13: Recognition Pipeline Speedup
6 Conclusion
The recognition rate of a PCA based face recognition
solution depends heavily on the exhaustiveness of the
training samples. Higher the number of training sam-
ples, higher the recognition rate. But as the num-
ber of training samples increases, CPUs get highly
strained and the training process will take several
minutes to complete (refer Fig. 5). But the same
process, when run on a GPU, will be completed in
a manner of seconds (refer Fig. 6). The highest
speedup achieved was 207x for training process, 330x
for the recognition pipeline and 165x for overall test-
ing process on the latest GeForce GTX 480 GPU, for
a database size of 15,000 images.
The execution time of the recognition pipeline on
the GPU is in the order of a few milli seconds even
for very large databases and and this allows the GPU
based testing to be integrated with real time video
and used for other applications involving large vol-
umes of test images. Our primary purpose in writ-
ing this paper is to make clear, the high perfor-
mance boosts that can be obtained by developing
GPU based face recognition solutions.
8
7 Future Works
Our future plans on this field include the paralleliza-
tion of other face recognition algorithms like LDA
(Linear Discriminant Analysis) and to replace the eu-
clidean distance based matching process with a neu-
ral network based one. We feel that algorithms with
a high degree of parallelism in them, like neural net-
works, will benefit immensely, if implemented on the
GPU. We are also working on integrating the GPU
recognition pipeline with real time video.
9

More Related Content

DOCX
SVM & MLP on Matlab program
PDF
Background Estimation Using Principal Component Analysis Based on Limited Mem...
PDF
A broad ranging open access journal Fast and efficient online submission Expe...
PDF
Reconstructing the Path of the Object based on Time and Date OCR in Surveilla...
PDF
Medial axis transformation based skeletonzation of image patterns using image...
PDF
Spectral approach to image projection with cubic b spline interpolation
PDF
IRJET- Optimization of Semantic Image Retargeting by using Guided Fusion Network
PDF
Feed forward neural network for sine
SVM & MLP on Matlab program
Background Estimation Using Principal Component Analysis Based on Limited Mem...
A broad ranging open access journal Fast and efficient online submission Expe...
Reconstructing the Path of the Object based on Time and Date OCR in Surveilla...
Medial axis transformation based skeletonzation of image patterns using image...
Spectral approach to image projection with cubic b spline interpolation
IRJET- Optimization of Semantic Image Retargeting by using Guided Fusion Network
Feed forward neural network for sine

What's hot (18)

PPTX
Images in matlab
PDF
IRJET - Hand Gesture Recognition to Perform System Operations
PDF
Content Based Image Retrieval Using 2-D Discrete Wavelet Transform
PDF
An empirical assessment of different kernel functions on the performance of s...
PDF
Electricity Demand Forecasting Using ANN
PDF
Black-box modeling of nonlinear system using evolutionary neural NARX model
PDF
Electricity Demand Forecasting Using Fuzzy-Neural Network
PDF
A complete user adaptive antenna tutorial demonstration. a gui based approach...
PDF
A0270107
PDF
Lecture 2 Introduction to digital image
PDF
BER Performance of Antenna Array-Based Receiver using Multi-user Detection in...
PDF
A Novel Algorithm for Watermarking and Image Encryption
PDF
Comparison of Different Methods for Fusion of Multimodal Medical Images
PDF
Lecture 1 Introduction to image processing
PDF
Lecture 4 Relationship between pixels
PDF
ROI Based Image Compression in Baseline JPEG
PDF
IRJET- Image Classification – Cat and Dog Images
PDF
GPU Parallel Computing of Support Vector Machines as applied to Intrusion Det...
Images in matlab
IRJET - Hand Gesture Recognition to Perform System Operations
Content Based Image Retrieval Using 2-D Discrete Wavelet Transform
An empirical assessment of different kernel functions on the performance of s...
Electricity Demand Forecasting Using ANN
Black-box modeling of nonlinear system using evolutionary neural NARX model
Electricity Demand Forecasting Using Fuzzy-Neural Network
A complete user adaptive antenna tutorial demonstration. a gui based approach...
A0270107
Lecture 2 Introduction to digital image
BER Performance of Antenna Array-Based Receiver using Multi-user Detection in...
A Novel Algorithm for Watermarking and Image Encryption
Comparison of Different Methods for Fusion of Multimodal Medical Images
Lecture 1 Introduction to image processing
Lecture 4 Relationship between pixels
ROI Based Image Compression in Baseline JPEG
IRJET- Image Classification – Cat and Dog Images
GPU Parallel Computing of Support Vector Machines as applied to Intrusion Det...
Ad

Viewers also liked (16)

PPTX
Back to School Kit for the Reluctant Learner Webinar
DOCX
Curriculum vitae alex ardila
PDF
Image Denoising Using WEAD
PDF
Speckle Reduction in Images with WEAD and WECD
PDF
BX-D – A Business Component & XML Driven Test Automation Framework
PDF
An Improved Hybrid Model for Molecular Image Denoising
PDF
Software Defined Networking – Virtualization of Traffic Engineering
PDF
Focal Cortical Dysplasia Lesion Analysis with Complex Diffusion Approach
PDF
UpnP in Digital Home Networking
PDF
A 1.2V 10-bit 165MSPS Video ADC
PDF
An Effective Design and Verification Methodology for Digital PLL
PDF
A Whitepaper on Hybrid Set-Top-Box
PDF
Advanced Driver Assistance System using FPGA
PDF
Real Time Video Processing in FPGA
PPTX
Introduction Powerpoint - Iraq
PPTX
Women and Religion
Back to School Kit for the Reluctant Learner Webinar
Curriculum vitae alex ardila
Image Denoising Using WEAD
Speckle Reduction in Images with WEAD and WECD
BX-D – A Business Component & XML Driven Test Automation Framework
An Improved Hybrid Model for Molecular Image Denoising
Software Defined Networking – Virtualization of Traffic Engineering
Focal Cortical Dysplasia Lesion Analysis with Complex Diffusion Approach
UpnP in Digital Home Networking
A 1.2V 10-bit 165MSPS Video ADC
An Effective Design and Verification Methodology for Digital PLL
A Whitepaper on Hybrid Set-Top-Box
Advanced Driver Assistance System using FPGA
Real Time Video Processing in FPGA
Introduction Powerpoint - Iraq
Women and Religion
Ad

Similar to CUDA Accelerated Face Recognition (20)

PPTX
Face recogntion Using PCA Algorithm
PPTX
Face Recognition using Eigen Values pptx
PPTX
Face Recognition
PDF
Ijebea14 276
PPTX
PCA Based Face Recognition System
PDF
Application of gaussian filter with principal component analysis
PDF
Application of gaussian filter with principal component analysis
PDF
Solr and Machine Vision - Scott Cote, Lucidworks & Trevor Grant, IBM
PDF
Implementation of Face Recognition in Cloud Vision Using Eigen Faces
PDF
Face Recognition Based on Image Processing in an Advanced Robotic System
PDF
Image Similarity Test Using Eigenface Calculation
PPT
L008.Eigenfaces And Nn Som
PDF
Face Recognition for Different Facial Expressions Using Principal Component a...
PDF
Volume 2-issue-6-2108-2113
PDF
Volume 2-issue-6-2108-2113
PPTX
Automated attendance system based on facial recognition
PDF
Criminal Detection System
PDF
IRJET- Face Recognition of Criminals for Security using Principal Component A...
PDF
Real Time Implementation Of Face Recognition System
Face recogntion Using PCA Algorithm
Face Recognition using Eigen Values pptx
Face Recognition
Ijebea14 276
PCA Based Face Recognition System
Application of gaussian filter with principal component analysis
Application of gaussian filter with principal component analysis
Solr and Machine Vision - Scott Cote, Lucidworks & Trevor Grant, IBM
Implementation of Face Recognition in Cloud Vision Using Eigen Faces
Face Recognition Based on Image Processing in an Advanced Robotic System
Image Similarity Test Using Eigenface Calculation
L008.Eigenfaces And Nn Som
Face Recognition for Different Facial Expressions Using Principal Component a...
Volume 2-issue-6-2108-2113
Volume 2-issue-6-2108-2113
Automated attendance system based on facial recognition
Criminal Detection System
IRJET- Face Recognition of Criminals for Security using Principal Component A...
Real Time Implementation Of Face Recognition System

More from QuEST Global (erstwhile NeST Software) (8)

PDF
High Performance Medical Reconstruction Using Stream Programming Paradigms
PDF
HPC Platform options: Cell BE and GPU
PDF
Real-Time Face Tracking with GPU Acceleration
PDF
Test Optimization Using Adaptive Random Testing Techniques
PDF
Ultra Fast SOM using CUDA
PDF
MR Brain Volume Analysis Using BrainAssist
PDF
A New Generation Software Test Automation Framework – CIVIM
PDF
FaSaT An Interoperable Test Automation Solution
High Performance Medical Reconstruction Using Stream Programming Paradigms
HPC Platform options: Cell BE and GPU
Real-Time Face Tracking with GPU Acceleration
Test Optimization Using Adaptive Random Testing Techniques
Ultra Fast SOM using CUDA
MR Brain Volume Analysis Using BrainAssist
A New Generation Software Test Automation Framework – CIVIM
FaSaT An Interoperable Test Automation Solution

Recently uploaded (20)

PDF
Encapsulation theory and applications.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Modernizing your data center with Dell and AMD
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Machine learning based COVID-19 study performance prediction
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPT
Teaching material agriculture food technology
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
Encapsulation theory and applications.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Unlocking AI with Model Context Protocol (MCP)
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Understanding_Digital_Forensics_Presentation.pptx
20250228 LYD VKU AI Blended-Learning.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Electronic commerce courselecture one. Pdf
cuic standard and advanced reporting.pdf
Modernizing your data center with Dell and AMD
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Machine learning based COVID-19 study performance prediction
MYSQL Presentation for SQL database connectivity
Network Security Unit 5.pdf for BCA BBA.
The Rise and Fall of 3GPP – Time for a Sabbatical?
Teaching material agriculture food technology
NewMind AI Weekly Chronicles - August'25 Week I
Spectral efficient network and resource selection model in 5G networks
Dropbox Q2 2025 Financial Results & Investor Presentation

CUDA Accelerated Face Recognition

  • 1. CUDA Accelerated Face Recognition Numaan. A Third Year Undergraduate Department of Electrical Engineering Indian Institute of Technology, Madras, India ee08b044@smail.iitm.ac.in Sibi. A NeST-NVIDIA Centre for GPU Computing NeST, India sibi.a@nestgroup.net July 26, 2010 Abstract This paper presents a study of the efficiency and performance speedup achieved by applying Graph- ics Processing Units for Face Recognition Solutions. We explore one of the possibilities of parallelizing and optimizing a well-known Face Recognition algorithm, Principal Component Analysis (PCA) with Eigen- faces. 1 Introduction In recent years, the Graphics Processing Units (GPU) has been the subject of extensive research and the computation speed of GPUs has been rapidly increas- ing. The computational power of the latest genera- tion of GPUs, measured in Flops1 , is several times that of a high end CPU and for this reason, they are being increasingly used for non-graphics applica- tions or general-purpose computing (GPGPU). Tra- ditionally, this power of the GPUs could only be harnessed through graphics APIs and was primarily used only by professionals familiar with these APIs. 1floating-point operations per second CUDA (Compute Unified Device Architecture) de- veloped by NVIDIA, tries to address this issue by in- troducing a familiar C like development environment to GPGPU programming and allows programmers to launch hundreds of concurrent threads to run on the “massively” parallel NVIDIA GPUs, with very little software overhead. This paper portrays our efforts to use this power to tame a computationally intensive, yet highly parallelizable PCA based algorithm used in face recognition solutions. We developed both CPU serial code and GPU parallel code to compare the ex- ecution time in each case and measure the speed up achieved by the GPU over the CPU. 2 PCA Theory 2.1 Introduction Principal Component Analysis (PCA) is one of the early and most successful techniques that have been used for face recognition. PCA aims to reduce the dimensionality of data so that it can be economically represented and processed. Information contained in a human face is highly redundant, with each pixel highly correlated to its neighboring pixels and the 1
  • 2. main idea of using PCA for face recognition is to remove this redundancy, and extract the features re- quired for comparison of faces. To increase accuracy of the algorithm, we are using a slightly modified method called Improved PCA, in which the images used for training are grouped into different classes and each class contains multiple images of a single person with different facial expressions. The mathe- matics behind PCA is described in the following sub- sections. 2.2 Mathematics of PCA Firstly, the 2-D facial images are resized using bilin- ear interpolation to reduce the dimension of data and increase the speed of computation. The resized im- age is represented as a 1-D vector by concatenating each row into one long vector . Let’s suppose we have M training samples per class, each of size N (=total pixels in the resized image). Let the training vectors be represented as xi. pj’s represent pixel values. xi = [p1 . . . pN ]T , i = 1, . . . , M The training vectors are then normalized by sub- tracting the class mean image m from them. m = 1 M M i=1 xi Let wi be the normalized images. wi = xi − m The matrix W is composed by placing the normal- ized image vectors wi side by side. The eigenvectors and eigenvalues of the covariance matrix C is com- puted. C = WWT The size of C is N × N which could be enormous. For example, images of size 16 × 16 give a covari- ance of size 256 × 256. It is not practical to solve for eigenvectors of C directly. Hence the eigenvectors of the surrogate matrix WT W of size M × M are computed and the first M −1 eigenvectors and eigen- values of C are given by Wdi and µi, where di and µi are eigenvectors and eigenvalues of the surrogate matrix, respectively. The eigenvectors corresponding to non-zero eigen- values of the covariance matrix make an orthonormal basis for the subspace within which most of the image data can be represented with minimum error. The eigenvector associated with the highest eigenvalue re- flects the greatest variance in the image. These eigen- vectors are known as eigenimages or eigenfaces and when normalized look like faces. The eigenvectors of all the classes are computed similarly and all these eigenvectors are placed side by side to make up the eigenspace S. The mean image of the entire training set, m is computed and each training vector xi is normal- ized. The normalized training vectors wi are pro- jected onto the eigenspace S, and the projected fea- ture vectors yi of the training samples are obtained. wi = xi − m yi = ST wi The simplest method for determining which class the test face falls under is to find the class k, that minimizes the Euclidean distance. The test image is projected onto the eigenspace and the Euclidean distance between the projected test image and each of the projected training samples are computed. If the minimum Euclidean distance falls under a prede- fined threshold θ, the face is classified as belonging to the class to which contained the feature vector that yielded the minimum Euclidean distance. 3 CPU Implementation 3.1 Database For our implementation, we are using one of the most common databases used for testing face recognition solutions, called the ORL Face Database (formerly known as the AT&T Database). The ORL Database 2
  • 3. contains 400 grayscale images of resolution 112 × 92 of 40 subjects with 10 images per subject. The images are taken under various situations, such as different time, different angles, different expressions (happy, angry, surprise, etc.) and different face de- tails (with/without spectacles, with/without beard, different hair styles etc). To truly show the power of a GPU, the image database has to be large. So in our implementation we scaled the ORL Database multiple times by copying and concatenating, to cre- ate bigger databases. This allowed us to measure the GPU performance for very large databases, as high as 15000 images. 3.2 Training phase The most time consuming operation in the training phase is the extraction of feature vector from the training samples by projecting each one of them to the eigenspace. The computation of eigenfaces and eigenspace is relatively less intensive since we have used the surrogate matrix workaround to decrease the size of the matrix and compute the eigenvectors with ease. Hence we have decided to parallelize only pro- jection step of the training process. The steps till the projection of training samples are done in MATLAB and the projection is written in C. The MATLAB routine acquires the images from files, resizes them to a standard 16 × 16 resolution and computes the eigenfaces and eigenspace. The data required for the projection step, namely, the resized training samples, the database mean image and the eigenvectors, are then dumped into a bi- nary file with is later read by the C routine to com- plete the training process. The C routine reads the data dumped by the MATLAB routine and extracts the feature vectors by normalizing the resized train- ing samples and projecting each of them onto the eigenspace. With this the training process is com- plete and the feature vectors are dumped onto a bi- nary file to be used in the testing process. 3.3 Testing phase The entire testing process is written in C. OpenCV is used to acquire the testing images from file, and resize them to the standard 16 × 16 resolution using bilinear interpolation. The resized image is normal- ized with the database mean image and projected onto the eigenspace computed in the training phase. The euclidean distance between the test image fea- ture vector and the training sample feature vectors are computed and the index of the feature vector yielding the minimum euclidean distance is found. The face that yielded this feature vector is the most probable match for the input test face. 4 GPU Implementation 4.1 Training phase 4.1.1 Introduction As mentioned in Section 3.2, only the final feature ex- traction process of the training phase is parallelized. Before the training samples can be projected, all the data required for the projection process is copied to the device’s global memory and the time taken for copying is noted. As all the data are of a read-only nature, they are bound as texture to take advantage of the cached texture memory. 4.1.2 Kernel The projection process is highly parallelizable and can be parallelized in two ways. The threads can be launched to parallelize the computation of a partic- ular feature vector, wherein, each thread computes a single element of the feature vector. Or, the threads can be launched to parallelize projection of multiple training samples, wherein each thread projects and computes the feature vector of a particular training sample. Since the number of training samples is large, the latter is adopted for the projection operation in training phase. We have adopted the former in the testing phase, where only one image has to be pro- jected, details of which are explained in Section 4.2. Before the projection kernel is called, the execution configuration is set. The number of threads per block, T1, is set to a standard of 256 and the total number of blocks is B1, where B1 = (ceil) (N1/T1), N1 = total number of training samples. 3
  • 4. Each thread projects and computes the feature vec- tor of a particular training sample by serially comput- ing each element of the feature vector one by one. Each element of the feature vector is obtained by taking inner product of the training image vector and the corresponding eigenvector in the eigenspace. The training sample is normalized with the database mean image element by element as it is fetched from texture memory and the intermediate sum of the in- ner product with eigenvector is stored in the shared memory. After each element of the feature vector is computed, the data is written back into the global memory and the next element is computed. All the data is aligned in a columnar fashion to avoid un- coalesced memory accesses and shared memory bank conflicts. Figure 1: Threads in Projection Kernel (Training) After the kernel has finished running on the device the entire data is copied back to the host memory and dumped as a binary file to be used in the testing phase. 4.2 Testing Phase 4.2.1 Introduction The testing process is completely run on the GPU and is handled by three kernels. The first kernel normalizes and projects it onto the eigenspace and extracts the feature vector. The second kernel par- allely computes the euclidean distance between the feature vector of the test image and that of the train- ing images. The final kernel, finds the minimum of the euclidean distance and index of the training sam- ple which yielded that minimum. The resized test image, database mean image, eigenvectors and the projected training samples are first copied to the de- vice memory and the test image, mean image and eigenvectors are bound as texture to take advantage of the cached texture memory. Due to the relatively larger size of the projected training samples data and the limitation on maximum texture memory, the pro- jected training samples are not bound as texture. Figure 2: Recognition pipeline 4.2.2 Projection Kernel As mentioned in section 4.1.2, the projection kernel in testing process is parallelized to concurrently com- pute each element of the feature vector. The number of threads per block, T2, is set to a standard of 256 and the total number of blocks is B2, where B2 = (ceil) (N2/T2), N2 = size of feature vector. Each thread computes each element of the feature vector, which is obtained by taking inner product of the test image vector and the corresponding eigenvec- tor in the eigenspace. The test image is normalized with the database mean image, element by element as it is fetched from texture memory and the inter- mediate sum of the inner product with eigenvector is 4
  • 5. stored in the shared memory. Figure 3: Threads in Projection Kernel (Testing) After the entire feature vector is computed, the data is written back into global memory. The colum- nar alignment of all the eigenvectors avoids unco- alesced memory accesses and shared memory bank conflicts. 4.2.3 Euclidean Distance Kernel The kernel for computing the euclidean distance is very similar to the projection kernel used in training phase. Threads are launched to concurrently com- pute the euclidean distance between the test image feature vector and the training sample feature vec- tors. The number of threads per block, T3, is set to a standard of 256 and the total number of blocks is B3, where B3 = (ceil) (N3/T3), N3 = total number of training samples. Each thread computes a particular euclidean dis- tance serially. The difference of each element of the vectors are computed, squared and summed. The in- termediate sum is stored in shared memory. After all the euclidean distances are computed, the data from the shared memory is written to the global mem- ory. The columnar alignment of training sample fea- ture vectors avoids uncoalesced memory accesses and shared memory bank conflicts. Figure 4: Threads in Euclidean Distance Kernel 4.2.4 Minimum Kernel The minimum kernel computes the minimum value of euclidean distance and its distance. The vector containing euclidean distances is divided into smaller blocks and each thread serially finds the value and index of the minimum in a particular block. The ker- nel is called iteratively with fewer and fewer threads till only one block is left. After execution of the ker- nel the minimum value and its index is copid back to host memory. The training sample at the index computed by the kernel is the most probable match for the test image. 5 Performance Statistics To test the performance of CPU and GPU, 5 images per subject of the ORL Database was selected as the training set. This set of 200 images was then repli- cated and concatenated to create databases of size ranging from 1000 to 15000 images. The eigenvectors corresponding to the 4 highest eigenvalues per class, were selected for forming the eigenspace. This led to feature vectors which grew in size as the database grew. This replicated database was trained with CPU and GPU and the execution time was noted and the GPU speedup for the training process was calculated. 5
  • 6. For testing, one image per subject from the ORL Database were selected and the total time taken by the CPU and GPU to test all 40 test images was noted and was used to calculate the GPU speedup for testing process. To get accurate performance statistics, the training and testing processes were run multiple times on dif- ferent CPUs and GPUs. The following graphs were plotted with the data obtained from the performance tests. All the CPU times are based on single-core performances. Fig. 5 shows the time taken three different CPUs to execute the projection process during training. Figure 5: Training time for different CPUs Fig. 6 shows the time taken by different NVIDIA GPUs to execute the projection process during train- ing. It includes the time taken for data transfers to and from the device. Figure 6: Training time for different GPUs Fig. 7 shows the performance speedup of different GPUs over Intel Core 2 Quad Q9550 CPU during training databases of varying sizes. Figure 7: Training Speedup Fig. 8 shows total time taken by different CPUs for testing 40 images. 6
  • 7. Figure 8: Testing using GPU Fig. 9 shows total time taken by GPUs to test 40 images. For this test, the trained database was copied to device memory once and 40 images were tested one by one. It includes time taken for transferring test image to device and getting match index from device. Figure 9: Testing using GPU Fig. 10 shows the performance speedup of differ- ent GPUs over Intel Core 2 Quad Q9550 CPU when testing 40 images. Figure 10: Testing Speedup Fig 11 shows the execution time of the recognition pipeline on the GPU for varying database sizes. Figure 11: Recognition pipeline on CPU 7
  • 8. Fig. 12 shows the execution time of the recognition pipeline on the GPU for varying database sizes. It is the time taken to transfer test image to device, find the match and transfer its index back to host. Figure 12: Recognition Pipeline on GPU Fig. 13 shows the performance speedup of differ- ent GPUs over Intel Core 2 Quad Q9550 CPU when executing the recognition pipeline. Figure 13: Recognition Pipeline Speedup 6 Conclusion The recognition rate of a PCA based face recognition solution depends heavily on the exhaustiveness of the training samples. Higher the number of training sam- ples, higher the recognition rate. But as the num- ber of training samples increases, CPUs get highly strained and the training process will take several minutes to complete (refer Fig. 5). But the same process, when run on a GPU, will be completed in a manner of seconds (refer Fig. 6). The highest speedup achieved was 207x for training process, 330x for the recognition pipeline and 165x for overall test- ing process on the latest GeForce GTX 480 GPU, for a database size of 15,000 images. The execution time of the recognition pipeline on the GPU is in the order of a few milli seconds even for very large databases and and this allows the GPU based testing to be integrated with real time video and used for other applications involving large vol- umes of test images. Our primary purpose in writ- ing this paper is to make clear, the high perfor- mance boosts that can be obtained by developing GPU based face recognition solutions. 8
  • 9. 7 Future Works Our future plans on this field include the paralleliza- tion of other face recognition algorithms like LDA (Linear Discriminant Analysis) and to replace the eu- clidean distance based matching process with a neu- ral network based one. We feel that algorithms with a high degree of parallelism in them, like neural net- works, will benefit immensely, if implemented on the GPU. We are also working on integrating the GPU recognition pipeline with real time video. 9