MULTIMODAL BIOMETRICS RECOGNITION FROM FACIAL VIDEO VIA DEEP LEARNING

David C. Wyld et al. (Eds) : CSITA, ISPR, ARIN, DMAP, CCSIT, AISC, SIPP, PDCTA, SOEN - 2017
pp. 67– 75, 2017. © CS & IT-CSCP 2017 DOI : 10.5121/csit.2017.70107
MULTIMODAL BIOMETRICS
RECOGNITION FROM FACIAL VIDEO VIA
DEEP LEARNING
Sayan Maity, Mohamed Abdel-Mottaleb, and Shihab S. As
University of Miami; 1251 Memorial Drive; Coral Gables; Florida 33146-0620
s.maity1@umail.miami.edu, mottaleb@miami.edu, sasfour@miami.edu
ABSTRACT
Biometrics identification using multiple modalities has attracted the attention of many
researchers as it produces more robust and trustworthy results than single modality biometrics.
In this paper, we present a novel multimodal recognition system that trains a Deep Learning
Network to automatically learn features after extracting multiple biometric modalities from a
single data source, i.e., facial video clips. Utilizing different modalities, i.e., left ear, left profile
face, frontal face, right profile face, and right ear, present in the facial video clips, we train
supervised denosing autoencoders to automatically extract robust and non-redundant features.
The automatically learned features are then used to train modality specific sparse classifiers to
perform the multimodal recognition. Experiments conducted on the constrained facial video
dataset (WVU) and the unconstrained facial video dataset (HONDA/UCSD), resulted in a
99.17% and 97.14% rank-1 recognition rates, respectively. The multimodal recognition
accuracy demonstrates the superiority and robustness of the proposed approach irrespective of
the illumination, non-planar movement, and pose variations present in the video clips.
KEYWORDS
Multimodal Biometrics, Autoencoder, Deep Learning, Sparse Classification.
1. INTRODUCTION
There are several motivations for building robust multimodal biometric systems that extract
multiple modalities from a single source of biometrics, i.e., facial video clips. Firstly, acquiring
video clips of facial data is straight forward using conventional video cameras, which are
ubiquitous. Secondly, the nature of data collection is non-intrusive and the ear, frontal, and profile
face can appear in the same video. The proposed system, shown in Figure 1, consists of three
distinct components to perform the task of efficient multimodal recognition from facial video
clips. First, the object detection technique proposed by Viola and Jones [1], was adopted for the
automatic detection of modality specific regions from the video frames. Unconstrained facial
video clips contain significant head pose variations due to non-planar movements, and sudden
changes in facial expressions. This results in an uneven number of detected modality specific
video frames for the same subject in different video clips, and also a different number of modality

68 Computer Science & Information Technology (CS & IT)
specific images for different subject. From the aspect of building a robust and accurate model, it
is always preferable to use the entire available training data. However, classification through
sparse representation (SRC) is vulnerable in the presence of uneven number of modality specific
training samples for different subjects. Thus, to overcome the vulnerability of SRC whilst using
all of the detected modality specific regions, in the model building phase we train supervised
denoising sparse autoencoder to construct a mapping function. This mapping function is used to
automatically extract the discriminative features preserving the robustness to the possible
variances using the uneven number of detected modality specific regions. Therefore, by applying
Deep Learning Network as the second component in the pipeline results in an equal number of
training sample features for the different subjects. Finally, using the modality specific recognition
results, score level multimodal fusion is performed to obtain the multimodal recognition result.
Fig. 1. System Block Diagram: Multimodal Biometrics Recognition from Facial Video
Due to the unavailability of proper datasets for multimodal recognition studies [2], often virtual
multimodal databases are synthetically obtained by pairing modalities of different subjects from
different databases. To the best of our knowledge, the proposed approach is the first study where
multiple modalities are extracted from a single data source that belongs to the same subject. The
main contributions of the proposed approach is the application of training a Deep Learning
Network for automatic feature learning in multimodal biometrics recognition using a single
source of biometrics i.e., facial video data, irrespective of the illumination, non-planar movement,
and pose variations present in the face video clips.

Computer Science & Information Technology (CS & IT) 69
The remainder of this paper is organized as follows: Section 2 details the modality specific frame
detection from the facial video clips. Section 3 describes the automatic feature learning using
supervised denoising sparse autoencoder (deep-learning). Section 4 presents the modality specific
classification using sparse representation and multimodal fusion. Section 5 provides the
experimental results on the constrained facial video dataset (WVU [3]) and the unconstrained
facial video dataset (HONDA/UCSD [4]) to demonstrate the performance of the proposed
framework. Finally, conclusions and future research directions are presented in Section 6.
2. MODALITY SPECIFIC IMAGE FRAME DETECTION
To perform multimodal biometric recognition, we first need to detect the images of the different
modalities from the facial video. The facial video clips in the constrained dataset are collected in
a controlled environment, where the camera rotates around the subject's head. The video
sequences start with the left profile of each subject (0 degrees) and proceed to the right profile
(180 degrees). Each of these video sequences contains image frames of different modalities, e.g.,
left ear, left profile face, frontal face, right profile face, and right ear, respectively. The video
sequences in the unconstrained dataset contains uncontrolled and nonuniform head rotations and
changing facial expressions. Thus, the appearance of a specific modality in a certain frame of the
unconstrained video clip is random compared with the constrained video clips.
The algorithm was trained to detect the different modalities that appear in the facial video clips.
To automate the detection process of the modality specific image frames, we adopt the Adaboost
object detection technique, proposed by Viola and Jones [1]. The algorithm is trained to detect
frontal and profile faces in the video frames, respectively, using manually cropped frontal face
images from color FERET database, and profile face images from the University of Notre Dame
Collection J2 database. Moreover, it is trained using cropped ear images from UND color ear
database to detect ear images in the video frames. By using these modality specific trained
detectors, we can detect faces and ears in the video frames. The modality specific trained
detectors are applied to the entire video sequence to detect the face and the ear regions in the
video frames.
Before using the detected modality specific regions from the video frames for extracting features,
some preprocessing steps are performed. The facial video clips recorded in the unconstrained
environment contain variations in illumination and low contrast. Histogram equalization is
performed to enhance the contrast of the images. Finally, all detected modality specific regions
from the facial video clips were resized; ear images were resized to 110 X 70 pixels and faces
images (frontal and profile) were resized to 128 X 128 pixels.
3. AUTOMATIC FEATURE LEARNING USING DEEP NEURAL NETWORK
Even though the modalitiy specific sparse classifiers result in relatively high recognition accuracy
on the constrained face video clips, the accuracy suffers in case of unconstrained video because
the sparse classifier is vulnerable to the bias in the number of training images from different
subjects. For example, subjects in the HONDA/UCSD dataset [4] randomly change their head
pose. This results in a nonuniform number of detected modality specific video frames across
different video clips, which is not ideal to perform classification through sparse representation.

In the subsequent sections we first describe the gabor feature extraction technique. Then, we
describe the supervised denoising sparse autoencoders, which we use to automatically learn equal
number of feature vectors for each subject from the uneven number of modality specific detected
regions.
3.1 Feature Extraction
2D Gabor filters [5] are used in broad range of applications to extract scale and rotation invariant
feature vectors. In our feature extraction step, uniform down-sampled Gabor wavelets are
computed for the detected regions:
3.2 Supervised Stacked Denoising Auto-encoder

3.3 Training the Deep Learning Network

4. MODALITY SPECIFIC AND MULITMODAL RECOGNITION
4.1 Multimodal Recognition

5. EXPERIMENTAL RESULTS
In this section we describe the results of the modality specific and multi-modal recognition
experiments on both datasets. The feature vectors automatically learned using the trained Deep
Learning network resulted in length of 9600 for frontal and profile face; 4160 for ear. In order to
decrease the computational complexity and to find out most effective feature vector length to
maximize the recognition accuracy, the dimensionality of the feature vector is reduced to a lower
dimension using Principal Component Analysis (PCA) [9]. Using PCA, the number of features is
reduced to 500 and 1000. In Table- 1 the modality specific recognition accuracy obtained for the
reduced feature vector of 500, 1000 is shown. Feature vectors of length 1000 resulted in best
recognition accuracy for both modality specific and multimodal recognition.
Table 1. Modality Specific and Multimodal Rank-1 Recognition Accuracy
The best rank-1 recognition rates, using ear, frontal and profile face modalities for multimodal
recognition, compared with the results reported in [10{12] is shown in Table 2.
Table 2. Comparison of 2D multimodal (frontal face, profile face and ear) rank-1 recognition accuracy with
the state-of-the-art techniques
6. CONCLUSION
We proposed a system for multimodal recognition using a single biometrics data source, i.e.,
facial video clips. Using the Adaboost detector, we automatically detect modality specific
regions. We use Gabor features extracted from the detected regions to automatically learn robust
and non-redundant features by training a Supervised Stacked Denoising Auto-encoder (Deep
Learning) network. Classification through sparse representation is used for each modality. Then,
the multimodal recognition is obtained through the fusion of the results from the modality
specific recognition.

REFERENCES
[1] Viola, P. and Jones, M.: Grid Rapid object detection using a boosted cascade of simple features. In:
Computer Vision and Pattern Recognition, pp. 511{518.(2001).
[2] Zengxi Huang and Yiguang Liu and Chunguang Li and Menglong Yang and Liping Chen: A robust
face and ear based multimodal biometric system using sparse representation. In: Pattern Recognition,
pp.2156{2168.(2013).
[3] Gamal Fahmy and Ahmed El-sherbeeny and Susmita M and Mohamed Abdel-mottaleb and Hany
Ammar: The effect of lighting direction/condition on the performance of face recognition algorithms.
In:SPIE Conference on Biometrics for Human Identification, pp.188{200.(2006).
[4] K.C. Lee and J. Ho and M.H. Yang and D. Kriegman: Visual Tracking and Recognition Using
Probabilistic Appearance Manifolds. In:Computer Vision and Image Understanding.(2005).
[5] Chengjun Liu and Wechsler, H.:Gabor feature based classification using the enhanced fisher linear
discriminant model for face recognition. In: IEEE Transactions on Image Processing,
pp.467{476.(2002).
[6] Rumelhart, David E. and McClelland, James L.: Parallel Distributed Processing: Explorations in the
Microstructure of Cognition. In:MIT Press. Cambridge, MA, USA.(1986).
[7] Shenghua Gao and Yuting Zhang and Kui Jia and Jiwen Lu and Yingying Zhang: Single Sample Face
Recognition via Learning Deep Supervised Autoencoders. In:IEEE Transactions on Information
Forensics and Security, pp.2108{2118.(2015)
[8] Ross, A. A. and Nandakumar, K. and Jain, A. K.: Handbook of multibiometrics. In:Springer.(2006)
[9] Turk Matthew and Pentland Alex: Eigenfaces for recognition.In: J. Cognitive Neuroscience. MIT
Press, pp.71{86.(1991)
[10] Nazmeen Bibi Boodoo and R. K. Subramanian: Robust Multi biometric Recognition Using Face and
Ear Images. In:J. CoRR.(2009)
[11] Dakshina Ranjan Kisku and Jamuna Kanta Sing and Phalguni Gupta: Multibiometrics Belief Fusion.
In:J. CoRR.(2010)
[12] Xiuqin Pan and Yongcun Cao and Xiaona Xu and Yong Lu and Yue Zhao: Ear and face based
multimodal recognition based on KFDA. In:International Conference on Audio, Language and Image
Processing. pp.965{969.(2008)

MULTIMODAL BIOMETRICS RECOGNITION FROM FACIAL VIDEO VIA DEEP LEARNING

More Related Content

What's hot (19)

Viewers also liked (20)

Similar to MULTIMODAL BIOMETRICS RECOGNITION FROM FACIAL VIDEO VIA DEEP LEARNING (20)

Recently uploaded (20)

MULTIMODAL BIOMETRICS RECOGNITION FROM FACIAL VIDEO VIA DEEP LEARNING