SlideShare a Scribd company logo
David C. Wyld et al. (Eds) : CSITA, ISPR, ARIN, DMAP, CCSIT, AISC, SIPP, PDCTA, SOEN - 2017
pp. 67– 75, 2017. © CS & IT-CSCP 2017 DOI : 10.5121/csit.2017.70107
MULTIMODAL BIOMETRICS
RECOGNITION FROM FACIAL VIDEO VIA
DEEP LEARNING
Sayan Maity, Mohamed Abdel-Mottaleb, and Shihab S. As
University of Miami; 1251 Memorial Drive; Coral Gables; Florida 33146-0620
s.maity1@umail.miami.edu, mottaleb@miami.edu, sasfour@miami.edu
ABSTRACT
Biometrics identification using multiple modalities has attracted the attention of many
researchers as it produces more robust and trustworthy results than single modality biometrics.
In this paper, we present a novel multimodal recognition system that trains a Deep Learning
Network to automatically learn features after extracting multiple biometric modalities from a
single data source, i.e., facial video clips. Utilizing different modalities, i.e., left ear, left profile
face, frontal face, right profile face, and right ear, present in the facial video clips, we train
supervised denosing autoencoders to automatically extract robust and non-redundant features.
The automatically learned features are then used to train modality specific sparse classifiers to
perform the multimodal recognition. Experiments conducted on the constrained facial video
dataset (WVU) and the unconstrained facial video dataset (HONDA/UCSD), resulted in a
99.17% and 97.14% rank-1 recognition rates, respectively. The multimodal recognition
accuracy demonstrates the superiority and robustness of the proposed approach irrespective of
the illumination, non-planar movement, and pose variations present in the video clips.
KEYWORDS
Multimodal Biometrics, Autoencoder, Deep Learning, Sparse Classification.
1. INTRODUCTION
There are several motivations for building robust multimodal biometric systems that extract
multiple modalities from a single source of biometrics, i.e., facial video clips. Firstly, acquiring
video clips of facial data is straight forward using conventional video cameras, which are
ubiquitous. Secondly, the nature of data collection is non-intrusive and the ear, frontal, and profile
face can appear in the same video. The proposed system, shown in Figure 1, consists of three
distinct components to perform the task of efficient multimodal recognition from facial video
clips. First, the object detection technique proposed by Viola and Jones [1], was adopted for the
automatic detection of modality specific regions from the video frames. Unconstrained facial
video clips contain significant head pose variations due to non-planar movements, and sudden
changes in facial expressions. This results in an uneven number of detected modality specific
video frames for the same subject in different video clips, and also a different number of modality
68 Computer Science & Information Technology (CS & IT)
specific images for different subject. From the aspect of building a robust and accurate model, it
is always preferable to use the entire available training data. However, classification through
sparse representation (SRC) is vulnerable in the presence of uneven number of modality specific
training samples for different subjects. Thus, to overcome the vulnerability of SRC whilst using
all of the detected modality specific regions, in the model building phase we train supervised
denoising sparse autoencoder to construct a mapping function. This mapping function is used to
automatically extract the discriminative features preserving the robustness to the possible
variances using the uneven number of detected modality specific regions. Therefore, by applying
Deep Learning Network as the second component in the pipeline results in an equal number of
training sample features for the different subjects. Finally, using the modality specific recognition
results, score level multimodal fusion is performed to obtain the multimodal recognition result.
Fig. 1. System Block Diagram: Multimodal Biometrics Recognition from Facial Video
Due to the unavailability of proper datasets for multimodal recognition studies [2], often virtual
multimodal databases are synthetically obtained by pairing modalities of different subjects from
different databases. To the best of our knowledge, the proposed approach is the first study where
multiple modalities are extracted from a single data source that belongs to the same subject. The
main contributions of the proposed approach is the application of training a Deep Learning
Network for automatic feature learning in multimodal biometrics recognition using a single
source of biometrics i.e., facial video data, irrespective of the illumination, non-planar movement,
and pose variations present in the face video clips.
Computer Science & Information Technology (CS & IT) 69
The remainder of this paper is organized as follows: Section 2 details the modality specific frame
detection from the facial video clips. Section 3 describes the automatic feature learning using
supervised denoising sparse autoencoder (deep-learning). Section 4 presents the modality specific
classification using sparse representation and multimodal fusion. Section 5 provides the
experimental results on the constrained facial video dataset (WVU [3]) and the unconstrained
facial video dataset (HONDA/UCSD [4]) to demonstrate the performance of the proposed
framework. Finally, conclusions and future research directions are presented in Section 6.
2. MODALITY SPECIFIC IMAGE FRAME DETECTION
To perform multimodal biometric recognition, we first need to detect the images of the different
modalities from the facial video. The facial video clips in the constrained dataset are collected in
a controlled environment, where the camera rotates around the subject's head. The video
sequences start with the left profile of each subject (0 degrees) and proceed to the right profile
(180 degrees). Each of these video sequences contains image frames of different modalities, e.g.,
left ear, left profile face, frontal face, right profile face, and right ear, respectively. The video
sequences in the unconstrained dataset contains uncontrolled and nonuniform head rotations and
changing facial expressions. Thus, the appearance of a specific modality in a certain frame of the
unconstrained video clip is random compared with the constrained video clips.
The algorithm was trained to detect the different modalities that appear in the facial video clips.
To automate the detection process of the modality specific image frames, we adopt the Adaboost
object detection technique, proposed by Viola and Jones [1]. The algorithm is trained to detect
frontal and profile faces in the video frames, respectively, using manually cropped frontal face
images from color FERET database, and profile face images from the University of Notre Dame
Collection J2 database. Moreover, it is trained using cropped ear images from UND color ear
database to detect ear images in the video frames. By using these modality specific trained
detectors, we can detect faces and ears in the video frames. The modality specific trained
detectors are applied to the entire video sequence to detect the face and the ear regions in the
video frames.
Before using the detected modality specific regions from the video frames for extracting features,
some preprocessing steps are performed. The facial video clips recorded in the unconstrained
environment contain variations in illumination and low contrast. Histogram equalization is
performed to enhance the contrast of the images. Finally, all detected modality specific regions
from the facial video clips were resized; ear images were resized to 110 X 70 pixels and faces
images (frontal and profile) were resized to 128 X 128 pixels.
3. AUTOMATIC FEATURE LEARNING USING DEEP NEURAL NETWORK
Even though the modalitiy specific sparse classifiers result in relatively high recognition accuracy
on the constrained face video clips, the accuracy suffers in case of unconstrained video because
the sparse classifier is vulnerable to the bias in the number of training images from different
subjects. For example, subjects in the HONDA/UCSD dataset [4] randomly change their head
pose. This results in a nonuniform number of detected modality specific video frames across
different video clips, which is not ideal to perform classification through sparse representation.
70 Computer Science & Information Technology (CS & IT)
In the subsequent sections we first describe the gabor feature extraction technique. Then, we
describe the supervised denoising sparse autoencoders, which we use to automatically learn equal
number of feature vectors for each subject from the uneven number of modality specific detected
regions.
3.1 Feature Extraction
2D Gabor filters [5] are used in broad range of applications to extract scale and rotation invariant
feature vectors. In our feature extraction step, uniform down-sampled Gabor wavelets are
computed for the detected regions:
3.2 Supervised Stacked Denoising Auto-encoder
Computer Science & Information Technology (CS & IT) 71
3.3 Training the Deep Learning Network
72 Computer Science & Information Technology (CS & IT)
Computer Science & Information Technology (CS & IT) 73
4. MODALITY SPECIFIC AND MULITMODAL RECOGNITION
4.1 Multimodal Recognition
74 Computer Science & Information Technology (CS & IT)
5. EXPERIMENTAL RESULTS
In this section we describe the results of the modality specific and multi-modal recognition
experiments on both datasets. The feature vectors automatically learned using the trained Deep
Learning network resulted in length of 9600 for frontal and profile face; 4160 for ear. In order to
decrease the computational complexity and to find out most effective feature vector length to
maximize the recognition accuracy, the dimensionality of the feature vector is reduced to a lower
dimension using Principal Component Analysis (PCA) [9]. Using PCA, the number of features is
reduced to 500 and 1000. In Table- 1 the modality specific recognition accuracy obtained for the
reduced feature vector of 500, 1000 is shown. Feature vectors of length 1000 resulted in best
recognition accuracy for both modality specific and multimodal recognition.
Table 1. Modality Specific and Multimodal Rank-1 Recognition Accuracy
The best rank-1 recognition rates, using ear, frontal and profile face modalities for multimodal
recognition, compared with the results reported in [10{12] is shown in Table 2.
Table 2. Comparison of 2D multimodal (frontal face, profile face and ear) rank-1 recognition accuracy with
the state-of-the-art techniques
6. CONCLUSION
We proposed a system for multimodal recognition using a single biometrics data source, i.e.,
facial video clips. Using the Adaboost detector, we automatically detect modality specific
regions. We use Gabor features extracted from the detected regions to automatically learn robust
and non-redundant features by training a Supervised Stacked Denoising Auto-encoder (Deep
Learning) network. Classification through sparse representation is used for each modality. Then,
the multimodal recognition is obtained through the fusion of the results from the modality
specific recognition.
Computer Science & Information Technology (CS & IT) 75
REFERENCES
[1] Viola, P. and Jones, M.: Grid Rapid object detection using a boosted cascade of simple features. In:
Computer Vision and Pattern Recognition, pp. 511{518.(2001).
[2] Zengxi Huang and Yiguang Liu and Chunguang Li and Menglong Yang and Liping Chen: A robust
face and ear based multimodal biometric system using sparse representation. In: Pattern Recognition,
pp.2156{2168.(2013).
[3] Gamal Fahmy and Ahmed El-sherbeeny and Susmita M and Mohamed Abdel-mottaleb and Hany
Ammar: The effect of lighting direction/condition on the performance of face recognition algorithms.
In:SPIE Conference on Biometrics for Human Identification, pp.188{200.(2006).
[4] K.C. Lee and J. Ho and M.H. Yang and D. Kriegman: Visual Tracking and Recognition Using
Probabilistic Appearance Manifolds. In:Computer Vision and Image Understanding.(2005).
[5] Chengjun Liu and Wechsler, H.:Gabor feature based classification using the enhanced fisher linear
discriminant model for face recognition. In: IEEE Transactions on Image Processing,
pp.467{476.(2002).
[6] Rumelhart, David E. and McClelland, James L.: Parallel Distributed Processing: Explorations in the
Microstructure of Cognition. In:MIT Press. Cambridge, MA, USA.(1986).
[7] Shenghua Gao and Yuting Zhang and Kui Jia and Jiwen Lu and Yingying Zhang: Single Sample Face
Recognition via Learning Deep Supervised Autoencoders. In:IEEE Transactions on Information
Forensics and Security, pp.2108{2118.(2015)
[8] Ross, A. A. and Nandakumar, K. and Jain, A. K.: Handbook of multibiometrics. In:Springer.(2006)
[9] Turk Matthew and Pentland Alex: Eigenfaces for recognition.In: J. Cognitive Neuroscience. MIT
Press, pp.71{86.(1991)
[10] Nazmeen Bibi Boodoo and R. K. Subramanian: Robust Multi biometric Recognition Using Face and
Ear Images. In:J. CoRR.(2009)
[11] Dakshina Ranjan Kisku and Jamuna Kanta Sing and Phalguni Gupta: Multibiometrics Belief Fusion.
In:J. CoRR.(2010)
[12] Xiuqin Pan and Yongcun Cao and Xiaona Xu and Yong Lu and Yue Zhao: Ear and face based
multimodal recognition based on KFDA. In:International Conference on Audio, Language and Image
Processing. pp.965{969.(2008)

More Related Content

PDF
MULTIMODAL BIOMETRICS RECOGNITION FROM FACIAL VIDEO VIA DEEP LEARNING
PDF
A SURVEY ON IRIS RECOGNITION FOR AUTHENTICATION
PDF
A Fast Recognition Method for Pose and Illumination Variant Faces on Video Se...
PPTX
A comparative review of various approaches for feature extraction in Face rec...
PDF
Shift Invariant Ear Feature Extraction using Dual Tree Complex Wavelet Transf...
PDF
E0543135
PDF
Ijetcas14 598
PDF
IRDO: Iris Recognition by fusion of DTCWT and OLBP
MULTIMODAL BIOMETRICS RECOGNITION FROM FACIAL VIDEO VIA DEEP LEARNING
A SURVEY ON IRIS RECOGNITION FOR AUTHENTICATION
A Fast Recognition Method for Pose and Illumination Variant Faces on Video Se...
A comparative review of various approaches for feature extraction in Face rec...
Shift Invariant Ear Feature Extraction using Dual Tree Complex Wavelet Transf...
E0543135
Ijetcas14 598
IRDO: Iris Recognition by fusion of DTCWT and OLBP

What's hot (19)

PDF
Security for Identity Based Identification using Water Marking and Visual Cry...
PDF
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
PDF
40120140505010
PDF
40120140505010 2-3
PDF
Face Recognition (D2L5 2017 UPC Deep Learning for Computer Vision)
PDF
IRJET- Autonamy of Attendence using Face Recognition
PDF
Adopting level set theory based algorithms to segment human ear
PDF
G0333946
PDF
IRJET- Persons Identification Tool for Visually Impaired - Digital Eye
PDF
To Improve the Recognition Rate with High Security with Ear/Iris Biometric Re...
PPTX
Final_ppt1
PDF
IRIS Based Human Recognition System
PDF
The Biometric Algorithm based on Fusion of DWT Frequency Components of Enhanc...
PDF
Transform Domain Based Iris Recognition using EMD and FFT
PDF
N010226872
PPTX
Model Based Emotion Detection using Point Clouds
PPTX
Face recognization using artificial nerual network
PDF
Iris Encryption using (2, 2) Visual cryptography & Average Orientation Circul...
PDF
Video Manifold Feature Extraction Based on ISOMAP
Security for Identity Based Identification using Water Marking and Visual Cry...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
40120140505010
40120140505010 2-3
Face Recognition (D2L5 2017 UPC Deep Learning for Computer Vision)
IRJET- Autonamy of Attendence using Face Recognition
Adopting level set theory based algorithms to segment human ear
G0333946
IRJET- Persons Identification Tool for Visually Impaired - Digital Eye
To Improve the Recognition Rate with High Security with Ear/Iris Biometric Re...
Final_ppt1
IRIS Based Human Recognition System
The Biometric Algorithm based on Fusion of DWT Frequency Components of Enhanc...
Transform Domain Based Iris Recognition using EMD and FFT
N010226872
Model Based Emotion Detection using Point Clouds
Face recognization using artificial nerual network
Iris Encryption using (2, 2) Visual cryptography & Average Orientation Circul...
Video Manifold Feature Extraction Based on ISOMAP
Ad

Viewers also liked (20)

PDF
CLASSIFICATION OF UPPER AIRWAYS IMAGES FOR ENDOTRACHEAL INTUBATION VERIFICATION
PDF
A COLLAGE IMAGE CREATION & “KANISEI” ANALYSIS SYSTEM BY COMBINING MULTIPLE IM...
PDF
LARGE SCALE IMAGE PROCESSING IN REAL-TIME ENVIRONMENTS WITH KAFKA
DOCX
Digital image processing titles 2016 2017
PPTX
E-voting menjadi langkah pelestarian pepohonan
PDF
FRACTAL PARAMETERS OF TUMOUR MICROSCOPIC IMAGES AS PROGNOSTIC INDICATORS OF C...
PDF
ARABIC DATASET FOR AUTOMATIC KEYPHRASE EXTRACTION
PDF
AN OPEN SHOP APPROACH IN APPROXIMATING OPTIMAL DATA TRANSMISSION DURATION IN ...
PDF
ENHANCING THE PERFORMANCE OF SENTIMENT ANALYSIS SUPERVISED LEARNING USING SEN...
PDF
EMOTIONAL LEARNING IN A SIMULATED MODEL OF THE MENTAL APPARATUS
PDF
COMPUTER VISION PERFORMANCE AND IMAGE QUALITY METRICS: A RECIPROCAL RELATION
PDF
OPENSKIMR A JOB- AND LEARNINGPLATFORM
PDF
THE SUSCEPTIBLE-INFECTIOUS MODEL OF DISEASE EXPANSION ANALYZED UNDER THE SCOP...
PDF
BIG DATA TECHNOLOGY ACCELERATE GENOMICS PRECISION MEDICINE
PPTX
Validasi uji sterilisasi
PDF
SPONTANEOUS SMILE DETECTION WITH APPLICATION OF LANDMARK POINTS SUPPORTED BY ...
PDF
AN EXPERT GAMIFICATION SYSTEM FOR VIRTUAL AND CROSS-CULTURAL SOFTWARE TEAMS
PDF
DYNAMIC QUALITY OF SERVICE STABILITY BASED MULTICAST ROUTING PROTOCOL FOR MAN...
PDF
NEURAL NETWORKS FOR HIGH PERFORMANCE TIME-DELAY ESTIMATION AND ACOUSTIC SOURC...
PDF
USING NLP APPROACH FOR ANALYZING CUSTOMER REVIEWS
CLASSIFICATION OF UPPER AIRWAYS IMAGES FOR ENDOTRACHEAL INTUBATION VERIFICATION
A COLLAGE IMAGE CREATION & “KANISEI” ANALYSIS SYSTEM BY COMBINING MULTIPLE IM...
LARGE SCALE IMAGE PROCESSING IN REAL-TIME ENVIRONMENTS WITH KAFKA
Digital image processing titles 2016 2017
E-voting menjadi langkah pelestarian pepohonan
FRACTAL PARAMETERS OF TUMOUR MICROSCOPIC IMAGES AS PROGNOSTIC INDICATORS OF C...
ARABIC DATASET FOR AUTOMATIC KEYPHRASE EXTRACTION
AN OPEN SHOP APPROACH IN APPROXIMATING OPTIMAL DATA TRANSMISSION DURATION IN ...
ENHANCING THE PERFORMANCE OF SENTIMENT ANALYSIS SUPERVISED LEARNING USING SEN...
EMOTIONAL LEARNING IN A SIMULATED MODEL OF THE MENTAL APPARATUS
COMPUTER VISION PERFORMANCE AND IMAGE QUALITY METRICS: A RECIPROCAL RELATION
OPENSKIMR A JOB- AND LEARNINGPLATFORM
THE SUSCEPTIBLE-INFECTIOUS MODEL OF DISEASE EXPANSION ANALYZED UNDER THE SCOP...
BIG DATA TECHNOLOGY ACCELERATE GENOMICS PRECISION MEDICINE
Validasi uji sterilisasi
SPONTANEOUS SMILE DETECTION WITH APPLICATION OF LANDMARK POINTS SUPPORTED BY ...
AN EXPERT GAMIFICATION SYSTEM FOR VIRTUAL AND CROSS-CULTURAL SOFTWARE TEAMS
DYNAMIC QUALITY OF SERVICE STABILITY BASED MULTICAST ROUTING PROTOCOL FOR MAN...
NEURAL NETWORKS FOR HIGH PERFORMANCE TIME-DELAY ESTIMATION AND ACOUSTIC SOURC...
USING NLP APPROACH FOR ANALYZING CUSTOMER REVIEWS
Ad

Similar to MULTIMODAL BIOMETRICS RECOGNITION FROM FACIAL VIDEO VIA DEEP LEARNING (20)

PDF
Review of Multimodal Biometrics: Applications, Challenges and Research Areas
PPTX
Pattern recognition multi biometrics using face and ear
PDF
Robust Analysis of Multibiometric Fusion Versus Ensemble Learning Schemes: A ...
PPTX
MULTIMODAL BIOMETRIC SECURITY SYSTEM
PDF
Role of fuzzy in multimodal biometrics system
PDF
PDF
33 102-1-pb
PDF
Face recognition a survey
PDF
Informatics Engineering, an International Journal (IEIJ) - Face Recognition: ...
PDF
An Improved Self Organizing Feature Map Classifier for Multimodal Biometric R...
DOCX
Full biometric eye tracking
PDF
Feature Level Fusion Based Bimodal Biometric Using Transformation Domine Tec...
PDF
Hoip10 articulo reconocimiento facial_univ_vigo
PDF
Multibiometric Secure Index Value Code Generation for Authentication and Retr...
PDF
Multimodal Biometrics at Feature Level Fusion using Texture Features
PDF
K0167683
PDF
K017247882
PDF
A Hybrid Approach to Face Detection And Feature Extraction
PDF
IRJET-Human Face Detection and Identification using Deep Metric Learning
PDF
Multi-modal palm-print and hand-vein biometric recognition at sensor level fu...
Review of Multimodal Biometrics: Applications, Challenges and Research Areas
Pattern recognition multi biometrics using face and ear
Robust Analysis of Multibiometric Fusion Versus Ensemble Learning Schemes: A ...
MULTIMODAL BIOMETRIC SECURITY SYSTEM
Role of fuzzy in multimodal biometrics system
33 102-1-pb
Face recognition a survey
Informatics Engineering, an International Journal (IEIJ) - Face Recognition: ...
An Improved Self Organizing Feature Map Classifier for Multimodal Biometric R...
Full biometric eye tracking
Feature Level Fusion Based Bimodal Biometric Using Transformation Domine Tec...
Hoip10 articulo reconocimiento facial_univ_vigo
Multibiometric Secure Index Value Code Generation for Authentication and Retr...
Multimodal Biometrics at Feature Level Fusion using Texture Features
K0167683
K017247882
A Hybrid Approach to Face Detection And Feature Extraction
IRJET-Human Face Detection and Identification using Deep Metric Learning
Multi-modal palm-print and hand-vein biometric recognition at sensor level fu...

Recently uploaded (20)

PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Approach and Philosophy of On baking technology
PPT
Teaching material agriculture food technology
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
KodekX | Application Modernization Development
PPTX
Spectroscopy.pptx food analysis technology
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Empathic Computing: Creating Shared Understanding
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
MYSQL Presentation for SQL database connectivity
MIND Revenue Release Quarter 2 2025 Press Release
Approach and Philosophy of On baking technology
Teaching material agriculture food technology
Per capita expenditure prediction using model stacking based on satellite ima...
Dropbox Q2 2025 Financial Results & Investor Presentation
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
KodekX | Application Modernization Development
Spectroscopy.pptx food analysis technology
Chapter 3 Spatial Domain Image Processing.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Encapsulation_ Review paper, used for researhc scholars
The Rise and Fall of 3GPP – Time for a Sabbatical?
20250228 LYD VKU AI Blended-Learning.pptx
Unlocking AI with Model Context Protocol (MCP)
Empathic Computing: Creating Shared Understanding
Agricultural_Statistics_at_a_Glance_2022_0.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
sap open course for s4hana steps from ECC to s4
MYSQL Presentation for SQL database connectivity

MULTIMODAL BIOMETRICS RECOGNITION FROM FACIAL VIDEO VIA DEEP LEARNING

  • 1. David C. Wyld et al. (Eds) : CSITA, ISPR, ARIN, DMAP, CCSIT, AISC, SIPP, PDCTA, SOEN - 2017 pp. 67– 75, 2017. © CS & IT-CSCP 2017 DOI : 10.5121/csit.2017.70107 MULTIMODAL BIOMETRICS RECOGNITION FROM FACIAL VIDEO VIA DEEP LEARNING Sayan Maity, Mohamed Abdel-Mottaleb, and Shihab S. As University of Miami; 1251 Memorial Drive; Coral Gables; Florida 33146-0620 s.maity1@umail.miami.edu, mottaleb@miami.edu, sasfour@miami.edu ABSTRACT Biometrics identification using multiple modalities has attracted the attention of many researchers as it produces more robust and trustworthy results than single modality biometrics. In this paper, we present a novel multimodal recognition system that trains a Deep Learning Network to automatically learn features after extracting multiple biometric modalities from a single data source, i.e., facial video clips. Utilizing different modalities, i.e., left ear, left profile face, frontal face, right profile face, and right ear, present in the facial video clips, we train supervised denosing autoencoders to automatically extract robust and non-redundant features. The automatically learned features are then used to train modality specific sparse classifiers to perform the multimodal recognition. Experiments conducted on the constrained facial video dataset (WVU) and the unconstrained facial video dataset (HONDA/UCSD), resulted in a 99.17% and 97.14% rank-1 recognition rates, respectively. The multimodal recognition accuracy demonstrates the superiority and robustness of the proposed approach irrespective of the illumination, non-planar movement, and pose variations present in the video clips. KEYWORDS Multimodal Biometrics, Autoencoder, Deep Learning, Sparse Classification. 1. INTRODUCTION There are several motivations for building robust multimodal biometric systems that extract multiple modalities from a single source of biometrics, i.e., facial video clips. Firstly, acquiring video clips of facial data is straight forward using conventional video cameras, which are ubiquitous. Secondly, the nature of data collection is non-intrusive and the ear, frontal, and profile face can appear in the same video. The proposed system, shown in Figure 1, consists of three distinct components to perform the task of efficient multimodal recognition from facial video clips. First, the object detection technique proposed by Viola and Jones [1], was adopted for the automatic detection of modality specific regions from the video frames. Unconstrained facial video clips contain significant head pose variations due to non-planar movements, and sudden changes in facial expressions. This results in an uneven number of detected modality specific video frames for the same subject in different video clips, and also a different number of modality
  • 2. 68 Computer Science & Information Technology (CS & IT) specific images for different subject. From the aspect of building a robust and accurate model, it is always preferable to use the entire available training data. However, classification through sparse representation (SRC) is vulnerable in the presence of uneven number of modality specific training samples for different subjects. Thus, to overcome the vulnerability of SRC whilst using all of the detected modality specific regions, in the model building phase we train supervised denoising sparse autoencoder to construct a mapping function. This mapping function is used to automatically extract the discriminative features preserving the robustness to the possible variances using the uneven number of detected modality specific regions. Therefore, by applying Deep Learning Network as the second component in the pipeline results in an equal number of training sample features for the different subjects. Finally, using the modality specific recognition results, score level multimodal fusion is performed to obtain the multimodal recognition result. Fig. 1. System Block Diagram: Multimodal Biometrics Recognition from Facial Video Due to the unavailability of proper datasets for multimodal recognition studies [2], often virtual multimodal databases are synthetically obtained by pairing modalities of different subjects from different databases. To the best of our knowledge, the proposed approach is the first study where multiple modalities are extracted from a single data source that belongs to the same subject. The main contributions of the proposed approach is the application of training a Deep Learning Network for automatic feature learning in multimodal biometrics recognition using a single source of biometrics i.e., facial video data, irrespective of the illumination, non-planar movement, and pose variations present in the face video clips.
  • 3. Computer Science & Information Technology (CS & IT) 69 The remainder of this paper is organized as follows: Section 2 details the modality specific frame detection from the facial video clips. Section 3 describes the automatic feature learning using supervised denoising sparse autoencoder (deep-learning). Section 4 presents the modality specific classification using sparse representation and multimodal fusion. Section 5 provides the experimental results on the constrained facial video dataset (WVU [3]) and the unconstrained facial video dataset (HONDA/UCSD [4]) to demonstrate the performance of the proposed framework. Finally, conclusions and future research directions are presented in Section 6. 2. MODALITY SPECIFIC IMAGE FRAME DETECTION To perform multimodal biometric recognition, we first need to detect the images of the different modalities from the facial video. The facial video clips in the constrained dataset are collected in a controlled environment, where the camera rotates around the subject's head. The video sequences start with the left profile of each subject (0 degrees) and proceed to the right profile (180 degrees). Each of these video sequences contains image frames of different modalities, e.g., left ear, left profile face, frontal face, right profile face, and right ear, respectively. The video sequences in the unconstrained dataset contains uncontrolled and nonuniform head rotations and changing facial expressions. Thus, the appearance of a specific modality in a certain frame of the unconstrained video clip is random compared with the constrained video clips. The algorithm was trained to detect the different modalities that appear in the facial video clips. To automate the detection process of the modality specific image frames, we adopt the Adaboost object detection technique, proposed by Viola and Jones [1]. The algorithm is trained to detect frontal and profile faces in the video frames, respectively, using manually cropped frontal face images from color FERET database, and profile face images from the University of Notre Dame Collection J2 database. Moreover, it is trained using cropped ear images from UND color ear database to detect ear images in the video frames. By using these modality specific trained detectors, we can detect faces and ears in the video frames. The modality specific trained detectors are applied to the entire video sequence to detect the face and the ear regions in the video frames. Before using the detected modality specific regions from the video frames for extracting features, some preprocessing steps are performed. The facial video clips recorded in the unconstrained environment contain variations in illumination and low contrast. Histogram equalization is performed to enhance the contrast of the images. Finally, all detected modality specific regions from the facial video clips were resized; ear images were resized to 110 X 70 pixels and faces images (frontal and profile) were resized to 128 X 128 pixels. 3. AUTOMATIC FEATURE LEARNING USING DEEP NEURAL NETWORK Even though the modalitiy specific sparse classifiers result in relatively high recognition accuracy on the constrained face video clips, the accuracy suffers in case of unconstrained video because the sparse classifier is vulnerable to the bias in the number of training images from different subjects. For example, subjects in the HONDA/UCSD dataset [4] randomly change their head pose. This results in a nonuniform number of detected modality specific video frames across different video clips, which is not ideal to perform classification through sparse representation.
  • 4. 70 Computer Science & Information Technology (CS & IT) In the subsequent sections we first describe the gabor feature extraction technique. Then, we describe the supervised denoising sparse autoencoders, which we use to automatically learn equal number of feature vectors for each subject from the uneven number of modality specific detected regions. 3.1 Feature Extraction 2D Gabor filters [5] are used in broad range of applications to extract scale and rotation invariant feature vectors. In our feature extraction step, uniform down-sampled Gabor wavelets are computed for the detected regions: 3.2 Supervised Stacked Denoising Auto-encoder
  • 5. Computer Science & Information Technology (CS & IT) 71 3.3 Training the Deep Learning Network
  • 6. 72 Computer Science & Information Technology (CS & IT)
  • 7. Computer Science & Information Technology (CS & IT) 73 4. MODALITY SPECIFIC AND MULITMODAL RECOGNITION 4.1 Multimodal Recognition
  • 8. 74 Computer Science & Information Technology (CS & IT) 5. EXPERIMENTAL RESULTS In this section we describe the results of the modality specific and multi-modal recognition experiments on both datasets. The feature vectors automatically learned using the trained Deep Learning network resulted in length of 9600 for frontal and profile face; 4160 for ear. In order to decrease the computational complexity and to find out most effective feature vector length to maximize the recognition accuracy, the dimensionality of the feature vector is reduced to a lower dimension using Principal Component Analysis (PCA) [9]. Using PCA, the number of features is reduced to 500 and 1000. In Table- 1 the modality specific recognition accuracy obtained for the reduced feature vector of 500, 1000 is shown. Feature vectors of length 1000 resulted in best recognition accuracy for both modality specific and multimodal recognition. Table 1. Modality Specific and Multimodal Rank-1 Recognition Accuracy The best rank-1 recognition rates, using ear, frontal and profile face modalities for multimodal recognition, compared with the results reported in [10{12] is shown in Table 2. Table 2. Comparison of 2D multimodal (frontal face, profile face and ear) rank-1 recognition accuracy with the state-of-the-art techniques 6. CONCLUSION We proposed a system for multimodal recognition using a single biometrics data source, i.e., facial video clips. Using the Adaboost detector, we automatically detect modality specific regions. We use Gabor features extracted from the detected regions to automatically learn robust and non-redundant features by training a Supervised Stacked Denoising Auto-encoder (Deep Learning) network. Classification through sparse representation is used for each modality. Then, the multimodal recognition is obtained through the fusion of the results from the modality specific recognition.
  • 9. Computer Science & Information Technology (CS & IT) 75 REFERENCES [1] Viola, P. and Jones, M.: Grid Rapid object detection using a boosted cascade of simple features. In: Computer Vision and Pattern Recognition, pp. 511{518.(2001). [2] Zengxi Huang and Yiguang Liu and Chunguang Li and Menglong Yang and Liping Chen: A robust face and ear based multimodal biometric system using sparse representation. In: Pattern Recognition, pp.2156{2168.(2013). [3] Gamal Fahmy and Ahmed El-sherbeeny and Susmita M and Mohamed Abdel-mottaleb and Hany Ammar: The effect of lighting direction/condition on the performance of face recognition algorithms. In:SPIE Conference on Biometrics for Human Identification, pp.188{200.(2006). [4] K.C. Lee and J. Ho and M.H. Yang and D. Kriegman: Visual Tracking and Recognition Using Probabilistic Appearance Manifolds. In:Computer Vision and Image Understanding.(2005). [5] Chengjun Liu and Wechsler, H.:Gabor feature based classification using the enhanced fisher linear discriminant model for face recognition. In: IEEE Transactions on Image Processing, pp.467{476.(2002). [6] Rumelhart, David E. and McClelland, James L.: Parallel Distributed Processing: Explorations in the Microstructure of Cognition. In:MIT Press. Cambridge, MA, USA.(1986). [7] Shenghua Gao and Yuting Zhang and Kui Jia and Jiwen Lu and Yingying Zhang: Single Sample Face Recognition via Learning Deep Supervised Autoencoders. In:IEEE Transactions on Information Forensics and Security, pp.2108{2118.(2015) [8] Ross, A. A. and Nandakumar, K. and Jain, A. K.: Handbook of multibiometrics. In:Springer.(2006) [9] Turk Matthew and Pentland Alex: Eigenfaces for recognition.In: J. Cognitive Neuroscience. MIT Press, pp.71{86.(1991) [10] Nazmeen Bibi Boodoo and R. K. Subramanian: Robust Multi biometric Recognition Using Face and Ear Images. In:J. CoRR.(2009) [11] Dakshina Ranjan Kisku and Jamuna Kanta Sing and Phalguni Gupta: Multibiometrics Belief Fusion. In:J. CoRR.(2010) [12] Xiuqin Pan and Yongcun Cao and Xiaona Xu and Yong Lu and Yue Zhao: Ear and face based multimodal recognition based on KFDA. In:International Conference on Audio, Language and Image Processing. pp.965{969.(2008)