Facial Expression Recognition Using Enhanced Deep 3D Convolutional Neural Networks

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/317062429
Facial Expression Recognition Using Enhanced
Deep 3D Convolutional Neural Networks
Article · May 2017
CITATIONS
0
READS
97
2 authors:
Some of the authors of this publication are also working on these related projects:
closed-loop deep brain stimulation, human behavior recognition, brain signals View project
Wearable Health Monitoring View project
Behzad Hasani
University of Denver
9 PUBLICATIONS 8 CITATIONS
SEE PROFILE
Mohammad H Mahoor
University of Denver
116 PUBLICATIONS 1,397 CITATIONS
SEE PROFILE
All content following this page was uploaded by Behzad Hasani on 25 May 2017.
The user has requested enhancement of the downloaded file.

Facial Expression Recognition Using Enhanced Deep 3D Convolutional Neural
Networks
Behzad Hasani and Mohammad H. Mahoor
Department of Electrical and Computer Engineering
University of Denver, Denver, CO
behzad.hasani@du.edu and mmahoor@du.edu
Abstract
Deep Neural Networks (DNNs) have shown to outper-
form traditional methods in various visual recognition tasks
including Facial Expression Recognition (FER). In spite of
efforts made to improve the accuracy of FER systems using
DNN, existing methods still are not generalizable enough
in practical applications. This paper proposes a 3D Con-
volutional Neural Network method for FER in videos. This
new network architecture consists of 3D Inception-ResNet
layers followed by an LSTM unit that together extracts the
spatial relations within facial images as well as the tem-
poral relations between different frames in the video. Fa-
cial landmark points are also used as inputs to our network
which emphasize on the importance of facial components
rather than the facial regions that may not contribute sig-
nificantly to generating facial expressions. Our proposed
method is evaluated using four publicly available databases
in subject-independent and cross-database tasks and out-
performs state-of-the-art methods.
1. Introduction
Facial expressions are one of the most important non-
verbal channels for expressing internal emotions and in-
tentions. Ekman et al. [13] defined six expressions (viz.
anger, disgust, fear, happiness, sadness, and surprise) as ba-
sic emotional expressions which are universal among hu-
man beings. Automated Facial Expression Recognition
(FER) has been a topic of study for decades. Although
there have been many breakthroughs in developing auto-
matic FER systems, majority of the existing methods either
show undesirable performance in practical applications or
lack generalization due to the controlled condition in which
they are developed [47].
The FER problem becomes even more difficult when we
recognize expressions in videos. Facial expressions have a
dynamic pattern that can be divided into three phases: on-
Figure 1. Proposed method
set, peak and offset, where the onset describes the beginning
of the expression, the peak (aka apex) describes the maxi-
mum intensity of the expression and the offset describes the
moment when the expression vanishes. Most of the times,
the entire event of facial expression from the onset to the
offset is very quick, which makes the process of expression
recognition very challenging [55].
Many methods have been proposed for automated facial
expression recognition. Most of the traditional approaches
mainly consider still images independently while ignore the
temporal relations of the consecutive frames in a sequence
which are essential for recognizing subtle changes in the
appearance of facial images especially in transiting frames
between emotions. Recently, with the help of Deep Neu-
ral Networks (DNNs), more promising results are reported
in the field [38, 39]. While in traditional approaches engi-
neered features are used to train classifiers, DNNs have the
ability to extract more discriminative features which yield
in a better interpretation of the texture of human face in vi-
sual data.
One of the problems in FER is that training neural net-
1
arXiv:1705.07871v1[cs.CV]22May2017

works is significantly more difficult as most of the exist-
ing databases have a small number of images or video se-
quences for certain emotions [39]. Also, most of these
databases contain still images that are unrelated to each
other (instead of having consecutive frames of exhibiting
the expression from onset to offset) which makes the task
of sequential image labeling more difficult.
In this paper, we propose a method which extracts tem-
poral relations of consecutive frames in a video sequence
using 3D convolutional networks and Long Short-Term
Memory (LSTM). Furthermore, we extract and incorporate
facial landmarks in our proposed method that emphasize
on more expressive facial components which improve the
recognition of subtle changes in the facial expressions in
a sequence (Figure 1). We evaluate our proposed method
using four well-known facial expression databases (CK+,
MMI, FERA, and DISFA) in order to classify the expres-
sions. Furthermore, we examine the ability of our method
in recognition of facial expressions in cross-database clas-
sification tasks.
The remainder of the paper is organized as follows: Sec-
tion 2 provides an overview of the related work in this field.
Section 3 explains the network proposed in this research.
Experimental results and their analysis are presented in Sec-
tion 4 and finally the paper is concluded in Section 5.
2. Related work
Traditionally, algorithms for automated facial expression
recognition consist of three main modules, viz. registration,
feature extraction, and classification. Detailed survey of dif-
ferent approaches in each of these steps can be found in
[44]. Conventional algorithms for affective computing from
faces use engineered features such as Local Binary Patterns
(LBP) [47], Histogram of Oriented Gradients (HOG) [8],
Local Phase Quantization (LPQ) [59], Histogram of Optical
Flow [9], facial landmarks [6, 7], and PCA-based methods
[36]. Since the majority of these features are hand-crafted
for their specific application of recognition, they often lack
required generalizability in cases where there is high varia-
tion in lighting, views, resolution, subjects’ ethnicity, etc.
One of the effective approaches for achieving better
recognition rates for sequence labeling task is to extract
the temporal relations of frames in a sequence. Extract-
ing these temporal relations has been studied using tradi-
tional methods in the past. Examples of these attempts are
Hidden Markov Models [5, 60, 64] (which combine tempo-
ral information and apply segmentation on videos), Spatio-
Temporal Hidden Markov Models (ST-HMM) by coupling
S-HMM and T-HMM [50], Dynamic Bayesian Networks
(DBN) [45, 63] associated with a multi-sensory informa-
tion fusion strategy, Bayesian temporal models [46] to cap-
ture the dynamic facial expression transition, and Condi-
tional Random Fields (CRFs) [19, 20, 25, 48] and their
extensions such as Latent-Dynamic Conditional Random
Fields (LD-CRFs) and Hidden Conditional Random Fields
(HCRFs) [58].
In recent years, “Convolutional Neural Networks”
(CNNs) have become the most popular approach among
researchers in the field. AlexNet [27] is based on the tra-
ditional CNN layered architecture which consists of sev-
eral convolution layers followed by max-pooling layers and
Rectified Linear Units (ReLUs). Szegedy et al. [52] intro-
duced GoogLeNet which is composed of multiple “Incep-
tion” layers. Inception applies several convolutions on the
feature map in different scales. Mollahosseini et al. [38, 39]
have used the Inception layer for the task of facial expres-
sion recognition and achieved state-of-the-art results. Fol-
lowing the success of Inception layers, several variations
of them have been proposed [24, 53]. Moreover, Inception
layer is combined with residual unit introduced by He et al.
[21] and it shows that the resulting architecture accelerates
the training of Inception networks significantly [51].
One of the major restrictions of ordinary Convolutional
Neural Networks is that they only extract spatial relations of
the input data while ignore the temporal relations of them if
they are part of a sequenced data. To overcome this prob-
lem, 3D Convolutional Neural Networks (3D-CNNs) have
been proposed. 3D-CNNs slide over the temporal dimen-
sion of the input data as well as the spatial dimension en-
abling the network to extract feature maps containing tem-
poral information which is essential for sequence labeling
tasks. Song et al. [49] have used 3D-CNNs for 3D object
detection task. Molchanov et al. [37] have proposed a re-
current 3D-CNN for dynamic hand gesture recognition and
Fan et al. [15] won the EmotiW 2016 challenge by cascad-
ing 3D-CNNs with LSTMs.
Traditional Recurrent Neural Networks (RNNs) can
learn temporal dynamics by mapping input sequences to
a sequence of hidden states, and also mapping the hidden
states to outputs [12]. Although RNNs have shown promis-
ing performance on various tasks, it is not easy for them
to learn long-term sequences. This is mainly due to the
vanishing/exploding gradients problem [23] which can be
solved by having a memory for remembering and forget-
ting the previous states. LSTMs [23] provide such memory
and can memorize the context information for long peri-
ods of time. LSTM modules have three gates: 1) the input
gate (i) 2) the forget gate (f) and 3) the output gate (o)
which overwrite, keep, or retrieve the memory cell c respec-
tively at the timestep t. Letting σ(x) = (1 + exp(−x))−1
be the sigmoid function and φ(x) = exp(x)−exp(−x)
exp(x)+exp(−x) =
2σ(2x) − 1 be the hyperbolic tangent function. Letting
x, h, c, W, and b be the input, output, cell state, parameter
matrix, and parameter vector respectively. The LSTM up-
dates for the timestep t given inputs xt, ht−1, and ct−1 are
as follows:

ft = σ(Wf · [ht−1, xt] + bf )
it = σ(Wi · [ht−1, xt] + bi)
ot = σ(Wo · [ht−1, xt] + bo)
gt = φ(WC · [ht−1, xt] + bC)
Ct = ft ∗ Ct−1 + it ∗ gt
ht = ot ∗ φ(Ct)
(1)
Several works have used LSTMs for the task of sequence
labeling. Byeon et al. [3] proposed an LSTM-based net-
work applying LSTMs in four direction sliding windows
and achieved impressive results. Fan et al. [15] cascaded
2D-CNN with LSTMs and combined the feature map with
3D-CNNs for facial expression recognition. Donahue et
al. [12] proposed Long-term Recurrent Convolutional Net-
work (LRCN) by combining CNNs and LSTMs which is
both spatially and temporally deep and has the flexibility
to be applied to different vision tasks involving sequential
inputs and outputs.
3. Proposed method
While Inception and ResNet have shown remarkable re-
sults in FER [20, 52], these methods do not extract the tem-
poral relations of the input data. Therefore, we propose
a 3D Inception-ResNet architecture to address this issue.
Our proposed method, extracts both spatial and temporal
features of the sequences in an end-to-end neural network.
Another component of our method is incorporating facial
landmarks in an automated manner during training in the
proposed neural network. These facial landmarks help the
network to pay more attention to the important facial com-
ponents in the feature maps which results in a more accurate
recognition. The final part of our proposed method is an
LSTM unit which takes the enhanced feature map resulted
from the 3D Inception-ResNet (3DIR) layer as an input and
extracts the temporal information from it. The LSTM unit is
followed by a fully-connected layer associated with a soft-
max activation function. In the following, we explain each
of the aforementioned units in detail.
3.1. 3D Inception-ResNet (3DIR)
We propose 3D version of Inception-ResNet network
which is slightly shallower than the original Inception-
ResNet network proposed in [51]. This network is the re-
sult of investigating several variations of Inception-ResNet
module and achieves better recognition rates comparing to
our other attempts in several databases.
Figure 2 shows the structure of our 3D Inception-ResNet
network. The input videos with the size 10 ×299×299 ×3
(10 frames, 299 × 299 frame size and 3 color channels) are
followed by the “stem” layer. Afterwards, stem is followed
Figure 2. Network architecture. The “V” and “S” marked layers
represent “Valid” and “Same” paddings respectively. The size of
the output tensor is provided next to each layer.
by 3DIR-A, Reduction-A (which reduces the grid size from
38 × 38 to 18 × 18), 3DIR-B, Reduction-B (which reduces
the grid size from 18×18 to 8×8), 3DIR-C, Average Pool-
ing, Dropout, and a fully-connected layer respectively. In
Figure 2, detailed specification of each layer is provided.
Various filter sizes, paddings, strides, and activations have
been investigated and the one that had the best performance
is presented in this paper.
We should mention that all convolution layers (except
the ones that are indicated as “Linear” in Figure 2) are fol-

lowed by an ReLU [27] activation function to avoid the van-
ishing gradient problem.
3.2. Facial landmarks
As mentioned before, the main reason we use facial land-
marks in our network is to differentiate between the impor-
tance of main facial components (such as eyebrows, lip cor-
ners, eyes, etc.) and other parts of the face which are less
expressive of facial expressions. As oppose to general ob-
ject recognition task, in FER, we have the advantage of ex-
tracting facial landmarks and using this information to im-
prove the recognition rate. In a similar approach, Jaiswal
et al. [26] proposed incorporation of binary masks around
different parts of the face in order to encode the shape of
different face components. However, in this work authors
perform AU recognition by using CNN as a feature extrac-
tor for training Bi-directional Long Short-Term Memory
while in our approach, we preserve the temporal order of the
frames throughout the network and train CNN and LSTMs
simultaneously in an end-to-end network. We incorporate
the facial landmarks by replacing the shortcut in residual
unit on original ResNet with element-wise multiplication of
facial landmarks and the input tensor of the residual unit
(Figures 1 and 2).
In order to extract the facial landmarks, OpenCV face
recognition is used to obtain bounding boxes of the faces.
A face alignment algorithm via regression local binary
features [41, 61] was used to extract 66 facial landmark
points. The facial landmark localization technique was
trained using the annotations provided from the 300W com-
petition [42, 43].
After detecting and saving the facial landmarks for all of
the databases, the facial landmark filters are generated for
each sequence automatically during training phase. Given
the facial landmarks for each frame of a sequence, we ini-
tially resize all of the images in the sequence to their cor-
responding filter size in the network. Afterwards, we as-
sign weights to all of the pixels in a frame of a sequence
based on their distances to the detected landmarks. The
closer a pixel is to a facial landmark, the greater weight is
assigned to that pixel. After investigating several distance
measures, we concluded that Manhattan distance with a
linear weight function results in a better recognition rate
in various databases. The Manhattan distance between two
items is the sum of the differences of their corresponding
components (in this case two components).
The weight function that we defined to assign the weight
values to their corresponding feature is a simple linear func-
tion of the Manhattan distance defined as follows:
ω(L, P) = 1 − 0.1 · dM(L,P ) (2)
where dM(L,P ) is the Manhattan distance between the facial
landmark L and pixel P. Therefore, places in which facial
(a) Landmarks (b) Generated filter
Figure 3. Sample image from MMI database (left) and its corre-
sponding filter in the network (right). Best in color.
landmarks are located will have the highest value and their
surrounding pixels will have lower weights proportional
to their distance from the corresponding facial landmark.
In order to avoid overlapping between two adjacent facial
landmarks, we define a 7 × 7 window around each facial
landmark and apply the weight function for these 49 pixels
for each landmark separately. Figure 3 shows an example
of facial image from MMI database and its corresponding
facial landmark filter in the network. We do not incorporate
the facial landmarks with the third 3D Inception-ResNet
module since the resulting feature map size at this stage be-
comes very small for calculating facial landmark filter.
Incorporating facial landmarks in our network replaces
the shortcut in original ResNets [22] with the element-wise
multiplication of the weight function ω and input layer xl
as follows:
yl = ω(L, P) ◦ xl + F(xl, Wl)
xl+1 = f(yl)
(3)
where xl and xl+1 are input and output of the l-th layer, ◦ is
Hadamard product symbol, F is a residual function (in our
case Inception layer convolutions), and f is an activation
function.
3.3. Long Short-Term Memory unit
As explained earlier, to capture the temporal relations
of the resulted feature map from 3DIR and take these rela-
tions into account by the time of classifying the sequences
in the softmax layer, we used an LSTM unit as it is shown in
Figure 2. Using the LSTM unit makes perfect sense since
the resulted feature map from the 3DIR unit contains the
time notion of the sequences within the feature map. There-
fore, vectorizing the resulting feature map of 3DIR on its se-
quence dimension, will provide the required sequenced in-
put for the LSTM unit. While other still image LSTM-based
methods, a vectorized non-sequenced feature map (which
obviously does not contain any time notion) is fed to the
LSTM unit, our method saves the time order of the input
sequences and passes this feature map to the LSTM unit.
We investigated that 200 hidden units for the LSTM unit is
a reasonable amount for the task of FER (Figure 2).

The proposed network was implemented using a com-
bination of TensorFlow [1] and TFlearn [10] toolboxes on
NVIDIA Tesla K40 GPUs. In the training phase we used
asynchronous stochastic gradient descent with momentum
of 0.9, weight decay of 0.0001, and learning rate of 0.01.
We used categorical cross entropy as our loss function and
accuracy as our evaluation metric.
4. Experiments and results
In this section, we briefly review the databases we used
for evaluating our method. We then report the results of our
experiments using these databases and compare the results
with the state of the arts.
4.1. Face databases
Since our method is designed mainly for classifying se-
quences of inputs, databases that contain only independent
unrelated still images of facial expressions such as Mul-
tiPie [18] , SFEW [11] , FER2013 [17] cannot be exam-
ined by our method. We evaluate our proposed method on
MMI [40], extended CK+ [32], GEMEP-FERA [2], and
DISFA [33] which contain videos of annotated facial ex-
pressions. In the following, we briefly review the contents
of these databases.
MMI: The MMI [40] database contains more than 20
subjects, ranging in age from 19 to 62, with different eth-
nicities (European, Asian, or South American). In MMI,
the subjects’ facial expressions start from the neutral state to
the apex of one of the six basic facial expressions and then
returns to the neutral state again. Subjects were instructed
to display 79 series of facial expressions, six of which are
prototypic emotions (angry, disgust, fear, happy, sad, and
surprise). We extracted static frames from each sequence,
which resulted in 11,500 images. Afterwards, we divided
videos into sequences of ten frames to shape the input ten-
sor for our network.
CK+: The extended Cohn-Kanade database (CK+) [32]
contains 593 videos from 123 subjects. However, only 327
sequences from 118 subjects contain facial expression la-
bels. Sequences in this database start from the neutral state
and end at the apex of one of the six basic expressions (an-
gry, contempt, disgust, fear, happy, sad, and surprise). CK+
primarily contains frontal face poses only. In order to make
the database compatible with our network, we consider the
last ten frames of each sequence as an input sequence in our
network.
FERA: The GEMEP-FERA database [2] is a subset of
the GEMEP corpus used as database for the FERA 2011
challenge [56] developed by the Geneva Emotion Research
Group at the University of Geneva. This database contains
87 image sequences of 7 subjects. Each subject shows fa-
cial expressions of the emotion categories: Anger, Fear, Joy,
Relief, and Sadness. Head pose is primarily frontal with rel-
atively fast movements. Each video is annotated with AUs
and holistic expressions. By extracting static frames from
the sequences, we obtained around 7,000 images. We di-
vided the these emotion videos into sequences of ten frames
to shape the input tensor for our network.
DISFA: Denver Intensity of Spontaneous Facial Actions
(DISFA) database [33] is one of a few naturalistic databases
that have been FACS coded by AU intensity values. This
database consists of 27 subjects. The subjects are asked to
watch YouTube videos while their spontaneous facial ex-
pressions are recorded. Twelve AUs are coded for each
frame and AU intensities are on a six-point scale between 0-
5, where 0 denotes the absence of the AU, and 5 represents
maximum intensity. As DISFA is not emotion-specified
coded, we used EMFACS system [16] to convert AU FACS
codes to seven expressions (angry, disgust, fear, happy, neu-
tral, sad, and surprise) which resulted in around 89,000 im-
ages in which the majority have neutral expressions. Same
as other databases, we divided the videos of emotions into
sequences of ten frames to shape the input tensor for our
network.
4.2. Results
As mentioned earlier, after detecting faces we extract 66
facial landmark points by a face alignment algorithm via
regression local binary features. Afterwards, we resize the
faces to 299×299 pixels. One of the reasons why we choose
large image size as input is the fact that larger images and
sequences will enable us to have deeper networks and ex-
tract more abstract features from sequences. All of the net-
works have the same settings (shown in Figure 2 in detail)
and are trained from scratch for each database separately.
We evaluate the accuracy of our proposed method with
two different sets of experiments: “subject-independent”
and “cross-database” evaluations.
4.2.1 Subject-independent task
In the subject-independent task, each database is split into
training and validation sets in a strict subject independent
manner. In all databases, we report the results using the
5-fold cross-validation technique and then averaging the
recognition rates over five folds. For each database and
each fold, we trained our proposed network entirely from
scratch with the aforementioned settings. Table 1 shows the
recognition rates achieved on each database in the subject-
independent case and compares the results with the state-of-
the-art methods. In order to compare the impact of incorpo-
rating facial landmarks, we also provide the results of our
network while the landmark multiplication unit is removed
and replaced with a simple shortcut between the input and
output of the residual unit. In this case, we randomly se-

lect 20 percent of the subjects as the test set and report the
results on those subjects. Table 1 also provides the recogni-
tion rates of the traditional 2D Inception-ResNet from [20]
which does not contain facial landmarks and the LSTM unit
(DISFA is not experimented in this study).
Comparing the recognition rates of the 3D and 2D
Inception-ResNets in Table 1, shows that the sequential
processing of facial expressions considerably enhances the
recognition rate. This improvement is more apparent in
MMI and FERA databases. Incorporating landmarks in the
network is proposed to emphasize on more important fa-
cial changes over time. Since changes in the lips or eyes
are much more expressive than the changes in other com-
ponents such as the cheeks, we utilize facial landmarks to
enhance these temporal changes in the network flow.
The “3D Inception-ResNet with landmarks” column in
Table 1 shows the impact of this enhancement in different
databases. It can be seen that compared with other net-
works, there is a considerable improvement in recognition
rates especially in FERA and MMI databases. The results
on DISFA, however, show higher fluctuations over different
folds which can be in part due to the abundance of inactive
frames in this database which causes confusion in recogniz-
ing different expressions. Therefore, the folds that contain
more neutral faces, would show lower recognition rates.
Comparing to other state-of-the-art works, our method
outperforms others in FERA and DISFA databases while
achieves comparable results in CK+ and MMI databases
(Table 1). Most of these works use traditional approaches
including hand-crafted features tuned for that specific
database, while our network’s settings are the same for all
databases. Also, due to the limited number of samples in
these databases, it is difficult to properly train a deep neural
network and avoid the overfitting problem. For these rea-
sons and in order to have a better understanding about our
proposed method, we also experimented the cross-database
task.
Figure 4 shows the resulting confusion matrices of our
3D Inception-ResNet with incorporating landmarks on dif-
ferent databases over the 5 folds. On CK+ (Figure 4a), it can
be seen that very high recognition rates have been achieved.
The recognition rates of happiness, sadness, and surprise are
higher than those of other expressions. The highest confu-
sion occurred between the happiness and contempt expres-
sions which can be caused from the low number of contempt
sequences in this database (only 18 sequences). On MMI
(Figure 4b), a perfect recognition is achieved for the happy
expression. It can be seen that there is a high confusion
between the sad and fear expressions as well as the angry
and sad expressions. Considering the fact that MMI is a
highly imbalanced dataset, these confusions are reasonable.
On FERA (Figure 4c), the highest and the lowest recogni-
tion rates belong to joy and relief respectively. The relief
category in this database has some similarities with other
categories especially with joy. These similarities make the
classification so difficult even for humans. Despite these
challenges, our method has performed well on all of the cat-
egories and outperforms state of the arts. On DISFA (Fig-
ure 4d), we can see the highest confusion rate compared
with other databases. As mentioned earlier, this database
contains long inactive frames, which means that the number
of neutral sequences is considerably higher than other cate-
gories. This imbalanced training data has made the network
to be biased toward the neutral category and therefore we
can observe a high confusion rate between the neutral ex-
pression and other categories in this database. Despite the
low number of angry and sad sequences in this database, our
method has been able achieve satisfying recognition rates in
these categories.
4.2.2 Cross-database task
In the cross-database task, for testing each database, that
database is entirely used for testing the network and the rest
of the databases are used to train the network. The same
network architecture as subject-independent task (Figure 2)
was used for this task. Table 2 shows the recognition rate
achieved on each database in the cross-database case and it
also compares the results with other state-of-the-art meth-
ods. It can be seen that our method outperforms the state-
of-the-art results in CK+, FERA, and DISFA databases. On
MMI, our method does not show improvements comparing
to others (e.g. [62]). However, authors in [62] trained their
classifier only with CK+ database while our method uses in-
stances from two additional databases (DISFA and FERA)
with completely different settings and subjects which add
significant amount of ambiguity in the training phase.
In order to have a fair comparison with other methods,
we provide the different settings used by the works men-
tioned in Table 2. The results provided in [34] are achieved
by training the models on one of the CK+, MMI, and FEED-
TUM databases and tested on the rest. The reported result
in [47] is the best achieved results using different SVM ker-
nels trained on CK+ and tested on MMI database. In [35]
several experiments were performed using four classifiers
(SVM, Nearest Mean Classifier, Weighted Template Match-
ing, and K-nearest neighbors). The reported results in this
work for CK+ is trained on MMI and Jaffe databases while
the reported results for MMI is trained on the CK+ database
only. As mentioned earlier, in [62] a Multiple Kernel Learn-
ing algorithm is used and the cross-database experiments
are trained on CK+, evaluated on MMI and vice versa. In
[38] a DNN network is proposed using traditional Incep-
tion layer. The networks for the cross-database case in this
work are tested on either CK+, MultiPIE, MMI, DISFA,
FERA, SFEW, or FER2013 while trained on the rest. Some

state-of-the-art methods
2D
Inception-ResNet
3D
Inception-ResNet
3D
Inception-ResNet
+
landmarks
CK+
84.1 [34], 84.4 [28], 88.5 [54], 92.0 [29],
93.2 [38], 92.4 [30], 93.6 [62]
85.77 89.50 93.21±2.32
MMI
63.4 [30], 75.12 [31], 74.7 [29], 79.8 [54],
86.7 [47], 78.51 [36]
55.83 67.50 77.50±1.76
FERA 56.1 [30], 55.6 [57], 76.7 [38] 49.64 67.74 77.42±3.67
DISFA 55.0 [38] - 51.35 58.00±5.77
Table 1. Recognition rates (%) in subject-independent task
(a) CK+ (b) MMI
(c) FERA (d) DISFA
Figure 4. Confusion matrices of 3D Inception-ResNet with landmarks for subject-independent task
of the expressions of these databases are excluded in this
study (such as neutral, relief, and contempt). There are
other works that perform their experiments on action unit
recognition task [4, 14, 26] but since fair comparison of ac-
tion unit recognition and facial expression recognition is not
easily obtainable, we did not mention these works in Ta-
bles 1 and 2.
Figure 5 shows the resulting confusion matrices of our
experiments on 3D Inception-ResNet with landmarks in
cross-database task. For CK+ (Figure 5a), we exclude the
contempt sequences in the test phase since other databases
that are used for training the network, do not contain con-
tempt category. Except for the fear expression (which has
very few number of samples in other databases), the net-
work has been able to correctly recognize other expressions.
For MMI (Figure 5b), highest recognition rate belongs to
surprise while the lowest one belongs to fear. Also, we can
see high confusion rate in recognizing sadness. On FERA
(Figure 5c), we exclude relief category as other databases do
not contain this emotion. Considering the fact that only half
of the train categories exist in the test set, the network shows
acceptable performance in correctly recognizing emotions.
However, surprise category has made signiﬁcant confusion
in all of the categories. On DISFA (Figure 5d), we exclude
the neutral category as other databases do not contain this
category. Highest recognition rates belong to happy and sur-

(a) CK+ (b) MMI
(c) FERA (d) DISFA
Figure 5. Confusion matrices of 3D Inception-ResNet with landmarks for cross-database task
prise emotions while lowest one belongs to fear. Comparing
to other databases, we can see a signiﬁcant increase in con-
fusion rate in all of the categories. This can be in part due
to the fact that emotions in DISFA are “spontaneous” while
emotions in the training databases are “posed”. Based on
the aforementioned results, our method provides a compre-
hensive solution that can generalize well to practical appli-
cations.
5. Conclusion
In this paper, we presented a 3D Deep Neural Network
for the task of facial expression recognition in videos. We
state-of-the-art methods
3D
Inception-ResNet
+
landmarks
CK+
47.1 [34], 56.0 [35],
61.2 [62], 64.2 [38]
67.52
MMI
51.4 [34], 50.8 [47],
36.8 [35], 55.6 [38],
66.9 [62]
54.76
FERA 39.4[38] 41.93
DISFA 37.7 [38] 40.51
Table 2. Recognition rates (%) in cross-database task
proposed the 3D Inception-ResNet (3DIR) network which
extends the well-known 2D Inception-ResNet module for
processing image sequences. This additional dimension
will result in a volume of feature maps and will extract
the spatial relations between frames in a sequence. This
module is followed by an LSTM which takes these tempo-
ral relations into account and uses this information to clas-
sify the sequences. In order to differentiate between facial
components and other parts of the face, we incorporated fa-
cial landmarks in our proposed method. These landmarks
are multiplied with the input tensor in the residual module
which is replaced with the shortcuts in the traditional resid-
ual layer.
We evaluated our proposed method in subject-
independent and cross-database tasks. Four well-known
databases were used to evaluate the method: CK+, MMI,
FERA, and DISFA. Our experiments show that the pro-
posed method outperforms many of the state-of-the-art
methods in both tasks and provides a general solution for
the task of FER.
6. Acknowledgement
This work is partially supported by the NSF grants IIS-
1111568 and CNS-1427872. We gratefully acknowledge
the support from NVIDIA Corporation with the donation of
the Tesla K40 GPUs used for this research.

References
[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,
C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al.
Tensorflow: Large-scale machine learning on heterogeneous
distributed systems. arXiv preprint arXiv:1603.04467, 2016.
5
[2] T. Bänziger and K. R. Scherer. Introducing the geneva mul-
timodal emotion portrayal (gemep) corpus. Blueprint for af-
fective computing: A sourcebook, pages 271–294, 2010. 5
[3] W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki. Scene
labeling with lstm recurrent neural networks. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 3547–3555, 2015. 3
[4] W.-S. Chu, F. De la Torre, and J. F. Cohn. Modeling spatial
and temporal cues for multi-label facial action unit detection.
arXiv preprint arXiv:1608.00911, 2016. 7
[5] I. Cohen, N. Sebe, A. Garg, L. S. Chen, and T. S. Huang.
Facial expression recognition from video sequences: tempo-
ral and static modeling. Computer Vision and image under-
standing, 91(1):160–187, 2003. 2
[6] T. F. Cootes, G. J. Edwards, C. J. Taylor, et al. Active ap-
pearance models. IEEE Transactions on pattern analysis and
machine intelligence, 23(6):681–685, 2001. 2
[7] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. Ac-
tive shape models-their training and application. Computer
vision and image understanding, 61(1):38–59, 1995. 2
[8] N. Dalal and B. Triggs. Histograms of oriented gradients for
human detection. In Computer Vision and Pattern Recogni-
tion, 2005. CVPR 2005. IEEE Computer Society Conference
on, volume 1, pages 886–893. IEEE, 2005. 2
[9] N. Dalal, B. Triggs, and C. Schmid. Human detection using
oriented histograms of flow and appearance. In European
conference on computer vision, pages 428–441. Springer,
2006. 2
[10] A. Damien et al. Tflearn. https://github.com/
tflearn/tflearn, 2016. 5
[11] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon. Static fa-
cial expression analysis in tough conditions: Data, evalua-
tion protocol and benchmark. In Computer Vision Workshops
(ICCV Workshops), 2011 IEEE International Conference on,
pages 2106–2112. IEEE, 2011. 5
[12] J. Donahue, L. Anne Hendricks, S. Guadarrama,
M. Rohrbach, S. Venugopalan, K. Saenko, and T. Dar-
rell. Long-term recurrent convolutional networks for visual
recognition and description. In Proceedings of the IEEE
conference on computer vision and pattern recognition,
pages 2625–2634, 2015. 2, 3
[13] P. Ekman and W. V. Friesen. Constants across cultures in the
face and emotion. Journal of personality and social psychol-
ogy, 17(2):124, 1971. 1
[14] C. Fabian Benitez-Quiroz, R. Srinivasan, and A. M. Mar-
tinez. Emotionet: An accurate, real-time algorithm for the
automatic annotation of a million facial expressions in the
wild. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 5562–5570, 2016. 7
[15] Y. Fan, X. Lu, D. Li, and Y. Liu. Video-based emotion recog-
nition using cnn-rnn and c3d hybrid networks. In Proceed-
ings of the 18th ACM International Conference on Multi-
modal Interaction, ICMI 2016, pages 445–450, New York,
NY, USA, 2016. ACM. 2, 3
[16] W. V. Friesen and P. Ekman. Emfacs-7: Emotional facial
action coding system. Unpublished manuscript, University
of California at San Francisco, 2(36):1, 1983. 5
[17] I. J. Goodfellow, D. Erhan, P. L. Carrier, A. Courville,
M. Mirza, B. Hamner, W. Cukierski, Y. Tang, D. Thaler, D.-
H. Lee, et al. Challenges in representation learning: A report
on three machine learning contests. In International Con-
ference on Neural Information Processing, pages 117–124.
Springer, 2013. 5
[18] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker.
Multi-pie. Image and Vision Computing, 28(5):807–813,
2010. 5
[19] B. Hasani, M. M. Arzani, M. Fathy, and K. Raahemifar.
Facial expression recognition with discriminatory graphical
models. In 2016 2nd International Conference of Signal Pro-
cessing and Intelligent Systems (ICSPIS), pages 1–7, Dec
2016. 2
[20] B. Hasani and M. H. Mahoor. Spatio-temporal facial expres-
sion recognition using convolutional neural networks and
conditional random fields. arXiv preprint arXiv:1703.06995,
2017. 2, 3, 6
[21] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
for image recognition. In The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), June 2016. 2
[22] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in
deep residual networks. pages 630–645, 2016. 4
[23] S. Hochreiter and J. Schmidhuber. Long short-term memory.
Neural computation, 9(8):1735–1780, 1997. 2
[24] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
deep network training by reducing internal covariate shift.
arXiv preprint arXiv:1502.03167, 2015. 2
[25] S. Jain, C. Hu, and J. K. Aggarwal. Facial expression
recognition with temporal modeling of shapes. In Computer
Vision Workshops (ICCV Workshops), 2011 IEEE Interna-
tional Conference on, pages 1642–1649. IEEE, 2011. 2
[26] S. Jaiswal and M. Valstar. Deep learning the dynamic ap-
pearance and shape of facial action units. In Applications of
Computer Vision (WACV), 2016 IEEE Winter Conference on,
pages 1–8. IEEE, 2016. 4, 7
[27] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
Advances in neural information processing systems, pages
1097–1105, 2012. 2, 4
[28] S. H. Lee, K. N. K. Plataniotis, and Y. M. Ro. Intra-
class variation reduction using training expression images
for sparse representation based facial expression recognition.
IEEE Transactions on Affective Computing, 5(3):340–351,
2014. 7
[29] M. Liu, S. Li, S. Shan, and X. Chen. Au-aware deep
networks for facial expression recognition. In Automatic
Face and Gesture Recognition (FG), 2013 10th IEEE Inter-
national Conference and Workshops on, pages 1–6. IEEE,
2013. 7

[30] M. Liu, S. Li, S. Shan, R. Wang, and X. Chen. Deeply
learning deformable facial action parts model for dynamic
expression analysis. In Asian Conference on Computer Vi-
sion, pages 143–157. Springer, 2014. 7
[31] M. Liu, S. Shan, R. Wang, and X. Chen. Learning expres-
sionlets on spatio-temporal manifold for dynamic facial ex-
pression recognition. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 1749–
1756, 2014. 7
[32] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and
I. Matthews. The extended cohn-kanade dataset (ck+): A
complete dataset for action unit and emotion-specified ex-
pression. In Computer Vision and Pattern Recognition Work-
shops (CVPRW), 2010 IEEE Computer Society Conference
on, pages 94–101. IEEE, 2010. 5
[33] S. M. Mavadati, M. H. Mahoor, K. Bartlett, P. Trinh, and J. F.
Cohn. Disfa: A spontaneous facial action intensity database.
IEEE Transactions on Affective Computing, 4(2):151–160,
2013. 5
[34] C. Mayer, M. Eggers, and B. Radig. Cross-database evalu-
ation for facial expression recognition. Pattern recognition
and image analysis, 24(1):124–132, 2014. 6, 7, 8
[35] Y.-Q. Miao, R. Araujo, and M. S. Kamel. Cross-domain
facial expression recognition using supervised kernel mean
matching. In Machine Learning and Applications (ICMLA),
2012 11th International Conference on, volume 2, pages
326–332. IEEE, 2012. 6, 8
[36] M. Mohammadi, E. Fatemizadeh, and M. H. Mahoor. Pca-
based dictionary building for accurate facial expression
recognition via sparse representation. Journal of Visual
Communication and Image Representation, 25(5):1082–
1092, 2014. 2, 7
[37] P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and
J. Kautz. Online detection and classification of dynamic hand
gestures with recurrent 3d convolutional neural network. In
Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 4207–4215, 2016. 2
[38] A. Mollahosseini, D. Chan, and M. H. Mahoor. Going deeper
in facial expression recognition using deep neural networks.
In 2016 IEEE Winter Conference on Applications of Com-
puter Vision (WACV), pages 1–10. IEEE, 2016. 1, 2, 6, 7,
8
[39] A. Mollahosseini, B. Hasani, M. J. Salvador, H. Abdollahi,
D. Chan, and M. H. Mahoor. Facial expression recognition
from world wild web. In The IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR) Workshops,
June 2016. 1, 2
[40] M. Pantic, M. Valstar, R. Rademaker, and L. Maat. Web-
based database for facial expression analysis. In 2005 IEEE
international conference on multimedia and Expo, pages 5–
pp. IEEE, 2005. 5
[41] S. Ren, X. Cao, Y. Wei, and J. Sun. Face alignment at 3000
fps via regressing local binary features. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 1685–1692, 2014. 4
[42] C. Sagonas, E. Antonakos, G. Tzimiropoulos, S. Zafeiriou,
and M. Pantic. 300 faces in-the-wild challenge: Database
and results. Image and Vision Computing, 47:3–18, 2016. 4
[43] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic.
A semi-automatic methodology for facial landmark annota-
tion. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition Workshops, pages 896–903,
2013. 4
[44] E. Sariyanidi, H. Gunes, and A. Cavallaro. Automatic anal-
ysis of facial affect: A survey of registration, representation,
and recognition. IEEE transactions on pattern analysis and
machine intelligence, 37(6):1113–1133, 2015. 2
[45] N. Sebe, M. S. Lew, Y. Sun, I. Cohen, T. Gevers, and T. S.
Huang. Authentic facial expression analysis. Image and Vi-
sion Computing, 25(12):1856–1863, 2007. 2
[46] C. Shan, S. Gong, and P. W. McOwan. Dynamic facial
expression recognition using a bayesian temporal manifold
model. In BMVC, pages 297–306. Citeseer, 2006. 2
[47] C. Shan, S. Gong, and P. W. McOwan. Facial expression
recognition based on local binary patterns: A comprehensive
study. Image and Vision Computing, 27(6):803–816, 2009.
1, 2, 6, 7, 8
[48] C. Sminchisescu, A. Kanaujia, and D. Metaxas. Conditional
models for contextual human motion recognition. Computer
Vision and Image Understanding, 104(2):210–220, 2006. 2
[49] S. Song and J. Xiao. Deep sliding shapes for amodal 3d ob-
ject detection in rgb-d images. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition,
pages 808–816, 2016. 2
[50] Y. Sun, X. Chen, M. Rosato, and L. Yin. Tracking vertex flow
and model adaptation for three-dimensional spatiotemporal
face analysis. Systems, Man and Cybernetics, Part A: Sys-
tems and Humans, IEEE Transactions on, 40(3):461–474,
2010. 2
[51] C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4,
inception-resnet and the impact of residual connections on
learning. arXiv preprint arXiv:1602.07261, 2016. 2, 3
[52] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
Going deeper with convolutions. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition,
pages 1–9, 2015. 2, 3
[53] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.
Rethinking the inception architecture for computer vision.
In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2016. 2
[54] S. Taheri, Q. Qiu, and R. Chellappa. Structure-preserving
sparse decomposition for facial expression analysis. IEEE
Transactions on Image Processing, 23(8):3590–3603, 2014.
7
[55] Y. Tian, T. Kanade, and J. F. Cohn. Recognizing lower face
action units for facial expression analysis. In Automatic Face
and Gesture Recognition, 2000. Proceedings. Fourth IEEE
International Conference on, pages 484–490. IEEE, 2000. 1
[56] M. F. Valstar, B. Jiang, M. Mehu, M. Pantic, and K. Scherer.
The first facial expression recognition and analysis chal-
lenge. In Automatic Face & Gesture Recognition and Work-
shops (FG 2011), 2011 IEEE International Conference on,
pages 921–926. IEEE, 2011. 5

[57] M. F. Valstar, B. Jiang, M. Mehu, M. Pantic, and K. Scherer.
The ﬁrst facial expression recognition and analysis chal-
lenge. In Automatic Face & Gesture Recognition and Work-
shops (FG 2011), 2011 IEEE International Conference on,
pages 921–926. IEEE, 2011. 7
[58] S. B. Wang, A. Quattoni, L.-P. Morency, D. Demirdjian,
and T. Darrell. Hidden conditional random ﬁelds for ges-
ture recognition. In Computer Vision and Pattern Recogni-
tion, 2006 IEEE Computer Society Conference on, volume 2,
pages 1521–1527. IEEE, 2006. 2
[59] Z. Wang and Z. Ying. Facial expression recognition based on
local phase quantization and sparse representation. In Natu-
ral Computation (ICNC), 2012 Eighth International Confer-
ence on, pages 222–225. IEEE, 2012. 2
[60] M. Yeasin, B. Bullot, and R. Sharma. Recognition of fa-
cial expressions and measurement of levels of interest from
video. Multimedia, IEEE Transactions on, 8(3):500–508,
2006. 2
[61] L. Yu. face-alignment-in-3000fps. https://github.
com/yulequan/face-alignment-in-3000fps,
2016. 4
[62] X. Zhang, M. H. Mahoor, and S. M. Mavadati. Facial expres-
sion recognition using {l} {p}-norm mkl multiclass-svm.
Machine Vision and Applications, 26(4):467–483, 2015. 6,
7, 8
[63] Y. Zhang and Q. Ji. Active and dynamic information fusion
for facial expression understanding from image sequences.
Pattern Analysis and Machine Intelligence, IEEE Transac-
tions on, 27(5):699–714, 2005. 2
[64] Y. Zhu, L. C. De Silva, and C. C. Ko. Using moment in-
variants and hmm in facial expression recognition. Pattern
Recognition Letters, 23(1):83–91, 2002. 2
View publication statsView publication stats

Facial Expression Recognition Using Enhanced Deep 3D Convolutional Neural Networks

More Related Content

What's hot (20)

Similar to Facial Expression Recognition Using Enhanced Deep 3D Convolutional Neural Networks (20)

More from Willy Marroquin (WillyDevNET) (20)

Recently uploaded (20)

Facial Expression Recognition Using Enhanced Deep 3D Convolutional Neural Networks