SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 09 Issue: 07 | July 2022 www.irjet.net p-ISSN: 2395-0072
© 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 746
Sign Language Recognition using Facial Gesture and Skeleton
Keypoints
Sheela N1, Kiran Raghavendra2, Shashank C2, Sanjana S2, Dhanush N S2
1Assistant Professor, Dept of Computer Science Engineering, JSS Science and Technology University, Mysuru
2Dept of Computer Science Engineering, JSS Science and Technology University, Mysuru
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - Sign Language is used by people who are
speech impaired or hard of hearing. Sign Language
Recognition aims to recognise signs performed by the signer
in the input videos. It is an extremely complex task as signs
are performed with complicated hand gestures, body
posture and mouth actions. In recent times, skeleton based
models are preferred for sign language recognition owing to
the independence between subject and background. Some
sign languages make use of mouthings/facial gestures in
addition to hand gestures for signing specific words. These
facial expressions can be used to assist the convoluted task
of sign language recognition. Skeleton based methods are
still under research due to lack of annotations for hand
keypoints. Significant efforts have been made to tackle sign
language recognition using skeleton based multi-modal
ensemble methods, but to our knowledge none of them take
facial expressions into consideration. To this end, we
propose the usage of face keypoints to assist skeleton based
sign language recogntion methods. As a result, skeleton
based methods on addition of facial feature information
achieves an accuracy of 93.26% on AUTSL dataset.
Key Words: Sign language recognition, Skeleton based
methods, Face expression, SLGCN, SSTCN, Wholepose
keypoint estimator, AUTSL dataset
1. INTRODUCTION
Sign Language is a means of communication for people
who are hard of hearing or speech impaired. It is a visual
language involving hand gestures, body posture and
mouth actions. Comprehending sign language requires
remarkable effort and training which is not feasible for the
general public. In addition sign language is affected by the
language of communication (e.g., English, Chinese, Italian)
and region of usage (e.g., American Sign Language, Indian
Sign Language). With advancements in computer vision
and machine learning it is essential to explore sign
language recognition (SLR) which translates sign language
and helps the deaf/speech impaired community to
communicate easily with others in their daily life.
In comparison with action recognition or pose estimation,
SLR is a extremely challenging task. Firstly, SLR requires
information of global body motions and intricate
movement of hands and fingers to express the sign
correctly. Similar signs can interpret different meanings
depending on the number of times it is repeated. Secondly,
different signers perform signs differently (e.g., speed,
body shape and posture, left handed or right handed) thus
making SLR challenging.
Inspired by the recent developments on SLR using multi-
modal methods [4], we propose the usage of pretrained
facial keypoint estimators to provide additional facial
gesture information to the SLGCN + SSTCN ensemble
framework proposed in [4].
We propose the usage of SLGCN [4] and SSTCN [4] to
exploit facial gesture information using face keypoints
generated using pretrained estimators thus assisting the
complex task of sign language recognition.
2. RELATED WORK
In this section, we review existing publicly available
datasets for sign language, and existing state-of-the-art
algorithms for sign language recognition.
2.1 Sign Language Datasets
A Word Level American Sign Language (WLASL) dataset is
proposed in [1], containing over 2000 words performed
by over 100 signers.
A Turkish Sign Language dataset is proposed in [6]. The
dataset consists of 226 signs performed by over 43
different and 38,336 sign video samples in total. Samples
contain a variety of videos in different backgrounds (both
indoor and outdoor environments).
Reference [3] introduces a 3D hand pose dataset based on
synthetic hand models.
2.2 Sign Language Recognition Approaches
An appearance based approach and 2D human pose based
approach is proposed in [1] creating baselines that aid
method bench marking. In addition, [1] proposes a pose-
based temporal graph convolution networks (Pose-TGCN)
that models spatial and temporal dependencies.
A Two-Stream Inflated 3D ConvNet (I3D) is proposed in
[7]. It is based on 2D ConvNet inflation: filters and pooling
kernels of very deep image classification convnets are
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 09 Issue: 07 | July 2022 www.irjet.net p-ISSN: 2395-0072
© 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 747
expanded into 3D, thus making it possible to learn spatio-
temporal features from video.
Skeleton Aware Multi-modal SLR framework (SAMSLR) is
proposed in [4] to take advantage of multi-modal
information thus aiding sign language recognition. RGB
and depth modalities are also added and assembled into
the final framework to provide information that is
complementary to the SL-GCN and SSTCN methods.
Reference [2] focuses on the translation system and
introduces the STMC-Transformer which improves on the
current state-of-the-art by over 5 and 7 BLEU respectively
on gloss-to-text and video-to-text translation of the
PHOENIX-Weather 2014T dataset [8]. An approach that
estimates hand pose from RGB images has been proposed
in [3]. Refernce [3] proposes a deep network that learns a
network-implicit 3D articulation. Together with detected
keypoints in the images, this network yields good
estimates of the 3D pose. A method to automatically find
compact and problem-specific topology for spatio-
temporal graph convolutional networks in a progressive
manner has been proposed in [5]. Summary of usage of
facial expressions by signers and an account of the range
of facial expressions has been proposed in [9]. It is done by
making use of the three dimensions on which facial
expressions vary: semantic, compositional, and iconic. In
[12], a sign language recognition model is created using
Convolutional Neural Networks (CNNs), Feature Pooling
Module and Long Short-Term Memory Networks (LSTMs).
In the CNN part, a pre-trained VGG-16 model is used, after
adapting its weights to the dataset. The extracted features
are used to generate multi-scale features. The features
matrices are reduced to feature vectors, using Global
Average Pooling (GAP). The features that are obtained
passed to the LSTM architecture after instance
normalization, which generates the text.
Reference [14] investigates the task of 2D human whole-
body pose estimation, with the aim of localizing dense
landmarks on the entire human body including face,
hands, body, and feet. Different deep models are trained
independently on different datasets of the human face,
hand, and body. In addition, [14] introduces COCO-
WholeBody which extends COCO dataset with whole-body
annotations i.e 133 dense landmarks with 68 on the face,
42 on hands and 23 on the body and feet. Neural network
model ZoomNet is proposed in [14], which takes into
account the hierarchical structure of the full human body
to solve the scale variation of different body parts of same
person.
Table -1: Statistical summary of AUTSL dataset
Subsets Samples
Train 28142
Validation 4418
Test 3742
3. AUTSL DATASET
The Ankara University Turkish Sign language dataset [6] is
gathered to perform general Sign Language Recognition
tasks. It consists of 226 signs performed by 43 different
signers and 38,336 isolated sign video samples in total.
Samples contain a wide variety of backgrounds recorded
in indoor and outdoor environments. Moreover, spatial
positions and the postures of signers also vary in the
recordings.The collection method uses Kinect V2 sensor
[10],[11]. Each sample is recorded with Microsoft Kinect
v2 and contains color image (RGB), depth, and skeleton
modalities. Specifically it is split to training, validation, and
testing categories as shown in Table. 1.
Fig-1: Surprise Sign
4. PROPOSED METHOD
In this section we discuss various deep learning model
benchmarks. We consider two baseline models , the
model benchmarked in [6] as well as the SLR challenge
leaderboard (Baseline RGB and Baseline RGB-D) model. In
paper [12] CNN + LSTM structure is employed for the
construction of the model and VGG-16 model is used to
extract the features for each video clip, which is given
input to the LSTM to generate text.
We consider the SLGCN and SSTCN model used in [4] as a
baseline and create an ensemble of the two models with
addition of facial features generated using wholepose
keypoint estimator proposed in [14].
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 09 Issue: 07 | July 2022 www.irjet.net p-ISSN: 2395-0072
© 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 748
The SLGCN model used in [4] uses 27 key points. 7
keypoints to model the upper body and 10 keypoints to
model each arm are considered. From these keypoints
bone data, joint data, bone motion data and joint motion
data is extracted. This is given input to the SLGCN
network. The SSTCN model considers 33 key points i.e. 4
landmarks on the lips and 1 landmark on the nose has
been considered in addition to the 27 keypoints
mentioned above. As mouth actions and facial gestures are
essential for better understanding of sign language, we
explore sign language recognition by using face landmarks
to capture variety of facial expressions e.g., eye brow raise
while signing ”surprise” as shown in Fig. 1.
Fig-2: Face keypoints using wholepose estimator
We make use of wholepose keypoint estimator proposed
in [14] to generate whole body keypoints. Specifically we
use 61 keypoints in total i.e 6 keypoints for left eye, 6
keypoints for right eye, 5 keypoints for left eye brow, 5
keypoints for right eye brow, 12 keypoints for mouth as
shown in Fig. 2 and 27 key points defined in [5](10
keypoints for each hand and 7 keypoints for the upper
body). The ensemble of the two models is used to make
sign language predictions as shown in Fig. 3.
Fig-3 : SLGCN + SSTCN ensemble with facial features
5. RESULTS
In this section, the performance of SLGCN+SSTCN
ensemble model with facial features is presented. For
testing, we make use of the test set of the AUTSL dataset
[6] consisting of 3742 examples. In an attempt to capture
facial expressions and mouth gestures, we have added 34
face key points. On this addition, we noticed a Top1
accuracy of 93.26 percent and Top 5 accuracy 95.66
percent as shown in Table 2.
The above results can be attributed to the imbalance
between signs with facial expressions and signs without
facial expressions in the AUTSL dataset i.e, the AUTSL
dataset consists very few signs having facial expressions,
thus resulting in an absence of clear patterns in facial
expressions.
Table - 2 : Performance of SLGCN+SSTCN ensemble
SLGCN+SSTCN Ensemble Top1 Top5
inclusion of facial features 93.26 95.66
6. CONCLUSION
In this paper, we propose the addition of facial features to
capture facial gestures and mouth actions, thus adding
valuable information to the task of sign language
recogntion.
We construct a skeleton graph for Sign Language
Recognition using pretrained whole pose estimators and
make use of SLGCN [4] to model temporal and spatial
dynamics of facial keypoints and skeleton keypoints and
SSTCN [4] to extract information from skeleton features.
The proposed methods has the ability to provide fruitful
results on sign languages that heavily relies on facial
expression.
REFERENCES
[1] Li, Dongxu & Rodríguez, Cristian & Yu, Xin & Li,
Hongdong. (2019). Word-level deep sign language
recognition from video: A new large-scale dataset
and methods comparison.
[2] Yin, Kayo & Read, Jesse. (2020). Better sign
language translation with STMC-transformer.
5975-5989. 10.18653/v1/2020.coling-main.525.
[3] Zimmermann, Christian & Brox, Thomas. (2017).
Learning to estimate 3D hand pose from single
RGB images. 4913-4921. 10.1109/ICCV.2017.525.
[4] Jiang, Songyao & Sun, Bin & Wang, Lichen & Bai,
Yue & Li, Kunpeng & Fu, Yun. (2021). Sign
language recognition via skeleton-aware multi-
model ensemble.
[5] Heidari, Negar & Iosifidis, Alexandros. (2020).
Progressive spatio-temporal graph convolutional
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 09 Issue: 07 | July 2022 www.irjet.net p-ISSN: 2395-0072
© 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 749
network for skeleton-based human action
recognition.
[6] Mercanoglu, Ozge & Keles, Hacer. (2020). AUTSL:
A large scale multi-modal Turkish sign language
dataset and baseline methods. IEEE Access. 8.
181340-181355. 10.1109/ACCESS.2020.3028072.
[7] Carreira, J. & Zisserman, Andrew. (2017). Quo
Vadis, Action Recognition? A new model and the
kinetics dataset. 4724-4733.
10.1109/CVPR.2017.502.
[8] Forster, Jens & Schmidt, Christoph & Hoyoux,
Thomas & Koller, Oscar & Zelle, Uwe & Piater,
Justus & Ney, Hermann. (2012). RWTH-PHOENIX-
Weather: A large vocabulary sign language
recognition and translation corpus.
[9] Elliott, Eeva & Jacobs, Arthur. (2013). Facial
expressions, emotions, and sign languages.
Frontiers in psychology. 4. 115.
[10] Diana Pagliari and Livio Pinto. Calibration
of kinect for Xbox One and comparison between
the two generations of microsoft sensors.
15:27569–27589, 10 2015.
[11] Clemens Amon, Ferdinand Fuhrmann, and Franz
Graf. Evaluation of the spatial resolution accuracy
of the face tracking system for Kinect for windows
V1 and V2. In Proceedings of AAAI Conference
on Artificial, pages 16–17, 2014
[12] Ozge Mercanoglu Sincan, Anil Osman
Tur, and Hacer Yalim Keles. Isolated sign language
recognition with multi-scale features using LSTM,
2019.
[13] Colin Raffel and Daniel PW Ellis. Feed-forward
networks with attention can solve some
long-term memory problems. arXiv:1512.08756,
2015.
[14] Sheng Jin, Lumin Xu, Jin Xu, Can Wang, Wentao
Liu, Chen Qian, Wanli Ouyang, and Ping Luo.
Whole-body human pose estimation in the wild.
2020.

More Related Content

PDF
Live Sign Language Translation: A Survey
PDF
Real Time Sign Language Detection
PDF
Pakistan sign language to Urdu translator using Kinect
PDF
Literature Review on Indian Sign Language Recognition System
PDF
SIGN LANGUAGE RECOGNITION USING CNN
PDF
VIDEO BASED SIGN LANGUAGE RECOGNITION USING CNN-LSTM
PDF
Sign Language Detector Using Cloud
PDF
SIGN LANGUAGE RECOGNITION USING MACHINE LEARNING
Live Sign Language Translation: A Survey
Real Time Sign Language Detection
Pakistan sign language to Urdu translator using Kinect
Literature Review on Indian Sign Language Recognition System
SIGN LANGUAGE RECOGNITION USING CNN
VIDEO BASED SIGN LANGUAGE RECOGNITION USING CNN-LSTM
Sign Language Detector Using Cloud
SIGN LANGUAGE RECOGNITION USING MACHINE LEARNING

Similar to Sign Language Recognition using Facial Gesture and Skeleton Keypoints (20)

PDF
electronics-11-01780-v2.pdf
PDF
"Isolated Sign Recognition with a Siamese Neural Network of RGB and Depth Str...
PDF
Dynamic Hand Gesture Recognition for Indian Sign Language: A Review
PDF
Hand Gesture Recognition System Using Holistic Mediapipe
PPTX
Indian Sign Language Recognition Method For Deaf People
PDF
Real Time Sign Language Detection
PDF
IRJET- Hand Sign Recognition using Convolutional Neural Network
PDF
IRJET - Sign Language Text to Speech Converter using Image Processing and...
PDF
Real time Myanmar Sign Language Recognition System using PCA and SVM
PDF
SILINGO – SIGN LANGUAGE DETECTION/ RECOGNITION USING CONVOLUTIONAL NEURAL NET...
PDF
Sign Language Recognition using Deep Learning
PDF
IRJET- Hand Gesture Recognition System using Convolutional Neural Networks
PPTX
SYNTAX_ERROR_transforming_gesture_communicationtospeechSpeakify Final_review_...
PPTX
Speakify Final_review_format eDIT-1.pptx
PPTX
Speakify Final_review_format eDIT_ai.pptx
PDF
Sign Language Detection using Action Recognition
PDF
Vision Based Approach to Sign Language Recognition
PDF
Learn2Sign : Sign language recognition and translation using human keypoint e...
PDF
IRJET- ASL Language Translation using ML
PDF
IRJET- Survey on Sign Language and Gesture Recognition System
electronics-11-01780-v2.pdf
"Isolated Sign Recognition with a Siamese Neural Network of RGB and Depth Str...
Dynamic Hand Gesture Recognition for Indian Sign Language: A Review
Hand Gesture Recognition System Using Holistic Mediapipe
Indian Sign Language Recognition Method For Deaf People
Real Time Sign Language Detection
IRJET- Hand Sign Recognition using Convolutional Neural Network
IRJET - Sign Language Text to Speech Converter using Image Processing and...
Real time Myanmar Sign Language Recognition System using PCA and SVM
SILINGO – SIGN LANGUAGE DETECTION/ RECOGNITION USING CONVOLUTIONAL NEURAL NET...
Sign Language Recognition using Deep Learning
IRJET- Hand Gesture Recognition System using Convolutional Neural Networks
SYNTAX_ERROR_transforming_gesture_communicationtospeechSpeakify Final_review_...
Speakify Final_review_format eDIT-1.pptx
Speakify Final_review_format eDIT_ai.pptx
Sign Language Detection using Action Recognition
Vision Based Approach to Sign Language Recognition
Learn2Sign : Sign language recognition and translation using human keypoint e...
IRJET- ASL Language Translation using ML
IRJET- Survey on Sign Language and Gesture Recognition System
Ad

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
PDF
Kiona – A Smart Society Automation Project
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
PDF
Breast Cancer Detection using Computer Vision
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
Kiona – A Smart Society Automation Project
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
BRAIN TUMOUR DETECTION AND CLASSIFICATION
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
Breast Cancer Detection using Computer Vision
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Ad

Recently uploaded (20)

PPTX
Sustainable Sites - Green Building Construction
PPT
Project quality management in manufacturing
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
Digital Logic Computer Design lecture notes
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
Geodesy 1.pptx...............................................
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Welding lecture in detail for understanding
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
composite construction of structures.pdf
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
DOCX
573137875-Attendance-Management-System-original
Sustainable Sites - Green Building Construction
Project quality management in manufacturing
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Embodied AI: Ushering in the Next Era of Intelligent Systems
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Digital Logic Computer Design lecture notes
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Geodesy 1.pptx...............................................
UNIT-1 - COAL BASED THERMAL POWER PLANTS
CYBER-CRIMES AND SECURITY A guide to understanding
Welding lecture in detail for understanding
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
composite construction of structures.pdf
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
573137875-Attendance-Management-System-original

Sign Language Recognition using Facial Gesture and Skeleton Keypoints

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 09 Issue: 07 | July 2022 www.irjet.net p-ISSN: 2395-0072 © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 746 Sign Language Recognition using Facial Gesture and Skeleton Keypoints Sheela N1, Kiran Raghavendra2, Shashank C2, Sanjana S2, Dhanush N S2 1Assistant Professor, Dept of Computer Science Engineering, JSS Science and Technology University, Mysuru 2Dept of Computer Science Engineering, JSS Science and Technology University, Mysuru ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - Sign Language is used by people who are speech impaired or hard of hearing. Sign Language Recognition aims to recognise signs performed by the signer in the input videos. It is an extremely complex task as signs are performed with complicated hand gestures, body posture and mouth actions. In recent times, skeleton based models are preferred for sign language recognition owing to the independence between subject and background. Some sign languages make use of mouthings/facial gestures in addition to hand gestures for signing specific words. These facial expressions can be used to assist the convoluted task of sign language recognition. Skeleton based methods are still under research due to lack of annotations for hand keypoints. Significant efforts have been made to tackle sign language recognition using skeleton based multi-modal ensemble methods, but to our knowledge none of them take facial expressions into consideration. To this end, we propose the usage of face keypoints to assist skeleton based sign language recogntion methods. As a result, skeleton based methods on addition of facial feature information achieves an accuracy of 93.26% on AUTSL dataset. Key Words: Sign language recognition, Skeleton based methods, Face expression, SLGCN, SSTCN, Wholepose keypoint estimator, AUTSL dataset 1. INTRODUCTION Sign Language is a means of communication for people who are hard of hearing or speech impaired. It is a visual language involving hand gestures, body posture and mouth actions. Comprehending sign language requires remarkable effort and training which is not feasible for the general public. In addition sign language is affected by the language of communication (e.g., English, Chinese, Italian) and region of usage (e.g., American Sign Language, Indian Sign Language). With advancements in computer vision and machine learning it is essential to explore sign language recognition (SLR) which translates sign language and helps the deaf/speech impaired community to communicate easily with others in their daily life. In comparison with action recognition or pose estimation, SLR is a extremely challenging task. Firstly, SLR requires information of global body motions and intricate movement of hands and fingers to express the sign correctly. Similar signs can interpret different meanings depending on the number of times it is repeated. Secondly, different signers perform signs differently (e.g., speed, body shape and posture, left handed or right handed) thus making SLR challenging. Inspired by the recent developments on SLR using multi- modal methods [4], we propose the usage of pretrained facial keypoint estimators to provide additional facial gesture information to the SLGCN + SSTCN ensemble framework proposed in [4]. We propose the usage of SLGCN [4] and SSTCN [4] to exploit facial gesture information using face keypoints generated using pretrained estimators thus assisting the complex task of sign language recognition. 2. RELATED WORK In this section, we review existing publicly available datasets for sign language, and existing state-of-the-art algorithms for sign language recognition. 2.1 Sign Language Datasets A Word Level American Sign Language (WLASL) dataset is proposed in [1], containing over 2000 words performed by over 100 signers. A Turkish Sign Language dataset is proposed in [6]. The dataset consists of 226 signs performed by over 43 different and 38,336 sign video samples in total. Samples contain a variety of videos in different backgrounds (both indoor and outdoor environments). Reference [3] introduces a 3D hand pose dataset based on synthetic hand models. 2.2 Sign Language Recognition Approaches An appearance based approach and 2D human pose based approach is proposed in [1] creating baselines that aid method bench marking. In addition, [1] proposes a pose- based temporal graph convolution networks (Pose-TGCN) that models spatial and temporal dependencies. A Two-Stream Inflated 3D ConvNet (I3D) is proposed in [7]. It is based on 2D ConvNet inflation: filters and pooling kernels of very deep image classification convnets are
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 09 Issue: 07 | July 2022 www.irjet.net p-ISSN: 2395-0072 © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 747 expanded into 3D, thus making it possible to learn spatio- temporal features from video. Skeleton Aware Multi-modal SLR framework (SAMSLR) is proposed in [4] to take advantage of multi-modal information thus aiding sign language recognition. RGB and depth modalities are also added and assembled into the final framework to provide information that is complementary to the SL-GCN and SSTCN methods. Reference [2] focuses on the translation system and introduces the STMC-Transformer which improves on the current state-of-the-art by over 5 and 7 BLEU respectively on gloss-to-text and video-to-text translation of the PHOENIX-Weather 2014T dataset [8]. An approach that estimates hand pose from RGB images has been proposed in [3]. Refernce [3] proposes a deep network that learns a network-implicit 3D articulation. Together with detected keypoints in the images, this network yields good estimates of the 3D pose. A method to automatically find compact and problem-specific topology for spatio- temporal graph convolutional networks in a progressive manner has been proposed in [5]. Summary of usage of facial expressions by signers and an account of the range of facial expressions has been proposed in [9]. It is done by making use of the three dimensions on which facial expressions vary: semantic, compositional, and iconic. In [12], a sign language recognition model is created using Convolutional Neural Networks (CNNs), Feature Pooling Module and Long Short-Term Memory Networks (LSTMs). In the CNN part, a pre-trained VGG-16 model is used, after adapting its weights to the dataset. The extracted features are used to generate multi-scale features. The features matrices are reduced to feature vectors, using Global Average Pooling (GAP). The features that are obtained passed to the LSTM architecture after instance normalization, which generates the text. Reference [14] investigates the task of 2D human whole- body pose estimation, with the aim of localizing dense landmarks on the entire human body including face, hands, body, and feet. Different deep models are trained independently on different datasets of the human face, hand, and body. In addition, [14] introduces COCO- WholeBody which extends COCO dataset with whole-body annotations i.e 133 dense landmarks with 68 on the face, 42 on hands and 23 on the body and feet. Neural network model ZoomNet is proposed in [14], which takes into account the hierarchical structure of the full human body to solve the scale variation of different body parts of same person. Table -1: Statistical summary of AUTSL dataset Subsets Samples Train 28142 Validation 4418 Test 3742 3. AUTSL DATASET The Ankara University Turkish Sign language dataset [6] is gathered to perform general Sign Language Recognition tasks. It consists of 226 signs performed by 43 different signers and 38,336 isolated sign video samples in total. Samples contain a wide variety of backgrounds recorded in indoor and outdoor environments. Moreover, spatial positions and the postures of signers also vary in the recordings.The collection method uses Kinect V2 sensor [10],[11]. Each sample is recorded with Microsoft Kinect v2 and contains color image (RGB), depth, and skeleton modalities. Specifically it is split to training, validation, and testing categories as shown in Table. 1. Fig-1: Surprise Sign 4. PROPOSED METHOD In this section we discuss various deep learning model benchmarks. We consider two baseline models , the model benchmarked in [6] as well as the SLR challenge leaderboard (Baseline RGB and Baseline RGB-D) model. In paper [12] CNN + LSTM structure is employed for the construction of the model and VGG-16 model is used to extract the features for each video clip, which is given input to the LSTM to generate text. We consider the SLGCN and SSTCN model used in [4] as a baseline and create an ensemble of the two models with addition of facial features generated using wholepose keypoint estimator proposed in [14].
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 09 Issue: 07 | July 2022 www.irjet.net p-ISSN: 2395-0072 © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 748 The SLGCN model used in [4] uses 27 key points. 7 keypoints to model the upper body and 10 keypoints to model each arm are considered. From these keypoints bone data, joint data, bone motion data and joint motion data is extracted. This is given input to the SLGCN network. The SSTCN model considers 33 key points i.e. 4 landmarks on the lips and 1 landmark on the nose has been considered in addition to the 27 keypoints mentioned above. As mouth actions and facial gestures are essential for better understanding of sign language, we explore sign language recognition by using face landmarks to capture variety of facial expressions e.g., eye brow raise while signing ”surprise” as shown in Fig. 1. Fig-2: Face keypoints using wholepose estimator We make use of wholepose keypoint estimator proposed in [14] to generate whole body keypoints. Specifically we use 61 keypoints in total i.e 6 keypoints for left eye, 6 keypoints for right eye, 5 keypoints for left eye brow, 5 keypoints for right eye brow, 12 keypoints for mouth as shown in Fig. 2 and 27 key points defined in [5](10 keypoints for each hand and 7 keypoints for the upper body). The ensemble of the two models is used to make sign language predictions as shown in Fig. 3. Fig-3 : SLGCN + SSTCN ensemble with facial features 5. RESULTS In this section, the performance of SLGCN+SSTCN ensemble model with facial features is presented. For testing, we make use of the test set of the AUTSL dataset [6] consisting of 3742 examples. In an attempt to capture facial expressions and mouth gestures, we have added 34 face key points. On this addition, we noticed a Top1 accuracy of 93.26 percent and Top 5 accuracy 95.66 percent as shown in Table 2. The above results can be attributed to the imbalance between signs with facial expressions and signs without facial expressions in the AUTSL dataset i.e, the AUTSL dataset consists very few signs having facial expressions, thus resulting in an absence of clear patterns in facial expressions. Table - 2 : Performance of SLGCN+SSTCN ensemble SLGCN+SSTCN Ensemble Top1 Top5 inclusion of facial features 93.26 95.66 6. CONCLUSION In this paper, we propose the addition of facial features to capture facial gestures and mouth actions, thus adding valuable information to the task of sign language recogntion. We construct a skeleton graph for Sign Language Recognition using pretrained whole pose estimators and make use of SLGCN [4] to model temporal and spatial dynamics of facial keypoints and skeleton keypoints and SSTCN [4] to extract information from skeleton features. The proposed methods has the ability to provide fruitful results on sign languages that heavily relies on facial expression. REFERENCES [1] Li, Dongxu & Rodríguez, Cristian & Yu, Xin & Li, Hongdong. (2019). Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison. [2] Yin, Kayo & Read, Jesse. (2020). Better sign language translation with STMC-transformer. 5975-5989. 10.18653/v1/2020.coling-main.525. [3] Zimmermann, Christian & Brox, Thomas. (2017). Learning to estimate 3D hand pose from single RGB images. 4913-4921. 10.1109/ICCV.2017.525. [4] Jiang, Songyao & Sun, Bin & Wang, Lichen & Bai, Yue & Li, Kunpeng & Fu, Yun. (2021). Sign language recognition via skeleton-aware multi- model ensemble. [5] Heidari, Negar & Iosifidis, Alexandros. (2020). Progressive spatio-temporal graph convolutional
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 09 Issue: 07 | July 2022 www.irjet.net p-ISSN: 2395-0072 © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 749 network for skeleton-based human action recognition. [6] Mercanoglu, Ozge & Keles, Hacer. (2020). AUTSL: A large scale multi-modal Turkish sign language dataset and baseline methods. IEEE Access. 8. 181340-181355. 10.1109/ACCESS.2020.3028072. [7] Carreira, J. & Zisserman, Andrew. (2017). Quo Vadis, Action Recognition? A new model and the kinetics dataset. 4724-4733. 10.1109/CVPR.2017.502. [8] Forster, Jens & Schmidt, Christoph & Hoyoux, Thomas & Koller, Oscar & Zelle, Uwe & Piater, Justus & Ney, Hermann. (2012). RWTH-PHOENIX- Weather: A large vocabulary sign language recognition and translation corpus. [9] Elliott, Eeva & Jacobs, Arthur. (2013). Facial expressions, emotions, and sign languages. Frontiers in psychology. 4. 115. [10] Diana Pagliari and Livio Pinto. Calibration of kinect for Xbox One and comparison between the two generations of microsoft sensors. 15:27569–27589, 10 2015. [11] Clemens Amon, Ferdinand Fuhrmann, and Franz Graf. Evaluation of the spatial resolution accuracy of the face tracking system for Kinect for windows V1 and V2. In Proceedings of AAAI Conference on Artificial, pages 16–17, 2014 [12] Ozge Mercanoglu Sincan, Anil Osman Tur, and Hacer Yalim Keles. Isolated sign language recognition with multi-scale features using LSTM, 2019. [13] Colin Raffel and Daniel PW Ellis. Feed-forward networks with attention can solve some long-term memory problems. arXiv:1512.08756, 2015. [14] Sheng Jin, Lumin Xu, Jin Xu, Can Wang, Wentao Liu, Chen Qian, Wanli Ouyang, and Ping Luo. Whole-body human pose estimation in the wild. 2020.