SlideShare a Scribd company logo
ACEEE Int. J. on Information Technology, Vol. 01, No. 01, Mar 2011




      Incremental Difference as Feature for Lipreading
                   Pravin L Yannawar1, Ganesh R Manza2, Bharti W Gawali3, Suresh C Mehrotra 4
                              1,2,3,4
                             Department of Computer Science and Information Technology,
                  Dr. Babasaheb Ambedkar Marathwada University, Aurangabad, Maharashtra, India
      pravinyannawar@gmail.com, ganesh.maza@gmail.com, drbhartirokade@gmail.com, scmehrotra@yahoo.com


Abstra ct — T his pa per rep resen ts a method of computing              databases are very small duration thus placing doubts about
incremen tal d if feren ce f ea tures on th e ba sis of scan line        generalization of reported results to large population and
projection and scan converting lines for the lipreading problem          tasks. There is no specific answer to this but researchers are
on a set of isolated word utterances. These features are affine          concentrating more on speaker independent audio-visual
inva ria nts an d fou nd to be eff ective in iden tifica tion of
                                                                         large vocabulary continuous speech recognition systems [10].
similarity between utterances by the speaker in spatial domain.
                                                                                   Many methods have been proposed by researcher’s
Keywords- Incremental Difference Feature, Euclidean dis-                 in-order to enhance speech recognition system by synchro-
tance, Lipreading.                                                       nization of visual information with the speech as improve-
                                                                         ment on automatic lipreading system which incorporates
                       I. INT RODUCTION                                  dynamic time warping, and vector quantization method ap-
                                                                         plied on alphabets, digits. The recognition was restricted to
    An Automatic speech recognition (ASR) for well defined               isolated utterances and was speaker dependent [2]. Later
applications like dictations and medium vocabulary                       Christoph Bregler (1993) had worked on how recognition
transaction processing tasks in relatively controlled                    performance in automated speech perception can be signifi-
environments has been designed. It is observed by the                    cantly improved & introduced an extension to existing Multi-
researchers that the ASR performance is far from human                   State Time Delayed Neural Network architecture for han-
performance in variety of tasks and conditions, indeed ASR               dling both the modalities that is acoustics and visual sensor
to date is very sensitive to variations in the environmental             input [11]. Similar work has been done by Yuhas et.al (1993)
channel (non-stationary noise sources such as speech                     & focused on neural network for vowel recognition and
babbled, reverberation in closed spaces such as car, multi-              worked on static images [12].
speaker environments) and style of speech (such as whispered                 Paul Duchnowski et.al (1995) worked on movement
etc)[1].                                                                 invariant automatic lipreading and speech recognition [13],
          Lipreading is an auditory, imagery system as a                 Juergen Luettin (1996) used active shape model and hidden
source of speech and image information. It provides the re-              markov model for visual speech recognition [14]. K.L. Sum
dundancy with the acoustic speech signal but is less vari-               et.al (2001) proposed a new optimization procedure for
able than acoustic signals; the acoustic signal depends on               extracting the point-based lip contour using active shape
lip, teeth, and tongue position to the extent that significant           model [16]. Capiler (2001) used Active shape model and
phonetic information is obtainable using lip movement rec-               Kalman filtering in spatiotemporal for noting visual
ognition alone [2][3]. The intimate relation between the au-             deformations [17]. Ian Matthews et.al (2002) has proposed
dio and imagery sensor domains in human recognition can                  method for extraction of visual features of lipreading for
be demonstrated with McGurk Effect [4][5]; where the per-                audio-visual speech recognition [18]. Xiaopeng Hong et.al
ceiver “hears” something other than what was said acousti-               (2006) used PCA based DCT features Extraction method
cally due to the influence of conflicting visual stimulus. The           for lipreading [19]. Takeshi Saitoh et.al (2008) has analyzed
current speech recognition technology may perform ad-                    efficient lipreading method for various languages where they
equately in the absence of acoustic noise for moderate size              focused on limited set of words from English, Japanese,
vocabularies; but even in the presence of moderate noise it              Nepalese, Chinese, Mongolian. The words in English and
fails except for very small vocabularies[6][7][8][9]. Humans             their translated words in above listed languages were
have difficulty distinguishing between some consonants                   considered for the experiment [20]; Meng Li et.al (2008)
when acoustic signal is degraded.                                        has proposed a Novel Motion Based Lip Feature Extraction
          However, to date all automatic speech reading stud-            for Lipreading problems [21].
ies have been limited to very small vocabulary tasks and in                  The paper is organized in four sections. Section I deals
most of cases to very small number of speakers. In addition              with introduction and literature review. Section II deals with
the numbers of diverse algorithms have been suggested in                 methodology adopted. Section III discusses results obtained
the literature for automatic speechreading and are very dif-             by applying methodology and section IV contains conclusion
ficult to compare, as they are hardly ever tested on any                 of the paper.
common audio visual databases. Furthermore, most of such

© 2011 ACEEE                                                        11
DOI: 01.IJIT.01.01.66
ACEEE Int. J. on Information Technology, Vol. 01, No. 01, Mar 2011

                        II. METHODOLOGY
    The System takes the input in the form of video (moving
picture) which is comprised of visual and audio data as shown
in Figure 1. This will act as an input to the audio visual speech
recognition. The samples from the subjects having devnagari
language as mother tongue have been collected. The isolated
words of city names in have been pronounced by the
speakers.

                                                                         Figure 3: Subject with utterance of word “MUMBAI”, Time 0.02 Sec @
                                                                         32Fps, Gray to Binary image conversion using Morphological Operation
                                                                                              with Structure Element ‘Disk’
                                                                     A. Region of Interst:

                                                                        The identification of Region of Interest (ROI) from binary
                                                                     image the scan line projections of row as R(x), columns as
                    Figure 1: Proposed Model                         C(y); were computed as a vectors with respective to every
    The samples from female subject have been chosen. Each           frame. The image from vector is represented by two
speaker or subject is requested to begin and end each letter         dimensional light intensity function F(x,y) returning
utterances for isolated city names with their mouth in closed-       amplitude at an coordinate x,y
                                                                                          m   n
open-close position. No head movement is allowed and
                                                                              R (x )               F ( x, y )           .........( 1)
speakers have been provided with close up view of their                                  x 1 y 1
mouth and urged to do not move face out of the frame. With
                                                                                          n   m
these constraints the dataset is prepared. This video input
                                                                             C ( y)                F (y, x)            .........( 2)
was acquired by acquisition phase and passed to sampler                                  y 1 x 1
which samples video into frames. The video samples of                    This process suggests the area for segmentation of eyes,
subject were sampled by sampler. This sampling of frame              nose & mouth from the every image of vector. This was found
was done with the standard rate of 32 frames per second.             to be helpful in classifying open-close-open mouth of the
Normally the Video input of 2 seconds was recorded for each          subject as well as some geometrical features such as height,
subject. When these samples were provided to sampler; it             width of mouth in every frame can easily be computed.
has produced 64 images for utterance and was considered
as image vector ‘I’ of size 64 images and shown in Figure 2.
    The image vector ‘I’ has to be enhanced because images
in vector ‘I’ are dependent on lighting conditions, head
positions etc. The registration or realignment of image vector
‘I’ was not necessary. The entire sample collected from
subject was in constrained environment, as discussed above.


                                                                                        (a)                                 (b)
                                                                         Figure 4 (a) Horizontal (row) and vertical (column) scan line Projections
                                                                          of Face to locate facial components like eyes, eyebrows, nostrils, nose,
                                                                                           mouth, (b) Isolation of Mouth Region

                                                                         The masking was done so as to reduce workspace. When
                                                                     the R(x) and C(y) were plotted, the plot represents the face
                                                                     components like eyes, nose, and mouth. The masking
                                                                     containing mouth region was framed in accordance with very
Figure 2:Subject with utterance of word “MUMBAI”, Time 0.02 Sec @    first image in vector ‘I’, this was accomplished by computing
                                 32Fps
                                                                     horizontal scan line projection (row projections) and vertical
    Image vector ‘I’ was processed for color to gray and             scan line projections (column projections) as discussed above. On
further to binary, with histogram equalization, background           source vector ‘I’, it was observed that the face components like
estimation and image morphological operation by defining             eyes, eyebrow s, nose, n ostrils, and mouth could be easily be
structural element for open, close, adjust operations. The           isolated. The region of interest, that was mouth can easily be located
outcome of this preprocessing is shown in Figure 2.                  as show in Figure 4 (a) and it was very easy to segment into three
                                                                     parts like eyes, nose, mouth, as show in Figure 4 (b). The masking
                                                                     remained constant for all remaining images of the vector and
                                                                     window coordinate containing mouth was fixed for Mask. This


© 2011 ACEEE                                                        12
DOI: 01.IJIT.01.01.66
ACEEE Int. J. on Information Technology, Vol. 01, No. 01, Mar 2011


was applied to all frames of Image vector so that mouth frame
from sou rce im age w as extracted. The resu lt of win dowing
operation was resulted in vector called ‘W’ as shown in Figure 5




 Figure 5: Masking result of Subject with word “MUMBAI”, Time 0.02
                              Sec @ 32Fps
Feature extraction




 Figure 6: HRL and VRL for identification for Incremental Difference

    To extract the lip features from frame vector there have
been two approaches. The First one is low level analysis of
image sequence data which does not attempt to incorporate
much prior knowledge. Another approach is a high level
approach that imposes a model on the data using prior
knowledge. Typically high level analysis uses lip tracking
to extract lip shape information alone. The feature extraction
was carried out by low level analysis by directly processing
image pixels and is implicitly able to retrieve additional
features that may be difficult to track such as teeth and tongue
[22].
    Low level analysis is adopted in order to compute the
features. The Horizontal Reference Line (HRL) and Vertical
Reference Line (VRL) for the lip are plotted. The points for
HRL and VRL have been chosen from scan line projection                      By equation (3) and (4) [23] with support of decision
vectors that is R(x) and C(y). The initial values for P1 which          variable new coordinates for P1(x, y) for HRL and P2(x,y)
were the midpoint for HRL was calculated as                             for VRL are computed. The pixel P1 was at the middle of
                                                                        HRL and pixel P2 lies at the middle of VRL. The difference
                                                                        between the P1 and P2 is considered as incremental difference
                                                                        feature and will be unique feature for the frame. This feature
                                                                        is invariant to scale, rotation and scaling. This difference is
                                                                        computed for all frames for utterance of word and stored in
                                                                        vector; this vector will be referred as feature vector for the
                                                                        word. The feature vector will contain the information of all
                                                                        samples of word such as {AURANGABAD,MUMBAI,
                                                                        PARBHANI, KOLHAPUR, and OSMANABAD}.
                                                                                        III.   RESULT AND DISCUSSION
                                                                        The midpoint (M) for HRL and VRL has been chosen on the
                                                                        basis of above discussed method. The pixel P1 and P2 are
                                                                        with new ( x p , y p ) respectively and are marked as landmark

                                                                       13
© 2011 ACEEE
DOI: 01.IJIT.01.01.66
ACEEE Int. J. on Information Technology, Vol. 01, No. 01, Mar 2011


points on the every frame of vector as shown in Figure 7.




              Figure 7: Marking of all landmark points

           The pixel difference between P1 and P 2 was re-
corded as feature of the frame and similar difference with
respect to all frames of vector have been computed and stored
in the feature vector. The feature vector corresponding to all
utterance of the word ‘AURANGABAD’ is formed, their
mean feature vector is also calculated. The Euclidean dis-
tance between mean feature vector and computed feature
vector has been computed and represented in Table I. From
table I, it is observed that the sample 1 and sample 2 of the
word ‘Aurangabad’ are found to be similar and sample 4
and sample 5 are also found to be similar, the sample 8 and
sample 9 are same. The similar kinds of results were ob-
tained for the other samples of the words uttered by the
speaker. The Table II represents the Euclidean distance
metrics for the word ‘AURANGABAD’ by the speaker1.
The Graph I shows similarity between the Maxima and
Minima from the feature vectors of Sample 1 and Sample 2
of the word ‘AURANGBAD’ uttered by the speaker 1 and
the Graph II shows how two words that is ‘AURANGABAD’
and ‘KOLHAPUR’ uttered by same speaker are different on
the basis of computed feature vectors and Maxima and
Minima observed in graph II. The mean feature vectors of
these words are plotted. This feature vector is formed with
the help of incremental difference procedure                             The similar results are observed from the other samples of
                                                                      speakers.
                                                                                        IV.   CONCLUSION
                                                                         The incremental difference a novel method for feature
                                                                      extraction for audio-visual speech recognition and it is found
                                                                      to be suitable for the enhancement of speech recognition.
                                                                      This method helps in differentiating the words spoken by
                                                                      the speaker.
                                                                                           ACKNOWLE DGEMENT
                                                                          The author’s would like to thank of the university
                                                                      authorities for providing all the infrastructures required for
                                                                      the experiments.
                                                                                              REFERENCES
                                                                      [1] J R Deller, Jr. J G Proakis and J.H L Hansen, Discrete-Time
                                                                          P roce ssin g o f Sp eech Sign als, M acm illa n P ub lish in g
                                                                          Company, Englewood cliffs, 1993.
                                                                      [2] Eric P etjan , B radford B isc hoff, an d David Bodoff, An
                                                                          Improved au tomatic lipreading system to enh ance speech
                                                                          recognition, Technical Report TM 11251-871012-11, AT&T
                                                                          Bell Labs, Oct. 1987


© 2011 ACEEE                                                     14
DOI: 01.IJIT.01.01.66
ACEEE Int. J. on Information Technology, Vol. 01, No. 01, Mar 2011

[3] Finn K.I, An investigation of Visible lip information to be        [15] A. K Jain, Statistical Pattern Recognition: A Review, IEEE
    used in Automated speech recognition, Ph.D Thesis, George-              Transactions on Pattern Analysis and Machine Intelligence,
    Town university, 1986                                                   Vol 22, No 1, January 2000.
                                                                       [16] K.L Sum, W H Lau, S H Leung, Alan W. C. Liew and K W
                                                                            Tse, A New Optimization procedure for extracting the point
                                                                            bas ed lip co nt ou r u sin g Active Sh ape M o del, IEE E
[5] MacGurk H and Macdonald J, Hearing lips and seeing voices,              Internation al confernece on Acoustics, Speech and Signal
     Nature vol 264, pp 746-748, Dec 1976.                                  Processing, Salt Lake City, UT, USA, pp 1485-1488, 7 th-11th
                                                                            May 2001
[6] Paul D.B, Lippmann R.P, Chen Y and Weinstein C.J, Robust
     HMM based technique for recognition of speech produced            [17] A Capiler, Lip detection and tracking, 11 th In ternation al
     under stress and in noise, Proceeding Speech Tech. 87, pp              Conference on Image Analysis and Processing (ICIAP 2001).
     275-280, 1987                                                     [18] Ian Matthews, T F Cootes, J A Banbham, S Cox, Richard
[7] Malkin F.J, The effect on computer recognition of speech when           Harvey, Extraction of Visual features of Lipreading, IEEE
     speaking through protective masks, Proceeding Speech Tech.             Transaction on Pattern Analysis and Machine Intelligence,
     87, pp 265-268, 1986                                                   Vol 24, No 2, February 2002.
[8] Meisel W.S, A Natural Speech recognition system, Proceeding        [19] Xiaopeng Hong, Hongxun Yao; Yuqi Wan; Rong Chen, A
     Speech Tech. 1987, pp 10-13,1987                                       PC A base d Visu al D CT fe atur e extr action m eth od f or
                                                                            lipreading, International conference on Intelligent Information
[9] Moody T, Joost M and Rodman R, A Comparative Evaluation                 hiding and multimedia signal Processing, 2006.
     a speech recognizers, Proceeding Speech Tech. 87, pp 275-
     280, 1987.                                                        [20] Takeshi Saitoh, Kazutoshi Morishita and Ryosuke Konishi,
                                                                            Analysis of efficient lipreading method for various languages,
[10] Chalapathy Neti, Gerasimos Potaminos, Juergen Luettin, Ian             International Conference on Pattern Recognition (ICPR) ,
     Matthews, Herve Glotin, Dimitra Bergyri, June Sison, Azad              Tampa, FL, pp 1-4, 8 th-11 th Dec 2008
     Mashari and Jie Zhou, Audio-Visual Speech Recognition,
     Workshop 2000 Final report, Oct 2000.                             [21] Meng Li, Yiu-min g Cheu ng, A Novel motion based Lip
                                                                            Featu re E xtra ction for lip rea ding , IE EE In te rn ation al
[11] Christopher Bergler, Improving connected letter recognition            conference on Computational Intelligence and Security, pg.no
     by lipreading, IEEE, 1993                                              361-365, 2008.
[12] B P Yuhas, M H Goldstien and T.J Sejnowski, Integration of        [22] Ian Mathews, Features for Audio Visual Speech Recognition,
     acoustic and visual speech signals using neural networks, IEEE         Ph.D Thesis, School of Information Systems, University of
     Communication Magazine.                                                East Anglia, 1998
[13] Paul Duchnowski, Toward movement invarian t autom atic            [23] James D Foley, Andries Van Dam, Steven K Fiener, John F
     lipreading and speech recognition, IEEE, 1995                          Huges, Computer Graphics Principal and Practice,Pearson
[14] Juergen Luetin, Visual Speech recognition using Active Shape           Education Asia, Second Edition ISBN 81-7808-038-9
     Model and Hidden Markov Model, IEEE,1996




© 2011 ACEEE                                                          15
DOI: 01.IJIT.01.01.66

More Related Content

DOCX
Abstract Silent Sound Technology
PDF
3D Dynamic Facial Sequences Analsysis for face recognition and emotion detection
PDF
Recognition of Facial Emotions Based on Sparse Coding
PDF
Paper id 23201490
PDF
A LIP LOCALIZATION BASED VISUAL FEATURE EXTRACTION METHOD
PDF
Audio-
PDF
Efficient facial expression_recognition_algorithm_based_on_hierarchical_deep_...
PDF
A Survey on Local Feature Based Face Recognition Methods
Abstract Silent Sound Technology
3D Dynamic Facial Sequences Analsysis for face recognition and emotion detection
Recognition of Facial Emotions Based on Sparse Coding
Paper id 23201490
A LIP LOCALIZATION BASED VISUAL FEATURE EXTRACTION METHOD
Audio-
Efficient facial expression_recognition_algorithm_based_on_hierarchical_deep_...
A Survey on Local Feature Based Face Recognition Methods

What's hot (20)

PPTX
Model Based Emotion Detection using Point Clouds
PDF
Design of a Communication System using Sign Language aid for Differently Able...
PPTX
ICIC-2015
PDF
Facial expression recognition using pca and gabor with jaffe database 11748
PDF
Local feature extraction based facial emotion recognition: A survey
DOCX
Face recognition across non uniform motion blur, illumination, and pose
PPTX
Facial expression recognition based on image feature
PDF
IRIS Based Human Recognition System
PDF
MULTIMODAL BIOMETRICS RECOGNITION FROM FACIAL VIDEO VIA DEEP LEARNING
PDF
2018 09
PDF
Use of Illumination Invariant Feature Descriptor for Face Recognition
PPTX
Movement Tracking in Real-time Hand Gesture Recognition
PPTX
Expression invariant face recognition
PDF
Deep Neural NEtwork
PPTX
Facial expression recognition android application
PPTX
Facial expression recognition on real world face images (OPTIK)
PPTX
Recognition of Partially Occluded Face Using Gradientface and Local Binary Pa...
PDF
Facial Expression Recognition Using Local Binary Pattern and Support Vector M...
PPT
Face Morphing
PDF
Multi-feature Fusion Using SIFT and LEBP for Finger Vein Recognition
Model Based Emotion Detection using Point Clouds
Design of a Communication System using Sign Language aid for Differently Able...
ICIC-2015
Facial expression recognition using pca and gabor with jaffe database 11748
Local feature extraction based facial emotion recognition: A survey
Face recognition across non uniform motion blur, illumination, and pose
Facial expression recognition based on image feature
IRIS Based Human Recognition System
MULTIMODAL BIOMETRICS RECOGNITION FROM FACIAL VIDEO VIA DEEP LEARNING
2018 09
Use of Illumination Invariant Feature Descriptor for Face Recognition
Movement Tracking in Real-time Hand Gesture Recognition
Expression invariant face recognition
Deep Neural NEtwork
Facial expression recognition android application
Facial expression recognition on real world face images (OPTIK)
Recognition of Partially Occluded Face Using Gradientface and Local Binary Pa...
Facial Expression Recognition Using Local Binary Pattern and Support Vector M...
Face Morphing
Multi-feature Fusion Using SIFT and LEBP for Finger Vein Recognition
Ad

Similar to Incremental Difference as Feature for Lipreading (20)

PDF
Lip Reading by Using 3-D Discrete Wavelet Transform with Dmey Wavelet
PDF
Speaker independent visual lip activity detection for
PDF
Speaker independent visual lip activity detection for human - computer inte...
PDF
LIP READING: VISUAL SPEECH RECOGNITION USING LIP READING
PDF
IRJET - Automatic Lip Reading: Classification of Words and Phrases using Conv...
PDF
LIP READING - AN EFFICIENT CROSS AUDIO-VIDEO RECOGNITION USING 3D CONVOLUTION...
PPT
lips _reading _in computer_ vision_n.ppt
PDF
Ijetcas14 390
PDF
silent sound technology pdf
PDF
Classification improvement of spoken arabic language based on radial basis fu...
PDF
4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATION
PDF
Et25897899
PDF
Constructed model for micro-content recognition in lip reading based deep lea...
PDF
Wingfield_et_al_Submitted
PDF
Automatic Speech Recognition and Machine Learning for Robotic Arm in Surgery
PDF
Comparison between handwritten word and speech record in real-time using CNN ...
PDF
Advances in Automatic Speech Recognition: From Audio-Only To Audio-Visual Sp...
PPTX
Lip Reading.pptx
PDF
Advances In Speech Recognition Noam Shabtai
PDF
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
Lip Reading by Using 3-D Discrete Wavelet Transform with Dmey Wavelet
Speaker independent visual lip activity detection for
Speaker independent visual lip activity detection for human - computer inte...
LIP READING: VISUAL SPEECH RECOGNITION USING LIP READING
IRJET - Automatic Lip Reading: Classification of Words and Phrases using Conv...
LIP READING - AN EFFICIENT CROSS AUDIO-VIDEO RECOGNITION USING 3D CONVOLUTION...
lips _reading _in computer_ vision_n.ppt
Ijetcas14 390
silent sound technology pdf
Classification improvement of spoken arabic language based on radial basis fu...
4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATION
Et25897899
Constructed model for micro-content recognition in lip reading based deep lea...
Wingfield_et_al_Submitted
Automatic Speech Recognition and Machine Learning for Robotic Arm in Surgery
Comparison between handwritten word and speech record in real-time using CNN ...
Advances in Automatic Speech Recognition: From Audio-Only To Audio-Visual Sp...
Lip Reading.pptx
Advances In Speech Recognition Noam Shabtai
Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)
Ad

More from IDES Editor (20)

PDF
Power System State Estimation - A Review
PDF
Artificial Intelligence Technique based Reactive Power Planning Incorporating...
PDF
Design and Performance Analysis of Genetic based PID-PSS with SVC in a Multi-...
PDF
Optimal Placement of DG for Loss Reduction and Voltage Sag Mitigation in Radi...
PDF
Line Losses in the 14-Bus Power System Network using UPFC
PDF
Study of Structural Behaviour of Gravity Dam with Various Features of Gallery...
PDF
Assessing Uncertainty of Pushover Analysis to Geometric Modeling
PDF
Secure Multi-Party Negotiation: An Analysis for Electronic Payments in Mobile...
PDF
Selfish Node Isolation & Incentivation using Progressive Thresholds
PDF
Various OSI Layer Attacks and Countermeasure to Enhance the Performance of WS...
PDF
Responsive Parameter based an AntiWorm Approach to Prevent Wormhole Attack in...
PDF
Cloud Security and Data Integrity with Client Accountability Framework
PDF
Genetic Algorithm based Layered Detection and Defense of HTTP Botnet
PDF
Enhancing Data Storage Security in Cloud Computing Through Steganography
PDF
Low Energy Routing for WSN’s
PDF
Permutation of Pixels within the Shares of Visual Cryptography using KBRP for...
PDF
Rotman Lens Performance Analysis
PDF
Band Clustering for the Lossless Compression of AVIRIS Hyperspectral Images
PDF
Microelectronic Circuit Analogous to Hydrogen Bonding Network in Active Site ...
PDF
Texture Unit based Monocular Real-world Scene Classification using SOM and KN...
Power System State Estimation - A Review
Artificial Intelligence Technique based Reactive Power Planning Incorporating...
Design and Performance Analysis of Genetic based PID-PSS with SVC in a Multi-...
Optimal Placement of DG for Loss Reduction and Voltage Sag Mitigation in Radi...
Line Losses in the 14-Bus Power System Network using UPFC
Study of Structural Behaviour of Gravity Dam with Various Features of Gallery...
Assessing Uncertainty of Pushover Analysis to Geometric Modeling
Secure Multi-Party Negotiation: An Analysis for Electronic Payments in Mobile...
Selfish Node Isolation & Incentivation using Progressive Thresholds
Various OSI Layer Attacks and Countermeasure to Enhance the Performance of WS...
Responsive Parameter based an AntiWorm Approach to Prevent Wormhole Attack in...
Cloud Security and Data Integrity with Client Accountability Framework
Genetic Algorithm based Layered Detection and Defense of HTTP Botnet
Enhancing Data Storage Security in Cloud Computing Through Steganography
Low Energy Routing for WSN’s
Permutation of Pixels within the Shares of Visual Cryptography using KBRP for...
Rotman Lens Performance Analysis
Band Clustering for the Lossless Compression of AVIRIS Hyperspectral Images
Microelectronic Circuit Analogous to Hydrogen Bonding Network in Active Site ...
Texture Unit based Monocular Real-world Scene Classification using SOM and KN...

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Tartificialntelligence_presentation.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPT
Teaching material agriculture food technology
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
cuic standard and advanced reporting.pdf
PPTX
Machine Learning_overview_presentation.pptx
PPTX
1. Introduction to Computer Programming.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Approach and Philosophy of On baking technology
Advanced methodologies resolving dimensionality complications for autism neur...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
SOPHOS-XG Firewall Administrator PPT.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Tartificialntelligence_presentation.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Teaching material agriculture food technology
Unlocking AI with Model Context Protocol (MCP)
Accuracy of neural networks in brain wave diagnosis of schizophrenia
cuic standard and advanced reporting.pdf
Machine Learning_overview_presentation.pptx
1. Introduction to Computer Programming.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

Incremental Difference as Feature for Lipreading

  • 1. ACEEE Int. J. on Information Technology, Vol. 01, No. 01, Mar 2011 Incremental Difference as Feature for Lipreading Pravin L Yannawar1, Ganesh R Manza2, Bharti W Gawali3, Suresh C Mehrotra 4 1,2,3,4 Department of Computer Science and Information Technology, Dr. Babasaheb Ambedkar Marathwada University, Aurangabad, Maharashtra, India pravinyannawar@gmail.com, ganesh.maza@gmail.com, drbhartirokade@gmail.com, scmehrotra@yahoo.com Abstra ct — T his pa per rep resen ts a method of computing databases are very small duration thus placing doubts about incremen tal d if feren ce f ea tures on th e ba sis of scan line generalization of reported results to large population and projection and scan converting lines for the lipreading problem tasks. There is no specific answer to this but researchers are on a set of isolated word utterances. These features are affine concentrating more on speaker independent audio-visual inva ria nts an d fou nd to be eff ective in iden tifica tion of large vocabulary continuous speech recognition systems [10]. similarity between utterances by the speaker in spatial domain. Many methods have been proposed by researcher’s Keywords- Incremental Difference Feature, Euclidean dis- in-order to enhance speech recognition system by synchro- tance, Lipreading. nization of visual information with the speech as improve- ment on automatic lipreading system which incorporates I. INT RODUCTION dynamic time warping, and vector quantization method ap- plied on alphabets, digits. The recognition was restricted to An Automatic speech recognition (ASR) for well defined isolated utterances and was speaker dependent [2]. Later applications like dictations and medium vocabulary Christoph Bregler (1993) had worked on how recognition transaction processing tasks in relatively controlled performance in automated speech perception can be signifi- environments has been designed. It is observed by the cantly improved & introduced an extension to existing Multi- researchers that the ASR performance is far from human State Time Delayed Neural Network architecture for han- performance in variety of tasks and conditions, indeed ASR dling both the modalities that is acoustics and visual sensor to date is very sensitive to variations in the environmental input [11]. Similar work has been done by Yuhas et.al (1993) channel (non-stationary noise sources such as speech & focused on neural network for vowel recognition and babbled, reverberation in closed spaces such as car, multi- worked on static images [12]. speaker environments) and style of speech (such as whispered Paul Duchnowski et.al (1995) worked on movement etc)[1]. invariant automatic lipreading and speech recognition [13], Lipreading is an auditory, imagery system as a Juergen Luettin (1996) used active shape model and hidden source of speech and image information. It provides the re- markov model for visual speech recognition [14]. K.L. Sum dundancy with the acoustic speech signal but is less vari- et.al (2001) proposed a new optimization procedure for able than acoustic signals; the acoustic signal depends on extracting the point-based lip contour using active shape lip, teeth, and tongue position to the extent that significant model [16]. Capiler (2001) used Active shape model and phonetic information is obtainable using lip movement rec- Kalman filtering in spatiotemporal for noting visual ognition alone [2][3]. The intimate relation between the au- deformations [17]. Ian Matthews et.al (2002) has proposed dio and imagery sensor domains in human recognition can method for extraction of visual features of lipreading for be demonstrated with McGurk Effect [4][5]; where the per- audio-visual speech recognition [18]. Xiaopeng Hong et.al ceiver “hears” something other than what was said acousti- (2006) used PCA based DCT features Extraction method cally due to the influence of conflicting visual stimulus. The for lipreading [19]. Takeshi Saitoh et.al (2008) has analyzed current speech recognition technology may perform ad- efficient lipreading method for various languages where they equately in the absence of acoustic noise for moderate size focused on limited set of words from English, Japanese, vocabularies; but even in the presence of moderate noise it Nepalese, Chinese, Mongolian. The words in English and fails except for very small vocabularies[6][7][8][9]. Humans their translated words in above listed languages were have difficulty distinguishing between some consonants considered for the experiment [20]; Meng Li et.al (2008) when acoustic signal is degraded. has proposed a Novel Motion Based Lip Feature Extraction However, to date all automatic speech reading stud- for Lipreading problems [21]. ies have been limited to very small vocabulary tasks and in The paper is organized in four sections. Section I deals most of cases to very small number of speakers. In addition with introduction and literature review. Section II deals with the numbers of diverse algorithms have been suggested in methodology adopted. Section III discusses results obtained the literature for automatic speechreading and are very dif- by applying methodology and section IV contains conclusion ficult to compare, as they are hardly ever tested on any of the paper. common audio visual databases. Furthermore, most of such © 2011 ACEEE 11 DOI: 01.IJIT.01.01.66
  • 2. ACEEE Int. J. on Information Technology, Vol. 01, No. 01, Mar 2011 II. METHODOLOGY The System takes the input in the form of video (moving picture) which is comprised of visual and audio data as shown in Figure 1. This will act as an input to the audio visual speech recognition. The samples from the subjects having devnagari language as mother tongue have been collected. The isolated words of city names in have been pronounced by the speakers. Figure 3: Subject with utterance of word “MUMBAI”, Time 0.02 Sec @ 32Fps, Gray to Binary image conversion using Morphological Operation with Structure Element ‘Disk’ A. Region of Interst: The identification of Region of Interest (ROI) from binary image the scan line projections of row as R(x), columns as Figure 1: Proposed Model C(y); were computed as a vectors with respective to every The samples from female subject have been chosen. Each frame. The image from vector is represented by two speaker or subject is requested to begin and end each letter dimensional light intensity function F(x,y) returning utterances for isolated city names with their mouth in closed- amplitude at an coordinate x,y m n open-close position. No head movement is allowed and R (x ) F ( x, y ) .........( 1) speakers have been provided with close up view of their x 1 y 1 mouth and urged to do not move face out of the frame. With n m these constraints the dataset is prepared. This video input C ( y) F (y, x) .........( 2) was acquired by acquisition phase and passed to sampler y 1 x 1 which samples video into frames. The video samples of This process suggests the area for segmentation of eyes, subject were sampled by sampler. This sampling of frame nose & mouth from the every image of vector. This was found was done with the standard rate of 32 frames per second. to be helpful in classifying open-close-open mouth of the Normally the Video input of 2 seconds was recorded for each subject as well as some geometrical features such as height, subject. When these samples were provided to sampler; it width of mouth in every frame can easily be computed. has produced 64 images for utterance and was considered as image vector ‘I’ of size 64 images and shown in Figure 2. The image vector ‘I’ has to be enhanced because images in vector ‘I’ are dependent on lighting conditions, head positions etc. The registration or realignment of image vector ‘I’ was not necessary. The entire sample collected from subject was in constrained environment, as discussed above. (a) (b) Figure 4 (a) Horizontal (row) and vertical (column) scan line Projections of Face to locate facial components like eyes, eyebrows, nostrils, nose, mouth, (b) Isolation of Mouth Region The masking was done so as to reduce workspace. When the R(x) and C(y) were plotted, the plot represents the face components like eyes, nose, and mouth. The masking containing mouth region was framed in accordance with very Figure 2:Subject with utterance of word “MUMBAI”, Time 0.02 Sec @ first image in vector ‘I’, this was accomplished by computing 32Fps horizontal scan line projection (row projections) and vertical Image vector ‘I’ was processed for color to gray and scan line projections (column projections) as discussed above. On further to binary, with histogram equalization, background source vector ‘I’, it was observed that the face components like estimation and image morphological operation by defining eyes, eyebrow s, nose, n ostrils, and mouth could be easily be structural element for open, close, adjust operations. The isolated. The region of interest, that was mouth can easily be located outcome of this preprocessing is shown in Figure 2. as show in Figure 4 (a) and it was very easy to segment into three parts like eyes, nose, mouth, as show in Figure 4 (b). The masking remained constant for all remaining images of the vector and window coordinate containing mouth was fixed for Mask. This © 2011 ACEEE 12 DOI: 01.IJIT.01.01.66
  • 3. ACEEE Int. J. on Information Technology, Vol. 01, No. 01, Mar 2011 was applied to all frames of Image vector so that mouth frame from sou rce im age w as extracted. The resu lt of win dowing operation was resulted in vector called ‘W’ as shown in Figure 5 Figure 5: Masking result of Subject with word “MUMBAI”, Time 0.02 Sec @ 32Fps Feature extraction Figure 6: HRL and VRL for identification for Incremental Difference To extract the lip features from frame vector there have been two approaches. The First one is low level analysis of image sequence data which does not attempt to incorporate much prior knowledge. Another approach is a high level approach that imposes a model on the data using prior knowledge. Typically high level analysis uses lip tracking to extract lip shape information alone. The feature extraction was carried out by low level analysis by directly processing image pixels and is implicitly able to retrieve additional features that may be difficult to track such as teeth and tongue [22]. Low level analysis is adopted in order to compute the features. The Horizontal Reference Line (HRL) and Vertical Reference Line (VRL) for the lip are plotted. The points for HRL and VRL have been chosen from scan line projection By equation (3) and (4) [23] with support of decision vectors that is R(x) and C(y). The initial values for P1 which variable new coordinates for P1(x, y) for HRL and P2(x,y) were the midpoint for HRL was calculated as for VRL are computed. The pixel P1 was at the middle of HRL and pixel P2 lies at the middle of VRL. The difference between the P1 and P2 is considered as incremental difference feature and will be unique feature for the frame. This feature is invariant to scale, rotation and scaling. This difference is computed for all frames for utterance of word and stored in vector; this vector will be referred as feature vector for the word. The feature vector will contain the information of all samples of word such as {AURANGABAD,MUMBAI, PARBHANI, KOLHAPUR, and OSMANABAD}. III. RESULT AND DISCUSSION The midpoint (M) for HRL and VRL has been chosen on the basis of above discussed method. The pixel P1 and P2 are with new ( x p , y p ) respectively and are marked as landmark 13 © 2011 ACEEE DOI: 01.IJIT.01.01.66
  • 4. ACEEE Int. J. on Information Technology, Vol. 01, No. 01, Mar 2011 points on the every frame of vector as shown in Figure 7. Figure 7: Marking of all landmark points The pixel difference between P1 and P 2 was re- corded as feature of the frame and similar difference with respect to all frames of vector have been computed and stored in the feature vector. The feature vector corresponding to all utterance of the word ‘AURANGABAD’ is formed, their mean feature vector is also calculated. The Euclidean dis- tance between mean feature vector and computed feature vector has been computed and represented in Table I. From table I, it is observed that the sample 1 and sample 2 of the word ‘Aurangabad’ are found to be similar and sample 4 and sample 5 are also found to be similar, the sample 8 and sample 9 are same. The similar kinds of results were ob- tained for the other samples of the words uttered by the speaker. The Table II represents the Euclidean distance metrics for the word ‘AURANGABAD’ by the speaker1. The Graph I shows similarity between the Maxima and Minima from the feature vectors of Sample 1 and Sample 2 of the word ‘AURANGBAD’ uttered by the speaker 1 and the Graph II shows how two words that is ‘AURANGABAD’ and ‘KOLHAPUR’ uttered by same speaker are different on the basis of computed feature vectors and Maxima and Minima observed in graph II. The mean feature vectors of these words are plotted. This feature vector is formed with the help of incremental difference procedure The similar results are observed from the other samples of speakers. IV. CONCLUSION The incremental difference a novel method for feature extraction for audio-visual speech recognition and it is found to be suitable for the enhancement of speech recognition. This method helps in differentiating the words spoken by the speaker. ACKNOWLE DGEMENT The author’s would like to thank of the university authorities for providing all the infrastructures required for the experiments. REFERENCES [1] J R Deller, Jr. J G Proakis and J.H L Hansen, Discrete-Time P roce ssin g o f Sp eech Sign als, M acm illa n P ub lish in g Company, Englewood cliffs, 1993. [2] Eric P etjan , B radford B isc hoff, an d David Bodoff, An Improved au tomatic lipreading system to enh ance speech recognition, Technical Report TM 11251-871012-11, AT&T Bell Labs, Oct. 1987 © 2011 ACEEE 14 DOI: 01.IJIT.01.01.66
  • 5. ACEEE Int. J. on Information Technology, Vol. 01, No. 01, Mar 2011 [3] Finn K.I, An investigation of Visible lip information to be [15] A. K Jain, Statistical Pattern Recognition: A Review, IEEE used in Automated speech recognition, Ph.D Thesis, George- Transactions on Pattern Analysis and Machine Intelligence, Town university, 1986 Vol 22, No 1, January 2000. [16] K.L Sum, W H Lau, S H Leung, Alan W. C. Liew and K W Tse, A New Optimization procedure for extracting the point bas ed lip co nt ou r u sin g Active Sh ape M o del, IEE E [5] MacGurk H and Macdonald J, Hearing lips and seeing voices, Internation al confernece on Acoustics, Speech and Signal Nature vol 264, pp 746-748, Dec 1976. Processing, Salt Lake City, UT, USA, pp 1485-1488, 7 th-11th May 2001 [6] Paul D.B, Lippmann R.P, Chen Y and Weinstein C.J, Robust HMM based technique for recognition of speech produced [17] A Capiler, Lip detection and tracking, 11 th In ternation al under stress and in noise, Proceeding Speech Tech. 87, pp Conference on Image Analysis and Processing (ICIAP 2001). 275-280, 1987 [18] Ian Matthews, T F Cootes, J A Banbham, S Cox, Richard [7] Malkin F.J, The effect on computer recognition of speech when Harvey, Extraction of Visual features of Lipreading, IEEE speaking through protective masks, Proceeding Speech Tech. Transaction on Pattern Analysis and Machine Intelligence, 87, pp 265-268, 1986 Vol 24, No 2, February 2002. [8] Meisel W.S, A Natural Speech recognition system, Proceeding [19] Xiaopeng Hong, Hongxun Yao; Yuqi Wan; Rong Chen, A Speech Tech. 1987, pp 10-13,1987 PC A base d Visu al D CT fe atur e extr action m eth od f or lipreading, International conference on Intelligent Information [9] Moody T, Joost M and Rodman R, A Comparative Evaluation hiding and multimedia signal Processing, 2006. a speech recognizers, Proceeding Speech Tech. 87, pp 275- 280, 1987. [20] Takeshi Saitoh, Kazutoshi Morishita and Ryosuke Konishi, Analysis of efficient lipreading method for various languages, [10] Chalapathy Neti, Gerasimos Potaminos, Juergen Luettin, Ian International Conference on Pattern Recognition (ICPR) , Matthews, Herve Glotin, Dimitra Bergyri, June Sison, Azad Tampa, FL, pp 1-4, 8 th-11 th Dec 2008 Mashari and Jie Zhou, Audio-Visual Speech Recognition, Workshop 2000 Final report, Oct 2000. [21] Meng Li, Yiu-min g Cheu ng, A Novel motion based Lip Featu re E xtra ction for lip rea ding , IE EE In te rn ation al [11] Christopher Bergler, Improving connected letter recognition conference on Computational Intelligence and Security, pg.no by lipreading, IEEE, 1993 361-365, 2008. [12] B P Yuhas, M H Goldstien and T.J Sejnowski, Integration of [22] Ian Mathews, Features for Audio Visual Speech Recognition, acoustic and visual speech signals using neural networks, IEEE Ph.D Thesis, School of Information Systems, University of Communication Magazine. East Anglia, 1998 [13] Paul Duchnowski, Toward movement invarian t autom atic [23] James D Foley, Andries Van Dam, Steven K Fiener, John F lipreading and speech recognition, IEEE, 1995 Huges, Computer Graphics Principal and Practice,Pearson [14] Juergen Luetin, Visual Speech recognition using Active Shape Education Asia, Second Edition ISBN 81-7808-038-9 Model and Hidden Markov Model, IEEE,1996 © 2011 ACEEE 15 DOI: 01.IJIT.01.01.66