Incremental Difference as Feature for Lipreading

ACEEE Int. J. on Information Technology, Vol. 01, No. 01, Mar 2011

Incremental Difference as Feature for Lipreading
Pravin L Yannawar1, Ganesh R Manza2, Bharti W Gawali3, Suresh C Mehrotra 4
1,2,3,4
Department of Computer Science and Information Technology,
Dr. Babasaheb Ambedkar Marathwada University, Aurangabad, Maharashtra, India
pravinyannawar@gmail.com, ganesh.maza@gmail.com, drbhartirokade@gmail.com, scmehrotra@yahoo.com

Abstra ct — T his pa per rep resen ts a method of computing databases are very small duration thus placing doubts about
incremen tal d if feren ce f ea tures on th e ba sis of scan line generalization of reported results to large population and
projection and scan converting lines for the lipreading problem tasks. There is no specific answer to this but researchers are
on a set of isolated word utterances. These features are affine concentrating more on speaker independent audio-visual
inva ria nts an d fou nd to be eff ective in iden tifica tion of
large vocabulary continuous speech recognition systems [10].
similarity between utterances by the speaker in spatial domain.
Many methods have been proposed by researcher’s
Keywords- Incremental Difference Feature, Euclidean dis- in-order to enhance speech recognition system by synchro-
tance, Lipreading. nization of visual information with the speech as improve-
ment on automatic lipreading system which incorporates
I. INT RODUCTION dynamic time warping, and vector quantization method ap-
plied on alphabets, digits. The recognition was restricted to
An Automatic speech recognition (ASR) for well defined isolated utterances and was speaker dependent [2]. Later
applications like dictations and medium vocabulary Christoph Bregler (1993) had worked on how recognition
transaction processing tasks in relatively controlled performance in automated speech perception can be signifi-
environments has been designed. It is observed by the cantly improved & introduced an extension to existing Multi-
researchers that the ASR performance is far from human State Time Delayed Neural Network architecture for han-
performance in variety of tasks and conditions, indeed ASR dling both the modalities that is acoustics and visual sensor
to date is very sensitive to variations in the environmental input [11]. Similar work has been done by Yuhas et.al (1993)
channel (non-stationary noise sources such as speech & focused on neural network for vowel recognition and
babbled, reverberation in closed spaces such as car, multi- worked on static images [12].
speaker environments) and style of speech (such as whispered Paul Duchnowski et.al (1995) worked on movement
etc)[1]. invariant automatic lipreading and speech recognition [13],
Lipreading is an auditory, imagery system as a Juergen Luettin (1996) used active shape model and hidden
source of speech and image information. It provides the re- markov model for visual speech recognition [14]. K.L. Sum
dundancy with the acoustic speech signal but is less vari- et.al (2001) proposed a new optimization procedure for
able than acoustic signals; the acoustic signal depends on extracting the point-based lip contour using active shape
lip, teeth, and tongue position to the extent that significant model [16]. Capiler (2001) used Active shape model and
phonetic information is obtainable using lip movement rec- Kalman filtering in spatiotemporal for noting visual
ognition alone [2][3]. The intimate relation between the au- deformations [17]. Ian Matthews et.al (2002) has proposed
dio and imagery sensor domains in human recognition can method for extraction of visual features of lipreading for
be demonstrated with McGurk Effect [4][5]; where the per- audio-visual speech recognition [18]. Xiaopeng Hong et.al
ceiver “hears” something other than what was said acousti- (2006) used PCA based DCT features Extraction method
cally due to the influence of conflicting visual stimulus. The for lipreading [19]. Takeshi Saitoh et.al (2008) has analyzed
current speech recognition technology may perform ad- efficient lipreading method for various languages where they
equately in the absence of acoustic noise for moderate size focused on limited set of words from English, Japanese,
vocabularies; but even in the presence of moderate noise it Nepalese, Chinese, Mongolian. The words in English and
fails except for very small vocabularies[6][7][8][9]. Humans their translated words in above listed languages were
have difficulty distinguishing between some consonants considered for the experiment [20]; Meng Li et.al (2008)
when acoustic signal is degraded. has proposed a Novel Motion Based Lip Feature Extraction
However, to date all automatic speech reading stud- for Lipreading problems [21].
ies have been limited to very small vocabulary tasks and in The paper is organized in four sections. Section I deals
most of cases to very small number of speakers. In addition with introduction and literature review. Section II deals with
the numbers of diverse algorithms have been suggested in methodology adopted. Section III discusses results obtained
the literature for automatic speechreading and are very dif- by applying methodology and section IV contains conclusion
ficult to compare, as they are hardly ever tested on any of the paper.
common audio visual databases. Furthermore, most of such

© 2011 ACEEE 11
DOI: 01.IJIT.01.01.66


II. METHODOLOGY
The System takes the input in the form of video (moving
picture) which is comprised of visual and audio data as shown
in Figure 1. This will act as an input to the audio visual speech
recognition. The samples from the subjects having devnagari
language as mother tongue have been collected. The isolated
words of city names in have been pronounced by the
speakers.

Figure 3: Subject with utterance of word “MUMBAI”, Time 0.02 Sec @
32Fps, Gray to Binary image conversion using Morphological Operation
with Structure Element ‘Disk’
A. Region of Interst:

The identification of Region of Interest (ROI) from binary
image the scan line projections of row as R(x), columns as
Figure 1: Proposed Model C(y); were computed as a vectors with respective to every
The samples from female subject have been chosen. Each frame. The image from vector is represented by two
speaker or subject is requested to begin and end each letter dimensional light intensity function F(x,y) returning
utterances for isolated city names with their mouth in closed- amplitude at an coordinate x,y
m n
open-close position. No head movement is allowed and
R (x ) F ( x, y ) .........( 1)
speakers have been provided with close up view of their x 1 y 1
mouth and urged to do not move face out of the frame. With
n m
these constraints the dataset is prepared. This video input
C ( y) F (y, x) .........( 2)
was acquired by acquisition phase and passed to sampler y 1 x 1
which samples video into frames. The video samples of This process suggests the area for segmentation of eyes,
subject were sampled by sampler. This sampling of frame nose & mouth from the every image of vector. This was found
was done with the standard rate of 32 frames per second. to be helpful in classifying open-close-open mouth of the
Normally the Video input of 2 seconds was recorded for each subject as well as some geometrical features such as height,
subject. When these samples were provided to sampler; it width of mouth in every frame can easily be computed.
has produced 64 images for utterance and was considered
as image vector ‘I’ of size 64 images and shown in Figure 2.
The image vector ‘I’ has to be enhanced because images
in vector ‘I’ are dependent on lighting conditions, head
positions etc. The registration or realignment of image vector
‘I’ was not necessary. The entire sample collected from
subject was in constrained environment, as discussed above.

(a) (b)
Figure 4 (a) Horizontal (row) and vertical (column) scan line Projections
of Face to locate facial components like eyes, eyebrows, nostrils, nose,
mouth, (b) Isolation of Mouth Region

The masking was done so as to reduce workspace. When
the R(x) and C(y) were plotted, the plot represents the face
components like eyes, nose, and mouth. The masking
containing mouth region was framed in accordance with very
Figure 2:Subject with utterance of word “MUMBAI”, Time 0.02 Sec @ first image in vector ‘I’, this was accomplished by computing
32Fps
horizontal scan line projection (row projections) and vertical
Image vector ‘I’ was processed for color to gray and scan line projections (column projections) as discussed above. On
further to binary, with histogram equalization, background source vector ‘I’, it was observed that the face components like
estimation and image morphological operation by defining eyes, eyebrow s, nose, n ostrils, and mouth could be easily be
structural element for open, close, adjust operations. The isolated. The region of interest, that was mouth can easily be located
outcome of this preprocessing is shown in Figure 2. as show in Figure 4 (a) and it was very easy to segment into three
parts like eyes, nose, mouth, as show in Figure 4 (b). The masking
remained constant for all remaining images of the vector and
window coordinate containing mouth was fixed for Mask. This

© 2011 ACEEE 12
DOI: 01.IJIT.01.01.66


was applied to all frames of Image vector so that mouth frame
from sou rce im age w as extracted. The resu lt of win dowing
operation was resulted in vector called ‘W’ as shown in Figure 5

Figure 5: Masking result of Subject with word “MUMBAI”, Time 0.02
Sec @ 32Fps
Feature extraction

Figure 6: HRL and VRL for identification for Incremental Difference

To extract the lip features from frame vector there have
been two approaches. The First one is low level analysis of
image sequence data which does not attempt to incorporate
much prior knowledge. Another approach is a high level
approach that imposes a model on the data using prior
knowledge. Typically high level analysis uses lip tracking
to extract lip shape information alone. The feature extraction
was carried out by low level analysis by directly processing
image pixels and is implicitly able to retrieve additional
features that may be difficult to track such as teeth and tongue
[22].
Low level analysis is adopted in order to compute the
features. The Horizontal Reference Line (HRL) and Vertical
Reference Line (VRL) for the lip are plotted. The points for
HRL and VRL have been chosen from scan line projection By equation (3) and (4) [23] with support of decision
vectors that is R(x) and C(y). The initial values for P1 which variable new coordinates for P1(x, y) for HRL and P2(x,y)
were the midpoint for HRL was calculated as for VRL are computed. The pixel P1 was at the middle of
HRL and pixel P2 lies at the middle of VRL. The difference
between the P1 and P2 is considered as incremental difference
feature and will be unique feature for the frame. This feature
is invariant to scale, rotation and scaling. This difference is
computed for all frames for utterance of word and stored in
vector; this vector will be referred as feature vector for the
word. The feature vector will contain the information of all
samples of word such as {AURANGABAD,MUMBAI,
PARBHANI, KOLHAPUR, and OSMANABAD}.
III. RESULT AND DISCUSSION
The midpoint (M) for HRL and VRL has been chosen on the
basis of above discussed method. The pixel P1 and P2 are
with new ( x p , y p ) respectively and are marked as landmark

13
© 2011 ACEEE
DOI: 01.IJIT.01.01.66


points on the every frame of vector as shown in Figure 7.

Figure 7: Marking of all landmark points

The pixel difference between P1 and P 2 was re-
corded as feature of the frame and similar difference with
respect to all frames of vector have been computed and stored
in the feature vector. The feature vector corresponding to all
utterance of the word ‘AURANGABAD’ is formed, their
mean feature vector is also calculated. The Euclidean dis-
tance between mean feature vector and computed feature
vector has been computed and represented in Table I. From
table I, it is observed that the sample 1 and sample 2 of the
word ‘Aurangabad’ are found to be similar and sample 4
and sample 5 are also found to be similar, the sample 8 and
sample 9 are same. The similar kinds of results were ob-
tained for the other samples of the words uttered by the
speaker. The Table II represents the Euclidean distance
metrics for the word ‘AURANGABAD’ by the speaker1.
The Graph I shows similarity between the Maxima and
Minima from the feature vectors of Sample 1 and Sample 2
of the word ‘AURANGBAD’ uttered by the speaker 1 and
the Graph II shows how two words that is ‘AURANGABAD’
and ‘KOLHAPUR’ uttered by same speaker are different on
the basis of computed feature vectors and Maxima and
Minima observed in graph II. The mean feature vectors of
these words are plotted. This feature vector is formed with
the help of incremental difference procedure The similar results are observed from the other samples of
speakers.
IV. CONCLUSION
The incremental difference a novel method for feature
extraction for audio-visual speech recognition and it is found
to be suitable for the enhancement of speech recognition.
This method helps in differentiating the words spoken by
the speaker.
ACKNOWLE DGEMENT
The author’s would like to thank of the university
authorities for providing all the infrastructures required for
the experiments.
REFERENCES
[1] J R Deller, Jr. J G Proakis and J.H L Hansen, Discrete-Time
P roce ssin g o f Sp eech Sign als, M acm illa n P ub lish in g
Company, Englewood cliffs, 1993.
[2] Eric P etjan , B radford B isc hoff, an d David Bodoff, An
Improved au tomatic lipreading system to enh ance speech
recognition, Technical Report TM 11251-871012-11, AT&T
Bell Labs, Oct. 1987

© 2011 ACEEE 14
DOI: 01.IJIT.01.01.66


[3] Finn K.I, An investigation of Visible lip information to be [15] A. K Jain, Statistical Pattern Recognition: A Review, IEEE
used in Automated speech recognition, Ph.D Thesis, George- Transactions on Pattern Analysis and Machine Intelligence,
Town university, 1986 Vol 22, No 1, January 2000.
[16] K.L Sum, W H Lau, S H Leung, Alan W. C. Liew and K W
Tse, A New Optimization procedure for extracting the point
bas ed lip co nt ou r u sin g Active Sh ape M o del, IEE E
[5] MacGurk H and Macdonald J, Hearing lips and seeing voices, Internation al confernece on Acoustics, Speech and Signal
Nature vol 264, pp 746-748, Dec 1976. Processing, Salt Lake City, UT, USA, pp 1485-1488, 7 th-11th
May 2001
[6] Paul D.B, Lippmann R.P, Chen Y and Weinstein C.J, Robust
HMM based technique for recognition of speech produced [17] A Capiler, Lip detection and tracking, 11 th In ternation al
under stress and in noise, Proceeding Speech Tech. 87, pp Conference on Image Analysis and Processing (ICIAP 2001).
275-280, 1987 [18] Ian Matthews, T F Cootes, J A Banbham, S Cox, Richard
[7] Malkin F.J, The effect on computer recognition of speech when Harvey, Extraction of Visual features of Lipreading, IEEE
speaking through protective masks, Proceeding Speech Tech. Transaction on Pattern Analysis and Machine Intelligence,
87, pp 265-268, 1986 Vol 24, No 2, February 2002.
[8] Meisel W.S, A Natural Speech recognition system, Proceeding [19] Xiaopeng Hong, Hongxun Yao; Yuqi Wan; Rong Chen, A
Speech Tech. 1987, pp 10-13,1987 PC A base d Visu al D CT fe atur e extr action m eth od f or
lipreading, International conference on Intelligent Information
[9] Moody T, Joost M and Rodman R, A Comparative Evaluation hiding and multimedia signal Processing, 2006.
a speech recognizers, Proceeding Speech Tech. 87, pp 275-
280, 1987. [20] Takeshi Saitoh, Kazutoshi Morishita and Ryosuke Konishi,
Analysis of efficient lipreading method for various languages,
[10] Chalapathy Neti, Gerasimos Potaminos, Juergen Luettin, Ian International Conference on Pattern Recognition (ICPR) ,
Matthews, Herve Glotin, Dimitra Bergyri, June Sison, Azad Tampa, FL, pp 1-4, 8 th-11 th Dec 2008
Mashari and Jie Zhou, Audio-Visual Speech Recognition,
Workshop 2000 Final report, Oct 2000. [21] Meng Li, Yiu-min g Cheu ng, A Novel motion based Lip
Featu re E xtra ction for lip rea ding , IE EE In te rn ation al
[11] Christopher Bergler, Improving connected letter recognition conference on Computational Intelligence and Security, pg.no
by lipreading, IEEE, 1993 361-365, 2008.
[12] B P Yuhas, M H Goldstien and T.J Sejnowski, Integration of [22] Ian Mathews, Features for Audio Visual Speech Recognition,
acoustic and visual speech signals using neural networks, IEEE Ph.D Thesis, School of Information Systems, University of
Communication Magazine. East Anglia, 1998
[13] Paul Duchnowski, Toward movement invarian t autom atic [23] James D Foley, Andries Van Dam, Steven K Fiener, John F
lipreading and speech recognition, IEEE, 1995 Huges, Computer Graphics Principal and Practice,Pearson
[14] Juergen Luetin, Visual Speech recognition using Active Shape Education Asia, Second Edition ISBN 81-7808-038-9
Model and Hidden Markov Model, IEEE,1996

© 2011 ACEEE 15
DOI: 01.IJIT.01.01.66

Incremental Difference as Feature for Lipreading

More Related Content

What's hot (20)

Similar to Incremental Difference as Feature for Lipreading (20)

More from IDES Editor (20)

Recently uploaded (20)

Incremental Difference as Feature for Lipreading