Visual speech to text conversion applicable to telephone communication

Visual-speech to text
conversion applicable
to telephone
communication for deaf
individuals
30TH APRIL 2013

Visual-speech to text conversion applicable to telephone communication for deaf individuals
INTRODUCTION
 Lip-reading technique,
 speech can be understood by interpreting
movements of lips, face and tongue.
 not one-to-one
 Impossible to distinguish phonemes using
visual information alone

 the Cued Speech system
 developed by Cornett
 contains two components:
the hand shape the hand position relative to the
face.
 Hand shapes- consonant phonemes
 hand positions -vowel phonemes.
 improves speech perception to a large extent

the Cued Speech system

AIM OF NEW SYSTEM
 To investigate the designing of a system able to
automatically recognize Cued Speech and convert it
to text.
 Possible for deaf or speech-impaired individuals to
communicate with each other and also with normal-hearing
persons
 Using gestures
 captured by devices equipped by a camera

METHODS
 Corpus, feature extraction, and
statistical modeling
 The speakers’ lips were painted blue, and color
marks were placed on the speakers’ fingers. .
 The data were derived from a video recording of
the cuers pronouncing and coding in Cued
Speech
 landmarks with different colors were placed on
the fingers

 faster and more accurate image processing
stage.
 The audio part of the video recording was
synchronized with the image.
 An automatic image processing method was
appliedli pt ow idththe ( Av)i,d eo
 lip aperture (B),
 lip area (S).
 pinching of the upper lip (Bsup)
 lower (Binf) lip

 Concatenative feature fusion
 Tracks and extracts the xy coordinates
each time frame,
 uses those values as features in the
HMM modeling.
 uses the concatenation of the
synchronous lip shape and hand features
as the joint feature vector given by,

Joint lip hand
feature vector,
Lip shape
feature vector,
Hand feature
vector,
Dimensionality of the
joint feature vector
 Parameters used for lip
shape modeling.

RESULTS
 Isolated word recognition
1. Recognition in normal-hearing subject

2. Recognition in deaf subject

3. Multi-speaker isolated word recognition:
 investigate whether it is possible to train speaker-independent
HMMs for Cued Speech recognition.
 The training data consisted of 750 words from the
normal-hearing subject, and 750 words from the
deaf subject.
 For testing 700 words from normal-hearing subject
and 700 words from the deaf subject were used,
respectively.
 Each state was modeled with a mixture of 4
Gaussian distributions.
 For lip shape and hand shape integration,
concatenative feature fusion was used.

4. Continuous phoneme recognition
 Phoneme correct for continuous phoneme word
recognition in the case of a normal-hearing subject.

Phoneme correct for continuous phoneme word
recognition in the case of a deaf subject.

CONCLUSION
 Hand shapes and lips shape were integrated
using concatenative feature fusion and HMM-based
automatic recognition was conducted.
 For continuous phoneme recognition, a 86%
phoneme correct was achieved for the normal-hearing
cuer and a 82.7% phoneme correct for
the dead cuer were achieved, respectively.
 Speech in both normal-hearing and deaf
subjects were also conducted obtaining a
94.9% and a 89% accuracy, respectively.
.

CONCLUSION
 A multi-speaker experiment using data
from both normal-hearing and deaf subject
showed a 89.6% word accuracy, on
average.
 This result indicates that training speaker-independent
HMMs for Cued Speech using
a large number of subjects should not face
particular difficulties

REFERENCES
 G. Potamianos, C. Neti, G. Gravier, A. Garg, and A.W. Senior,
“recent Advances in the automatic recognition of audiovisual
speech,” in Proceedings of the IEEE, vol. 91, issue 9, pp.
1306–1326, 2003.
 S. Nakamura, K. Kumatani, and S. Tamura, “Multi-modal
temporal asynchronicity modeling by product hmms for
robust audio-visual speech recognition,” in Proceedings of
Fourth IEEE International Conference on Multimodal
Interfaces (ICMI’02), p. 305, 2002.
 R. O. Cornett, “Cued speech,” American Annals of the Deaf,
vol. 112, pp. 3–13, 1967.
 J. Leybaert, “Phonology acquired through the eyes and
spelling in deaf children,”Journal of Experimental Child
Psychology, vol. 75, pp. 291– 318, 2000

ANY
QUESTION
S?

Visual speech to text conversion applicable to telephone communication

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Visual speech to text conversion applicable to telephone communication (20)

More from Swathi Venugopal (7)

Recently uploaded (20)

Visual speech to text conversion applicable to telephone communication