Part1 speech basics

Unit 6 Speech Signal
DR MINAKSHI PRADEEP ATRE
PVG’S COET & GKPIM PUNE

References
Book: Speech and Audio Processing by Dr Shaila Apte madam
Pdf document: http://cs.haifa.ac.il/~nimrod/Compression/Speech/S1Basics2010.pdf
For speech samples:
https://www.signalogic.com/index.pl?page=speech_codec_wav_samples

Contents
Speech:
1. Basics of speech signal and its features
2. LTI representation of speech signal
3. LTV representation of speech signal
4. Estimation of fundamental frequency
5. identification of voiced and unvoiced speech
6. and noise removal

Speech
Speech signal is generated by nature
Naturally occurring so random in nature
Necessary to understand the generalized human speech production
Simple linear time invariant (LTI) model for speech production
Inherently time varying nature of speech
Introduction to linear time variant (LTV) model of speech
Speech type: consonants, fricatives
Voiced and unvoiced (V/UV) speech

Speech Production Mechanism: Pipelines
Model
Vocal Tract

Vocal Tract
 Vocal tract is the cavity between the vocal cords and the
lips, and acts as a resonator that spectrally shapes the
periodic input, much like the cavity of a musical wind
instrument. ƒ
Simple model of a steady-state vowel regards the vocal
tract as a linear time-invariant (LTI) filter with a periodic
impulse-like input.

What is Speech signal?
 Created at the Vocal cords, travels through the Vocal tract, and
produced at speakers mouth
 Gets to the listeners ear as a pressure wave
 Non-Stationary, but can be divided to sound segments which have
some common acoustic properties for a short time interval
 Two Major classes: Phonemes (Vowels and Consonants)

Phonemes
The basic sounds of a language (e.g. "a" in the word "father“) are
called phonemes
A typical speech utterance consists of a string of vowel and
consonant phonemes whose temporal and spectral characteristics
change with time
In addition, the time-varying source and system can also
nonlinearly interact in a complex way: our simple model is correct for
a steady vowel, but the sounds of speech are not always well
represented by linear time-invariant systems !

Vowel Production
In vowel production, air is forced from the lungs by contraction of
the muscles around the lung cavity
Air flows through the vocal cords, which are two masses of flesh,
causing periodic vibration of the cords whose rate gives the pitch of
the sound
Resulting periodic puffs of air act as an excitation input, or source,
to the vocal tract

Speech Production
A sound source excites a (vocal tract) filter
◦ Voiced: Periodic source, created by vocal cords
◦ Unvoiced: Aperiodic and noisy source
Pitch is the fundamental frequency of the vocal cords vibration (also called F0) followed by 4-5
Formants (F1 - F5) at higher frequencies
Natural frequencies occur at
odd multiples of 500 Hz.
These resonant frequencies
are called formants.
Vowel Adult Male Adult Female
F1 F2 F3 F1 F2 F3
(i) 255 2330 3000 340 2610 3210
(u) 290 940 2180 390 995 2585
(ae) 735 1625 2465 950 1955 2900
Typical formant frequencies for selected vowels in Hz
This table shows
the three values

LTI Model for speech production
Impulse Train
Generator
(Glottis)
Random Signal
Generator
Impulse Response
of Vocal Tract
Generated Speech
Impulse train generator is
used as an excitation signal
when a voiced segment is
produced VOWEL
e.g. “a”
Basic Assumption: source of excitation and
the vocal tract systems are independent
Periodic

LTI Model for speech production
Impulse Train
Generator
(Glottis)
Random Signal
Generator
Impulse Response
of Vocal Tract
Generated Speech
Random Signal Generator is
used as an excitation signal
when an unvoiced segment
is produced
CONSONANTS
e.g. “s”
LTI model is used for a short segment of
speech @10 ms for which we can assume the
parameters of vocal tract remain constant
Random

Nature of Speech Signal
 Speech is generated by components like vocal cords and vocal tracts
 It’s not possible to generate a speech signal on its own
Speech is random signal
 Speech has/ can have infinite features (story of an elephant and the blind people touching the
elephant to identify and specify what the elephant looks like)
So it’s a complex problem
 Uttering the different words is possible because of humans can change the resonant modes of
the vocal cavity and can also stretch the vocal cords to some extent for modifying the pitch
period for different vowels
And that’s why we have the linear time-varying (LTV) model

Linear Time-varying Model: Speech
production
Impulse Train
Generator
Random Signal
Generator
Impulse Response
of Vocal Tract
Generated Speech
Amplitude
Pitch period is
variable
Impulse response is
variable

Speech Sound Categories
Periodic (Sonorants, Voiced)
Noisy (Fricatives , Un-Voiced)
Impulsive (Plosive)
Example:
In the word “shop,” the “sh,” “o,” and “p” are generated from a
noisy, periodic, and impulsive source, respectively

Frequency Range
Speech:
Pitch frequency:
◦ male ~ 85-155 Hz;
◦ female ~ 165-255 Hz;
Singer’s vocal range: from bass to
soprano: 80 Hz-1100 Hz

Pitch
Pitch period: The time duration of one glottal cycle
Pitch (fundamental frequency): The reciprocal of the pitch period.
Remember: we will
calculate the pitch
for voiced segment

Pitch Detection
The pitch period and V/UV
decisions are elementary
to many speech coders
Many methods for the
calculation:
◦ Autocorrelation function
◦ ZCR

Features or categorization of speech
sound
Speech sounds are studied and classified from the following
perspectives:
1) The nature of the source: periodic, noisy, or impulsive, and
combinations of the three
2) The shape of the vocal tract
3) The time-domain waveform, which gives the pressure change with
time at the lips output
4) The time-varying spectral characteristics revealed through the
spectrogram

Spectrogram
Time-varying spectral characteristics of the speech signal can be graphically
displayed through the use of a tow-dimensional pattern
Vertical axis: frequency, Horizontal axis: time
The pseudo-color of the (red: high energy ) pattern is proportional to signal
energy
The resonance frequencies of the vocal tract show up as “energy bands”
Voiced intervals characterized by striated appearance (periodically of the
signal)
Un-Voiced intervals are more solidly filled in

Most common Manner of articulation
Plosive, or oral stop, where there is complete occlusion (blockage) of both the oral and nasal
cavities of the vocal tract, and therefore no air flow. Examples include English /p t k/ (voiceless)
and /b d g/ (voiced)
Nasal stop, where there is complete occlusion of the oral cavity, and the air passes instead
through the nose. The shape and position of the tongue determine the resonant cavity that
gives different nasal stops their characteristic sounds. Examples include English /m, n/
Fricative, sometimes called spirant, where there is continuous frication (turbulent and noisy
airflow) at the place of articulation. Examples include English /f, s/ (voiceless), /v, z/ (voiced), etc

Most common Manner of articulation
Sibilants are a type of fricative where the airflow is guided by a groove in the tongue toward the
teeth, creating a high-pitched and very distinctive sound. These are by far the most common
fricatives. English sibilants include /s/ and /z
Affricate, which begins like a plosive, but this releases into a fricative rather than having a
separate release of its own. The English letters "ch" and "j" represent affricates
Trill, in which the articulator (usually the tip of the tongue) is held in place, and the airstream
causes it to vibrate. The double "r" of Spanish "perro" is a trill.
Approximant, where there is very little obstruction. Examples include English /w/ and /r/. Lateral
approximants, usually shortened to lateral, are a type of approximant pronounced with the side
of the tongue. English /l/ is a lateral.

Part1 speech basics

More Related Content

What's hot (20)

Similar to Part1 speech basics (20)

More from Minakshi Atre (20)

Recently uploaded (20)

Part1 speech basics