Dolování dat z řeči pro bezpečnostní aplikace - Jan Černocký

Dolování dat z řeči
pro bezpečnostní aplikace
Honza Černocký
BUT Speech@FIT, FIT VUT v Brně
Security Session, 11.4.2015

Security Session Honza Cernocky 11/4/2015 2/36
Agenda
• Introduction
• Gender ID example
• Speech recognition
• Language identification
• Speaker recognition
• Conclusions

3/36
Needle in a haystack
• Speech is the most important modality of human-human
communication (~80% of information) … criminals and
terrorists are also communicating by speech
• Speech is easy to acquire in the scenarios of interest.
• More difficult is to find what we are looking for
• Typically done by human experts, but always count on:
• Limited personnel
• Limited budget
• Not enough languages spoken
• Insufficient security clearances
Technologies of speech processing are not
almighty but can help to narrow the search
space.
Security Session Honza Cernocky 11/4/2015

Data mining from spontaneous unprepared
speech
Speaker/Voice
Recognition
Gender
Recognition
Language
Recognition
Who speaks?
What gender?
What language?
John Doe
Male or Female
English/German/??
audio (speech)
Speech
Recognition What was
said?“Hello John!”
“John” spotted
Time/relation
analysis
Who asked to
whom?
John asked Paul

How do we work ?
• According to recipes from pattern recognition text-
books !
Collect data
Choose features
Choose model
Train model
Evaluate the classifier
A priori knowledge
of the problem
deployment
Happy (or deadline passed) ?
Unhappy?

The result
Feature
extraction Evaluation of
probabilities or
likelihoods
Models
“Decoding”
nput decision

7/36
The simplest example … GID
Gender Identification
• Tag speech segments as male or
female.
Security Session Honza Cernocky 11/4/2015

So how is Gender-ID done ?
Evaluation of
GMM
likelihoods
MFCC
put
Gaussian Mixture
models – boys,
girls
Decision
Male/female

Features – Mel Frequency Cepstral Coefficients
• The signal is not stationary
• And the hearing is not linear

Features – a vector each 10ms

The evaluation of likelihoods: GMM

Decision - „decoding“

Gender ID summary
Needed data:
•Several hours of speech (from the target channels)
labeled as M or F.
Accuracy:
•the most accurate of our speech data mining tools:
>96% accuracy on challenging channels
What do we get:
•Limiting the search space by 50%

Speech recognition
• Voice2text (V2T), Speech2text (S2T), transcription …
• Large vocabulary continuous speech recognition
(LVCSR)
Feature
extraction Evaluation of
likelihoods (scores
of hypothesis)
Acoustic models
“Decoding”
peech text
Language model Pronunciation
dictionary
Recognition network

LVCSR technically …
• Acoustic models
• … how do speech segments match basic speech unites
(phonemes)
• trained on large (>100h) quantities of carefully transcribed
speech data
• Classically Gaussian Mixture models
• Language models
• … how do the words follow each other
President George Bush
President George push
• Need to be trained on large quantities (Gigabytes) of text from
the target domain
• Pronunciation dictionary
• Translate words into phonemes: dog  d oh g
• Basis needs to be created by hand, the rest generated using
trained grapheme to phoneme (g2p) converter
• A toolkit to do all this … HTK, KALDI, proprietary.

Making LVCSR work well
• Neural networks
• Eating up other techniques (feature extraction, scoring,
LM) - DNNs
• Bottle-neck NNs.
• Speaker adaptation
• Asking the speaker to read a text in dictation systems …
• Unsupervised needed !
• MAP, MLLR, CMLLR, RDLT, SAT …

Challenges in LVCSR
• LVCSR relatively mature in well represented languages (US
English, Modern Standard Arabic, Czech)
• Fast development of recognizers for new languages with
limited resources – IARBA BABEL project
• Limited language packs 10h + some 70h of untranscribed
data
• 2013 languages: Cantonese, Turkish, Pashto, Tagalog,
Surprise - Vietnamese
• 2014 languages: Bengali, Assamese, Zulu, Haiti Creole, Lao,
Surprise: Tamil
• How to re-use resources
from other languages ?
• How to adapt to user’s
language/domain without
seeing his/her data ?

Some examples ….
and then they have one week to retrain their
keyword results ...
and ...
give you might ask why one we there a lot of
research or evaluation methods ...
the people are trying out what keywords or so it
is important to leave a ...
sufficient amount of time there as well ...
uhuh kade sengifowunelwe nguThami manje ithi
angazi e- ekhuluma nomunye ubhuti wakwamasipala
ukuthi ene usho ukuthi kunabantu ekufanele
baphelelwe ngumsebenzi ngoba uNomvula emecabanga
uzokhokha (()) ngoba yena uzoy ithela uzoyi
uzoyihlulisela ngoba phela kukhona aba- abaphethe
u-Adam angithi

LVCSR – what to expect
Accuracies (word accuracy)
•Dictation: >90%
•Reasonable languages:
>70%
•Babel languages ~70%
WER (example on Tamil)
Is this OK ??
•Usually not useable for direct reading, and
questionable, if a trained secretary is not faster in case
we need 100% accurate output.
•Yes useable for search, for rare languages often the
only alternative.

LVCSR – user data
• Speech (for acoustic models):
• Many hours of data as close as possible to the target use
(language, dialect, speaking style …)
• Needs to be transcribed better than in TV subtitles.
• Text (for language models)
• Newspapers and TV news work for dictation but not here.
• Need target text data (including very dirty language)
• Can be simulated by looking for dirty Internet data (Twitter,
discussion forums).
• Pronunciations: generally not a big deal, needs list of words.
Problematic for languages without expertise.
• Privacy issues:
• Speech and text are sensitive.
• Re-training of LVCSR by the users so far not successful.
• Work on modularization: collection of statistics by the user,
shipping to development teams…
• Opportunity to collect this data jointly, especially for
languages relevant for security across Europe

Language identification
• Which language in the recording ?
LID

Standard approaches
• Acoustics
• Phonotactics

LID: Current state-of-the-art system
• A large GMM (“Universal Background model - UBM”) –
performs collection of sufficient statistics – a vector of
several thousands of parameters per utterance (fixed size!)
• Projection to a “language print” – several hundreds of
values.
• These language prints are scored and score is calibrated.

LID – what to expect
• Performance on nice data
NIST LRE 2009,
23 languages
0%
2%
4%
6%
8%
10%
30s 10s 3s
Best 1
Best 2
Best 3
Best 4
Best 5
Phase3
Phase2
Phase1
17
• And on terrible data
RATS 2014,
5 languages (EER)

LID – user data
• Tens of hours of data per target language or dialect
• Need to have only the language label, no transcription necessary.
• Allow to:
• Improve the model of an existing language.
• Add a new language or dialect, or even a target group
• LID is a technology where the user can modify the system
him/her-self
• Language prints do not carry the information on the content –
potential for cooperation
• Backup solution:
• automatic acquisition of language-specific telephone data from public
sources (EOARD project)

Speaker recognition
Two hypotheses
• H0: the speaker in test recording IS THE SAME WE
SAW IN THE ENROLMENT
• H1: the speaker in test recording IS DIFFERENT
• Log likelihood ratio

SRE classical scheme
• Feature extraction – Mel Frequency Cepstral
Coefficients
• Background model implemented as a Gaussian
Mixture model
• Adapted to the target speaker.
• At the time of the test, both models produce likelihoods
that are subtracted and thresholded.
Such a system
• Can be built by a reasonably skilled student equipped
with Matlab in half a day
• Will reasonably function in case enrollment and test
take place under similar conditions.
IKR !

Inter-session variability
NOT HAVING THE SAME CONDITIONS !
Intrinsic variability
•Language
•Emotions, stress, Lombard effect
•Health condition
•Content of the message
Extrinsic variability
•Noise
•Transmission channel
•Codec (or series of codecs)
•Recording device …

Years of SRE R&D fighting the variability …
Front-end
processing
Front-end
processing
Target modelTarget model
Background
model
Background
model
LR score
normalization
LR score
normalization
Σ ΛAdapt
Feature domain Model domain Score domain
• Noise
removal
• Tone
removal
• Cepstral mean
subtraction
• RASTA filtering
• Mean & variance
normalization
• Feature warping
• Speaker Model
Synthesis
• Eigenchannel
compensation
•Joint Factor
Analysis
• Nuisance Attribute
Projection
• Z-norm
• T-norm
• ZT-norm
•Feature
Mapping
•Eigenchannel
adaptation in
feature domain

Current state-of-the-art
• Low-dimensional representation of whole recordings
• i-Vectors (for R&D), Voiceprints (for business)
• Allows for very fast scoring.

What to expect I.
• Works very nicely for long telephone recordings (EER
~2%) – multiple successes in NIST evaluations.
• Examples …

What to expect II.
• Noise, varying communication channels, short
recordings (10s) still a problem – DARPA RATS
program
• Examples …

SRE – user data
• The performance of the SRE system crucially depends
on how the training data is close to the deployment.
• UBM – needs lots (100s of hours) of unannotated data,
not very sensitive.
• VoicePrint extractor – dtto.
• Scoring done by PLDA
• Voice-prints with speaker labels (A,B,C, …) needed
• Even 50 speakers help to increase the accuracy by 30%.
• … but some users are not able to collect/label even this
amount.
• Work running on unsupervised adaptation on
unannotated data.

The charm of voice-prints
• Allowing for transfer of speaker identities
• without giving out the original WAV
• Without possibility to reconstruct what was said.
No contentcontent
• Opening a range of opportunities for
• Cooperation between customers and law enforcement
• Cooperation with R&D teams.

Conclusions
• Speech data mining technologies are already serving
in security and defense (and you can test and
eventually buy the ones from several vendors)
• International crime asks for international reaction:
Standardization (even in the form of informal
working draft) should take place ASAP to allow Police
forces to exchange voice-prints regardless of vendors.
… we’re on it.

Díky za pozvání na Security Session !
Otázky ?

BACKUP
SLIDES

Who am I
• MS. in Radioelectronics from BUT 1993.
• PhD. in Signal processing jointly from Universite d’Orsay
(France) and BUT
• Started speech coding in 1992 and stayed in speech processing
since
• was with Oregon Graduate Institute (Portland, OR) in the group
of Prof. Hermansky in 2001
• Since 2002 at the Faculty of Information Technology of BUT,
habilitation to Associate Professor (Doc.) in 2003.
• Executive leader of BUT Speech@FIT research group
• Since 2008 Head of Department of Computer Graphics and
Multimedia

BUT Speech@FIT
• Founded in 1997 (1 person)
• ~20 people in 2013 (faculty, researchers, grad and pre-grad
students, support staff)
• Active in all technologies this presentation is about
• Supported by EU,
local and US
(DARPA and
IARPA)
grants

International cooperation and standardization
• NIST evaluation campaigns
• Allowing for objective comparison of technologies
• Often on too good data.
• US-funded projects
• Realistic testing on noisy channels (DARPA RATS) and new
languages (IARPA Babel)
• Restricted to participants
• EU projects examples
• Past: MOBIO EU FP7 (mobile biometry) helped and fast speaker
recognition based on low-dimensional voice-prints.
• SIIP – addressing topic SEC-2013.5.1-2 Audio and voice analysis,
speaker identification for security applications – Integration Project
- starting now.
Standardization – not much …
• UK Home Office Forensic Speech and Audio (FSA) Group - Bring
forensic speech and audio under the regulation of ISO 17025
• ANSI/NIST-ITL Standard 1-2013, Data Format for
InterchangeRecord Type-11: Forensic and investigatory voice
record

Dolování dat z řeči pro bezpečnostní aplikace - Jan Černocký

More Related Content

What's hot (11)

Similar to Dolování dat z řeči pro bezpečnostní aplikace - Jan Černocký (20)

More from Security Session (20)

Recently uploaded (20)

Dolování dat z řeči pro bezpečnostní aplikace - Jan Černocký

Editor's Notes