SlideShare a Scribd company logo
Dolování dat z řeči
pro bezpečnostní aplikace
Honza Černocký
BUT Speech@FIT, FIT VUT v Brně
Security Session, 11.4.2015
Security Session Honza Cernocky 11/4/2015 2/36
Agenda
• Introduction
• Gender ID example
• Speech recognition
• Language identification
• Speaker recognition
• Conclusions
3/36
Needle in a haystack
• Speech is the most important modality of human-human
communication (~80% of information) … criminals and
terrorists are also communicating by speech
• Speech is easy to acquire in the scenarios of interest.
• More difficult is to find what we are looking for
• Typically done by human experts, but always count on:
• Limited personnel
• Limited budget
• Not enough languages spoken
• Insufficient security clearances
Technologies of speech processing are not
almighty but can help to narrow the search
space.
Security Session Honza Cernocky 11/4/2015
Security Session Honza Cernocky 11/4/2015 4/36
Data mining from spontaneous unprepared
speech
Speaker/Voice
Recognition
Gender
Recognition
Language
Recognition
Who speaks?
What gender?
What language?
John Doe
Male or Female
English/German/??
audio (speech)
Speech
Recognition What was
said?“Hello John!”
“John” spotted
Time/relation
analysis
Who asked to
whom?
John asked Paul
Security Session Honza Cernocky 11/4/2015 5/36
How do we work ?
• According to recipes from pattern recognition text-
books !
Collect data
Choose features
Choose model
Train model
Evaluate the classifier
A priori knowledge
of the problem
deployment
Happy (or deadline passed) ?
Unhappy?
Security Session Honza Cernocky 11/4/2015 6/36
The result
Feature
extraction Evaluation of
probabilities or
likelihoods
Models
“Decoding”
nput decision
7/36
The simplest example … GID
Gender Identification
• Tag speech segments as male or
female.
Security Session Honza Cernocky 11/4/2015
Security Session Honza Cernocky 11/4/2015 8/36
So how is Gender-ID done ?
Evaluation of
GMM
likelihoods
MFCC
put
Gaussian Mixture
models – boys,
girls
Decision
Male/female
Security Session Honza Cernocky 11/4/2015 9/36
Features – Mel Frequency Cepstral Coefficients
• The signal is not stationary
• And the hearing is not linear
Security Session Honza Cernocky 11/4/2015 10/36
Features – a vector each 10ms
Security Session Honza Cernocky 11/4/2015 11/36
The evaluation of likelihoods: GMM
Security Session Honza Cernocky 11/4/2015 12/36
Decision - „decoding“
Gender ID summary
Needed data:
•Several hours of speech (from the target channels)
labeled as M or F.
Accuracy:
•the most accurate of our speech data mining tools:
>96% accuracy on challenging channels
What do we get:
•Limiting the search space by 50%
Security Session Honza Cernocky 11/4/2015 13/36
Security Session Honza Cernocky 11/4/2015 14/36
Speech recognition
• Voice2text (V2T), Speech2text (S2T), transcription …
• Large vocabulary continuous speech recognition
(LVCSR)
Feature
extraction Evaluation of
likelihoods (scores
of hypothesis)
Acoustic models
“Decoding”
peech text
Language model Pronunciation
dictionary
Recognition network
LVCSR technically …
• Acoustic models
• … how do speech segments match basic speech unites
(phonemes)
• trained on large (>100h) quantities of carefully transcribed
speech data
• Classically Gaussian Mixture models
• Language models
• … how do the words follow each other
President George Bush
President George push
• Need to be trained on large quantities (Gigabytes) of text from
the target domain
• Pronunciation dictionary
• Translate words into phonemes: dog  d oh g
• Basis needs to be created by hand, the rest generated using
trained grapheme to phoneme (g2p) converter
• A toolkit to do all this … HTK, KALDI, proprietary.
Security Session Honza Cernocky 11/4/2015 15/36
Security Session Honza Cernocky 11/4/2015 16/36
Making LVCSR work well
• Neural networks
• Eating up other techniques (feature extraction, scoring,
LM) - DNNs
• Bottle-neck NNs.
• Speaker adaptation
• Asking the speaker to read a text in dictation systems …
• Unsupervised needed !
• MAP, MLLR, CMLLR, RDLT, SAT …
Security Session Honza Cernocky 11/4/2015 17/36
Challenges in LVCSR
• LVCSR relatively mature in well represented languages (US
English, Modern Standard Arabic, Czech)
• Fast development of recognizers for new languages with
limited resources – IARBA BABEL project
• Limited language packs 10h + some 70h of untranscribed
data
• 2013 languages: Cantonese, Turkish, Pashto, Tagalog,
Surprise - Vietnamese
• 2014 languages: Bengali, Assamese, Zulu, Haiti Creole, Lao,
Surprise: Tamil
• How to re-use resources
from other languages ?
• How to adapt to user’s
language/domain without
seeing his/her data ?
Security Session Honza Cernocky 11/4/2015 18/36
Some examples ….
and then they have one week to retrain their
keyword results ...
and ...
give you might ask why one we there a lot of
research or evaluation methods ...
the people are trying out what keywords or so it
is important to leave a ...
sufficient amount of time there as well ...
uhuh kade sengifowunelwe nguThami manje ithi
angazi e- ekhuluma nomunye ubhuti wakwamasipala
ukuthi ene usho ukuthi kunabantu ekufanele
baphelelwe ngumsebenzi ngoba uNomvula emecabanga
uzokhokha (()) ngoba yena uzoy ithela uzoyi
uzoyihlulisela ngoba phela kukhona aba- abaphethe
u-Adam angithi
LVCSR – what to expect
Accuracies (word accuracy)
•Dictation: >90%
•Reasonable languages:
>70%
•Babel languages ~70%
WER (example on Tamil)
Is this OK ??
•Usually not useable for direct reading, and
questionable, if a trained secretary is not faster in case
we need 100% accurate output.
•Yes useable for search, for rare languages often the
only alternative.
Security Session Honza Cernocky 11/4/2015 19/36
LVCSR – user data
• Speech (for acoustic models):
• Many hours of data as close as possible to the target use
(language, dialect, speaking style …)
• Needs to be transcribed better than in TV subtitles.
• Text (for language models)
• Newspapers and TV news work for dictation but not here.
• Need target text data (including very dirty language)
• Can be simulated by looking for dirty Internet data (Twitter,
discussion forums).
• Pronunciations: generally not a big deal, needs list of words.
Problematic for languages without expertise.
• Privacy issues:
• Speech and text are sensitive.
• Re-training of LVCSR by the users so far not successful.
• Work on modularization: collection of statistics by the user,
shipping to development teams…
• Opportunity to collect this data jointly, especially for
languages relevant for security across Europe
Security Session Honza Cernocky 11/4/2015 20/36
Security Session Honza Cernocky 11/4/2015 21/36
Language identification
• Which language in the recording ?
LID
Security Session Honza Cernocky 11/4/2015 22/36
Standard approaches
• Acoustics
• Phonotactics
Security Session Honza Cernocky 11/4/2015 23/36
LID: Current state-of-the-art system
• A large GMM (“Universal Background model - UBM”) –
performs collection of sufficient statistics – a vector of
several thousands of parameters per utterance (fixed size!)
• Projection to a “language print” – several hundreds of
values.
• These language prints are scored and score is calibrated.
LID – what to expect
• Performance on nice data
NIST LRE 2009,
23 languages
Security Session Honza Cernocky 11/4/2015 24/36
0%
2%
4%
6%
8%
10%
30s 10s 3s
Best 1
Best 2
Best 3
Best 4
Best 5
Phase3
Phase2
Phase1
17
• And on terrible data
RATS 2014,
5 languages (EER)
Security Session Honza Cernocky 11/4/2015 25/36
LID – user data
• Tens of hours of data per target language or dialect
• Need to have only the language label, no transcription necessary.
• Allow to:
• Improve the model of an existing language.
• Add a new language or dialect, or even a target group
• LID is a technology where the user can modify the system
him/her-self
• Language prints do not carry the information on the content –
potential for cooperation
• Backup solution:
• automatic acquisition of language-specific telephone data from public
sources (EOARD project)
Security Session Honza Cernocky 11/4/2015 26/36
Speaker recognition
Two hypotheses
• H0: the speaker in test recording IS THE SAME WE
SAW IN THE ENROLMENT
• H1: the speaker in test recording IS DIFFERENT
• Log likelihood ratio
SRE classical scheme
• Feature extraction – Mel Frequency Cepstral
Coefficients
• Background model implemented as a Gaussian
Mixture model
• Adapted to the target speaker.
• At the time of the test, both models produce likelihoods
that are subtracted and thresholded.
Such a system
• Can be built by a reasonably skilled student equipped
with Matlab in half a day
• Will reasonably function in case enrollment and test
take place under similar conditions.
Security Session Honza Cernocky 11/4/2015 27/36
IKR !
Inter-session variability
NOT HAVING THE SAME CONDITIONS !
Intrinsic variability
•Language
•Emotions, stress, Lombard effect
•Health condition
•Content of the message
Extrinsic variability
•Noise
•Transmission channel
•Codec (or series of codecs)
•Recording device …
Security Session Honza Cernocky 11/4/2015 28/36
Security Session Honza Cernocky 11/4/2015 29/36
Years of SRE R&D fighting the variability …
Front-end
processing
Front-end
processing
Target modelTarget model
Background
model
Background
model
LR score
normalization
LR score
normalization
Σ ΛAdapt
Feature domain Model domain Score domain
• Noise
removal
• Tone
removal
• Cepstral mean
subtraction
• RASTA filtering
• Mean & variance
normalization
• Feature warping
• Speaker Model
Synthesis
• Eigenchannel
compensation
•Joint Factor
Analysis
• Nuisance Attribute
Projection
• Z-norm
• T-norm
• ZT-norm
•Feature
Mapping
•Eigenchannel
adaptation in
feature domain
Security Session Honza Cernocky 11/4/2015 30/36
Current state-of-the-art
• Low-dimensional representation of whole recordings
• i-Vectors (for R&D), Voiceprints (for business)
• Allows for very fast scoring.
Security Session Honza Cernocky 11/4/2015 31/36
What to expect I.
• Works very nicely for long telephone recordings (EER
~2%) – multiple successes in NIST evaluations.
• Examples …
Security Session Honza Cernocky 11/4/2015 32/36
What to expect II.
• Noise, varying communication channels, short
recordings (10s) still a problem – DARPA RATS
program
• Examples …
SRE – user data
• The performance of the SRE system crucially depends
on how the training data is close to the deployment.
• UBM – needs lots (100s of hours) of unannotated data,
not very sensitive.
• VoicePrint extractor – dtto.
• Scoring done by PLDA
• Voice-prints with speaker labels (A,B,C, …) needed
• Even 50 speakers help to increase the accuracy by 30%.
• … but some users are not able to collect/label even this
amount.
• Work running on unsupervised adaptation on
unannotated data.
Security Session Honza Cernocky 11/4/2015 33/36
The charm of voice-prints
• Allowing for transfer of speaker identities
• without giving out the original WAV
• Without possibility to reconstruct what was said.
Security Session Honza Cernocky 11/4/2015 34/36
No contentcontent
• Opening a range of opportunities for
• Cooperation between customers and law enforcement
• Cooperation with R&D teams.
Conclusions
• Speech data mining technologies are already serving
in security and defense (and you can test and
eventually buy the ones from several vendors)
• International crime asks for international reaction:
Standardization (even in the form of informal
working draft) should take place ASAP to allow Police
forces to exchange voice-prints regardless of vendors.
… we’re on it.
Security Session Honza Cernocky 11/4/2015 35/36
Security Session Honza Cernocky 11/4/2015 36/36
Díky za pozvání na Security Session !
Otázky ?
BACKUP
SLIDES
Security Session Honza Cernocky 11/4/2015 37/36
Security Session Honza Cernocky 11/4/2015 38/11
Who am I
• MS. in Radioelectronics from BUT 1993.
• PhD. in Signal processing jointly from Universite d’Orsay
(France) and BUT
• Started speech coding in 1992 and stayed in speech processing
since
• was with Oregon Graduate Institute (Portland, OR) in the group
of Prof. Hermansky in 2001
• Since 2002 at the Faculty of Information Technology of BUT,
habilitation to Associate Professor (Doc.) in 2003.
• Executive leader of BUT Speech@FIT research group
• Since 2008 Head of Department of Computer Graphics and
Multimedia
Security Session Honza Cernocky 11/4/2015 39/36
BUT Speech@FIT
• Founded in 1997 (1 person)
• ~20 people in 2013 (faculty, researchers, grad and pre-grad
students, support staff)
• Active in all technologies this presentation is about
• Supported by EU,
local and US
(DARPA and
IARPA)
grants
International cooperation and standardization
• NIST evaluation campaigns
• Allowing for objective comparison of technologies
• Often on too good data.
• US-funded projects
• Realistic testing on noisy channels (DARPA RATS) and new
languages (IARPA Babel)
• Restricted to participants
• EU projects examples
• Past: MOBIO EU FP7 (mobile biometry) helped and fast speaker
recognition based on low-dimensional voice-prints.
• SIIP – addressing topic SEC-2013.5.1-2 Audio and voice analysis,
speaker identification for security applications – Integration Project
- starting now.
Standardization – not much …
• UK Home Office Forensic Speech and Audio (FSA) Group - Bring
forensic speech and audio under the regulation of ISO 17025
• ANSI/NIST-ITL Standard 1-2013, Data Format for
InterchangeRecord Type-11: Forensic and investigatory voice
record
Security Session Honza Cernocky 11/4/2015 40/36

More Related Content

PDF
Natural language processing and its application in ai
PPT
L1 nlp intro
PPTX
Natural language processing
PDF
A tailor-made one-size-fits-all approach to sentiment analysis
PDF
State of Tools for NLP in Danish: 2018
PPTX
Presentation1 (1)
PDF
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Limitations of...
PDF
Research data as an aid in teaching technical competence in subtitling
Natural language processing and its application in ai
L1 nlp intro
Natural language processing
A tailor-made one-size-fits-all approach to sentiment analysis
State of Tools for NLP in Danish: 2018
Presentation1 (1)
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Limitations of...
Research data as an aid in teaching technical competence in subtitling

What's hot (11)

PDF
Computational linguistics
PDF
ICANN 51: IDN Root Zone LGR (workshop)
PPT
Language and Intelligence
PPT
PDF
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
PDF
American Standard Sign Language Representation Using Speech Recognition
PDF
An HLT profile of the official South African languages
PPTX
The Role of Natural Language Processing in Information Retrieval
PPTX
Natural language processing
PPTX
Penetration Testing
PPT
Natural language processing
Computational linguistics
ICANN 51: IDN Root Zone LGR (workshop)
Language and Intelligence
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
American Standard Sign Language Representation Using Speech Recognition
An HLT profile of the official South African languages
The Role of Natural Language Processing in Information Retrieval
Natural language processing
Penetration Testing
Natural language processing
Ad

Similar to Dolování dat z řeči pro bezpečnostní aplikace - Jan Černocký (20)

PDF
Ijartes v1-i1-005
PPTX
Speech recognition final presentation
PDF
Trends of ICASSP 2022
PDF
Deciphering voice of customer through speech analytics
PPTX
Speech-Recognition.pptx
PPTX
Research Developments and Directions in Speech Recognition and ...
PPTX
SPEECH RECOGNITION USING NEURAL NETWORK
PDF
50320130403005 2
PDF
A017420108
PPTX
Speech Recognition Technology
PDF
Recent advances in LVCSR : A benchmark comparison of performances
PDF
IFD&TC 2018: An Experiment with Voice Recognition to Improve Call Center Quality
PPTX
Speaker Identification and Verification
PPTX
SPEKER RECOGNITION UNDER LIMITED DATA CODITION
PDF
PPTX
Speech to text conversion
PPTX
Speech to text conversion
PPT
Speechrecognition 100423091251-phpapp01
PDF
Bachelors project summary
Ijartes v1-i1-005
Speech recognition final presentation
Trends of ICASSP 2022
Deciphering voice of customer through speech analytics
Speech-Recognition.pptx
Research Developments and Directions in Speech Recognition and ...
SPEECH RECOGNITION USING NEURAL NETWORK
50320130403005 2
A017420108
Speech Recognition Technology
Recent advances in LVCSR : A benchmark comparison of performances
IFD&TC 2018: An Experiment with Voice Recognition to Improve Call Center Quality
Speaker Identification and Verification
SPEKER RECOGNITION UNDER LIMITED DATA CODITION
Speech to text conversion
Speech to text conversion
Speechrecognition 100423091251-phpapp01
Bachelors project summary
Ad

More from Security Session (20)

PDF
Getting your hands dirty: How to Analyze the Behavior of Malware Traffic / SE...
PDF
Základy reverse engineeringu a assembleru / KAREL LEJSKA, MILAN BARTOŠ [DEFEN...
PDF
Zabezpečení nejen SSH na serveru pomocí Fail2Ban a jednoduchého honeypotu. / ...
PDF
Insights of a brute-forcing botnet / VERONICA VALEROS [CISCO]
PDF
Softwarove protektory / KAREL LEJSKA, MILAN BARTOŠ [DEFENDIO]
PDF
Wintel Hell: průvodce devíti kruhy Dantova technologického pekla / MARTIN HRO...
PDF
Robots against robots: How a Machine Learning IDS detected a novel Linux Botn...
PPTX
#ochranadat pred sebou samotným / MATEJ ZACHAR [SAFETICA TECHNOLOGIES S.R.O.]
PDF
Co vše skrývá síťový provoz a jak detekovat kybernetické hrozby? / MARTIN ŠKO...
PDF
Bezpečnější pošta díky protokolu DANE / ONDŘEJ CALETKA [CESNET]
ODP
Prezentace brno
PDF
OSINT and beyond
PDF
Exploitace – od minulosti po současnost - Jan Kopecký
PDF
Kontrola uživatelských účtů ve Windows a jak ji obejít - Martin Dráb
PDF
Research in Liveness Detection - Martin Drahanský
ODP
Turris - Robert Šefr
PDF
Co se skrývá v datovém provozu? - Pavel Minařík
PPTX
Jak odesílat zprávy, když někdo vypne Internet - Pavel Táborský
PDF
Two Years with botnet Asprox - Michal Ambrož
PPTX
Falsifikace biometricke charakteristiky a detekce zivosti
Getting your hands dirty: How to Analyze the Behavior of Malware Traffic / SE...
Základy reverse engineeringu a assembleru / KAREL LEJSKA, MILAN BARTOŠ [DEFEN...
Zabezpečení nejen SSH na serveru pomocí Fail2Ban a jednoduchého honeypotu. / ...
Insights of a brute-forcing botnet / VERONICA VALEROS [CISCO]
Softwarove protektory / KAREL LEJSKA, MILAN BARTOŠ [DEFENDIO]
Wintel Hell: průvodce devíti kruhy Dantova technologického pekla / MARTIN HRO...
Robots against robots: How a Machine Learning IDS detected a novel Linux Botn...
#ochranadat pred sebou samotným / MATEJ ZACHAR [SAFETICA TECHNOLOGIES S.R.O.]
Co vše skrývá síťový provoz a jak detekovat kybernetické hrozby? / MARTIN ŠKO...
Bezpečnější pošta díky protokolu DANE / ONDŘEJ CALETKA [CESNET]
Prezentace brno
OSINT and beyond
Exploitace – od minulosti po současnost - Jan Kopecký
Kontrola uživatelských účtů ve Windows a jak ji obejít - Martin Dráb
Research in Liveness Detection - Martin Drahanský
Turris - Robert Šefr
Co se skrývá v datovém provozu? - Pavel Minařík
Jak odesílat zprávy, když někdo vypne Internet - Pavel Táborský
Two Years with botnet Asprox - Michal Ambrož
Falsifikace biometricke charakteristiky a detekce zivosti

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
project resource management chapter-09.pdf
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Hybrid model detection and classification of lung cancer
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Developing a website for English-speaking practice to English as a foreign la...
NewMind AI Weekly Chronicles – August ’25 Week III
project resource management chapter-09.pdf
TLE Review Electricity (Electricity).pptx
Hybrid model detection and classification of lung cancer
A contest of sentiment analysis: k-nearest neighbor versus neural network
Web App vs Mobile App What Should You Build First.pdf
NewMind AI Weekly Chronicles - August'25-Week II
cloud_computing_Infrastucture_as_cloud_p
Getting started with AI Agents and Multi-Agent Systems
gpt5_lecture_notes_comprehensive_20250812015547.pdf
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Final SEM Unit 1 for mit wpu at pune .pptx
Programs and apps: productivity, graphics, security and other tools
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Hindi spoken digit analysis for native and non-native speakers
A comparative study of natural language inference in Swahili using monolingua...
Chapter 5: Probability Theory and Statistics
Univ-Connecticut-ChatGPT-Presentaion.pdf
Developing a website for English-speaking practice to English as a foreign la...

Dolování dat z řeči pro bezpečnostní aplikace - Jan Černocký

  • 1. Dolování dat z řeči pro bezpečnostní aplikace Honza Černocký BUT Speech@FIT, FIT VUT v Brně Security Session, 11.4.2015
  • 2. Security Session Honza Cernocky 11/4/2015 2/36 Agenda • Introduction • Gender ID example • Speech recognition • Language identification • Speaker recognition • Conclusions
  • 3. 3/36 Needle in a haystack • Speech is the most important modality of human-human communication (~80% of information) … criminals and terrorists are also communicating by speech • Speech is easy to acquire in the scenarios of interest. • More difficult is to find what we are looking for • Typically done by human experts, but always count on: • Limited personnel • Limited budget • Not enough languages spoken • Insufficient security clearances Technologies of speech processing are not almighty but can help to narrow the search space. Security Session Honza Cernocky 11/4/2015
  • 4. Security Session Honza Cernocky 11/4/2015 4/36 Data mining from spontaneous unprepared speech Speaker/Voice Recognition Gender Recognition Language Recognition Who speaks? What gender? What language? John Doe Male or Female English/German/?? audio (speech) Speech Recognition What was said?“Hello John!” “John” spotted Time/relation analysis Who asked to whom? John asked Paul
  • 5. Security Session Honza Cernocky 11/4/2015 5/36 How do we work ? • According to recipes from pattern recognition text- books ! Collect data Choose features Choose model Train model Evaluate the classifier A priori knowledge of the problem deployment Happy (or deadline passed) ? Unhappy?
  • 6. Security Session Honza Cernocky 11/4/2015 6/36 The result Feature extraction Evaluation of probabilities or likelihoods Models “Decoding” nput decision
  • 7. 7/36 The simplest example … GID Gender Identification • Tag speech segments as male or female. Security Session Honza Cernocky 11/4/2015
  • 8. Security Session Honza Cernocky 11/4/2015 8/36 So how is Gender-ID done ? Evaluation of GMM likelihoods MFCC put Gaussian Mixture models – boys, girls Decision Male/female
  • 9. Security Session Honza Cernocky 11/4/2015 9/36 Features – Mel Frequency Cepstral Coefficients • The signal is not stationary • And the hearing is not linear
  • 10. Security Session Honza Cernocky 11/4/2015 10/36 Features – a vector each 10ms
  • 11. Security Session Honza Cernocky 11/4/2015 11/36 The evaluation of likelihoods: GMM
  • 12. Security Session Honza Cernocky 11/4/2015 12/36 Decision - „decoding“
  • 13. Gender ID summary Needed data: •Several hours of speech (from the target channels) labeled as M or F. Accuracy: •the most accurate of our speech data mining tools: >96% accuracy on challenging channels What do we get: •Limiting the search space by 50% Security Session Honza Cernocky 11/4/2015 13/36
  • 14. Security Session Honza Cernocky 11/4/2015 14/36 Speech recognition • Voice2text (V2T), Speech2text (S2T), transcription … • Large vocabulary continuous speech recognition (LVCSR) Feature extraction Evaluation of likelihoods (scores of hypothesis) Acoustic models “Decoding” peech text Language model Pronunciation dictionary Recognition network
  • 15. LVCSR technically … • Acoustic models • … how do speech segments match basic speech unites (phonemes) • trained on large (>100h) quantities of carefully transcribed speech data • Classically Gaussian Mixture models • Language models • … how do the words follow each other President George Bush President George push • Need to be trained on large quantities (Gigabytes) of text from the target domain • Pronunciation dictionary • Translate words into phonemes: dog  d oh g • Basis needs to be created by hand, the rest generated using trained grapheme to phoneme (g2p) converter • A toolkit to do all this … HTK, KALDI, proprietary. Security Session Honza Cernocky 11/4/2015 15/36
  • 16. Security Session Honza Cernocky 11/4/2015 16/36 Making LVCSR work well • Neural networks • Eating up other techniques (feature extraction, scoring, LM) - DNNs • Bottle-neck NNs. • Speaker adaptation • Asking the speaker to read a text in dictation systems … • Unsupervised needed ! • MAP, MLLR, CMLLR, RDLT, SAT …
  • 17. Security Session Honza Cernocky 11/4/2015 17/36 Challenges in LVCSR • LVCSR relatively mature in well represented languages (US English, Modern Standard Arabic, Czech) • Fast development of recognizers for new languages with limited resources – IARBA BABEL project • Limited language packs 10h + some 70h of untranscribed data • 2013 languages: Cantonese, Turkish, Pashto, Tagalog, Surprise - Vietnamese • 2014 languages: Bengali, Assamese, Zulu, Haiti Creole, Lao, Surprise: Tamil • How to re-use resources from other languages ? • How to adapt to user’s language/domain without seeing his/her data ?
  • 18. Security Session Honza Cernocky 11/4/2015 18/36 Some examples …. and then they have one week to retrain their keyword results ... and ... give you might ask why one we there a lot of research or evaluation methods ... the people are trying out what keywords or so it is important to leave a ... sufficient amount of time there as well ... uhuh kade sengifowunelwe nguThami manje ithi angazi e- ekhuluma nomunye ubhuti wakwamasipala ukuthi ene usho ukuthi kunabantu ekufanele baphelelwe ngumsebenzi ngoba uNomvula emecabanga uzokhokha (()) ngoba yena uzoy ithela uzoyi uzoyihlulisela ngoba phela kukhona aba- abaphethe u-Adam angithi
  • 19. LVCSR – what to expect Accuracies (word accuracy) •Dictation: >90% •Reasonable languages: >70% •Babel languages ~70% WER (example on Tamil) Is this OK ?? •Usually not useable for direct reading, and questionable, if a trained secretary is not faster in case we need 100% accurate output. •Yes useable for search, for rare languages often the only alternative. Security Session Honza Cernocky 11/4/2015 19/36
  • 20. LVCSR – user data • Speech (for acoustic models): • Many hours of data as close as possible to the target use (language, dialect, speaking style …) • Needs to be transcribed better than in TV subtitles. • Text (for language models) • Newspapers and TV news work for dictation but not here. • Need target text data (including very dirty language) • Can be simulated by looking for dirty Internet data (Twitter, discussion forums). • Pronunciations: generally not a big deal, needs list of words. Problematic for languages without expertise. • Privacy issues: • Speech and text are sensitive. • Re-training of LVCSR by the users so far not successful. • Work on modularization: collection of statistics by the user, shipping to development teams… • Opportunity to collect this data jointly, especially for languages relevant for security across Europe Security Session Honza Cernocky 11/4/2015 20/36
  • 21. Security Session Honza Cernocky 11/4/2015 21/36 Language identification • Which language in the recording ? LID
  • 22. Security Session Honza Cernocky 11/4/2015 22/36 Standard approaches • Acoustics • Phonotactics
  • 23. Security Session Honza Cernocky 11/4/2015 23/36 LID: Current state-of-the-art system • A large GMM (“Universal Background model - UBM”) – performs collection of sufficient statistics – a vector of several thousands of parameters per utterance (fixed size!) • Projection to a “language print” – several hundreds of values. • These language prints are scored and score is calibrated.
  • 24. LID – what to expect • Performance on nice data NIST LRE 2009, 23 languages Security Session Honza Cernocky 11/4/2015 24/36 0% 2% 4% 6% 8% 10% 30s 10s 3s Best 1 Best 2 Best 3 Best 4 Best 5 Phase3 Phase2 Phase1 17 • And on terrible data RATS 2014, 5 languages (EER)
  • 25. Security Session Honza Cernocky 11/4/2015 25/36 LID – user data • Tens of hours of data per target language or dialect • Need to have only the language label, no transcription necessary. • Allow to: • Improve the model of an existing language. • Add a new language or dialect, or even a target group • LID is a technology where the user can modify the system him/her-self • Language prints do not carry the information on the content – potential for cooperation • Backup solution: • automatic acquisition of language-specific telephone data from public sources (EOARD project)
  • 26. Security Session Honza Cernocky 11/4/2015 26/36 Speaker recognition Two hypotheses • H0: the speaker in test recording IS THE SAME WE SAW IN THE ENROLMENT • H1: the speaker in test recording IS DIFFERENT • Log likelihood ratio
  • 27. SRE classical scheme • Feature extraction – Mel Frequency Cepstral Coefficients • Background model implemented as a Gaussian Mixture model • Adapted to the target speaker. • At the time of the test, both models produce likelihoods that are subtracted and thresholded. Such a system • Can be built by a reasonably skilled student equipped with Matlab in half a day • Will reasonably function in case enrollment and test take place under similar conditions. Security Session Honza Cernocky 11/4/2015 27/36 IKR !
  • 28. Inter-session variability NOT HAVING THE SAME CONDITIONS ! Intrinsic variability •Language •Emotions, stress, Lombard effect •Health condition •Content of the message Extrinsic variability •Noise •Transmission channel •Codec (or series of codecs) •Recording device … Security Session Honza Cernocky 11/4/2015 28/36
  • 29. Security Session Honza Cernocky 11/4/2015 29/36 Years of SRE R&D fighting the variability … Front-end processing Front-end processing Target modelTarget model Background model Background model LR score normalization LR score normalization Σ ΛAdapt Feature domain Model domain Score domain • Noise removal • Tone removal • Cepstral mean subtraction • RASTA filtering • Mean & variance normalization • Feature warping • Speaker Model Synthesis • Eigenchannel compensation •Joint Factor Analysis • Nuisance Attribute Projection • Z-norm • T-norm • ZT-norm •Feature Mapping •Eigenchannel adaptation in feature domain
  • 30. Security Session Honza Cernocky 11/4/2015 30/36 Current state-of-the-art • Low-dimensional representation of whole recordings • i-Vectors (for R&D), Voiceprints (for business) • Allows for very fast scoring.
  • 31. Security Session Honza Cernocky 11/4/2015 31/36 What to expect I. • Works very nicely for long telephone recordings (EER ~2%) – multiple successes in NIST evaluations. • Examples …
  • 32. Security Session Honza Cernocky 11/4/2015 32/36 What to expect II. • Noise, varying communication channels, short recordings (10s) still a problem – DARPA RATS program • Examples …
  • 33. SRE – user data • The performance of the SRE system crucially depends on how the training data is close to the deployment. • UBM – needs lots (100s of hours) of unannotated data, not very sensitive. • VoicePrint extractor – dtto. • Scoring done by PLDA • Voice-prints with speaker labels (A,B,C, …) needed • Even 50 speakers help to increase the accuracy by 30%. • … but some users are not able to collect/label even this amount. • Work running on unsupervised adaptation on unannotated data. Security Session Honza Cernocky 11/4/2015 33/36
  • 34. The charm of voice-prints • Allowing for transfer of speaker identities • without giving out the original WAV • Without possibility to reconstruct what was said. Security Session Honza Cernocky 11/4/2015 34/36 No contentcontent • Opening a range of opportunities for • Cooperation between customers and law enforcement • Cooperation with R&D teams.
  • 35. Conclusions • Speech data mining technologies are already serving in security and defense (and you can test and eventually buy the ones from several vendors) • International crime asks for international reaction: Standardization (even in the form of informal working draft) should take place ASAP to allow Police forces to exchange voice-prints regardless of vendors. … we’re on it. Security Session Honza Cernocky 11/4/2015 35/36
  • 36. Security Session Honza Cernocky 11/4/2015 36/36 Díky za pozvání na Security Session ! Otázky ?
  • 37. BACKUP SLIDES Security Session Honza Cernocky 11/4/2015 37/36
  • 38. Security Session Honza Cernocky 11/4/2015 38/11 Who am I • MS. in Radioelectronics from BUT 1993. • PhD. in Signal processing jointly from Universite d’Orsay (France) and BUT • Started speech coding in 1992 and stayed in speech processing since • was with Oregon Graduate Institute (Portland, OR) in the group of Prof. Hermansky in 2001 • Since 2002 at the Faculty of Information Technology of BUT, habilitation to Associate Professor (Doc.) in 2003. • Executive leader of BUT Speech@FIT research group • Since 2008 Head of Department of Computer Graphics and Multimedia
  • 39. Security Session Honza Cernocky 11/4/2015 39/36 BUT Speech@FIT • Founded in 1997 (1 person) • ~20 people in 2013 (faculty, researchers, grad and pre-grad students, support staff) • Active in all technologies this presentation is about • Supported by EU, local and US (DARPA and IARPA) grants
  • 40. International cooperation and standardization • NIST evaluation campaigns • Allowing for objective comparison of technologies • Often on too good data. • US-funded projects • Realistic testing on noisy channels (DARPA RATS) and new languages (IARPA Babel) • Restricted to participants • EU projects examples • Past: MOBIO EU FP7 (mobile biometry) helped and fast speaker recognition based on low-dimensional voice-prints. • SIIP – addressing topic SEC-2013.5.1-2 Audio and voice analysis, speaker identification for security applications – Integration Project - starting now. Standardization – not much … • UK Home Office Forensic Speech and Audio (FSA) Group - Bring forensic speech and audio under the regulation of ISO 17025 • ANSI/NIST-ITL Standard 1-2013, Data Format for InterchangeRecord Type-11: Forensic and investigatory voice record Security Session Honza Cernocky 11/4/2015 40/36

Editor's Notes

  • #8: Sem dat obrazek spkID a zaramovat sloupec s pohlavim !!!
  • #17: Can do this in more detail later …
  • #21: Q publikum: kde takova data vzit ? Mozna demo na cestine, kurva piča, atd
  • #33: Q pro publikum: co je tady nejvetsi challenge ? … poznat kde vubec rec je – VAD !
  • #34: It might be problematic to collect even these 50 speakers (if possible on different communication channels…)