Wreck a nice beach: adventures in speech recognition

Wreck a nice beach: adventures in speech recognitionStephen MarquardCentre for Educational Technology, University of Cape Townstephen.marquard@uct.ac.zaDepartment of Computer ScienceSeminar, April 2011

OverviewProject goalsSpeech recognitionAcoustic modellingLanguage modellingIntegration into a lecture capture system

Project goalsIntegrate speech recognition into a lecture capture system:Opencast MatterhornCMU Sphinx ASR engineGenerate automatic transcripts of recorded lecturesAllow users to correct and improve the transcripts (crowdsourcing)Use feedback to improve recognition accuracy (of the same, similar or subsequent recordings)Experiment and implement at UCT

Why is it important?Video and audio is more useful if you can:Navigate it easilyLocate relevant recordings from a large setUse by students:Catch up on missed lectures (continuous play or read the transcript)Revision: jump to a particular point or find the lectures which cover topic XOn the public web:Discoverability (search indexing)

Easy or hard?Easiest: small, fixed vocabulary, prescriptive grammar, discrete words, known audio conditions (command-and-control systems)Dictation applications in a specific domain, e.g. Dragon Naturally SpeakingHardest: speaker-independent, large vocabulary continuous speech recognition, adverse or unknown audio conditions

Why is it hard?People have huge amounts of prior experience and a rich (complex) understanding of contextModelling of context in ASR engines is currently very limitedEven people misrecognize speech (e.g. new / foreign accents, specialized terminology, background noise)

Speech recognitionWreck a nice beach … you sing calm incenseReckon eyes peachRecognize speech … using common sense

Early historyFirst known device 1952 (digits)Above: IBM Shoebox, 1961http://www-03.ibm.com/ibm/history/exhibits/specialprod1/specialprod1_7.html

Linguistics vs statistics Early approaches tried to recognize individual phonemes (phonetic units) and hence the words they formed. But not very successfully.

Airplanes don’t flap their wings “Every time I fire a linguist, my system improves” Fred Jelinek 1985/1988

Speech recognition pipelineAudio (signal processing, extract features)Acoustic model (features to phonemes)Pronunciation dictionary (lexicon)Language model (likelihood of words)Confusion lattice (possible options)Results > confidence score

http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-345-automatic-speech-recognition-spring-2003/lecture-notes/lecture1.pdf

Hidden Markov ModelsHMMs model transition probabilities:Alice talks to Bob three days in a row and discovers that on the first day he went for a walk, on the second day he went shopping, and on the third day he cleaned his apartment.Alice has a question: what is the most likely sequence of rainy/sunny days that would explain these observations?http://en.wikipedia.org/wiki/Viterbi_algorithm

Training in action “training 3 (decision) trees to depth 20 from 1 million images takes about a day on a 1000 core cluster”http://research.microsoft.com/pubs/145347/BodyPartRecognition.pdf

Characteristics of the field “the standard approach in our field [is] state-of-the-art system A is gently perturbed to create system B, resulting in a relative decrease in error rate of from 1 to 10%”Borlard, Hermansky and Morgan. Towards increasing speech recognition error rates, 1996.Algorithmic, drawing on many disciplines (especially signal processing, statistics, linguistics, natural language processing)Empirical: lots of different algorithms and optimizationsAlmost no theory to describe why particular approaches work better than others, or how to find optimal solutionsMassive infrastructure is a big advantage: large and varied data sets, significant computing resources.

Audio issuesBandwidthRecording noiseAmbient noiseReverberationMicrophonesMicrophone arrays

Wreck a nice beach: adventures in speech recognition

Acoustic modelsGenerated from a corpus of recorded, transcribed audioBoth artificial and natural corpuses(TIMIT, Broadcast News, Meetings)Audio needs to match the applicationAudio bandwidth = ½ sampling ratePhone speech (sampled 8 KHz, bandwidth 4 KHz)Microphone speech (sampled 16 KHz, bandwidth 8 KHz, typical analysis on 130 Hz – 6800 Hz)There is a South African corpus of phone speech But no South African corpus of microphone speech 

The TIMIT audio corpus 0 47719 She had your dark suit in greasy wash water all year2214 4428 she4428 8316 had7308 9691 your9691 15331 dark15331 19634 suit20929 22453 in22453 27697 greasy27697 32326 wash33120 36575 water37597 39644 all39644 43982 year0 2214 h#2214 3744 sh3744 4428 ax-h4428 5229 hv5229 6927 ae6927 7308 dcl7308 8316 jh8316 9691 axr9691 11697 dcl11697 12114 d12114 13075 aa …Word and phoneme alignment by timecode.630 speakers from 8 US dialect regions, speaking 10 sentences each.

Dialect regionsThe Nationwide Speech Project: A new corpus of American English dialectshttp://web.mit.edu/~nancyc/Public/Papers/Clopper_Pisoni_06_SC.pdf

Crowdsourcing the creation of a GPL speech corpus and open source acoustic models (Sphinx, ISIP, Julius, HTK). An important effort, but still small (84 hours at Dec 2010)www.voxforge.org

Language modellingPronunciation dictionary (lexicon) TOMATO T AH0 M EY1 T OW2 TOMATO(1) T AH0 M AA1 T OW2Language model: a statistical sequence model of words. Trigram models (3 words) are common: -2.0998 YORK MONEY FUND -0.0798 YORK HEDGE FUND -0.1392 YORK MUTUAL FUND

Statistical sequence modelsTruly Madly _____Widely usedApplicationsAuto-suggestSpell-checkersLossless compressionMachine translationLanguage models for speech recognitionProbability of token w in context of preceding tokens c, e.g. P(deeply), given “truly madly”

Context is kingMicro-context (e.g. bi- and trigrams) United Kingdom United Airlines United Arab EmiratesLong-range context “Cricket and rugby are amongst the most popular sports in the United _________”(example from The Sequence Memoizer, Wood et al, 2011).

Characteristics of languagePower law frequency / rank distribution. Zipf’s law: “given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table”http://en.wikipedia.org/wiki/Zipf’s_lawAlso more frequent words are shorter.

How to get large language data setsLinguistic Data Consortium(by subscription, restricted)Some other more specialized corporaMicrosoft (free, restricted)Google (Creative Commons license)Wikipedia (CC / GFDL license)

Using Wikipedia as a language resourceDownload a snapshot (6G compressed)Convert from XML and markup to plain textCreate dictionaries of target size (by word frequency)Create language models of target sizeApproximately equal in size to English Gigaword Corpus

Grid computing for language modellingFor when you need lots of RAM and/or lots of CPUwww.sagrid.ac.zaICTS at UCT: Tim Carr, Andrew Lewis

Accounting for context: LM adaptationAdapt a language model to more closely resemble the target speechUsing related text forTopic modelling (vocabulary, concepts)Style-of-speech modelling “ok and um it's quite useful to have a very good diagnostic test of of acute hepatitis um you know to prevent kind of unnecessary um surgery um so hepatitis is really one um example of a cause of acute abdominal pain that doesn't need surgery”

What’s special about lectures?Possibly helpful assumptions:Coherent topic(s) within a courseOne lecturer presents many lecturesSpecialized vocabularySpoken speech different to written speech

Using Wikipedia for LM adaptationGoal is to adapt a “standard” LM to be specific to the topic of the audioStart somewhere: title, keywords, text from slidesSelect a set of documents, adapt the LMUsing wikipedia, select by similarity: identify the set of documents most closely related to the starting point or keywords

Vector space modellingRepresents documents as n-dimensional vectors (n terms)Document similarity established by comparing vectors, producing a similarity score.Gensim VSM toolkit: independent of corpus size (so good for wikipedia)LSI, LDA, TF-IDF measures. Create a “similarity crawler” to build a corpus of documents related to the topic

MetricsPerplexity (average number of guesses required)Word Error Rate (edit distance: insertions, deletions, substitutions)Information Retrieval: precision and recallWhat’s sufficient? Need to close an accuracy gap of Munteanu research: %WER for a transcript

What is lecture capture?Largely automated: Recording

OutputRecreates the lecture experience by recording: audio

screen output (VGA)www.opencastproject.org

Licensing constraintsOpencast Matterhorn is licensed under the ECL open source license (similar to Apache 2.0 license)Allows closed commercial derivativesTherefore cannot use software or datasets which are non-commercial or research-only.Can use Apache, BSD, LGPL, maybe GPL code and data.

Speech recognition software ecosystemLicensing and patentsClosedProprietaryFOSSOpen

Wreck a nice beach: adventures in speech recognition

More Related Content

What's hot (20)

Similar to Wreck a nice beach: adventures in speech recognition (20)

More from Stephen Marquard (20)

Recently uploaded (20)

Wreck a nice beach: adventures in speech recognition