SlideShare a Scribd company logo
Wreck a nice beach: adventures in speech recognitionStephen MarquardCentre for Educational Technology, University of Cape Townstephen.marquard@uct.ac.zaDepartment of Computer ScienceSeminar, April 2011
OverviewProject goalsSpeech recognitionAcoustic modellingLanguage modellingIntegration into a lecture capture system
Project goalsIntegrate speech recognition into a lecture capture system:Opencast MatterhornCMU Sphinx ASR engineGenerate automatic transcripts of recorded lecturesAllow users to correct and improve the transcripts (crowdsourcing)Use feedback to improve recognition accuracy (of the same, similar or subsequent recordings)Experiment and implement at UCT
Why is it important?Video and audio is more useful if you can:Navigate it easilyLocate relevant recordings from a large setUse by students:Catch up on missed lectures (continuous play or read the transcript)Revision: jump to a particular point or find the lectures which cover topic XOn the public web:Discoverability (search indexing)
Easy or hard?Easiest: small, fixed vocabulary, prescriptive grammar, discrete words, known audio conditions (command-and-control systems)Dictation applications in a specific domain, e.g. Dragon Naturally SpeakingHardest: speaker-independent, large vocabulary continuous speech recognition, adverse or unknown audio conditions
Why is it hard?People have huge amounts of prior experience and a rich (complex) understanding of contextModelling of context in ASR engines is currently very limitedEven people misrecognize speech (e.g. new / foreign accents, specialized terminology, background noise)
Speech recognitionWreck a nice beach 			… you sing calm incenseReckon eyes peachRecognize speech				… using common sense
Early historyFirst known device 1952 (digits)Above: IBM Shoebox, 1961http://www-03.ibm.com/ibm/history/exhibits/specialprod1/specialprod1_7.html
Linguistics vs statistics	Early approaches tried to recognize individual phonemes (phonetic units) and hence the words they formed.	But not very successfully.
Airplanes don’t flap their wings	“Every time I fire a linguist, my system improves”	Fred Jelinek	1985/1988
Speech recognition pipelineAudio (signal processing, extract features)Acoustic model (features to phonemes)Pronunciation dictionary (lexicon)Language model (likelihood of words)Confusion lattice (possible options)Results > confidence score
http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-345-automatic-speech-recognition-spring-2003/lecture-notes/lecture1.pdf
Hidden Markov ModelsHMMs model transition probabilities:Alice talks to Bob three days in a row and discovers that on the first day he went for a walk, on the second day he went shopping, and on the third day he cleaned his apartment.Alice has a question: what is the most likely sequence of rainy/sunny days that would explain these observations?http://en.wikipedia.org/wiki/Viterbi_algorithm
Training in action	“training 3 (decision) trees to depth 20 from 1 million images takes about a day on a 1000 core cluster”http://research.microsoft.com/pubs/145347/BodyPartRecognition.pdf
Characteristics of the field	“the standard approach in our field [is] state-of-the-art system A is gently perturbed to create system B, resulting in a relative decrease in error rate of from 1 to 10%”Borlard, Hermansky and Morgan. Towards increasing speech recognition error rates, 1996.Algorithmic, drawing on many disciplines (especially signal processing, statistics, linguistics, natural language processing)Empirical: lots of different algorithms and optimizationsAlmost no theory to describe why particular approaches work better than others, or how to find optimal solutionsMassive infrastructure is a big advantage: large and varied data sets, significant computing resources.
Audio issuesBandwidthRecording noiseAmbient noiseReverberationMicrophonesMicrophone arrays
Wreck a nice beach: adventures in speech recognition
Acoustic modelsGenerated from a corpus of recorded, transcribed audioBoth artificial and natural corpuses(TIMIT, Broadcast News, Meetings)Audio needs to match the applicationAudio bandwidth = ½ sampling ratePhone speech (sampled 8 KHz, bandwidth 4 KHz)Microphone speech (sampled 16 KHz, bandwidth 8 KHz, typical analysis on 130 Hz – 6800 Hz)There is a South African corpus of phone speech But no South African corpus of microphone speech 
The TIMIT audio corpus	0 47719 She had your dark suit in greasy wash water all year2214 4428 she4428 8316 had7308 9691 your9691 15331 dark15331 19634 suit20929 22453 in22453 27697 greasy27697 32326 wash33120 36575 water37597 39644 all39644 43982 year0 2214 h#2214 3744 sh3744 4428 ax-h4428 5229 hv5229 6927 ae6927 7308 dcl7308 8316 jh8316 9691 axr9691 11697 dcl11697 12114 d12114 13075 aa …Word and phoneme alignment by timecode.630 speakers from 8 US dialect regions, speaking 10 sentences each.
Dialect regionsThe Nationwide Speech Project: A new corpus of American English dialectshttp://web.mit.edu/~nancyc/Public/Papers/Clopper_Pisoni_06_SC.pdf
Crowdsourcing the creation of a GPL speech corpus and open source acoustic models (Sphinx, ISIP, Julius, HTK).	An important effort, but still small (84 hours at Dec 2010)www.voxforge.org
Language modellingPronunciation dictionary (lexicon)	TOMATO  T AH0 M EY1 T OW2		TOMATO(1)  T AH0 M AA1 T OW2Language model: a statistical sequence model of words. Trigram models (3 words) are common:	-2.0998 YORK MONEY FUND 	-0.0798 YORK HEDGE FUND 	-0.1392 YORK MUTUAL FUND
Statistical sequence modelsTruly Madly _____Widely usedApplicationsAuto-suggestSpell-checkersLossless compressionMachine translationLanguage models for speech recognitionProbability of token w in context of preceding tokens c, e.g. P(deeply), given “truly madly”
Context is kingMicro-context (e.g. bi- and trigrams)	United Kingdom	United Airlines	United Arab EmiratesLong-range context	“Cricket and rugby are amongst the most popular sports in the United _________”(example from The Sequence Memoizer, Wood et al, 2011).
Wreck a nice beach: adventures in speech recognition
Characteristics of languagePower law frequency / rank distribution. Zipf’s law:	“given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table”http://en.wikipedia.org/wiki/Zipf’s_lawAlso more frequent words are shorter.
How to get large language data setsLinguistic Data Consortium(by subscription, restricted)Some other more specialized corporaMicrosoft (free, restricted)Google (Creative Commons license)Wikipedia (CC / GFDL license)
Using Wikipedia as a language resourceDownload a snapshot (6G compressed)Convert from XML and markup to plain textCreate dictionaries of target size (by word frequency)Create language models of target sizeApproximately equal in size to English Gigaword Corpus
Grid computing for language modellingFor when you need lots of RAM and/or lots of CPUwww.sagrid.ac.zaICTS at UCT: Tim Carr, Andrew Lewis
Accounting for context: LM adaptationAdapt a language model to more closely resemble the target speechUsing related text forTopic modelling (vocabulary, concepts)Style-of-speech modelling	“ok and um it's quite useful to have a very good diagnostic test of of acute hepatitis um you know to prevent kind of unnecessary um surgery um so hepatitis is really one um example of a cause of acute abdominal pain that doesn't need surgery”
What’s special about lectures?Possibly helpful assumptions:Coherent topic(s) within a courseOne lecturer presents many lecturesSpecialized vocabularySpoken speech different to written speech
Wreck a nice beach: adventures in speech recognition
Using Wikipedia for LM adaptationGoal is to adapt a “standard” LM to be specific to the topic of the audioStart somewhere: title, keywords, text from slidesSelect a set of documents, adapt the LMUsing wikipedia, select by similarity: identify the set of documents most closely related to the starting point or keywords
Vector space modellingRepresents documents as n-dimensional vectors (n terms)Document similarity established by comparing vectors, producing a similarity score.Gensim VSM toolkit: independent of corpus size (so good for wikipedia)LSI, LDA, TF-IDF measures. Create a “similarity crawler” to build a corpus of documents related to the topic
MetricsPerplexity (average number of guesses required)Word Error Rate (edit distance: insertions, deletions, substitutions)Information Retrieval: precision and recallWhat’s sufficient? Need to close an accuracy gap of Munteanu research: %WER for a transcript
What is lecture capture?Largely automated: Recording
 Processing
 OutputRecreates the lecture experience by recording: audio
 video
 screen output (VGA)www.opencastproject.org
Licensing constraintsOpencast Matterhorn is licensed under the ECL open source license (similar to Apache 2.0 license)Allows closed commercial derivativesTherefore cannot use software or datasets which are non-commercial or research-only.Can use Apache, BSD, LGPL, maybe GPL code and data.
Speech recognition software ecosystemLicensing and patentsClosedProprietaryFOSSOpen
Opencast in action

More Related Content

PPTX
Deep Learning - Speaker Verification, Sound Event Detection
PPTX
Deep Learning - Speaker Recognition
PPTX
Deep Learning | Speaker Indentification
PPT
Language and Intelligence
PDF
Deep Learning in practice : Speech recognition and beyond - Meetup
PDF
Natural Language Processing with Python
PPTX
Nltk
PDF
Natural language processing (Python)
Deep Learning - Speaker Verification, Sound Event Detection
Deep Learning - Speaker Recognition
Deep Learning | Speaker Indentification
Language and Intelligence
Deep Learning in practice : Speech recognition and beyond - Meetup
Natural Language Processing with Python
Nltk
Natural language processing (Python)

What's hot (20)

PPTX
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
PPT
ppt
PDF
Deep Learning for Speech Recognition - Vikrant Singh Tomar
PDF
MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORK
PDF
Spoken Content Retrieval - Lattices and Beyond
PPT
An Intuitive Natural Language Understanding System
PPTX
What are the basics of Analysing a corpus? chpt.10 Routledge
PDF
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
PDF
Deep Learning, an interactive introduction for NLP-ers
PPTX
Python NLTK
PPT
Improvement in Quality of Speech associated with Braille codes - A Review
PPT
download
PDF
Zero shot learning through cross-modal transfer
PDF
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
PDF
Statistical Semantic入門 ~分布仮説からword2vecまで~
PPTX
LiDeng-BerlinOct2015-ASR-GenDisc-4by3.pptx
PPTX
Corpus Linguistics :Analytical Tools
PDF
Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)
PDF
Multi modal retrieval and generation with deep distributed models
PDF
UCU NLP Summer Workshops 2017 - Part 2
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
ppt
Deep Learning for Speech Recognition - Vikrant Singh Tomar
MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORK
Spoken Content Retrieval - Lattices and Beyond
An Intuitive Natural Language Understanding System
What are the basics of Analysing a corpus? chpt.10 Routledge
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
Deep Learning, an interactive introduction for NLP-ers
Python NLTK
Improvement in Quality of Speech associated with Braille codes - A Review
download
Zero shot learning through cross-modal transfer
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
Statistical Semantic入門 ~分布仮説からword2vecまで~
LiDeng-BerlinOct2015-ASR-GenDisc-4by3.pptx
Corpus Linguistics :Analytical Tools
Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)
Multi modal retrieval and generation with deep distributed models
UCU NLP Summer Workshops 2017 - Part 2
Ad

Similar to Wreck a nice beach: adventures in speech recognition (20)

PPTX
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
PPT
Asr
PPT
PPT
sr.ppt
PPT
Voice recognitionr.ppt
PDF
A survey on Enhancements in Speech Recognition
PPTX
Speech Recognition Technology
PPTX
lec26_audio.pptx
PDF
Ry pyconjp2015 karaoke
PDF
Computational linguistics
PPT
Asr
PPT
speech recognition system of modern world.ppt
PPT
Voice Recognition
PDF
Neither Her Nor Hal: Considering Access and Representation in the Next Genera...
PPTX
Speech to text conversion
PPTX
Speech to text conversion
PPT
Speechrecognition 100423091251-phpapp01
PDF
Iitdmj 1
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
Asr
sr.ppt
Voice recognitionr.ppt
A survey on Enhancements in Speech Recognition
Speech Recognition Technology
lec26_audio.pptx
Ry pyconjp2015 karaoke
Computational linguistics
Asr
speech recognition system of modern world.ppt
Voice Recognition
Neither Her Nor Hal: Considering Access and Representation in the Next Genera...
Speech to text conversion
Speech to text conversion
Speechrecognition 100423091251-phpapp01
Iitdmj 1
Ad

More from Stephen Marquard (20)

PPTX
The implementation of an Opt-Out Lecture Recording Policy at the University o...
PPTX
Orchestrating Self-Service Video Workflows with Opencast
PPTX
Smart workflows for Opencast
PPTX
LectureSight is awesome and getting better! 
PPTX
Track4K in production at the University of Cape Town
PPTX
Opencast Valencia 2017: Users, groups, roles, ACLs and providers
PPTX
Opencast and Sakai at UCT, LectureSight and Track4K
PPTX
LectureSight in Action (Opencast Community Summit 2016)
PPTX
Opencast Project Update at Open Apereo 2015
PPTX
Why do students use lecture recordings?
PPTX
Introduction to Opencast Matterhorn: Apereo 2014
PPTX
Introduction to Opencast Matterhorn, Apereo Mexico Conference, May 2014
POTX
Matterhorn 2014 Unconference: Ideas for automated post-recording video handling
PPTX
Opencast Matterhorn at UCT
PPTX
Open Text: Speech recognition in Opencast Matterhorn
PPTX
Advancing Online Assessment in Medical Education
PPTX
SMS, Q&A and Course Evaluations in Sakai
PPTX
SMS, Q&A, Course Evaluation tools in Sakai
PPTX
Sakai E Learning Update Sep09
PPTX
Vula is my survival kit
The implementation of an Opt-Out Lecture Recording Policy at the University o...
Orchestrating Self-Service Video Workflows with Opencast
Smart workflows for Opencast
LectureSight is awesome and getting better! 
Track4K in production at the University of Cape Town
Opencast Valencia 2017: Users, groups, roles, ACLs and providers
Opencast and Sakai at UCT, LectureSight and Track4K
LectureSight in Action (Opencast Community Summit 2016)
Opencast Project Update at Open Apereo 2015
Why do students use lecture recordings?
Introduction to Opencast Matterhorn: Apereo 2014
Introduction to Opencast Matterhorn, Apereo Mexico Conference, May 2014
Matterhorn 2014 Unconference: Ideas for automated post-recording video handling
Opencast Matterhorn at UCT
Open Text: Speech recognition in Opencast Matterhorn
Advancing Online Assessment in Medical Education
SMS, Q&A and Course Evaluations in Sakai
SMS, Q&A, Course Evaluation tools in Sakai
Sakai E Learning Update Sep09
Vula is my survival kit

Recently uploaded (20)

PPTX
Institutional Correction lecture only . . .
PDF
RMMM.pdf make it easy to upload and study
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
01-Introduction-to-Information-Management.pdf
PPTX
Pharma ospi slides which help in ospi learning
PPTX
Cell Types and Its function , kingdom of life
PDF
Business Ethics Teaching Materials for college
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PPTX
Cell Structure & Organelles in detailed.
Institutional Correction lecture only . . .
RMMM.pdf make it easy to upload and study
Pharmacology of Heart Failure /Pharmacotherapy of CHF
2.FourierTransform-ShortQuestionswithAnswers.pdf
01-Introduction-to-Information-Management.pdf
Pharma ospi slides which help in ospi learning
Cell Types and Its function , kingdom of life
Business Ethics Teaching Materials for college
102 student loan defaulters named and shamed – Is someone you know on the list?
Final Presentation General Medicine 03-08-2024.pptx
Renaissance Architecture: A Journey from Faith to Humanism
O5-L3 Freight Transport Ops (International) V1.pdf
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Microbial disease of the cardiovascular and lymphatic systems
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
Cell Structure & Organelles in detailed.

Wreck a nice beach: adventures in speech recognition

  • 1. Wreck a nice beach: adventures in speech recognitionStephen MarquardCentre for Educational Technology, University of Cape Townstephen.marquard@uct.ac.zaDepartment of Computer ScienceSeminar, April 2011
  • 2. OverviewProject goalsSpeech recognitionAcoustic modellingLanguage modellingIntegration into a lecture capture system
  • 3. Project goalsIntegrate speech recognition into a lecture capture system:Opencast MatterhornCMU Sphinx ASR engineGenerate automatic transcripts of recorded lecturesAllow users to correct and improve the transcripts (crowdsourcing)Use feedback to improve recognition accuracy (of the same, similar or subsequent recordings)Experiment and implement at UCT
  • 4. Why is it important?Video and audio is more useful if you can:Navigate it easilyLocate relevant recordings from a large setUse by students:Catch up on missed lectures (continuous play or read the transcript)Revision: jump to a particular point or find the lectures which cover topic XOn the public web:Discoverability (search indexing)
  • 5. Easy or hard?Easiest: small, fixed vocabulary, prescriptive grammar, discrete words, known audio conditions (command-and-control systems)Dictation applications in a specific domain, e.g. Dragon Naturally SpeakingHardest: speaker-independent, large vocabulary continuous speech recognition, adverse or unknown audio conditions
  • 6. Why is it hard?People have huge amounts of prior experience and a rich (complex) understanding of contextModelling of context in ASR engines is currently very limitedEven people misrecognize speech (e.g. new / foreign accents, specialized terminology, background noise)
  • 7. Speech recognitionWreck a nice beach … you sing calm incenseReckon eyes peachRecognize speech … using common sense
  • 8. Early historyFirst known device 1952 (digits)Above: IBM Shoebox, 1961http://www-03.ibm.com/ibm/history/exhibits/specialprod1/specialprod1_7.html
  • 9. Linguistics vs statistics Early approaches tried to recognize individual phonemes (phonetic units) and hence the words they formed. But not very successfully.
  • 10. Airplanes don’t flap their wings “Every time I fire a linguist, my system improves” Fred Jelinek 1985/1988
  • 11. Speech recognition pipelineAudio (signal processing, extract features)Acoustic model (features to phonemes)Pronunciation dictionary (lexicon)Language model (likelihood of words)Confusion lattice (possible options)Results > confidence score
  • 13. Hidden Markov ModelsHMMs model transition probabilities:Alice talks to Bob three days in a row and discovers that on the first day he went for a walk, on the second day he went shopping, and on the third day he cleaned his apartment.Alice has a question: what is the most likely sequence of rainy/sunny days that would explain these observations?http://en.wikipedia.org/wiki/Viterbi_algorithm
  • 14. Training in action “training 3 (decision) trees to depth 20 from 1 million images takes about a day on a 1000 core cluster”http://research.microsoft.com/pubs/145347/BodyPartRecognition.pdf
  • 15. Characteristics of the field “the standard approach in our field [is] state-of-the-art system A is gently perturbed to create system B, resulting in a relative decrease in error rate of from 1 to 10%”Borlard, Hermansky and Morgan. Towards increasing speech recognition error rates, 1996.Algorithmic, drawing on many disciplines (especially signal processing, statistics, linguistics, natural language processing)Empirical: lots of different algorithms and optimizationsAlmost no theory to describe why particular approaches work better than others, or how to find optimal solutionsMassive infrastructure is a big advantage: large and varied data sets, significant computing resources.
  • 16. Audio issuesBandwidthRecording noiseAmbient noiseReverberationMicrophonesMicrophone arrays
  • 18. Acoustic modelsGenerated from a corpus of recorded, transcribed audioBoth artificial and natural corpuses(TIMIT, Broadcast News, Meetings)Audio needs to match the applicationAudio bandwidth = ½ sampling ratePhone speech (sampled 8 KHz, bandwidth 4 KHz)Microphone speech (sampled 16 KHz, bandwidth 8 KHz, typical analysis on 130 Hz – 6800 Hz)There is a South African corpus of phone speech But no South African corpus of microphone speech 
  • 19. The TIMIT audio corpus 0 47719 She had your dark suit in greasy wash water all year2214 4428 she4428 8316 had7308 9691 your9691 15331 dark15331 19634 suit20929 22453 in22453 27697 greasy27697 32326 wash33120 36575 water37597 39644 all39644 43982 year0 2214 h#2214 3744 sh3744 4428 ax-h4428 5229 hv5229 6927 ae6927 7308 dcl7308 8316 jh8316 9691 axr9691 11697 dcl11697 12114 d12114 13075 aa …Word and phoneme alignment by timecode.630 speakers from 8 US dialect regions, speaking 10 sentences each.
  • 20. Dialect regionsThe Nationwide Speech Project: A new corpus of American English dialectshttp://web.mit.edu/~nancyc/Public/Papers/Clopper_Pisoni_06_SC.pdf
  • 21. Crowdsourcing the creation of a GPL speech corpus and open source acoustic models (Sphinx, ISIP, Julius, HTK). An important effort, but still small (84 hours at Dec 2010)www.voxforge.org
  • 22. Language modellingPronunciation dictionary (lexicon) TOMATO T AH0 M EY1 T OW2 TOMATO(1) T AH0 M AA1 T OW2Language model: a statistical sequence model of words. Trigram models (3 words) are common: -2.0998 YORK MONEY FUND -0.0798 YORK HEDGE FUND -0.1392 YORK MUTUAL FUND
  • 23. Statistical sequence modelsTruly Madly _____Widely usedApplicationsAuto-suggestSpell-checkersLossless compressionMachine translationLanguage models for speech recognitionProbability of token w in context of preceding tokens c, e.g. P(deeply), given “truly madly”
  • 24. Context is kingMicro-context (e.g. bi- and trigrams) United Kingdom United Airlines United Arab EmiratesLong-range context “Cricket and rugby are amongst the most popular sports in the United _________”(example from The Sequence Memoizer, Wood et al, 2011).
  • 26. Characteristics of languagePower law frequency / rank distribution. Zipf’s law: “given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table”http://en.wikipedia.org/wiki/Zipf’s_lawAlso more frequent words are shorter.
  • 27. How to get large language data setsLinguistic Data Consortium(by subscription, restricted)Some other more specialized corporaMicrosoft (free, restricted)Google (Creative Commons license)Wikipedia (CC / GFDL license)
  • 28. Using Wikipedia as a language resourceDownload a snapshot (6G compressed)Convert from XML and markup to plain textCreate dictionaries of target size (by word frequency)Create language models of target sizeApproximately equal in size to English Gigaword Corpus
  • 29. Grid computing for language modellingFor when you need lots of RAM and/or lots of CPUwww.sagrid.ac.zaICTS at UCT: Tim Carr, Andrew Lewis
  • 30. Accounting for context: LM adaptationAdapt a language model to more closely resemble the target speechUsing related text forTopic modelling (vocabulary, concepts)Style-of-speech modelling “ok and um it's quite useful to have a very good diagnostic test of of acute hepatitis um you know to prevent kind of unnecessary um surgery um so hepatitis is really one um example of a cause of acute abdominal pain that doesn't need surgery”
  • 31. What’s special about lectures?Possibly helpful assumptions:Coherent topic(s) within a courseOne lecturer presents many lecturesSpecialized vocabularySpoken speech different to written speech
  • 33. Using Wikipedia for LM adaptationGoal is to adapt a “standard” LM to be specific to the topic of the audioStart somewhere: title, keywords, text from slidesSelect a set of documents, adapt the LMUsing wikipedia, select by similarity: identify the set of documents most closely related to the starting point or keywords
  • 34. Vector space modellingRepresents documents as n-dimensional vectors (n terms)Document similarity established by comparing vectors, producing a similarity score.Gensim VSM toolkit: independent of corpus size (so good for wikipedia)LSI, LDA, TF-IDF measures. Create a “similarity crawler” to build a corpus of documents related to the topic
  • 35. MetricsPerplexity (average number of guesses required)Word Error Rate (edit distance: insertions, deletions, substitutions)Information Retrieval: precision and recallWhat’s sufficient? Need to close an accuracy gap of Munteanu research: %WER for a transcript
  • 36. What is lecture capture?Largely automated: Recording
  • 38. OutputRecreates the lecture experience by recording: audio
  • 40. screen output (VGA)www.opencastproject.org
  • 41. Licensing constraintsOpencast Matterhorn is licensed under the ECL open source license (similar to Apache 2.0 license)Allows closed commercial derivativesTherefore cannot use software or datasets which are non-commercial or research-only.Can use Apache, BSD, LGPL, maybe GPL code and data.
  • 42. Speech recognition software ecosystemLicensing and patentsClosedProprietaryFOSSOpen
  • 44. Prior work in ASR for lecturesMIT Lecture Browser (SUMMIT recognizer)U. Toronto / ePresence PhD prototype by CosminMunteanu(SONIC recognizer)ETH Zurich Integration of CMU Sphinx with REPLAY
  • 45. Work in progressGet consistently good quality audio recordingsImplement dynamic language model adaptationIntegrate into Opencast Matterhorn workflowShow transcript to users in UI, enable searchAllow users to edit / improve transcriptUse edits to improve recognition
  • 46. Speech recognition in the cloud Google Android: 70 CPU-years to build modelsNexiwave: cloud service using GPUsAdvantages: potentially massive computing resourcesDisadvantages: generic issues and risks with cloud servicesBandwidth, lock-in, terms of service, data ownership and retention, etc.
  • 47. Find out more Truly Madly Wordly: my blog on open source language modelling and speech recognition: http://trulymadlywordly.blogspot.com CMU Sphinxhttp://cmusphinx.sourceforge.net/ Opencasthttp://www.opencastproject.org