SlideShare a Scribd company logo
THE STATE.
OF ASR 2023.
HELLO! WE’RE EXCITED TO CHAT ASR TODAY. .
LILY BOND (She/Her)
SVP of Marketing @ 3Play Media
lily@3playmedia.com
TESSA KETTELBERGER (She/Her)
Senior Data Scientist @ 3Play Media
tessa@3playmedia.com
AGENDA.
ASR overview
Annual State of ASR report
Research results & trends
Key takeaways & conclusions
AN OVERVIEW OF ASR TECH
IMPROVING ASR
ASR gets better by modelling “truth”
data so the AI learns from its
mistakes. For example - ASR might
read “I need to call an über” until
the company name “Uber” is
added to its vocabulary.
HOW IS IT USED?
ASR is used in many aspects of daily life -
from transcription to phone support to
automated assistants like Siri or Alexa.
WHAT IS ASR?
ASR stands for Automatic Speech
Recognition and refers to the use of
Machine Learning (ML), Natural Language
Processing (NLP), and Artificial Intelligence
(AI) technology to convert speech into
text.
ASR FOR TRANSCRIPTION
This session will specifically cover the
use case of ASR for transcription and
captioning
AUTO ASSISTANTS VS CAPTIONS
AUTOMATED ASSISTANTS:
● Single Speaker
● High quality audio, close
speaker
● Learns your voice
● Constrained tasks
● Clarification
● Did you catch my drift?
AUTOMATIC CAPTIONS:
● Usually multiple speakers
● Tasks are open-ended
● Background noise, poor audio
● Lost frequencies
● Most of us don’t speak
perfectly
● Changing audio conditions
.LET’S TALK.
.STATE OF ASR..
An annual review of the top ~8
speech recognitions testing how
they perform for the task of
captioning and transcription. We
test for both Word Error Rate
(WER) and Formatted Error Rate
(FER).
THE REPORT
Because we use speech
recognition as the first step in our
human-corrected captioning
process, we care about using the
best ASR out there. This annual
test keeps us on top of what’s
changing in the industry.
OUR GOAL
The
Accessibility
Picture
VARIETY
Long-form transcription and
captioning can present a variety of
environments and subjects.
LENGTH
Captioning relies on long-form
audio, not short commands &
feedback.
READABILITY
Captions are consumed by
humans and need to be
understandable, using proper
sentence case and grammar.
CAPTIONING.
PRESENTS A.
UNIQUE.
CHALLENGE.
.LET’S SEE THE.
.DATA..
10 ASR ENGINES ON.
.
107 HOURS & 929,795 WORDS.
.
ACROSS 549 VIDEOS.
.
FROM 9 INDUSTRIES.
WE TESTED ….
SPECIFICALLY ….
ASR ENGINES
● Speechmatics (SMX)
● Speechmatics with 3Play Media post-processing
● Microsoft
● Rev.ai
● IBM
● Google (Standard)
● Google (Enhanced/VM)
● Assembly AI
● Whisper (Tiny)
● Whisper (Large)
This year, we tested 57% more hours and 56% more
words than in 2022’s report.
DISTRIBUTION BY INDUSTRY
● 34% Higher Ed
● 16% Tech
● 15% Consumer Goods
● 9% Cinematic
● 8% Associations
● 7% Sports
● 4% Publishing
● 3% eLearning
● 3% News & Networks
Note: The duration, number of speakers, audio quality,
and speaking style (e.g. scripted vs. spontaneous) varies
greatly across this data.
The
Accessibility
Picture
3-STEP PROCESS
ASR is the first step of our captioning
process, followed by 2 rounds of human
editing and review. The better the ASR, the
easier the job of the humans.
POST-PROCESSING
We do our own post-processing on the ASR
engines we use to further improve the ASR
output. We have millions of accurately
transcribed words that we model on top of
ASR to further tune the results.
3PLAY + SMX
In this report, you’ll see the 3Play results
modeled on Speechmatics, our current
primary ASR engine. We would expect to see
a similar 10% relative improvement if we
applied our proprietary post-processing to
any engine in this report.
HOW DOES.
3PLAY USE.
ASR?.
OUR R&D TEAM TESTED TWO.
METRICS: WER & FER..
Word Error Rate (WER)
Word Error Rate is the metric you typically see when
discussing caption accuracy. For example, “99%
accurate captions” would have a WER of 1%.
That means 1 in every 100 words is incorrect - the
standard for recorded captioning.
In addition to pure WER, we dig deeper to measure
insertions, substitutions, deletions, and corrections -
which provides nuance on how different engines get
to the measured WER.
Formatted Error Rate (FER)
While WER is the most common measure of caption accuracy, we
think FER and CER are most critical to the human experience of
caption accuracy.
FER takes into account formatting errors like punctuation,
grammar, capitalization, and other captioning requirements like
speaker identification and sound effects.
This is critical for the “read” experience of captioning, and as you’ll
see, some engines prioritize FER over others.
.
2023’S REPORT IS THE MOST.
.
EXCITING STATE OF ASR YET!.
SPOILER ALERT ….
WORD ERROR RATES.
2022 2023
SMX + 3Play 7.96 6.86
AssemblyAI -- 7.5
Speechmatics 8.67 7.56
Whisper (Large) -- 8.42
Microsoft 10.6 9.69
Rev.ai 13.8 10.4
Google (Video) 12.8 13.5
Whisper (Tiny) -- 15.1
IBM 23.3 24.8
Google (Stand.) 26.1 28.1
KEY TAKEAWAYS
1. New entrants Whisper and AssemblyAI are very
interesting
2. Speechmatics, Microsoft, and Rev all made
impactful improvements
3. Google and IBM lost ground
4. 3Play proprietary post-processing adds an
incremental ~10% gain on any vendor - we tested
here with Speechmatics (our current primary
vendor), but we’d expect the same gains when
tuned to any other vendor.
Overall, it is fair to say that speech recognition for the
task of transcription has improved YOY from 2022.
DIFFERENT TYPES OF ERRORS.
%SUB %INS %DEL
SMX + 3Play 2.3 2.95 1.61
AssemblyAI 2.98 1.35 3.17
Speechmatics 2.48 3.61 1.48
Whisper (Large) 2.39 2.57 3.45
Microsoft 3.64 3.82 2.23
Rev.ai 3.86 4.53 2
Google (Video) 5.46 3.78 4.27
Whisper (Tiny) 7.48 4.1 3.49
IBM 12.6 5.45 6.7
Google (Stand.) 9.62 3.42 15.1
KEY TAKEAWAYS
● Speechmatics deletes by far the fewest words
● AssemblyAI inserts by far the fewest words
● SMX+3Play and Whisper substitute the fewest
words
● Meanwhile, Google deletes an alarming % of
words and IBM inserts an alarming % of words
● Error type breakdowns illustrate the strengths and
weaknesses and differing behavior between
engines
● These help us decide how to act when error rates
look very similar between top engines
● For our business needs, we believe lower deletion
rates are valuable
DIFFERENT TRANSCRIPT STYLES
CLEAN READ VERBATIM
AssemblyAI 6.39
14.2
Speechmatics 7.2
9.74
Whisper (Large) 8.02
10.8
Microsoft 9.06
13.5
Rev AI 9.92
13.2
Google (Enhanced) 12.3
20.6
Whisper (Tiny) 13.8
22.6
IBM Watson 23.2
34.2
Google (Standard) 25.9
21.6
KEY TAKEAWAYS
● Engines will lie on a spectrum between “Clean Read” or
“Verbatim” transcript styles.
● Assembly AI favors the “Clean Read” style
● Speechmatics is more in the “Verbatim” style
We offer two styles of transcription. Verbatim includes
disfluencies, false starts, and word repetitions. Clean Read does
not. Both of these styles could be considered correct and are
appropriate for different situations. When we split our test
sample into Clean Read and Verbatim, the relative ranking of
the engines is quite different between the two samples.
The majority of our content is done in Clean Read. This probably
imparts a slight bias towards scoring the clean read engines
favorably.
*Error rates overall tend to be higher on our Verbatim
content. This is related to difficulty of content
in the markets where each style is most popular.
FORMATTED ERROR RATES.
KEY TAKEAWAYS
1. Again - new entrants Whisper and AssemblyAI are
very interesting, and Speechmatics continues to
be a top engine.
2. It’s clear which engines are prioritizing the
captioning use case.
3. These results suggest engines may be plateauing
in the formatting space.
FER is the experienced accuracy of captioning, and even
the best performing engine is still only ~83% accurate.
This is far from a quality or “equal” captioning
experience.
For the captioning use case, FER is critical to readability
and meaning - and an accuracy rate of under 85% is
extremely noticeable.
2022 2023
Whisper (Large) -- 17.2
AssemblyAI -- 17.5
3Play 17.2 17.8
Speechmatics 17.9 18.3
Rev.ai 22.4 21.5
Microsoft 24.9 22.3
Whisper (Tiny) -- 25.4
Google (Video) 27.0 29.8
Google (Stand.) 38.6 41.6
IBM 38.2 42.5
POLL TIME! ASR PERFORMED BEST.
ON CONTENT FROM WHICH.
INDUSTRY?.
● Sports
● Cinematic
● News
● Publishing
● Tech
● Consumer Goods
● Higher Ed
● Associations
● eLearning
WER & FER BY INDUSTRY.
INDUSTRY AVG. WER AVG. FER
Sports 9.94 21.4
Cinematic 12.91 26.3
News 11.1 26.4
Publishing 7.74 18.2
Tech 5.5 14.5
Consumer Goods 8.72 17.7
Higher Ed 6.38 16.0
Associations 6.43 15.9
eLearning 4.07 13.4
KEY TAKEAWAYS
● Cinematic, News, and Sports content stand out as the
toughest for ASR to transcribe accurately - these markets
often have background noise, specific formatting needs,
overlapping speech, and difficult audio.
● Whisper performed particularly poorly for Cinematic content,
with a FER of 32.6% (vs 25%, 23.8%, and 23.7% for Assembly,
3Play, and SMX respectively).
● eLearning performed the best, followed by Tech - video in
these industries is usually professionally recorded, with clear
audio and a single speaker.
● FER remains high enough across industries to require human
oversight in creating quality captions.
● Industries with extremely clear audio and simple formatting
needs have the best chance of performing well. Those with
complex formatting and poor audio quality perform worst -
here, ASR is very far from being a good solution on its own.
*Note: These are averages of the top 4
engines (3Play, SMX, Whisper, Assembly).
The
Accessibility
Picture
TRAINING DATA
The quantity and quality of data - as well as
the type of data - a model is trained on
makes a huge difference in output.
ARCHITECTURE
There are three major architecture models -
Convolutional, Transformer, and Conformer
(a blend of the two launched mid-2020).
Assembly uses Conformer; Whisper and SMX
use Transformer.
MODEL GOALS
Different companies have different goals for
their engines - broad vs specialized,
captioning vs auto assistants, ASR only vs
human correction. These goals matter.
NOT ALL.
MODELS ARE.
CREATED.
EQUAL.
.THE BEST.
.OF THE BEST.
Speechmatics (SMX)
Speechmatics transcribed more words
accurately, but made more insertions than
AssemblyAI - although most of these
insertions were disfluencies (uhm, y’know,
false starts). Their self-learning model
continues to see gains year over year.
AssemblyAI
Assembly missed more words than SMX, but
didn’t insert as much (notably, they don’t
insert many disfluencies). AssemblyAI uses a
different architectural model than Whisper
and SMX and trains on specialized data.
Whisper
Trained on a very large but general data set
(680K hours), applying the same neural
scaling hypotheses used on GPT to ASR.
However, something odd happens with
Whisper (and no other engine …)
“.
… IT HALLUCINATES 👀👀👀.
Whisper’s greatest flaw seems to be its tendency to sometimes “hallucinate”
additional speech that doesn’t appear in the original audio sample. The
hallucinations look very credible if you aren’t listening to the audio. They are
usually sensible and on-topic, grammatically correct sentences. This would make
viewing the captions as a Deaf/HoH user really confusing. If auto-captions are
nonsensical, it’s clear they are making a mistake, but with these, you could easily
assume the mistakes are what is actually being said. Whisper’s scores don’t
adequately penalize hallucinations in my opinion. Hallucinations will show up as
errors, but an area where the text was completely invented may still get as low as
a 50% error rate (rather than 100%) because of common pronouns, function
words, and punctuation lining up with the real text.
”.
TRUTH WHISPER
the
>
mysteries
of
the
universe
in
a
the
southeastern
part
of
the
state
it’s
a
● This example is from a news segment on the weather that transitioned to a
segment on a NASA launch
● Whisper tries to stay on topic and “hallucinates” a continued story about
the weather
● While 0% of this is correct, the WER is ~50% because of words like “the,” “of,”
and “a”
● If you relied on captions for this programming, you would get a made up
and inaccurate weather forecast
.KEY.
FINDINGS:.
(TL;DR).
New Models Are Emerging
Whisper and AssemblyAI have different approaches but
have both emerged with exciting offerings - with
~equivalent accuracy to SMX, who has led the pack for
many years.
Source Material Matters
It’s clear that results are still heavily dependent on audio
quality and content difficulty. Most improvements are
driven by training techniques, not changes to technology.
Hallucination?
What is it about Whisper’s model that hallucinates
completely made up content? Does this have to do with
their scaled supervised learning approach?
Use Case Matters
These engines are ultimately trained for different use cases.
Understanding your use case and which engine best suits it
is critical to produce the highest quality.
Still Not Good Enough
It’s clear that ASR is still far from good enough for
compliance, where 99%+ accuracy is required to provide
an equal experience.
.WHAT THIS.
.MEANS FOR .
YOU..
While technology continues to improve, there is
still a significant leap to real accuracy from even
the best speech recognition engines, making
humans a crucial part of creating accurate
captions.
Word Errors Formatting Errors
● Multiple speakers or
overlapping speech
● Background noise
● Poor audio quality
● False starts
● Acoustic errors
● “Function” words
● Speaker labels
● Punctuation
● Grammar
● Numbers
● Non-speech
elements
● [INAUDIBLE] tags
COMMON CAUSES OF ASR.
ERRORS:.
Incorrect punctuation can
change the meaning of
language tremendously.
FORMATTING
ERRORS
This example indicates a very
common ASR error. Although
seemingly small, the meaning
is completely reversed.
“I can’t
attend the
meeting.”
vs.
“I can
attend the
meeting.”
FUNCTION
WORDS
These examples of names
and complex vocabulary
require human expertise &
knowledge. In each case, the
truth is on the left, and the
ASR is on the right.
COMPLEX
VOCABULARY
REMEMBER - ERRORS ADD UP.
QUICKLY ....
AT 85% ACCURACY, 1 IN 7 WORDS.
IS INCORRECT.
QUALITY MATTERS..
.SO,.
TO RECAP:.
SPEECHMATICS, MICROSOFT,.
AND REV ALL IMPROVED YOY -.
WHISPER & ASSEMBLYAI ARE.
EXCITING ENTRANTS.
SPEECHMATICS IS NO LONGER.
THE CLEAR LEADER..
WHISPER AND ASSEMBLYAI.
APPEAR JUST AS GOOD..
THE BEST ENGINES CAN.
ACHIEVE UP TO 93% ACCURACY ….
FOR NON-SPECIALIZED CONTENT.
WITH GREAT AUDIO QUALITY.
THIS WAS THE MOST EXCITING.
STATE OF ASR WE’VE SEEN -.
BUT THERE’S STILL.
A LONG WAY TO GO.
TO REPLACE HUMANS..
THANK YOU!.
WHAT QUESTIONS.
DO YOU HAVE?.
STATE OF ASR
go.3playmedia.com/rs-2023-asr
3PLAY MEDIA
www.3playmedia.com | @3playmedia
LILY BOND
(She/Her)
lily@3playmedia.com
TESSA KETTLEBERGER
(She/Her)
tessa@3playmedia.com

More Related Content

PDF
ELECTRONIC COURT CASE MANAGEMENT SYSTEM_Project
PPTX
EIGRP (Enhanced Interior Gateway Routing Protocol)
PPTX
Introduction to Named Entity Recognition
PPT
Rip ospf and bgp
PDF
Hyperparameter Optimization for Machine Learning
PPSX
Project management
PDF
Warc - Marketers Toolkit 2023.pdf
PDF
Warc marketers toolkit_2021
ELECTRONIC COURT CASE MANAGEMENT SYSTEM_Project
EIGRP (Enhanced Interior Gateway Routing Protocol)
Introduction to Named Entity Recognition
Rip ospf and bgp
Hyperparameter Optimization for Machine Learning
Project management
Warc - Marketers Toolkit 2023.pdf
Warc marketers toolkit_2021

What's hot (20)

PDF
Why Every Company Needs to Think and Act Like a Media Company
PPTX
Speech recognition An overview
PDF
Generative AI
PPTX
SPEECH RECOGNITION USING NEURAL NETWORK
PPTX
Machine translator Introduction
PPTX
Fine tune and deploy Hugging Face NLP models
PDF
The Future is in Responsible Generative AI
PPT
Automatic speech recognition
PDF
Steve Litras [Cribl] | The Power of Infinite Choice | InfluxDays Virtual Expe...
PPTX
Natural language processing and transformer models
PDF
Automatic speech recognition system using deep learning
PDF
Language Detection Library for Java
PPTX
TEXT-SPEECH PPT.pptx
DOCX
Natural language processing
PDF
CS571: Dependency Parsing
PPT
What is machine translation
PPTX
Spell checker using Natural language processing
PDF
Dr. Harvey Castro - GPT Healthcare.pdf
PDF
Video Accessibility Essentials
PPTX
impresoras y monitores
Why Every Company Needs to Think and Act Like a Media Company
Speech recognition An overview
Generative AI
SPEECH RECOGNITION USING NEURAL NETWORK
Machine translator Introduction
Fine tune and deploy Hugging Face NLP models
The Future is in Responsible Generative AI
Automatic speech recognition
Steve Litras [Cribl] | The Power of Infinite Choice | InfluxDays Virtual Expe...
Natural language processing and transformer models
Automatic speech recognition system using deep learning
Language Detection Library for Java
TEXT-SPEECH PPT.pptx
Natural language processing
CS571: Dependency Parsing
What is machine translation
Spell checker using Natural language processing
Dr. Harvey Castro - GPT Healthcare.pdf
Video Accessibility Essentials
impresoras y monitores
Ad

Similar to 2023 State of Automatic Speech Recognition (20)

PDF
The State of Automatic Speech Recognition 2022 (2).pdf
PPTX
State of Automatic Speech Recognition
PPTX
Shop By Voice Product Overview
PDF
General Speereo Technology
PPTX
Google Voice-to-text
PDF
IRJET- Vocal Code
PPTX
Presentation for top (Hotel Review).pptx
PPT
Comparative Study of programming Languages
PPTX
PDF
Deep Learning with Tensorflow and Apache MXNet on AWS (April 2019)
PDF
Cloud-Native Roadshow - Google - DC
PDF
Infinite Memory Engine: HPC in the FLASH Era
PDF
AircraftIT MRO Journal Vol 3.3 Paper or Plastic?
PDF
IRJET- Voice to Code Editor using Speech Recognition
PDF
Review On Speech Recognition using Deep Learning
PDF
IRJET- Voice Command Execution with Speech Recognition and Synthesizer
PDF
5 Things to Consider When Deploying AI in Your Enterprise
PDF
IRJET- Voice Recognition(AI) : Voice Assistant Robot
PDF
Google Cloud Platform
PDF
Oracle SOA Tips & Tricks
The State of Automatic Speech Recognition 2022 (2).pdf
State of Automatic Speech Recognition
Shop By Voice Product Overview
General Speereo Technology
Google Voice-to-text
IRJET- Vocal Code
Presentation for top (Hotel Review).pptx
Comparative Study of programming Languages
Deep Learning with Tensorflow and Apache MXNet on AWS (April 2019)
Cloud-Native Roadshow - Google - DC
Infinite Memory Engine: HPC in the FLASH Era
AircraftIT MRO Journal Vol 3.3 Paper or Plastic?
IRJET- Voice to Code Editor using Speech Recognition
Review On Speech Recognition using Deep Learning
IRJET- Voice Command Execution with Speech Recognition and Synthesizer
5 Things to Consider When Deploying AI in Your Enterprise
IRJET- Voice Recognition(AI) : Voice Assistant Robot
Google Cloud Platform
Oracle SOA Tips & Tricks
Ad

More from 3Play Media (20)

PPTX
Advancing Equity and Inclusion for Deaf Students in Higher Education
PPTX
"Am I Doing This Right?" Imposter Syndrome and Accessibility Maturity
PPTX
The 3Play Way: Real-Time Captioning in Higher Education
PPTX
Developing a Centrally Supported Captioning System with Utah State University
PPTX
Developing a Centrally Supported Captioning System with Utah State University
PDF
Lessons Learned: Canada’s Past, Present, and Future Leadership in Digital Acc...
PDF
Product Innovation is on the Edge
PDF
Complex Identities: The Intersection of Disability with Race, Culture, Gender...
PDF
Accessibility as a Gateway to Creativity
PDF
Disability Inclusion for Leadership
PDF
How to Tell Whether UDL is Working
PDF
Neurodivergency at work (P2) – 3Play and B-I.pdf
PDF
Neurodiversity in the Workplace - Part 1
PPTX
How To Deliver an Accessible Online Presentation
PDF
Power of an Accessible Website.pdf
PDF
2022 Digital Accessibility Legal Update.pdf
PDF
Intro to Live Captioning for Broadcast.pdf
PPTX
How to Scale a Sustainable Accessibility Program
PDF
Web Accessibility Lawsuit Trends in 2022
PPTX
See How You Can Do More with 3Play Media (1).pptx
Advancing Equity and Inclusion for Deaf Students in Higher Education
"Am I Doing This Right?" Imposter Syndrome and Accessibility Maturity
The 3Play Way: Real-Time Captioning in Higher Education
Developing a Centrally Supported Captioning System with Utah State University
Developing a Centrally Supported Captioning System with Utah State University
Lessons Learned: Canada’s Past, Present, and Future Leadership in Digital Acc...
Product Innovation is on the Edge
Complex Identities: The Intersection of Disability with Race, Culture, Gender...
Accessibility as a Gateway to Creativity
Disability Inclusion for Leadership
How to Tell Whether UDL is Working
Neurodivergency at work (P2) – 3Play and B-I.pdf
Neurodiversity in the Workplace - Part 1
How To Deliver an Accessible Online Presentation
Power of an Accessible Website.pdf
2022 Digital Accessibility Legal Update.pdf
Intro to Live Captioning for Broadcast.pdf
How to Scale a Sustainable Accessibility Program
Web Accessibility Lawsuit Trends in 2022
See How You Can Do More with 3Play Media (1).pptx

Recently uploaded (20)

PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Modernizing your data center with Dell and AMD
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
cuic standard and advanced reporting.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
KodekX | Application Modernization Development
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Cloud computing and distributed systems.
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Big Data Technologies - Introduction.pptx
Unlocking AI with Model Context Protocol (MCP)
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
20250228 LYD VKU AI Blended-Learning.pptx
Modernizing your data center with Dell and AMD
Reach Out and Touch Someone: Haptics and Empathic Computing
NewMind AI Weekly Chronicles - August'25 Week I
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Network Security Unit 5.pdf for BCA BBA.
Review of recent advances in non-invasive hemoglobin estimation
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
cuic standard and advanced reporting.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
KodekX | Application Modernization Development
Per capita expenditure prediction using model stacking based on satellite ima...
Cloud computing and distributed systems.
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Understanding_Digital_Forensics_Presentation.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Building Integrated photovoltaic BIPV_UPV.pdf
Big Data Technologies - Introduction.pptx

2023 State of Automatic Speech Recognition

  • 2. HELLO! WE’RE EXCITED TO CHAT ASR TODAY. . LILY BOND (She/Her) SVP of Marketing @ 3Play Media lily@3playmedia.com TESSA KETTELBERGER (She/Her) Senior Data Scientist @ 3Play Media tessa@3playmedia.com
  • 3. AGENDA. ASR overview Annual State of ASR report Research results & trends Key takeaways & conclusions
  • 4. AN OVERVIEW OF ASR TECH IMPROVING ASR ASR gets better by modelling “truth” data so the AI learns from its mistakes. For example - ASR might read “I need to call an über” until the company name “Uber” is added to its vocabulary. HOW IS IT USED? ASR is used in many aspects of daily life - from transcription to phone support to automated assistants like Siri or Alexa. WHAT IS ASR? ASR stands for Automatic Speech Recognition and refers to the use of Machine Learning (ML), Natural Language Processing (NLP), and Artificial Intelligence (AI) technology to convert speech into text. ASR FOR TRANSCRIPTION This session will specifically cover the use case of ASR for transcription and captioning
  • 5. AUTO ASSISTANTS VS CAPTIONS AUTOMATED ASSISTANTS: ● Single Speaker ● High quality audio, close speaker ● Learns your voice ● Constrained tasks ● Clarification ● Did you catch my drift? AUTOMATIC CAPTIONS: ● Usually multiple speakers ● Tasks are open-ended ● Background noise, poor audio ● Lost frequencies ● Most of us don’t speak perfectly ● Changing audio conditions
  • 7. An annual review of the top ~8 speech recognitions testing how they perform for the task of captioning and transcription. We test for both Word Error Rate (WER) and Formatted Error Rate (FER). THE REPORT Because we use speech recognition as the first step in our human-corrected captioning process, we care about using the best ASR out there. This annual test keeps us on top of what’s changing in the industry. OUR GOAL
  • 8. The Accessibility Picture VARIETY Long-form transcription and captioning can present a variety of environments and subjects. LENGTH Captioning relies on long-form audio, not short commands & feedback. READABILITY Captions are consumed by humans and need to be understandable, using proper sentence case and grammar. CAPTIONING. PRESENTS A. UNIQUE. CHALLENGE.
  • 10. 10 ASR ENGINES ON. . 107 HOURS & 929,795 WORDS. . ACROSS 549 VIDEOS. . FROM 9 INDUSTRIES. WE TESTED ….
  • 11. SPECIFICALLY …. ASR ENGINES ● Speechmatics (SMX) ● Speechmatics with 3Play Media post-processing ● Microsoft ● Rev.ai ● IBM ● Google (Standard) ● Google (Enhanced/VM) ● Assembly AI ● Whisper (Tiny) ● Whisper (Large) This year, we tested 57% more hours and 56% more words than in 2022’s report. DISTRIBUTION BY INDUSTRY ● 34% Higher Ed ● 16% Tech ● 15% Consumer Goods ● 9% Cinematic ● 8% Associations ● 7% Sports ● 4% Publishing ● 3% eLearning ● 3% News & Networks Note: The duration, number of speakers, audio quality, and speaking style (e.g. scripted vs. spontaneous) varies greatly across this data.
  • 12. The Accessibility Picture 3-STEP PROCESS ASR is the first step of our captioning process, followed by 2 rounds of human editing and review. The better the ASR, the easier the job of the humans. POST-PROCESSING We do our own post-processing on the ASR engines we use to further improve the ASR output. We have millions of accurately transcribed words that we model on top of ASR to further tune the results. 3PLAY + SMX In this report, you’ll see the 3Play results modeled on Speechmatics, our current primary ASR engine. We would expect to see a similar 10% relative improvement if we applied our proprietary post-processing to any engine in this report. HOW DOES. 3PLAY USE. ASR?.
  • 13. OUR R&D TEAM TESTED TWO. METRICS: WER & FER.. Word Error Rate (WER) Word Error Rate is the metric you typically see when discussing caption accuracy. For example, “99% accurate captions” would have a WER of 1%. That means 1 in every 100 words is incorrect - the standard for recorded captioning. In addition to pure WER, we dig deeper to measure insertions, substitutions, deletions, and corrections - which provides nuance on how different engines get to the measured WER. Formatted Error Rate (FER) While WER is the most common measure of caption accuracy, we think FER and CER are most critical to the human experience of caption accuracy. FER takes into account formatting errors like punctuation, grammar, capitalization, and other captioning requirements like speaker identification and sound effects. This is critical for the “read” experience of captioning, and as you’ll see, some engines prioritize FER over others.
  • 14. . 2023’S REPORT IS THE MOST. . EXCITING STATE OF ASR YET!. SPOILER ALERT ….
  • 15. WORD ERROR RATES. 2022 2023 SMX + 3Play 7.96 6.86 AssemblyAI -- 7.5 Speechmatics 8.67 7.56 Whisper (Large) -- 8.42 Microsoft 10.6 9.69 Rev.ai 13.8 10.4 Google (Video) 12.8 13.5 Whisper (Tiny) -- 15.1 IBM 23.3 24.8 Google (Stand.) 26.1 28.1 KEY TAKEAWAYS 1. New entrants Whisper and AssemblyAI are very interesting 2. Speechmatics, Microsoft, and Rev all made impactful improvements 3. Google and IBM lost ground 4. 3Play proprietary post-processing adds an incremental ~10% gain on any vendor - we tested here with Speechmatics (our current primary vendor), but we’d expect the same gains when tuned to any other vendor. Overall, it is fair to say that speech recognition for the task of transcription has improved YOY from 2022.
  • 16. DIFFERENT TYPES OF ERRORS. %SUB %INS %DEL SMX + 3Play 2.3 2.95 1.61 AssemblyAI 2.98 1.35 3.17 Speechmatics 2.48 3.61 1.48 Whisper (Large) 2.39 2.57 3.45 Microsoft 3.64 3.82 2.23 Rev.ai 3.86 4.53 2 Google (Video) 5.46 3.78 4.27 Whisper (Tiny) 7.48 4.1 3.49 IBM 12.6 5.45 6.7 Google (Stand.) 9.62 3.42 15.1 KEY TAKEAWAYS ● Speechmatics deletes by far the fewest words ● AssemblyAI inserts by far the fewest words ● SMX+3Play and Whisper substitute the fewest words ● Meanwhile, Google deletes an alarming % of words and IBM inserts an alarming % of words ● Error type breakdowns illustrate the strengths and weaknesses and differing behavior between engines ● These help us decide how to act when error rates look very similar between top engines ● For our business needs, we believe lower deletion rates are valuable
  • 17. DIFFERENT TRANSCRIPT STYLES CLEAN READ VERBATIM AssemblyAI 6.39 14.2 Speechmatics 7.2 9.74 Whisper (Large) 8.02 10.8 Microsoft 9.06 13.5 Rev AI 9.92 13.2 Google (Enhanced) 12.3 20.6 Whisper (Tiny) 13.8 22.6 IBM Watson 23.2 34.2 Google (Standard) 25.9 21.6 KEY TAKEAWAYS ● Engines will lie on a spectrum between “Clean Read” or “Verbatim” transcript styles. ● Assembly AI favors the “Clean Read” style ● Speechmatics is more in the “Verbatim” style We offer two styles of transcription. Verbatim includes disfluencies, false starts, and word repetitions. Clean Read does not. Both of these styles could be considered correct and are appropriate for different situations. When we split our test sample into Clean Read and Verbatim, the relative ranking of the engines is quite different between the two samples. The majority of our content is done in Clean Read. This probably imparts a slight bias towards scoring the clean read engines favorably. *Error rates overall tend to be higher on our Verbatim content. This is related to difficulty of content in the markets where each style is most popular.
  • 18. FORMATTED ERROR RATES. KEY TAKEAWAYS 1. Again - new entrants Whisper and AssemblyAI are very interesting, and Speechmatics continues to be a top engine. 2. It’s clear which engines are prioritizing the captioning use case. 3. These results suggest engines may be plateauing in the formatting space. FER is the experienced accuracy of captioning, and even the best performing engine is still only ~83% accurate. This is far from a quality or “equal” captioning experience. For the captioning use case, FER is critical to readability and meaning - and an accuracy rate of under 85% is extremely noticeable. 2022 2023 Whisper (Large) -- 17.2 AssemblyAI -- 17.5 3Play 17.2 17.8 Speechmatics 17.9 18.3 Rev.ai 22.4 21.5 Microsoft 24.9 22.3 Whisper (Tiny) -- 25.4 Google (Video) 27.0 29.8 Google (Stand.) 38.6 41.6 IBM 38.2 42.5
  • 19. POLL TIME! ASR PERFORMED BEST. ON CONTENT FROM WHICH. INDUSTRY?. ● Sports ● Cinematic ● News ● Publishing ● Tech ● Consumer Goods ● Higher Ed ● Associations ● eLearning
  • 20. WER & FER BY INDUSTRY. INDUSTRY AVG. WER AVG. FER Sports 9.94 21.4 Cinematic 12.91 26.3 News 11.1 26.4 Publishing 7.74 18.2 Tech 5.5 14.5 Consumer Goods 8.72 17.7 Higher Ed 6.38 16.0 Associations 6.43 15.9 eLearning 4.07 13.4 KEY TAKEAWAYS ● Cinematic, News, and Sports content stand out as the toughest for ASR to transcribe accurately - these markets often have background noise, specific formatting needs, overlapping speech, and difficult audio. ● Whisper performed particularly poorly for Cinematic content, with a FER of 32.6% (vs 25%, 23.8%, and 23.7% for Assembly, 3Play, and SMX respectively). ● eLearning performed the best, followed by Tech - video in these industries is usually professionally recorded, with clear audio and a single speaker. ● FER remains high enough across industries to require human oversight in creating quality captions. ● Industries with extremely clear audio and simple formatting needs have the best chance of performing well. Those with complex formatting and poor audio quality perform worst - here, ASR is very far from being a good solution on its own. *Note: These are averages of the top 4 engines (3Play, SMX, Whisper, Assembly).
  • 21. The Accessibility Picture TRAINING DATA The quantity and quality of data - as well as the type of data - a model is trained on makes a huge difference in output. ARCHITECTURE There are three major architecture models - Convolutional, Transformer, and Conformer (a blend of the two launched mid-2020). Assembly uses Conformer; Whisper and SMX use Transformer. MODEL GOALS Different companies have different goals for their engines - broad vs specialized, captioning vs auto assistants, ASR only vs human correction. These goals matter. NOT ALL. MODELS ARE. CREATED. EQUAL.
  • 22. .THE BEST. .OF THE BEST. Speechmatics (SMX) Speechmatics transcribed more words accurately, but made more insertions than AssemblyAI - although most of these insertions were disfluencies (uhm, y’know, false starts). Their self-learning model continues to see gains year over year. AssemblyAI Assembly missed more words than SMX, but didn’t insert as much (notably, they don’t insert many disfluencies). AssemblyAI uses a different architectural model than Whisper and SMX and trains on specialized data. Whisper Trained on a very large but general data set (680K hours), applying the same neural scaling hypotheses used on GPT to ASR. However, something odd happens with Whisper (and no other engine …)
  • 23. “. … IT HALLUCINATES 👀👀👀. Whisper’s greatest flaw seems to be its tendency to sometimes “hallucinate” additional speech that doesn’t appear in the original audio sample. The hallucinations look very credible if you aren’t listening to the audio. They are usually sensible and on-topic, grammatically correct sentences. This would make viewing the captions as a Deaf/HoH user really confusing. If auto-captions are nonsensical, it’s clear they are making a mistake, but with these, you could easily assume the mistakes are what is actually being said. Whisper’s scores don’t adequately penalize hallucinations in my opinion. Hallucinations will show up as errors, but an area where the text was completely invented may still get as low as a 50% error rate (rather than 100%) because of common pronouns, function words, and punctuation lining up with the real text. ”.
  • 24. TRUTH WHISPER the > mysteries of the universe in a the southeastern part of the state it’s a ● This example is from a news segment on the weather that transitioned to a segment on a NASA launch ● Whisper tries to stay on topic and “hallucinates” a continued story about the weather ● While 0% of this is correct, the WER is ~50% because of words like “the,” “of,” and “a” ● If you relied on captions for this programming, you would get a made up and inaccurate weather forecast
  • 25. .KEY. FINDINGS:. (TL;DR). New Models Are Emerging Whisper and AssemblyAI have different approaches but have both emerged with exciting offerings - with ~equivalent accuracy to SMX, who has led the pack for many years. Source Material Matters It’s clear that results are still heavily dependent on audio quality and content difficulty. Most improvements are driven by training techniques, not changes to technology. Hallucination? What is it about Whisper’s model that hallucinates completely made up content? Does this have to do with their scaled supervised learning approach? Use Case Matters These engines are ultimately trained for different use cases. Understanding your use case and which engine best suits it is critical to produce the highest quality. Still Not Good Enough It’s clear that ASR is still far from good enough for compliance, where 99%+ accuracy is required to provide an equal experience.
  • 26. .WHAT THIS. .MEANS FOR . YOU.. While technology continues to improve, there is still a significant leap to real accuracy from even the best speech recognition engines, making humans a crucial part of creating accurate captions.
  • 27. Word Errors Formatting Errors ● Multiple speakers or overlapping speech ● Background noise ● Poor audio quality ● False starts ● Acoustic errors ● “Function” words ● Speaker labels ● Punctuation ● Grammar ● Numbers ● Non-speech elements ● [INAUDIBLE] tags COMMON CAUSES OF ASR. ERRORS:.
  • 28. Incorrect punctuation can change the meaning of language tremendously. FORMATTING ERRORS
  • 29. This example indicates a very common ASR error. Although seemingly small, the meaning is completely reversed. “I can’t attend the meeting.” vs. “I can attend the meeting.” FUNCTION WORDS
  • 30. These examples of names and complex vocabulary require human expertise & knowledge. In each case, the truth is on the left, and the ASR is on the right. COMPLEX VOCABULARY
  • 31. REMEMBER - ERRORS ADD UP. QUICKLY .... AT 85% ACCURACY, 1 IN 7 WORDS. IS INCORRECT.
  • 34. SPEECHMATICS, MICROSOFT,. AND REV ALL IMPROVED YOY -. WHISPER & ASSEMBLYAI ARE. EXCITING ENTRANTS.
  • 35. SPEECHMATICS IS NO LONGER. THE CLEAR LEADER.. WHISPER AND ASSEMBLYAI. APPEAR JUST AS GOOD..
  • 36. THE BEST ENGINES CAN. ACHIEVE UP TO 93% ACCURACY …. FOR NON-SPECIALIZED CONTENT. WITH GREAT AUDIO QUALITY.
  • 37. THIS WAS THE MOST EXCITING. STATE OF ASR WE’VE SEEN -. BUT THERE’S STILL. A LONG WAY TO GO. TO REPLACE HUMANS..
  • 38. THANK YOU!. WHAT QUESTIONS. DO YOU HAVE?. STATE OF ASR go.3playmedia.com/rs-2023-asr 3PLAY MEDIA www.3playmedia.com | @3playmedia LILY BOND (She/Her) lily@3playmedia.com TESSA KETTLEBERGER (She/Her) tessa@3playmedia.com