2023 State of Automatic Speech Recognition

HELLO! WE’RE EXCITED TO CHAT ASR TODAY. .
LILY BOND (She/Her)
SVP of Marketing @ 3Play Media
lily@3playmedia.com
TESSA KETTELBERGER (She/Her)
Senior Data Scientist @ 3Play Media
tessa@3playmedia.com

AGENDA.
ASR overview
Annual State of ASR report
Research results & trends
Key takeaways & conclusions

AN OVERVIEW OF ASR TECH
IMPROVING ASR
ASR gets better by modelling “truth”
data so the AI learns from its
mistakes. For example - ASR might
read “I need to call an über” until
the company name “Uber” is
added to its vocabulary.
HOW IS IT USED?
ASR is used in many aspects of daily life -
from transcription to phone support to
automated assistants like Siri or Alexa.
WHAT IS ASR?
ASR stands for Automatic Speech
Recognition and refers to the use of
Machine Learning (ML), Natural Language
Processing (NLP), and Artificial Intelligence
(AI) technology to convert speech into
text.
ASR FOR TRANSCRIPTION
This session will specifically cover the
use case of ASR for transcription and
captioning

AUTO ASSISTANTS VS CAPTIONS
AUTOMATED ASSISTANTS:
● Single Speaker
● High quality audio, close
speaker
● Learns your voice
● Constrained tasks
● Clarification
● Did you catch my drift?
AUTOMATIC CAPTIONS:
● Usually multiple speakers
● Tasks are open-ended
● Background noise, poor audio
● Lost frequencies
● Most of us don’t speak
perfectly
● Changing audio conditions

.LET’S TALK.
.STATE OF ASR..

An annual review of the top ~8
speech recognitions testing how
they perform for the task of
captioning and transcription. We
test for both Word Error Rate
(WER) and Formatted Error Rate
(FER).
THE REPORT
Because we use speech
recognition as the first step in our
human-corrected captioning
process, we care about using the
best ASR out there. This annual
test keeps us on top of what’s
changing in the industry.
OUR GOAL

The
Accessibility
Picture
VARIETY
Long-form transcription and
captioning can present a variety of
environments and subjects.
LENGTH
Captioning relies on long-form
audio, not short commands &
feedback.
READABILITY
Captions are consumed by
humans and need to be
understandable, using proper
sentence case and grammar.
CAPTIONING.
PRESENTS A.
UNIQUE.
CHALLENGE.

10 ASR ENGINES ON.
.
107 HOURS & 929,795 WORDS.
.
ACROSS 549 VIDEOS.
.
FROM 9 INDUSTRIES.
WE TESTED ….

SPECIFICALLY ….
ASR ENGINES
● Speechmatics (SMX)
● Speechmatics with 3Play Media post-processing
● Microsoft
● Rev.ai
● IBM
● Google (Standard)
● Google (Enhanced/VM)
● Assembly AI
● Whisper (Tiny)
● Whisper (Large)
This year, we tested 57% more hours and 56% more
words than in 2022’s report.
DISTRIBUTION BY INDUSTRY
● 34% Higher Ed
● 16% Tech
● 15% Consumer Goods
● 9% Cinematic
● 8% Associations
● 7% Sports
● 4% Publishing
● 3% eLearning
● 3% News & Networks
Note: The duration, number of speakers, audio quality,
and speaking style (e.g. scripted vs. spontaneous) varies
greatly across this data.

The
Accessibility
Picture
3-STEP PROCESS
ASR is the first step of our captioning
process, followed by 2 rounds of human
editing and review. The better the ASR, the
easier the job of the humans.
POST-PROCESSING
We do our own post-processing on the ASR
engines we use to further improve the ASR
output. We have millions of accurately
transcribed words that we model on top of
ASR to further tune the results.
3PLAY + SMX
In this report, you’ll see the 3Play results
modeled on Speechmatics, our current
primary ASR engine. We would expect to see
a similar 10% relative improvement if we
applied our proprietary post-processing to
any engine in this report.
HOW DOES.
3PLAY USE.
ASR?.

OUR R&D TEAM TESTED TWO.
METRICS: WER & FER..
Word Error Rate (WER)
Word Error Rate is the metric you typically see when
discussing caption accuracy. For example, “99%
accurate captions” would have a WER of 1%.
That means 1 in every 100 words is incorrect - the
standard for recorded captioning.
In addition to pure WER, we dig deeper to measure
insertions, substitutions, deletions, and corrections -
which provides nuance on how different engines get
to the measured WER.
Formatted Error Rate (FER)
While WER is the most common measure of caption accuracy, we
think FER and CER are most critical to the human experience of
caption accuracy.
FER takes into account formatting errors like punctuation,
grammar, capitalization, and other captioning requirements like
speaker identification and sound effects.
This is critical for the “read” experience of captioning, and as you’ll
see, some engines prioritize FER over others.

.
2023’S REPORT IS THE MOST.
.
EXCITING STATE OF ASR YET!.
SPOILER ALERT ….

WORD ERROR RATES.
2022 2023
SMX + 3Play 7.96 6.86
AssemblyAI -- 7.5
Speechmatics 8.67 7.56
Whisper (Large) -- 8.42
Microsoft 10.6 9.69
Rev.ai 13.8 10.4
Google (Video) 12.8 13.5
Whisper (Tiny) -- 15.1
IBM 23.3 24.8
Google (Stand.) 26.1 28.1
KEY TAKEAWAYS
1. New entrants Whisper and AssemblyAI are very
interesting
2. Speechmatics, Microsoft, and Rev all made
impactful improvements
3. Google and IBM lost ground
4. 3Play proprietary post-processing adds an
incremental ~10% gain on any vendor - we tested
here with Speechmatics (our current primary
vendor), but we’d expect the same gains when
tuned to any other vendor.
Overall, it is fair to say that speech recognition for the
task of transcription has improved YOY from 2022.

DIFFERENT TYPES OF ERRORS.
%SUB %INS %DEL
SMX + 3Play 2.3 2.95 1.61
AssemblyAI 2.98 1.35 3.17
Speechmatics 2.48 3.61 1.48
Whisper (Large) 2.39 2.57 3.45
Microsoft 3.64 3.82 2.23
Rev.ai 3.86 4.53 2
Google (Video) 5.46 3.78 4.27
Whisper (Tiny) 7.48 4.1 3.49
IBM 12.6 5.45 6.7
Google (Stand.) 9.62 3.42 15.1
KEY TAKEAWAYS
● Speechmatics deletes by far the fewest words
● AssemblyAI inserts by far the fewest words
● SMX+3Play and Whisper substitute the fewest
words
● Meanwhile, Google deletes an alarming % of
words and IBM inserts an alarming % of words
● Error type breakdowns illustrate the strengths and
weaknesses and differing behavior between
engines
● These help us decide how to act when error rates
look very similar between top engines
● For our business needs, we believe lower deletion
rates are valuable

DIFFERENT TRANSCRIPT STYLES
CLEAN READ VERBATIM
AssemblyAI 6.39
14.2
Speechmatics 7.2
9.74
Whisper (Large) 8.02
10.8
Microsoft 9.06
13.5
Rev AI 9.92
13.2
Google (Enhanced) 12.3
20.6
Whisper (Tiny) 13.8
22.6
IBM Watson 23.2
34.2
Google (Standard) 25.9
21.6
KEY TAKEAWAYS
● Engines will lie on a spectrum between “Clean Read” or
“Verbatim” transcript styles.
● Assembly AI favors the “Clean Read” style
● Speechmatics is more in the “Verbatim” style
We offer two styles of transcription. Verbatim includes
disfluencies, false starts, and word repetitions. Clean Read does
not. Both of these styles could be considered correct and are
appropriate for different situations. When we split our test
sample into Clean Read and Verbatim, the relative ranking of
the engines is quite different between the two samples.
The majority of our content is done in Clean Read. This probably
imparts a slight bias towards scoring the clean read engines
favorably.
*Error rates overall tend to be higher on our Verbatim
content. This is related to difficulty of content
in the markets where each style is most popular.

FORMATTED ERROR RATES.
KEY TAKEAWAYS
1. Again - new entrants Whisper and AssemblyAI are
very interesting, and Speechmatics continues to
be a top engine.
2. It’s clear which engines are prioritizing the
captioning use case.
3. These results suggest engines may be plateauing
in the formatting space.
FER is the experienced accuracy of captioning, and even
the best performing engine is still only ~83% accurate.
This is far from a quality or “equal” captioning
experience.
For the captioning use case, FER is critical to readability
and meaning - and an accuracy rate of under 85% is
extremely noticeable.
2022 2023
Whisper (Large) -- 17.2
AssemblyAI -- 17.5
3Play 17.2 17.8
Speechmatics 17.9 18.3
Rev.ai 22.4 21.5
Microsoft 24.9 22.3
Whisper (Tiny) -- 25.4
Google (Video) 27.0 29.8
Google (Stand.) 38.6 41.6
IBM 38.2 42.5

POLL TIME! ASR PERFORMED BEST.
ON CONTENT FROM WHICH.
INDUSTRY?.
● Sports
● Cinematic
● News
● Publishing
● Tech
● Consumer Goods
● Higher Ed
● Associations
● eLearning

WER & FER BY INDUSTRY.
INDUSTRY AVG. WER AVG. FER
Sports 9.94 21.4
Cinematic 12.91 26.3
News 11.1 26.4
Publishing 7.74 18.2
Tech 5.5 14.5
Consumer Goods 8.72 17.7
Higher Ed 6.38 16.0
Associations 6.43 15.9
eLearning 4.07 13.4
KEY TAKEAWAYS
● Cinematic, News, and Sports content stand out as the
toughest for ASR to transcribe accurately - these markets
often have background noise, specific formatting needs,
overlapping speech, and difficult audio.
● Whisper performed particularly poorly for Cinematic content,
with a FER of 32.6% (vs 25%, 23.8%, and 23.7% for Assembly,
3Play, and SMX respectively).
● eLearning performed the best, followed by Tech - video in
these industries is usually professionally recorded, with clear
audio and a single speaker.
● FER remains high enough across industries to require human
oversight in creating quality captions.
● Industries with extremely clear audio and simple formatting
needs have the best chance of performing well. Those with
complex formatting and poor audio quality perform worst -
here, ASR is very far from being a good solution on its own.
*Note: These are averages of the top 4
engines (3Play, SMX, Whisper, Assembly).

The
Accessibility
Picture
TRAINING DATA
The quantity and quality of data - as well as
the type of data - a model is trained on
makes a huge difference in output.
ARCHITECTURE
There are three major architecture models -
Convolutional, Transformer, and Conformer
(a blend of the two launched mid-2020).
Assembly uses Conformer; Whisper and SMX
use Transformer.
MODEL GOALS
Different companies have different goals for
their engines - broad vs specialized,
captioning vs auto assistants, ASR only vs
human correction. These goals matter.
NOT ALL.
MODELS ARE.
CREATED.
EQUAL.

.THE BEST.
.OF THE BEST.
Speechmatics (SMX)
Speechmatics transcribed more words
accurately, but made more insertions than
AssemblyAI - although most of these
insertions were disfluencies (uhm, y’know,
false starts). Their self-learning model
continues to see gains year over year.
AssemblyAI
Assembly missed more words than SMX, but
didn’t insert as much (notably, they don’t
insert many disfluencies). AssemblyAI uses a
different architectural model than Whisper
and SMX and trains on specialized data.
Whisper
Trained on a very large but general data set
(680K hours), applying the same neural
scaling hypotheses used on GPT to ASR.
However, something odd happens with
Whisper (and no other engine …)

“.
… IT HALLUCINATES 👀👀👀.
Whisper’s greatest flaw seems to be its tendency to sometimes “hallucinate”
additional speech that doesn’t appear in the original audio sample. The
hallucinations look very credible if you aren’t listening to the audio. They are
usually sensible and on-topic, grammatically correct sentences. This would make
viewing the captions as a Deaf/HoH user really confusing. If auto-captions are
nonsensical, it’s clear they are making a mistake, but with these, you could easily
assume the mistakes are what is actually being said. Whisper’s scores don’t
adequately penalize hallucinations in my opinion. Hallucinations will show up as
errors, but an area where the text was completely invented may still get as low as
a 50% error rate (rather than 100%) because of common pronouns, function
words, and punctuation lining up with the real text.
”.

TRUTH WHISPER
the
>
mysteries
of
the
universe
in
a
the
southeastern
part
of
the
state
it’s
a
● This example is from a news segment on the weather that transitioned to a
segment on a NASA launch
● Whisper tries to stay on topic and “hallucinates” a continued story about
the weather
● While 0% of this is correct, the WER is ~50% because of words like “the,” “of,”
and “a”
● If you relied on captions for this programming, you would get a made up
and inaccurate weather forecast

.KEY.
FINDINGS:.
(TL;DR).
New Models Are Emerging
Whisper and AssemblyAI have different approaches but
have both emerged with exciting offerings - with
~equivalent accuracy to SMX, who has led the pack for
many years.
Source Material Matters
It’s clear that results are still heavily dependent on audio
quality and content difficulty. Most improvements are
driven by training techniques, not changes to technology.
Hallucination?
What is it about Whisper’s model that hallucinates
completely made up content? Does this have to do with
their scaled supervised learning approach?
Use Case Matters
These engines are ultimately trained for different use cases.
Understanding your use case and which engine best suits it
is critical to produce the highest quality.
Still Not Good Enough
It’s clear that ASR is still far from good enough for
compliance, where 99%+ accuracy is required to provide
an equal experience.

.WHAT THIS.
.MEANS FOR .
YOU..
While technology continues to improve, there is
still a significant leap to real accuracy from even
the best speech recognition engines, making
humans a crucial part of creating accurate
captions.

Word Errors Formatting Errors
● Multiple speakers or
overlapping speech
● Background noise
● Poor audio quality
● False starts
● Acoustic errors
● “Function” words
● Speaker labels
● Punctuation
● Grammar
● Numbers
● Non-speech
elements
● [INAUDIBLE] tags
COMMON CAUSES OF ASR.
ERRORS:.

Incorrect punctuation can
change the meaning of
language tremendously.
FORMATTING
ERRORS

This example indicates a very
common ASR error. Although
seemingly small, the meaning
is completely reversed.
“I can’t
attend the
meeting.”
vs.
“I can
attend the
meeting.”
FUNCTION
WORDS

These examples of names
and complex vocabulary
require human expertise &
knowledge. In each case, the
truth is on the left, and the
ASR is on the right.
COMPLEX
VOCABULARY

REMEMBER - ERRORS ADD UP.
QUICKLY ....
AT 85% ACCURACY, 1 IN 7 WORDS.
IS INCORRECT.

SPEECHMATICS, MICROSOFT,.
AND REV ALL IMPROVED YOY -.
WHISPER & ASSEMBLYAI ARE.
EXCITING ENTRANTS.

SPEECHMATICS IS NO LONGER.
THE CLEAR LEADER..
WHISPER AND ASSEMBLYAI.
APPEAR JUST AS GOOD..

THE BEST ENGINES CAN.
ACHIEVE UP TO 93% ACCURACY ….
FOR NON-SPECIALIZED CONTENT.
WITH GREAT AUDIO QUALITY.

THIS WAS THE MOST EXCITING.
STATE OF ASR WE’VE SEEN -.
BUT THERE’S STILL.
A LONG WAY TO GO.
TO REPLACE HUMANS..

THANK YOU!.
WHAT QUESTIONS.
DO YOU HAVE?.
STATE OF ASR
go.3playmedia.com/rs-2023-asr
3PLAY MEDIA
www.3playmedia.com | @3playmedia
LILY BOND
(She/Her)
lily@3playmedia.com
TESSA KETTLEBERGER
(She/Her)
tessa@3playmedia.com

2023 State of Automatic Speech Recognition

More Related Content

What's hot (20)

Similar to 2023 State of Automatic Speech Recognition (20)

More from 3Play Media (20)

Recently uploaded (20)

2023 State of Automatic Speech Recognition