SlideShare a Scribd company logo
Veton Këpuska. Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 7, Issue 3, ( Part -2) March 2017, pp.20-24
www.ijera.com DOI: 10.9790/9622-0703022024 20|P a g e
Comparing Speech Recognition Systems (Microsoft API, Google
API And CMU Sphinx)
Veton Këpuska1
, Gamal Bohouta2
1,2
(Electrical & Computer Engineering Department, Florida Institute of Technology, Melbourne, FL, USA
ABSTRACT
The idea of this paper is to design a tool that will be used to test and compare commercial speech recognition
systems, such as Microsoft Speech API and Google Speech API, with open-source speech recognition systems
such as Sphinx-4. The best way to compare automatic speech recognition systems in different environments is
by using some audio recordings that were selected from different sources and calculating the word error rate
(WER). Although the WER of the three aforementioned systems were acceptable, it was observed that the
Google API is superior.
Keywords: Speech Recognition, Testing Speech Recognition Systems, Microsoft Speech API, Google Speech
API, CMU Sphinx-4 Speech Recognition.
I. INTRODUCTION
Automatic Speech Recognition (ASR) is
commonly employed in everyday applications. “One
of the goals of speech recognition is to
allow natural communication between humans and
computers via speech, where natural implies
similarity to the ways humans interact with each
other” [8]. ASR has provided many systems that
have been used to increase the interaction experience
between users and computers. According to Dale
Isaacs, “Today automatic speech recognition (ASR)
systems and text-to-speech (TTS) systems are quite
well established. These systems, using the latest
technologies, are operating at accuracies in excess of
90%” [6]. Due to the increasing number of ASR
systems, such as Microsoft, Google, Sphinx, WUW,
HTK and Dragon, it becomes very difficult to know
which of them we need. However, this paper shows
the results of testing Microsoft API, Google API,
and Sphinx4 by using a tool that has been designed
and implemented using Java language with some
audio recordings that were selected from a large
number of sources. Also, in comparing those
systems a number of various components were
utilized and evaluated such as the acoustic model,
the language model, and the dictionary.
There are a number of commercial and
open-source systems such as AT&T Watson,
Microsoft API Speech, Google Speech API,
Amazon Alexa API, Nuance Recognizer, WUW,
HTK and Dragon [2]. Three systems were selected
for our evaluation in different environments:
Microsoft API, Google API, and Sphinx-4 automatic
speech recognition systems. Two of the biggest
companies building voice-powered applications are
Google and Microsoft [4]. The Microsoft API and
Google API are the commercial speech recognition
systems whose code is inaccessible, and
Sphinx-4 is one of the ASR systems whose code is
freely available for download [3].
II. THE CMU SPHINX
The Sphinx system has been developed at
Carnegie Mellon University (CMU). Currently,”
CMU Sphinx has a large vocabulary, speaker
independent speech recognition codebase, and its
code is available for download and use” [13]. The
Sphinx has several versions and packages for
different tasks and applications such as Sphinx-2,
Sphinx-3 and Sphinx-4. Also, there are additional
packages such as Pocketsphinx, Sphinxbase,
Sphinxtrain. In this paper, the Sphinx-4 will be
evaluated. The Sphinx-4 has been written by Java
programming language. Moreover,” its structure has
been designed with a high degree of flexibility and
modularity” [13]. According to Juraj Kačur, “The
latest Sphinx-4 is written in JAVA, and Main
theoretical improvements are: support for finite
grammar called Java Speech API grammar, it
doesn’t impose the restriction using the same
structure for all models” [13] [5]. There are three
main components in the Sphinx-4 structure, which
includes the Frontend, the Decoder and the Linguist.
According to Willie Walker and other who have
worked in Sphinx-4, "we created a number of
differing implementations for each module in the
framework. For example, the Frontend
implementations support MFCC, PLP, and LPC
feature extraction; the Linguist implementations
support a variety of language models, including
CFGs, FSTs, and N-Grams; and the Decoder
supports a variety of Search Manager
implementations" [1]. Therefore, Sphinx-4 has the
most recent version of an HMM-based speech and a
RESEARCH ARTICLE OPEN ACCESS
Veton Këpuska. Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 7, Issue 3, ( Part -2) March 2017, pp.20-24
www.ijera.com DOI: 10.9790/9622-0703022024 21|P a g e
strong acoustic model by using HHM model with
training large vocabulary [2].
III. THE GOOGLE API
Google has improved its speech recognition
by using a new technology in many applications
with the Google App such as Goog411, Voice
Search on mobile, Voice Actions, Voice Input
(spoken input to keypad), Android Developer APIs,
Voice Search on desktop, YouTube transcription and
Translate, Navigate, TTS.
After Google, has used the new technology
that is the deep learning neural networks, Google
achieved an 8 percent error rate in 2015 that is
reduction of more than 23 percent from year 2013.
According to Pichai, senior vice president of
Android, Chrome, and Apps at Google, “We have
the best investments in machine learning over the
past many years. Indeed, Google has acquired
several deep learning companies over the years,
including DeepMind, DNNresearch, and
Jetpac”[11].
IV. THE MICROSOFT API
Microsoft has developed the Speech API
since 1993, the company hired Xuedong (XD)
Huang, Fil Alleva, and Mei-Yuh Hwang “three of
the four people responsible for the Carnegie Mellon
University Sphinx-II speech recognition system,
which achieved fame in the speech world in 1992
due to its unprecedented accuracy. the first Speech
API is (SAPI) 1.0 team in 1994” [12].
Microsoft has continued to develop the
powerful speech API and has released a series of
increasingly powerful speech platforms. The
Microsoft team has released the Speech API (SAPI)
5.3 with Windows Vista which was very powerful
and useful. On the developer front, "Windows Vista
includes a new WinFX® namespace,
System.Speech. This allows developers to easily
speech-enable Windows Forms applications and
apps based on the Windows Presentation
Framework"[12].
Microsoft has focused on increasing
emphasis on speech recognition systems and
improved the Speech API (SAPI) by using a context-
dependent deep neural network hidden Markov
model (CD-DNN-HMM). According to the
researchers who have worked with Microsoft to
improve the Speech API and the CD-DNN-HMM
models, they determined that the large-vocabulary
speech recognition that achieves substantially better
results than a Context-Dependent Gaussian Mixture
Model Hidden Markov mode12]. Just recently
Microsoft announced “Historic Achievement:
Microsoft researchers reach human parity in
conversational speech recognition” [15].
V. EXPERIMENTS
The best way to test the quality of various
ASR systems is to calculate the word error rate
(WER). According to the WER, we can also test the
different models in the ASR systems, such as the
acoustic model, the language model, and the
dictionary size. However, in this paper we have
developed a tool that we have used to test these
models in Microsoft API, Google API, and Sphinx-
4. Also, we have calculated the WER by using this
tool to recognize a list of sentences, which we
collected in the form of audio files and text
translation. In this paper, we follow these steps to
design the tool and test Microsoft API, Google API,
and Sphinx-4.
VI. TESTING DATA
The audio files were selected from various
sources to evaluate the Microsoft API, Google API,
and Sphinx-4. According to CMUSphin, Sphinx-4's
decoder supports only one of the two specific audio
formats (16000 Hz / 8000 Hz) [13]. Also, Google
does not recognize the WAV format generally used
with Sphinx-4. Part of the process of recognizing
WAV files with Google involves converting the
WAV files to the FLAC format. Microsoft can
recognize any WAV files format. However, we
solved this problem by making our tool recognize all
audio files in the same format (16000 Hz / 8000 Hz).
Some of the audio files have been selected
from the TIMIT corpus.” The TIMIT corpus of read
speech is designed to provide speech data for
acoustic-phonetic studies and for the development
and evaluation of automatic speech recognition
systems. TIMIT contains broadband recordings of
630 speakers of eight major dialects of American
English, each reading ten phonetically rich
sentences” [14]. “The TIMIT corpus includes time-
aligned orthographic, phonetic and word
transcriptions as well as a 16-bit, 16kHz speech
waveform file for each utterance. Corpus design was
a joint effort among the Massachusetts Institute of
Technology (MIT), SRI International (SRI) and
Texas Instruments, Inc. (TI)” [9].
Also, we have selected other audio files
from ITU (International Telecommunication Union)
which is the United Nations Specialized Agency in
the field of telecommunications [10]. Example of
some of the audio files are presented in the table1
below:
Veton Këpuska. Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 7, Issue 3, ( Part -2) March 2017, pp.20-24
www.ijera.com DOI: 10.9790/9622-0703022024 22|P a g e
Table 1. The Audio Files
VII. SYSTEM DESCRIPTION
This system has been designed by using the
Java language, which is the same language that has
been used in Sphinx-4, as well as the C# that was
used to test the Microsoft API and Google API.
Also, we have used several libraries such as Text to
Speech API, Graph API and Math API for different
tasks. Moreover, this tool was connected with the
classes of Sphinx4, Microsoft API and Google API
to work together to recognize the audio files. Then
we compared the recognition results with the
original recording texts.
Figure 1. The System Interface.
VIII. EXPERIMENTAL RESULTS
The audio recordings with the original
sentences were used to test the Sphinx-4, Microsoft
API, and Google API. By using our tool, we have
tested all files and calculated the word error rate
(WER) and accuracy. We calculated the word error
rate (WER) and accuracy according to these
equations.
WER = (I + D + S) / N
WER = (0 + 0 + 1) / 9 = 0.11
where I words were inserted, D words were deleted,
and S words were substituted.
The original text (Reference):
the small boy PUT the worm on the hook
The recognition text (Hypothesis):
the small boy THAT the worm on the hook
Accuracy = (N - D - S) / N
WA = (9 + 0 + 1) / 9 = 0.88
The original text (Reference):
the coffee STANDARD is too high for the couch
The recognition text (Hypothesis):
the coffee STAND is too high for the couch
Figure 2. The Structure of The System.
Figure 3. The Result of Sphinx-4
By using our tool, we have gathered data and
results are as follows: The Sphinx-4 (37% WER),
Google Speech API (9% WER) and Microsoft
Speech API (18% WER). Where S sentences, N
words, I words were inserted, D words were deleted,
and S words were substituted. CW correct words,
EW error words.
Veton Këpuska. Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 7, Issue 3, ( Part -2) March 2017, pp.20-24
www.ijera.com DOI: 10.9790/9622-0703022024 23|P a g e
Table 3. The Final Results of Sphinx-4
Table 4. The Final Results of Microsoft API
Table 5. The Final Results of Google API
Table 6. Comparison Between Three Systems
Figure 4. Comparison Between Three Systems
IX. CONCLUSION
In this paper, it can be concluded that the
tool that we have built to test the Sphinx-4,
Microsoft API, and Google API by using some
audio recordings that were selected from many
places with the original sentences showed that
Sphinx-4 achieved 37% WER, Microsoft API
achieved 18% WER and Google API achieved 9%
WER. Therefore, it can be stated that the acoustic
modeling and language model of Google is superior.
REFERENCES
[1]. W. Walker, P. Lamere, P. Kwok, B. Raj, R.
Singh, E. Gouvea, P. Wolf, and J. Woelfel,
Sphinx-4: A Flexible Open Source
Framework for Speech Recognition, Sun
Microsystems, SMLI TR-2004-139, 2004,1-
14
[2]. C. Gaida, P. Lange, R. Petrick, P. Proba, A.
Malatawy, and D. Suendermann-Oeft,
Comparing Open-Source Speech Recognition
Toolkits. The Baden-Wuerttemberg Ministry
of Science and Arts as part of the research
project, 2011
Veton Këpuska. Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 7, Issue 3, ( Part -2) March 2017, pp.20-24
www.ijera.com DOI: 10.9790/9622-0703022024 24|P a g e
[3]. K. Samudravijaya and M. Barol, Comparison
of Public Domain Software Tools for Speech
Recognition. ISCA Archive, 2013
[4]. P. Lange and D. Suendermann, Tuning
Sphinx to Outperform Google’s Speech
Recognition API, The Baden-Wuerttemberg
Ministry of Science and Arts as part of the
research project.
[5]. J. Kačur, HTK vs. Sphinx for Speech
Recognition. Department of
telecommunication FEI STU.
[6]. D. Isaacs and D. Mashao, A Comparison of
the Network Speech Recognition and
Distributed Speech Recognition Systems and
their eect on Speech Enabling Mobile
Devices, doctoral diss. Speech Technology
and Research Group, University of Cape
Town, 2010
[7]. R. Srikanth, L. Bo and J. Salsman, Automatic
Pronunciation Evaluation and
Mispronunciation Detection Using
CMUSphin. COLING, 2012, 61-68
[8]. V. Kepuska, Wake-Up-Word Speech
Recognition. IN TECH, 2011
[9]. STAR. (2016) SRI International's Speech
Technology and Research (STAR)
Laboratory. SRI, http://www.speech.sri.com/.
[10]. ITU. (2016) Committed to connecting the
world. ITU, http://www.itu.int//.
[11]. V. Beat and J. Novet (2016) Google says its
speech recognition technology now has only
an 8% word error rate. Venture beat,
http://venturebeat.com/2015/05/28/.
[12]. Microsoft Corporation (2016) Exploring New
Speech Recognition and Synthesis APIs In
Windows Vista. Microsoft,
http://web.archive.org/.
[13]. CMUSphinx (2016) CMUSphinx Tutorial for
Developers. Carnegie Mellon University,
http://www.speech.cs.cmu.edu/sphinx/.
[14]. TIMIT (2016) TIMIT Acoustic-Phonetic
Continuous Speech Corpus. Linguistic Data
Consortium,
https://catalog.ldc.upenn.edu/LDC93S1.
[15]. Microsoft Corporation (2016) Historic
Achievement: Microsoft researchers reach
human parity in conversational speech
recognition”, https://blogs.microsoft.com.

More Related Content

PPTX
Environmental Thesauri Under the Lens of Reusability (EGOVIS 2014)
PDF
POWER CONSUMING SYSTEM USING WSN IN HEMS
PDF
Some common Fixed Point Theorems for compatible  - contractions in G-metric ...
PDF
Performance Evaluation of Two-Level Photovoltaic Voltage Source Inverter Cons...
PDF
Study On The External Gas-Assisted Mold Temperature Control For Thin Wall Inj...
PDF
A Singular Spectrum Analysis Technique to Electricity Consumption Forecasting
PDF
Numerical Model and Experimental Validation of the Hydrodynamics in an Indust...
PDF
Phyto cover for Sanitary Landfill Sites: A brief review
Environmental Thesauri Under the Lens of Reusability (EGOVIS 2014)
POWER CONSUMING SYSTEM USING WSN IN HEMS
Some common Fixed Point Theorems for compatible  - contractions in G-metric ...
Performance Evaluation of Two-Level Photovoltaic Voltage Source Inverter Cons...
Study On The External Gas-Assisted Mold Temperature Control For Thin Wall Inj...
A Singular Spectrum Analysis Technique to Electricity Consumption Forecasting
Numerical Model and Experimental Validation of the Hydrodynamics in an Indust...
Phyto cover for Sanitary Landfill Sites: A brief review

Viewers also liked (20)

PDF
Topology Management for Mobile Ad Hoc Networks Scenario
PDF
Direction of Arrival Estimation Based on MUSIC Algorithm Using Uniform and No...
PDF
Locating Facts Devices in Optimized manner in Power System by Means of Sensit...
PDF
Design of Low Power Vedic Multiplier Based on Reversible Logic
PDF
Evaluation of Anti-oxidant Activity of Elytraria acaulis Aerial Extracts
PDF
Mild balanced Intuitionistic Fuzzy Graphs
PDF
Duplex 2209 Weld Overlay by ESSC Process
PDF
The Equation Based on the Rotational and Orbital Motion of the Planets
PDF
A Proposed Method for Safe Disposal of Consumed Photovoltaic Modules
PDF
Brainstorming: Thinking - Problem Solving Strategy
PDF
Defects, Root Causes in Casting Process and Their Remedies: Review
PDF
Empirical Study of a Key Authentication Scheme in Public Key Cryptography
PDF
“Design and Analysis of a Windmill Blade in Windmill Electric Generation System”
PDF
“Electricity Generation by Universal Neodymium Permanent Magnetic Rotor by Re...
PDF
An Intelligent Healthcare Serviceto Monitor Vital Signs in Daily Life – A Cas...
PDF
Properties of Concrete Containing Scrap-Tire Rubber
PDF
Moringa Seed, Residual Coffee Powder, and Banana Peel as Biosorbents for Uran...
PDF
Reducing the Negative Effects of Seasonal Demand Fluctuations: A Proposal Bas...
PDF
A study of Heavy Metal Pollution in Groundwater of Malwa Region of Punjab, In...
PDF
FE Simulation Modelling and Exergy Analysis of Conventional Forging Deformati...
Topology Management for Mobile Ad Hoc Networks Scenario
Direction of Arrival Estimation Based on MUSIC Algorithm Using Uniform and No...
Locating Facts Devices in Optimized manner in Power System by Means of Sensit...
Design of Low Power Vedic Multiplier Based on Reversible Logic
Evaluation of Anti-oxidant Activity of Elytraria acaulis Aerial Extracts
Mild balanced Intuitionistic Fuzzy Graphs
Duplex 2209 Weld Overlay by ESSC Process
The Equation Based on the Rotational and Orbital Motion of the Planets
A Proposed Method for Safe Disposal of Consumed Photovoltaic Modules
Brainstorming: Thinking - Problem Solving Strategy
Defects, Root Causes in Casting Process and Their Remedies: Review
Empirical Study of a Key Authentication Scheme in Public Key Cryptography
“Design and Analysis of a Windmill Blade in Windmill Electric Generation System”
“Electricity Generation by Universal Neodymium Permanent Magnetic Rotor by Re...
An Intelligent Healthcare Serviceto Monitor Vital Signs in Daily Life – A Cas...
Properties of Concrete Containing Scrap-Tire Rubber
Moringa Seed, Residual Coffee Powder, and Banana Peel as Biosorbents for Uran...
Reducing the Negative Effects of Seasonal Demand Fluctuations: A Proposal Bas...
A study of Heavy Metal Pollution in Groundwater of Malwa Region of Punjab, In...
FE Simulation Modelling and Exergy Analysis of Conventional Forging Deformati...
Ad

Similar to Comparing Speech Recognition Systems (Microsoft API, Google API And CMU Sphinx) (20)

PDF
Efficient Intralingual Text To Speech Web Podcasting And Recording
PDF
A Voice Based Assistant Using Google Dialogflow And Machine Learning
PDF
Automatic Subtitle Generation for Sound in Videos
PDF
QUrdPro: Query processing system for Urdu Language
PDF
Automatic Subtitle Generation For Sound In Videos
PDF
An Application for Performing Real Time Speech Translation in Mobile Environment
PDF
Hindi speech enabled windows application using microsoft
PDF
Tackling the Problem of Multilingualism in Voice Assistants
PDF
IRJET- Voice to Code Editor using Speech Recognition
PDF
Software Language Engineering Second International Conference Sle 2009 Denver...
PDF
IDE Code Compiler for the physically challenged (Deaf, Blind & Mute)
PDF
2010 tool forum ata handout
PDF
Recent advances in LVCSR : A benchmark comparison of performances
PDF
Contextual Analysis for Middle Eastern Languages with Hidden Markov Models
PPTX
Industry-Academia Communication In Empirical Software Engineering
PDF
Performance Of The Google Desktop, Arabic Google Desktop and Peer to Peer App...
PDF
A Novel Method for An Intelligent Based Voice Meeting System Using Machine Le...
DOCX
Learning activity 4
PDF
Assistive Examination System for Visually Impaired
Efficient Intralingual Text To Speech Web Podcasting And Recording
A Voice Based Assistant Using Google Dialogflow And Machine Learning
Automatic Subtitle Generation for Sound in Videos
QUrdPro: Query processing system for Urdu Language
Automatic Subtitle Generation For Sound In Videos
An Application for Performing Real Time Speech Translation in Mobile Environment
Hindi speech enabled windows application using microsoft
Tackling the Problem of Multilingualism in Voice Assistants
IRJET- Voice to Code Editor using Speech Recognition
Software Language Engineering Second International Conference Sle 2009 Denver...
IDE Code Compiler for the physically challenged (Deaf, Blind & Mute)
2010 tool forum ata handout
Recent advances in LVCSR : A benchmark comparison of performances
Contextual Analysis for Middle Eastern Languages with Hidden Markov Models
Industry-Academia Communication In Empirical Software Engineering
Performance Of The Google Desktop, Arabic Google Desktop and Peer to Peer App...
A Novel Method for An Intelligent Based Voice Meeting System Using Machine Le...
Learning activity 4
Assistive Examination System for Visually Impaired
Ad

Recently uploaded (20)

PPTX
Sustainable Sites - Green Building Construction
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
Digital Logic Computer Design lecture notes
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
DOCX
573137875-Attendance-Management-System-original
PPTX
Safety Seminar civil to be ensured for safe working.
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Construction Project Organization Group 2.pptx
PPTX
Current and future trends in Computer Vision.pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
CH1 Production IntroductoryConcepts.pptx
PPT
introduction to datamining and warehousing
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Sustainable Sites - Green Building Construction
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
R24 SURVEYING LAB MANUAL for civil enggi
Operating System & Kernel Study Guide-1 - converted.pdf
Digital Logic Computer Design lecture notes
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
573137875-Attendance-Management-System-original
Safety Seminar civil to be ensured for safe working.
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Construction Project Organization Group 2.pptx
Current and future trends in Computer Vision.pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
CH1 Production IntroductoryConcepts.pptx
introduction to datamining and warehousing
Automation-in-Manufacturing-Chapter-Introduction.pdf
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx

Comparing Speech Recognition Systems (Microsoft API, Google API And CMU Sphinx)

  • 1. Veton Këpuska. Int. Journal of Engineering Research and Application www.ijera.com ISSN : 2248-9622, Vol. 7, Issue 3, ( Part -2) March 2017, pp.20-24 www.ijera.com DOI: 10.9790/9622-0703022024 20|P a g e Comparing Speech Recognition Systems (Microsoft API, Google API And CMU Sphinx) Veton Këpuska1 , Gamal Bohouta2 1,2 (Electrical & Computer Engineering Department, Florida Institute of Technology, Melbourne, FL, USA ABSTRACT The idea of this paper is to design a tool that will be used to test and compare commercial speech recognition systems, such as Microsoft Speech API and Google Speech API, with open-source speech recognition systems such as Sphinx-4. The best way to compare automatic speech recognition systems in different environments is by using some audio recordings that were selected from different sources and calculating the word error rate (WER). Although the WER of the three aforementioned systems were acceptable, it was observed that the Google API is superior. Keywords: Speech Recognition, Testing Speech Recognition Systems, Microsoft Speech API, Google Speech API, CMU Sphinx-4 Speech Recognition. I. INTRODUCTION Automatic Speech Recognition (ASR) is commonly employed in everyday applications. “One of the goals of speech recognition is to allow natural communication between humans and computers via speech, where natural implies similarity to the ways humans interact with each other” [8]. ASR has provided many systems that have been used to increase the interaction experience between users and computers. According to Dale Isaacs, “Today automatic speech recognition (ASR) systems and text-to-speech (TTS) systems are quite well established. These systems, using the latest technologies, are operating at accuracies in excess of 90%” [6]. Due to the increasing number of ASR systems, such as Microsoft, Google, Sphinx, WUW, HTK and Dragon, it becomes very difficult to know which of them we need. However, this paper shows the results of testing Microsoft API, Google API, and Sphinx4 by using a tool that has been designed and implemented using Java language with some audio recordings that were selected from a large number of sources. Also, in comparing those systems a number of various components were utilized and evaluated such as the acoustic model, the language model, and the dictionary. There are a number of commercial and open-source systems such as AT&T Watson, Microsoft API Speech, Google Speech API, Amazon Alexa API, Nuance Recognizer, WUW, HTK and Dragon [2]. Three systems were selected for our evaluation in different environments: Microsoft API, Google API, and Sphinx-4 automatic speech recognition systems. Two of the biggest companies building voice-powered applications are Google and Microsoft [4]. The Microsoft API and Google API are the commercial speech recognition systems whose code is inaccessible, and Sphinx-4 is one of the ASR systems whose code is freely available for download [3]. II. THE CMU SPHINX The Sphinx system has been developed at Carnegie Mellon University (CMU). Currently,” CMU Sphinx has a large vocabulary, speaker independent speech recognition codebase, and its code is available for download and use” [13]. The Sphinx has several versions and packages for different tasks and applications such as Sphinx-2, Sphinx-3 and Sphinx-4. Also, there are additional packages such as Pocketsphinx, Sphinxbase, Sphinxtrain. In this paper, the Sphinx-4 will be evaluated. The Sphinx-4 has been written by Java programming language. Moreover,” its structure has been designed with a high degree of flexibility and modularity” [13]. According to Juraj Kačur, “The latest Sphinx-4 is written in JAVA, and Main theoretical improvements are: support for finite grammar called Java Speech API grammar, it doesn’t impose the restriction using the same structure for all models” [13] [5]. There are three main components in the Sphinx-4 structure, which includes the Frontend, the Decoder and the Linguist. According to Willie Walker and other who have worked in Sphinx-4, "we created a number of differing implementations for each module in the framework. For example, the Frontend implementations support MFCC, PLP, and LPC feature extraction; the Linguist implementations support a variety of language models, including CFGs, FSTs, and N-Grams; and the Decoder supports a variety of Search Manager implementations" [1]. Therefore, Sphinx-4 has the most recent version of an HMM-based speech and a RESEARCH ARTICLE OPEN ACCESS
  • 2. Veton Këpuska. Int. Journal of Engineering Research and Application www.ijera.com ISSN : 2248-9622, Vol. 7, Issue 3, ( Part -2) March 2017, pp.20-24 www.ijera.com DOI: 10.9790/9622-0703022024 21|P a g e strong acoustic model by using HHM model with training large vocabulary [2]. III. THE GOOGLE API Google has improved its speech recognition by using a new technology in many applications with the Google App such as Goog411, Voice Search on mobile, Voice Actions, Voice Input (spoken input to keypad), Android Developer APIs, Voice Search on desktop, YouTube transcription and Translate, Navigate, TTS. After Google, has used the new technology that is the deep learning neural networks, Google achieved an 8 percent error rate in 2015 that is reduction of more than 23 percent from year 2013. According to Pichai, senior vice president of Android, Chrome, and Apps at Google, “We have the best investments in machine learning over the past many years. Indeed, Google has acquired several deep learning companies over the years, including DeepMind, DNNresearch, and Jetpac”[11]. IV. THE MICROSOFT API Microsoft has developed the Speech API since 1993, the company hired Xuedong (XD) Huang, Fil Alleva, and Mei-Yuh Hwang “three of the four people responsible for the Carnegie Mellon University Sphinx-II speech recognition system, which achieved fame in the speech world in 1992 due to its unprecedented accuracy. the first Speech API is (SAPI) 1.0 team in 1994” [12]. Microsoft has continued to develop the powerful speech API and has released a series of increasingly powerful speech platforms. The Microsoft team has released the Speech API (SAPI) 5.3 with Windows Vista which was very powerful and useful. On the developer front, "Windows Vista includes a new WinFX® namespace, System.Speech. This allows developers to easily speech-enable Windows Forms applications and apps based on the Windows Presentation Framework"[12]. Microsoft has focused on increasing emphasis on speech recognition systems and improved the Speech API (SAPI) by using a context- dependent deep neural network hidden Markov model (CD-DNN-HMM). According to the researchers who have worked with Microsoft to improve the Speech API and the CD-DNN-HMM models, they determined that the large-vocabulary speech recognition that achieves substantially better results than a Context-Dependent Gaussian Mixture Model Hidden Markov mode12]. Just recently Microsoft announced “Historic Achievement: Microsoft researchers reach human parity in conversational speech recognition” [15]. V. EXPERIMENTS The best way to test the quality of various ASR systems is to calculate the word error rate (WER). According to the WER, we can also test the different models in the ASR systems, such as the acoustic model, the language model, and the dictionary size. However, in this paper we have developed a tool that we have used to test these models in Microsoft API, Google API, and Sphinx- 4. Also, we have calculated the WER by using this tool to recognize a list of sentences, which we collected in the form of audio files and text translation. In this paper, we follow these steps to design the tool and test Microsoft API, Google API, and Sphinx-4. VI. TESTING DATA The audio files were selected from various sources to evaluate the Microsoft API, Google API, and Sphinx-4. According to CMUSphin, Sphinx-4's decoder supports only one of the two specific audio formats (16000 Hz / 8000 Hz) [13]. Also, Google does not recognize the WAV format generally used with Sphinx-4. Part of the process of recognizing WAV files with Google involves converting the WAV files to the FLAC format. Microsoft can recognize any WAV files format. However, we solved this problem by making our tool recognize all audio files in the same format (16000 Hz / 8000 Hz). Some of the audio files have been selected from the TIMIT corpus.” The TIMIT corpus of read speech is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences” [14]. “The TIMIT corpus includes time- aligned orthographic, phonetic and word transcriptions as well as a 16-bit, 16kHz speech waveform file for each utterance. Corpus design was a joint effort among the Massachusetts Institute of Technology (MIT), SRI International (SRI) and Texas Instruments, Inc. (TI)” [9]. Also, we have selected other audio files from ITU (International Telecommunication Union) which is the United Nations Specialized Agency in the field of telecommunications [10]. Example of some of the audio files are presented in the table1 below:
  • 3. Veton Këpuska. Int. Journal of Engineering Research and Application www.ijera.com ISSN : 2248-9622, Vol. 7, Issue 3, ( Part -2) March 2017, pp.20-24 www.ijera.com DOI: 10.9790/9622-0703022024 22|P a g e Table 1. The Audio Files VII. SYSTEM DESCRIPTION This system has been designed by using the Java language, which is the same language that has been used in Sphinx-4, as well as the C# that was used to test the Microsoft API and Google API. Also, we have used several libraries such as Text to Speech API, Graph API and Math API for different tasks. Moreover, this tool was connected with the classes of Sphinx4, Microsoft API and Google API to work together to recognize the audio files. Then we compared the recognition results with the original recording texts. Figure 1. The System Interface. VIII. EXPERIMENTAL RESULTS The audio recordings with the original sentences were used to test the Sphinx-4, Microsoft API, and Google API. By using our tool, we have tested all files and calculated the word error rate (WER) and accuracy. We calculated the word error rate (WER) and accuracy according to these equations. WER = (I + D + S) / N WER = (0 + 0 + 1) / 9 = 0.11 where I words were inserted, D words were deleted, and S words were substituted. The original text (Reference): the small boy PUT the worm on the hook The recognition text (Hypothesis): the small boy THAT the worm on the hook Accuracy = (N - D - S) / N WA = (9 + 0 + 1) / 9 = 0.88 The original text (Reference): the coffee STANDARD is too high for the couch The recognition text (Hypothesis): the coffee STAND is too high for the couch Figure 2. The Structure of The System. Figure 3. The Result of Sphinx-4 By using our tool, we have gathered data and results are as follows: The Sphinx-4 (37% WER), Google Speech API (9% WER) and Microsoft Speech API (18% WER). Where S sentences, N words, I words were inserted, D words were deleted, and S words were substituted. CW correct words, EW error words.
  • 4. Veton Këpuska. Int. Journal of Engineering Research and Application www.ijera.com ISSN : 2248-9622, Vol. 7, Issue 3, ( Part -2) March 2017, pp.20-24 www.ijera.com DOI: 10.9790/9622-0703022024 23|P a g e Table 3. The Final Results of Sphinx-4 Table 4. The Final Results of Microsoft API Table 5. The Final Results of Google API Table 6. Comparison Between Three Systems Figure 4. Comparison Between Three Systems IX. CONCLUSION In this paper, it can be concluded that the tool that we have built to test the Sphinx-4, Microsoft API, and Google API by using some audio recordings that were selected from many places with the original sentences showed that Sphinx-4 achieved 37% WER, Microsoft API achieved 18% WER and Google API achieved 9% WER. Therefore, it can be stated that the acoustic modeling and language model of Google is superior. REFERENCES [1]. W. Walker, P. Lamere, P. Kwok, B. Raj, R. Singh, E. Gouvea, P. Wolf, and J. Woelfel, Sphinx-4: A Flexible Open Source Framework for Speech Recognition, Sun Microsystems, SMLI TR-2004-139, 2004,1- 14 [2]. C. Gaida, P. Lange, R. Petrick, P. Proba, A. Malatawy, and D. Suendermann-Oeft, Comparing Open-Source Speech Recognition Toolkits. The Baden-Wuerttemberg Ministry of Science and Arts as part of the research project, 2011
  • 5. Veton Këpuska. Int. Journal of Engineering Research and Application www.ijera.com ISSN : 2248-9622, Vol. 7, Issue 3, ( Part -2) March 2017, pp.20-24 www.ijera.com DOI: 10.9790/9622-0703022024 24|P a g e [3]. K. Samudravijaya and M. Barol, Comparison of Public Domain Software Tools for Speech Recognition. ISCA Archive, 2013 [4]. P. Lange and D. Suendermann, Tuning Sphinx to Outperform Google’s Speech Recognition API, The Baden-Wuerttemberg Ministry of Science and Arts as part of the research project. [5]. J. Kačur, HTK vs. Sphinx for Speech Recognition. Department of telecommunication FEI STU. [6]. D. Isaacs and D. Mashao, A Comparison of the Network Speech Recognition and Distributed Speech Recognition Systems and their eect on Speech Enabling Mobile Devices, doctoral diss. Speech Technology and Research Group, University of Cape Town, 2010 [7]. R. Srikanth, L. Bo and J. Salsman, Automatic Pronunciation Evaluation and Mispronunciation Detection Using CMUSphin. COLING, 2012, 61-68 [8]. V. Kepuska, Wake-Up-Word Speech Recognition. IN TECH, 2011 [9]. STAR. (2016) SRI International's Speech Technology and Research (STAR) Laboratory. SRI, http://www.speech.sri.com/. [10]. ITU. (2016) Committed to connecting the world. ITU, http://www.itu.int//. [11]. V. Beat and J. Novet (2016) Google says its speech recognition technology now has only an 8% word error rate. Venture beat, http://venturebeat.com/2015/05/28/. [12]. Microsoft Corporation (2016) Exploring New Speech Recognition and Synthesis APIs In Windows Vista. Microsoft, http://web.archive.org/. [13]. CMUSphinx (2016) CMUSphinx Tutorial for Developers. Carnegie Mellon University, http://www.speech.cs.cmu.edu/sphinx/. [14]. TIMIT (2016) TIMIT Acoustic-Phonetic Continuous Speech Corpus. Linguistic Data Consortium, https://catalog.ldc.upenn.edu/LDC93S1. [15]. Microsoft Corporation (2016) Historic Achievement: Microsoft researchers reach human parity in conversational speech recognition”, https://blogs.microsoft.com.