Real-Time Voice Actuation

Team Jarvis
Final Presentation
Pragya Agrawal
Dominic Calabrese
David Martel
Nathan Sawicki

Project Goals
• Design and build real-time speech recognition system
• Build with embedded hardware
• Used Source-Filter model of speech and Support Vector Machine
classifier to recognize commands “zero” through “nine”
• Finished system executes in real-time and has GPIO-based actuation
to demonstrate functional voice recognition

Source-Filter Model of Speech
• Word characterization should be
independent of volume, pitch, and duration
of the word
• Simplify speech production model to being:
1.Source - vibration of vocal chords
2.Filter – vocal tract (i.e. positioning of
tongue, mouth, etc.)
• Accurately modeling the filter provides a
basis for word recognition[4]
Broad sweeps of spectrum (formants) result
from the filter configuration. Rapidly varying
peaks come from source resonances

All-Pole Filter Coefficients
• First n filter coefficients can be roughly
calculated using the first n time shifts of
the autocorrelation of a signal
• Levinson-Durbin recursion algorithm
calculates all-pole filter coefficients from
autocorrelation
• Want to capture spectral envelope, so
want ~10 filter coefficients[5]
Too many coefficients leads to over-fitting of
curve

Cepstral Coefficients
• Cepstrum is useful in separating the source
and filter
• Cepstral coefficients are a very compact
representation of the spectral envelope and
are highly uncorrelated
• Filter coefficients are too sensitive to
numerical precision
• Better to transform LP coefficients into
cepstral coefficients[5]
Cepstral Analysis on source filter model
(a) DFT (b) log magnitude of DFT (c) IDFT

Support Vector Machine Learning
• Support Vector Machine (SVM) is a supervised
learning algorithm used for classification and
regression
• We utilize Multi-class Support Vector Machine
• Our algorithm uses one-against-one method to
construct (k *(k-1)/2) classifiers (k = number of
classes), one SVM for each pair of classes.
• LIBSVM, an integrated software for multi-class
support vector classification is used[6]

Library
• Stored autocorrelation coefficients calculated through C5515
• Calculated cepstral coefficients in MATLAB
• Three male speakers with combined 1920 recordings
• 64 instances of each digit for each speaker
9 Coef 0 1 2 3 4 5 6 7 8 9
0 154 0 0 4 0 0 22 6 0 6
1 0 166 1 1 23 1 0 0 0 0
2 1 0 168 22 0 0 1 0 0 0
3 13 0 6 172 0 0 1 0 0 0
4 1 9 0 0 181 0 0 1 0 0
5 0 1 0 0 0 190 0 1 0 0
6 4 0 1 0 0 0 187 0 0 0
7 1 0 0 0 1 0 0 189 0 1
8 0 0 1 0 0 0 0 0 191 0
9 0 0 1 0 0 0 0 2 0 189

Rejected Methods
• Classification based on correlation of cepstral coefficients
• Took maximum correlation between new signal and library
• Not very robust to small variations or scalable
• Classification using SVM on CRM database
• Words cut off early in database or contaminated by other words
• Recording conditions do not match our method

C5515: Vocalization Identification
• Implemented Word from non-Word
Identification
• Grab frame of 256 samples Compute
RMS of frame, compare to threshold
• If RMS > Threshold
• Accumulate frame data
• Else if RMS < Threshold and Frames
Acquired > 3
• Compute Autocorrelation,
• Transmit Data
• Else
• Reset Stored Data
• Specific values determined experimentally

C5515: UART Transmission
• Transmit Autocorrelation Coefficients
• UART is 115200 baud, 8 bit, No
Parity, 1 stop bit
• Data is signed 16 bit
• Bit masking and Reconstruction
on the Raspberry Pi
• BlueSmirf Bluetooth-UART Pipes
• Abstracts wireless transmission
• Looks like UART to microcontroller
• Effectively Plug&Play

C5515: Major Challenges Faced
• Autocorrelation Coefficient Overflow
• Function Generator Provide too large a voltage
• Forces autocorrelation to overflow
• Bit-shifting worked temporarily, but reduced data precision: poor
classifier performance and threshold variability
• Solution: Switched to Microphone
• Bluesmirf Setup
• Configuring Bluesmirf requires commands at precise times
• Solution: Implemented long delay function on C5515

Raspberry Pi: Word Classification
• Implemented All-pole Model of Speech
Vocalization for Classification
• Computes LPC Coefficients from
Autocorrelation
• Converts LPC Coefficients into Cepstral
Coefficients
• LIBSVM multistage classifier
• Algorithm written in mixed C/C++
• LPC and Cepstral functions codegen’d
from Matlab
• Wrapper in hand written code
• Waits for autocorrelation input from UART

Raspberry Pi: Actuation
• State Machine implemented
• Displays infamous EECS 452 Fall 2014 Image on sequence of “452”
• Displays special Raspberry Pi Image on “314”
• GPIO array drives LED Binary Counter
• Capable of implemented more complicated functions
• Planned for Coffee Machine Actuation, ran out of time
• Renders graphics using OpenVG Library
• Displays Startup Image
• Displays Digit Image on Classification

Raspberry Pi: Major Challenges Faced
• Initially planned to use Simulink Model to implement code
• Worked great for algorithm
• Did not work well for IO
• S-Functions are tricky to work with
• Solution
• Codegen core algorithm
• Hand write wrapper
• Matlab Coder Toolbox
• Converts Matlab code into ANSI C code, with processor specific
optimizations available
• Extremely useful for complex algorithms
• Very finicky to configure properly
• Solution: Study, study, study

Looking Forward
• Coffee Machine Actuation
• Build Better Library
• More speakers
• Female speakers
• Non-Midwestern speakers
• Investigate Tuning SVM Parameters

References
[1]http://www.spectrumdigital.com/product_info.php?cPath=31&products_i
d=238
[2] https://www.sparkfun.com/products/12577
[3] http://www.adafruit.com/product/1914
[4] Dutoit, T., Moreau, N., Kroon, P., How is speech processed in a cell
phone conversation?, 2009
[5] Rabiner, L., Schafer, R., Introduction to Digital Speech Processing,
2007
[6] http://www.csie.ntu.edu.tw/~cjlin/libsvm/

Real-Time Voice Actuation

More Related Content

What's hot (20)

Similar to Real-Time Voice Actuation (20)

Real-Time Voice Actuation