Building your own open-source voice assistant

2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
DIY Jarvis
Wes Widner, Engineering Manager CrowdStrike

The promise..
The promise of voice interaction
with computers is very old. Just
look at its ubiquity in our SciFi

Humans weren’t designed to type
The calculation of WPM to bits per seconds is based on an average of 5 characters
per word multiplied by 2 bits per character or the formula:
1 wpm = ((1/60)*5*2) bps or 1 wpm = 0.166 bps
▪ The average person types 6.83 bits per second (41 wpm)
▪ The average person speaks 39.15 bits per second (236 wpm)
▪ Voice first is ~3x faster than typing accounting for error correction
▪ The average person hears at 75 bits per second (450 wpm)

A new hope
▪ We’ll build our own!
▪ Constraints:
▪ Everything must be local
▪ Needs to work in a closed
ecosystem
▪ No magical cloud services
▪ All data needs to remain local
▪ Must be extensible
▪ Must be scalable
▪ Must be easy to understand
▪ Must be easy to implement

The service contracts
service AudioProcessor {
rpc Subscribe(stream ProcessAudioRequest) returns (stream
ProcessAudioResponse) {}
}
service EventResponder {
rpc Subscribe (stream TextEventRequest) returns (stream
TextEventResponse) {}
}
service AudioOutput {
rpc Subscribe (stream OutputRequest) returns (stream OutputResponse)
{}
}

Step 1 - Capture the voice
This is harder than you may think..

The canvas
▪ The human voice ranges between
300 and 3400 Hz
▪ Human hearing ranges between
20 (really 30) and 20000 Hz

mic_capture
▪ Capture clips of audio to
send to the audio event
processor
▪ Source audio is either a
microphone or slice up a
wav file (wav_slicer)
message ProcessAudioRequest {
string RequestId = 1;
string SourceId = 2;
uint64 AudioStartTime = 3;
bytes AudioData = 5;
}
message ProcessAudioResponse {
string requestId = 1;
ProcessAudioResponseCode
ResponseCode = 2;
string Output = 4;
}

The audio subsystem
▪ We can’t do anything until we
understand the basics
▪ Advanced Linux Sound Architecture
is the driver layer
▪ PulseAudio is how sound is shared
around the system
▪ It pays to know this system very
well. Especially the
pactl command
▪ PulseAudio has native support for
sharing sound over TCP
▪ This means we can also set up our
own FOSS Sonos system
▪ Or, become Batman!
ALSA
Core
Audio
PulseAudio
Sound Application
Audio Hardware
Open
SLES

Mic check 1, 2, 1, 2
The right way to do a mic check is with
Harvard Sentences
72 10-sentence stanzas that are
phonetically balanced. Using specific
phonemes at the same frequency they
appear in English
"Tea served from the brown jug is
tasty" misheard as "Tea soaked in
Lebron James is tasty."

Microphones everywhere
▪ I start with either the Raspberry Pi
Nano WH or Raspberry Pi A+
▪ The biggest challenge to portability
is the changing noise landscape
▪ Your choice and placement of a
microphone is key
▪ More advanced work in this area
relies on DSPs and FPGAs
RaspiAudio Mic+ HAT
raspiaudio.com

Far field options
Seed Studio ReSpeaker
seedstudio.com
Matrix Voice
matrix.one

Technical challenges opportunities
▪ Limited to 16k Hz mono uncompressed PCM (WAV) audio
▪ Need better voice envelope detection
▪ No filters or enhancements are being used
▪ No adaptive acoustical measurements are made
▪ I have no idea how to automatically package binaries for raspbain

Step 2 - Turn voice into text
Aka throwing wav files at the black box

deepspeech
▪ Takes an audio clip and
passes it through Mozilla’s
DeepSpeech engine
▪ My implementation uses a
pre-trained English model
▪ This component takes up a lot
of memory and is CPU
intensive
▪ Should be able to run on a GPU
system like Nvidia’s JetPack
message ProcessAudioResponse {
string requestId = 1;
ResponseCode = 2;
string Output = 4;
}
message TextEventRequest {
string Text = 4;
}

Hunting for phonemes
The acoustic space around us is pretty busy.
We need to:
1. Separate out formant frequencies using
Fast Fourier Transformation
2. Determine what signals contain
phonemes - the perceptually distinct
units of sound in a specified language
that distinguish one word from another
3. Match phoneme fragments to text

Formants
▪ Phonemes have a unique formant
mapping between F1 and F2
▪ Our ears have evolved such that the
fundamental frequency doesn’t matter

What comes next?
Traditional approach (eg CMU Sphinx) is
to use a Hidden Markov Model to match
phoneme fragments to text / words
Simple Markov Model

Mozilla DeepSpeech
▪ Code at
https://github.com/mozilla/DeepSpeech
▪ Tensorflow implementation of Baidu’s
DeepSpeech neural network architecture
▪ Has native clients in many languages
▪ It’s TF core means it can run on CPU or GPU
▪ DeepSpeech supports native clients in C++,
Go, Rust, Python, .net, and Java
Audry 1952

▪ This needs to utilize a GPU to be performant
▪ Since it sits in the middle of the stack, it needs better reconnect handling
▪ I’m still not sure how thread-safe the underlying libraries are
▪ Even though it breaks the design constraints, I’d love to see a cloud service
integration implemented against Google, Amazon, and IBM voice recognition
services

Step 3 - Mine text for intent
Basically, creating a chat bot

text_event_processor
▪ Take text as input, parse
it, and produce some
sort of output
▪ Eventually this will
involve the use of
Natural Language
Process to parse
semantics
▪ For now, it simply
responds to commands
message TextEventRequest {
string Text = 4;
}
message TextEventResponse {
responseCode = 3;
}

Language taxonomy
▪ Commands are defined in YAML
▪ The Zygomys library is used for
go “scripting”
Commands:
- command: "I'm home"
return-output: false
action-format: zygomys
action: >
(system "echo welcome
home")

▪ Variables are not yet implemented
▪ Sentences need to be parsed for intent and meaning
▪ This component needs to be more chat-like as opposed to command and
response

Step 4 - Send a response
Do it all in reverse

▪ This is the least developed part of
the system. And by least I mean
unimplemented
▪ The idea is to take either text in
speech synthesis markup
language (SSML) syntax and send
it to espeak-ng or a URL to an
audio file or stream (eg an RTSP
source).
output_event_processor
message OutputRequest {
string SinkId = 1;
string Text = 3;
string MediaURL = 4;
}
message OutputResponse {
uint64 RequestId = 1;
string SinkId = 2;
AudioOutputResponseCode
ResponseCode = 3;
}

Text to speech
▪ We’ll use espeak-ng to synthesize
speech
▪ Uses n-grams and markov models
to turn text back into phonemes
▪ The ng part is a second phase for
adding prosody to the overall word
▪ We can either feed it raw text or
Speech Synthesis Markup
Language
▪ Allows finer control over prosody
and dictation
▪ The current SSML version is 1.1
Wolfgang von Kempelen (The Turk) 1769

Demo time!
make deepspeech-service
Two demos
▪ Simple wav slicing
▪ Live mic and text processing
Both start with deepspeech-service

Sample data from Brian Roemmele

Remix audio for wav slicing
$ sox --i 5d91700b51b062790d7b1674.mp3
Input File : '5d91700b51b062790d7b1674.mp3'
Channels : 2
Sample Rate : 44100
Precision : 16-bit
Duration : 00:33:50.83 = 89559735 samples = 152312 CDDA sectors
File Size : 65.0M
Bit Rate : 256k
Sample Encoding: MPEG audio (layer I, II or III)
$ sox 5d91700b51b062790d7b1674.mp3 -r 16k 5d91700b51b062790d7b1674.wav remix 1-2
Current limitation is 16k sample rate mono, so we’ll need to remix the audio

wav_slicer
export FILE=/data/5d91700b51b062790d7b1674.wav
export DATA_DIR=~/Downloads
make wav-slicer

mic_capture
export DURATION=3
make mic-capture
docker tail -f diy-jarvis-mic-capture

Thanks for coming!
Code and references are available at:
▪ https://github.com/kai5263499/diy-jarvis
▪ It’s hacktoberfest, send me PRs!
Contact me at:
▪ Twitter - @kai5263499
▪ Email - wes.widner@crowdstrike.com
PS We’re hiring engineers! Talk to us at booth #50 right outside

Building your own open-source voice assistant

More Related Content

What's hot (18)

Similar to Building your own open-source voice assistant (20)

More from All Things Open (20)

Recently uploaded (20)

Building your own open-source voice assistant