SlideShare a Scribd company logo
2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
DIY Jarvis
Wes Widner, Engineering Manager CrowdStrike
The promise..
The promise of voice interaction
with computers is very old. Just
look at its ubiquity in our SciFi
Humans weren’t designed to type
The calculation of WPM to bits per seconds is based on an average of 5 characters
per word multiplied by 2 bits per character or the formula:
1 wpm = ((1/60)*5*2) bps or 1 wpm = 0.166 bps
▪ The average person types 6.83 bits per second (41 wpm)
▪ The average person speaks 39.15 bits per second (236 wpm)
▪ Voice first is ~3x faster than typing accounting for error correction
▪ The average person hears at 75 bits per second (450 wpm)
The disappointment
A new hope
▪ We’ll build our own!
▪ Constraints:
▪ Everything must be local
▪ Needs to work in a closed
ecosystem
▪ No magical cloud services
▪ All data needs to remain local
▪ Must be extensible
▪ Must be scalable
▪ Must be easy to understand
▪ Must be easy to implement
General design
System design
Implementation
The service contracts
service AudioProcessor {
rpc Subscribe(stream ProcessAudioRequest) returns (stream
ProcessAudioResponse) {}
}
service EventResponder {
rpc Subscribe (stream TextEventRequest) returns (stream
TextEventResponse) {}
}
service AudioOutput {
rpc Subscribe (stream OutputRequest) returns (stream OutputResponse)
{}
}
Step 1 - Capture the voice
This is harder than you may think..
The canvas
▪ The human voice ranges between
300 and 3400 Hz
▪ Human hearing ranges between
20 (really 30) and 20000 Hz
mic_capture
▪ Capture clips of audio to
send to the audio event
processor
▪ Source audio is either a
microphone or slice up a
wav file (wav_slicer)
message ProcessAudioRequest {
string RequestId = 1;
string SourceId = 2;
uint64 AudioStartTime = 3;
bytes AudioData = 5;
}
message ProcessAudioResponse {
string requestId = 1;
ProcessAudioResponseCode
ResponseCode = 2;
uint64 AudioStartTime = 3;
string Output = 4;
}
The audio subsystem
▪ We can’t do anything until we
understand the basics
▪ Advanced Linux Sound Architecture
is the driver layer
▪ PulseAudio is how sound is shared
around the system
▪ It pays to know this system very
well. Especially the
pactl command
▪ PulseAudio has native support for
sharing sound over TCP
▪ This means we can also set up our
own FOSS Sonos system
▪ Or, become Batman!
ALSA
Core
Audio
PulseAudio
Sound Application
Audio Hardware
Open
SLES
Mic check 1, 2, 1, 2
The right way to do a mic check is with
Harvard Sentences
72 10-sentence stanzas that are
phonetically balanced. Using specific
phonemes at the same frequency they
appear in English
"Tea served from the brown jug is
tasty" misheard as "Tea soaked in
Lebron James is tasty."
Microphones everywhere
▪ I start with either the Raspberry Pi
Nano WH or Raspberry Pi A+
▪ The biggest challenge to portability
is the changing noise landscape
▪ Your choice and placement of a
microphone is key
▪ More advanced work in this area
relies on DSPs and FPGAs
RaspiAudio Mic+ HAT
raspiaudio.com
Far field options
Seed Studio ReSpeaker
seedstudio.com
Matrix Voice
matrix.one
Technical challenges opportunities
▪ Limited to 16k Hz mono uncompressed PCM (WAV) audio
▪ Need better voice envelope detection
▪ No filters or enhancements are being used
▪ No adaptive acoustical measurements are made
▪ I have no idea how to automatically package binaries for raspbain
Step 2 - Turn voice into text
Aka throwing wav files at the black box
deepspeech
▪ Takes an audio clip and
passes it through Mozilla’s
DeepSpeech engine
▪ My implementation uses a
pre-trained English model
▪ This component takes up a lot
of memory and is CPU
intensive
▪ Should be able to run on a GPU
system like Nvidia’s JetPack
message ProcessAudioResponse {
string requestId = 1;
ProcessAudioResponseCode
ResponseCode = 2;
uint64 AudioStartTime = 3;
string Output = 4;
}
message TextEventRequest {
string RequestId = 1;
string SourceId = 2;
string Text = 4;
}
Hunting for phonemes
The acoustic space around us is pretty busy.
We need to:
1. Separate out formant frequencies using
Fast Fourier Transformation
2. Determine what signals contain
phonemes - the perceptually distinct
units of sound in a specified language
that distinguish one word from another
3. Match phoneme fragments to text
Formants
▪ Phonemes have a unique formant
mapping between F1 and F2
▪ Our ears have evolved such that the
fundamental frequency doesn’t matter
What comes next?
Traditional approach (eg CMU Sphinx) is
to use a Hidden Markov Model to match
phoneme fragments to text / words
Simple Markov Model
Mozilla DeepSpeech
▪ Code at
https://github.com/mozilla/DeepSpeech
▪ Tensorflow implementation of Baidu’s
DeepSpeech neural network architecture
▪ Has native clients in many languages
▪ It’s TF core means it can run on CPU or GPU
▪ DeepSpeech supports native clients in C++,
Go, Rust, Python, .net, and Java
Audry 1952
Technical challenges opportunities
▪ This needs to utilize a GPU to be performant
▪ Since it sits in the middle of the stack, it needs better reconnect handling
▪ I’m still not sure how thread-safe the underlying libraries are
▪ Even though it breaks the design constraints, I’d love to see a cloud service
integration implemented against Google, Amazon, and IBM voice recognition
services
Step 3 - Mine text for intent
Basically, creating a chat bot
text_event_processor
▪ Take text as input, parse
it, and produce some
sort of output
▪ Eventually this will
involve the use of
Natural Language
Process to parse
semantics
▪ For now, it simply
responds to commands
message TextEventRequest {
string RequestId = 1;
string SourceId = 2;
string Text = 4;
}
message TextEventResponse {
string RequestId = 1;
string SourceId = 2;
ProcessAudioResponseCode
responseCode = 3;
}
Language taxonomy
▪ Commands are defined in YAML
▪ The Zygomys library is used for
go “scripting”
Commands:
- command: "I'm home"
return-output: false
action-format: zygomys
action: >
(system "echo welcome
home")
Technical challenges opportunities
▪ Variables are not yet implemented
▪ Sentences need to be parsed for intent and meaning
▪ This component needs to be more chat-like as opposed to command and
response
Step 4 - Send a response
Do it all in reverse
▪ This is the least developed part of
the system. And by least I mean
unimplemented
▪ The idea is to take either text in
speech synthesis markup
language (SSML) syntax and send
it to espeak-ng or a URL to an
audio file or stream (eg an RTSP
source).
output_event_processor
message OutputRequest {
string SinkId = 1;
string RequestId = 2;
string Text = 3;
string MediaURL = 4;
}
message OutputResponse {
uint64 RequestId = 1;
string SinkId = 2;
AudioOutputResponseCode
ResponseCode = 3;
}
Text to speech
▪ We’ll use espeak-ng to synthesize
speech
▪ Uses n-grams and markov models
to turn text back into phonemes
▪ The ng part is a second phase for
adding prosody to the overall word
▪ We can either feed it raw text or
Speech Synthesis Markup
Language
▪ Allows finer control over prosody
and dictation
▪ The current SSML version is 1.1
Wolfgang von Kempelen (The Turk) 1769
Demo time!
make deepspeech-service
Two demos
▪ Simple wav slicing
▪ Live mic and text processing
Both start with deepspeech-service
Sample data from Brian Roemmele
Remix audio for wav slicing
$ sox --i 5d91700b51b062790d7b1674.mp3
Input File : '5d91700b51b062790d7b1674.mp3'
Channels : 2
Sample Rate : 44100
Precision : 16-bit
Duration : 00:33:50.83 = 89559735 samples = 152312 CDDA sectors
File Size : 65.0M
Bit Rate : 256k
Sample Encoding: MPEG audio (layer I, II or III)
$ sox 5d91700b51b062790d7b1674.mp3 -r 16k 5d91700b51b062790d7b1674.wav remix 1-2
Current limitation is 16k sample rate mono, so we’ll need to remix the audio
wav_slicer
export FILE=/data/5d91700b51b062790d7b1674.wav
export DATA_DIR=~/Downloads
make wav-slicer
mic_capture
export DURATION=3
make mic-capture
docker tail -f diy-jarvis-mic-capture
Thanks for coming!
Code and references are available at:
▪ https://github.com/kai5263499/diy-jarvis
▪ It’s hacktoberfest, send me PRs!
Contact me at:
▪ Twitter - @kai5263499
▪ Email - wes.widner@crowdstrike.com
PS We’re hiring engineers! Talk to us at booth #50 right outside

More Related Content

PPT
Os Tucker
PPT
Audio compression 1
ODP
Audio compression
PPT
Advances in Network-adaptive Video Streaming
PPT
Multimedia
PPTX
Audio compression
PPTX
Audio encoding principles
Os Tucker
Audio compression 1
Audio compression
Advances in Network-adaptive Video Streaming
Multimedia
Audio compression
Audio encoding principles

What's hot (18)

DOCX
Digital audio
PPTX
Making asterisk feel like home outside north america
PDF
Nltk natural language toolkit overview and application @ PyCon.tw 2012
DOCX
Backtrack Manual Part3
PDF
Intro to Compression: Audio and Video Optimization for Learning
PDF
How to Speak Intel DPDK KNI for Web Services.
PDF
R bernardino hand_in_assignment_week_1
ZIP
XMPP 101
PPTX
Audio compression
PPTX
Audio compression
PPT
Cni mc donough_preservation
PPTX
10 - IDNOG04 - Enrico Hugo (Indonesia Honeynet Project) - The Rise of DGA Mal...
PPT
An Application Gateway to Deploy High-quality Video Communications in Various...
PDF
Influence of Online Games Traffic Multiplexing and Router Buffer on Subjectiv...
PDF
Bandwidth Efficiency Improvement for Online Games by the use of Tunneling, Co...
DOC
3 Digital Audio
Digital audio
Making asterisk feel like home outside north america
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Backtrack Manual Part3
Intro to Compression: Audio and Video Optimization for Learning
How to Speak Intel DPDK KNI for Web Services.
R bernardino hand_in_assignment_week_1
XMPP 101
Audio compression
Audio compression
Cni mc donough_preservation
10 - IDNOG04 - Enrico Hugo (Indonesia Honeynet Project) - The Rise of DGA Mal...
An Application Gateway to Deploy High-quality Video Communications in Various...
Influence of Online Games Traffic Multiplexing and Router Buffer on Subjectiv...
Bandwidth Efficiency Improvement for Online Games by the use of Tunneling, Co...
3 Digital Audio
Ad

Similar to Building your own open-source voice assistant (20)

PDF
YOUR VOICE IS MY PASSPORT
PPT
Jingle: Cutting Edge VoIP
DOCX
Ben white ig2 task 1 work sheet
PDF
speech technologies with neural networks present
PDF
An Empirical Evaluation of VoIP Playout Buffer Dimensioning in Skype, Google ...
PDF
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...
PPTX
final ppt BATCH 3.pptx
DOCX
Ben white ig2 task 1 work sheet
PDF
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
PDF
Spliced NAT2NAT And Other Packet-Level Misadventures.pdf
PDF
Different types of Bits, bytes, and representation of information1.pdf
PPT
Chapter%202%20 %20 Text%20compression(2)
 
PDF
Your Voice is My Passport
ODP
What Shazam doesn't want you to know
PDF
Wordlist Generation and Wifi Cracking
PDF
Veil-Ordnance
ODP
RSYSLOG v8 improvements and how to write plugins in any language.
PDF
Encoding for Objects Matters (IWST 2025)
PDF
Performance
YOUR VOICE IS MY PASSPORT
Jingle: Cutting Edge VoIP
Ben white ig2 task 1 work sheet
speech technologies with neural networks present
An Empirical Evaluation of VoIP Playout Buffer Dimensioning in Skype, Google ...
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...
final ppt BATCH 3.pptx
Ben white ig2 task 1 work sheet
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Spliced NAT2NAT And Other Packet-Level Misadventures.pdf
Different types of Bits, bytes, and representation of information1.pdf
Chapter%202%20 %20 Text%20compression(2)
 
Your Voice is My Passport
What Shazam doesn't want you to know
Wordlist Generation and Wifi Cracking
Veil-Ordnance
RSYSLOG v8 improvements and how to write plugins in any language.
Encoding for Objects Matters (IWST 2025)
Performance
Ad

More from All Things Open (20)

PDF
Agentic AI for Developers and Data Scientists Build an AI Agent in 10 Lines o...
PPTX
Big Data on a Small Budget: Scalable Data Visualization for the Rest of Us - ...
PDF
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
PDF
Let's Create a GitHub Copilot Extension! - Nick Taylor, Pomerium
PDF
Leveraging Pre-Trained Transformer Models for Protein Function Prediction - T...
PDF
Gen AI: AI Agents - Making LLMs work together in an organized way - Brent Las...
PDF
You Don't Need an AI Strategy, But You Do Need to Be Strategic About AI - Jes...
PPTX
DON’T PANIC: AI IS COMING – The Hitchhiker’s Guide to AI - Mark Hinkle, Perip...
PDF
Fine-Tuning Large Language Models with Declarative ML Orchestration - Shivay ...
PDF
Leveraging Knowledge Graphs for RAG: A Smarter Approach to Contextual AI Appl...
PPTX
Artificial Intelligence Needs Community Intelligence - Sriram Raghavan, IBM R...
PDF
Don't just talk to AI, do more with AI: how to improve productivity with AI a...
PPTX
Open-Source GenAI vs. Enterprise GenAI: Navigating the Future of AI Innovatio...
PDF
The Death of the Browser - Rachel-Lee Nabors, AgentQL
PDF
Making Operating System updates fast, easy, and safe
PDF
Reshaping the landscape of belonging to transform community
PDF
The Unseen, Underappreciated Security Work Your Maintainers May (or may not) ...
PDF
Integrating Diversity, Equity, and Inclusion into Product Design
PDF
The Open Source Ecosystem for eBPF in Kubernetes
PDF
Open Source Privacy-Preserving Metrics - Sarah Gran & Brandon Pitman
Agentic AI for Developers and Data Scientists Build an AI Agent in 10 Lines o...
Big Data on a Small Budget: Scalable Data Visualization for the Rest of Us - ...
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
Let's Create a GitHub Copilot Extension! - Nick Taylor, Pomerium
Leveraging Pre-Trained Transformer Models for Protein Function Prediction - T...
Gen AI: AI Agents - Making LLMs work together in an organized way - Brent Las...
You Don't Need an AI Strategy, But You Do Need to Be Strategic About AI - Jes...
DON’T PANIC: AI IS COMING – The Hitchhiker’s Guide to AI - Mark Hinkle, Perip...
Fine-Tuning Large Language Models with Declarative ML Orchestration - Shivay ...
Leveraging Knowledge Graphs for RAG: A Smarter Approach to Contextual AI Appl...
Artificial Intelligence Needs Community Intelligence - Sriram Raghavan, IBM R...
Don't just talk to AI, do more with AI: how to improve productivity with AI a...
Open-Source GenAI vs. Enterprise GenAI: Navigating the Future of AI Innovatio...
The Death of the Browser - Rachel-Lee Nabors, AgentQL
Making Operating System updates fast, easy, and safe
Reshaping the landscape of belonging to transform community
The Unseen, Underappreciated Security Work Your Maintainers May (or may not) ...
Integrating Diversity, Equity, and Inclusion into Product Design
The Open Source Ecosystem for eBPF in Kubernetes
Open Source Privacy-Preserving Metrics - Sarah Gran & Brandon Pitman

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPT
Teaching material agriculture food technology
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Encapsulation theory and applications.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Machine Learning_overview_presentation.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Cloud computing and distributed systems.
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
“AI and Expert System Decision Support & Business Intelligence Systems”
The Rise and Fall of 3GPP – Time for a Sabbatical?
Teaching material agriculture food technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
MIND Revenue Release Quarter 2 2025 Press Release
Encapsulation theory and applications.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Machine Learning_overview_presentation.pptx
Encapsulation_ Review paper, used for researhc scholars
Per capita expenditure prediction using model stacking based on satellite ima...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Machine learning based COVID-19 study performance prediction
Cloud computing and distributed systems.
Chapter 3 Spatial Domain Image Processing.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows

Building your own open-source voice assistant

  • 1. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. DIY Jarvis Wes Widner, Engineering Manager CrowdStrike
  • 2. The promise.. The promise of voice interaction with computers is very old. Just look at its ubiquity in our SciFi
  • 3. Humans weren’t designed to type The calculation of WPM to bits per seconds is based on an average of 5 characters per word multiplied by 2 bits per character or the formula: 1 wpm = ((1/60)*5*2) bps or 1 wpm = 0.166 bps ▪ The average person types 6.83 bits per second (41 wpm) ▪ The average person speaks 39.15 bits per second (236 wpm) ▪ Voice first is ~3x faster than typing accounting for error correction ▪ The average person hears at 75 bits per second (450 wpm)
  • 5. A new hope ▪ We’ll build our own! ▪ Constraints: ▪ Everything must be local ▪ Needs to work in a closed ecosystem ▪ No magical cloud services ▪ All data needs to remain local ▪ Must be extensible ▪ Must be scalable ▪ Must be easy to understand ▪ Must be easy to implement
  • 9. The service contracts service AudioProcessor { rpc Subscribe(stream ProcessAudioRequest) returns (stream ProcessAudioResponse) {} } service EventResponder { rpc Subscribe (stream TextEventRequest) returns (stream TextEventResponse) {} } service AudioOutput { rpc Subscribe (stream OutputRequest) returns (stream OutputResponse) {} }
  • 10. Step 1 - Capture the voice This is harder than you may think..
  • 11. The canvas ▪ The human voice ranges between 300 and 3400 Hz ▪ Human hearing ranges between 20 (really 30) and 20000 Hz
  • 12. mic_capture ▪ Capture clips of audio to send to the audio event processor ▪ Source audio is either a microphone or slice up a wav file (wav_slicer) message ProcessAudioRequest { string RequestId = 1; string SourceId = 2; uint64 AudioStartTime = 3; bytes AudioData = 5; } message ProcessAudioResponse { string requestId = 1; ProcessAudioResponseCode ResponseCode = 2; uint64 AudioStartTime = 3; string Output = 4; }
  • 13. The audio subsystem ▪ We can’t do anything until we understand the basics ▪ Advanced Linux Sound Architecture is the driver layer ▪ PulseAudio is how sound is shared around the system ▪ It pays to know this system very well. Especially the pactl command ▪ PulseAudio has native support for sharing sound over TCP ▪ This means we can also set up our own FOSS Sonos system ▪ Or, become Batman! ALSA Core Audio PulseAudio Sound Application Audio Hardware Open SLES
  • 14. Mic check 1, 2, 1, 2 The right way to do a mic check is with Harvard Sentences 72 10-sentence stanzas that are phonetically balanced. Using specific phonemes at the same frequency they appear in English "Tea served from the brown jug is tasty" misheard as "Tea soaked in Lebron James is tasty."
  • 15. Microphones everywhere ▪ I start with either the Raspberry Pi Nano WH or Raspberry Pi A+ ▪ The biggest challenge to portability is the changing noise landscape ▪ Your choice and placement of a microphone is key ▪ More advanced work in this area relies on DSPs and FPGAs RaspiAudio Mic+ HAT raspiaudio.com
  • 16. Far field options Seed Studio ReSpeaker seedstudio.com Matrix Voice matrix.one
  • 17. Technical challenges opportunities ▪ Limited to 16k Hz mono uncompressed PCM (WAV) audio ▪ Need better voice envelope detection ▪ No filters or enhancements are being used ▪ No adaptive acoustical measurements are made ▪ I have no idea how to automatically package binaries for raspbain
  • 18. Step 2 - Turn voice into text Aka throwing wav files at the black box
  • 19. deepspeech ▪ Takes an audio clip and passes it through Mozilla’s DeepSpeech engine ▪ My implementation uses a pre-trained English model ▪ This component takes up a lot of memory and is CPU intensive ▪ Should be able to run on a GPU system like Nvidia’s JetPack message ProcessAudioResponse { string requestId = 1; ProcessAudioResponseCode ResponseCode = 2; uint64 AudioStartTime = 3; string Output = 4; } message TextEventRequest { string RequestId = 1; string SourceId = 2; string Text = 4; }
  • 20. Hunting for phonemes The acoustic space around us is pretty busy. We need to: 1. Separate out formant frequencies using Fast Fourier Transformation 2. Determine what signals contain phonemes - the perceptually distinct units of sound in a specified language that distinguish one word from another 3. Match phoneme fragments to text
  • 21. Formants ▪ Phonemes have a unique formant mapping between F1 and F2 ▪ Our ears have evolved such that the fundamental frequency doesn’t matter
  • 22. What comes next? Traditional approach (eg CMU Sphinx) is to use a Hidden Markov Model to match phoneme fragments to text / words Simple Markov Model
  • 23. Mozilla DeepSpeech ▪ Code at https://github.com/mozilla/DeepSpeech ▪ Tensorflow implementation of Baidu’s DeepSpeech neural network architecture ▪ Has native clients in many languages ▪ It’s TF core means it can run on CPU or GPU ▪ DeepSpeech supports native clients in C++, Go, Rust, Python, .net, and Java Audry 1952
  • 24. Technical challenges opportunities ▪ This needs to utilize a GPU to be performant ▪ Since it sits in the middle of the stack, it needs better reconnect handling ▪ I’m still not sure how thread-safe the underlying libraries are ▪ Even though it breaks the design constraints, I’d love to see a cloud service integration implemented against Google, Amazon, and IBM voice recognition services
  • 25. Step 3 - Mine text for intent Basically, creating a chat bot
  • 26. text_event_processor ▪ Take text as input, parse it, and produce some sort of output ▪ Eventually this will involve the use of Natural Language Process to parse semantics ▪ For now, it simply responds to commands message TextEventRequest { string RequestId = 1; string SourceId = 2; string Text = 4; } message TextEventResponse { string RequestId = 1; string SourceId = 2; ProcessAudioResponseCode responseCode = 3; }
  • 27. Language taxonomy ▪ Commands are defined in YAML ▪ The Zygomys library is used for go “scripting” Commands: - command: "I'm home" return-output: false action-format: zygomys action: > (system "echo welcome home")
  • 28. Technical challenges opportunities ▪ Variables are not yet implemented ▪ Sentences need to be parsed for intent and meaning ▪ This component needs to be more chat-like as opposed to command and response
  • 29. Step 4 - Send a response Do it all in reverse
  • 30. ▪ This is the least developed part of the system. And by least I mean unimplemented ▪ The idea is to take either text in speech synthesis markup language (SSML) syntax and send it to espeak-ng or a URL to an audio file or stream (eg an RTSP source). output_event_processor message OutputRequest { string SinkId = 1; string RequestId = 2; string Text = 3; string MediaURL = 4; } message OutputResponse { uint64 RequestId = 1; string SinkId = 2; AudioOutputResponseCode ResponseCode = 3; }
  • 31. Text to speech ▪ We’ll use espeak-ng to synthesize speech ▪ Uses n-grams and markov models to turn text back into phonemes ▪ The ng part is a second phase for adding prosody to the overall word ▪ We can either feed it raw text or Speech Synthesis Markup Language ▪ Allows finer control over prosody and dictation ▪ The current SSML version is 1.1 Wolfgang von Kempelen (The Turk) 1769
  • 32. Demo time! make deepspeech-service Two demos ▪ Simple wav slicing ▪ Live mic and text processing Both start with deepspeech-service
  • 33. Sample data from Brian Roemmele
  • 34. Remix audio for wav slicing $ sox --i 5d91700b51b062790d7b1674.mp3 Input File : '5d91700b51b062790d7b1674.mp3' Channels : 2 Sample Rate : 44100 Precision : 16-bit Duration : 00:33:50.83 = 89559735 samples = 152312 CDDA sectors File Size : 65.0M Bit Rate : 256k Sample Encoding: MPEG audio (layer I, II or III) $ sox 5d91700b51b062790d7b1674.mp3 -r 16k 5d91700b51b062790d7b1674.wav remix 1-2 Current limitation is 16k sample rate mono, so we’ll need to remix the audio
  • 37. Thanks for coming! Code and references are available at: ▪ https://github.com/kai5263499/diy-jarvis ▪ It’s hacktoberfest, send me PRs! Contact me at: ▪ Twitter - @kai5263499 ▪ Email - wes.widner@crowdstrike.com PS We’re hiring engineers! Talk to us at booth #50 right outside