SlideShare a Scribd company logo
cwh.consulting
Artificial Intelligence in
Real Time Communications
(AI in RTC)
RTC Korea
1 November 2018
cwh.consulting
A blog for WebRTC developers
webrtcHacks.com
@webrtcHacks
AI & RTC blog
cogint.ai
@cogintai
WebRTC and ML for Developer Event
November 16, 2018 in San Francisco
krankygeek.com
About Me
Chad Hart
Analyst & Product Consultant
https://cwh.consulting
@chadwallacehart
chad@cwh.consulting
cwh.consulting
AI in RTC Research Study
• Authors
• Chad Hart – cwh.consulting
• Tsahi Levent-Levi - BlogGeek.me
• Methodology
• 40+ 1-on-1 vendor interviews
• ~100 respondent web survey
• Analysis of 126 companies & all major
products
• Output: 147-page report
cwh.consulting
+ =
Image source:
pixabay.com/en/a-i-ai-anatomy-2729782
What is AI in RTC?
RTC
cwh.consulting
AI in RTC use case categories
speech analytics
voicebots
RTC optimization
computer vision
Image source:
pixabay.com/en/a-i-ai-anatomy-2729782
cwh.consulting
• Call center agent
monitoring
• Transcription
• Translation
• Agent coaching
• Customer engagement
Speech Analytics
cwh.consulting
Promise:
machine transcription at human levels
Source: Google I/O 2017 keynote
cwh.consulting
Reality:
transcription quality is often not so great
My name is a chat heart of you might be
familiar with Dave from a brand or if you
are, a web or to see people I've done
about five years, I'm or so a of an
independent analyst. So I'm mostly do
park management strategy type. For a
product, marketing.
My name is Chad Hart. You might be
familiar with me from a brand -- if you are
WebRTC people; I've done webrtcHacks
now for about five years or so. Outside of
webrtcHacks, I have been an independent
analyst. I mostly do product management
and strategy type work and product
marketing.
Machine Transcription Actual Transcription
https://www.nojitter.com/post/240173958/when-speech-analytics-makes-gibberish-useful
cwh.consulting
My name is Chad Hart. You might be
familiar with me from a brand -- if you are
WebRTC people; I've done webrtcHacks
now for about five years or so. Outside of
webrtcHacks, I have been an independent
analyst. I mostly do product management
and strategy type work and product
marketing.
Reality:
transcription quality is often not so great
My name is a chat heart of you might be
familiar with Dave from a brand or if you
are, a web or to see people I've done
about five years, I'm or so a of an
independent analyst. So I'm mostly do
park management strategy type. For a
product, marketing.
Machine Transcription Actual Transcription
Non-standard
spelling
Industry
Jargon
Speech
disfluencies
US-English
language
assumption
https://www.nojitter.com/post/240173958/when-speech-analytics-makes-gibberish-useful
cwh.consulting
Higher-level speech analytics
• Perfect transcription is not needed to
provide useful analysis.
• Higher-level speech analytics systems look
for patterns in speech.
• These patterns can be matched to
business outcomes, such as did a caller
end up purchasing or did they give a good
customer satisfaction score.
• There are often meaningful patterns
beyond the words that were spoken – like
how fast each party was speaking, or how
often the agent talked compared to the
customer.
• There is also a lot of work going into
looking at caller emotion and sentiment.
Source: CallMiner
cwh.consulting
• IVR replacement
• Starting meetings
• In-call assistance
Voicebots – Smart Speakers & Assistants
cwh.consulting
• Another area we examined was voice bots.
• These are smart speakers like the google home which was recently made available in
South Korea and AI assistants like Bixby or Siri.
• Building a voicebot is complex. You not only need to transcribe the speech and run
some natural language understanding on it like in speech analytics, but you need to
also generate speech and deal with interactivity with the customer in real time.
• There is very broad interest in using these voicebots
• Every telephony device maker is interested in adding a voice user interface to their
products – and this is a natural fit since people “talk” to these devices already.
• Typical conference room equipment is already setup to capture good quality audio
with minimal noise from a variety of locations throughout the room with microphone
arrays
• However, most companies are just starting to figure out how to use them in their
products.
Voicebots – Smart Speakers & Assistants
cwh.consulting
Flattening the IVR:
humans don’t speak in menus
https://cogint.ai/dialogflow-phone-bot/
Menu
DTMF
Menu
DTMF
Response Response Menu
DTMF
Response Response Response
Menu
DTMF
Response Response Response Menu
DTMF
Response Response
Utterance
Intent
Response
Intent
Response
Intent
Response
Intent
Response
Intent
Response
Intent
Response
Intent
Response
Intent
Response
Intent
Response
Intent
Response
Traditional IVR Menu Voicebot
time
10 potential responses in an IVR menu hierarchy vs. a voicebot
cwh.consulting
Flattening the IVR:
humans don’t speak in menus
• One major area where voicebots will have an impact is in IVRs.
• Traditional IVRs were designed for DTMF input and are usually setup with multiple
levels of menus.
• Because people cannot remember more than a few menu options at a time, you
cannot put too many options in each menu.
• As a result, to fit many options, you need to have a complex menu with many
layers.
• Users hate this because they are difficult to navigate and takes too long.
• Voicebots help to flatten the IVR into a just a few layers.
• Rather than navigating a complex menu, user can just say what they want and use
natural language to get the information they need.
• This is good for call centers too because users are more likely to stay in the IVR
instead of immediately dropping out to an operator.
https://cogint.ai/dialogflow-phone-bot/
cwh.consulting
New voicebots: consumer ⇨ businessNotable Consumer Voicebot Market Milestones
krankygeek.com/research
KRANKY GEEK RESEARCH
Notable voicebot milestones
cwh.consulting
New voicebot technology threatens IVRs
Time
Abilitytooffloadhumantasks
today
cwh.consulting
• Funny hats
• Face detection
• Gestures
• Object detection
• Emotion analysis
Computer vision
cwh.consulting
Object detection over WebRTC with TensorFlow
Blog post:
https://webrtchacks.com/webrtc-cv-tensorflow/
Demo video: https://youtu.be/vzTXW0hGINM
• Using open source libraries and existing work,
without having a PhD in computer vision it is
relatively simple to setup your own server
and process real time video.
• Here is an example of a server I setup to do
real time analysis of a WebRTC stream.
cwh.consulting
Object detection over WebRTC with TensorFlow – example
architecture
https://webrtchacks.com/webrtc-cv-tensorflow/
TensorFlow
Object
Detection
Flask
Server Browser
local.js
index.html
objDetect.js
POST with image
object details
web assets
GET web assets
• This is just a very basic example that uses an
HTTP post to send several images per
second to a cloud-based server for
processing.
• As you saw in the video, there can be a little
bit of lag.
• Using a GPU-accelerated server or even
something like Google’s TPU that were
specifically designed to accelerate heavy
machine learning graphs would have helped
• But ultimately streaming a high-quality
image can always have its limits.
• Wouldn’t it be nice if you do the heavy
processing locally with hardware
acceleration, just like you can hardware
accelerate codecs like H.264?
cwh.consulting
ML processing moving to the edge,
with faster, local processing
• That’s exactly what you can do with some new chipsets from vendors like
Intel.
• This is an example of a kit from Google called the AIY Vision Kit that
includes the Intel Movidius processor.
• The Movidius is designed to run deep neural networks locally and is
especially well-suited to low-power computer vision applications.
• This kits runs on a tiny, single core Raspberry Pi 0 with only 512MB of RAM.
• Google used to sell just the vision bonnet add-on part of the chip for $45.
Now you can buy the complete kit with the Raspberry Pi for $90 in the US.
• Note that Amazon also has a computer vision kit it calls Deep Lense. That
runs on something more like an Intel NUC mini-PC and costs $250.
cwh.consulting
ML processing moving to the edge,
with faster, local processing
https://webrtchacks.com/aiy-vision-kit-uv4l-web-server/
cwh.consulting
Improvements with edge hardware (demonstration)
• Let’s look at this in action
• This all runs locally on the Pi.
• So in this case, I am doing the computer
vision process locally while sending the
stream and annotation remotely
Blog post:
https://webrtchacks.com/aiy-vision-kit-uv4l-
web-server
Video:
https://youtu.be/h0O18R1rI9U
cwh.consulting
Fun use cases with native mobile libraries
• With new native mobile libraries like
Apple’s CoreML and Google’s ML Kit, it
is relatively simple.
• Some of the engineers at Houseparty
wrote a blog post demonstrating how
to do smile detection
• Similar libraries are available that
detect facial boundaries and let you
put hats, sunglasses, beards, and other
silly masks on people – I am sure you
have seen some of these!
• Similar techniques can be used in a
business context to blur out
backgrounds for remote workers who
call into a video conference.
https://webrtchacks.com/ml-kit-smile-detection/
cwh.consulting
MLKit CPU consumption: high framerates are not practical (without
special hardware)
CPU Usage for different framerates processed by ML Kit
CPUUsage%
https://webrtchacks.com/ml-kit-smile-detection/
cwh.consulting
Resource consumption
MLKit is small compared to WebRTC
https://webrtchacks.com/ml-kit-smile-detection/
cwh.consulting
WebRTC CV is coming to the browser
https://w3c.github.io/webrtc-nv-use-cases/#funnyhats*
This is from a W3C document examining use cases for the next version of WebRTC
cwh.consulting
RTC optimization
• Noise suppression
• Echo cancellation
• Error correction
• Route optimization
cwh.consulting
Mozilla RNNoise – real time, low-power noise suppression with
deep learning
• One example is a research project
from Mozilla that uses Deep Learning
to provide better real-time noise
suppression.
• This is designed for lower power
devices and does not require any
specialized hardware.
• We do not have time now, but you can
go to that link and try some demos.
• Unfortunately this was just a research
project, but it gives you some idea of
what could be done in this and other
areas.
https://people.xiph.org/~jm/demo/rnnoise/
cwh.consulting
Special discount
for RTC Korea
Use code RTC-KOREA
until November 7
for $1000.00 off
krankygeek.com/research
or email me
purchase at
cwh.consulting
Questions?
cwh.consulting
A blog for WebRTC developers
webrtcHacks.com
@webrtcHacks
AI & RTC blog
cogint.ai
@cogintai
WebRTC and ML for Developer Event
November 16, 2018 in San Francisco
krankygeek.com
About Me
Chad Hart
Analyst & Product Consultant
https://cwh.consulting
@chadwallacehart
chad@cwh.consulting

More Related Content

PDF
WebRTC Check-in (from WebRTC Boston 6)
PPTX
Media processing with serverless architecture
PDF
Getting Started with WebRTC
PDF
WebRTC Codec Wars: Rebooted
PDF
Quality Assurance for WebRTC Services
PDF
WebRTC - On Standards, Identity and Telco Strategy
PDF
WebRTC Tutorial by Dean Bubley of Disruptive Analysis & Tim Panton of Westhaw...
PPTX
WebRTC
WebRTC Check-in (from WebRTC Boston 6)
Media processing with serverless architecture
Getting Started with WebRTC
WebRTC Codec Wars: Rebooted
Quality Assurance for WebRTC Services
WebRTC - On Standards, Identity and Telco Strategy
WebRTC Tutorial by Dean Bubley of Disruptive Analysis & Tim Panton of Westhaw...
WebRTC

What's hot (20)

PDF
WebRTC in the Real World
PDF
WebRTC DataChannels demystified
PDF
WebRTC: A front-end perspective
PDF
WebRTC standards update (April 2015)
PDF
Kamailio World 2017: Getting Real with WebRTC
PDF
WebRTC Timeline and Forecast
PDF
WebRTC - a History Lesson
PDF
Deploying WebRTC in a low-latency streaming service
PDF
Upperside WebRTC conference - WebRTC intro
PDF
The future of WebRTC - Sept 2021
PPTX
WebRTC overview
PDF
Baby Steps: A WebRTC Tutorial
PPTX
Introduction to WebRTC
PDF
WebRTC standards update - November 2014
PDF
WebRTC on Mobile Devices: Challenges and Opportunities
PDF
Getting started with WebRTC
PDF
WebRTC Webinar & Q&A - W3C WebRTC JS API Test Platform & Updates from W3C Lis...
PDF
WebRTC - a quick introduction
PDF
Common WebRTC mistakesand how to avoid them (RTC Expo 2019)
PDF
Value Added Services and WebRTC
WebRTC in the Real World
WebRTC DataChannels demystified
WebRTC: A front-end perspective
WebRTC standards update (April 2015)
Kamailio World 2017: Getting Real with WebRTC
WebRTC Timeline and Forecast
WebRTC - a History Lesson
Deploying WebRTC in a low-latency streaming service
Upperside WebRTC conference - WebRTC intro
The future of WebRTC - Sept 2021
WebRTC overview
Baby Steps: A WebRTC Tutorial
Introduction to WebRTC
WebRTC standards update - November 2014
WebRTC on Mobile Devices: Challenges and Opportunities
Getting started with WebRTC
WebRTC Webinar & Q&A - W3C WebRTC JS API Test Platform & Updates from W3C Lis...
WebRTC - a quick introduction
Common WebRTC mistakesand how to avoid them (RTC Expo 2019)
Value Added Services and WebRTC
Ad

Similar to AI in RTC - RTC Korea 2018 (20)

PPTX
DevDay 2013 - Building Startups and Minimum Viable Products
PDF
Behavior Driven Development
PPT
QA Fest 2018. Александр Хотемский. Использование голосовых помощников для раз...
PPT
Custom Image Classifier with Visual Recognition: Building with Watson
PDF
Notes From Velocity Conference Europe
PDF
Story of Multnomah County: Migrating from Vignette and Building a Drupal Ecos...
PDF
NUS-ISS Learning Day 2019- ChatBots: All about Conversational Experiences
PPTX
xAPI in Action
PDF
When e-commerce meets Symfony
PPTX
Global Azure2021 Verona.pptx
PPT
Webiner Presentation
PPT
presentation slides
PDF
How HTML5 missed its graduation - #TrondheimDC
PDF
WSO2Con EU 2015: Opening Keynote - Helping You Connect the World
PPT
Application Starter Kits for Developers - Building with Watson
PDF
Chris Mathias Presents Advanced API Design Considerations at LA CTO Forum
PPTX
Picnic Software - Developing a flexible and scalable application
PDF
OpenValue Vienna meetup september 2020 - Better software, faster: Principles ...
PDF
Devoxx Belgium 2019 - Better software, faster: Principles of Continuous Deliv...
PDF
05 DIGI CREATIVE people&process
DevDay 2013 - Building Startups and Minimum Viable Products
Behavior Driven Development
QA Fest 2018. Александр Хотемский. Использование голосовых помощников для раз...
Custom Image Classifier with Visual Recognition: Building with Watson
Notes From Velocity Conference Europe
Story of Multnomah County: Migrating from Vignette and Building a Drupal Ecos...
NUS-ISS Learning Day 2019- ChatBots: All about Conversational Experiences
xAPI in Action
When e-commerce meets Symfony
Global Azure2021 Verona.pptx
Webiner Presentation
presentation slides
How HTML5 missed its graduation - #TrondheimDC
WSO2Con EU 2015: Opening Keynote - Helping You Connect the World
Application Starter Kits for Developers - Building with Watson
Chris Mathias Presents Advanced API Design Considerations at LA CTO Forum
Picnic Software - Developing a flexible and scalable application
OpenValue Vienna meetup september 2020 - Better software, faster: Principles ...
Devoxx Belgium 2019 - Better software, faster: Principles of Continuous Deliv...
05 DIGI CREATIVE people&process
Ad

More from Chad Hart (11)

PPTX
Kill Your IVR with a Voicebot (ClueCon 2019)
PPTX
AIY Vision Kit - Embedded ML for STEM and Makers (GDG Boston Tensorflow)
PDF
Boosting business with WebRTC - ClueCon 2017
PPTX
6 Months of WebRTC in 10 minutes
PDF
Astricon WebRTC Update
PPTX
WebRTC From Asterisk to Headline - MoNage
PPTX
ClueCon 2016: Should you use WebRTC?
PPTX
WebRTC Hacks: Lessons Learned
PPTX
WebRTC for Billions
PPTX
The Future of Real Time Communications
PDF
What's Next for WebRTC
Kill Your IVR with a Voicebot (ClueCon 2019)
AIY Vision Kit - Embedded ML for STEM and Makers (GDG Boston Tensorflow)
Boosting business with WebRTC - ClueCon 2017
6 Months of WebRTC in 10 minutes
Astricon WebRTC Update
WebRTC From Asterisk to Headline - MoNage
ClueCon 2016: Should you use WebRTC?
WebRTC Hacks: Lessons Learned
WebRTC for Billions
The Future of Real Time Communications
What's Next for WebRTC

Recently uploaded (20)

PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Machine learning based COVID-19 study performance prediction
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
Encapsulation_ Review paper, used for researhc scholars
Network Security Unit 5.pdf for BCA BBA.
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Dropbox Q2 2025 Financial Results & Investor Presentation
Review of recent advances in non-invasive hemoglobin estimation
Programs and apps: productivity, graphics, security and other tools
NewMind AI Weekly Chronicles - August'25 Week I
The Rise and Fall of 3GPP – Time for a Sabbatical?
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Chapter 3 Spatial Domain Image Processing.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Machine learning based COVID-19 study performance prediction
Building Integrated photovoltaic BIPV_UPV.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Empathic Computing: Creating Shared Understanding
Digital-Transformation-Roadmap-for-Companies.pptx
cuic standard and advanced reporting.pdf
Per capita expenditure prediction using model stacking based on satellite ima...

AI in RTC - RTC Korea 2018

  • 1. cwh.consulting Artificial Intelligence in Real Time Communications (AI in RTC) RTC Korea 1 November 2018
  • 2. cwh.consulting A blog for WebRTC developers webrtcHacks.com @webrtcHacks AI & RTC blog cogint.ai @cogintai WebRTC and ML for Developer Event November 16, 2018 in San Francisco krankygeek.com About Me Chad Hart Analyst & Product Consultant https://cwh.consulting @chadwallacehart chad@cwh.consulting
  • 3. cwh.consulting AI in RTC Research Study • Authors • Chad Hart – cwh.consulting • Tsahi Levent-Levi - BlogGeek.me • Methodology • 40+ 1-on-1 vendor interviews • ~100 respondent web survey • Analysis of 126 companies & all major products • Output: 147-page report
  • 5. cwh.consulting AI in RTC use case categories speech analytics voicebots RTC optimization computer vision Image source: pixabay.com/en/a-i-ai-anatomy-2729782
  • 6. cwh.consulting • Call center agent monitoring • Transcription • Translation • Agent coaching • Customer engagement Speech Analytics
  • 7. cwh.consulting Promise: machine transcription at human levels Source: Google I/O 2017 keynote
  • 8. cwh.consulting Reality: transcription quality is often not so great My name is a chat heart of you might be familiar with Dave from a brand or if you are, a web or to see people I've done about five years, I'm or so a of an independent analyst. So I'm mostly do park management strategy type. For a product, marketing. My name is Chad Hart. You might be familiar with me from a brand -- if you are WebRTC people; I've done webrtcHacks now for about five years or so. Outside of webrtcHacks, I have been an independent analyst. I mostly do product management and strategy type work and product marketing. Machine Transcription Actual Transcription https://www.nojitter.com/post/240173958/when-speech-analytics-makes-gibberish-useful
  • 9. cwh.consulting My name is Chad Hart. You might be familiar with me from a brand -- if you are WebRTC people; I've done webrtcHacks now for about five years or so. Outside of webrtcHacks, I have been an independent analyst. I mostly do product management and strategy type work and product marketing. Reality: transcription quality is often not so great My name is a chat heart of you might be familiar with Dave from a brand or if you are, a web or to see people I've done about five years, I'm or so a of an independent analyst. So I'm mostly do park management strategy type. For a product, marketing. Machine Transcription Actual Transcription Non-standard spelling Industry Jargon Speech disfluencies US-English language assumption https://www.nojitter.com/post/240173958/when-speech-analytics-makes-gibberish-useful
  • 10. cwh.consulting Higher-level speech analytics • Perfect transcription is not needed to provide useful analysis. • Higher-level speech analytics systems look for patterns in speech. • These patterns can be matched to business outcomes, such as did a caller end up purchasing or did they give a good customer satisfaction score. • There are often meaningful patterns beyond the words that were spoken – like how fast each party was speaking, or how often the agent talked compared to the customer. • There is also a lot of work going into looking at caller emotion and sentiment. Source: CallMiner
  • 11. cwh.consulting • IVR replacement • Starting meetings • In-call assistance Voicebots – Smart Speakers & Assistants
  • 12. cwh.consulting • Another area we examined was voice bots. • These are smart speakers like the google home which was recently made available in South Korea and AI assistants like Bixby or Siri. • Building a voicebot is complex. You not only need to transcribe the speech and run some natural language understanding on it like in speech analytics, but you need to also generate speech and deal with interactivity with the customer in real time. • There is very broad interest in using these voicebots • Every telephony device maker is interested in adding a voice user interface to their products – and this is a natural fit since people “talk” to these devices already. • Typical conference room equipment is already setup to capture good quality audio with minimal noise from a variety of locations throughout the room with microphone arrays • However, most companies are just starting to figure out how to use them in their products. Voicebots – Smart Speakers & Assistants
  • 13. cwh.consulting Flattening the IVR: humans don’t speak in menus https://cogint.ai/dialogflow-phone-bot/ Menu DTMF Menu DTMF Response Response Menu DTMF Response Response Response Menu DTMF Response Response Response Menu DTMF Response Response Utterance Intent Response Intent Response Intent Response Intent Response Intent Response Intent Response Intent Response Intent Response Intent Response Intent Response Traditional IVR Menu Voicebot time 10 potential responses in an IVR menu hierarchy vs. a voicebot
  • 14. cwh.consulting Flattening the IVR: humans don’t speak in menus • One major area where voicebots will have an impact is in IVRs. • Traditional IVRs were designed for DTMF input and are usually setup with multiple levels of menus. • Because people cannot remember more than a few menu options at a time, you cannot put too many options in each menu. • As a result, to fit many options, you need to have a complex menu with many layers. • Users hate this because they are difficult to navigate and takes too long. • Voicebots help to flatten the IVR into a just a few layers. • Rather than navigating a complex menu, user can just say what they want and use natural language to get the information they need. • This is good for call centers too because users are more likely to stay in the IVR instead of immediately dropping out to an operator. https://cogint.ai/dialogflow-phone-bot/
  • 15. cwh.consulting New voicebots: consumer ⇨ businessNotable Consumer Voicebot Market Milestones krankygeek.com/research KRANKY GEEK RESEARCH Notable voicebot milestones
  • 16. cwh.consulting New voicebot technology threatens IVRs Time Abilitytooffloadhumantasks today
  • 17. cwh.consulting • Funny hats • Face detection • Gestures • Object detection • Emotion analysis Computer vision
  • 18. cwh.consulting Object detection over WebRTC with TensorFlow Blog post: https://webrtchacks.com/webrtc-cv-tensorflow/ Demo video: https://youtu.be/vzTXW0hGINM • Using open source libraries and existing work, without having a PhD in computer vision it is relatively simple to setup your own server and process real time video. • Here is an example of a server I setup to do real time analysis of a WebRTC stream.
  • 19. cwh.consulting Object detection over WebRTC with TensorFlow – example architecture https://webrtchacks.com/webrtc-cv-tensorflow/ TensorFlow Object Detection Flask Server Browser local.js index.html objDetect.js POST with image object details web assets GET web assets • This is just a very basic example that uses an HTTP post to send several images per second to a cloud-based server for processing. • As you saw in the video, there can be a little bit of lag. • Using a GPU-accelerated server or even something like Google’s TPU that were specifically designed to accelerate heavy machine learning graphs would have helped • But ultimately streaming a high-quality image can always have its limits. • Wouldn’t it be nice if you do the heavy processing locally with hardware acceleration, just like you can hardware accelerate codecs like H.264?
  • 20. cwh.consulting ML processing moving to the edge, with faster, local processing • That’s exactly what you can do with some new chipsets from vendors like Intel. • This is an example of a kit from Google called the AIY Vision Kit that includes the Intel Movidius processor. • The Movidius is designed to run deep neural networks locally and is especially well-suited to low-power computer vision applications. • This kits runs on a tiny, single core Raspberry Pi 0 with only 512MB of RAM. • Google used to sell just the vision bonnet add-on part of the chip for $45. Now you can buy the complete kit with the Raspberry Pi for $90 in the US. • Note that Amazon also has a computer vision kit it calls Deep Lense. That runs on something more like an Intel NUC mini-PC and costs $250.
  • 21. cwh.consulting ML processing moving to the edge, with faster, local processing https://webrtchacks.com/aiy-vision-kit-uv4l-web-server/
  • 22. cwh.consulting Improvements with edge hardware (demonstration) • Let’s look at this in action • This all runs locally on the Pi. • So in this case, I am doing the computer vision process locally while sending the stream and annotation remotely Blog post: https://webrtchacks.com/aiy-vision-kit-uv4l- web-server Video: https://youtu.be/h0O18R1rI9U
  • 23. cwh.consulting Fun use cases with native mobile libraries • With new native mobile libraries like Apple’s CoreML and Google’s ML Kit, it is relatively simple. • Some of the engineers at Houseparty wrote a blog post demonstrating how to do smile detection • Similar libraries are available that detect facial boundaries and let you put hats, sunglasses, beards, and other silly masks on people – I am sure you have seen some of these! • Similar techniques can be used in a business context to blur out backgrounds for remote workers who call into a video conference. https://webrtchacks.com/ml-kit-smile-detection/
  • 24. cwh.consulting MLKit CPU consumption: high framerates are not practical (without special hardware) CPU Usage for different framerates processed by ML Kit CPUUsage% https://webrtchacks.com/ml-kit-smile-detection/
  • 25. cwh.consulting Resource consumption MLKit is small compared to WebRTC https://webrtchacks.com/ml-kit-smile-detection/
  • 26. cwh.consulting WebRTC CV is coming to the browser https://w3c.github.io/webrtc-nv-use-cases/#funnyhats* This is from a W3C document examining use cases for the next version of WebRTC
  • 27. cwh.consulting RTC optimization • Noise suppression • Echo cancellation • Error correction • Route optimization
  • 28. cwh.consulting Mozilla RNNoise – real time, low-power noise suppression with deep learning • One example is a research project from Mozilla that uses Deep Learning to provide better real-time noise suppression. • This is designed for lower power devices and does not require any specialized hardware. • We do not have time now, but you can go to that link and try some demos. • Unfortunately this was just a research project, but it gives you some idea of what could be done in this and other areas. https://people.xiph.org/~jm/demo/rnnoise/
  • 29. cwh.consulting Special discount for RTC Korea Use code RTC-KOREA until November 7 for $1000.00 off krankygeek.com/research or email me purchase at
  • 31. cwh.consulting A blog for WebRTC developers webrtcHacks.com @webrtcHacks AI & RTC blog cogint.ai @cogintai WebRTC and ML for Developer Event November 16, 2018 in San Francisco krankygeek.com About Me Chad Hart Analyst & Product Consultant https://cwh.consulting @chadwallacehart chad@cwh.consulting

Editor's Notes

  • #3: As a quick background, my name is Chad Hart. I am an analyst and consultant focused on real time communications products and services Some of you may be familiar with webrtcHacks – I blog I have run since 2013 that aims to provide useful content for WebRTC developers I also recently launched a blog to specifically explore topics related to AI, Machine Learning and RTC. You can check that out at cogint.ai Lastly, I also help to run the Kranky Geek series of events with the help of Google and other sponsors like Intel, Nexmo and Agora. We hold an event every year in San Francisco. This year we will also be focusing on the AI in RTC topics with many great talks from companies like Facebook, Microsoft, IBM and many more.
  • #4: The AI in RTC topic has been a major focus of mine. I recently came off a long-term project where I ran a new product incubator group that launched a speech analytics service inside a telco. I could see speech analytics and other machine-learning based technologies were starting to intersect with real time communications. To understand this better I teamed up with Tsahi Levent-Levi of BlogGeek.me, another WebRTC analyst many of you know, to write a research report on this topic. We covered more than 125 vendors, ran an industry survey, and had 1-on1 conversations with 40 vendors.
  • #5: So what is AI in RTC? I am not talking about science fiction robots making phone calls I am going to talk about how modern machine learning techniques can be used to improve and expand real time communications.
  • #6: We saw 4 major categories of use cases Speech analytics voice bots computer vision, And using Machine Learning (ML) to optimize lower-level RTC protocols and networks
  • #7: By far the most common use case was speech analytics There is a broad range of use cases that range from providing transcription on conference calls to providing real time agent coaching based on what the customer is saying in the call center.
  • #8: Speech transcription – also known as ASR or Speech-to-text (STT) Has made a lot of improvements over the past couple of year thanks to deep learning techniques. Many vendors now claim they are at human-levels of accuracy.
  • #9: The reality is that transcription still has a number of challenges. The example here shows a transcription where I was introducing myself. As you can see – the machine transcription did not do such a great job.
  • #10: This specific example is probably worse than average, but not uncommon. The first major challenge is getting languages and dialects correct. I am sure that this is a big struggle for this audience as you deal with STT technologies made outside of Korea. I am lucky that English, and particularly American English, is by far the best supported language. May vendors also have support for many dialects of English, such as British, Australian, and Indian accents. You will find much more limited support for Korean. I do not think I have seen any major international vendor support specific Korean dialects. Fortunately this is improving and newer algorithms require less training data, so it is becoming easier to build support for new languages. Non-standard spellings and specific industry jargon that does not appear in the dictionary like “WebRTC” is also a challenge. Most systems now have techniques that let you specify a custom vocabulary to correct these.
  • #11: It is also important to note that perfect transcription is not needed to provide useful analysis. Higher-level speech analytics systems look for patterns in speech. These patterns can be matched to business outcomes, such as did a caller end up purchasing or did they give a good customer satisfaction score. There are often meaningful patterns beyond the words that were spoken – like how fast each party was speaking, or how often the agent talked compared to the customer. There is also a lot of work going into looking at caller emotion and sentiment.
  • #12: Another area we examined was voice bots. These are smart speakers like the google home which was recently made available in South Korea (https://voicebot.ai/2018/09/11/google-home-arriving-in-south-korean-on-september-18-pre-orders-start-today/) And AI assistants like Bixby or Siri. Building a voicebot is complex. You not only need to transcribe the speech and run some natural language understanding on it like in speech analytics, but you need to also generate speech and deal with interactivity with the customer in real time. There is very broad interest in using these voicebots Every telephony device maker is interested in adding a voice user interface to their products – and this is a natural fit since people “talk” to these devices already. Typical conference room equipment is already setup to capture good quality audio with minimal noise from a variety of locations throughout the room with microphone arrays However, most companies are just starting to figure out how to use them in their products.
  • #13: Another area we examined was voice bots. These are smart speakers like the google home which was recently made available in South Korea (https://voicebot.ai/2018/09/11/google-home-arriving-in-south-korean-on-september-18-pre-orders-start-today/) And AI assistants like Bixby or Siri. Building a voicebot is complex. You not only need to transcribe the speech and run some natural language understanding on it like in speech analytics, but you need to also generate speech and deal with interactivity with the customer in real time. There is very broad interest in using these voicebots Every telephony device maker is interested in adding a voice user interface to their products – and this is a natural fit since people “talk” to these devices already. Typical conference room equipment is already setup to capture good quality audio with minimal noise from a variety of locations throughout the room with microphone arrays However, most companies are just starting to figure out how to use them in their products.
  • #14: One major area where voicebots will have an impact is in IVRs. Traditional IVRs were designed for DTMF input and are usually setup with multiple levels of menus. Because people cannot remember more than a few menu options at a time, you cannot put too many options in each menu. As a result, to fit many options, you need to have a complex menu with many layers. Users hate this because they are difficult to navigate and takes too long. Voicebots help to flatten the IVR into a just a few layers. Rather than navigating a complex menu, user can just say what they want and use natural language to get the information they need. This is good for call centers too because users are more likely to stay in the IVR instead of immediately dropping out to an operator.
  • #15: One major area where voicebots will have an impact is in IVRs. Traditional IVRs were designed for DTMF input and are usually setup with multiple levels of menus. Because people cannot remember more than a few menu options at a time, you cannot put too many options in each menu. As a result, to fit many options, you need to have a complex menu with many layers. Users hate this because they are difficult to navigate and takes too long. Voicebots help to flatten the IVR into a just a few layers. Rather than navigating a complex menu, user can just say what they want and use natural language to get the information they need. This is good for call centers too because users are more likely to stay in the IVR instead of immediately dropping out to an operator.
  • #16: Actually, many advanced IVR systems like those sold by companies like Nuance, Aspect, and Genesys already have natural language inputs and responses. One big change here is the growth of the consumer voicebot market. As this technology has matured, these solutions are not being targeted at business telephony use cases, not just consumers. For example, IBM launched a voice gateway option for its Watson assistant. Amazon is integrating its natural language engine called Lex into Amazon Connect, its contact center solution. Microsoft’s language processing platform is called LUIS and it has a bot-builder framework that can use this to integrate into the consumer Skype and Skype for business. Just this summer, Google launched its contact center AI initiative where it has partnered with many major communications providers and vendors. As part of Google’s solution, they are looking to penetrate call centers by using Dialogflow, their natural languge understanding engine and are using other tools to help agents more quickly answer questions.
  • #17: Existing IVR technology that incorporates natural language tends to be very expensive. Big vendors like Amazon, Google, and Microsoft are adapting technologies they built for the much larger consumer market and applying that to business use cases at much lower costs, often with better performance. One of Google’s customers Marks and Spensor, commented they were able to save the equivalent of 100 Full Time employees using this technology across their call center.
  • #18: The last area I would like to discuss is computer vision. This domain already had a lot of usage in consumer applications and is just starting to find some business use cases. There are many applications area including counting people, identifying faces, using gestures for controls, and even augmented reality.
  • #19: Using open source libraries and existing work, without having a PhD in computer vision it is relatively simple to setup your own server and process real time video. Here is an example of a server I setup to do real time analysis of a WebRTC stream.
  • #20: This is just a very basic example that uses an HTTP post to send several images per second to a cloud-based server for processing. As you saw in the video, there can be a little bit of lag. Using a GPU-accelerated server or even something like Google’s TPU that were specifically designed to accelerate heavy machine learning graphs would have helped But ultimately streaming a high-quality image can always have its limits. Wouldn’t it be nice if you do the heavy processing locally with hardware acceleration, just like you can hardware accelerate codecs like H.264?
  • #21: That’s exactly what you can do with some new chipsets from vendors like Intel. This is an example of a kit from Google called the AIY Vision Kit that includes the Intel Movidius processor. The Movidius is designed to run deep neural networks locally and is especially well-suited to low-power computer vision applications. This kits runs on a tiny, single core Raspberry Pi 0 with only 512MB of RAM. Google used to sell just the vision bonnet add-on part of the chip for $45. Now you can buy the complete kit with the Raspberry Pi for $90 in the US. Note that Amazon also has a computer vision kit it calls Deep Lense. That runs on something more like an Intel NUC mini-PC and costs $250.
  • #22: That’s exactly what you can do with some new chipsets from vendors like Intel. This is an example of a kit from Google called the AIY Vision Kit that includes the Intel Movidius processor. The Movidius is designed to run deep neural networks locally and is especially well-suited to low-power computer vision applications. This kits runs on a tiny, single core Raspberry Pi 0 with only 512MB of RAM. Google used to sell just the vision bonnet add-on part of the chip for $45. Now you can buy the complete kit with the Raspberry Pi for $90 in the US. Note that Amazon also has a computer vision kit it calls Deep Lense. That runs on something more like an Intel NUC mini-PC and costs $250.
  • #23: Let’s look at this in action This all runs locally on the Pi. So in this case, I doing the computer vision process locally while sending the stream and annotation remotely
  • #24: With new native mobile libraries like Apple’s CoreML and Google’s ML Kit, it is relatively simple. Some of the engineers at Houseparty wrote a blog post demonstrating how to do smile detection Similar libraries are available that detect facial boundaries and let you put hats, sunglasses, beards, and other silly masks on people – I am sure you have seen some of these! Similar techniques can be used in a business context to blur out backgrounds for remote workers who call into a video conference.
  • #28: The last area is RTC optimization. There are many opportunities to use machine learning to improve bandwidth estimation, echo cancellation, and perform better error correction. We were very surprised that there has been relatively investment made here.
  • #29: One example is a research project from Mozilla that uses Deep Learning to provide better real-time noise suppression. This is designed for lower power devices and does not require any specialized hardware. We do not have time now, but you can go to that link and try some demos. It is pretty neat. Unfortunately this was just a research project, but it gives you some idea of what could be done in this and other areas.
  • #30: Before I take questions, I did want to mention we have a special discount code for RTC Korea attendees. If you are interested in seeing out full 147-page report, you can use that for a big discount.