AT&T 2012 DevLab Speech API Deep Dive

September 25, 2012

AT&T SPEECH API DEEP DIVE
Michael Owens (@mko on Twitter, mowens on Github)
Jay Lieske ( jay.lieske@att.com, jayatyp on Github)

AT&T Developer Program
©"2012"AT&T"Intellectual"Property."All"rights"reserved."AT&T"and"the"AT&T"logo"are"trademarks"of"AT&T"Intellectual"Property.

WHAT IS THE
AT&T SPEECH API?

2

How the
AT&T
Speech
API Works

2

Powered by AT&T WATSON℠
• Developed 20+ years
• Optimized for different usage scenarios:
• Web Search
• Business Search
• Question & Answer
• Voicemail-to-Text
• Short Message (SMS)
• TV Search/Remote (U-Verse)
• Generic Speech-to-Text
2

Simple Speech-to-Text
• One REST endpoint
• Accepts audio in WAV or AMR
• Structured JSON response
• Text spoken by user
• Metrics to evaluate recognition quality
• AT&T Native SDKs for Android and iOS
handle audio capture and streaming

2

Apps in the Wild

AT&T-Translator Speak4it U4Verse-Easy-Remote

2

GETTING STARTED
WITH THE AT&T
SPEECH API

3

Sign Up for API Access
• j.mp/ATTDevSignUp
• Free API Access for
DevLab Attendees
• Detailed Instructions in
your Attendee Packet
• Sign up with code
“APILAB12”
• AT&T Staff is on hand to
answer questions and
help get you set up

2

Before You Code
• Get your API Keys from Developer portal:
• Client ID (“API Key” on the AT&T Developer Portal)
• Client Secret (“Secret Key” on the AT&T Developer Portal)
• OAuth 2.0 client_credentials grant type
• OAuth 2.0 access_token
• Audio File Types:
• AMR: narrowband, 12.2 kbits/s, 8 kHz sampling
• WAV: 16 bit PCM WAV, single channel, 8 kHz sampling
• Audio File Length:
• Voicemail: 4 minutes or less
• Other: 1 minute or less

2

Step 1: Connect via OAuth
Request Method: POST
Request URL: https://api.att.com/oauth/token

Request Headers: Content-Type: application/x-www-form-
urlencoded
Request Body: client_id=ATT_API_CLIENT_ID
&client_secret=ATT_API_CLIENT_SECRET
&grant_type=client_credentials
&scope=SPEECH

Response Body: {
"access_token": "xxyz123"
}

2

Step 2: POST Audio to AT&T
(Non-Streaming HTTP Request)
Request URL: https://api.att.com/rest/1/SpeechToText
Request Headers: Accept: application/json
Authorization: Bearer xxyz123
Content-Type: audio/wav
Content-Length: 1534
X-SpeechContext: BusinessSearch
Request Body: AUDIO_BINARY_DATA
Note: The Audio Binary Data
goes directly in POST Body,
not a MIME Attachment.

2

Step 2: POST Audio to AT&T
(Streaming HTTP Request)
Request URL: https://api.att.com/rest/1/SpeechToText
Request Headers: Accept: application/json
Authorization: Bearer xxyz123
Content-Type: audio/amr
Transfer-Encoding: chunked
X-SpeechContext: QuestionAndAnswer
Request Body: 200
Note: Numbers are the AUDIO_BINARY_DATA_CHUNK
recommended chunk size 200
in hexadecimal format. AUDIO_BINARY_DATA_CHUNK
0
2

AT&T SPEECH API
EXAMPLE
APPLICATION
Download the Source:
https://github.com/attdevsupport/2012DevLabExamples

4

Transcription in Three Steps
1. Capture Audio Input 2. POST Audio to AT&T 3. Use AT&T API Response

Capturing audio input differs Once the audio input has been The AT&T API sends back a very
from platform to platform. captured, we send the easy to parse JSON object with
compatible audio file from our the interpreted text.
In our Basic Example, we use a server to the Speech API using
small Adobe Flex app to access In our Basic example, we
a simple POST.
the mic via Flash, capture the output this to the user’s screen
audio in one of the two In our Basic Example, we use a pretty printed and syntax
accepted formats, then save small Node.js module called highlighted, but you could do
that newly created audio file to “Watson.js” (NPM: “watson-js”) much more.
disk on the server. to OAuth to the Speech API
In our Speech Labs, we will look
and then POST the audio file.
In our Speech Labs, we will look at other ways to use this data,
at the methods by which you In our Speech Labs, we will do like searching for businesses
can capture and stream audio this on iOS, Android, and Web. on Foursquare.
directly to the Speech API.

2

Watson.js
Node.js API Wrapper for the AT&T
Speech API

GitHub: http://github.com/mowens/watson-js/
NPM: https://npmjs.org/package/watson-js

5

Using Watson.js
1. Require API Wrapper
var WatsonClient = require(‘watson-js’);

2. Set API Client Options
var options = {
client_id: ATT_API_CLIENT_ID,
client_secret: ATT_API_CLIENT_SECRET,
access_token: ACCESS_TOKEN,
scope: "SPEECH",
context: "Generic",
access_token_url: "https://api.att.com/oauth/token",
api_domain: "api.att.com"
};

3. Instantiate New API Client
var Watson = new WatsonClient.Watson(options);

2

The Methods of Watson.js
Watson.getAccessToken(callback)
Method for requesting a new OAuth Access Token using
the Client Credentials grant type and passes the returned
Access Token to the passed callback function.

Watson.speechToText(speechFile, accessToken, callback)
Method for piping a speech ﬁle (passed as an absolute ﬁle
location) to the AT&T Speech API using the passed access
token. The API Response’s JSON is returned to the passed
callback function as parsed JSON.

2

AT&T SPEECH API
EXAMPLE APP CODE
WALKTHROUGH
Using the AT&T Speech API to convert
generic audio to text in a web browser.
example-basic in the examples repo

6

Frameworks &
Requirements:
Server-side:
• Node.js: JavaScript platform for building fast, scalable network apps
• FS: Node.js File System module
• Express: Minimal web application framework for Node.js
• Optimist: Lightweight option parsing module for Node.js
• HBS: Express View Engine wrapper for Handlebars
• Watson.js: Simple API Wrapper for AT&T Speech API

Client-side:
• jQuery: The gold standard of client-side JavaScript libraries
• swfobject: JavaScript to make embedding Flash objects easier
• Bootstrap: Twitter’s CSS framework for quickly developing web apps

2

Capture Audio Input
recorder.swf:
Adobe Flex app that accesses the user’s microphone and emits events to JS
recorder.js:
JavaScript interface to receive events, update UI, and POST ﬁle to Node.js
Node.js upload script:
function cp(source, destination, callback) {
fs.readFile(source, function(err, buf) {
fs.writeFile(destination, buf, callback);
});
}
app.post('/upload', function(req, res) {
cp(req.files.upload_file.filename.path, __dirname +
req.files.upload_file.filename.name, function(err) {
res.send({ saved: 'saved' });
return;
});
});

2

POST Audio to AT&T
AJAX Request via POST from client side to Node.js
// Receive an AJAX POST from client-side JavaScript
app.post('/speechToText', function(req, res) {

// Pass the audio file and access token to AT&T Speech API
Watson.speechToText(__dirname + '/public/audio/audio.wav',
this.access_token, function(err, reply) {

// Pass any errors associated with API call to client-side JS
if(err) { res.send({ error: err }); return; }

// Return the parsed JSON to client-side JavaScript
res.send(reply);
return;

});

});

2

Use Speech API Response
Example API Response, returned Response-
What-The-Response-Parameter-Means
from call using Content-Type of Parameter
‘application/json’: Recognition Body"object"for"the"AT&T"Speech"API"Response
ResponseId Unique"IdenGfier"for"a"specific"API"call
Array"of"hypothesis"objects"(possible"
{ NBest
transcripGons"of"audio"data).
"Recognition": {
PlainKtext,"cleaned"up"representaGon"of"the"
"ResponseId": "74a964bf2fe", ResultText Hypothesis."This"should"be"used"when"displaying"
"NBest": [ { the"text"to"users."
"WordScores": [1, 0.75, 1, 0.75], Confidence"score"for"the"overall"Hypothesis."
"Confidence": 0.75, Confidence Scored"on"a"scale"from"0"(not"confident)"to"1.0"
(very"confident)
"Grade": "accept",
Recommended"acGon"to"take"with"the"current"
"ResultText": "This is a test.", Grade
Hypothesis:"accept,"reject,"or"confirm
"Words": [“This”, “is”, “a”, Array"of"the"individual"words."Confidence"scores"
“test.”], Words for"each"word"are"available"in"the"WordScores"
"LanguageId": "en-us", array."
"Hypothesis": "This is a test." Array"of"individual"confidence"scores"for"each"
WordScores word"in"the"ResultText"parameter."Corresponds"
} ] to"Words"array.
} RepresentaGon"of"the"response"language."
} LanguageId Supports"English"&"Spanish"in"Generic;"EnglishK
only"in"other"contexts.
The"raw"transcripGon"of"the"audio"that"was"
Hypothesis
interpreted.

2

Up Next:

Michael Fitzpatrick

2

Up Next:

Jason Goecke
Adam Kalsey
2

ADVANCED
EXAMPLES
What can you do with Speech-to-text?
You could…
• Make your mobile or web application accessible with voice commands
• Post tweets using voice commands in a simple Twitter app
• Add on-the-ﬂy transcripts while recording in a podcasting app
• Add captioning to videos hosted on your website automatically
• Create real-time closed captions of a conference speaker’s presentation
• Search for nearby places to check in at on Foursquare

7

Speech Labs
We’re now going to break out into three clusters, each focusing on a
different technology stack. Work independently or with a partner!

Web (Flex + Node.js) iOS (Objective-C) Android (Java)

In the Web Speech Lab, Michael In the iOS Speech Lab, Brant In the Android Speech Lab, Jay
will be on hand to help get your will help you try out the AT&T will help you try out the AT&T
Node.js app working with the Speech API on iOS and go into Speech API on Android and go
AT&T Speech API. Code up your more depth about the AT&T into more depth about the
own Speech API app from Speech SDK for iOS. AT&T Speech SDK for Android.
scratch, or you can start from a The mobile SDK allows you to The mobile SDK allows you to
boilerplate app that uses quickly capture and stream quickly capture and stream
Foursquare to search for audio from your iPhone or iPad audio from your Android
locations and allow you to app to the AT&T Speech API. phone or tablet app to the
check-in from your web AT&T Speech API.
browser!

2

September 25, 2012

THANKS! ANY QUESTIONS?
Michael Owens (@mko on Twitter, mowens on Github)
Jay Lieske ( jay.lieske@att.com, jayatyp on Github)


AT&T 2012 DevLab Speech API Deep Dive

More Related Content

Similar to AT&T 2012 DevLab Speech API Deep Dive (20)

AT&T 2012 DevLab Speech API Deep Dive