Games Speech ASDK

L&H Game ASDK for Sony PlayStation 2 Sony PlayStation2 Developers Conference San Jose, March 2001

L&H Games ASDK Presentation Lernout & Hauspie Technologies Products Speech in Games Games ASDK (our suggestion) Engine Properties Concepts Development Process Grammars Integration Engine: flexible, comprehensive feature-set, targeted

L&H Desktop Dictation Products Dragon NaturallySpeaking ® L&H Voice Xpress ™ World most popular continuous dictation products SDK available for free download – build your own speech-enabled applications under Windows Jun 6, 2009

Diverse other Speech Products based on L&H technologies Jun 6, 2009 CrimeTalk Reporter ST Microelectronics VX Medicine Xsara (Citroen) Seaman (Vivarium/Sega)

Lernout & Hauspie Speech Products A complete portfolio of Speech and Language technologies Speech Recognition(DNS), Text-to-Speech (RealSpeak), Speaker Verification, Speech and Music coders, Phonetic tools Intelligent Content Management tools (spelling and grammar checkers, topic identifier, summarizer, translators, audio mining) Most complete set of languages Broad range of (application-specific) development kits Optimized development kits for each market segment Speech and Language technologies : Ensure maximum interaction and sense of realism Allow for natural dialog and interaction Speech is the most powerful interface Jun 6, 2009

L&H Games ASDK Lernout & Hauspie Technologies Products Speech in Games Games ASDK (our suggestion) Engine Properties Concepts Development Process Grammars Integration Engine: flexible, comprehensive feature-set, targeted Jun 6, 2009

Speech Recognition for Games Speech as an interface Most natural & intuitive We all know how to use it Use speech & game-pad at the same time Examples Interact with NPCs Voice command and control of complex games Simulation games Realtime strategy & multiplayer RPGs Turn-based games Empire building Input to virtual/animated character Build a personal relationship moving strategic units and looking for strategic ressources) New genre for speech enabled games? Jun 6, 2009

L&H Games ASDK for PlayStation 2 The only vendor of Speech Recognition SDK for the PS2 Two parts Development Runtime Brings code into events and user interface Hides linguistic and phonetic knowledge Lets developers concentrate on program flow and functionality Jun 6, 2009

Types of Speech Recognition Continuous versus Discrete Command-and-Control versus Dictation Vocabulary size Speaker-independent versus Speaker-dependent Phoneme-based versus Word-based Environment (office, telephony, car, …) Language dependency Jun 6, 2009

Basics on the ASR1600 engine Basic characteristics of the ASR1600 continuous speech recognition engine command-and-control speaker-independent, with the ability to train words phoneme-based noise robust trained for office environment input audio format: 11 kHz 16 bit linear PCM language modules (speaker-independent) Jun 6, 2009

Definition of concepts Grammar list of sentences/words that can be recognized rule -based a rule can be e.g. a list of names or commands a rule can also be more complex, and e.g. define any string of one or more digits the recognizer only considers the start rules . one rule may refer to another ( local ) rule source grammar source code that describes the rules in a readable format syntax: L&H BNF grammar format binary grammar compiled source grammar binary format (unreadable) used by the engine can be saved to file Jun 6, 2009

Definition of concepts Context binary grammars can not be loaded directly on an engine one or more binary grammars can be converted to one context rule resolution all start rules in parallel binary grammars must have been compiled for same langauge an engine must have one and only one context loaded context subsets (i.e. rules and/or grammars) can be dynamically activated or deactivated while the context is loaded on the engine. a context can not be saved to a file Jun 6, 2009

Development Process Offline grammar creation Offline development tools for Win32 platform Development of grammars and contexts Testing and tuning Runtime: Implementation / Integration Jun 6, 2009

Offline Context Creation Not part of the compile / debug / build cycle Tested on PC Exported to format suitable for GSAPI runtime system Jun 6, 2009

Grammar Specification L&H Bachus Naur format for grammar description Context Information File Specification of events & parameters On a per-context basis Jun 6, 2009

Representation of a Grammar Specified as a list by designer or programmer Think of it as a directed graph Jun 6, 2009

Grammar Simplification When sentences are generated, the tool walks the graph and specifies longer paths in terms of shorter paths: faster go ret_faster_at_1 move ret_faster_at_1 Jun 6, 2009

L&H BNF Grammar !start<start>; <verb>:play|stop; <object>:music|radio; <start>:<verb>!optional(<object>); Jun 6, 2009 play, stop, play music, stop music, play radio, stop radio !optional to define an optional expression : the expression can be spoken, but does not have to.

L&H BNF Grammar !start <rule> ; !import <dirs> ; <rule>: go <dirs> ; Jun 6, 2009 !export <dirs> ; <dirs> : left | right ; grammar1.bnf grammar2.bnf go left, go right import rule export rule context

Development Process Offline Context Creation Implementation / Integration Jun 6, 2009

GSAPI Implementation / Integration More like a server than just a library Call gsapi_ExecServer() every vsync (we’re not dropping frames, now are we?) Initialization couldn’t be easier Initialize API Load a Language Open a recognizer Error = gsapi_Init(cbResult, 0); Error = gsapi_LanguageLoadBuffer(pLng, 0); Error = gsapi_EngineOpen(0, &hEngine); Jun 6, 2009

GSAPI Implementation / Integration Specify mode with gsapi_EngineSetMode() GSAPI_MODE_CONTINUOUS : Speech recognition is always active (always listening in the background… privacy issues for Q&A ) GSAPI_MODE_SINGLE : Speech recognition is activated and deactivated (PTT, dialog) In Continuous mode, load and activate a context, and start the recognizer. In Single mode, begin recognition in response to an event (eg. button press) The recognizer will automatically stop. The programmer then has the opportunity to set another context, and to enable or disable rules. gsapi_EngineDisableAllWords() gsapi_EngineEnableWords() gsapi_EngineStart() Jun 6, 2009

Engine Operating Modes Record mode Not intended for recognition Wave data can be obtained through notification callback Jun 6, 2009

Result Callback Upon initializing GSAPI, provide a result callback: GSAPI_ERROR gsapi_Init(GSAPI_CBRESULT pResultCallback, unsigned long u32UserData) When a result is available, the engine calls it: VOID (* GSAPI_CBRESULT)(GSAPI_HENGINE hEngine, GSAPI_ACTION Action, GSAPI_RESULT* pResult, unsigned long u32UserData) Jun 6, 2009

Tips Don’t just translate controller input, create a SUI We can provide state-of-the art speech technologies, but don’t ignore limitations Plan and be creative Speech has strengths & weaknesses, so design it in from the start! Eg. Adding speech to a game that’s done: Speech will never seem to be a natural part of the application unless interface is redesigned so speech parts make more sense. User’s don’t want to learn commands by heart. You can make it intuitive. Misrecognitions should not be fatal Smart responses can add to the experience Jun 6, 2009

L&H Games ASDK Presentation Lernout & Hauspie Technologies Products Speech in Games Games ASDK (our suggestion) Engine Properties Concepts Development Process Grammars Integration Engine: flexible, comprehensive feature-set, targeted Jun 6, 2009

Engine Parameters Quick preview Demonstrate Flexibility Comprehensive set of features Chosen with the game developer in mind, but With many years of experience in the industry to know “what works” Jun 6, 2009

Engine parameters N-best maximum number of possible recognition results that should be returned per utterance Accuracy the maximum number of hypotheses that are simultaneously searched at any time: if the number of hypotheses grows beyond the accuracy parameter, the ones that score worst are removed from the search affects the precision of the recognizer the CPU load and memory consumption Jun 6, 2009

Engine parameters Rejection confidence level gives an indication of how confident the engine is that the recognized result was actually spoken, compared to the average speech model . the lower this parameter, the more the engine will lean towards the average speech model. does not influence the actual recognition results; only the confidence levels default is set as such that for a confidence level of 5000, the number of false rejections is equal to the number of false acceptances Jun 6, 2009

Engine parameters Jun 6, 2009 threshold (confidence level) #FAs 5000 0 10000 False acceptance (FA) : a wrong result is accepted #FRs False rejection (FR) : a correct result is rejected

Engine parameters Garbage garbage model (<…>): absorbs speech that is of no importance the garbage parameter modules the match on the garbage model: the lower the parameter, the more the engine will lean towards the garbage model and the more keywords could be missed. Gender selection can be any combination of male, female, child. by default, all genders are active: engine detects the gender of the speaker can only be set if a language (i.e. context) is loaded on the engine all ASR1600 languages only support male, female and child Jun 6, 2009

Engine parameters Speech detection enables or disables the “ Voice Activity Detector ” (VAD). if the VAD detects speech in the input signal, the engine goes from low CPU mode to full recognition mode disable if application requires tight control over the starting point of speech Sensitivity minimal energy jump (dB/100) in the input signal that is perceived by the VAD as being speech the higher, the more deaf the engine is the lower, the higher the risk that the engine starts full recognition on background speech Minimum duration for start minimum duration of speech in milliseconds to trigger the VAD prevent the engine from reacting to short, high energy events Jun 6, 2009

Engine parameters Trailing silence detection enables or disables the trailing silence detector the trailing silence detector triggers the engine to calculate a result disable if application requires tight control over the end point of speech Reaction time minimum trailing silence in milliseconds the higher it is, the lower the response time has to be longer than the duration of a pause between two words Time out indicates the longest time (in seconds) an utterance can take (low CPU mode included) set to zero to disable time-out Jun 6, 2009

Engine parameters Continuous mode if disabled: the engine stops automatically after each recognition if enabled: the engine recognizes multiple utterances one after the other, till the application explicitly aborts the recognition process if enabled, trailing silence detection + speech detection are enabled and time-out is disabled Far talk on/off parameter if off, then AGC is to use small increments when changing the volume settings Jun 6, 2009

Engine parameters AGC set to FALSE (0) to disable if enabled: engine will request for optimal gain if engine recognizes directly from WaveIn device: set to TRUE (1) to enable engine will adjust automatically gain of WaveIn device will be reset to 0 if WaveIn device does not allow AGC (immediately or when calling lhs_asrRecAcqOpen) if engine acquires audio data from the application: set to current gain (1-65535) to enable requests are notified through a callback function Jun 6, 2009

Games Speech ASDK

More Related Content

Similar to Games Speech ASDK (20)

Games Speech ASDK