SlideShare a Scribd company logo
Using computers to authenticate a person using their voice
Bhusan Chettri gives an overview of the technology behind Voice Authentication using AI
So, what is Automatic Speaker Recognition?
Automatic Speaker Recognition is the task of recognizing humans through their voice by using a
computer. Automatic Speaker Recognition generally comprises of two tasks: Speaker identification and
Speaker verification. Speaker identification involves finding the correct person from a given pool of known
speakers or voices. A speaker identification usually comprises of a set of N speakers who are already
registered in the system and these N speakers can only have access to the system. Speaker verification
on the other hand involves verifying whether a person is who he/she claims to be using their voice
sample.
These systems are further classified into two categories depending upon the level of user cooperation: (1)
Text dependent (2) Text independent. In text dependent application, the system has prior knowledge of
the spoken text and therefore expects same utterance during test time (or deployment phase). For
example, pass-phrase such as "My voice is my password" will be used both during speaker enrollment
(registration) and during deployment (when the system is running). On the contrary, in text independent
systems there is no prior knowledge about the lexical contents, and therefore these systems are much
more complex than text dependent ones.
So how does the speaker verification algorithm work? How are they trained and deployed?
Bhusan Chettri says: well, in order to build automatic speaker recognition systems first thing we need is
data. Big amount of speech data collected from hundreds and thousands of speakers spoken across
varied acoustic conditions. It would be nice to have pictures illustrating the methodology as pictures speak
louder than thousand words. The block diagram shown below summarises a typical speaker verification
system. It consists of speaker enrollment phase (Fig a) and speaker verification phase (Fig b). The role of
a feature extraction module is to transform the raw speech signal into some representation (features) that
retains speaker specific attributes useful to the downstream components in building speaker models. The
enrollment phase comprises offline and online modes of building models. During the offline mode,
background models are trained on features computed from a large speech collection representing a
diverse population of speakers. The online phase comprises building a target speaker model using
features computed from target speaker’s speech. Usually, training the target speaker model from scratch
is avoided because learning reliable model parameters requires a sufficiently large amount of speech
data,which is usually not available for every individual speaker. To overcome this, the parameters of a
pretrained background model representing the speaker population are adapted using the speaker data
yielding a reliable speaker model estimate. During the speaker verification phase, for a given test speech
utterance, a claimed speaker’s model and the background model (representing the world of all other
possible speakers) is used to derive a confidence score. The decision logic module then makes a binary
decision: it either accepts the claimed identity as a genuine speaker or rejects it as an impostor based on
some decision threshold.
(a) Speaker enrollment phase. The goal here is to build speaker specific models by adapting a
background model which is trained on a large speech database.
(b) Speaker verification phase. For a given speech utterance the system obtains a verification score and
makes a decision whether to accept or reject the claimed identity.
How has the state-of-the-art changed and driven by big-data and AI?
Bhusan Chettri explains that there has been a big paradigm shift in the way we build these systems. To
bring clarity on this, Dr. Bhusan Chettri summarises the recent advancement in state-of-the-art in two
broad categories. (1) Traditional approaches (2) Deep learning (and Big data) approaches.
Traditional methods. By traditional methods he refers to approaches driven by a Gaussian mixture model
- universal background model (GMM-UBM) that were adopted in the ASV literature until deep learning
techniques became popular in the field. Mel-frequency cepstral coefficients (MFCCs) were popular frame-
level feature representations used in speaker verification. Using short-term MFCC feature vectors,
utterance level features such as i-vectors are often derived which have shown state-of-the-art
performance in speaker verification. The background models such as the Universal back-ground model
(UBM) and total variability (T) matrix are learned in an offline phase using a large collection of speech
data. The UBM and T matrix are used in computing i-vector (this is just a fixed length vector representing
a variable-length speech utterance) representations. The training process involves learning model (target
or background) parameters from training data. As for modelling techniques, vector quantization (VQ) was
one of the earliest approaches used to represent a speaker, after which Gaussian mixture models
(GMMs), an extension to VQ methods, and Support vector machines became popular methods for
speaker modelling. The traditional approach also includes training an i-vector extractor (GMM-UBM, T-
matrix) on MFCCs and using a probabilistic linear discriminant analysis (PLDA) backend for scoring.
Deep learning methods. In deep learning based approaches for ASV, features are often learned in a data-
driven manner directly from the raw speech signal or from some intermediate speech representations
such as filter bank energies. Handcrafted features, for example MFCCs, are often used as input to train
deep neural network (DNN) based ASV systems. Features learned from DNNs are often used to build
traditional ASV systems. Researchers have used the output from the penultimate layer of a pre-trained
DNN as features to train a traditional i-vector PLDA setup (replacing i-vectors with DNN features).
Extracting bottleneck features (output from a hidden layer with a relatively small number of units) from a
DNN to train a GMM-UBM system which uses the log-likelihood ratio as scoring is also used commonly.
Utterance-level discriminative features, so called embeddings extracted from pre-trained DNNs have
become popular recently, demonstrating good results. End-to-end modelling approaches have also been
extensively studied in speaker verification showing promising results. In this setting, both feature learning
and model training are jointly optimised from the raw speech input. A wide range of neural architectures
have been studied for speaker verification. This includes feed forward neural networks, commonly
referred as deep neural networks (DNNs), convolutional neural networks (CNNs), recurrent neural
networks, and attention models. Training background models in deep learning approaches can be thought
of as a pretrainng phase where network parameters are trained on a large dataset. Speaker models are
then derived by adapting the pretrained model parameters using speaker specific data, much like the
same way a traditional GMM-UBM system operates.
So Dr. Bhusan Chettri tell us where these technology are being used? Its applications?
He explains that the technology can be used across wide-range of domains such as (a) access control -
voice based access control systems (b) in banking applications for authenticating a transaction (c)
personalisation: in mobile devices, lock/unlock vehicle door (engine start/off) based on specific user etc.
Are they safe and secure? Are they prone to any manipulation when they are deployed?
Bhusan Chettri further explains that although the current advancement in algorithms with the aid of big
data have shown remarkable state-of-the-art results, these systems are not 100% secure. They are prone
to spoofing attacks where an attacker aims to manipulate voice to sound like registered user and gain
illegitimate access to their system. A significant amount of research is being promoted by the ASV
community recently along this direction.
References
[1] See Bhusan Chettri research , personal website and Orcid for related works
[2] M. Sahidullah et. al. Introduction to Voice Presentation Attack Detection and Recent Advances, 2019.
[3]. Bhusan Chettri. Voice biometric system security: Design and analysis of countermeasures for replay
attacks. PhD thesis, Queen Mary University of London, August 2020.
[4] ASVspoof: The automatic speaker verification spoofing and countermeasures challenge website.
TAGS: Bhusan Chettri London | Bhusan Chettri Queen Mary University of London | Dr. Bhusan Chettri | Bhusan Chettri social |
Bhusan Chettri scholar| Bhusan Chettri Sikkim

More Related Content

PDF
An overview of speaker recognition by Bhusan Chettri.pdf
PDF
Automatic Speaker Recognition and AI.pdf
PDF
Story Voice authentication systems .pdf
PPTX
DEVELOPMENT OF SPEAKER VERIFICATION UNDER LIMITED DATA AND CONDITION
PPTX
SPEAKER VERIFICATION
PPTX
Voice
PPTX
Speaker recognition in android
PPTX
Speaker recognition in android
An overview of speaker recognition by Bhusan Chettri.pdf
Automatic Speaker Recognition and AI.pdf
Story Voice authentication systems .pdf
DEVELOPMENT OF SPEAKER VERIFICATION UNDER LIMITED DATA AND CONDITION
SPEAKER VERIFICATION
Voice
Speaker recognition in android
Speaker recognition in android

Similar to Using AI to recognise person (20)

PDF
Bayesian distance metric learning and its application in automatic speaker re...
PDF
PPTX
Speaker Recognition
PPTX
Speaker Identification and Verification
PPTX
SPEKER RECOGNITION UNDER LIMITED DATA CODITION
PDF
Enhancing speaker verification accuracy with deep ensemble learning and inclu...
PPTX
Speaker recognition system by abhishek mahajan
PDF
Utterance based speaker identification
PDF
SPEAKER VERIFICATION USING ACOUSTIC AND PROSODIC FEATURES
PDF
Discriminative deep learning based hybrid spectro-temporal features for synth...
PDF
Bachelors project summary
PDF
Real Time Speaker Identification System – Design, Implementation and Validation
PDF
A Robust Speaker Identification System
PDF
50120140502007
PDF
High level speaker specific features modeling in automatic speaker recognitio...
PDF
V041203124126
PDF
Speaker identification under noisy conditions using hybrid convolutional neur...
PDF
19 ijcse-01227
PPTX
Speaker Recognition System
PDF
Utterance Based Speaker Identification Using ANN
Bayesian distance metric learning and its application in automatic speaker re...
Speaker Recognition
Speaker Identification and Verification
SPEKER RECOGNITION UNDER LIMITED DATA CODITION
Enhancing speaker verification accuracy with deep ensemble learning and inclu...
Speaker recognition system by abhishek mahajan
Utterance based speaker identification
SPEAKER VERIFICATION USING ACOUSTIC AND PROSODIC FEATURES
Discriminative deep learning based hybrid spectro-temporal features for synth...
Bachelors project summary
Real Time Speaker Identification System – Design, Implementation and Validation
A Robust Speaker Identification System
50120140502007
High level speaker specific features modeling in automatic speaker recognitio...
V041203124126
Speaker identification under noisy conditions using hybrid convolutional neur...
19 ijcse-01227
Speaker Recognition System
Utterance Based Speaker Identification Using ANN
Ad

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Programs and apps: productivity, graphics, security and other tools
PPT
Teaching material agriculture food technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Electronic commerce courselecture one. Pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Spectroscopy.pptx food analysis technology
Approach and Philosophy of On baking technology
MIND Revenue Release Quarter 2 2025 Press Release
Digital-Transformation-Roadmap-for-Companies.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Advanced methodologies resolving dimensionality complications for autism neur...
Programs and apps: productivity, graphics, security and other tools
Teaching material agriculture food technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
Per capita expenditure prediction using model stacking based on satellite ima...
gpt5_lecture_notes_comprehensive_20250812015547.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Electronic commerce courselecture one. Pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Unlocking AI with Model Context Protocol (MCP)
Building Integrated photovoltaic BIPV_UPV.pdf
Review of recent advances in non-invasive hemoglobin estimation
Spectroscopy.pptx food analysis technology
Ad

Using AI to recognise person

  • 1. Using computers to authenticate a person using their voice Bhusan Chettri gives an overview of the technology behind Voice Authentication using AI So, what is Automatic Speaker Recognition? Automatic Speaker Recognition is the task of recognizing humans through their voice by using a computer. Automatic Speaker Recognition generally comprises of two tasks: Speaker identification and Speaker verification. Speaker identification involves finding the correct person from a given pool of known speakers or voices. A speaker identification usually comprises of a set of N speakers who are already registered in the system and these N speakers can only have access to the system. Speaker verification on the other hand involves verifying whether a person is who he/she claims to be using their voice sample. These systems are further classified into two categories depending upon the level of user cooperation: (1) Text dependent (2) Text independent. In text dependent application, the system has prior knowledge of the spoken text and therefore expects same utterance during test time (or deployment phase). For example, pass-phrase such as "My voice is my password" will be used both during speaker enrollment (registration) and during deployment (when the system is running). On the contrary, in text independent systems there is no prior knowledge about the lexical contents, and therefore these systems are much more complex than text dependent ones. So how does the speaker verification algorithm work? How are they trained and deployed? Bhusan Chettri says: well, in order to build automatic speaker recognition systems first thing we need is data. Big amount of speech data collected from hundreds and thousands of speakers spoken across varied acoustic conditions. It would be nice to have pictures illustrating the methodology as pictures speak louder than thousand words. The block diagram shown below summarises a typical speaker verification system. It consists of speaker enrollment phase (Fig a) and speaker verification phase (Fig b). The role of a feature extraction module is to transform the raw speech signal into some representation (features) that retains speaker specific attributes useful to the downstream components in building speaker models. The enrollment phase comprises offline and online modes of building models. During the offline mode, background models are trained on features computed from a large speech collection representing a diverse population of speakers. The online phase comprises building a target speaker model using
  • 2. features computed from target speaker’s speech. Usually, training the target speaker model from scratch is avoided because learning reliable model parameters requires a sufficiently large amount of speech data,which is usually not available for every individual speaker. To overcome this, the parameters of a pretrained background model representing the speaker population are adapted using the speaker data yielding a reliable speaker model estimate. During the speaker verification phase, for a given test speech utterance, a claimed speaker’s model and the background model (representing the world of all other possible speakers) is used to derive a confidence score. The decision logic module then makes a binary decision: it either accepts the claimed identity as a genuine speaker or rejects it as an impostor based on some decision threshold. (a) Speaker enrollment phase. The goal here is to build speaker specific models by adapting a background model which is trained on a large speech database.
  • 3. (b) Speaker verification phase. For a given speech utterance the system obtains a verification score and makes a decision whether to accept or reject the claimed identity. How has the state-of-the-art changed and driven by big-data and AI? Bhusan Chettri explains that there has been a big paradigm shift in the way we build these systems. To bring clarity on this, Dr. Bhusan Chettri summarises the recent advancement in state-of-the-art in two broad categories. (1) Traditional approaches (2) Deep learning (and Big data) approaches. Traditional methods. By traditional methods he refers to approaches driven by a Gaussian mixture model - universal background model (GMM-UBM) that were adopted in the ASV literature until deep learning techniques became popular in the field. Mel-frequency cepstral coefficients (MFCCs) were popular frame- level feature representations used in speaker verification. Using short-term MFCC feature vectors, utterance level features such as i-vectors are often derived which have shown state-of-the-art performance in speaker verification. The background models such as the Universal back-ground model (UBM) and total variability (T) matrix are learned in an offline phase using a large collection of speech data. The UBM and T matrix are used in computing i-vector (this is just a fixed length vector representing a variable-length speech utterance) representations. The training process involves learning model (target or background) parameters from training data. As for modelling techniques, vector quantization (VQ) was one of the earliest approaches used to represent a speaker, after which Gaussian mixture models (GMMs), an extension to VQ methods, and Support vector machines became popular methods for speaker modelling. The traditional approach also includes training an i-vector extractor (GMM-UBM, T- matrix) on MFCCs and using a probabilistic linear discriminant analysis (PLDA) backend for scoring. Deep learning methods. In deep learning based approaches for ASV, features are often learned in a data- driven manner directly from the raw speech signal or from some intermediate speech representations such as filter bank energies. Handcrafted features, for example MFCCs, are often used as input to train
  • 4. deep neural network (DNN) based ASV systems. Features learned from DNNs are often used to build traditional ASV systems. Researchers have used the output from the penultimate layer of a pre-trained DNN as features to train a traditional i-vector PLDA setup (replacing i-vectors with DNN features). Extracting bottleneck features (output from a hidden layer with a relatively small number of units) from a DNN to train a GMM-UBM system which uses the log-likelihood ratio as scoring is also used commonly. Utterance-level discriminative features, so called embeddings extracted from pre-trained DNNs have become popular recently, demonstrating good results. End-to-end modelling approaches have also been extensively studied in speaker verification showing promising results. In this setting, both feature learning and model training are jointly optimised from the raw speech input. A wide range of neural architectures have been studied for speaker verification. This includes feed forward neural networks, commonly referred as deep neural networks (DNNs), convolutional neural networks (CNNs), recurrent neural networks, and attention models. Training background models in deep learning approaches can be thought of as a pretrainng phase where network parameters are trained on a large dataset. Speaker models are then derived by adapting the pretrained model parameters using speaker specific data, much like the same way a traditional GMM-UBM system operates. So Dr. Bhusan Chettri tell us where these technology are being used? Its applications? He explains that the technology can be used across wide-range of domains such as (a) access control - voice based access control systems (b) in banking applications for authenticating a transaction (c) personalisation: in mobile devices, lock/unlock vehicle door (engine start/off) based on specific user etc. Are they safe and secure? Are they prone to any manipulation when they are deployed? Bhusan Chettri further explains that although the current advancement in algorithms with the aid of big data have shown remarkable state-of-the-art results, these systems are not 100% secure. They are prone to spoofing attacks where an attacker aims to manipulate voice to sound like registered user and gain illegitimate access to their system. A significant amount of research is being promoted by the ASV community recently along this direction.
  • 5. References [1] See Bhusan Chettri research , personal website and Orcid for related works [2] M. Sahidullah et. al. Introduction to Voice Presentation Attack Detection and Recent Advances, 2019. [3]. Bhusan Chettri. Voice biometric system security: Design and analysis of countermeasures for replay attacks. PhD thesis, Queen Mary University of London, August 2020. [4] ASVspoof: The automatic speaker verification spoofing and countermeasures challenge website. TAGS: Bhusan Chettri London | Bhusan Chettri Queen Mary University of London | Dr. Bhusan Chettri | Bhusan Chettri social | Bhusan Chettri scholar| Bhusan Chettri Sikkim