SlideShare a Scribd company logo
Deep Learning – Speaker
Recognition for Security & IoT
Sai Kiran Kadam(SK)
Description: Automatic Text-Independent Speaker
Recognition and Spoof Detection using Deep Learning
for Security and IoT
SPEAKER IDENTIFICATION & CLUSTERING USING
CONVOLUTIONAL NEURAL NETWORKS
Yanick Lukic, Carlo Vogt, Oliver Durr, Thilo Stadelmann
• Speaker ID using CNN; Input to CNN – Spectograms (Cepstral Analysis)
• Speaker Clustering-Telling who spoke w/o prior knowledge of identity
• Technique/Method: Apply CNN’s on Spectrograms to learn speaker specific
features
• Libraries Used: Python - LIBROSA (to compute I/p) & LASAGNE (Build, Train CNN)
• Training: Dataset - Studio Quality Recordings - 630 people (192 F, 438 M)
• Experiments & Results:
• Optimal Convolutional Filter Dimension
• Evaluate Speaker Perf – 97.0% corresponding to 19 misidentified speakers
• Evaluate Clustering Perf – Mis-classification Rate
• Use: Clustering and Convolution Architecture to
my work
RECURRENT NEURAL NETWORKS FOR POLYPHONIC SOUND EVENT
DETECTION IN REAL LIFE RECORDINGS
Giambattista Parascandolo, Heikki Huttunen, Tuomas Virtanen
• Presents approach to Polyphonic SED in real life recordings.
• Technique: BLSTM – RNN
• Training Data: 103 recordings (10-30min long)- total 1133 minutes
from 10 real life contexts- 8 to14 recordings/context
• Testing Data: DB of 61 classes from 10 different real life contexts
• Results/How good it is: Avg F1 score of 65.5% on 1-sec blocks & 64.7% on single
frames with relative improvement over state-of-art methods of 6.8% & 15.1 %
respectively.
• Limitations: Overfitting - dataset smaller than network (use Data Augmentation)
• Use BLSTM – RNN with Data Augmentation for my thesis.
Using Deep Learning for Detecting Spoofing Attacks on Speech
Signals
Alan Godoy , Flavio Sim ´ oes ˜ , Jose Augusto Stuchi , Marcus de Assis Angeloni , Mario Uliani ´ , Ricardo Violato
• About Automatic Speaker Verification Spoofing & Countermeasure
Challenge – ASVSpoof2015 based on Deep Neural Networks.
• Biometric Spoofing: Direct attack perpetrated against a biometric authentication system
by presenting fake/forged biometric sample.
• Technique: DNN used as a classifier and feature extraction module.
• Feature Extraction: DNN based MLP(I/p – 2668 features of a vector.)
• Back Propagation Algorithm + Stochastic Gradient Descent Optimization – to train
the network
• How good it is: MLP showed EER<0.5% beating SVM-RBF & GMM
• Limitation: MLP is not as efficient as CNN/RNN with BLSTM
• Tradeoffs: BLSTM-RNN over MLP -> No loss of long-term info, EER Increase
• Use: BLSTM-RNN with Spoofing for Security
END-TO-END ATTENTION BASED TEXT-DEPENDENT SPEAKER
VERIFICATION
Shi-Xiong Zhang, Zhuo Chen , Yong Zhao, Jinyu Li and Yifan Gong
• Presents End-to-End system that uses CNNs to extract noise-robust frame-level speaker
discriminative features
• These features - combined to form Attention Mechanism
• CNN + Attention Model-joint optimized using end-to-end criterion
• Technique: CNN + End-to-end Architecture
• Tools: Theano FrameWork, KERAS package – Python
• Testing: System is evaluated on Windows 10 “Hey Cortana” SV task
• End-End Arch has 3 phases
• Training: 200k utterances from 10k speakers, each with 10-30 utterances
• Enrollment: 6 utterances of “Hey Cortana”
• Evaluation: 60k utterances from 3k target speakers & 3k imposters
• Attention mechanism with DNN outperforms CNN & LSTM.
• Use Attention model + BLSTM RNN to my work ? Not sure yet.
Deep Neural Network Embedding’s for Text-Independent
Speaker Verification
David Snyder, Daniel Garcia-Romero, Daniel Povey, Sanjeev Khudanpur
• Investigates i-vectors replacement with embedding’s from Ff-DNN (for Txt Ind SV)
• i-vectors: low-Dim representation of speech, captures speaker & channel chars
• Temporal Pooling Layer-captures long term speaker chars, to train network to
discriminate between speakers from variable length speech segments
• Tools: Kaldi Speech Recognition Toolkit - USEFUL, nnet3 neural network.
• Training Data: Telephone speech of 65,000 recordings from 6500 speakers
• Evaluation: on NIST - SRE2010 & SRE2016
Related Research Papers
• D. Garcia-Romero, X. Zhang, A. McCree, and D. Povey, “Improving speaker recognition performance in the domain
adaptation challenge using deep neural networks,” in Spoken Language Technology Workshop (SLT), 2014 IEEE. IEEE, 2014,
pp. 378– 383.
• O. Novotny, P. Mat ´ ejka, O. Glembeck, O. Plchot, F. Gr ˇ ezl, L. Bur- ´ get, and J. Cernock ˇ y, “Analysis of the dnn-based sre
systems ´ in multi-language conditions,” in Spoken Language Technology Workshop (SLT), 2016 IEEE. IEEE, 2016.
• E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. GonzalezDominguez, “Deep neural networks for small footprint
textdependent speaker verification,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International
Conference on. IEEE, 2014, pp. 4052–4056.
• Najim Dehak, Patrick J Kenny, Reda Dehak, Pierre Du- ´ mouchel, and Pierre Ouellet, “Front-end factor analysis for speaker
verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
• Felix Weninger, “Introducing currennt: The munich opensource cuda recurrent neural network toolkit,” Journal of Machine
Learning Research, vol. 16, pp. 547–551, 2015.
• Yann LeCun and Yoshua Bengio, “Convolutional networks for images, speech, and time series,” The handbook of brain
theory and neural networks, vol. 3361, no. 10, pp. 1995, 1995
• Ossama Abdel-Hamid, Li Deng, and Dong Yu, “Exploring convolutional neural network structures and optimization
techniques for speech recognition.,” in INTERSPEECH, 2013, pp. 3366–3370.
• Honglak Lee, Peter Pham, Yan Largman, and Andrew Y Ng, “Unsupervised feature learning for audio classification using
convolutional deep belief networks,” in Advances in neural information processing systems, 2009, pp. 1096–1104

More Related Content

PPTX
Deep Learning for Automatic Speaker Recognition
PPTX
Deep Learning - Speaker Recognition
PPTX
Deep Learning | Speaker Indentification
PDF
Deep Learning in practice : Speech recognition and beyond - Meetup
PPT
Speech Recognition System By Matlab
PPTX
DEVELOPMENT OF SPEAKER VERIFICATION UNDER LIMITED DATA AND CONDITION
PDF
Deep Learning for Speech Recognition - Vikrant Singh Tomar
Deep Learning for Automatic Speaker Recognition
Deep Learning - Speaker Recognition
Deep Learning | Speaker Indentification
Deep Learning in practice : Speech recognition and beyond - Meetup
Speech Recognition System By Matlab
DEVELOPMENT OF SPEAKER VERIFICATION UNDER LIMITED DATA AND CONDITION
Deep Learning for Speech Recognition - Vikrant Singh Tomar

What's hot (20)

PPTX
Ai based character recognition and speech synthesis
PDF
Real Time Speaker Identification System – Design, Implementation and Validation
PDF
Speaker identification using mel frequency
DOC
Speaker recognition.
DOC
Speaker recognition on matlab
PPTX
Voice Identification And Recognition System, Matlab
PDF
Machine-learning Approaches for P2P Botnet Detection using Signal-processing...
PDF
Ijetcas14 426
PDF
Dy36749754
PPTX
Text-Independent Speaker Verification
PDF
H42045359
PDF
Utterance Based Speaker Identification Using ANN
PDF
Utterance Based Speaker Identification Using ANN
PDF
histogram-based-emotion
PDF
E0502 01 2327
PDF
Utterance based speaker identification
PDF
Course report-islam-taharimul (1)
PDF
A review of analog audio scrambling methods for residual intelligibility
PPT
Automatic speech recognition
PDF
Hiding text in audio using lsb based steganography
Ai based character recognition and speech synthesis
Real Time Speaker Identification System – Design, Implementation and Validation
Speaker identification using mel frequency
Speaker recognition.
Speaker recognition on matlab
Voice Identification And Recognition System, Matlab
Machine-learning Approaches for P2P Botnet Detection using Signal-processing...
Ijetcas14 426
Dy36749754
Text-Independent Speaker Verification
H42045359
Utterance Based Speaker Identification Using ANN
Utterance Based Speaker Identification Using ANN
histogram-based-emotion
E0502 01 2327
Utterance based speaker identification
Course report-islam-taharimul (1)
A review of analog audio scrambling methods for residual intelligibility
Automatic speech recognition
Hiding text in audio using lsb based steganography
Ad

Similar to Deep Learning - Speaker Verification, Sound Event Detection (20)

PDF
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
PPTX
Deep Learning - A Literature survey
PPTX
Speaker identification
PPTX
Introduction to deep learning
PPTX
Introduction to deep learning
PPT
Deep Learning Jeff-Shomaker_1-20-17_Final_
PPT
Text Independent Speaker recognitom framework for detecting criminals.ppt
PDF
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
PDF
Big Data Malaysia - A Primer on Deep Learning
PDF
An Introduction to Deep Learning
PPTX
Deep_Learning_Algorithms_Presentation.pptx
PPTX
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
PDF
NLP and Deep Learning for non_experts
PPTX
Deep learning: the future of recommendations
PPTX
Natural Language Processing Advancements By Deep Learning: A Survey
PPTX
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
PDF
Deep learning fundamentals workshop
DOCX
PhilipSamDavisResume
PDF
Looking into the Black Box - A Theoretical Insight into Deep Learning Networks
PDF
IRJET- Survey on Text Error Detection using Deep Learning
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
Deep Learning - A Literature survey
Speaker identification
Introduction to deep learning
Introduction to deep learning
Deep Learning Jeff-Shomaker_1-20-17_Final_
Text Independent Speaker recognitom framework for detecting criminals.ppt
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
Big Data Malaysia - A Primer on Deep Learning
An Introduction to Deep Learning
Deep_Learning_Algorithms_Presentation.pptx
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
NLP and Deep Learning for non_experts
Deep learning: the future of recommendations
Natural Language Processing Advancements By Deep Learning: A Survey
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep learning fundamentals workshop
PhilipSamDavisResume
Looking into the Black Box - A Theoretical Insight into Deep Learning Networks
IRJET- Survey on Text Error Detection using Deep Learning
Ad

Recently uploaded (20)

PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
Construction Project Organization Group 2.pptx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Artificial Intelligence
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Internet of Things (IOT) - A guide to understanding
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
UNIT 4 Total Quality Management .pptx
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
PPT on Performance Review to get promotions
PPTX
Safety Seminar civil to be ensured for safe working.
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Construction Project Organization Group 2.pptx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
OOP with Java - Java Introduction (Basics)
Artificial Intelligence
Embodied AI: Ushering in the Next Era of Intelligent Systems
Internet of Things (IOT) - A guide to understanding
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
UNIT 4 Total Quality Management .pptx
Automation-in-Manufacturing-Chapter-Introduction.pdf
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPT on Performance Review to get promotions
Safety Seminar civil to be ensured for safe working.
CH1 Production IntroductoryConcepts.pptx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
CYBER-CRIMES AND SECURITY A guide to understanding

Deep Learning - Speaker Verification, Sound Event Detection

  • 1. Deep Learning – Speaker Recognition for Security & IoT Sai Kiran Kadam(SK) Description: Automatic Text-Independent Speaker Recognition and Spoof Detection using Deep Learning for Security and IoT
  • 2. SPEAKER IDENTIFICATION & CLUSTERING USING CONVOLUTIONAL NEURAL NETWORKS Yanick Lukic, Carlo Vogt, Oliver Durr, Thilo Stadelmann • Speaker ID using CNN; Input to CNN – Spectograms (Cepstral Analysis) • Speaker Clustering-Telling who spoke w/o prior knowledge of identity • Technique/Method: Apply CNN’s on Spectrograms to learn speaker specific features • Libraries Used: Python - LIBROSA (to compute I/p) & LASAGNE (Build, Train CNN) • Training: Dataset - Studio Quality Recordings - 630 people (192 F, 438 M) • Experiments & Results: • Optimal Convolutional Filter Dimension • Evaluate Speaker Perf – 97.0% corresponding to 19 misidentified speakers • Evaluate Clustering Perf – Mis-classification Rate • Use: Clustering and Convolution Architecture to my work
  • 3. RECURRENT NEURAL NETWORKS FOR POLYPHONIC SOUND EVENT DETECTION IN REAL LIFE RECORDINGS Giambattista Parascandolo, Heikki Huttunen, Tuomas Virtanen • Presents approach to Polyphonic SED in real life recordings. • Technique: BLSTM – RNN • Training Data: 103 recordings (10-30min long)- total 1133 minutes from 10 real life contexts- 8 to14 recordings/context • Testing Data: DB of 61 classes from 10 different real life contexts • Results/How good it is: Avg F1 score of 65.5% on 1-sec blocks & 64.7% on single frames with relative improvement over state-of-art methods of 6.8% & 15.1 % respectively. • Limitations: Overfitting - dataset smaller than network (use Data Augmentation) • Use BLSTM – RNN with Data Augmentation for my thesis.
  • 4. Using Deep Learning for Detecting Spoofing Attacks on Speech Signals Alan Godoy , Flavio Sim ´ oes ˜ , Jose Augusto Stuchi , Marcus de Assis Angeloni , Mario Uliani ´ , Ricardo Violato • About Automatic Speaker Verification Spoofing & Countermeasure Challenge – ASVSpoof2015 based on Deep Neural Networks. • Biometric Spoofing: Direct attack perpetrated against a biometric authentication system by presenting fake/forged biometric sample. • Technique: DNN used as a classifier and feature extraction module. • Feature Extraction: DNN based MLP(I/p – 2668 features of a vector.) • Back Propagation Algorithm + Stochastic Gradient Descent Optimization – to train the network • How good it is: MLP showed EER<0.5% beating SVM-RBF & GMM • Limitation: MLP is not as efficient as CNN/RNN with BLSTM • Tradeoffs: BLSTM-RNN over MLP -> No loss of long-term info, EER Increase • Use: BLSTM-RNN with Spoofing for Security
  • 5. END-TO-END ATTENTION BASED TEXT-DEPENDENT SPEAKER VERIFICATION Shi-Xiong Zhang, Zhuo Chen , Yong Zhao, Jinyu Li and Yifan Gong • Presents End-to-End system that uses CNNs to extract noise-robust frame-level speaker discriminative features • These features - combined to form Attention Mechanism • CNN + Attention Model-joint optimized using end-to-end criterion • Technique: CNN + End-to-end Architecture • Tools: Theano FrameWork, KERAS package – Python • Testing: System is evaluated on Windows 10 “Hey Cortana” SV task • End-End Arch has 3 phases • Training: 200k utterances from 10k speakers, each with 10-30 utterances • Enrollment: 6 utterances of “Hey Cortana” • Evaluation: 60k utterances from 3k target speakers & 3k imposters • Attention mechanism with DNN outperforms CNN & LSTM. • Use Attention model + BLSTM RNN to my work ? Not sure yet.
  • 6. Deep Neural Network Embedding’s for Text-Independent Speaker Verification David Snyder, Daniel Garcia-Romero, Daniel Povey, Sanjeev Khudanpur • Investigates i-vectors replacement with embedding’s from Ff-DNN (for Txt Ind SV) • i-vectors: low-Dim representation of speech, captures speaker & channel chars • Temporal Pooling Layer-captures long term speaker chars, to train network to discriminate between speakers from variable length speech segments • Tools: Kaldi Speech Recognition Toolkit - USEFUL, nnet3 neural network. • Training Data: Telephone speech of 65,000 recordings from 6500 speakers • Evaluation: on NIST - SRE2010 & SRE2016
  • 7. Related Research Papers • D. Garcia-Romero, X. Zhang, A. McCree, and D. Povey, “Improving speaker recognition performance in the domain adaptation challenge using deep neural networks,” in Spoken Language Technology Workshop (SLT), 2014 IEEE. IEEE, 2014, pp. 378– 383. • O. Novotny, P. Mat ´ ejka, O. Glembeck, O. Plchot, F. Gr ˇ ezl, L. Bur- ´ get, and J. Cernock ˇ y, “Analysis of the dnn-based sre systems ´ in multi-language conditions,” in Spoken Language Technology Workshop (SLT), 2016 IEEE. IEEE, 2016. • E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. GonzalezDominguez, “Deep neural networks for small footprint textdependent speaker verification,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 4052–4056. • Najim Dehak, Patrick J Kenny, Reda Dehak, Pierre Du- ´ mouchel, and Pierre Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011. • Felix Weninger, “Introducing currennt: The munich opensource cuda recurrent neural network toolkit,” Journal of Machine Learning Research, vol. 16, pp. 547–551, 2015. • Yann LeCun and Yoshua Bengio, “Convolutional networks for images, speech, and time series,” The handbook of brain theory and neural networks, vol. 3361, no. 10, pp. 1995, 1995 • Ossama Abdel-Hamid, Li Deng, and Dong Yu, “Exploring convolutional neural network structures and optimization techniques for speech recognition.,” in INTERSPEECH, 2013, pp. 3366–3370. • Honglak Lee, Peter Pham, Yan Largman, and Andrew Y Ng, “Unsupervised feature learning for audio classification using convolutional deep belief networks,” in Advances in neural information processing systems, 2009, pp. 1096–1104