Deep Learning - Speaker Verification, Sound Event Detection

Deep Learning – Speaker
Recognition for Security & IoT
Sai Kiran Kadam(SK)
Description: Automatic Text-Independent Speaker
Recognition and Spoof Detection using Deep Learning
for Security and IoT

SPEAKER IDENTIFICATION & CLUSTERING USING
CONVOLUTIONAL NEURAL NETWORKS
Yanick Lukic, Carlo Vogt, Oliver Durr, Thilo Stadelmann
• Speaker ID using CNN; Input to CNN – Spectograms (Cepstral Analysis)
• Speaker Clustering-Telling who spoke w/o prior knowledge of identity
• Technique/Method: Apply CNN’s on Spectrograms to learn speaker specific
features
• Libraries Used: Python - LIBROSA (to compute I/p) & LASAGNE (Build, Train CNN)
• Training: Dataset - Studio Quality Recordings - 630 people (192 F, 438 M)
• Experiments & Results:
• Optimal Convolutional Filter Dimension
• Evaluate Speaker Perf – 97.0% corresponding to 19 misidentified speakers
• Evaluate Clustering Perf – Mis-classification Rate
• Use: Clustering and Convolution Architecture to
my work

RECURRENT NEURAL NETWORKS FOR POLYPHONIC SOUND EVENT
DETECTION IN REAL LIFE RECORDINGS
Giambattista Parascandolo, Heikki Huttunen, Tuomas Virtanen
• Presents approach to Polyphonic SED in real life recordings.
• Technique: BLSTM – RNN
• Training Data: 103 recordings (10-30min long)- total 1133 minutes
from 10 real life contexts- 8 to14 recordings/context
• Testing Data: DB of 61 classes from 10 different real life contexts
• Results/How good it is: Avg F1 score of 65.5% on 1-sec blocks & 64.7% on single
frames with relative improvement over state-of-art methods of 6.8% & 15.1 %
respectively.
• Limitations: Overfitting - dataset smaller than network (use Data Augmentation)
• Use BLSTM – RNN with Data Augmentation for my thesis.

Using Deep Learning for Detecting Spoofing Attacks on Speech
Signals
Alan Godoy , Flavio Sim ´ oes ˜ , Jose Augusto Stuchi , Marcus de Assis Angeloni , Mario Uliani ´ , Ricardo Violato
• About Automatic Speaker Verification Spoofing & Countermeasure
Challenge – ASVSpoof2015 based on Deep Neural Networks.
• Biometric Spoofing: Direct attack perpetrated against a biometric authentication system
by presenting fake/forged biometric sample.
• Technique: DNN used as a classifier and feature extraction module.
• Feature Extraction: DNN based MLP(I/p – 2668 features of a vector.)
• Back Propagation Algorithm + Stochastic Gradient Descent Optimization – to train
the network
• How good it is: MLP showed EER<0.5% beating SVM-RBF & GMM
• Limitation: MLP is not as efficient as CNN/RNN with BLSTM
• Tradeoffs: BLSTM-RNN over MLP -> No loss of long-term info, EER Increase
• Use: BLSTM-RNN with Spoofing for Security

END-TO-END ATTENTION BASED TEXT-DEPENDENT SPEAKER
VERIFICATION
Shi-Xiong Zhang, Zhuo Chen , Yong Zhao, Jinyu Li and Yifan Gong
• Presents End-to-End system that uses CNNs to extract noise-robust frame-level speaker
discriminative features
• These features - combined to form Attention Mechanism
• CNN + Attention Model-joint optimized using end-to-end criterion
• Technique: CNN + End-to-end Architecture
• Tools: Theano FrameWork, KERAS package – Python
• Testing: System is evaluated on Windows 10 “Hey Cortana” SV task
• End-End Arch has 3 phases
• Training: 200k utterances from 10k speakers, each with 10-30 utterances
• Enrollment: 6 utterances of “Hey Cortana”
• Evaluation: 60k utterances from 3k target speakers & 3k imposters
• Attention mechanism with DNN outperforms CNN & LSTM.
• Use Attention model + BLSTM RNN to my work ? Not sure yet.

Deep Neural Network Embedding’s for Text-Independent
Speaker Verification
David Snyder, Daniel Garcia-Romero, Daniel Povey, Sanjeev Khudanpur
• Investigates i-vectors replacement with embedding’s from Ff-DNN (for Txt Ind SV)
• i-vectors: low-Dim representation of speech, captures speaker & channel chars
• Temporal Pooling Layer-captures long term speaker chars, to train network to
discriminate between speakers from variable length speech segments
• Tools: Kaldi Speech Recognition Toolkit - USEFUL, nnet3 neural network.
• Training Data: Telephone speech of 65,000 recordings from 6500 speakers
• Evaluation: on NIST - SRE2010 & SRE2016

Related Research Papers
• D. Garcia-Romero, X. Zhang, A. McCree, and D. Povey, “Improving speaker recognition performance in the domain
adaptation challenge using deep neural networks,” in Spoken Language Technology Workshop (SLT), 2014 IEEE. IEEE, 2014,
pp. 378– 383.
• O. Novotny, P. Mat ´ ejka, O. Glembeck, O. Plchot, F. Gr ˇ ezl, L. Bur- ´ get, and J. Cernock ˇ y, “Analysis of the dnn-based sre
systems ´ in multi-language conditions,” in Spoken Language Technology Workshop (SLT), 2016 IEEE. IEEE, 2016.
• E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. GonzalezDominguez, “Deep neural networks for small footprint
textdependent speaker verification,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International
Conference on. IEEE, 2014, pp. 4052–4056.
• Najim Dehak, Patrick J Kenny, Reda Dehak, Pierre Du- ´ mouchel, and Pierre Ouellet, “Front-end factor analysis for speaker
verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
• Felix Weninger, “Introducing currennt: The munich opensource cuda recurrent neural network toolkit,” Journal of Machine
Learning Research, vol. 16, pp. 547–551, 2015.
• Yann LeCun and Yoshua Bengio, “Convolutional networks for images, speech, and time series,” The handbook of brain
theory and neural networks, vol. 3361, no. 10, pp. 1995, 1995
• Ossama Abdel-Hamid, Li Deng, and Dong Yu, “Exploring convolutional neural network structures and optimization
techniques for speech recognition.,” in INTERSPEECH, 2013, pp. 3366–3370.
• Honglak Lee, Peter Pham, Yan Largman, and Andrew Y Ng, “Unsupervised feature learning for audio classification using
convolutional deep belief networks,” in Advances in neural information processing systems, 2009, pp. 1096–1104

Deep Learning - Speaker Verification, Sound Event Detection

More Related Content

What's hot (20)

Similar to Deep Learning - Speaker Verification, Sound Event Detection (20)

Recently uploaded (20)

Deep Learning - Speaker Verification, Sound Event Detection