SlideShare a Scribd company logo
IV WORKSHOP NVIDIA DE GPU E CUDA
Audio Processing using
Convolutional Neural Network
Diego Augusto
September 6, 2016
Speech Activity Detection (SAD)
❖ Distinguish speech and noise segments.
❖ Estimate start and end times of speech events.
WAVEFORM
Speech Activity Detection (SAD)
❖ Distinguish speech and noise segments.
❖ Estimate start and end times of speech events.
#1, START: 1.2 sec, END: 2.5 sec
#2, START: 3.3 sec END: 4.9 sec
WAVEFORM
speech speech
Applications
❖ Segmentation of spontaneous speech:
➢ Live language translation.
➢ Speech transmission over audio codec’s.
➢ Retrieval of speech in video and social networks.
Applications
❖ Segmentation of spontaneous speech:
➢ Live language translation.
➢ Speech transmission over audio codec’s.
➢ Retrieval of speech in video and social networks.
❖ Pre processing of speech engines:
➢ Speech Recognition - “what is being said?”
➢ Speaker Authentication - “who is speaking?”
➢ Speaker Diarization - “who spoke when?”
Challenges
❖ Large variety of different types of noises:
➢ Clicking, Motor sound, Background voice.
❖ Voice distortion, overlapping sounds.
Convolutional Neural Network (CNN)
❖ CNN approach:
➢ Features are extracted automatically by the network.
❖ Inspired by human vision system (visual cortex).
❖ Extract distinctive features.
CPqD Dataset
❖ > 300 hours of speech and noise.
➢ with ground truth.
❖ Environments:
➢ Phone conversation.
➢ PCs and IoT devices (mobile apps).
❖ Split into two parts:
➢ Development = 75%.
➢ Evaluation = 25%.
Speech/Noise Features
SPECTROGRAMWAVEFORM
Speech/Noise Features
SPECTROGRAMWAVEFORM
1 1 1 1 1 1
0 = NOISE
1 = SPEECH
Speech/Noise Features
SPECTROGRAMWAVEFORM
0 = NOISE
1 = SPEECH
0 0 1 0 0 0 0 01 1 1 1 1
Deep Learning Platform
MANAGE DEVELOPMENT EVALUATION
NVIDIA DIGITS 4
GPU GRID K520 Linux 64-bit
FEAT. EXTRACT
TRAIN TEST
REINFORCEMENT
LEARNING
NVIDIA DIGITS
Monitor Train Test Model
99,93
0,07
1
0
Evaluation
FA MSMS FAFA
Ground Truth:
Spectrogram:
❖ Half-Total Error Rate: HTER = (MR + FAR) / 2
➢ Miss Speech Rate (%):
■ (# Speech samples not detected as speech / Total number of speech samples) x 100
➢ False Alarm Rate (%):
■ (# Nonspeech samples detected as speech / Total number of nonspeech samples) x 100
speech speech
Evaluation
FA MS FA MS FA
Ground Truth:
Spectrogram:
Hypothesis:
❖ Half-Total Error Rate: HTER = (MR + FAR) / 2
➢ Miss Speech Rate (%):
■ (# Speech samples not detected as speech / Total number of speech samples) x 100
➢ False Alarm Rate (%):
■ (# Nonspeech samples detected as speech / Total number of nonspeech samples) x 100
speech speech
Evaluation
❖ QUT-NOISE-TIMIT:
➢ Large-scale dataset to evaluation SAD algorithms.
❖ Technical challenges and Future:
➢ Automatic adaptation to environment.
➢ Overlapping sound events.
➢ CNN approach to perform others problems.
Features Classifier HTER
Energy Threshold 26,3%
MFCC GMM-HMM 4,7 %
Spectrogram CNN 3,2%
References
● J. Sohn, N. S. Kim, and W. Sung, “A statistical model based voice activity detection,” Signal Processing Letters, IEEE, vol. 6,
no. 1, pp. 1–3, 1999.
● W. H. Abdulla, Z. Guan, and H. C. Sou, “Noise robust speech activity detection,” in Signal Processing and Information
Technology (ISSPIT), 2009 IEEE International Symposium on. IEEE, 2009, pp. 473–477.
● D. B. Dean, S. Sridharan, R. J. Vogt, and M. W. Mason, “The qut-noise-timit corpus for the evaluation of voice
activity detection algorithms,” Proceedings of Interspeech 2010, 2010.
● D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J.
Silovsky, G. Stemmer, and K. Vesely. The Kaldi Speech Recognition Toolkit. In IEEE 2011 Workshop on Automatic Speech
Recognition and Understanding. IEEE Signal Processing Society, 2011.
● S. Thomas, S. Ganapathy, G. Saon, and H. Soltau, “Analyzing convolutional neural networks for speech activity detection in
mismatched acoustic conditions,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International
Conference on. IEEE, 2014, pp. 2519– 2523.
● Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe:
Convolutional architecture for fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.
● H. Ghaemmaghami, D. Dean, S. Kalantari, S. Sridharan, and C. Fookes, “Complete-linkage clustering for voice
activity detection in audio and visual speech,” 2015.
● NVIDIA Deep Learning GPU Training System (DIGITS) 4. Retrieved July 18, 2016, from
https://developer.nvidia.com/digits.
www.cpqd.com.br
TURNING
INTO REALITY
Diego Augusto
diegoa@cpqd.com.br

More Related Content

PDF
PPTX
Rjdj
PPTX
Storyboarding
PDF
PrimeVoices presentation
PDF
Artificial Intelligence for Speech Recognition
DOCX
Unidad educativa quevedo
DOCX
여성흥분크림『 http://x5.ana.kr 』 톡 w2015 ♡ 여성흥분크림판매 , 여성흥분크림종류, 여성흥분크림 사용후기, 여성흥분크...
DOCX
Rjdj
Storyboarding
PrimeVoices presentation
Artificial Intelligence for Speech Recognition
Unidad educativa quevedo
여성흥분크림『 http://x5.ana.kr 』 톡 w2015 ♡ 여성흥분크림판매 , 여성흥분크림종류, 여성흥분크림 사용후기, 여성흥분크...

Viewers also liked (15)

PPTX
Vecka 7 relationer kopia
PDF
MMT Audio Technology and Applications
PPTX
Top 10 los youtubers más ifluyentes en latinoamérica
PDF
Hạn chế mệt mỏi sau 1 chuyến bay dài
PPT
El reconstruccionismo
PPT
Implicaciones de la investigacion
PDF
Better Tests, Less Code: Property-based Testing
PPTX
Belajar Seo untuk pemula Terbaru 2017
PPTX
JS: Audio Data Processing
PDF
Jenkins-CI
PPTX
Application of digital_signal_processing_in_audio_processing[1]
PPTX
Dinamica de grupos
PPTX
Audio Processing and Music Recognition
PPT
Digitization of Audio.ppt
PDF
Sound of Safety
Vecka 7 relationer kopia
MMT Audio Technology and Applications
Top 10 los youtubers más ifluyentes en latinoamérica
Hạn chế mệt mỏi sau 1 chuyến bay dài
El reconstruccionismo
Implicaciones de la investigacion
Better Tests, Less Code: Property-based Testing
Belajar Seo untuk pemula Terbaru 2017
JS: Audio Data Processing
Jenkins-CI
Application of digital_signal_processing_in_audio_processing[1]
Dinamica de grupos
Audio Processing and Music Recognition
Digitization of Audio.ppt
Sound of Safety
Ad

Similar to IV_WORKSHOP_NVIDIA-Audio_Processing (20)

PPTX
Final_Presentation_ENDSEMFORNITJSRI.pptx
PDF
Deep Learning Based Voice Activity Detection and Speech Enhancement
PDF
A Study of Digital Media Based Voice Activity Detection Protocols
PDF
SPEECH RECOGNITION USING SONOGRAM AND AANN
PDF
Trends of ICASSP 2022
PDF
On the use of voice activity detection in speech emotion recognition
PDF
Cancellation of Noise from Speech Signal using Voice Activity Detection Metho...
PDF
Introduction to deep learning based voice activity detection
PDF
Audio insights
PDF
Listening at the Cocktail Party with Deep Neural Networks and TensorFlow
PDF
Development of Algorithm for Voice Operated Switch for Digital Audio Control ...
PDF
Voice Activity Detector of Wake-Up-Word Speech Recognition System Design on FPGA
PDF
A review of Noise Suppression Technology for Real-Time Speech Enhancement
PDF
ADVANCEMENTS IN AI AND BIOACOUSTIC SIGNAL PROCESSING - ATAL FDP Presentation ...
PDF
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLAB
PDF
Emotional telugu speech signals classification based on k nn classifier
PDF
Emotional telugu speech signals classification based on k nn classifier
PDF
Slides of my presentation at EUSIPCO 2017
PDF
A Gaussian Clustering Based Voice Activity Detector for Noisy Environments Us...
PPTX
CNN architectures for large-scale audio classification CONFERENCE PAPER REVIE...
Final_Presentation_ENDSEMFORNITJSRI.pptx
Deep Learning Based Voice Activity Detection and Speech Enhancement
A Study of Digital Media Based Voice Activity Detection Protocols
SPEECH RECOGNITION USING SONOGRAM AND AANN
Trends of ICASSP 2022
On the use of voice activity detection in speech emotion recognition
Cancellation of Noise from Speech Signal using Voice Activity Detection Metho...
Introduction to deep learning based voice activity detection
Audio insights
Listening at the Cocktail Party with Deep Neural Networks and TensorFlow
Development of Algorithm for Voice Operated Switch for Digital Audio Control ...
Voice Activity Detector of Wake-Up-Word Speech Recognition System Design on FPGA
A review of Noise Suppression Technology for Real-Time Speech Enhancement
ADVANCEMENTS IN AI AND BIOACOUSTIC SIGNAL PROCESSING - ATAL FDP Presentation ...
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLAB
Emotional telugu speech signals classification based on k nn classifier
Emotional telugu speech signals classification based on k nn classifier
Slides of my presentation at EUSIPCO 2017
A Gaussian Clustering Based Voice Activity Detector for Noisy Environments Us...
CNN architectures for large-scale audio classification CONFERENCE PAPER REVIE...
Ad

IV_WORKSHOP_NVIDIA-Audio_Processing

  • 1. IV WORKSHOP NVIDIA DE GPU E CUDA Audio Processing using Convolutional Neural Network Diego Augusto September 6, 2016
  • 2. Speech Activity Detection (SAD) ❖ Distinguish speech and noise segments. ❖ Estimate start and end times of speech events. WAVEFORM
  • 3. Speech Activity Detection (SAD) ❖ Distinguish speech and noise segments. ❖ Estimate start and end times of speech events. #1, START: 1.2 sec, END: 2.5 sec #2, START: 3.3 sec END: 4.9 sec WAVEFORM speech speech
  • 4. Applications ❖ Segmentation of spontaneous speech: ➢ Live language translation. ➢ Speech transmission over audio codec’s. ➢ Retrieval of speech in video and social networks.
  • 5. Applications ❖ Segmentation of spontaneous speech: ➢ Live language translation. ➢ Speech transmission over audio codec’s. ➢ Retrieval of speech in video and social networks. ❖ Pre processing of speech engines: ➢ Speech Recognition - “what is being said?” ➢ Speaker Authentication - “who is speaking?” ➢ Speaker Diarization - “who spoke when?”
  • 6. Challenges ❖ Large variety of different types of noises: ➢ Clicking, Motor sound, Background voice. ❖ Voice distortion, overlapping sounds.
  • 7. Convolutional Neural Network (CNN) ❖ CNN approach: ➢ Features are extracted automatically by the network. ❖ Inspired by human vision system (visual cortex). ❖ Extract distinctive features.
  • 8. CPqD Dataset ❖ > 300 hours of speech and noise. ➢ with ground truth. ❖ Environments: ➢ Phone conversation. ➢ PCs and IoT devices (mobile apps). ❖ Split into two parts: ➢ Development = 75%. ➢ Evaluation = 25%.
  • 10. Speech/Noise Features SPECTROGRAMWAVEFORM 1 1 1 1 1 1 0 = NOISE 1 = SPEECH
  • 11. Speech/Noise Features SPECTROGRAMWAVEFORM 0 = NOISE 1 = SPEECH 0 0 1 0 0 0 0 01 1 1 1 1
  • 12. Deep Learning Platform MANAGE DEVELOPMENT EVALUATION NVIDIA DIGITS 4 GPU GRID K520 Linux 64-bit FEAT. EXTRACT TRAIN TEST REINFORCEMENT LEARNING
  • 13. NVIDIA DIGITS Monitor Train Test Model 99,93 0,07 1 0
  • 14. Evaluation FA MSMS FAFA Ground Truth: Spectrogram: ❖ Half-Total Error Rate: HTER = (MR + FAR) / 2 ➢ Miss Speech Rate (%): ■ (# Speech samples not detected as speech / Total number of speech samples) x 100 ➢ False Alarm Rate (%): ■ (# Nonspeech samples detected as speech / Total number of nonspeech samples) x 100 speech speech
  • 15. Evaluation FA MS FA MS FA Ground Truth: Spectrogram: Hypothesis: ❖ Half-Total Error Rate: HTER = (MR + FAR) / 2 ➢ Miss Speech Rate (%): ■ (# Speech samples not detected as speech / Total number of speech samples) x 100 ➢ False Alarm Rate (%): ■ (# Nonspeech samples detected as speech / Total number of nonspeech samples) x 100 speech speech
  • 16. Evaluation ❖ QUT-NOISE-TIMIT: ➢ Large-scale dataset to evaluation SAD algorithms. ❖ Technical challenges and Future: ➢ Automatic adaptation to environment. ➢ Overlapping sound events. ➢ CNN approach to perform others problems. Features Classifier HTER Energy Threshold 26,3% MFCC GMM-HMM 4,7 % Spectrogram CNN 3,2%
  • 17. References ● J. Sohn, N. S. Kim, and W. Sung, “A statistical model based voice activity detection,” Signal Processing Letters, IEEE, vol. 6, no. 1, pp. 1–3, 1999. ● W. H. Abdulla, Z. Guan, and H. C. Sou, “Noise robust speech activity detection,” in Signal Processing and Information Technology (ISSPIT), 2009 IEEE International Symposium on. IEEE, 2009, pp. 473–477. ● D. B. Dean, S. Sridharan, R. J. Vogt, and M. W. Mason, “The qut-noise-timit corpus for the evaluation of voice activity detection algorithms,” Proceedings of Interspeech 2010, 2010. ● D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely. The Kaldi Speech Recognition Toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, 2011. ● S. Thomas, S. Ganapathy, G. Saon, and H. Soltau, “Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 2519– 2523. ● Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014. ● H. Ghaemmaghami, D. Dean, S. Kalantari, S. Sridharan, and C. Fookes, “Complete-linkage clustering for voice activity detection in audio and visual speech,” 2015. ● NVIDIA Deep Learning GPU Training System (DIGITS) 4. Retrieved July 18, 2016, from https://developer.nvidia.com/digits.