This document describes the implementation of various neural network architectures for speech recognition using a dataset from VoxForge. It discusses preprocessing audio data into acoustic features, and implementing recurrent neural networks (RNNs), convolutional neural networks (CNNs), and combinations of CNNs and RNNs as acoustic models. Five models are implemented and evaluated: RNN with time-distributed dense layer; CNN plus RNN; deeper RNN; bidirectional RNN; and a custom architecture with CNN and deep RNN layers. The best performing model is selected for predicting speech from the test data.
Related topics: