Speech recognition works by converting analog audio signals from a microphone into digital data that is then analyzed using various mathematical and statistical techniques. The digital audio passes through different stages where it is converted into discrete frequency bands using FFT and then compared to a database of phonemes to identify the sounds. It uses techniques like hidden Markov models, tri-phones, and pruning to handle variations in sounds and identify when one phoneme ends and the next begins in order to translate speech into recognized text.