This paper presents a novel lipreading approach implemented through a web application
that automatically generates subtitles for videos where the speaker's mouth movements are
visible. The proposed solution leverages a deep learning architecture combining 3D
convolutional neural networks (CNN) with bidirectional Long Short-Term Memory (LSTM)
units to accurately predict sentences based solely on visual input. A thorough review of
existing lipreading techniques over the past decade is provided to contextualize the
advancements introduced in this work. The primary goal is to improve the accuracy and
usability of lipreading technologies, with a focus on real-world applications. This study
contributes to the ongoing progress in the field, offering a robust, scalable solution for
enhancing automated visual speech recognition systems.
Related topics: