This document presents a new semantic masking strategy for training transformer-based end-to-end speech recognition models, which helps to improve generalization and reduce overfitting in limited data scenarios. The proposed method masks input features corresponding to specific output tokens, inspired by techniques like SpecAugment and BERT. Experimental results demonstrate that this approach achieves state-of-the-art performance on the Librispeech and Tedlium2 datasets.
Related topics: