07
Aug
arXiv:2408.02945v1 Announce Type: new Abstract: Self-supervised learning, such as with the wav2vec 2.0 framework significantly improves the accuracy of end-to-end automatic speech recognition (ASR). Wav2vec 2.0 has been applied to single-channel end-to-end ASR models. In this work, we explored a self-supervised learning method for a multi-channel end-to-end ASR model based on the wav2vec 2.0 framework. As the multi-channel end-to-end ASR model, we focused on a multi-channel neural transducer. In pre-training, we compared three different methods for feature quantization to train a multi-channel conformer audio encoder: joint quantization, feature-wise quantization and channel-wise quantization. In fine-tuning, we trained the multi-channel conformer-transducer. All…