Boosting Wav2Vec2 with n-grams in 🤗 TransformersOpen In...

Boosting Wav2Vec2 with n-grams in 🤗 Transformers
Open In Colab
Wav2Vec2 is a popular pre-trained model for speech recognition. Released in September 2020 by Meta AI Research, the novel architecture catalyzed progress in self-supervised pretraining for speech recognition, e.g. G. Ng et al., 2021, Chen et al, 2021, Hsu et al., 2021 and Babu et al., 2021. On the Hugging Face Hub, Wav2Vec2's most popular pre-trained checkpoint currently amounts to over 250,000 monthly downloads.

Using Connectionist Temporal Classification (CTC), pre-trained Wav2Vec2-like checkpoints are extremely easy to fine-tune on downstream speech recognition tasks. In a nutshell, fine-tuning pre-trained Wav2Vec2 checkpoints works as follows:

A single randomly initialized linear layer is stacked on top of the pre-trained checkpoint and trained to classify raw audio input to a sequence of letters. It does so by:

extracting audio representations from the raw audio (using CNN layers),
processing the sequence of audio representations with a stack of transformer layers, and,
classifying the processed audio representations into a sequence of output letters.
Previously audio classification models required an additional language model (LM) and a dictionary to transform the sequence of classified audio frames to a coherent transcription. Wav2Vec2's architecture is based on transformer layers, thus giving each processed audio representation context from all other audio representations. In addition, Wav2Vec2 leverages the CTC algorithm for fine-tuning, which solves the problem of alignment between a varying "input audio length"-to-"output text length" ratio.