Robust Speech Recognition via Large-Scale Weak Supervision

Original paper · Radford et al 2022

Speech recognition systems have never really impressed anyone, they've left a lot to be desired and it's never felt good enough to warrant incorporation into daily. That has changed, Whisper is one of the first speech recognition systems I tried that hade me in awe. Speech recognition is such an important way into natural integration of computers into our daily life, because speech is how we communicate. Like many other advancements, Whisper does away with all the complex nuances of fine-tuning, task-specific architectures etc. and instead trains on raw text of transcripts, relying on the expressiveness of sequence to sequence models to learn to map between speech and their transcribed form.

Data Curation Process

Data plays a pivotal role in the success of any machine learning model, and Whisper 3 is no exception. The dataset used for training Whisper 3 consists of audio snippets ( $a_i$ ) paired with their corresponding transcriptions ( $t_i$ ) collected from various sources on the internet. This diverse dataset includes audio from different environments, recorded with various setups, spoken by different individuals, and in multiple languages. While diversity in audio quality is beneficial, diversity in transcript quality can hinder the learning process.

One of the challenges the Whisper team faced was the presence of automated transcripts generated by speech recognition systems. These machine-generated transcriptions were problematic as research has shown that training on mixed human and machine-generated data can negatively impact the performance of translation systems. To tackle this issue, the team devised heuristics to curate the dataset, filtering out the automated transcripts and retaining only the human-transcribed data. Many of the machine-generated transcripts were identified through simple rule-based systems, as they often lacked proper whitespace and punctuation.

Interestingly, the audio data was broken into 30-second segments, even including segments with no speech (albeit with sub-sampled probability). These non-speech segments were utilized for training voice activity detection, showcasing the model's ability to adapt to various aspects of speech recognition.

A particularly clever filtering method involved training an initial model and then manually inspecting data points that the model struggled with. This inspection revealed low-quality transcripts that the filtering heuristics had missed, including partially transcribed, poorly aligned, or machine-generated transcriptions. This highlights the importance of a meticulous data curation process.

Whisper 3's approach underscores the significance of data quality and diversity in achieving model generalization, a key factor in its impressive performance.

Model and Training

Whisper 3 employs a Transformer Encoder-Decoder architecture, a well-established model architecture in natural language processing tasks. In this architecture, the encoder processes the Mel spectrogram representation of the audio data. While predicting spoken words is a central aspect of Whisper, it is not the sole focus. A comprehensive speech recognition system involves various components such as voice activity detection, speaker diarization, translation, and alignment. Traditionally, these components are handled separately, resulting in complex systems. However, Whisper integrates these components into a single pipeline and trains them end-to-end.

The decoder in Whisper 3 is configured to perform multitask training using language tokens, task tokens, time tokens, and transcription tokens. It incorporates contextual information from the transcript text preceding the current audio segment through cross-attention mechanisms. This approach allows Whisper 3 to seamlessly integrate multiple tasks and leverage their interdependencies during training.

The training process involves a suite of models of varying sizes, facilitating the study of Whisper's scaling properties. Data parallelism across accelerators is employed, utilizing FP16 with dynamic loss scaling and activation checkpointing. Models are trained with the AdamW optimizer and gradient norm clipping, with a linear learning rate decay to zero after an initial warm-up period. A batch size of 256 segments is used, and training occurs for 220 updates, equivalent to two to three passes over the dataset. Remarkably, overfitting is not a major concern, and data augmentation or regularization techniques are not applied, relying instead on the inherent diversity within the vast dataset to promote generalization and robustness.

Analysis and Ablations

Whisper 3 delivers exceptional performance across a range of evaluations, including language identification, transcription, translation, multi-lingual speech recognition, and long-form transcription. While it may not achieve perfect accuracy, it comes remarkably close to human-level performance.

A series of ablations were conducted to examine the impact of various model and data choices on Whisper 3's performance:

Model Scaling: As with many large-scale models, Whisper 3 demonstrates that performance increases with model size across multiple tasks, including multilingual speech recognition, speech translation, and language identification.
Dataset Scaling: The Whisper dataset, consisting of 680,000 hours of labeled audio, is one of the largest in supervised speech recognition. To assess the importance of the raw dataset, the authors trained medium-sized models on subsampled versions of the data at different percentages of the full size. Just like model scaling, dataset scaling plays a crucial role in performance improvement, with all evaluated tasks benefiting from larger datasets.
Multitask and Multilingual Transfer: Concerns arose regarding the joint training of a single model on multiple tasks and languages, potentially hindering performance compared to single-task, single-language models. To address this, two models were trained and evaluated on English speech recognition datasets. Small models exhibited negative transfer between tasks and languages, underperforming compared to English-only models. However, as computational resources increased, joint models demonstrated the advantages of diverse representations, ultimately outperforming their single-task counterparts.

Final Thoughts

Whisper 3 represents a remarkable achievement in the domains of data and model scaling. It stands as one of the foundational models in the speech recognition field, drawing lessons from the successes of language and vision modeling.

The use of multitask token formatting to define multiple tasks and languages is particularly intriguing. Whisper 3 seamlessly integrates audio Mel spectrograms with crucial task information tokens, showcasing innovation in model architecture.

Since its initial release, Whisper has continued to evolve, with updates such as Whisper V2 and V3. These updates involve longer training durations and slight architectural modifications, reflecting the ongoing commitment to enhancing this powerful open-source ASR system. Whisper 3, in particular, has recently made its debut, promising even more impressive capabilities. It's an exciting time for the world of speech recognition, and Whisper remains at the forefront of this technological advancement.