Leveraging Pre-trained Language Model Checkpoints for Enc... | Leveraging Pre-trained Language Model Checkpoints for Enc...
Leveraging Pre-trained Language Model Checkpoints for Encoder-Decoder Models
Open In Colab
Transformer-based encoder-decoder models were proposed in Vaswani et al. (2017) and have recently experienced a surge of interest, e.g. Lewis et al. (2019), Raffel et al. (2019), Zhang et al. (2020), Zaheer et al. (2020), Yan et al. (2020).

Similar to BERT and GPT2, massive pre-trained encoder-decoder models have shown to significantly boost performance on a variety of sequence-to-sequence tasks Lewis et al. (2019), Raffel et al. (2019). However, due to the enormous computational cost attached to pre-training encoder-decoder models, the development of such models is mainly limited to large companies and institutes.

In Leveraging Pre-trained Checkpoints for Sequence Generation Tasks (2020), Sascha Rothe, Shashi Narayan and Aliaksei Severyn initialize encoder-decoder model with pre-trained encoder and/or decoder-only checkpoints (e.g. BERT, GPT2) to skip the costly pre-training. The authors show that such warm-started encoder-decoder models yield competitive results to large pre-trained encoder-decoder models, such as T5, and Pegasus on multiple sequence-to-sequence tasks at a fraction of the training cost.

In this notebook, we will explain in detail how encoder-decoder models can be warm-started, give practical tips based on Rothe et al. (2020), and finally go over a complete code example showing how to warm-start encoder-decoder models with 🤗Transformers.

This notebook is divided into 4 parts:

Introduction - Short summary of pre-trained language models in NLP and the need for warm-starting encoder-decoder models.
Warm-starting encoder-decoder models (Theory) - Illustrative explanation on how encoder-decoder models are warm-started?
Warm-starting encoder-decoder models (Analysis) - Summary of Leveraging Pre-trained Checkpoints for Sequence Generation Tasks (2020) - What model combinations are effective to warm-start encoder-decoder models; How does it differ from task to task?
Warm-starting encoder-decoder models with 🤗Transformers (Practice) - Complete code example showcasing in-detail how to use the EncoderDecoderModel framework to warm-start transformer-based encoder-decoder models.
It is highly recommended (probably even necessary) to have read this blog post about transformer-based encoder-decoder models.

Let's start by giving some back-ground on warm-starting encoder-decoder models.