Overview of natively supported quantization schemes in 🤗...

Overview of natively supported quantization schemes in 🤗 Transformers
We aim to give a clear overview of the pros and cons of each quantization scheme supported in transformers to help you decide which one you should go for.

Currently, quantizing models are used for two main purposes:

Running inference of a large model on a smaller device
Fine-tune adapters on top of quantized models
So far, two integration efforts have been made and are natively supported in transformers : bitsandbytes and auto-gptq. Note that some additional quantization schemes are also supported in the 🤗 optimum library, but this is out of scope for this blogpost.

To learn more about each of the supported schemes, please have a look at one of the resources shared below. Please also have a look at the appropriate sections of the documentation.

Note also that the details shared below are only valid for PyTorch models, this is currently out of scope for Tensorflow and Flax/JAX models.