Block Sparse Matrices for Smaller and Faster Language Mod... | Block Sparse Matrices for Smaller and Faster Language Mod...
Block Sparse Matrices for Smaller and Faster Language Models
Saving space and time, one zero at a time
In previous blog posts we introduced sparse matrices and what they could do to improve neural networks.

The basic assumption is that full dense layers are often overkill and can be pruned without a significant loss in precision. In some cases sparse linear layers can even improve precision or/and generalization.

The main issue is that currently available code that supports sparse algebra computation is severely lacking efficiency. We are also still waiting for official PyTorch support.

That's why we ran out of patience and took some time this summer to address this "lacuna". Today, we are excited to release the extension pytorch_block_sparse.

By itself, or even better combined with other methods like distillation and quantization, this library enables networks which are both smaller and faster, something Hugging Face considers crucial to let anybody use neural networks in production at low cost, and to improve the experience for the end user.