The world’s first multilingual ColBERT: Jina ColBERT V2 a... | The world’s first multilingual ColBERT: Jina ColBERT V2 a...
The world’s first multilingual ColBERT: Jina ColBERT V2 and its “Russian Doll” technology
In the field of RAG, the multi-vector model ColBERT improves retrieval accuracy by generating independent vectors for each token of the document. But it also brings about a sharp increase in storage requirements, and only supports English, which limits its application scope. To solve these problems, we improved the architecture and training process of ColBERT, especially making breakthroughs in multi-language processing. The latest Jina-ColBERT-v2 supports 89 languages ​​and introduces custom output dimension options, significantly reducing storage requirements and improving the efficiency and accuracy of multi-language retrieval. The core highlights of the new version are performance enhancements: compared with the original ColBERT-v2, the English retrieval performance has improved by 6.5%; compared with the previous generation jina-colbert-v1-en, the performance has also improved by 5.4%. Multi-language support: The new version supports up to 89 languages, covering Arabic, Chinese, English, Japanese, Russian and other languages, and also supports programming languages. The output dimensions can be customized: The new version adopts "Russian doll" representation learning technology (Matryoshka Representation Learning, MRL) and provides 128, 96 and 64-dimensional output vector options, allowing users to choose the appropriate dimensions according to actual needs. The full technical report can be found on arXiv: https://arxiv.org/abs/2408.16672