I have put together a notebook on Multimodal RAG, where we do not process the documents with hefty pipelines but natively use:
-
vidore/colpali
for retrieval ๐ it doesn't need indexing with image-text pairs but just images!
-
Qwen/Qwen2-VL-2B-Instruct
for generation ๐ฌ directly feed images as is to a vision language model with no processing to text!
I used ColPali implementation of the new ๐ญ Byaldi library by @bclavie ๐ค
https://github.com/answerdotai/byaldi
Link to notebook: https://github.com/merveenoyan/smol-vision/blob/main/ColPali_%2B_Qwen2_VL.ipynb
-
vidore/colpali
for retrieval ๐ it doesn't need indexing with image-text pairs but just images!
-
Qwen/Qwen2-VL-2B-Instruct
for generation ๐ฌ directly feed images as is to a vision language model with no processing to text!
I used ColPali implementation of the new ๐ญ Byaldi library by @bclavie ๐ค
https://github.com/answerdotai/byaldi
Link to notebook: https://github.com/merveenoyan/smol-vision/blob/main/ColPali_%2B_Qwen2_VL.ipynb