HuggingFaceM4 /Idefics3-8B-Llama3 Transformers version: u... | HuggingFaceM4 /Idefics3-8B-Llama3 Transformers version: u...
HuggingFaceM4 /Idefics3-8B-Llama3
Transformers version: until the next Transformers pypi release, please install Transformers from source and use this PR to be able to use Idefics3. TODO: change when new version.

Idefics3
Idefics3 is an open multimodal model that accepts arbitrary sequences of image and text inputs and produces text outputs. The model can answer questions about images, describe visual content, create stories grounded on multiple images, or simply behave as a pure language model without visual inputs. It improves upon Idefics1 and Idefics2, significantly enhancing capabilities around OCR, document understanding and visual reasoning.

We release the checkpoints under the Apache 2.0.

Model Summary
Developed by: Hugging Face
Model type: Multi-modal model (image+text)
Language(s) (NLP): en
License: Apache 2.0
Parent Models: google/siglip-so400m-patch14-384 and meta-llama/Meta-Llama-3.1-8B-Instruct
Resources for more information:
Idefics1 paper: OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
Idefics2 paper: What matters when building vision-language models?
Idefics3 paper: Coming soon (TODO)https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3 HuggingFaceM4/Idefics3-8B-Llama3 · Hugging Face