Image-text-to-text models take in an image and text promp... | Image-text-to-text models take in an image and text promp...
Image-text-to-text models take in an image and text prompt and output text. These models are also called vision-language models, or VLMs. The difference from image-to-text models is that these models take an additional text input, not restricting the model to certain use cases like image captioning, and may also be trained to accept a conversation as input.

About Image-Text-to-Text
https://youtu.be/IoGaGfU1CIg
Different Types of Vision Language Models
Vision language models come in three types:

Base: Pre-trained models that can be fine-tuned. A good example of base models is the PaliGemma models family by Google.
Instruction: Base models fine-tuned on instruction datasets. A good example of instruction fine-tuned models is idefics2-8b.
Chatty/Conversational: Base models fine-tuned on conversation datasets. A good example of chatty models is deepseek-vl-7b-chat.
https://huggingface.co/tasks/image-text-to-text