BLIP: Bootstrapping Language-Image Pre-training for Unifi... | BLIP: Bootstrapping Language-Image Pre-training for Unifi...
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Model card for image captioning pretrained on COCO dataset - base architecture (with ViT large backbone).

BLIP.gif
Pull figure from BLIP official repo