The vision language model in this video is 0.5B and can take in image, video and 3D! 🤯
Llava-NeXT-Interleave is a new vision language model trained on interleaved image, video and 3D data
keep reading ⥥⥥
Llava-NeXT-Interleave is a new vision language model trained on interleaved image, video and 3D data
keep reading ⥥⥥