Vintern-1B βοΈ (Viet-InternVL2-1B) [π€ HF Demo] - The LLaVA π Challenger
We are excited to introduce Vintern-1B the Vietnamese π»π³ multimodal model that combines the advanced Vietnamese language model Qwen2-0.5B-Instruct with the latest visual model, InternViT-300M-448px, CVPR 2024. This model excels in tasks such as OCR-VQA, Doc-VQA, and Chart-VQA,... With only 1 billion parameters, it is 4096 context length finetuned from the InternVL2-1B model on over 1.5 million specialized image-question-answer pairs for optical character recognition π, text recognition π€, document extraction π, and more. The model can be integrated into various on-device applications π±, demonstrating its versatility and robust capabilities.
Vintern-1B is a multimodal large language model series, featuring models of various sizes. For each size, we release instruction-tuned models optimized for multimodal tasks. Vintern-1B consists of InternViT-300M-448px, an MLP projector, and Qwen2-0.5B-Instruct.
Training details π
The fine-tuning dataset was meticulously sampled in part from the following datasets:
Viet-OCR-VQA
Viet-Doc-VQA
Viet-Doc-VQA-II
Benchmarks π
Since there are still many different metrics that need to be tested, we chose a quick and simple metric first to guide the development of our model. Our metric is inspired by Lavy's paper. For the time being, we are using GPT-4 to evaluate the quality of answers on two datasets: OpenViVQA and ViTextVQA. Detailed results can be found at the provided Here. The inputs are images, questions, labels, and predicted answers. The model will return a score from 0 to 10 for the corresponding answer quality. The results table is shown below.https://huggingface.co/5CD-AI/Viet-InternVL2-1B
We are excited to introduce Vintern-1B the Vietnamese π»π³ multimodal model that combines the advanced Vietnamese language model Qwen2-0.5B-Instruct with the latest visual model, InternViT-300M-448px, CVPR 2024. This model excels in tasks such as OCR-VQA, Doc-VQA, and Chart-VQA,... With only 1 billion parameters, it is 4096 context length finetuned from the InternVL2-1B model on over 1.5 million specialized image-question-answer pairs for optical character recognition π, text recognition π€, document extraction π, and more. The model can be integrated into various on-device applications π±, demonstrating its versatility and robust capabilities.
Vintern-1B is a multimodal large language model series, featuring models of various sizes. For each size, we release instruction-tuned models optimized for multimodal tasks. Vintern-1B consists of InternViT-300M-448px, an MLP projector, and Qwen2-0.5B-Instruct.
Training details π
The fine-tuning dataset was meticulously sampled in part from the following datasets:
Viet-OCR-VQA
Viet-Doc-VQA
Viet-Doc-VQA-II
Benchmarks π
Since there are still many different metrics that need to be tested, we chose a quick and simple metric first to guide the development of our model. Our metric is inspired by Lavy's paper. For the time being, we are using GPT-4 to evaluate the quality of answers on two datasets: OpenViVQA and ViTextVQA. Detailed results can be found at the provided Here. The inputs are images, questions, labels, and predicted answers. The model will return a score from 0 to 10 for the corresponding answer quality. The results table is shown below.https://huggingface.co/5CD-AI/Viet-InternVL2-1B