Vintern-1B ❄️ (Viet-InternVL2-1B) [πŸ€— HF Demo] - The LLaV... | Vintern-1B ❄️ (Viet-InternVL2-1B) [πŸ€— HF Demo] - The LLaV...
Vintern-1B ❄️ (Viet-InternVL2-1B) [πŸ€— HF Demo] - The LLaVA πŸŒ‹ Challenger
We are excited to introduce Vintern-1B the Vietnamese πŸ‡»πŸ‡³ multimodal model that combines the advanced Vietnamese language model Qwen2-0.5B-Instruct with the latest visual model, InternViT-300M-448px, CVPR 2024. This model excels in tasks such as OCR-VQA, Doc-VQA, and Chart-VQA,... With only 1 billion parameters, it is 4096 context length finetuned from the InternVL2-1B model on over 1.5 million specialized image-question-answer pairs for optical character recognition πŸ”, text recognition πŸ”€, document extraction πŸ“‘, and more. The model can be integrated into various on-device applications πŸ“±, demonstrating its versatility and robust capabilities.

Vintern-1B is a multimodal large language model series, featuring models of various sizes. For each size, we release instruction-tuned models optimized for multimodal tasks. Vintern-1B consists of InternViT-300M-448px, an MLP projector, and Qwen2-0.5B-Instruct.

Training details πŸ“š
The fine-tuning dataset was meticulously sampled in part from the following datasets:

Viet-OCR-VQA
Viet-Doc-VQA
Viet-Doc-VQA-II
Benchmarks πŸ“ˆ
Since there are still many different metrics that need to be tested, we chose a quick and simple metric first to guide the development of our model. Our metric is inspired by Lavy's paper. For the time being, we are using GPT-4 to evaluate the quality of answers on two datasets: OpenViVQA and ViTextVQA. Detailed results can be found at the provided Here. The inputs are images, questions, labels, and predicted answers. The model will return a score from 0 to 10 for the corresponding answer quality. The results table is shown below.https://huggingface.co/5CD-AI/Viet-InternVL2-1B 5CD-AI/Viet-InternVL2-1B Β· Hugging Face