Détail du document
Identifiant

oai:arXiv.org:2409.04828

Sujet
Computer Science - Computer Vision... Computer Science - Artificial Inte... Computer Science - Multimedia
Auteur
Liu, Yuan Zhao, Zhongyin Zhuang, Ziyuan Tian, Le Zhou, Xiao Zhou, Jie
Catégorie

Computer Science

Année

2024

Date de référencement

13/11/2024

Mots clés
science strategies data computer
Métrique

Résumé

In recent years, vision-language models have made significant strides, excelling in tasks like optical character recognition and geometric problem-solving.

However, several critical issues remain: 1) Proprietary models often lack transparency about their architectures, while open-source models need more detailed ablations of their training strategies.

2) Pre-training data in open-source works is under-explored, with datasets added empirically, making the process cumbersome.

3) Fine-tuning often focuses on adding datasets, leading to diminishing returns.

To address these issues, we propose the following contributions: 1) We trained a robust baseline model using the latest advancements in vision-language models, introducing effective improvements and conducting comprehensive ablation and validation for each technique.

2) Inspired by recent work on large language models, we filtered pre-training data using perplexity, selecting the lowest perplexity data for training.

This approach allowed us to train on a curated 1M dataset, achieving competitive performance.

3) During visual instruction tuning, we used model soup on different datasets when adding more datasets yielded marginal improvements.

These innovations resulted in a 9B parameter model that performs competitively with state-of-the-art models.

Our strategies are efficient and lightweight, making them easily adoptable by the community.

;Comment: v2

Liu, Yuan,Zhao, Zhongyin,Zhuang, Ziyuan,Tian, Le,Zhou, Xiao,Zhou, Jie, 2024, POINTS: Improving Your Vision-language Model with Affordable Strategies

Document

Ouvrir

Partager

Source

Articles recommandés par ES/IODE IA