Dokumentdetails
ID

oai:arXiv.org:2409.04828

Thema
Computer Science - Computer Vision... Computer Science - Artificial Inte... Computer Science - Multimedia
Autor
Liu, Yuan Zhao, Zhongyin Zhuang, Ziyuan Tian, Le Zhou, Xiao Zhou, Jie
Kategorie

Computer Science

Jahr

2024

Auflistungsdatum

13.11.2024

Schlüsselwörter
science strategies data computer
Metrisch

Zusammenfassung

In recent years, vision-language models have made significant strides, excelling in tasks like optical character recognition and geometric problem-solving.

However, several critical issues remain: 1) Proprietary models often lack transparency about their architectures, while open-source models need more detailed ablations of their training strategies.

2) Pre-training data in open-source works is under-explored, with datasets added empirically, making the process cumbersome.

3) Fine-tuning often focuses on adding datasets, leading to diminishing returns.

To address these issues, we propose the following contributions: 1) We trained a robust baseline model using the latest advancements in vision-language models, introducing effective improvements and conducting comprehensive ablation and validation for each technique.

2) Inspired by recent work on large language models, we filtered pre-training data using perplexity, selecting the lowest perplexity data for training.

This approach allowed us to train on a curated 1M dataset, achieving competitive performance.

3) During visual instruction tuning, we used model soup on different datasets when adding more datasets yielded marginal improvements.

These innovations resulted in a 9B parameter model that performs competitively with state-of-the-art models.

Our strategies are efficient and lightweight, making them easily adoptable by the community.

;Comment: v2

Liu, Yuan,Zhao, Zhongyin,Zhuang, Ziyuan,Tian, Le,Zhou, Xiao,Zhou, Jie, 2024, POINTS: Improving Your Vision-language Model with Affordable Strategies

Dokumentieren

Öffnen

Teilen

Quelle

Artikel empfohlen von ES/IODE AI