Détail du document
Identifiant

oai:arXiv.org:2410.07599

Sujet
Computer Science - Computer Vision...
Auteur
Wang, Feng Yang, Timing Yu, Yaodong Ren, Sucheng Wei, Guoyizhe Wang, Angtian Shao, Wei Zhou, Yuyin Yuille, Alan Xie, Cihang
Catégorie

Computer Science

Année

2024

Date de référencement

16/10/2024

Mots clés
images
Métrique

Résumé

In this work, we present a comprehensive analysis of causal image modeling and introduce the Adventurer series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual representations.

This modeling paradigm allows us to process images in a recurrent formulation with linear complexity relative to the sequence length, which can effectively address the memory and computation explosion issues posed by high-resolution and fine-grained images.

In detail, we introduce two simple designs that seamlessly integrate image inputs into the causal inference framework: a global pooling token placed at the beginning of the sequence and a flipping operation between every two layers.

Extensive empirical studies demonstrate the significant efficiency and effectiveness of this causal image modeling paradigm.

For example, our base-sized Adventurer model attains a competitive test accuracy of 84.0% on the standard ImageNet-1k benchmark with 216 images/s training throughput, which is 5.3 times more efficient than vision transformers to achieve the same result.

Wang, Feng,Yang, Timing,Yu, Yaodong,Ren, Sucheng,Wei, Guoyizhe,Wang, Angtian,Shao, Wei,Zhou, Yuyin,Yuille, Alan,Xie, Cihang, 2024, Causal Image Modeling for Efficient Visual Understanding

Document

Ouvrir

Partager

Source

Articles recommandés par ES/IODE IA

Hespi: A pipeline for automatically detecting information from hebarium specimen sheets
science recognition institutional detects text-based text pipeline specimen