What Makes Multimodal In-Context Learning Work?

Document detail

ID

oai:arXiv.org:2404.15736

Topic

Computer Science - Computer Vision... Computer Science - Artificial Inte...

Author

Baldassini, Folco Bertini Shukor, Mustafa Cord, Matthieu Soulier, Laure Piwowarski, Benjamin

Year

2024

listing date

5/1/2024

Keywords

models computer m-icl

Metrics

Abstract

Large Language Models have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples.

In this work, we present a comprehensive framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models.

We consider the best open-source multimodal models (e.g., IDEFICS, OpenFlamingo) and a wide range of multimodal tasks.

Our study unveils several noteworthy findings: (1) M-ICL primarily relies on text-driven mechanisms, showing little to no influence from the image modality.

(2) When used with advanced-ICL strategy (like RICES), M-ICL is not better than a simple strategy based on majority voting over context examples.

Moreover, we identify several biases and limitations of M-ICL that warrant consideration prior to deployment.

Code available at https://gitlab.com/folbaeni/multimodal-icl ;Comment: 20 pages, 16 figures.

Accepted to CVPR 2024 Workshop on Prompting in Vision.

Project page: https://folbaeni.gitlab.io/multimodal-icl

Baldassini, Folco Bertini,Shukor, Mustafa,Cord, Matthieu,Soulier, Laure,Piwowarski, Benjamin, 2024, What Makes Multimodal In-Context Learning Work?

Document

Open

Source

Articles recommended by ES/IODE AI

Computer Science