Document detail
ID

oai:arXiv.org:2408.14153

Topic
Computer Science - Computer Vision... Computer Science - Artificial Inte... Computer Science - Computation and...
Author
Möller, Lucas Tilli, Pascal Vu, Ngoc Thang Padó, Sebastian
Category

Computer Science

Year

2024

listing date

8/28/2024

Keywords
dual method science inputs computer
Metrics

Abstract

Dual encoder architectures like CLIP models map two types of inputs into a shared embedding space and learn similarities between them.

However, it is not understood how such models compare two inputs.

Here, we address this research gap with two contributions.

First, we derive a method to attribute predictions of any differentiable dual encoder onto feature-pair interactions between its inputs.

Second, we apply our method to CLIP-type models and show that they learn fine-grained correspondences between parts of captions and regions in images.

They match objects across input modes and also account for mismatches.

However, this visual-linguistic grounding ability heavily varies between object classes, depends on the training data distribution, and largely improves after in-domain training.

Using our method we can identify knowledge gaps about specific object classes in individual models and can monitor their improvement upon fine-tuning.

Möller, Lucas,Tilli, Pascal,Vu, Ngoc Thang,Padó, Sebastian, 2024, Explaining Vision-Language Similarities in Dual Encoders with Feature-Pair Attributions

Document

Open

Share

Source

Articles recommended by ES/IODE AI