Dokumentdetails
ID

oai:arXiv.org:2403.12042

Thema
Computer Science - Computer Vision...
Autor
Zhu, Zixin Feng, Xuelu Chen, Dongdong Yuan, Junsong Qiao, Chunming Hua, Gang
Kategorie

Computer Science

Jahr

2024

Auflistungsdatum

10.07.2024

Schlüsselwörter
model diffusion noise video
Metrisch

Zusammenfassung

In this paper, we explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.

We hypothesize that the latent representation learned from a pretrained generative T2V model encapsulates rich semantics and coherent temporal correspondences, thereby naturally facilitating video understanding.

Our hypothesis is validated through the classic referring video object segmentation (R-VOS) task.

We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed pretrained T2V model.

Specifically, VD-IT uses textual information as a conditional input, ensuring semantic consistency across time for precise temporal instance matching.

It further incorporates image tokens as supplementary textual inputs, enriching the feature set to generate detailed and nuanced masks.

Besides, instead of using the standard Gaussian noise, we propose to predict the video-specific noise with an extra noise prediction module, which can help preserve the feature fidelity and elevates segmentation quality.

Through extensive experiments, we surprisingly observe that fixed generative T2V diffusion models, unlike commonly used video backbones (e.g., Video Swin Transformer) pretrained with discriminative image/video pre-tasks, exhibit better potential to maintain semantic alignment and temporal consistency.

On existing standard benchmarks, our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.

The code is available at https://github.com/buxiangzhiren/VD-IT.

;Comment: Appear at ECCV 2024, and the code is available at https://github.com/buxiangzhiren/VD-IT

Zhu, Zixin,Feng, Xuelu,Chen, Dongdong,Yuan, Junsong,Qiao, Chunming,Hua, Gang, 2024, Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

Dokumentieren

Öffnen

Teilen

Quelle

Artikel empfohlen von ES/IODE AI

Use of ileostomy versus colostomy as a bridge to surgery in left-sided obstructive colon cancer: retrospective cohort study
deviating 0 versus surgery bridge colon study left-sided obstructive stoma colostomy cancer cent