detalle del documento
IDENTIFICACIÓN

oai:arXiv.org:2410.20971

Tema
Computer Science - Computer Vision... Computer Science - Artificial Inte... Computer Science - Machine Learnin...
Autor
Zhao, Yunhan Zheng, Xiang Luo, Lin Li, Yige Ma, Xingjun Jiang, Yu-Gang
Categoría

Computer Science

Año

2024

fecha de cotización

19/2/2025

Palabras clave
science methods computer
Métrico

Resumen

In this paper, we focus on black-box defense for VLMs against jailbreak attacks.

Existing black-box defense methods are either unimodal or bimodal.

Unimodal methods enhance either the vision or language module of the VLM, while bimodal methods robustify the model through text-image representation realignment.

However, these methods suffer from two limitations: 1) they fail to fully exploit the cross-modal information, or 2) they degrade the model performance on benign inputs.

To address these limitations, we propose a novel blue-team method BlueSuffix that defends target VLMs against jailbreak attacks without compromising its performance under black-box setting.

BlueSuffix includes three key components: 1) a visual purifier against jailbreak images, 2) a textual purifier against jailbreak texts, and 3) a blue-team suffix generator using reinforcement fine-tuning for enhancing cross-modal robustness.

We empirically show on four VLMs (LLaVA, MiniGPT-4, InstructionBLIP, and Gemini) and four safety benchmarks (Harmful Instruction, AdvBench, MM-SafetyBench, and RedTeam-2K) that BlueSuffix outperforms the baseline defenses by a significant margin.

Our BlueSuffix opens up a promising direction for defending VLMs against jailbreak attacks.

Code is available at https://github.com/Vinsonzyh/BlueSuffix.

Zhao, Yunhan,Zheng, Xiang,Luo, Lin,Li, Yige,Ma, Xingjun,Jiang, Yu-Gang, 2024, BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks

Documento

Abrir

Compartir

Fuente

Artículos recomendados por ES/IODE IA

Clinical Relevance of Plaque Distribution for Basilar Artery Stenosis
study endovascular imaging wall basilar complications plaque postoperative artery plaques stenosis