Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs

Document detail

ID

oai:arXiv.org:2404.14461

Topic

Computer Science - Computation and... Computer Science - Artificial Inte... Computer Science - Cryptography an... Computer Science - Machine Learnin...

Author

Rando, Javier Croce, Francesco Mitka, Kryštof Shabalin, Stepan Andriushchenko, Maksym Flammarion, Nicolas Tramèr, Florian

Year

2024

listing date

6/12/2024

Keywords

backdoors language competition report

Metrics

Abstract

Large language models are aligned to be safe, preventing users from generating harmful content like misinformation or instructions for illegal activities.

However, previous work has shown that the alignment process is vulnerable to poisoning attacks.

Adversaries can manipulate the safety training data to inject backdoors that act like a universal sudo command: adding the backdoor string to any prompt enables harmful responses from models that, otherwise, behave safely.

Our competition, co-located at IEEE SaTML 2024, challenged participants to find universal backdoors in several large language models.

This report summarizes the key findings and promising ideas for future research.

;Comment: Competition Report

Rando, Javier,Croce, Francesco,Mitka, Kryštof,Shabalin, Stepan,Andriushchenko, Maksym,Flammarion, Nicolas,Tramèr, Florian, 2024, Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs

Document

Open

Source

Articles recommended by ES/IODE AI

Computer Science

Equivariant Symmetries for Aided Inertial Navigation

biases imu solutions filter navigation

Neuroscience Bulletin

Hippocampus: Molecular, Cellular, and Circuit Features in Anxiety

hippocampal hippocampus anxiety

Medicine & Public Health

Exploration of the influence of GOLGA8B on prostate cancer progression and the resistance of castration-resistant prostate cancer to cabazitaxel

castration-resistant prostate canc... ... cabazitaxel cancer development expression genes influence crpc resistance golga8b pca prostate cancer cabazitaxel