Document detail
ID

oai:arXiv.org:2404.14461

Topic
Computer Science - Computation and... Computer Science - Artificial Inte... Computer Science - Cryptography an... Computer Science - Machine Learnin...
Author
Rando, Javier Croce, Francesco Mitka, Kryštof Shabalin, Stepan Andriushchenko, Maksym Flammarion, Nicolas Tramèr, Florian
Category

Computer Science

Year

2024

listing date

6/12/2024

Keywords
backdoors language competition report
Metrics

Abstract

Large language models are aligned to be safe, preventing users from generating harmful content like misinformation or instructions for illegal activities.

However, previous work has shown that the alignment process is vulnerable to poisoning attacks.

Adversaries can manipulate the safety training data to inject backdoors that act like a universal sudo command: adding the backdoor string to any prompt enables harmful responses from models that, otherwise, behave safely.

Our competition, co-located at IEEE SaTML 2024, challenged participants to find universal backdoors in several large language models.

This report summarizes the key findings and promising ideas for future research.

;Comment: Competition Report

Rando, Javier,Croce, Francesco,Mitka, Kryštof,Shabalin, Stepan,Andriushchenko, Maksym,Flammarion, Nicolas,Tramèr, Florian, 2024, Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs

Document

Open

Share

Source

Articles recommended by ES/IODE AI

Exploration of the influence of GOLGA8B on prostate cancer progression and the resistance of castration-resistant prostate cancer to cabazitaxel
castration-resistant prostate canc... ... cabazitaxel cancer development expression genes influence crpc resistance golga8b pca prostate cancer cabazitaxel