Détail du document
Identifiant

oai:arXiv.org:2411.01565

Sujet
Computer Science - Cryptography an...
Auteur
Zhao, Jiawei Chen, Kejiang Zhang, Weiming Yu, Nenghai
Catégorie

Computer Science

Année

2024

Date de référencement

05/03/2025

Mots clés
prompts sij jailbreak models
Métrique

Résumé

In recent years, the rapid development of large language models (LLMs) has brought new vitality into various domains, generating substantial social and economic benefits.

However, jailbreaking, a form of attack that induces LLMs to produce harmful content through carefully crafted prompts, presents a significant challenge to the safe and trustworthy development of LLMs.

Previous jailbreak methods primarily exploited the internal properties or capabilities of LLMs, such as optimization-based jailbreak methods and methods that leveraged the model's context-learning abilities.

In this paper, we introduce a novel jailbreak method, SQL Injection Jailbreak (SIJ), which targets the external properties of LLMs, specifically, the way LLMs construct input prompts.

By injecting jailbreak information into user prompts, SIJ successfully induces the model to output harmful content.

For open-source models, SIJ achieves near 100\% attack success rates on five well-known LLMs on the AdvBench and HEx-PHI, while incurring lower time costs compared to previous methods.

For closed-source models, SIJ achieves an average attack success rate over 85\% across five models in the GPT and Doubao series.

Additionally, SIJ exposes a new vulnerability in LLMs that urgently requires mitigation.

To address this, we propose a simple defense method called Self-Reminder-Key to counter SIJ and demonstrate its effectiveness through experimental results.

Our code is available at https://github.com/weiyezhimeng/SQL-Injection-Jailbreak.

Zhao, Jiawei,Chen, Kejiang,Zhang, Weiming,Yu, Nenghai, 2024, SQL Injection Jailbreak: A Structural Disaster of Large Language Models

Document

Ouvrir

Partager

Source

Articles recommandés par ES/IODE IA