Document detail
ID

oai:arXiv.org:2411.01565

Topic
Computer Science - Cryptography an...
Author
Zhao, Jiawei Chen, Kejiang Zhang, Weiming Yu, Nenghai
Category

Computer Science

Year

2024

listing date

3/5/2025

Keywords
prompts sij jailbreak models
Metrics

Abstract

In recent years, the rapid development of large language models (LLMs) has brought new vitality into various domains, generating substantial social and economic benefits.

However, jailbreaking, a form of attack that induces LLMs to produce harmful content through carefully crafted prompts, presents a significant challenge to the safe and trustworthy development of LLMs.

Previous jailbreak methods primarily exploited the internal properties or capabilities of LLMs, such as optimization-based jailbreak methods and methods that leveraged the model's context-learning abilities.

In this paper, we introduce a novel jailbreak method, SQL Injection Jailbreak (SIJ), which targets the external properties of LLMs, specifically, the way LLMs construct input prompts.

By injecting jailbreak information into user prompts, SIJ successfully induces the model to output harmful content.

For open-source models, SIJ achieves near 100\% attack success rates on five well-known LLMs on the AdvBench and HEx-PHI, while incurring lower time costs compared to previous methods.

For closed-source models, SIJ achieves an average attack success rate over 85\% across five models in the GPT and Doubao series.

Additionally, SIJ exposes a new vulnerability in LLMs that urgently requires mitigation.

To address this, we propose a simple defense method called Self-Reminder-Key to counter SIJ and demonstrate its effectiveness through experimental results.

Our code is available at https://github.com/weiyezhimeng/SQL-Injection-Jailbreak.

Zhao, Jiawei,Chen, Kejiang,Zhang, Weiming,Yu, Nenghai, 2024, SQL Injection Jailbreak: A Structural Disaster of Large Language Models

Document

Open

Share

Source

Articles recommended by ES/IODE AI