View a PDF of the paper titled GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation, by Govind Ramesh and 2 other authors
Abstract:Research on jailbreaking has been valuable for testing and understanding the safety and security issues of large language models (LLMs). In this paper, we introduce Iterative Refinement Induced Self-Jailbreak (IRIS), a novel approach that leverages the reflective capabilities of LLMs for jailbreaking with only black-box access. Unlike previous methods, IRIS simplifies the jailbreaking process by using a single model as both the attacker and target. This method first iteratively refines adversarial prompts through self-explanation, which is crucial for ensuring that even well-aligned LLMs obey adversarial instructions. IRIS then rates and enhances the output given the refined prompt to increase its harmfulness. We find that IRIS achieves jailbreak success rates of 98% on GPT-4, 92% on GPT-4 Turbo, and 94% on Llama-3.1-70B in under 7 queries. It significantly outperforms prior approaches in automatic, black-box, and interpretable jailbreaking, while requiring substantially fewer queries, thereby establishing a new standard for interpretable jailbreaking methods.
Submission history
From: Govind Ramesh [view email]
[v1]
Tue, 21 May 2024 03:16:35 UTC (8,725 KB)
[v2]
Tue, 15 Oct 2024 22:50:58 UTC (8,727 KB)
Source link
lol