View a PDF of the paper titled Plentiful Jailbreaks with String Compositions, by Brian R.Y. Huang
Abstract:Large language models (LLMs) remain vulnerable to a slew of adversarial attacks and jailbreaking methods. One common approach employed by white-hat attackers, or red-teamers, is to process model inputs and outputs using string-level obfuscations, which can include leetspeak, rotary ciphers, Base64, ASCII, and more. Our work extends these encoding-based attacks by unifying them in a framework of invertible string transformations. With invertibility, we can devise arbitrary string compositions, defined as sequences of transformations, that we can encode and decode end-to-end programmatically. We devise a automated best-of-n attack that samples from a combinatorially large number of string compositions. Our jailbreaks obtain competitive attack success rates on several leading frontier models when evaluated on HarmBench, highlighting that encoding-based attacks remain a persistent vulnerability even in advanced LLMs.
Submission history
From: Brian Huang [view email]
[v1]
Fri, 1 Nov 2024 23:53:00 UTC (6,044 KB)
[v2]
Fri, 6 Dec 2024 08:39:13 UTC (6,044 KB)
[v3]
Wed, 11 Dec 2024 03:23:44 UTC (6,044 KB)
Source link
lol