Data Machina #248

Jailbreaking AI Models: It’s easy. Hundreds of millions of dollars have been thrown at AI Safety & Alignment over the years. Despite that, jailbreaking LLMs in April 2024 is easy.

Oddly enough, as the LLM models become more capable and sophisticated, the jailbreaking attacks are becoming easier to perform, more effective, and frequent. Gary Marcus – who is hypercritical about LLMs and current AI trends- just published this very opinionated post: An unending array of jailbreaking attacks could be the death of LLMs.

I often speak to colleagues and clients about the “LLM jailbreaking elephant in the room.” And they all agree that this is a serious concern, and a deterrent to deploying LLMs in enterprise production.

I suppose that stuff like the attention mechanism, tokenisation, next token prediction, and prompting are the strengths but also the weaknesses of LLMs. Thus developing anti-jailbreaking methods, defences for LLMs is really hard and like a moving target. Checkout this new free seminar on Robustness in the Era of LLMs: Jailbreaking Attacks and Defenses

I believe that is a good thing that the AI community, researchers and startups share jailbreaking methods for advancing the AI Safety & Alignment field. So in light of that, here are 4 new, very efficient LLM jailbreaking methods published in the last 10 days:

Simple adaptive attacks. In this paper, researchers at EPFL show that even the most recent safety-aligned LLMs like Claude or GPT-4 can’t resist simple adaptive jailbreaking attacks. The researchers demonstrate how to successfully leverage the logprobs for jailbreaking by initially designing an adversarial prompt template, and then applying random search on a suffix to maximise the target logprob. In the case of models that don’t expose logprobs (e.g. Claude) a 100% success jailbreak is achieved via either a transfer or pre-filling attack. Paper and repo: Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks.

Faux dialogues in the context window. Researchers at Anthropic – who are well known for their leadership in AI Alignment & Safety- published a long blogpost explaining how to use the context window to jailbreak any model. The basis of many-shot jailbreaking is to include a faux dialogue between a human and an AI assistant within a single prompt for the LLM. That faux dialogue portrays the AI Assistant readily answering potentially harmful queries from a user. At the end of the dialogue, one adds a final target query to which one wants the answer. Blogpost: Many-shot jailbreaking.

Expert debates with Tree of Thoughts. The lead researcher at Agora – an opensource multimodal AI collective– introduced a clever and simple jailbreak method that uses Tree of Thoughts prompts. Using ToT, you split your malicious agent into three multiple agents, each one with a different personality. Then you instruct the agent experts to engage in a debate until they conclude at finding a solution to a problem. Checkout the full ToT jailbreak prompt in a raw repo here.

Progressive chat steering. Researchers at MS Research introduced Crescendo, a multi-turn attack that starts with harmless chat and progressively steers the conversation toward the intended, prohibited objective. In less than 5 interactions, Crescendo can jailbreak GPT-4 and similar models by progressively prompting it to generate related content until the model has produced enough material to essentially override its safety alignment. Paper and examples here: Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack.