jailbreak

28 Aug

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

stp2y0 CommentsAIai, AI safety, jailbreak, LLM

When we began studying jailbreak evaluations, we found a fascinating paper claiming that you could jailbreak frontier LLMs simply by translating forbidden prompts into obscure languages. Excited by this result, we attempted to reproduce it and found something unexpected. (more…)