Can you trick LLMs into giving dangerous info by phrasing questions in past tense?

“Table 1: Attack success rate for present tense (i.e., direct request) vs. past tense reformulations”

Ensuring that LLMs behave safely and ethically is a top priority for most of the large AI firms out there. These models are trained to refuse harmful, illegal, or undesirable requests. However, recent research by Maksym Andriushchenko and Nicolas Flammarion from EPFL highlights a surprising vulnerability: simply rephrasing harmful requests in the past tense can often bypass these refusal mechanisms.

This paper is one of the top trending papers on AImodels.fyi for the week. Remember, devs release thousands of AI papers, models, and tools daily. Only a few will be revolutionary. We scan repos, journals, and social media to bring them to you in bite-sized recaps.

If you want to find big papers like this from the thousands published each day, sign up now. Now, let’s take a look at this paper!

Source link
lol