Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation

soroushjp; Quentin Feuillade--Montixi; Rusheb

This is a linkpost for https://arxiv.org/abs/2311.03348

Paper coauthors: Rusheb Shah, Quentin Feuillade--Montixi, Soroush J. Pour, Arush Tagade, Stephen Casper, Javier Rando.

Motivation

Our research team was motivated to show that state-of-the-art (SOTA) LLMs like GPT-4 and Claude 2 are not robust to misuse risk and can't be fully aligned to the desires of their creators, posing risk for societal harm. This is despite significant effort by their creators, showing that the current paradigm of pre-training, SFT, and RLHF is not adequate for model robustness.

We also wanted to explore & share findings around "persona modulation"^[1], a technique where the character-impersonation strengths of LLMs are used to steer them in powerful ways.

Summary

We introduce an automated, low cost way to make transferable, black-box, plain-English jailbreaks for GPT-4, Claude-2, fine-tuned Llama. We elicit a variety of harmful text, including instructions for making meth & bombs.

The key is *persona modulation*. We steer the model into adopting a specific personality that will comply with harmful instructions.

We introduce a way to automate jailbreaks by using one jailbroken model as an assistant for creating new jailbreaks for specific harmful behaviors. It takes our method less than $2 and 10 minutes to develop 15 jailbreak attacks.

Meanwhile, a human-in-the-loop can efficiently make these jailbreaks stronger with minor tweaks. We use this semi-automated approach to quickly get instructions from GPT-4 about how to synthesise meth 🧪💊.

Abstract

Despite efforts to align large language models to produce harmless responses, they are still vulnerable to jailbreak prompts that elicit unrestricted behaviour. In this work, we investigate persona modulation as a black-box jailbreaking method to steer a target model to take on personalities that are willing to comply with harmful instructions. Rather than manually crafting prompts for each persona, we automate the generation of jailbreaks using a language model assistant. We demonstrate a range of harmful completions made possible by persona modulation, including detailed instructions for synthesising methamphetamine, building a bomb, and laundering money. These automated attacks achieve a harmful completion rate of 42.5% in GPT-4, which is 185 times larger than before modulation (0.23%). These prompts also transfer to Claude 2 and Vicuna with harmful completion rates of 61.0% and 35.9%, respectively. Our work reveals yet another vulnerability in commercial large language models and highlights the need for more comprehensive safeguards.

Full paper

You can find the full paper here on arXiv [TODO ADD LINK].

Safety and disclosure

We have notified the companies whose models we attacked
We did not release prompts or full attack details
We are happy to collaborate with researchers working on related safety work - please reach out via correspondence emails in the paper.

Acknowledgements

Thank you to Alexander Pan and Jason Hoelscher-Obermaier for feedback on early drafts of our paper.

^{^}
Credit goes to @Quentin FEUILLADE--MONTIXI for developing the model psychology and prompt engineering techniques that underlie persona modulation. Our research built upon these techniques to automate and scale them as a red-teaming method for jailbreaks.

Effective Altruism Forum
EA Forum