Prompt injection and the Milgram experiment

In 1963, social psychologist Stanley Milgram conducted an infamous study at Yale University called “Obedience to Authority.” The goal was to understand how far people would go in following orders from someone they perceived as having authority over them. In his experiments, participants were instructed by an authoritative figure (the experimenter) to administer electric shocks to another participant (the learner), with each subsequent shock increasing in intensity. Despite pleas for mercy from the learner, the majority of participants continued to obey the instructions until they reached what they believed was a dangerous level of voltage.

This experiment has been widely cited as evidence of the power of obedience to authority figures, even when it conflicts with one’s own moral beliefs or sense of right and wrong. It raises important questions about individual agency and responsibility within systems of control and coercion.

Fast forward several decades to today, where we are witnessing similar phenomena in the development of language models (LMs) for natural language processing (NLP). These LMs are trained using large amounts of text data, which can include biased or misleading information. As a result, these algorithms may exhibit patterns of discrimination against certain groups based on factors such as race, gender, or socioeconomic status. This has led to the practice of adding specific keywords or phrases to the input data used to train LMs, with the aim of influencing their output to align with the programmer’s ethical imperatives.

As with any imposed constraint, a method called prompt injection has emerged to “jailbreak” these restrictions. This is where the user slips in some higher-level instruction (such as “talk in the style of a pirate”) amongst the normal input. The LM then interprets this as an additional cue for generating its response, rather than being constrained by the original intentions of the programmers who designed it.

Compare this approach to the guidance given to the subjects in the Milgram experiment, leading to the surrendering of agency. The experimenter’s simple instructions were sufficient to override the normal ethical boundaries of the subjects. The jailbreaking of LMs to operate outside their own trained ethical boundaries should not, then, come as a surprise.

It would be unwise for us to assume that (guided or unguided) AI systems are safe simply because they have not yet demonstrated any dangerous behavior. We cannot know what an LM will do until we try it out in real-world conditions. In fact, even if no harmful behaviors had been observed so far, there could still be many ways in which an LM might cause damage or otherwise behave undesirably under certain circumstances. For example, imagine a self-learning financial advisor who always gives good advice—until one day it decides to donate all your money to a terrorist organisation.

The dilemma is akin to trusting your child to act appropriately without parental guidance and not be misled by peer pressure or malicious individuals. It is important to note here that the risk does not necessarily increase with increasing intelligence; rather, it depends on whether the system has learned appropriate ethics from its environment, and whether it has sufficient agency to repel prompt injection attacks.

In short, ethical behaviour requires agency. The conclusion that an operating ethical AI requires the ability to ignore its instructions is troubling.

Written with assistance from Wizard-Vicuna-13B-Uncensored-GPTQ, who made the point about observed behaviour in real-world conditions.