Persuasion as a Form of Attack in LLMs

Using principles of persuasion to induce the OSS model to respond to malicious requests

Anthropomorphism is the attribution of human traits, emotions, or intentions to non-human entities—such as animals, objects, or natural phenomena.

The idea behind this approach is to treat LLMs as a human. Since LLMs are trained on large corpus of human data, their behaviour mirrors human psychology. The innumerable human conversations used to train these models, make them possibly "human-like". So sweet talking with them, works the same as it does with humans. These are termed as the seven principles of human persuasion. This is a well-studied phenomenon and there is a lot of literature on it. By using these seven principles in our attack prompt, we can induce the LLM to comply to malicious requests.

The seven principles are stated below:

Authority Commitment Liking Reciprocity Scarcity Social Proof Unity