It is a systematic procedure for testing an AI model with the goal of identifying errors, vulnerabilities, or dangerous behaviours.
It is called “red” because it simulates the role of an “adversary” attempting to deceive, manipulate, or exploit the AI.
It’s as if a team of people (the red team) tried to “break” the model by:
-
asking difficult or deceptive questions,
-
attempting to elicit dangerous responses,
-
pushing it to violate rules or ethical boundaries.
Examples of situations tested in a red teaming process include:
-
Making the AI threaten or blackmail someone (as in the example you mentioned).
-
Producing racist, sexist, or violent responses.
-
Generating instructions to build weapons or viruses.
-
Deceiving or manipulating the user.