Red teaming

What it means

Red teaming borrows from cybersecurity: a dedicated team (internal or contracted) plays attacker against a model, trying to elicit harmful outputs, jailbreaks, biased behavior, or capability misuse. The goal isn't to prove the model is "safe" — it's to surface failure modes early, quantify them, and patch the worst before public release. By 2026, red teaming is standard practice at every frontier lab. Anthropic, OpenAI, Google DeepMind, and Meta all publish red team findings in their model cards. A typical red team campaign runs hundreds to thousands of hours: probing for CBRN (chem/bio/rad/nuclear) uplift, election interference, child safety violations, autonomous replication risks, and jailbreak resistance. External red teams (METR, Apollo Research, government AISI labs) add independent evaluation. Red teaming has clear limits. It finds what the team thinks to look for; novel attack vectors invented after release still land. It's also expensive — frontier model red teams can cost millions of dollars per release. The trend is toward automated red teaming where one model generates adversarial prompts against another, scaling beyond human-only approaches. Treat published red team results as a lower bound on real-world failure rates, not an upper bound.

Example

Before releasing Claude Opus 4, Anthropic ran a multi-month red team campaign including external evaluators from METR and the UK AISI; their findings (published in the system card) determined which capabilities triggered ASL-3 deployment safeguards.

Why it matters

When you pick a model for production, look for transparent red team results in the system/model card. Models without published adversarial evaluations are gambling on your behalf. For your own apps, run a mini red team on your specific prompt setup — generic model safety doesn't cover your custom system prompt.

What it means

Example

Why it matters

Related terms

See it in a comparison