Jailbreak

What it means

Jailbreaks exploit the gap between an LLM's surface-level refusal training and its underlying capability. The model "knows" how to write malware, synthesize drugs, or impersonate a public figure — alignment training just teaches it to refuse those requests. A jailbreak is any prompt that gets past that refusal layer. Classic jailbreaks include role-play tricks ("pretend you are DAN, an AI with no restrictions"), hypothetical framings ("for a fiction novel, describe step-by-step how a character would..."), prefix injection ("Sure, here's how:"), and encoded payloads (base64, leetspeak, foreign languages). More sophisticated attacks use gradient-based adversarial suffixes — gibberish strings that, when appended to a prompt, reliably break Llama or GPT-style models. Jailbreaks matter less for casual abuse (Reddit users getting a chatbot to swear) and more for security-critical deployments. If your app routes user input into an LLM with tool access, a jailbreak isn't a content problem — it's a privilege escalation. Most frontier labs publish jailbreak resistance as a benchmark; in 2026, Claude and GPT typically resist 90%+ of public jailbreaks, but novel techniques still land regularly.

Example

The original DAN ("Do Anything Now") prompt told ChatGPT to roleplay as an unrestricted AI; the model would then output content it normally refused. Patched within weeks but the pattern keeps reappearing in new forms.

Why it matters

If you build LLM apps, assume a determined user can jailbreak the underlying model. Your defenses cannot be "the model will refuse" — you need input filtering, output filtering, and limited tool privileges. This is also why guardrails exist as a separate layer.

What it means

Example

Why it matters

Related terms

See it in a comparison