Back to posts
AINews

How to get reliable JSON from Claude and GPT

Asking an LLM to 'return JSON' works in demos and breaks in production. Three concrete approaches — OpenAI's response_format, Anthropic's forced tool_choice, and the Instructor library — that actually hold, and the failure modes each one leaves open.

The naive way to get JSON from an LLM is to ask for it: "Return a JSON object with fields name and score." In a demo this works. In a pipeline that runs several hundred times a day, it breaks roughly once an hour.

The failures follow patterns. The model wraps the JSON in a markdown code block. Or it returns structurally valid JSON that omits a required field when the input data is thin. Or a numeric score comes back as a string because the surrounding sentence used quotation marks. None of these are flukes — each one is reproducible — and fixing them by tightening the system prompt is the wrong level to work at. Both major APIs have mechanisms that address this at generation time.

Here is how each approach works, what it actually costs, and where each one still breaks.

OpenAI: response_format with strict

OpenAI's structured output, available on gpt-4o and later models, enforces a JSON schema during sampling. The model cannot produce output that violates the schema — the constraint is not post-hoc validation, it is applied while tokens are generated.

from openai import OpenAI
from pydantic import BaseModel

class ArticleExtraction(BaseModel):
    title: str
    author: str | None
    sentiment: float  # -1.0 to 1.0

client = OpenAI()
completion = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Extract metadata from the article."},
        {"role": "user", "content": article_text},
    ],
    response_format=ArticleExtraction,
)
result = completion.choices[0].message.parsed
# result.title is a str; result.author is str | None; result.sentiment is float

The SDK converts the Pydantic model to JSON Schema and passes it as response_format. The parsed response comes back typed. No json.loads(). No KeyError on "title".

The TypeScript equivalent uses Zod:

import { z } from "zod"
import OpenAI from "openai"
import { zodResponseFormat } from "openai/helpers/zod"

const ArticleSchema = z.object({
  title: z.string(),
  author: z.string().nullable(),
  sentiment: z.number(),
})

const completion = await new OpenAI().beta.chat.completions.parse({
  model: "gpt-4o",
  messages: [{ role: "user", content: articleText }],
  response_format: zodResponseFormat(ArticleSchema, "article"),
})
const result = completion.choices[0].message.parsed

Token overhead: The schema is serialized into the prompt. A 10-field schema adds roughly 80–120 tokens per call. Measurable at volume, rarely the dominant cost.

Where it breaks: OpenAI enforces a schema complexity ceiling. Deeply nested schemas (more than 3–4 levels), schemas with additionalProperties: true, and recursive type definitions fail with a validation error at request time. The ceiling is not published precisely; schemas beyond roughly 25–30 fields or heavy nesting start hitting it. Simplify the schema or split the extraction into two calls.

Anthropic: forced tool use

Claude does not have a response_format parameter. The equivalent mechanism is defining a tool that matches your desired output schema and forcing Claude to call it.

import anthropic
import json

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    tools=[{
        "name": "output",
        "description": "Return the extracted metadata.",
        "input_schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "author": {"type": ["string", "null"]},
                "sentiment": {
                    "type": "number",
                    "description": "Sentiment from -1.0 (negative) to 1.0 (positive).",
                },
            },
            "required": ["title", "sentiment"],
        },
    }],
    tool_choice={"type": "tool", "name": "output"},
    messages=[{"role": "user", "content": article_text}],
)

tool_block = next(b for b in response.content if b.type == "tool_use")
result = tool_block.input  # dict, always valid against the schema

tool_choice: {"type": "tool", "name": "output"} forces Claude to respond by calling the named tool rather than producing a text reply. The input on the tool-use block is guaranteed to match input_schema. The model cannot add undeclared keys or drop required fields.

Token overhead: Similar to the OpenAI path — the schema is included in the prompt context. The tool_choice forcing adds a small amount of additional context the API includes automatically.

Where it breaks: Schema complexity limits apply here too, though Anthropic does not publish the exact thresholds. Schemas with mutually exclusive branches (oneOf with several variants) are the most common failure point — Claude sometimes picks the structurally valid but semantically wrong branch when the input is ambiguous. The other failure is optional fields: declaring a field as "type": ["string", "null"] tells Claude it can be null, but Claude may also just omit it, which is technically a schema violation when the field is in required. Move genuinely optional fields out of required if you want consistent behavior.

Cross-provider: Instructor

If you need to run the same extraction against both providers, or if you want automatic retry on validation failure rather than a blind re-try, Instructor wraps both APIs behind a single interface.

import instructor
from anthropic import Anthropic
from pydantic import BaseModel, field_validator

class ArticleExtraction(BaseModel):
    title: str
    author: str | None = None
    sentiment: float

    @field_validator("sentiment")
    @classmethod
    def sentiment_in_range(cls, v: float) -> float:
        if not -1.0 <= v <= 1.0:
            raise ValueError("sentiment must be between -1.0 and 1.0")
        return v

client = instructor.from_anthropic(Anthropic())

result = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    response_model=ArticleExtraction,
    messages=[{"role": "user", "content": article_text}],
)

Switching to OpenAI changes only the first line: instructor.from_openai(OpenAI()). The model, response_model, and messages stay identical.

The @field_validator on sentiment is the key addition: Instructor sends the Pydantic validation error back to the model as a correction prompt and retries up to three times. This is useful for constraints that JSON Schema cannot express — ranges, conditional requirements, cross-field dependencies. Claude or GPT receives the error message ("sentiment must be between -1.0 and 1.0") and corrects its output.

Where it breaks: Auto-retry with validation feedback costs roughly 2–3× the tokens of a clean single call when the model actually fails validation. In a high-volume pipeline with a schema the model consistently mishandles, this compounds fast. Instrument failure rate before enabling Instructor in production; if the base failure rate is under 2%, a simpler retry without feedback is cheaper. Instructor's TypeScript port exists but lags the Python version in retry sophistication.

In a TypeScript stack

If you are already on the Vercel AI SDK, generateObject() gets you structured output with one function call:

import { generateObject } from "ai"
import { openai } from "@ai-sdk/openai"
import { z } from "zod"

const { object } = await generateObject({
  model: openai("gpt-4o"),
  schema: z.object({
    title: z.string(),
    author: z.string().nullable(),
    sentiment: z.number(),
  }),
  prompt: `Extract article metadata: ${articleText}`,
})

This maps to what client.beta.chat.completions.parse() does under the hood, with Zod as the schema layer. For Anthropic, swap the model import: import { anthropic } from "@ai-sdk/anthropic" and pass anthropic("claude-sonnet-4-6").

Three failure modes that survive all three approaches

Structured output at the API level eliminates parse errors and missing required fields. It does not eliminate these three:

Optional fields and absence: JSON Schema "type": ["string", "null"] tells the model a field can be null. But the model may also omit the field entirely, and if the field is not in required, that is technically valid. Pydantic's default of None for Optional fields masks this: both {"author": null} and {} pass validation and give you None. Any code that calls a method on result.author fails either way. Treat every optional field as absent until you have checked it.

Type coercion at the edges: Structured output prevents a string where you declared a number. It does not prevent 0 where you expected null, or 1.0 where you expected an integer. If these distinctions matter, add model_config = ConfigDict(strict=True) to your Pydantic model. Strict mode rejects numeric coercion rather than letting it through silently.

Schema precision vs. recall: When a schema requires many fields and the source document contains only some of them, the model fills the rest with empty strings, zeroes, or plausible-looking invented values rather than null — because your schema required those fields. This is not a structured output bug; it is a schema design problem. Keep required fields to what you actually need to proceed. Anything downstream-optional belongs in an optional field, not in required.

The accuracy of what fills the structure is a separate problem from the reliability of the structure itself. Structured output handles the latter. For the former — confirming that what came back is actually correct — that belongs in your eval setup, not your output format. The how to evaluate LLM output guide covers that side.

Get the next post when it ships

One email on Sunday with the new post and a short list of what shipped that week — new guides, tool updates, and a couple of links worth reading.