Test-time compute

What it means

For most of the deep learning era, "make the model better" meant "train a bigger model on more data." Test-time compute (also called inference-time compute) is the paradigm shift that says: you can also make a model better by giving the existing model more compute when it's actually answering your question. Instead of one forward pass producing one answer, the model can think for thousands of tokens, sample dozens of candidate answers, run a search procedure, or use tools — all of which cost more compute per query but produce better answers. This shift went mainstream in late 2024 with OpenAI's o1, which demonstrated that scaling test-time compute produced reliable accuracy gains on hard problems even without changing the underlying model. The scaling curves were remarkable: doubling thinking tokens often produced gains comparable to 10x-ing training compute. By 2025-2026, every major lab had reasoning variants, and "test-time compute" became the rallying cry for a generation of inference techniques: chain of thought, self-consistency, tree of thought, best-of-N sampling, and tool-augmented reasoning. The economic implications are significant. Training a frontier model costs hundreds of millions of dollars, mostly fixed. Test-time compute is a knob users (or providers) can turn per query. Hard problem? Spend $0.50 on thinking. Easy problem? Spend $0.001. This makes "intelligence as a service" much more flexible, and it shifts the competitive frontier from who has the biggest training run to who has the best reasoning training plus efficient inference infrastructure.

Example

A coding agent gets a hard bug report. Instead of one quick attempt, it runs the reasoning model with 50,000 thinking tokens, generates 5 candidate fixes, runs the test suite on each, and picks the one that passes. Total compute per query: 100x a normal request. Result: a fix that would have taken a human 2 hours.

Why it matters

Test-time compute is reshaping how the AI industry thinks about progress. It explains why reasoning models exist, why agents are suddenly viable, and why a 2026 model with the same parameter count as a 2024 model is dramatically more capable. If you're building with LLMs, knowing when to spend more compute (and when not to) is the new core skill.

What it means

Example

Why it matters

Related terms

See it in a comparison