How to actually work with AI when building software
The biggest mistake people make with AI-assisted dev in 2026 is treating it like "use Claude for everything" or "use ChatGPT for everything." The better mental model is a team of models — different tools for different roles. This page sorts out the layers, the workflows, and the decisions.
Choose your mode
Most of what follows depends on what you're actually doing. Find the row that fits — not the one that sounds impressive — and read the rest of the page through that lens. These aren't filters; they're orientation.
Just starting
First time letting AI write code you keep. You're still learning what good output looks like.
- •One tool, one model. No orchestration.
- •Read every diff before accepting.
- •Skip agents and teams until you feel the limit.
Solo / side projects
Shipping your own stuff. No teammates to coordinate with. Speed matters more than process.
- •IDE tool plus a terminal agent for bigger jobs.
- •One repo instruction file, kept short.
- •Tests on the parts you'd hate to break.
Real production repo
Code your users depend on. Mistakes cost real money or trust. You can't skip review.
- •Architect + Builder workflow, written plans.
- •A second model verifies every diff.
- •CI checks gate the merge, not the human.
Team lead
You're setting the defaults other engineers will inherit. Consistency beats cleverness.
- •One blessed IDE, one blessed agent. Document why.
- •Repo instruction files committed and reviewed.
- •Verification model wired into CI, not optional.
Recommended starter stacks
Concrete picks for each mode above. Not the only valid answers — but answers that actually work today, and that you can change later without rewriting everything. Skip the ones that don't apply yet.
Starter
- IDE
- Cursor
- Terminal agent
- None yet — add later.
- Foundation model
- Whatever ships in the IDE's default plan.
- Repo file
.cursorrules— 20 lines, your real conventions.- Verification
- Your eyes. Read the diff. That's the practice.
Solo builder
- IDE
- Cursor (or Cline if you prefer open-source).
- Terminal agent
- Claude Code for multi-file work.
- Foundation model
- Claude Sonnet for builds, Opus for plans.
- Repo file
CLAUDE.md+.cursorrulespointing at it.- Verification
- Tests on critical paths. Skim the diff yourself.
Production engineer
- IDE
- Cursor for inline edits, chat-in-editor.
- Terminal agent
- Claude Code as the builder; Codex as the reviewer (or vice versa — different model is the point).
- Foundation model
- Two providers. Don't let the same model build and verify.
- Repo file
AGENTS.mdas canonical,CLAUDE.md+.cursorrulespoint at it.- Verification
- CI checks + reviewer model on every PR. Human review on auth, payments, data.
Team / lead
- IDE
- One blessed pick across the team — Cursor or Cline. Not both.
- Terminal agent
- Claude Code as default; Codex available for verification runs.
- Foundation model
- At least two, contractually. Builder and reviewer must differ.
- Repo file
AGENTS.mdowned and PR-reviewed; treated like docs that ship.- Verification
- Reviewer model in CI, blocking merge. Human review for architecture and high-stakes paths only.
Move up a row when the current one stops fitting — usually because a hallucinated change made it past you. Don't adopt the heaviest stack on day one. The overhead only pays off when the risk it covers is real.
The stack layers
AI-assisted dev happens in four layers. You don't have to use all of them, but understanding what each layer does helps you choose tools deliberately instead of by hype.
Foundation model
The brain. The LLM behind everything else. Examples: Claude Opus, GPT-5, Gemini Pro, Kimi, DeepSeek. You usually don't pick this directly — it's chosen by the layer above.
IDE / editor integration
Where you actually type code. Examples: Cursor, GitHub Copilot, Windsurf, Continue, Cline. This layer handles autocomplete, inline edits, and chat-in-editor.
Terminal / agent layer
For longer tasks the IDE can't handle on its own. Examples: Claude Code, Codex, Kimi Code, Aider. Runs as a CLI, integrates with git, can drive multi-file changes and CI.
Verification / review
A second model whose only job is to check the first model's output. Could be the same product running with a different prompt, or a different model entirely. Catches what the builder missed.
Most teams start with layers 1+2. As work gets bigger, they add layer 3. Once they've been bitten by a hallucinated PR or two, they add layer 4.
Models as a team, not a tool
The 2026 default for serious work isn't "pick the best AI and use it for everything." It's splitting the work across roles, like a small engineering team. The pattern people converge on:
| Role | Job | What it rewards |
|---|---|---|
| Architect | "What should we build?" Plans, structures, decides tradeoffs. | Careful reasoning. Big context. Slow is fine. |
| Builder | "Build it." Implements, tests, ships. | Speed. Tool integration. Repo awareness. |
| Reviewer | "Find what's wrong." Skeptical, fresh-eyes critique. | Independence from the builder's biases. |
| Specialist | Narrow expertise (security review, perf, docs). | Domain depth, not generality. |
The point isn't to use four different products. It's to think in roles. You can play all four roles with one product if you switch prompts. You can play them with two or three different models. The structure matters more than the brand.
See the glossary on AI team topology and model verification for the underlying concepts.
Three workflow shapes
Most AI-assisted dev work fits one of three shapes. Pick deliberately — using the wrong shape is where most velocity gets lost.
1. Solo builder
One model in one tool. You prompt, it builds, you review with your own eyes.
When it's right:small features, prototypes, throwaway code. Adding more layers is overhead you don't need.
2. Architect + Builder
One model plans, a different one (or the same one with different instructions) implements. The plan is written down — usually as a PLAN.md — before code is written.
When it's right:features that touch multiple files or services. Anything where "how should we structure this?" isn't obvious. Refactors.
3. Full team (Architect → Builder → Reviewer)
Three roles, ideally three models. Architect writes the plan. Builder implements. A different model reads the diff with no context about how it was built and looks for problems.
When it's right: code shipping to production. Anything touching auth, payments, data integrity. Anywhere a confident hallucination is expensive.
Handoff patterns
The thing that breaks multi-model workflows isn't the models — it's the handoffs. People paste 50k tokens of chat history into the next model and wonder why output drops. The fix is structured handoffs: small, deliberate documents that capture decisions, not conversation.
A typical handoff chain
- Architect → Builder: Architect produces
PLAN.md— goals, constraints, chosen approach, rejected alternatives, non-goals. Builder reads only this, not the architect's reasoning trace. - Builder → Reviewer: Builder produces a diff plus an
IMPLEMENTATION.md— what was built, what deviated from the plan, why. Reviewer reads PLAN.md + diff + IMPLEMENTATION.md. - Reviewer → Builder (loop): Reviewer writes
REVIEW.md— issues, questions, blocking concerns. Builder addresses them and loops back. Stop when reviewer is satisfied.
Each handoff doc is short — under 2k tokens, ideally. The discipline is in summarizing, not in writing more.
Files every repo should have
The single highest-ROI thing you can add to a codebase for AI-assisted work is a repo instruction file at the root. It tells every coding agent how your code actually works — conventions, dependencies, gotchas — so they stop relearning it every session.
The standard files
CLAUDE.md— Read by Claude Code. Project context, conventions, what NOT to do.AGENTS.md— Generic agent instruction file, increasingly read by multiple tools..cursorrules— Cursor-specific. If you use Cursor, this is what it reads..continue/— Continue config + checks. CI-friendly.
A good repo file is short (a few hundred lines max), specific (real examples, not principles), and actively maintained. Bad ones are aspirational and rot. See the glossary on repo instruction files for what good ones look like.
What good and bad ones look like
The difference isn't length — it's signal. A useful AGENTS.md is boring and specific. A useless one reads like a careers page.
# Acme Dashboard Stack: Next.js 16 (App Router) + Postgres 16 + Resend. Deployed to Vercel. Postgres on Neon (preview branches per PR). ## Commands - npm run dev # localhost:3000 - npm run typecheck # must pass before commit - npm test # vitest, watch mode by default - npm run db:migrate # drizzle, never edit migrations by hand ## Architecture - App Router only. No /pages directory. - Server Components by default. Add 'use client' only when needed. - DB access lives in lib/db/*. Never import drizzle in components. - All emails go through lib/email/send.ts (wraps Resend). ## Tests - Co-located: foo.ts -> foo.test.ts. - New server actions require a test. UI components do not. - Coverage gate: 70% on lib/, no gate on app/. ## Never - Don't add a new ORM. We use drizzle. - Don't introduce client-side data fetching for first paint. - Don't catch errors silently — log via lib/log.ts and rethrow. ## Style - Tailwind utility classes, no CSS modules. - Named exports only. No default exports outside route files. - Dates as ISO strings at the boundary; Date objects internally.
# Welcome to Acme Acme is a next-generation platform empowering teams to unlock productivity through delightful experiences. Our mission is to build software people love. ## Our values - We value clean code. - We believe in excellence. - We move fast and care deeply about quality. - Communication is key. ## Getting started Clone the repo and follow the README. Make sure you have Node installed. Install dependencies and you should be good to go! ## Coding guidelines - Write good, readable code. - Follow best practices. - Write tests where appropriate. - Keep functions small and focused. - Comment your code where it makes sense. - Be a good citizen of the codebase. ## Architecture We use a modern stack with industry-standard tools. The frontend talks to the backend, which talks to the database. For more details ask in #engineering. ## Notes TODO: update this doc — last edited 14 months ago.
Test: hand it to a model that's never seen the codebase. If it can't answer "what command runs the tests?" or "where does DB code live?" from the file alone, it's the bad version.
Template gallery
Starting points for the files above. Copy, paste at the root of your repo, then strip what doesn't apply and fill in the rest. Templates are intentionally terse — a long file that nobody reads is worse than a short one that everyone does.
AGENTS.mdGeneric multi-tool repo instruction file.
# Project: <name> Stack: <framework> + <db> + <key services>. Deployed to <host>. Source of truth: this file. ## Commands
CLAUDE.mdClaude Code-specific instructions and loop.
# Claude Code instructions You are working in a real production repo. Read this file fully before touching code. When in doubt, ask before editing. ## Stack
.cursorrulesCursor-specific editing rules and tone.
# .cursorrules Stack: <framework> + <db> + <key services>. Read AGENTS.md at repo root before any non-trivial edit. ## Editing
PLAN.mdArchitect → Builder handoff template.
# PLAN: <feature name> Author: <model + role> Date: <YYYY-MM-DD> Status: draft | accepted | implemented
IMPLEMENTATION.mdBuilder → Reviewer handoff template.
# IMPLEMENTATION: <feature name> Author: <model + role> Date: <YYYY-MM-DD> Plan: link to PLAN-<feature>.md
REVIEW.mdReviewer → Builder feedback template.
# REVIEW: <feature name> Reviewer: <model + role> Date: <YYYY-MM-DD> Implementation: link to IMPL-<feature>.md Verdict: approve | request-changes | block
These are starting scaffolds, not finished docs. The first edit pass — deleting lines that don't apply to your stack — is where the file becomes useful.
Verification loops
The most underrated practice in AI-assisted dev right now. AI lets you build 5x faster — but reviewing a 500-line diff still takes 500 lines of attention. If verification doesn't scale with build velocity, error rates rise silently. The principle is build-more, verify-more.
Three verification layers worth wiring in:
Automated checks (cheap, always-on)
Tests, linters, type-checks, security scans. Run on every PR. AI doesn't skip these because they're fast and don't need a human.
Model verification (medium effort, high signal)
A second model reviews the diff with no priors. Different model from the builder, ideally. Catches confident-wrong claims and missed edge cases. See model verification.
Human review (expensive, save it)
Still essential for high-stakes code (auth, payments, data integrity) and for the architecture decisions automated tools can't evaluate. Don't use human attention for things automation could catch.
When each level is required
Verification scales with risk, not effort. The right amount is the minimum that catches the failures you care about — and the floor rises fast once real users, real money, or real data are on the line.
| Code type | Automated | Model review | Independent verifier | Human review |
|---|---|---|---|---|
| Throwaway prototype / demo | ✓ | — | — | — |
| Internal tool | ✓ | ✓ | — | — |
| Production feature | ✓ | ✓ | ✓ | — |
| Auth / payments / data integrity | ✓ | ✓ | ✓ | ✓ |
| Database migration / schema change | ✓ | ✓ | ✓ | ✓ |
Read the rows as floors, not ceilings. A throwaway demo can absolutely get human review if you have time — but production code without an independent verifier is shipping on faith.
Escalation rules
The matrix above is the default. These are the overrides — situations where the normal verification level is not enough, regardless of how confident the builder model sounds. Treat them as hard rules, not suggestions. If a diff trips one, the gate closes until the extra verification is done.
If auth, sessions, or password handling is touched
Human review required. No exceptions for "small" auth changes — those are the dangerous ones.
If migration is destructive (DROP COLUMN, RENAME, NOT NULL on existing rows)
Require an explicit, written rollback plan attached to the PR. Test it on a copy of prod before merge.
If diff exceeds ~400 lines
Require a structured PLAN.md before merge. Big diffs without a plan are how scope creep ships unreviewed.
If code touches payments, billing, or money flow
Independent verifier model AND human review. Different model from the builder. Money bugs do not fail loud.
If editing CI/CD or deploy pipelines
Human review required, plus a tested rollback. A broken deploy pipeline blocks every fix that comes after it.
If touching customer data export, deletion, or PII
Human review plus a security checklist. Get this wrong once and you owe regulators an explanation.
These rules exist because confident-wrong is the AI failure mode that costs the most. The builder will tell you the migration is safe. The builder is not the one paying for the rollback.
Picking your setup
There's no single right answer. The right setup depends on what you build, how often, and how much risk you can absorb. A rough sorting:
| If you're… | Start here |
|---|---|
| Just exploring AI dev | One IDE-integrated tool. Solo-builder workflow. |
| Shipping side projects | IDE tool + a terminal agent for bigger tasks. Add CLAUDE.md or .cursorrules. |
| Working in a real production codebase | Architect+Builder workflow. Repo instruction file. Automated CI checks on every PR. |
| Building anything mission-critical | Full team workflow. Verification model on every PR. Human review on auth/payments/data. |
| Cost-sensitive / OSS | Open-source tooling (Cline, Continue, OpenHands, Aider) + BYOK. Run verification only on critical paths. |
For specific tool comparisons see /compare. For tools by category see /tools. For Claude-specific deep-dive see /agents.
Where it ships
Most AI builders include their own one-click deploy — Lovable, Bolt, v0, Replit, Base44, and Emergent will all ship what they generate without any external service. That's the right call for prototypes and demos. The platforms below are where teams land when they outgrow the built-in hosting: to own the deploy, extend the stack, control costs at scale, or wire into infra they already run.
Hosting & deploy
Frontend-first edge platform; effectively the default for Next.js apps (which v0 outputs natively).
Full-stack apps with databases, services, and one-click deploys from git. Popular for Lovable backends and side projects.
Workers, D1, R2, and Pages — global edge with generous free tiers. Strong fit for low-latency AI apps.
Heroku-style simplicity for any stack. Web services, background workers, managed Postgres in one place.
Run apps in 30+ regions close to users. Good for AI inference workloads and apps that need GPU access.
Database & backend
Postgres + auth + storage + edge functions. The default backend Lovable wires up; pgvector built in for RAG.
Serverless Postgres with branching like Vercel previews. Great for AI agents that spin up scratch databases.
MySQL platform with branching and zero-downtime schema changes. Built for scale; popular at later stages.
TypeScript-native real-time backend. Functions, database, and reactivity in one — minimal glue code for AI builders.
Once you're going external, pick by fit with your existing workflow — not feature lists. Most AI-generated apps will run on any of these; the right answer is whichever one matches how you already deploy, monitor, and pay.
Common mistakes
The mistakes change as you get more practice. What trips up someone in their first week is different from what blows up a team six months in. Sorted by where you are.
Beginner — your first weeks with AI
- ✗Trusting generated code blindly. Confident-sounding output isn't the same as correct output. Read every line before you commit it, especially early on when you're still calibrating what the model gets wrong.
- ✗Using one model for everything. Same problem as one engineer doing every job: you get blind spots. Split roles where the work matters.
- ✗Overcomplicating too early. You don't need agents, orchestration, or a verification model on day one. Start with one tool, one prompt, one file. Add layers only when you feel the pain they fix.
- ✗Reaching for orchestration when one call would do. Sometimes one model and one prompt is the right tool. Don't add agents and teams to problems that don't need them.
Intermediate — a few months in
- ✗Skipping the repo instruction file. Every session re-explaining your codebase to a fresh agent is wasted tokens and inconsistent results. 30 minutes writing CLAUDE.md saves hours per week.
- ✗Pasting chat history as a handoff. The next model doesn't need 50k tokens of conversation — it needs the conclusion plus the constraints. Write a short, structured handoff.
- ✗Dumping giant context into the prompt. More tokens isn't more signal. A 200k-token paste of your repo buries the part that matters and degrades output. Curate context the way you'd brief a new hire.
- ✗No clear handoff docs between roles. If your architect, builder, and reviewer aren't reading the same short artifacts (PLAN.md, IMPLEMENTATION.md, REVIEW.md), they're each guessing at the others' intent. Write the docs.
- ✗Asking the builder to verify itself. Self-review is theater — same model, same biases, same blind spots. The verifier needs to be different from the builder.
Advanced — running AI workflows on a team
- ✗Building 5x faster without verifying 5x more. Velocity gains compound, but so do undetected errors. If verification doesn't scale with build speed, error rates rise silently. Wire in checks before you ship.
- ✗Unclear ownership of review. "The model reviewed it" isn't a name on a PR. Someone human is accountable for what ships — decide who, write it down, and don't let model verification quietly absorb that responsibility.
- ✗Too much autonomy on high-risk surfaces. Auth, payments, data integrity, infra changes — these are not places to let an agent merge on green CI. Gate the blast radius; agents propose, humans approve.
- ✗Vendor lock-in through undocumented habits. If your team's workflow only works because everyone happens to use the same IDE, the same model, the same prompt tricks — you have a dependency you didn't choose. Write the conventions down so they survive a tool swap.