LLMs: which model, when
How to pick the right model for each call. Default to gpt-5-nano with Zod schemas, escalate only when the work demands it, and route everything through one helper.
Most MVPs over-spend on tokens because the builder reaches for the biggest model first and never revisits the choice. This page lays out a default that works for almost every call we make, the handful of cases where it doesn't, and the cheap routing trick that keeps the bill flat as you scale. Costs and model names below are current as of April 2026.
The default: gpt-5-nano with reasoning effort 'minimal'
This site uses gpt-5-nano with reasoning_effort: 'minimal' and Zod-typed structured output via the Responses API for almost every model call. Three reasons:
- A typical call costs around $0.0001 (input plus output, single-digit thousand tokens).
- Latency lands sub-500ms in our logs, often closer to 250ms when the prompt is short.
- The Zod schema becomes the contract. Either the model returns valid data or you raise. There's no string parsing layer to break.
The Responses API plus zodResponseFormat is the part that makes nano usable. Without a schema, nano hallucinates JSON shape. With one, it doesn't get to. We've never had a "the model returned text instead of JSON" incident since switching everything to Zod.
Long version of the why-nano case: /blog/gpt-5-nano-what-its-good-at/. The integration pattern: /blog/typed-openai-zod/.
import OpenAI from "openai";
import { z } from "zod";
import { zodResponseFormat } from "openai/helpers/zod";
const Classification = z.object({
category: z.enum(["bug", "feature", "question", "spam"]),
confidence: z.number().min(0).max(1),
rationale: z.string().max(140),
});
const openai = new OpenAI();
const r = await openai.responses.parse({
model: "gpt-5-nano",
reasoning: { effort: "minimal" },
input: [
{ role: "system", content: "Classify the user's support message." },
{ role: "user", content: msg },
],
text: { format: zodResponseFormat(Classification, "classification") },
});
const result = r.output_parsed; // typed!
That's the whole template. Most model calls in this codebase look like that with a different schema.
When to escalate inside OpenAI
Nano is great until it isn't. Here's the ladder:
| Need | Model | Reasoning effort |
|---|---|---|
| Classification, extraction, short rewrites | gpt-5-nano | minimal |
| Multi-field reasoning, light planning | gpt-5-mini | minimal or low |
| Hard reasoning, code generation, math | gpt-5 | medium |
| Multi-step plans, agentic flows | gpt-5 | high |
| Image understanding | gpt-5 | low |
Concrete tells you've outgrown nano: the model starts mixing up which input field maps to which output field, or it picks a plausible-but-wrong enum value when the right answer needs a chain of reasoning. Move up one step. If gpt-5-mini solves it, stay there — the price gap to full gpt-5 is real.
Don't crank reasoning effort past what the task needs. effort: 'high' on gpt-5 can take 10–30 seconds and cost cents per call. Use it for one-shot deep work (drafting a migration plan, debugging a thorny prompt), not user-facing latency paths.
Other providers worth knowing
Single-vendor lock-in is fine for an MVP, but knowing the alternatives matters when you hit a wall.
| Provider | Model | What it's good at | Rough cost (input / output per 1M tokens) |
|---|---|---|---|
| OpenAI | gpt-5-nano | Cheap structured output, classification | $0.05 / $0.40 |
| OpenAI | gpt-5-mini | General purpose, good price/quality | $0.25 / $2.00 |
| OpenAI | gpt-5 | Hard reasoning, vision, code | $1.25 / $10.00 |
| Anthropic | Claude Haiku 4.5 | Fast, cheap, strong on tool-use | $1.00 / $5.00 |
| Anthropic | Claude Sonnet 4.6 | Best-in-class code generation, long context | $3.00 / $15.00 |
| Gemini 2.5 Flash | Cheap, native multimodal, 1M context | $0.30 / $2.50 | |
| Mistral | Mistral Large 2 | EU-hosted option, decent general model | $2.00 / $6.00 |
| Groq | Llama 3.3 70B (hosted) | Sub-100ms latency, free tier for low volume | $0.59 / $0.79 |
A few notes from running these in production:
- Claude Sonnet 4.6 is what we'd reach for if we were generating long-form code or multi-file edits inside an app. It's also what most coding agents use under the hood. For one-off generation in a user-facing app, the price is hard to justify. For agent loops where quality dominates cost, it's the right call.
- Claude Haiku 4.5 is the closest direct competitor to gpt-5-mini. It's a better fit if your prompts are heavy on tool-calling.
- Gemini 2.5 Flash wins when you need to feed in a 200-page PDF or hours of audio. The 1M context isn't a gimmick for those use cases.
- Groq is the right answer if you need sub-second turnaround on a Llama-class model and can tolerate occasional capacity issues. The free tier is generous enough to prototype with.
- Mistral is mostly relevant if you have an EU data residency requirement.
Open-source models worth knowing
You probably don't want to self-host on day one. But if you're building something that runs locally, processes regulated data, or needs to ship as a desktop app, the OSS world is genuinely useful in 2026.
- Llama 3.3 70B — Meta's general-purpose workhorse. Runs on a single H100 or two RTX 4090s with quantization. Quality is in the gpt-5-mini ballpark for most tasks.
- Qwen 2.5 72B — Alibaba's release. Strong on code and math, especially the Coder variants (Qwen 2.5 Coder 32B is small enough to run on a Mac Studio).
- DeepSeek V3 — MoE architecture, very cheap to run inference on. Good for batch jobs.
For self-hosting, use Ollama for prototyping and vLLM for production. Don't try to write your own inference server.
Decision matrix by task
This is the lookup table we actually use:
| Task | Pick | Notes |
|---|---|---|
| Classification (under ~20 classes) | gpt-5-nano | Always with a Zod enum. |
| Field extraction from text | gpt-5-nano | Schema does the heavy lifting. |
| Summarization (under 2k tokens in) | gpt-5-nano | Cap output length in the schema. |
| Long-form generation (blog posts, emails) | gpt-5-mini | Nano's prose gets repetitive past a few paragraphs. |
| Code generation inside an app | claude-sonnet-4.6 or gpt-5 | Cost is real. Cache the output if you can. |
| Code generation as a developer | Claude Code, Cursor | Use the agent, not your own API call. |
| Image understanding | gpt-5 (low effort) or claude-sonnet-4.6 | Gemini Flash if cost matters more than nuance. |
| Real-time chat UX | Groq Llama or gpt-5-nano with streaming | Streaming is mandatory under 1s perceived latency. |
| Embedding for semantic search | text-embedding-3-small | Cheap, fine for most. |
| Re-ranking search results | gpt-5-nano | One call with a list-of-IDs schema. |
The router pattern
Once you have more than a couple of distinct AI tasks, you'll be tempted to start sprinkling model names around the codebase. Don't. Make a tiny router call instead:
const Route = z.object({
task: z.enum(["classify", "extract", "draft", "code", "vision"]),
difficulty: z.enum(["trivial", "normal", "hard"]),
});
const route = await classify(userInput, Route); // a gpt-5-nano call
const model =
route.task === "code" || route.difficulty === "hard"
? "gpt-5"
: route.task === "draft"
? "gpt-5-mini"
: "gpt-5-nano";
A nano router call adds maybe 200ms and a fraction of a cent. It saves you from paying gpt-5 prices on a one-line classification, and it gives you a single place to retune model choices when prices change (and they do change — the gpt-5 family dropped twice in 2025).
Anti-patterns
A short list of the mistakes we see most often:
- Defaulting to the biggest model. "I'll start with gpt-5 and downgrade later" almost never gets downgraded. Start at nano and prove you need more.
- Free-form text outputs. If you're parsing the response with a regex, you've already lost. Use a schema.
- Sending PII to OpenAI without redaction. OpenAI's enterprise terms cover a lot, but even then, redact what you don't strictly need. Email, phone, address — strip them at the edge unless the prompt actually requires them.
- Reasoning effort 'high' on user-facing paths. The latency is brutal. Reserve high effort for background jobs.
- Re-doing the same call on every request. Cache by content hash. Even a 10-minute in-memory cache cuts spend on hot paths.
- No max_output_tokens. A runaway response can cost $0.50 by itself. Set a ceiling.
Our recommendation
Default everything to gpt-5-nano with reasoning_effort: 'minimal' and a Zod schema. Centralize calls in a single lib/ai.ts helper that takes a schema and a system/user pair, returns the parsed object, and logs cost. When you need more, escalate the model on a per-call basis from inside that helper — never from the call site.
If you're building something that does a lot of code generation inside the product (not for the developer, for the end user), Claude Sonnet 4.6 is the second model you should add to the helper.
If you're cost-sensitive enough that nano still feels expensive at scale, look at Groq Llama for the highest-volume path. Below that you're squeezing pennies.