Resources · AI & Monetization

LLMs: which model, when

How to pick the right model for each call. Default to gpt-5-nano with Zod schemas, escalate only when the work demands it, and route everything through one helper.

Most MVPs over-spend on tokens because the builder reaches for the biggest model first and never revisits the choice. This page lays out a default that works for almost every call we make, the handful of cases where it doesn't, and the cheap routing trick that keeps the bill flat as you scale. Costs and model names below are current as of April 2026.

The default: gpt-5-nano with reasoning effort 'minimal'

This site uses gpt-5-nano with reasoning_effort: 'minimal' and Zod-typed structured output via the Responses API for almost every model call. Three reasons:

  • A typical call costs around $0.0001 (input plus output, single-digit thousand tokens).
  • Latency lands sub-500ms in our logs, often closer to 250ms when the prompt is short.
  • The Zod schema becomes the contract. Either the model returns valid data or you raise. There's no string parsing layer to break.

The Responses API plus zodResponseFormat is the part that makes nano usable. Without a schema, nano hallucinates JSON shape. With one, it doesn't get to. We've never had a "the model returned text instead of JSON" incident since switching everything to Zod.

Long version of the why-nano case: /blog/gpt-5-nano-what-its-good-at/. The integration pattern: /blog/typed-openai-zod/.

import OpenAI from "openai";
import { z } from "zod";
import { zodResponseFormat } from "openai/helpers/zod";

const Classification = z.object({
  category: z.enum(["bug", "feature", "question", "spam"]),
  confidence: z.number().min(0).max(1),
  rationale: z.string().max(140),
});

const openai = new OpenAI();

const r = await openai.responses.parse({
  model: "gpt-5-nano",
  reasoning: { effort: "minimal" },
  input: [
    { role: "system", content: "Classify the user's support message." },
    { role: "user", content: msg },
  ],
  text: { format: zodResponseFormat(Classification, "classification") },
});

const result = r.output_parsed; // typed!

That's the whole template. Most model calls in this codebase look like that with a different schema.

When to escalate inside OpenAI

Nano is great until it isn't. Here's the ladder:

Need Model Reasoning effort
Classification, extraction, short rewrites gpt-5-nano minimal
Multi-field reasoning, light planning gpt-5-mini minimal or low
Hard reasoning, code generation, math gpt-5 medium
Multi-step plans, agentic flows gpt-5 high
Image understanding gpt-5 low

Concrete tells you've outgrown nano: the model starts mixing up which input field maps to which output field, or it picks a plausible-but-wrong enum value when the right answer needs a chain of reasoning. Move up one step. If gpt-5-mini solves it, stay there — the price gap to full gpt-5 is real.

Don't crank reasoning effort past what the task needs. effort: 'high' on gpt-5 can take 10–30 seconds and cost cents per call. Use it for one-shot deep work (drafting a migration plan, debugging a thorny prompt), not user-facing latency paths.

Other providers worth knowing

Single-vendor lock-in is fine for an MVP, but knowing the alternatives matters when you hit a wall.

Provider Model What it's good at Rough cost (input / output per 1M tokens)
OpenAI gpt-5-nano Cheap structured output, classification $0.05 / $0.40
OpenAI gpt-5-mini General purpose, good price/quality $0.25 / $2.00
OpenAI gpt-5 Hard reasoning, vision, code $1.25 / $10.00
Anthropic Claude Haiku 4.5 Fast, cheap, strong on tool-use $1.00 / $5.00
Anthropic Claude Sonnet 4.6 Best-in-class code generation, long context $3.00 / $15.00
Google Gemini 2.5 Flash Cheap, native multimodal, 1M context $0.30 / $2.50
Mistral Mistral Large 2 EU-hosted option, decent general model $2.00 / $6.00
Groq Llama 3.3 70B (hosted) Sub-100ms latency, free tier for low volume $0.59 / $0.79

A few notes from running these in production:

  • Claude Sonnet 4.6 is what we'd reach for if we were generating long-form code or multi-file edits inside an app. It's also what most coding agents use under the hood. For one-off generation in a user-facing app, the price is hard to justify. For agent loops where quality dominates cost, it's the right call.
  • Claude Haiku 4.5 is the closest direct competitor to gpt-5-mini. It's a better fit if your prompts are heavy on tool-calling.
  • Gemini 2.5 Flash wins when you need to feed in a 200-page PDF or hours of audio. The 1M context isn't a gimmick for those use cases.
  • Groq is the right answer if you need sub-second turnaround on a Llama-class model and can tolerate occasional capacity issues. The free tier is generous enough to prototype with.
  • Mistral is mostly relevant if you have an EU data residency requirement.

Open-source models worth knowing

You probably don't want to self-host on day one. But if you're building something that runs locally, processes regulated data, or needs to ship as a desktop app, the OSS world is genuinely useful in 2026.

  • Llama 3.3 70B — Meta's general-purpose workhorse. Runs on a single H100 or two RTX 4090s with quantization. Quality is in the gpt-5-mini ballpark for most tasks.
  • Qwen 2.5 72B — Alibaba's release. Strong on code and math, especially the Coder variants (Qwen 2.5 Coder 32B is small enough to run on a Mac Studio).
  • DeepSeek V3 — MoE architecture, very cheap to run inference on. Good for batch jobs.

For self-hosting, use Ollama for prototyping and vLLM for production. Don't try to write your own inference server.

Decision matrix by task

This is the lookup table we actually use:

Task Pick Notes
Classification (under ~20 classes) gpt-5-nano Always with a Zod enum.
Field extraction from text gpt-5-nano Schema does the heavy lifting.
Summarization (under 2k tokens in) gpt-5-nano Cap output length in the schema.
Long-form generation (blog posts, emails) gpt-5-mini Nano's prose gets repetitive past a few paragraphs.
Code generation inside an app claude-sonnet-4.6 or gpt-5 Cost is real. Cache the output if you can.
Code generation as a developer Claude Code, Cursor Use the agent, not your own API call.
Image understanding gpt-5 (low effort) or claude-sonnet-4.6 Gemini Flash if cost matters more than nuance.
Real-time chat UX Groq Llama or gpt-5-nano with streaming Streaming is mandatory under 1s perceived latency.
Embedding for semantic search text-embedding-3-small Cheap, fine for most.
Re-ranking search results gpt-5-nano One call with a list-of-IDs schema.

The router pattern

Once you have more than a couple of distinct AI tasks, you'll be tempted to start sprinkling model names around the codebase. Don't. Make a tiny router call instead:

const Route = z.object({
  task: z.enum(["classify", "extract", "draft", "code", "vision"]),
  difficulty: z.enum(["trivial", "normal", "hard"]),
});

const route = await classify(userInput, Route); // a gpt-5-nano call

const model =
  route.task === "code" || route.difficulty === "hard"
    ? "gpt-5"
    : route.task === "draft"
    ? "gpt-5-mini"
    : "gpt-5-nano";

A nano router call adds maybe 200ms and a fraction of a cent. It saves you from paying gpt-5 prices on a one-line classification, and it gives you a single place to retune model choices when prices change (and they do change — the gpt-5 family dropped twice in 2025).

Anti-patterns

A short list of the mistakes we see most often:

  • Defaulting to the biggest model. "I'll start with gpt-5 and downgrade later" almost never gets downgraded. Start at nano and prove you need more.
  • Free-form text outputs. If you're parsing the response with a regex, you've already lost. Use a schema.
  • Sending PII to OpenAI without redaction. OpenAI's enterprise terms cover a lot, but even then, redact what you don't strictly need. Email, phone, address — strip them at the edge unless the prompt actually requires them.
  • Reasoning effort 'high' on user-facing paths. The latency is brutal. Reserve high effort for background jobs.
  • Re-doing the same call on every request. Cache by content hash. Even a 10-minute in-memory cache cuts spend on hot paths.
  • No max_output_tokens. A runaway response can cost $0.50 by itself. Set a ceiling.

Our recommendation

Default everything to gpt-5-nano with reasoning_effort: 'minimal' and a Zod schema. Centralize calls in a single lib/ai.ts helper that takes a schema and a system/user pair, returns the parsed object, and logs cost. When you need more, escalate the model on a per-call basis from inside that helper — never from the call site.

If you're building something that does a lot of code generation inside the product (not for the developer, for the end user), Claude Sonnet 4.6 is the second model you should add to the helper.

If you're cost-sensitive enough that nano still feels expensive at scale, look at Groq Llama for the highest-volume path. Below that you're squeezing pennies.