Updated June 2026 · product fit, not benchmark scores

Which AI Model Should I Use?

The latest Anthropic and OpenAI models, compared within each company. What each tier is for, when to reach for it, when it's overkill — and where the quality gap actually shows up.

What this is. A guide to picking the right tier inside each lineup — Haiku vs Sonnet vs Opus vs Fable at Anthropic; nano vs mini vs GPT-5.4 vs GPT-5.5 at OpenAI. It is not Anthropic-vs-OpenAI, and it deliberately ignores this week's benchmark numbers (they change constantly and rarely decide the right tier). The question that matters is fit: how hard is your task, how much does a mistake cost, and how much are you willing to pay per call.
Freshness warning. Model names and prices move monthly. Figures below are (as of mid-2026): Anthropic Haiku 4.5 · Sonnet 5 · Opus 4.8 · Fable 5; OpenAI GPT-5.4 nano/mini · GPT-5.4 · GPT-5.5 · GPT-5.5 Pro plus codex and deep-research variants. Both vendors iterate point releases (Sonnet 5.x, GPT-5.6…) fast. Confirm current names and prices on the vendor page before committing. The positioning below outlives the version numbers.

Quick reference — pick by what you're doing

Anthropic (Claude)

If you're doing…Reach for
Classification, extraction, routing, high-volume cheap callsHaiku 4.5
Most production work — chat, summaries, agentic coding, tool loopsSonnet 5 default
Hard reasoning, judgment calls, prose with a voice, tricky debuggingOpus 4.8
The hardest long-horizon autonomous runs (overnight refactors)Fable 5

OpenAI (GPT)

If you're doing…Reach for
Classification, tagging, autocomplete, massive-scale cheap callsGPT-5.4 nano
Fast, cheap-but-capable coding, computer use, subagentsGPT-5.4 mini
Most production work at a lower price than flagshipGPT-5.4
Hardest coding & professional work, deep reasoningGPT-5.5 flagship
Research-grade, cost-no-object one-shotsGPT-5.5 Pro

Rule of thumb both sides share: start one tier higher than you think you need, get it working, then drop down until quality breaks. It's far cheaper to down-shift a working prompt than to debug why a too-small model keeps failing.

The mental model: it's a ladder, not a menu

Both companies now sell the same shape: one family, several rungs. As you climb a rung you buy more intelligence and pay more per token and (usually) more latency. You are not choosing between different kinds of model — you're choosing how much capability the task is worth.

Small / fast rung
Haiku · nano. Sub-second, pennies. It knows the standard answer and formats it. No deep reasoning.
Workhorse rung
Sonnet · mini/GPT-5.4. The 80% default: capable, quick, affordable. Where most production traffic should live.
Frontier rung
Opus/Fable · GPT-5.5/Pro. Reserved for hard reasoning, long autonomy, high cost-of-error. Slower, pricier, smarter.
The 2026 shift you should internalize. "Reasoning models" stopped being a separate product. There's no o-series to pick anymore for flagship work, and Anthropic never split them out — instead, reasoning depth is a dial on the model you already chose (effort at Anthropic, reasoning level like xhigh at OpenAI). So model choice is now two decisions, not one: which rung (capability tier) and how hard should it think (effort). See the effort knob.
Why not just pick by benchmark? Leaderboard deltas between adjacent tiers are small, saturating, self-reported, and non-comparable across scaffolds — and they invert week to week. They almost never tell you the thing that decides the right rung: whether a wrong answer costs you a re-run or a customer. Trial two adjacent tiers on your actual task and pick the cheaper one that doesn't break.

Anthropic — Claude lineup

Four rungs. One tokenizer family and one API surface across the top three, so moving between them is mostly a model-ID swap. Prices are per 1M tokens, input / output (as of mid-2026).

Claude Haiku 4.5 fast · cheap
claude-haiku-4-5
$1 / $5 · 200K ctx

What it's for. The volume tier. Fastest and cheapest Claude; built for simple, well-specified tasks you run a lot: classification, sentiment, routing, tagging, extraction, cheap tool-call steps, first-pass filtering, sub-agents doing narrow reads.

  • Use when the task is bounded and the "right" answer is unambiguous, latency matters, or you're doing it thousands of times an hour.
  • Don't use when the task needs multi-step reasoning, judgment, or holding a lot of context — it'll confidently give you a shallow answer.

Smaller context window (200K vs 1M on the others) and lower max output (64K). No effort/max reasoning dial. It's the one rung that's genuinely a different capability class, not just a cheaper Sonnet.

Claude Sonnet 5 the default
claude-sonnet-5
$3 / $15 · 1M ctx

What it's for. The workhorse, and the right default for almost everything. Near-Opus quality on coding and agentic work at a third to a fifth of the cost. If you don't have a specific reason to go up or down, start here.

  • Use when you're building anything in production: chat, RAG, summarization, agentic coding, tool loops, document work. High-volume-but-not-trivial lives here.
  • Don't use when a mistake is expensive and subtle (go Opus), or the task is dead-simple and you're paying 3× for nothing (go Haiku).

1M context, 128K output, full effort range including xhigh. Adaptive thinking is on by default. Introductory pricing (~$2/$10) ran through 2026-08-31 — check current rates.

Claude Opus 4.8 deep reasoning
claude-opus-4-8
$5 / $25 · 1M ctx

What it's for. The judgment tier. Reach for Opus when the task is genuinely hard, the cost of a wrong answer is high, or you want an answer with a point of view rather than a survey. Best-in-class long-horizon agentic coding, tricky multi-file debugging, architecture decisions, dense knowledge work, and writing with actual voice.

  • Use when Sonnet's answer is almost right but keeps missing something; when the problem needs real reasoning; when a human would otherwise spend an hour on it.
  • Don't use when Sonnet already nails it — you're paying ~1.7× for no gain — or for latency-sensitive interactive typing.

Same request surface as Sonnet 5 (adaptive thinking only; sampling params removed). Supports xhigh and max effort, fast mode, and mid-session system prompts. 1M context at standard pricing — no long-context premium.

Claude Fable 5 frontier · premium
claude-fable-5
$10 / $50 · 1M ctx

What it's for. Anthropic's most capable widely-released model — the top of the ladder, priced accordingly. It is not the default "upgrade from Opus." It earns its premium only on the hardest, longest-horizon autonomous work: overnight coding runs that finish without human correction, first-shot builds of well-specified systems, end-to-end analysis/deliverables, deep multi-agent orchestration.

  • Use when you're handing it a hard problem to chew on for many minutes autonomously and the outcome justifies 2× Opus pricing. Give it the full spec up front and high effort.
  • Don't use when Opus 4.8 already does the job (usually it does), for interactive/latency-sensitive use, or for security/bio-adjacent work where its safety classifiers may refuse.

Quirks that matter operationally: thinking is always on (you can't disable it); the raw chain-of-thought is never returned; single requests can run many minutes (plan for streaming/timeouts); requires 30-day data retention (no zero-data-retention). Same tokenizer as Opus 4.8.

Where the Sonnet → Opus gap actually shows up

For everyday, well-trodden requests you mostly won't notice the difference — both are well-calibrated on common ground. The gap opens in three specific places, and knowing them tells you when the up-tier is worth it:

1. Holding tension without collapsing.

Ask "is Hamlet actually indecisive?" Sonnet gives the standard reading plus the counter-reading, tidy and balanced. Opus is likelier to commit to a position and push back on the framing itself. Sonnet surveys; Opus argues.

2. Catching the real question.

"Should I tell my dying mother I'm not religious anymore?" Sonnet gives a clean ethics breakdown. Opus is likelier to notice it isn't really an ethics question and answer the question behind the question.

3. Prose with a voice.

Sonnet writes smooth, competent, slightly generic prose. Opus takes sharper turns and is likelier to hand you a sentence you'd actually quote. For anything read by a human for pleasure, that matters.

Where the gap is small: factual recall, explaining established concepts, summarizing a known argument — anything whose answer already exists clearly in the world. Where it's near zero: when you just want the standard view, fast. Pay for Sonnet there.

The calibration test. Take a question you've already thought hard about and have a real opinion on. Run it through both. The model that disagrees with you more interestingly is Opus; the one that maps the landscape more cleanly is Sonnet. That single comparison teaches you where your own workload sits better than any leaderboard.

The same logic scales one rung further: Opus → Fable only pays off when the work is long-horizon and autonomous, not when it's merely "important." A hard one-shot answer is Opus territory; a hard multi-hour project is where Fable pulls ahead.

OpenAI — GPT lineup

OpenAI splits its ladder across two point releases: the cheaper, broader GPT-5.4 sub-family (nano / mini / standard) for volume and production, and the GPT-5.5 flagship (+ Pro) for the hardest work. Reasoning is a reasoning level on the model, not a separate o-series. Prices per 1M tokens, input / output (as of mid-2026).

GPT-5.4 nano cheapest
gpt-5.4-nano
$0.20 / $1.25

What it's for. The absolute-volume floor. Cheapest and fastest GPT; for embedding-scale classification, tagging, autocomplete, routing, and simple structured extraction where you're paying by the millions of calls.

  • Use when throughput and cost per call dominate and the task is trivial and well-specified.
  • Don't use when the task needs any real reasoning or nuance — nano will happily return a plausible wrong answer. Step up to mini.

OpenAI's rough analog of Haiku, one notch cheaper still.

GPT-5.4 mini fast workhorse
gpt-5.4-mini
$0.75 / $4.50 · 400K ctx

What it's for. Cheap-but-genuinely-capable. OpenAI positions it as its strongest mini yet for coding, computer use, and subagents — so it's the natural pick for high-volume agentic steps and interactive features where mini quality is "good enough" and latency/cost matter.

  • Use when you want fast, affordable coding/agent loops, tool-calling, or interactive UX at scale; a strong sub-agent worker.
  • Don't use when the task is the reasoning bottleneck of your system — promote it to GPT-5.4 or 5.5.

400K context (vs 1M on the big models). The closest OpenAI equivalent to "Sonnet-but-cheaper."

GPT-5.4 value default
gpt-5.4
$2.50 / $15 · 1M ctx

What it's for. The affordable full-size model — flagship-class breadth at half the flagship price. The sensible default for most production work that's beyond mini's depth but doesn't need the very top.

  • Use when you want strong general capability and 1M context without paying GPT-5.5 rates; the workhorse for serious apps.
  • Don't use when you're on the hardest coding/reasoning tasks where the flagship's extra headroom pays for itself — or when mini already suffices.

Exactly half GPT-5.5's per-token cost. A gpt-5.4-pro ($30/$180) exists for maximum capability within the 5.4 line.

GPT-5.5 flagship
gpt-5.5
$5 / $30 · 1M ctx

What it's for. OpenAI's recommended starting point for complex work — "a new class of intelligence for coding and professional work." The tier for the hardest coding, deepest reasoning, and highest-stakes professional output.

  • Use when the task is genuinely hard, correctness is worth the premium, or you're prototyping and want the best answer before optimizing cost downward.
  • Don't use when GPT-5.4 or mini already clears the bar — the flagship is 2–6× their cost. Reserve it for where the extra capability shows.

1M context. Reasoning depth is controlled via the reasoning level (up to xhigh) — the equivalent of Anthropic's effort.

GPT-5.5 Pro frontier · premium
gpt-5.5-pro
$30 / $180 · 1M ctx

What it's for. Maximum capability for research-grade problems, at cost-no-object pricing (6× the flagship). Think one-shot answers to genuinely hard problems where being right once is worth more than being cheap.

  • Use when a single high-stakes answer justifies the spend — hard math/science, deep analysis, the last few points of quality that a human expert would otherwise supply.
  • Don't use for anything at volume; the price makes it a scalpel, not a workhorse.

The rough OpenAI counterpart to reaching for Fable 5 at Anthropic — a deliberate step past the flagship, not a default.

Specialized variants niche codex · deep-research · realtime · image

Beyond the general ladder, OpenAI ships task-tuned models. Pick these only when your task is that task:

  • gpt-5.3-codex ($1.75/$14) — coding-agent-tuned; for autonomous software workflows (the model behind Codex).
  • o3-deep-research ($5/$20) & o4-mini-deep-research ($1/$4) — the surviving o-series, now specialized for long multi-step web research with citations, not general chat.
  • GPT-Realtime-2 and the realtime family — low-latency voice / speech-to-speech / translation / transcription.
  • GPT Image 2 — image generation.

Where the GPT-5.4 → GPT-5.5 gap shows up

The two share DNA, so on routine work the difference is subtle. It widens in the same places Sonnet→Opus does — hard reasoning, depth, and not-losing-the-thread over long tasks — plus a couple that are specific to how OpenAI tiers:

Depth under sustained reasoning.

On multi-step problems — a gnarly bug spanning several files, a proof, a plan with dependencies — GPT-5.4 is more likely to take a plausible shortcut; GPT-5.5 holds the whole structure and follows it through. Turning up reasoning on 5.4 narrows this but doesn't fully close it.

The mini vs full-size cliff.

The bigger, sharper jump is mini → full-size, not 5.4 → 5.5. Mini is tuned for speed and cost; on anything that's actually the reasoning bottleneck, moving to GPT-5.4 (full) buys more than the 5.4→5.5 step does. Diagnose which jump you need before paying for the top.

The calibration test (OpenAI edition). Run your hardest real task on GPT-5.4 at high reasoning, then on GPT-5.5. If 5.5's answer isn't visibly better on your task, you've found your tier — stay on 5.4 and pocket the difference. If it is, you've justified the flagship. Repeat one rung down (mini vs 5.4) for cheaper workloads.

The second decision: how hard should it think?

Picking the rung is half the choice. The other half is the reasoning dial — the single biggest lever on quality, latency, and cost within a model. Turning it up on a cheaper rung often beats jumping to the next rung at low effort, for less money.

AnthropicOpenAI
The dialeffort: low · medium · high · xhigh · maxreasoning level: up to xhigh
Defaulthigh; adaptive thinking decides depth per requestModel-dependent; set it explicitly for hard work
Turn it up forCoding & agentic work (xhigh), correctness-critical one-shots (max)The hardest reasoning/coding tasks
Turn it down forChat, classification, latency-sensitive UX (low/medium)Simple, fast, cheap tasks
The catchHigher effort = more thinking tokens = higher cost + latency; give max_tokens headroomReasoning tokens bill at output rates — high effort can multiply cost several-fold
The move most people miss. Before up-tiering, try same rung, higher effort. A workhorse model thinking hard frequently matches a frontier model thinking lazily — at a fraction of the price. Only climb the ladder when max effort on the cheaper rung still isn't enough.

Cost & context ladders (at a glance)

Per 1M tokens, input / output, (as of mid-2026 — verify). The ladders line up tier-for-tier, which is the easiest way to translate a habit from one vendor to the other.

Anthropic

Haiku 4.5volume$1 / $5 · 200K
Sonnet 5workhorse$3 / $15 · 1M
Opus 4.8reasoning$5 / $25 · 1M
Fable 5frontier$10 / $50 · 1M

OpenAI

GPT-5.4 nanovolume$0.20 / $1.25
GPT-5.4 minifast$0.75 / $4.50 · 400K
GPT-5.4value$2.50 / $15 · 1M
GPT-5.5flagship$5 / $30 · 1M
GPT-5.5 Profrontier$30 / $180 · 1M

Rough cross-reads (positioning, not identical capability): Haiku ≈ nano/mini · Sonnet ≈ GPT-5.4(-mini) · Opus ≈ GPT-5.5 · Fable ≈ GPT-5.5 Pro. Both sides also offer prompt caching (cache reads ~0.1× input) and batch (~50% off, non-urgent) — often a bigger lever on your bill than the tier choice itself.

How to actually choose (a 4-step recipe)

  1. Prototype on the flagship. Start with Opus / GPT-5.5 at high effort. Get the task working correctly first — don't optimize a broken prompt. This tells you the ceiling.
  2. Down-shift until it breaks. Move to the workhorse (Sonnet / GPT-5.4), then the fast tier (Haiku / mini/nano). Stop one rung above where quality fails. Most production traffic ends up on the workhorse or fast tier.
  3. Tune effort before re-tiering. If the cheaper rung is almost good enough, raise its effort/reasoning before jumping back up. Cheaper and often sufficient.
  4. Split the workload. Real systems mix tiers: fast model for routing/extraction, workhorse for the main loop, frontier only for the hard sub-steps or a final review pass. You don't have to pick one.
The one-line heuristic. Cost of a wrong answer × how often you run it decides the tier. Cheap-and-frequent-and-forgiving → fast tier. Rare-and-expensive-to-get-wrong → frontier. Everything else → the workhorse, which is why it's the default.

Common mistakes & anti-patterns

Defaulting to the flagship for everything. The most expensive habit there is. Most calls don't need it; you're paying 2–5× for capability you can't see in the output.
Using the nano/Haiku tier for reasoning. The opposite failure. A cheap model on a task that needs thinking returns confident, plausible, wrong answers — the costliest kind.
Ignoring the effort dial. People up-tier when they should just raise effort/reasoning — or burn money at max effort on a task medium would nail.
Picking by this week's benchmark. Adjacent-tier deltas are small, saturating, and non-comparable. Trial on your own task; that's the only benchmark that predicts your result.
Forgetting reasoning tokens are billed. Especially on OpenAI, hidden reasoning bills at output rates and can multiply cost 3–10×. Your invoice ≠ your visible output length.
Reaching for Fable / GPT-5.5 Pro as a default upgrade. They're specialists for hardest-case work, not "Opus/GPT-5.5 but better." Usually the tier below already wins on value.
One model for the whole pipeline. Routing, extraction, main loop, and final review have different needs. Mixing tiers is cheaper and better than one-size-fits-all.
Skipping caching & batch. Prompt caching (~0.1× on reads) and batch (~50% off) often cut the bill more than dropping a tier — and with no quality loss.

Straight answers

"I have no idea where to start — just tell me one."

Sonnet 5 (Anthropic) or GPT-5.4 (OpenAI). The workhorse tier is the right default for almost everything. Get it working there, then move down for cost or up for quality only if you have a concrete reason.

"When is Opus / GPT-5.5 actually worth it?"

When the workhorse's answer is almost right but keeps missing the point, when the problem needs genuine multi-step reasoning, or when a mistake is expensive and subtle. If Sonnet/GPT-5.4 already nails your task, the up-tier buys you nothing visible.

"Fable 5 or GPT-5.5 Pro — should I ever use these?"

Rarely, and deliberately. Fable earns its 2× Opus price on long-horizon autonomous runs; Pro earns its 6× flagship price on research-grade one-shots. Neither is a routine "upgrade." If you're not sure you need it, you don't.

"Should I use the cheap tier to save money?"

Yes — for bounded, well-specified, high-volume tasks (classification, extraction, routing). No — for anything needing reasoning or judgment, where a wrong answer costs more than the tokens you saved. And try caching + batch before down-tiering; they cut cost without cutting quality.

"Reasoning model or normal model?"

That distinction has mostly collapsed. Reasoning is now a dial (effort / reasoning) on the model you already picked, not a separate product. Pick the tier for capability; turn the dial up for hard reasoning, down for speed. OpenAI's remaining o-series models are specialized deep-research tools, not general chat.

"How do I tell which tier my task really needs?"

Run the calibration test: take a real task you have strong judgment about, try two adjacent tiers, and see whether the pricier one is visibly better on your work. That single comparison beats any leaderboard for predicting your result.