/ Insights/ SME/ Adoption/ Strategy

The model zoo, explained.

Martin Lulham
The word MODELS set in heavy outlined capitals, with a thick red horizontal line running across the frame behind the letters.

We get asked the same question most weeks. Some version of: which model should I be using? And the honest answer is that the market is confusing on purpose, nobody starts you at the start, and the press releases assume you already know a dozen things you probably don't.

So this is the walk-up-from-the-bottom version. Not exhaustive. Enough to hold your own in a conversation, pick something sensible on Monday, and stop nodding along to acronyms.

Once you've got tokens, the first thing that clears up is that the vendors aren't shipping "one model." They're shipping families, and each family has tiers — small, medium, large — dressed up with brand names. OpenAI calls them Nano, Mini and the full model. Anthropic calls them Haiku, Sonnet and Opus. Google calls them Flash, Pro and Ultra. The names are there for marketing. The job they do is the same: trade cost against capability. Small tiers are fast and cheap and fine for most of what businesses actually do. Large tiers are slow, expensive, and worth it when the task genuinely needs the horsepower.

The thing that trips everyone up, us included in our first year, is thinking version numbers mean something across families. They don't. Claude 4.7 and GPT-5 are not on the same scale. They're internal version numbers — Anthropic has got to 4.7, OpenAI has got to 5, and the numbers tell you nothing about which is stronger for your task. Every comparison you read between them is benchmark cherry-picking, one way or the other. Stop trying to rank them on a number line. It doesn't exist.

The second thing that's landed in the last year — and the single biggest source of confusion when you open a dropdown — is thinking. You'll see it under different names. Anthropic calls it extended thinking. OpenAI has its o-series — "reasoning models." Google surfaces it as thinking modes on Gemini Pro and as deep research on the consumer side. Underneath, they're the same idea: the model is allowed to spend more tokens chewing privately before it answers. You don't see most of those tokens, but you pay for them, and the answer at the end is usually better.

Beyond the big three families, there's a growing set of open-weights and non-US alternatives worth knowing the shape of. Kimi (from Moonshot, out of China) has been shipping strong cheap models and long context windows. DeepSeek has been the story of the last year on the cost floor — the releases that forced the frontier labs to rethink pricing. Mistral is the French open-weights counterweight. Llama is Meta's open-weights line. You can use them through hosted APIs, and several of them you can also run yourself — more on that in a moment. They matter for three reasons: price pressure on the big three, sovereignty and privacy positioning, and — for the right workload — they can be the right answer outright.

Which brings us to the box in the cupboard. Quietly, without the headlines, running models locally has become genuinely practical. Tools like Ollama and LM Studio have made it a one-evening project. A modern desktop GPU — a 4090 or 5090 with 24GB of VRAM — runs quantised 30-billion-parameter open-weights models at sensible speed. A small server with a couple of cards runs bigger ones. The quality gap to the frontier is real, and it's closing, and for certain jobs the gap has already closed. Zero per-token cost. No vendor relationship. The data never leaves the building. For privacy-critical workloads, high-volume classification or extraction, or anything that runs offline, it's a serious option and most businesses haven't noticed. Why isn't it the default? Because the frontier models are still better at the hard generalist tasks, and the operational overhead — keeping something running, patched and available — is real. But "AI means OpenAI in a browser tab" is no longer the whole map.

Most of the money people spend on AI goes wrong at one of two places.

The first is the sticker-price trap. You look at $/million tokens, pick the cheap one, and feel clever. But a "cheap" model asked a vague question wastes more tokens stumbling than a good model uses to answer cleanly. And a cheap thinking model on a task that doesn't need thinking can cost you ten times the same task on the right non-thinking model. The price per token is rarely the price of the answer.

The second is reaching for the top tier by default. Opus and the o-series and the big Gemini are extraordinary, and the demos look extraordinary, and they are not the right tool for nine-tenths of the work most businesses do. Classification, extraction, summarising, drafting, pulling structured data out of a PDF — the middle tier does it almost as well, five to twenty times cheaper, at a fraction of the latency.

So the routing heuristic we use on ourselves and recommend to clients looks roughly like this. Pick a family — any of the big three, honestly, you won't go wrong, consistency matters more than cross-shopping. Default to the middle tier. Drop to the small tier for anything repetitive, extractive, or high-volume. Step up to the big tier, or turn on thinking, only when you've tried the middle tier and it genuinely wasn't enough. And for sensitive or high-volume workloads, look seriously at whether an open-weights model — hosted or in your own cupboard — does the job. Most of the time, when a client says "we need the top one," we find the middle one is fine and the real problem is somewhere else in the task.

A last honest point. The confusion isn't going anywhere. New families will ship, tiers will be renamed, thinking will get weirder, and — as the Mythos preview just made visible — some of what ships will be deliberately held back. Waiting until you understand all of it before picking one is itself a decision, and not a good one. Pick a family. Use the middle tier as your default. Keep the cheat sheet below for when you need to reach for something specific. Revisit in six months. That's the whole exercise.

The economics of all this are the story under the story — things that were far too expensive to commission eighteen months ago are now the middle tier of the cheapest family, on a Tuesday afternoon, for a handful of tokens. That curve hasn't finished.

/ Battle cards

Pick the card that fits the task.

Four families, four different shapes of trade-off. Pick whichever you're already standing next to, default to the middle tier, and reach for the rest only when the middle tier wasn't enough.

  1. OpenAI · GPT + o-series01

    The default, and the broadest tier ladder.

    Widest tier ladder, reasoning models are a separate line (the o-series), and the consumer ChatGPT surface drags most non-technical staff in whether you sanctioned it or not.

    • Tiers: GPT-5 Nano (cheapest), Mini (most day-to-day work), the full GPT-5 for heavier lifts.
    • Thinking: the o-series models — o3, o4-mini — are separate. Reach for them when the task is multi-step reasoning, hard code, or analysis.
    • Deep research mode on ChatGPT Pro spends tokens aggressively. Useful. Expensive.
    Good looks like

    Mini is our default; we step up to GPT-5 or an o-series model only when mini wasn't enough.

  2. Anthropic · Claude02

    The strongest bench for code and long-form work.

    Three named tiers plus extended thinking as a toggle on Sonnet and Opus. Claude Code has pulled a lot of the coding conversation to Anthropic over the last year.

    • Tiers: Haiku (fast and cheap), Sonnet (the workhorse), Opus (the heavyweight).
    • Extended thinking: a toggle, not a separate model. Turn it on for hard problems; leave it off for everything else — billed thinking tokens add up fast.
    • Claude Code is the agentic coding surface. If your work is engineering-heavy, this is where the conversation is.
    Good looks like

    Sonnet is our default; we enable extended thinking when the question needs it, not by habit.

  3. Google · Gemini03

    The value tier, and the long-context option.

    Three tiers plus thinking modes on 2.5 Pro and up, very large context windows on the bigger tiers, and deep native integration into Google Workspace.

    • Tiers: Flash (cheap, fast), Pro (default), Ultra (heavy lifts).
    • Thinking: surfaced as 'thinking' on Pro / Ultra and as Deep Research on the consumer side. Same idea as the others — more tokens chewed, better answer, higher cost.
    • Long-context strength makes it a strong pick for 'here is a whole codebase / contract / report, now…' workflows.
    Good looks like

    Pro is our default; Flash for volume; we lean on the long context window when the work genuinely needs it.

  4. Open + alt · Kimi, DeepSeek, Mistral, Llama04

    The cost floor, the sovereignty play, the box in the cupboard.

    Hosted open-weights and non-US options. Can be the right answer outright — not just a fallback — for cost-sensitive, privacy-sensitive or high-volume workloads. Some you can run yourself.

    • Cheap per token through hosted APIs — useful as a cost floor for high-volume, extractive or classification work.
    • Sovereignty matters: non-US hosting, non-US data paths, non-US vendors sit differently under UK GDPR conversations.
    • Open weights mean you can run them yourself on a GPU under your desk or a small server. Ollama and LM Studio make this a one-evening project. Real option for privacy-critical or offline work.
    Good looks like

    We know when an open model beats the big three on our task — and when running it ourselves beats paying for it.

/ Start a conversation

Let's talk about what you're trying to build.

Book a discovery session and we'll walk through the workflow, the systems and the shape of the solution.