Luka Mrkić
Head of BD
Insights, strategies, and real-world playbooks on AI-powered marketing.
JUN 11, 2026
If you are evaluating who should build a multi-model frontier stack for your team, this guide gives you both the technical blueprint and the standards to evaluate the work.
Three labs ship the publicly available frontier today. Anthropic released Claude Fable 5 on June 9, 2026 as the first generally available Mythos-class model. Google released Gemini 3 Pro on the same week in preview across the Gemini app, AI Studio, and Vertex AI. OpenAI’s GPT-5 launched in 2025 and now sits behind GPT-5.5 in the OpenAI docs, with GPT-5 still generally available for coding, reasoning, and agentic tasks across domains.
Each lab anchors its launch on a different kind of evidence. Google publishes a clean numeric leaderboard. Anthropic publishes a small set of capability claims and a long bench of customer testimonials. OpenAI publishes detailed API documentation and a tier-by-tier pricing page. None of the three publishes a head-to-head score against the other two on the same set of evals, so a fair comparison requires reading what each lab actually measured.
The right framing for a buyer is this. Frontier work splits into four buckets that show up in production: reasoning under uncertainty, code generation, multimodal understanding, and long-horizon agentic execution. The three models have meaningfully different strengths on each bucket, and the published evidence reflects that.
Most teams running a frontier model in production today have a single-vendor stack. The temptation when a new tier ships is to swap names and ship. That works for a quick prototype. It breaks the moment your eval set covers workloads outside the single-vendor strength. Fable 5 is exceptional on hours-long agentic work. Gemini 3 Pro is exceptional on multimodal reasoning and vibe-coded UI. GPT-5 ships at a lower per-token price that wins on volume.
A real multi-model stack needs a router, an eval harness, a refusal handler for the Anthropic side, and a cost dashboard split by model. The point of comparing the three is to decide which model wins which workload. The leaderboard score is the start of that decision. Your own eval on your own data is the end of it.
If you are interested in building AI agents and automation like this for your team, book a call here.

Gemini 3 Pro publishes the cleanest set of numbers in this bucket. On Humanity’s Last Exam without tools, Gemini 3 Pro scores 37.5 percent and Gemini 3 Deep Think scores 41.0 percent. On GPQA Diamond, Gemini 3 Pro hits 91.9 percent and Deep Think reaches 93.8 percent. On MathArena Apex, Gemini 3 Pro publishes a new state of the art of 23.4 percent. On SimpleQA Verified, Gemini 3 Pro reaches 72.1 percent for factual accuracy.
Claude Fable 5 publishes capability claims rather than HLE or GPQA numbers. Anthropic frames Fable 5 as state of the art on nearly all tested benchmarks of AI capability. Customer evidence stands in for numeric leaderboards. IMC’s Damian Miraglia reported Fable 5 is the strongest finance-first model they have tested. BPGC’s Matthew Pines reported that on frontier physics research, Fable 5 reached in 36 hours where GPT-5.5 landed after four days, while using a third of the reasoning tokens. That last claim is the most useful operational signal in the announcement: time to insight on hard scientific work.
GPT-5 sits between the two in evidence shape. OpenAI’s docs describe GPT-5 as a higher reasoning model with configurable effort across minimal, low, medium, and high. Independent third-party leaderboards measured GPT-5 against earlier Claude and Gemini generations through 2025. The fair read in June 2026 is that GPT-5 is no longer the latest from OpenAI, which means head-to-head reasoning comparisons against Fable 5 or Gemini 3 should usually run against GPT-5.5 instead.
Coding is the bucket where the three labs publish the most different shapes of evidence. Google publishes Gemini 3 Pro at 76.2 percent on SWE-Bench Verified, 54.2 percent on Terminal-Bench 2.0, and a WebDev Arena Elo of 1487. The narrative around Gemini 3 Pro is that it is exceptional at zero-shot generation and at building richer interactive web UI.
Anthropic publishes Fable 5 as the highest scoring model on Cognition’s FrontierCode evaluation, which tests whether a model can pass difficult coding tasks while meeting the standards of high-quality production codebases. Stripe reported the headline operational result of the launch: a 50-million-line Ruby codebase migration completed in a day where the manual path would have taken a whole team over two months. Cognition’s Scott Wu reported that Fable 5 is also the highest scoring model on FrontierBench, Cognition’s frontier coding eval, and that it generalizes to unfamiliar tools out of the box.
GPT-5’s coding strength was the headline of the 2025 launch and remains real. OpenAI now positions GPT-5.5 as the recommended coding model in the docs. For teams already on GPT-5, the practical question is whether the upgrade to 5.5 or the move to Fable 5 or Gemini 3 Pro is worth the eval and migration cost. The signal that matters is your own pass rate on your own coding eval.
Gemini 3 Pro publishes the strongest numbers on multimodal evals: 81 percent on MMMU-Pro and 87.6 percent on Video-MMMU. The launch positions Gemini 3 as the best model in the world for multimodal understanding, and the underlying capability is native multimodality across text, images, video, audio, and code with a 1 million-token context window. For workloads that combine long video, dense PDFs, and code in a single prompt, Gemini 3 Pro is the cleanest fit.
Fable 5 publishes a different multimodal story. Anthropic positions Fable 5 as the new state of the art for tasks involving vision and shows it extracting precise numbers from detailed scientific figures, rebuilding a web app’s source code from screenshots alone, and beating Pokemon FireRed with a vision-only harness from raw game screenshots. The harness story is the operational signal: Fable 5 needs less scaffolding than previous Claude models did to operate from pixels.
GPT-5 supports text and image input with text output. Audio and video are handled by separate OpenAI models. For teams that have standardized on a single OpenAI model, the multimodal ceiling on GPT-5 is lower than on Gemini 3 Pro and Fable 5. For teams that route by modality, GPT-5 still ships at a lower price for the text-and-image workloads it handles.
This is the bucket where Fable 5 and Gemini 3 Pro publish the most directly comparable evidence. Anthropic showed that when Fable 5 played Slay the Spire with persistent file-based memory, performance improved three times more than for Opus 4.8, and the model reached the game’s final act three times more often. Yusuke Kaji at MUFG reported that at the highest effort, Fable 5 reflects on and validates its own work, and that for highly autonomous operations the extra thinking pays for itself.
Google showed Gemini 3 Pro at the top of Vending-Bench 2, which tests longer horizon planning by managing a simulated vending machine business. Gemini 3 Pro maintains consistent tool usage and decision making for a full simulated year of operation. Deep Think publishes 45.1 percent on ARC-AGI-2 with code execution, demonstrating performance on novel challenges. The Antigravity launch ties Gemini 3 Pro into an agent-first IDE where the model plans, codes, and validates end-to-end software tasks in parallel.
GPT-5’s agentic story runs through Codex and the OpenAI Agents SDK. Codex is the agentic surface OpenAI ships for coding agents and sandboxed automation, with reviews, subagents, and workflows already in production. The right comparison frame is platform versus model. Anthropic and Google are publishing model-level agentic claims; OpenAI is publishing a richer agent platform on top of GPT-5.

Pricing is one of the cleanest places to compare. Fable 5 is priced at $10 per million input tokens and $50 per million output tokens. GPT-5 is priced at $1.25 per million input tokens and $10 per million output tokens, with cached input at 12.5 cents per million. Google has not published a public per-token price for Gemini 3 Pro in preview at launch; pricing on Vertex AI and the Gemini API typically lands within weeks of the preview window.
Context windows split the field. Fable 5 ships with 1 million tokens of context and up to 128k output. GPT-5 ships with 400k context and 128k max output. Gemini 3 Pro ships with 1 million tokens of context. For workloads that genuinely need millions of tokens of context, Fable 5 and Gemini 3 Pro are the two choices. For workloads inside 400k, GPT-5 is the cheapest of the three by a wide margin.
Operational constraints are different across the three. Fable 5 is a Covered Model with mandatory 30-day data retention and no zero-data-retention option. Gemini 3 Pro inherits Google Cloud’s data residency and retention controls through Vertex AI. GPT-5 supports zero data retention on eligible workloads through the OpenAI API. For regulated workflows, the data retention posture often picks the model before the benchmark does.
Three gaps belong on the design page before you trust any leaderboard. First, none of the three labs publishes a head-to-head against the other two on the same eval set. The closest signal is the Matthew Pines testimonial that Fable 5 outpaced GPT-5.5 on frontier physics research, and even that is a customer claim from a single early tester. Treat cross-lab claims as directional, then run your own eval.
Second, every benchmark is gameable. Strong leaderboard performance can mask poor production behavior on the long tail of workloads outside the eval distribution. The Gemini 3 Pro scores on MMMU-Pro and SWE-Bench Verified are real and useful for narrowing the shortlist. They will not tell you how the model behaves on your own data shape.
Third, the comparison surface keeps moving. OpenAI’s docs already point developers to GPT-5.5 as the latest model. Anthropic published Mythos Preview in April and Fable 5 in June. Google ships Gemini 3 Deep Think to Ultra subscribers in the coming weeks. A scorecard built today against three named SKUs is a scorecard that needs a quarterly refresh.

A simple rubric covers most production routing decisions today. Route to Claude Fable 5 when the task would be assigned to a senior engineer or a senior analyst, when the horizon spans hours, and when long context plus file-based memory carry the workload. Route to Gemini 3 Pro when the work is multimodal, when the output is interactive UI, when PhD-grade math and science Q&A matter, and when zero-shot generation against a long video or a stack of PDFs is the shape of the task.
Route to GPT-5 when latency and unit economics dominate, when the workload fits comfortably inside 400k tokens, and when the agent platform around the model (Codex, Agents SDK, Apps SDK) is doing the heavy lifting. GPT-5 mini and nano cover the same intent at much lower cost for tasks where the marginal capability gap with the frontier does not move the metric.
The dominant failure mode in early multi-model rollouts is overspending on the strongest model where the weakest one would do. Default the router to the cheapest tier that passes your eval, and only promote workflows to a higher tier after a side-by-side run shows the lift is worth the per-token cost.
Four metrics belong on a dashboard the day a second frontier model joins your stack. Per-workflow win rate on your own eval set tells you whether each model is actually carrying the workload you assigned it. Per-workflow token cost split by model tells you whether the more expensive model is paying for itself on the workloads where you routed it. Latency at the p50 and p95 tells you whether the workflow’s user experience holds up under load. Refusal and fallback rate on the Anthropic side tells you whether Fable’s safety classifiers are firing where you expect.
Pair these with a monthly leaderboard review. Read the three labs’ latest published scores against your own internal eval and check whether your routing rules still match the evidence. If a model your router prefers has dropped two places on an eval that matters to your workload, the routing rule needs an update or your eval set has aged.
If you want a multi-model frontier stack designed and shipped cleanly inside your AI engineering org with workflow-keyed routing, refusal handling, fallback retries, eval harnesses, and a dashboard your team can audit, that is the kind of work we ship at Espressio.
No single model is better across the board. Fable 5 leads on long-horizon agentic work, production codebase migrations, and senior-grade analytical tasks. Gemini 3 Pro leads on multimodal reasoning, math and science Q&A, and zero-shot interactive UI generation. GPT-5 leads on cost per call inside its 400k context window and on the maturity of its agent platform. Pick the model by workload.
Anthropic chose to publish a smaller set of named capability claims along with customer testimonials and a system card for the launch. The system card includes detailed safety and capability evaluations. The lack of a public HLE or GPQA number does not mean Fable 5 underperforms on those evals; it means Anthropic anchored the launch on different evidence.
For a head-to-head in mid-2026, compare GPT-5.5. The OpenAI docs explicitly recommend GPT-5.5 as the latest model for coding, reasoning, and agentic tasks. Compare GPT-5 only when you have an explicit reason to lock the version, for example to keep an existing eval set stable.
Claude Fable 5 ships with a 1 million-token context window and up to 128k output tokens. Gemini 3 Pro ships with a 1 million-token context window. GPT-5 ships with a 400k-token context window and 128k max output tokens.
Claude Fable 5 is $10 per million input tokens and $50 per million output tokens. GPT-5 is $1.25 per million input tokens and $10 per million output tokens, with cached input at 12.5 cents per million. Gemini 3 Pro pricing on Vertex AI and the Gemini API was not published at preview launch and typically lands within weeks.
Gemini 3 Pro publishes 76.2 percent on SWE-Bench Verified, 54.2 percent on Terminal-Bench 2.0, and a WebDev Arena Elo of 1487. Claude Fable 5 publishes the highest score on Cognition’s FrontierCode and FrontierBench evals, plus the Stripe 50 million-line Ruby migration case. The right comparison depends on whether your workload looks more like SWE-Bench (issue resolution) or FrontierCode (production-quality changes).
Claude Fable 5 publishes the strongest evidence on long-horizon agentic work, with three-times memory leverage over Opus 4.8 on Slay the Spire and senior-grade reasoning at high effort. Gemini 3 Pro publishes strong evidence on long-horizon planning through Vending-Bench 2 and on agentic coding through Antigravity. Run your own eval on a representative workflow of your own.
Yes. Deep Think pushes HLE to 41 percent, GPQA Diamond to 93.8 percent, and ARC-AGI-2 to 45.1 percent with code execution. It rolls out to Google AI Ultra subscribers in the coming weeks after additional safety testing. For PhD-level reasoning and novel-task workloads, Deep Think is the configuration to test.
If you want a frontier multi-model stack designed and shipped cleanly inside your AI engineering org with routing, eval harness, refusal handling, and observability built in, let’s talk.