01 — The closed frontier
A four-way race decided in weeks
Opus 4.8 launched at the same price as 4.7 ($5 / $25 per M input/output tokens) and retook the lead on most agentic, coding, and knowledge-work tests. But no lab owns the frontier outright: leadership is now task-specific and changes hands within weeks of each release.
Figure 1. Source: Anthropic, Claude Opus 4.8 System Card (28 May 2026). Higher is better; first-party scores, not independently replicated. marks the leader on each benchmark.
| Leads on | Where each model is ahead today |
|---|---|
| Opus 4.8 | Agentic coding (SWE-Bench Pro 69.2%), computer use (OSWorld 83.4%), reasoning-with-tools (HLE 57.9%), financial analysis, knowledge work (GDPval-AA 1890 vs GPT-5.5 1769, Gemini 1314), and, as of 1 June, the aggregate Artificial Analysis Intelligence Index (61.4). |
| GPT-5.5 | Agentic terminal coding (Terminal-Bench 2.1 78.2%); it led the aggregate Intelligence Index until Opus 4.8 was indexed on 1 June. |
| Gemini 3.1 Pro | Pure scientific reasoning (GPQA Diamond 94.1%) and the lowest token price among the major-lab frontier models. |
Continue reading
The rest of the brief
The open–closed gap, the pace data, what it means for the market, and the next step-change already staged. Enter your details and the rest of the brief opens right here.
- The ~7-point gap that nearly closed
- Capability doubling every ~4 months
- Five strategic implications for investors
02 — The open frontier
The gap that nearly closed
On Artificial Analysis's independently run Intelligence Index, the best open-weight model (Kimi K2.6, 54) sits ~7 points behind the closed ceiling, now Claude Opus 4.8 at 61.4 (indexed 1 June, just ahead of GPT-5.5's 60). It is already at parity on several individual benchmarks, and on coding the gap is effectively gone: the same model scores 80.2% on SWE-bench Verified against Opus 4.6's 80.8%.
Closed weights Open weights ~7-point gap, open vs closed ceiling
Figure 2. Independent composite of 10 evaluations, bars scaled across the 45–62 range. Claude Opus 4.8, indexed by Artificial Analysis on 1 June 2026, now tops the Index at 61.4, just ahead of GPT-5.5 (60.2). Source: Artificial Analysis.
Figure 3. Months behind the closed frontier. Source: Epoch AI, Capabilities Index. The lag has compressed from ~12 months to ~3 in a year.
03 — Pace
How fast this is actually moving
The reason the field reshuffles monthly: capability is on an exponential, and the recent cadence is what matters. On METR's time-horizon test, the length of task a frontier model finishes at 50% reliability has been doubling roughly every four months since 2023. Anthropic's Opus line shows it plainly, going from under two hours (Opus 4.1, August) to about 14.5 hours (Opus 4.6, February), with the latest models now past the ~16-hour mark where METR's current suite saturates.
Anthropic Opus OpenAI GPT log scale · ≈ 4-month doubling · Opus up ≈ 8× since Aug ’25
Figure 4. Source: METR, Time Horizon 1.1 (May 2026). Estimates near and above the ~16 h mark are noisy as the suite saturates. METR has not yet published a time horizon for GPT-5.5; the most recent OpenAI model on its suite is the GPT-5 Codex line.
Figure 5. Sources: Artificial Analysis, vendor model cards. GPQA is saturating: Claude Opus 4.8 (93.6), GPT-5.5 (~93.6) and Gemini 3.1 Pro (94.1) are statistically tied at the frontier, ~29 pts above the ~65% human-expert baseline.
04 — Implications
What this means for the market
| What we are seeing | Strategic implication |
|---|---|
| Model-layer differentiation is eroding | Moats migrate to product, distribution, agentic reliability, and enterprise trust and compliance. Differentiation has moved from the weights to the stack around the model. |
| Inference cost is collapsing (30–60%/yr) | Margin pressure on "model-as-product"; a tailwind to the application layer and to inference-optimization and routing infrastructure. |
| Capability is compounding (~4–7 mo doublings) | Capability-gated TAM (autonomous agents, long-horizon knowledge work) opens earlier than a straight-line read implies, which makes timing the hardest variable to call. |
| Open weights are a credible option | A viable option for cost, control, data residency, and sovereignty; switching cost is the right lens for evaluating model dependency. |
| Frontier safety is now a gating factor | Release cadence is now driven by safety review more than by training, and the next tier already exists in restricted hands, so the gated pipeline tells you as much as the shipped models. |
Methodology & sources. Figures as of 1 June 2026. Cross-model capability uses the Artificial Analysis Intelligence Index v4.0 (independently run); Opus 4.8's Intelligence Index (61.4) and GPQA Diamond (93.6) are from Artificial Analysis (indexed 1 June 2026); its agentic-benchmark figures (Figure 1) are Anthropic first-party (System Card). Pace metrics from METR (task time horizon) and Epoch AI (open–closed lag, Capabilities Index). Benchmark trajectories from Scale AI (HLE), vals.ai / SWE-bench, llm-stats, and vendor model cards. Cost economics draw on Epoch AI (inference-price decline) and RouteLLM (multi-model routing, ~95% of frontier quality at ~15% of cost); the Q1 2026 model-release count is from llm-stats. Caveats: vendor-reported scores can differ from independent runs; human-preference (LMArena) and composite indices can disagree, and individual-benchmark leadership shifts week to week. Company and funding figures reflect public reporting as of 1 June 2026. For informational purposes only; not investment advice, and not a recommendation regarding any security.