The Frontier Compresses — Karmel Mid-Year Brief 2026

01 — The closed frontier

A four-way race decided in weeks

Opus 4.8 launched at the same price as 4.7 ($5 / $25 per M input/output tokens) and retook the lead on most agentic, coding, and knowledge-work tests. But no lab owns the frontier outright: leadership is now task-specific and changes hands within weeks of each release.

Opus 4.8 vs. the closed frontier — first-party benchmark suite

Figure 1. Source: Anthropic, Claude Opus 4.8 System Card (28 May 2026). Higher is better; first-party scores, not independently replicated. marks the leader on each benchmark.

Leads on	Where each model is ahead today
Opus 4.8	Agentic coding (SWE-Bench Pro 69.2%), computer use (OSWorld 83.4%), reasoning-with-tools (HLE 57.9%), financial analysis, knowledge work (GDPval-AA 1890 vs GPT-5.5 1769, Gemini 1314), and, as of 1 June, the aggregate Artificial Analysis Intelligence Index (61.4).
GPT-5.5	Agentic terminal coding (Terminal-Bench 2.1 78.2%); it led the aggregate Intelligence Index until Opus 4.8 was indexed on 1 June.
Gemini 3.1 Pro	Pure scientific reasoning (GPQA Diamond 94.1%) and the lowest token price among the major-lab frontier models.

The rest of the brief

The open–closed gap, the pace data, what it means for the market, and the next step-change already staged. Enter your details and the rest of the brief opens right here.

The ~7-point gap that nearly closed
Capability doubling every ~4 months
Five strategic implications for investors

02 — The open frontier

The gap that nearly closed

On Artificial Analysis's independently run Intelligence Index, the best open-weight model (Kimi K2.6, 54) sits ~7 points behind the closed ceiling, now Claude Opus 4.8 at 61.4 (indexed 1 June, just ahead of GPT-5.5's 60). It is already at parity on several individual benchmarks, and on coding the gap is effectively gone: the same model scores 80.2% on SWE-bench Verified against Opus 4.6's 80.8%.

Artificial Analysis Intelligence Index v4.0 — mid-May 2026

Closed weights Open weights ~7-point gap, open vs closed ceiling

Figure 2. Independent composite of 10 evaluations, bars scaled across the 45–62 range. Claude Opus 4.8, indexed by Artificial Analysis on 1 June 2026, now tops the Index at 61.4, just ahead of GPT-5.5 (60.2). Source: Artificial Analysis.

Time the best open model trails the closed frontier

Figure 3. Months behind the closed frontier. Source: Epoch AI, Capabilities Index. The lag has compressed from ~12 months to ~3 in a year.

The disappearing premium

The capability gap is narrowing, and the premium buyers will pay for it is narrowing faster. Chinese labs now set the open-weight frontier at 8–100× lower cost per token: Moonshot's Kimi, DeepSeek, Alibaba's Qwen, and Z.ai's GLM. For most workloads, multi-model routing (the bulk of traffic to cheap open or small models, a sliver to a frontier model) reaches roughly 95% of frontier quality at about 15% of the cost (RouteLLM).

Opus 4.8$5 / $25

Gemini 3.1 Pro$2 / $12

Kimi K2.6$0.60 / $2.50

DeepSeek V4 Pro$0.44 / $0.87

Price per million tokens (input / output). Prices fell an estimated 30–60% across the board over the past year (Epoch AI).

03 — Pace

How fast this is actually moving

The reason the field reshuffles monthly: capability is on an exponential, and the recent cadence is what matters. On METR's time-horizon test, the length of task a frontier model finishes at 50% reliability has been doubling roughly every four months since 2023. Anthropic's Opus line shows it plainly, going from under two hours (Opus 4.1, August) to about 14.5 hours (Opus 4.6, February), with the latest models now past the ~16-hour mark where METR's current suite saturates.

METR 50%-reliability task-time horizon

Anthropic Opus OpenAI GPT log scale · ≈ 4-month doubling · Opus up ≈ 8× since Aug ’25

Figure 4. Source: METR, Time Horizon 1.1 (May 2026). Estimates near and above the ~16 h mark are noisy as the suite saturates. METR has not yet published a time horizon for GPT-5.5; the most recent OpenAI model on its suite is the GPT-5 Codex line.

GPQA Diamond — graduate-level science

Figure 5. Sources: Artificial Analysis, vendor model cards. GPQA is saturating: Claude Opus 4.8 (93.6), GPT-5.5 (~93.6) and Gemini 3.1 Pro (94.1) are statistically tied at the frontier, ~29 pts above the ~65% human-expert baseline.

04 — Implications

What this means for the market

What we are seeing	Strategic implication
Model-layer differentiation is eroding	Moats migrate to product, distribution, agentic reliability, and enterprise trust and compliance. Differentiation has moved from the weights to the stack around the model.
Inference cost is collapsing (30–60%/yr)	Margin pressure on "model-as-product"; a tailwind to the application layer and to inference-optimization and routing infrastructure.
Capability is compounding (~4–7 mo doublings)	Capability-gated TAM (autonomous agents, long-horizon knowledge work) opens earlier than a straight-line read implies, which makes timing the hardest variable to call.
Open weights are a credible option	A viable option for cost, control, data residency, and sovereignty; switching cost is the right lens for evaluating model dependency.
Frontier safety is now a gating factor	Release cadence is now driven by safety review more than by training, and the next tier already exists in restricted hands, so the gated pipeline tells you as much as the shipped models.

Methodology & sources. Figures as of 1 June 2026. Cross-model capability uses the Artificial Analysis Intelligence Index v4.0 (independently run); Opus 4.8's Intelligence Index (61.4) and GPQA Diamond (93.6) are from Artificial Analysis (indexed 1 June 2026); its agentic-benchmark figures (Figure 1) are Anthropic first-party (System Card). Pace metrics from METR (task time horizon) and Epoch AI (open–closed lag, Capabilities Index). Benchmark trajectories from Scale AI (HLE), vals.ai / SWE-bench, llm-stats, and vendor model cards. Cost economics draw on Epoch AI (inference-price decline) and RouteLLM (multi-model routing, ~95% of frontier quality at ~15% of cost); the Q1 2026 model-release count is from llm-stats. Caveats: vendor-reported scores can differ from independent runs; human-preference (LMArena) and composite indices can disagree, and individual-benchmark leadership shifts week to week. Company and funding figures reflect public reporting as of 1 June 2026. For informational purposes only; not investment advice, and not a recommendation regarding any security.