Karmel Mid-Year Brief · Technology & AI · June 2026

The Frontier Compresses

Anthropic shipped Claude Opus 4.8 last week. It's a strong model, but the real story is the convergence: the leading systems are bunching up, across labs and across the open–closed line, even as the pace of gains keeps rising.

  • ~3 moOpen-weight lag behind the closed frontier, down from ~12 mo a year ago
  • ~4 – 7 moCapability doubling time: task length AI completes at 50% reliability
  • 61 vs 54Intelligence Index: closed ceiling vs best open-weight model
  • 8 – 100×Cheaper per token for frontier-class open weights

01 — The closed frontier

A four-way race decided in weeks

Opus 4.8 launched at the same price as 4.7 ($5 / $25 per M input/output tokens) and retook the lead on most agentic, coding, and knowledge-work tests. But no lab owns the frontier outright: leadership is now task-specific and changes hands within weeks of each release.

Opus 4.8 vs. the closed frontier — first-party benchmark suite

Figure 1. Source: Anthropic, Claude Opus 4.8 System Card (28 May 2026). Higher is better; first-party scores, not independently replicated. marks the leader on each benchmark.

Leads onWhere each model is ahead today
Opus 4.8Agentic coding (SWE-Bench Pro 69.2%), computer use (OSWorld 83.4%), reasoning-with-tools (HLE 57.9%), financial analysis, knowledge work (GDPval-AA 1890 vs GPT-5.5 1769, Gemini 1314), and, as of 1 June, the aggregate Artificial Analysis Intelligence Index (61.4).
GPT-5.5Agentic terminal coding (Terminal-Bench 2.1 78.2%); it led the aggregate Intelligence Index until Opus 4.8 was indexed on 1 June.
Gemini 3.1 ProPure scientific reasoning (GPQA Diamond 94.1%) and the lowest token price among the major-lab frontier models.

Continue reading

The rest of the brief

The open–closed gap, the pace data, what it means for the market, and the next step-change already staged. Enter your details and the rest of the brief opens right here.

  • The ~7-point gap that nearly closed
  • Capability doubling every ~4 months
  • Five strategic implications for investors

02 — The open frontier

The gap that nearly closed

On Artificial Analysis's independently run Intelligence Index, the best open-weight model (Kimi K2.6, 54) sits ~7 points behind the closed ceiling, now Claude Opus 4.8 at 61.4 (indexed 1 June, just ahead of GPT-5.5's 60). It is already at parity on several individual benchmarks, and on coding the gap is effectively gone: the same model scores 80.2% on SWE-bench Verified against Opus 4.6's 80.8%.

Artificial Analysis Intelligence Index v4.0 — mid-May 2026

Closed weights Open weights ~7-point gap, open vs closed ceiling

Figure 2. Independent composite of 10 evaluations, bars scaled across the 45–62 range. Claude Opus 4.8, indexed by Artificial Analysis on 1 June 2026, now tops the Index at 61.4, just ahead of GPT-5.5 (60.2). Source: Artificial Analysis.

Time the best open model trails the closed frontier

Figure 3. Months behind the closed frontier. Source: Epoch AI, Capabilities Index. The lag has compressed from ~12 months to ~3 in a year.

03 — Pace

How fast this is actually moving

The reason the field reshuffles monthly: capability is on an exponential, and the recent cadence is what matters. On METR's time-horizon test, the length of task a frontier model finishes at 50% reliability has been doubling roughly every four months since 2023. Anthropic's Opus line shows it plainly, going from under two hours (Opus 4.1, August) to about 14.5 hours (Opus 4.6, February), with the latest models now past the ~16-hour mark where METR's current suite saturates.

METR 50%-reliability task-time horizon

Anthropic Opus OpenAI GPT log scale · ≈ 4-month doubling · Opus up ≈ 8× since Aug ’25

Figure 4. Source: METR, Time Horizon 1.1 (May 2026). Estimates near and above the ~16 h mark are noisy as the suite saturates. METR has not yet published a time horizon for GPT-5.5; the most recent OpenAI model on its suite is the GPT-5 Codex line.

GPQA Diamond — graduate-level science

Figure 5. Sources: Artificial Analysis, vendor model cards. GPQA is saturating: Claude Opus 4.8 (93.6), GPT-5.5 (~93.6) and Gemini 3.1 Pro (94.1) are statistically tied at the frontier, ~29 pts above the ~65% human-expert baseline.

04 — Implications

What this means for the market

What we are seeingStrategic implication
Model-layer differentiation is erodingMoats migrate to product, distribution, agentic reliability, and enterprise trust and compliance. Differentiation has moved from the weights to the stack around the model.
Inference cost is collapsing (30–60%/yr)Margin pressure on "model-as-product"; a tailwind to the application layer and to inference-optimization and routing infrastructure.
Capability is compounding (~4–7 mo doublings)Capability-gated TAM (autonomous agents, long-horizon knowledge work) opens earlier than a straight-line read implies, which makes timing the hardest variable to call.
Open weights are a credible optionA viable option for cost, control, data residency, and sovereignty; switching cost is the right lens for evaluating model dependency.
Frontier safety is now a gating factorRelease cadence is now driven by safety review more than by training, and the next tier already exists in restricted hands, so the gated pipeline tells you as much as the shipped models.

Methodology & sources. Figures as of 1 June 2026. Cross-model capability uses the Artificial Analysis Intelligence Index v4.0 (independently run); Opus 4.8's Intelligence Index (61.4) and GPQA Diamond (93.6) are from Artificial Analysis (indexed 1 June 2026); its agentic-benchmark figures (Figure 1) are Anthropic first-party (System Card). Pace metrics from METR (task time horizon) and Epoch AI (open–closed lag, Capabilities Index). Benchmark trajectories from Scale AI (HLE), vals.ai / SWE-bench, llm-stats, and vendor model cards. Cost economics draw on Epoch AI (inference-price decline) and RouteLLM (multi-model routing, ~95% of frontier quality at ~15% of cost); the Q1 2026 model-release count is from llm-stats. Caveats: vendor-reported scores can differ from independent runs; human-preference (LMArena) and composite indices can disagree, and individual-benchmark leadership shifts week to week. Company and funding figures reflect public reporting as of 1 June 2026. For informational purposes only; not investment advice, and not a recommendation regarding any security.