CAISI Says DeepSeek V4 Pro Lags US Frontier by 8 Months

The US Center for AI Standards and Innovation (CAISI), housed within NIST, released its evaluation of DeepSeek V4 Pro on May 3, 2026, and the headline finding is concrete: DeepSeek V4 Pro lags the US frontier by approximately eight months on aggregated benchmarks, performing similarly to GPT-5 (released eight months earlier) and behind GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Ultra. The CAISI DeepSeek evaluation also flagged that DeepSeek’s own self-reported scores overstate the model’s standing relative to the top US models, with the gap being meaningfully larger on CAISI’s contamination-resistant benchmarks than on public ones. The evaluation is the first comprehensive US government benchmark of a Chinese frontier open-weights model and arrives in a politically charged moment around China-US AI competition. The implications go beyond the eight-month number itself.

What’s actually new

CAISI’s evaluation covered nine benchmarks across five domains: cybersecurity, software engineering, natural sciences, abstract reasoning, and mathematics. Two of those benchmarks were held-out and non-public — ARC-AGI-2’s semi-private dataset and CAISI’s internally developed PortBench, a software-engineering evaluation built specifically to resist contamination by models trained against public benchmark data. The held-out benchmarks matter because they cannot be optimized against; performance on them is closer to true capability than performance on public benchmarks where leakage and overfitting are real concerns.

The headline finding: DeepSeek V4 Pro is the most capable Chinese AI model CAISI has evaluated to date, but its aggregated performance places it roughly where GPT-5 was eight months earlier. That puts the model meaningfully behind the current US frontier (GPT-5.5, Claude Opus 4.7, Gemini 3.1 Ultra) but ahead of older US frontier models. The “eight-month gap” is now the working number for US-China AI competition discussions.

Domain-specific findings produced more nuanced picture. Math is the one area where DeepSeek V4 Pro nearly matches the top US models — V4 Pro scored 97% on OTIS-AIME-2025, 96% on PUMaC 2024, 96% on SMT 2025, slightly better than Claude Opus 4.6 across all three and only 2-3 points behind GPT-5.5. On other domains the gap is wider. Coding (SWE-bench, PortBench) shows DeepSeek behind the frontier; cybersecurity benchmarks show similar gaps; abstract reasoning (ARC-AGI-2) shows the largest gap.

Cost efficiency tells a different story. Compared to the most cost-competitive US reference model (GPT-5.4 mini), DeepSeek V4 was more cost efficient on 5 out of 7 benchmarks. DeepSeek charges roughly 1/10 the per-token price of frontier US models. For organizations whose use cases tolerate the eight-month capability gap, the cost differential is large enough that DeepSeek wins on total cost of ownership.

The discrepancy with DeepSeek’s own published numbers matters. DeepSeek’s self-reported evaluations placed V4 Pro near Claude Opus 4.6 and GPT-5.4 — roughly 4-5 months behind frontier rather than eight. CAISI’s evaluation on contamination-resistant benchmarks produces the larger gap. The implication is not that DeepSeek’s published numbers are dishonest, but that public-benchmark scores can drift from real-world capability when models are tuned heavily for benchmarks. CAISI’s PortBench and held-out ARC-AGI-2 sets reset that distortion.

Why it matters

  • The US-China AI competition narrative now has a working number. Eight months is concrete enough for policy discussions and short enough that closing it is plausible. Expect the figure to appear in congressional testimony, export-control debates, and AI policy frameworks through 2026.
  • Public benchmarks are losing their authority. The CAISI methodology of held-out, contamination-resistant evaluation is increasingly the gold standard for serious capability assessment. Vendor-published numbers will be increasingly evaluated against held-out alternatives.
  • Open-weights models force capability discussions to consider cost. A model 8 months behind on capability but 10x cheaper has different deployment economics than a frontier closed model. Procurement decisions in 2026 increasingly model both capability and cost rather than picking on capability alone.
  • DeepSeek’s math performance reveals where Chinese AI labs are competitive. Math is harder to fake than reading-comprehension or knowledge benchmarks because correct answers can be verified objectively. DeepSeek’s near-frontier math scores suggest genuine technical capability in narrower domains, even if aggregate performance lags.
  • Compute and access constraints shape the gap. The eight-month gap exists despite Chinese labs operating under US export controls on the most advanced AI chips. Whether the gap widens or narrows depends partly on whether export controls tighten or loosen, partly on Chinese domestic chip development, and partly on Western lab progress velocity.
  • Enterprises now have a US-government-validated reference for open-weights deployment. Organizations evaluating DeepSeek V4 Pro for their own use cases can cite CAISI’s findings as a baseline rather than relying on vendor claims. This streamlines procurement decisions.

How to use the CAISI DeepSeek evaluation today

For organizations evaluating Chinese open-weights models for production use, the CAISI evaluation provides concrete reference data. Three steps integrate the findings into procurement decisions.

  1. Read the full CAISI report, not just summaries. The report at nist.gov includes per-benchmark methodology, results, and discussion that doesn’t fit in news coverage. The methodology section in particular helps you reproduce or extend the findings on your specific use cases.
  2. Build an internal benchmark on your actual workloads. CAISI’s findings are aggregate; your specific use case may map to one of the domains where DeepSeek is competitive (math) or one where the gap is wider (cybersecurity, abstract reasoning). Run a representative test on your real workload before procurement.
  3. Calculate cost-adjusted capability for your use case. An 8-month capability gap at 10x cost difference produces a different procurement calculation depending on volume and quality requirements. For high-volume use cases where the capability is sufficient, DeepSeek’s cost advantage may be decisive; for use cases requiring frontier capability, the gap matters more than cost.

For developers benchmarking their own models against the CAISI methodology, the held-out PortBench is described in NIST documentation:

# Conceptual structure of CAISI's PortBench-style evaluation
# (the actual benchmark is held-out; this illustrates the methodology)

class HeldOutEval:
    def __init__(self, sealed_test_set, eval_metric):
        self.test_set = sealed_test_set    # never published
        self.metric = eval_metric

    def evaluate(self, model_endpoint):
        results = []
        for problem in self.test_set:
            response = model_endpoint(problem.prompt)
            score = self.metric(response, problem.ground_truth)
            results.append(score)
        return aggregate(results)

# The discipline: never make the test set public, never publish
# individual problem signatures, refresh the test set when models
# might have been trained on similar problems.

The methodological pattern matters more than the specific benchmark. Held-out evaluation produces capability assessments that survive vendor optimization; public evaluation does not.

How it compares

The CAISI DeepSeek evaluation positions DeepSeek V4 Pro alongside the major foundation models. The table below summarizes capability and cost positioning as of mid-2026 based on CAISI methodology and public pricing.

Model Capability tier Strength domain Cost (input/output per 1M tokens) Open weights
Claude Opus 4.7 (Anthropic) Frontier Coding, reasoning, agentic $5 / $25 No
GPT-5.5 (OpenAI) Frontier Broad capability, tools $6 / $30 (approx) No
Gemini 3.1 Ultra (Google) Frontier Long context, multimodal $5 / $25 (approx) No
DeepSeek V4 Pro ~8 months behind frontier Math, cost-efficient $0.27 / $1.10 (approx) Yes
Kimi K2.6 (Moonshot) Comparable to V4 Pro Coding, agentic $0.30 / $1.20 (approx) Yes
GLM-5.1 (Z.ai) Comparable to V4 Pro Coding, multilingual $0.30 / $1.20 (approx) Yes
GPT-5 (OpenAI, 8 months ago) Approximately V4 Pro level Reasoning, tools $3 / $15 No

Two takeaways. First, the open-weights frontier (DeepSeek V4 Pro, Kimi K2.6, GLM-5.1) clusters around the GPT-5 capability tier from eight months ago, at roughly 10% of the cost. For use cases where that capability tier is sufficient, the cost advantage is decisive. Second, the closed-source US frontier maintains a measurable lead, with Claude Opus 4.7, GPT-5.5, and Gemini 3.1 Ultra outperforming the open-weights cohort on aggregate capability. The market is bifurcating into “frontier capability at premium cost” and “near-frontier capability at commodity cost,” with the choice driven by use-case requirements.

What’s next

Three things to watch over the next two quarters. First, DeepSeek’s response. DeepSeek will likely release V5 or V4 Pro Max generation that closes some of the gap. The cadence of Chinese open-weights releases through 2025-2026 has been roughly one major generation per 6-9 months; the next iteration is expected mid-to-late 2026. Whether the gap narrows from 8 months to 4-6 months, holds at 8, or widens depends on relative progress velocity at Western and Chinese labs. Second, broader CAISI evaluations. NIST has signaled it will continue evaluating frontier models with the held-out methodology. Expect evaluations of the next generation Western models, additional Chinese models, and possibly other geographies. The methodology will increasingly anchor capability discussions. Third, policy implications. The eight-month gap will be cited in export-control debates, federal AI procurement decisions, and national-security AI assessments. Whether the gap is treated as comfortable margin or as urgent threat depends on the policy framing administrations choose.

The longer-term implication is that AI capability assessment is becoming a regulated science rather than a vendor-dominated discipline. CAISI’s role parallels NIST’s role in cryptographic standards, semiconductor measurements, and other technical domains where independent assessment matters. The maturation should produce more reliable capability information for buyers, more honest competitive dynamics among vendors, and better-informed policy debates. Enterprises and policymakers should track CAISI publications as routinely as they track vendor announcements.

Frequently Asked Questions

What does the eight-month gap actually mean for my use case?

It means DeepSeek V4 Pro performs roughly where US frontier models did eight months ago. For use cases where that earlier capability was sufficient (and many production workloads still are), DeepSeek is a viable choice with substantial cost advantages. For use cases that pushed the limits of older models, you likely need current frontier capability rather than DeepSeek. Test on your actual workload to determine which category applies.

Is DeepSeek V4 Pro safe to use given its Chinese origin?

The model itself is open-weights — you can run it on your own infrastructure with full data isolation. Concerns about Chinese-origin AI typically center on hosted services where data flow could be observed. Self-hosted deployments mitigate most of those concerns. CAISI’s evaluation explicitly evaluated the open-weights model and did not flag specific safety or security concerns beyond the capability assessment.

Why does DeepSeek’s self-reported performance differ from CAISI’s findings?

Public benchmarks can be optimized against — models trained heavily on public benchmark data score better on those benchmarks than their general capability would suggest. CAISI’s held-out PortBench and ARC-AGI-2 semi-private set are designed to resist this optimization. The differing scores are not necessarily evidence of dishonesty; they’re evidence that public benchmarks are weakening as authoritative capability measures.

What is PortBench and why does it matter?

PortBench is CAISI’s internally developed software-engineering benchmark, kept private to prevent training-data contamination. The benchmark evaluates models on realistic engineering tasks where the test set is not available for model training. PortBench-style evaluations are increasingly the gold standard for capability assessment because they produce results that survive vendor optimization on public benchmarks.

How do CAISI’s findings affect AI export controls and policy?

The eight-month gap will likely be cited in policy debates as evidence that current export controls are working — Chinese AI capability lags despite substantial domestic investment. Other voices will argue the gap is closing rapidly and current controls are insufficient. The policy discussion will evolve through 2026-2027; CAISI’s ongoing evaluations will provide updated data points.

Should US enterprises avoid using DeepSeek V4 Pro for production?

It depends on the use case. The CAISI evaluation does not recommend against using DeepSeek; it characterizes the capability and cost trade-offs. For commercial use cases where DeepSeek’s capability is sufficient and cost matters, DeepSeek is a reasonable choice with self-hosting recommended. For use cases involving sensitive data, regulated industries, or government workflows, organizations typically prefer Western models for procurement and political reasons even when DeepSeek’s capability would suffice.

Scroll to Top