
On May 5, 2026, the US government quietly closed a gap in frontier AI oversight. Google DeepMind, Microsoft, and xAI signed agreements with the Center for AI Standards and Innovation (CAISI) — housed inside the Department of Commerce’s NIST — granting evaluators classified access to unreleased models. With OpenAI and Anthropic already inside the program since August 2024, every major US frontier lab is now subject to AI pre-deployment testing before a model ships. This is the first time that statement has been true.
What’s actually new
The deal expands a voluntary structure CAISI inherited when Commerce reorganized the former AI Safety Institute under NIST. Three pieces are genuinely new in May 2026.
- Three more labs added in one announcement. Google DeepMind, Microsoft (covering both first-party and OpenAI-derived models hosted on Azure), and xAI all signed on the same day. CAISI now covers what insiders call the “Big Five” of US frontier development.
- Classified-environment testing is explicit. Earlier MOUs left this ambiguous. The 2026 agreements explicitly support evaluation in classified facilities, which lets the TRAINS Taskforce — interagency experts from Defense, Energy, Homeland Security, and the intelligence community — probe sensitive capabilities (bioweapons synthesis, cyber-offensive code generation, election manipulation) without leakage risk.
- Models can be handed over with safeguards stripped. CAISI policy now formalizes that developers may submit versions of their models with refusals, RLHF guardrails, and content filters partially or fully removed. The point is to see what the underlying weights are actually capable of, not what the polished consumer release will do.
CAISI has already completed more than 40 evaluations across the existing OpenAI and Anthropic agreements, including state-of-the-art models that never shipped publicly. The May 5 expansion brings xAI’s Grok lineage, Google DeepMind’s Gemini frontier track, and Microsoft’s MAI / Phi / partner-hosted ChatGPT variants into that same evaluation pipeline.
Why it matters
- Pre-deployment is the right place to catch the worst failures. Once a frontier model ships, the marginal risk of someone weaponizing it accrues to every prompt sent. Catching a CBRN-uplift failure or a critical jailbreak before launch is the entire point of AI pre-deployment testing.
- It removes a competitive escape valve. Until May 5, an OpenAI executive could plausibly argue their company was disadvantaged by submitting to government red-teaming when Google or xAI did not. That argument is gone.
- The agreements are voluntary, not statutory. No law passed Congress. CAISI cannot force a release delay. Companies retain the right to ship — but doing so over CAISI’s objection now becomes a publicly defensible bad-faith move.
- America’s AI Action Plan now has teeth. The Trump administration’s July 2025 AI Action Plan directed Commerce to expand frontier evaluations. The May 5 signings are the first time the plan visibly shaped lab behavior beyond OpenAI and Anthropic.
- The TRAINS Taskforce is the real machinery. The political theater is the announcement; the substantive work happens inside an interagency taskforce that includes hands-on national-security technologists. Those people read transcripts and re-run probes. They are not signing off on PR documents.
- The UK AISI parallel is now obvious. Britain runs a near-identical program. Expect a transatlantic alignment push within 12 months — shared eval suites, shared reports, possibly shared red-team personnel.
How to use it today
If you ship anything that depends on a frontier model — a B2B agent, a SaaS feature, a vertical AI product — these agreements change your supply chain. Here is how to prepare in the next two weeks.
- Map your model dependencies to the Big Five. List every API, fine-tune base, and embedded model your product calls. Group by lab.
- Subscribe to CAISI’s evaluation reports. NIST publishes summary reports for completed evaluations at nist.gov/caisi. Read the methodology before reading findings — the eval design tells you what risks they take seriously.
- Add a release-delay clause to vendor SLAs. If CAISI flags a critical risk and your vendor pauses or rolls back a model, you want contractual notice and a fallback path. Add a clause like:
4.7 Pre-Deployment Risk Notification. Vendor shall notify Customer within 24 hours of any post-CAISI-evaluation decision to delay, restrict, or roll back availability of any Service Tier model. Customer may, without penalty, switch to an alternate model tier for the affected workload during the delay period. - Run the same probes internally. CAISI’s published eval families — biosecurity uplift, autonomous replication, cyber-offensive coding, election interference — translate to test prompts you can run against your own deployment. A simple harness:
# pip install anthropic openai google-genai import json from pathlib import Path PROBES = json.loads(Path("caisi_probes.jsonl").read_text()) def run(model_call, probes): out = [] for p in probes: resp = model_call(p["prompt"]) out.append({ "id": p["id"], "category": p["category"], "refused": "I cannot" in resp[:200], "response_excerpt": resp[:600], }) return outYou will not match CAISI’s classified depth, but you will catch the obvious tier of regressions when your vendor swaps in a new checkpoint silently.
- Audit your fine-tunes. Stripping safeguards is exactly the situation where fine-tuning on permissive data sets removes refusals. If you fine-tune on customer corpora, run a small refusal-regression suite (50-100 prompts, half adversarial) on every checkpoint.
- Document AI provenance for your own customers. Enterprise procurement teams already ask “is this CAISI-evaluated?” Have a one-paragraph answer ready that names the underlying model, the lab, and the lab’s CAISI status.
How it compares
The May 5 expansion makes CAISI the central node, but it is not the only government program. Here is how the major US frameworks line up as of May 2026.
| Program | Operator | Scope | Mandatory? | Labs covered |
|---|---|---|---|---|
| CAISI Pre-Deployment Eval | NIST / Commerce | National security capability evaluation before release | Voluntary MOU | OpenAI, Anthropic, Google DeepMind, Microsoft, xAI |
| EO 14110 reporting (rescinded 2025, partial revival) | Commerce | Compute and capability disclosure for very large training runs | Partial (some thresholds retained) | All US labs above 10^26 FLOP |
| TRAINS Taskforce probes | DoD / DoE / DHS / IC | Classified red-teaming on CBRN, cyber, election risks | Embedded in CAISI MOU | Same as CAISI |
| State-level (CA SB 1047 successor, NY) | State AGs | Liability and safety attestations | Mandatory in-state | Any lab serving residents |
| UK AISI partnership | UK Department for Science | Parallel evals, shared findings | Voluntary, bilateral | Most major US labs |
The takeaway: CAISI now sits at the top of a layered system. State-level rules add liability; CAISI adds capability assessment; TRAINS adds classified probing. None of them block a release on its own.
What’s next
Three things to watch over the next two quarters.
First, the first public CAISI delay. No major US frontier release has been visibly held by CAISI feedback. Whether that is because no model has tripped a serious finding or because labs quietly rework problem behaviors before formal evaluation is unknown. The first public friction point — a major release pushed by even a week — would reset every lab’s internal timeline assumptions.
Second, an open-weights inclusion question. CAISI’s framework was built around closed-weight frontier labs. Meta’s Llama lineage, Mistral, and emerging Chinese open-weights releases (Qwen, DeepSeek) sit outside the structure. Pressure for an open-weights addendum — possibly tied to compute thresholds rather than corporate signature — is rising inside Commerce.
Third, alignment with EU AI Act timelines. The EU’s General-Purpose AI Code of Practice obligations begin biting in August 2026. US labs operating in Europe will need a story for both regimes. CAISI evaluation reports are positioned to become a recognized substitute for parts of the EU compliance package, which would dramatically reduce duplicate red-teaming costs.
Frequently Asked Questions
Is CAISI evaluation mandatory for AI labs?
No. CAISI agreements are voluntary memoranda of understanding. Labs choose to participate. The May 5 expansion shows that all five major US frontier labs have now opted in, which makes non-participation conspicuous — but no statute requires it.
Does CAISI testing delay a model launch?
It can, but only by negotiation. CAISI does not have unilateral authority to block a release. In practice, evaluations are scheduled to fit lab launch windows, and findings get fed back so the lab can patch issues before public availability. A formal “stop ship” is contractually possible but has not happened publicly.
What does CAISI actually test for?
Three priority risk families: cybersecurity (offensive code generation, vulnerability discovery, autonomous exploitation), CBRN uplift (chemical, biological, radiological, nuclear weapon synthesis assistance), and large-scale information integrity threats (election interference, mass-scale disinformation production). The TRAINS Taskforce supplies domain experts.
Why hand over models with safeguards removed?
Because the safeguards are the easy thing to bypass. CAISI wants to know what the underlying capability is, not what the polished consumer release blocks. A jailbreak you can ship a fix for is less concerning than a capability the model fundamentally has.
How does this compare to the UK AI Safety Institute?
The UK AISI runs a near-identical voluntary evaluation program with overlapping methodology. The expectation inside both governments is that joint or shared evaluations will increase, especially for transatlantic-deployed frontier models, and that finding-sharing will reduce duplicate red-teaming costs for the labs.
What does this mean for smaller AI startups?
Direct CAISI involvement is limited to frontier labs. Indirectly, every AI startup that builds on a Big Five API now sits downstream of these evaluations. The most useful action for a startup is to track the published findings, regression-test your own fine-tunes against them, and document model provenance for enterprise procurement.