Four Chinese Open-Weights Coding Models Just Shipped in 12 Days

Four Chinese AI labs shipped frontier-grade open-weights coding models inside a twelve-day window in late April 2026. Z.ai’s GLM-5.1, MiniMax M2.7, Moonshot’s Kimi K2.6, and DeepSeek V4 all landed at roughly the same capability ceiling on agentic engineering benchmarks — and none of them costs more than a third of Claude Opus 4.7 to run. The release cluster forces a question Western developers can no longer wave away: are open-weights coding models from China now a serious alternative to closed frontier models for production work? The short answer, after a week of community benchmarking, is yes for most use cases. This guide unpacks what’s actually new, why it matters, and how to evaluate the four contenders for your own pipeline.

What’s actually new

The headline is the simultaneity. Open-weights coding models have been creeping closer to closed-frontier capability for a year, but the late-April release cluster compressed that progress into a twelve-day burst. Each lab released independently — there was no coordination — and yet each landed within a few benchmark points of the others. The implication: the recipe for a frontier-grade open-weights coding model is now well-understood inside multiple labs, and the bottleneck is no longer research, it’s compute and training data.

The four releases also share a common shape. All four ship under permissive licenses (Apache 2.0 or near-equivalents). All four target the agentic-coding use case specifically — long-horizon tool use, multi-file edits, error recovery — rather than the older “complete this function” benchmark suite. All four publish detailed eval reports. And all four are cheaper to run than equivalent closed models by a factor of three to ten, depending on your inference setup.

Specifics matter. Z.ai’s GLM-5.1 is a 235B-parameter mixture-of-experts model with 32 experts and 8B active parameters per token, putting it in the same architectural neighborhood as DeepSeek V3. It scores 76.4% on SWE-Bench Verified and 71.2% on Aider’s polyglot benchmark — within a point of GPT-5.5 on both. MiniMax M2.7 is dense at 65B parameters with full long-context support out to 1M tokens, optimized for codebases that exceed typical context windows. Kimi K2.6 is the smallest of the four at 32B parameters, leans heavily into reasoning before code generation, and ships with a custom inference runtime that gets 3x the throughput of standard llama.cpp on consumer hardware. DeepSeek V4 is the most ambitious — 671B total parameters, 37B active, with strong claims on math-heavy code (numerical computing, scientific simulation) where the others trail.

For deployment, the practical news is that all four publish quantized versions on Hugging Face the same day weights drop. GLM-5.1 in 4-bit fits on two H100s. MiniMax M2.7 fits on a single H100 with INT8 quantization. Kimi K2.6 runs comfortably on a 24GB consumer GPU. DeepSeek V4 needs a serious cluster, but the API price (run by DeepSeek directly) is $0.30 per million input tokens and $1.10 per million output — roughly one-tenth of Claude Opus 4.7.

Why it matters

  • The “open-weights tax” is gone. A year ago, choosing open-weights for coding meant accepting a 10-15% capability gap. Today, the gap is small enough that for most agentic-coding tasks you can’t tell the difference in production. The cost savings (often 70-90%) are real and immediate.
  • Vendor lock-in becomes a deliberate choice, not an architectural inevitability. Teams that built their stack around one closed-frontier API now have a credible second source. The standard procurement playbook applies: dual-source, run continuous evals, switch when the cost-quality curve crosses.
  • The competitive frontier moves to inference economics. When four labs ship the same capability the same week, capability stops being the differentiator. The new game is throughput per dollar, time-to-first-token, and how cleanly the model integrates with agent frameworks. This favors infrastructure operators (DeepSeek, Together, Fireworks, Groq) over base-model labs.
  • Self-hosting goes from niche to normal. A 32B-parameter Kimi K2.6 fits on a $2,000 GPU. For a team handling sensitive code that can’t go to a US-based API, self-hosting a coding model just got cheap and viable.
  • Compliance and data-residency questions get easier. EU teams, healthcare teams, and government teams that previously had to fight for closed-frontier model access can now point to open-weights coding models with comparable capability and run them on infrastructure they fully control.
  • The Western frontier labs face real pricing pressure. When DeepSeek V4 API costs are 10x cheaper for similar capability, Anthropic and OpenAI either drop prices, differentiate harder on agentic / multimodal capability, or watch volume migrate. Expect price cuts within the next two quarters.

How to use it today

If you’ve been running an agentic-coding workflow on a closed API, here’s the fastest way to swap in an open-weights coding model without rewriting your stack.

  1. Pick the right model for your hardware budget. Use the comparison table below as a quick filter. For a single H100, MiniMax M2.7 is the easiest fit. For consumer hardware, Kimi K2.6. For a managed API at the lowest cost, DeepSeek V4. For maximum capability on agentic benchmarks, GLM-5.1.
  2. Run the model behind an OpenAI-compatible API shim. All four ship with vLLM and SGLang support, both of which expose an OpenAI-compatible endpoint. Your existing client code that calls openai.chat.completions.create works unchanged — you just point it at a different base URL.
    MASK13
  3. Wire up tool-calling. All four support OpenAI-style function calling natively. If you’re already using LangChain, LlamaIndex, or a custom agent loop, the migration is changing the model name. If you’re using MCP via the OpenAI MCP adapter, the same flow works:
    MASK14
  4. Add a quality gate. Open-weights models are genuinely close to closed-frontier on most coding tasks but not all. Run your existing eval suite on the new model before flipping production traffic. The cleanest pattern: 5% canary traffic for a week, compare blast-radius metrics (PR merge rate, revert rate, customer-reported bugs) against the closed-frontier baseline. If they hold, ramp.
  5. Plan for the long-context cases. If your workload regularly exceeds 128K tokens (large repos, long multi-file edits), MiniMax M2.7 with its 1M context is the only one of the four that handles it cleanly. The others degrade at the upper end. Don’t migrate without checking your context-length distribution first.
  6. Set up cost tracking from day one. Self-hosting feels free; it’s not. Track GPU-hours, monitor utilization, and build a per-request cost figure you can compare against the closed-API baseline. Most teams break even on a self-hosted 32B model at 2-3M tokens per day; below that, the API is cheaper.

How it compares

Model Params (active) Context SWE-Bench Verified Aider Polyglot API cost / 1M output Min hardware
Z.ai GLM-5.1 Air 235B (8B) 128K 76.4% 71.2% $0.55 2x H100
MiniMax M2.7 65B dense 1M 74.8% 68.5% $0.80 1x H100
Moonshot Kimi K2.6 32B dense 128K 72.1% 66.0% $0.40 1x RTX 4090 (24GB)
DeepSeek V4 671B (37B) 128K 77.9% 72.0% $1.10 8x H100 (or API)
GPT-5.5 (closed) 200K 78.2% 72.4% $10.00 API only
Claude Opus 4.7 (closed) 500K 79.5% 73.8% $15.00 API only

The headline number: open-weights coding models are within 1-3 capability points of the closed frontier on the standard benchmarks, at one-tenth to one-thirtieth the cost. For workloads where every percentage point of capability translates to measurable revenue (high-stakes engineering, autonomous code-shipping), the closed-frontier price premium is still defensible. For workloads where the model is part of a broader pipeline and the cost is felt directly (high-volume agent loops, customer-facing copilots), the open-weights option is now hard to beat.

One caveat the table doesn’t show: refusal rates. The Chinese open-weights models have meaningfully different refusal patterns than Western closed models — both more permissive in some areas (less hesitant on ambiguous code requests) and more restrictive in others (some politically sensitive topics). For pure-coding workloads this rarely matters; for general-purpose assistants, evaluate carefully.

What’s next

The twelve-day release cluster is itself the news. The next phase is consolidation — which of the four labs sustains the cadence and which falls behind. Smart money watches three signals.

First, does Western frontier pricing crack? Anthropic and OpenAI have publicly held the line on premium pricing while justifying it with capability. If DeepSeek V4 captures meaningful API volume in Q3, expect counter-moves: tiered pricing, “good enough” model variants, or aggressive volume discounts to enterprise. Watch for OpenAI announcing a “GPT-5.5 mini” or Anthropic positioning Sonnet 4.6 more aggressively against open-weights at the value tier.

Second, does the U.S. government respond? Open-weights models from Chinese labs running in U.S. infrastructure raise compliance and supply-chain questions. The current administration has signaled wariness without specific restrictions. Possible 2026 H2 actions: export-control parallels for foreign-origin model weights, mandatory disclosures for AI systems used in regulated industries, or stricter procurement guidance for federal contractors. None of these are guaranteed; all are credible.

Third, does the agent-framework ecosystem fully embrace open-weights? The major frameworks (LangGraph, CrewAI, AutoGen) all support arbitrary OpenAI-compatible endpoints in principle, but their default templates, tutorials, and recommended models still point at closed APIs. Expect updated guidance in Q3 that puts open-weights options on equal footing — at which point the migration friction drops further.

The deeper trend underneath all of this: the gap between Western and Chinese frontier capability is now measured in single percentage points, while the gap in deployment economics is multiples. That asymmetry doesn’t reverse on its own. Either Western labs find a way to ship comparable capability at comparable cost, or the buy-vs-build calculus permanently shifts toward self-hosted open-weights for cost-sensitive workloads. The next two quarters will tell us which way it breaks.

Frequently Asked Questions

Are these models safe to use in production?

The base-model safety story is comparable to closed Western models — they refuse the obvious harmful requests, leak the obvious training-data artifacts, and have predictable jailbreaking surface area. The differentiator is operational: when you self-host an open-weights coding model, you are responsible for monitoring, abuse prevention, and incident response. If your team has run any other inference workload in production, you have the muscle. If not, the closed APIs handle this for you.

What about training data and IP risk?

All four labs publish their training-data summaries; none publishes the full corpora. The risk profile is similar to Western open-weights coding models — assume the model has been trained on a broad mix of public code, including code under licenses that prohibit redistribution. For high-IP-stakes workloads, consider a code-attribution layer (BigCode’s stack-attribution, GitHub’s blame matching) on the model’s output as a defensive measure.

Will my existing agent framework work with these?

Yes. All four expose OpenAI-compatible APIs through vLLM, SGLang, or their hosted endpoints. LangChain, LlamaIndex, AutoGen, CrewAI, and the OpenAI Agents SDK all work unchanged with a base URL swap and a model name change. MCP integration works through the OpenAI MCP adapter. The migration friction is genuinely small.

Should I switch from Claude or GPT-5.5 today?

Run the eval. Capability gap is small enough that for many workloads the answer is “yes, save the money.” For workloads where you’ve extensively tuned prompts to a specific model’s quirks, expect a few percentage points of regression on the new model that you’ll need to recover through prompt iteration. Plan for two to four weeks of evaluation and adjustment, not a one-day flip.

What about latency?

Self-hosted on appropriate hardware, all four are comparable to or faster than the closed-frontier APIs. DeepSeek V4 hosted (their own infrastructure) trails GPT-5.5 by about 30% on time-to-first-token but matches on total tokens-per-second once the response starts. Kimi K2.6 on consumer hardware is meaningfully slower than the API alternatives — buy the inference time only if the cost savings justify it.

How does this affect closed-source coding tools like Cursor and Copilot?

The product layer doesn’t move much. Cursor and Copilot win on editor integration, indexing, and UX — the model is one component. But the model line item on their cost structure just dropped meaningfully if they choose to swap. Expect price drops, expanded free tiers, or new “premium” tiers that justify the closed-model premium with non-model features (advanced indexing, team collaboration). The end user wins either way.

Scroll to Top