Cerebras Systems just published an inference benchmark that resets expectations for what real-time AI looks like at frontier model size: 969 tokens per second on Llama 3.1 405B and roughly 3,000 tokens per second on gpt-oss-120B, with Groq’s competing LPU stack measuring around 476 t/s on the same gpt-oss-120B workload. The numbers matter because Cerebras inference speed is the metric that controls whether agentic workflows, real-time voice assistants, and long-document reasoning feel snappy or feel broken. At 969 t/s on a 405-billion-parameter model, an answer that took 30 seconds in 2024 finishes in under three. The gap between Cerebras and Groq is no longer hypothetical — it’s measured, replicable, and meaningful for anyone choosing where to run inference in mid-2026.
What’s actually new
Cerebras shipped two records in close succession. The first: 969 tokens per second on Meta’s Llama 3.1 405B, the largest open-weights model in wide production use. That number is the highest any third-party benchmark has measured on a frontier-sized open model. The second: approximately 3,000 tokens per second on the gpt-oss-120B open model released by OpenAI earlier in 2026, against Artificial Analysis’s measurement of around 476 t/s for Groq on the same workload. Cerebras’s CS-3 wafer-scale chip, with its 900,000 cores and 44 GB of on-chip SRAM, is doing in one device what cluster-based GPU inference does in dozens.
The technical story underneath is that Cerebras keeps the entire model on chip. Most GPU inference systems pay a heavy memory-bandwidth tax shuttling weights between HBM and the compute units. The CS-3’s wafer-scale design eliminates the tax for models that fit. For Llama 3.1 405B, the model fits across a small cluster of CS-3s with weights pinned to on-chip SRAM; for gpt-oss-120B, it fits comfortably on fewer chips. The latency profile then becomes dominated by the actual matrix-multiply work, not by data movement. Groq’s LPU achieves something architecturally similar (deterministic execution, on-chip weight residence) but at smaller per-chip scale, which is why Groq wins on smaller models and tight time-to-first-token while Cerebras pulls away on the largest models.
The benchmarks are not theoretical. Cerebras Inference is a hosted API customers can hit today, with documented SLAs. Groq’s API has been live for over a year. Both publish per-token pricing. The speed difference at the top end is no longer a research datapoint — it’s a production architecture decision for any team running real-time AI at scale.
Why it matters
- Agentic workflows become viable at frontier model size. Multi-step agent loops that need 5-10 model calls were latency-prohibitive on 405B-class models when each call took 8-15 seconds. At 969 t/s, those calls finish in 1-3 seconds, which makes interactive agents on the largest open model finally usable.
- Real-time voice on big models is now possible. Voice agents need round-trip latency under 800ms to feel natural. Cerebras inference speed brings frontier-sized models into that envelope for short responses, opening conversational use cases that used to require model distillation.
- Open models close the gap with closed inference economics. The argument for proprietary frontier APIs has always been “they’re faster and easier than self-hosted open models.” Cerebras and Groq are dismantling the speed half of that argument; the ease half remains, but it’s a smaller moat than it was.
- The cost-per-token landscape gets weirder. Wafer-scale and LPU economics are different from GPU economics. At high throughput on the largest models, Cerebras can offer pricing that competes with mid-tier API offerings while delivering frontier latency. The pricing matrix that procurement teams have to reason about expands.
- Long-context reasoning gets a usability upgrade. A model processing 200K tokens of context at 200 t/s takes 16 minutes for a long answer. At 969 t/s, the same task finishes in 3 minutes. The difference between “set it and walk away” and “stay engaged” is precisely that gap.
- Hardware diversity in AI inference becomes a real strategy. A team that locks all inference to a single vendor takes on real risk; a team that splits workloads — Groq for latency-critical, Cerebras for throughput-critical, GPU clusters for fine-tuning and batch — gets a more resilient stack with better unit economics.
How to use Cerebras inference today
Cerebras Inference exposes an OpenAI-compatible API surface that drops into existing code with minor changes. The platform supports Llama 3.1 8B, 70B, 405B, gpt-oss-120B, and a handful of other open-weights models. Pricing is per-token with a free tier for evaluation. Three steps get a working integration:
- Provision an API key. Sign up at cloud.cerebras.ai, generate an API key from the dashboard, and store it as the
CEREBRAS_API_KEYenvironment variable. - Switch your existing OpenAI client. Cerebras’s API is OpenAI-compatible. Most existing code needs only a base URL change and the Cerebras key.
- Pick the right model for your workload. For raw throughput on frontier-size, use Llama 3.1 405B. For balanced cost-and-speed, gpt-oss-120B or Llama 3.1 70B. For tight latency on smaller workloads, Llama 3.1 8B will run at 1,800 t/s on Cerebras and is hard to beat anywhere else.
Drop-in Python integration: