Cerebras Hits 969 Tokens/Sec on Llama 3.1 405B Inference

Cerebras Systems just published an inference benchmark that resets expectations for what real-time AI looks like at frontier model size: 969 tokens per second on Llama 3.1 405B and roughly 3,000 tokens per second on gpt-oss-120B, with Groq’s competing LPU stack measuring around 476 t/s on the same gpt-oss-120B workload. The numbers matter because Cerebras inference speed is the metric that controls whether agentic workflows, real-time voice assistants, and long-document reasoning feel snappy or feel broken. At 969 t/s on a 405-billion-parameter model, an answer that took 30 seconds in 2024 finishes in under three. The gap between Cerebras and Groq is no longer hypothetical — it’s measured, replicable, and meaningful for anyone choosing where to run inference in mid-2026.

What’s actually new

Cerebras shipped two records in close succession. The first: 969 tokens per second on Meta’s Llama 3.1 405B, the largest open-weights model in wide production use. That number is the highest any third-party benchmark has measured on a frontier-sized open model. The second: approximately 3,000 tokens per second on the gpt-oss-120B open model released by OpenAI earlier in 2026, against Artificial Analysis’s measurement of around 476 t/s for Groq on the same workload. Cerebras’s CS-3 wafer-scale chip, with its 900,000 cores and 44 GB of on-chip SRAM, is doing in one device what cluster-based GPU inference does in dozens.

The technical story underneath is that Cerebras keeps the entire model on chip. Most GPU inference systems pay a heavy memory-bandwidth tax shuttling weights between HBM and the compute units. The CS-3’s wafer-scale design eliminates the tax for models that fit. For Llama 3.1 405B, the model fits across a small cluster of CS-3s with weights pinned to on-chip SRAM; for gpt-oss-120B, it fits comfortably on fewer chips. The latency profile then becomes dominated by the actual matrix-multiply work, not by data movement. Groq’s LPU achieves something architecturally similar (deterministic execution, on-chip weight residence) but at smaller per-chip scale, which is why Groq wins on smaller models and tight time-to-first-token while Cerebras pulls away on the largest models.

The benchmarks are not theoretical. Cerebras Inference is a hosted API customers can hit today, with documented SLAs. Groq’s API has been live for over a year. Both publish per-token pricing. The speed difference at the top end is no longer a research datapoint — it’s a production architecture decision for any team running real-time AI at scale.

Why it matters

  • Agentic workflows become viable at frontier model size. Multi-step agent loops that need 5-10 model calls were latency-prohibitive on 405B-class models when each call took 8-15 seconds. At 969 t/s, those calls finish in 1-3 seconds, which makes interactive agents on the largest open model finally usable.
  • Real-time voice on big models is now possible. Voice agents need round-trip latency under 800ms to feel natural. Cerebras inference speed brings frontier-sized models into that envelope for short responses, opening conversational use cases that used to require model distillation.
  • Open models close the gap with closed inference economics. The argument for proprietary frontier APIs has always been “they’re faster and easier than self-hosted open models.” Cerebras and Groq are dismantling the speed half of that argument; the ease half remains, but it’s a smaller moat than it was.
  • The cost-per-token landscape gets weirder. Wafer-scale and LPU economics are different from GPU economics. At high throughput on the largest models, Cerebras can offer pricing that competes with mid-tier API offerings while delivering frontier latency. The pricing matrix that procurement teams have to reason about expands.
  • Long-context reasoning gets a usability upgrade. A model processing 200K tokens of context at 200 t/s takes 16 minutes for a long answer. At 969 t/s, the same task finishes in 3 minutes. The difference between “set it and walk away” and “stay engaged” is precisely that gap.
  • Hardware diversity in AI inference becomes a real strategy. A team that locks all inference to a single vendor takes on real risk; a team that splits workloads — Groq for latency-critical, Cerebras for throughput-critical, GPU clusters for fine-tuning and batch — gets a more resilient stack with better unit economics.

How to use Cerebras inference today

Cerebras Inference exposes an OpenAI-compatible API surface that drops into existing code with minor changes. The platform supports Llama 3.1 8B, 70B, 405B, gpt-oss-120B, and a handful of other open-weights models. Pricing is per-token with a free tier for evaluation. Three steps get a working integration:

  1. Provision an API key. Sign up at cloud.cerebras.ai, generate an API key from the dashboard, and store it as the CEREBRAS_API_KEY environment variable.
  2. Switch your existing OpenAI client. Cerebras’s API is OpenAI-compatible. Most existing code needs only a base URL change and the Cerebras key.
  3. Pick the right model for your workload. For raw throughput on frontier-size, use Llama 3.1 405B. For balanced cost-and-speed, gpt-oss-120B or Llama 3.1 70B. For tight latency on smaller workloads, Llama 3.1 8B will run at 1,800 t/s on Cerebras and is hard to beat anywhere else.

Drop-in Python integration:

MASK13

Streaming is the right default. At 969 t/s, the user perceives near-instant output rather than waiting for a complete response. Most Cerebras integrations stream by default and never look back.

How it compares

The inference-speed landscape in May 2026 has clear leaders for different workload shapes. The table summarizes measured throughput across the four production-ready inference platforms on the most-cited benchmarks. Numbers are tokens per second from the most recent published benchmarks; they shift as platforms tune.

Platform Llama 3.1 405B gpt-oss-120B Llama 3.1 70B Llama 3.1 8B Sweet spot
Cerebras CS-3 ~969 t/s ~3,000 t/s ~450 t/s ~1,800 t/s Throughput on big models
Groq LPU n/a (size limit) ~476 t/s ~250 t/s ~1,200 t/s Low time-to-first-token
NVIDIA H100/H200 ~150-220 t/s ~280 t/s ~120 t/s ~700 t/s Flexibility, fine-tuning
SambaNova ~440 t/s ~600 t/s ~250 t/s ~900 t/s Enterprise on-prem

Two takeaways. First, Cerebras dominates raw throughput on frontier-sized open models, and the gap widens as models get larger. The wafer-scale advantage compounds with model size. Second, the right platform depends on the workload, not the brand. Latency-sensitive consumer apps may prefer Groq’s time-to-first-token; high-throughput enterprise reasoning may prefer Cerebras; fine-tuning and bespoke architectures still lean on NVIDIA. A multi-platform stack is the architectural pattern that has emerged for serious AI shops.

What’s next

Three things to watch over the next two quarters. First, Cerebras’s path to Llama 3 successors and Llama 4 (when it lands). Each generation pushes parameter counts higher, and the wafer-scale advantage gets more compelling at every step. Cerebras has hinted publicly at upcoming benchmarks on 700B-class models that the GPU stack genuinely cannot match for real-time inference. Second, Groq’s response. Groq has expanded its LPU compute footprint and is targeting larger models; whether they can match Cerebras’s throughput at the top end remains to be seen, but a competitive Groq pushes both vendors forward. Third, NVIDIA’s reaction. NVIDIA’s Blackwell B200 and the upcoming Rubin generation aim to close the inference-speed gap through different architectural paths (more memory bandwidth, sparser compute, FP4 support). The 2027 inference landscape may look very different from today’s.

The longer-term implication is that inference is becoming a specialized hardware market with multiple credible architectures rather than a GPU monoculture. Buyers benefit. The economics of frontier-model deployment improve as competition matures. The architectural decision for AI teams shifts from “where do we run inference” to “which inference platform fits which workload” — a more nuanced and lower-risk question than the previous lock-in default.

Frequently Asked Questions

Is Cerebras Inference available to anyone, or only enterprise customers?

Available to anyone with an API key. Cerebras Cloud has a free evaluation tier with rate limits, then per-token pricing on production usage. Enterprise contracts with committed volume and SLAs are also available, but no enterprise gate sits between developers and the API. The signup flow at cloud.cerebras.ai is self-service.

How does Cerebras inference speed translate into real-world latency for a chat application?

For a 500-token answer at 969 t/s, the generation phase takes about 0.52 seconds. Add network round-trip and prompt processing, and a typical chat exchange completes in under one second on Llama 3.1 405B. That is the difference between an app that feels alive and one that feels sluggish. Voice apps doing TTS on top of the model output need additional headroom, but the latency budget is workable where it was previously not.

Can Cerebras run my fine-tuned or custom model?

The hosted Cerebras Inference service runs supported open-weights models out of the box. Customer-specific fine-tunes are supported through Cerebras’s enterprise offering, which includes dedicated CS-3 capacity for the customer’s models. The self-service tier focuses on the standard open-weights catalog because that’s what wafer-scale economics work best for; bespoke deployments require sales engagement.

How does the cost compare to running on GPUs?

For high-throughput inference on frontier-size open models, Cerebras pricing is competitive with or better than GPU-based services like Together AI or Anyscale. The advantage compounds with model size — the larger the model, the more Cerebras’s wafer-scale economics outperform GPU clusters. For smaller models or batch workloads, GPUs remain price-competitive. Run a representative benchmark on your actual workload before committing.

What models will Cerebras support next?

Cerebras tracks the major open-weights releases. Llama 4 (when it lands), gpt-oss successor models, and the larger Chinese open-weights models (DeepSeek V4, Kimi K2.6, GLM-5.1) are the obvious near-term additions. The platform’s design supports arbitrary transformer architectures, so the limiting factor is engineering work to add a model rather than fundamental capability.

Does Cerebras inference speed matter for everyone, or only specific use cases?

It matters most for use cases where end-to-end latency drives user experience — agentic loops, real-time voice, interactive long-form generation, real-time decisioning. For batch workloads (overnight document processing, periodic report generation, embedding pipelines), throughput per dollar matters more than wall-clock latency, and the platform choice is more nuanced. Don’t pick Cerebras just for the headline benchmark; pick it where the benchmark translates into actual user-experience or operational improvements.

Scroll to Top