Cerebras

Executive Summary

What it is: Cerebras is an inference provider that runs open-source models (GLM, GPT OSS, Llama, Qwen) on wafer-scale chips, claiming 20x faster throughput than GPU-based providers. Its Cerebras Code product offers subscription-based coding agent access at $50/mo (Pro) and $200/mo (Max), and the Developer API operates on pay-per-token pricing starting at a $10 minimum payment. There is no first-party model; Cerebras hosts third-party open-weight models.

What to watch out for: Cerebras Code Pro and Max have been sold out for an extended period with no restock ETA. Free and Developer tier rate limits are undisclosed (described only as "10x higher than Free" for the paid tier, with no absolute numbers). Llama 3.1 8B and Qwen 3 235B Instruct will be deprecated on May 27, 2026. GLM 4.7 and GPT OSS 120B are preview models not intended for production use.

Bottom line: Cerebras's speed advantage is real (1,000 to 3,000 tok/s confirmed by published pricing page), but the practical impact for agentic coding, where thinking time often dominates generation speed, is limited. With the Code product sold out, two models pending deprecation, and rate limits opaque, Cerebras is difficult to evaluate for production use. Best suited as a low-latency inference backend for specific high-throughput tasks where token generation speed is the bottleneck.

Key Terms

  • Wafer-scale inference - Cerebras runs inference on CS-3 wafer-scale chips instead of GPUs, claiming 20x faster throughput than GPU-based providers. Source: Cerebras – Pricing
  • Cerebras Code - a subscription coding product (separate from the API) providing token-based access to open-source models for IDE integrations and agentic workflows. Source: Cerebras – Introducing Cerebras Code
  • Pay-per-token - the Developer API tier charges per token processed, similar to other inference providers. Source: Cerebras – Cerebras Inference Now Available Via Pay Per Token
  • Inference provider - Cerebras does not train its own models. It runs third-party open-source models (GLM, GPT OSS, Llama, Qwen) on its hardware at high speed. Source: Cerebras – Pricing
  • Preview model - a model available on the platform for evaluation only, not recommended for production workloads. GLM 4.7 and GPT OSS 120B are currently in preview. Source: Cerebras – Pricing

Latest Changes

First report for this supplier. All models, plans, and pricing are listed as current state.

  • Deprecation (upcoming): Llama 3.1 8B and Qwen 3 235B Instruct will be deprecated May 27, 2026.
  • Feature added: Blog posts on multi-agent workflows and Figma multi-agent integration (April 16).
  • Feature added: MCP vs. CLI speed debate blog post (April 6).
  • Plan change: Cerebras Code Pro ($50/mo) and Max ($200/mo) remain sold out with no restock ETA.
  • Feature added: Cerebras available through AWS Marketplace (March 13).

Plans

Inference API Tiers

PlanPriceUsageKey Inclusions
Free$0Undisclosed rate limitsAccess to all models, community support via Discord
DeveloperStarting at $10 (self-serve payment)10x higher rate limits than FreeEverything in Free, higher priority processing
EnterpriseCustom (contact sales)Highest rate limitsDedicated queue, custom model weights, model fine-tuning, dedicated support

Cerebras Code (Coding Agent Product)

PlanPriceUsageKey Inclusions
Pro$50/month24M tokens/day ($48/day value)For indie devs and simple agentic workflows. Sold out as of April 2026
Max$200/month120M tokens/day ($240/day value)For full-time development. Sold out as of April 2026

Source: Cerebras – Pricing

API Pricing

Developer API Per-Token Pricing

ModelSpeedInput ($/MTok)Output ($/MTok)Notes
ZAI GLM 4.7~1,000 tok/s$2.25$2.75Preview model, not for production
OPENAI GPT OSS 120B~3,000 tok/s$0.35$0.75Preview model, not for production
META Llama 3.1 8B~2,200 tok/s$0.10$0.10Deprecated May 27, 2026
QWEN Qwen 3 235B Instruct~1,400 tok/s$0.60$1.20Deprecated May 27, 2026

Implied Code product pricing: Cerebras Code Pro provides 24M tokens/day at $50/month, which works out to roughly $0.07/1M tokens if fully utilized (720M tokens/month). This is significantly below API rates, suggesting the Code product bundles a subsidized allocation.

Source: Cerebras – Pricing

Model Performance / Benchmarks

Cerebras does not train its own models and does not publish benchmark scores. It hosts third-party open-weight models on wafer-scale chips. Key performance claim:

  • 1,000 to 3,000 tokens/second generation speed (confirmed by published pricing page)
  • 20x faster throughput than GPU-based providers (claimed)

The practical impact for agentic coding, where thinking time often dominates generation speed, is debated in the community. Speed advantage is most relevant for high-throughput tasks where token generation is the bottleneck.

Source: Cerebras – Pricing

Latest News

Lessons Learned from Building Multi-Agent Workflows (April 16, 2026)

Blog post sharing engineering insights from deploying multi-agent systems on Cerebras infrastructure. Covers practical challenges in orchestrating multiple AI agents for production workloads.

Source: Cerebras – Lessons Learned From Building Multi Agent Workflows

Figma MultiAgents (April 16, 2026)

Blog post about a multi-agent integration with Figma, demonstrating how fast inference enables real-time collaborative AI-assisted design workflows.

Source: Cerebras – Figma Multiagents

The Debate of MCP vs. CLI Centers on Speed (April 6, 2026)

Blog post arguing that the choice between MCP (Model Context Protocol) servers and CLI-based agent interfaces comes down to inference speed. Cerebras positions its speed advantage as critical for CLI-based agentic workflows where round-trip latency compounds.

Source: Cerebras – Mcpvscli

Cerebras Coming to AWS (March 13, 2026)

Cerebras announced availability through AWS Marketplace, allowing customers to access Cerebras inference through AWS with flexible pricing and low-latency workloads. This is a distribution expansion, not a new product.

Source: Cerebras – Cerebras Is Coming To Aws

Introducing Cerebras Code (August 1, 2025)

Cerebras Code launched as a subscription coding product with Pro ($50/month) and Max ($200/month) tiers. Both tiers are currently sold out.

Source: Cerebras – Introducing Cerebras Code

Qwen3 Coder 480B Live on Cerebras (August 1, 2025)

Qwen3 Coder 480B model made available on Cerebras inference, a key coding-focused model in their lineup.

Source: Cerebras – Qwen3 Coder 480B Is Live On Cerebras

Community Signals

  • Cerebras Code Pro and Max tiers have been sold out for an extended period. No ETA for reopening has been communicated. This limits access to the most cost-effective coding plans.
  • Cerebras is frequently mentioned in discussions about inference speed, with community benchmarks confirming 10-20x faster token generation compared to GPU-based providers. However, the practical impact for coding tasks (where thinking time dominates over generation speed) is debated.
  • The OpenAI Codex Spark partnership (February 2026) generated significant interest, as it positions Cerebras as the fast-inference backend for an OpenAI product.

Enterprise Readiness

FeatureAvailable?Details
SSO (SAML)NoNot mentioned. Enterprise plan requires contacting sales.
SSO (OIDC)NoNot mentioned.
SCIMNoNot mentioned.
Audit logsNoNot mentioned.
IP indemnityNoNot mentioned. Cerebras is an inference provider, not a model creator.
Data residencyNoNot mentioned.
HIPAANoNot mentioned.
Air-gapped / on-premNoNot available. Cerebras requires its proprietary wafer-scale hardware.
SLANoNo published SLA.
Admin controls (RBAC)NoNo admin controls documented.

Transparency Gaps

MetricStatusNotes
Free tier rate limitsundisclosedNo published requests/minute or tokens/day
Developer tier rate limitsundisclosedDescribed as "10x higher than Free" with no absolute numbers
Enterprise pricingundisclosedContact sales required
Cerebras Code restock timelineundisclosedPro and Max have been sold out with no announced reopening date
Preview model GA timelineundisclosedGLM 4.7 and GPT OSS 120B are marked as preview with no stated production readiness date
Code product model allocationundisclosedPro and Max plans do not specify which models are available or if model selection is restricted