Cerebras

Executive Summary

What it is: Cerebras is an inference-only provider that runs open-weight models from other companies (OpenAI OSS, Zhipu AI, Alibaba Qwen, Moonshot AI, Meta, Mistral, DeepSeek, and others) on custom wafer-scale chips at speeds up to 3,000 tokens/sec. It offers a per-token API (Free/Developer/Enterprise tiers), a subscription coding product called Cerebras Code (currently sold out), and dedicated endpoints for enterprise customers. Pricing ranges from free (rate-limited shared API) to pay-per-token ($0.35-$2.75 per million tokens depending on model) to custom enterprise contracts.

What to watch out for: Both Cerebras Code plans (Pro at $50/month and Max at $200/month) are sold out with no announced timeline for re-opening, limiting access to the flat-rate coding tier. The shared API lacks prompt caching pricing, making agentic workflows with long conversation histories expensive compared to providers like Anthropic or DeepSeek that discount cached input tokens. Only two models are available on the public shared endpoint (GPT OSS 120B and GLM 4.7); access to the broader model catalog (Qwen3 235B, Kimi K2.6, DeepSeek V3.2, etc.) requires an enterprise dedicated endpoint.

Bottom line: Cerebras delivers unmatched inference speed for open-weight models, but its current value proposition is narrow: two models on shared API, sold-out subscription plans, and enterprise-only access to the full model catalog. Teams evaluating Cerebras should test the free API tier first and verify that available models meet their quality requirements before committing to enterprise contracts.

Key Terms

  • Wafer-Scale Engine (WSE-3) -- Cerebras's custom AI processor, 58x larger than a standard GPU chip, designed for ultra-fast inference and training. Source: Cerebras – Chip
  • Dedicated Endpoint -- A private, provisioned inference instance reserved for a single organization, offering guaranteed throughput, custom model weights, and fine-tuning. Not available on the shared API. Source: Inference-Docs – Overview
  • Cerebras Code -- A subscription product providing flat-rate access to open-weight models for coding workflows via API, usable with third-party IDEs and CLI tools (not a standalone IDE). Source: Cerebras – Pricing
  • Prompt Caching -- A feature that stores and reuses previously processed prompt tokens to reduce latency and cost for repeated queries. Cerebras supports prompt caching on its API but does not publicly disclose a pricing discount for cached tokens. Source: Inference-Docs – Prompt Caching
  • Multi-LoRA -- Multi-adapter support for Low-Rank Adaptation, allowing teams to apply different LoRA specializations per request on a single base model. Launched in private preview in May 2026 for dedicated endpoints. Source: Cerebras – Introducing Multi Lora On Cerebras Inference

Latest Changes

First report for this supplier. All models, plans, and pricing are listed as current state.

Plans

Inference API Access

PlanPriceRate LimitsKey Inclusions
Free$05 RPM, 30K input tokens/min, 1M tokens/day (GPT OSS 120B)Access to all shared models, community support via Discord
DeveloperStarting at $10 (self-serve, pay-per-token)1,000 RPM, 1M input tokens/min (GPT OSS 120B); 500 RPM, 500K input tokens/min (GLM 4.7); no daily token cap disclosed10x higher rate limits than Free, higher priority processing
EnterpriseUndisclosed (contact sales)Highest rate limits (undisclosed), dedicated queue priorityCustom model weights, fine-tuning/training services, dedicated support team with response time guarantees, Multi-LoRA, dedicated endpoints

Terms explained:

  • RPM -- Requests per minute, a standard rate limiting metric for APIs.
  • Self-serve payment starting at $10 -- Cerebras states the Developer tier starts at $10 but does not disclose what this buys in terms of token credits or billing increments. Actual cost is usage-based per token.

Cerebras Code (Subscription)

PlanPriceToken AllowanceStatus
Pro$50/monthUp to 24M tokens/day (~$48/day value)Sold out
Max$200/monthUp to 120M tokens/day (~$240/day value)Sold out

Both plans provide access to top open-source models for coding via API, usable with third-party IDEs and CLI tools. Neither plan is currently available for purchase.

API Pricing

Per-token pricing for models on the shared public endpoint. All prices are per million tokens.

ModelModel IDStatusInput ($/MTok)Output ($/MTok)Speed (tok/s)Context (paid)Max Output (paid)
OpenAI GPT OSS 120Bgpt-oss-120bProduction$0.35$0.75~3,000131K40K
Z.ai GLM 4.7zai-glm-4.7Preview$2.25$2.75~1,000131K40K

Rate Limits by Tier

GPT OSS 120B:

TierRequests/minInput Tokens/minDaily Tokens
Free Trial530K1M
Developer1,0001MN/A (undisclosed)

GLM 4.7:

TierRequests/minInput Tokens/minDaily Tokens
Free Trial530K1M
Developer500500KN/A (undisclosed)

Dedicated Endpoint Models (Enterprise Only)

Dedicated endpoints support 30+ models across 10+ model families. Key models relevant to coding:

Model FamilyKey Models
Alibaba QwenQwen3-235B-A22B, Qwen3-Coder-480B-A35B, Qwen3-32B, Qwen3-30B-A3B
OpenAI OSSGPT-OSS-120B, GPT-OSS-20B
Moonshot AIKimi-K2.6, Kimi-K2.5, Kimi-K2-Instruct, Kimi-K2-Thinking
Z.AIGLM-5.1, GLM-5, GLM-4.7, GLM-4.7-Flash, GLM-4.6
DeepSeekDeepSeek-V3.2, DeepSeek-V3.1, DeepSeek-V3
MetaLlama-4-Maverick (402B), Llama-4-Scout (109B), Llama-3.3-70B
MistralMistral-Large-3-675B, Devstral-Small-2-24B, Codestral-22B
MiniMaxMiniMax-M2.5, MiniMax-M2.1
ByteDanceSeed-OSS-36B
ServiceNowApriel-1.6-15B-Thinker

Dedicated endpoint pricing is not publicly listed. Source: Inference-Docs – Overview

Model Performance / Benchmarks

Cerebras does not publish its own coding benchmarks for the models it serves. Speed claims are verified by third-party Artificial Analysis:

ModelMetricValueSource
Kimi K2.6Output speed (Artificial Analysis)981 tok/sCerebras – Cerebras Kimi K2 Enterprise
GPT OSS 120BOutput speed (Cerebras measurement)~3,000 tok/sInference-Docs – Openai Oss
GLM 4.7Output speed (Cerebras measurement)~1,000 tok/sInference-Docs – Zai Glm 47
SWE-1.6 (Cognition)Output speed on Windsurf Fast tier~950 tok/sCerebras – Case Study Cognition X Cerebras
SWE-1.6 (Cognition)SWE-Bench Pro score50.4% (vs 40.1% for SWE-1.5)Cerebras – Case Study Cognition X Cerebras
Kimi K2.6SWE-Bench Pro score58.6% (model-level benchmark)Cerebras – Cerebras Kimi K2 Enterprise

Note: SWE-Bench Pro scores are model-level benchmarks published by the model creators, not by Cerebras. Cerebras's contribution is inference speed, not model capability.

Latest News

Cerebras IPO on NASDAQ (May 14, 2026)

Cerebras went public on the Nasdaq Global Select Market under ticker "CBRS" at $185/share, offering 30M shares (with a 4.5M over-allotment option). Lead underwriters: Morgan Stanley, Citigroup, Barclays, UBS. The company positions itself as delivering "up to 15x faster inference than leading GPU-based solutions" on its Wafer-Scale Engine 3 (WSE-3). Source: Cerebras – Cerebras Systems Announces Pricing Of Initial Public Offering

Kimi K2.6 Enterprise Inference (May 19, 2026)

Cerebras announced enterprise trials of Kimi K2.6, a 1T-parameter open-weight model from Moonshot AI, at 981 output tokens/sec. This is the first trillion-parameter model served on Cerebras. Artificial Analysis measured the performance at 6.7x faster than the next GPU-based cloud and 23x faster than the median provider. For a 10K-token input with 500 output tokens, Cerebras delivered the full response in 5.6 seconds vs. 163.7 seconds on the official Kimi endpoint. K2.6 tops SWE-Bench Pro at 58.6%. Source: Cerebras – Cerebras Kimi K2 Enterprise

Multi-LoRA Private Preview (May 6, 2026)

Cerebras launched Multi-LoRA support in private preview for dedicated endpoint users. The feature allows deploying multiple LoRA adapters alongside a shared base model, with per-request adapter switching. Use cases include specializing coding assistants by language, framework, or customer. Available at no additional cost for dedicated endpoint users. Source: Cerebras – Introducing Multi Lora On Cerebras Inference

Cognition (Windsurf) Case Study (May 1, 2026)

Cerebras published a case study detailing its partnership with Cognition AI. SWE-1.6, Cognition's coding model, runs at up to 950 tok/s on Cerebras in Windsurf's "fast tier," compared to ~200 tok/s on GPU. SWE-1.6 scored 50.4% on SWE-Bench Pro (vs 40.1% for SWE-1.5). The case study highlights co-optimization of model, agent harness, and inference layer. Source: Cerebras – Case Study Cognition X Cerebras

Sovereign AI Blog (May 26, 2026)

Cerebras published an overview of its "Cerebras for Nations" sovereign AI initiative, covering partnerships with the US (DOE Genesis Mission), UAE (G42/MBZUAI, JAIS 2 model), and India (8 exaflops national AI supercomputer with G42 and C-DAC). The post positions speed as a sovereign advantage. Not a product announcement but relevant to enterprise/government buyers. Source: Cerebras – What Is Sovereign Ai And How Cerebras Helps Nations

UI Generation Best Practices (May 8, 2026)

A practical blog post on generating better UIs with AI, including 8 methods for improving output quality. References Codex-Spark running at ~1,200 tok/s on Cerebras. Source: Cerebras – Generating Beautiful Uis

Community Signals

Cerebras Code sold-out status generates frustration. Both subscription plans (Pro $50/month, Max $200/month) have been sold out for an extended period with no announced reopening date. Multiple HN commenters expressed frustration at being unable to access the flat-rate product. Source: News – Item

Rate limit transparency is a recurring complaint. Users on the Cerebras Code Pro plan reported hitting limits well below the advertised "1,000 messages per day." One HN commenter noted: "While they advertise a 1,000-request limit, the actual daily constraint is a 7.5 million-token limit" (citing a Reddit thread titled "Cerebras Pro Coder Deceptive Limits"). Another reported being rate-limited at under 1M tokens. The FAQ clarified that limits are token-based, not message-based, but the marketing initially led users to expect 1,000 API calls per day. Source: News – Item

Speed is universally praised, but quality concerns persist. HN commenters consistently highlight the speed advantage as genuine and impressive. However, multiple users noted that model quality on open-weight models (Qwen3-Coder, GPT OSS 120B) does not match Claude Sonnet or Opus for complex coding tasks. One commenter said: "The quality is also not quite what Claude Code gave me, but the speed is definitely way faster." Source: News – Item

Lack of prompt caching pricing is a cost concern for agentic workflows. Multiple HN users flagged that without cached token pricing, agentic coding workflows (where the full conversation history is re-sent with each tool call) become expensive on the per-token API. One commenter noted: "Without caching, this becomes very expensive very quickly. After each new tool call, you're sending the entire previous message history as input tokens." Source: News – Item

IPO generated limited community buzz. The Cerebras IPO (May 14) received only 3 points and no comments on Hacker News, suggesting the developer community is more focused on product availability than financial milestones. Source: News – Item

Free tier rate limits too restrictive for coding agents. Users reported that the 5 RPM / 30K input tokens/min free tier limit is insufficient for agentic workflows, where a single task can generate dozens of tool calls. "It hits the request per minute limit instantly and then you wait a minute," one user reported when using Cerebras via OpenRouter with Claude Code Router. Source: News – Item

Enterprise Readiness

FeatureAvailable?Details
SSO (SAML/OIDC)UndisclosedNot mentioned in pricing, docs, or enterprise tier description. Contact sales for details.
SCIMUndisclosedNot mentioned in public documentation.
Audit logsUndisclosedNot mentioned. Dedicated endpoint metrics available in Prometheus format. Source: Inference-Docs – Metrics
IP indemnityNoCerebras is an inference provider, not a model creator. IP indemnity would depend on the model used. Not mentioned in Cerebras's own terms.
Data residencyPartialSovereign AI initiative supports on-premises deployment in specific countries (US, UAE, India). Cloud inference data residency is undisclosed. Source: Cerebras – What Is Sovereign Ai And How Cerebras Helps Nations
HIPAAUndisclosedNot mentioned in public documentation.
Air-gapped/on-premYesCerebras systems can be deployed on-premises. The CS-3 system is sold as hardware for customer datacenters. Source: Cerebras – Ai Supercomputer
SLAUndisclosedEnterprise tier mentions "response time guarantees" for support, but inference SLA (uptime, latency) is not publicly documented.
Admin controls (RBAC)PartialCloud Console supports Projects for organizing workloads and managing team access. Full RBAC details not publicly documented. Source: Inference-Docs – Projects

Transparency Gaps

  1. Cerebras Code plans sold out with no timeline. Both Pro ($50/month) and Max ($200/month) are listed as "sold out" with no indication of when or whether they will reopen. This blocks the primary flat-rate access point for individual developers.
  1. Developer tier daily token limit undisclosed. The rate limit table lists "N/A" for daily tokens on the Developer tier, meaning there is either no daily cap or Cerebras does not disclose it. Without this number, developers cannot estimate costs.
  1. Prompt caching pricing not disclosed. Prompt caching is listed as a supported capability, but no pricing discount for cached input tokens is published. Competitors like Anthropic ($0.30/MTok cached vs $3/MTok standard for Sonnet) and Google make this explicit. Without it, cost comparisons for agentic workflows are incomplete.
  1. Enterprise pricing is entirely opaque. No pricing information, rate limits, or SLA terms are publicly available for the Enterprise tier or dedicated endpoints. Buyers must engage sales to get any numbers.
  1. Dedicated endpoint model pricing varies but is not listed. The dedicated endpoint supports 30+ models with different computational requirements, but no per-model pricing or throughput guarantees are published.
  1. Model quality benchmarks absent. Cerebras publishes speed benchmarks but does not publish any quality benchmarks (SWE-Bench, LiveCodeBench, etc.) for the models it serves. Quality claims rely entirely on the model creators' benchmarks.
  1. Free tier context window reduced vs paid. GPT OSS 120B has 65K context on the free tier vs 131K on paid, and 32K max output vs 40K. These differences are documented but easy to miss.
  1. GLM 4.7 labeled "Preview" with no production commitment. GLM 4.7 is listed as a preview model that "may be discontinued on short notice." Teams building on GLM 4.7 have no guarantee of continued availability.

---

*Sources: All pricing and plan data from Cerebras – Pricing and Inference-Docs – Overview (accessed 2026-05-31). Blog data from Cerebras – Blog. Community signals from Hacker News.*