Key Terms
- Open-weight model - a model whose trained weights are published for anyone to download, modify, and deploy. Meta's Llama models are open-weight, meaning the cost of inference depends entirely on which hosting provider you choose. Source: Llama
- Inference provider - a third-party service (e.g., Cerebras, Together AI, Groq, AWS Bedrock, Azure AI) that hosts Llama models and charges for API access. Meta does not sell inference directly. Source: Llama
- Distributed inference - running inference across multiple GPUs or nodes to serve large models. Llama 4 Scout and Maverick both support distributed inference, with estimated costs of $0.19-$0.49/MTok. Source: Llama
Latest Changes
First report for this supplier. All models, plans, and pricing are listed as current state.
- New model: Llama 4 Scout and Maverick available as open-weight models. Both feature 10M token context windows and native multimodal support.
- Deprecation (upcoming): Cerebras deprecating Llama 3.1 8B on May 27, 2026.
- Plan change: Website migrated from llama.meta.com to Llama Old URL returns HTTP 400 errors.
Plans
Meta does not offer a coding agent product or subscription plans. Llama models are open-weight and free to download. The cost structure depends entirely on how you deploy them.
| Deployment Method | Cost | Notes |
|---|---|---|
| Download and self-host | Free (hardware costs only) | Requires GPU infrastructure. Cost depends on your hardware and electricity |
| Third-party inference API | Varies by provider | See API Pricing table below for concrete per-provider rates |
| Cerebras (fast inference) | See API Pricing table | Pay-per-token via Cerebras API or Cerebras Code subscription |
| AWS Bedrock / Azure AI | See respective provider pricing | Pay-per-token through cloud marketplace |
| Together AI / Fireworks AI | See respective provider pricing | Competitive pricing for open-source model inference |
| Groq | See Groq pricing | Fast inference on LPU hardware |
Source: Llama
API Pricing
Meta does not offer an API directly. Inference costs are set by third-party providers. Concrete pricing from Google Vertex AI and Cerebras:
| Model | Provider | Input ($/MTok) | Output ($/MTok) | Batch Input ($/MTok) | Batch Output ($/MTok) | Notes |
|---|---|---|---|---|---|---|
| Llama 4 Scout | Google Vertex AI | $0.25 | $0.70 | $0.125 | $0.35 | |
| Llama 4 Maverick | Google Vertex AI | $0.35 | $1.15 | $0.175 | $0.575 | |
| Llama 3.3 70B | Google Vertex AI | $0.72 | $0.72 | $0.36 | $0.36 | |
| Llama 3.1 8B | Cerebras | $0.10 | $0.10 | - | - | Deprecating May 27, 2026 |
Terms explained:
- Batch API - a lower-cost inference mode where requests are queued and processed asynchronously (not real-time), typically at 50% of standard pricing. Google Vertex AI offers batch pricing for all Llama models listed above. Source: Google – Pricing
Source: Llama Google – Pricing, Cerebras – Pricing
Model Performance / Benchmarks
| Benchmark | Llama 4 Maverick | Llama 4 Scout |
|---|---|---|
| MMLU Pro | 80.5 | 74.3 |
| LiveCodeBench | 43.4 | 32.8 |
Additional specifications:
- Both models: 10M token context window, text + image (natively multimodal)
- Llama 4 Scout: optimized for efficient inference on a single H100 GPU
- Llama 4 Maverick: targets frontier-level performance with higher resource requirements
- Estimated distributed inference cost: $0.19-$0.49/MTok
Source: Llama
Latest News
Llama 4 Scout and Maverick Release
Llama 4 Scout and Llama 4 Maverick are available as open-weight models. Both feature 10M token context windows and native multimodal support (text + image). Llama 4 Maverick achieves 80.5 MMLU Pro and 43.4 LiveCodeBench. Llama 4 Scout achieves 74.3 MMLU Pro and 32.8 LiveCodeBench. Both can be downloaded from Llama and are supported by major inference providers including Cerebras, Google Vertex AI, Together AI, Fireworks AI, AWS Bedrock, and Azure AI.
Cerebras Deprecation of Llama 3.1 8B
Cerebras is deprecating Llama 3.1 8B on its platform effective May 27, 2026. The model was priced at $0.10/MTok input and $0.10/MTok output. Users should migrate to Llama 4 Scout or Maverick before the deprecation date.
Website Migration
Meta's Llama website has migrated from llama.meta.com to Llama The old URL returns HTTP 400 errors. All model downloads and documentation are now hosted at the new domain.
Source: Llama Cerebras – Pricing
Community Signals
LiveCodeBench and Context Window Discussion
Llama 4 Maverick's LiveCodeBench score of 43.4 is frequently cited in coding benchmark discussions, with community members comparing it favorably to GPT-4-class models for code generation tasks. The 10M token context window is a major talking point, with developers noting it enables processing very large codebases in a single prompt. However, practical latency and cost at that context length are still being evaluated by the community.
Self-Hosting for Privacy
The lack of a first-party coding agent product from Meta means users must build their own agent infrastructure using tools like Continue, Cline, Aider, or OpenHands on top of Llama models. Organizations with strong privacy requirements (defense, healthcare, finance) often choose Llama models for on-premises deployment to avoid sending code to third-party APIs.
Provider Lock-In Concerns
The Cerebras Llama 3.1 8B deprecation has prompted discussion about provider lock-in when relying on a single inference provider for open-weight models.
Source: Llama
Enterprise Readiness
| Feature | Available? | Details |
|---|---|---|
| SSO (SAML) | N/A | Meta does not offer a managed platform. Models are downloaded and deployed by the user. |
| SSO (OIDC) | N/A | Same as above. |
| SCIM | N/A | Same as above. |
| Audit logs | N/A | Same as above. |
| IP indemnity | No | Not offered. Models are open-weight with no commercial indemnification from Meta. |
| Data residency | Yes | Full control when self-hosting. Models can run on any infrastructure in any region. |
| HIPAA | N/A | Self-hosted deployments can be made HIPAA-compliant by the deploying organization. |
| Air-gapped / on-prem | Yes | Models can be downloaded and deployed on air-gapped infrastructure. Full data isolation. Source: Llama |
| SLA | N/A | No managed service. Availability depends on the user's own infrastructure. |
| Admin controls (RBAC) | N/A | No managed platform. Controls depend on the user's deployment infrastructure. |
Transparency Gaps
| Metric | Status | Notes |
|---|---|---|
| Recommended inference costs | not applicable | Meta does not set inference pricing |
| Self-hosting hardware requirements | undisclosed | No official guidance on minimum GPU specs for Llama 4 Scout or Maverick |
| Fine-tuning tools | partially disclosed | Meta provides Llama fine-tuning guides but specifics vary by model size |
| Cerebras Llama 4 Maverick pricing | undisclosed | Cerebras lists Llama 4 Maverick as supported but has not published per-token pricing |
| Together AI / Fireworks AI Llama 4 pricing | undisclosed | Pricing pages not updated with Llama 4 per-token rates at time of report |
| Context window performance at scale | undisclosed | 10M token context is claimed but no official latency/cost benchmarks published at that scale |