Key Terms
- Open-weight model — model weights are published for download, allowing anyone to run inference, fine-tune, or modify the model locally or on their own infrastructure. Unlike "open source," the Llama 4 Community License imposes restrictions on derivative naming and has a 700M MAU commercial threshold. Source: GitHub – License
- Mixture-of-experts (MoE) — architecture where only a subset of parameters are activated per token. Llama 4 always activates 17B parameters per forward pass, but routes through different expert subsets (16 for Scout, 128 for Maverick). This keeps inference cost low despite large total parameter counts. Source: Llama – Llama4
- Early fusion — a multimodal training technique where text and vision data are processed together from the first layer, rather than using separate frozen vision encoders. Llama 4 uses early fusion for native multimodality. Source: Llama
- Llama 4 Community License — Meta's custom license for Llama 4 models. Permits free commercial use with two conditions: (1) entities with >700M monthly active users must request a separate license from Meta, and (2) derivative models must prefix their name with "Llama" and display "Built with Llama" on related materials. Source: GitHub – License
- Llama API — Meta's hosted inference API at llama.developer.meta.com, currently in waitlist mode. Not yet publicly available. Source: Llama
Latest Changes
Changes since the 2026-04 report.
- No new models. Llama 4 Scout and Llama 4 Maverick remain the flagship models, released April 5, 2025. No new Llama model has been announced or released since.
- No pricing changes. Meta does not sell API access, so there are no pricing changes from Meta. Third-party provider pricing may have shifted independently.
- No Llama-specific blog posts in May 2026. The AI at Meta blog's most recent posts are about Muse Spark (April 8, 2026), SAM 3.1 (March 27, 2026), and MTIA chips (March 11, 2026). None relate to Llama. Source: Meta
- Llama API remains in waitlist mode. No update on pricing, availability, or capabilities of the first-party Llama API. Source: Llama – Join Waitlist
- Llama 4 models not available on Together AI. As of May 31, 2026, Together AI's pricing page lists Llama 3.3 70B ($0.88/$0.88 per MTok) but does not carry Llama 4 Scout or Maverick. Source: Together – Pricing
- Llama 4 available on OpenRouter. Scout at $0.08/$0.30 per MTok (10.4B tokens processed in the last 7 days) and Maverick at $0.15/$0.60 per MTok (28.2B tokens processed in the last 7 days). Source: Openrouter – Models
Plans
Meta does not offer subscription plans, tiers, or bundled products. Models are free to download and use under the Llama 4 Community License.
| Access Method | Price | Details |
|---|---|---|
| Direct download (self-host) | Free | Download weights from Llama – Llama Downloads or HuggingFace. Requires own GPU infrastructure. |
| Llama API (Meta-hosted) | Undisclosed (waitlist) | Llama – Join Waitlist. No public pricing or availability timeline. |
| Third-party API providers | Varies by provider | OpenRouter, Fireworks, Groq, DeepInfra, etc. Pricing and SLAs vary. |
API Pricing
Meta does not publish API pricing because it does not offer a public API. Third-party providers set their own prices for serving Llama 4 models. All prices below are per 1M tokens.
| Provider | Model | Input | Output | Context | Notes |
|---|---|---|---|---|---|
| OpenRouter | Llama 4 Scout | $0.08 | $0.30 | 10M | 10.4B tokens/week. Source: Openrouter – Llama 4 Scout |
| OpenRouter | Llama 4 Maverick | $0.15 | $0.60 | 1.05M | 28.2B tokens/week. Source: Openrouter – Llama 4 Maverick |
| Together AI | Llama 4 Scout | N/A | N/A | N/A | Not available. Only Llama 3.3 70B listed. Source: Together – Pricing |
| Together AI | Llama 4 Maverick | N/A | N/A | N/A | Not available. Source: Together – Pricing |
| Meta (self-host estimate) | Llama 4 Maverick | $0.19 (blended) | $0.30-$0.49 (blended) | 1M | Meta's own cost estimate for distributed inference. Source: Llama |
Self-hosting hardware requirements:
- Llama 4 Scout: Can run on a single H100 GPU with INT4 quantization (109B total params).
- Llama 4 Maverick: Cannot run on a single GPU. FP8 quantized weights fit on a single H100 DGX host (8 GPUs). BF16 weights require multi-host deployment (400B total params). Source: Hugging Face – Llama 4 Maverick 17B 128E
Model Performance / Benchmarks
Meta reports the following benchmarks for instruction-tuned Llama 4 models. All evaluations conducted on BF16 weights with 0-shot, temperature=0. For high-variance benchmarks, Meta averages over multiple generations.
| Model | Benchmark | Score | Notes |
|---|---|---|---|
| Llama 4 Maverick | LiveCodeBench (10/01/2024-02/01/2025) | 43.4 | Pass@1. Source: Llama |
| Llama 4 Scout | LiveCodeBench (10/01/2024-02/01/2025) | 32.8 | Pass@1. Source: Llama |
| Llama 4 Maverick | MMLU Pro | 80.5 | Macro avg/acc. Source: Llama |
| Llama 4 Scout | MMLU Pro | 74.3 | Macro avg/acc. Source: Llama |
| Llama 4 Maverick | GPQA Diamond | 69.8 | Accuracy. Source: Llama |
| Llama 4 Scout | GPQA Diamond | 57.2 | Accuracy. Source: Llama |
| Llama 4 Maverick | MMMU | 73.4 | Accuracy. Multimodal benchmark. Source: Llama |
| Llama 4 Scout | MMMU | 69.4 | Accuracy. Source: Llama |
| Llama 4 Maverick | MathVista | 73.7 | Accuracy. Source: Llama |
| Llama 4 Scout | MathVista | 70.7 | Accuracy. Source: Llama |
| Llama 4 Maverick | ChartQA | 90.0 | Relaxed accuracy. Source: Llama |
| Llama 4 Scout | ChartQA | 88.8 | Relaxed accuracy. Source: Llama |
| Llama 4 Maverick | DocVQA | 94.4 | ANLS. Source: Llama |
| Llama 4 Scout | DocVQA | 94.4 | ANLS. Source: Llama |
| Llama 4 Maverick | MMLU Multi | 84.6 | Source: Llama |
| Llama 4 Maverick | MBPP (pretrained) | 77.6 | Pass@1 (3-shot). Source: Hugging Face – Llama 4 Maverick 17B 128E |
| Llama 4 Scout | MBPP (pretrained) | 67.8 | Pass@1 (3-shot). Source: Hugging Face – Llama 4 Maverick 17B 128E |
Context for comparison: Llama 4 Maverick's LiveCodeBench score of 43.4 is lower than DeepSeek V4 Pro (undisclosed exact score, community-rated "close to Opus 4.5") and well below current closed-source frontier models. However, at $0.15/$0.60 per MTok via OpenRouter, Maverick offers strong performance-per-dollar.
Latest News
- 2026-04-08: Muse Spark announced. Meta's AI at Meta blog published "Introducing Muse Spark: Scaling Towards Personal Superintelligence." This is not directly Llama-related but represents Meta's broader AI direction. No Llama model updates were included. Source: Meta – Introducing Muse Spark Msl
- 2026-03-27: SAM 3.1 released. Meta released Segment Anything Model 3.1 for real-time video detection and tracking. Not Llama-related, but demonstrates continued investment in open-source AI tooling. Source: Meta – Segment Anything Model 3
- 2026-03-11: Four MTIA Chips in Two Years. Meta detailed its custom AI chip roadmap. Relevant because MTIA chips are designed to serve Llama models at scale, potentially reducing Meta's inference costs and enabling the Llama API. Source: Meta – Meta Mtia Scale Ai Chips For Billions
- No Llama 4 Behemoth update. When Llama 4 launched in April 2025, Meta mentioned a larger "Llama 4 Behemoth" model was in training. As of May 2026, there has been no update on Behemoth's status or release timeline.
- No Llama 4.1 or Llama 5 announcement. Meta has not announced any successor to the Llama 4 series. The models are now over 13 months old.
Community Signals
Llama 4 adoption on OpenRouter is moderate but growing. Llama 4 Maverick processes 28.2B tokens/week on OpenRouter (second among Llama models, behind Llama 3.1 8B at 83.3B tokens/week). Llama 4 Scout processes 10.4B tokens/week. For comparison, DeepSeek V4 Pro processes significantly more volume through its direct API. Source: Openrouter – Models
Llama 4 not prominently discussed in May 2026 HN threads. The dominant community conversations in May 2026 focused on DeepSeek V4 Pro's permanent pricing, DeepClaude (using DeepSeek with Claude Code), and coding agent competition. Llama 4 was mentioned in passing as an open-weight alternative but was not the center of any major discussion thread. This suggests Llama 4 has settled into a stable "commodity" position in the community rather than being a hot topic.
OpenCode integration. Multiple HN commenters in the DeepClaude thread mentioned using OpenCode as a harness for Llama and other open-weight models. User rurban said: "I'm working with Deepseek for a few weeks with opencode, and there are no desires left." While this specifically references DeepSeek, it validates the opencode harness pattern for open-weight models including Llama 4. Source: News – Item
Llama 4 Scout praised for local deployment. Community members on r/LocalLLaMA and HN have noted that Scout's ability to run on a single H100 (with INT4 quantization) makes it one of the most capable models that can be deployed without multi-GPU infrastructure. However, the 10M context window claim has been met with skepticism, as it requires 512 GPUs with 5D parallelism in Meta's benchmark configuration.
Knowledge cutoff is a recurring concern. With an August 2024 knowledge cutoff, Llama 4 models are over 20 months behind current events. Community members have flagged this as a significant limitation for coding tasks that involve recently released libraries, APIs, or frameworks.
Enterprise Readiness
| Feature | Available? | Details |
|---|---|---|
| SSO (SAML/OIDC) | N/A | Meta does not offer a hosted product with authentication. Self-hosting or third-party providers handle auth. |
| SCIM | N/A | No hosted product. |
| Audit logs | N/A | No hosted product. Self-hosting users implement their own. |
| IP indemnity | No | The Llama 4 Community License disclaims all warranties (Section 3) and limits liability (Section 4). Meta provides no IP indemnity for Llama model outputs. |
| Data residency | Partial | Self-hosting provides full data residency control. Third-party providers vary: OpenRouter routes through multiple providers; check individual provider policies. Meta's license does not address data processing. |
| HIPAA | N/A | No hosted product to certify. Self-hosting may be HIPAA-compliant with proper infrastructure controls. |
| Air-gapped / On-prem | Yes | Models are downloadable and can run fully offline. Scout runs on 1x H100 with INT4; Maverick requires multi-GPU setup. |
| SLA | No | Meta provides no SLA for model availability, performance, or support. |
| Admin controls (RBAC) | N/A | No hosted product. |
Source: GitHub – License ; Llama
Transparency Gaps
- Context window contradiction for Maverick. The llama.com homepage claims Maverick has "10M-token context for long-form work," but the official model card documentation page states "1M tokens" as the maximum context length. OpenRouter lists 1.05M context. The homepage claim appears to be marketing copy that does not match the technical specification. Source: Llama vs Llama – Llama4
- Llama API status and pricing are undisclosed. Meta launched a Llama API waitlist at llama.developer.meta.com, but has not published pricing, rate limits, capabilities, or a general availability timeline. The waitlist has been open since at least April 2025 with no update.
- Llama 4 Behemoth status unknown. Meta mentioned a larger "Behemoth" model during the Llama 4 launch in April 2025. Thirteen months later, there has been no update on its status, capabilities, or release plans.
- Training data composition is vague. Meta states Llama 4 was trained on "a mix of publicly available, licensed data and information from Meta's products and services. This includes publicly shared posts from Instagram and Facebook and people's interactions with Meta AI." The exact composition, data filtering methodology, and opt-out mechanisms are not disclosed. Source: Hugging Face – Llama 4 Maverick 17B 128E
- Knowledge cutoff is 20+ months old. Llama 4 models were trained on data with an August 2024 cutoff. Meta has not announced any plans for an updated model or continued pretraining on fresher data.
- No benchmark methodology disclosure beyond summary. Meta reports benchmark results as single numbers but does not publish full evaluation code, prompts, or raw results. The methodology note on the homepage states "0 shot evaluation with temperature = 0" and "we average over multiple generations" for high-variance benchmarks, but the number of generations and confidence intervals are not disclosed.
- Self-hosting cost estimates lack detail. Meta estimates serving Maverick at $0.19-$0.49 per MTok (blended, 3:1 ratio), but does not disclose the assumptions behind these estimates (GPU utilization, batch size, hardware depreciation, power cost, etc.). Source: Llama