Meta

Executive Summary

What it is: Meta does not offer a coding agent product. Llama 4 Scout and Llama 4 Maverick are open-weight models that can be downloaded for free and deployed through third-party inference providers (Cerebras, Google Vertex AI, Together AI, Fireworks AI, AWS Bedrock, Azure AI) or self-hosted on your own GPU infrastructure. Inference costs are set entirely by the hosting provider, ranging from $0.10/MTok (Cerebras, Llama 3.1 8B) to $1.15/MTok output (Google Vertex AI, Llama 4 Maverick).

What to watch out for: Meta's Llama website moved from llama.meta.com to https://www.llama.com (the old URL returns HTTP 400 errors). There is no first-party IDE, CLI, or web coding product, so users must build their own agent infrastructure using tools like Continue, Cline, Aider, or OpenHands. Llama 3.1 8B on Cerebras is deprecating on May 27, 2026. Self-hosting requires GPU infrastructure with no official hardware guidance from Meta.

Bottom line: Llama models are the right choice when you need full control over data, model weights, and deployment (defense, healthcare, finance use cases with strict data residency requirements). For everyone else, the lack of a first-party coding agent means significant integration effort compared to managed options like Claude Code, Copilot, or Cursor. Llama 4 Maverick scores 43.4 on LiveCodeBench, competitive with GPT-4-class models, but without a managed product it serves as an infrastructure choice rather than a tooling choice.

Key Terms

  • Open-weight model - a model whose trained weights are published for anyone to download, modify, and deploy. Meta's Llama models are open-weight, meaning the cost of inference depends entirely on which hosting provider you choose. Source: Llama
  • Inference provider - a third-party service (e.g., Cerebras, Together AI, Groq, AWS Bedrock, Azure AI) that hosts Llama models and charges for API access. Meta does not sell inference directly. Source: Llama
  • Distributed inference - running inference across multiple GPUs or nodes to serve large models. Llama 4 Scout and Maverick both support distributed inference, with estimated costs of $0.19-$0.49/MTok. Source: Llama

Latest Changes

First report for this supplier. All models, plans, and pricing are listed as current state.

  • New model: Llama 4 Scout and Maverick available as open-weight models. Both feature 10M token context windows and native multimodal support.
  • Deprecation (upcoming): Cerebras deprecating Llama 3.1 8B on May 27, 2026.
  • Plan change: Website migrated from llama.meta.com to Llama Old URL returns HTTP 400 errors.

Plans

Meta does not offer a coding agent product or subscription plans. Llama models are open-weight and free to download. The cost structure depends entirely on how you deploy them.

Deployment MethodCostNotes
Download and self-hostFree (hardware costs only)Requires GPU infrastructure. Cost depends on your hardware and electricity
Third-party inference APIVaries by providerSee API Pricing table below for concrete per-provider rates
Cerebras (fast inference)See API Pricing tablePay-per-token via Cerebras API or Cerebras Code subscription
AWS Bedrock / Azure AISee respective provider pricingPay-per-token through cloud marketplace
Together AI / Fireworks AISee respective provider pricingCompetitive pricing for open-source model inference
GroqSee Groq pricingFast inference on LPU hardware

Source: Llama

API Pricing

Meta does not offer an API directly. Inference costs are set by third-party providers. Concrete pricing from Google Vertex AI and Cerebras:

ModelProviderInput ($/MTok)Output ($/MTok)Batch Input ($/MTok)Batch Output ($/MTok)Notes
Llama 4 ScoutGoogle Vertex AI$0.25$0.70$0.125$0.35
Llama 4 MaverickGoogle Vertex AI$0.35$1.15$0.175$0.575
Llama 3.3 70BGoogle Vertex AI$0.72$0.72$0.36$0.36
Llama 3.1 8BCerebras$0.10$0.10--Deprecating May 27, 2026

Terms explained:

  • Batch API - a lower-cost inference mode where requests are queued and processed asynchronously (not real-time), typically at 50% of standard pricing. Google Vertex AI offers batch pricing for all Llama models listed above. Source: Google – Pricing

Source: Llama Google – Pricing, Cerebras – Pricing

Model Performance / Benchmarks

BenchmarkLlama 4 MaverickLlama 4 Scout
MMLU Pro80.574.3
LiveCodeBench43.432.8

Additional specifications:

  • Both models: 10M token context window, text + image (natively multimodal)
  • Llama 4 Scout: optimized for efficient inference on a single H100 GPU
  • Llama 4 Maverick: targets frontier-level performance with higher resource requirements
  • Estimated distributed inference cost: $0.19-$0.49/MTok

Source: Llama

Latest News

Llama 4 Scout and Maverick Release

Llama 4 Scout and Llama 4 Maverick are available as open-weight models. Both feature 10M token context windows and native multimodal support (text + image). Llama 4 Maverick achieves 80.5 MMLU Pro and 43.4 LiveCodeBench. Llama 4 Scout achieves 74.3 MMLU Pro and 32.8 LiveCodeBench. Both can be downloaded from Llama and are supported by major inference providers including Cerebras, Google Vertex AI, Together AI, Fireworks AI, AWS Bedrock, and Azure AI.

Cerebras Deprecation of Llama 3.1 8B

Cerebras is deprecating Llama 3.1 8B on its platform effective May 27, 2026. The model was priced at $0.10/MTok input and $0.10/MTok output. Users should migrate to Llama 4 Scout or Maverick before the deprecation date.

Website Migration

Meta's Llama website has migrated from llama.meta.com to Llama The old URL returns HTTP 400 errors. All model downloads and documentation are now hosted at the new domain.

Source: Llama Cerebras – Pricing

Community Signals

LiveCodeBench and Context Window Discussion

Llama 4 Maverick's LiveCodeBench score of 43.4 is frequently cited in coding benchmark discussions, with community members comparing it favorably to GPT-4-class models for code generation tasks. The 10M token context window is a major talking point, with developers noting it enables processing very large codebases in a single prompt. However, practical latency and cost at that context length are still being evaluated by the community.

Self-Hosting for Privacy

The lack of a first-party coding agent product from Meta means users must build their own agent infrastructure using tools like Continue, Cline, Aider, or OpenHands on top of Llama models. Organizations with strong privacy requirements (defense, healthcare, finance) often choose Llama models for on-premises deployment to avoid sending code to third-party APIs.

Provider Lock-In Concerns

The Cerebras Llama 3.1 8B deprecation has prompted discussion about provider lock-in when relying on a single inference provider for open-weight models.

Source: Llama

Enterprise Readiness

FeatureAvailable?Details
SSO (SAML)N/AMeta does not offer a managed platform. Models are downloaded and deployed by the user.
SSO (OIDC)N/ASame as above.
SCIMN/ASame as above.
Audit logsN/ASame as above.
IP indemnityNoNot offered. Models are open-weight with no commercial indemnification from Meta.
Data residencyYesFull control when self-hosting. Models can run on any infrastructure in any region.
HIPAAN/ASelf-hosted deployments can be made HIPAA-compliant by the deploying organization.
Air-gapped / on-premYesModels can be downloaded and deployed on air-gapped infrastructure. Full data isolation. Source: Llama
SLAN/ANo managed service. Availability depends on the user's own infrastructure.
Admin controls (RBAC)N/ANo managed platform. Controls depend on the user's deployment infrastructure.

Transparency Gaps

MetricStatusNotes
Recommended inference costsnot applicableMeta does not set inference pricing
Self-hosting hardware requirementsundisclosedNo official guidance on minimum GPU specs for Llama 4 Scout or Maverick
Fine-tuning toolspartially disclosedMeta provides Llama fine-tuning guides but specifics vary by model size
Cerebras Llama 4 Maverick pricingundisclosedCerebras lists Llama 4 Maverick as supported but has not published per-token pricing
Together AI / Fireworks AI Llama 4 pricingundisclosedPricing pages not updated with Llama 4 per-token rates at time of report
Context window performance at scaleundisclosed10M token context is claimed but no official latency/cost benchmarks published at that scale