Llama 4 API Developer Guide 2026: Scout, Maverick, MoE Architecture and Integration

Sat, 02 May 2026 21:07:51 +0000

Llama 4 Scout and Maverick are Meta’s open-weight multimodal models — available today via multiple API providers with OpenAI-compatible endpoints. Scout offers a 10M-token context window at $0.08–$0.15 per 1M input tokens; Maverick beats GPT-4o on MMLU, HumanEval, and SWE-bench. Here’s how to integrate both.

What Is Llama 4? Scout, Maverick, and Behemoth Explained

Llama 4 is Meta’s fourth-generation open-weight large language model family, released in April 2026 as a multimodal, Mixture-of-Experts architecture covering three tiers: Scout, Maverick, and the research-preview Behemoth. Scout has 17B active parameters out of ~109B total across 16 experts, with a groundbreaking 10-million-token context window — the largest available in any production API as of May 2026. Maverick scales to ~400B total parameters (still 17B active per forward pass) across 128 experts and delivers benchmark scores of 91.8% MMLU, 91.5% HumanEval, and 74.2% SWE-bench, outperforming GPT-4o and Gemini 2.0 Flash. Behemoth sits at ~2 trillion total parameters with 288B active — still in training and research preview, not yet available via public API. All three models support multimodal inputs (text + images), structured output, function calling, and streaming. The key architectural insight is that active parameter count — not total — determines inference cost, which is why both Scout and Maverick run at the speed of a ~17B dense model while achieving quality far above their class. Meta released these models under a custom Llama 4 Community License that permits commercial use with attribution for most use cases.

MoE Architecture Deep Dive: How Llama 4 Achieves 17B Active Parameters

Mixture-of-Experts (MoE) is a neural network design where each token is routed to only a subset of “expert” sub-networks during the forward pass, rather than activating all model weights. Llama 4 Scout uses 16 experts with 17B active parameters out of ~109B total — meaning each token uses roughly 15% of the full parameter space per inference call. Maverick uses 128 experts (alternating MoE and dense layers) with the same 17B active parameter budget per token but dramatically more total capacity. In practice, this means a prompt sent to Maverick costs the same compute as sending it to a 17B dense model, while benefiting from 400B parameters worth of learned representations distributed across experts. For developers, the implication is straightforward: you pay for active compute, not total parameters. Maverick at $0.50–$0.75 per 1M tokens competes with GPT-4o at $5.00/M on quality benchmarks while running at a fraction of the cost. Additionally, Maverick was co-distilled from Llama Behemoth using a novel loss function that dynamically weights student/teacher logits — the 2T parameter teacher is effectively baked into the 400B student’s weights without needing to run it at inference time.

Why Expert Count Matters for Your Use Case

Higher expert counts mean more specialized knowledge domains get dedicated capacity. Scout’s 16 experts excel at long-context retrieval and document understanding. Maverick’s 128 experts deliver stronger reasoning, coding, and instruction-following because more specialized sub-networks can activate for domain-specific patterns. For most coding and reasoning tasks, choose Maverick. For document ingestion, legal review, or any workflow involving files larger than 100K tokens, Scout’s 10M context window is the architectural differentiator.

iRoPE and the 10M-Token Context Window

The iRoPE (interleaved Rotary Position Embedding) architecture is the mechanism behind Scout’s 10-million-token context window — a leap from the 128K tokens common in competing models. iRoPE works by alternating between standard RoPE attention layers (which apply rotational position encoding to give tokens relative position awareness) and NoPE layers (No Position Encoding) every 4 layers throughout the transformer stack. Standard RoPE layers struggle to extrapolate far beyond their training context length because the rotational frequencies saturate. NoPE layers have no positional bias at all — they process tokens as an unordered set — which paradoxically helps the model generalize across arbitrary distances. By interleaving the two, Llama 4 Scout maintains local position awareness for nearby tokens while allowing global attention across millions of tokens without exponential attention score degradation. For developers, this means no chunking middleware, no RAG pipelines for large documents, and no retrieval approximation errors. You can send an entire 500-page PDF, a full codebase, or a year of chat history as a single context window and let the model reason over all of it at once. Current API providers cap at 512K–1M tokens per request even when the model supports 10M, but this limit is expected to increase as infrastructure scales.

Getting API Access: Meta Llama API, Groq, Together.ai, Fireworks, and More

Llama 4 Scout and Maverick are available through Meta’s own Llama API and at least five major third-party inference providers as of May 2026. Meta’s Llama API (llama.developer.meta.com) is OpenAI SDK compatible — you swap the base URL and model name, and existing code works unchanged. Groq delivers 250+ tokens per second for Llama 4 at approximately $0.59 per 1M input tokens, making it the fastest option for latency-sensitive applications like chat UIs and real-time agents. Together.ai and Fireworks.ai offer batch processing at lower cost, with Scout pricing starting at $0.08 per 1M input tokens — the cheapest available. Replicate and HuggingFace Inference also host both models with per-request billing. There is a 6x price spread between the cheapest and most expensive provider for Maverick, which means routing strategy (high-volume batch to Together.ai, interactive real-time to Groq) meaningfully reduces costs at scale. All providers require account creation and API key generation; most offer a free tier with rate limits. The Meta Llama API currently provides the most reliable access with direct model updates from Meta, but Groq’s throughput advantage is significant for production chat workloads.

Quickstart: Using Llama 4 with the OpenAI-Compatible API (Python Examples)

Meta’s Llama API maintains OpenAI SDK compatibility — existing OpenAI client code works with only a base_url and model name change. This is the fastest migration path from GPT-4o to Llama 4. Below are minimal working examples for the most common patterns.

Install dependencies:

pip install openai

Basic completion (Meta Llama API):

from openai import OpenAI

client = OpenAI(
    api_key="your-llama-api-key",
    base_url="https://api.llama.com/compat/v1"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    messages=[{"role": "user", "content": "Explain MoE architecture in 3 sentences."}],
    max_tokens=512
)
print(response.choices[0].message.content)

Using Groq for low-latency inference:

from openai import OpenAI

client = OpenAI(
    api_key="your-groq-api-key",
    base_url="https://api.groq.com/openai/v1"
)

response = client.chat.completions.create(
    model="meta-llama/llama-4-scout-17b-16e-instruct",
    messages=[{"role": "user", "content": "Write a Python function to parse JSON."}]
)

Streaming:

stream = client.chat.completions.create(
    model="meta-llama/Llama-4-Maverick-17B-128E-Instruct",
    messages=[{"role": "user", "content": "Generate a test suite for this function."}],
    stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Scout vs Maverick: Which Model Should You Use?

Scout and Maverick are architecturally similar — both use 17B active parameters per forward pass — but they are optimized for entirely different use cases, and choosing the wrong one is a common and expensive mistake. Scout excels when context length is the bottleneck: its 10-million-token window eliminates chunking middleware, RAG pipelines, and retrieval approximation errors for large-document workflows. Maverick excels when reasoning depth is the bottleneck: with 128 experts across ~400B total parameters and Behemoth co-distillation, it outscores GPT-4o on HumanEval (91.5% vs 90.2%) and SWE-bench (74.2% vs 72.0%), making it the strongest open-weight model for coding and complex instruction-following as of May 2026. Price also differs: Scout starts at $0.08 per 1M input tokens while Maverick ranges from $0.30–$0.65/M depending on provider. For most developers, the decision rule is straightforward — if your prompt fits in 128K tokens and involves reasoning or code, use Maverick. If you’re processing documents, codebases, or long histories, use Scout.

Criterion	Scout	Maverick
Active parameters	17B	17B
Total parameters	~109B	~400B
Experts	16	128
Context window	10M tokens	1M tokens
MMLU score	~79%	91.8%
HumanEval	~72%	91.5%
SWE-bench	~61%	74.2%
Input price (cheapest)	$0.08/M	~$0.30/M
Best for	Long-context, document analysis	Coding, reasoning, complex tasks

Choose Scout when: processing large documents (contracts, codebases, transcripts), building RAG-free pipelines, or running high-volume batch jobs where cost per token matters most. Choose Maverick when: writing or reviewing code, solving multi-step reasoning problems, handling complex instruction-following, or competing with GPT-4o on quality benchmarks.

Function Calling, Streaming, and Vision Inputs

Llama 4 supports three advanced API features that cover most production use cases: function calling (tool use), streaming responses, and vision/image inputs. Function calling follows the OpenAI tools API format — define a JSON schema for your function, pass it in the tools parameter, and the model returns structured tool_call objects when it decides to invoke a function. This works identically to the OpenAI SDK, so existing tool-use code requires no changes beyond the base_url swap. Vision inputs accept standard base64-encoded images or URLs in the content array using the multimodal message format. Both Scout and Maverick are natively multimodal — there is no separate vision model to call. Streaming uses server-sent events (SSE) and is enabled with stream=True on the client; the response yields ChatCompletionChunk objects compatible with OpenAI’s streaming format. One important difference from OpenAI: Llama 4 has no enforced output filtering. The model will not automatically refuse or sanitize responses. Developers building consumer-facing applications must implement their own content moderation layer — Meta recommends Llama Guard 3 (available via the same API) as a post-processing filter.

Function Calling Example

tools = [{
    "type": "function",
    "function": {
        "name": "search_codebase",
        "description": "Search for a symbol in the codebase",
        "parameters": {
            "type": "object",
            "properties": {
                "symbol": {"type": "string"},
                "file_pattern": {"type": "string"}
            },
            "required": ["symbol"]
        }
    }
}]

response = client.chat.completions.create(
    model="meta-llama/Llama-4-Maverick-17B-128E-Instruct",
    messages=[{"role": "user", "content": "Find where UserService is defined."}],
    tools=tools,
    tool_choice="auto"
)

API Provider Comparison: Pricing, Speed, and Rate Limits

Provider	Scout Input	Maverick Input	Speed	Rate Limit	Notes
Meta Llama API	$0.10/M	$0.50/M	Medium	200 req/min	Official, most up-to-date
Groq	$0.11/M	$0.59/M	250+ tok/s	30 req/min (free)	Fastest for real-time
Together.ai	$0.08/M	$0.35/M	Medium	60 req/min	Cheapest batch option
Fireworks.ai	$0.09/M	$0.40/M	Fast	100 req/min	Good reliability
Replicate	$0.12/M	$0.65/M	Variable	Per-model	Easy setup

Rate limit handling — all providers return HTTP 429 on rate limit exceeded. Use exponential backoff with jitter: start at 1 second, double on each retry, cap at 60 seconds, add ±10% random jitter. The x-ratelimit-remaining header tells you tokens remaining in the current window.

import time, random

def call_with_backoff(client, **kwargs, max_retries=5):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(**kwargs)
        except Exception as e:
            if "429" in str(e) and attempt < max_retries - 1:
                wait = min(60, (2 ** attempt)) * (1 + random.uniform(-0.1, 0.1))
                time.sleep(wait)
            else:
                raise

Self-Hosting Llama 4: vLLM and Ollama Deployment Guide

Self-hosting Llama 4 Scout requires a single H100 80GB or two A100 40GB GPUs at FP8 precision — a realistic setup for teams with GPU infrastructure. Maverick requires at least 48GB VRAM for FP8 inference, typically 2x H100 80GB for practical throughput. The recommended self-hosting path is vLLM, which supports Llama 4’s MoE architecture natively as of vLLM 0.5.x.

vLLM Scout deployment (single H100):

pip install vllm>=0.5.0

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --dtype float8 \
  --max-model-len 131072 \
  --tensor-parallel-size 1 \
  --port 8000

vLLM Maverick deployment (2x H100):

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-4-Maverick-17B-128E-Instruct \
  --dtype float8 \
  --max-model-len 65536 \
  --tensor-parallel-size 2 \
  --enable-chunked-prefill \
  --port 8000

Enable --enable-chunked-prefill for Maverick to handle long-context prompts without OOM errors by breaking prefill into chunks. For Ollama users, Scout is available as ollama pull llama4:scout and runs on a single A100 40GB in Q4 quantization — context window is capped at 128K by default in Ollama’s current config.

Safety and Guardrails: What Developers Must Handle Themselves

Llama 4 has no enforced output filtering — there is no automatic refusal layer in the model weights or in the API. This is a deliberate design choice from Meta that distinguishes Llama 4 from OpenAI’s GPT-4o and Google’s Gemini, both of which bake content policies into the inference pipeline. For developers building internal tools, research applications, or B2B products, this is a feature: the model will not refuse legitimate but edge-case requests. For consumer-facing applications — anything accessible by the general public, minors, or users who may produce harmful outputs — developers are legally and operationally responsible for implementing their own content moderation. Meta provides Llama Guard 3, a 1B-parameter safety classifier, as a companion model available through the same API. Call it as a post-processing step on any response you show to users. You can also use it as a pre-filter on user inputs. The Llama Guard 3 API call costs roughly $0.001 per check at Meta’s pricing — a negligible addition that substantially reduces liability. Additionally, Llama 4’s system prompt is fully controllable: you can define persona, restrictions, and output format through the system role in the messages array. A well-designed system prompt is your first line of defense; Llama Guard 3 is your second.

Llama 4 vs GPT-4o vs Gemini: When to Choose Open-Weight

Llama 4 Maverick outperforms GPT-4o on MMLU (91.8% vs 88.7%), HumanEval (91.5% vs 90.2%), and SWE-bench (74.2% vs 72.0%) at 10–15% of the API cost for high-volume workloads. The open-weight architecture means you can self-host for zero marginal cost at scale, audit the model weights, and fine-tune on proprietary data without data leaving your infrastructure. Choose Llama 4 when: cost is a primary constraint, you need self-hosting for data privacy or compliance (HIPAA, GDPR, SOC 2), you want to fine-tune on domain-specific data, or you are building high-volume batch processing pipelines where per-token costs compound. Choose GPT-4o or Gemini when: you need guaranteed SLA with enterprise support contracts, you require built-in content moderation without building your own, you use OpenAI-specific features (Assistants API, persistent threads, Code Interpreter), or you need Google’s native search integration via Gemini. The 6x provider price spread for Llama 4 Maverick means that picking the right provider matters as much as picking the right model. At 1M requests per month, routing Maverick to Together.ai instead of a premium provider saves approximately $150K annually.

FAQ

The following questions cover the most common developer concerns when integrating Llama 4 Scout and Maverick into production applications: API access options, the Scout vs Maverick decision, OpenAI migration compatibility, self-hosting hardware requirements, and multimodal input support. Each answer draws on current provider documentation and benchmark data as of May 2026. If you are evaluating Llama 4 for a new project, read the Scout vs Maverick section above first — picking the right model tier has a larger impact on cost and quality than any other configuration decision. For production deployments, review the Safety and Guardrails section carefully: unlike GPT-4o, Llama 4 has no built-in output filtering, and that responsibility falls entirely on the developer. Llama 4 represents the most capable open-weight model available today at production quality, and this FAQ is designed to help developers get unblocked quickly.

Is Llama 4 free to use?

Llama 4 Scout and Maverick are free to download from Meta’s model hub (HuggingFace, llama.com) for self-hosting. API usage through Meta’s Llama API and third-party providers is paid, with Scout starting at $0.08 per 1M input tokens — among the cheapest frontier-quality models available. Most providers offer a free tier with rate limits suitable for development and testing.

What is the difference between Llama 4 Scout and Maverick?

Scout and Maverick share the same 17B active parameter architecture but differ in total capacity and context window. Scout has ~109B total parameters, 16 experts, and a 10M-token context window — optimized for long-document processing. Maverick has ~400B total parameters, 128 experts, and a 1M-token context window — optimized for reasoning, coding, and complex instruction-following. Maverick scores significantly higher on reasoning benchmarks; Scout’s main advantage is the extreme context window and lower price.

Can I migrate from OpenAI GPT-4o to Llama 4 without rewriting my code?

Yes. Meta’s Llama API and Groq both expose OpenAI-compatible endpoints. Change base_url to the provider’s URL and update the model parameter. Tool calling, streaming, system prompts, and message format are identical. The one difference is that Llama 4 does not enforce content filtering — you will need to add your own moderation if your current code relies on OpenAI’s built-in refusals.

What hardware do I need to self-host Llama 4?

Scout at FP8 precision fits on a single H100 80GB GPU or two A100 40GB GPUs. Maverick requires at least 48GB VRAM, typically two H100 80GBs for practical throughput. Use vLLM 0.5.x for best performance; enable --enable-chunked-prefill for Maverick to avoid OOM on long prompts. In Q4 quantization via Ollama, Scout runs on a single A100 40GB but with a reduced context window.

Does Llama 4 support multimodal (image) inputs?

Yes. Both Scout and Maverick are natively multimodal and accept image inputs via the standard OpenAI multimodal message format (base64 or URL in the content array). There is no separate vision model — the same endpoint handles text and image inputs. Image support is available through Meta’s Llama API and most major providers; verify with your specific provider as image support may lag slightly behind text-only on some platforms.

Llama-4 on RockB