The Llama 4 herd

Llama 4 Models:
  - Both Llama 4 Scout and Llama 4 Maverick use a Mixture-of-Experts (MoE) design with 17B active parameters each.
  - They are natively multimodal: text + image input, text-only output.
  - Key achievements include industry-leading context lengths, strong coding/reasoning performance, and improved multilingual capabilities.
  - Knowledge cutoff: August 2024.

  Llama 4 Scout:
  - 17B active parameters, 16 experts, 109B total.
  - Fits on a single H100 GPU (INT4-quantized).
  - 10M token context window
  - Outperforms previous Llama releases on multimodal tasks while being more resource-friendly.
  - Employs iRoPE architecture for efficient long-context attention.
  - Tested with up to 8 images per prompt.

  Llama 4 Maverick:
  - 17B active parameters, 128 experts, 400B total.
  - 1M token context window.
  - Not single-GPU; runs on one H100 DGX host or can be distributed for greater efficiency.
  - Outperforms GPT-4o and Gemini 2.0 Flash on coding, reasoning, and multilingual tests at a competitive cost.
  - Maintains strong image understanding and grounded reasoning ability.

  Llama 4 Behemoth (Preview):
  - 288B active parameters, 16 experts, nearly 2T total.
  - Still in training; not yet released.
  - Exceeds GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM benchmarks (e.g., MATH-500, GPQA Diamond).
  - Serves as the “teacher” model for Scout and Maverick via co-distillation.

  Misc:
  - MoE Architecture: Only 17B parameters activated per token, reducing inference cost.
  - Native Multimodality: Unified text + vision encoder, pre-trained on large-scale unlabeled data.

Our Rules