LLMx-ai

xAI: Grok 4.20 Multi-Agent

Grok 4.20 Multi-Agent is a variant of xAI’s Grok 4.20 designed for collaborative, agent-based workflows. Multiple agents operate in parallel to conduct deep research, coordinate tool use, and synthesize information...

Anyone in the Space can @-mention xAI: Grok 4.20 Multi-Agent with the team's shared context - pooled credits, one chat, one memory.

All models

Starter is free forever - 1 Space, 100 credits/month, 1 MCP. No card.

Verdict

Grok 4.20 Multi-Agent brings a 2M-token context window and multi-agent orchestration at $1.25/$2.50 per Mtok — competitive pricing for massive-context work. The multi-agent architecture lets you decompose complex tasks into parallel sub-problems, which shines for research synthesis and multi-document analysis. Trade-off: no public benchmarks yet, so you're flying blind on coding accuracy and reasoning depth versus Claude or GPT-4. Reach for this when you need enormous context and can tolerate some uncertainty on raw performance.

Best for

Multi-document research synthesis across hundreds of pages
Parallel task decomposition with agent orchestration
Cost-sensitive massive-context workflows
Exploratory analysis on large codebases or datasets
Vision tasks combined with long-form text

Strengths

The 2M-token window handles entire books or sprawling codebases in one pass, and the multi-agent design lets you split complex queries into parallel streams — useful for comparing dozens of documents or running multi-step research workflows. Pricing undercuts Claude Opus 4 by 40% on input tokens, making it viable for high-volume context-heavy work. Multimodal support (text, image, file) means you can mix screenshots and PDFs in the same session.

Trade-offs

No public benchmarks means you can't compare coding accuracy, math reasoning, or instruction-following against Claude Sonnet 4.5 or GPT-4o. Early xAI models lagged on nuanced reasoning tasks, and without MMLU or HumanEval scores we can't confirm this generation closed the gap. The multi-agent feature is novel but adds latency and complexity — simpler queries may perform worse than a single-shot model. If you need proven performance on STEM or code, wait for benchmark data.

Specifications

Provider: x-ai
Category: llm
Context length: 2,000,000 tokens
Max output: —
Modalities: text, image, file
License: proprietary
Released: 2026-03-31

Pricing

Input: $1.25/Mtok
Output: $2.50/Mtok
Model ID: x-ai/grok-4.20-multi-agent

Per-token prices show what the model costs upstream. On Switchy your team draws from one shared org credit pool - one plan, one balance for everyone.

Team cost calculator

Seats5 peopleMessages / seat / day80Avg turn size2 ktokOutput share30 %

Estimated monthly spend

$28.60

17.6M tokens / month
5 seats · 80 msgs/day

Switchy meters this against your org's shared credit pool - one plan, one balance for everyone.

Providers

Provider	Context	Input	Output	P50 latency	Throughput	30d uptime
x-ai	2000k	$1.25/Mtok	$2.50/Mtok	—	—	—

Performance

Performance snapshots are collected daily. Check back after the next ingestion run.

Benchmarks

Public benchmark scores are not available yet for this model. Check back after the next ingestion run.

Works well with

Top MCPs

Compatibility data comes from first-party telemetry; once we have enough co-usage signal, top MCPs for this model will appear here.

How Switchy teams use it

Not enough Spaces have used this model yet to share anonymised team stats. We wait for at least 50 distinct Spaces per week before publishing any aggregate.

Starter prompts

Multi-Document Research Synthesis

I'm uploading 25 PDFs on climate adaptation strategies. Identify the three most-cited intervention types, compare their cost-effectiveness across studies, and flag any contradictory findings. Organize your analysis by intervention type.

Open in a Space →

Codebase Architecture Review

Review this 80k-line Python codebase. Identify architectural bottlenecks, flag modules with high cyclomatic complexity, and suggest three refactoring priorities. Focus on scalability and maintainability.

Open in a Space →

Legal Contract Cross-Reference

I'm providing 15 vendor contracts. Extract all indemnification clauses, compare liability caps, and flag any non-standard termination terms. Summarize discrepancies in a table.

Open in a Space →

Vision-Heavy Technical Documentation

This 200-page technical manual includes 40 diagrams. Extract all safety warnings, cross-reference them with the diagrams, and create a checklist for field technicians. Prioritize by risk level.

Open in a Space →

Parallel Hypothesis Testing

Given this dataset on customer churn, test five hypotheses in parallel: pricing sensitivity, feature usage, support ticket volume, contract length, and seasonality. Report which factors show statistical significance and recommend retention strategies.

Open in a Space →

Example outputs

Illustrative - representative of the model's voice and quality, not literal recordings.

Prompt

I have a 50-page technical specification PDF and three competing vendor proposals. Extract the key requirements, then build a comparison matrix showing which vendor meets each requirement. Flag any gaps or ambiguities.

Output

This example would produce a structured analysis beginning with a numbered list of extracted requirements (security certifications, API response times, data residency constraints, etc.), followed by a three-column comparison table mapping each requirement to vendor compliance status. The model would highlight two requirements where the spec uses vague language ('reasonable uptime' without SLA definition) and note that Vendor B's proposal doesn't address the API versioning requirement at all. The output would span roughly 800 words with clear section headers, making the 2M token context window's document-handling strength immediately visible.

Notes

Showcases the massive context window handling multiple large documents simultaneously — a 50-page PDF plus three proposals easily fits in 2M tokens. The multi-agent architecture likely coordinates separate analysis and synthesis steps. At $6/Mtok output, a thorough 1000-token response costs $0.006, but long multi-document tasks will accumulate input costs quickly.

Prompt

Review this codebase (attaching 12 Python files, ~8000 lines total). Identify architectural inconsistencies, suggest a refactor plan to consolidate the three different error-handling patterns, and estimate migration effort.

Output

This example would produce a hierarchical breakdown: first, a diagram-style ASCII representation of the current architecture showing how modules interact; second, annotated code excerpts demonstrating the three error-handling patterns (try-except blocks, error return codes, and a custom Result wrapper); third, a phased refactor plan recommending the Result pattern as the standard with specific file-by-file migration steps; fourth, an effort estimate (roughly 3 developer-days, assuming test coverage exists). The response would reference specific line numbers and function names from the attached files, demonstrating that the model has genuinely parsed the entire codebase context.

Notes

Demonstrates file upload handling and the ability to reason across a mid-sized codebase within the 2M token window. The multi-agent framing suggests separate agents might handle static analysis versus refactor planning. However, without public benchmarks, we can't verify accuracy on complex refactoring logic compared to models with published HumanEval or SWE-bench scores.

Prompt

Attached: a blurry photo of a handwritten flowchart on a whiteboard. Convert this to Mermaid diagram syntax, then explain three potential bottlenecks in the process it describes.

Output

This example would first produce clean Mermaid code reconstructing the flowchart (likely 15-20 nodes with decision diamonds and process boxes), interpreting the handwritten labels despite the photo quality. Then it would provide a numbered analysis: (1) the approval step creates a single point of failure with no parallel path, (2) the data validation loop has no timeout, risking infinite retries, (3) the final aggregation step processes serially when the diagram suggests the inputs could be gathered concurrently. The explanation would reference specific node IDs from the generated Mermaid code.

Notes

Highlights image input capability combined with structured output generation and process analysis. The multi-agent architecture might split OCR/image understanding from logical reasoning. The 2M token context is overkill here but allows follow-up questions referencing the same image. Trade-off: at $2/Mtok input, image tokens add up faster than text, though a single photo is negligible.

Use-case deep-dives

Multi-stage research synthesis

When 2M tokens lets you process entire research libraries in one pass

A 4-person biotech consultancy needs to synthesize 40+ clinical trial PDFs (averaging 80 pages each) into a single comparative report every week. Grok 4.20's 2M token context window fits the entire corpus in one prompt—no chunking, no retrieval pipeline, no context-loss between documents. At $2/Mtok input, processing 1.5M tokens of source material costs $3 per report. The multi-agent architecture handles the synthesis workflow (extract → compare → draft → cite) without manual orchestration. If your documents are under 500K tokens combined, a standard 128K model at $0.15/Mtok saves you money. Above that threshold, Grok's window eliminates the engineering overhead of RAG systems and keeps cross-document reasoning intact. Buy this if you're routinely hitting context limits on document-heavy workflows.

High-volume image triage

Multi-agent vision workflows for content moderation at scale

A 12-person e-commerce platform reviews 8,000 user-uploaded product photos daily for policy violations (prohibited items, misleading angles, copyright issues). Grok's multi-agent mode lets you chain vision analysis (flag suspicious images) → text reasoning (check against policy rules) → decision routing (auto-approve, escalate, reject) in a single API call. The image modality handles photos natively without preprocessing. At $2 input / $6 output per Mtok, a typical 3-stage workflow costs ~$0.008 per image reviewed—$64/day for 8K images. Without public vision benchmarks, you'll want to pilot 500 images against your policy edge-cases before committing. If accuracy on your specific violations stays above 92%, the agent orchestration saves 15 hours/week versus manual review queues. Deploy this when your moderation backlog exceeds 2,000 items/day.

Long-context legal discovery

When contract review needs cross-reference across 50+ documents

A 3-attorney firm handles M&A due diligence where a single deal involves 60 contracts (NDAs, employment agreements, vendor MSAs, IP assignments) totaling 1.2M tokens. Grok's 2M window loads the entire deal room so the model can trace obligations across documents—finding where an NDA's confidentiality term conflicts with a vendor contract's disclosure clause, for example. The multi-agent setup runs parallel analysis tracks (financial terms, liability caps, termination rights) then consolidates findings. At $2/Mtok input, each deal review costs $2.40 in processing. Output generation (the memo) adds $6/Mtok, so a 20K-token report costs another $0.12. If your deals are under 200K tokens, Claude 3.5 Sonnet at $3/Mtok input is cheaper. Above 500K tokens, Grok's window and agent routing justify the cost by eliminating multi-pass workflows.

Frequently asked

Is Grok 4.20 Multi-Agent good for complex reasoning tasks?

Yes, the multi-agent architecture suggests it's designed for tasks requiring multiple reasoning steps or perspectives. The 2M token context window lets it handle extensive documentation and long-form analysis. Without public benchmarks we can't compare it directly to GPT-4 or Claude on MMLU or HumanEval, but the multi-agent framing indicates xAI built this for decomposing complex problems rather than simple chat.

Is Grok 4.20 cheaper than GPT-4o or Claude Sonnet?

No. At $2 input and $6 output per Mtok, Grok 4.20 costs roughly 4x more than GPT-4o ($0.50/$1.50) and 10x more than Claude Sonnet 4 ($0.60/$1.20). You're paying a premium for the multi-agent capability and the 2M context window. If you don't need those features, stick with the cheaper frontier models.

Can Grok 4.20 actually use the full 2 million token context?

The spec says 2M tokens, which is competitive with Gemini 2.0 Pro. Whether it maintains coherence across that entire window depends on your use case—most models degrade on needle-in-haystack retrieval past 500k tokens. For code repos or legal document analysis, test it on your data before assuming perfect recall. The multi-agent design might help by chunking context across agents.

How does Grok 4.20 compare to previous Grok versions?

We don't have benchmark data for earlier Grok models in this dataset, so direct comparison is impossible. The "4.20" version number and multi-agent label suggest a major architecture change from earlier releases. If you're already using Grok 3 or earlier, run your own evals—xAI hasn't published enough public data to make the upgrade case clear.

Should I use Grok 4.20 for production API calls?

Only if you need the multi-agent reasoning or the 2M context window and can afford the $6/Mtok output cost. For standard chat or summarization, GPT-4o or Claude will be faster and cheaper. The lack of public benchmarks means you're taking on integration risk—plan for thorough testing before committing production traffic.