LLMx-ai

xAI: Grok 4.20

Grok 4.20 is a reasoning model from xAI with industry-leading speed and agentic tool calling capabilities. It combines the lowest hallucination rate on the market with strict prompt adherance, delivering...

Anyone in the Space can @-mention xAI: Grok 4.20 with the team's shared context - pooled credits, one chat, one memory.

All models

Starter is free forever - 1 Space, 100 credits/month, 1 MCP. No card.

Verdict

Grok 4.20 offers a massive 2M token context window at competitive pricing ($1.25/$2.50 per Mtok), making it a strong choice for document-heavy workflows where you need to process entire codebases, legal contracts, or research archives in a single pass. The trade-off is limited public benchmark data — you're betting on xAI's engineering without the usual third-party validation. Reach for this when context length is your bottleneck and you're comfortable with early adoption risk.

Best for

Processing entire codebases in one context
Multi-document legal contract analysis
Long-form research synthesis across papers
Cost-effective ultra-long-context tasks
File and image analysis at scale

Strengths

The 2M token window is among the largest available, letting you load hundreds of pages or thousands of code files without chunking strategies. Pricing undercuts many competitors at this context length — you pay roughly half what Claude 3.5 Sonnet charges for input tokens. Multimodal support (text, image, file) means you can mix PDFs, screenshots, and plain text in the same request without preprocessing.

Trade-offs

No public benchmark scores yet, so you can't compare reasoning quality, coding accuracy, or instruction-following against Claude, GPT-4, or Gemini on standard evals. This makes it harder to predict performance on your specific tasks without hands-on testing. The model is new enough that community tooling, fine-tuning recipes, and edge-case documentation are still thin compared to more established options.

Specifications

Provider: x-ai
Category: llm
Context length: 2,000,000 tokens
Max output: —
Modalities: text, image, file
License: proprietary
Released: 2026-03-31

Pricing

Input: $1.25/Mtok
Output: $2.50/Mtok
Model ID: x-ai/grok-4.20

Per-token prices show what the model costs upstream. On Switchy your team draws from one shared org credit pool - one plan, one balance for everyone.

Team cost calculator

Seats5 peopleMessages / seat / day80Avg turn size2 ktokOutput share30 %

Estimated monthly spend

$28.60

17.6M tokens / month
5 seats · 80 msgs/day

Switchy meters this against your org's shared credit pool - one plan, one balance for everyone.

Providers

Provider	Context	Input	Output	P50 latency	Throughput	30d uptime
x-ai	2000k	$1.25/Mtok	$2.50/Mtok	—	—	—

Performance

Performance snapshots are collected daily. Check back after the next ingestion run.

Benchmarks

Public benchmark scores are not available yet for this model. Check back after the next ingestion run.

Works well with

Top MCPs

Compatibility data comes from first-party telemetry; once we have enough co-usage signal, top MCPs for this model will appear here.

How Switchy teams use it

Not enough Spaces have used this model yet to share anonymised team stats. We wait for at least 50 distinct Spaces per week before publishing any aggregate.

Starter prompts

Codebase Architecture Review

I've uploaded my full codebase as files. Review the architecture, identify tight coupling between modules, and suggest three specific refactorings to improve maintainability. Focus on dependencies that cross logical boundaries.

Open in a Space →

Multi-Contract Clause Comparison

I've provided 15 vendor contracts as files. Extract all indemnification clauses, compare their scope and liability caps, then flag any contracts missing force majeure language. Present findings in a table.

Open in a Space →

Research Paper Synthesis

I've uploaded 20 research papers on transformer attention mechanisms. Synthesize the key findings into a 500-word literature review, organizing by approach (sparse attention, linear attention, etc.) and citing paper titles inline.

Open in a Space →

Screenshot-to-Spec Documentation

I've uploaded 8 screenshots of our new dashboard UI. Write a technical spec describing each component's behavior, data requirements, and interaction states. Include edge cases like empty states and error handling.

Open in a Space →

Long-Form Transcript Analysis

I've provided a 3-hour earnings call transcript. Extract all forward-looking statements about product launches, list risks mentioned by executives, and summarize analyst questions about revenue guidance. Group by topic.

Open in a Space →

Example outputs

Illustrative - representative of the model's voice and quality, not literal recordings.

Prompt

Review this Python function for potential race conditions. It's meant to update a shared counter across multiple threads, but we're seeing inconsistent results in production.

Output

The function lacks thread synchronization. When multiple threads call `increment_counter()` simultaneously, they can read the same value before any writes complete, causing lost updates. The read-modify-write sequence isn't atomic. You need either a threading.Lock around the entire operation, or switch to `threading.local()` for thread-specific counters, or use `multiprocessing.Value` with a lock parameter if this spans processes. The 2M token context window here means I can see your entire codebase structure — if you paste related modules, I can trace whether this counter is also accessed elsewhere without protection.

Notes

Grok 4.20's extended context window (2M tokens) lets it reason across large codebases when debugging concurrency issues. The response correctly identifies the race condition and offers multiple remediation paths. However, without public benchmarks, it's unclear how this model's code reasoning compares to alternatives like Claude 3.7 Sonnet or GPT-4.5.

Prompt

I have a blurry photo of a handwritten recipe card from my grandmother. Can you transcribe the ingredients list and suggest what dish this might be?

Output

The card shows a ingredients list in cursive: '2 cups flour, 1 tsp salt, 3 eggs, 1/2 cup milk, butter for frying.' The handwriting has some smudging in the flour measurement area, but context suggests '2 cups' rather than '3 cups.' This appears to be a basic crepe or thin pancake batter — the egg-to-flour ratio and the 'butter for frying' note are characteristic of French-style crepes rather than American pancakes. If there's a second page with instructions, I can confirm, but the ingredient proportions strongly indicate a crepe recipe.

Notes

Grok 4.20 supports image input, enabling OCR and visual reasoning tasks. The model attempts contextual interpretation (distinguishing crepes from pancakes based on ratios), which is useful for ambiguous handwriting. The lack of public vision benchmarks means we can't quantify its OCR accuracy against models like GPT-4.5 Vision or Gemini 2.0 Flash.

Prompt

Summarize the key regulatory changes in this 847-page EU AI Act PDF, focusing on obligations for general-purpose AI providers and how they differ from high-risk system requirements.

Output

General-purpose AI providers face transparency obligations under Article 52: they must publish training data summaries, energy consumption estimates, and model capability documentation. High-risk systems (Annex III) require conformity assessments, risk management systems, and post-market monitoring — significantly heavier compliance. The key distinction: GP-AI providers document what the model *can* do; high-risk deployers must prove their *specific application* is safe. Providers of GP-AI models with systemic risk (Article 51) face additional obligations including adversarial testing and incident reporting, triggered when training compute exceeds 10^25 FLOPs.

Notes

The 2M token context window allows Grok 4.20 to ingest entire regulatory documents and extract structured comparisons. This is a genuine differentiator for legal and compliance work where context length matters. At $1.25/Mtok input, processing an 847-page PDF (~500K tokens) costs roughly $0.63 — competitive for document analysis, though output costs ($2.50/Mtok) add up for long summaries.

Use-case deep-dives

Multi-document contract negotiation

When 2M token context beats splitting contracts across threads

A 4-person legal ops team handling SaaS vendor negotiations needs to compare redlines across 8-12 contracts simultaneously—MSAs, DPAs, order forms—without losing cross-reference context. Grok 4.20's 2M token window fits roughly 1.5M words, enough to load every active contract plus internal playbook rules in a single prompt. At $1.25/Mtok input, a 500K token load costs $0.63 versus Claude 3.5 Sonnet's $3.00 for the same context. The trade-off: no public benchmark data means you're trusting xAI's internal evals for accuracy on legal reasoning. If your team already validates every AI output through counsel review and the cost delta matters at 20+ negotiations per month, this is the cheapest way to avoid context-window chunking errors. Under 10 deals monthly, pay for Claude's proven accuracy instead.

Startup pitch deck analysis

Why image-plus-text input works for investor memo generation

A 3-person venture scout team reviews 40 pitch decks weekly, turning slide images and founder emails into standardized investment memos. Grok 4.20 accepts image files directly, so the workflow is: upload the 15-slide PDF as images, paste the founder's follow-up email, prompt for a memo matching the fund's template. The 2M token context means you can include the last 6 months of funded-deal memos as few-shot examples without hitting limits. At $2.50/Mtok output, a 3K token memo costs $0.0075—cheap enough to run on every inbound deck. The risk: without MMMU or DocVQA scores published, you're betting on xAI's multimodal accuracy for chart and table extraction. If your memos feed directly into partner meetings with no human check, use GPT-4o's benchmarked vision instead. If scouts review every memo before forwarding, the speed and cost here win.

Customer support ticket triage

When real-time API latency matters more than benchmark leaderboards

A 12-person B2B SaaS support team routes 200 tickets daily from Zendesk into Slack channels by urgency and product area. The triage prompt includes the ticket body, the last 5 messages from that customer, and a 20K token internal KB article. Grok 4.20's pricing at $1.25 input per Mtok makes each classification cost roughly $0.03 if the average context is 25K tokens. The model's file-upload modality lets you attach error logs or screenshots without preprocessing. The catch: no published latency benchmarks and no LMSYS Chatbot Arena ranking means you're flying blind on response time and accuracy versus GPT-4o-mini at $0.15/Mtok input. If your SLA tolerates 3-5 second API calls and you're processing 4K+ tickets monthly where cost compounds, test this for a week. Under 1K tickets monthly, the savings don't justify the eval risk.

Frequently asked

Is Grok 4.20 good for long-context analysis?

Yes. With a 2M token context window, Grok 4.20 handles entire codebases, multi-hour transcripts, or 500+ page documents in a single prompt. That's roughly 1.5 million words — enough for most real-world long-context tasks without chunking or summarization tricks.

Is Grok 4.20 cheaper than GPT-4o or Claude Sonnet?

No. At $1.25 input / $2.50 output per Mtok, Grok 4.20 costs more than GPT-4o ($2.50 / $10.00) and Claude Sonnet 3.5 ($3.00 / $15.00) on output, though input is comparable. For high-output workloads like code generation or writing, you'll pay 4-6× more than GPT-4o.

Can Grok 4.20 process images and files?

Yes. Grok 4.20 supports text, image, and file inputs, so you can upload PDFs, spreadsheets, or screenshots alongside your prompt. This makes it viable for multimodal tasks like document extraction or chart analysis, though no public benchmarks exist to compare its vision performance against GPT-4o or Gemini.

How does Grok 4.20 compare to earlier Grok versions?

No public benchmarks are available for Grok 4.20, so direct performance comparisons to Grok 3 or earlier are impossible. The 2M context window is a clear upgrade if previous versions had smaller limits, but without MMLU, HumanEval, or reasoning scores, you're flying blind on capability improvements.

Should I use Grok 4.20 for production chatbots?

Only if you need the 2M context window and can afford the output cost. For typical chat (sub-10k context), GPT-4o or Claude Sonnet are cheaper and have proven performance on public benchmarks. Grok 4.20's lack of published scores makes it risky for production without internal testing first.