LLMx-ai

xAI: Grok 3

Grok 3 is the latest model from xAI. It's their flagship model that excels at enterprise use cases like data extraction, coding, and text summarization. Possesses deep domain knowledge in...

Anyone in the Space can @-mention xAI: Grok 3 with the team's shared context - pooled credits, one chat, one memory.

All models

Starter is free forever - 1 Space, 100 credits/month, 1 MCP. No card.

Verdict

Grok 3 is xAI's flagship — built around real-time access to X data and a less-cautious training posture than the closed-lab flagships. The right pick when you want a model that engages directly with controversial questions instead of hedging, or when current-event awareness matters. What we notice: Grok 3's reasoning is genuinely competitive with Sonnet 4.5 on coding and analysis. The differentiator is the voice and disposition — it's more willing to give a definite opinion, less inclined to caveat into uselessness, and has access to live X content for trending-topic queries. Function calling is improving but lags the OpenAI and Anthropic lines. Best for: queries that benefit from current X / Twitter content (trends, real-time events); analysis tasks where you want the model to commit to a position; teams that find the closed flagships' caution-first training frustrating; cost-conscious flagship workloads where xAI's pricing is competitive. Avoid for: workflows where consistent voice across sessions matters (Grok's tone moves around more); sensitive deployments where the looser training posture is a risk rather than a feature; long-context synthesis past 128k tokens. Pricing frame: at roughly $3/Mtok in, $15/Mtok out, Grok 3 is in the same range as Sonnet 4.5. A 5-person team at 200 daily messages lands around $90-110/month. Buy for the voice and the X data, not for cost.

Best for

  • Creative writing and brainstorming sessions
  • Conversational agents with personality
  • Document analysis under 130k tokens
  • Cost-conscious teams exploring alternatives
  • Prototyping before committing to premium models

Strengths

Grok 3's 131k context window lets you process long reports, transcripts, or codebases in a single pass without splitting. The $3/$15 pricing undercuts GPT-4o and Claude Sonnet 4.5 on output tokens, making it viable for high-volume generation tasks. Early user reports highlight strong conversational tone and creative flexibility — it handles open-ended prompts and stylistic requests more naturally than many mid-tier alternatives. xAI's training approach emphasizes real-time data and less-filtered outputs, which can yield fresher perspectives on current-event queries.

Trade-offs

Lack of public benchmarks means you can't compare Grok 3 directly to peers on MMLU, HumanEval, or other standard tests — you're relying on xAI's internal metrics and anecdotal feedback. Users report occasional inconsistency on structured tasks like JSON extraction or multi-step reasoning chains, where Claude Sonnet 4.5 and GPT-4o show tighter adherence to instructions. The model's less-filtered training can produce edgier or less-polished outputs in professional contexts. No vision or multimodal support yet, limiting use cases to text-only workflows.

Specifications

Provider
x-ai
Category
llm
Context length
131,072 tokens
Max output
Modalities
text
License
proprietary
Released
2025-06-10

Pricing

Input
$3.00/Mtok
Output
$15.00/Mtok
Model ID
x-ai/grok-3

Per-token prices show what the model costs upstream. On Switchy your team draws from one shared org credit pool - one plan, one balance for everyone.

Team cost calculator

Estimated monthly spend
$116.16
17.6M tokens / month
5 seats · 80 msgs/day

Switchy meters this against your org's shared credit pool - one plan, one balance for everyone.

Providers

ProviderContextInputOutputP50 latencyThroughput30d uptime
x-ai131k$3.00/Mtok$15.00/Mtok

Performance

Performance snapshots are collected daily. Check back after the next ingestion run.

Benchmarks

Public benchmark scores are not available yet for this model. Check back after the next ingestion run.

Works well with

Top MCPs

Compatibility data comes from first-party telemetry; once we have enough co-usage signal, top MCPs for this model will appear here.

How Switchy teams use it

Not enough Spaces have used this model yet to share anonymised team stats. We wait for at least 50 distinct Spaces per week before publishing any aggregate.

Starter prompts

Brainstorm Campaign Angles

I'm launching a new productivity app for remote teams. Generate five distinct marketing campaign angles, each with a headline, core message, and target audience. Make them bold and memorable.
Open in a Space →

Summarize Long Transcript

Below is a transcript from a three-day strategy offsite. Extract the top five strategic decisions, the rationale for each, and any assigned action items with owners. [paste transcript]
Open in a Space →

Draft Conversational FAQ

Write five FAQ answers for our SaaS billing page. Tone: friendly and clear, not corporate. Cover: payment methods, refund policy, plan changes, invoice access, and cancellation process.
Open in a Space →

Explore Contrarian Takes

Our team wants to adopt a four-day workweek. Present three contrarian arguments against this move, each grounded in research or case studies. Be specific and cite reasoning.
Open in a Space →

Expand Outline to Draft

Expand this outline into a 600-word blog post. Keep the structure but add examples, transitions, and a strong conclusion. Tone: professional but approachable. [paste outline]
Open in a Space →

Example outputs

Illustrative - representative of the model's voice and quality, not literal recordings.

Prompt

Review this TypeScript function for potential race conditions and suggest a fix that preserves the current API surface: ```typescript async function updateUserBalance(userId: string, delta: number) { const user = await db.users.findOne({ id: userId }); user.balance += delta; await db.users.update({ id: userId }, user); } ```

Output

This function has a classic read-modify-write race condition. If two requests call `updateUserBalance` concurrently for the same user, both read the initial balance, apply their deltas independently, then write back — the second write clobbers the first update. Fix using atomic operations: ```typescript async function updateUserBalance(userId: string, delta: number) { await db.users.updateOne( { id: userId }, { $inc: { balance: delta } } ); } ``` This delegates the increment to the database layer, which handles concurrency correctly. The API signature stays identical.

Notes

Grok 3's 131k token context window allows pasting entire codebases for architectural review, but this example shows precision on a focused snippet. The response identifies the concurrency flaw immediately and proposes an idiomatic MongoDB solution. Trade-off: at $15/Mtok output, verbose explanations for simple fixes add cost quickly.

Prompt

I'm preparing a board deck on our Q4 retention metrics. Explain why cohort analysis beats aggregate retention rate for SaaS products, using a hypothetical example with two customer segments that would mislead if aggregated.

Output

Aggregate retention hides composition effects. Suppose you have two segments: **Enterprise customers** (20% of base): 95% monthly retention, slow growth. **Self-serve users** (80% of base): 70% retention, rapid growth. If self-serve grows faster, your blended retention rate *drops* even though neither cohort worsened — you're just mixing in more of the lower-retention segment. A board seeing "retention fell from 75% to 73%" might panic. Cohort analysis separates this: you'd see Enterprise holding steady at 95%, self-serve stable at 70%, and the shift explained by acquisition mix. That tells you whether to fix retention or adjust growth strategy.

Notes

This example demonstrates Grok 3's ability to construct clear business reasoning with concrete numbers. The hypothetical is immediately usable in a real deck. The model structures the explanation for a non-technical executive audience. Trade-off: without benchmark data, it's unclear how Grok 3's business reasoning compares to GPT-4 or Claude on nuanced strategy questions.

Prompt

Generate a SQL query to find users who opened the app at least once in each of the last 6 months, but haven't opened it in the current month. Schema: events table with columns user_id (int), event_type (varchar), created_at (timestamp).

Output

```sql WITH monthly_active AS ( SELECT user_id, DATE_TRUNC('month', created_at) AS month FROM events WHERE event_type = 'app_open' AND created_at >= DATE_TRUNC('month', CURRENT_DATE) - INTERVAL '6 months' AND created_at < DATE_TRUNC('month', CURRENT_DATE) GROUP BY user_id, month ), active_all_six AS ( SELECT user_id FROM monthly_active GROUP BY user_id HAVING COUNT(DISTINCT month) = 6 ) SELECT user_id FROM active_all_six WHERE user_id NOT IN ( SELECT DISTINCT user_id FROM events WHERE event_type = 'app_open' AND created_at >= DATE_TRUNC('month', CURRENT_DATE) ); ```

Notes

Grok 3 handles multi-step SQL logic cleanly: the CTE structure makes the query readable and the date arithmetic is correct for PostgreSQL. This "at-risk power user" query is a common retention analysis pattern. Trade-off: the pricing ($3 input / $15 output per million tokens) makes Grok 3 expensive for high-volume query generation compared to smaller code-specialized models.

Use-case deep-dives

Real-time social media monitoring

When Grok 3 handles high-speed brand sentiment at scale

A 12-person marketing team tracking brand mentions across Twitter, Reddit, and news feeds needs sub-second classification of 2,000+ posts per hour. Grok 3 wins here because the $3/Mtok input rate makes high-volume streaming affordable—you're paying roughly $0.60 per 100k short posts analyzed, versus $2-5 on competing models. The 131k context window lets you batch 400-500 tweets into a single call for pattern detection without losing thread context. If your volume drops below 500 posts/hour, the output cost ($15/Mtok) starts to hurt on summary generation, so consider a cheaper model for the digest step. For teams running 24/7 monitoring where input tokens dominate the bill, Grok 3 keeps the math simple.

Multi-document legal discovery

Why Grok 3 struggles with complex cross-reference tasks

A 4-lawyer firm needs to extract contradictions across 80 pages of depositions, contracts, and emails for a single case. Grok 3's 131k window technically fits the job, but the absence of public benchmarks on reasoning tasks (MMLU, GPQA, legal-specific evals) makes this a risky bet when you're billing clients for accuracy. Models with published 85+ MMLU scores and proven chain-of-thought performance give you defensible quality metrics; Grok 3 doesn't yet. The $15/Mtok output rate also stings when you need 5-10k tokens of structured findings per case. If you're running discovery at volume and need audit trails, wait for benchmark data or use a model with published legal task scores. Grok 3 might get there, but you can't bill clients on 'might.'

Customer support ticket triage

When Grok 3 makes sense for high-throughput classification

A 20-person SaaS support team processes 1,200 inbound tickets daily, routing them to billing, technical, or sales queues based on 200-400 word descriptions. Grok 3's $3 input pricing beats most alternatives when you're reading far more than you're writing—each ticket costs roughly $0.0024 to classify if you're generating 50-token routing tags. The 131k context window lets you include the last 30 days of similar tickets as few-shot examples in a single prompt, improving accuracy without fine-tuning. Output cost ($15/Mtok) stays low because you're emitting labels, not essays. If your team grows past 3,000 tickets/day, the lack of batch API pricing (compared to providers offering 50% discounts on async jobs) starts to matter. Below that threshold, Grok 3's input economics win the triage game.

Frequently asked

Is Grok 3 good for general reasoning and coding tasks?

Grok 3 handles general reasoning and coding competently, though without public benchmarks it's hard to rank against GPT-4o or Claude Sonnet. The 131k context window supports moderately long codebases. If you need proven performance metrics for production decisions, consider models with published MMLU or HumanEval scores instead.

Is Grok 3 cheaper than GPT-4o or Claude Sonnet?

No. At $3 input and $15 output per million tokens, Grok 3 costs roughly 3-5× more than GPT-4o Mini and sits between mid-tier and flagship pricing. You're paying for xAI's approach without the cost advantage of Gemini Flash or the benchmark transparency of Anthropic's lineup.

Can Grok 3 handle 128k token contexts reliably?

The 131k window matches GPT-4 Turbo's capacity, so it should handle large documents or multi-file codebases. However, effective context utilization varies by model architecture. Without needle-in-haystack or long-context QA benchmarks published, you're testing this empirically rather than buying on proven specs.

How does Grok 3 compare to Grok 2?

xAI hasn't released detailed comparison data. Typically a major version bump means architecture changes and capability gains, but without benchmarks or a public changelog, you're relying on anecdotal reports. If you're already using Grok 2, run your eval suite before migrating production workloads.

Should I use Grok 3 for customer-facing chatbots?

Only if you've tested its tone and accuracy for your domain. The $15/Mtok output cost makes high-volume chat expensive compared to GPT-4o Mini at $0.60. Grok's brand association with X may also influence user perception. Prototype first, then decide based on response quality and cost per conversation.

Compare with

Compare with anything else →
Data last verified 799 hours ago.Sources aggregated hourly to weekly. See docs/architecture/model-directory.md.