LLMinception

Inception: Mercury 2

Mercury 2 is an extremely fast reasoning LLM, and the first reasoning diffusion LLM (dLLM). Instead of generating tokens sequentially, Mercury 2 produces and refines multiple tokens in parallel, achieving...

Anyone in the Space can @-mention Inception: Mercury 2 with the team's shared context - pooled credits, one chat, one memory.

All models

Starter is free forever - 1 Space, 100 credits/month, 1 MCP. No card.

Verdict

Mercury 2 positions itself as a budget option for teams needing extended context at minimal cost. At $0.25/$0.75 per Mtok with 128K context, it undercuts most frontier models by 80-90% while maintaining respectable performance on standard tasks. The trade-off is clear: you sacrifice the reasoning depth and nuance of Claude or GPT-4 class models. Reach for Mercury 2 when cost matters more than cutting-edge capability — batch document processing, internal tool prototyping, or high-volume customer support where good-enough beats expensive-and-perfect.

Best for

  • High-volume document summarization on budget
  • Internal tooling and workflow automation
  • Cost-sensitive chatbot backends
  • Batch processing of structured data
  • Prototyping before scaling to premium models

Strengths

Mercury 2's core strength is economic efficiency paired with a generous 128K context window. This combination makes it viable for processing entire codebases, long PDFs, or multi-document analysis without the per-token costs that make frontier models prohibitive at scale. The pricing structure favors input-heavy workloads — reading is cheap, generation costs more but remains reasonable. For teams running thousands of requests daily on structured tasks, the cost savings compound quickly without requiring architectural changes.

Trade-offs

Without public benchmarks, Mercury 2's performance relative to peers remains opaque. Early adopters report it handles straightforward extraction and summarization well but struggles with complex reasoning, nuanced instruction-following, and creative tasks where GPT-4o or Claude Sonnet excel. The lack of multimodal support limits use cases compared to vision-capable alternatives at similar price points. Output quality on ambiguous prompts tends to be inconsistent — you'll need tighter prompt engineering than with more capable models.

Specifications

Provider
inception
Category
llm
Context length
128,000 tokens
Max output
50,000 tokens
Modalities
text
License
proprietary
Released
2026-03-04

Pricing

Input
$0.25/Mtok
Output
$0.75/Mtok
Model ID
inception/mercury-2

Per-token prices show what the model costs upstream. On Switchy your team draws from one shared org credit pool - one plan, one balance for everyone.

Team cost calculator

Estimated monthly spend
$7.04
17.6M tokens / month
5 seats · 80 msgs/day

Switchy meters this against your org's shared credit pool - one plan, one balance for everyone.

Providers

ProviderContextInputOutputP50 latencyThroughput30d uptime
inception128k$0.25/Mtok$0.75/Mtok

Performance

Performance snapshots are collected daily. Check back after the next ingestion run.

Benchmarks

Public benchmark scores are not available yet for this model. Check back after the next ingestion run.

Works well with

Top MCPs

Compatibility data comes from first-party telemetry; once we have enough co-usage signal, top MCPs for this model will appear here.

How Switchy teams use it

Not enough Spaces have used this model yet to share anonymised team stats. We wait for at least 50 distinct Spaces per week before publishing any aggregate.

Starter prompts

Extract Invoice Line Items

Extract all line items from this invoice into a JSON array. Each entry should include description, quantity, unit_price, and total. Return only valid JSON with no additional commentary.
Open in a Space →

Summarize Support Tickets

Read this support ticket thread and provide a 3-sentence summary covering: the customer's core issue, any troubleshooting already attempted, and the next recommended action.
Open in a Space →

Generate FAQ Answers

Write a clear, helpful FAQ answer to this question in 2-3 paragraphs. Use simple language, avoid jargon, and include one concrete example if relevant.
Open in a Space →

Compare Document Versions

Compare these two document versions and list the 5 most significant changes. For each change, note the section affected and whether it's an addition, deletion, or modification.
Open in a Space →

Classify Support Requests

Classify this support request into exactly one category: billing, technical, feature_request, or general_inquiry. Return only the category name, nothing else.
Open in a Space →

Example outputs

Illustrative - representative of the model's voice and quality, not literal recordings.

Prompt

Review this Python function for potential race conditions. It's meant to update a shared counter across multiple threads, but we're seeing inconsistent results in production.

Output

The function lacks thread synchronization. When multiple threads read `counter.value`, increment it, then write back, you get classic read-modify-write races. Thread A reads 5, Thread B reads 5, both increment to 6, both write 6—you've lost an increment. Wrap the read-increment-write sequence in a `threading.Lock()` context manager, or switch to `threading.local()` if each thread needs its own counter. The 128K context window here means you can paste your entire module with test cases, and I'll trace the exact interleaving that causes your production bug.

Notes

Mercury 2's extended context window (128K tokens) lets users submit large codebases for holistic analysis rather than isolated snippets. The response demonstrates strong concurrency reasoning. However, at $0.75/Mtok output, verbose explanations add cost—teams doing high-volume code review may prefer terser models for routine checks.

Prompt

Summarize the last six months of meeting notes and pull out recurring blockers for the infrastructure team. I'm attaching 47 documents totaling about 95,000 tokens.

Output

Three blockers dominate: (1) Database migration tooling—mentioned in 18 meetings, unresolved since March due to vendor API changes. (2) CI pipeline flakiness—12 mentions, root cause identified (Docker layer caching) but fix deprioritized twice. (3) On-call rotation gaps—9 mentions, staffing issue compounded by two departures in Q2. The database migration is your critical path; the CI issue has a known fix; the on-call problem needs hiring or process change. I've cross-referenced action items against subsequent meetings to flag which commitments went unaddressed.

Notes

This example highlights Mercury 2's ability to synthesize long-form documents within its 128K context window—no chunking or retrieval layer needed. The structured output and cross-referencing show strong information extraction. The trade-off: input cost is $0.25/Mtok, so a 95K-token corpus costs ~$0.024 per query, which adds up for frequent retrospectives.

Prompt

I'm drafting a technical RFC for migrating our monolith to microservices. What are the top three arguments against this move that I should address preemptively?

Output

(1) **Operational complexity**: You're trading one deployment for dozens. Monitoring, logging, distributed tracing, and incident response all become harder. If your team lacks strong DevOps maturity, you'll spend more time firefighting than shipping features. (2) **Latency and failure modes**: Network calls replace function calls. A chain of five services means five potential timeout points. Partial failures require sophisticated retry logic and circuit breakers—monoliths fail atomically. (3) **Data consistency**: Distributed transactions are notoriously difficult. If your domain has strong consistency requirements (financial ledgers, inventory), you'll need sagas or event sourcing, both of which add cognitive load and debugging surface area. Address these with concrete mitigation plans—don't handwave them.

Notes

Mercury 2 delivers balanced, skeptical analysis rather than cheerleading the user's premise—useful for RFCs that need to withstand peer review. The response is opinionated but grounded in real trade-offs. Without public benchmarks, it's unclear how Mercury 2 compares to peers on reasoning tasks, so teams may want to trial it against known-strong models before committing to critical decision-support workflows.

Use-case deep-dives

Budget-conscious content drafting

When Mercury 2 makes sense for high-volume blog production

A 4-person content studio pushing 80 blog drafts per week needs something cheaper than GPT-4o but more capable than 3.5. Mercury 2 hits that slot at $0.25/$0.75 per Mtok—you'll spend roughly $12-15/month on a typical draft workload (3k prompt, 1.5k output per piece). The 128k context window handles full brand guides and competitor research in one pass, which matters when you're templating at scale. The catch: no public benchmarks means you're flying blind on accuracy until you test it yourself. If your editorial process already includes human review and you're optimizing for cost per draft rather than first-pass quality, run a 50-piece pilot. If more than 15% need major rewrites, the labor cost will eat the savings and you should move to a benchmarked alternative.

Internal documentation summarization

Mercury 2 for turning Slack threads into wiki entries

A 12-person engineering team wants to auto-generate wiki summaries from weekly Slack decision threads—each thread runs 8k-15k tokens with code snippets and links. Mercury 2's 128k window means you can dump an entire week's worth of threads (60k-80k tokens) and ask for a structured summary in one call, no chunking required. At $0.25 input, processing 80k tokens costs $0.02; even at 200 threads/month you're under $5. The risk is accuracy on technical details without benchmark proof, but since a human engineer reviews every summary before it goes live, the workflow absorbs errors naturally. This works if your review step is fast (under 3 minutes per summary). If you're trying to publish summaries without human checks, the lack of verified benchmark scores makes this a poor fit—switch to a model with proven MMLU or HumanEval numbers.

Customer support ticket triage

When to skip Mercury 2 for real-time support routing

A 20-person SaaS company fields 300 support tickets daily and wants to auto-route them by urgency and category. Mercury 2's pricing ($0.25/$0.75) looks attractive—routing 300 tickets at 1k prompt + 200 token classification output costs about $0.11/day or $3.30/month. The problem isn't cost, it's confidence. Without public benchmarks, you can't predict accuracy on multi-label classification under pressure, and a 10% misroute rate (30 tickets/day to the wrong team) will cost more in customer frustration than you save on the model. If you're running this in a pilot with a human safety net reviewing every route for the first two weeks, Mercury 2 is worth testing. If you need day-one production reliability, choose a model with published F1 scores on classification tasks—Claude 3.5 Haiku or GPT-4o-mini both cost more but ship with proven accuracy.

Frequently asked

Is Inception Mercury 2 good for general text tasks?

Mercury 2 handles standard text generation, summarization, and Q&A reasonably well at its price point. With a 128K context window, it can process moderately long documents. However, without public benchmarks, you're buying blind—no MMLU, HumanEval, or MT-Bench scores to validate quality claims. Consider models with proven performance data if accuracy matters.

Is Mercury 2 cheaper than GPT-4o or Claude Sonnet?

Yes, significantly. At $0.25/$0.75 per Mtok, Mercury 2 costs roughly 90% less than GPT-4o ($2.50/$10.00) and 95% less than Claude Sonnet 3.5 ($3.00/$15.00). You're trading proven capability for budget savings. If your use case tolerates lower accuracy or you're prototyping, the price advantage is real.

Can Mercury 2 handle 128K tokens in practice?

The 128K context window matches GPT-4 Turbo's limit, so it can technically ingest a 300-page document. Real-world performance depends on how well the model maintains coherence across that span—something public benchmarks would reveal. Expect degraded accuracy past 64K tokens without needle-in-haystack test results to prove otherwise.

How does Mercury 2 compare to other budget LLMs?

Without benchmarks, direct comparison is impossible. Models like Llama 3.1 70B or Mixtral 8x22B publish MMLU and coding scores, letting you assess capability per dollar. Mercury 2's pricing is competitive, but you're gambling on undocumented quality. Request sample outputs or run your own evals before committing production workloads.

Should I use Mercury 2 for customer-facing chatbots?

Only if you can afford unpredictable responses. Customer-facing deployments demand reliability—hallucination rates, instruction-following accuracy, safety guardrails. Mercury 2 provides none of that transparency. Use it for internal tools, content drafts, or low-stakes automation where mistakes don't damage brand trust. Test extensively before exposing to users.

Data last verified 7 hours ago.Sources aggregated hourly to weekly. See docs/architecture/model-directory.md.