NVIDIA: Nemotron 3 Super
NVIDIA Nemotron 3 Super is a 120B-parameter open hybrid MoE model, activating just 12B parameters for maximum compute efficiency and accuracy in complex multi-agent applications. Built on a hybrid Mamba-Transformer...
Anyone in the Space can @-mention NVIDIA: Nemotron 3 Super with the team's shared context - pooled credits, one chat, one memory.
Starter is free forever - 1 Space, 100 credits/month, 1 MCP. No card.
Verdict
Best for
- Long-context document analysis on budget
- Internal knowledge base Q&A
- Batch processing large text corpora
- Cost-sensitive summarization tasks
Strengths
The 262K context window handles full-length books, legal filings, or multi-document research sessions without chunking. At $0.09 input, you can afford to stuff entire codebases or policy manuals into a single prompt. NVIDIA's GPU heritage suggests strong inference throughput, which matters when you're processing hundreds of long documents daily. The output pricing at $0.45/Mtok keeps generation costs reasonable even for verbose responses.
Trade-offs
No public benchmarks means you're flying blind on reasoning quality, instruction-following, or coding ability relative to Claude, GPT-4, or Gemini. NVIDIA hasn't published MMLU, HumanEval, or GPQA scores, so you'll need to run your own evals on your actual use cases. The model may lag behind frontier models on complex multi-step reasoning or nuanced creative tasks—test thoroughly before committing production workloads.
Specifications
- Provider
- nvidia
- Category
- llm
- Context length
- 262,144 tokens
- Max output
- —
- Modalities
- text
- License
- proprietary
- Released
- 2026-03-11
Pricing
- Input
- $0.09/Mtok
- Output
- $0.45/Mtok
- Model ID
nvidia/nemotron-3-super-120b-a12b
Per-token prices show what the model costs upstream. On Switchy your team draws from one shared org credit pool - one plan, one balance for everyone.
Team cost calculator
5 seats · 80 msgs/day
Switchy meters this against your org's shared credit pool - one plan, one balance for everyone.
Providers
| Provider | Context | Input | Output | P50 latency | Throughput | 30d uptime |
|---|---|---|---|---|---|---|
| nvidia | 262k | $0.09/Mtok | $0.45/Mtok | — | — | — |
Performance
Benchmarks
Works well with
Top MCPs
Compatibility data comes from first-party telemetry; once we have enough co-usage signal, top MCPs for this model will appear here.
How Switchy teams use it
Starter prompts
Multi-Document Synthesis
I've pasted three research papers below (separated by ---). Extract the common themes, conflicting findings, and actionable recommendations. Prioritize findings that appear in at least two papers.Open in a Space →
Codebase Q&A
Here's the contents of five Python modules from our API service. Explain how authentication flows from the login endpoint through middleware to the database layer. Cite specific function names.Open in a Space →
Contract Clause Extraction
Below is a 150-page SaaS agreement. List every clause that mentions data retention, including section numbers and exact wording. Flag any ambiguous language.Open in a Space →
Meeting Transcript Analysis
This is a transcript of a six-hour product planning session. Identify all decisions made, who owns each action item, and any topics that need follow-up. Group by theme.Open in a Space →
Knowledge Base Consolidation
I've combined ten wiki pages on our deployment process. Remove duplicate instructions, reconcile conflicting steps, and output a single canonical runbook with clear prerequisites.Open in a Space →
Example outputs
Illustrative - representative of the model's voice and quality, not literal recordings.
Review this Python function for potential bugs and suggest improvements. Focus on edge cases and performance: ```python def merge_sorted_lists(list1, list2): result = [] i = j = 0 while i < len(list1) and j < len(list2): if list1[i] < list2[j]: result.append(list1[i]) i += 1 else: result.append(list2[j]) j += 1 return result ```
The function has a critical bug: it drops remaining elements after one list is exhausted. After the while loop, you need: ```python result.extend(list1[i:]) result.extend(list2[j:]) ``` Edge cases to handle: empty lists work correctly due to the while condition. For performance, this is O(n+m) which is optimal for merging. However, if you're merging many lists repeatedly, consider using `heapq.merge()` from the standard library—it's implemented in C and handles the remaining elements correctly. Also consider whether you need a new list or can modify in-place depending on your use case.
Demonstrates code review capability with concrete bug identification and practical alternatives. The 262K token context window means this model can review entire codebases in one prompt. Without public benchmarks, we can't verify how it compares to specialized code models on complex refactoring tasks.
Explain the CAP theorem to a junior developer who understands databases but hasn't worked with distributed systems. Use a concrete example they'd recognize.
The CAP theorem says a distributed database can only guarantee two of three properties: Consistency (everyone sees the same data), Availability (every request gets a response), and Partition tolerance (system works even when network fails between nodes). Think of a social media app with servers in New York and London. If the transatlantic cable breaks (partition), you must choose: either lock writes until it's fixed (losing Availability), or let both servers accept posts independently and merge later (losing Consistency—users might see different feeds temporarily). Most modern systems choose AP (available but eventually consistent) because users tolerate slightly stale data better than error messages.
Shows technical explanation ability with appropriate abstraction level. The large context window is less relevant for single-concept explanations like this. At $0.45/Mtok output, this model costs 5x more than many alternatives for high-volume educational content generation.
Analyze this customer support transcript and extract: 1) the core issue, 2) customer sentiment, 3) whether the agent resolved it, 4) suggested follow-up action. [Transcript: 380-word conversation about a delayed shipment, customer initially frustrated, agent offers refund, customer accepts but mentions this is second delay]
**Core issue:** Order #47293 delayed 8 days beyond promised delivery; customer needed item for an event that has now passed. **Sentiment:** Started angry (3/10), improved to neutral-positive (6/10) after refund offer. Underlying concern about reliability remains. **Resolution status:** Immediate issue resolved with full refund. However, systemic issue unresolved—customer mentioned this is their second delayed order in three months. **Suggested follow-up:** 1) Process refund within 24hrs as promised, 2) Flag account for shipping audit, 3) Proactive outreach in 2 weeks with discount code and assurance that shipping process has been reviewed. Consider account-level shipping upgrade for next order to rebuild trust.
Highlights structured data extraction from conversational text—useful for support ticket analysis at scale. The 262K context window handles hundreds of transcripts in a single analysis pass. Output pricing makes this expensive for real-time per-ticket processing compared to smaller models fine-tuned for classification tasks.
Use-case deep-dives
When 262K context justifies the premium for contract teams
A 4-person legal ops team needs to compare 40-page vendor contracts against master service agreements and flag deviations. Nemotron 3 Super's 262,144-token window fits multiple full contracts in a single prompt—no chunking, no retrieval overhead. At $0.09/$0.45 per Mtok, you're paying roughly $0.12 per 100K-token analysis run. If your team processes 20+ contracts per week and accuracy on clause-level differences matters more than speed, this model's context capacity beats splitting documents across cheaper alternatives. Below 10 contracts per week, the cost delta isn't worth it; use a 128K model and accept the chunking workflow.
Where Nemotron 3 Super fits in tiered support routing
A 12-person SaaS support team routes Tier 2 escalations through AI before human handoff. Nemotron 3 Super handles the middle layer: complex troubleshooting threads with 8-12 back-and-forth exchanges, pulling context from knowledge base snippets and prior ticket history. The 262K window means you can load the entire conversation plus 50+ KB of documentation without summarization. At current pricing, each escalation costs $0.03-0.08 depending on response length. If your Tier 2 volume is under 200 tickets per day, this works. Above that threshold, output token costs ($0.45/Mtok) push you toward a faster, cheaper model for initial triage and reserve Nemotron for the 15% of cases that need deep context.
When to skip Nemotron 3 Super for high-frequency moderation
A 3-person community team moderates 800 forum posts daily, flagging policy violations and toxic language. Nemotron 3 Super's pricing structure makes it a poor fit here: at $0.45/Mtok output, even short moderation verdicts (50-100 tokens each) add up fast across volume. You'd spend $18-36 per day on output tokens alone for binary classification tasks that don't need 262K context. This model's strength is long-context reasoning, not high-throughput categorization. For moderation at this scale, use a sub-$0.10 output model or a fine-tuned classifier. Reserve Nemotron for the 5% of edge cases where you need to analyze entire comment threads (2K+ tokens) against nuanced community guidelines.
Frequently asked
Is Nemotron 3 Super good for general text generation?
Yes, it handles standard text tasks well with a massive 262k token context window. Without public benchmarks, we can't compare it directly to GPT-4 or Claude, but the pricing suggests it's positioned as a mid-tier option. The large context makes it suitable for document analysis and long-form content work where you need to process entire codebases or books in one pass.
Is Nemotron 3 Super cheaper than GPT-4o?
Yes, significantly. At $0.09 input and $0.45 output per million tokens, it's roughly 5-10x cheaper than GPT-4o depending on the task mix. However, GPT-4o has proven performance across benchmarks while Nemotron 3 Super lacks public evaluation data. If cost is your primary constraint and you can validate quality yourself, the price advantage is real.
Can Nemotron 3 Super handle 250k token documents in practice?
The 262k context window suggests yes, but real-world performance depends on how well the model maintains coherence across that span. Without benchmark data on long-context tasks like RULER or Needle-in-Haystack, you'll need to test your specific use case. Most models degrade in quality past 100k tokens regardless of technical limits.
How does Nemotron 3 Super compare to Llama 3.1 405B?
We can't make a quality comparison without benchmarks for Nemotron 3 Super. Llama 3.1 405B has proven performance on MMLU, HumanEval, and other standard tests. Nemotron's advantage is the larger context window (262k vs 128k) and potentially lower hosting costs through NVIDIA's infrastructure. If you need validated performance, stick with Llama until NVIDIA publishes evaluation results.
Should I use Nemotron 3 Super for production chatbots?
Only if you can thoroughly test it first. The lack of public benchmarks means you're flying blind on quality metrics like instruction-following, safety, and factual accuracy. The pricing is attractive and the context window is generous, but production deployments need proven reliability. Run your own evals on representative tasks before committing to it over established alternatives like GPT-4o-mini or Claude Haiku.