OpenAI: gpt-oss-120b
gpt-oss-120b is an open-weight, 117B-parameter Mixture-of-Experts (MoE) language model from OpenAI designed for high-reasoning, agentic, and general-purpose production use cases. It activates 5.1B parameters per forward pass and is optimized...
Anyone in the Space can @-mention OpenAI: gpt-oss-120b with the team's shared context - pooled credits, one chat, one memory.
Starter is free forever - 1 Space, 100 credits/month, 1 MCP. No card.
Verdict
Best for
- High-volume content generation at scale
- Cost-sensitive production chatbots
- Document summarization under 100K tokens
- Code review and refactoring tasks
- Batch processing text workflows
Strengths
At $0.04 input and $0.18 output per million tokens, this model undercuts GPT-4o by roughly 90% while maintaining OpenAI's characteristic coherence and instruction-following. The 131K context window covers most real-world documents without chunking. For teams running thousands of API calls daily on straightforward text tasks—customer support, content drafts, code comments—the cost savings compound fast without sacrificing reliability.
Trade-offs
This is a text-only model, so vision tasks and multimodal workflows are off the table. Without public benchmarks, you're flying blind on how it stacks up against Gemini Flash or Claude Haiku on reasoning-heavy prompts. The 120B parameter count suggests it will lag behind 405B+ models on complex logic, nuanced creative writing, and domain-specific expertise. If your prompts demand frontier-level reasoning or you need image understanding, look elsewhere.
Specifications
- Provider
- openai
- Category
- llm
- Context length
- 131,072 tokens
- Max output
- —
- Modalities
- text
- License
- proprietary
- Released
- 2025-08-05
Pricing
- Input
- $0.04/Mtok
- Output
- $0.18/Mtok
- Model ID
openai/gpt-oss-120b
Per-token prices show what the model costs upstream. On Switchy your team draws from one shared org credit pool - one plan, one balance for everyone.
Team cost calculator
5 seats · 80 msgs/day
Switchy meters this against your org's shared credit pool - one plan, one balance for everyone.
Providers
| Provider | Context | Input | Output | P50 latency | Throughput | 30d uptime |
|---|---|---|---|---|---|---|
| openai | 131k | $0.04/Mtok | $0.18/Mtok | — | — | — |
Performance
Benchmarks
Works well with
Top MCPs
Compatibility data comes from first-party telemetry; once we have enough co-usage signal, top MCPs for this model will appear here.
How Switchy teams use it
Starter prompts
Summarize Support Tickets
Read this support ticket thread and produce a three-sentence summary covering the customer's core issue, steps already taken, and recommended next action.Open in a Space →
Refactor Code Blocks
Refactor this function to improve readability and performance. Add inline comments explaining each optimization and flag any potential edge cases.Open in a Space →
Draft Marketing Copy
Write three variations of a 50-word product description for this item. Each version should emphasize a different benefit: cost, quality, and convenience.Open in a Space →
Extract Structured Data
Extract the following fields from this contract: parties involved, effective date, termination clause, and payment terms. Return results as a JSON object.Open in a Space →
Rewrite for Clarity
Rewrite this paragraph at an 8th-grade reading level. Preserve all key facts but replace jargon with everyday terms and shorten sentences.Open in a Space →
Example outputs
Illustrative - representative of the model's voice and quality, not literal recordings.
Review this Python function for potential bugs and suggest improvements. Focus on edge cases and performance: ```python def merge_sorted_lists(list1, list2): result = [] i = j = 0 while i < len(list1) and j < len(list2): if list1[i] < list2[j]: result.append(list1[i]) i += 1 else: result.append(list2[j]) j += 1 return result ```
The function has a critical bug: it doesn't handle remaining elements after one list is exhausted. When the while loop exits, any leftover items in list1 or list2 are silently dropped. Add these lines after the loop: ```python result.extend(list1[i:]) result.extend(list2[j:]) ``` Performance-wise, repeated `len()` calls are negligible in modern Python, but you could cache them if this runs in a tight loop. The algorithm is already O(n+m), which is optimal for this problem. Consider adding type hints and handling None inputs if this is production code.
This example shows the model's code review capability at a 120B parameter scale. The response correctly identifies the logic error and provides a clean fix. At $0.18/Mtok output, this is positioned as a cost-effective option for bulk code analysis tasks where you need solid reasoning but don't require the absolute ceiling performance of 400B+ models.
Explain how HTTP/2 multiplexing works to a backend engineer who's familiar with HTTP/1.1 but hasn't worked with HTTP/2 yet. Keep it practical—focus on what changes in their mental model.
In HTTP/1.1, your mental model is: one request, one response, per TCP connection. Want parallel requests? Open multiple connections (browsers typically cap at 6 per domain). HTTP/2 flips this: one TCP connection carries multiple request/response pairs simultaneously. Each pair gets a stream ID. The connection is a shared highway; streams are lanes. A large image download on stream 3 doesn't block a small API call on stream 5—they interleave at the frame level. Practically: you stop worrying about domain sharding and sprite sheets. Server push becomes possible (server sends resources before you ask). But head-of-line blocking moves from HTTP to TCP—if packets drop, all streams stall. That's why HTTP/3 switches to QUIC.
Demonstrates the model's ability to translate technical concepts with appropriate context for the audience. The 131K token context window means engineers can paste entire API documentation or codebases for similar explanations. The explanation balances accuracy with accessibility, though a larger model might provide more nuanced discussion of QUIC's loss recovery mechanisms.
Draft a concise incident postmortem summary for our engineering team. Incident: API gateway returned 503s for 14 minutes during peak traffic. Root cause: autoscaling policy had a 5-minute cooldown, and traffic spiked 3x in 90 seconds. What we did: manually scaled, then adjusted cooldown to 60s and added predictive scaling.
**Incident Summary: API Gateway Outage (14min)** **Impact:** 503 errors across all endpoints, 2:15-2:29 PM PST. ~8,400 failed requests during peak checkout window. **Root Cause:** Autoscaling cooldown (5min) prevented rapid scale-out when traffic jumped from 2K to 6K req/s in 90 seconds. Existing capacity exhausted within 2 minutes. **Resolution:** Manual intervention at 2:22 PM to force scale-out. Service recovered by 2:29 PM. **Prevention:** - Reduced cooldown to 60s (deployed 2:45 PM) - Enabled predictive scaling based on historical traffic patterns (deployed next day) - Added alerting for req/s > 4K sustained over 30s **Action Items:** Load test the new config at 10x normal traffic by Friday.
Shows the model handling structured business writing with technical detail. The output follows standard postmortem format and extracts key facts from the prompt efficiently. At this price point ($0.04 input), teams can affordably process large volumes of incident logs or support tickets. The model maintains clarity but doesn't add the deeper architectural recommendations a frontier model might surface.
Use-case deep-dives
When gpt-oss-120b handles 500+ daily support tickets under budget
A 12-person SaaS company routing 600 inbound support emails per day needs fast intent classification and initial response drafts before human review. gpt-oss-120b wins here because the $0.04/Mtok input rate means processing 600 tickets at ~800 tokens each costs under $20/day, while the 131k context window lets you pack full conversation histories plus knowledge base snippets into single calls. The $0.18 output rate stays reasonable because triage responses average 200-300 tokens. If your tickets need multi-turn reasoning or you're seeing accuracy gaps without benchmarks to validate quality, test against gpt-4o-mini first. Otherwise, this model's price-to-capacity ratio makes it the default for teams burning through thousands of short-to-medium inference calls daily where context matters more than cutting-edge reasoning.
Why gpt-oss-120b compresses 80-page reports without chunking overhead
A 4-person legal tech startup summarizing 80-page discovery documents into 2-page briefs needs a model that ingests entire PDFs in one pass. gpt-oss-120b's 131k token window fits most full documents (roughly 100 pages of dense text) without the chunking-and-stitching workflow that adds latency and costs extra calls. At $0.04/Mtok input, a 90k-token document costs $3.60 to process; the $0.18 output rate adds another $0.36 for a 2k-token summary. The trade-off: without public benchmarks, you're flying blind on summarization accuracy compared to models with published ROUGE or human-eval scores. Run a 20-document pilot against your ground-truth summaries before committing. If quality holds, this model's context capacity and input pricing beat chunking-based approaches for any workflow ingesting 50+ page documents daily.
gpt-oss-120b as the low-cost testbed for early-stage conversational AI
A 3-person team building a customer-facing chatbot for a regional bank needs to iterate on prompt templates, conversation flows, and edge-case handling across 200+ test dialogues before launch. gpt-oss-120b's $0.04 input and $0.18 output pricing means running 200 multi-turn conversations (averaging 5k tokens in, 1k out per session) costs roughly $50 total—letting you burn through dozens of prompt variations without budget anxiety. The 131k context window supports long conversation histories for testing memory and coherence. The risk: deploying to production without benchmark validation means you're guessing on accuracy and safety compared to models with published evals. Use this model to prove the UX and flow, then run a head-to-head against gpt-4o or claude-3-5-sonnet on 50 real customer transcripts before go-live. For prototyping velocity at team scale, the price advantage is hard to beat.
Frequently asked
Is gpt-oss-120b good for general text generation tasks?
Yes, with 131K context window and text-only focus, it handles long documents, summarization, and general writing well. The 120B parameter count suggests strong reasoning and coherence. Without public benchmarks we can't compare it directly to GPT-4 or Claude, but the context size makes it viable for most production text work.
Is gpt-oss-120b cheaper than GPT-4o or Claude Sonnet?
Significantly cheaper. At $0.04 input and $0.18 output per million tokens, it undercuts GPT-4o ($2.50/$10.00) and Claude Sonnet 4 ($3.00/$15.00) by 50-95x. If you're processing high volumes of text and don't need multimodal or the absolute ceiling of reasoning, the cost savings are substantial.
Can it handle 128K tokens in practice or does quality degrade?
The 131K window is there, but without published needle-in-haystack or long-context benchmarks, we can't confirm retrieval accuracy at max length. Most 120B-class models show some degradation past 64K tokens. Test your specific use case—if you're doing RAG or multi-document analysis, validate recall before committing to production.
How does gpt-oss-120b compare to GPT-4 Turbo for accuracy?
Unknown—OpenAI hasn't released benchmarks for this model. The 120B size is smaller than GPT-4's rumored architecture, so expect lower performance on complex reasoning, math, and code. If you need GPT-4-level accuracy, pay for GPT-4. If you need cost efficiency for simpler tasks, this is worth testing.
Should I use gpt-oss-120b for high-volume content moderation?
Yes, if latency and cost matter more than perfect accuracy. The pricing makes it feasible to process millions of messages daily. The 131K context lets you include full conversation threads for better judgment. Just run a calibration set first—without benchmarks, you need to validate precision and recall against your moderation policy.