OpenAI: gpt-oss-120b (free)
gpt-oss-120b is an open-weight, 117B-parameter Mixture-of-Experts (MoE) language model from OpenAI designed for high-reasoning, agentic, and general-purpose production use cases. It activates 5.1B parameters per forward pass and is optimized...
Anyone in the Space can @-mention OpenAI: gpt-oss-120b (free) with the team's shared context - pooled credits, one chat, one memory.
Starter is free forever - 1 Space, 100 credits/month, 1 MCP. No card.
Verdict
Best for
- Prototyping workflows before production deployment
- High-volume low-stakes content generation
- Cost-sensitive educational or research projects
- Testing prompt chains with zero spend
Strengths
The standout feature is zero cost at both input and output, removing financial friction from experimentation. The 128K context window matches mid-tier commercial models, allowing full-document processing without chunking. OpenAI's infrastructure means reliable uptime and familiar API patterns. For teams validating ideas or running internal tools where output quality can be manually reviewed, the economics are unbeatable.
Trade-offs
Lack of public benchmarks makes capability assessment difficult—assume this underperforms GPT-4o and Claude Sonnet significantly on reasoning, code generation, and nuanced instruction-following. The 'free' designation typically signals rate limits or deprioritized compute, so expect slower response times under load. Not suitable for customer-facing applications or tasks requiring high accuracy. You're trading performance for price, and that trade is steep.
Specifications
- Provider
- openai
- Category
- llm
- Context length
- 131,072 tokens
- Max output
- 131,072 tokens
- Modalities
- text
- License
- proprietary
- Released
- 2025-08-05
Pricing
- Input
- $0.00/Mtok
- Output
- $0.00/Mtok
- Model ID
openai/gpt-oss-120b:free
Per-token prices show what the model costs upstream. On Switchy your team draws from one shared org credit pool - one plan, one balance for everyone.
Team cost calculator
5 seats · 80 msgs/day
Switchy meters this against your org's shared credit pool - one plan, one balance for everyone.
Providers
| Provider | Context | Input | Output | P50 latency | Throughput | 30d uptime |
|---|---|---|---|---|---|---|
| openai | 131k | $0.00/Mtok | $0.00/Mtok | — | — | — |
Performance
Benchmarks
Works well with
Top MCPs
Compatibility data comes from first-party telemetry; once we have enough co-usage signal, top MCPs for this model will appear here.
How Switchy teams use it
Starter prompts
Draft Internal Email
Write a friendly 3-paragraph email to the engineering team announcing our weekly sync has moved from Tuesday 2pm to Thursday 10am. Keep it casual and mention we'll cover sprint planning.Open in a Space →
Summarize Meeting Notes
Read these meeting notes and create a bulleted summary of action items, decisions made, and topics tabled for next time. Focus on what requires follow-up.Open in a Space →
Brainstorm Feature Ideas
Our app helps freelancers track invoices. Brainstorm 10 new feature ideas that would make billing less painful. Be creative—don't worry about feasibility yet.Open in a Space →
Rewrite for Clarity
Here's a paragraph from our documentation. Rewrite it to be clearer and more concise while keeping all the technical details intact: [paste paragraph]Open in a Space →
Generate Test Data
Generate 20 realistic customer support tickets for a SaaS product. Include variety: billing questions, feature requests, bug reports. Make them feel authentic.Open in a Space →
Example outputs
Illustrative - representative of the model's voice and quality, not literal recordings.
Review this Python function for potential bugs. It's supposed to merge two sorted lists but users report incorrect output when one list is empty.
The function has an off-by-one error in the while loop condition. When `list_a` is empty, the code tries to access `list_a[i]` before checking bounds, causing an IndexError. Move the length checks before the comparison. Also, the final concatenation logic assumes both pointers are valid — you need separate conditions for appending remaining elements from each list. Here's the corrected version with explicit boundary handling for both edge cases.
This example shows the model's debugging capability within its 131K token context window — enough to handle substantial codebases. The free pricing makes it accessible for exploratory debugging sessions. However, without public benchmarks, we can't verify its accuracy against established coding evaluation suites like HumanEval.
Summarize the key regulatory changes in this 40-page compliance document, focusing on what our finance team needs to action before Q2.
Three critical changes require immediate attention: (1) New expense categorization rules for software subscriptions take effect April 1st — all SaaS contracts must be reclassified under operating expenses rather than capital expenditure. (2) Monthly reconciliation deadlines move from the 10th to the 5th business day. (3) International wire transfer documentation now requires additional beneficiary verification for amounts exceeding $50K. The document also introduces voluntary best-practice guidelines for ESG reporting, but these carry no compliance deadline for your industry segment.
The 131K context window handles long documents without chunking, making this model suitable for enterprise document analysis. Zero cost per token means teams can process high volumes of internal documents without budget concerns. The trade-off is uncertainty around accuracy on specialized domains without benchmark validation.
I'm teaching a 10-year-old about photosynthesis. Explain it using an analogy they'd understand, avoiding scientific jargon.
Think of a plant like a tiny solar-powered factory. The leaves are like solar panels that catch sunlight. Inside the leaves, there are millions of little workers (called chloroplasts, but let's call them 'green helpers'). These green helpers take three ingredients — sunlight from above, water from the roots, and carbon dioxide from the air we breathe out — and mix them together to make food (sugar) for the plant. The leftover stuff they don't need? That's oxygen, which they release back into the air for us to breathe. So plants are basically running on sunshine and making the air better for us at the same time.
This demonstrates the model's ability to adjust technical depth for different audiences — a key strength for educational content and internal knowledge-sharing. The free tier removes cost barriers for high-volume tutoring or documentation use cases. Without benchmark data on instruction-following tasks, we can't compare its pedagogical effectiveness to models with measured MMLU or HellaSwag scores.
Use-case deep-dives
When free inference makes sense for early-stage bot testing
A 4-person startup building a customer support bot needs to iterate on prompts and conversation flows without burning runway on API costs. gpt-oss-120b at $0.00/Mtok lets you run thousands of test conversations during the design phase—swap system prompts, test edge cases, refine tone—without watching a meter. The 131k context window handles multi-turn dialogues and lets you include full conversation history plus documentation chunks. The trade-off: no public benchmarks means you're flying blind on quality until you test it yourself, and 'oss' in the name suggests this may be a community or experimental release with less reliability than production models. Use this for the first 2-3 weeks of prototyping, then migrate to a benchmarked model once you've locked the interaction pattern and need predictable output quality.
Why zero-cost inference breaks down at moderation scale
A 12-person social platform needs to classify 200k user comments per day for policy violations. At $0.00/Mtok, gpt-oss-120b looks like the obvious choice—your monthly inference bill stays at zero even as volume spikes. The problem is accuracy. Content moderation demands low false-positive rates (don't ban good users) and low false-negatives (don't miss real abuse), and without MMLU, HumanEval, or safety benchmarks, you can't validate this model's precision on adversarial text. The 131k window is overkill here since each moderation call is under 500 tokens. If you're moderating under 10k items/day and can manually review flags, this works as a first-pass filter. Above that threshold, switch to a model with published safety evals—the cost difference is negligible compared to the reputation risk of a moderation failure.
When free LLM inference works for low-stakes knowledge retrieval
A 20-person engineering team wants to build a Slack bot that answers questions about internal wikis, runbooks, and API docs. gpt-oss-120b's 131k context window fits 40-50 pages of documentation in a single prompt, and at $0.00/Mtok you can serve unlimited queries without justifying budget to finance. The use-case is low-stakes: if the bot hallucinates a config setting, an engineer will catch it before pushing to prod. The missing benchmarks matter less here because you're not making high-consequence decisions based on the output—you're just surfacing relevant passages faster than Cmd+F. Deploy this for 2-3 months and track answer accuracy in Slack reactions. If accuracy stays above 80% and the team uses it daily, keep it. If you see frequent corrections or engineers stop asking, upgrade to a model with published retrieval benchmarks and tighter instruction-following.
Frequently asked
Is gpt-oss-120b good for general text tasks?
Yes, for basic use cases where cost matters more than performance. With 120B parameters and a 131k token context window, it handles standard summarization, Q&A, and content generation adequately. Being free makes it ideal for prototyping or high-volume low-stakes work. Expect weaker reasoning than GPT-4 class models, but it's solid for straightforward tasks that don't require cutting-edge capabilities.
How does the free pricing compare to GPT-4o or Claude?
It's $0 versus $2.50-$15 per million tokens for GPT-4o or Claude Sonnet, making it infinitely cheaper for budget-constrained projects. The trade-off is performance: no public benchmarks exist for gpt-oss-120b, so you're gambling on quality. Use it when volume matters more than accuracy, or when you need to burn through millions of tokens without budget approval.
Can it handle 131k token context windows reliably?
The 131k context window matches GPT-4 Turbo's capacity on paper, but real-world performance at max context is unknown without benchmarks. Smaller open-source models often degrade past 32k tokens. Test your specific use case before committing. For long-document analysis or large codebases, you might hit quality issues that paid models with proven long-context performance avoid.
How does this compare to other open-source 120B models?
Without public benchmarks, it's impossible to rank against Falcon 180B or other open weights models objectively. The 120B parameter count suggests mid-tier capability, likely below GPT-3.5 but above smaller 7-13B models. The free hosting through OpenAI's infrastructure is the real differentiator here—you skip deployment costs that plague self-hosted open-source alternatives.
Should I use this for production customer-facing features?
Only if you can tolerate unpredictable output quality and have fallback logic. The lack of benchmarks means you're flying blind on accuracy, safety, and consistency. Fine for internal tools, content drafts, or non-critical automation. For customer support, content moderation, or anything reputation-sensitive, pay for GPT-4o or Claude where quality is documented and support exists.