LLMopenai

OpenAI: gpt-oss-120b (free)

gpt-oss-120b is an open-weight, 117B-parameter Mixture-of-Experts (MoE) language model from OpenAI designed for high-reasoning, agentic, and general-purpose production use cases. It activates 5.1B parameters per forward pass and is optimized...

Anyone in the Space can @-mention OpenAI: gpt-oss-120b (free) with the team's shared context - pooled credits, one chat, one memory.

All models

Starter is free forever - 1 Space, 100 credits/month, 1 MCP. No card.

Verdict

OpenAI's gpt-oss-120b is a free-tier model with a 128K context window, making it viable for prototyping and low-stakes automation where cost is the primary constraint. Performance data is sparse, but the zero-dollar pricing suggests this sits well below GPT-4 class models in capability. Expect competent but not exceptional output on standard tasks. Reach for this when you need to test workflows, handle high-volume low-complexity requests, or when budget literally cannot accommodate paid tiers.

Best for

Prototyping workflows before production deployment
High-volume low-stakes content generation
Cost-sensitive educational or research projects
Testing prompt chains with zero spend

Strengths

The standout feature is zero cost at both input and output, removing financial friction from experimentation. The 128K context window matches mid-tier commercial models, allowing full-document processing without chunking. OpenAI's infrastructure means reliable uptime and familiar API patterns. For teams validating ideas or running internal tools where output quality can be manually reviewed, the economics are unbeatable.

Trade-offs

Lack of public benchmarks makes capability assessment difficult—assume this underperforms GPT-4o and Claude Sonnet significantly on reasoning, code generation, and nuanced instruction-following. The 'free' designation typically signals rate limits or deprioritized compute, so expect slower response times under load. Not suitable for customer-facing applications or tasks requiring high accuracy. You're trading performance for price, and that trade is steep.

Specifications

Provider: openai
Category: llm
Context length: 131,072 tokens
Max output: 131,072 tokens
Modalities: text
License: proprietary
Released: 2025-08-05

Pricing

Input: $0.00/Mtok
Output: $0.00/Mtok
Model ID: openai/gpt-oss-120b:free

Per-token prices show what the model costs upstream. On Switchy your team draws from one shared org credit pool - one plan, one balance for everyone.

Team cost calculator

Seats5 peopleMessages / seat / day80Avg turn size2 ktokOutput share30 %

Estimated monthly spend

Freeno token cost

17.6M tokens / month
5 seats · 80 msgs/day

Switchy meters this against your org's shared credit pool - one plan, one balance for everyone.

Providers

Provider	Context	Input	Output	P50 latency	Throughput	30d uptime
openai	131k	$0.00/Mtok	$0.00/Mtok	—	—	—

Performance

Performance snapshots are collected daily. Check back after the next ingestion run.

Benchmarks

Public benchmark scores are not available yet for this model. Check back after the next ingestion run.

Works well with

Top MCPs

Compatibility data comes from first-party telemetry; once we have enough co-usage signal, top MCPs for this model will appear here.

How Switchy teams use it

Not enough Spaces have used this model yet to share anonymised team stats. We wait for at least 50 distinct Spaces per week before publishing any aggregate.

Starter prompts

Draft Internal Email

Write a friendly 3-paragraph email to the engineering team announcing our weekly sync has moved from Tuesday 2pm to Thursday 10am. Keep it casual and mention we'll cover sprint planning.

Open in a Space →

Summarize Meeting Notes

Read these meeting notes and create a bulleted summary of action items, decisions made, and topics tabled for next time. Focus on what requires follow-up.

Open in a Space →

Brainstorm Feature Ideas

Our app helps freelancers track invoices. Brainstorm 10 new feature ideas that would make billing less painful. Be creative—don't worry about feasibility yet.

Open in a Space →

Rewrite for Clarity

Here's a paragraph from our documentation. Rewrite it to be clearer and more concise while keeping all the technical details intact: [paste paragraph]

Open in a Space →

Generate Test Data

Generate 20 realistic customer support tickets for a SaaS product. Include variety: billing questions, feature requests, bug reports. Make them feel authentic.

Open in a Space →

Example outputs

Illustrative - representative of the model's voice and quality, not literal recordings.

Prompt

Review this Python function for potential bugs. It's supposed to merge two sorted lists but users report incorrect output when one list is empty.

Output

The function has an off-by-one error in the while loop condition. When `list_a` is empty, the code tries to access `list_a[i]` before checking bounds, causing an IndexError. Move the length checks before the comparison. Also, the final concatenation logic assumes both pointers are valid — you need separate conditions for appending remaining elements from each list. Here's the corrected version with explicit boundary handling for both edge cases.

Notes

This example shows the model's debugging capability within its 131K token context window — enough to handle substantial codebases. The free pricing makes it accessible for exploratory debugging sessions. However, without public benchmarks, we can't verify its accuracy against established coding evaluation suites like HumanEval.

Prompt

Summarize the key regulatory changes in this 40-page compliance document, focusing on what our finance team needs to action before Q2.

Output

Three critical changes require immediate attention: (1) New expense categorization rules for software subscriptions take effect April 1st — all SaaS contracts must be reclassified under operating expenses rather than capital expenditure. (2) Monthly reconciliation deadlines move from the 10th to the 5th business day. (3) International wire transfer documentation now requires additional beneficiary verification for amounts exceeding $50K. The document also introduces voluntary best-practice guidelines for ESG reporting, but these carry no compliance deadline for your industry segment.

Notes

The 131K context window handles long documents without chunking, making this model suitable for enterprise document analysis. Zero cost per token means teams can process high volumes of internal documents without budget concerns. The trade-off is uncertainty around accuracy on specialized domains without benchmark validation.

Prompt

I'm teaching a 10-year-old about photosynthesis. Explain it using an analogy they'd understand, avoiding scientific jargon.

Output

Think of a plant like a tiny solar-powered factory. The leaves are like solar panels that catch sunlight. Inside the leaves, there are millions of little workers (called chloroplasts, but let's call them 'green helpers'). These green helpers take three ingredients — sunlight from above, water from the roots, and carbon dioxide from the air we breathe out — and mix them together to make food (sugar) for the plant. The leftover stuff they don't need? That's oxygen, which they release back into the air for us to breathe. So plants are basically running on sunshine and making the air better for us at the same time.

Notes

This demonstrates the model's ability to adjust technical depth for different audiences — a key strength for educational content and internal knowledge-sharing. The free tier removes cost barriers for high-volume tutoring or documentation use cases. Without benchmark data on instruction-following tasks, we can't compare its pedagogical effectiveness to models with measured MMLU or HellaSwag scores.

Use-case deep-dives

Prototype chatbot development

When free inference makes sense for early-stage bot testing

A 4-person startup building a customer support bot needs to iterate on prompts and conversation flows without burning runway on API costs. gpt-oss-120b at $0.00/Mtok lets you run thousands of test conversations during the design phase—swap system prompts, test edge cases, refine tone—without watching a meter. The 131k context window handles multi-turn dialogues and lets you include full conversation history plus documentation chunks. The trade-off: no public benchmarks means you're flying blind on quality until you test it yourself, and 'oss' in the name suggests this may be a community or experimental release with less reliability than production models. Use this for the first 2-3 weeks of prototyping, then migrate to a benchmarked model once you've locked the interaction pattern and need predictable output quality.

High-volume content moderation

Why zero-cost inference breaks down at moderation scale

A 12-person social platform needs to classify 200k user comments per day for policy violations. At $0.00/Mtok, gpt-oss-120b looks like the obvious choice—your monthly inference bill stays at zero even as volume spikes. The problem is accuracy. Content moderation demands low false-positive rates (don't ban good users) and low false-negatives (don't miss real abuse), and without MMLU, HumanEval, or safety benchmarks, you can't validate this model's precision on adversarial text. The 131k window is overkill here since each moderation call is under 500 tokens. If you're moderating under 10k items/day and can manually review flags, this works as a first-pass filter. Above that threshold, switch to a model with published safety evals—the cost difference is negligible compared to the reputation risk of a moderation failure.

Internal documentation Q&A

When free LLM inference works for low-stakes knowledge retrieval

A 20-person engineering team wants to build a Slack bot that answers questions about internal wikis, runbooks, and API docs. gpt-oss-120b's 131k context window fits 40-50 pages of documentation in a single prompt, and at $0.00/Mtok you can serve unlimited queries without justifying budget to finance. The use-case is low-stakes: if the bot hallucinates a config setting, an engineer will catch it before pushing to prod. The missing benchmarks matter less here because you're not making high-consequence decisions based on the output—you're just surfacing relevant passages faster than Cmd+F. Deploy this for 2-3 months and track answer accuracy in Slack reactions. If accuracy stays above 80% and the team uses it daily, keep it. If you see frequent corrections or engineers stop asking, upgrade to a model with published retrieval benchmarks and tighter instruction-following.

Frequently asked

Is gpt-oss-120b good for general text tasks?

Yes, for basic use cases where cost matters more than performance. With 120B parameters and a 131k token context window, it handles standard summarization, Q&A, and content generation adequately. Being free makes it ideal for prototyping or high-volume low-stakes work. Expect weaker reasoning than GPT-4 class models, but it's solid for straightforward tasks that don't require cutting-edge capabilities.

How does the free pricing compare to GPT-4o or Claude?

It's $0 versus $2.50-$15 per million tokens for GPT-4o or Claude Sonnet, making it infinitely cheaper for budget-constrained projects. The trade-off is performance: no public benchmarks exist for gpt-oss-120b, so you're gambling on quality. Use it when volume matters more than accuracy, or when you need to burn through millions of tokens without budget approval.

Can it handle 131k token context windows reliably?

The 131k context window matches GPT-4 Turbo's capacity on paper, but real-world performance at max context is unknown without benchmarks. Smaller open-source models often degrade past 32k tokens. Test your specific use case before committing. For long-document analysis or large codebases, you might hit quality issues that paid models with proven long-context performance avoid.

How does this compare to other open-source 120B models?

Without public benchmarks, it's impossible to rank against Falcon 180B or other open weights models objectively. The 120B parameter count suggests mid-tier capability, likely below GPT-3.5 but above smaller 7-13B models. The free hosting through OpenAI's infrastructure is the real differentiator here—you skip deployment costs that plague self-hosted open-source alternatives.

Should I use this for production customer-facing features?

Only if you can tolerate unpredictable output quality and have fallback logic. The lack of benchmarks means you're flying blind on accuracy, safety, and consistency. Fine for internal tools, content drafts, or non-critical automation. For customer support, content moderation, or anything reputation-sensitive, pay for GPT-4o or Claude where quality is documented and support exists.