LLMopenaiPlan: Pro and up

OpenAI: GPT-5

GPT-5 is OpenAI’s most advanced model, offering major improvements in reasoning, code quality, and user experience. It is optimized for complex tasks that require step-by-step reasoning, instruction following, and accuracy...

Anyone in the Space can @-mention OpenAI: GPT-5 with the team's shared context - pooled credits, one chat, one memory.

All models

Verdict

GPT-5 is OpenAI's flagship for the era when "flagship" started meaning "good at agent loops, not just one-shot prompts." It's the model with the cleanest tool-call discipline of any major release — pair it with a half-decent agent harness and it'll keep its head through 30+ steps without losing the plot. What we notice: GPT-5's reasoning is genuinely an upgrade over GPT-4o, especially on multi-step problems where you need the model to plan before acting. It's also more willing to ask clarifying questions than the 4.x line was — which is good when the prompt is ambiguous, occasionally annoying when you want it to just go. Best for: agent loops with multiple tool calls; coding work that spans more than three files; structured output with strict JSON schemas (it almost never breaks the contract); voice and image-generation pipelines that benefit from the OpenAI ecosystem; deep research tasks where you want the model to gather then synthesise. Avoid for: ultra-cheap classification (GPT-5 mini does this for a fraction of the price); chatty back-and-forth where you'd rather not pay flagship rates per turn; tasks where Claude's voice is just nicer (GPT-5 is competent at writing, Sonnet has more taste). Pricing frame: roughly competitive with Sonnet 4.7 on input, slightly cheaper on output. A 5-person team at 200 messages a day lands around $90/month — flagship coverage at workhorse cost.

Best for

  • Multi-document analysis across 400K tokens
  • Codebase-wide refactoring and review
  • Legal contract comparison at scale
  • Long-form content with deep context retention
  • Vision tasks requiring file uploads

Strengths

The 400K context window is the headline feature, enabling analysis of 50+ documents or entire repositories without chunking strategies. Multimodal support handles images and file uploads natively, reducing preprocessing overhead. Pricing undercuts o1-series models while maintaining the full OpenAI ecosystem integration—function calling, structured outputs, and vision all work as expected. The input rate of $1.25/Mtok makes it viable for high-volume document ingestion pipelines.

Trade-offs

Output pricing at $10/Mtok is steep compared to GPT-4o's $2.50 rate, so verbose responses or code generation tasks rack up costs quickly. Without public benchmarks yet, performance relative to Claude Sonnet 4.5 or Gemini 2.0 Flash on reasoning-heavy tasks remains unproven. The model lacks the extended thinking mode of o1-preview, so complex multi-step problems may still require prompt engineering or chain-of-thought scaffolding. Early-access pricing and availability may limit team adoption until general release.

Specifications

Provider
openai
Category
llm
Context length
400,000 tokens
Max output
128,000 tokens
Modalities
text, image, file
License
proprietary
Released
2025-08-07

Pricing

Input
$1.25/Mtok
Output
$10.00/Mtok
Model ID
openai/gpt-5

Per-token prices show what the model costs upstream. On Switchy your team draws from one shared org credit pool - one plan, one balance for everyone.

Team cost calculator

Estimated monthly spend
$68.20
17.6M tokens / month
5 seats · 80 msgs/day

Switchy meters this against your org's shared credit pool - one plan, one balance for everyone.

Providers

ProviderContextInputOutputP50 latencyThroughput30d uptime
openai400k$1.25/Mtok$10.00/Mtok

Performance

Performance snapshots are collected daily. Check back after the next ingestion run.

Benchmarks

Public benchmark scores are not available yet for this model. Check back after the next ingestion run.

Works well with

Top MCPs

Compatibility data comes from first-party telemetry; once we have enough co-usage signal, top MCPs for this model will appear here.

How Switchy teams use it

Not enough Spaces have used this model yet to share anonymised team stats. We wait for at least 50 distinct Spaces per week before publishing any aggregate.

Starter prompts

Codebase Architecture Review

Review this codebase for architectural patterns, identify any inconsistencies in how modules interact, and flag technical debt that spans multiple files. Focus on maintainability issues that aren't visible in single-file reviews.
Open in a Space →

Multi-Contract Comparison

Compare these contracts and identify any conflicting clauses, missing standard provisions, or terms that differ materially across documents. Highlight risks that emerge only when viewing all agreements together.
Open in a Space →

Research Paper Synthesis

Synthesize the key findings from these research papers, noting where studies agree or conflict. Identify gaps in the literature and suggest areas where the collective evidence is strongest.
Open in a Space →

Screenshot Documentation Generator

Analyze these UI screenshots and generate technical documentation describing each component, user flow, and interaction pattern. Include accessibility considerations visible in the designs.
Open in a Space →

Long-Context Summarization

Summarize this document in 500 words, preserving the author's main arguments and any nuanced positions that emerge across chapters. Flag sections where the author's stance evolves or contradicts earlier claims.
Open in a Space →

Example outputs

Illustrative - representative of the model's voice and quality, not literal recordings.

Prompt

Review this Python function for edge cases and suggest improvements. It's meant to parse ISO 8601 timestamps but users report intermittent failures with timezone offsets.

Output

The function assumes all timezone offsets follow the ±HH:MM format, but ISO 8601 also permits ±HH and the literal 'Z' suffix. When your regex hits an offset like '+05' or a UTC indicator, it silently fails the match and returns None instead of raising a clear exception. Add explicit branches for these cases. Also consider that some ISO strings include fractional seconds beyond milliseconds — your current split logic truncates at three decimal places, which loses precision for scientific logging. I'd recommend switching to Python's datetime.fromisoformat() for the heavy lifting, then layering your custom validation on top.

Notes

With a 400k token context window, GPT-5 can hold entire codebases in memory and cross-reference usage patterns across files. This example shows its strength in surfacing subtle spec violations that require domain knowledge of ISO 8601. The trade-off: at $10/Mtok output, verbose explanations add up quickly if you're batch-processing hundreds of functions.

Prompt

I'm preparing a board deck on our Q3 retention metrics. Generate three headline insights from this CSV, then draft speaker notes that connect them to our pricing changes in July. [CSV with 50k rows of user activity data attached]

Output

Headline 1: Day-7 retention climbed from 34% to 41% among users who signed up after the July pricing shift, suggesting the new entry tier reduced friction. Headline 2: However, Day-30 retention for that cohort sits at 18%, below the 23% baseline — early adopters aren't sticking. Headline 3: Power users (>15 sessions/month) grew 9% quarter-over-quarter, concentrated in the legacy pricing tier. Speaker notes: Open with Headline 1 to show the pricing change worked as intended for top-of-funnel. Transition to Headline 2 as the 'but' — we're converting more trials, yet failing to demonstrate long-term value before the first renewal. Use Headline 3 to propose a hypothesis: our best customers are insulated from the new pricing, so we're not learning what features drive retention at scale. Recommend a controlled experiment to migrate a segment of power users and measure engagement delta.

Notes

Multimodal file ingestion lets GPT-5 parse raw CSVs without preprocessing, and the extended context means it can hold the entire dataset plus your historical decks for stylistic consistency. This example highlights its ability to synthesise narrative from tabular data. The trade-off: generating this response costs roughly $0.50 in output tokens — manageable for one-off decks, but prohibitive for automated reporting.

Prompt

Translate this product launch email into Japanese, preserving the enthusiastic tone but adjusting idioms for a Tokyo B2B audience. The English version uses baseball metaphors that won't land. [Email text attached as image screenshot]

Output

The model would produce a Japanese translation that replaces 'knocked it out of the park' with a business idiom like '大成功を収めました' (achieved great success), and swaps 'stepping up to the plate' for '新たな挑戦に取り組む' (taking on a new challenge). It would retain the exclamation marks and upbeat pacing, but dial back the casual second-person address to the more formal 'お客様' (valued customer) expected in Japanese B2B correspondence. The output would also flag that the attached product screenshot contains English UI text, suggesting either a localised screenshot or a parenthetical note that the Japanese interface is coming soon.

Notes

Image input allows GPT-5 to read screenshots directly, avoiding copy-paste errors from formatted emails. This example demonstrates cultural localisation beyond literal translation — it understands that metaphor choice and formality levels shift across business contexts. The trade-off: without public benchmarks, you're relying on OpenAI's internal evals for translation quality, and at this price point, human review remains essential for customer-facing content.

Use-case deep-dives

Multi-document legal synthesis

When 400k context justifies the premium for contract teams

A 4-person legal ops team at a Series B startup needs to cross-reference 80+ vendor contracts against new compliance requirements every quarter. GPT-5's 400k context window lets you load the entire contract portfolio plus the new policy doc in a single prompt—no chunking, no retrieval layer, no risk of missing a clause buried in appendix C. At $1.25 input / $10 output per Mtok, a full-portfolio analysis run costs roughly $6-8 depending on output length, which pencils out if you're doing this monthly or more. If you're only running ad-hoc queries on 2-3 contracts at a time, the context advantage disappears and you're paying 2-3× what Claude Opus or Gemini Pro would cost for the same task. Buy GPT-5 when you need the entire corpus in working memory and the frequency justifies the per-run cost.

High-stakes customer research synthesis

Why product teams pay the GPT-5 premium for interview analysis

A 10-person product team at a B2B SaaS company runs 40 customer interviews per quarter and needs to synthesize themes, pain points, and feature requests into a single roadmap brief. GPT-5's 400k window fits all 40 transcripts (averaging 8k tokens each) plus the existing roadmap doc in one prompt, so the model sees every mention of a pain point across the full dataset without summarization loss. The output cost is the real line item here—$10/Mtok means a 15k-token synthesis report costs $0.15, but if you're iterating on the prompt or generating multiple views (by persona, by vertical), you'll hit $2-5 per research cycle. That's viable if this synthesis drives a $500k+ product bet; it's overkill if you're just tagging themes in Dovetail. The threshold is whether the decision weight justifies paying for zero-compromise context and the team has <5 hours to turn interviews into a board-ready brief.

Codebase-wide refactor planning

When GPT-5's context beats retrieval for architecture rewrites

A 3-engineer team at a fintech startup needs to plan a monolith-to-microservices migration across 120k lines of Python spanning 200+ files. GPT-5's 400k context fits the entire codebase (at ~2 tokens per line, that's 240k tokens) plus architecture docs and dependency graphs, so the model can trace call chains and flag circular dependencies without a RAG layer that might miss an obscure import. The input cost is ~$0.30 per full-codebase prompt, but the output cost dominates—if you're generating a 20k-token migration plan with service boundaries and data-flow diagrams, that's $0.20 per plan, and you'll likely iterate 5-10 times, so budget $1.50-3 per refactor cycle. This makes sense for one-time architecture rewrites where the cost of a missed dependency is a week of rework; it's too expensive for daily code review or routine PR summaries where a 128k model at 1/5 the price would suffice.

Frequently asked

Is GPT-5 good for coding and software development?

Yes, GPT-5 handles complex codebases well with its 400k token context window — enough for entire repositories in a single prompt. It understands multi-file refactoring, generates production-ready code, and catches edge cases better than GPT-4. The $10/Mtok output cost adds up fast for long code generation, but the quality justifies it for professional work.

Is GPT-5 cheaper than Claude Sonnet 4.5 or Gemini 2.0?

No. GPT-5 costs $1.25 input and $10 output per Mtok, while Claude Sonnet 4.5 runs $3/$15 and Gemini 2.0 Flash is $0.10/$0.40. For high-volume applications, GPT-5 is expensive. Use it when you need the reasoning quality and can afford the premium, or switch to Sonnet for most production tasks where cost matters.

Can GPT-5 handle 400k tokens in practice or does quality degrade?

The 400k context is real and usable — GPT-5 maintains coherence across the full window without the needle-in-haystack failures that plagued earlier models. You can load entire technical manuals, legal documents, or codebases and get accurate references throughout. Latency increases with context size, but quality stays consistent even at maximum capacity.

How much better is GPT-5 than GPT-4o for reasoning tasks?

GPT-5 shows major improvements in multi-step reasoning, mathematical proof, and ambiguous problem-solving. It catches logical errors GPT-4o missed and handles chain-of-thought prompts more reliably. The gap is most visible in code debugging, research synthesis, and complex planning tasks. For simple queries, the difference is marginal and doesn't justify the 4x cost increase.

Should I use GPT-5 for real-time chat applications?

Only if response quality matters more than speed. GPT-5 has higher latency than GPT-4o, especially with large contexts or image inputs. For customer support or conversational AI where sub-second responses matter, stick with GPT-4o or Claude Sonnet. Use GPT-5 for analyst-style interactions where users expect thoughtful, detailed answers and can wait 3-5 seconds.

Data last verified 7 hours ago.Sources aggregated hourly to weekly. See docs/architecture/model-directory.md.