OpenAI: GPT-4
OpenAI's flagship model, GPT-4 is a large-scale multimodal language model capable of solving difficult problems with greater accuracy than previous models due to its broader general knowledge and advanced reasoning...
Anyone in the Space can @-mention OpenAI: GPT-4 with the team's shared context - pooled credits, one chat, one memory.
Starter is free forever - 1 Space, 100 credits/month, 1 MCP. No card.
Verdict
Best for
- Legacy workflows requiring GPT-4 compatibility
- Complex multi-step reasoning tasks
- Applications with strict safety requirements
- Tasks under 6K tokens of context
Strengths
GPT-4 established the template for modern instruction-following and excels at breaking down complex problems into logical steps. Its training emphasized safety and refusal behavior, making it predictable in production environments where content filtering matters. The model handles nuanced instructions well and maintains coherent reasoning across multi-turn conversations within its context window. Its maturity means extensive community knowledge and debugging resources.
Trade-offs
The 8K context window is the primary limitation — you'll exhaust it with a single long document or a few turns of conversation with code snippets. At $30 input and $60 output per million tokens, it costs 6-10x more than GPT-4o while delivering similar or inferior performance on most tasks. It lacks vision capabilities and function calling feels less polished than in GPT-4 Turbo or GPT-4o. For new projects, the cost-performance ratio rarely justifies choosing this over newer OpenAI models.
Specifications
- Provider
- openai
- Category
- llm
- Context length
- 8,191 tokens
- Max output
- 4,096 tokens
- Modalities
- text
- License
- proprietary
- Released
- 2023-05-28
Pricing
- Input
- $30.00/Mtok
- Output
- $60.00/Mtok
- Model ID
openai/gpt-4
Per-token prices show what the model costs upstream. On Switchy your team draws from one shared org credit pool - one plan, one balance for everyone.
Team cost calculator
5 seats · 80 msgs/day
Switchy meters this against your org's shared credit pool - one plan, one balance for everyone.
Providers
| Provider | Context | Input | Output | P50 latency | Throughput | 30d uptime |
|---|---|---|---|---|---|---|
| openai | 8k | $30.00/Mtok | $60.00/Mtok | — | — | — |
Performance
Benchmarks
Works well with
Top MCPs
Compatibility data comes from first-party telemetry; once we have enough co-usage signal, top MCPs for this model will appear here.
How Switchy teams use it
Starter prompts
Multi-Step Problem Breakdown
Break down this problem into discrete steps, explaining your reasoning at each stage: [describe your problem]. Number each step and show dependencies between them.Open in a Space →
Nuanced Policy Interpretation
Read this policy document and identify edge cases where the rules might conflict or require judgment calls: [paste policy text]. Explain your interpretation for each case.Open in a Space →
Structured Data Extraction
Extract key entities from this text and return them as JSON with fields for name, role, and relationship: [paste text]. Ensure valid JSON syntax.Open in a Space →
Code Review with Explanations
Review this code for logic errors, edge cases, and maintainability issues: [paste code]. For each issue, explain the potential impact and suggest a fix.Open in a Space →
Argument Evaluation
Analyze the logical structure of this argument and identify any fallacies, unsupported claims, or gaps in reasoning: [paste argument]. Rate the overall strength of the conclusion.Open in a Space →
Example outputs
Illustrative - representative of the model's voice and quality, not literal recordings.
Refactor this Python function to use list comprehensions and remove the nested loops. Keep the same logic but make it more Pythonic.
The model would produce a clean refactor that replaces nested for-loops with a single list comprehension or generator expression, preserving edge cases like empty inputs. It typically adds inline comments explaining non-obvious transformations and suggests type hints if the original code lacked them. The refactored version maintains readability while reducing line count by 30-40%.
GPT-4 excels at idiomatic rewrites that balance brevity with clarity. Its 8K context window handles medium-sized functions comfortably, though it may truncate explanations when refactoring larger modules. The model occasionally over-optimizes for conciseness at the expense of beginner readability.
Explain the CAP theorem to a product manager who needs to choose between PostgreSQL and Cassandra for a new feature. Focus on practical trade-offs, not theory.
The model would frame CAP theorem as a decision tree: PostgreSQL guarantees consistency and tolerates partition failures by sacrificing some availability (transactions may block). Cassandra prioritizes availability and partition tolerance, meaning reads might return stale data briefly. It would then map these to product scenarios—'use Postgres if you need strict inventory counts; use Cassandra if you need a global activity feed that can't go down.' The explanation avoids academic jargon.
GPT-4 translates technical concepts into business context effectively, a strength for cross-functional documentation. The 8K window allows it to include 2-3 concrete examples without truncation. However, it sometimes hedges excessively ('it depends') rather than making a clear recommendation when one exists.
Draft a three-paragraph email declining a vendor proposal. Tone: polite but firm. Reason: their pricing model doesn't align with our usage patterns, but we want to stay in touch.
The model would produce a structured email opening with appreciation for the proposal, a second paragraph diplomatically explaining the pricing mismatch (e.g., 'our intermittent usage makes flat-rate licensing cost-prohibitive'), and a closing paragraph expressing interest in future conversations if their model evolves. The tone balances professionalism with warmth—no corporate clichés, no over-apologizing.
GPT-4 handles nuanced communication tasks well, capturing tone constraints without sounding robotic. At $60/Mtok output, this use case is expensive relative to simpler models that could draft emails adequately. The model shines when the task requires reading subtext or navigating political sensitivity.
Use-case deep-dives
When GPT-4 justifies the premium on high-stakes writing
A 4-person consulting shop sends 8-12 proposals a month, each going through 3-4 revision cycles with client feedback. GPT-4 at $30/$60 per Mtok costs roughly $0.18 per 3,000-token proposal draft—negligible against a $40k contract. The 8k context window holds the full RFP, previous proposal sections, and client comments in one prompt, so the model rewrites with all context in view. Cheaper models at this context length either hallucinate requirements or lose thread between sections. The cost threshold: if you're drafting under 20 documents a month where each mistake costs you a deal, GPT-4's reliability pays for itself. Beyond that volume, test whether GPT-4o or Claude 3.5 Sonnet hit your quality bar at half the output cost.
Where GPT-4 loses to newer models on repetitive extraction
A 10-person finance team processes 200 invoices a week, pulling vendor names, line items, and totals into a database. GPT-4 can handle the task but at $30 input per Mtok, scanning 200 PDFs (average 4k tokens each) costs $24/week or $1,248/year. The 8k context cap means multi-page invoices need chunking, which introduces errors at page boundaries. GPT-4o runs the same workload at $2.50 input per Mtok—$2/week—with a 128k window that swallows entire documents. GPT-4's instruction-following was best-in-class in 2023, but for high-volume structured tasks where the schema is fixed, newer models with lower input pricing and larger windows are the correct call. Use GPT-4 here only if you're on a legacy integration that hasn't migrated to the newer endpoint.
Why GPT-4's latency and cost don't fit live chat workflows
A 15-person SaaS startup wants an AI to read incoming Slack support threads and tag them as billing, technical, or sales before routing. GPT-4 handles the classification but averages 4-6 seconds per response and costs $0.24 per 1,000 customer messages at typical token counts. Over 500 messages a day, that's $120/month and a noticeable lag that frustrates customers expecting instant acknowledgment. GPT-4o delivers sub-2-second responses at $15/month for the same volume, and the 128k context window lets the model see the full thread history without truncation. GPT-4 made sense when it was the only reliable classifier; now it's the wrong tool for any workflow where humans are waiting on the model to respond. Save it for the complex escalations that need the extra reasoning depth.
Frequently asked
Is GPT-4 still good for general text tasks in 2025?
Yes, but it's outclassed by newer models in most categories. GPT-4 handles reasoning, summarisation, and creative writing competently, but GPT-4 Turbo, Claude 3.5 Sonnet, and Gemini 1.5 Pro all deliver better performance at lower cost. Use GPT-4 only if you need the original model for consistency with existing workflows or specific fine-tuned behaviour.
Is GPT-4 cheaper than GPT-4 Turbo or Claude?
No. At $30 input and $60 output per Mtok, GPT-4 costs roughly 3× more than GPT-4 Turbo ($10/$30) and 6× more than Claude 3.5 Sonnet ($3/$15). The 8K context window makes it even less economical for document-heavy work. Unless you're locked into the original GPT-4 API for legacy reasons, switch to a newer model.
Can GPT-4 handle long documents with its 8K context?
Not effectively. 8,191 tokens is roughly 6,000 words — enough for short reports but inadequate for research papers, legal contracts, or codebases. GPT-4 Turbo offers 128K tokens, Claude 3.5 Sonnet gives you 200K, and Gemini 1.5 Pro reaches 2M. For anything beyond basic chat, the context limit is a deal-breaker.
How does GPT-4 compare to GPT-4 Turbo?
GPT-4 Turbo is faster, cheaper, and has 16× the context window. The original GPT-4 was OpenAI's flagship in 2023, but Turbo replaced it as the default for good reason. Performance is comparable on most tasks, with Turbo occasionally scoring higher on coding and maths benchmarks. Stick with Turbo unless you need exact GPT-4 behaviour for reproducibility.
Should I use GPT-4 for production chatbots?
Only if you're maintaining a legacy system. The 8K context means conversations truncate quickly, and the $60/Mtok output pricing makes high-volume chat expensive. GPT-4 Turbo, Claude 3.5 Sonnet, or even GPT-3.5 Turbo deliver better cost-per-conversation metrics. For new deployments, choose a model with larger context and lower output costs.