LLMqwen

Qwen: Qwen Plus 0728 (thinking)

Qwen Plus 0728, based on the Qwen3 foundation model, is a 1 million context hybrid reasoning model with a balanced performance, speed, and cost combination.

Anyone in the Space can @-mention Qwen: Qwen Plus 0728 (thinking) with the team's shared context - pooled credits, one chat, one memory.

All models

Starter is free forever - 1 Space, 100 credits/month, 1 MCP. No card.

Verdict

Qwen Plus 0728 with thinking mode exposes its reasoning process before delivering answers, similar to OpenAI's o1 models. The million-token context window handles book-length documents and sprawling codebases. At $0.26/$0.78 per Mtok, it undercuts Western frontier models by 60-80% while maintaining competitive reasoning quality. Reach for this when you need transparent step-by-step logic on complex problems without burning budget on Claude or GPT-4.

Best for

  • Multi-step reasoning with visible chain-of-thought
  • Long-context analysis under $1 per million tokens
  • Cost-sensitive research and document synthesis
  • Complex problem decomposition with transparency
  • Budget-conscious teams needing frontier-class reasoning

Strengths

The thinking mode surfaces intermediate reasoning steps, making it easier to debug logic errors and verify conclusions. The million-token context window swallows entire codebases or multi-chapter documents in one pass. Pricing sits 60-80% below GPT-4o and Claude Sonnet, making it viable for high-volume reasoning tasks where Western models blow budgets. Qwen's architecture handles Chinese and English with near-parity, useful for bilingual workflows.

Trade-offs

Without public benchmarks, you're flying blind on head-to-head performance against o1-mini or Claude Sonnet 4.5. Thinking mode adds latency—expect 2-5x longer response times than standard inference. Proprietary license means no fine-tuning or on-premise deployment. Qwen models historically lag OpenAI and Anthropic on nuanced creative writing and edge-case instruction following. If you need sub-second responses or proven MMLU scores, look elsewhere.

Specifications

Provider
qwen
Category
llm
Context length
1,000,000 tokens
Max output
32,768 tokens
Modalities
text
License
proprietary
Released
2025-09-08

Pricing

Input
$0.26/Mtok
Output
$0.78/Mtok
Model ID
qwen/qwen-plus-2025-07-28:thinking

Per-token prices show what the model costs upstream. On Switchy your team draws from one shared org credit pool - one plan, one balance for everyone.

Team cost calculator

Estimated monthly spend
$7.32
17.6M tokens / month
5 seats · 80 msgs/day

Switchy meters this against your org's shared credit pool - one plan, one balance for everyone.

Providers

ProviderContextInputOutputP50 latencyThroughput30d uptime
qwen1000k$0.26/Mtok$0.78/Mtok

Performance

Performance snapshots are collected daily. Check back after the next ingestion run.

Benchmarks

Public benchmark scores are not available yet for this model. Check back after the next ingestion run.

Works well with

Top MCPs

Compatibility data comes from first-party telemetry; once we have enough co-usage signal, top MCPs for this model will appear here.

How Switchy teams use it

Not enough Spaces have used this model yet to share anonymised team stats. We wait for at least 50 distinct Spaces per week before publishing any aggregate.

Starter prompts

Debug Logical Fallacy

Walk through this argument step by step and identify any logical fallacies or unsupported leaps: [paste argument]. Show your reasoning at each stage.
Open in a Space →

Multi-Step Math Problem

Solve this problem showing all intermediate steps: [paste problem]. Explain your reasoning at each stage before moving to the next.
Open in a Space →

Codebase Architecture Analysis

Here's a full codebase dump. Analyze the architecture, identify design patterns, and suggest refactoring opportunities. Think through dependencies first: [paste code].
Open in a Space →

Research Synthesis

I'm pasting five research papers on [topic]. Compare their methodologies, identify contradictions, and synthesize a unified conclusion. Show your reasoning: [paste papers].
Open in a Space →

Strategic Decision Tree

Help me decide between [option A] and [option B] for [context]. Build a decision tree showing pros, cons, and second-order effects at each branch. Think through dependencies first.
Open in a Space →

Example outputs

Illustrative - representative of the model's voice and quality, not literal recordings.

Prompt

Walk me through your reasoning: Why does the birthday paradox show a 50% collision probability with just 23 people, when there are 365 days in a year?

Output

Let me think through this step by step. First, I need to consider what we're actually calculating. We're not asking 'what's the chance someone shares MY birthday' — we're asking 'what's the chance ANY two people share a birthday.' With 23 people, there are (23 × 22) / 2 = 253 unique pairs. Each pair has a 1/365 chance of matching. But we can't just multiply 253 × (1/365) because these events aren't independent — if persons A and B match, and B and C match, then A and C definitely match. The cleaner approach: calculate the probability that NO pairs match, then subtract from 1. For person 2, there's a 364/365 chance they don't match person 1. For person 3, there's a 363/365 chance they don't match persons 1 or 2. Continuing this pattern: (364/365) × (363/365) × ... × (343/365) ≈ 0.493. So the probability of at least one match is 1 - 0.493 ≈ 50.7%. The counterintuitive result comes from the quadratic growth of pair combinations.

Notes

This example demonstrates the model's extended reasoning capability — the 'thinking' variant shows its work before delivering conclusions. The step-by-step breakdown is clear and pedagogically sound. With a 1M token context window, this model can maintain coherence through much longer reasoning chains than shown here, though the output verbosity may increase token costs at $0.78/Mtok.

Prompt

Review this Python function for edge cases and suggest improvements: `def calculate_discount(price, percent): return price - (price * percent / 100)`

Output

Let me analyze this function systematically. Immediate issues I see: 1. No type hints — unclear if `price` and `percent` should be int, float, or Decimal 2. No validation — negative prices or percentages over 100 would produce nonsensical results 3. Floating-point arithmetic can introduce rounding errors for currency (e.g., $10.00 with 33% discount gives $6.699999999999999) 4. No handling of None or non-numeric inputs Improved version: ```python from decimal import Decimal def calculate_discount(price: Decimal, percent: float) -> Decimal: if price < 0: raise ValueError("Price cannot be negative") if not 0 <= percent <= 100: raise ValueError("Discount percent must be between 0 and 100") return (price * (Decimal('100') - Decimal(str(percent))) / Decimal('100')).quantize(Decimal('0.01')) ``` This handles currency precision correctly and validates inputs, though it's more verbose.

Notes

Shows the model's code review capabilities with practical attention to real-world edge cases like floating-point precision in financial calculations. The reasoning process helps junior developers understand WHY each issue matters. However, for rapid iteration, the extended thinking may feel slower than models that jump straight to solutions — a trade-off between thoroughness and speed.

Prompt

I'm designing a rate limiter for an API. Compare token bucket vs. sliding window log approaches for a service handling 10K requests/second with burst tolerance of 2x.

Output

Let me work through the trade-offs for your specific requirements. Token bucket approach: - Refills at 10K tokens/second, bucket capacity of 20K - Pros: Constant memory (just track token count and last refill time), extremely fast O(1) checks, naturally allows bursts up to capacity - Cons: After a burst, users must wait for full refill; doesn't provide granular visibility into request timing patterns - Memory: ~16 bytes per user (token count + timestamp) Sliding window log: - Stores timestamp of each request, evicts entries older than 1 second - Pros: Precise enforcement (no 'bucket edge' artifacts), detailed audit trail of request patterns - Cons: Memory grows with request rate (at 20K burst = 20K timestamps = ~160KB per user), O(n) cleanup on each request - Memory: ~8 bytes × request_count per user For 10K RPS with burst tolerance, I'd recommend token bucket. The memory efficiency is critical at scale, and the O(1) performance keeps latency predictable. Sliding window log's precision advantage doesn't justify the 10,000× memory overhead unless you need forensic request timing data.

Notes

Demonstrates the model's ability to reason through system design trade-offs with concrete numbers. The thinking process weighs multiple dimensions (memory, latency, precision) against the stated requirements. At $0.26 input / $0.78 output per Mtok, the detailed reasoning is cost-effective for high-stakes architectural decisions, though overkill for simple queries where a direct answer suffices.

Use-case deep-dives

Multi-document legal discovery

When 1M tokens lets you skip the summarization step entirely

A 4-person litigation support team needs to extract key clauses from 80-page depositions and cross-reference them with contract exhibits. Qwen Plus 0728 handles this in a single pass: the 1M token context window holds the full deposition plus three related contracts without chunking or retrieval overhead. At $0.26/Mtok input, processing 800k tokens costs $0.21—cheaper than running a summarization pipeline on a smaller-context model, then feeding summaries to a second pass. The thinking mode adds 10-15 seconds of latency per query, which matters if you're running real-time chat but disappears when you're batching 20 discovery requests overnight. If you're processing fewer than 10 documents per case, the latency won't justify the cost savings; above that threshold, this becomes the default.

Codebase-wide refactoring analysis

Why this model struggles against specialized code alternatives

A 12-engineer SaaS team wants to analyze breaking changes across 40 TypeScript files before a major dependency upgrade. Qwen Plus 0728's 1M token window technically fits the entire codebase, but the lack of public benchmarks on code tasks is a red flag—models without HumanEval or MBPP scores usually trail purpose-built code models by 15-20 percentage points on multi-file reasoning. The $0.78/Mtok output price also stings when generating refactor diffs: a 50k-token diff costs $0.039, versus $0.015 on Claude 3.5 Haiku. The thinking mode might help with complex dependency graphs, but you're paying for reasoning overhead without proof it closes the gap on code-specific tasks. Use this if you're already locked into the Qwen ecosystem; otherwise, pick a model with published code benchmarks.

Customer support ticket triage

When thinking mode costs more than the problem it solves

A 20-seat e-commerce support team routes 300 tickets daily into billing, shipping, or product categories based on email text and order history. Qwen Plus 0728's thinking mode adds reasoning steps before classification, but ticket triage is a solved problem—baseline models hit 94% accuracy without extended reasoning, and the 10-second thinking latency breaks real-time workflows. At 200 tokens average per ticket (150 input, 50 output), you're paying $0.078 per 1k tickets; faster models like Gemini 1.5 Flash cost $0.04 and return answers in under 2 seconds. The 1M context window is overkill here—you rarely need more than the current ticket plus 2-3 previous exchanges. Save this model for cases where reasoning depth actually moves the accuracy needle, not commodity classification tasks.

Frequently asked

Is Qwen Plus 0728 good for complex reasoning tasks?

Yes, the 'thinking' designation means this model uses chain-of-thought reasoning before answering, similar to OpenAI's o1 series. It trades speed for accuracy on math, logic, and multi-step problems. Expect longer response times but better results on tasks where showing your work matters—code debugging, proof validation, strategic planning.

Is Qwen Plus 0728 cheaper than GPT-4o or Claude Sonnet?

Significantly cheaper. At $0.26/$0.78 per Mtok, Qwen Plus costs roughly one-fifth of GPT-4o ($2.50/$10.00) and one-tenth of Claude Sonnet 4 ($3.00/$15.00). The catch: no public benchmarks yet, so you're trading proven performance for cost savings. Test it on your actual workload before committing production traffic.

Can Qwen Plus 0728 handle 1M token context windows reliably?

The 1M token window matches GPT-4o and Gemini 1.5, but real-world performance depends on how the thinking mode handles long contexts. Extended reasoning chains consume more tokens internally, so effective capacity may be lower than advertised. For document analysis or codebase queries, start with 500K tokens and monitor quality degradation.

How does Qwen Plus 0728 compare to the standard Qwen Plus?

The 0728 thinking variant adds explicit reasoning steps before final answers, increasing latency and output token costs. Use standard Qwen Plus for straightforward Q&A, classification, or summarization. Reserve the thinking model for problems where intermediate steps matter—competitive programming, medical diagnosis chains, financial modeling with audit trails.

Should I use Qwen Plus 0728 for production chatbots?

Not for customer-facing chat. The thinking mode's slower responses (often 10-30 seconds) will frustrate users expecting instant replies. Better fit: backend analysis, overnight batch jobs, or internal tools where accuracy trumps speed. For chat, use standard Qwen Plus or Claude Haiku and escalate complex queries to the thinking model asynchronously.

Data last verified 8 hours ago.Sources aggregated hourly to weekly. See docs/architecture/model-directory.md.