Qwen: Qwen3 VL 235B A22B Thinking
Qwen3-VL-235B-A22B Thinking is a multimodal model that unifies strong text generation with visual understanding across images and video. The Thinking model is optimized for multimodal reasoning in STEM and math....
Anyone in the Space can @-mention Qwen: Qwen3 VL 235B A22B Thinking with the team's shared context - pooled credits, one chat, one memory.
Starter is free forever - 1 Space, 100 credits/month, 1 MCP. No card.
Verdict
Best for
- Multimodal document analysis with images
- Vision-language tasks requiring long context
- Cost-sensitive image understanding workflows
- Reasoning over screenshots and diagrams
- Batch processing of visual content
Strengths
The 131K token context window handles lengthy multimodal documents without chunking. Pricing undercuts GPT-4o and Claude Sonnet 4 on input tokens by 60-75%, making it viable for high-volume vision tasks. The 235B parameter base suggests competitive capability on complex reasoning, and the 'Thinking' designation implies chain-of-thought or extended reasoning modes that can improve accuracy on multi-step problems.
Trade-offs
No public benchmark data makes it impossible to compare against Claude Sonnet 4.5, GPT-4o, or Gemini 1.5 Pro on standard vision-language tasks. The proprietary license limits deployment flexibility. Output pricing at $2.60/Mtok is 3-4x higher than input, penalizing verbose responses. Without MMMU, VQA, or OCRBench scores, you're flying blind on actual vision performance—plan to run your own evals before committing production workloads.
Specifications
- Provider
- qwen
- Category
- llm
- Context length
- 131,072 tokens
- Max output
- 32,768 tokens
- Modalities
- text, image
- License
- proprietary
- Released
- 2025-09-23
Pricing
- Input
- $0.26/Mtok
- Output
- $2.60/Mtok
- Model ID
qwen/qwen3-vl-235b-a22b-thinking
Per-token prices show what the model costs upstream. On Switchy your team draws from one shared org credit pool - one plan, one balance for everyone.
Team cost calculator
5 seats · 80 msgs/day
Switchy meters this against your org's shared credit pool - one plan, one balance for everyone.
Providers
| Provider | Context | Input | Output | P50 latency | Throughput | 30d uptime |
|---|---|---|---|---|---|---|
| qwen | 131k | $0.26/Mtok | $2.60/Mtok | — | — | — |
Performance
Benchmarks
Works well with
Top MCPs
Compatibility data comes from first-party telemetry; once we have enough co-usage signal, top MCPs for this model will appear here.
How Switchy teams use it
Starter prompts
Analyze Technical Diagram
Examine this diagram and list every component, connection, and labeled value. Then explain the system's function in two sentences.Open in a Space →
Multi-Page Document Summary
Summarize this 40-page report, highlighting key findings and any charts or tables that support the main argument.Open in a Space →
Screenshot Debugging
Review this screenshot for usability problems, visual bugs, or accessibility issues. Prioritize by severity.Open in a Space →
Invoice Data Extraction
Extract vendor name, invoice number, line items with quantities and prices, subtotal, tax, and total from this image.Open in a Space →
Visual Reasoning Chain
This image shows a physics problem setup. Identify all forces, write the equations, then solve for the unknown variable step-by-step.Open in a Space →
Example outputs
Illustrative - representative of the model's voice and quality, not literal recordings.
Analyze this product photo and suggest three specific improvements to lighting and composition that would increase conversion rates for an e-commerce listing.
Looking at the current setup, the primary issue is the harsh overhead lighting creating a shadow under the product's base, which obscures detail and reduces perceived quality. First, add a diffused fill light at 45 degrees from the left to eliminate that shadow while maintaining definition. Second, shift the camera angle down by approximately 15 degrees—the current high angle makes the product appear smaller than it is. Third, introduce a subtle rim light from behind-right to separate the product from the background and add depth. The background itself is good—neutral without competing—but these lighting changes would lift the product's perceived value by 20-30% based on typical A/B testing results for similar categories.
This example demonstrates the model's multimodal reasoning—combining visual analysis with domain knowledge about e-commerce photography. The 131K context window allows it to reference detailed product photography principles while maintaining coherent recommendations. However, without published vision benchmarks, we can't quantify its accuracy against specialized vision models.
I'm refactoring a Python service that processes 10M records daily. Walk me through converting this synchronous database query pattern to async/await without breaking existing error handling.
Start by wrapping your current connection pool with `asyncpg` instead of `psycopg2`—the API surface is similar enough that you'll recognize the patterns. Your existing `try/except` blocks translate directly: `async with pool.acquire() as conn` replaces the synchronous context manager, and exceptions propagate the same way. The critical change is in your batch processing loop: instead of `for record in records`, you'll use `asyncio.gather()` with a semaphore to limit concurrency—I'd start with 50 concurrent queries based on your volume. Keep your existing retry logic but add `asyncio.sleep()` for backoff delays. One gotcha: if you're using SQLAlchemy ORM, you'll need the async extension, which changes session management significantly. For raw queries like yours, the migration is straightforward—expect 3-4x throughput improvement with the same error visibility.
Shows the model's ability to provide architectural guidance grounded in real-world constraints (10M records, existing error handling). The thinking mode likely helps it reason through migration trade-offs. At $2.60/Mtok output, this 150-word response costs ~$0.0004—reasonable for complex technical guidance, though repeated queries add up quickly.
Explain why this quarterly revenue chart shows concerning trends even though top-line growth is 15% YoY. Focus on what the CFO should ask the sales team next week.
The 15% headline growth masks two structural problems visible in the data. First, customer acquisition cost increased 28% quarter-over-quarter while average contract value only grew 6%—your unit economics are deteriorating. Second, the revenue composition shifted heavily toward Q4, with 61% of annual bookings landing in December versus a historical 35-40% pattern. This suggests either aggressive discounting to hit targets or deals pulled forward that would naturally close in Q1. The CFO should ask: (1) What percentage of Q4 deals included non-standard terms or extended payment schedules? (2) What's the 90-day retention rate for Q4 cohorts versus Q2-Q3? (3) Are sales comp plans creating incentives for quarter-end stuffing? The growth is real, but if it's coming from margin compression and timing games, you're building on sand.
Demonstrates analytical reasoning across both quantitative data and business context—a strength of large-context models that can hold multiple data series in working memory. The thinking component likely helps it identify non-obvious patterns. Trade-off: without vision benchmark data, we can't verify how accurately it would parse actual chart images versus described data.
Use-case deep-dives
When 131K context handles full vendor packet analysis without chunking
A 9-person procurement team processes 200+ vendor invoices weekly, each with 8-12 pages of line items, terms, and compliance attachments. Qwen3 VL 235B fits entire invoice packets in one 131K-token pass—no chunking, no context loss across pages. The vision layer reads tables and handwritten notes that pure-text models miss. At $0.26 input per million tokens, a 40-page packet (roughly 60K tokens with images) costs $0.016 to process. Output is expensive at $2.60/Mtok, so keep extraction templates tight: structured JSON, not verbose summaries. If your invoices average under 20 pages and you're processing 500+/day, the input savings beat GPT-4V. Above that volume, test whether the output cost (generating 2K tokens = $0.005/invoice) still pencils. Buy this model when document density and context continuity matter more than raw speed.
Why vision + reasoning works for screenshot-heavy support queues
A 14-person SaaS support team handles 400 tickets daily, half with user-submitted screenshots of error states, config panels, or broken UI. Qwen3 VL parses the image, reads error codes from the screenshot, cross-references against the text description, and routes to the right specialist—all in one model call. The 131K context window holds the full ticket thread (10-15 back-and-forth messages) plus 3-4 screenshots without truncation. Input cost is negligible: a typical ticket with 2 images and 5K tokens of text runs $0.0015. Output triage (300 tokens: severity, category, suggested owner) costs $0.00078. Total per-ticket cost under $0.0025 makes this viable at scale. The thinking layer helps with ambiguous cases where the screenshot contradicts the user's description. If your tickets are text-only, skip the vision tax and use a cheaper text model. If screenshots are central, this is the call.
When long-context reasoning beats multi-pass summarization for analyst reports
A 4-person market research consultancy synthesizes 80-page competitor filings, product roadmaps, and earnings transcripts into client briefings. Qwen3 VL's 131K window ingests the full document set in one pass—no lossy summarization chains, no hallucinated connections between sections 60 pages apart. The thinking layer surfaces non-obvious strategic pivots (a pricing change on page 12 that contradicts the growth narrative on page 71). Input cost for an 80-page PDF (roughly 120K tokens) is $0.031. Generating a 4K-token executive brief costs $0.0104. At $0.041 per report, you're paying for coherence across the full document. If your reports are under 30 pages, a cheaper 32K-context model will do. Above 60 pages, the single-pass accuracy justifies the spend. The vision modality is a bonus for charts and tables, but the real win is reasoning over the full context without stitching.
Frequently asked
Is Qwen3 VL 235B A22B Thinking good for vision-language tasks?
Yes, it's built for multimodal work combining text and images. The 131K token context window lets you process multiple images with detailed prompts in one request. Without public benchmarks we can't compare it directly to GPT-4V or Claude 3.5 Sonnet, but the 235B parameter count suggests strong reasoning capability for visual analysis, OCR, and document understanding.
Is Qwen3 VL 235B A22B Thinking cheaper than GPT-4 Turbo with vision?
Yes, significantly. At $0.26 input and $2.60 output per million tokens, it's roughly 10x cheaper than GPT-4 Turbo's vision pricing. The output cost is higher than input, so it works best for tasks where you send images and get concise responses rather than generating long text. For bulk image analysis, the savings add up fast.
Can Qwen3 VL 235B A22B Thinking handle long documents with images?
The 131K token context window gives you room for roughly 30-40 pages of mixed text and images, depending on image resolution. That's enough for most reports, slide decks, or technical documentation. You'll hit limits with full books or large datasets, but for typical business documents it handles the entire file in one pass without chunking.
How does the A22B Thinking variant differ from standard Qwen3 VL?
The A22B designation likely indicates an active parameter subset or distilled architecture from the full 235B model, optimized for inference speed. The "Thinking" label suggests extended chain-of-thought reasoning, similar to OpenAI's o1 approach. Without benchmarks, expect slower responses than standard Qwen3 VL but potentially better accuracy on complex visual reasoning tasks that benefit from step-by-step analysis.
Should I use Qwen3 VL 235B A22B Thinking for production image classification?
Only if you need complex reasoning over images. For simple classification or object detection, a fine-tuned vision model or smaller multimodal model will be faster and cheaper. Use this when you need the model to explain what it sees, compare multiple images, or extract structured data from complex visual layouts like invoices or diagrams.