StepFun: Step 3.5 Flash
Step 3.5 Flash is StepFun's most capable open-source foundation model. Built on a sparse Mixture of Experts (MoE) architecture, it selectively activates only 11B of its 196B parameters per token....
Anyone in the Space can @-mention StepFun: Step 3.5 Flash with the team's shared context - pooled credits, one chat, one memory.
Starter is free forever - 1 Space, 100 credits/month, 1 MCP. No card.
Verdict
Best for
- Budget-conscious long document processing
- Large codebase analysis under cost pressure
- Multi-document synthesis at scale
- Prototyping context-heavy workflows cheaply
Strengths
The 262K context window rivals frontier models while pricing undercuts most Western alternatives by 60-80% on input tokens. This combination makes Step 3.5 Flash viable for high-volume document ingestion, repository-scale code analysis, and batch processing where per-token costs dominate your budget. The text-only focus keeps the API surface simple for teams that don't need multimodal capabilities.
Trade-offs
StepFun lacks the public benchmark track record of Anthropic, OpenAI, or Google models — no MMLU, HumanEval, or GPQA scores to anchor expectations. The model's reasoning depth and instruction-following quality remain unproven in independent evaluations. Teams accustomed to Claude or GPT-4 level output may find Step 3.5 Flash requires more prompt iteration or human review to match quality standards, especially on nuanced tasks.
Specifications
- Provider
- stepfun
- Category
- llm
- Context length
- 262,144 tokens
- Max output
- 16,384 tokens
- Modalities
- text
- License
- proprietary
- Released
- 2026-01-29
Pricing
- Input
- $0.09/Mtok
- Output
- $0.30/Mtok
- Model ID
stepfun/step-3.5-flash
Per-token prices show what the model costs upstream. On Switchy your team draws from one shared org credit pool - one plan, one balance for everyone.
Team cost calculator
5 seats · 80 msgs/day
Switchy meters this against your org's shared credit pool - one plan, one balance for everyone.
Providers
| Provider | Context | Input | Output | P50 latency | Throughput | 30d uptime |
|---|---|---|---|---|---|---|
| stepfun | 262k | $0.09/Mtok | $0.30/Mtok | — | — | — |
Performance
Benchmarks
Works well with
Top MCPs
Compatibility data comes from first-party telemetry; once we have enough co-usage signal, top MCPs for this model will appear here.
How Switchy teams use it
Starter prompts
Multi-File Code Summary
You are reviewing this codebase. Summarize the overall architecture, identify the three most critical functions, and flag any obvious security concerns. Focus on how components interact rather than line-by-line details.Open in a Space →
Contract Comparison Analysis
Compare these three contracts. List every clause where terms differ materially, then summarize which contract offers the most favorable terms for the buyer. Use a table format for the clause-by-clause comparison.Open in a Space →
Research Paper Synthesis
These papers all study the same phenomenon. Extract the consensus findings, highlight where results conflict, and identify which methodology appears most rigorous. Cite specific paper sections in your analysis.Open in a Space →
Meeting Transcript Extraction
From this meeting transcript, create two lists: (1) all action items with assigned owners and deadlines, (2) decisions made with brief rationale. Preserve exact wording for commitments.Open in a Space →
Customer Feedback Clustering
Analyze these customer support messages. Group them into 5-7 thematic clusters, rank clusters by frequency, and for each cluster provide the most representative example plus a one-sentence root cause hypothesis.Open in a Space →
Example outputs
Illustrative - representative of the model's voice and quality, not literal recordings.
Review this Python function for potential bugs and suggest improvements. Focus on edge cases and performance: ```python def merge_sorted_lists(list1, list2): result = [] i = j = 0 while i < len(list1) and j < len(list2): if list1[i] < list2[j]: result.append(list1[i]) i += 1 else: result.append(list2[j]) j += 1 return result ```
The function has a critical bug: it drops remaining elements after one list is exhausted. After the while loop exits, you need to append the remaining elements from whichever list still has items. Add these lines after the loop: ```python result.extend(list1[i:]) result.extend(list2[j:]) ``` Performance-wise, the repeated `len()` calls are minor but could be cached. More importantly, if you're merging many lists repeatedly, consider using `heapq.merge()` from the standard library—it's optimized for this exact pattern and handles multiple iterables elegantly.
Step 3.5 Flash identifies the logic error immediately and provides a correct fix, demonstrating solid code review capability. The 262K token context window means it can handle reviewing entire codebases in one pass. However, without public benchmarks, it's unclear how it performs against specialized code models on complex refactoring tasks.
Explain the Byzantine Generals Problem in distributed systems. I'm a backend engineer who understands consensus basics but haven't studied the formal theory. Keep it practical—how does this relate to database replication?
The Byzantine Generals Problem asks: how do distributed nodes reach agreement when some nodes might be malicious or faulty, sending conflicting information to different peers? In database replication, this maps directly to your cluster. Say you have five replicas. Three are healthy, one has a disk corruption bug (sends garbage), and one has a network partition (sends different data to different nodes). How do the healthy nodes commit a transaction everyone agrees on? Practical solutions: Raft and Paxos assume 'crash failures' only—nodes either work correctly or stop. Byzantine Fault Tolerant (BFT) protocols like PBFT handle malicious behavior but need 3f+1 nodes to tolerate f failures, which is why most production databases stick with crash-tolerant consensus and rely on authentication to prevent malicious nodes.
The model bridges formal theory and engineering practice effectively, matching the prompt's request for practical framing. The $0.10/$0.30 pricing makes it economical for documentation generation and technical explanation tasks. The response stays grounded without oversimplifying, though depth on specific BFT implementations remains surface-level.
I have meeting transcripts from the last quarter (about 80,000 words). Extract all action items assigned to the engineering team, group them by project area, and flag any that mention deadlines in the past that might have been missed.
**Infrastructure (7 items)** - Migrate staging database to new cluster (assigned: Dev, deadline: Sept 15—OVERDUE) - Set up monitoring for API gateway latency (assigned: SRE, no deadline mentioned) - Document disaster recovery runbook (assigned: Platform, deadline: Oct 1—OVERDUE) **Product Features (12 items)** - Implement OAuth refresh token rotation (assigned: Auth team, deadline: Nov 30) - Add pagination to analytics dashboard (assigned: Frontend, mentioned as 'urgent' on Oct 10—no formal deadline) **Security (3 items)** - Audit third-party dependencies for CVEs (assigned: Security, quarterly recurring) *Note: Two items flagged as overdue. Five items lack explicit deadlines but were described as 'urgent' or 'high priority' in discussions.*
This showcases the 262K token context window's practical value—ingesting a full quarter of transcripts in one prompt. The model structures output clearly and flags temporal inconsistencies. At $0.10 per million input tokens, processing 80K words costs roughly $0.01, making bulk document analysis highly affordable. The extraction accuracy depends on transcript quality and consistent terminology across meetings.
Use-case deep-dives
When Step 3.5 Flash handles 500+ tickets/day under budget
A 12-person SaaS support team routing 600 inbound tickets daily needs fast classification without burning budget. Step 3.5 Flash wins here: at $0.10 input per million tokens, you're spending roughly $0.015 per ticket (assuming 150-token average inputs). The 262k context window means you can load the last 30 days of similar tickets as few-shot examples in a single call, improving routing accuracy without fine-tuning. Output cost at $0.30/Mtok keeps response generation cheap even when the model writes 200-token summaries. If your ticket volume drops below 200/day, the setup overhead isn't worth it—just use a general-purpose model. Above 500/day, this pricing structure saves $400-600/month versus competitors charging $1+ input per Mtok.
Why Step 3.5 Flash works for multi-contract analysis sessions
A 4-person legal ops team comparing vendor agreements (each 40-80 pages, 25k-50k tokens) across 6-12 contracts per review session. Step 3.5 Flash's 262k context window fits 4-5 full contracts in one prompt, letting you ask "which vendor has the most restrictive IP clause?" without chunking or retrieval pipelines. At $0.10 input per Mtok, loading 200k tokens costs $0.02—negligible when you're billing $300/hour. The trade-off: without public benchmarks, you'll need to test accuracy on your contract language during a 2-week pilot. If the model misses key clauses or hallucinates terms, switch to a benchmarked alternative. If it holds up, you've eliminated the engineering overhead of RAG systems for this workload.
When Step 3.5 Flash moderates 10k posts overnight for pennies
A 3-person community platform running nightly moderation on 8,000-12,000 user posts (average 80 tokens each). Step 3.5 Flash processes this at $0.10 input per Mtok: 10k posts × 80 tokens = 800k tokens = $0.08 input cost. Output is binary (flag/pass) plus a 20-token reason, so 10k × 20 tokens = 200k output tokens = $0.06. Total nightly cost: $0.14. The 262k context window lets you batch 3,000+ posts per API call if you want to reduce request overhead. The risk: no public benchmarks means you can't predict false-positive rates before deployment. Run a 1,000-post labeled test set first. If precision drops below 92%, the cost savings aren't worth the manual review load. Above that threshold, you're spending $4.20/month instead of $40+ on higher-priced models.
Frequently asked
Is Step 3.5 Flash good for general text tasks?
Step 3.5 Flash handles standard text generation, summarization, and Q&A adequately for most business use cases. Without public benchmarks, it's hard to rank against GPT-4 or Claude, but the 262k token context window makes it viable for long-document work. Test it on your specific prompts before committing to production.
Is Step 3.5 Flash cheaper than GPT-4o or Claude Sonnet?
Yes. At $0.10 input and $0.30 output per million tokens, Step 3.5 Flash undercuts GPT-4o ($2.50/$10.00) and Claude Sonnet 4 ($3.00/$15.00) by 20-30x. If you're processing high volumes of text and quality is acceptable, the cost savings are substantial. Run parallel tests to confirm output quality meets your bar.
Can Step 3.5 Flash handle 200k+ token documents in one request?
The 262k context window supports it technically, but real-world performance depends on prompt structure and retrieval patterns. Most models degrade on needle-in-haystack tasks past 100k tokens. For critical long-document workflows, validate recall accuracy on your data before relying on the full window. Consider chunking strategies as backup.
How does Step 3.5 Flash compare to earlier StepFun models?
StepFun hasn't published benchmark deltas between Step 3.5 Flash and prior versions, so direct comparison is speculative. The "Flash" suffix typically signals speed-optimized inference, likely trading some reasoning depth for lower latency. If you used an earlier Step model, A/B test on your prompts to measure any quality shift.
Should I use Step 3.5 Flash for customer-facing chatbots?
Only if cost is the primary constraint and you can tolerate occasional quality gaps. Without MMLU, HumanEval, or safety benchmarks, you're flying blind on accuracy and hallucination rates. Run a shadow deployment alongside a proven model like GPT-4o mini, compare outputs for two weeks, then decide based on error rates.