StepFun: Step 3.7 Flash
Step 3.7 Flash is StepFun's latest high-efficiency multimodal Mixture-of-Experts model. It pairs a 196B-parameter language backbone with a vision encoder for native image and video understanding, activating roughly 11B parameters...
Anyone in the Space can @-mention StepFun: Step 3.7 Flash with the team's shared context — pooled credits, one chat, one memory.
Starter is free forever — 1 Space, 100 credits/month, 1 MCP. No card.
Verdict
Best for
- Budget-conscious multimodal analysis
- Video content understanding at scale
- Long-context document processing with images
- Prototyping vision features before production
- High-volume screenshot interpretation
Strengths
Pricing sits 40-60% below comparable vision models, making high-volume multimodal work economically viable. The 256K context window accommodates full-length transcripts paired with video frames or multi-page PDFs with embedded diagrams without chunking. Native video support eliminates frame-extraction preprocessing that other models require. StepFun's architecture appears optimized for throughput over raw capability, fitting teams that need acceptable quality across thousands of requests rather than perfect accuracy on dozens.
Trade-offs
Absence of public benchmarks means no MMMU, MathVista, or DocVQA scores to anchor expectations against GPT-4V or Gemini Pro Vision. Early adopters report variable performance on complex reasoning chains that mix visual and textual evidence. Latency characteristics remain undocumented, so real-time applications need testing. The proprietary license limits deployment flexibility compared to open-weight alternatives. Teams requiring auditable performance metrics for compliance or client reporting will struggle without third-party validation.
Specifications
- Provider
- stepfun
- Category
- llm
- Context length
- 256,000 tokens
- Max output
- 256,000 tokens
- Modalities
- text, image, video
- License
- proprietary
- Released
- 2026-05-28
Pricing
- Input
- $0.20/Mtok
- Output
- $1.15/Mtok
- Model ID
stepfun/step-3.7-flash
Per-token prices show what the model costs upstream. On Switchy your team draws from one shared org credit pool — one plan, one balance for everyone.
Team cost calculator
5 seats · 80 msgs/day
Switchy meters this against your org's shared credit pool — one plan, one balance for everyone.
Providers
| Provider | Context | Input | Output | P50 latency | Throughput | 30d uptime |
|---|---|---|---|---|---|---|
| stepfun | 256k | $0.20/Mtok | $1.15/Mtok | — | — | — |
Performance
Benchmarks
Works well with
Top MCPs
Compatibility data comes from first-party telemetry; once we have enough co-usage signal, top MCPs for this model will appear here.
How Switchy teams use it
Starter prompts
Extract Invoice Line Items
Extract all line items from this invoice image into a JSON array. For each item include: description, quantity, unit_price, and total. Preserve exact values as they appear.Open in a Space →
Summarize Video Meeting
Watch this meeting recording and create a summary with three sections: key decisions made, action items with owners, and unresolved questions. Focus on what was said, not slide content.Open in a Space →
Compare Product Screenshots
Compare these two app screenshots and list every visual difference you find. Organize by: layout changes, color/styling updates, text modifications, and new or removed elements.Open in a Space →
Analyze Chart Trends
Describe the trends shown in this chart. What's the main story? Are there any anomalies or inflection points? What questions would you ask about the underlying data?Open in a Space →
Generate Alt Text
Write concise alt text for this image suitable for screen readers. Describe the key visual elements and any text present. Keep it under 125 characters while conveying essential information.Open in a Space →