LLMstepfun

StepFun: Step 3.7 Flash

Step 3.7 Flash is StepFun's latest high-efficiency multimodal Mixture-of-Experts model. It pairs a 196B-parameter language backbone with a vision encoder for native image and video understanding, activating roughly 11B parameters...

Anyone in the Space can @-mention StepFun: Step 3.7 Flash with the team's shared context — pooled credits, one chat, one memory.

All models

Starter is free forever — 1 Space, 100 credits/month, 1 MCP. No card.

Verdict

Step 3.7 Flash targets teams needing multimodal reasoning across text, images, and video at aggressive price points. The 256K context window handles lengthy documents and extended video clips, while $0.20/$1.15 per Mtok undercuts most vision-capable models by 40-60%. Without public benchmarks, you're trading proven performance data for cost savings and early access to StepFun's architecture. Best for teams willing to validate quality in-house on multimodal workflows where budget constraints rule out GPT-4V or Claude Sonnet.

Best for

Budget-conscious multimodal analysis
Video content understanding at scale
Long-context document processing with images
Prototyping vision features before production
High-volume screenshot interpretation

Strengths

Pricing sits 40-60% below comparable vision models, making high-volume multimodal work economically viable. The 256K context window accommodates full-length transcripts paired with video frames or multi-page PDFs with embedded diagrams without chunking. Native video support eliminates frame-extraction preprocessing that other models require. StepFun's architecture appears optimized for throughput over raw capability, fitting teams that need acceptable quality across thousands of requests rather than perfect accuracy on dozens.

Trade-offs

Absence of public benchmarks means no MMMU, MathVista, or DocVQA scores to anchor expectations against GPT-4V or Gemini Pro Vision. Early adopters report variable performance on complex reasoning chains that mix visual and textual evidence. Latency characteristics remain undocumented, so real-time applications need testing. The proprietary license limits deployment flexibility compared to open-weight alternatives. Teams requiring auditable performance metrics for compliance or client reporting will struggle without third-party validation.

Specifications

Provider: stepfun
Category: llm
Context length: 256,000 tokens
Max output: 256,000 tokens
Modalities: text, image, video
License: proprietary
Released: 2026-05-28

Pricing

Input: $0.20/Mtok
Output: $1.15/Mtok
Model ID: stepfun/step-3.7-flash

Per-token prices show what the model costs upstream. On Switchy your team draws from one shared org credit pool — one plan, one balance for everyone.

Team cost calculator

Seats5 peopleMessages / seat / day80Avg turn size2 ktokOutput share30 %

Estimated monthly spend

$8.54

17.6M tokens / month
5 seats · 80 msgs/day

Switchy meters this against your org's shared credit pool — one plan, one balance for everyone.

Providers

Provider	Context	Input	Output	P50 latency	Throughput	30d uptime
stepfun	256k	$0.20/Mtok	$1.15/Mtok	—	—	—

Performance

Performance snapshots are collected daily. Check back after the next ingestion run.

Benchmarks

Public benchmark scores are not available yet for this model. Check back after the next ingestion run.

Works well with

Top MCPs

Compatibility data comes from first-party telemetry; once we have enough co-usage signal, top MCPs for this model will appear here.

How Switchy teams use it

Not enough Spaces have used this model yet to share anonymised team stats. We wait for at least 50 distinct Spaces per week before publishing any aggregate.

Starter prompts

Extract Invoice Line Items

Extract all line items from this invoice image into a JSON array. For each item include: description, quantity, unit_price, and total. Preserve exact values as they appear.

Open in a Space →

Summarize Video Meeting

Watch this meeting recording and create a summary with three sections: key decisions made, action items with owners, and unresolved questions. Focus on what was said, not slide content.

Open in a Space →

Compare Product Screenshots

Compare these two app screenshots and list every visual difference you find. Organize by: layout changes, color/styling updates, text modifications, and new or removed elements.

Open in a Space →

Analyze Chart Trends

Describe the trends shown in this chart. What's the main story? Are there any anomalies or inflection points? What questions would you ask about the underlying data?

Open in a Space →

Generate Alt Text

Write concise alt text for this image suitable for screen readers. Describe the key visual elements and any text present. Keep it under 125 characters while conveying essential information.

Open in a Space →