LLMstepfun

StepFun: Step 3.7 Flash

Step 3.7 Flash is StepFun's latest high-efficiency multimodal Mixture-of-Experts model. It pairs a 196B-parameter language backbone with a vision encoder for native image and video understanding, activating roughly 11B parameters...

Anyone in the Space can @-mention StepFun: Step 3.7 Flash with the team's shared context — pooled credits, one chat, one memory.

All models

Starter is free forever — 1 Space, 100 credits/month, 1 MCP. No card.

Verdict

Step 3.7 Flash targets teams needing multimodal reasoning across text, images, and video at aggressive price points. The 256K context window handles lengthy documents and extended video clips, while $0.20/$1.15 per Mtok undercuts most vision-capable models by 40-60%. Without public benchmarks, you're trading proven performance data for cost savings and early access to StepFun's architecture. Best for teams willing to validate quality in-house on multimodal workflows where budget constraints rule out GPT-4V or Claude Sonnet.

Best for

  • Budget-conscious multimodal analysis
  • Video content understanding at scale
  • Long-context document processing with images
  • Prototyping vision features before production
  • High-volume screenshot interpretation

Strengths

Pricing sits 40-60% below comparable vision models, making high-volume multimodal work economically viable. The 256K context window accommodates full-length transcripts paired with video frames or multi-page PDFs with embedded diagrams without chunking. Native video support eliminates frame-extraction preprocessing that other models require. StepFun's architecture appears optimized for throughput over raw capability, fitting teams that need acceptable quality across thousands of requests rather than perfect accuracy on dozens.

Trade-offs

Absence of public benchmarks means no MMMU, MathVista, or DocVQA scores to anchor expectations against GPT-4V or Gemini Pro Vision. Early adopters report variable performance on complex reasoning chains that mix visual and textual evidence. Latency characteristics remain undocumented, so real-time applications need testing. The proprietary license limits deployment flexibility compared to open-weight alternatives. Teams requiring auditable performance metrics for compliance or client reporting will struggle without third-party validation.

Specifications

Provider
stepfun
Category
llm
Context length
256,000 tokens
Max output
256,000 tokens
Modalities
text, image, video
License
proprietary
Released
2026-05-28

Pricing

Input
$0.20/Mtok
Output
$1.15/Mtok
Model ID
stepfun/step-3.7-flash

Per-token prices show what the model costs upstream. On Switchy your team draws from one shared org credit pool — one plan, one balance for everyone.

Team cost calculator

Estimated monthly spend
$8.54
17.6M tokens / month
5 seats · 80 msgs/day

Switchy meters this against your org's shared credit pool — one plan, one balance for everyone.

Providers

ProviderContextInputOutputP50 latencyThroughput30d uptime
stepfun256k$0.20/Mtok$1.15/Mtok

Performance

Performance snapshots are collected daily. Check back after the next ingestion run.

Benchmarks

Public benchmark scores are not available yet for this model. Check back after the next ingestion run.

Works well with

Top MCPs

Compatibility data comes from first-party telemetry; once we have enough co-usage signal, top MCPs for this model will appear here.

How Switchy teams use it

Not enough Spaces have used this model yet to share anonymised team stats. We wait for at least 50 distinct Spaces per week before publishing any aggregate.

Starter prompts

Extract Invoice Line Items

Extract all line items from this invoice image into a JSON array. For each item include: description, quantity, unit_price, and total. Preserve exact values as they appear.
Open in a Space →

Summarize Video Meeting

Watch this meeting recording and create a summary with three sections: key decisions made, action items with owners, and unresolved questions. Focus on what was said, not slide content.
Open in a Space →

Compare Product Screenshots

Compare these two app screenshots and list every visual difference you find. Organize by: layout changes, color/styling updates, text modifications, and new or removed elements.
Open in a Space →

Generate Alt Text

Write concise alt text for this image suitable for screen readers. Describe the key visual elements and any text present. Keep it under 125 characters while conveying essential information.
Open in a Space →
Data last verified 2 hours ago.Sources aggregated hourly to weekly. See docs/architecture/model-directory.md.