Mistral: Voxtral Small 24B 2507
Voxtral Small is an enhancement of Mistral Small 3, incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding. Input audio...
Anyone in the Space can @-mention Mistral: Voxtral Small 24B 2507 with the team's shared context - pooled credits, one chat, one memory.
Starter is free forever - 1 Space, 100 credits/month, 1 MCP. No card.
Verdict
Best for
- Voice-enabled chatbots on tight budgets
- Audio transcription with contextual understanding
- Bilingual voice applications
- Real-time speech-to-text workflows
- Prototyping multimodal voice features
Strengths
Native audio processing eliminates the latency and error cascade of separate transcription pipelines. The 32K context window handles long conversations or multi-turn voice sessions without truncation. Pricing at $0.10 input makes it roughly 5× cheaper than GPT-4o's audio tier for high-volume voice applications. Mistral's prior models have shown strong multilingual performance, likely carrying over to audio modalities.
Trade-offs
No public benchmarks means you're flying blind on accuracy versus GPT-4o Audio, Gemini 2.0 Flash, or other audio-native models. The 24B size may lag behind larger audio models on complex reasoning tasks that require both speech understanding and deep inference. Mistral's ecosystem is less mature than OpenAI's for audio tooling—expect fewer off-the-shelf integrations. Early-stage release risk: edge cases in audio parsing may surface in production.
Specifications
- Provider
- mistralai
- Category
- llm
- Context length
- 32,000 tokens
- Max output
- —
- Modalities
- text, audio, file
- License
- proprietary
- Released
- 2025-10-30
Pricing
- Input
- $0.10/Mtok
- Output
- $0.30/Mtok
- Model ID
mistralai/voxtral-small-24b-2507
Per-token prices show what the model costs upstream. On Switchy your team draws from one shared org credit pool - one plan, one balance for everyone.
Team cost calculator
5 seats · 80 msgs/day
Switchy meters this against your org's shared credit pool - one plan, one balance for everyone.
Providers
| Provider | Context | Input | Output | P50 latency | Throughput | 30d uptime |
|---|---|---|---|---|---|---|
| mistralai | 32k | $0.10/Mtok | $0.30/Mtok | — | — | — |
Performance
Benchmarks
Works well with
Top MCPs
Compatibility data comes from first-party telemetry; once we have enough co-usage signal, top MCPs for this model will appear here.
How Switchy teams use it
Starter prompts
Summarize Voice Meeting
Listen to this meeting recording and extract: (1) decisions made, (2) action items with owners, (3) unresolved questions. Format as a bulleted list.Open in a Space →
Bilingual Customer Support
You're a customer support agent. Listen to the customer's question in any language, then respond in the same language with a helpful, concise answer.Open in a Space →
Transcribe With Context
Transcribe this audio. Label each speaker as Speaker A, Speaker B, etc. Correct obvious speech recognition errors using conversational context.Open in a Space →
Voice-Driven Data Entry
Listen to the user dictating form fields. Extract the data into JSON with keys: name, email, phone, issue_description. Ask clarifying questions if any field is unclear.Open in a Space →
Accent-Robust Commands
You control a smart home. Listen to voice commands and respond with the action you'll take: 'lights on', 'set thermostat to 72', etc. Confirm ambiguous requests before acting.Open in a Space →
Example outputs
Illustrative - representative of the model's voice and quality, not literal recordings.
Listen to this 30-second customer support call and summarize the issue, sentiment, and any action items. The caller sounds frustrated about a delayed shipment.
The caller reports their order #8472 was promised for delivery three days ago but hasn't arrived. Sentiment is frustrated but controlled—they mention needing the item for a work presentation tomorrow. Action items: (1) locate the shipment via tracking, (2) expedite if still in transit, (3) offer overnight replacement if lost. The caller emphasized time sensitivity twice, suggesting follow-up within 2 hours would prevent escalation.
Voxtral Small's native audio input eliminates transcription latency and preserves prosodic cues like frustration markers. At 24B parameters it handles nuanced sentiment analysis well, though the 32K context window means longer calls may need chunking—fine for typical support interactions under 10 minutes.
Generate a Python function that validates email addresses using regex, handles common edge cases like plus-addressing and international domains, and returns both a boolean and an error message if invalid.
```python import re def validate_email(email: str) -> tuple[bool, str]: pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$' if not email: return False, "Email cannot be empty" if len(email) > 254: return False, "Email exceeds maximum length" if not re.match(pattern, email): return False, "Invalid email format" local, domain = email.rsplit('@', 1) if len(local) > 64: return False, "Local part too long" return True, "Valid email" ``` Supports plus-addressing via the `+` character and handles basic international TLDs.
The code is clean and handles the specified edge cases, though the regex doesn't cover all RFC 5322 complexities (quoted strings, comments). At $0.30/Mtok output, Voxtral Small sits in the budget tier—acceptable for straightforward coding tasks where you'll review output anyway, less ideal for production-critical generation.
I'm preparing a 5-minute conference talk on WebAssembly adoption trends. Draft an outline with three main points, each supported by a recent statistic or case study, plus a memorable closing hook.
**Outline: WebAssembly's Quiet Revolution** 1. **Performance at Scale** – Figma's editor runs 3× faster in WASM vs. JS (2022 rewrite); Shopify reports 50% reduction in cart-rendering time. 2. **Beyond the Browser** – Cloudflare Workers now support WASM for edge compute; Fermyon's Spin framework shows 80% of new serverless projects targeting WASM runtimes. 3. **Language Diversity** – Rust adoption up 40% YoY in WASM contexts (Stack Overflow 2024); Python (via Pyodide) and Go now compile to WASM, lowering barriers. **Closing hook:** "Five years ago, WASM was a browser curiosity. Today, it's rewriting the rules for where code runs—and which languages get to play."
Voxtral Small structures the outline logically and weaves in plausible data points (though you'd verify specifics before presenting). The 32K context window comfortably holds background research if you paste in source material, but the model lacks real-time web access—statistics here are illustrative, not fact-checked.
Use-case deep-dives
When audio-native models beat the transcript-then-analyze pipeline
A 12-person SaaS support team fields 200+ calls daily and needs sentiment flags plus ticket summaries in Salesforce. Voxtral Small processes audio directly at $0.10/Mtok input, skipping the Whisper→GPT chain that costs $0.06 transcription + $0.50+ analysis per call. The 32k context window handles 45-minute calls without chunking. At 200 calls/day you're spending ~$24/day vs. $112+ on the two-step flow. The model's 24B parameter count sits between efficiency and accuracy—fine for support routing, but if you need deep call analytics or multi-language nuance, test against Gemini 1.5 Flash's audio mode first. Buy this if your call volume justifies native audio and your use-case tolerates a model without public benchmarks.
Native audio processing for solo creators on tight margins
A solo podcaster publishes 3 episodes weekly and wants show notes, social clips, and blog drafts from each 60-minute recording. Voxtral Small's audio input means you drop the MP3 directly into the prompt—no transcription API, no preprocessing pipeline. At $0.10 input + $0.30 output per Mtok, a 60-minute episode (~9k tokens audio, 2k tokens output) costs roughly $0.90 vs. $1.80+ for Whisper + Claude Haiku. The 32k window fits full episodes, and the text+audio modality lets you reference timestamps in the output. The catch: no public benchmarks means you're flying blind on summarization quality. If you're producing 12+ episodes/month and can afford one revision pass, the cost savings justify the risk. Above 50 episodes/month, benchmark against Gemini 1.5 Flash for quality-per-dollar.
When your team needs speaker tone, not just transcripts
A 6-person product team records sprint planning calls and wants minutes that flag when stakeholders sound uncertain or pushback happens. Voxtral Small's audio modality captures tone and hesitation that disappears in text transcripts—critical for reading between the lines on scope creep. At $0.10/$0.30 per Mtok, a 90-minute call (~13k tokens audio, 1.5k output) runs $1.30 vs. $2+ for transcript-then-sentiment pipelines. The 32k context handles the full call, and the 24B size keeps latency reasonable for same-day turnaround. The risk: without public benchmarks you can't validate sentiment accuracy against MMLU or audio-specific evals. If your team runs 20+ calls/month and tone matters more than perfect transcription, this is the play. If you need legally defensible minutes, stick to Whisper + a benchmarked text model.
Frequently asked
Is Voxtral Small 24B good for voice-to-text applications?
Yes, it's built for audio+text workflows. The 24B parameter count gives you decent reasoning on transcribed speech without the cost of larger models. At $0.10/$0.30 per Mtok, you can process voice inputs economically. The 32k context window handles long conversations or meeting transcripts in a single pass.
Is Voxtral Small cheaper than GPT-4o for audio tasks?
Much cheaper. GPT-4o runs $2.50/$10.00 per Mtok — 25x and 33x more expensive on input and output respectively. If your use case tolerates a smaller model's reasoning limits, Voxtral Small cuts audio processing costs dramatically. You trade some accuracy for budget headroom.
Can it handle real-time voice chat with low latency?
Probably not ideal for real-time. The 24B size means slower inference than 7B-class models, and Mistral hasn't published latency numbers. Use this for batch transcription, voice memo analysis, or async voice assistants where a 2-3 second delay is acceptable. For live chat, try Gemini Flash or smaller Whisper+LLM combos.
How does Voxtral Small compare to Whisper plus a text LLM?
It's a single-model solution versus a two-step pipeline. You skip the Whisper API call and the glue code, which simplifies deployment. The trade-off: you lose Whisper's best-in-class transcription accuracy and can't swap the text LLM independently. Choose Voxtral if you want simplicity; choose Whisper+LLM if you need maximum control.
Should I use this for customer support call analysis?
Yes, if you're analyzing recordings post-call. The 32k window fits most support calls, and the pricing makes high-volume analysis feasible. You can extract sentiment, action items, and compliance flags in one pass. For live agent assist, the latency will frustrate users — stick to post-call batch jobs.