Google: Gemini 3.1 Pro Preview Custom Tools
Gemini 3.1 Pro Preview Custom Tools is a variant of Gemini 3.1 Pro that improves tool selection behavior by preventing overuse of a general bash tool when more efficient third-party...
Anyone in the Space can @-mention Google: Gemini 3.1 Pro Preview Custom Tools with the team's shared context - pooled credits, one chat, one memory.
Starter is free forever - 1 Space, 100 credits/month, 1 MCP. No card.
Verdict
Best for
- Agentic workflows with multiple API calls
- Multimodal document processing with tool outputs
- Function calling across audio and video inputs
- Prototyping custom tool integrations
- Long-context tasks requiring structured responses
Strengths
The 1M token context window handles entire codebases or multi-hour transcripts without chunking. Multimodal support across five input types—including audio and video—gives it unusual flexibility for processing mixed-media datasets. The custom tools optimization means function calling and structured JSON outputs should be more reliable than general-purpose models. Pricing sits between budget and premium tiers, making it viable for production prototypes that need Google's infrastructure without Gemini Ultra costs.
Trade-offs
Preview status means no public benchmarks, limited documentation, and potential breaking changes as Google iterates. Tool-calling performance relative to GPT-4o or Claude 3.5 Sonnet remains unverified without published evals. The $12/Mtok output rate climbs quickly on verbose tool responses—budget carefully for multi-turn agentic loops. Multimodal capabilities may lag specialized models like GPT-4 Vision for pure image tasks, since this model prioritizes tool integration over single-modality excellence.
Specifications
- Provider
- Category
- llm
- Context length
- 1,048,576 tokens
- Max output
- 65,536 tokens
- Modalities
- text, audio, image, video, file
- License
- proprietary
- Released
- 2026-02-25
Pricing
- Input
- $2.00/Mtok
- Output
- $12.00/Mtok
- Model ID
google/gemini-3.1-pro-preview-customtools
Per-token prices show what the model costs upstream. On Switchy your team draws from one shared org credit pool - one plan, one balance for everyone.
Team cost calculator
5 seats · 80 msgs/day
Switchy meters this against your org's shared credit pool - one plan, one balance for everyone.
Providers
| Provider | Context | Input | Output | P50 latency | Throughput | 30d uptime |
|---|---|---|---|---|---|---|
| 1049k | $2.00/Mtok | $12.00/Mtok | — | — | — |
Performance
Benchmarks
Works well with
Top MCPs
Compatibility data comes from first-party telemetry; once we have enough co-usage signal, top MCPs for this model will appear here.
How Switchy teams use it
Starter prompts
Multi-Step Research Agent
You have access to search_web(), fetch_document(), and summarize_content() functions. Research the latest developments in quantum error correction, retrieve three recent papers, and synthesize key findings into a technical brief.Open in a Space →
Video Analysis with Actions
Analyze this product demo video. Use extract_timestamps() to mark key feature demonstrations, then call generate_social_clips() to create three 15-second highlight reels for different platforms.Open in a Space →
Codebase Documentation Generator
Read this 50-file Python codebase. Use parse_dependencies() to map module relationships, then call generate_markdown() to create API documentation with usage examples for each public function.Open in a Space →
Audio Transcription Pipeline
Transcribe this 2-hour meeting audio. Use identify_speakers() to label participants, extract_action_items() to pull tasks, then call update_database() and send_notifications() to assigned team members.Open in a Space →
Multimodal Data Extraction
You have access to parse_pdf(), ocr_image(), and format_json() functions. Extract all financial figures from these quarterly reports (3 PDFs, 2 scanned images) and return a structured dataset with company, quarter, revenue, and profit fields.Open in a Space →
Example outputs
Illustrative - representative of the model's voice and quality, not literal recordings.
Analyze this quarterly sales spreadsheet and create a Python script that identifies underperforming regions, then draft an email to the regional managers explaining the findings.
This example would demonstrate the model processing a multi-sheet Excel file, extracting sales data across regions, and producing both a working Python analysis script and a professionally-toned email. The script would include pandas operations for aggregation and filtering, with inline comments explaining the logic. The email would reference specific numbers from the analysis, maintain appropriate business tone, and suggest actionable next steps—all generated in a single conversational turn without requiring the user to switch contexts between coding and writing tasks.
Showcases the 1M+ token context window handling large file uploads and the multimodal capability to reason across structured data and natural language simultaneously. The custom tools feature would allow this model to execute the Python script and verify results before finalizing the email, though latency at $12/Mtok output makes iterative refinement expensive for budget-conscious teams.
Watch this 8-minute product demo video and generate a technical FAQ covering the five most complex features shown, with timestamps linking back to relevant moments.
This example would show the model ingesting video content, identifying key technical demonstrations (API integration sequences, configuration workflows, error-handling scenarios), and producing a structured FAQ document. Each answer would include a timestamp reference like "See 3:42 for the webhook setup process" and explain the feature in clearer terms than the presenter used. The model would distinguish between features that were clearly explained versus those that need additional clarification, organizing answers by complexity rather than chronological order.
Highlights native video understanding without requiring transcription preprocessing. The model's ability to cross-reference visual demonstrations with spoken explanations makes it useful for documentation teams working from recorded content. However, without public benchmarks, accuracy on domain-specific technical content (medical devices, financial systems) remains unvalidated compared to specialized alternatives.
I'm debugging a React app where the useEffect hook fires twice on mount in development. Here's my component code, console logs, and a screenshot of the Network tab showing duplicate API calls. What's happening?
This example would demonstrate the model analyzing code, log output, and a browser screenshot together to diagnose the issue. The response would explain React 18's Strict Mode behavior causing intentional double-mounting in development, reference the specific lines in the provided code where side effects occur, and point to the duplicate POST requests visible in the Network tab screenshot. It would then provide a corrected code snippet using a ref to prevent duplicate API calls, explaining why this pattern works and when the double-mount behavior disappears in production builds.
Demonstrates multimodal debugging where code, logs, and visual browser state must be synthesized. The audio input capability (not used here) would theoretically allow developers to verbally walk through reproduction steps. The $2 input pricing makes uploading screenshots and large codebases economical, though the model's code-specific performance relative to Claude 3.5 Sonnet or GPT-4 remains unclear without published coding benchmarks.
Use-case deep-dives
When support tickets arrive as screenshots, PDFs, and voice clips
A 12-person SaaS support team gets 200+ tickets daily—half arrive as phone recordings, annotated screenshots, or scanned contracts. Gemini 3.1 Pro Preview Custom Tools handles all four input types (text, audio, image, file) in a single API call, so you skip the pre-processing pipeline that costs you 90 seconds per mixed-media ticket. At $2/$12 per Mtok, a 500-token triage prompt with a 2MB image costs roughly $0.03, which pencils if your team's hourly rate exceeds $40. The 1M-token context window means you can dump an entire 40-page PDF contract into the prompt without chunking. If your tickets are text-only or you're processing under 50/day, Claude Sonnet 4 at $3/$15 will save you the custom-tools learning curve.
Diffing 300-page agreements without chunking or retrieval overhead
A 4-attorney firm reviews merger agreements that average 80,000 words each. Gemini 3.1 Pro Preview's 1M-token window fits two full contracts plus a 5,000-word diff prompt in one pass—no vector store, no retrieval step, no context-window juggling. You paste both PDFs (via file input), ask for clause-level changes, and get a structured response in 15-20 seconds. At $2 input per Mtok, a 200k-token comparison costs $0.40, versus $60 of associate time for the same first-pass review. The trade-off: no public benchmarks yet, so you'll want to spot-check the first 20 diffs against human review before trusting it on high-stakes deals. If your agreements are under 30 pages, GPT-4o at $2.50/$10 is cheaper and has published MMLU scores.
Scanning user-uploaded videos for policy violations at scale
A 20-person social app moderates 800 user videos daily (30-90 seconds each). Gemini 3.1 Pro Preview ingests video natively, so you skip the frame-extraction step that adds 12 seconds and $0.08 per video in your current pipeline. You send the raw MP4, a 200-token policy checklist, and get a violation report in one call. At $2/$12 per Mtok and an average 8,000-token video encoding, each moderation costs roughly $0.12 in API fees—half your current spend when you factor in the frame-extraction overhead. The 1M-token context means you can batch-review a user's last 10 uploads in a single prompt to catch repeat offenders. If you're under 200 videos/day, GPT-4o Vision at $2.50/$10 is simpler to integrate and has stronger benchmark coverage for edge cases.
Frequently asked
Is Gemini 3.1 Pro Preview good for complex reasoning tasks?
Yes, the 3.1 Pro series targets advanced reasoning and multimodal understanding. Without public benchmarks yet, you're betting on Google's track record with the Gemini line. The 1M token context window handles long documents well, and custom tools support means it can orchestrate complex workflows. If you need proven scores, wait for independent evals or test it yourself on your specific use case.
Is Gemini 3.1 Pro Preview cheaper than GPT-4o or Claude Sonnet?
At $2 input and $12 output per Mtok, it sits between budget and premium tiers. GPT-4o runs around $2.50/$10, Claude Sonnet 4 is $3/$15. Gemini 3.1 Pro Preview is cheaper on output than Sonnet but pricier than 4o. For heavy generation workloads, 4o wins on cost. For multimodal tasks with moderate output, this pricing is competitive.
Can Gemini 3.1 Pro Preview handle video and audio inputs natively?
Yes, it accepts text, audio, image, video, and file inputs directly. This makes it useful for transcription-plus-analysis workflows or video content moderation without preprocessing. The 1M token context means you can feed long videos or multiple files in one request. Just watch output token costs at $12/Mtok if you're generating summaries of hour-long content.
How does Gemini 3.1 Pro Preview compare to Gemini 2.0 Flash?
The 3.1 Pro series is Google's reasoning-focused flagship; 2.0 Flash prioritizes speed and cost. If you need multimodal understanding with custom tool orchestration, 3.1 Pro Preview is the pick. If you're running high-volume chat or simple classification, Flash's lower latency and price make more sense. The 3.1 line trades speed for capability depth.
Should I use Gemini 3.1 Pro Preview for production chatbots?
Only if you need the multimodal features or custom tools integration. The $12/Mtok output cost adds up fast in conversational apps. For text-only chat, GPT-4o or Claude Sonnet 4 offer better benchmark transparency and lower output pricing. Use this model when you're processing uploaded files, images, or video alongside chat, and the custom tools feature solves a real orchestration problem.