LLMgoogle

Google: Gemini 3.1 Pro Preview Custom Tools

Gemini 3.1 Pro Preview Custom Tools is a variant of Gemini 3.1 Pro that improves tool selection behavior by preventing overuse of a general bash tool when more efficient third-party...

Anyone in the Space can @-mention Google: Gemini 3.1 Pro Preview Custom Tools with the team's shared context - pooled credits, one chat, one memory.

All models

Starter is free forever - 1 Space, 100 credits/month, 1 MCP. No card.

Verdict

Gemini 3.1 Pro Preview Custom Tools is Google's experimental model optimized for function calling and tool use workflows. It handles multimodal inputs—text, audio, image, video, files—across a 1M token context window at $2/$12 per Mtok. The custom tools designation signals tuning for structured outputs and API integrations, making it ideal for agentic systems that need to orchestrate external services. Trade-off: as a preview model, expect less stability and documentation than production releases. Reach for this when you're building tool-heavy agents and need Google's multimodal reach at mid-tier pricing.

Best for

  • Agentic workflows with multiple API calls
  • Multimodal document processing with tool outputs
  • Function calling across audio and video inputs
  • Prototyping custom tool integrations
  • Long-context tasks requiring structured responses

Strengths

The 1M token context window handles entire codebases or multi-hour transcripts without chunking. Multimodal support across five input types—including audio and video—gives it unusual flexibility for processing mixed-media datasets. The custom tools optimization means function calling and structured JSON outputs should be more reliable than general-purpose models. Pricing sits between budget and premium tiers, making it viable for production prototypes that need Google's infrastructure without Gemini Ultra costs.

Trade-offs

Preview status means no public benchmarks, limited documentation, and potential breaking changes as Google iterates. Tool-calling performance relative to GPT-4o or Claude 3.5 Sonnet remains unverified without published evals. The $12/Mtok output rate climbs quickly on verbose tool responses—budget carefully for multi-turn agentic loops. Multimodal capabilities may lag specialized models like GPT-4 Vision for pure image tasks, since this model prioritizes tool integration over single-modality excellence.

Specifications

Provider
google
Category
llm
Context length
1,048,576 tokens
Max output
65,536 tokens
Modalities
text, audio, image, video, file
License
proprietary
Released
2026-02-25

Pricing

Input
$2.00/Mtok
Output
$12.00/Mtok
Model ID
google/gemini-3.1-pro-preview-customtools

Per-token prices show what the model costs upstream. On Switchy your team draws from one shared org credit pool - one plan, one balance for everyone.

Team cost calculator

Estimated monthly spend
$88.00
17.6M tokens / month
5 seats · 80 msgs/day

Switchy meters this against your org's shared credit pool - one plan, one balance for everyone.

Providers

ProviderContextInputOutputP50 latencyThroughput30d uptime
google1049k$2.00/Mtok$12.00/Mtok

Performance

Performance snapshots are collected daily. Check back after the next ingestion run.

Benchmarks

Public benchmark scores are not available yet for this model. Check back after the next ingestion run.

Works well with

Top MCPs

Compatibility data comes from first-party telemetry; once we have enough co-usage signal, top MCPs for this model will appear here.

How Switchy teams use it

Not enough Spaces have used this model yet to share anonymised team stats. We wait for at least 50 distinct Spaces per week before publishing any aggregate.

Starter prompts

Multi-Step Research Agent

You have access to search_web(), fetch_document(), and summarize_content() functions. Research the latest developments in quantum error correction, retrieve three recent papers, and synthesize key findings into a technical brief.
Open in a Space →

Video Analysis with Actions

Analyze this product demo video. Use extract_timestamps() to mark key feature demonstrations, then call generate_social_clips() to create three 15-second highlight reels for different platforms.
Open in a Space →

Codebase Documentation Generator

Read this 50-file Python codebase. Use parse_dependencies() to map module relationships, then call generate_markdown() to create API documentation with usage examples for each public function.
Open in a Space →

Audio Transcription Pipeline

Transcribe this 2-hour meeting audio. Use identify_speakers() to label participants, extract_action_items() to pull tasks, then call update_database() and send_notifications() to assigned team members.
Open in a Space →

Multimodal Data Extraction

You have access to parse_pdf(), ocr_image(), and format_json() functions. Extract all financial figures from these quarterly reports (3 PDFs, 2 scanned images) and return a structured dataset with company, quarter, revenue, and profit fields.
Open in a Space →

Example outputs

Illustrative - representative of the model's voice and quality, not literal recordings.

Prompt

Analyze this quarterly sales spreadsheet and create a Python script that identifies underperforming regions, then draft an email to the regional managers explaining the findings.

Output

This example would demonstrate the model processing a multi-sheet Excel file, extracting sales data across regions, and producing both a working Python analysis script and a professionally-toned email. The script would include pandas operations for aggregation and filtering, with inline comments explaining the logic. The email would reference specific numbers from the analysis, maintain appropriate business tone, and suggest actionable next steps—all generated in a single conversational turn without requiring the user to switch contexts between coding and writing tasks.

Notes

Showcases the 1M+ token context window handling large file uploads and the multimodal capability to reason across structured data and natural language simultaneously. The custom tools feature would allow this model to execute the Python script and verify results before finalizing the email, though latency at $12/Mtok output makes iterative refinement expensive for budget-conscious teams.

Prompt

Watch this 8-minute product demo video and generate a technical FAQ covering the five most complex features shown, with timestamps linking back to relevant moments.

Output

This example would show the model ingesting video content, identifying key technical demonstrations (API integration sequences, configuration workflows, error-handling scenarios), and producing a structured FAQ document. Each answer would include a timestamp reference like "See 3:42 for the webhook setup process" and explain the feature in clearer terms than the presenter used. The model would distinguish between features that were clearly explained versus those that need additional clarification, organizing answers by complexity rather than chronological order.

Notes

Highlights native video understanding without requiring transcription preprocessing. The model's ability to cross-reference visual demonstrations with spoken explanations makes it useful for documentation teams working from recorded content. However, without public benchmarks, accuracy on domain-specific technical content (medical devices, financial systems) remains unvalidated compared to specialized alternatives.

Prompt

I'm debugging a React app where the useEffect hook fires twice on mount in development. Here's my component code, console logs, and a screenshot of the Network tab showing duplicate API calls. What's happening?

Output

This example would demonstrate the model analyzing code, log output, and a browser screenshot together to diagnose the issue. The response would explain React 18's Strict Mode behavior causing intentional double-mounting in development, reference the specific lines in the provided code where side effects occur, and point to the duplicate POST requests visible in the Network tab screenshot. It would then provide a corrected code snippet using a ref to prevent duplicate API calls, explaining why this pattern works and when the double-mount behavior disappears in production builds.

Notes

Demonstrates multimodal debugging where code, logs, and visual browser state must be synthesized. The audio input capability (not used here) would theoretically allow developers to verbally walk through reproduction steps. The $2 input pricing makes uploading screenshots and large codebases economical, though the model's code-specific performance relative to Claude 3.5 Sonnet or GPT-4 remains unclear without published coding benchmarks.

Use-case deep-dives

Multi-format customer support triage

When support tickets arrive as screenshots, PDFs, and voice clips

A 12-person SaaS support team gets 200+ tickets daily—half arrive as phone recordings, annotated screenshots, or scanned contracts. Gemini 3.1 Pro Preview Custom Tools handles all four input types (text, audio, image, file) in a single API call, so you skip the pre-processing pipeline that costs you 90 seconds per mixed-media ticket. At $2/$12 per Mtok, a 500-token triage prompt with a 2MB image costs roughly $0.03, which pencils if your team's hourly rate exceeds $40. The 1M-token context window means you can dump an entire 40-page PDF contract into the prompt without chunking. If your tickets are text-only or you're processing under 50/day, Claude Sonnet 4 at $3/$15 will save you the custom-tools learning curve.

Long-context legal document comparison

Diffing 300-page agreements without chunking or retrieval overhead

A 4-attorney firm reviews merger agreements that average 80,000 words each. Gemini 3.1 Pro Preview's 1M-token window fits two full contracts plus a 5,000-word diff prompt in one pass—no vector store, no retrieval step, no context-window juggling. You paste both PDFs (via file input), ask for clause-level changes, and get a structured response in 15-20 seconds. At $2 input per Mtok, a 200k-token comparison costs $0.40, versus $60 of associate time for the same first-pass review. The trade-off: no public benchmarks yet, so you'll want to spot-check the first 20 diffs against human review before trusting it on high-stakes deals. If your agreements are under 30 pages, GPT-4o at $2.50/$10 is cheaper and has published MMLU scores.

Video content moderation pipeline

Scanning user-uploaded videos for policy violations at scale

A 20-person social app moderates 800 user videos daily (30-90 seconds each). Gemini 3.1 Pro Preview ingests video natively, so you skip the frame-extraction step that adds 12 seconds and $0.08 per video in your current pipeline. You send the raw MP4, a 200-token policy checklist, and get a violation report in one call. At $2/$12 per Mtok and an average 8,000-token video encoding, each moderation costs roughly $0.12 in API fees—half your current spend when you factor in the frame-extraction overhead. The 1M-token context means you can batch-review a user's last 10 uploads in a single prompt to catch repeat offenders. If you're under 200 videos/day, GPT-4o Vision at $2.50/$10 is simpler to integrate and has stronger benchmark coverage for edge cases.

Frequently asked

Is Gemini 3.1 Pro Preview good for complex reasoning tasks?

Yes, the 3.1 Pro series targets advanced reasoning and multimodal understanding. Without public benchmarks yet, you're betting on Google's track record with the Gemini line. The 1M token context window handles long documents well, and custom tools support means it can orchestrate complex workflows. If you need proven scores, wait for independent evals or test it yourself on your specific use case.

Is Gemini 3.1 Pro Preview cheaper than GPT-4o or Claude Sonnet?

At $2 input and $12 output per Mtok, it sits between budget and premium tiers. GPT-4o runs around $2.50/$10, Claude Sonnet 4 is $3/$15. Gemini 3.1 Pro Preview is cheaper on output than Sonnet but pricier than 4o. For heavy generation workloads, 4o wins on cost. For multimodal tasks with moderate output, this pricing is competitive.

Can Gemini 3.1 Pro Preview handle video and audio inputs natively?

Yes, it accepts text, audio, image, video, and file inputs directly. This makes it useful for transcription-plus-analysis workflows or video content moderation without preprocessing. The 1M token context means you can feed long videos or multiple files in one request. Just watch output token costs at $12/Mtok if you're generating summaries of hour-long content.

How does Gemini 3.1 Pro Preview compare to Gemini 2.0 Flash?

The 3.1 Pro series is Google's reasoning-focused flagship; 2.0 Flash prioritizes speed and cost. If you need multimodal understanding with custom tool orchestration, 3.1 Pro Preview is the pick. If you're running high-volume chat or simple classification, Flash's lower latency and price make more sense. The 3.1 line trades speed for capability depth.

Should I use Gemini 3.1 Pro Preview for production chatbots?

Only if you need the multimodal features or custom tools integration. The $12/Mtok output cost adds up fast in conversational apps. For text-only chat, GPT-4o or Claude Sonnet 4 offer better benchmark transparency and lower output pricing. Use this model when you're processing uploaded files, images, or video alongside chat, and the custom tools feature solves a real orchestration problem.

Data last verified 8 hours ago.Sources aggregated hourly to weekly. See docs/architecture/model-directory.md.