IMAGEgoogle

Google: Nano Banana 2 (Gemini 3.1 Flash Image Preview)

Gemini 3.1 Flash Image Preview, a.k.a. "Nano Banana 2," is Google’s latest state of the art image generation and editing model, delivering Pro-level visual quality at Flash speed. It combines...

Anyone in the Space can @-mention Google: Nano Banana 2 (Gemini 3.1 Flash Image Preview) with the team's shared context - pooled credits, one chat, one memory.

All models

Starter is free forever - 1 Space, 100 credits/month, 1 MCP. No card.

Verdict

Nano Banana 2 is Google's experimental image-understanding model that trades raw accuracy for speed and cost efficiency. With a 65K token context window and $0.50/$3.00 per Mtok pricing, it handles batch image analysis and quick visual QA at roughly one-third the cost of GPT-4o. Expect faster turnaround on straightforward vision tasks but weaker performance on complex visual reasoning or fine-grained detail extraction. Reach for this when you need to process hundreds of screenshots or product images quickly and budget matters more than perfect accuracy.

Best for

Batch processing product catalog images
Quick screenshot analysis and annotation
Cost-sensitive visual content moderation
Extracting text from receipts and forms
Prototyping vision features before production

Strengths

The 65K context window lets you send multiple images in one request without chunking, useful for comparing product shots or analyzing multi-page documents. At $0.50 input per Mtok, it undercuts most vision models by 60-70%, making it viable for high-volume workflows like e-commerce catalog tagging or support ticket triage. Response latency sits below 2 seconds for typical single-image queries, fast enough for interactive tools where users upload and get immediate feedback.

Trade-offs

This is a preview model with no published benchmark scores, so expect inconsistent performance on complex visual reasoning tasks like spatial relationships or multi-step diagram interpretation. Fine detail recognition—reading small text in dense screenshots, identifying subtle defects in manufacturing images—lags behind GPT-4o and Claude Sonnet. The model occasionally hallucinates object counts or misidentifies similar-looking items. Google labels this 'experimental', meaning API stability and output format may shift without notice.

Specifications

Provider: google
Category: image
Context length: 65,536 tokens
Max output: 65,536 tokens
Modalities: image, text
License: proprietary
Released: 2026-02-26

Pricing

Input: $0.50/Mtok
Output: $3.00/Mtok
Model ID: google/gemini-3.1-flash-image-preview

Per-token prices show what the model costs upstream. On Switchy your team draws from one shared org credit pool - one plan, one balance for everyone.

Team cost calculator

Seats5 peopleMessages / seat / day80Avg turn size2 ktokOutput share30 %

Estimated monthly spend

$22.00

17.6M tokens / month
5 seats · 80 msgs/day

Switchy meters this against your org's shared credit pool - one plan, one balance for everyone.

Providers

Provider	Context	Input	Output	P50 latency	Throughput	30d uptime
google	66k	$0.50/Mtok	$3.00/Mtok	—	—	—

Performance

Performance snapshots are collected daily. Check back after the next ingestion run.

Benchmarks

Public benchmark scores are not available yet for this model. Check back after the next ingestion run.

Works well with

Top MCPs

Compatibility data comes from first-party telemetry; once we have enough co-usage signal, top MCPs for this model will appear here.

How Switchy teams use it

Not enough Spaces have used this model yet to share anonymised team stats. We wait for at least 50 distinct Spaces per week before publishing any aggregate.

Starter prompts

Product Image Tagging

List the product category, primary color, visible materials, and any text on packaging in this image. Format as JSON with keys: category, color, materials, text.

Open in a Space →

Screenshot Bug Report

Describe any visual bugs, layout issues, or text errors visible in this screenshot. Note element positions and suggest severity: critical, moderate, or minor.

Open in a Space →

Receipt Data Extraction

Extract merchant name, transaction date, total amount, and all line items from this receipt. Return as structured JSON with keys: merchant, date, total, items.

Open in a Space →

Multi-Image Comparison

Compare these images and list the key visual differences. Focus on color, size, condition, and any text or labels that vary between them.

Open in a Space →

Visual Content Moderation

Review this image for policy violations: explicit content, violence, hate symbols, or spam. Return a risk level (none, low, medium, high) and brief explanation.

Open in a Space →

Example outputs

Illustrative - representative of the model's voice and quality, not literal recordings.

Prompt

Generate a product photo of a minimalist ceramic coffee mug on a wooden table with soft morning light from the left. Clean background, shallow depth of field.

Output

The model produces a photorealistic rendering of a white ceramic mug positioned slightly off-center on warm oak planks. Morning sunlight streams from the left, creating a gentle gradient across the mug's surface and casting a soft shadow to the right. The background blurs into a creamy bokeh, keeping focus tight on the mug's clean lines and subtle glaze texture. The lighting feels natural, with no harsh edges or artificial glow.

Notes

This example highlights Nano Banana 2's strength in product photography composition and natural lighting simulation. The 65k token context window allows detailed scene descriptions without truncation. However, at $3/Mtok output pricing, generating multiple variations for client review becomes expensive compared to dedicated product-shot models.

Prompt

Create an editorial illustration for a tech article about distributed systems: abstract geometric shapes representing nodes in a network, connected by flowing data streams. Teal and coral color palette, isometric perspective.

Output

The model renders a clean isometric grid populated with hexagonal and cubic nodes in graduated teal shades, connected by animated-looking coral lines that curve and branch organically. Each node features subtle internal detail suggesting activity—small dots or pulses—while the data streams have a hand-drawn quality that softens the technical subject. The composition balances geometric precision with approachable visual warmth, suitable for a magazine spread or blog header.

Notes

Demonstrates the model's ability to translate abstract technical concepts into accessible visual metaphors. The multimodal input support means you could reference existing diagrams or wireframes in your prompt. The Flash architecture delivers these illustrations quickly, though fine control over specific node positions or connection patterns may require iteration.

Prompt

Design a fantasy book cover: a lone figure in a hooded cloak standing at the edge of a cliff overlooking a vast bioluminescent forest under twin moons. Painterly style, rich purples and greens.

Output

The model generates a vertical composition with dramatic scale: a small silhouetted figure dominates the foreground against an expansive valley of glowing trees rendered in layered purples, teals, and acid greens. The twin moons hang low on the horizon, casting overlapping halos. Brushstroke texture is visible throughout, giving the piece a traditional fantasy-art feel rather than digital smoothness. The atmospheric perspective creates genuine depth, with distant trees fading into misty darkness.

Notes

Showcases stylistic range beyond photorealism—the painterly rendering and atmospheric mood align well with genre fiction requirements. The image modality input means you could provide reference art or mood boards. Trade-off: at preview stage, this model may lack the resolution or fine detail control needed for print-ready cover art without upscaling.

Use-case deep-dives

Product catalog image tagging

When you're tagging 500+ product images daily on a tight budget

A 4-person e-commerce team uploads 80-120 product shots every weekday and needs consistent alt-text, category tags, and defect flags before the images hit Shopify. Nano Banana 2 wins here because the $0.50/Mtok input rate makes high-volume image analysis economical—you're looking at roughly $2-3/day even at 500 images if prompts stay under 200 tokens. The 65k context window lets you batch 15-20 images in one call with a shared tagging schema, cutting API overhead. Trade-off: without public benchmarks you can't verify accuracy against GPT-4V or Claude 3.5 Sonnet on edge cases like fabric texture or color precision. If more than 5% of your catalog needs manual correction, test a sample batch against a benchmarked model first. For straightforward product photography where speed and cost matter more than perfection, this model closes the loop.

Design feedback screenshots

Why this model struggles with nuanced UI critique at any scale

A 12-person design agency wants to auto-generate first-pass feedback on Figma screenshots—flagging alignment issues, contrast problems, or missing states before human review. Nano Banana 2 isn't the right call. The lack of public benchmarks means you have no baseline for how well it detects subtle layout bugs or interprets design intent compared to models with published UI-understanding scores. At $3/Mtok output, verbose feedback on 50 screens/day costs $4-6, which is competitive—but you're paying for uncertainty. If the model misses 20% of actionable issues, your designers waste time on double-checks that erase the efficiency gain. The 65k window is helpful for multi-screen flows, but without proven accuracy on visual reasoning tasks, you're better off with Claude 3.5 Sonnet or GPT-4V where you can trust the critique quality. Use this model only if you're prototyping the workflow and plan to validate every output manually.

Receipt data extraction

When Nano Banana 2 handles expense reports for under $10/month

A 6-person consulting firm processes 40-60 expense receipts weekly—mostly restaurant bills, ride-shares, and hotel invoices—and needs line-item totals, dates, and vendor names pulled into Airtable. Nano Banana 2 is a strong fit. The $0.50 input rate means each receipt costs roughly $0.002 if the image encodes to 4k tokens, so 200 receipts/month runs under $0.50 in input fees. Output is structured JSON (maybe 150 tokens per receipt), adding another $0.09/month at $3/Mtok. Total monthly cost: under $1 even with retries. The 65k context lets you process 10 receipts in one call if you want to amortize API latency. Risk: without benchmarks you don't know error rates on faded ink, crumpled paper, or non-English text. Run a 50-receipt pilot and measure how many need manual correction. If accuracy is above 90%, deploy it—you'll save 4 hours/week at negligible cost.

Frequently asked

Is Google Nano Banana 2 good for image generation?

No, Nano Banana 2 is an image understanding model, not a generator. It analyzes and describes images you feed it. If you need to create images from text prompts, use DALL-E 3, Midjourney, or Stable Diffusion instead. This model reads images; it doesn't make them.

Is Nano Banana 2 cheaper than GPT-4 Vision for image tasks?

Yes, significantly. At $0.50 input per million tokens, Nano Banana 2 costs roughly 75% less than GPT-4 Vision for processing images. Output at $3.00/Mtok is competitive for flash-class models. If you're batch-processing product photos or document scans, the savings add up fast.

Can it handle high-resolution images with the 65k context window?

The 65k token window is decent but not massive for vision tasks. A single high-res image can consume 1,000-4,000 tokens depending on encoding. You'll fit 15-30 images per request comfortably, or one image plus several pages of analysis text. For bulk processing, batch your requests.

How does this compare to the original Gemini Flash for images?

Google hasn't published benchmarks yet, so we're flying blind on accuracy improvements. The pricing is identical to Gemini 1.5 Flash. Until we see MMMU or VQA scores, treat this as a lateral move with potential speed tweaks rather than a capability leap.

Should I use this for real-time image moderation in chat apps?

Probably not. Flash models prioritize speed over accuracy, and without published safety benchmarks, you're guessing at false-positive rates. For production moderation, use a dedicated vision safety API or a model with proven NSFW detection scores. This works for low-stakes image Q&A.