LLMbytedance

ByteDance: UI-TARS 7B

UI-TARS-1.5 is a multimodal vision-language agent optimized for GUI-based environments, including desktop interfaces, web browsers, mobile systems, and games. Built by ByteDance, it builds upon the UI-TARS framework with reinforcement...

Anyone in the Space can @-mention ByteDance: UI-TARS 7B with the team's shared context - pooled credits, one chat, one memory.

All models

Starter is free forever - 1 Space, 100 credits/month, 1 MCP. No card.

Verdict

UI-TARS 7B is ByteDance's specialized model for understanding and reasoning about user interfaces. At $0.10/$0.20 per Mtok with vision support, it undercuts GPT-4o and Claude Sonnet on price while targeting UI-specific tasks like element detection, layout analysis, and accessibility auditing. The 7B parameter count keeps latency low for real-time applications. Trade-off: no public benchmarks yet, so performance relative to general-purpose vision models remains unproven. Reach for this when you need fast, cost-effective UI analysis and can tolerate early-stage model risk.

Best for

UI element detection and classification
Accessibility compliance checking on screenshots
Layout analysis for design systems
Real-time interface testing automation
Cost-sensitive vision tasks under 128K tokens

Strengths

The 7B parameter size delivers sub-second inference for UI tasks where GPT-4o would be overkill. Vision support at $0.10 input makes it 5-10x cheaper than Claude Sonnet 4 for screenshot analysis at scale. The 128K context window handles multi-screen flows or dense design specs in a single call. ByteDance's domain focus suggests training data weighted toward interface understanding rather than general vision.

Trade-offs

No public benchmarks means you're flying blind on accuracy versus GPT-4o Vision or Claude Sonnet 4.5 on UI tasks. The 7B size likely struggles with complex reasoning chains or nuanced design critique compared to 70B+ models. Proprietary license locks you into ByteDance's infrastructure. Early release status means API stability and feature completeness are question marks — expect breaking changes and limited documentation.

Specifications

Provider: bytedance
Category: llm
Context length: 128,000 tokens
Max output: 2,048 tokens
Modalities: image, text
License: proprietary
Released: 2025-07-22

Pricing

Input: $0.10/Mtok
Output: $0.20/Mtok
Model ID: bytedance/ui-tars-1.5-7b

Per-token prices show what the model costs upstream. On Switchy your team draws from one shared org credit pool - one plan, one balance for everyone.

Team cost calculator

Seats5 peopleMessages / seat / day80Avg turn size2 ktokOutput share30 %

Estimated monthly spend

$2.29

17.6M tokens / month
5 seats · 80 msgs/day

Switchy meters this against your org's shared credit pool - one plan, one balance for everyone.

Providers

Provider	Context	Input	Output	P50 latency	Throughput	30d uptime
bytedance	128k	$0.10/Mtok	$0.20/Mtok	—	—	—

Performance

Performance snapshots are collected daily. Check back after the next ingestion run.

Benchmarks

Public benchmark scores are not available yet for this model. Check back after the next ingestion run.

Works well with

Top MCPs

Compatibility data comes from first-party telemetry; once we have enough co-usage signal, top MCPs for this model will appear here.

How Switchy teams use it

Not enough Spaces have used this model yet to share anonymised team stats. We wait for at least 50 distinct Spaces per week before publishing any aggregate.

Starter prompts

Detect Clickable Elements

Analyze this screenshot and list every clickable element. For each, provide: element type (button, link, input), visible text or label, and approximate position (top-left, center, etc.).

Open in a Space →

Check WCAG Compliance

Review this UI screenshot for WCAG 2.1 AA violations. Flag issues with color contrast, missing alt text indicators, keyboard navigation hints, and touch target sizing. Prioritize by severity.

Open in a Space →

Compare Layout Versions

I'm providing two screenshots of the same interface. Identify all layout differences: moved elements, spacing changes, added or removed components, and typography shifts. Focus on structural changes, not color.

Open in a Space →

Extract Design Tokens

From this screenshot, extract design tokens: font sizes, font weights, spacing values between elements, border radius values, and the color palette used. Format as a structured list.

Open in a Space →

Generate Test Selectors

For each interactive element in this screenshot, suggest a robust CSS selector or XPath that a test automation script could use. Prefer data attributes or semantic HTML over brittle class names.

Open in a Space →

Example outputs

Illustrative - representative of the model's voice and quality, not literal recordings.

Prompt

Analyze this wireframe screenshot and suggest three specific improvements to the navigation hierarchy that would reduce cognitive load for first-time users.

Output

Looking at your wireframe, I can identify three navigation improvements: 1. **Consolidate the secondary menu** — You currently have 8 items in the top bar. Group 'Settings', 'Profile', and 'Help' under a single account dropdown to reduce visual clutter. 2. **Add breadcrumb navigation** — The nested structure (Projects > Templates > Details) isn't visible to users. A breadcrumb trail below the header would provide spatial orientation. 3. **Differentiate primary actions** — The 'Create New' and 'Import' buttons have equal visual weight. Make 'Create New' the primary CTA with higher contrast, since it's your main conversion path. These changes maintain your information architecture while making the interface more scannable for new users.

Notes

This example demonstrates UI-TARS 7B's multimodal analysis capability — processing both visual layout and UX principles. The 128K context window supports analyzing multiple screens in one conversation. However, without public benchmarks, it's unclear how its UI-specific reasoning compares to general-purpose vision models on complex interface tasks.

Prompt

I'm building a design system component library. Generate TypeScript interfaces for a Card component that supports three variants (elevated, outlined, filled) and optional header/footer slots.

Output

```typescript type CardVariant = 'elevated' | 'outlined' | 'filled'; interface CardSlot { content: React.ReactNode; className?: string; } interface CardProps { variant: CardVariant; children: React.ReactNode; header?: CardSlot; footer?: CardSlot; elevation?: 1 | 2 | 3; // Only applies to 'elevated' variant borderColor?: string; // Only applies to 'outlined' variant className?: string; onClick?: () => void; } ``` This structure keeps variant-specific props optional while enforcing the base requirements. The `CardSlot` interface ensures consistent header/footer APIs across your component library.

Notes

Shows the model's ability to generate design-system-aware code with proper TypeScript patterns. The output balances flexibility with type safety. At $0.20/Mtok output, this is 2× the cost of GPT-4o, so batch processing design system documentation would add up quickly compared to cheaper code-focused alternatives.

Prompt

Explain the accessibility implications of using CSS `display: contents` on a navigation landmark, and provide an alternative approach that preserves semantic structure.

Output

`display: contents` removes an element from the accessibility tree in most browsers, which breaks landmark navigation for screen reader users. When you apply it to a `<nav>` element, assistive technology loses the 'navigation' role entirely — users can't jump to that landmark with keyboard shortcuts. **Alternative approach:** Instead of removing the `<nav>` from the box model, use flexbox or grid on its parent container and make the `<nav>` a direct participant: ```css .parent { display: flex; } .nav { /* nav stays in DOM and a11y tree */ } ``` If you need the nav's children to behave as siblings to other elements, restructure your HTML rather than using `display: contents`. Semantic structure should drive layout decisions, not the reverse.

Notes

Highlights UI-TARS 7B's grasp of the intersection between CSS implementation and accessibility standards — a niche but critical area for frontend teams. The explanation is technically precise without requiring prior context. The 128K window means you could paste entire stylesheets for holistic accessibility audits in a single prompt.

Use-case deep-dives

Visual bug triage for mobile apps

When UI-TARS handles screenshot-to-ticket workflows under budget

A 4-person mobile dev shop gets 30-40 bug reports daily, half with screenshots showing layout breaks or color mismatches. UI-TARS 7B at $0.10/$0.20 per Mtok processes each screenshot plus 2-3K tokens of context (device info, reproduction steps) for under $0.01 per ticket. The 128K window handles batch processing: drop 20 screenshots with their metadata into one prompt, get structured Jira-ready output in 8 seconds. No public benchmarks exist yet, so you're flying blind on accuracy versus GPT-4V or Claude 3.5 Sonnet—but if your triage load is predictable and you can afford a week testing output quality against your existing flow, the 5-10x cost savings versus vision-capable frontier models pays for the experiment. Best fit if you're processing 500+ visual tickets monthly and can build a feedback loop to catch misclassifications early.

Multilingual product catalog enrichment

UI-TARS for image-text SKU tagging when volume beats precision

An 8-person e-commerce team maintains 12,000 SKUs across English, Spanish, and Mandarin listings. Each product has 3-5 photos; the goal is auto-generating alt text, color/material tags, and category suggestions to feed the CMS. UI-TARS 7B's image+text modality handles the photo analysis, and the 128K context window lets you batch 40-50 products per call (each product = 4 images + existing metadata). At $0.10 input, processing the full catalog costs roughly $30-40 versus $200+ on GPT-4V. The trade-off: ByteDance hasn't published accuracy numbers, so you'll need a human spot-check on 5-10% of output. If your catalog refresh is quarterly and you can tolerate 8-12% tag errors that get caught in QA, UI-TARS delivers fast, cheap enrichment. Skip it if product liability or brand consistency demands frontier-model precision.

Internal document Q&A with charts

When UI-TARS covers mixed-media knowledge base queries at scale

A 15-person ops team fields 200 questions weekly about internal SOPs stored as PDFs with embedded flowcharts, tables, and diagrams. UI-TARS 7B ingests the full document (often 20-30 pages, 40K tokens) plus the chart images in one prompt, then answers procedural questions like 'What's the approval path for vendor contracts over $50K?' The 128K window means no chunking or retrieval layer for 90% of docs. At $0.10/$0.20 per Mtok, each query costs $0.008-0.015 versus $0.04-0.06 on Claude 3.5 Sonnet. The risk: no benchmarks means you're trusting ByteDance's internal evals. Run a 2-week pilot on 50 real questions, compare answers to your current Slack search, and measure time-to-resolution. If accuracy clears 85% and you're burning 10+ hours weekly on repeat questions, UI-TARS pays for itself in month one.

Frequently asked

Is UI-TARS 7B good for UI understanding and automation tasks?

Yes, that's its core purpose. UI-TARS 7B is built specifically for understanding user interfaces from screenshots and text. It can parse layouts, identify interactive elements, and reason about UI workflows. The 7B parameter count keeps it fast enough for real-time automation while the 128K context window handles complex multi-screen flows without truncation.

Is UI-TARS 7B cheaper than GPT-4o for UI automation?

Significantly cheaper. At $0.10/$0.20 per million tokens, UI-TARS costs roughly 5-10x less than GPT-4o for equivalent tasks. For high-volume UI testing or RPA workflows processing thousands of screenshots daily, this price difference compounds fast. The trade-off is narrower capabilities — UI-TARS won't write essays or solve math problems like GPT-4o.

Can UI-TARS 7B handle mobile app screenshots reliably?

It should, given the image+text modality support, but without public benchmarks we can't verify accuracy on iOS vs Android layouts or complex mobile gestures. Test it on your specific app designs before committing. The 128K context helps with multi-step mobile flows, but you'll need to validate element detection rates yourself since ByteDance hasn't published UI-specific accuracy numbers.

How does UI-TARS 7B compare to GPT-4V for screenshot analysis?

UI-TARS is purpose-built for UI tasks while GPT-4V is a general vision model. UI-TARS likely understands button hierarchies and form structures better, but GPT-4V handles broader visual reasoning. The 10x price advantage and faster inference from the smaller 7B size make UI-TARS better for dedicated automation pipelines. Use GPT-4V when you need general image understanding beyond UI elements.

Should I use UI-TARS 7B for web scraping and testing automation?

Yes, if you're building visual scrapers or end-to-end test suites that need to understand page layouts without brittle CSS selectors. The image modality lets it adapt to UI changes that break traditional DOM-based scrapers. Pair it with Playwright or Selenium for execution. The 128K window handles entire user journeys in one prompt, reducing the need for complex state management across API calls.