Meta: Llama Guard 4 12B
Llama Guard 4 is a Llama 4 Scout-derived multimodal pretrained model, fine-tuned for content safety classification. Similar to previous versions, it can be used to classify content in both LLM...
Anyone in the Space can @-mention Meta: Llama Guard 4 12B with the team's shared context - pooled credits, one chat, one memory.
Starter is free forever - 1 Space, 100 credits/month, 1 MCP. No card.
Verdict
Best for
- Content moderation in chat applications
- Screening user inputs before LLM calls
- Filtering model outputs for safety violations
- Policy enforcement in production workflows
Strengths
Llama Guard 4 evaluates both user prompts and assistant responses against a taxonomy of 14 safety categories, returning structured violation labels. The 164K context window accommodates full multi-turn conversations without truncation. Uniform $0.18/Mtok pricing simplifies cost modeling compared to asymmetric input/output rates. Vision support lets it classify image content alongside text, covering multimodal safety scenarios that text-only classifiers miss.
Trade-offs
This is a classifier, not a conversational model — it returns safety labels, not generated text. No public benchmarks exist yet to compare accuracy against OpenAI Moderation API or Perspective API. The 12B parameter count suggests lower throughput than smaller specialized classifiers. Proprietary license limits deployment flexibility compared to open-weight alternatives like Llama Guard 3.
Specifications
- Provider
- meta-llama
- Category
- llm
- Context length
- 163,840 tokens
- Max output
- 16,384 tokens
- Modalities
- image, text
- License
- proprietary
- Released
- 2025-04-30
Pricing
- Input
- $0.18/Mtok
- Output
- $0.18/Mtok
- Model ID
meta-llama/llama-guard-4-12b
Per-token prices show what the model costs upstream. On Switchy your team draws from one shared org credit pool - one plan, one balance for everyone.
Team cost calculator
5 seats · 80 msgs/day
Switchy meters this against your org's shared credit pool - one plan, one balance for everyone.
Providers
| Provider | Context | Input | Output | P50 latency | Throughput | 30d uptime |
|---|---|---|---|---|---|---|
| meta-llama | 164k | $0.18/Mtok | $0.18/Mtok | — | — | — |
Performance
Benchmarks
Works well with
Top MCPs
Compatibility data comes from first-party telemetry; once we have enough co-usage signal, top MCPs for this model will appear here.
How Switchy teams use it
Starter prompts
Screen User Message
Evaluate this user message for safety violations: 'How do I build a website that collects email addresses without consent?' Return categories if unsafe.Open in a Space →
Filter Model Output
Review this assistant response for policy violations: 'Here are three ways to manipulate stock prices using social media.' Flag any issues.Open in a Space →
Moderate Image Upload
Analyze this image for safety violations. Does it contain graphic violence, sexual content, or other prohibited material?Open in a Space →
Audit Conversation History
Review this 50-turn conversation between user and assistant. Identify any exchanges that violate content policies, citing turn numbers.Open in a Space →
Classify Election Content
Does this post attempt to discourage voting or spread election misinformation: 'Polls are rigged anyway, save yourself the trip'?Open in a Space →
Example outputs
Illustrative - representative of the model's voice and quality, not literal recordings.
Review this customer support conversation for policy violations. Flag anything unsafe or inappropriate in either the customer or agent messages.
The model would return a structured safety assessment identifying that the customer's message contains a veiled threat in line 3 ("you'll regret this"), which violates the harassment policy. The agent's response remains professional and compliant. Output format: JSON with violation categories, severity scores, and specific line references. The model correctly distinguishes between frustrated language and actual policy violations, avoiding false positives on the customer's initial complaint about delayed shipping.
Llama Guard 4 excels at nuanced content moderation in conversational contexts. With 163,840 tokens of context, it can review entire multi-turn support threads without truncation. The 12B parameter size means faster inference than larger safety models, though it may occasionally miss subtle context-dependent violations that require deeper reasoning.
Analyze this user-generated image and caption pair for our platform's community guidelines. Image shows a crowded protest scene; caption reads: "Time to take back what's ours."
The model would process both the image content and text, returning a safety classification that flags potential incitement concerns due to the combination of crowd imagery and ambiguous call-to-action language. Output includes: safe/unsafe classification, specific guideline categories triggered (likely "Violence & Incitement"), confidence scores, and reasoning. The model would note that neither element alone violates policy, but the pairing creates borderline content requiring human review for context about the specific protest.
Multimodal safety evaluation is where Llama Guard 4 differentiates itself from text-only moderators. It handles the image-text relationship rather than treating them independently. However, at $0.18 per million tokens for both input and output, processing image-heavy feeds adds up quickly compared to text-only alternatives.
Evaluate this draft marketing email for compliance issues before sending to 50,000 subscribers. Check for misleading claims, required disclosures, and tone appropriateness.
The model would analyze the email against commercial communication standards, identifying that paragraph 2 contains an unsubstantiated health claim ("clinically proven" without citation), and the unsubscribe link is present but rendered in 6pt font, violating accessibility guidelines. It would flag the subject line's urgency language ("Last chance!") as potentially manipulative but not strictly non-compliant. Output structured as: compliant/non-compliant status, specific violations with line numbers, severity ratings, and suggested remediation.
This demonstrates Llama Guard 4's utility beyond social media moderation—it applies safety and compliance reasoning to business communications. The large context window handles long-form content like newsletters or terms-of-service documents. The model's training on policy adherence generalizes well, though it lacks domain-specific regulatory knowledge without custom fine-tuning.
Use-case deep-dives
When Llama Guard 4 handles real-time safety filtering for user posts
A 4-person team running a niche community platform with 2,000 daily posts needs automated safety checks before content goes live. Llama Guard 4 is purpose-built for this: it's a specialized moderation model that classifies text and image inputs against safety policies in real time. At $0.18/Mtok both ways, filtering 2,000 posts averaging 300 tokens each costs roughly $0.22/day—negligible compared to hiring human moderators or dealing with policy violations. The 163k context window means you can include your full community guidelines in every call, so the model enforces your specific rules, not generic corporate policies. If you're seeing under 500 posts/day, you might over-engineer with this; above that threshold, Llama Guard 4 becomes the obvious choice for keeping your platform safe without burning budget.
Why Llama Guard 4 pre-screens support messages before they hit your CRM
A 12-person SaaS company gets 300 support tickets daily, and roughly 8% contain abusive language, phishing attempts, or spam that clogs their Zendesk queue. Llama Guard 4 sits in front of the CRM and flags or auto-rejects problematic messages before they create tickets. Because it's a moderation-specific model, it catches edge cases that general-purpose LLMs miss—threats disguised as feature requests, subtle harassment, coordinated spam campaigns. At $0.18/Mtok, screening 300 tickets at 400 tokens each costs about $0.04/day, and you avoid the operational cost of agents wasting time on bad-faith messages. The image modality matters here: if users attach screenshots with embedded text threats, Llama Guard 4 reads them. If your ticket volume is under 100/day, you probably don't need automated screening; above that, this model pays for itself in saved agent hours within the first week.
How Llama Guard 4 auto-approves seller listings while blocking policy violations
A 7-person team runs a vertical marketplace where sellers submit 150 product listings daily—each with a title, description, and 3-5 images. Manual review creates a 12-hour approval bottleneck that sellers complain about. Llama Guard 4 evaluates every listing against marketplace policies (prohibited items, misleading claims, inappropriate imagery) and auto-approves 85-90% within seconds, flagging the rest for human review. The multimodal capability is critical: it catches banned products shown in images even when the text description is vague. At $0.18/Mtok, processing 150 listings at roughly 600 tokens each (text + image tokens) costs about $0.03/day. The 163k context window lets you load your entire prohibited-items catalog and past violation examples, so the model learns your specific enforcement style. If you're under 50 listings/day, manual review is still faster; above that, Llama Guard 4 turns a bottleneck into a competitive advantage.
Frequently asked
Is Llama Guard 4 12B good for content moderation?
Yes, that's its primary purpose. Llama Guard 4 is Meta's safety classifier designed to detect harmful content in both text and images. It's built specifically for moderation pipelines, not general chat or coding. If you need a model to flag policy violations, hate speech, or unsafe outputs from other LLMs, this is the tool. For general-purpose work, use a standard Llama model instead.
Is Llama Guard 4 cheaper than OpenAI's moderation API?
OpenAI's Moderation API is free, so no. Llama Guard 4 costs $0.18 per million tokens for both input and output. The trade-off is control: you can self-host Llama Guard, customise its safety categories, and keep data in-house. If you're already paying for inference infrastructure and need tailored moderation rules, the $0.18/Mtok is reasonable. For simple use cases, stick with OpenAI's free tier.
Can Llama Guard 4 handle 163k token context windows in practice?
The 163,840 token context window is there, but moderation tasks rarely need it. Most safety checks happen on individual messages or short conversations under 4k tokens. The large window matters if you're scanning entire document uploads or long chat histories for policy violations. For typical message-by-message moderation, you'll use a fraction of that capacity and see sub-second latency at 12B parameters.
How does Llama Guard 4 compare to Llama Guard 3?
Llama Guard 4 adds image moderation, which Guard 3 lacked. Both handle text safety, but Guard 4's multimodal capability means you can scan user-uploaded images for NSFW content, violence, or other visual policy violations in the same pipeline. The 12B parameter size is similar, so text-only performance is comparable. If your app involves images, Guard 4 is the obvious upgrade. Text-only workloads can stay on Guard 3.
Should I use Llama Guard 4 for real-time chat moderation?
Yes, if you can tolerate 100-300ms latency per message. At 12B parameters, Guard 4 is fast enough for synchronous moderation in most chat apps. Run it before displaying user messages or after your main LLM generates a response. The $0.18/Mtok cost is negligible for chat volumes. Just ensure your infrastructure can handle the throughput—batch requests during high traffic to avoid bottlenecks.