LLMnousresearch

Nous: Hermes 3 405B Instruct

Hermes 3 is a generalist language model with many improvements over Hermes 2, including advanced agentic capabilities, much better roleplaying, reasoning, multi-turn conversation, long context coherence, and improvements across the...

Anyone in the Space can @-mention Nous: Hermes 3 405B Instruct with the team's shared context - pooled credits, one chat, one memory.

All models

Starter is free forever - 1 Space, 100 credits/month, 1 MCP. No card.

Verdict

Hermes 3 405B Instruct is Nous Research's flagship instruction-tuned model built on Llama 3.1 405B, optimized for function calling and agentic workflows. At $1/Mtok both ways, it undercuts GPT-4 class models while delivering strong reasoning and tool use. The 128K context window handles most document tasks comfortably. Reach for this when you need reliable function calling at half the cost of Claude or GPT-4, but expect slightly weaker performance on nuanced creative writing compared to frontier models.

Best for

Function calling and tool use workflows
Agentic systems with structured outputs
Cost-sensitive reasoning tasks under 128K tokens
Multi-turn conversations requiring context retention
JSON schema adherence and structured generation

Strengths

Hermes 3 405B excels at function calling and structured output generation, making it a strong choice for agentic applications where reliability matters more than raw creativity. The symmetric $1/Mtok pricing removes the usual input/output cost asymmetry, simplifying budget planning for retrieval-augmented or multi-turn workflows. Built on Llama 3.1 405B, it inherits solid reasoning capabilities while Nous Research's instruction tuning sharpens tool use and JSON adherence beyond the base model.

Trade-offs

Without public benchmark data, performance relative to GPT-4o or Claude Sonnet 4.5 remains unverified in standardized evals. Anecdotal reports suggest it trails frontier models on creative writing, nuanced tone control, and edge-case reasoning. The 128K context window, while adequate, falls short of Claude's 200K or Gemini's 1M+ for extreme long-context tasks. Proprietary licensing limits deployment flexibility compared to open-weight alternatives like Llama 3.1 405B itself.

Specifications

Provider: nousresearch
Category: llm
Context length: 131,072 tokens
Max output: 16,384 tokens
Modalities: text
License: proprietary
Released: 2024-08-16

Pricing

Input: $1.00/Mtok
Output: $1.00/Mtok
Model ID: nousresearch/hermes-3-llama-3.1-405b

Per-token prices show what the model costs upstream. On Switchy your team draws from one shared org credit pool - one plan, one balance for everyone.

Team cost calculator

Seats5 peopleMessages / seat / day80Avg turn size2 ktokOutput share30 %

Estimated monthly spend

$17.60

17.6M tokens / month
5 seats · 80 msgs/day

Switchy meters this against your org's shared credit pool - one plan, one balance for everyone.

Providers

Provider	Context	Input	Output	P50 latency	Throughput	30d uptime
nousresearch	131k	$1.00/Mtok	$1.00/Mtok	—	—	—

Performance

Performance snapshots are collected daily. Check back after the next ingestion run.

Benchmarks

Public benchmark scores are not available yet for this model. Check back after the next ingestion run.

Works well with

Top MCPs

Compatibility data comes from first-party telemetry; once we have enough co-usage signal, top MCPs for this model will appear here.

How Switchy teams use it

Not enough Spaces have used this model yet to share anonymised team stats. We wait for at least 50 distinct Spaces per week before publishing any aggregate.

Starter prompts

Multi-Step Research Agent

You have access to search_web() and summarize_document() functions. Research the latest developments in solid-state battery technology, summarize three key papers, then synthesize findings into a 200-word executive brief.

Open in a Space →

Structured Data Extraction

Extract all mentioned companies, funding amounts, and dates from this press release into a JSON array with fields: {company, amount_usd, date_announced, investor_lead}. Ensure strict schema compliance.

Open in a Space →

Code Review with Tool Use

Review this Python function for bugs and style issues. Use run_pylint() and check_security() tools, then provide a prioritized list of fixes with severity ratings.

Open in a Space →

Long-Context Contract Analysis

Analyze this 40-page SaaS agreement for non-standard clauses in sections 8-15. Flag any liability caps below industry norms and unusual termination provisions.

Open in a Space →

Agentic Customer Support Router

You're a support router with access to check_account_status(), escalate_to_human(), and search_kb() functions. Handle this customer inquiry by checking their account, searching relevant docs, and deciding whether to escalate.

Open in a Space →

Example outputs

Illustrative - representative of the model's voice and quality, not literal recordings.

Prompt

Review this Python function for potential race conditions. It's meant to update a shared counter across multiple threads, but users report inconsistent totals.

Output

The function lacks thread synchronization. When multiple threads call `increment_counter()` simultaneously, they can read the same value before any writes complete, causing lost updates. The read-modify-write sequence isn't atomic. You need either a threading.Lock around the entire operation, or switch to `threading.local()` for per-thread counters that you sum later. If this counter feeds into financial calculations, the Lock approach is safer—local counters can miss updates if threads terminate unexpectedly. Also consider `multiprocessing.Value` with a lock if you're actually using processes, not threads.

Notes

This example highlights Hermes 3's concurrency reasoning—it identifies the specific race condition pattern and offers two solutions with trade-off analysis. The 405B parameter count enables nuanced architectural advice. However, without benchmark data, we can't verify its accuracy against models like GPT-4 or Claude on threading edge cases.

Prompt

Explain how transformer attention mechanisms work, but write it for a PM who needs to brief executives on why our API costs scale with input length.

Output

Transformers process text by having each word 'look at' every other word to understand context—that's the attention mechanism. The catch: if your input is 1,000 words, the model performs roughly 1,000 × 1,000 = 1 million comparisons. Double the input to 2,000 words, and you're at 4 million comparisons. This quadratic growth is why longer prompts cost more—the computational work isn't linear. For your exec brief: 'Every token we send compares itself to every other token. A 10-page document requires 100× more processing than a single page, which directly drives our per-token pricing.'

Notes

Demonstrates Hermes 3's ability to translate technical concepts into business context without losing accuracy. The 131K token context window makes it practical for analyzing full documents before explaining them. The response stays jargon-light while remaining technically correct—useful for cross-functional communication, though we lack head-to-head comparisons on explanation quality.

Prompt

Generate a SQL query to find customers who made 3+ purchases in Q1 2024 but zero purchases in Q2 2024. Our schema has `orders(id, customer_id, order_date, amount)` and `customers(id, name, email)`.

Output

```sql SELECT c.id, c.name, c.email FROM customers c WHERE c.id IN ( SELECT customer_id FROM orders WHERE order_date BETWEEN '2024-01-01' AND '2024-03-31' GROUP BY customer_id HAVING COUNT(*) >= 3 ) AND c.id NOT IN ( SELECT customer_id FROM orders WHERE order_date BETWEEN '2024-04-01' AND '2024-06-30' ); ``` This uses subqueries for clarity. For large datasets, a LEFT JOIN with `WHERE o2.id IS NULL` on the Q2 check would perform better—subqueries can trigger full table scans twice.

Notes

Shows structured query generation with performance awareness. The model provides working SQL and flags optimization opportunities, which matters at scale. The $1/$1 per-Mtok pricing makes it cost-competitive for batch analytics tasks. Without MMLU-Pro or HumanEval scores, though, we can't quantify its SQL correctness rate versus specialized code models.

Use-case deep-dives

Multi-document legal research

When 131K context handles complex contract comparison without re-prompting

A 4-person legal tech startup needs to compare clauses across 8-12 vendor contracts simultaneously—each contract runs 15-20 pages. Hermes 3 405B's 131K token window fits all documents in a single prompt, so the model sees every clause at once when you ask it to flag liability gaps or pricing inconsistencies. At $1/Mtok symmetrical pricing, a typical 80K-token analysis (contracts in, structured report out) costs $0.08. The trade-off: without public benchmarks, you're trusting Nous's fine-tuning on faith—run a 10-contract pilot before committing to production. If your contracts average under 10 pages each, Claude 3.5 Sonnet's proven accuracy may justify the slight price premium. For teams drowning in multi-document cross-reference work, this model's context ceiling and cost floor make it the default first test.

High-frequency customer support triage

Why symmetrical $1 pricing matters when outputs match inputs at scale

A 12-person SaaS company routes 400 support tickets daily through an AI triage layer—each ticket includes conversation history, account metadata, and KB snippets, averaging 2K tokens in and 1.5K tokens out (classification + suggested reply). Hermes 3 405B's symmetrical $1/$1 pricing means each triage costs $0.0035, or $1.40/day for 400 tickets. Compare that to models with asymmetric pricing (often $3 input / $15 output): when your output volume approaches input volume, symmetrical rates cut costs 60-70%. The risk: no public benchmarks means you need a 2-week A/B test against GPT-4o or Claude to confirm classification accuracy holds. If accuracy drops below 92%, the cost savings evaporate in escalation overhead. For teams where AI writes as much as it reads, this pricing structure is the unlock.

Batch content localization

When overnight batch jobs need 405B reasoning without GPT-4 pricing

A 3-person e-learning studio localizes 50-80 lesson scripts per week from English into Spanish and French—each script is 3K tokens, and localization requires cultural adaptation (not just translation). Hermes 3 405B's $1 symmetrical pricing means a 3K-in/3.5K-out job costs $0.0065 per script, or $0.52 for 80 scripts. Run it overnight as a batch, review in the morning, and ship. The model's 405B parameter count handles idiomatic rewrites and context-aware terminology better than smaller models, but without MMLU or MT-Bench scores, you're flying blind on accuracy—budget 15% of scripts for human QA until you trust the output. If you're localizing under 20 scripts/week, the setup overhead isn't worth it; just use DeepL. For studios shipping localized content at double-digit weekly volume, this model's cost and context make it the pragmatic pick.

Frequently asked

Is Hermes 3 405B good for complex reasoning tasks?

Yes. The 405B parameter count puts it in the same weight class as GPT-4 and Claude Opus for multi-step reasoning, code generation, and nuanced instruction following. Nous Research fine-tuned this on synthetic data designed for agentic workflows, so it handles tool use and chain-of-thought prompting well. Expect strong performance on logic puzzles, math, and structured output tasks.

Is Hermes 3 405B cheaper than GPT-4o or Claude Sonnet?

No. At $1.00 per million tokens for both input and output, it costs roughly the same as GPT-4o ($2.50 input / $10 output) for balanced workloads but more than Claude Sonnet 3.5 ($3 input / $15 output) if you generate short responses. For long-context reads with minimal output, Hermes 3 is competitive. For chatbots generating verbose replies, you'll pay more.

Can Hermes 3 405B handle 128K token contexts reliably?

The 131K context window is there, but real-world performance depends on your provider's infrastructure and the task. Nous Research trained this model for extended context, so retrieval from the middle of long documents should work. Test with your actual data — some open-weight models degrade past 64K tokens despite advertised limits. For critical production use, validate needle-in-haystack accuracy yourself.

How does Hermes 3 405B compare to Llama 3.1 405B?

Hermes 3 is a fine-tune of Llama 3.1 405B, not a separate base model. Nous Research added instruction tuning and synthetic agentic data on top of Meta's foundation. You get better tool calling and structured output adherence than raw Llama 3.1, but the core reasoning ceiling is the same. If Llama 3.1 fails a task, Hermes 3 probably will too.

Should I use Hermes 3 405B for production chatbots?

Only if you need the 405B scale and want an open-weight alternative to proprietary models. The symmetric $1/$1 pricing makes it expensive for high-output chat. Latency will be slower than GPT-4o or Claude due to the parameter count. If you're already running Llama 3.1 405B and need better instruction following, Hermes 3 is a drop-in upgrade. Otherwise, test cheaper models first.