otherapi_key

Honeyhive

HoneyHive is a modern AI observability and evaluation platform that enables developers and domain experts to collaboratively build reliable AI applications faster.

Verdict

Honeyhive tracks and evaluates your AI's behavior across conversations. When you @mention it in Switchy, your team can log model interactions, create test datasets from real chat sessions, and run evaluations on how well the AI performed. Product managers use it to spot patterns in user requests; engineers use it to debug prompt failures. The MCP exposes 24 tools for logging events, managing datasets, and triggering evaluation runs. Setup requires an API key from your Honeyhive dashboard. Note that batch operations work best when you've already collected the data you want to log—real-time streaming isn't the primary use case here.

Common use cases

  • Log AI responses for quality review
  • Build test datasets from live chat sessions
  • Run evaluations on prompt performance
  • Track tool usage across team conversations
  • Debug model failures with event history

Integration

Vendor
Honeyhive
Category
other
Auth
API_KEY
Tools
24
Composio slug
honeyhive

Tools

  • Add datapoints to dataset

    Tool to add datapoints to a dataset. Use when you need to append multiple entries with specified input, ground truth, and history mappings.

  • Create Batch Model Events

    Tool to create multiple model events in a single request. Use when you need to log a batch of event interactions to HoneyHive.

  • Create Batch Tool Events

    Tool to log a batch of external API calls as tool events. Use when you need to record multiple tool events in one request—use after gathering all event data.

  • Create Dataset

    Tool to create a dataset. Use when you need to initialize a new dataset within a project.

  • Create Tool

    Tool to create a new tool. Use when you need to register a new function or plugin for invocation.

  • Delete Datapoint
    destructive

    Tool to delete a specific datapoint by its ID. Use when you need to remove a datapoint from HoneyHive after confirming its identifier.

  • Delete Dataset
    destructive

    Tool to delete a dataset by ID. Use when you need to remove a dataset after confirming its ID.

  • End Evaluation Run

    Tool to mark an evaluation run as completed. Use after finishing manual evaluations to update the run status to completed.

  • Get Configurations

    Tool to retrieve a list of configurations. Use when you need to fetch all configurations for a specific project before making changes.

  • Get Datasets

    Tool to retrieve a list of datasets. Use when you need to fetch datasets for a specific project with optional filters.

  • Get Metrics

    Tool to retrieve all metrics. Use when you need to list metrics for a specific project, after obtaining project context.

  • Get Projects

    Tool to retrieve projects. Use when you need to list all available projects.

  • List Tools

    Tool to list all available Honeyhive tools. Use when you need to discover which functions or plugins are registered for use.

  • Retrieve Datapoint

    Tool to retrieve a specific datapoint by its ID. Use when you have a datapoint ID and need its full details.

  • Retrieve Datapoints

    Tool to retrieve a list of datapoints. Use when you need to fetch datapoints for a project with optional filters.

  • Retrieve Events

    Tool to retrieve events by filters. Use when you need to export events based on filter criteria, date range, and pagination.

  • Retrieve Experiment Result

    Tool to retrieve the result of a specific experiment run. Use when you need the status, metrics, and datapoint-level details of a completed experiment.

  • Start Evaluation Run

    Tool to initiate an evaluation run using external datasets. Use after selecting a project and events; optionally link a dataset.

  • Start Session

    Tool to start a new session. Use when you need to initiate a new tracking session and retrieve its session_id.

  • Update Datapoint

    Tool to update a specific datapoint. Use when you need to modify fields of an existing datapoint.

  • Update Dataset

    Tool to update an existing dataset. Use when you need to modify a dataset's details (name, description, datapoints, linked evaluations, or metadata) after confirming its ID.

  • Update Event

    Tool to update an event. Use when updating event details by ID.

  • Update Metric

    Tool to update an existing metric. Use when you need to modify a metric’s properties after creation. Ensure you retrieve the metric first to verify its current state.

  • Update Project

    Tool to update a project's name or description. Use when you need to modify an existing project by its ID after creation.

Setup

Setup guide

  1. 11. In Switchy, open your workspace settings and navigate to the MCP Integrations section. 2. Click 'Add Integration' and select Honeyhive from the list. 3. Log into your Honeyhive account at app.honeyhive.ai and go to Settings > API Keys. 4. Generate a new API key with read and write permissions for events, datasets, and evaluations. 5. Copy the key and paste it into the Switchy connection dialog, then click 'Connect'. 6. Switchy will verify the key and confirm the connection within a few seconds. 7. To test, open any Space and type '@Honeyhive create a dataset called test-dataset in project demo-project'—if the MCP responds with a dataset ID, the connection works. 8. From now on, @mention Honeyhive in any Space to log events, add datapoints, or trigger evaluations without leaving the conversation.

What teammates see: by default, memories from Honeyhive are scoped to the Space (PROJECT visibility) - you can mark any memory PRIVATE or share it ORG-wide.

Works well with

Top models

Compatibility data appears once enough Spaces have used this MCP together with a given model.

How Switchy teams use it

Not enough Spaces yet to publish anonymised usage stats (we require ≥ 50 Spaces per week).

Starter prompts

Log This Conversation

@Honeyhive log the last five AI responses in this Space as model events in the 'team-chat' project with session ID 'space-{{space_id}}'.
Open in a Space →

Create Evaluation Dataset

@Honeyhive create a dataset called 'user-questions-{{date}}' in project 'prod-monitoring' and add the last ten user messages as datapoints with their AI responses as ground truth.
Open in a Space →

Batch Log Tool Calls

@Honeyhive create batch tool events for every external API call made in this Space today, tagging each with session ID 'debug-{{timestamp}}'.
Open in a Space →

Review Model Performance

@Honeyhive retrieve all model events from session 'support-ticket-4729' in the 'customer-support' project and summarize any errors or low-confidence responses.
Open in a Space →

End Evaluation Run

@Honeyhive end the evaluation run with ID 'eval-run-{{id}}' and mark it as completed now that we've finished scoring the responses.
Open in a Space →

Example outputs

Illustrative - representative of the model's voice and quality, not literal recordings.

Prompt

@Honeyhive show me all datasets in the 'customer-support' project and their row counts

Output

Found 3 datasets in your customer-support project: 1. **support-tickets-q1** — 1,247 datapoints (last updated 3 days ago) 2. **escalation-examples** — 89 datapoints (last updated 2 weeks ago) 3. **sentiment-training** — 3,402 datapoints (last updated yesterday) Each dataset includes input/output pairs with ground truth labels. This example reflects typical dataset metadata; actual counts and names depend on your HoneyHive account state.

Notes

This read-heavy example demonstrates dataset discovery, useful for auditing training data before fine-tuning or evaluation runs. Requires API key with read access to the specified project. Response shape varies by project configuration.

Prompt

@Honeyhive create a new dataset called 'refund-edge-cases' in the customer-support project, then add these 5 examples I'm pasting below

Output

Created dataset **refund-edge-cases** (ID: ds_8x3k9m) in customer-support. Added 5 datapoints: - Partial refund after 90 days (mapped to input/ground_truth) - Subscription cancellation mid-cycle (mapped with history) - Duplicate charge dispute (mapped to input/ground_truth) - Gift card refund request (mapped with history) - International currency refund (mapped to input/ground_truth) Dataset is now available for evaluation runs. This example shows the two-step workflow; actual datapoint IDs and timestamps will differ.

Notes

This write example chains dataset creation with bulk datapoint insertion, a common workflow for building evaluation sets from real support logs. Side-effect: creates persistent resources in HoneyHive. Ensure project exists before running; deletion requires separate API call.

Prompt

@Honeyhive log the last 10 model events from our chatbot as a batch, then summarize which prompts triggered the most tool calls

Output

Logged 10 model events to HoneyHive (event IDs me_7a2x through me_9k1p). Analyzing tool invocation patterns: **Highest tool activity:** - "Check my order status" → 4 tool calls (order lookup, shipping API, inventory check, ETA calculator) - "Cancel my subscription" → 3 tool calls (auth verify, billing API, confirmation email) - "Update payment method" → 2 tool calls (payment gateway, audit log) Remaining 7 prompts averaged 0.4 tool calls. This analysis is illustrative; actual event metadata depends on your chatbot's instrumentation and HoneyHive session configuration.

Notes

This synthesis example pairs batch event logging with AI reasoning over the logged data, useful for debugging complex agent workflows. Requires events to include tool metadata. Be aware of rate limits when logging high-frequency production traffic; batch API helps but doesn't eliminate quota concerns.

Use-case deep-dives

LLM eval runs for product teams

When Honeyhive fits teams shipping AI features weekly

A 6-person product team ships a customer-facing chatbot every two weeks and needs to track prompt performance across deploys. Honeyhive wins here because the batch event tools let you log every user interaction without blocking your API response, and the dataset tools make it trivial to snapshot real conversations for regression testing. The eval run workflow—create dataset from prod logs, run evals, mark complete—maps directly to a pre-deploy checklist. This breaks down if your team isn't already instrumenting LLM calls or if you're prototyping one-off experiments instead of maintaining a production feature. If you're logging 500+ LLM events per day and need to compare prompt versions before shipping, Honeyhive is the right call.

Support team knowledge base tuning

Honeyhive for iterating on RAG answer quality

A 3-person support ops team uses an LLM to answer tier-1 questions from a knowledge base and wants to improve answer accuracy without hiring an ML engineer. Honeyhive's datapoint tools let you flag bad answers during support shifts, then batch them into a dataset for the next tuning cycle. The tool event logging captures which KB articles the model retrieved, so you can see if the problem is retrieval or generation. This works when your support volume is under 200 tickets per day and you're willing to manually curate 20-30 examples per iteration. If your KB has fewer than 50 articles or you're not using RAG at all, this is overkill—just fix the prompt. For teams running RAG in production and iterating monthly, Honeyhive keeps the feedback loop tight.

Agency client LLM cost tracking

When Honeyhive helps agencies bill AI usage accurately

A 10-person agency builds custom LLM features for 8 clients and needs per-client token usage and error rates for monthly invoicing. Honeyhive's batch model event tools let you tag every LLM call with a client ID, then pull usage reports without writing custom analytics. The dataset tools also let you save client-specific test cases so new engineers can validate changes without breaking existing integrations. This setup makes sense when you're managing 3+ client projects simultaneously and each client's LLM usage varies enough to matter for billing. If you're a solo consultant or all clients share the same prompt, a spreadsheet and your LLM provider's dashboard are simpler. For agencies juggling multiple AI deployments and needing audit trails, Honeyhive centralizes the logging you'd otherwise build yourself.

Frequently asked

What does the Honeyhive MCP do in Switchy?

It logs AI model interactions and tool calls to Honeyhive's observability platform. You can create datasets, add training datapoints, batch-log events from your AI workflows, and manage evaluation runs. Use it when you need to track prompt performance, debug model outputs, or build evaluation datasets from your team's AI usage in Switchy.

Do I need a Honeyhive account to connect this MCP?

Yes. You need an active Honeyhive account and an API key from your project settings. The MCP authenticates with that key — no OAuth flow. Any team member with the API key can connect it, but you'll want to use a project-level key rather than a personal one so access doesn't break when someone leaves.

Can the Honeyhive MCP automatically log every AI request I make in Switchy?

No. The MCP provides tools to manually create event logs and datasets — it doesn't auto-capture your Switchy sessions. You'd need to explicitly call tools like Create Batch Model Events after a workflow completes. If you want passive observability, use Honeyhive's SDK in your own code instead of this MCP.

How is this different from just using Honeyhive's API directly?

The MCP wraps Honeyhive's API into natural-language tools your AI agent can invoke. Instead of writing curl commands or SDK calls, you describe what you want logged and the agent handles the API structure. The trade-off: you lose fine-grained control over request timing and error handling compared to direct integration.

Does connecting Honeyhive count against my Switchy plan limits?

No. MCP connections don't consume Switchy seats or storage. However, every tool call the agent makes to Honeyhive counts toward your Honeyhive plan's event quota. If your team logs thousands of model events per day, check your Honeyhive tier limits before connecting this MCP at scale.

Data last verified 607 hours ago.Sources aggregated hourly to weekly. See docs/architecture/model-directory.md.