Observability for AI Agents

Metrics Taxonomy

Every span, every tool call, every model invocation emits structured telemetry. The taxonomy is stable, documented and consistent across agents — so dashboards, alerts and SLOs work the same way for every team.

Metric	What it measures	Aggregation	Typical alert threshold
task.latency_ms	Wall-clock duration from task arrival to terminal state.	p50, p95, p99 per skill	p95 > 2x baseline
task.cost_eur	Total cost (model + tools) per task in EUR.	Sum per agent / skill / day	Cost-per-task > budget
task.tokens_in / tokens_out	Input + output tokens per LLM call.	Sum per model	Sudden 50% increase WoW
eval.score	Score from regression eval against task-specific test cases.	Min, mean per skill version	Score drop > threshold blocks deploy
safety.flag_rate	Fraction of tasks flagged by safety classifier.	Rolling 1h, 24h	> baseline + 2σ
tool.error_rate	Fraction of tool calls returning an error.	Per tool, rolling 5m	> 5% sustained
human_handoff.rate	Fraction of tasks escalated to human approval.	Per skill, daily	Sudden change indicates skill drift
skill.regression_delta	Eval-score change after skill or model update.	Per deploy	Negative delta blocks promotion

KPIs Operators Actually Use

Per-call metrics are useful for debugging. KPIs are what you put in front of a sponsor or board to show whether the agent program is paying off.

Cost per resolved task

Total spend (model + tools + human review) divided by tasks reaching a terminal success state. Compares directly to human-only baseline.

Mean time to human escalation

How long agents work autonomously before a task is handed off. Rising = skill drift; falling = improving autonomy.

Skill regression delta after deploy

Eval-score change between current and prior skill version. Catches model upgrades that silently degrade behavior.

Human-approval acceptance rate

Fraction of agent-proposed actions that humans accept. Low rate = poor judgment; rate of 1.0 over time = trust threshold reached, automate the gate.

Safety incident rate

Flagged outputs per 1000 tasks. Tracked per channel and skill — used to catch new prompt-injection or jailbreak patterns.

Time to first useful output

From channel message in to first useful agent response. The user-facing latency that determines whether agents feel real-time.

Reliability — Evals as a Deploy Gate

Models change. Skills get edited. Without an eval gate, every deploy is a coin flip. HybridClaw records real task runs as trajectories, replays them against new skill or model versions, and surfaces regressions before they reach production.

Trajectory replay. Past runs become test cases. New versions must match or beat the score.
Skill scorecards. Per-skill dashboard shows pass-rate, latency and cost across the eval set. Operators can see at a glance whether a skill is production-ready.
Deploy gate. Configurable thresholds — score, latency, cost — block promotion automatically. No human in the deploy loop unless a regression actually appears.
Content-addressed rollback. Every skill version is content-addressed. Rollback is one command, deterministic, and auditable.

Cost Control

Agent platforms can burn through model budgets fast. HybridClaw makes cost a first-class observable metric — measured per agent, per skill, per task — and exposes the levers operators need.

Model routing

Send simple tasks to cheaper models, complex ones to capable ones — based on cost-per-quality measured by evals.

Per-agent budgets

Soft and hard caps per day / week / month. Soft cap warns, hard cap stops new tasks until lifted.

Cache layers

In-memory, on-disk and shared cache for skill outputs and retrieval. Cache hits cost nothing.

Trajectory pruning

Skills that consistently exceed token budgets are flagged for refactoring or context compression.

Cost reports

Daily / weekly cost reports per agent and skill, exportable as CSV.

Cost-per-outcome KPI

Track total spend against business outcomes (resolved tickets, generated leads, etc.) — not just raw token consumption.

Safety Telemetry

Safety is not an output filter — it is a continuous signal in the same telemetry stream as latency and cost. Operators see safety incidents alongside the actions that caused them, with full audit trail.

Every flagged output is logged with the trajectory that produced it — so post-incident review takes minutes, not days.
High-impact actions (transfers, external mails, deletes) require human approval — configurable per skill, audit-logged.
Tamper-evident audit log: every action chained and content-addressed. Operators can prove what an agent did, when, and on whose authority.
EU-hosted control plane. GDPR & AI-Act compliant by design.

See the dashboards in action

Start with HybridClaw Read runtime architecture

Metrics Taxonomy

KPIs Operators Actually Use

Cost per resolved task

Mean time to human escalation

Skill regression delta after deploy

Human-approval acceptance rate

Safety incident rate

Time to first useful output

Reliability — Evals as a Deploy Gate

Cost Control

Model routing

Per-agent budgets

Cache layers

Trajectory pruning

Cost reports

Cost-per-outcome KPI

Safety Telemetry

See the dashboards in action

🚀 Quickstart: Create your custom -Bot

Configure your bot:

1. Name your Bot

2. Choose a Color Scheme

3. Describe what the bot shall cover and how

📸 4. Upload background image (optional)

🎉 Say hello to your -Bot!

📧 Get the link to claim your bot!