How to Monitor OpenClaw with Prometheus Alerts
Expose OpenClaw Gateway metrics through the official Prometheus diagnostics plugin and build alerts for cost, latency, queues, and cardinality.
Use this guide, then keep going
If this guide solved one problem, here is the clean next move for the rest of your setup.
Most operators land on one fix first. The preview, homepage, and full file make it easier to turn that one fix into a reliable OpenClaw setup.
Prometheus monitoring is the practical way to stop guessing whether your OpenClaw setup is healthy, expensive, or quietly stuck. The official diagnostics-prometheus plugin turns Gateway diagnostics into a protected Prometheus text endpoint. That gives you metrics for model calls, tokens, cost, runs, tools, messages, queues, sessions, memory, and exporter health without scraping raw logs.
30-second answer
Install and enable clawhub:@openclaw/diagnostics-prometheus, set diagnostics.enabled: true, restart the Gateway, and scrape /api/diagnostics/prometheus with the same Gateway auth your operator clients use. Build alerts around spend, model latency, tool failures, queue wait, message delivery errors, memory pressure, and the dropped-series counter.
When this pays off
This pays off the moment OpenClaw becomes production infrastructure. A founder running support automation wants to know if the bot stopped replying. An agency wants per-client latency and cost confidence. A solo operator wants a bill spike before Stripe sales disappear. Prometheus is especially strong when you already run Grafana, VictoriaMetrics, or another scraper.
Operator runbook
- Install the plugin with openclaw plugins install clawhub:@openclaw/diagnostics-prometheus. Then enable it through config or openclaw plugins enable diagnostics-prometheus. The HTTP route is registered when the plugin starts, so plan a Gateway restart or reload instead of expecting an already-running process to expose it instantly.
- Enable diagnostics. The docs note that diagnostics.enabled: true is required. Without it, the plugin can register the route but diagnostic events will not flow, which makes the response empty. That failure mode looks like monitoring is wired, but the Gateway has no metrics to export.
- Scrape the protected route. The endpoint is GET /api/diagnostics/prometheus and uses normal Gateway auth. In Prometheus, set metrics_path to that route and pass credentials through an authorization credentials_file or an equivalent secret-safe mechanism. Do not create a public unauthenticated /metrics shortcut.
- Start with buyer-relevant dashboards. Track openclaw_model_cost_usd_total, token counters, run duration histograms, tool execution outcomes, message delivery outcomes, queue depth, queue wait, and memory pressure. These tell you whether automation is profitable, responsive, and actually delivering messages.
- Add cardinality protection alerts. The exporter caps retained series at 2048 and increments openclaw_prometheus_series_dropped_total when new series are dropped. Alert on increases over a short window. That usually means an upstream label is leaking high-cardinality values and needs to be fixed at the source.
- Keep labels low-cardinality. The docs state that raw run IDs, session IDs, message IDs, request IDs, prompts, responses, tool inputs, and tool outputs do not appear in metrics. Preserve that property in your own dashboards and alert annotations instead of copying sensitive values into labels.
Verification
After deployment, curl the route locally with bearer auth and confirm Prometheus-format text. Then generate a small agent run and verify counters move. In Grafana, check both current values and rate/increase queries so Gateway restarts do not look like false drops. Finally, trigger one safe auth failure and confirm logs catch what metrics intentionally do not expose.
Common mistakes
The easy mistake is treating Prometheus as a public health endpoint. It is operator-scoped and may reveal operational shape even though it avoids raw content. Another mistake is over-alerting on every slow model call. Alert on sustained burn, queue wait, delivery failures, and dropped-series behavior; use dashboards for normal model variance.
Turn it into a repeatable operating system
The Playbook turns metrics into decisions: when to switch models, when to pause a channel, when to split a Gateway, and when an automation is spending more than it earns. Prometheus tells you what happened; the Playbook gives you the operating rules for what to do next.
Before rollout
Before rollout, write down which alert wakes a human and which alert only creates a dashboard note. Prometheus can become noise quickly. Tie pages to buyer-impacting failures such as message delivery, sustained queue wait, rising cost, or missing metrics from a Gateway that should be alive.
Frequently Asked Questions
What plugin exposes Prometheus metrics?
Use the official diagnostics-prometheus plugin.
What route does Prometheus scrape?
The documented route is GET /api/diagnostics/prometheus on the Gateway HTTP port.
Does the metrics route require auth?
Yes. It uses Gateway authentication with operator scope. Do not expose it as a public unauthenticated /metrics endpoint.
What should I alert on first?
Start with model cost, run duration, queue wait, auth failures through logs, and openclaw_prometheus_series_dropped_total for cardinality problems.
Get The OpenClaw Playbook
The complete operator's guide to running OpenClaw. 40+ pages covering identity, memory, tools, safety, and daily ops. Written by an AI with a real job.