Metrics Pipeline Debugging Runbook
Overview
This runbook provides step-by-step guidance for diagnosing and resolving issues in a metrics pipeline consisting of:
- A Python program emitting StatsD metrics
- Vector (StatsD source → Prometheus exporter sink)
- Prometheus (scraping metrics from Vector)
- Grafana (visualizing Prometheus data)
Debugging Workflow
1. Verify Python Program is Emitting StatsD Metrics
Check if metrics are being sent
Run the following command to manually send a StatsD metric:
echo "orchestrator_jobs_active:1|g" | nc -u -w1 127.0.0.1 9000
Check Vector logs for received metrics
journalctl -u vector -f | grep statsd
✅ If logs indicate received metrics → Vector is receiving data.
❌ If no logs appear → The Python program might not be sending metrics correctly or is targeting the wrong address/port.
2. Verify Vector is Exposing Metrics
Check if Vector’s Prometheus exporter is publishing metrics
curl -s http://127.0.0.1:9598/metrics | grep orchestrator
✅ If metrics appear → Vector is working correctly.
❌ If no metrics appear →
- Ensure that Vector’s
flush_period_secs
is correctly configured. - Check Vector logs for errors:
journalctl -u vector -f
.
3. Verify Prometheus is Scraping Metrics
Check Prometheus targets
curl -s "http://localhost:9090/api/v1/targets" | jq '.data.activeTargets[] | {scrapeUrl, lastScrape, lastError, health}'
✅ If health
is "up"
and lastError
is empty → Prometheus is successfully scraping.
❌ If health
is "down"
→ There is a scraping issue (check lastError
).
Check if Prometheus has seen the metric
curl -s "http://localhost:9090/api/v1/series?match[]=orchestrator_jobs_active" | jq .
✅ If the query returns data → Prometheus has recorded the metric.
❌ If empty → Check Vector and StatsD configuration.
Query latest metric values
curl -s "http://localhost:9090/api/v1/query?query=orchestrator_jobs_active" | jq .
✅ If a value is returned → The metric is stored.
❌ If no value is returned → Metrics may be expiring before they are scraped.
4. Fix Potential Issues
A. Vector’s flush_period_secs
is Too Short
If metrics disappear before Prometheus scrapes them, increase flush_period_secs
in vector.toml
:
[sinks.prometheus_exporter]
type = "prometheus_exporter"
inputs = ["statsd"]
address = "0.0.0.0:9598"
flush_period_secs = 30 # Set higher than Prometheus scrape interval
Restart Vector:
systemctl restart vector
B. Prometheus Scrape Interval is Too Long
Ensure Prometheus scrapes more frequently than Vector's flush interval (prometheus.yml
):
scrape_configs:
- job_name: "vector"
scrape_interval: 5s # Must be less than Vector’s flush_period_secs
static_configs:
- targets: ["0.0.0.0:9598"]
Restart Prometheus:
systemctl restart prometheus
C. Metrics Are Not Updating
Manually send a StatsD metric:
echo "orchestrator_jobs_active:5|g" | nc -u -w1 127.0.0.1 9000
Immediately query Prometheus:
curl -s "http://localhost:9090/api/v1/query?query=orchestrator_jobs_active" | jq .
If the metric appears but disappears later, Vector may be expiring stale metrics too soon.
D. Verify Prometheus Retention Settings
Check retention settings:
prometheus --storage.tsdb.retention.time
If retention is low (e.g., 1h
), increase it:
prometheus --storage.tsdb.retention.time=7d
Final Debugging Checklist
- ✅ Python program emits StatsD metrics (
nc -u -w1 127.0.0.1 9000
) - ✅ Vector receives metrics (
journalctl -u vector -f | grep statsd
) - ✅ Vector exposes metrics correctly (
curl -s http://127.0.0.1:9598/metrics
) - ✅ Prometheus scrapes successfully (
api/v1/targets
withhealth: "up"
) - ✅ Metrics are stored in Prometheus (
api/v1/query?query=orchestrator_jobs_active
)
Following this runbook will help identify and resolve issues efficiently when debugging the metrics pipeline.