Metrics Pipeline Debugging Runbook
Overview
This runbook provides step-by-step guidance for diagnosing and resolving issues in a metrics pipeline consisting of:
- A Python program emitting StatsD metrics
- Vector (StatsD source → Prometheus exporter sink)
- Prometheus (scraping metrics from Vector)
- Grafana (visualizing Prometheus data)
Debugging Workflow
1. Verify Python Program is Emitting StatsD Metrics
Check if metrics are being sent
Run the following command to manually send a StatsD metric:
echo "orchestrator_jobs_active:1|g" | nc -u -w1 127.0.0.1 9000
Check Vector logs for received metrics
journalctl -u vector -f | grep statsd
✅ If logs indicate received metrics → Vector is receiving data.
❌ If no logs appear → The Python program might not be sending metrics correctly or is targeting the wrong address/port.
2. Verify Vector is Exposing Metrics
Check if Vector’s Prometheus exporter is publishing metrics
curl -s http://127.0.0.1:9598/metrics | grep orchestrator
✅ If metrics appear → Vector is working correctly.
❌ If no metrics appear →
- Ensure that Vector’s
flush_period_secsis correctly configured. - Check Vector logs for errors:
journalctl -u vector -f.
3. Verify Prometheus is Scraping Metrics
Check Prometheus targets
curl -s "http://localhost:9090/api/v1/targets" | jq '.data.activeTargets[] | {scrapeUrl, lastScrape, lastError, health}'
✅ If health is "up" and lastError is empty → Prometheus is successfully scraping.
❌ If health is "down" → There is a scraping issue (check lastError).
Check if Prometheus has seen the metric
curl -s "http://localhost:9090/api/v1/series?match[]=orchestrator_jobs_active" | jq .
✅ If the query returns data → Prometheus has recorded the metric.
❌ If empty → Check Vector and StatsD configuration.
Query latest metric values
curl -s "http://localhost:9090/api/v1/query?query=orchestrator_jobs_active" | jq .
✅ If a value is returned → The metric is stored.
❌ If no value is returned → Metrics may be expiring before they are scraped.
4. Fix Potential Issues
A. Vector’s flush_period_secs is Too Short
If metrics disappear before Prometheus scrapes them, increase flush_period_secs in vector.toml:
[sinks.prometheus_exporter]
type = "prometheus_exporter"
inputs = ["statsd"]
address = "0.0.0.0:9598"
flush_period_secs = 30 # Set higher than Prometheus scrape interval
Restart Vector:
systemctl restart vector
B. Prometheus Scrape Interval is Too Long
Ensure Prometheus scrapes more frequently than Vector's flush interval (prometheus.yml):
scrape_configs:
- job_name: "vector"
scrape_interval: 5s # Must be less than Vector’s flush_period_secs
static_configs:
- targets: ["0.0.0.0:9598"]
Restart Prometheus:
systemctl restart prometheus
C. Metrics Are Not Updating
Manually send a StatsD metric:
echo "orchestrator_jobs_active:5|g" | nc -u -w1 127.0.0.1 9000
Immediately query Prometheus:
curl -s "http://localhost:9090/api/v1/query?query=orchestrator_jobs_active" | jq .
If the metric appears but disappears later, Vector may be expiring stale metrics too soon.
D. Verify Prometheus Retention Settings
Check retention settings:
prometheus --storage.tsdb.retention.time
If retention is low (e.g., 1h), increase it:
prometheus --storage.tsdb.retention.time=7d
Final Debugging Checklist
- ✅ Python program emits StatsD metrics (
nc -u -w1 127.0.0.1 9000) - ✅ Vector receives metrics (
journalctl -u vector -f | grep statsd) - ✅ Vector exposes metrics correctly (
curl -s http://127.0.0.1:9598/metrics) - ✅ Prometheus scrapes successfully (
api/v1/targetswithhealth: "up") - ✅ Metrics are stored in Prometheus (
api/v1/query?query=orchestrator_jobs_active)
Following this runbook will help identify and resolve issues efficiently when debugging the metrics pipeline.