Metrics Pipeline Debugging Runbook

Overview

This runbook provides step-by-step guidance for diagnosing and resolving issues in a metrics pipeline consisting of:

  • A Python program emitting StatsD metrics
  • Vector (StatsD source → Prometheus exporter sink)
  • Prometheus (scraping metrics from Vector)
  • Grafana (visualizing Prometheus data)

Debugging Workflow

1. Verify Python Program is Emitting StatsD Metrics

Check if metrics are being sent

Run the following command to manually send a StatsD metric:

echo "orchestrator_jobs_active:1|g" | nc -u -w1 127.0.0.1 9000

Check Vector logs for received metrics

journalctl -u vector -f | grep statsd

✅ If logs indicate received metrics → Vector is receiving data.

❌ If no logs appear → The Python program might not be sending metrics correctly or is targeting the wrong address/port.


2. Verify Vector is Exposing Metrics

Check if Vector’s Prometheus exporter is publishing metrics

curl -s http://127.0.0.1:9598/metrics | grep orchestrator

✅ If metrics appear → Vector is working correctly.

❌ If no metrics appear →

  • Ensure that Vector’s flush_period_secs is correctly configured.
  • Check Vector logs for errors: journalctl -u vector -f.

3. Verify Prometheus is Scraping Metrics

Check Prometheus targets

curl -s "http://localhost:9090/api/v1/targets" | jq '.data.activeTargets[] | {scrapeUrl, lastScrape, lastError, health}'

✅ If health is "up" and lastError is empty → Prometheus is successfully scraping.

❌ If health is "down" → There is a scraping issue (check lastError).

Check if Prometheus has seen the metric

curl -s "http://localhost:9090/api/v1/series?match[]=orchestrator_jobs_active" | jq .

✅ If the query returns data → Prometheus has recorded the metric.

❌ If empty → Check Vector and StatsD configuration.

Query latest metric values

curl -s "http://localhost:9090/api/v1/query?query=orchestrator_jobs_active" | jq .

✅ If a value is returned → The metric is stored.

❌ If no value is returned → Metrics may be expiring before they are scraped.


4. Fix Potential Issues

A. Vector’s flush_period_secs is Too Short

If metrics disappear before Prometheus scrapes them, increase flush_period_secs in vector.toml:

[sinks.prometheus_exporter]
type = "prometheus_exporter"
inputs = ["statsd"]
address = "0.0.0.0:9598"
flush_period_secs = 30  # Set higher than Prometheus scrape interval

Restart Vector:

systemctl restart vector

B. Prometheus Scrape Interval is Too Long

Ensure Prometheus scrapes more frequently than Vector's flush interval (prometheus.yml):

scrape_configs:
- job_name: "vector"
  scrape_interval: 5s  # Must be less than Vector’s flush_period_secs
  static_configs:
    - targets: ["0.0.0.0:9598"]

Restart Prometheus:

systemctl restart prometheus

C. Metrics Are Not Updating

Manually send a StatsD metric:

echo "orchestrator_jobs_active:5|g" | nc -u -w1 127.0.0.1 9000

Immediately query Prometheus:

curl -s "http://localhost:9090/api/v1/query?query=orchestrator_jobs_active" | jq .

If the metric appears but disappears later, Vector may be expiring stale metrics too soon.

D. Verify Prometheus Retention Settings

Check retention settings:

prometheus --storage.tsdb.retention.time

If retention is low (e.g., 1h), increase it:

prometheus --storage.tsdb.retention.time=7d

Final Debugging Checklist

  • Python program emits StatsD metrics (nc -u -w1 127.0.0.1 9000)
  • Vector receives metrics (journalctl -u vector -f | grep statsd)
  • Vector exposes metrics correctly (curl -s http://127.0.0.1:9598/metrics)
  • Prometheus scrapes successfully (api/v1/targets with health: "up")
  • Metrics are stored in Prometheus (api/v1/query?query=orchestrator_jobs_active)

Following this runbook will help identify and resolve issues efficiently when debugging the metrics pipeline.