Monitoring Microservices with Prometheus and Grafana
Monitoring Microservices with Prometheus and Grafana
Observability is the cornerstone of reliable distributed systems. In this guide, we'll set up a complete monitoring stack using Prometheus for metrics collection and Grafana for visualization.
Why Prometheus?
Prometheus has become the industry standard for cloud-native monitoring because of:
- Pull-based model: It scrapes metrics from services
- Service Discovery: Automatically finds targets in Kubernetes/Consul
- PromQL: Powerful query language
- Dimensional Data: Labels make filtering easy
Architecture
[App 1] <--- Scrape --- [Prometheus] ---> [Grafana]
[App 2] <--- Scrape ---/ ^
|
[Alertmanager]
Step 1: Instrumenting Applications
To let Prometheus scrape metrics, your app needs to expose a /metrics endpoint.
Node.js Example
const express = require('express');
const client = require('prom-client');
const app = express();
const collectDefaultMetrics = client.collectDefaultMetrics;
// Probe every 5th second
collectDefaultMetrics({ timeout: 5000 });
app.get('/metrics', async (req, res) => {
res.set('Content-Type', client.register.contentType);
res.end(await client.register.metrics());
});
app.listen(3000);
Step 2: Configuring Prometheus
Create a prometheus.yml configuration file:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'node-app'
static_configs:
- targets: ['localhost:3000']
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
Step 3: Running with Docker Compose
version: '3'
services:
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
grafana:
image: grafana/grafana:latest
environment:
- GF_SECURITY_ADMIN_PASSWORD=secret
ports:
- "3000:3000"
depends_on:
- prometheus
Step 4: Visualizing in Grafana
- Login to Grafana (admin/secret)
- Add Data Source -> Prometheus -> URL: http://prometheus:9090
- Import Dashboard (e.g., Node Exporter Full - ID: 1860)
Important Metrics to Watch
RED Method
- Rate: Request rate (req/s)
- Errors: Error rate (%)
- Duration: Request duration (latency)
USE Method (for Infrastructure)
- Utilization: % time busy
- Saturation: Queue length
- Errors: Count of errors
Alerting
Don't just look at dashboards. Set up alerts for critical issues.
# alert.rules.yml
groups:
- name: example
rules:
- alert: HighErrorRate
expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
for: 10m
labels:
severity: page
annotations:
summary: High request latency
Conclusion
A robust monitoring stack gives you the confidence to deploy faster. By measuring what matters (RED/USE), you can detect and fix issues before users notice.