Traditional monitoring fires alerts when something is already broken. What if your monitoring could warn you 10 minutes before a problem becomes an incident? That is exactly what we built for one of our clients.
The Problem
The team was getting paged at 2 AM for incidents that, in hindsight, had clear warning signs hours earlier — gradually rising memory usage, slowly increasing error rates, growing queue depths. The data was all there in Prometheus. No one was looking at it the right way.
The Architecture
We built a lightweight anomaly detection service in Python that:
- Pulls metrics from Prometheus every 60 seconds via the HTTP API
- Runs each metric through an Isolation Forest model (from scikit-learn) trained on 30 days of normal behaviour
- Scores each metric — anything above a threshold triggers an early warning
- Sends alerts to PagerDuty and Slack with context (which metric, how anomalous, recent trend)
Key Code: Pulling Prometheus Metrics
import requests
def query_prometheus(metric: str, lookback: str = "1h") -> list:
url = "http://prometheus:9090/api/v1/query_range"
resp = requests.get(url, params={
"query": metric,
"start": f"now-{lookback}",
"end": "now",
"step": "60s",
})
return resp.json()["data"]["result"]
Results
After 60 days in production: mean time to detect (MTTD) dropped by 40%, and the team went from 12 overnight pages per month to 3. The model also flagged a slow memory leak that would have caused an outage — caught and fixed during business hours.
Interested in AI-powered observability for your infrastructure? Let’s talk.