AI & ML

How We Built an AI-Powered Monitoring System with Python and Prometheus

June 20, 2026 · ekamops_admin

Traditional monitoring fires alerts when something is already broken. What if your monitoring could warn you 10 minutes before a problem becomes an incident? That is exactly what we built for one of our clients.

The Problem

The team was getting paged at 2 AM for incidents that, in hindsight, had clear warning signs hours earlier — gradually rising memory usage, slowly increasing error rates, growing queue depths. The data was all there in Prometheus. No one was looking at it the right way.

The Architecture

We built a lightweight anomaly detection service in Python that:

Pulls metrics from Prometheus every 60 seconds via the HTTP API
Runs each metric through an Isolation Forest model (from scikit-learn) trained on 30 days of normal behaviour
Scores each metric — anything above a threshold triggers an early warning
Sends alerts to PagerDuty and Slack with context (which metric, how anomalous, recent trend)

Key Code: Pulling Prometheus Metrics

import requests

def query_prometheus(metric: str, lookback: str = "1h") -> list:
    url = "http://prometheus:9090/api/v1/query_range"
    resp = requests.get(url, params={
        "query": metric,
        "start": f"now-{lookback}",
        "end":   "now",
        "step":  "60s",
    })
    return resp.json()["data"]["result"]

Results

After 60 days in production: mean time to detect (MTTD) dropped by 40%, and the team went from 12 overnight pages per month to 3. The model also flagged a slow memory leak that would have caused an outage — caught and fixed during business hours.

Interested in AI-powered observability for your infrastructure? Let’s talk.

← Back to Blog Work With Us →