✈️ PilotOps MCP

AI-powered Incident Response Autopilot for DevOps & SRE teams

Connect Claude AI to your entire monitoring stack and respond to incidents in natural language — no more jumping between 5 different tools at 3am.

The Problem

When an incident fires at 3am, an SRE must manually:

| Step | Tool | Time | |------|------|------| | Check alerts | Prometheus | 2 min | | Analyze metrics | Grafana | 5 min | | Search logs | Loki / ELK | 10 min | | Diagnose root cause | Brain | 15 min | | Write runbook | Notion / Confluence | 10 min | | Page on-call | PagerDuty | 2 min | | Notify team | Slack | 2 min | | Total | 7 tools | ~46 min |

The Solution

With PilotOps MCP, you just tell Claude:

"There's an alert on prod, investigate and generate a runbook"

And Claude handles everything in under 2 minutes.

How It Works

┌─────────────────────────────────────────────────────────────┐
│                        You (Claude Desktop)                  │
│  "Investigate the active alert on prod-server-01"           │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│                    PilotOps MCP Server                       │
│                                                              │
│  1. prometheus_get_active_alerts()                          │
│     → CPU 95% on prod-server-01 since 10min                 │
│                                                              │
│  2. prometheus_get_metrics("node_cpu...")                    │
│     → Spike started at 22:15, still climbing                │
│                                                              │
│  3. loki_get_logs('{host="prod-server-01"}')                │
│     → 847 errors: "OOM Killer activated"                    │
│                                                              │
│  4. analyze_incident(alerts, metrics, logs)                  │
│     → P1 | Memory leak in payments-api | Confidence: HIGH   │
│                                                              │
│  5. generate_runbook("memory_leak", "P1")                   │
│     → 4-phase runbook generated                             │
│                                                              │
│  6. pagerduty_create_incident("P1: Memory leak")            │
│     → On-call engineer paged                                │
│                                                              │
│  7. slack_notify("#incidents", severity="critical")          │
│     → Team notified with communication template             │
│                                                              │
│  8. grafana_create_annotation("[P1 START] 22:15")           │
│     → Incident marked on all dashboards                     │
└─────────────────────────────────────────────────────────────┘

Features

12 MCP Tools across 5 integrations
AI Correlation Engine — matches alerts + metrics + logs against 7 incident patterns
Auto Runbook Generator — produces 4-phase runbooks (Triage → Mitigation → Investigation → Resolution)
Slack Communication Templates — ready-to-send status updates
Full Docker Demo Stack — simulate real incidents locally with 1 command
Zero vendor lock-in — works with any Prometheus-compatible stack

Tools Reference

Prometheus

| Tool | Description | |------|-------------| | prometheus_get_active_alerts | Fetch all firing alerts with severity, labels, and annotations | | prometheus_get_metrics | Query any PromQL expression with time range | | prometheus_silence_alert | Silence an alert for a specified duration |

Grafana

| Tool | Description | |------|-------------| | grafana_get_dashboards | List and search available dashboards | | grafana_create_annotation | Mark incident start/end on dashboards for post-mortem |

Loki

| Tool | Description | |------|-------------| | loki_get_logs | Query logs via LogQL with level filtering and error detection |

PagerDuty

| Tool | Description | |------|-------------| | pagerduty_get_incidents | List open incidents by status | | pagerduty_create_incident | Create P1-P4 incident and page on-call | | pagerduty_update_incident | Acknowledge or resolve with timeline note |

Slack

| Tool | Description | |------|-------------| | slack_notify | Send color-coded alert with severity emoji |

AI Core

| Tool | Description | |------|-------------| | analyze_incident | Correlates alerts + metrics + logs → root cause + confidence | | generate_runbook | Generates structured 4-phase runbook with Slack template |

Supported Incident Types

| Type | Trigger | Pattern | |------|---------|---------| | memory_leak | OOM kills, heap growth | Memory > 85% + OOM logs | | high_cpu | CPU saturation | CPU > 80% sustained | | disk_full | Disk space exhaustion | No space left errors | | network_issue | Connectivity problems | Timeouts + packet loss | | database_issue | DB overload / deadlocks | Slow queries + connection pool | | service_crash | App crash / restart loop | Segfault + panic logs | | deployment_issue | Failed K8s rollout | CrashLoopBackOff + ImagePull |

Tech Stack

Language    : Python 3.11+
MCP Server  : FastMCP (official Anthropic SDK)
Metrics     : Prometheus + Alertmanager
Dashboards  : Grafana
Logs        : Loki + Promtail
Incidents   : PagerDuty
Alerts      : Slack
Containers  : Docker + Docker Compose

Quick Start

Prerequisites

Python 3.11+
Docker & Docker Compose
Claude Desktop

1. Clone & install

git clone https://github.com/muhammedehab35/PILOT_OPS-MCP.git
cd PILOT_OPS-MCP
pip install -r requirements.txt

2. Configure

cp .env.example .env

# Minimum required for local demo
PROMETHEUS_URL=http://localhost:9090
GRAFANA_URL=http://localhost:3000
GRAFANA_API_KEY=your_grafana_api_key
LOKI_URL=http://localhost:3100

# Optional: for full incident workflow
PAGERDUTY_API_KEY=your_pagerduty_key
PAGERDUTY_SERVICE_ID=PXXXXXX
SLACK_BOT_TOKEN=xoxb-your-slack-token
SLACK_DEFAULT_CHANNEL=#incidents

3. Launch the full demo stack

cd docker
docker-compose up -d

| Service | URL | Credentials | |---------|-----|-------------| | Demo App | http://localhost:8080 | — | | Prometheus | http://localhost:9090 | — | | Alertmanager | http://localhost:9093 | — | | Grafana | http://localhost:3000 | admin / admin123 | | Loki | http://localhost:3100 | — |

4. Trigger a real incident

# CPU spike → fires HighCPUUsage alert after 30s
curl -X POST http://localhost:8080/simulate/cpu-spike

# Memory leak → fires HighMemoryUsage alert after 30s
curl -X POST http://localhost:8080/simulate/memory-leak

# High error rate → fires HighErrorRate alert after 30s
curl -X POST http://localhost:8080/simulate/high-errors

# Slow responses → fires SlowResponseTime alert after 30s
curl -X POST http://localhost:8080/simulate/slow-response

# Reset all incidents
curl -X POST http://localhost:8080/simulate/reset

5. Connect to Claude Desktop

Add to %APPDATA%\Claude\claude_desktop_config.json (Windows) or ~/Library/Application Support/Claude/claude_desktop_config.json (Mac):

{
  "mcpServers": {
    "pilotops": {
      "command": "python",
      "args": ["/full/path/to/PILOT_OPS-MCP/server.py"],
      "env": {
        "PROMETHEUS_URL": "http://localhost:9090",
        "GRAFANA_URL": "http://localhost:3000",
        "GRAFANA_API_KEY": "your_key",
        "LOKI_URL": "http://localhost:3100",
        "PAGERDUTY_API_KEY": "your_key",
        "SLACK_BOT_TOKEN": "your_token"
      }
    }
  }
}

Restart Claude Desktop → look for the 🔨 hammer icon in the chat bar.

6. Run your first incident response

You:     "There's an active alert on prod, investigate and generate a runbook"

Claude:  → Fetching active alerts from Prometheus...
         → Querying CPU and memory metrics...
         → Pulling last 15 minutes of error logs from Loki...
         → Analyzing correlation...
         → [P1] Memory leak detected in payments-api (confidence: HIGH)
         → Generating runbook...
         → Creating PagerDuty incident #42...
         → Notifying #incidents on Slack...
         ✅ Full incident response completed in 45 seconds.

Project Structure

PILOT_OPS-MCP/
├── server.py                    # FastMCP server — registers all 12 tools
├── config.py                    # Pydantic settings — loads from .env
├── requirements.txt
├── .env.example
│
├── tools/                       # One file per integration
│   ├── prometheus.py            # get_alerts, get_metrics, silence
│   ├── grafana.py               # dashboards, annotations
│   ├── loki.py                  # log queries via LogQL
│   ├── pagerduty.py             # create / update incidents
│   └── slack.py                 # team notifications
│
├── core/                        # AI intelligence layer
│   ├── correlator.py            # Pattern-matching correlation engine
│   └── runbook.py               # 4-phase runbook generator (7 types)
│
└── docker/                      # Full local demo environment
    ├── docker-compose.yml
    ├── demo-app/                # Flask app — simulates real incidents
    │   ├── app.py               # /simulate/* endpoints + Prometheus metrics
    │   ├── Dockerfile
    │   └── requirements.txt
    ├── prometheus/
    │   ├── prometheus.yml       # Scrape config
    │   └── alerts.yml           # 5 alert rules
    ├── grafana/
    │   ├── provisioning/        # Auto-configured datasources
    │   └── dashboards/          # Pre-built infrastructure dashboard
    ├── loki/loki-config.yml
    ├── promtail/promtail-config.yml
    └── alertmanager/alertmanager.yml

Example Runbook Output

📋 RUNBOOK: Memory Leak / OOM Incident
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Severity : P1  |  SLA: 15 minutes
Services : payments-api
Hosts    : prod-server-01

PHASE 1 — TRIAGE
  1. Confirm memory usage: free -h or Grafana memory dashboard
  2. Identify top memory consumers: ps aux --sort=-%mem | head -20
  3. Check OOM kills: dmesg | grep -i 'oom'

PHASE 2 — MITIGATION
  1. Restart the affected service to free memory immediately
  2. Enable memory limits (K8s: resources.limits.memory)
  3. Set up swap if not present

PHASE 3 — INVESTIGATION
  1. Collect heap dump (JVM: jmap, Go: pprof)
  2. Review recent code changes for memory regressions
  3. Check GC logs for anomalies

PHASE 4 — RESOLUTION
  1. Deploy fix or roll back the problematic version
  2. Verify memory returns to baseline
  3. Resolve PagerDuty + post-mortem

💬 SLACK TEMPLATE:
  [P1 INCIDENT] Memory Leak / OOM
  • Affected: payments-api
  • Hosts: prod-server-01
  • Status: Investigating
  • SLA: Resolve within 15 minutes
  • Next update: In 15 minutes

Contributing

Contributions are welcome! Ideas for new integrations:

[ ] OpsGenie support
[ ] Datadog metrics
[ ] Kubernetes events via kubectl
[ ] Jira ticket creation
[ ] Email notifications

Author

Ehab Muhammed — DevOps Engineer GitHub: @muhammedehab35

MCP Servers

✈️ PilotOps MCP

AI-powered Incident Response Autopilot for DevOps & SRE teams

The Problem

The Solution

How It Works

Features

Tools Reference

Prometheus

Grafana

Loki

PagerDuty

Slack

AI Core

Supported Incident Types

Tech Stack

Quick Start

Prerequisites

1. Clone & install

2. Configure

3. Launch the full demo stack

4. Trigger a real incident

5. Connect to Claude Desktop

6. Run your first incident response

Project Structure

Example Runbook Output

Contributing

Author

License

安装包（如果需要）

Cursor 配置 (mcp.json)

Db MCP Gateway

✈️ PilotOps MCP

AI-powered Incident Response Autopilot for DevOps & SRE teams

The Problem

The Solution

How It Works

Features

Tools Reference

Prometheus

Grafana

Loki

PagerDuty

Slack

AI Core

Supported Incident Types

Tech Stack

Quick Start

Prerequisites

1. Clone & install

2. Configure

3. Launch the full demo stack

4. Trigger a real incident

5. Connect to Claude Desktop

6. Run your first incident response

Project Structure

Example Runbook Output

Contributing

Author

License

安装包 （如果需要）

Cursor 配置 (mcp.json)

Db MCP Gateway

安装包（如果需要）