MCP server by muhammedehab35
✈️ PilotOps MCP
AI-powered Incident Response Autopilot for DevOps & SRE teams
Connect Claude AI to your entire monitoring stack and respond to incidents in natural language — no more jumping between 5 different tools at 3am.
The Problem
When an incident fires at 3am, an SRE must manually:
| Step | Tool | Time | |------|------|------| | Check alerts | Prometheus | 2 min | | Analyze metrics | Grafana | 5 min | | Search logs | Loki / ELK | 10 min | | Diagnose root cause | Brain | 15 min | | Write runbook | Notion / Confluence | 10 min | | Page on-call | PagerDuty | 2 min | | Notify team | Slack | 2 min | | Total | 7 tools | ~46 min |
The Solution
With PilotOps MCP, you just tell Claude:
"There's an alert on prod, investigate and generate a runbook"
And Claude handles everything in under 2 minutes.
How It Works
┌─────────────────────────────────────────────────────────────┐
│ You (Claude Desktop) │
│ "Investigate the active alert on prod-server-01" │
└────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ PilotOps MCP Server │
│ │
│ 1. prometheus_get_active_alerts() │
│ → CPU 95% on prod-server-01 since 10min │
│ │
│ 2. prometheus_get_metrics("node_cpu...") │
│ → Spike started at 22:15, still climbing │
│ │
│ 3. loki_get_logs('{host="prod-server-01"}') │
│ → 847 errors: "OOM Killer activated" │
│ │
│ 4. analyze_incident(alerts, metrics, logs) │
│ → P1 | Memory leak in payments-api | Confidence: HIGH │
│ │
│ 5. generate_runbook("memory_leak", "P1") │
│ → 4-phase runbook generated │
│ │
│ 6. pagerduty_create_incident("P1: Memory leak") │
│ → On-call engineer paged │
│ │
│ 7. slack_notify("#incidents", severity="critical") │
│ → Team notified with communication template │
│ │
│ 8. grafana_create_annotation("[P1 START] 22:15") │
│ → Incident marked on all dashboards │
└─────────────────────────────────────────────────────────────┘
Features
- 12 MCP Tools across 5 integrations
- AI Correlation Engine — matches alerts + metrics + logs against 7 incident patterns
- Auto Runbook Generator — produces 4-phase runbooks (Triage → Mitigation → Investigation → Resolution)
- Slack Communication Templates — ready-to-send status updates
- Full Docker Demo Stack — simulate real incidents locally with 1 command
- Zero vendor lock-in — works with any Prometheus-compatible stack
Tools Reference
Prometheus
| Tool | Description |
|------|-------------|
| prometheus_get_active_alerts | Fetch all firing alerts with severity, labels, and annotations |
| prometheus_get_metrics | Query any PromQL expression with time range |
| prometheus_silence_alert | Silence an alert for a specified duration |
Grafana
| Tool | Description |
|------|-------------|
| grafana_get_dashboards | List and search available dashboards |
| grafana_create_annotation | Mark incident start/end on dashboards for post-mortem |
Loki
| Tool | Description |
|------|-------------|
| loki_get_logs | Query logs via LogQL with level filtering and error detection |
PagerDuty
| Tool | Description |
|------|-------------|
| pagerduty_get_incidents | List open incidents by status |
| pagerduty_create_incident | Create P1-P4 incident and page on-call |
| pagerduty_update_incident | Acknowledge or resolve with timeline note |
Slack
| Tool | Description |
|------|-------------|
| slack_notify | Send color-coded alert with severity emoji |
AI Core
| Tool | Description |
|------|-------------|
| analyze_incident | Correlates alerts + metrics + logs → root cause + confidence |
| generate_runbook | Generates structured 4-phase runbook with Slack template |
Supported Incident Types
| Type | Trigger | Pattern |
|------|---------|---------|
| memory_leak | OOM kills, heap growth | Memory > 85% + OOM logs |
| high_cpu | CPU saturation | CPU > 80% sustained |
| disk_full | Disk space exhaustion | No space left errors |
| network_issue | Connectivity problems | Timeouts + packet loss |
| database_issue | DB overload / deadlocks | Slow queries + connection pool |
| service_crash | App crash / restart loop | Segfault + panic logs |
| deployment_issue | Failed K8s rollout | CrashLoopBackOff + ImagePull |
Tech Stack
Language : Python 3.11+
MCP Server : FastMCP (official Anthropic SDK)
Metrics : Prometheus + Alertmanager
Dashboards : Grafana
Logs : Loki + Promtail
Incidents : PagerDuty
Alerts : Slack
Containers : Docker + Docker Compose
Quick Start
Prerequisites
- Python 3.11+
- Docker & Docker Compose
- Claude Desktop
1. Clone & install
git clone https://github.com/muhammedehab35/PILOT_OPS-MCP.git
cd PILOT_OPS-MCP
pip install -r requirements.txt
2. Configure
cp .env.example .env
# Minimum required for local demo
PROMETHEUS_URL=http://localhost:9090
GRAFANA_URL=http://localhost:3000
GRAFANA_API_KEY=your_grafana_api_key
LOKI_URL=http://localhost:3100
# Optional: for full incident workflow
PAGERDUTY_API_KEY=your_pagerduty_key
PAGERDUTY_SERVICE_ID=PXXXXXX
SLACK_BOT_TOKEN=xoxb-your-slack-token
SLACK_DEFAULT_CHANNEL=#incidents
3. Launch the full demo stack
cd docker
docker-compose up -d
| Service | URL | Credentials | |---------|-----|-------------| | Demo App | http://localhost:8080 | — | | Prometheus | http://localhost:9090 | — | | Alertmanager | http://localhost:9093 | — | | Grafana | http://localhost:3000 | admin / admin123 | | Loki | http://localhost:3100 | — |
4. Trigger a real incident
# CPU spike → fires HighCPUUsage alert after 30s
curl -X POST http://localhost:8080/simulate/cpu-spike
# Memory leak → fires HighMemoryUsage alert after 30s
curl -X POST http://localhost:8080/simulate/memory-leak
# High error rate → fires HighErrorRate alert after 30s
curl -X POST http://localhost:8080/simulate/high-errors
# Slow responses → fires SlowResponseTime alert after 30s
curl -X POST http://localhost:8080/simulate/slow-response
# Reset all incidents
curl -X POST http://localhost:8080/simulate/reset
5. Connect to Claude Desktop
Add to %APPDATA%\Claude\claude_desktop_config.json (Windows) or ~/Library/Application Support/Claude/claude_desktop_config.json (Mac):
{
"mcpServers": {
"pilotops": {
"command": "python",
"args": ["/full/path/to/PILOT_OPS-MCP/server.py"],
"env": {
"PROMETHEUS_URL": "http://localhost:9090",
"GRAFANA_URL": "http://localhost:3000",
"GRAFANA_API_KEY": "your_key",
"LOKI_URL": "http://localhost:3100",
"PAGERDUTY_API_KEY": "your_key",
"SLACK_BOT_TOKEN": "your_token"
}
}
}
}
Restart Claude Desktop → look for the 🔨 hammer icon in the chat bar.
6. Run your first incident response
You: "There's an active alert on prod, investigate and generate a runbook"
Claude: → Fetching active alerts from Prometheus...
→ Querying CPU and memory metrics...
→ Pulling last 15 minutes of error logs from Loki...
→ Analyzing correlation...
→ [P1] Memory leak detected in payments-api (confidence: HIGH)
→ Generating runbook...
→ Creating PagerDuty incident #42...
→ Notifying #incidents on Slack...
✅ Full incident response completed in 45 seconds.
Project Structure
PILOT_OPS-MCP/
├── server.py # FastMCP server — registers all 12 tools
├── config.py # Pydantic settings — loads from .env
├── requirements.txt
├── .env.example
│
├── tools/ # One file per integration
│ ├── prometheus.py # get_alerts, get_metrics, silence
│ ├── grafana.py # dashboards, annotations
│ ├── loki.py # log queries via LogQL
│ ├── pagerduty.py # create / update incidents
│ └── slack.py # team notifications
│
├── core/ # AI intelligence layer
│ ├── correlator.py # Pattern-matching correlation engine
│ └── runbook.py # 4-phase runbook generator (7 types)
│
└── docker/ # Full local demo environment
├── docker-compose.yml
├── demo-app/ # Flask app — simulates real incidents
│ ├── app.py # /simulate/* endpoints + Prometheus metrics
│ ├── Dockerfile
│ └── requirements.txt
├── prometheus/
│ ├── prometheus.yml # Scrape config
│ └── alerts.yml # 5 alert rules
├── grafana/
│ ├── provisioning/ # Auto-configured datasources
│ └── dashboards/ # Pre-built infrastructure dashboard
├── loki/loki-config.yml
├── promtail/promtail-config.yml
└── alertmanager/alertmanager.yml
Example Runbook Output
📋 RUNBOOK: Memory Leak / OOM Incident
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Severity : P1 | SLA: 15 minutes
Services : payments-api
Hosts : prod-server-01
PHASE 1 — TRIAGE
1. Confirm memory usage: free -h or Grafana memory dashboard
2. Identify top memory consumers: ps aux --sort=-%mem | head -20
3. Check OOM kills: dmesg | grep -i 'oom'
PHASE 2 — MITIGATION
1. Restart the affected service to free memory immediately
2. Enable memory limits (K8s: resources.limits.memory)
3. Set up swap if not present
PHASE 3 — INVESTIGATION
1. Collect heap dump (JVM: jmap, Go: pprof)
2. Review recent code changes for memory regressions
3. Check GC logs for anomalies
PHASE 4 — RESOLUTION
1. Deploy fix or roll back the problematic version
2. Verify memory returns to baseline
3. Resolve PagerDuty + post-mortem
💬 SLACK TEMPLATE:
[P1 INCIDENT] Memory Leak / OOM
• Affected: payments-api
• Hosts: prod-server-01
• Status: Investigating
• SLA: Resolve within 15 minutes
• Next update: In 15 minutes
Contributing
Contributions are welcome! Ideas for new integrations:
- [ ] OpsGenie support
- [ ] Datadog metrics
- [ ] Kubernetes events via kubectl
- [ ] Jira ticket creation
- [ ] Email notifications
Author
Ehab Muhammed — DevOps Engineer GitHub: @muhammedehab35
License
MIT © 2026 Ehab Muhammed