DocScout-MCP

Give your AI assistant a reliable map of your entire GitHub organization.

An MCP server written in Go that continuously scans your GitHub org, builds a persistent knowledge graph from manifests and docs, and exposes it to Claude, Cursor, Copilot, Gemini CLI, and any other MCP-compatible AI — with zero hallucinations.

The Problem

Your AI assistant knows nothing about your internal services. Every time you ask "which teams own the payment service?" or "what breaks if I take down the DB?", it either hallucinates or burns tokens scanning dozens of repos.

DocScout-MCP solves this by pre-computing the answer graph and serving it deterministically over MCP.

How It Works

graph LR
    GH["GitHub Org\n(repos, manifests, docs)"]
    S["Scanner\n(concurrent, retry-safe)"]
    P["Parsers\ngo.mod · pom.xml · package.json\nCODEOWNERS · catalog-info.yaml\nDockerfile · Helm · Terraform · OpenAPI"]
    G["Knowledge Graph\nSQLite · PostgreSQL"]
    AI["AI Clients\nClaude · Cursor · Copilot · Gemini"]

    GH -->|"GitHub API + Webhooks"| S
    S --> P
    P -->|"entities + relations"| G
    G -->|"23 MCP tools"| AI

Scan — Crawls every repo in your org: docs, manifests, infra files, and root tooling files. Repeats on a configurable interval and reacts to GitHub webhooks for instant updates.
Parse — Extracts services, owners, dependencies, and relations from go.mod, pom.xml, package.json, CODEOWNERS, catalog-info.yaml, and more.
Graph — Persists everything as entities and relations in SQLite or PostgreSQL, surviving restarts.
Answer — AI clients query the graph via 23 MCP tools. No file-reading loops, no token waste, no guessing.

Why DocScout?

| Approach | Accuracy | Token Cost | Setup | | ------------------ | ---------------------- | ----------------- | ------------------ | | AI reads files raw | Hallucination-prone | ~27,000/question | None | | Backstage catalog | High (manual) | Medium | Heavy (infra team) | | DocScout-MCP | Verified (F1 1.00) | ~290/question | 5 minutes |

DocScout pre-computes the answer graph from your repos so your AI never reads files to answer architecture questions. See benchmark/RESULTS.md for methodology.

See It In Action

"What happens if I shut down component:db? Which systems go offline, and who do I notify?"

→ search_nodes("component:db")
  Found: component:db — incoming edge: payment-service depends_on

→ open_nodes(["payment-service"])
  Entity: payment-service (service)
  Observations: _source:go.mod, go_version:1.26, _scan_repo:myorg/payment-service

→ search_nodes("payments-team")
  Entity: payments-team (team)
  Observations: github_handle:@myorg/payments-team
  Relations: payments-team → owns → payment-service

Claude: "Shutting down component:db will impact payment-service.
         Notify @myorg/payments-team. No other services have a direct dependency."

The AI answers from verified graph facts — not file naming conventions or guesses.

Quick Start

1. Get a Fine-Grained GitHub PAT

Go to GitHub → Settings → Developer Settings → Fine-grained tokens. Grant Read-only access to Contents and Metadata for your org's repositories.

2. Add to Your AI Client

Claude CLI (recommended):

claude mcp add --transport stdio \
  --env GITHUB_TOKEN=github_pat_... \
  --env GITHUB_ORG=my-org \
  docscout-mcp -- go run github.com/doc-scout/mcp-server@latest

Or build and run locally:

git clone https://github.com/doc-scout/mcp-server
cd mcp-server

GITHUB_TOKEN="github_pat_..." GITHUB_ORG="my-org" go run .

Docker:

docker run -i \
  -e GITHUB_TOKEN="github_pat_..." \
  -e GITHUB_ORG="my-org" \
  ghcr.io/doc-scout/mcp-server:latest

3. Ask Away

"Which services depend on the billing library?" "Who owns the checkout service?" "List all repos with a Helm chart." "What Go services have direct dependencies on pgx?"

MCP Tools (23)

| Category | Tool | What it does | ------------------- | --------------------- | ------------------------------------------------ | Scanner | list_repos | | search_docs | | get_file_content | | get_scan_status | | trigger_scan | | search_content | Knowledge Graph | create_e | | create_relations | | add_observations | | update_entity | | read_graph | | list_entities | | list_relations | | search_nodes | | open_nodes | | traverse_graph | | find_path | | get_integration_map | | delete_entities | | delete_observations | | delete_relations | Observability | get_usag | Semantic Search | semantic | -------------- | | All repos with indexed files, filterable by type | | Search file paths and repo names | | Raw content of any indexed file (path-traversal protected) | | Scanner state, last scan time, cache size | | Queue an immediate full scan without waiting for next interval | | Full-text search across cached docs (SCAN_CONTENT=true) | ntities | Add nodes to the graph | | Add directed edges between nodes | | Append facts to existing entities | | Rename an entity or change its type atomically | | Return the full graph | | List all entities, optionally filtered by type | | List relations, filtered by type and/or source entity | | Search by name, type, or observation | | Retrieve entities with their relations | | BFS traversal: impact analysis, dependency chains | | Shortest connection path between two entities | | Full integration topology of a service in one call | | Remove entities (> 10 requires confirm: true) | | Remove specific facts | | Remove specific edges | e_stats | Per-tool call counts + top 20 most-fetched docs | _search | Natural-language vector search (requires embedding provider) |

What Gets Scanned

Root-level manifests (extracted into the knowledge graph):

| File | Extracts | | ------------------------------------------------------------ | --------------------------------------------- | | catalog-info.yaml | Backstage entity, lifecycle, owner, relations | | go.mod | Module path, Go version, direct dependencies | | package.json | Package name, version, runtime dependencies | | pom.xml | Maven artifact, version, compile/runtime deps | | CODEOWNERS | Team and person ownership per repo | | Dockerfile, Makefile, docker-compose.yml, .mise.toml | Tooling presence | | README.md, openapi.yaml, swagger.json | Documentation surface |

Recursive directories: docs/ and .agents/ (.md files) · deploy/, infra/, .github/workflows/ (Helm, Terraform, K8s, workflows)

Key Configuration

| Variable | Required | Default | Description | | ----------------------- | -------- | ---------------- | ---------------------------------------------------- | | GITHUB_TOKEN | ✅ | — | Fine-grained PAT (read-only Contents + Metadata) | | GITHUB_ORG | ✅ | — | GitHub org or username | | SCAN_INTERVAL | ❌ | 30m | Re-scan interval (10s, 5m, 1h) | | DATABASE_URL | ❌ | in-memory SQLite | sqlite://path.db or postgres://... | | HTTP_ADDR | ❌ | — | Enable HTTP transport at this address (e.g. :8080) | | SCAN_CONTENT | ❌ | false | Cache file contents for full-text search | | GITHUB_WEBHOOK_SECRET | ❌ | — | Enable incremental scans on push events |

See full environment variable reference for all options including SCAN_FILES, SCAN_DIRS, REPO_TOPICS, REPO_REGEX, EXTRA_REPOS, and more.

AI Client Setup

| Client | Guide | | ---------------------- | ------------------------------------------ | | Claude Desktop / CLI | docs/claude.md | | VS Code (Copilot Chat) | docs/vscode.md | | GitHub Copilot | docs/copilot.md | | Antigravity (Google) | docs/antigravity.md | | Gemini CLI | docs/gemini.md | | ChatGPT Desktop | docs/chatgpt.md |

Architecture & Security

Path-traversal protection: Only files verified by the scanner are accessible. The AI cannot read arbitrary files.
STDIO safety: No text is ever written to stdout. All logs go to stderr. Corruption of the JSON-RPC stream is impossible by design.
Rate limit resilience: Every GitHub API call uses exponential backoff with smart Retry-After handling.
Graph integrity: Observations are sanitized before storage. Mass deletions (> 10 entities) require explicit confirmation.
Audit log: Every graph mutation emits a structured slog line to stderr.

For a deep dive, see How It Works.

Roadmap

See ROADMAP.md for completed features and upcoming work, including:

Semantic Search & RAG — vector embeddings via pgvector
Custom Parser Extensions — plug in new manifest formats without forking
Integration Topology Discovery — Kafka, gRPC, HTTP call graph from config files
Multi-Cloud Adapters — GitLab, Bitbucket, Confluence
Documentation Wiki (gh-pages) — move the detailed guides to a dedicated GitHub Pages site

Contributing

# Install dependencies
go mod tidy

# Build
go build -o docscout-mcp .

# Test (unit + E2E integration)
go test ./...

Review the Development Guidelines and AGENTS.md before submitting a PR.

License

GNU AGPL v3

Disclaimer

This software is provided "as is", without warranty of any kind. AI-generated output depends on indexed repository data — always verify before acting on it. See DISCLAIMER.md for full details.

MCP Servers