mpaas-docs

MCP server for semantic search over Alipay+ Mini Program documentation. Uses hybrid search (vector + FTS5), graph navigation, and provides MCP tools/prompts/resources for AI assistants.

Pipeline

crawl ──► extract ──► parse ──► migrate ──► load ──► serve
  │          │           │          │          │        │
  ▼          ▼           ▼          ▼          ▼        ▼
graph.json state.json  *.md        SQLite    chunks+   FastMCP
(URL-graph) (URL list) (markdown)  schema    embed-    server
                                         dings     (tools)

Source docs: Alipay+ Mini Program — ~170 pages covering Quick Start, Mini Program Studio, IAPMiniProgram SDK (Android/iOS/Flutter/React Native), OpenAPIs, JSAPI reference, Capabilities, Framework, Custom Components, and antd-mini extended components.

Quick start

uv pip install -e ".[dev]"
playwright install chromium

Full pipeline

Downloads, chunks, embeds, and indexes all documentation:

mpaas-docs pipeline

Skip completed steps on re-run:

mpaas-docs pipeline --skip-crawl --skip-extract --skip-parse
mpaas-docs pipeline --skip-migrate --skip-load

Step by step

| Command | What it does | Produces | |---------|-------------|----------| | mpaas-docs crawl | Scrapes sidebar URL graph via Playwright | data/raw/graph.json | | mpaas-docs extract | Flattens graph into page list | data/raw/state.json | | mpaas-docs parse | Downloads each page as markdown | data/raw/md/**/*.md | | mpaas-docs migrate | Creates SQLite tables (pages, chunks, fts5, vec0, graph) | data/sqlite/docs.db | | mpaas-docs load | Chunks text + generates embeddings + builds graph | Populates DB | | mpaas-docs serve | Starts MCP server | — |

MCP server

# HTTP (default)
mpaas-docs serve

# stdio — for AI assistant integration
mpaas-docs serve --transport stdio

HTTP flags:

| Flag | Default | |------|---------| | --host | 127.0.0.1 | | --port | 8000 | | --path | /mcp |

MCP interface

Tools

| Tool | Description | |------|-------------| | search_docs(query, mode, code_only, include_context, filter_section, limit) | Hybrid search — vector (cosine via sqlite-vec), FTS5 (BM25), or combined with score summation | | get_page(page_id, url) | Full page with all chunks | | get_related_pages(page_id, direction, depth) | Graph-traverse related pages via CONTAINS edges |

Prompts

| Prompt | Purpose | |--------|---------| | implement_feature(requirement) | Structured plan with step breakdown, code, and tradeoffs | | explain_concept(concept) | 3-layer explanation: analogy → technical → code | | debug_code(code) | Systematic root-cause analysis before proposing a fix |

Resources

| URI | Description | |-----|-------------| | docs://structure | JSON navigation tree (section/page IDs and names) |

Database schema

┌─────────┐       ┌──────────────┐       ┌───────────────────┐
│  pages  │       │    chunks    │       │ chunk_embeddings  │
├─────────┤       ├──────────────┤       ├───────────────────┤
│ id (PK) │──┐    │ id (PK)      │       │ chunk_id (PK)     │
│ content │  └──► │ source_page_id├──────►│ embedding FLOAT[] │
│ url     │       │ content      │       └───────────────────┘
└─────────┘       │ chunk_index  │       ┌──────────────┐
                   │ metadata_json│       │  chunks_fts  │
                   └──────────────┘       │ (FTS5)       │
                                           └──────────────┘
          ┌─────────────────────────────────────────┐
          │  graphqlite: nodes + edges (CONTAINS)   │
          │  hierarchical nav tree (Section→Page→   │
          │  Chunk)                                 │
          └─────────────────────────────────────────┘

Configuration

Single YAML file (mpaas-docs.yaml at project root):

paths:
  data_dir: data
  db_path: data/sqlite/docs.db

crawler:
  start_url: "https://miniprogram.alipayplus.com/docs/miniprogram/mpdev"
  base_url: "https://miniprogram.alipayplus.com"
  headless: true
  wait_timeout: 10000
  request_delay: 2

llm:
  base_url: "http://localhost:1234/v1"
  api_key: "lm-studio"
  embed_model: "text-embedding-qwen3-embedding-0.6b"
  embedding_dim: 1024

chunking:
  chunk_size: 1024
  chunk_overlap_ratio: 0.2

server:
  name: "docs-search"
  host: "127.0.0.1"
  port: 8000
  transport: "http"
  path: "/mcp"
  search_default_limit: 10
  resource_uri: "docs://structure"

fts:
  tokenizer: "porter unicode61"

Requires a running OpenAI-compatible embedding endpoint (default: LM Studio at http://localhost:1234/v1).

Project structure

src/mpaas_docs/
├── cli.py                 # CLI entry point (argparse)
├── config/settings.py     # YAML config loader (lru_cached)
├── crawler/
│   ├── spider.py          # Playwright sidebar graph scraper
│   ├── extract.py         # Graph → flat URL list
│   └── parser.py          # Page download → markdown
├── db/
│   ├── connection.py      # sqlite-vec factory
│   └── migrations.py      # Schema (pages, chunks, FTS5, vec0)
├── etl/
│   ├── chunker.py         # Legacy chunker (MarkdownSyntax splitter)
│   └── loader.py          # Chunking + embeddings + graph building
├── server/
│   ├── app.py             # FastMCP instance + registration
│   ├── tools.py           # search_docs, get_page, get_related_pages
│   ├── prompts.py         # implement_feature, explain_concept, debug_code
│   └── resources.py       # docs://structure nav tree
├── __init__.py
└── __main__.py            # python -m mpaas_docs

Development

ruff check src/mpaas_docs/ tests/
ruff format src/mpaas_docs/ tests/

Manual search test (requires populated DB + running embedding endpoint):

python tests/test_vec_search.py

MCP Servers