wiki-local-mcp

Part of Project Independence — building a fully offline, self-sufficient knowledge and tools stack that runs on consumer hardware.

A local, offline Wikipedia search engine exposed as an MCP (Model Context Protocol) server. It gives any MCP-compatible LLM instant access to the full English Wikipedia — semantic search, title lookup, category browsing, and full article retrieval — all running on your own hardware with zero API calls.

Demo

wiki-local-mcp in action with llama.cpp running Qwen3.5-9B-Q4_K_M (TurboQuant tbq3) at 63 tok/s on an NVIDIA RTX 5060 8GB with 50K context — the same machine serving the MCP search:

Built for speed: 7-8 ms semantic search across 7M+ articles, 3-7 ms full article retrieval via byte-offset seeking into the raw XML dumps.

Features

Semantic search — find articles by meaning, not just keywords, using cosine similarity over 384-dimensional sentence embeddings
Full article retrieval — fetch complete article text on demand via byte-offset seeking (no need to load multi-GB XML files into memory)
Title search — fast substring matching against all article titles
Category browsing — find articles by Wikipedia category
Redirect resolution — automatically follows Wikipedia redirects (up to 5 hops)
GPU-accelerated ingestion — bulk embedding on GPU, serving on CPU (keeps GPU free for your LLM)
Incremental ingestion — add new XML dump files without re-processing existing ones
MCP-native — works with any MCP client: Claude Code, llama.cpp, Open WebUI, and more

Architecture

Wikipedia XML Dumps (10-200+ GB)
         |
         v
  [3-Phase Ingestion Pipeline]
         |
   +-----+-----+------------------+
   |           |                  |
   v           v                  v
 SQLite     ChromaDB          XML Files
 (metadata,  (HNSW vector     (full text,
  offsets,    index,            read on
  redirects) embeddings)        demand)
         |
         v
   [MCP Server]
    - wiki_search         (semantic)
    - wiki_get_article    (full text)
    - wiki_search_titles  (substring)
    - wiki_search_by_category
    - wiki_stats
         |
         v
   LLM Clients
   (Claude Code, llama.cpp, Open WebUI, ...)

How storage works

| Layer | What it stores | Size (full English Wikipedia) | |-------|---------------|-------------------------------| | XML dumps | Raw Wikipedia articles (source of truth) | ~216 GB | | SQLite | Article metadata, intro text, byte offsets into XML, redirects | ~13 GB | | ChromaDB | HNSW vector index + 384-dim embeddings + lightweight metadata | ~7 GB | | Total | | ~236 GB |

Full article text is never duplicated — it lives only in the XML dumps and is read on demand via byte-offset seeking. SQLite stores the byte offset and length for each article's <page> block, so retrieval is a single seek() + read() — no scanning.

Requirements

Python >= 3.11
~250 GB disk space (for full English Wikipedia)
NVIDIA GPU recommended for ingestion (optional — CPU works, just slower)
Wikipedia XML dump files (see Downloading Wikipedia dumps)

Installation

git clone https://github.com/robtacconelli/wiki-local-mcp.git
cd wiki-local-mcp

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

PyTorch with CUDA (recommended for ingestion)

If you have an NVIDIA GPU and want faster ingestion:

pip install torch --index-url https://download.pytorch.org/whl/cu124

Downloading Wikipedia dumps

Download the English Wikipedia XML dump files from Wikimedia. We used the mediawiki_content_current dumps:

https://dumps.wikimedia.org/other/mediawiki_content_current/enwiki/2026-04-01/xml/bzip2/

mkdir -p xml_db
cd xml_db

# Download all split XML files
wget -r -np -nd -A "*.xml.bz2" \
  "https://dumps.wikimedia.org/other/mediawiki_content_current/enwiki/2026-04-01/xml/bzip2/"

# Decompress
bunzip2 *.bz2

Place all .xml files in the xml_db/ directory. The ingestion pipeline will process every XML file it finds there.

You can also use a single monolithic dump file or a different dump date — the parser handles any valid MediaWiki XML export. Check https://dumps.wikimedia.org/other/mediawiki_content_current/enwiki/ for available dates.

Usage

1. Ingest Wikipedia dumps

./ingest.sh

This runs a 3-phase pipeline:

| Phase | What it does | Time (full Wikipedia, GPU) | |-------|-------------|---------------------------| | Phase 1 | Parse XML with streaming expat parser, record byte offsets | ~10 min per 10 GB file | | Phase 1.5 | Write metadata + offsets to SQLite | ~12 sec | | Phase 2 | Embed all articles with all-MiniLM-L6-v2 | ~6 min per 350K articles (GPU) | | Phase 3 | Batch insert embeddings into ChromaDB HNSW index | ~4 min per 350K articles |

Ingestion is incremental — already-processed files are skipped. To force a full re-ingestion:

./ingest.sh --force

GPU is used automatically if available. Falls back to CPU (half the cores, to avoid thermal throttling).

2. Start the MCP server

./start-mcp.sh

This starts an HTTP server on 0.0.0.0:8081 with streamable-HTTP transport. The server runs on CPU only — GPU stays free for your LLM.

The embedding model (~90 MB) loads lazily on first query and stays resident in memory. First query takes ~1 second; subsequent queries take 7-8 ms.

3. Connect your LLM client

llama.cpp

In the llama.cpp web UI, go to MCP and add a new server:

Server URL: http://<your-ip>:8081/mcp

Open WebUI

In Settings > Integrations > MCP Tool Server, add:

URL: http://<your-ip>:8081/mcp

If Open WebUI runs in Docker, use the host machine's LAN IP (not 127.0.0.1).

Claude Code

claude mcp add wiki-search -- /path/to/wiki-mcp/venv/bin/python -m wiki_mcp

Claude Code uses stdio transport directly — no need to start the HTTP server.

MCP tools reference

`wiki_search`

Semantic similarity search across all articles.

| Parameter | Type | Default | Description | |-----------|------|---------|-------------| | query | string | (required) | Natural language search query (English) | | n_results | integer | 5 | Number of results (max 20) |

Returns: JSON array of {title, short_description, categories, intro, relevance_score, page_id}

Example result:

[
  {
    "title": "Quantum mechanics",
    "short_description": "Description of physical properties at the atomic and subatomic scale",
    "categories": "Quantum mechanics, Concepts in physics",
    "intro": "Quantum mechanics is a fundamental theory in physics that describes ...",
    "relevance_score": 0.635,
    "page_id": "7954681"
  }
]

`wiki_get_article`

Retrieve a full Wikipedia article by exact title. Resolves redirects automatically.

| Parameter | Type | Default | Description | |-----------|------|---------|-------------| | title | string | (required) | Exact article title (English) |

Returns: JSON {title, page_id, short_description, categories, intro, full_text}

`wiki_get_article_by_id`

Retrieve a full article by its Wikipedia page ID. Useful after wiki_search returns page IDs.

| Parameter | Type | Default | Description | |-----------|------|---------|-------------| | page_id | string | (required) | Wikipedia page ID |

Returns: Same as wiki_get_article

`wiki_search_titles`

Fast substring search against article titles.

| Parameter | Type | Default | Description | |-----------|------|---------|-------------| | query | string | (required) | Substring to match (English) | | limit | integer | 10 | Max results (max 50) |

Returns: JSON array of {page_id, title, short_description, intro}

`wiki_search_by_category`

Find articles belonging to a specific Wikipedia category.

| Parameter | Type | Default | Description | |-----------|------|---------|-------------| | category | string | (required) | Category name (e.g. "Physics", "French novels") | | n_results | integer | 10 | Max results (max 50) |

Returns: JSON array of {title, short_description, categories, intro, page_id}

`wiki_stats`

Get database statistics.

Returns: JSON {articles, redirects, chroma_count, ingested_files: [{file, articles, redirects, ingested_at}]}

Database management

Backup

./backup.sh

Creates a timestamped archive in backups/ containing wiki.db and chroma_db/. The backup includes a manifest of required XML files and verifies they exist.

Restore

./restore.sh                                    # restore latest backup
./restore.sh backups/wiki-backup-20260403.tar   # restore specific backup

Checks that the XML files referenced by the backup are present in xml_db/ before restoring. Warns if any are missing (full article retrieval would fail without them, but search and metadata still work).

Reset and re-ingest

./reset.sh

Deletes all databases and runs a fresh ingestion from the XML files in xml_db/.

Project structure

wiki-mcp/
  wiki_mcp/
    __init__.py
    __main__.py        # CLI entry point (ingest vs. server)
    config.py          # Paths, model name, tuning constants
    server.py          # MCP server with tool definitions
    db.py              # SQLite + ChromaDB wrapper
    wiki_parser.py     # Streaming XML parser + wikitext cleaner
    ingest.py          # 3-phase ingestion pipeline
  xml_db/              # Wikipedia XML dump files
  chroma_db/           # ChromaDB persistent storage (generated)
  wiki.db              # SQLite database (generated)
  hf_cache/            # HuggingFace model cache
  backups/             # Backup archives
  venv/                # Python virtual environment
  ingest.sh            # Run ingestion (GPU)
  start-mcp.sh         # Start MCP server (CPU)
  reset.sh             # Delete DBs + re-ingest
  backup.sh            # Create backup
  restore.sh           # Restore from backup
  pyproject.toml       # Project metadata + dependencies
  README.md

Configuration

Key constants in wiki_mcp/config.py:

| Constant | Default | Description | |----------|---------|-------------| | EMBEDDING_MODEL | all-MiniLM-L6-v2 | SentenceTransformer model | | EMBEDDING_DIM | 384 | Vector dimensions | | BATCH_SIZE | 500 | Articles per ChromaDB insert batch | | INTRO_MAX_CHARS | 1500 | Max intro characters to store and embed |

Ingestion tuning in wiki_mcp/ingest.py:

| Setting | GPU | CPU | |---------|-----|-----| | Embedding batch size | 256 | 512 | | Torch threads | N/A | cpu_count / 2 |

Performance benchmarks

Tested on: AMD Ryzen 7 (16 threads) + NVIDIA RTX 5060 8GB

Ingestion (single 10.5 GB XML file, 354K articles)

| Phase | Time | |-------|------| | XML parsing | 580 sec | | SQLite write | 12 sec | | Embedding (GPU) | 365 sec | | ChromaDB insert | 228 sec | | Total | ~20 min |

Query latency (354K articles, CPU)

| Operation | Latency | |-----------|---------| | Semantic search (first query, cold) | ~1 sec | | Semantic search (warm) | 7-8 ms | | Full article retrieval (XML byte seek) | 3-7 ms | | Title search (SQLite LIKE) | < 1 ms |

How it works

Ingestion pipeline

Phase 1 — XML Parsing: A streaming expat parser processes the XML dump in 8 MB chunks. For each <page> element, it records the byte offset and length in the file, extracts the article title, intro, categories, infobox type, and short description, and cleans the wikitext markup. Redirects are stored separately.

Phase 2 — Embedding: Article text (title + description + intro + categories) is encoded into 384-dimensional vectors using the all-MiniLM-L6-v2 sentence transformer. GPU is used when available, with automatic fallback to multi-threaded CPU inference.

Phase 3 — Indexing: Embeddings are inserted into a ChromaDB collection backed by an HNSW (Hierarchical Navigable Small World) index with cosine similarity. Lightweight metadata (title, description, categories) is stored alongside the vectors for fast result formatting without hitting SQLite.

Query flow

User query is encoded into a 384-dim vector using the same sentence transformer
ChromaDB HNSW index returns the top-N most similar article vectors with cosine distance
Article intros are fetched from SQLite using the returned page IDs
If full text is requested, the article's <page> block is read directly from the XML file using the stored byte offset — a single seek() + read() call

Why byte offsets?

Wikipedia XML dumps are 10+ GB per file. Loading them into memory or scanning them for each request is impractical. Instead, during ingestion, we record exactly where each article starts and how long it is (in bytes) within the XML file. At query time, we open the file, seek to that position, and read just that article's XML — making full article retrieval a constant-time operation regardless of dump size.

Limitations

English Wikipedia only — all search queries and titles must be in English
Single-file XML parsing is single-threaded — parsing a 10 GB file takes ~10 minutes; this is I/O bound and hard to parallelize with expat's streaming model
Intro-based embeddings — semantic search is based on article intros (first 1500 chars), not full text. This works well for most queries but may miss articles where the relevant content is deep in the body
No real-time updates — to update the database, download a new Wikipedia dump and re-ingest

License

Apache License 2.0 — see LICENSE for details.