Local offline Wikipedia search engine as an MCP server. Semantic search over 7M+ articles with ChromaDB + SentenceTransformers. Works with llama.cpp, Open WebUI, Claude Code.
wiki-local-mcp
Part of Project Independence — building a fully offline, self-sufficient knowledge and tools stack that runs on consumer hardware.
A local, offline Wikipedia search engine exposed as an MCP (Model Context Protocol) server. It gives any MCP-compatible LLM instant access to the full English Wikipedia — semantic search, title lookup, category browsing, and full article retrieval — all running on your own hardware with zero API calls.
Demo
wiki-local-mcp in action with llama.cpp running Qwen3.5-9B-Q4_K_M (TurboQuant tbq3) at 63 tok/s on an NVIDIA RTX 5060 8GB with 50K context — the same machine serving the MCP search:
Built for speed: 7-8 ms semantic search across 7M+ articles, 3-7 ms full article retrieval via byte-offset seeking into the raw XML dumps.
Features
- Semantic search — find articles by meaning, not just keywords, using cosine similarity over 384-dimensional sentence embeddings
- Full article retrieval — fetch complete article text on demand via byte-offset seeking (no need to load multi-GB XML files into memory)
- Title search — fast substring matching against all article titles
- Category browsing — find articles by Wikipedia category
- Redirect resolution — automatically follows Wikipedia redirects (up to 5 hops)
- GPU-accelerated ingestion — bulk embedding on GPU, serving on CPU (keeps GPU free for your LLM)
- Incremental ingestion — add new XML dump files without re-processing existing ones
- MCP-native — works with any MCP client: Claude Code, llama.cpp, Open WebUI, and more
Architecture
Wikipedia XML Dumps (10-200+ GB)
|
v
[3-Phase Ingestion Pipeline]
|
+-----+-----+------------------+
| | |
v v v
SQLite ChromaDB XML Files
(metadata, (HNSW vector (full text,
offsets, index, read on
redirects) embeddings) demand)
|
v
[MCP Server]
- wiki_search (semantic)
- wiki_get_article (full text)
- wiki_search_titles (substring)
- wiki_search_by_category
- wiki_stats
|
v
LLM Clients
(Claude Code, llama.cpp, Open WebUI, ...)
How storage works
| Layer | What it stores | Size (full English Wikipedia) | |-------|---------------|-------------------------------| | XML dumps | Raw Wikipedia articles (source of truth) | ~216 GB | | SQLite | Article metadata, intro text, byte offsets into XML, redirects | ~13 GB | | ChromaDB | HNSW vector index + 384-dim embeddings + lightweight metadata | ~7 GB | | Total | | ~236 GB |
Full article text is never duplicated — it lives only in the XML dumps and is read on demand via byte-offset seeking. SQLite stores the byte offset and length for each article's <page> block, so retrieval is a single seek() + read() — no scanning.
Requirements
- Python >= 3.11
- ~250 GB disk space (for full English Wikipedia)
- NVIDIA GPU recommended for ingestion (optional — CPU works, just slower)
- Wikipedia XML dump files (see Downloading Wikipedia dumps)
Installation
git clone https://github.com/robtacconelli/wiki-local-mcp.git
cd wiki-local-mcp
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
PyTorch with CUDA (recommended for ingestion)
If you have an NVIDIA GPU and want faster ingestion:
pip install torch --index-url https://download.pytorch.org/whl/cu124
Downloading Wikipedia dumps
Download the English Wikipedia XML dump files from Wikimedia. We used the mediawiki_content_current dumps:
https://dumps.wikimedia.org/other/mediawiki_content_current/enwiki/2026-04-01/xml/bzip2/
mkdir -p xml_db
cd xml_db
# Download all split XML files
wget -r -np -nd -A "*.xml.bz2" \
"https://dumps.wikimedia.org/other/mediawiki_content_current/enwiki/2026-04-01/xml/bzip2/"
# Decompress
bunzip2 *.bz2
Place all .xml files in the xml_db/ directory. The ingestion pipeline will process every XML file it finds there.
You can also use a single monolithic dump file or a different dump date — the parser handles any valid MediaWiki XML export. Check https://dumps.wikimedia.org/other/mediawiki_content_current/enwiki/ for available dates.
Usage
1. Ingest Wikipedia dumps
./ingest.sh
This runs a 3-phase pipeline:
| Phase | What it does | Time (full Wikipedia, GPU) | |-------|-------------|---------------------------| | Phase 1 | Parse XML with streaming expat parser, record byte offsets | ~10 min per 10 GB file | | Phase 1.5 | Write metadata + offsets to SQLite | ~12 sec | | Phase 2 | Embed all articles with all-MiniLM-L6-v2 | ~6 min per 350K articles (GPU) | | Phase 3 | Batch insert embeddings into ChromaDB HNSW index | ~4 min per 350K articles |
Ingestion is incremental — already-processed files are skipped. To force a full re-ingestion:
./ingest.sh --force
GPU is used automatically if available. Falls back to CPU (half the cores, to avoid thermal throttling).
2. Start the MCP server
./start-mcp.sh
This starts an HTTP server on 0.0.0.0:8081 with streamable-HTTP transport. The server runs on CPU only — GPU stays free for your LLM.
The embedding model (~90 MB) loads lazily on first query and stays resident in memory. First query takes ~1 second; subsequent queries take 7-8 ms.
3. Connect your LLM client
llama.cpp
In the llama.cpp web UI, go to MCP and add a new server:
- Server URL:
http://<your-ip>:8081/mcp
Open WebUI
In Settings > Integrations > MCP Tool Server, add:
- URL:
http://<your-ip>:8081/mcp
If Open WebUI runs in Docker, use the host machine's LAN IP (not 127.0.0.1).
Claude Code
claude mcp add wiki-search -- /path/to/wiki-mcp/venv/bin/python -m wiki_mcp
Claude Code uses stdio transport directly — no need to start the HTTP server.
MCP tools reference
wiki_search
Semantic similarity search across all articles.
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| query | string | (required) | Natural language search query (English) |
| n_results | integer | 5 | Number of results (max 20) |
Returns: JSON array of {title, short_description, categories, intro, relevance_score, page_id}
Example result:
[
{
"title": "Quantum mechanics",
"short_description": "Description of physical properties at the atomic and subatomic scale",
"categories": "Quantum mechanics, Concepts in physics",
"intro": "Quantum mechanics is a fundamental theory in physics that describes ...",
"relevance_score": 0.635,
"page_id": "7954681"
}
]
wiki_get_article
Retrieve a full Wikipedia article by exact title. Resolves redirects automatically.
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| title | string | (required) | Exact article title (English) |
Returns: JSON {title, page_id, short_description, categories, intro, full_text}
wiki_get_article_by_id
Retrieve a full article by its Wikipedia page ID. Useful after wiki_search returns page IDs.
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| page_id | string | (required) | Wikipedia page ID |
Returns: Same as wiki_get_article
wiki_search_titles
Fast substring search against article titles.
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| query | string | (required) | Substring to match (English) |
| limit | integer | 10 | Max results (max 50) |
Returns: JSON array of {page_id, title, short_description, intro}
wiki_search_by_category
Find articles belonging to a specific Wikipedia category.
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| category | string | (required) | Category name (e.g. "Physics", "French novels") |
| n_results | integer | 10 | Max results (max 50) |
Returns: JSON array of {title, short_description, categories, intro, page_id}
wiki_stats
Get database statistics.
Returns: JSON {articles, redirects, chroma_count, ingested_files: [{file, articles, redirects, ingested_at}]}
Database management
Backup
./backup.sh
Creates a timestamped archive in backups/ containing wiki.db and chroma_db/. The backup includes a manifest of required XML files and verifies they exist.
Restore
./restore.sh # restore latest backup
./restore.sh backups/wiki-backup-20260403.tar # restore specific backup
Checks that the XML files referenced by the backup are present in xml_db/ before restoring. Warns if any are missing (full article retrieval would fail without them, but search and metadata still work).
Reset and re-ingest
./reset.sh
Deletes all databases and runs a fresh ingestion from the XML files in xml_db/.
Project structure
wiki-mcp/
wiki_mcp/
__init__.py
__main__.py # CLI entry point (ingest vs. server)
config.py # Paths, model name, tuning constants
server.py # MCP server with tool definitions
db.py # SQLite + ChromaDB wrapper
wiki_parser.py # Streaming XML parser + wikitext cleaner
ingest.py # 3-phase ingestion pipeline
xml_db/ # Wikipedia XML dump files
chroma_db/ # ChromaDB persistent storage (generated)
wiki.db # SQLite database (generated)
hf_cache/ # HuggingFace model cache
backups/ # Backup archives
venv/ # Python virtual environment
ingest.sh # Run ingestion (GPU)
start-mcp.sh # Start MCP server (CPU)
reset.sh # Delete DBs + re-ingest
backup.sh # Create backup
restore.sh # Restore from backup
pyproject.toml # Project metadata + dependencies
README.md
Configuration
Key constants in wiki_mcp/config.py:
| Constant | Default | Description |
|----------|---------|-------------|
| EMBEDDING_MODEL | all-MiniLM-L6-v2 | SentenceTransformer model |
| EMBEDDING_DIM | 384 | Vector dimensions |
| BATCH_SIZE | 500 | Articles per ChromaDB insert batch |
| INTRO_MAX_CHARS | 1500 | Max intro characters to store and embed |
Ingestion tuning in wiki_mcp/ingest.py:
| Setting | GPU | CPU |
|---------|-----|-----|
| Embedding batch size | 256 | 512 |
| Torch threads | N/A | cpu_count / 2 |
Performance benchmarks
Tested on: AMD Ryzen 7 (16 threads) + NVIDIA RTX 5060 8GB
Ingestion (single 10.5 GB XML file, 354K articles)
| Phase | Time | |-------|------| | XML parsing | 580 sec | | SQLite write | 12 sec | | Embedding (GPU) | 365 sec | | ChromaDB insert | 228 sec | | Total | ~20 min |
Query latency (354K articles, CPU)
| Operation | Latency | |-----------|---------| | Semantic search (first query, cold) | ~1 sec | | Semantic search (warm) | 7-8 ms | | Full article retrieval (XML byte seek) | 3-7 ms | | Title search (SQLite LIKE) | < 1 ms |
How it works
Ingestion pipeline
Phase 1 — XML Parsing: A streaming expat parser processes the XML dump in 8 MB chunks. For each <page> element, it records the byte offset and length in the file, extracts the article title, intro, categories, infobox type, and short description, and cleans the wikitext markup. Redirects are stored separately.
Phase 2 — Embedding: Article text (title + description + intro + categories) is encoded into 384-dimensional vectors using the all-MiniLM-L6-v2 sentence transformer. GPU is used when available, with automatic fallback to multi-threaded CPU inference.
Phase 3 — Indexing: Embeddings are inserted into a ChromaDB collection backed by an HNSW (Hierarchical Navigable Small World) index with cosine similarity. Lightweight metadata (title, description, categories) is stored alongside the vectors for fast result formatting without hitting SQLite.
Query flow
- User query is encoded into a 384-dim vector using the same sentence transformer
- ChromaDB HNSW index returns the top-N most similar article vectors with cosine distance
- Article intros are fetched from SQLite using the returned page IDs
- If full text is requested, the article's
<page>block is read directly from the XML file using the stored byte offset — a singleseek()+read()call
Why byte offsets?
Wikipedia XML dumps are 10+ GB per file. Loading them into memory or scanning them for each request is impractical. Instead, during ingestion, we record exactly where each article starts and how long it is (in bytes) within the XML file. At query time, we open the file, seek to that position, and read just that article's XML — making full article retrieval a constant-time operation regardless of dump size.
Limitations
- English Wikipedia only — all search queries and titles must be in English
- Single-file XML parsing is single-threaded — parsing a 10 GB file takes ~10 minutes; this is I/O bound and hard to parallelize with expat's streaming model
- Intro-based embeddings — semantic search is based on article intros (first 1500 chars), not full text. This works well for most queries but may miss articles where the relevant content is deep in the body
- No real-time updates — to update the database, download a new Wikipedia dump and re-ingest
License
Apache License 2.0 — see LICENSE for details.