A backend-only Model Context Protocol (MCP) server for self-reflective retrieval augmented generation.
Self-RAG MCP Server
A backend-only Model Context Protocol (MCP) server for self-reflective retrieval augmented generation. It can answer simple questions directly, retrieve indexed PDF content from Qdrant Cloud, and use Tavily web search as a bounded fallback.
Features
- FastMCP server with local
stdioand Streamable HTTP transports. - LangGraph workflow with retrieval decisions, relevance grading, source extraction, groundedness checks, and bounded web-search retries.
- Qdrant Cloud hybrid retrieval using Gemini dense embeddings and local BM25 sparse embeddings.
- Idempotent PDF ingestion with stable chunk IDs and replacement of stale chunks when a file is re-ingested.
- User-facing citations with PDF page numbers or web URLs.
- Lazy external clients: the server can start with empty API keys and report configuration errors through MCP tools.
Architecture
Open the standalone LangGraph visual guide in a browser for an interactive execution map, node-by-node reference, state dictionary, routing rules, and ingestion pipeline.
flowchart TD
client["MCP client"] --> server["FastMCP server"]
server --> graph["LangGraph Self-RAG workflow"]
graph --> gemini["Gemini chat and embeddings"]
graph --> qdrant["Qdrant Cloud hybrid collection"]
graph --> tavily["Tavily web search"]
The graph contains ten nodes:
decide_retrieval
|-- no --> generate_direct --> END
`-- yes --> retrieve --> is_relevant
|-- yes --> extract_sources --> generate_from_context
| --> grade_hallucination --> END or retry
`-- no --> rewrite_query --> web_search --> is_relevant
Retries stop after MAX_WEB_ATTEMPTS. If no useful evidence is found, no_docs_fallback returns a
clear response.
MCP Tools
| Tool | Purpose |
| --- | --- |
| query_knowledge_base(question) | Answer a question through the Self-RAG graph. |
| ingest_documents(file_paths) | Index server-side PDF files in Qdrant Cloud. |
| get_knowledge_base_info() | Report collection status and indexed chunk count. |
Query results include answer, sources, is_grounded, answer_status, used_retrieval,
used_web, and web_search_attempts.
answer_status distinguishes verified RAG answers from direct general-knowledge answers:
| Status | Meaning |
| --- | --- |
| verified | The answer passed the context-groundedness check. |
| direct_unverified | The graph answered directly from Gemini general knowledge. |
| unverified | The answer could not be fully supported after retrying. |
| not_found | No useful knowledge-base or web evidence was found. |
| error | Configuration or external-service failure. |
Requirements
- Python 3.12 or newer.
- A Gemini API key.
- A Qdrant Cloud cluster URL and API key.
- A Tavily API key if web-search fallback is required.
Setup
git clone https://github.com/vvanshkkumar/Self-RAG-MCP-Server.git
cd Self-RAG-MCP-Server
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
cp .env.example .env
Add your keys to .env. Real keys must never be committed.
GOOGLE_API_KEY=
QDRANT_URL=
QDRANT_API_KEY=
TAVILY_API_KEY=
The defaults use gemini-2.5-flash and gemini-embedding-001. Both can be changed through .env
without editing Python files.
Environment Variables
| Variable | Default | Purpose |
| --- | --- | --- |
| GOOGLE_API_KEY | empty | Gemini chat and embedding API key. |
| GEMINI_CHAT_MODEL | gemini-2.5-flash | Chat model for generation and grading. |
| GEMINI_EMBEDDING_MODEL | gemini-embedding-001 | Dense embedding model. |
| GEMINI_EMBEDDING_DIMENSIONS | 768 | Dense Qdrant vector size. |
| QDRANT_URL | empty | Qdrant Cloud cluster URL. |
| QDRANT_API_KEY | empty | Qdrant Cloud API key. |
| QDRANT_COLLECTION | self_rag_documents_v1 | Hybrid collection name. |
| TAVILY_API_KEY | empty | Tavily web-search API key. |
| RETRIEVAL_TOP_K | 6 | Hybrid chunks retrieved before grading. |
| MAX_WEB_ATTEMPTS | 2 | Maximum Tavily fallback calls per query. |
| CHUNK_SIZE | 800 | PDF chunk size in characters. |
| CHUNK_OVERLAP | 150 | PDF chunk overlap in characters. |
| INGEST_ALLOWED_ROOT | empty | Optional directory boundary for PDF ingestion. |
| MCP_HOST | 127.0.0.1 | Streamable HTTP bind address. |
| MCP_PORT | 8000 | Streamable HTTP port. |
Changing the embedding model or dimensions requires a new Qdrant collection name and re-ingestion of your PDFs.
Ingest PDFs
Use the CLI for initial ingestion:
source .venv/bin/activate
python ingest.py ./documents/
python ingest.py ./documents/policy.pdf
python ingest.py "./documents/*.pdf"
Folders are searched recursively. PDF extension matching supports .pdf and .PDF.
The MCP ingest_documents tool accepts absolute file paths on the server machine. Set
INGEST_ALLOWED_ROOT in .env when you want to prevent ingestion outside a known documents
directory.
Run the MCP Server
For desktop clients using stdio:
source .venv/bin/activate
python mcp_server.py
For local Streamable HTTP:
source .venv/bin/activate
python mcp_server.py --transport http --host 127.0.0.1 --port 8000
The HTTP server binds to localhost by default. Do not expose it publicly without authentication and
network controls. The ingestion tool can read server-side PDF paths, so INGEST_ALLOWED_ROOT should
also be configured before any remote deployment.
Claude Desktop Configuration
Use absolute paths and point the command at this project's virtual-environment Python:
{
"mcpServers": {
"self-rag": {
"command": "/absolute/path/to/Self-RAG-MCP-Server/.venv/bin/python",
"args": ["/absolute/path/to/Self-RAG-MCP-Server/mcp_server.py"]
}
}
}
The server loads API keys from .env, so keys do not need to be duplicated in the client
configuration.
Development Checks
source .venv/bin/activate
pytest
ruff check .
ruff format --check .
mypy rag ingest.py mcp_server.py
python -m compileall -q rag ingest.py mcp_server.py tests
Tests use local fakes and FastMCP's in-memory client. They do not require API keys or external network calls.