╔════════════════════════════════════════════════════════════╗
║                                                            ║
║   ███╗   ███╗ ██████╗██████╗     ██╗     ██╗     ███╗   ███╗║
║   ████╗ ████║██╔════╝██╔══██╗    ██║     ██║     ████╗ ████║║
║   ██╔████╔██║██║     ██████╔╝    ██║     ██║     ██╔████╔██║║
║   ██║╚██╔╝██║██║     ██╔═══╝     ██║     ██║     ██║╚██╔╝██║║
║   ██║ ╚═╝ ██║╚██████╗██║         ███████╗███████╗██║ ╚═╝ ██║║
║   ╚═╝     ╚═╝ ╚═════╝╚═╝         ╚══════╝╚══════╝╚═╝     ╚═╝║
║                                                            ║
║                    L L M   R O U T E R                     ║
║                                                            ║
╚════════════════════════════════════════════════════════════╝

MCP LLM Router

A Model Context Protocol (MCP) server for routing LLM requests across multiple providers and connecting to other MCP servers. Designed with an "all-local except the brain" architecture for privacy and control.

Features (Unified Router + Judge)

One server, two roles: mcp_llm_router.server now ships Judge tools in-process—no separate mcp-as-a-judge server required.
Multi-Provider LLM Routing: Route requests to OpenAI, OpenRouter, DeepInfra, and other OpenAI-compatible APIs.
Configurable "Brain" Model: Choose DeepSeek reasoning or any OpenAI-compatible model as the router brain.
Session Management: Track agent sessions with goals, constraints, and event logging.
Quality Gating (Judge): Plan → code → test → completion validation using the embedded Judge toolset.
Local-First Memory: Default: Local embeddings via Ollama with optional ChromaDB vector store for efficient semantic search. OpenAI-compatible endpoints supported as fallback.
Local Cross-Encoder Reranking: Optional privacy-focused reranking using Qwen3-Reranker-0.6B for improved search relevance without external API calls.
MCP Server Orchestration: Connect to and orchestrate multiple MCP servers.
Cross-Server Tool Calling: Call tools across different MCP servers.
Universal MCP Compatibility: Works with any MCP-compatible client (not tied to specific IDEs).

Architecture: All-Local Except the Brain

This project follows an "all-local except the brain" design philosophy:

✅ Embeddings: Run locally via Ollama (default: qwen3-embedding:0.6b)
✅ Vector Storage: SQLite (default) or ChromaDB with HNSW indexing (optional RAG package)
✅ Document Chunking: Token-based chunking with overlap (optional RAG package)
✅ Semantic Search: Local cosine similarity with L2-normalized vectors
✅ Reranking: Optional local cross-encoder reranking with Qwen3-Reranker-0.6B
🌐 LLM "Brain": Configurable external API (DeepSeek, OpenAI, etc.) for reasoning and generation

Why? This architecture keeps your data and semantic search private and fast, while leveraging powerful external LLMs only for high-level reasoning tasks.

Installation

Quick Install (Recommended)

One-command automated installation:

./install.sh

This script will:

✅ Create a Python virtual environment
✅ Install all dependencies from pyproject.toml
✅ Check for Ollama installation
✅ Verify the setup
✅ Display next steps with your specific paths

Manual Installation

If you prefer manual installation or need a Conda environment:

# Clone the repository
git clone https://github.com/groxaxo/mcp-llm-router.git
cd mcp-llm-router

# Option 1: Using venv (recommended)
python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -U pip
pip install -e .

# Option 2: Using Conda
conda create -n mcp-router python=3.12 -y
conda activate mcp-router
pip install -U pip
pip install -e .

Ollama Setup (Required for Local Embeddings)

Install Ollama for local, privacy-focused embeddings:

# Linux/macOS
curl -fsSL https://ollama.ai/install.sh | sh

# Or download from https://ollama.ai

Pull the embedding model:

ollama pull qwen3-embedding:0.6b

Verify Ollama is running:

curl http://localhost:11434/api/version

Alternative Embedding Models:

nomic-embed-text - General-purpose embeddings
mxbai-embed-large - Larger model for better quality

Set via environment variable:

export EMBEDDINGS_MODEL="nomic-embed-text"

Project Structure

mcp-llm-router/
├── install.sh              # Automated installation script
├── README.md               # This file
├── pyproject.toml          # Python package configuration
│
├── mcp_llm_router/         # Main package
│   ├── server.py           # MCP server entry point
│   ├── brain.py            # LLM routing logic
│   ├── memory.py           # Memory management (embeddings, search, rerank)
│   ├── codex.py            # MCP server orchestration
│   └── judge/              # Embedded judge tools for quality gating
│
├── rag/                    # Optional RAG package (ChromaDB, chunking)
│   ├── main.py             # CLI for indexing and queries
│   ├── indexer.py          # Document indexing
│   ├── retriever.py        # Vector search
│   └── reranker.py         # Local cross-encoder reranking
│
├── scripts/                # Utility scripts
│   ├── verify_server.py    # Installation verification
│   ├── opencode            # CLI tool for direct LLM requests
│   ├── mcp_client.py       # MCP client for testing
│   └── mcp_manager.py      # MCP server management
│
├── examples/               # Example configurations and demos
│   ├── demo_judge_gating.py          # End-to-end judge workflow demo
│   ├── local_reranker_example.py     # Local reranking example
│   ├── mcp-config.deepseek-ollama.json
│   └── mcp-config.local-reranker.json
│
└── tests/                  # Test suite
    ├── test_server.py
    ├── test_mcp.py
    └── test_local_reranker.py

Configuration

MCP Server Configuration (`mcp-config.json`)

{
  "mcpServers": {
    "llm-router": {
      "command": "python",
      "args": ["-m", "mcp_llm_router.server"],
      "env": {
        "OPENAI_API_KEY": "your-openai-key",
        "DEEPINFRA_API_KEY": "your-deepinfra-key",
        "OPENROUTER_API_KEY": "your-openrouter-key"
      }
    },
    "other-server": {
      "command": "python",
      "args": ["-m", "other_mcp_server"],
      "env": {}
    }
}
}

Example Config + Demo

examples/mcp-config.deepseek-ollama.json - DeepSeek brain + Ollama embeddings + judge history persistence.
examples/mcp-config.local-reranker.json - DeepSeek brain + Ollama embeddings + local cross-encoder reranking.
examples/demo_judge_gating.py - End-to-end demo that indexes memory and walks a task through judge gating via router_chat.
examples/local_reranker_example.py - Example of using local cross-encoder reranking to improve search relevance.

Run the demo:

python examples/demo_judge_gating.py --config examples/mcp-config.deepseek-ollama.json

Run the local reranker example:

python examples/local_reranker_example.py

Note: the demo skips request_plan_approval because it requires user elicitation. Ensure DEEPSEEK_API_KEY (or LLM_API_KEY) is set and Ollama is running for embeddings.

Environment Variables

Set API keys in your environment or in the config:

export OPENAI_API_KEY="sk-proj-..."
export DEEPINFRA_API_KEY="..."
export OPENROUTER_API_KEY="sk-or-..."
export DEEPSEEK_API_KEY="..."

Brain Configuration (Router LLM)

# Core brain settings
export ROUTER_BRAIN_MODEL="deepseek-reasoner"
export ROUTER_BRAIN_PROVIDER="deepseek"
export ROUTER_BRAIN_API_KEY_ENV="DEEPSEEK_API_KEY"

# Optional overrides
export ROUTER_BRAIN_BASE_URL="https://api.deepseek.com"
export ROUTER_BRAIN_MAX_TOKENS="4000"
export ROUTER_BRAIN_TEMPERATURE="0.2"

You can also set the brain per session using the configure_brain tool.

Memory Configuration (Embeddings + Rerank)

Default: Local Ollama Embeddings (Recommended)

No API keys required! The default configuration uses local Ollama embeddings:

# Storage paths
export MCP_ROUTER_DATA_DIR="./.mcp-llm-router"
export MCP_ROUTER_MEMORY_DB="./.mcp-llm-router/memory.db"

# Local embeddings via Ollama (DEFAULT - no API key needed)
export EMBEDDINGS_PROVIDER="ollama"
export EMBEDDINGS_BASE_URL="http://localhost:11434"
export EMBEDDINGS_MODEL="qwen3-embedding:0.6b"
export EMBEDDINGS_PATH="/api/embed"
# No EMBEDDINGS_API_KEY_ENV needed for local Ollama!

Alternative: OpenAI-Compatible Embeddings

If you prefer cloud-based embeddings:

# Embeddings via OpenAI
export EMBEDDINGS_PROVIDER="openai"
export EMBEDDINGS_BASE_URL="https://api.openai.com/v1"
export EMBEDDINGS_MODEL="text-embedding-3-small"
export EMBEDDINGS_API_KEY_ENV="OPENAI_API_KEY"
export EMBEDDINGS_PATH="/embeddings"

Reranking (Optional)

Reranking is optional and defaults to "none". Three modes are available:

1. Local Cross-Encoder Reranking (Recommended for Privacy)

Uses the local Qwen3-Reranker-0.6B model for reranking without external API calls:

# Local cross-encoder reranking (requires transformers and torch)
export RERANK_PROVIDER="local"
export RERANK_MODE="local"
export RERANK_MODEL="tomaarsen/Qwen3-Reranker-0.6B-seq-cls"  # Default model

Requirements:

Install PyTorch: pip install torch
Install Transformers: pip install transformers
The model will be automatically downloaded on first use (~1.2GB)

2. LLM-Based Reranking

Uses an external LLM API for reranking:

# Rerank using OpenAI-compatible LLM (optional)
export RERANK_PROVIDER="openai"
export RERANK_BASE_URL="https://api.openai.com/v1"
export RERANK_MODEL="gpt-4o-mini"
export RERANK_API_KEY_ENV="OPENAI_API_KEY"
export RERANK_PATH="/chat/completions"
export RERANK_MODE="llm"

3. Disable Reranking

# Or disable reranking entirely (default)
export RERANK_PROVIDER="none"

Judge Persistence (embedded Judge)

# Persist judge conversation history + task metadata
export MCP_JUDGE_DATABASE_URL="sqlite:///./.mcp-llm-router/judge_history.db"

Advanced: ChromaDB + Token Chunking (RAG Package)

For enhanced semantic search with vector indexing and intelligent chunking, this repository includes an optional rag package that provides:

Token-based chunking with overlap for consistent semantic granularity
ChromaDB vector store with HNSW indexing for fast similarity search
L2-normalized embeddings for consistent cosine similarity
Batch embedding and efficient upserts

Using the RAG Package

Install additional dependencies (already included in pyproject.toml):
```
pip install -e .  # chromadb, transformers are now included
```
Index your codebase:
```
python -m rag.main --path . --exts .py,.md --interactive
```
This will:
- Scan the current directory for .py and .md files
- Chunk them into 400-token segments with 80-token overlap
- Embed using Ollama (qwen3-embedding:0.6b)
- Store in ChromaDB at data/chroma/
- Enter interactive mode for testing queries

Use in your code:

from rag.retriever import retrieve
from rag.indexer import index_path

# Index documents
stats = index_path("/path/to/docs", exts=[".py", ".md"])
print(f"Indexed {stats['files_indexed']} files")

# Retrieve relevant chunks
results = retrieve("How does authentication work?", top_k=5)
for hit in results:
    print(f"Score: {hit['distance']:.4f}")
    print(f"File: {hit['meta']['path']}")
    print(f"Content: {hit['doc']}\n")

RAG Package Components:

rag/embedding_config.py - Configuration constants
rag/chunker.py - Token-based text chunking
rag/ollama_embedder.py - Ollama embedding with normalization
rag/chroma_store.py - ChromaDB initialization and management
rag/indexer.py - Document indexing pipeline
rag/retriever.py - Vector search and retrieval
rag/main.py - CLI for indexing and queries

Note: The RAG package is a self-contained enhancement. The core MCP server works with its built-in SQLite memory store without requiring ChromaDB.

Usage

Running MCP Servers

Using the Server Runner

# List configured servers
python scripts/mcp_server_runner.py list

# Run a specific server
python scripts/mcp_server_runner.py run llm-router

Using the Server Manager

# Add a new server
python scripts/mcp_manager.py add my-server python -m my_mcp_server

# List servers
python scripts/mcp_manager.py list

# Test server connection
python scripts/mcp_manager.py test llm-router

# Remove a server
python scripts/mcp_manager.py remove my-server

Connecting to MCP Servers

Using the MCP Client

# List tools on a server
python scripts/mcp_client.py list-tools llm-router

# Call a tool on a server
python scripts/mcp_client.py call-tool llm-router start_session '{"goal": "Test session"}'

Using the Server Manager for Cross-Server Operations

# Call a tool across all configured servers
python scripts/mcp_manager.py call start_session '{"goal": "Test all servers"}'

MCP Tools Available

Session Management

start_session(goal, constraints, context, metadata) - Start a new agent session
log_event(session_id, kind, message, details) - Log events to a session
get_session_context(session_id) - Retrieve full session data

LLM Routing

agent_llm_request(session_id, prompt, model, base_url, api_key_env, ...) - Route to LLM providers
configure_brain(...) - Set the global or per-session brain model/settings
get_brain_config(session_id) - Read the active brain configuration
router_chat(session_id, message, ...) - Main brain chat (memory + workflow guidance)

Memory (Embeddings + Rerank)

configure_memory(...) - Set embedding/rerank configuration globally or per-session
memory_index(namespace, texts, metadatas, doc_ids) - Index texts into memory
memory_search(namespace, query, top_k, rerank) - Retrieve relevant memory hits
memory_delete(namespace, doc_id) - Delete one doc or a whole namespace
memory_list_namespaces() - List namespaces
memory_stats() - Show memory counts

MCP Server Orchestration

connect_mcp_server(server_name, command, args, env) - Configure connection to another MCP server
list_mcp_servers() - List configured MCP server connections
call_mcp_tool(server_name, tool_name, arguments) - Call tools on other MCP servers
list_mcp_tools(server_name) - List tools available on another MCP server

Judge Tools (built-in)

set_coding_task(...)
get_current_coding_task()
request_plan_approval(...)
judge_coding_plan(...)
judge_code_change(...)
judge_testing_implementation(...)
judge_coding_task_completion(...)
raise_obstacle(...)
raise_missing_requirements(...)

Integration with MCP Clients

Any MCP-Compatible Client

The server works with any client that supports the MCP protocol:

{
  "mcpServers": {
    "llm-router": {
      "command": "python",
      "args": ["-m", "mcp_llm_router.server"],
      "env": {
        "OPENAI_API_KEY": "your-key"
      }
    }
  }
}

Example: Claude Desktop

Add to your Claude Desktop MCP configuration:

{
  "mcpServers": {
    "llm-router": {
      "command": "python",
      "args": ["-m", "mcp_llm_router.server"],
      "env": {
        "OPENAI_API_KEY": "sk-...",
        "DEEPINFRA_API_KEY": "..."
      }
    }
  }
}

Example: Custom MCP Client

import asyncio
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

async def main():
    server_params = StdioServerParameters(
        command="python",
        args=["-m", "mcp_llm_router.server"],
        env={"OPENAI_API_KEY": "your-key"}
    )

    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()

            # Start a session
            result = await session.call_tool("start_session", {
                "goal": "Test the MCP server"
            })
            print("Session started:", result)

if __name__ == "__main__":
    asyncio.run(main())

Provider Configuration

OpenAI

{
  "base_url": null,  # Uses default
  "api_key_env": "OPENAI_API_KEY"
}

OpenRouter

{
  "base_url": "https://openrouter.ai/api/v1",
  "api_key_env": "OPENROUTER_API_KEY"
}

DeepInfra

{
  "base_url": "https://api.deepinfra.com/v1/openai",
  "api_key_env": "DEEPINFRA_API_KEY"
}

CLI Tool

The opencode command provides direct CLI access:

# Basic usage
scripts/opencode run "What is Python"

# Use specific provider
scripts/opencode run "Explain Docker" --provider deepinfra --model meta-llama/Meta-Llama-3.1-70B-Instruct

Development

Running the Server Directly

cd ~/mcp-llm-router
conda activate mcp-router
python -m mcp_llm_router.server

Testing

# Test server startup
timeout 5 python -m mcp_llm_router.server

# Test CLI
scripts/opencode run "Hello world"

# Test MCP client
python scripts/mcp_client.py list-tools llm-router

Architecture

┌─────────────────┐    ┌──────────────────────────────────────┐
│   MCP Client    │◄──►│     LLM Router MCP Server            │
│ (Claude, etc.)  │    │  ┌────────────────────────────────┐  │
└─────────────────┘    │  │  Session & Memory Management   │  │
                       │  │  • SQLite/ChromaDB (local)     │  │
                       │  │  • Ollama Embeddings (local)   │  │
                       │  │  • L2-normalized vectors       │  │
                       │  └────────────────────────────────┘  │
                       │                │                     │
                       │                ▼                     │
                       │  ┌────────────────────────────────┐  │
                       │  │  Brain (External LLM API)      │  │
                       │  │  • DeepSeek / OpenAI / etc.    │  │
                       │  │  • Reasoning & Generation      │  │
                       │  └────────────────────────────────┘  │
                       └──────────────────────────────────────┘
                                         │
                                         ▼
                              ┌──────────────────┐
                              │ Other MCP Servers│
                              │ • File system    │
                              │ • Database       │
                              │ • APIs           │
                              └──────────────────┘

All-Local Except the Brain:
  ✅ Embeddings: Ollama (local, no API key)
  ✅ Vector Store: SQLite or ChromaDB (local)
  ✅ Semantic Search: Local cosine similarity
  🌐 LLM Brain: External API (configurable)

License

MIT License - see LICENSE file for details.

# Basic usage with OpenAI (default)
scripts/opencode run "Explain quantum computing"

# Use a specific provider
scripts/opencode run "Write a Python function" --provider openrouter --model anthropic/claude-3-opus

# Use DeepInfra
scripts/opencode run "Summarize this text" --provider deepinfra --model meta-llama/Llama-3.1-70B-Instruct

Available providers:

openai (default) - Uses OPENAI_API_KEY
openrouter - Uses OPENROUTER_API_KEY
deepinfra - Uses DEEPINFRA_API_KEY

MCP Tools

When used as an MCP server in Antigravity, the following tools are available:

start_session

Start a new agent session with a goal and constraints.

{
  "goal": "Implement user authentication",
  "constraints": "Use JWT tokens, no external dependencies",
  "context": "FastAPI application"
}

log_event

Log events during an agent session (info, error, warning, success).

{
  "session_id": "uuid-here",
  "kind": "error",
  "message": "Build failed",
  "details": {"exit_code": 1}
}

agent_llm_request

Make a request to an LLM provider within a session.

{
  "session_id": "uuid-here",
  "prompt": "How do I fix this error?",
  "model": "gpt-4",
  "base_url": "https://openrouter.ai/api/v1",  # optional
  "api_key_env": "OPENROUTER_API_KEY"
}

get_session_context

Retrieve full session history and events.

{
  "session_id": "uuid-here"
}

Example Agent Workflow in Antigravity

Start session:

Call start_session with goal="Build a REST API for task management"

Work on task:
```
Create files, run commands, etc.
```

Log progress:

Call log_event with kind="info", message="Created database schema"

When stuck:

Call agent_llm_request with prompt="How do I handle authentication?"

Review context:

Call get_session_context to see full history

Development

Run the MCP server directly:

cd ~/mcp-llm-router
conda activate mcp-router
python -m mcp_llm_router.server

Environment Variables

Set these in your ~/.bashrc or Antigravity config:

export OPENAI_API_KEY="sk-..."
export OPENROUTER_API_KEY="sk-or-..."
export DEEPINFRA_API_KEY="..."

MCP LLM Router

Features (Unified Router + Judge)

Architecture: All-Local Except the Brain

Installation

Quick Install (Recommended)

Manual Installation

Ollama Setup (Required for Local Embeddings)

Project Structure

Configuration

MCP Server Configuration (mcp-config.json)

Example Config + Demo

Environment Variables

Brain Configuration (Router LLM)

Memory Configuration (Embeddings + Rerank)

Default: Local Ollama Embeddings (Recommended)

Alternative: OpenAI-Compatible Embeddings

Reranking (Optional)

1. Local Cross-Encoder Reranking (Recommended for Privacy)

2. LLM-Based Reranking

3. Disable Reranking

Judge Persistence (embedded Judge)

Advanced: ChromaDB + Token Chunking (RAG Package)

Using the RAG Package

Usage

Running MCP Servers

Using the Server Runner

Using the Server Manager

Connecting to MCP Servers

Using the MCP Client

Using the Server Manager for Cross-Server Operations

MCP Tools Available

Session Management

LLM Routing

Memory (Embeddings + Rerank)

MCP Server Orchestration

Judge Tools (built-in)

Integration with MCP Clients

Any MCP-Compatible Client

Example: Claude Desktop

Example: Custom MCP Client

Provider Configuration

OpenAI

OpenRouter

DeepInfra

CLI Tool

Development

Running the Server Directly

Testing

Architecture

License

MCP Tools

start_session

log_event

agent_llm_request

get_session_context

Example Agent Workflow in Antigravity

Development

Environment Variables

Install Package (if required)

Cursor configuration (mcp.json)

MCP 4k Eyes

MCP Eyes 8k

MCP Server Configuration (`mcp-config.json`)