MCP Jina Supabase RAG

A lean, focused MCP server for crawling documentation websites and indexing them to Supabase for RAG (Retrieval-Augmented Generation).

Features

Smart URL Discovery: Tries sitemap.xml first, falls back to Crawl4AI recursive discovery
Hybrid Content Extraction: Uses Jina AI for fast content extraction, Crawl4AI as fallback
Multi-Project Support: Index multiple documentation sites to separate Supabase projects
Efficient Chunking: Intelligent text chunking with configurable size and overlap
Vector Embeddings: OpenAI embeddings stored in Supabase pgvector

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    MCP Server Tools                         │
├─────────────────────────────────────────────────────────────┤
│  1. crawl_and_index(url_pattern, project_name)             │
│  2. list_projects()                                         │
│  3. search_documents(query, project_name, limit)           │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                   Discovery Layer                           │
├─────────────────────────────────────────────────────────────┤
│  • Try sitemap.xml (fast)                                   │
│  • Try common doc patterns                                  │
│  • Crawl4AI recursive discovery (fallback)                  │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                  Extraction Layer                           │
├─────────────────────────────────────────────────────────────┤
│  • Jina AI Reader API (primary, fast)                       │
│  • Crawl4AI (fallback for complex pages)                    │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│              Chunking & Embedding Layer                     │
├─────────────────────────────────────────────────────────────┤
│  • Smart text chunking                                      │
│  • OpenAI embeddings (text-embedding-3-small)               │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                   Supabase Storage                          │
├─────────────────────────────────────────────────────────────┤
│  • pgvector for similarity search                           │
│  • Project isolation via source column                      │
└─────────────────────────────────────────────────────────────┘

Installation

Prerequisites

Python 3.12+
Supabase account
OpenAI API key
Jina AI API key (optional, recommended)

Setup

Clone the repository:

git clone https://github.com/yourusername/mcp-jina-supabase-rag.git
cd mcp-jina-supabase-rag

Install dependencies:

# Using uv (recommended)
uv venv
source .venv/bin/activate  # or .venv\Scripts\activate on Windows
uv pip install -e .

# Or using pip
pip install -e .

Set up Supabase database:

# Run the SQL in supabase_schema.sql in your Supabase SQL Editor

Configure environment:

cp .env.example .env
# Edit .env with your credentials

Usage

Running the MCP Server

# SSE transport (recommended for remote connections)
python src/main.py

# The server will start on http://localhost:8052/sse

Configure MCP Client

Claude Code

claude mcp add --transport sse jina-supabase http://localhost:8052/sse

Cursor / Claude Desktop

{
  "mcpServers": {
    "jina-supabase": {
      "transport": "sse",
      "url": "http://localhost:8052/sse"
    }
  }
}

Slash Command

Create /home/marty/.claude/commands/jina.md:

---
allowed-tools: mcp__jina-supabase
argument-hint: <url_pattern> <project_name>
description: Crawl documentation and index to Supabase RAG
---

# Index Documentation to Supabase

Use the jina-supabase MCP server to crawl and index documentation.

Arguments:
- $1: URL pattern (e.g., https://docs.example.com/*)
- $2: Project name for isolation

Example:
/jina https://docs.anthropic.com/claude/* anthropic-docs

Tools

`crawl_and_index`

Crawl a documentation site and index to Supabase.

Parameters:

url_pattern (string): URL or pattern to crawl
project_name (string): Project identifier for isolation
discovery_method (string, optional): auto, sitemap, or crawl
extraction_method (string, optional): auto, jina, or crawl4ai

Example:

await crawl_and_index(
    url_pattern="https://docs.supabase.com/docs/*",
    project_name="supabase-docs",
    discovery_method="auto",
    extraction_method="jina"
)

`list_projects`

List all indexed projects.

Returns: List of project names with document counts

`search_documents`

Search indexed documents using vector similarity.

Parameters:

query (string): Search query
project_name (string, optional): Filter by project
limit (int, optional): Max results (default: 5)

Example:

results = await search_documents(
    query="How do I set up authentication?",
    project_name="supabase-docs",
    limit=10
)

Configuration

See .env.example for all configuration options.

Discovery Methods

auto: Try sitemap first, fallback to crawl
sitemap: Only use sitemap.xml (fast, fails if no sitemap)
crawl: Only use Crawl4AI recursive discovery (slow, comprehensive)

Extraction Methods

auto: Use Jina for bulk extraction (>10 URLs), Crawl4AI otherwise
jina: Use Jina AI Reader API (fast, requires API key)
crawl4ai: Use Crawl4AI browser automation (slow, no API key needed)

Development

# Install dev dependencies
uv pip install -e ".[dev]"

# Run tests
pytest

# Format code
black src/

# Lint
ruff check src/

Differences from mcp-crawl4ai-rag

| Feature | mcp-crawl4ai-rag | mcp-jina-supabase-rag | |---------|------------------|------------------------| | Focus | Full-featured RAG with knowledge graphs | Lean documentation indexer | | Discovery | Recursive only | Sitemap first, crawl fallback | | Extraction | Crawl4AI only | Jina primary, Crawl4AI fallback | | Dependencies | Heavy (Neo4j, etc.) | Light (core only) | | Use Case | Advanced RAG with hallucination detection | Fast doc indexing |

License

MIT

Contributing

Contributions welcome! Please open an issue first to discuss changes.

MCP Servers