CrawMcp

Gives Claude (or any MCP client) a web-crawling capability that runs on your own server. It fetches pages blocked by anti-bot walls, JS challenges, or IP bans, without depending on paid, quota-limited scraping services like Firecrawl, Bright Data, or Browserbase. Bring your own proxy, self-host it, and content is deleted automatically by TTL. License: AGPL-3.0.

A URL-crawling MCP server. The model (Claude) cannot send requests directly to most sites. Instead it hands the URL to this MCP, and the MCP writes it to RabbitMQ. A separate real crawler (worker) picks it up from the queue, opens the page, writes the content to MinIO, and sends only a pointer to the queue. The MCP reads that pointer, fetches the content from MinIO, and returns it to the model.

Heavy page content lives in MinIO (object store), job status/metadata lives in Redis, and RabbitMQ only carries small messages. This is the classic Claim Check pattern.

Claude ──[crawl_url]──▶ MCP ──publish(url)──▶ RabbitMQ: crawl.requests
                         │                          │
                  (metadata to Redis)       crawler: opens the page
                         │                          │
                         │                    content ─▶ MinIO: crawl/<jobId>  (gzip)
                         │                          │
Claude ◀─[crawl_result]─ MCP ◀── pointer+meta ── RabbitMQ: crawl.results
                         │
                MCP fetches content from MinIO by storageKey

RabbitMQ = transport pipe (messages are small, deleted once acked).
Redis crawmcp:job:<id> = status + metadata (NO text; has TTL, shared).
MinIO crawl/<jobId> = heavy content (gzipped, the real persistent home).

crawl_url returns a job_id right away (does not block). You fetch the content with crawl_result using that job_id. It returns pending until the worker finishes.

Tools

| Tool | What it does | |------|----------| | crawl_url | { url } → writes to RabbitMQ, returns job_id | | crawl_result | { job_id, max_length? } → pending / page content (from MinIO) / error | | crawl_forget | { job_id } → deletes content from MinIO and metadata from Redis (manual/early cleanup) |

Content cleanup (deletion)

To stop content from piling up, job lifetime = content lifetime:

In sync with TTL (automatic): when the job record in Redis expires per JOB_TTL_SECONDS, the MCP catches it via a Redis expired event and deletes the matching MinIO object (crawl/<jobId>). (The MCP sets notify-keyspace-events itself at startup.)
Lifecycle backstop: in case an event is missed, a day-based expiry rule on the bucket (CONTENT_BACKSTOP_DAYS, default 1 day) stands as a safety net.
Manual/early: crawl_forget(job_id) deletes the content + metadata immediately.

Setup

npm install
npm run build

Running

npm run infra:up       # RabbitMQ + Redis + MinIO + worker-rs (Rust crawler), docker
npm run build          # build the MCP server → dist/
# The MCP server is started by the client (Claude Code/Desktop); standalone too: npm start
# npm run worker:example → pure-JS reference worker (contract/local test; production = worker-rs)

UIs: RabbitMQ → http://localhost:15672 (guest/guest), MinIO → http://localhost:9001 (minioadmin/minioadmin). To stop: npm run infra:down.

Adding it to Claude Code

claude mcp add crawmcp -- node <abs>/dist/index.js   # <abs> = absolute path of this repo

Message contract (between crawler and MCP)

This is all you need to know when writing your own crawler. Heavy text does not go in the message. The crawler writes content to MinIO and sends a pointer to the queue.

Request, MCP → crawl.requests:

{ "jobId": "crawl_1_...", "url": "https://...", "requestedAt": 1730000000000, "storageKey": "crawl/crawl_1_..." }

Result, crawler → crawl.results (first write the content gzipped to storageKey, then):

{
  "jobId": "crawl_1_...",
  "result": {
    "ok": true,
    "status": 200,
    "contentType": "text/html",
    "size": 528,
    "storageKey": "crawl/crawl_1_...",
    "encoding": "gzip"
  }
}

On error: { "jobId": "...", "result": { "ok": false, "error": "..." } }

The MCP suggests the storageKey (crawl/<jobId>). If the crawler uses a different key, it just needs to report the key it wrote in the result. The MCP reads the content with that key. If encoding: "gzip", the MCP decompresses it.

Crawler / worker

Production crawler: worker-rs. A Rust multi-strategy scraper (TLS-fingerprint HTTP strategies + a real-Chromium --dump-dom BrowserFetch fallback, JS-challenge solver, optional proxy rotation). It runs as the worker-rs service in docker-compose.yml (npm run infra:up starts it).

worker/example-worker.mjs is a minimal pure-JS reference. It shows the message contract (fetch → gzip → MinIO → pointer), for local testing (npm run worker:example). Any worker that follows the contract above works with this MCP without changes.

Environment variables

| Variable | Default | |----------|-----------| | RABBITMQ_URL | amqp://guest:guest@localhost:5672 | | REDIS_URL | redis://localhost:6379 | | CRAWL_REQUEST_QUEUE | crawl.requests | | CRAWL_RESULT_QUEUE | crawl.results | | JOB_TTL_SECONDS | 3600 (lifetime of metadata in Redis) | | MINIO_ENDPOINT | http://localhost:9000 | | MINIO_ACCESS_KEY / MINIO_SECRET_KEY | minioadmin / minioadmin | | MINIO_BUCKET | crawl-content | | S3_REGION | us-east-1 | | CONTENT_BACKSTOP_DAYS | 1 (lifecycle safety net; the real deletion is by TTL) |

Because it uses the AWS SDK, you can switch to Cloudflare R2 / AWS S3 with the same code by changing MINIO_ENDPOINT and the keys.

Notes

Data placement: content → MinIO (gzip), metadata/pointer → Redis, RabbitMQ stays light. This way heavy HTML neither bloats Redis RAM nor strains the queue.
Multiple MCP instances can run because job status is shared in Redis.
Cleanup: content is deleted in sync with the job TTL (Redis expired-event → MinIO delete) + day-based lifecycle backstop + manual crawl_forget. See "Content cleanup".
Note: Redis expired events are not 100% guaranteed delivery (they can be missed if the listener is down at that moment). The backstop exists for exactly this.
Not yet present: retry, per-job timeout, dead-letter queue (DLQ), returning extracted text/markdown instead of raw content. Added when needed.

License

AGPL-3.0-only, see LICENSE. Use, modify, and self-host it freely. The one condition: if you offer this to others as a network service (hosted SaaS), you must also share the source of your changes under AGPL.

worker-rs statically links the GPL-3.0 licensed wreq-util crate, so it is already under copyleft. AGPL-3.0 is compatible with it and covers the whole project under one license.

Contributing

See CONTRIBUTING.md and, for architecture notes, docs/ARCHITECTURE.md.

MCP Servers

CrawMcp

Tools

Content cleanup (deletion)

Setup

Running

Adding it to Claude Code

Message contract (between crawler and MCP)

Crawler / worker

Environment variables

Notes

License

Contributing

安装命令（包未发布）

Cursor 配置 (mcp.json)

CrawMcp

Tools

Content cleanup (deletion)

Setup

Running

Adding it to Claude Code

Message contract (between crawler and MCP)

Crawler / worker

Environment variables

Notes

License

Contributing

安装命令 （包未发布）

Cursor 配置 (mcp.json)

安装命令（包未发布）