PDF OCR MCP Server

A production-oriented Model Context Protocol (MCP) server for working with OCR text extracted from PDF page images. The repository includes an OCR pipeline for generating page text files and an HTTP-based MCP server built with Express and the official MCP TypeScript SDK.

This project is designed for local corpora that have already been split into page images. It exposes that OCR corpus through MCP tools, resources, and prompts so MCP clients can search, inspect, and summarize page content efficiently.

Features

HTTP-based MCP server using Streamable HTTP and Express
OCR pipeline for converting page images into .txt files with Tesseract
Safe file access patterns to prevent path traversal
In-memory LRU-style cache for hot text pages
Fast file-list caching for repeated MCP calls
Session-aware MCP transport with graceful shutdown
Health and readiness endpoints for local ops and deployment checks
MCP tools, resources, and prompts for page discovery and retrieval

Repository Structure

.
|-- pages/                  # Source page images (.png)
|-- texts/                  # OCR output text files (.txt)
|-- src/
|   |-- scripts/
|   |   `-- index.ts        # OCR generation script
|   `-- tools/
|       |-- http-server.ts  # Express + HTTP MCP server entrypoint
|       |-- mcp-server.ts   # MCP tools/resources/prompts registration
|       |-- text-repository.ts
|       `-- tools.ts        # Compatibility entrypoint
|-- package.json
`-- tsconfig.json

Architecture

The codebase is split into three clean layers:

TextRepository Handles safe file resolution, page listing, cached reads, range reads, and text search.
createTextMcpServer Defines the MCP contract exposed to clients, including tools, resource templates, and prompts.
http-server Hosts the MCP server over Express using Streamable HTTP transport, manages sessions, and exposes operational endpoints.

Requirements

Node.js 18+
npm
Tesseract OCR installed and available on PATH

The OCR script currently invokes tesseract directly, so the binary must be accessible from your shell.

Installation

npm install

OCR Workflow

If your page images already exist in pages/, generate text files with:

npm run generate:textfiles

This will:

read .png files from pages/
sort them by page number
run Tesseract with English language data
write matching .txt files into texts/

Running the MCP Server

Start the HTTP-based MCP server with:

npm run start

The server defaults to:

MCP endpoint: http://127.0.0.1:3000/mcp
health endpoint: http://127.0.0.1:3000/healthz
readiness endpoint: http://127.0.0.1:3000/readyz

Environment Variables

The server supports the following environment variables:

| Variable | Default | Description | | ------------------- | ----------- | ------------------------------------------ | | MCP_HOST | 127.0.0.1 | Host interface to bind the server to | | MCP_PORT | 3000 | Port used by the HTTP MCP server | | MCP_BODY_LIMIT | 1mb | Maximum JSON request body size | | MCP_ALLOWED_HOSTS | empty | Optional comma-separated host allowlist | | MCP_PRELOAD_CACHE | true | Preload text content into cache on startup |

Example:

MCP_HOST=127.0.0.1
MCP_PORT=3000
MCP_BODY_LIMIT=1mb
MCP_PRELOAD_CACHE=true

MCP Capabilities

Tools

list_text_pages Lists available OCR text files with pagination support.
read_text_page Reads a single OCR page by file name and can optionally truncate the response.
read_text_range Reads a bounded character range from a page for more efficient retrieval.
search_text_pages Searches the corpus and returns contextual snippets for matching pages.
get_corpus_stats Returns repository and cache metrics for diagnostics and monitoring.

Resources

texts://page/{file} Exposes OCR text pages as MCP resources through a dynamic resource template.

Prompts

summarize_text_page Generates a reusable prompt for summarizing a specific OCR page, with optional focus text.

Operational Notes

The HTTP server includes several production-friendly behaviors:

per-request request IDs for logging
request duration logging
JSON parse error handling
Cache-Control: no-store on MCP endpoints
session tracking for stateful MCP transport
graceful cleanup on SIGINT and SIGTERM

Development

Type-check the project with:

npx tsc --noEmit

Suggested MCP Client Usage

This repository is a good fit for clients that need to:

search large OCR corpora before loading full pages
retrieve only relevant text ranges instead of full files
consume page text as MCP resources
build summarization or extraction workflows on top of OCR output

Limitations

OCR currently assumes English text via -l eng
OCR input is currently limited to .png page images in pages/
Search is simple substring matching, not full-text indexed search
Cache is in-memory and resets on process restart

Future Improvements

add structured metadata for page numbers, source PDFs, and sections
support multi-language OCR
add full-text indexing for faster corpus-wide search
add tests for repository and transport behavior
add Docker support and deployment examples

License

This repository is currently marked as ISC in package.json.

MCP Servers