MCP server for class 12 physics students.
PDF OCR MCP Server
A production-oriented Model Context Protocol (MCP) server for working with OCR text extracted from PDF page images. The repository includes an OCR pipeline for generating page text files and an HTTP-based MCP server built with Express and the official MCP TypeScript SDK.
This project is designed for local corpora that have already been split into page images. It exposes that OCR corpus through MCP tools, resources, and prompts so MCP clients can search, inspect, and summarize page content efficiently.
Features
- HTTP-based MCP server using Streamable HTTP and Express
- OCR pipeline for converting page images into
.txtfiles with Tesseract - Safe file access patterns to prevent path traversal
- In-memory LRU-style cache for hot text pages
- Fast file-list caching for repeated MCP calls
- Session-aware MCP transport with graceful shutdown
- Health and readiness endpoints for local ops and deployment checks
- MCP tools, resources, and prompts for page discovery and retrieval
Repository Structure
.
|-- pages/ # Source page images (.png)
|-- texts/ # OCR output text files (.txt)
|-- src/
| |-- scripts/
| | `-- index.ts # OCR generation script
| `-- tools/
| |-- http-server.ts # Express + HTTP MCP server entrypoint
| |-- mcp-server.ts # MCP tools/resources/prompts registration
| |-- text-repository.ts
| `-- tools.ts # Compatibility entrypoint
|-- package.json
`-- tsconfig.json
Architecture
The codebase is split into three clean layers:
-
TextRepositoryHandles safe file resolution, page listing, cached reads, range reads, and text search. -
createTextMcpServerDefines the MCP contract exposed to clients, including tools, resource templates, and prompts. -
http-serverHosts the MCP server over Express using Streamable HTTP transport, manages sessions, and exposes operational endpoints.
Requirements
- Node.js 18+
- npm
- Tesseract OCR installed and available on
PATH
The OCR script currently invokes tesseract directly, so the binary must be accessible from your shell.
Installation
npm install
OCR Workflow
If your page images already exist in pages/, generate text files with:
npm run generate:textfiles
This will:
- read
.pngfiles frompages/ - sort them by page number
- run Tesseract with English language data
- write matching
.txtfiles intotexts/
Running the MCP Server
Start the HTTP-based MCP server with:
npm run start
The server defaults to:
- MCP endpoint:
http://127.0.0.1:3000/mcp - health endpoint:
http://127.0.0.1:3000/healthz - readiness endpoint:
http://127.0.0.1:3000/readyz
Environment Variables
The server supports the following environment variables:
| Variable | Default | Description |
| ------------------- | ----------- | ------------------------------------------ |
| MCP_HOST | 127.0.0.1 | Host interface to bind the server to |
| MCP_PORT | 3000 | Port used by the HTTP MCP server |
| MCP_BODY_LIMIT | 1mb | Maximum JSON request body size |
| MCP_ALLOWED_HOSTS | empty | Optional comma-separated host allowlist |
| MCP_PRELOAD_CACHE | true | Preload text content into cache on startup |
Example:
MCP_HOST=127.0.0.1
MCP_PORT=3000
MCP_BODY_LIMIT=1mb
MCP_PRELOAD_CACHE=true
MCP Capabilities
Tools
-
list_text_pagesLists available OCR text files with pagination support. -
read_text_pageReads a single OCR page by file name and can optionally truncate the response. -
read_text_rangeReads a bounded character range from a page for more efficient retrieval. -
search_text_pagesSearches the corpus and returns contextual snippets for matching pages. -
get_corpus_statsReturns repository and cache metrics for diagnostics and monitoring.
Resources
texts://page/{file}Exposes OCR text pages as MCP resources through a dynamic resource template.
Prompts
summarize_text_pageGenerates a reusable prompt for summarizing a specific OCR page, with optional focus text.
Operational Notes
The HTTP server includes several production-friendly behaviors:
- per-request request IDs for logging
- request duration logging
- JSON parse error handling
Cache-Control: no-storeon MCP endpoints- session tracking for stateful MCP transport
- graceful cleanup on
SIGINTandSIGTERM
Development
Type-check the project with:
npx tsc --noEmit
Suggested MCP Client Usage
This repository is a good fit for clients that need to:
- search large OCR corpora before loading full pages
- retrieve only relevant text ranges instead of full files
- consume page text as MCP resources
- build summarization or extraction workflows on top of OCR output
Limitations
- OCR currently assumes English text via
-l eng - OCR input is currently limited to
.pngpage images inpages/ - Search is simple substring matching, not full-text indexed search
- Cache is in-memory and resets on process restart
Future Improvements
- add structured metadata for page numbers, source PDFs, and sections
- support multi-language OCR
- add full-text indexing for faster corpus-wide search
- add tests for repository and transport behavior
- add Docker support and deployment examples
License
This repository is currently marked as ISC in package.json.