paperbase-mcp

A research-grade MCP server for academic literature. Composes arXiv, Semantic Scholar, OpenAlex, and (optionally) unarXive into a single tool surface for finding papers, walking citation graphs, and generating BibTeX — without leaving your chat.

No PDF downloads, no local PDF cache, no full-text search. If you want those, see arxiv-mcp-server. If you want metadata, relations, and venue-correct BibTeX, you're in the right place.

What it looks like in Claude Desktop

You: Give me the 25 most-cited descendants of arXiv 1706.03762, dedup,
     drop self-citations, with BibTeX.

Claude: [calls related_work with identifier="1706.03762", limit=25,
        include_self_citations=false]

The top descendants of "Attention Is All You Need" are:

 1. BERT (Devlin et al., 2018) — 80k citations, NAACL
 2. GPT-3 (Brown et al., 2020) — 30k citations, NeurIPS
 3. T5 (Raffel et al., 2020) — 15k citations, JMLR
 …

BibTeX for all 25 below:

@inproceedings{devlin2018bert,
  title = {BERT: Pre-training of Deep Bidirectional Transformers for
           Language Understanding},
  author = {Jacob Devlin and ...},
  booktitle = {NAACL},
  year = {2018},
  …
}
…

Drop into your own .bib and keep working.

Install

Add the server to your MCP client config:

{
  "mcpServers": {
    "paperbase": {
      "command": "uvx",
      "args": ["paperbase-mcp"],
      "env": { "PAPERBASE_MAILTO": "you@example.com" }
    }
  }
}

The mailto is required: it goes into the User-Agent and into OpenAlex's polite-pool query param. Snippets for Claude Desktop, Cursor, Codex, and Gemini CLI are in docs/install.md.

Tools

| Tool | What it does | |---|---| | search_papers | Keyword search across S2 (with arXiv fallback). ranking is "relevance" or "date". | | find_paper | Resolve an arXiv ID / DOI / S2 id / URL / title to one Paper with field-level provenance. | | related_work | References + citations, deduped, ranked, with venue-correct BibTeX attached. | | citation_graph | Bounded walk over cites / cited-by, with truncated flag when the cap is hit. | | author_papers | Recent papers + venues for a (disambiguated) author. | | paper_sections | Parsed sections via unarXive; abstract-only if not configured. | | bibtex_for | BibTeX with collision-free citation keys; published venue (@article / @inproceedings) when DOI known, else arXiv preprint (@misc). | | compare_papers | Side-by-side facets for 2–5 papers. Regex-based extraction. |

Every tool returns the same envelope:

{ok, data?, error_kind, error_message, retry_after?,
 upstreams_used, partial, provenance}

Errors never throw across the MCP boundary. partial=true is the soft-failure signal (one upstream was down, we returned what we could).

Recipes

Worked examples — written as the workflow, not the API:

related-work synthesis from an arXiv ID
most-cited descendants of a paper
author venue trends over time
setting up a local unarXive index (enables paper_sections beyond abstract-only)

Why not just curl the APIs

Four of the eight tools really are wrappers; we're honest about that. The four that aren't:

find_paper runs arXiv ‖ S2 in parallel, merges with primary-wins precedence, and emits field-level provenance — so you know whether abstract came from arXiv or S2.
related_work dedupes the same paper across s2_paper_id / arxiv_id / doi, drops self-citations by author overlap, ranks by log1p(citations) * exp(-age/10) with a 1.5× bump for S2's isInfluential, and attaches venue-correct BibTeX.
bibtex_for prefers @inproceedings / @article over @misc when a venue is known, generates collision-free keys, and resolves duplicates across the input list.
citation_graph is a bounded BFS with truncation signalling, so agents can adapt when they hit the cap.

You can hand-write any of these in an afternoon. The server gives you that afternoon back, and it does so the same way every time.

Behaviour notes

Per-upstream token bucket (arXiv 1 req / 3 s, S2 1 RPS, OpenAlex 8 RPS). Backs off on 429 honouring Retry-After.
Per-upstream circuit breaker — 3 failures opens, 60-second cool-down, one half-open probe.
SQLite cache at ~/.cache/paperbase-mcp/cache.sqlite (XDG / Library / AppData via platformdirs). 2xx and 404 cached, 5xx / 429 never.
Cache TTLs: search 6h, find 7d, related/graph/author 24h, sections 7d, bibtex 30d, not-found 1h. Override any with PAPERBASE_TTL_<NAME>.
Missing fields stay null. The server never imputes.

Development

pip install -e ".[dev]"
make ci    # ruff + mypy --strict + pytest

Fixture-backed tests via respx; no live upstream calls in the default suite. CI matrix is Python 3.11 / 3.12 / 3.13 × Linux / macOS / Windows.

Read docs/00-brief.md for the design rationale and docs/01-architecture.md for the layered internals. The reviewer-lens docs at docs/03-review-*.md are the honest critique passes (MCP power user, ML researcher, upstream maintainer, HN skeptic).

License

Apache-2.0.

MCP Servers

paperbase-mcp

What it looks like in Claude Desktop

Install

Tools

Recipes

Why not just curl the APIs

Behaviour notes

Development

License

Install Package (if required)

Cursor configuration (mcp.json)

MCP Wandb 2

Obsidian MCP