A research-grade MCP server composing arXiv + Semantic Scholar + OpenAlex — related work, citation graphs, and BibTeX in your chat.
paperbase-mcp
A research-grade MCP server for academic literature. Composes arXiv, Semantic Scholar, OpenAlex, and (optionally) unarXive into a single tool surface for finding papers, walking citation graphs, and generating BibTeX — without leaving your chat.
No PDF downloads, no local PDF cache, no full-text search. If you want
those, see arxiv-mcp-server.
If you want metadata, relations, and venue-correct BibTeX, you're in the
right place.
What it looks like in Claude Desktop
You: Give me the 25 most-cited descendants of arXiv 1706.03762, dedup,
drop self-citations, with BibTeX.
Claude: [calls related_work with identifier="1706.03762", limit=25,
include_self_citations=false]
The top descendants of "Attention Is All You Need" are:
1. BERT (Devlin et al., 2018) — 80k citations, NAACL
2. GPT-3 (Brown et al., 2020) — 30k citations, NeurIPS
3. T5 (Raffel et al., 2020) — 15k citations, JMLR
…
BibTeX for all 25 below:
@inproceedings{devlin2018bert,
title = {BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding},
author = {Jacob Devlin and ...},
booktitle = {NAACL},
year = {2018},
…
}
…
Drop into your own .bib and keep working.
Install
Add the server to your MCP client config:
{
"mcpServers": {
"paperbase": {
"command": "uvx",
"args": ["paperbase-mcp"],
"env": { "PAPERBASE_MAILTO": "you@example.com" }
}
}
}
The mailto is required: it goes into the User-Agent and into OpenAlex's polite-pool query param. Snippets for Claude Desktop, Cursor, Codex, and Gemini CLI are in docs/install.md.
Tools
| Tool | What it does |
|---|---|
| search_papers | Keyword search across S2 (with arXiv fallback). ranking is "relevance" or "date". |
| find_paper | Resolve an arXiv ID / DOI / S2 id / URL / title to one Paper with field-level provenance. |
| related_work | References + citations, deduped, ranked, with venue-correct BibTeX attached. |
| citation_graph | Bounded walk over cites / cited-by, with truncated flag when the cap is hit. |
| author_papers | Recent papers + venues for a (disambiguated) author. |
| paper_sections | Parsed sections via unarXive; abstract-only if not configured. |
| bibtex_for | BibTeX with collision-free citation keys; published venue (@article / @inproceedings) when DOI known, else arXiv preprint (@misc). |
| compare_papers | Side-by-side facets for 2–5 papers. Regex-based extraction. |
Every tool returns the same envelope:
{ok, data?, error_kind, error_message, retry_after?,
upstreams_used, partial, provenance}
Errors never throw across the MCP boundary. partial=true is the
soft-failure signal (one upstream was down, we returned what we could).
Recipes
Worked examples — written as the workflow, not the API:
- related-work synthesis from an arXiv ID
- most-cited descendants of a paper
- author venue trends over time
- setting up a local unarXive index
(enables
paper_sectionsbeyond abstract-only)
Why not just curl the APIs
Four of the eight tools really are wrappers; we're honest about that. The four that aren't:
find_paperruns arXiv ‖ S2 in parallel, merges with primary-wins precedence, and emits field-level provenance — so you know whetherabstractcame from arXiv or S2.related_workdedupes the same paper acrosss2_paper_id/arxiv_id/doi, drops self-citations by author overlap, ranks bylog1p(citations) * exp(-age/10)with a 1.5× bump for S2'sisInfluential, and attaches venue-correct BibTeX.bibtex_forprefers@inproceedings/@articleover@miscwhen a venue is known, generates collision-free keys, and resolves duplicates across the input list.citation_graphis a bounded BFS with truncation signalling, so agents can adapt when they hit the cap.
You can hand-write any of these in an afternoon. The server gives you that afternoon back, and it does so the same way every time.
Behaviour notes
- Per-upstream token bucket (arXiv 1 req / 3 s, S2 1 RPS, OpenAlex
8 RPS). Backs off on 429 honouring
Retry-After. - Per-upstream circuit breaker — 3 failures opens, 60-second cool-down, one half-open probe.
- SQLite cache at
~/.cache/paperbase-mcp/cache.sqlite(XDG / Library / AppData viaplatformdirs). 2xx and 404 cached, 5xx / 429 never. - Cache TTLs: search 6h, find 7d, related/graph/author 24h, sections
7d, bibtex 30d, not-found 1h. Override any with
PAPERBASE_TTL_<NAME>. - Missing fields stay null. The server never imputes.
Development
pip install -e ".[dev]"
make ci # ruff + mypy --strict + pytest
Fixture-backed tests via respx; no live upstream calls in the default suite. CI matrix is Python 3.11 / 3.12 / 3.13 × Linux / macOS / Windows.
Read docs/00-brief.md for the design rationale and
docs/01-architecture.md for the layered
internals. The reviewer-lens docs at docs/03-review-*.md are the
honest critique passes (MCP power user, ML researcher, upstream
maintainer, HN skeptic).
License
Apache-2.0.