MCP Servers

模型上下文协议服务器、框架、SDK 和模板的综合目录。

T
Tool Attention MCP
作者 @tanhaien

Drop-in MCP middleware that eliminates the MCP Tax — 98.6% reduction in per-turn schema tokens via dynamic Tool Attention gating and lazy schema loading. Paper: arXiv:2604.21816

创建于 5/6/2026
更新于 about 3 hours ago
Repository documentation and setup instructions

Tool Attention MCP 🧰✨

arXiv Python 3.10+ License: MIT MCP

Your AI agent shouldn't blow its entire context window just to know what tools exist.

Tool Attention MCP is a lightweight middleware server that solves a simple but expensive problem: the MCP Tax — where every single message to an AI agent carries the full descriptions of all available tools, even when only 2-3 are relevant.

Based on the Tool Attention paper (Sadani, 2026). Works with any MCP-compatible agent — OpenClaw, Claude Desktop, VS Code, or your own client.


The Problem: Why Your Agent Is Wasting Tokens

Here's what happens in a typical MCP setup:

You: "Hey, can you read the latest sales report?"

Your agent's brain (the prompt sent to the LLM):

Messages (your query) ...... 500 tokens
+ Tools: read_file ........ 1,200 tokens (full JSON schema)
+ Tools: send_email ....... 1,300 tokens (full JSON schema)
+ Tools: query_database ... 1,400 tokens (full JSON schema)
+ Tools: create_pdf ....... 1,200 tokens (full JSON schema)
+ Tools: deploy_server .... 1,500 tokens (full JSON schema)
+ ... 35 more tools ....... 50,000+ tokens
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total: ~57,000 tokens  ← 99% of it completely irrelevant

The agent doesn't need deploy_server's 10-parameter JSON schema just to read a file. But the MCP protocol sends everything, every time, because the APIs are stateless and can't remember what was sent before.

This causes three real problems:

  1. 💸 Costs 5-10x more — you're paying for unused tool descriptions every turn
  2. 🧠 Worse reasoning — the model gets buried in noise, context fills up, and it starts hallucinating tool names and parameters
  3. 🔓 Security risk — malicious tool descriptions can hijack the model's attention ("Tool Poisoning")

The Solution: Only Load What You Need

Tool Attention replaces "shove everything in" with "figure out what's relevant, then load it."

┌─────────────────────────────────────────────┐
│  Your Query: "Read the sales report"        │
└──────────────────┬──────────────────────────┘
                   │
                   ▼
┌─────────────────────────────────────────────┐
│  1. Quick Scan (Intent-Schema Overlap)      │
│     → "read_file" scores 0.92               │
│     → "send_email" scores 0.15              │
│     → "deploy_server" scores 0.02           │
└──────────────────┬──────────────────────────┘
                   │
                   ▼
┌─────────────────────────────────────────────┐
│  2. Only useful tools make the cut          │
│     ✓ read_file (score > threshold)        │
│     ✗ send_email, query_database, ...      │
└──────────────────┬──────────────────────────┘
                   │
                   ▼
┌─────────────────────────────────────────────┐
│  3. Send compact prompt to the LLM          │
│     Messages .............. 500 tokens      │
│     All tool summaries .... 700 tokens      │
│     Full schema: read_file  800 tokens      │
│     ─────────────────────────────────       │
│     Total: ~2,000 tokens                    │
│     (vs 57,000 before)                      │
└─────────────────────────────────────────────┘

Result: 98.6% fewer tokens per turn.


How It Works (In Plain Terms)

1. Quick Scan (ISO Score)

Every tool has a short summary (~40 words). When you ask a question, Tool Attention converts both your query and each tool summary into "embeddings" — think of them as fingerprints in a 384-dimensional space. Tools that "smell" similar to your query score higher.

*Example: * You say "list files" → read_file scores 0.92, deploy_server scores 0.02.

2. State Check (Gating)

Even if a tool is relevant, it might not be available right now (needs auth, wrong workflow stage). The gating function checks these preconditions.

3. Smart Loading (Two-Phase)

  • Phase 1 (always visible): Just the short summaries of all tools — this stays in context permanently (and gets cached after the first turn, so it costs almost nothing).
  • Phase 2 (on-demand): Only the full JSON schemas for the top-ranked tools (usually 3-5).

The agent can still see all tools it has via Phase 1, but only loads the ones it needs.


The Numbers 📊

| Method | Tokens/Turn | Reduction | |---|---:|---:| | ❌ Full schemas for 120 tools | 57,452 | 0% | | 😕 Just grab top-k by keyword | 5,390 | 90.6% | | ✅ Tool Attention (everyday use) | 787 | 98.6% | | ✅ Tool Attention (first message only) | 5,459 | 90.5% |

Phase 1 is cached after the first message. So for messages 2, 3, 4... you only pay ~800 tokens instead of ~5,400.


Quick Start 🚀

Install

git clone https://github.com/tanhaien/tool-attention-mcp.git
cd tool-attention-mcp
python -m venv .venv
source .venv/bin/activate
pip install -e .

And if you want the full Tool Attention backend (recommended):

pip install sentence-transformers

Check It Works

tool-attention-cli health

Rank Some Tools

tool-attention-cli rank -i examples/rank_input.json

You'll see output like:

{
  "ranked_tools": [
    {"id": "read_file", "score": 0.92, "confidence": "high"},
    {"id": "list_directory", "score": 0.78, "confidence": "medium"}
  ],
  "model_info": {"backend": "tool_attention"}
}

Run Tests

pytest -q

Project Structure

tool-attention-mcp/
├── src/tool_attention_mcp/     ← The MCP server (the thing you use)
│   ├── server.py               Handles JSON-RPC messages over stdin/stdout
│   ├── service.py              Picks the best ranking backend
│   ├── adapter_tool_attention.py  Connects to the reference implementation
│   ├── scorer_semantic.py      TF-IDF ranker (backup #2)
│   ├── scorer_ollama.py        LLM-based ranker (backup #1)
│   ├── scorer_fallback.py      Word-match ranker (always available)
│   ├── config.py               Setup from environment variables
│   └── models.py               Data types used everywhere
├── .vendor/tool-attention/     ← The paper's reference code (read-only)
│   └── code/
│       ├── vector_store.py     FAISS index for tool summaries
│       ├── intent_router.py    The core ranking algorithm
│       └── benchmark.py        Token-counting benchmark
├── examples/                   ← Example configs and test data
├── tests/                      ← Test suite
└── Dockerfile                  ← For container deployment

Graceful Degradation

Even if the full Tool Attention backend fails, the server keeps working:

Preferred:    Tool Attention (sentence embeddings + FAISS)
Fallback #1:  LLM-based ranking (Ollama)
Fallback #2:  Semantic TF-IDF (scikit-learn)
Fallback #3:  Keyword matching (always works, no deps)

No single point of failure.


MCP Tools Exposed

Once connected to OpenClaw or any MCP host, you get four tools:

| Tool | What It Does | |---|---| | tool_attention.rank_tools | "Here's my query, rank these tools by relevance" | | tool_attention.pick_tool | "What's the single best tool for this query?" | | tool_attention.explain_ranking | "Why did you pick that tool over this one?" | | tool_attention.health | "Is everything working?" |

Real Example

{
  "query": "Find all customers from last month",
  "tools": [
    {"id": "query_database", "name": "Query Database", "description": "Run SQL queries on the database", "tags": ["sql", "read"]},
    {"id": "send_email", "name": "Send Email", "description": "Send an email to a recipient", "tags": ["communication"]}
  ],
  "top_k": 2
}

pick_tool returns query_database with score 0.95, send_email with 0.12.


OpenClaw Configuration

Add this to your OpenClaw MCP servers config:

{
  "mcpServers": {
    "tool-attention": {
      "command": "/path/to/tool-attention-mcp/.venv/bin/tool-attention-mcp",
      "args": [],
      "env": {
        "TA_USE_TOOL_ATTENTION": "1",
        "TA_THRESHOLD": "0.28",
        "TA_TOP_K_DEFAULT": "5",
        "TA_VENDOR_PATH": "/path/to/tool-attention-mcp/.vendor/tool-attention/code"
      }
    }
  }
}

Then use tool_attention.pick_tool in your agent's routing policy before calling any real tool.

Environment Variables

| Variable | Default | What It Does | |---|---|---| | TA_USE_TOOL_ATTENTION | 1 | Turn on full Tool Attention backend (1=on, 0=fallback chain only) | | TA_THRESHOLD | 0.28 | Minimum relevance score for a tool to be considered (0.0 to 1.0) | | TA_TOP_K_DEFAULT | 5 | Default number of top tools to return | | TA_VENDOR_PATH | — | Path to the reference implementation code | | TA_ENCODER | sentence-transformers/all-MiniLM-L6-v2 | Which embedding model to use for scoring |


Docker

docker build -t tool-attention-mcp .
docker run -i tool-attention-mcp

Note: the slim Docker image skips sentence-transformers to keep size small. Extend it if you need the full backend.


Credits 📖


License

MIT — see LICENSE.

The vendored reference code (.vendor/tool-attention/) is MIT licensed by the original authors.

快速设置
此服务器的安装指南

安装包 (如果需要)

uvx tool-attention-mcp

Cursor 配置 (mcp.json)

{ "mcpServers": { "tanhaien-tool-attention-mcp": { "command": "uvx", "args": [ "tool-attention-mcp" ] } } }