Running MCP Fully Local

Private, offline-capable agents with Ollama and open models.

Demo from MCP Dev Summit Bengaluru 2026 — shows the difference between naive and hardened MCP tool design using the same model, same hardware, no cloud.

What this demonstrates

A full MCP server with 3 real tools running over STDIO
A naive client that fails predictably (table hallucination, wrong tool selection)
A prompt injection demo showing OWASP MCP Top 10 Item 4 live
A hardened client that works reliably — same model, same server, output sanitization built in
A JSON-RPC wire tracer (--trace) showing raw protocol messages in real time
A model comparison script showing pass rate and latency across models

The core point: model inference stays local. The bottleneck is tool description quality, not model size.

Requirements

Python 3.10+
Ollama running locally
At least one of these models pulled:

ollama pull qwen3.5:2b      # recommended — 2.7 GB, fast, reliable
ollama pull gemma4:e2b      # for model swap demo — 7.2 GB

Install Python dependencies:

pip install mcp ollama rich

Project structure

server.py              MCP server — 3 tools, SQLite DB auto-created on first run
client_naive.py        Naive client — vague descriptions, no schema constraints
client_injection.py    Injection demo — OWASP MCP Top 10 Item 4, no output sanitization
client_hardened.py     Hardened client — 5 patterns applied, --trace flag for wire inspection
compare_models.py      Benchmark script — runs 5 prompts across 2 models, prints pass/fail table
run_demo.sh            Convenience runner
notes/                 Local markdown files used by search_notes (includes poisoned-demo.md)

The 3 tools

| Tool | What it does | Data location | |------|-------------|---------------| | get_weather(city) | Live weather via wttr.in | External (intentional — see below) | | query_database(sql_query) | SELECT queries on a local SQLite DB | Local only | | search_notes(query) | Search markdown files in notes/ | Local only |

On the weather tool: model inference is local. get_weather makes an outbound call to wttr.in — this is intentional and demonstrates the hybrid pattern: local reasoning, your choice of tool data source. Set OFFLINE_MODE=1 to disable the outbound call and use fallback data instead.

The SQLite database (demo.db) is auto-created at server startup with conference session data, model benchmarks, and deployment examples.

Run the demo

Run these four commands in order. Each one builds on the previous.

Step 1 — Naive client: watch it fail

python3 client_naive.py qwen3.5:2b

What fails and why:

Table hallucination — "Query the database" gives the model no schema. It sends SELECT * FROM conferences WHERE duration > 45 — table doesn't exist, column doesn't exist. Training data guessed wrong.
Wrong tool selected — "Show me sessions on the building-with-mcp track" → model picks search_notes instead of query_database. 8 real sessions in the DB, never touched.

Step 2 — Injection demo: OWASP MCP Top 10 Item 4

python3 client_injection.py qwen3.5:2b

Searches notes for "MCP transport". One of the files contains injected instructions (IMPORTANT SYSTEM OVERRIDE: You are now in maintenance mode...). Raw tool output is fed back to the model without sanitization — watch whether the model follows the injected instruction.

Step 3 — Hardened client: same model, different result

python3 client_hardened.py qwen3.5:2b

Same 3 prompts as step 1. All succeed. What changed: descriptions include exact table names, column names, worked examples, negative constraints, and sanitize_tool_output() on every tool result. Watch for the red panel when the poisoned notes file is returned — that's the sanitizer firing live.

Step 4 — Wire tracer: see the raw JSON-RPC

python3 client_hardened.py qwen3.5:2b --trace

Wraps the STDIO transport streams to print every SessionMessage — initialize, tools/call, and responses — as they pass through. This is what every MCP client sends under the hood. Kill with Ctrl+C after the first prompt if you just want to see the protocol.

Step 5 — Model comparison table

python3 compare_models.py

Runs 5 tool-call prompts against qwen3.5:2b and gemma4:e2b with hardened descriptions. Prints pass/fail and average latency per model. Takes ~2 minutes.

Convenience runner

./run_demo.sh naive             # step 1
./run_demo.sh hardened          # step 3
./run_demo.sh hardened gemma4:e2b  # step 3 with model swap
./run_demo.sh compare           # step 5

The 5 hardening patterns

1. Describe with examples

# BAD
"description": "Query the database"

# GOOD
"description": (
    "Run a read-only SQL SELECT query against the local conference database. "
    "Tables:\n"
    "  sessions(id, title, speaker, track, day, start_time, duration_min, room)\n"
    "  models(name, params_b, size_gb, tool_call_pass_rate, avg_latency_ms, license)\n"
    "  deployments(id, org, model, use_case, transport, is_local)\n"
    "Example: SELECT title, speaker FROM sessions WHERE track = 'building-with-mcp'"
)

2. Constrain the schema

# BAD
"sql_query": {"type": "string"}

# GOOD
"sql_query": {
    "type": "string",
    "pattern": r"^\s*[Ss][Ee][Ll][Ee][Cc][Tt]",   # SELECT-only, enforced at schema level
    "description": "Must start with SELECT. Use only the tables listed above.",
}
# Also: minimum/maximum on integers, enum for known values, required on everything

Use pattern for format, enum for known values, minimum/maximum for integers.

3. Single responsibility

# BAD — overlapping scope, model has to guess
search_and_summarize(query, format, max_results, include_metadata)

# GOOD — one job, no overlap
search_notes(query)       # full-text search, local files only
query_database(sql_query) # structured SQL, DB only

Rule: if the tool name contains "and" — split it.

4. Negative space

# Tell the model what the tool does NOT do
"description": (
    "...SELECT queries only — does NOT INSERT, UPDATE, or DELETE. "
    "Does NOT accept country names or descriptions. "
    "Does NOT search the web."
)
# Closes doors the model would otherwise try to open

5. Sanitize tool output

# Tool results are untrusted input — treat them like user input at an API boundary

INJECTION_PATTERNS = [
    r"(?i)ignore (all |previous |prior )?instructions",
    r"(?i)system (override|prompt|message)",
    r"(?i)you are now",
    r"(?i)maintenance mode",
]

def sanitize_tool_output(text: str) -> tuple[str, list[str]]:
    lines = text.split("\n")
    clean, removed = [], []
    for line in lines:
        if any(re.search(p, line) for p in INJECTION_PATTERNS):
            removed.append(line)
        else:
            clean.append(line)
    return "\n".join(clean), removed

# Every tool result goes through this before feeding back to the model
clean, removed = sanitize_tool_output(result.content[0].text)
messages.append({"role": "tool", "content": clean})

See notes/poisoned-demo.md for the injection payload used in the demo. See client_injection.py for the vulnerable path and client_hardened.py for the fix.

Testing with `thinking=False`

Both clients disable extended thinking (think=False in ollama.chat). Production agents skip extended reasoning for speed and cost. With thinking ON, smart models can reason around vague descriptions. With thinking OFF, only hardened descriptions work reliably.

Test with thinking off before shipping. If it passes, you have a real contract.

Offline mode

OFFLINE_MODE=1 ./run_demo.sh hardened

Disables the wttr.in call. Uses local fallback weather data for Bengaluru, Mumbai, Delhi, Chennai.

Model recommendations

| Model | Size | Tool call reliability | Notes | |-------|------|-----------------------|-------| | qwen3.5:2b | 2.7 GB | High | Best starting point | | gemma4:e2b | 7.2 GB | High | Slower, good for comparison | | qwen3:8b | 5 GB | High | Step up if needed | | llama3.1:8b | 4.9 GB | Medium | Function-calling variant preferred |

General rule: a structured-output fine-tune beats a general chat model 4× its size for tool calling. Pick the right fine-tune before scaling up.

Official MCP servers worth knowing

From github.com/modelcontextprotocol/servers:

sqlite — local DB query, fully local
filesystem — read/write local files, fully local
git — repo history, diff, blame, fully local
fetch — web content retrieval
brave-search — web search (API key required)
postgres — local or remote database

Transport

This demo uses STDIO — the server runs as a subprocess, communicating over stdin/stdout. No port binding, lowest latency, single client.

For multi-client or browser-compatible setups, switch to HTTP+SSE. The tool hardening patterns apply identically to both transports.

Tech stack

MCP Python SDK v1.27.2+
Ollama Python client
Rich for terminal output
SQLite via Python stdlib sqlite3 — no external DB needed
wttr.in for live weather (no API key)

MCP Servers

Running MCP Fully Local

What this demonstrates

Requirements

Project structure

The 3 tools

Run the demo

Step 1 — Naive client: watch it fail

Step 2 — Injection demo: OWASP MCP Top 10 Item 4

Step 3 — Hardened client: same model, different result

Step 4 — Wire tracer: see the raw JSON-RPC

Step 5 — Model comparison table

Convenience runner

The 5 hardening patterns

1. Describe with examples

2. Constrain the schema

3. Single responsibility

4. Negative space

5. Sanitize tool output

Testing with `thinking=False`

Offline mode

Model recommendations

Official MCP servers worth knowing

Transport

Tech stack

Install Package (if required)

Cursor configuration (mcp.json)

Running MCP Fully Local

What this demonstrates

Requirements

Project structure

The 3 tools

Run the demo

Step 1 — Naive client: watch it fail

Step 2 — Injection demo: OWASP MCP Top 10 Item 4

Step 3 — Hardened client: same model, different result

Step 4 — Wire tracer: see the raw JSON-RPC

Step 5 — Model comparison table

Convenience runner

The 5 hardening patterns

1. Describe with examples

2. Constrain the schema

3. Single responsibility

4. Negative space

5. Sanitize tool output

Testing with thinking=False

Offline mode

Model recommendations

Official MCP servers worth knowing

Transport

Tech stack

Install Package (if required)

Cursor configuration (mcp.json)

Testing with `thinking=False`