MCP coordination primitive for multi-agent workflows. LLM-agnostic, CLI-agnostic, self-hostable. TypeScript + SQLite.
bot-relay-mcp
A local-first message relay for AI coding agents. Two interfaces, one shared SQLite database, zero infrastructure.
v2.1 — architecturally complete. 25 tools. Everything v2.0 delivered (smart routing, task leases, session-aware reads, lazy health monitor, busy/DND status, webhook retries, channels) + the v2.1 sweep: explicit auth_state machine with revoke/recovery flow, managed-agent rotation grace, keyring-based encryption with online rotation, unified relay CLI with recover + re-encrypt + doctor + init + test + generate-hooks + backup + restore. 14 of 14 Codex architectural findings closed. See CHANGELOG for the full arc.
What is this?
bot-relay-mcp gives AI coding agents and external systems a way to coordinate.
Two audiences, two transports:
- AI coding agents (Claude Code, Cursor, Cline, Zed) connect via stdio MCP. Drop one entry into
~/.claude.jsonand the relay's tools appear inside Claude. No daemon required. - External systems (n8n, Slack, Telegram, custom scripts) connect via HTTP+SSE with optional Bearer auth. Trigger agent actions or receive webhook events.
Everything reads and writes the same SQLite file at ~/.bot-relay/relay.db. There is no cloud, no daemon you have to install, no service mesh.
Quick Start (30 seconds)
Once published to npm, setup is a single config entry — no cloning, no compiling, no absolute paths.
Add to your ~/.claude.json:
{
"mcpServers": {
"bot-relay": {
"command": "npx",
"args": ["-y", "bot-relay-mcp"],
"type": "stdio"
}
}
}
The first invocation fetches the package and starts the server. Subsequent launches are instant.
Quick Start (from source)
git clone https://github.com/Maxlumiere/bot-relay-mcp.git
cd bot-relay-mcp
npm install
npm run build
Add to ~/.claude.json:
{
"mcpServers": {
"bot-relay": {
"command": "node",
"args": ["/absolute/path/to/bot-relay-mcp/dist/index.js"],
"type": "stdio"
}
}
}
Open two Claude Code terminals and try it:
Terminal A:
> Register on the relay as "planner" with role "orchestrator"
> Discover other agents
> Send a message to "builder": "Can you handle the API layer?"
Terminal B:
> Register on the relay as "builder" with role "builder"
> Check my relay messages
> Reply to planner: "On it."
The database is created automatically at ~/.bot-relay/relay.db on first use. That is the full setup.
File permissions (v2.1). The relay creates ~/.bot-relay/ at 0700 and relay.db + backup tarballs at 0600 — owner-only. config.json is operator-managed; the relay never chmods it but logs a warning at startup if it's more permissive than 0600. POSIX only — native Windows NTFS uses ACLs, not POSIX modes, so the chmod calls are no-ops there (documented).
Tools
Identity
| Tool | Inputs | Description |
|------|--------|-------------|
| register_agent | name, role, capabilities[] | Register this terminal as a named agent. Uses upsert — safe to call multiple times. |
| unregister_agent | name | Remove an agent from the relay. Idempotent. Fires agent.unregistered webhook on success. |
| discover_agents | role (optional) | List all registered agents with status (online/stale/offline). |
| spawn_agent | name, role, capabilities, cwd?, initial_message? | Spawn a new Claude Code terminal pre-configured as a relay agent. Cross-platform (v1.9): macOS (iTerm2/Terminal.app), Linux (gnome-terminal/konsole/xterm/tmux fallback chain — tmux covers headless servers), Windows (wt.exe/powershell.exe/cmd.exe). See docs/cross-platform-spawn.md. |
Messaging
| Tool | Inputs | Description |
|------|--------|-------------|
| send_message | from, to, content, priority | Send a direct message to another agent by name. |
| get_messages | agent_name, status, limit | Check your mailbox. Pending messages are auto-marked as read. |
| broadcast | from, content, role (optional) | Send a message to all registered agents (or filter by role). |
Tasks
| Tool | Inputs | Description |
|------|--------|-------------|
| post_task | from, to, title, description, priority | Assign a task to another agent. |
| post_task_auto (v2.0) | from, title, description, required_capabilities[], priority | Auto-route to the least-loaded agent whose capabilities match ALL required. Queues if no match; assigns on the next capable registration. |
| update_task | task_id, agent_name, action, result? | Actions: accept / complete / reject / cancel (v2.0, requester-only) / heartbeat (v2.0, renews lease). State machine + CAS enforced. |
| get_tasks | agent_name, role, status, limit | Query your task queue (assigned to you or posted by you). |
| get_task | task_id | Get a single task by ID with full details. |
Channels (v2.0)
| Tool | Inputs | Description |
|------|--------|-------------|
| create_channel | name, description?, creator | Create a named channel for multi-agent coordination. Requires channels capability. |
| join_channel | channel_name, agent_name | Join any public channel. |
| leave_channel | channel_name, agent_name | Leave a channel. |
| post_to_channel | channel_name, from, content, priority | Post to a channel you are a member of. |
| get_channel_messages | channel_name, agent_name, limit, since? | Read messages posted to a channel since your join time. |
Status + Health (v2.0)
| Tool | Inputs | Description |
|------|--------|-------------|
| set_status | agent_name, status | Signal online / busy / away / offline. busy/away exempt you from health-monitor task reassignment. |
| health_check | (none) | Report relay version, uptime, and live counts (agents, messages, tasks, channels). No auth required. |
Webhooks (v1.2+)
| Tool | Inputs | Description |
|------|--------|-------------|
| register_webhook | url, event, filter, secret | Subscribe to relay events via HTTP POST. |
| list_webhooks | (none) | List all registered webhook subscriptions. |
| delete_webhook | webhook_id | Remove a webhook subscription. |
Supported events: message.sent, message.broadcast, task.posted, task.accepted, task.completed, task.rejected, task.cancelled (v2.0), task.auto_routed (v2.0), task.health_reassigned (v2.0), channel.message_posted (v2.0), agent.unregistered, agent.spawned, webhook.delivery_failed, * (all).
v2.0 — retry with backoff. Failed webhook deliveries retry at 60s / 300s / 900s (3 attempts). CAS-claimed per row — no double delivery. Piggybacks on webhook-firing tool calls, no background thread.
When secret is provided, each delivery includes an X-Relay-Signature: sha256=... HMAC header. Filter optionally restricts firing to events where from_agent or to_agent matches.
Example: Task Delegation
Terminal A — Orchestrator:
1. register_agent("orchestrator", "planner", ["delegation", "review"])
2. discover_agents() → sees "worker" is online
3. post_task(from: "orchestrator", to: "worker",
title: "Write auth tests",
description: "Cover login, logout, token refresh. Use vitest.",
priority: "high")
4. send_message(from: "orchestrator", to: "worker",
content: "Task posted — check your queue.")
Terminal B — Worker:
1. register_agent("worker", "builder", ["testing", "backend"])
2. get_messages("worker") → message from orchestrator
3. get_tasks("worker", role: "assigned", status: "posted") → auth test task
4. update_task(task_id, "worker", "accept")
5. ... does the work ...
6. update_task(task_id, "worker", "complete", result: "12 tests passing")
7. send_message(from: "worker", to: "orchestrator",
content: "Auth tests done. All passing.")
Terminal A checks results:
1. get_messages("orchestrator") → "Auth tests done."
2. get_task(task_id) → status: completed, result: "12 tests passing"
How It Works
Every Claude Code terminal spawns its own MCP server process via stdio. All processes read and write the same SQLite file at ~/.bot-relay/relay.db. SQLite WAL mode handles concurrent access safely. Messages older than 7 days and completed tasks older than 30 days are purged automatically on startup.
Unified relay CLI (v2.1)
One entry, six subcommands: doctor / init / test / generate-hooks / backup / restore. First-run setup:
relay init # interactive
relay init --yes # defaults + random HTTP secret
relay doctor runs a diagnostic sweep; relay test runs a minimal self-check against a throwaway relay; relay generate-hooks emits Claude Code hook JSON for ~/.claude/settings.json. Full reference in docs/cli.md. The standalone bin/relay-backup + bin/relay-restore from Phase 2c have been absorbed into relay backup + relay restore.
Token lifecycle (v2.1)
Two new tools for credential hygiene: rotate_token lets an agent swap its own token with history preserved; revoke_token lets an admin-capable agent nullify another agent's token_hash (target re-bootstraps via the Phase 2b migration path). New admin capability is never auto-granted — register admin agents explicitly. Full operator runbook in docs/token-lifecycle.md.
Error codes (v2.1)
Every tool error response carries a stable error_code token alongside the free-form error string. Branch on the code; never string-match the message. Full catalog + stability guarantee in docs/error-codes.md. Source of truth: src/error-codes.ts.
Protocol version (v2.1)
Beyond the package version string, the relay surfaces a protocol_version via register_agent + health_check responses. Clients should key compatibility on protocol_version (bumps only on tool-surface changes) rather than the package version (bumps on every ship). See docs/protocol-version.md for SemVer rules + a client-side compatibility snippet.
HTTP Mode — for n8n, Slack, Telegram, custom scripts
Run the relay as an HTTP daemon and any HTTP client can drive it:
RELAY_TRANSPORT=http RELAY_HTTP_SECRET=your-shared-secret node dist/index.js
# Listens on http://127.0.0.1:3777
Production deployment — set
RELAY_HTTP_SECRET. v2.1 refuses to start on a non-loopback host (0.0.0.0, a public IP, Docker-p 3777:3777without loopback pinning) unlessRELAY_HTTP_SECRETis set. Loopback binds (127.0.0.1,::1,localhost) stay zero-config for local development.Dev-only escape hatch:
RELAY_ALLOW_OPEN_PUBLIC=1lets the relay start anyway on a public host without a secret — useful for throwaway local Docker nets, but logs a loud warning every startup. Never use in production.
Endpoints:
POST /mcp— JSON-RPC (the MCP protocol over HTTP+SSE). RequiresAuthorization: Bearer <secret>ifhttp_secretis configured.GET /health— server status (always open, no auth)GET /— built-in dashboard (live view of agents, messages, tasks, webhooks). v2.1 Phase 4d: protected by Host-header allowlist (DNS-rebinding defense) + auth gate (RELAY_DASHBOARD_SECRETorRELAY_HTTP_SECRETfallback). Loopback binds allow no-secret access for dev; non-loopback binds require a secret. Full policy indocs/dashboard-security.md.GET /api/snapshot— JSON snapshot of relay state (same gates as/)
Three transport modes:
stdio(default) — per-terminal, for AI coding agentshttp— daemon, for external systemsboth— HTTP daemon plus a stdio connection (useful for bridge scripts)
All transports share the same SQLite database. Stdio agents and HTTP clients see the same world.
Process-boundary reminder (v2.1.3): stdio MCP clients and the HTTP daemon are separate processes. Each Claude Code terminal with "type":"stdio" in ~/.claude.json spawns its own node dist/index.js child. Restarting the :3777 HTTP daemon never affects stdio clients — their own child processes are untouched. Operator /mcp reconnect is only needed after restart for "type":"http" MCP clients pointed at the daemon URL. See docs/transport-architecture.md for the full topology + post-restart operator checklist.
n8n integration example
Trigger a Claude Code agent from an n8n workflow:
POST http://127.0.0.1:3777/mcp
Authorization: Bearer your-shared-secret
Content-Type: application/json
{
"jsonrpc": "2.0",
"id": 1,
"method": "tools/call",
"params": {
"name": "post_task",
"arguments": {
"from": "n8n-workflow-42",
"to": "builder",
"title": "Process new lead",
"description": "Lead data: ...",
"priority": "high"
}
}
}
Then register a webhook so n8n hears about completion:
{
"jsonrpc": "2.0", "id": 2, "method": "tools/call",
"params": {
"name": "register_webhook",
"arguments": {
"url": "https://your-n8n.example.com/webhook/abc",
"event": "task.completed",
"secret": "shared-with-n8n"
}
}
}
When a task completes, n8n receives a POST with the result and an HMAC signature in X-Relay-Signature.
Config File (v1.2+)
Optional ~/.bot-relay/config.json:
{
"transport": "http",
"http_port": 3777,
"http_host": "127.0.0.1",
"webhook_timeout_ms": 5000,
"http_secret": null,
"trusted_proxies": []
}
Env vars override file config: RELAY_TRANSPORT, RELAY_HTTP_PORT, RELAY_HTTP_HOST, RELAY_HTTP_SECRET, RELAY_TRUSTED_PROXIES (comma-separated CIDRs).
Trusted Proxies and X-Forwarded-For (v1.6.2)
By default, the relay IGNORES the X-Forwarded-For header completely. Rate limits are keyed on the direct socket peer IP only. This prevents a caller from sending a spoofed header to get their own rate-limit bucket.
If you front the relay with Cloudflare, nginx, or any other reverse proxy, configure trusted_proxies with CIDRs of those proxies:
{
"trusted_proxies": ["127.0.0.0/8", "::1/128", "10.0.0.0/8"]
}
Or via env var:
RELAY_TRUSTED_PROXIES="127.0.0.0/8,::1/128,10.0.0.0/8"
When the direct peer IP falls in the trusted list, the relay walks the X-Forwarded-For chain right-to-left, skipping trusted hops, and uses the leftmost-untrusted hop as the real client IP. This matches RFC 7239 §7.4 and how nginx/Express normally handle this.
Per-Agent Tokens (v1.7)
Every tool call (other than first-time register_agent and /health) requires an agent token. The token identifies WHO is calling — separate from and stronger than the shared HTTP secret, which only identifies a trusted network.
Issuing a token — first registration:
# The response returns `agent_token` ONCE. Save it.
{
"jsonrpc": "2.0", "id": 1, "method": "tools/call",
"params": {
"name": "register_agent",
"arguments": { "name": "builder", "role": "builder", "capabilities": ["tasks"] }
}
}
The server stores only a bcrypt hash. The raw token is surfaced once in the response agent_token field and echoed to stderr as [auth] New agent_token issued for "builder". Save it: RELAY_AGENT_TOKEN=.... If you lose it, unregister_agent (auth'd) then re-register.
Presenting the token on every subsequent call — three ways:
- Arg field (works for stdio + HTTP):
{ "name": "send_message", "arguments": { "from": "builder", "to": "ops", "content": "hi", "agent_token": "..." } } - HTTP header:
X-Agent-Token: <token> - Env var (stdio flow, also picked up by HTTP client wrappers):
export RELAY_AGENT_TOKEN=<token>
Capabilities are set at first registration and are immutable (v1.7.1). To change an agent's capability set, call unregister_agent (with its token) then register_agent with the new capability list. Re-register attempts that change caps are ignored with a capabilities_note in the response.
Capability catalog:
spawn— required forspawn_agenttasks— required forpost_task,update_taskwebhooks— required forregister_webhook,list_webhooks,delete_webhookbroadcast— required forbroadcast- All other tools are always allowed for any authenticated agent (no capability check).
Migration for pre-v1.7 agents (v2.1+): agents registered before v1.7 have no token hash. A register_agent call against such a row self-migrates — the relay detects the null hash, issues a fresh token, and the agent is first-class from that point on. No RELAY_ALLOW_LEGACY=1 required for the migration call itself. RELAY_ALLOW_LEGACY is still available as a coarser escape hatch for non-register tool calls against unmigrated legacy rows (e.g., if you want send_message to work before an agent has migrated); turn it OFF once all your agents have migrated.
Encryption at Rest (v1.7 opt-in; keyring + rotation in v2.1 Phase 4b.3)
Set the keyring to encrypt message/task/audit/webhook content fields in the SQLite database with AES-256-GCM. Three configuration sources (pick exactly one — multi-set is rejected at startup):
# 1. Inline JSON (for CI / secrets managers)
export RELAY_ENCRYPTION_KEYRING='{"current":"k1","keys":{"k1":"<base64-32>"}}'
# 2. File path (operator-friendly; chmod 600)
export RELAY_ENCRYPTION_KEYRING_PATH=~/.bot-relay/keyring.json
# 3. Legacy single-key (auto-wraps to { current: "k1", keys: { k1: <value> } }; deprecation warning at startup)
export RELAY_ENCRYPTION_KEY="<base64-32>"
# Generate a key:
openssl rand -base64 32
# or:
node -e 'console.log(require("crypto").randomBytes(32).toString("base64"))'
When the keyring is set, the relay transparently encrypts on write (with current key) and decrypts on read (with any key in the keyring). Every ciphertext carries an enc:<key_id>:... prefix so rows are self-describing. Legacy enc1:... rows (pre-Phase-4b.3 deployments) decrypt via RELAY_ENCRYPTION_LEGACY_KEY_ID (default k1).
Rotating keys (online)
Full runbook at docs/key-rotation.md. In summary:
- Add the new key to the keyring while keeping the old one (both decrypt;
currentstill points to old). - Flip
currentto the new key; restart. New writes use the new key. relay re-encrypt --from old_key_id --to new_key_id --yes— scans + migrates all existing rows across 5 encrypted columns. Resumable.relay re-encrypt --verify-clean old_key_id— exit 0 = safe to retire.- Remove the old key from the keyring; restart.
Without the keyring set, content is stored plaintext (default, convenient for local dev).
Rotation Guide — HTTP Shared Secret (v1.7)
The RELAY_HTTP_SECRET shared secret can be rotated without downtime using a grace window:
Step 1 — promote the new secret as primary, keep the old as previous:
RELAY_HTTP_SECRET="new-secret-v2" \
RELAY_HTTP_SECRET_PREVIOUS="old-secret-v1" \
RELAY_TRANSPORT=http node dist/index.js
During this window, BOTH secrets are accepted. Requests using the old secret receive an X-Relay-Secret-Deprecated: true response header as a signal to upgrade.
Step 2 — update every client to present new-secret-v2 in their Authorization: Bearer … or X-Relay-Secret header.
Step 3 — watch for the deprecation header on your dashboard/logs until no more requests use the old secret.
Step 4 — drop the old secret:
RELAY_HTTP_SECRET="new-secret-v2" \
RELAY_TRANSPORT=http node dist/index.js # RELAY_HTTP_SECRET_PREVIOUS unset
Multiple previous secrets are supported as a comma-separated list:
RELAY_HTTP_SECRET_PREVIOUS="v1-secret,v0-secret"
Secret comparisons are timing-safe (v1.7.1 — crypto.timingSafeEqual), so an attacker cannot recover the secret via byte-by-byte response-timing measurement.
Multi-machine: centralized deployment (v2.1)
bot-relay-mcp is LLM-agnostic, CLI-agnostic, and deployment-flexible. Pick the path that fits your setup:
Path A — Single-machine (default). Stdio transport, per-terminal process, zero infrastructure. Best for solo development on one laptop. No secrets, no reverse proxies, no ops — run npm install + add the stdio entry to your MCP client config and you're done. Covered throughout this README.
Path B — Multi-machine (centralized, v2.1 Phase 7r). One bot-relay-mcp hub on a VPS, multiple thin MCP clients connecting via HTTP. Agents on different machines can send_message, post tasks, subscribe to webhooks, and join channels through shared state. No new architecture — just the HTTP transport we've had since v1.2, packaged with a convenience CLI in v2.1.
When to pick centralized
- Two or more machines in play (dev laptop + CI, work + personal, family devices)
- AI agents running on different hosts that need to coordinate
- Team environments where multiple people connect their MCP clients to a shared relay
Quick pair flow
On the hub (VPS, reachable at e.g. https://relay.example.com): install bot-relay-mcp, run under systemd with RELAY_TRANSPORT=http + RELAY_HTTP_SECRET, terminate TLS with Caddy/nginx. See docs/multi-machine-deployment.md for the worked VPS runbook.
On each client machine:
relay pair https://relay.example.com \
--name "$(whoami)-$(hostname -s)" \
--role operator \
--capabilities spawn,tasks,webhooks,broadcast,channels \
--secret "$RELAY_HTTP_SECRET"
relay pair probes the hub, registers this machine as an agent, captures the returned one-time agent_token, and emits an MCP client config snippet ready to paste into ~/.claude.json / ~/.cursor/mcp.json / etc. Persist the token (export RELAY_AGENT_TOKEN=… in your shell rc) so hooks can authenticate on every terminal open.
Verify after pairing:
relay doctor --remote https://relay.example.com
Expected: PASS on reachability + protocol compatibility + token auth + hub auth config.
Trust-model tradeoffs
- Hub operator can read plaintext messages in RAM (even with
RELAY_ENCRYPTION_KEYset, decryption happens server-side for routing) - Hub is a single point of failure for cross-machine coordination
- Recommended for trusted groups (families, small teams, personal multi-machine setups)
- NOT recommended for mutually distrustful parties sharing a single hub, or compliance-bound workloads where in-RAM access by the operator is a policy violation
See SECURITY.md §Centralized deployment trust model for the full posture + incident response playbook.
Bridge to other tools via MCP
bot-relay-mcp is MCP-compatible, so your MCP client can connect to both bot-relay-mcp AND other MCP servers (Slack, Discord, Matrix, email, etc.) simultaneously. That's an operator deployment choice — we don't integrate those into bot-relay-mcp. See docs/multi-machine-deployment.md §3 for the pattern.
Zero-Friction Setup
To skip approval prompts for relay tools, add this to your project's .claude/settings.json:
{
"permissions": {
"allow": [
"mcp__bot-relay__register_agent",
"mcp__bot-relay__discover_agents",
"mcp__bot-relay__send_message",
"mcp__bot-relay__get_messages",
"mcp__bot-relay__broadcast",
"mcp__bot-relay__post_task",
"mcp__bot-relay__update_task",
"mcp__bot-relay__get_tasks",
"mcp__bot-relay__get_task",
"mcp__bot-relay__register_webhook",
"mcp__bot-relay__list_webhooks",
"mcp__bot-relay__delete_webhook"
]
}
}
Auto-Check on Session Start
Add a SessionStart hook so every terminal automatically checks the relay for pending messages when it opens. See docs/hooks.md for the full configuration.
Near-Real-Time Mail Delivery (v1.8)
The SessionStart hook only fires when a terminal opens. If an agent is actively working and mail arrives mid-session, it does not see the message until next startup (or a human pastes it in).
v1.8 adds a PostToolUse hook — hooks/post-tool-use-check.sh — that fires after every tool call, checks the mailbox, and injects pending messages as additionalContext so the running session picks them up immediately.
Install per-project (NOT global), in <project>/.claude/settings.json:
{
"hooks": {
"PostToolUse": [
{
"matcher": "*",
"hooks": [
{
"type": "command",
"command": "/path/to/bot-relay-mcp/hooks/post-tool-use-check.sh",
"timeout": 5
}
]
}
]
}
}
Important — if your path contains spaces, single-quote it inside the JSON string. Claude Code passes the
commandto the shell, which splits on whitespace. Without the single quotes the hook silently fails with errors like/bin/sh: ... is a directory. Example for a real installation at/Users/maxime/Documents/Ai stuff/Claude AI/bot-relay-mcp/:"command": "'/Users/maxime/Documents/Ai stuff/Claude AI/bot-relay-mcp/hooks/post-tool-use-check.sh'"The outer double-quotes are JSON; the inner single-quotes are shell. Paths with no spaces do not need this treatment.
The hook prefers the HTTP path when RELAY_AGENT_TOKEN is set and the daemon is running (full auth + audit), falling back to direct sqlite on RELAY_DB_PATH otherwise. It does NOT re-register (SessionStart handles that), does NOT check tasks (simpler focus, less context pressure), and silent-exits when there is no mail. Full docs + troubleshooting in docs/post-tool-use-hook.md.
Honest limitation: idle terminals get no delivery. The hook only fires when the agent is actively running tool calls. For long-idle windows, still rely on SessionStart + human attention.
Turn-End Mail Delivery (v2.1)
PostToolUse only fires on turns that include at least one tool call. A text-only turn (Claude responds with no tool invocation) does not trigger it. The Stop hook — hooks/stop-check.sh — closes that gap by firing on every turn-end, whether or not the turn invoked tools. Install both together in your project's .claude/settings.json:
{
"hooks": {
"Stop": [
{
"matcher": "*",
"hooks": [
{
"type": "command",
"command": "/path/to/bot-relay-mcp/hooks/stop-check.sh",
"timeout": 5
}
]
}
]
}
}
Same single-quote-the-path-if-it-contains-spaces rule, same env vars, same HTTP/sqlite fallback, same silent-fail contract as PostToolUse. Full docs + troubleshooting in docs/stop-hook.md.
Honest limitation: Stop does NOT wake truly idle terminals. If no turn is in progress, neither hook fires. For long-idle windows, use the Layer 2 Managed Agent reference (examples/managed-agent-reference/).
Backup & Restore (v2.1)
Two CLIs for disaster recovery:
relay-backup # snapshot to ~/.bot-relay/backups/
relay-backup --output /srv/backup.tgz # custom path
relay-restore ~/.bot-relay/backups/relay-backup-<iso>.tar.gz
relay-backup produces a tar.gz of the live DB (via a consistent VACUUM INTO snapshot — safe while the daemon is running), the optional config.json, and a manifest.json with schema version and row counts. Works identically on the native better-sqlite3 driver and the optional sql.js wasm driver.
relay-restore always safety-backs-up the current DB first (to ~/.bot-relay/backups/pre-restore-<iso>.tar.gz). If that safety backup fails, the restore aborts untouched. It then refuses if the daemon appears to be running (/health probe, best-effort), refuses schema-version mismatches (higher = hard refuse, lower = --force overrides), runs PRAGMA integrity_check on the extracted DB, and finally atomic-swaps the new DB into place.
Full docs + troubleshooting in docs/backup-restore.md.
Lost-Token Recovery (v2.1)
Close a terminal, lose RELAY_AGENT_TOKEN, and the relay rejects your register_agent with AUTH_FAILED because the row is intact. Clear the registration so the agent can re-bootstrap:
relay recover <agent-name> # interactive confirm
relay recover <agent-name> --yes # skip confirm (for scripts)
relay recover <agent-name> --dry-run # show what would change, commit nothing
relay recover <agent-name> --db-path PATH # non-default DB location
Messages and tasks addressed to the agent are preserved — only the agents + agent_capabilities rows are cleared. After recovery, the operator calls register_agent with the same name/role/capabilities and captures a fresh agent_token.
Trust model: filesystem access to ~/.bot-relay/relay.db IS the authority (same boundary the daemon relies on). Not an MCP tool — the caller by definition cannot authenticate. The CLI emits an audit_log entry with tool='recovery.cli' + the operator's OS username for incident traceability.
Cross-Platform Spawn (v1.9)
spawn_agent opens a new Claude Code terminal on macOS, Linux, and Windows via a driver abstraction:
- macOS —
bin/spawn-agent.sh(iTerm2 → Terminal.app). Unchanged from v1.6.4, preserves the 3-layer hardening + 19-payload adversarial test suite. - Linux —
gnome-terminal→konsole→xterm→tmuxfallback chain. The tmux fallback creates a detached session (attach later withtmux attach -t <agent-name>) — covers headless servers with no GUI. - Windows —
wt.exe(Windows Terminal) →powershell.exe→cmd.exe.
Driver selection: RELAY_TERMINAL_APP override (allowlist-gated) > process.platform auto-detect > in-driver fallback chain.
Full install requirements per platform + manual smoke-test checklists + troubleshooting: docs/cross-platform-spawn.md.
Env-var propagation is minimal by default (principle of least authority): system essentials + anything prefixed RELAY_*. Secrets like AWS_SECRET_ACCESS_KEY are NOT passed to spawned agents unless explicitly prefixed.
Plug-and-play defaults (v2.1.2): spawned terminals are configured for autonomous work out of the box — they auto-pull mail from their inbox on first turn instead of idling, run with --permission-mode bypassPermissions so they don't ask the operator to approve every tool call, get an iTerm2 / session-picker title set to the agent name, and run at --effort high to cap token spend (parent terminals doing strategic work may use xhigh; spawned children doing mechanical work shouldn't inherit it). Every default has an env override for the rare case the legacy behavior is wanted:
| Default | Env override | Notes |
|---|---|---|
| Kickstart prompt sent as positional arg | RELAY_SPAWN_KICKSTART="custom" / RELAY_SPAWN_NO_KICKSTART=1 | The default tells the spawned agent to pull get_messages and act on inbox. |
| --permission-mode bypassPermissions | RELAY_SPAWN_PERMISSION_MODE=<mode> | Allowlist: acceptEdits, auto, bypassPermissions, default, dontAsk, plan. |
| --name <agent> | RELAY_SPAWN_DISPLAY_NAME="custom" | Shows up as the iTerm2 tab + Claude Code session title. |
| --effort high | RELAY_SPAWN_EFFORT=<level> | Allowlist: low, medium, high, xhigh, max. |
Layer 2: Managed Agents (v1.10)
Agents that are NOT Claude Code terminals — Python daemons, Node workers, Hermes/Ollama integrations, custom scripts. They connect to the relay via HTTP (recommended) or direct SQLite, use the same 25 MCP tools (v2.1), and authenticate with per-agent tokens. If registered with managed:true, they also receive token-rotation push-messages over the normal get_messages channel — see docs/managed-agent-protocol.md.
Full integration guide with mental model, auth flow, lifecycle, error patterns, and security notes: docs/managed-agent-integration.md.
Runnable reference implementations (stdlib-only, ~200 LOC each):
- Python:
examples/managed-agent-reference/python/agent.py - Node:
examples/managed-agent-reference/node/agent.js
Both demonstrate: register, send/receive messages, accept + complete tasks, discover peers, SIGINT cleanup. Each has a SMOKE.md with a 5-step manual verification checklist.
SQLite Driver Options (v1.11)
The relay uses SQLite for persistent state. Two drivers are available:
native(default) —better-sqlite3, a compiled C addon. Fast, supports WAL mode, multi-process safe. Requires a C++ compiler atnpm installtime.wasm—sql.js, SQLite compiled to WebAssembly. Zero native compilation. Slightly slower writes (in-memory + write-back-to-file). Single-process only (not safe for multi-terminal stdio).
Switch with one env var:
npm install sql.js # one-time install of the optional dep
RELAY_SQLITE_DRIVER=wasm node dist/index.js
Both drivers read the same relay.db file format. Full details, performance notes, and limitations: docs/sqlite-wasm-driver.md.
Roadmap
- v1.1: Local relay, 9 tools, SQLite, auto-purge
- v1.2: HTTP transport, webhook system, config file — 12 tools
- v1.3: Presence integrity,
unregister_agent, hook delivers mail — 13 tools - v1.4:
spawn_agent+ role templates + dashboard — 14 tools - v1.5: Built-in security — Bearer auth, audit log, rate limiting
- v1.6: Hardening pass — SSRF, input validation, path traversal, stdout discipline
- v1.7: Per-agent tokens, secret rotation, at-rest encryption, capability scoping
- v1.8: Near-real-time mail via
PostToolUsehook - v1.9: Cross-platform spawn (macOS / Linux / Windows / tmux) — Node/TS driver abstraction
- v1.10: Layer 2 Managed Agents — reference Python + Node workers
- v1.11: SQLite WASM driver (
sql.jsopt-in) — zero native compilation on Windows/Alpine/Docker/CI - v2.0: Plug-and-play — channels, smart routing (
post_task_auto), task leases + heartbeat, lazy health monitor, session-aware reads, busy/DND,health_check, webhook retry with CAS, payload size limits, config validation, auto-unregister, dead-agent purge, debug mode. 22 tools. - v2.1 (current): Architectural completion — explicit
auth_statemachine, managed-agent rotation grace, versioned ciphertext + keyring with online rotation (relay re-encrypt), lost-token recovery CLI (relay recover), admin-initiated cross-agent rotation (rotate_token_admin), structured error_code catalog, protocol_version surface, Phase 4p webhook-secret encryption, Phase 4b.1 v2 revoke/recovery redesign. 25 tools. 14 of 14 Codex architectural findings closed. - v2.2: Polish — batch operations, fan-out/fan-in, scheduled messages, metrics endpoint, message ACK, private channels, token scoping, idle-terminal wake.
- v2.5: Federation — cross-machine peering, E2E encryption.
Dashboard
When running in http or both mode, open http://127.0.0.1:3777/ in a browser. You'll see:
- Live agent presence (online / stale / offline)
- Active tasks with priority and assignment
- Recent messages
- Registered webhooks
- Recently completed tasks
Auto-refreshes every 3 seconds. Useful for "what's happening across all my terminals right now?" at a glance.
Role Templates
See roles/ for drop-in role specs. Examples:
- planner.md — orchestrator that delegates and synthesizes
- builder.md — worker that accepts and completes tasks
- reviewer.md — skeptical reviewer with structured output
- researcher.md — investigates questions, returns findings
Three ways to apply a role: paste into project CLAUDE.md, pass as initial_message when spawning, or wire via shell alias.
Requirements
- Node.js 18+
- Claude Code (or any MCP-compatible client)
License
MIT