Domain-agnostic MCP server for autonomous experimentation with metric-driven keep/rollback decisions and reproducible experiment history.
CatoBot autoexperiment MCP Server
Domain-Agnostic MCP Server for Autonomous Experimentation
A generalisation of Karpathy's autoresearch pattern into a reusable Model Context Protocol (MCP) server that any AI agent can drive, pointed at any domain.
Documentation Map
- Core usage and setup: this README
- Example catalog:
example_experiments/README.md - Experiment design guide:
example_experiments/Autoexperiment_Design_Guide.md - Contribution guide:
CONTRIBUTING.md - Security policy:
SECURITY.md - Support channels:
SUPPORT.md - License and notices:
LICENSE,NOTICE
Example Experiments
- Shell-only experiment with CatoBot autoexperiment MCP:
example_experiments/shell/ - External orchestration of CatoBot autoexperiment MCP + Text2Sim MCP):
example_experiments/external/DES_Text2Sim/
The Pattern
modify something → run it → measure a result → keep or discard → repeat
The server exposes this loop as a standard set of MCP tools. The domain (what gets modified, how it runs, and what gets measured) is defined entirely in a JSON config file. The agent-side logic stays the same regardless of domain.
Architecture
┌─────────────────────────────────────────────────────┐
│ AI Agent (Claude Code, Codex, etc.) │
│ │
│ Reads status → plans change → edits file → │
│ runs experiment → checks result → keeps/discards │
└──────────────┬──────────────────────────────────────┘
│ MCP (stdio)
┌──────────────▼──────────────────────────────────────┐
│ autoexperiment MCP server │
│ │
│ Tools: │
│ autoexp_get_status — session overview │
│ autoexp_read_file — read allowed file │
│ autoexp_update_file — full file replace │
│ autoexp_patch_file — targeted find/repl │
│ autoexp_run_experiment — execute + measure │
│ autoexp_begin_experiment — open pending record │
│ autoexp_complete_experiment — close with metric │
│ autoexp_set_baseline — mark as baseline │
│ autoexp_rollback — revert to last good │
│ autoexp_get_history — review past runs │
│ autoexp_run_setup — one-time setup │
│ │
│ Resources: │
│ autoexp://status — session status (JSON) │
│ autoexp://history — experiment history │
│ autoexp://file/{path} — read allowed files │
│ │
│ Config: autoexperiment.json (domain adapter) │
│ Ledger: .autoexperiment_ledger.json (state) │
└──────────────┬──────────────────────────────────────┘
│ subprocess / external MCP server
┌──────────────▼──────────────────────────────────────┐
│ Your domain │
│ (training script, benchmark, simulation, etc.) │
└─────────────────────────────────────────────────────┘
Code Structure
The server is implemented as a Python package (autoexperiment_mcp/) with a thin server.py entry point:
autoexperiment-mcp-server/
├── server.py # Entry point: imports mcp, calls mcp.run()
└── autoexperiment_mcp/
├── models.py # Pydantic models (DomainConfig, ExperimentRecord, …)
├── utils.py # Pure utilities (git, hash, path, regex, time, coercion)
├── store.py # State I/O, snapshot management, TSV logging, query helpers
├── experiment.py # Core lifecycle: begin/complete experiment, keep decision
├── lifespan.py # Startup validation, app_lifespan context manager
├── app.py # mcp = FastMCP("autoexperiment_mcp", lifespan=…)
├── tools.py # All 11 @mcp.tool() registrations
├── resources.py # All 3 @mcp.resource() registrations
└── __init__.py # Imports app + triggers tool/resource registration
Installation
Prerequisites
- Python 3.12 or higher
uvpackage manager
Install uv
macOS / Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
Windows (PowerShell)
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
Learn more: astral-sh/uv
Clone the repository
git clone https://github.com/IamCatoBot/catobot-autoexperiment-mcp.git
cd catobot-autoexperiment-mcp
Install dependencies
uv sync
Quick Start
1. Prepare your experiment folder
Your experiment folder needs a working baseline, an evaluation script, and a config file:
my-experiment/
├── autoexperiment.json ← config (you write this)
├── solution.py ← editable (agent modifies this)
├── benchmark.py ← evaluation (read-only)
└── data.csv ← test data (read-only)
2. Create autoexperiment.json
{
"project_name": "My Experiment",
"description": "What you're trying to optimise",
"workspace_dir": "/absolute/path/to/my-experiment",
"editable_files": ["solution.py"],
"read_only_files": ["benchmark.py", "data.csv"],
"run_command": "python benchmark.py 2>&1",
"timeout_seconds": 60,
"metric_name": "rmse",
"metric_regex": "^rmse:\\s*([\\d.]+)",
"metric_direction": "lower",
"use_git": true
}
3. Initialise git in the experiment folder
Git tracking is enabled by default (use_git: true). The experiment folder must be a git repository with an initial commit before the server will start.
cd /path/to/my-experiment
git init
git add -A
git commit -m "initial baseline"
4. Verify the run command works
Run your experiment command manually and check the output contains the metric in the expected format:
cd /path/to/my-experiment
python benchmark.py
# Should print something like: rmse: 12.345678
5. Register the MCP server
Recommended: pass AUTOEXPERIMENT_CONFIG pointing to your config file. MCP hosts may launch the server process from a different working directory, so an explicit path is the safest default.
Claude Code (-e for env vars):
claude mcp add autoexperiment \
-e AUTOEXPERIMENT_CONFIG=/path/to/my-experiment/autoexperiment.json \
-- uv run \
--project PATH_TO_AUTOEXPERIMENT_MCP_SERVER \
python PATH_TO_AUTOEXPERIMENT_MCP_SERVER/server.py
Codex (--env for env vars):
codex mcp add autoexperiment \
--env AUTOEXPERIMENT_CONFIG=/path/to/my-experiment/autoexperiment.json \
-- uv run \
--project PATH_TO_AUTOEXPERIMENT_MCP_SERVER \
python PATH_TO_AUTOEXPERIMENT_MCP_SERVER/server.py
Replace /path/to/my-experiment/autoexperiment.json with the absolute path to your config file.
Optional shortcut: if the server process is launched from your experiment folder and the config filename is autoexperiment.json, you can omit the environment variable.
You only need to register the MCP server once per MCP client profile; after that, reconnect normally in new sessions.
Note: Replace
PATH_TO_AUTOEXPERIMENT_MCP_SERVERwith the actual path to your cloned repository. If theuvcommand is not found, runwhich uv(Unix) orGet-Command uv(PowerShell) and use the full path in the"command"field.
6. Start experimenting
Launch Claude Code, Codex, or another MCP client from your experiment folder and prompt it:
Read the experiment status, review the editable and read-only files, run the baseline first, then iterate until improvements plateau and no meaningful gains remain.
Security Warning
setup_command and run_command execute shell commands on your host machine. This server does not provide sandboxing or container isolation by default.
Startup Validation
The server validates the configuration at startup and will refuse to start if:
workspace_dirdoes not exist or is not a directory- Any file in
editable_filesorread_only_filesis missing - A file appears in both
editable_filesandread_only_files metric_regexis not a valid regular expressionuse_gitistruebut the workspace is not a git repository
Error messages are specific and tell you exactly what to fix.
Configuration
Everything domain-specific lives in autoexperiment.json:
| Field | Required | Default | Description |
|---|---|---|---|
| project_name | yes | | Human-readable name |
| description | no | "" | What you're trying to achieve |
| workspace_dir | yes | | Absolute path to the experiment folder |
| editable_files | yes | | Files the agent is allowed to modify (at least one) |
| read_only_files | no | [] | Files the agent can read but not change |
| execution_mode | no | "hybrid" | "shell", "external", or "hybrid" |
| run_command | shell/hybrid | | Shell command to run one experiment |
| timeout_seconds | no | 300 | Max time per experiment (10–7200s) |
| setup_command | no | null | One-time setup (deps, data download, etc.) |
| metric_name | yes | | Name of the metric being optimised |
| metric_regex | shell/hybrid | | Regex with one capture group to extract a float from stdout |
| metric_direction | yes | | "lower" or "higher" |
| require_baseline_first | no | true | Require a baseline experiment before non-baseline runs |
| use_git | no | true | Track experiments as git commits. Requires the workspace to be a git repo with an initial commit. |
| git_branch_prefix | no | "autoexp" | Prefix for experiment branches |
| keep_policy | no | see below | Multi-gate keep/discard policy |
Keep Policy
The keep_policy object controls when a completed experiment is kept vs discarded. All gates must pass for a run to be kept.
| Field | Default | Description |
|---|---|---|
| required_true_keys | [] | Metadata keys that must be boolean true |
| numeric_min | {} | Metadata keys with a floor value (e.g. {"utilization": 45}) |
| numeric_max | {} | Metadata keys with a ceiling value (e.g. {"latency_ms": 250}) |
| require_numeric_keys_present | true | If true, missing keys in numeric_min/numeric_max cause discard |
| allow_equal_metric_if_simpler | true | Keep a tied run if its complexity_score is lower |
| equal_metric_tolerance | 1e-9 | Tolerance for treating two metric values as equal |
| complexity_key | "complexity_score" | Metadata key used for complexity tie-breaking |
The agent sees the full keep_policy in autoexp_get_status and receives a required_metadata_keys reminder in every autoexp_begin_experiment response — so it always knows exactly what to include in the metadata argument when calling autoexp_complete_experiment.
Tool Reference
| Tool | Purpose | Destructive? |
|---|---|---|
| autoexp_get_status | Session overview, best score, editable files, keep_policy gates | No |
| autoexp_read_file | Read any allowed file | No |
| autoexp_update_file | Replace entire file contents | Yes |
| autoexp_patch_file | Targeted find-and-replace | No |
| autoexp_run_experiment | Execute the run command, extract metric (shell mode) | No (but slow) |
| autoexp_begin_experiment | Open a pending experiment record (external/hybrid mode) | No |
| autoexp_complete_experiment | Close a pending experiment with metric + metadata (external/hybrid mode) | No |
| autoexp_set_baseline | Mark an existing completed experiment as the baseline | No |
| autoexp_rollback | Revert files to a specific experiment's state via git | Yes |
| autoexp_get_history | Review past experiments and results | No |
| autoexp_run_setup | Run one-time setup command | No |
How the Loop Works
Shell mode (execution_mode: "shell")
- Agent calls
autoexp_get_status→ learns the domain, metric, and current best. - Agent calls
autoexp_read_file→ reads the editable file(s) to understand the code. - Agent calls
autoexp_patch_fileorautoexp_update_file→ makes a change. - Agent calls
autoexp_run_experimentwith a hypothesis → server runs it, extracts metric. - If improved: server auto-commits via git and records the commit hash. Agent plans next experiment.
- If regressed or crashed: agent calls
autoexp_rollback, then tries something else. - Agent calls
autoexp_get_historyperiodically to review trends and avoid repetition. - Repeat indefinitely.
External / hybrid mode (execution_mode: "external" or "hybrid")
Use this when another MCP server (e.g. a physics sim, a cloud evaluator) runs the experiment.
- Agent calls
autoexp_get_status→ note thekeep_policyfield — it lists every metadata key the policy will gate on. - Agent edits the editable file(s) via
autoexp_update_file/autoexp_patch_file. - Agent calls
autoexp_begin_experiment→ receivesexperiment_idand arequired_metadata_keysreminder. - Agent triggers the external system and waits for results.
- Agent assembles a
metadatadict containing all keys fromrequired_metadata_keys(both from simulation output and any input-parameter constraints defined innumeric_min/numeric_max). - Agent calls
autoexp_complete_experimentwithexperiment_id,metric_value, and the assembledmetadatadict. - Server evaluates the keep policy and responds with
kept,keep_reason,is_best. - If not kept: agent calls
autoexp_rollbackand adjusts its approach.
Important:
numeric_min/numeric_maxgates often reference input parameters (e.g. service-time bounds from a config file) rather than simulation outputs. You must read those values yourself and include them inmetadataalongside the simulator's results.
Design Principles
- Domain-agnostic. The server knows nothing about ML, sorting, prompts, or any specific domain. All domain knowledge lives in the config file and the agent's reasoning.
- Single metric. One number determines success. If your problem needs multiple metrics, your run command should combine them into a single score.
- Fixed time budget. Each experiment gets the same wall-clock timeout, making results comparable.
- Git as memory. Every improvement is committed with its commit hash recorded. Every regression can be rolled back to a specific experiment. The full history is always recoverable.
- Agent autonomy. The server provides tools, not opinions. The agent decides what to try, when to rollback, and when to change strategy.
Maintainer
The CatoBot autoexperiment MCP Server is an open source project developed and maintained by Nikolaos Maniatis, The Cato Bot Company Limited.
Disclaimer
- Work in progress: the software is actively evolving; features may change and some functionality may be incomplete.
- LLM-powered workflow: model/code quality depends on the capabilities of the LLM driving the loop.
- Validate outputs: always critically review and validate generated models, code changes, and metrics before relying on results.
Citation
For academic use, cite:
Maniatis, N. (2026). CatoBot autoexperiment MCP Server (v1.0.0). https://github.com/IamCatoBot/catobot-autoexperiment-mcp. Copyright The Cato Bot Company Limited. Licensed under Apache 2.0.