AutoPrepAI is an autonomous multi-agent data science pipeline that uses LLM-driven orchestration, A2A communication, and MCP tools to perform dataset profiling, EDA, preprocessing, feature engineering, AutoML, reflection, and report generation through distributed AI agents and deterministic execution.
🧠 AutoPrepAI v5
Autonomous Multi-Agent Data Science Pipeline
Upload a dataset → LLM orchestrates 9 agents → Get a trained ML model + full report
100% local • No API keys • Runs on CPU via Ollama llama3.2
What is AutoPrepAI?
AutoPrepAI is a fully autonomous data science pipeline where every decision is made by a local LLM (Ollama llama3.2). You upload a raw CSV and the system automatically cleans, analyzes, engineers features, trains 6 ML models, and delivers a full explainability report — all without writing a single line of code.
9 specialized AI agents communicate through real protocols (A2A + MCP) and are orchestrated by an LLM brain. If model performance is poor, a Reflection Agent diagnoses the issue and self-heals the pipeline automatically.
Architecture
graph TB
UI["🖥️ Streamlit UI<br/>Premium Dark Theme"]
ORCH["🧠 Orchestrator<br/>LLM Brain + Memory + Discovery"]
UI --> ORCH
subgraph A2A["🤝 A2A Agent Servers (ports 8201–8209)"]
DU["📋 DataUnderstanding"]
DQ["🔍 DataQuality"]
EDA["📈 EDA"]
MV["🩹 MissingValue"]
ENC["🏷️ Encoding"]
FE["⚙️ FeatureEngineering"]
AML["🏆 AutoML"]
REF["🔄 Reflection"]
REP["📝 Report"]
end
subgraph MCP["📡 MCP Tools Server (port 8100)"]
T1["describe_data"]
T2["check_missing"]
T3["correlation_analysis"]
T4["detect_outliers"]
T5["distribution_analysis"]
T6["train_model"]
T7["evaluate_model"]
T8["pipeline_status"]
end
ORCH -- "A2A Protocol<br/>JSON-RPC over HTTP" --> A2A
ORCH -- "MCP Protocol<br/>FastMCP" --> MCP
How It Works
The pipeline runs in 7 autonomous stages. The LLM decides what to do — deterministic code (pandas/sklearn) handles execution.
flowchart TD
START(["📂 Upload CSV / Excel / JSON"])
S1["📋 Step 1: Data Understanding<br/>Profile columns, detect target, classify types"]
S2["🔍 Step 2: Data Quality<br/>Grade quality A–F, flag risky columns"]
S3["📈 Step 3: EDA<br/>Outliers, correlations, skewness, charts"]
S4["🩹 Step 4: Missing Values<br/>LLM picks per-column strategy: median, mode, ffill..."]
S5["🏷️ Step 5: Encoding<br/>LLM picks per-column method: onehot, label, tfidf..."]
S6["⚙️ Step 6: Feature Engineering<br/>LLM recommends: log, sqrt, interactions, binning"]
S7["🏆 Step 7: AutoML<br/>Train 6 models, LLM selects the best"]
CHECK{"Score OK?"}
REFLECT["🔄 Reflection Agent<br/>Diagnose root cause → Fix → Retry"]
REPORT["📝 Generate Final Report<br/>Markdown + HTML + JSON"]
DONE(["✅ Pipeline Complete<br/>Processed CSV + Model + Report"])
START --> S1 --> S2 --> S3 --> S4 --> S5 --> S6 --> S7 --> CHECK
CHECK -- "✅ Yes" --> REPORT --> DONE
CHECK -- "❌ No" --> REFLECT --> S7
style START fill:#6366f1,stroke:#818cf8,color:#fff
style S7 fill:#f59e0b,stroke:#fbbf24,color:#000
style REFLECT fill:#f43f5e,stroke:#fb7185,color:#fff
style DONE fill:#10b981,stroke:#34d399,color:#fff
style REPORT fill:#06b6d4,stroke:#22d3ee,color:#000
Agent Collaboration
Agents talk to each other in real-time using the A2A protocol:
graph LR
EDA["📈 EDAAgent"] -- "A2A: quality risks?" --> DQ["🔍 DataQualityAgent"]
AML["🏆 AutoMLAgent"] -- "A2A: feature risks?" --> EDA2["📈 EDAAgent"]
REF["🔄 ReflectionAgent"] -- "A2A: model weaknesses?" --> AML2["🏆 AutoMLAgent"]
style EDA fill:#6366f1,color:#fff
style DQ fill:#06b6d4,color:#fff
style AML fill:#f59e0b,color:#000
style EDA2 fill:#6366f1,color:#fff
style REF fill:#f43f5e,color:#fff
style AML2 fill:#f59e0b,color:#000
Dataset Versioning
Every successful step saves a snapshot — you can inspect or rollback any stage:
| Stage | File | Description |
|:---:|---|---|
| 0 | stage_0_raw.csv | Your original data |
| 3 | stage_3_missing_fixed.csv | After imputation |
| 4 | stage_4_encoded.csv | After encoding |
| 6 | stage_6_final.csv | ML-ready data |
| — | automl_results.json | All model metrics |
| — | eda_report.json | EDA findings + charts |
| — | final_report.md | Report (Markdown) |
| — | final_report.html | Report (HTML) |
| — | pipeline_memory.json | Full execution log |
How to Run
Prerequisites
| Tool | Purpose | Install | |---|---|---| | Python 3.10+ | Runtime | python.org | | Ollama | Local LLM | ollama.com |
Step 1 — Install Dependencies
git clone https://github.com/yourusername/AutoPrepAI.git
cd AutoPrepAI
python -m venv myenv
myenv\Scripts\activate # Windows
# source myenv/bin/activate # Linux/Mac
pip install -r requirements.txt
Step 2 — Setup Ollama
ollama serve # Start Ollama (if not running)
ollama pull llama3.2 # Download model (~2GB, one-time)
Step 3 — Start All Servers
python start_servers.py
Wait until you see
[OK] 10 servers started successfully
Step 4 — Launch the App
streamlit run app.py
Open http://localhost:8501 → Upload your CSV → Click "Run Autonomous Pipeline" → Done! 🎉
Project Structure
AutoPrepAI/
├── app.py # Streamlit UI
├── orchestrator.py # LLM orchestrator brain
├── discovery.py # Dynamic agent/tool discovery
├── memory.py # Pipeline memory
├── start_servers.py # One-command server launcher
├── requirements.txt # Dependencies
│
├── llm/
│ └── ollama_engine.py # Ollama LLM engine + Pydantic schemas
│
├── mcp_server/
│ └── server.py # FastMCP server (8 data tools)
│
├── a2a_agents/
│ ├── base.py # Shared agent utilities
│ ├── data_understanding.py # Dataset profiling
│ ├── data_quality.py # Quality assessment
│ ├── eda.py # EDA + visual charts
│ ├── missing_values.py # Imputation
│ ├── encoding.py # Encoding
│ ├── feature_engineering.py # Feature engineering
│ ├── automl.py # AutoML + model selection
│ ├── reflection.py # Self-healing reflection
│ └── report.py # Explainability report
│
└── workdir/ # Runtime artifacts (auto-generated)
Built with 💜 using Ollama • FastMCP • A2A SDK • Streamlit