MCP server by haji-mi
System Perception MCP Server
A high-performance Model Context Protocol (MCP) server designed for AI Agents to perceive and control the Windows operating system with ultra-low latency and zero physical mouse/keyboard interference.
🌟 Core Features
- Ultra-Low Latency Screen Perception: Bypasses traditional slow screenshot methods. Utilizes
dxcamfor direct DXGI VRAM capture and OpenCV for memory compression, delivering screen frames to the agent in under ~120ms. - Silent Background Control: Eliminates the fragile and disruptive nature of physical mouse/keyboard simulation. Uses
win32apianduiautomationto send underlying system messages (PostMessage) and invoke UI elements silently. - UI Tree Parsing (
get_ui_tree): Instantly reads the accessibility tree of standard Windows applications, bypassing the slow Vision-Language Model (VLM) coordinate calculation bottleneck. - Instant Execution (
invoke_ui_element): Directly triggers standard OS elements (like desktop icons, buttons, and text fields) in less than a second based on UI definitions rather than screen coordinates.
🛠️ Requirements
- OS: Windows 10 / 11 (Requires DXGI and Windows UIAutomation APIs)
- Python: 3.8+
- Agent Harness: Any MCP-compatible client (e.g., Claude Desktop, DeerFlow)
📦 Installation
-
Clone this repository:
git clone <YOUR_GITHUB_REPO_URL> cd system-perception-mcp -
Install the required dependencies:
pip install -r requirements.txt
🚀 Exposed Tools
Once connected to an MCP client, the following tools become available to the LLM/Agent:
get_gpu_frame(): Instantly captures the current screen from the GPU frame buffer.get_ui_tree(): Scans and returns the current window's hierarchical UI structure.invoke_ui_element(element_name/id): Directly interacts with a specific UI node without moving the physical cursor.silent_mouse_click(x, y, hwnd): Sends a background click event to specific coordinates within a target window.silent_keyboard_type(text, hwnd): Injects keystrokes directly into a background application's message queue.
💡 Why This Approach?
Traditional visual AI agents rely on taking screenshots, sending them to a VLM, waiting 2-4 seconds for coordinate calculation, and then physically moving the user's cursor. This is slow, fragile, and prevents the user from using their computer while the agent is working.
System Perception MCP solves this by fusing computer vision with native OS UI Automation, allowing the agent to "see" instantly and "act" invisibly.
📝 License
MIT License