OfficeReader-MCP used by AI agent
OfficeReader-MCP
A Model Context Protocol (MCP) server that converts Microsoft Office documents (Word, Excel, PowerPoint) to Markdown format with intelligent image extraction and optimization.
Features
- Multi-Format Support: Word (.docx, .doc), Excel (.xlsx, .xls), PowerPoint (.pptx, .ppt)
- Intelligent Image Processing: Automatic extraction and optimization with WebP compression
- Format Preservation: Maintains document structure including headings, tables, lists, and formatting
- Metadata Extraction: Access document properties (author, title, creation date, etc.)
- Efficient Caching: Smart caching system for quick reuse of converted documents
- Cross-Platform: Works on Windows, macOS, and Linux
Supported Formats
| Format | Extensions | Features |
|--------|------------|----------|
| Word | .docx, .doc | Text formatting, headings, lists, tables, images |
| Excel | .xlsx, .xls | Multi-sheet support, tables, charts, embedded images |
| PowerPoint | .pptx, .ppt | Slides, text boxes, images, speaker notes, tables |
Installation
Prerequisites
- Python 3.10 or higher
- Claude Desktop or Claude Code
Step 1: Install the Package
# Clone the repository
git clone https://github.com/Asunainlove/office-reader-mcp.git
cd office-reader-mcp
# Install in editable mode
pip install -e .
Step 2: Configure Claude
For Claude Desktop
Add to your Claude Desktop config file:
Windows: %APPDATA%\Claude\claude_desktop_config.json
macOS/Linux: ~/.config/Claude/claude_desktop_config.json
{
"mcpServers": {
"officereader": {
"command": "python",
"args": ["-m", "officereader_mcp.server"],
"env": {
"OFFICEREADER_CACHE_DIR": "/path/to/cache"
}
}
}
}
For Claude Code
Add to your Claude Code settings:
Windows: %LOCALAPPDATA%\claude-code\settings.json
macOS/Linux: ~/.config/claude-code/settings.json
{
"mcpServers": {
"officereader": {
"command": "python",
"args": ["-m", "officereader_mcp.server"],
"env": {
"OFFICEREADER_CACHE_DIR": "/path/to/cache"
}
}
}
}
Step 3: Restart Claude
Restart Claude Desktop or Claude Code to load the MCP server.
Quick Start
After installation, you can use OfficeReader-MCP directly in your conversations with Claude:
Convert my Excel file at D:\Reports\sales_2024.xlsx to markdown
Extract text and images from D:\Presentations\keynote.pptx
Get metadata from my document at C:\Documents\report.docx
Available Tools
1. convert_document
Convert any supported Office document to Markdown format.
Parameters:
file_path(required): Absolute path to the documentextract_images(optional, default: true): Extract embedded imagesimage_format(optional, default: "file"): How to handle images"file": Save images to disk (recommended)"base64": Embed images as base64 in markdown"both": Both save and embed
output_name(optional): Custom name for output files
Example:
Convert D:\Documents\report.xlsx with images
2. read_converted_markdown
Read the full content of a previously converted markdown file.
Parameters:
markdown_path(required): Path to the markdown file
Example:
Read the markdown at D:\cache\output\report_abc12345\report_abc12345.md
3. list_conversions
List all cached document conversions with details.
Example:
List all converted documents
4. clear_cache
Clear all cached conversions to free up disk space.
Example:
Clear the document cache
5. get_document_metadata
Extract metadata from a document without full conversion (faster).
Parameters:
file_path(required): Path to the document
Example:
Get metadata from D:\Documents\presentation.pptx
6. get_supported_formats
Get list of all supported file formats and extensions.
Example:
What file formats does officereader support?
Output Structure
Converted documents are organized in the cache directory:
cache/
└── output/
└── document_name_abc12345/
├── document_name_abc12345.md # Converted markdown
└── images/
├── image_001.webp # Optimized images
├── slide2_image_002.webp
└── excel_image_003.webp
Image Optimization
Images are automatically optimized to reduce file size while maintaining quality:
- Max Dimensions: 1920×1080 pixels (configurable)
- Format: WebP (preferred) or PNG/JPEG fallback
- Quality: 80% for photos, 85% for JPEG, lossless PNG for graphics with transparency
- Typical Compression: 50-80% size reduction
- Smart Detection: Automatically distinguishes between photos and graphics
Technical Details
Architecture
OfficeReader-MCP/
├── src/officereader_mcp/
│ ├── server.py # MCP server implementation
│ ├── converter.py # Word converter (DocxConverter, OfficeConverter)
│ ├── excel_converter.py # Excel to Markdown converter
│ ├── pptx_converter.py # PowerPoint to Markdown converter
│ ├── image_optimizer.py # Image compression utility
│ └── __init__.py # Package initialization
├── test/
│ ├── test_converter.py # Basic functionality tests
│ └── test_all_formats.py # Comprehensive test suite
├── pyproject.toml # Project configuration
└── README.md # Documentation
Dependencies
| Package | Version | Purpose |
|---------|---------|---------|
| mcp | >=1.0.0 | Model Context Protocol SDK |
| python-docx | >=1.1.0 | DOCX file parsing and manipulation |
| mammoth | >=1.6.0 | DOC/DOCX to HTML conversion (fallback) |
| Pillow | >=10.0.0 | Image processing and optimization |
| markdownify | >=0.11.0 | HTML to Markdown conversion |
| openpyxl | >=3.1.0 | Excel file parsing |
| python-pptx | >=0.6.21 | PowerPoint file parsing |
All dependencies are automatically installed when you run pip install -e .
Testing
Run Tests
# Basic converter test
python test/test_converter.py
# Comprehensive test suite for all formats
python test/test_all_formats.py
# Test with a specific document
python test/test_converter.py path/to/your/document.docx
Test Coverage
The test suite verifies:
- Module imports and initialization
- Converter functionality for all formats
- Image extraction and optimization
- File type detection
- Cache management
- Metadata extraction
Configuration
OfficeReader-MCP supports multiple configuration methods to customize cache locations and behavior.
Quick Configuration (Recommended)
-
Copy the example config file:
cp config.example.json config.json -
Edit
config.jsonto set your cache directory:{ "cache_dir": "D:/MyDocuments/OfficeReaderCache", "image_optimization": { "enabled": true, "max_dimension": 1920, "quality": 80 } } -
The config file will be automatically loaded on startup.
For detailed configuration options, see CONFIG.md.
Environment Variables
| Variable | Description | Default |
|----------|-------------|---------|
| OFFICEREADER_CACHE_DIR | Directory for cached conversions | System temp directory |
Example usage:
# Set custom cache directory
export OFFICEREADER_CACHE_DIR=/path/to/custom/cache
# Or in Windows
set OFFICEREADER_CACHE_DIR=C:\path\to\custom\cache
Note: Environment variables take priority over config file settings.
Usage Examples
Converting Excel with Multiple Sheets
User: Convert my Excel file at D:\Reports\Q4_sales.xlsx
Claude: I'll convert that Excel file. Each sheet will be converted to a separate
section in the markdown with properly formatted tables...
[Output includes all sheets as markdown tables with preserved formatting]
Extracting PowerPoint Content
User: Extract all text and images from D:\Presentations\product_launch.pptx
Claude: Converting the PowerPoint presentation. I'll extract text from each slide,
including speaker notes, along with all embedded images...
[Output includes slide-by-slide breakdown with images and notes]
Batch Processing
User: Convert all Office documents in D:\Documents\
Claude: I'll convert each document and cache the results for quick access...
[Processes all supported files and provides summary]
Troubleshooting
"Module not found" Error
# Reinstall the package
pip install -e .
Configuration Not Loading
- Verify the config file location is correct
- Check JSON syntax is valid (use a JSON validator)
- Restart Claude Desktop or Claude Code completely
- Check logs for error messages
Images Not Extracting
Possible causes:
- Document contains linked images (not embedded)
- Insufficient write permissions for cache directory
- Image format not supported by the document library
Solution:
# Verify cache directory is writable
ls -la /path/to/cache # Unix/Mac
dir /path/to/cache # Windows
# Check if images are embedded
# Use convert_document with extract_images=true explicitly
Encoding Issues
The converter uses UTF-8 encoding throughout. If you see garbled text:
- Check the source document encoding
- Ensure your terminal/console supports UTF-8
- Try converting with different system locale settings
Changelog
v2.0.0 (2024-11)
Major Features:
- Added Excel (.xlsx, .xls) support with multi-sheet conversion
- Added PowerPoint (.pptx, .ppt) support with slide extraction
- Implemented intelligent image optimization with WebP compression
- Added unified OfficeConverter interface for all document types
- Enhanced metadata extraction for all formats
Improvements:
- Smart caching system with hash-based file identification
- Lazy-loading of format-specific converters for better performance
- Better error handling and validation
- Comprehensive test suite for all formats
Tools:
- Added
get_supported_formatstool - Enhanced
get_document_metadatafor all formats - Improved
list_conversionswith detailed cache information
v1.0.0 (2024-09)
- Initial release
- Word document (.docx, .doc) conversion
- Basic image extraction
- MCP server implementation
Contributing
Contributions are welcome! Here's how you can help:
- Report Bugs: Open an issue with details and steps to reproduce
- Suggest Features: Describe your idea and use case
- Submit Pull Requests:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to your branch (
git push origin feature/amazing-feature) - Open a Pull Request
Development Setup
# Clone and install with dev dependencies
git clone https://github.com/Asunainlove/office-reader-mcp.git
cd office-reader-mcp
pip install -e ".[dev]"
# Run tests
python test/test_all_formats.py
# Run linting (if configured)
black src/
ruff check src/
License
MIT License - see LICENSE file for details.
Author
Asunainlove
- GitHub: @Asunainlove
- Repository: office-reader-mcp
- Issues: Report a bug
Acknowledgments
This project uses the following open-source libraries:
- Model Context Protocol (MCP) by Anthropic
- python-docx for Word processing
- openpyxl for Excel processing
- python-pptx for PowerPoint processing
- Pillow for image processing
Support
If you find this project helpful, please:
- ⭐ Star the repository
- 🐛 Report bugs and issues
- 💡 Suggest new features
- 🔀 Contribute code improvements
Happy converting! 🚀