MCP server by bogdanovich
media-mcp
Backend-agnostic MCP server for durable asynchronous media download and transcription jobs.
media-mcp is meant to be used by chat agents such as PicoClaw, OpenClaw, or any MCP-capable agent runtime. It downloads media with yt-dlp, stores artifacts on disk, persists job state in SQLite, and returns artifact paths plus metadata to the calling agent.
It deliberately does not send files to Telegram, Matrix, Discord, or any other chat backend. Delivery should be handled by the agent runtime's native send_file / message tools.
Why Async Jobs
Media work is slow and failure-prone:
- social/video downloads may take seconds or minutes;
- transcription can be CPU-heavy;
- an agent runtime may restart while work is running;
- users often need status/result inspection later.
For that reason, media-mcp stores jobs durably in SQLite and runs workers in detached child processes.
Tools
download_async
Starts a durable async media download job.
Arguments:
urlstring, required: URL supported byyt-dlp.modestring:audioorvideo. Defaults toaudio.qualitystring:loworbest. Defaults tolow.prepare_transcription_audioboolean: create 16 kHz mono WAV for transcription. Defaults totrue.
Returns immediately with:
{
"ok": true,
"accepted": true,
"job_id": "job_20260504231751_bf964e43",
"type": "download",
"status": "queued"
}
When finished, the job result includes fields such as:
asset_idmedia_pathaudio_pathnormalized_audio_pathsendable_file_pathtitleplatformduration_seconds
transcribe_async
Starts a durable async transcription job for a previously downloaded asset.
Arguments:
asset_idstring, required.modelstring: transcription model name passed to the helper. Defaults tobase.languagestring: optional language hint such asenorru.timestampsboolean: request segment timestamps when supported.
job_status
Fast status check for a job.
Use this for quick checks or when the user explicitly asks whether a job is still running.
job_wait
Waits for a job to finish and returns the same payload shape as job_result.
Arguments:
job_idstring, required.timeout_secondsnumber: defaults to60, maximum180.poll_interval_secondsnumber: defaults to2.
Agents should prefer job_wait over repeated job_status polling when a job is expected to finish during the current turn. This avoids wasting LLM/tool iterations.
job_result
Reads the durable result or error for a job.
Use this when a job is already known to be finished, or when resuming a previously started job by id.
Install
git clone https://github.com/your-org/media-mcp.git
cd media-mcp
go build -o media-mcp ./cmd/media-mcp
Runtime dependencies:
- Go 1.24+ to build.
yt-dlpfor downloads.ffmpegfor audio extraction/normalization.- Optional transcriber command for
transcribe_async.
Run
./media-mcp \
--db-path /var/lib/media-mcp/media.db \
--asset-root /var/lib/media-mcp/assets \
--yt-dlp /usr/local/bin/yt-dlp \
--ffmpeg /usr/bin/ffmpeg \
--transcriber-command python3 \
--transcriber-args "/opt/media-mcp/examples/transcribers/faster_whisper.py {audio} --model {model} {language_arg} {timestamps_arg}"
Flags:
--db-path: SQLite job/asset database path. Required.--asset-root: root directory for downloaded media and transcripts. Required.--yt-dlp:yt-dlpexecutable. Defaults toyt-dlp.--ffmpeg:ffmpegexecutable. Defaults toffmpeg.--transcriber-command: executable used bytranscribe_async, for examplepython3.--transcriber-args: argument template for the transcriber command.--transcribe-helper: deprecated compatibility shortcut for Python helpers. It expands to--transcriber-command python3 --transcriber-args "<helper> {audio} --model {model} {language_arg} {timestamps_arg}".
Transcriber Adapter Contract
media-mcp delegates speech-to-text to an external command so the core server remains provider-agnostic. You can use local faster-whisper, whisper.cpp, OpenAI, Deepgram, AssemblyAI, or any other backend as long as your adapter follows the stdout JSON contract.
The command is configured with --transcriber-command and --transcriber-args.
Supported argument placeholders:
{audio}: normalized 16 kHz mono WAV path.{model}: model requested by the tool call.{language}: raw language hint, if any.{language_arg}: expands to--language <language>when language is set, otherwise empty.{timestamps_arg}: expands to--timestampswhen timestamps are requested, otherwise empty.
Example:
--transcriber-command python3
--transcriber-args "/opt/media-mcp/examples/transcribers/faster_whisper.py {audio} --model {model} {language_arg} {timestamps_arg}"
The adapter should print JSON to stdout:
{
"backend": "local-faster-whisper",
"model": "base",
"requested_language": "ru",
"detected_language": "ru",
"language_probability": 0.98,
"text": "Transcript text...",
"segments": [
{ "start": 0.0, "end": 2.4, "text": "Transcript text..." }
]
}
segments may be omitted when timestamps are not requested.
A reference faster-whisper adapter is included at examples/transcribers/faster_whisper.py.
MCP Config Examples
PicoClaw-style config
{
"tools": {
"mcp": {
"enabled": true,
"servers": {
"media": {
"enabled": true,
"deferred": false,
"command": "/opt/media-mcp/media-mcp",
"args": [
"--db-path",
"/var/lib/media-mcp/media.db",
"--asset-root",
"/var/lib/media-mcp/assets",
"--yt-dlp",
"/usr/local/bin/yt-dlp",
"--ffmpeg",
"/usr/bin/ffmpeg",
"--transcriber-command",
"python3",
"--transcriber-args",
"/opt/media-mcp/examples/transcribers/faster_whisper.py {audio} --model {model} {language_arg} {timestamps_arg}"
],
"type": "stdio"
}
}
}
}
}
OpenClaw-style config
{
"mcp": {
"servers": {
"media": {
"command": "/opt/media-mcp/media-mcp",
"args": [
"--db-path",
"/var/lib/media-mcp/media.db",
"--asset-root",
"/var/lib/media-mcp/assets",
"--yt-dlp",
"/usr/local/bin/yt-dlp",
"--ffmpeg",
"/usr/bin/ffmpeg",
"--transcriber-command",
"python3",
"--transcriber-args",
"/opt/media-mcp/examples/transcribers/faster_whisper.py {audio} --model {model} {language_arg} {timestamps_arg}"
]
}
}
}
}
Different runtimes expose MCP tool names differently. For a server named media, the agent may see tools as mcp_media_download_async, media__download_async, or similar.
Recommended Agent Instructions
See examples/TOOLS.md for a copy-pasteable instruction block.
The important rules are:
- Prefer
quality: "low"for normal downloads. - Use
quality: "best"only when explicitly requested or when low quality is unusable. - Use
job_waitinstead of repeatedly pollingjob_status. - Do not treat the initial
accepted: trueresponse as completion. - Do not ask
media-mcpto send files to chat. Use the agent runtime's native delivery tool withsendable_file_path,transcript_path, or another artifact path.
Typical Flows
Download a video and send it to the user
- Call
download_asyncwithmode: "video"andquality: "low". - Call
job_waitwith the returnedjob_id. - Read
result.sendable_file_path. - Use the runtime's native file delivery tool.
Extract a recipe from Instagram/TikTok/YouTube media
- Call
download_asyncwith:mode: "video"quality: "low"prepare_transcription_audio: true
- Call
job_wait. - Call
transcribe_asyncwith the resultingasset_id. - Call
job_wait. - Use
result.textorresult.transcript_pathto extract ingredients and steps. - Reply with a concise structured recipe.
Resume an old job
- Call
job_statuswith the knownjob_id. - If
succeededorfailed, calljob_result. - If still
running, calljob_waitwith a reasonable timeout or tell the user it is still running.
Storage Layout
SQLite stores job and asset metadata. Media artifacts are stored under --asset-root by asset_id.
Example:
assets/
20260504231751_7ad348ff/
source.mp4
source.info.json
source.m4a
audio_16k.wav
transcript.txt
transcript.json
_worker-logs/
job_20260504231818_bbb042b2.log
Security Notes
- Treat downloaded media as untrusted user-controlled files.
- Do not expose
--asset-rootpublicly without an access-control layer. - Keep SQLite and artifacts outside your agent prompt/workspace if users should not browse them directly.
- Use OS permissions to restrict who can read transcripts and media files.
- Be careful with cookies/browser profiles used by
yt-dlp; this server does not manage secrets for you.
Development
go test ./...
go build -o media-mcp ./cmd/media-mcp