media-mcp

Backend-agnostic MCP server for durable asynchronous media download and transcription jobs.

media-mcp is meant to be used by chat agents such as PicoClaw, OpenClaw, or any MCP-capable agent runtime. It downloads media with yt-dlp, stores artifacts on disk, persists job state in SQLite, and returns artifact paths plus metadata to the calling agent.

It deliberately does not send files to Telegram, Matrix, Discord, or any other chat backend. Delivery should be handled by the agent runtime's native send_file / message tools.

Why Async Jobs

Media work is slow and failure-prone:

social/video downloads may take seconds or minutes;
transcription can be CPU-heavy;
an agent runtime may restart while work is running;
users often need status/result inspection later.

For that reason, media-mcp stores jobs durably in SQLite and runs workers in detached child processes.

Tools

`download_async`

Starts a durable async media download job.

Arguments:

url string, required: URL supported by yt-dlp.
mode string: audio or video. Defaults to audio.
quality string: low or best. Defaults to low.
prepare_transcription_audio boolean: create 16 kHz mono WAV for transcription. Defaults to true.

Returns immediately with:

{
  "ok": true,
  "accepted": true,
  "job_id": "job_20260504231751_bf964e43",
  "type": "download",
  "status": "queued"
}

When finished, the job result includes fields such as:

asset_id
media_path
audio_path
normalized_audio_path
sendable_file_path
title
platform
duration_seconds

`transcribe_async`

Starts a durable async transcription job for a previously downloaded asset.

Arguments:

asset_id string, required.
model string: transcription model name passed to the helper. Defaults to base.
language string: optional language hint such as en or ru.
timestamps boolean: request segment timestamps when supported.

`job_status`

Fast status check for a job.

Use this for quick checks or when the user explicitly asks whether a job is still running.

`job_wait`

Waits for a job to finish and returns the same payload shape as job_result.

Arguments:

job_id string, required.
timeout_seconds number: defaults to 60, maximum 180.
poll_interval_seconds number: defaults to 2.

Agents should prefer job_wait over repeated job_status polling when a job is expected to finish during the current turn. This avoids wasting LLM/tool iterations.

`job_result`

Reads the durable result or error for a job.

Use this when a job is already known to be finished, or when resuming a previously started job by id.

Install

git clone https://github.com/your-org/media-mcp.git
cd media-mcp
go build -o media-mcp ./cmd/media-mcp

Runtime dependencies:

Go 1.24+ to build.
yt-dlp for downloads.
ffmpeg for audio extraction/normalization.
Optional transcriber command for transcribe_async.

Run

./media-mcp \
  --db-path /var/lib/media-mcp/media.db \
  --asset-root /var/lib/media-mcp/assets \
  --yt-dlp /usr/local/bin/yt-dlp \
  --ffmpeg /usr/bin/ffmpeg \
  --transcriber-command python3 \
  --transcriber-args "/opt/media-mcp/examples/transcribers/faster_whisper.py {audio} --model {model} {language_arg} {timestamps_arg}"

Flags:

--db-path: SQLite job/asset database path. Required.
--asset-root: root directory for downloaded media and transcripts. Required.
--yt-dlp: yt-dlp executable. Defaults to yt-dlp.
--ffmpeg: ffmpeg executable. Defaults to ffmpeg.
--transcriber-command: executable used by transcribe_async, for example python3.
--transcriber-args: argument template for the transcriber command.
--transcribe-helper: deprecated compatibility shortcut for Python helpers. It expands to --transcriber-command python3 --transcriber-args "<helper> {audio} --model {model} {language_arg} {timestamps_arg}".

Transcriber Adapter Contract

media-mcp delegates speech-to-text to an external command so the core server remains provider-agnostic. You can use local faster-whisper, whisper.cpp, OpenAI, Deepgram, AssemblyAI, or any other backend as long as your adapter follows the stdout JSON contract.

The command is configured with --transcriber-command and --transcriber-args.

Supported argument placeholders:

{audio}: normalized 16 kHz mono WAV path.
{model}: model requested by the tool call.
{language}: raw language hint, if any.
{language_arg}: expands to --language <language> when language is set, otherwise empty.
{timestamps_arg}: expands to --timestamps when timestamps are requested, otherwise empty.

Example:

--transcriber-command python3
--transcriber-args "/opt/media-mcp/examples/transcribers/faster_whisper.py {audio} --model {model} {language_arg} {timestamps_arg}"

The adapter should print JSON to stdout:

{
  "backend": "local-faster-whisper",
  "model": "base",
  "requested_language": "ru",
  "detected_language": "ru",
  "language_probability": 0.98,
  "text": "Transcript text...",
  "segments": [
    { "start": 0.0, "end": 2.4, "text": "Transcript text..." }
  ]
}

segments may be omitted when timestamps are not requested.

A reference faster-whisper adapter is included at examples/transcribers/faster_whisper.py.

MCP Config Examples

PicoClaw-style config

{
  "tools": {
    "mcp": {
      "enabled": true,
      "servers": {
        "media": {
          "enabled": true,
          "deferred": false,
          "command": "/opt/media-mcp/media-mcp",
          "args": [
            "--db-path",
            "/var/lib/media-mcp/media.db",
            "--asset-root",
            "/var/lib/media-mcp/assets",
            "--yt-dlp",
            "/usr/local/bin/yt-dlp",
            "--ffmpeg",
            "/usr/bin/ffmpeg",
            "--transcriber-command",
            "python3",
            "--transcriber-args",
            "/opt/media-mcp/examples/transcribers/faster_whisper.py {audio} --model {model} {language_arg} {timestamps_arg}"
          ],
          "type": "stdio"
        }
      }
    }
  }
}

OpenClaw-style config

{
  "mcp": {
    "servers": {
      "media": {
        "command": "/opt/media-mcp/media-mcp",
        "args": [
          "--db-path",
          "/var/lib/media-mcp/media.db",
          "--asset-root",
          "/var/lib/media-mcp/assets",
          "--yt-dlp",
          "/usr/local/bin/yt-dlp",
          "--ffmpeg",
          "/usr/bin/ffmpeg",
          "--transcriber-command",
          "python3",
          "--transcriber-args",
          "/opt/media-mcp/examples/transcribers/faster_whisper.py {audio} --model {model} {language_arg} {timestamps_arg}"
        ]
      }
    }
  }
}

Different runtimes expose MCP tool names differently. For a server named media, the agent may see tools as mcp_media_download_async, media__download_async, or similar.

Recommended Agent Instructions

See examples/TOOLS.md for a copy-pasteable instruction block.

The important rules are:

Prefer quality: "low" for normal downloads.
Use quality: "best" only when explicitly requested or when low quality is unusable.
Use job_wait instead of repeatedly polling job_status.
Do not treat the initial accepted: true response as completion.
Do not ask media-mcp to send files to chat. Use the agent runtime's native delivery tool with sendable_file_path, transcript_path, or another artifact path.

Typical Flows

Download a video and send it to the user

Call download_async with mode: "video" and quality: "low".
Call job_wait with the returned job_id.
Read result.sendable_file_path.
Use the runtime's native file delivery tool.

Extract a recipe from Instagram/TikTok/YouTube media

Call download_async with:
- mode: "video"
- quality: "low"
- prepare_transcription_audio: true
Call job_wait.
Call transcribe_async with the resulting asset_id.
Call job_wait.
Use result.text or result.transcript_path to extract ingredients and steps.
Reply with a concise structured recipe.

Resume an old job

Call job_status with the known job_id.
If succeeded or failed, call job_result.
If still running, call job_wait with a reasonable timeout or tell the user it is still running.

Storage Layout

SQLite stores job and asset metadata. Media artifacts are stored under --asset-root by asset_id.

Example:

assets/
  20260504231751_7ad348ff/
    source.mp4
    source.info.json
    source.m4a
    audio_16k.wav
    transcript.txt
    transcript.json
  _worker-logs/
    job_20260504231818_bbb042b2.log

Security Notes

Treat downloaded media as untrusted user-controlled files.
Do not expose --asset-root publicly without an access-control layer.
Keep SQLite and artifacts outside your agent prompt/workspace if users should not browse them directly.
Use OS permissions to restrict who can read transcripts and media files.
Be careful with cookies/browser profiles used by yt-dlp; this server does not manage secrets for you.

Development

go test ./...
go build -o media-mcp ./cmd/media-mcp

MCP Servers

media-mcp

Why Async Jobs

Tools

`download_async`

`transcribe_async`

`job_status`

`job_wait`

`job_result`

Install

Run

Transcriber Adapter Contract

MCP Config Examples

PicoClaw-style config

OpenClaw-style config

Recommended Agent Instructions

Typical Flows

Download a video and send it to the user

Extract a recipe from Instagram/TikTok/YouTube media

Resume an old job

Storage Layout

Security Notes

Development

安装命令（包未发布）

Cursor 配置 (mcp.json)

media-mcp

Why Async Jobs

Tools

download_async

transcribe_async

job_status

job_wait

job_result

Install

Run

Transcriber Adapter Contract

MCP Config Examples

PicoClaw-style config

OpenClaw-style config

Recommended Agent Instructions

Typical Flows

Download a video and send it to the user

Extract a recipe from Instagram/TikTok/YouTube media

Resume an old job

Storage Layout

Security Notes

Development

安装命令 （包未发布）

Cursor 配置 (mcp.json)

`download_async`

`transcribe_async`

`job_status`

`job_wait`

`job_result`

安装命令（包未发布）