Extract Audio from Video in n8n

Pull audio from video without touching a terminal

If you have video interviews to transcribe, n8n video to audio extraction is probably the cleanest way to automate the whole thing. Record the interview, drop it in a folder, and let the workflow extract the audio, convert it to the right format, and hand it off to a transcription service—all without you touching a terminal or remembering the FFmpeg flags.

FFmpeg handles the extraction in one command. n8n triggers it automatically whenever a new video appears. The tricky part is that n8n can't run FFmpeg directly, which is why the setup below uses RenderIO's API as the processing layer.

The problem: n8n can't extract audio natively

n8n doesn't have an audio extraction node. The cloud version doesn't allow shell commands. Even self-hosted, running FFmpeg inside n8n blocks the worker and risks crashes on large files.

The solution: send the extraction command to RenderIO's API via n8n's HTTP Request node. RenderIO runs FFmpeg in an isolated container. Your n8n instance stays responsive.

## Use the RenderIO n8n node

RenderIO has a partner-verified community node on the n8n marketplace. Install from Settings → Community Nodes → search "renderio". It provides a visual interface for FFmpeg commands, including audio extraction.

The node handles authentication and request formatting automatically. The extraction examples below use HTTP Request nodes for full flexibility, but the same FFmpeg commands work with the native node. For a broader overview of video processing in n8n, the complete n8n video processing guide covers the setup patterns in detail.

Basic extraction: MP4 to MP3

The simplest workflow: video URL in, MP3 URL out.

HTTP Request node configuration:

Method: POST
URL: https://renderio.dev/api/v1/run-ffmpeg-command
Authentication: Header Auth (X-API-KEY)
Body:

{
  "ffmpeg_command": "-i {{in_video}} -vn -acodec libmp3lame -q:a 2 {{out_audio}}",
  "input_files": {
    "in_video": "{{ $json.videoUrl }}"
  },
  "output_files": {
    "out_audio": "extracted.mp3"
  }
}

-vn disables video. -q:a 2 sets MP3 quality (0=best, 9=worst, 2 is high quality at ~190kbps).

Poll for completion, then use the output URL.

Extraction formats

MP3 (most compatible)

{
  "ffmpeg_command": "-i {{in_video}} -vn -acodec libmp3lame -q:a 2 {{out_audio}}",
  "input_files": { "in_video": "{{ $json.videoUrl }}" },
  "output_files": { "out_audio": "audio.mp3" }
}

Best for: sharing, podcast distribution, general use.

WAV (lossless)

{
  "ffmpeg_command": "-i {{in_video}} -vn -acodec pcm_s16le -ar 44100 {{out_audio}}",
  "input_files": { "in_video": "{{ $json.videoUrl }}" },
  "output_files": { "out_audio": "audio.wav" }
}

Best for: transcription services (they often prefer WAV), audio editing, archival.

AAC (Apple/streaming)

{
  "ffmpeg_command": "-i {{in_video}} -vn -acodec aac -b:a 192k {{out_audio}}",
  "input_files": { "in_video": "{{ $json.videoUrl }}" },
  "output_files": { "out_audio": "audio.m4a" }
}

Best for: Apple devices, streaming platforms, smaller files than MP3 at same quality.

FLAC (lossless compressed)

{
  "ffmpeg_command": "-i {{in_video}} -vn -acodec flac {{out_audio}}",
  "input_files": { "in_video": "{{ $json.videoUrl }}" },
  "output_files": { "out_audio": "audio.flac" }
}

Best for: archival when you want lossless but smaller than WAV (typically 50-60% of WAV size).

OGG/Opus (web)

{
  "ffmpeg_command": "-i {{in_video}} -vn -acodec libopus -b:a 128k {{out_audio}}",
  "input_files": { "in_video": "{{ $json.videoUrl }}" },
  "output_files": { "out_audio": "audio.ogg" }
}

Best for: web applications, voice recordings, VoIP.

Complete workflow: Extract and transcribe

Combine audio extraction with a transcription service:

Google Drive Trigger (new video)
  → HTTP Request: Extract audio (RenderIO)
  → Wait + Poll
  → HTTP Request: Download audio
  → HTTP Request: Send to Whisper API / AssemblyAI / Deepgram
  → Google Sheets: Write transcript
  → Slack: Notify team

Node 1: Google Drive Trigger Watches a "Videos" folder for new uploads.

Node 2: Extract audio (HTTP Request)

{
  "ffmpeg_command": "-i {{in_video}} -vn -acodec pcm_s16le -ar 16000 -ac 1 {{out_audio}}",
  "input_files": { "in_video": "{{ $json.downloadUrl }}" },
  "output_files": { "out_audio": "for_transcription.wav" }
}

Note: -ar 16000 -ac 1 converts to 16kHz mono. This is the format most transcription APIs prefer. Smaller files, faster uploads, same transcription quality.

Node 3-5: Poll and get result

Standard polling loop.

Node 6: Send to transcription

{
  "method": "POST",
  "url": "https://api.openai.com/v1/audio/transcriptions",
  "headers": { "Authorization": "Bearer {{ $credentials.openAiApi.apiKey }}" },
  "body": {
    "model": "whisper-1",
    "file": "{{ $json.output_files.out_audio.storage_url }}"
  }
}

Batch extraction from a video library

Process an entire folder of videos:

Step 1: Get video list

Use a Code node or fetch from a spreadsheet:

const videos = [
  { url: "https://example.com/interview1.mp4", name: "interview1" },
  { url: "https://example.com/interview2.mp4", name: "interview2" },
  { url: "https://example.com/interview3.mp4", name: "interview3" },
];

return videos.map(v => ({ json: v }));

Step 2: Split in Batches (size: 5)

Step 3: Submit extraction for each

{
  "ffmpeg_command": "-i {{in_video}} -vn -acodec libmp3lame -q:a 2 {{out_audio}}",
  "input_files": { "in_video": "{{ $json.url }}" },
  "output_files": { "out_audio": "{{ $json.name }}.mp3" }
}

Step 4: Poll and collect URLs

Step 5: Write results to spreadsheet

Video	Audio URL	Status
interview1	https://media.renderio.dev/interview1.mp3	extracted
interview2	https://media.renderio.dev/interview2.mp3	extracted

Audio processing after extraction

Once you have the audio, you can process it further:

Normalize volume:

-i {{in_audio}} -af loudnorm=I=-16:TP=-1.5:LRA=11 {{out_audio}}

Trim silence from start/end:

-i {{in_audio}} -af silenceremove=start_periods=1:start_silence=0.5:start_threshold=-50dB,areverse,silenceremove=start_periods=1:start_silence=0.5:start_threshold=-50dB,areverse {{out_audio}}

Convert sample rate:

-i {{in_audio}} -ar 44100 {{out_audio}}

Chain these into your workflow as additional processing steps after extraction.

Choosing the right audio format for your use case

The five formats above aren't interchangeable. Here's how they stack up in practice:

Format	Lossy/Lossless	Typical file size (1hr audio)	Best for
MP3	Lossy	~85MB at 192kbps	Podcasts, general distribution, sharing
WAV	Lossless	~600MB (16-bit, 44.1kHz)	Transcription APIs, audio editing, archival
AAC	Lossy	~70MB at 192kbps	Apple ecosystem, streaming, files smaller than MP3
FLAC	Lossless (compressed)	~250MB	Archival with lossless quality but manageable size
OGG/Opus	Lossy	~55MB at 128kbps	Web apps, voice, anything WebRTC-adjacent

For transcription workflows, WAV at 16kHz mono is almost always the right choice. It's the format OpenAI Whisper, AssemblyAI, and Deepgram all prefer, and the files are smaller than you'd expect at that sample rate.

For general audio extraction where you want something shareable, MP3 at 192kbps hits the sweet spot.

Cost breakdown

Each audio extraction is one API command. Here's what that looks like at scale:

Plan	Cost	Commands/month	Videos/month
Starter	$12/mo	500	500
Growth	$29/mo	1,000	1,000
Business	$99/mo	20,000	20,000
Overage (Starter)	$0.08/cmd	—	—

For a transcription workflow processing 20 video interviews a week, that's 80 extractions per month. Starter plan easily covers it.

Batch workflows that also normalize and convert sample rate use 2 commands per video (extract, then process). Factor that in if you're building a multi-step pipeline.

Error handling

Common extraction failures:

No audio track: Some screen recordings or animations have no audio. FFmpeg returns an error. Handle with an IF node that checks the error message for "does not contain any stream."

Corrupted audio: Add -err_detect ignore_err before -i to attempt extraction despite minor corruption.

Very long videos: Extraction is fast (typically 10–30 seconds regardless of video length) because it only copies/transcodes the audio stream, not the video. If you're hitting timeouts, check the polling setup, not the command itself.

For a full list of FFmpeg audio flags and troubleshooting, the FFmpeg audio command reference has examples for most edge cases.

FAQ

Can n8n extract audio from video without the RenderIO node?

On self-hosted n8n, you can run shell commands with the Execute Command node if FFmpeg is installed on the same machine. The problem is it blocks the n8n worker process and risks crashes on large files. The API approach keeps n8n responsive and offloads the processing to an isolated container.

What video formats does audio extraction support?

Anything FFmpeg supports: MP4, MOV, MKV, AVI, WebM, FLV, and most others. FFmpeg auto-detects the container format; you don't need to tell it what type of file it's reading.

How long does extraction take?

Typically 10–30 seconds, regardless of how long the video is. Extraction copies or transcodes only the audio stream—it doesn't touch the video, so the job time doesn't scale linearly with video length.

Can I extract only part of the audio from a long video?

Yes. Add -ss 00:10:00 -t 00:05:00 to your FFmpeg command to start at 10 minutes and extract 5 minutes. This is useful for long recordings where you only need a specific segment.

What's the best format to send to Whisper or other transcription APIs?

WAV at 16kHz mono (-acodec pcm_s16le -ar 16000 -ac 1). Most transcription services downsample anyway, so sending at the target rate saves upload time and reduces file size without affecting transcription quality.

If your workflow also involves processing the video before extracting audio (trimming, resizing, or stripping metadata), the FFmpeg API reference has the command patterns for chaining operations.