Pull audio from video without touching a terminal
If you have video interviews to transcribe, n8n video to audio extraction is probably the cleanest way to automate the whole thing. Record the interview, drop it in a folder, and let the workflow extract the audio, convert it to the right format, and hand it off to a transcription service—all without you touching a terminal or remembering the FFmpeg flags.
FFmpeg handles the extraction in one command. n8n triggers it automatically whenever a new video appears. The tricky part is that n8n can't run FFmpeg directly, which is why the setup below uses RenderIO's API as the processing layer.
The problem: n8n can't extract audio natively
n8n doesn't have an audio extraction node. The cloud version doesn't allow shell commands. Even self-hosted, running FFmpeg inside n8n blocks the worker and risks crashes on large files.
The solution: send the extraction command to RenderIO's API via n8n's HTTP Request node. RenderIO runs FFmpeg in an isolated container. Your n8n instance stays responsive.
## Use the RenderIO n8n node
RenderIO has a partner-verified community node on the n8n marketplace. Install from Settings → Community Nodes → search "renderio". It provides a visual interface for FFmpeg commands, including audio extraction.
The node handles authentication and request formatting automatically. The extraction examples below use HTTP Request nodes for full flexibility, but the same FFmpeg commands work with the native node. For a broader overview of video processing in n8n, the complete n8n video processing guide covers the setup patterns in detail.
Basic extraction: MP4 to MP3
The simplest workflow: video URL in, MP3 URL out.
HTTP Request node configuration:
Method: POST
URL:
https://renderio.dev/api/v1/run-ffmpeg-commandAuthentication: Header Auth (X-API-KEY)
Body:
-vn disables video. -q:a 2 sets MP3 quality (0=best, 9=worst, 2 is high quality at ~190kbps).
Poll for completion, then use the output URL.
Extraction formats
MP3 (most compatible)
Best for: sharing, podcast distribution, general use.
WAV (lossless)
Best for: transcription services (they often prefer WAV), audio editing, archival.
AAC (Apple/streaming)
Best for: Apple devices, streaming platforms, smaller files than MP3 at same quality.
FLAC (lossless compressed)
Best for: archival when you want lossless but smaller than WAV (typically 50-60% of WAV size).
OGG/Opus (web)
Best for: web applications, voice recordings, VoIP.
Complete workflow: Extract and transcribe
Combine audio extraction with a transcription service:
Node 1: Google Drive Trigger Watches a "Videos" folder for new uploads.
Node 2: Extract audio (HTTP Request)
Note: -ar 16000 -ac 1 converts to 16kHz mono. This is the format most transcription APIs prefer. Smaller files, faster uploads, same transcription quality.
Node 3-5: Poll and get result
Standard polling loop.
Node 6: Send to transcription
Batch extraction from a video library
Process an entire folder of videos:
Step 1: Get video list
Use a Code node or fetch from a spreadsheet:
Step 2: Split in Batches (size: 5)
Step 3: Submit extraction for each
Step 4: Poll and collect URLs
Step 5: Write results to spreadsheet
| Video | Audio URL | Status |
| interview1 | https://media.renderio.dev/interview1.mp3 | extracted |
| interview2 | https://media.renderio.dev/interview2.mp3 | extracted |
Audio processing after extraction
Once you have the audio, you can process it further:
Normalize volume:
Trim silence from start/end:
Convert sample rate:
Chain these into your workflow as additional processing steps after extraction.
Choosing the right audio format for your use case
The five formats above aren't interchangeable. Here's how they stack up in practice:
| Format | Lossy/Lossless | Typical file size (1hr audio) | Best for |
| MP3 | Lossy | ~85MB at 192kbps | Podcasts, general distribution, sharing |
| WAV | Lossless | ~600MB (16-bit, 44.1kHz) | Transcription APIs, audio editing, archival |
| AAC | Lossy | ~70MB at 192kbps | Apple ecosystem, streaming, files smaller than MP3 |
| FLAC | Lossless (compressed) | ~250MB | Archival with lossless quality but manageable size |
| OGG/Opus | Lossy | ~55MB at 128kbps | Web apps, voice, anything WebRTC-adjacent |
For transcription workflows, WAV at 16kHz mono is almost always the right choice. It's the format OpenAI Whisper, AssemblyAI, and Deepgram all prefer, and the files are smaller than you'd expect at that sample rate.
For general audio extraction where you want something shareable, MP3 at 192kbps hits the sweet spot.
Cost breakdown
Each audio extraction is one API command. Here's what that looks like at scale:
| Plan | Cost | Commands/month | Videos/month |
| Starter | $12/mo | 500 | 500 |
| Growth | $29/mo | 1,000 | 1,000 |
| Business | $99/mo | 20,000 | 20,000 |
| Overage (Starter) | $0.08/cmd | — | — |
For a transcription workflow processing 20 video interviews a week, that's 80 extractions per month. Starter plan easily covers it.
Batch workflows that also normalize and convert sample rate use 2 commands per video (extract, then process). Factor that in if you're building a multi-step pipeline.
Error handling
Common extraction failures:
No audio track: Some screen recordings or animations have no audio. FFmpeg returns an error. Handle with an IF node that checks the error message for "does not contain any stream."
Corrupted audio: Add -err_detect ignore_err before -i to attempt extraction despite minor corruption.
Very long videos: Extraction is fast (typically 10–30 seconds regardless of video length) because it only copies/transcodes the audio stream, not the video. If you're hitting timeouts, check the polling setup, not the command itself.
For a full list of FFmpeg audio flags and troubleshooting, the FFmpeg audio command reference has examples for most edge cases.
FAQ
Can n8n extract audio from video without the RenderIO node?
On self-hosted n8n, you can run shell commands with the Execute Command node if FFmpeg is installed on the same machine. The problem is it blocks the n8n worker process and risks crashes on large files. The API approach keeps n8n responsive and offloads the processing to an isolated container.
What video formats does audio extraction support?
Anything FFmpeg supports: MP4, MOV, MKV, AVI, WebM, FLV, and most others. FFmpeg auto-detects the container format; you don't need to tell it what type of file it's reading.
How long does extraction take?
Typically 10–30 seconds, regardless of how long the video is. Extraction copies or transcodes only the audio stream—it doesn't touch the video, so the job time doesn't scale linearly with video length.
Can I extract only part of the audio from a long video?
Yes. Add -ss 00:10:00 -t 00:05:00 to your FFmpeg command to start at 10 minutes and extract 5 minutes. This is useful for long recordings where you only need a specific segment.
What's the best format to send to Whisper or other transcription APIs?
WAV at 16kHz mono (-acodec pcm_s16le -ar 16000 -ac 1). Most transcription services downsample anyway, so sending at the target rate saves upload time and reduces file size without affecting transcription quality.
If your workflow also involves processing the video before extracting audio (trimming, resizing, or stripping metadata), the FFmpeg API reference has the command patterns for chaining operations.