FFmpeg CUDA and NVENC: GPU Hardware Acceleration Guide

March 24, 2026 ยท RenderIO

FFmpeg is slow because it's using your CPU

You ran ffmpeg -i input.mp4 -c:v libx264 -crf 23 output.mp4 and it took 12 minutes for a 5-minute clip. The encoding speed said something like 0.8x. Your CPU hit 100%. Everything else on the machine froze.

The fix is one flag away. FFmpeg CUDA acceleration lets you offload encoding to dedicated hardware on NVIDIA GPUs. The encoder is called NVENC. The decoder is NVDEC. Together with CUDA for frame processing, they push encoding speeds to 10-50x realtime instead of 0.8x.

This guide covers setup, the three NVENC encoders (H.264, HEVC, AV1), CUDA-accelerated filters, the memory optimization that doubles throughput, and what to do when it doesn't work. If you're looking for general codec conversion without GPU acceleration, see the FFmpeg transcoding guide first. For a quick reference of common commands without the explanations, the FFmpeg cheat sheet has 50 commands organized by task.

CUDA vs NVENC vs NVDEC: three different things

People mix these up constantly, and the confusion causes real debugging headaches.

NVENC is a fixed-function hardware encoder on NVIDIA GPUs. It encodes video into H.264, HEVC, or AV1. It runs on dedicated silicon, not on CUDA cores. You could be running a heavy ML training job and NVENC would still encode at full speed because it's physically separate hardware. Higher-end Ada Lovelace GPUs (like the RTX 4090) actually have two or three NVENC engines on the same chip, and the driver load-balances between them automatically.

NVDEC is the hardware decoder. Same idea, separate silicon. It decodes compressed video into raw frames. Every NVIDIA GPU since Kepler (2012) has one.

CUDA is the general-purpose compute framework. In the FFmpeg context, CUDA matters for two reasons: keeping decoded frames in GPU memory (avoiding PCIe bus copies), and running GPU-accelerated filters like scale_cuda and overlay_cuda.

When you see -hwaccel cuda in an FFmpeg command, that tells FFmpeg to use CUDA for decode-side acceleration. When you see -c:v h264_nvenc, that selects the NVENC hardware encoder. They work together but they're not the same thing.

NVENC generation support matrix

Not every GPU supports every codec. Here's what matters:

GPU generationH.264HEVCHEVC 10-bitAV1HEVC B-frames
Maxwell (GTX 900)Yes2nd gen onlyNoNoNo
Pascal (GTX 10xx)YesYesYesNoNo
Turing (RTX 20xx)YesYesYesNoYes
Ampere (RTX 30xx)YesYesYesNoYes
Ada Lovelace (RTX 40xx)YesYesYesYesYes
Blackwell (RTX 50xx)YesYesYesYesYes

Note: H.264 B-frame support has been available since the first NVENC generation (Kepler). The column above tracks HEVC B-frame support, which was added with Turing.

If you're buying a GPU specifically for encoding work, Ada Lovelace is the sweet spot right now. It's the oldest generation with AV1 hardware encoding, and the dual NVENC engines on the RTX 4070 Ti and above roughly double throughput compared to single-engine cards.

Check if your FFmpeg build supports NVENC

Before writing any commands, verify your FFmpeg binary was compiled with NVENC support:

ffmpeg -encoders 2>/dev/null | grep nvenc

You should see lines like:

V....D h264_nvenc           NVIDIA NVENC H.264 encoder (codec h264)
V....D hevc_nvenc           NVIDIA NVENC hevc encoder (codec hevc)
V....D av1_nvenc            NVIDIA NVENC av1 encoder (codec av1)

If nothing shows up, your FFmpeg wasn't built with NVIDIA support. Most package manager installs (apt install ffmpeg) don't include it. You need either a custom build or a static binary from a source that compiled with --enable-nvenc.

Also check for NVDEC decoder support:

ffmpeg -decoders 2>/dev/null | grep cuvid

You should see h264_cuvid, hevc_cuvid, etc. These are the GPU decoders that pair with NVENC.

To build from source on Linux:

# Install the codec headers (not the full CUDA toolkit for runtime)
git clone https://git.videolan.org/git/ffmpeg/nv-codec-headers.git
cd nv-codec-headers && sudo make install && cd ..

# Clone and configure FFmpeg
git clone https://git.ffmpeg.org/ffmpeg.git && cd ffmpeg
./configure --enable-nonfree --enable-cuda-nvcc --enable-libnpp \
  --extra-cflags=-I/usr/local/cuda/include \
  --extra-ldflags=-L/usr/local/cuda/lib64 \
  --disable-static --enable-shared
make -j$(nproc) && sudo make install

Also verify your GPU driver is recent enough. Run nvidia-smi and check the driver version. NVENC support depends on the Video Codec SDK version, which requires a minimum driver. SDK 13.0 needs driver 570+ on Linux.

A common mistake: people install the CUDA toolkit but forget the nv-codec-headers. FFmpeg needs the headers at compile time, not the full toolkit at runtime. The headers are a lightweight repo that just provides the API definitions.

FFmpeg CUDA encoding: the basic commands

H.264 with NVENC

The simplest GPU-accelerated encode:

ffmpeg -i input.mp4 -c:v h264_nvenc -b:v 5M output.mp4

This decodes on the CPU and encodes on the GPU. It's faster than libx264, but we can do better. More on that in the pipeline section below.

For quality control, NVENC supports constant quantization (CQ) mode, which is the GPU equivalent of CRF in libx264. Lower values mean higher quality:

ffmpeg -i input.mp4 -c:v h264_nvenc -rc constqp -qp 23 -preset p7 -c:a copy output.mp4

The -preset flag goes from p1 (fastest, lowest quality) to p7 (slowest, highest quality). Even p7 on NVENC is still much faster than medium on libx264.

Important: NVENC's -qp and libx264's -crf are not the same scale, even though they look similar. QP 23 on NVENC does not produce the same quality as CRF 23 on libx264. You'll need to test with your own content and compare visually or with VMAF scores. As a rough starting point, QP 20-24 on NVENC produces results in the same ballpark as CRF 20-24 on libx264, but with slightly lower compression efficiency.

HEVC (H.265) with NVENC

Same idea, different codec. HEVC gets you 30-40% smaller files at the same quality compared to H.264. If you're already familiar with CPU-based HEVC encoding, the compression guide covers CRF tuning for libx265 in detail.

ffmpeg -i input.mp4 -c:v hevc_nvenc -rc constqp -qp 29 \
  -preset slow -profile:v main10 -rc-lookahead 32 \
  -spatial_aq 1 -tag:v hvc1 -c:a copy output.mp4

A few things worth explaining here:

  • -rc-lookahead 32 lets the encoder look 32 frames ahead for better bitrate decisions. 32 is the max NVENC supports.

  • -spatial_aq 1 enables spatial adaptive quantization. It spends more bits on complex regions of a frame and fewer on flat areas. This matters for content with mixed detail levels.

  • -tag:v hvc1 forces the HEVC tag that Apple devices need. Without it, Safari and iOS won't play your file.

  • -qp 29 roughly matches what CRF 28 produces with the CPU encoder libx265. The numbers aren't 1:1 between QP and CRF, so test with your own content.

  • -profile:v main10 enables 10-bit encoding, which gives better gradient handling (fewer banding artifacts in skies, dark scenes) even when your source is 8-bit. Requires Pascal or newer.

AV1 with NVENC

AV1 is the newest codec NVENC supports. You need an Ada Lovelace GPU (RTX 4000 series) or Blackwell (RTX 5000 series) for hardware AV1 encoding:

ffmpeg -i input.mp4 -c:v av1_nvenc -cq 30 -preset p5 -c:a copy output.mp4

AV1 gives another 20-30% size reduction over HEVC at similar quality. The catch: only recent GPUs have the hardware for it, and while browser support is now solid (Chrome, Firefox, Edge, Safari all decode AV1), some older devices still can't play it. For maximum compatibility, H.264 remains the safe bet. For minimum file size on modern platforms, AV1 wins.

One benefit of NVENC AV1 over software AV1 encoders: speed. SVT-AV1 on CPU runs at roughly 0.3-0.6x realtime for 1080p. AV1 NVENC on an RTX 4090 encodes at hundreds of frames per second. If you're encoding AV1 at scale, hardware encoding makes it actually practical. The transcoding guide covers software AV1 encoding with SVT-AV1 and libaom if you need the maximum compression efficiency that CPU encoders provide.

The GPU memory trick that doubles throughput

This one change matters more than anything else in the article. Compare these two commands:

# Slow: decode on CPU, copy frames to GPU for encode
ffmpeg -i input.mp4 -c:v h264_nvenc -b:v 5M output.mp4

# Fast: keep everything in GPU memory
ffmpeg -hwaccel cuda -hwaccel_output_format cuda \
  -i input.mp4 -c:v h264_nvenc -b:v 5M output.mp4

The difference is -hwaccel cuda -hwaccel_output_format cuda. Without it, FFmpeg decodes the video on the CPU, copies raw frames to the GPU for encoding. Those raw frames are huge. A single 1080p frame in YUV420 is about 3MB. At 30fps, that's 90MB/second crossing the PCIe bus.

With -hwaccel_output_format cuda, FFmpeg uses NVDEC to decode on the GPU, and the decoded frames stay in GPU memory. No copies. The encoder reads them directly. NVIDIA's own benchmarks show this alone can give you up to 2x throughput.

Always use both flags together. -hwaccel cuda alone only sets up GPU decoding. Adding -hwaccel_output_format cuda is what keeps the frames on the GPU.

Selecting a specific GPU

If you have multiple NVIDIA GPUs, you can target a specific one:

ffmpeg -hwaccel cuda -hwaccel_device 0 -hwaccel_output_format cuda \
  -i input.mp4 -c:v h264_nvenc -gpu 0 -b:v 5M output.mp4

The -hwaccel_device selects which GPU does the decoding. The -gpu flag on the encoder selects which GPU does the encoding. In a multi-GPU setup, you can decode on one GPU and encode on another, though keeping everything on the same GPU avoids inter-GPU memory transfers.

CUDA filters: resize and process without leaving the GPU

Standard FFmpeg filters like scale and overlay run on the CPU. If you're using hardware acceleration and then apply a CPU filter, FFmpeg has to download frames from GPU memory, process them on the CPU, then upload them back. That kills your performance gains.

CUDA filters avoid this by running directly on the GPU:

Resize with scale_cuda

ffmpeg -hwaccel cuda -hwaccel_output_format cuda \
  -i input.mp4 -vf scale_cuda=1280:720 \
  -c:v h264_nvenc -b:v 3M output_720p.mp4

Resize with scale_npp (NVIDIA Performance Primitives)

scale_npp is another GPU scaler with more interpolation options. For downscaling, the super-sampling algorithm gives noticeably better quality than bilinear:

ffmpeg -hwaccel cuda -hwaccel_output_format cuda \
  -i input.mp4 -vf "scale_npp=1280:720:interp_algo=super" \
  -c:v h264_nvenc -b:v 3M output_720p.mp4

Deinterlacing with yadif_cuda

If you're working with interlaced source material (common with broadcast footage, old DVDs, security cameras):

ffmpeg -hwaccel cuda -hwaccel_output_format cuda \
  -i interlaced_input.mp4 -vf yadif_cuda=0:-1:0 \
  -c:v h264_nvenc -b:v 5M output_progressive.mp4

This deinterlaces on the GPU without pulling frames back to the CPU.

Multiple outputs from a single decode

This is where GPU acceleration earns its keep. Decode once, encode multiple resolutions, all in GPU memory:

ffmpeg -hwaccel cuda -hwaccel_output_format cuda -i input.mp4 \
  -vf scale_npp=1920:1080 -c:a copy -c:v h264_nvenc -b:v 5M output_1080p.mp4 \
  -vf scale_npp=1280:720 -c:a copy -c:v h264_nvenc -b:v 3M output_720p.mp4 \
  -vf scale_npp=640:360 -c:a copy -c:v h264_nvenc -b:v 1M output_360p.mp4

One input, three outputs, all processed on the GPU. This is the standard pattern for ABR (adaptive bitrate) transcoding in video pipelines. The input gets decoded once, scaled to multiple resolutions on the GPU, and encoded in parallel. If you're doing this kind of batch work regularly, the FFmpeg commands list has ready-to-use API equivalents for every operation here.

Upload CPU-decoded frames to GPU

Sometimes you need a CPU filter that has no CUDA equivalent. You can mix CPU and GPU processing with hwupload_cuda:

ffmpeg -i input.mp4 \
  -vf "fade=t=in:d=2,hwupload_cuda,scale_npp=1280:720" \
  -c:v h264_nvenc output.mp4

This decodes on CPU, applies the fade filter on CPU, uploads the frame to GPU memory, scales on the GPU, then encodes on the GPU. Not ideal, but better than doing everything on the CPU. For extracting frames from video rather than re-encoding, GPU acceleration helps less since the bottleneck is usually disk I/O, not decoding speed.

Available CUDA filters

Here's the full list of GPU-accelerated filters in FFmpeg:

FilterPurposeGPU equivalent of
scale_cudaResizescale
scale_nppResize (more algorithms)scale
overlay_cudaPicture-in-pictureoverlay
yadif_cudaDeinterlaceyadif
thumbnail_cudaScene detection thumbnailthumbnail
transpose_nppRotate 90/180/270transpose
chromakey_cudaGreen screen removalchromakey
colorspace_cudaColor space conversioncolorspace

If your filter chain needs something not on this list, you'll have to use hwdownload,format=nv12 to pull frames back to CPU, apply the filter, then hwupload_cuda to push them back. It's slower, but it keeps the decode and encode steps on the GPU.

CPU vs GPU encoding: when NVENC wins (and when it doesn't)

NVENC is not always the right choice. Here's the actual tradeoff.

Speed: NVENC wins by a wide margin. A typical 1080p encode runs at 5-15x realtime on NVENC versus 0.5-2x on libx264 with the medium preset. For HEVC, the gap is even larger since libx265 is painfully slow on CPU. AV1 is where it gets dramatic: SVT-AV1 at preset 6 runs at 0.3-0.6x realtime, while av1_nvenc on Ada Lovelace hardware encodes at hundreds of fps.

Quality per bit: At the same bitrate, libx264 produces slightly better quality than h264_nvenc. The CPU encoder has more time and more sophisticated algorithms to make compression decisions. If you're archiving footage and file size matters more than speed, CPU encoding at a slower preset will give you better results. The video compression guide covers this quality-vs-size tradeoff in depth.

Quality at high bitrates: Above roughly 10Mbps for 1080p, the quality difference between CPU and GPU encoding shrinks to nearly invisible. If you can afford the bitrate, NVENC gives you the same visual quality at 10x the speed.

Parallel sessions: NVENC runs on separate hardware from CUDA cores. You can encode video while running other GPU workloads (ML inference, rendering, gaming) without either slowing down. Consumer GeForce GPUs limit concurrent NVENC sessions to 12 per system (raised from 8 in late 2025, from 5 in 2024, and from 3 before that). Professional cards (Quadro, A-series, L-series) have no session limit. There's a community driver patch on GitHub that removes the consumer limit on Linux, but use it at your own discretion.

When to use CPU encoding: Offline archival at minimum file size, content where every bit of quality matters (film mastering), or when you don't have an NVIDIA GPU.

When to use NVENC: Live streaming, batch processing large video libraries, ABR transcoding, anything where speed matters more than squeezing out the last 5% of compression efficiency. If you compress video with FFmpeg at scale, GPU encoding is almost always the right call. For batch processing that also involves making videos unique (different crops, overlays, metadata), the batch uniqueness guide covers techniques you can combine with GPU acceleration.

Troubleshooting common NVENC errors

"Unknown encoder h264_nvenc"

Your FFmpeg wasn't compiled with NVENC support. Either build from source with --enable-nvenc or download a static build that includes it. The ffmpeg -encoders | grep nvenc check from earlier confirms support.

On Ubuntu/Debian, the snap version of FFmpeg (snap install ffmpeg) typically includes NVENC. The apt version usually doesn't.

"Cannot load libcuda.so.1"

The NVIDIA driver isn't installed, or it's not in the library path. Run nvidia-smi to check. If that works but FFmpeg still complains, try:

export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

On Docker containers, this is especially common. Make sure you're using the nvidia/cuda base image or passing --gpus all to docker run. The NVIDIA Container Toolkit needs to be installed on the host.

"No NVENC capable devices found"

Either your GPU doesn't support NVENC (check the NVIDIA support matrix), or the driver is too old. Update your driver and try again.

In Docker: this almost always means the container can't access the GPU. Verify with nvidia-smi inside the container. If it fails, the issue is your Docker GPU passthrough, not FFmpeg.

"Too many concurrent sessions"

Consumer GeForce GPUs limit you to 12 simultaneous NVENC sessions per system. If you're running multiple FFmpeg processes, you'll hit this. Options:

  1. Use a professional card (Quadro, A-series, L-series) which has no session limit

  2. Serialize your encoding queue so only 12 run at a time

  3. Apply the keylase/nvidia-patch on Linux to remove the limit (unofficial, use at your own risk)

  4. Offload to a cloud API where session limits aren't your problem

"Failed to initialize NVENC"

Usually a driver/SDK mismatch. The nv-codec-headers version used to build FFmpeg must be compatible with your installed driver. Check the compatibility table in the nv-codec-headers repo.

The fix is usually: update nv-codec-headers to match your driver version, then rebuild FFmpeg.

"Incompatible pixel format"

If you see errors about pixel formats when using -hwaccel_output_format cuda, the issue is usually that a CPU filter in your chain expects a software pixel format (like yuv420p) but receives a hardware format (cuda). Either use GPU-native filters, or explicitly convert with hwdownload,format=nv12 before the CPU filter.

When GPU infrastructure is more trouble than it's worth

All of the above assumes you have an NVIDIA GPU, the right drivers, a custom FFmpeg build, and the patience to manage it. For batch video processing in production, that's a lot of infrastructure to maintain.

Consumer GPU session limits mean you can run at most 12 encodes in parallel per system. Scaling beyond that means buying more GPUs or professional cards. Driver updates can break your pipeline. Different GPU generations support different codecs and features. And if you're running in the cloud, GPU instances (like AWS g4dn or p3) cost 3-10x more than CPU instances.

Running FFmpeg on AWS Lambda doesn't help either, since Lambda has no GPU access at all. Serverless FFmpeg setups hit the same wall.

If you're processing video at scale and don't want to manage GPU infrastructure, an FFmpeg API handles all of this for you. You send a command over HTTP, and the service runs it on optimized hardware. You don't deal with drivers or session limits. The comparison of hosted vs self-hosted FFmpeg has the full cost breakdown, and the best FFmpeg API services comparison covers pricing across providers.

For example, the same transcode becomes a single API call:

curl -X POST https://api.renderio.dev/api/v1/run-ffmpeg-command \
  -H "Content-Type: application/json" \
  -H "X-API-KEY: ffsk_your_api_key" \
  -d '{
    "ffmpeg_command": "-i {{in_video}} -c:v libx264 -preset fast -b:v 5M {{out_video}}",
    "input_files": { "in_video": "https://storage.example.com/input.mp4" },
    "output_files": { "out_video": "output.mp4" }
  }'

No GPU infrastructure on your end. RenderIO runs full FFmpeg 7.x with software encoders (libx264, libx265, libsvtav1) in a secure cloud sandbox. You skip the driver management, session limits, and custom builds entirely. Grab an API key and try it. You can also run FFmpeg in the cloud without a server or use it as a managed service over HTTP without touching infrastructure.

Quick reference

TaskCommand
H.264 GPU encodeffmpeg -hwaccel cuda -hwaccel_output_format cuda -i in.mp4 -c:v h264_nvenc -b:v 5M out.mp4
HEVC GPU encodeffmpeg -hwaccel cuda -hwaccel_output_format cuda -i in.mp4 -c:v hevc_nvenc -qp 29 -preset slow out.mp4
AV1 GPU encodeffmpeg -hwaccel cuda -hwaccel_output_format cuda -i in.mp4 -c:v av1_nvenc -cq 30 out.mp4
GPU resize to 720pffmpeg -hwaccel cuda -hwaccel_output_format cuda -i in.mp4 -vf scale_cuda=1280:720 -c:v h264_nvenc out.mp4
GPU deinterlaceffmpeg -hwaccel cuda -hwaccel_output_format cuda -i in.mp4 -vf yadif_cuda -c:v h264_nvenc out.mp4
List available GPUsffmpeg -i in.mp4 -c:v h264_nvenc -gpu list -f null -
Check NVENC supportffmpeg -encoders 2>/dev/null | grep nvenc
Check NVDEC supportffmpeg -decoders 2>/dev/null | grep cuvid

For the full FFmpeg command reference with API examples, including every operation covered here as a REST call, check the companion guide.

FAQ

Does FFmpeg automatically use GPU if I have an NVIDIA card?

No. FFmpeg defaults to CPU-based software encoding. You have to explicitly select GPU encoding with -c:v h264_nvenc (or hevc_nvenc, av1_nvenc) and hardware decoding with -hwaccel cuda. FFmpeg also needs to be compiled with NVENC support, which most package manager installs don't include.

How much faster is NVENC compared to libx264?

It depends on the preset and content, but typically 5-15x faster for 1080p. A video that takes 10 minutes with libx264 at medium preset finishes in under 2 minutes with h264_nvenc. For HEVC, the gap is larger since libx265 is slower than libx264. One benchmark on a GTX 1080 (technomancer.com) hit 900-1300 FPS for 1080p HEVC, and Ada Lovelace GPUs with dual NVENC engines push even higher.

Is NVENC quality worse than software encoding?

At the same bitrate, yes, slightly. CPU encoders like libx264 have more sophisticated algorithms and can spend more compute on each frame. But the difference is small, especially at higher bitrates (above 10Mbps for 1080p). For most practical uses like streaming, social media uploads, and batch processing, the quality is good enough that the 10x speed improvement is worth the tradeoff.

Can I use NVENC in Docker containers?

Yes, but you need the NVIDIA Container Toolkit installed on the host, and you must run containers with --gpus all (or --gpus device=0 for a specific GPU). The container also needs an FFmpeg build with NVENC support. The nvidia/cuda base images provide the driver libraries, but you still need to install or build FFmpeg inside the container.

What's the NVENC session limit on consumer GPUs?

As of late 2025, consumer GeForce GPUs allow 12 simultaneous NVENC sessions per system. NVIDIA has gradually increased this over the years (it was 3, then 5, then 8, now 12). Professional cards (Quadro, A-series, L-series) have no artificial limit. If 12 isn't enough, the keylase/nvidia-patch project on GitHub removes the limit on Linux.

Do I need the full CUDA toolkit to use NVENC?

No. NVENC runs on separate hardware from CUDA cores and doesn't require the CUDA toolkit at runtime. You only need the NVIDIA display driver and an FFmpeg build that includes NVENC headers. The CUDA toolkit is only needed if you're building FFmpeg from source with --enable-cuda-nvcc for CUDA filter support (like scale_cuda).

Can I use GPU encoding with FFmpeg on Mac or AMD GPUs?

NVENC is NVIDIA-only. On Mac, use VideoToolbox (-c:v hevc_videotoolbox), which uses Apple Silicon's dedicated media engine. On Linux with AMD GPUs, use VAAPI (-c:v h264_vaapi) or AMF (-c:v h264_amf). The commands in this guide are NVIDIA-specific. The transcoding guide covers VideoToolbox and VAAPI briefly.