原文地址 blog.csdn.net

0. 概述

FFmpeg 可通过 Nvidia 的 GPU 进行加速，其中高层接口是通过 Video Codec SDK 来实现 GPU 资源的调用。Video Codec SDK 包含完整的的高性能工具、源码及文档，支持，可以运行在 Windows 和 Linux 系统之上。从软件上来说，SDK 包含两类硬件加速接口，用于编码加速的 NVENCODE API 和用于解码加速的 NVDECODE API(之前被称为 NVCUVID API)。从硬件上来说，Nvidia GPU 有一到多个编解码器 (解码器又称硬件加速引擎)，它们独立于 CUDA 核。从视频格式上来说，编码支持 H.264、H.265、无损压缩，位深度支持 8bit、10bit，色域空间支持 YUV 4:4:4 和 4:2:0，分辨率支持最高 8K；解码支持 MPEG-2、VC1、VP8、VP9、H.264、H.265、无损压缩，位深度支持 8 bit、10bit、12bit，色域空间支持 YUV 4:2:0，分辨率支持最高 8K。Video Codec SDK 已经被集成在 ffmpeg 工程中，但是 ffmpeg 对编解码器配置参数较少，如果需要充分的发挥编解码器特性，还需要直接使用 SDK 进行编程。

Nvidia 编码器与 CPU 上的 x264 的性能对比与质量对比如下图所示，性能以每秒钟编码帧数为参考指标，质量以 PSNR 为参考指标。

可看出性能方面 Nvidia 编码器是 x264 的 2~5 倍，质量方面对于 fast stream 场景来说 Nvidia 编码器优于 x264，高质量场景来说低于 x264，但没有说明是哪款 Nvidia 的产品，以及对比测试的 x264 运行平台的 CPU 的型号及平台能力。下图可以看出对于 1080P@30fps，NVENC 可支持 21 路的编码或 9 路的高质量编码。

不同型号的 GPU 的编码的能力表格如下：

Nvidia 解码器性能指标如下图所示，不过只有两款 Tesla 的产品。

解码的能力表格如下：

1. 安装驱动与 SDK

1.1 前期准备

需要关闭所有开源的显示驱动
vi /etc/modprobe.d/blacklist.conf
添加
blacklist amd76x_edac
blacklist vga16fb
blacklist nouveau
blacklist nvidiafb
blacklist rivatv

1.2 驱动安装

(1). 删除原来的驱动
apt-get remove –purge nvidia*
(2). 官方下载 run 文件的驱动进行安装
service lightdm stop
chmod 777 NVIDIA-Linux-x86_64-367.44.run
./NVIDIA-Linux-x86_64-367.44.run
service lightdm start
reboot
(2). 驱动安装验证
运行 nvidia-smi，有如下输出则安装成功

问题 1：如果重启之后发现图形界面进不去，发生了循环登录，说明视频驱动没有安装完全，需要重装驱动，保险的方法是联网安装
console 中执行
apt-get remove –purge nvidia-*
add-apt-repository ppa:graphics-drivers/ppa
apt-get update
service lightdm stop
apt-get install nvidia-375 nvidia-settings nvidia-prime
nvidia-xconfig
apt-get install mesa-common-dev // 安装缺少的库
apt-get install freeglut3-dev
update-initramfs -u
reboot

1.3 SDK 安装

(1). 官方下载 run 文件的驱动进行安装
cuda_8.0.44_linux.run –no-opengl-libs // 不需要 opengl 支持
apt-get install freeglut3-dev build-essential libx11-dev
apt-get install libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa
apt-get install libglu1-mesa-dev
gedit ~/.bashrc
添加
export PATH=/usr/local/cuda/bin:$PATH

export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

gedit /etc/ld.so.conf.d/cuda.conf
添加
/usr/local/cuda/lib64
/lib
/lib32
/lib64
/usr/lib
/user/lib32
sudo ldconfig
(2). SDK 安装验证
运行 nvcc -V，有如下输出则安装成功。

2. Sample 测试

2.1 Sample 编译

进入 Samples 目录，运行 make，如果没有安装 OpenGL 相关库，则 NvDecodeGL 会编译不通过
每个工程的含义可参考《NVIDIA_Video_Codec_SDK_Samples_Guide》
NvEncoder: 基本功能的编码
NvEncoderCudaInterpo: CUDA surface 的编码
NvEncoderD3D9Interpo: D3D9 surface 的编码，Linux 下没有
NvEncoderLowLatency: 低延时特征的使用，如帧内刷新与参考图像有效性 (RPI)
NvEncoderPerf: 最大性能的编码
NvTranscoder: NVENC 的转码能力
NvDecodeD3D9: 视频解码 D3D9 显示，Linux 下没有
NvDecodeD3D11: 视频解码 D3D11 显示，Linux 下没有
NvDecodeGL: 视频解码 OpenGL 显示

2.2 Sample 测试

参见《NVIDIA_Video_Codec_SDK_Samples_Guide》
问题 2：如果运行例子后显示 libcuda.so failed!
在 / usr/lib/x86_64-linux-gnu 下制作链接 libcuda.so，链接至 libcuda.so.375.26

3. ffmpeg 结合

3.1 ffmpeg 编译

3.1.1 前期工作

确保 Video_Codec_SDK_7.1.9/Samples/common/inc 目录下有基本的头文件
确保 Video_Codec_SDK_7.1.9/Samples/common/lib/linux/x86_64 目录下有 libGLEW.a

3.1.2 configure 命令

configure \
  --enable-version3 \
  --enable-libfdk-aac \
  --enable-libmp3lame \
  --enable-libx264 \
  --enable-nvenc \
  --extra-cflags=-I/root/workspace/Video_Codec_SDK_7.1.9/Samples/common/inc \
  --extra-ldflags=-L/root/workspace/Video_Codec_SDK_7.1.9/Samples/common/lib/linux/x86_64 \
  --enable-shared \
  --enable-gpl \
  --enable-postproc \
  --enable-nonfree \
  --enable-avfilter \
  --enable-pthreads

3.1.2 make

运行 make & make install

3.2 ffmpeg 测试

运行 ffmpeg -codecs|grep nvenc
显示一下信息说明

ffmpeg version 3.0.git Copyright (c) 2000-2016 the FFmpeg developers
  built with gcc 5.4.0 (Ubuntu 5.4.0-6ubuntu1~16.04.1) 20160609
  configuration: --enable-version3 --enable-libfdk-aac --enable-libmp3lame --enable-libx264 --enable-nvenc --extra-cflags=-I/workspace/Video_Codec_SDK_7.1.9/Samples/common/inc --extra-ldflags=-L/workspace/Video_Codec_SDK_7.1.9/Samples/common/lib/linux/x86_64 --enable-shared --enable-gpl --enable-postproc --enable-nonfree --enable-avfilter --enable-pthreads
  libavutil      55. 29.100 / 55. 29.100
  libavcodec     57. 54.100 / 57. 54.100
  libavformat    57. 48.100 / 57. 48.100
  libavdevice    57.  0.102 / 57.  0.102
  libavfilter     6. 57.100 /  6. 57.100
  libswscale      4.  1.100 /  4.  1.100
  libswresample   2.  1.100 /  2.  1.100
  libpostproc    54.  0.100 / 54.  0.100
 DEV.LS h264                 H.264 / AVC / MPEG-4 AVC / MPEG-4 part 10 (encoders: libx264 libx264rgb h264_nvenc nvenc nvenc_h264 )
 DEV.L. hevc                 H.265 / HEVC (High Efficiency Video Coding) (encoders: nvenc_hevc hevc_nvenc )

其中前缀含义如下：
前缀含义
D….. = Decoding supported
.E…. = Encoding supported
..V… = Video codec
..A… = Audio codec
..S… = Subtitle codec
…I.. = Intra frame-only codec
….L. = Lossy compression
…..S = Lossless compression

3.3 编解码器使用方法

h265 编码测试
(1). ffmpeg -s 1920x1080 -pix_fmt yuv420p -i BQTerrace_1920x1080_60.yuv -vcodec hevc_nvenc -r 60 -y 2_60.265
(2). ffmpeg -s 1920x1080 -pix_fmt yuv420p -i BQTerrace_1920x1080_60.yuv -vcodec hevc_nvenc -r 30 -y 2_30.265

h264 编码测试
(3). ffmpeg -s 1920x1080 -pix_fmt yuv420p -i BQTerrace_1920x1080_60.yuv -vcodec h264_nvenc -r 60 -y 2_60.264
(4). ffmpeg -s 1920x1080 -pix_fmt yuv420p -i BQTerrace_1920x1080_60.yuv -vcodec h264_nvenc -r 30 -y 2_30.264

h264 转 h265
(5). ffmpeg -i 1_60.264 -vcodec hevc_nvenc -r 60 -y 2_60_264to265.265
(6). ffmpeg -i 1_30.264 -vcodec hevc_nvenc -r 30 -y 2_30_264to265.265

h265 转 h264
(7). ffmpeg -i 1_60.265 -vcodec h264_nvenc -r 60 -y 2_60_265to264.264
(8). ffmpeg -i 1_30.265 -vcodec h264_nvenc -r 30 -y 2_30_265to264.264

3.4 程序开发使用方法

av_find_encoder_by_name(“h264_nvenc”);
av_find_encoder_by_name(“hevc_nvenc”);

4. 辅助工具

watch -n 1 nvidia-smi
以 1 秒钟为间隔来查看 GPU 资源占用情况

5. 实测结果

5.1 硬件性能

本人用 Geforce GTX1070 与 Tesla P4 进行了测试，两者都是 Pascal 架构。
(1). GTX1070 的硬件信息如下 (deviceQuery 显示)：

CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 1070"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 8110 MBytes (8504279040 bytes)
  (15) Multiprocessors, (128) CUDA Cores/MP:     1920 CUDA Cores
  GPU Max Clock rate:                            1683 MHz (1.68 GHz)
  Memory Clock rate:                             4004 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 2097152 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 5 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 1070
Result = PASS

(2). P4 的硬件信息如下：

 CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "Tesla P4"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 7606 MBytes (7975862272 bytes)
  (20) Multiprocessors, (128) CUDA Cores/MP:     2560 CUDA Cores
  GPU Max Clock rate:                            1114 MHz (1.11 GHz)
  Memory Clock rate:                             3003 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 2097152 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 5 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Tesla P4
Result = PASS

5.2 实验结果

(1). GTX1070
| | hevc 编码 | h264 编码 | h264 转 h265 | h265 转 h264 |
| 60fps | 387fps(6.45x) | 430fps(7.17x) | 348fps(5.79x) | 170fps(2.84x) |
| 30fps | 345fps(11.5x) | 429fps(14.3x) | 318fps(10.6x) | 94fps(3.13x) |
(2). P4

5.3 实验分析

虽然在硬件性能上，P4 比 GTX1070 显存略少，主频略低，CUDA 的数量多出了 33%，但从实验结果上看除了 h265->h264 结果持平外，P4 表现都要逊色于 GTX1070，这和官网所言 “编解码器独立于 CUDA 核” 相一致。

6. 源码分析

集成在 ffmpeg 框架内的视频编解码器需要定义一个 AVCodec 结构体包含（私有结构体 AVClass、三个函数等）

6.1 h264 部分

(1). 结构体 (nvenc_h264.c)

AVCodec ff_h264_nvenc_encoder = {
    .name           = "h264_nvenc",
    .long_name      = NULL_IF_CONFIG_SMALL("NVIDIA NVENC H.264 encoder"),
    .type           = AVMEDIA_TYPE_VIDEO,
    .id             = AV_CODEC_ID_H264,
    .init           = ff_nvenc_encode_init, //初始化函数
    .encode2        = ff_nvenc_encode_frame, //编码函数
    .close          = ff_nvenc_encode_close, //关闭函数
    .priv_data_size = sizeof(NvencContext),  //内部数据结构，见nvenc.h
    .priv_class     = &h264_nvenc_class,     //私有结构体
    .defaults       = defaults,
    .capabilities   = AV_CODEC_CAP_DELAY,
    .caps_internal  = FF_CODEC_CAP_INIT_CLEANUP,
    .pix_fmts       = ff_nvenc_pix_fmts,
};

static const AVClass h264_nvenc_class = {
    .class_name = "h264_nvenc",
    .item_name = av_default_item_name,
    .option = options, //编码器选项参数在这个AVOption结构体中
    .version = LIBAVUTIL_VERSION_INT,
};

注意还有两个 AVCodec，一个名字叫 nvenc、一个叫 nvenc_h264，对应三大函数与 h264_nvenc 是一样的
(2). 处理函数 (nvenc.c)

av_cold int ff_nvenc_encode_init(AVCodecContext *avctx)
{
   NvencContext *ctx = avctx->priv_data; //读入私有结构体
   ...
   //下面是一些nvenc的api
   nvenc_load_libraries
   nvenc_setup_device
   nvenc_setup_encoder
   nvenc_setup_surfaces
   nvenc_setup_extradata
   ...
}
int ff_nvenc_encode_frame(AVCodecContext *avctx, AVPacket *pkt,
                          const AVFrame *frame, int *got_packet)
{
    ...
    if (frame) {
        inSurf = get_free_frame(ctx); //来一帧
        ...
        res = nvenc_upload_frame(avctx, frame, inSurf);//编一帧
        ...
    }
}
av_cold int ff_nvenc_encode_close(AVCodecContext *avctx)
{
   ...
   //一些free和destroy的工作
}

6.2 h265 部分

(1). 结构体 (nvenc_hevc.c)

AVCodec ff_hevc_nvenc_encoder = {
    .name           = "hevc_nvenc",
    .long_name      = NULL_IF_CONFIG_SMALL("NVIDIA NVENC hevc encoder"),
    .type           = AVMEDIA_TYPE_VIDEO,
    .id             = AV_CODEC_ID_HEVC,
    .init           = ff_nvenc_encode_init, //初始化函数
    .encode2        = ff_nvenc_encode_frame, //编码函数
    .close          = ff_nvenc_encode_close, //关闭函数
    .priv_data_size = sizeof(NvencContext),  //内部数据结构，见nvenc.h
    .priv_class     = &hevc_nvenc_class, //私有结构体
    .defaults       = defaults,
    .pix_fmts       = ff_nvenc_pix_fmts,
    .capabilities   = AV_CODEC_CAP_DELAY,
    .caps_internal  = FF_CODEC_CAP_INIT_CLEANUP,
};

static const AVClass hevc_nvenc_class = {
    .class_name = "hevc_nvenc",
    .item_name = av_default_item_name,
    .option = options,//编码器选项参数在这个AVOption结构体中
    .version = LIBAVUTIL_VERSION_INT,
};

注意还有一个 AVCodec，一个叫 nvenc_hevc，对应三大函数与 h264_nvenc 是一样的
(2) 处理函数 (nvenc.c)
同 h264 的处理函数

Hi! Welcome

ffmpeg Nvidia硬件加速总结

0. 概述

1. 安装驱动与 SDK

1.1 前期准备

1.2 驱动安装

1.3 SDK 安装

2. Sample 测试

2.1 Sample 编译

2.2 Sample 测试

3. ffmpeg 结合

3.1 ffmpeg 编译

3.1.1 前期工作

3.1.2 configure 命令

3.1.2 make

3.2 ffmpeg 测试

3.3 编解码器使用方法

3.4 程序开发使用方法

4. 辅助工具

5. 实测结果

5.1 硬件性能

5.2 实验结果

5.3 实验分析

6. 源码分析

6.1 h264 部分

6.2 h265 部分

FFMPEG 常用命令

路由器最高速度/性能测试 – Windows 安装 IPerf3 及使用方法

m

Comments | 1 条评论

博主 Giorni

取消回复

Hi! Welcome

0. 概述

1. 安装驱动与 SDK

1.1 前期准备

1.2 驱动安装

1.3 SDK 安装

2. Sample 测试

2.1 Sample 编译

2.2 Sample 测试

3. ffmpeg 结合

3.1 ffmpeg 编译

3.1.1 前期工作

3.1.2 configure 命令

3.1.2 make

3.2 ffmpeg 测试

3.3 编解码器使用方法

3.4 程序开发使用方法

4. 辅助工具

5. 实测结果

5.1 硬件性能

5.2 实验结果

5.3 实验分析

6. 源码分析

6.1 h264 部分

6.2 h265 部分

FFMPEG 常用命令

路由器最高速度/性能测试 – Windows 安装 IPerf3 及 使用方法

m

Comments | 1 条评论

博主 Giorni

取消回复

路由器最高速度/性能测试 – Windows 安装 IPerf3 及使用方法