37、测试Yolox+TensorRT Yolox+NCNN Yolox+Tengine

觉基本思想：最近yolox刚被放出来，因为之前很多项目都是基于TensorRT部署nano，突然想使用Tengine部署一下nano，随手记录一下

分了四步走：1）先测试一下Yolovx在PC端的性能,源码来自官方的demo和网上相关资料

2）进行nano的择优部署，测试数据；

TensorRT 下载大佬的代码：GitHub - Megvii-BaseDetection/YOLOX: YOLOX is a high-performance anchor-free YOLO, exceeding yolov3~v5 with MegEngine, ONNX, TensorRT, ncnn, and OpenVINO supported. Documentation: https://yolox.readthedocs.io/

(1)开始编译demo源码，因为我之前已经吧tensorRT 的环境配置好了，就不在累述环境问题了 50、ubuntu18.04/20.04进行TensorRT环境搭建和YOLO5部署(含安装vulkan）_sxj731533730-CSDN博客

ubuntu@ubuntu:~$ cd YOLOX/demo/TensorRT/cpp/
ubuntu@ubuntu:~/YOLOX/demo/TensorRT/cpp$ mkdir build
ubuntu@ubuntu:~/YOLOX/demo/TensorRT/cpp$ cd build/
ubuntu@ubuntu:~/YOLOX/demo/TensorRT/cpp/build$ cmake ..
-- Configuring done
-- Generating done
-- Build files have been written to: /home/ubuntu/YOLOX/demo/TensorRT/cpp/build

(2)编译过程过程中，遇到了第一个问题

ubuntu@ubuntu:~/YOLOX/demo/TensorRT/cpp/build$ make -j 8
[ 50%] Building CXX object CMakeFiles/yolox.dir/yolox.cpp.o
/home/ubuntu/YOLOX/demo/TensorRT/cpp/yolox.cpp:9:10: fatal error: NvInfer.h: No such file or directory9 | #include "NvInfer.h"|          ^~~~~~~~~~~
compilation terminated.
make[2]: *** [CMakeFiles/yolox.dir/build.make:63: CMakeFiles/yolox.dir/yolox.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:76: CMakeFiles/yolox.dir/all] Error 2
make: *** [Makefile:84: all] Error 2

需要将cuda 头文件拷贝到该源码目录

ubuntu@ubuntu:~/YOLOX/demo/TensorRT/cpp$ cp /home/ubuntu/NVIDIA_CUDA-11.1_Samples/TensorRT-7.2.2.3/include/* .
ubuntu@ubuntu:~/YOLOX/demo/TensorRT/cpp$ cp -r /usr/local/cuda/include/* .

(3)、再次编译又遇到问题

ubuntu@ubuntu:~/YOLOX/demo/TensorRT/cpp/build$ make -j 8
[100%] Linking CXX executable yolox
/usr/bin/ld: cannot find -lcudart
collect2: error: ld returned 1 exit status
make[2]: *** [CMakeFiles/yolox.dir/build.make:99: yolox] Error 1
make[1]: *** [CMakeFiles/Makefile2:76: CMakeFiles/yolox.dir/all] Error 2
make: *** [Makefile:84: all] Error 2

(4)、再次编译又遇到问题

ubuntu@ubuntu:~/YOLOX/demo/TensorRT/cpp/build$ sudo ln -s /usr/local/cuda/lib64/libcudart.so /usr/lib/libcudart.so
ubuntu@ubuntu:~/YOLOX/demo/TensorRT/cpp/build$ make
[ 50%] Linking CXX executable yolox
[100%] Built target yolox

(5) 下载模型 git clone https://github.com/Megvii-BaseDetection/storage/releases/download/0.0.1/yolox_s.pth
首先配置一下环境

ubuntu@ubuntu:~$ git clone https://github.com/NVIDIA-AI-IOT/torch2trt
ubuntu@ubuntu:~$ cd torch2trt
ubuntu@ubuntu:~/torch2trt$ python3 setup.py install --user
ubuntu@ubuntu:~/torch2trt$ python3
Python 3.8.10 (default, Jun  2 2021, 10:49:15) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch2trt
>>> exit()
ubuntu@ubuntu:~/YOLOX$ pip install -r requirements.txt

进行模型转tensorRT

ubuntu@ubuntu:~/YOLOX$ cp tools/trt.py .
ubuntu@ubuntu:~/YOLOX$ python3 trt.py -n yolox-s -c /home/ubuntu/a/yolox_s.pth
2021-07-24 10:50:26.033 | INFO     | __main__:main:52 - loaded checkpoint done.
[TensorRT] WARNING: Tensor DataType is determined at build time for tensors not marked as input or output.
[TensorRT] WARNING: Half2 support requested on hardware without native FP16 support, performance will be negatively affected.
[TensorRT] WARNING: TensorRT was linked against cuBLAS/cuBLAS LT 11.3.0 but loaded cuBLAS/cuBLAS LT 11.2.1
[TensorRT] ERROR: ../rtSafe/safeRuntime.cpp (25) - Cuda Error in allocate: 2 (out of memory)
[TensorRT] ERROR: ../rtSafe/safeRuntime.cpp (25) - Cuda Error in allocate: 2 (out of memory)

因为我笔记本的显存太小了才4GB，所以修改一下代码trt.py

  model_trt = torch2trt(model,[x],fp16_mode=True,log_level=trt.Logger.INFO,#max_workspace_size=(1 << 32),max_workspace_size=(1 << 22),)

然后转出成功

生成的模型路经

/home/ubuntu/YOLOX/YOLOX_outputs/yolox_s/model_trt.engine

（6）、测试一下效果（本菜鸟的pc显卡1050TI）

ubuntu@ubuntu:~/YOLOX/demo/TensorRT/cpp/build$ ./yolox  ../../../../YOLOX_outputs/yolox_s/model_trt.engine -i ../../../../assets/dog.jpg 
[07/24/2021-11:16:26] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.3.0 but loaded cuBLAS/cuBLAS LT 11.2.1
[07/24/2021-11:16:26] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.3.0 but loaded cuBLAS/cuBLAS LT 11.2.1
blob image
24ms
num of boxes before nms: 38
num of boxes: 3
1 = 0.92316 at 111.95 128.16 456.10 x 312.89
16 = 0.88935 at 133.35 214.42 183.07 x 329.38
7 = 0.64439 at 468.19 76.37 221.86 x 92.72
save vis file

（7）NCNN的环境构建不在详细叙述，进行NCNN的测试

只是简单提一下，下载NCNN源码之后，进行编译，测试不含GPU加速驱动的NCNN，好像默认是打开的~

ubuntu@ubuntu:~/ncnn/build$ cmake NCNN_VULKAN=OFF ..

生onnx模型，使用该文件的脚本，奇怪，我必须把执行的脚本从tools文件夹拷贝出来，才可以使用~

ubuntu@ubuntu:~/YOLOX$ cp tools/export_onnx.py .
ubuntu@ubuntu:~/YOLOX$ python3 export_onnx.py -n yolox-s -c /home/ubuntu/a/yolox_s.pth
2021-07-24 14:03:49.491 | INFO     | __main__:main:50 - args value: Namespace(ckpt='/home/ubuntu/a/yolox_s.pth', exp_file=None, experiment_name=None, input='images', name='yolox-s', no_onnxsim=False, opset=11, opts=[], output='output', output_name='yolox.onnx')
2021-07-24 14:03:49.623 | INFO     | __main__:main:74 - loaded checkpoint done.
2021-07-24 14:03:53.540 | INFO     | __main__:main:84 - generate onnx named yolox.onnx
Simplifying...
Checking 0/3...
Checking 1/3...
Checking 2/3...
Ok!
2021-07-24 14:03:57.516 | INFO     | __main__:main:89 - generate simplify onnx named yolox.onnx

进行模型转化

ubuntu@ubuntu:~/ncnn/build/tools/onnx$ ./onnx2ncnn ../../../../YOLOX/yolox.onnx ../../../../YOLOX/yolox-s.param ../../../../YOLOX/yolox-s.bin
Unsupported slice step !
Unsupported slice step !
Unsupported slice step !
Unsupported slice step !
Unsupported slice step !
Unsupported slice step !
Unsupported slice step !
Unsupported slice step !

然后参考之前的博客 45、NCNN之ONNX模型解析及其使用（YOLO5）_sxj731533730-CSDN博客或者nihui大佬的知乎详细记录u版YOLOv5目标检测ncnn实现 - 知乎进行模型修改

原param文件内容 (前16行）

7767517
235 268
Input            images                   0 1 images
Split            splitncnn_input0         1 4 images images_splitncnn_0 images_splitncnn_1 images_splitncnn_2 images_splitncnn_3
Crop             Slice_4                  1 1 images_splitncnn_3 467 -23309=1,0 -23310=1,2147483647 -23311=1,1
Crop             Slice_9                  1 1 467 472 -23309=1,0 -23310=1,2147483647 -23311=1,2
Crop             Slice_14                 1 1 images_splitncnn_2 477 -23309=1,0 -23310=1,2147483647 -23311=1,1
Crop             Slice_19                 1 1 477 482 -23309=1,1 -23310=1,2147483647 -23311=1,2
Crop             Slice_24                 1 1 images_splitncnn_1 487 -23309=1,1 -23310=1,2147483647 -23311=1,1
Crop             Slice_29                 1 1 487 492 -23309=1,0 -23310=1,2147483647 -23311=1,2
Crop             Slice_34                 1 1 images_splitncnn_0 497 -23309=1,1 -23310=1,2147483647 -23311=1,1
Crop             Slice_39                 1 1 497 502 -23309=1,1 -23310=1,2147483647 -23311=1,2
Concat           Concat_40                4 1 472 492 482 502 503 0=0
Convolution      Conv_41                  1 1 503 877 0=32 1=3 11=3 2=1 12=1 3=1 13=1 4=1 14=1 15=1 16=1 5=1 6=3456
Swish            Mul_43                   1 1 877 507
Convolution      Conv_44                  1 1 507 880 0=64 1=3 11=3 2=1 12=1 3=2 1

修改后的param文件内容

7767517
226 268
Input            images                   0 1 images
YoloV5Focus      focus                    1 1 images 503
Convolution      Conv_41                  1 1 503 877 0=32 1=3 11=3 2=1 12=1 3=1 13=1 4=1 14=1 15=1 16=1 5=1 6=3456
Swish            Mul_43                   1 1 877 507
Convolution      Conv_44                  1 1 507 880 0=64 1=3 11=3 2=1 12=1 3=2 13=2 4=1 14=1 15=1 16=1 5=1 6=18432

测试一下cpu+ncnn 的推理时间

/home/ubuntu/CLionProjects/untitled1/cmake-build-debug/untitled1
output height: 3549, width: 85, channels: 1, dims:2
1 = 0.94330 at 118.83 128.06 449.39 x 289.91
196ms
16 = 0.87024 at 131.04 219.33 183.01 x 321.65
2 = 0.77507 at 464.40 80.81 226.02 x 92.19

顺手测试了一下python版本的，

ubuntu@ubuntu:~/YOLOX$ python3 tools/demo.py image -f exps/default/yolox_s.py --trt --nms 0.5 --conf 0.5 --save_result
2021-08-27 17:29:07.976 | INFO     | __main__:main:239 - Args: Namespace(camid=0, ckpt=None, conf=0.5, demo='image', device='gpu', exp_file='exps/default/yolox_s.py', experiment_name='yolox_s', fp16=False, fuse=False, name=None, nms=0.5, path='./assets/dog.jpg', save_result=True, trt=True, tsize=None)
/home/ps/anaconda2/envs/tensorflow/lib/python3.6/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
2021-08-27 17:29:08.168 | INFO     | __main__:main:249 - Model Summary: Params: 8.97M, Gflops: 26.81
2021-08-27 17:29:14.971 | INFO     | __main__:main:278 - Using TensorRT to inference
[TensorRT] WARNING: TensorRT was linked against cuBLAS/cuBLAS LT 11.3.0 but loaded cuBLAS/cuBLAS LT 11.2.1
[TensorRT] WARNING: TensorRT was linked against cuBLAS/cuBLAS LT 11.3.0 but loaded cuBLAS/cuBLAS LT 11.2.1
2021-08-27 17:29:16.636 | INFO     | __main__:inference:150 - Infer time: 0.0095s
2021-08-27 17:29:16.638 | INFO     | __main__:image_demo:187 - Saving detection result in ./YOLOX_outputs/yolox_s/vis_res/2021_08_27_17_29_16/dog.jpg

测试图片

(8) 下载圈圈大佬团队开发的源码：https://github.com/OAID/Tengine.git

发现官方的代码已经有了YOLOX的example源码，测试一下把,“作业”还没写完呢~~

通过上官网的教程产生的onnx模型，然后转Tengine的测试模型

ubuntu@ubuntu:~/Tengine/build$ cmake --DTENGINE_BUILD_CONVERT_TOOL=ON --DTENGINE_BUILD_QUANT_TOOL --DTENGINE_ENABLE_CUDA=ON --DTENGINE_ENABLE_TENSORRT=ON ..

先将Tengine的源码和生成库移植到Clion工程中，发现Tengine还是蛮贴心的，头文件只有两个，对应的生成库还兼顾动态的so和静态的.a

进行Tengine移植工程遇到一个问题，缺少头文件，后来发现在example的common文件夹中；还是修改一下，看看测试代码哪些函数依赖这些头文件吧；

稍微改一下即可，注释两个不存在的文件，添加一个头文件

//#include "common.h"
#include "tengine/c_api.h"
//#include "tengine_operations.h"#include <unistd.h>

记时代码改成clock()获取即可，编译通过~

double start = clock();//get_current_time();
if (run_graph(graph, 1) < 0)
{fprintf(stderr, "Run graph failed\n");return -1;
}double end = clock();//get_current_time();

(9)模型转化模型转换工具 — Tengine 文档

首先进行源码编译，我将这四个选项打开

OPTION (TENGINE_BUILD_CONVERT_TOOL          "Build convert tool"                        ON)
OPTION (TENGINE_BUILD_QUANT_TOOL            "Build quantization tool"                   ON)
OPTION (TENGINE_ENABLE_CUDA                 "With nVIDIA CUDA support"                  ON)
OPTION (TENGINE_ENABLE_TENSORRT             "With nVIDIA TensorRT support"              ON)

编译过程中，有点奇怪，我都已经把我的cuda-11.1映射到cuda目录下，编译过程中，它仍然去寻找cuda-11.1,无奈，重新复制一下文件到cuda-11.1的目录下吧 50、ubuntu18.04/20.04进行TensorRT环境搭建和YOLO5部署(含安装vulkan）_sxj731533730-CSDN博客

遇到第一个问题

/usr/local/cuda-11.1/include/cudnn.h:60:10: fatal error: cudnn_version.h: No such file or directory60 | #include "cudnn_version.h"

解决办法，其实有点奇怪，我的环境变量已经映射到cuda目录，cuda目录下是有这些文件的，它仍然去找cuda-11.1，不深究了~

ubuntu@ubuntu:~$ tar -zxvf cudnn-11.2-linux-x64-v8.1.0.77.tgz 
ubuntu@ubuntu:~$ sudo cp cuda/include/*    /usr/local/cuda-11.1/include     
ubuntu@ubuntu:~$ sudo cp cuda/lib64/libcudnn*    /usr/local/cuda-11.1/lib64
ubuntu@ubuntu:~$ sudo chmod a+r /usr/local/cuda/include/cudnn.h   /usr/local/cuda-11.1/lib64/libcudnn*

又遇到一个问题

43:
/home/ubuntu/Tengine/source/device/tensorrt/trt_limit.hpp:43:10: fatal error: NvInfer.h: No such file or directory43 | #include <NvInfer.h>|          ^~~~~~~~~~~

解决办法 ,超神给我解决办法是在cmakelists.txt 直接引入即可，都差不多吧~

ubuntu@ubuntu:~/Tengine/source/device/tensorrt$ cp /home/ubuntu/Downloads/TensorRT/TensorRT-7.2.2.3.Ubuntu-18.04.x86_64-gnu.cuda-11.1.cudnn8.0/TensorRT-7.2.2.3/include/*  .

然后就顺利编译成功了~

进行模型转化，参考虫叔知乎：”Tengine 支持 NPU 模型部署-YOLOX - 知乎《突然感觉以后应该学习如何改op了！！》

参考了虫叔叔的yolov5.pt优化op过程 Tengine/tools/optimize at tengine-lite · OAID/Tengine · GitHub

转化结果

然后执行代码命令行 (突然感觉成了搬运工，也许这就是知识的传播吧~)

$ python3 yolov5s-opt.py --input yolov5s.v5.onnx --output yolov5s.v5.opt.onnx --in_tensor 167 --out_tensor 397,458,519

yolov-x原模型，仿照它的预设输入即可

修改op的模型

这是使用Tengine转化模型成功之后的输出

/usr/bin/python3 /home/ubuntu/Tengine/tools/optimize/yolov5s-opt.py
---- Tengine YOLOv5 Optimize Tool ----Input model      : /home/ubuntu/YOLOX/yolox.onnx
Output model     : /home/ubuntu/YOLOX/yolox.opt.onnx
Input tensor     : 503
Output tensor    : output
[Quant Tools Info]: Step 0, load original onnx model from /home/ubuntu/YOLOX/yolox.onnx.
278
[Quant Tools Info]: Step 1, Remove the focus and postprocess nodes.
[Quant Tools Info]: Step 2, Using hardswish replace the sigmoid and mul.
[Quant Tools Info]: Step 3, Rebuild onnx graph nodes.
[Quant Tools Info]: Step 4, Update input and output tensor.
[Quant Tools Info]: Step 5, save the new onnx model to /home/ubuntu/YOLOX/yolox.opt.onnx.---- Tengine YOLOv5s Optimize onnx create success, best wish for your inference has a high accuracy ...\(^0^)/ ----

进行转化

ubuntu@ubuntu:~/Tengine/build/install/bin$ ./convert_tool -f onnx -m  ~/YOLOX/yolox.opt.onnx -o ~/YOLOX/yolox.tmfile---- Tengine Convert Tool ---- Version     : v1.0, 11:27:47 Jul 25 2021
Status      : float32----------onnx2tengine begin----------
Model op set is :11
----------onnx2tengine done.----------
graph opt begin
graph opt done.
Convert model success. /home/ubuntu/YOLOX/yolox.opt.onnx -----> /home/ubuntu/YOLOX/yolox.tmfile

转的另一个模型

python3 yolov5s-opt.py --input  yolox-smi.onnx --output yolox-smiopt.onnx --in_tensor 503 --out_tensor output

奈何客户给我的板子是Android 系统，我先搞了另一个项目12、 Android+RK3399 pro+USB直连摄像头+NCNN+Nanodet进行检测_sxj731533730-CSDN博客，又刷了个机,这玩意刷机太麻烦了;不记录了太多问题了，刷成linux系统测试一下yolox

root@teamhd:~/Tengine/build/benchmark# ./tm_benchmark -r 5 -t 1 -p 1
Tengine benchmark:loops:    5threads:  1cluster:  1affinity: 0xFFFFFFFF
Tengine-lite library version: 1.5-devsqueezenet_v1.1  min =   58.57 ms   max =   62.29 ms   avg =   61.25 msmobilenetv1  min =  110.43 ms   max =  115.09 ms   avg =  113.89 msmobilenetv2  min =  117.23 ms   max =  117.53 ms   avg =  117.40 msmobilenetv3  min =   81.17 ms   max =   81.76 ms   avg =   81.53 msshufflenetv2  min =   36.52 ms   max =   37.24 ms   avg =   36.88 msresnet18  min =  200.26 ms   max =  200.65 ms   avg =  200.47 msresnet50  min =  582.59 ms   max =  694.41 ms   avg =  627.50 msgooglenet  min =  243.06 ms   max =  244.56 ms   avg =  243.80 msinceptionv3  min = 1134.42 ms   max = 1139.62 ms   avg = 1137.18 msvgg16  min = 1069.57 ms   max = 1206.45 ms   avg = 1132.67 msmssd  min =  240.48 ms   max =  242.75 ms   avg =  241.59 msretinaface  min =   40.51 ms   max =   40.94 ms   avg =   40.78 msyolov3_tiny  min =  302.06 ms   max =  303.28 ms   avg =  302.71 msmobilefacenets  min =   50.08 ms   max =   53.15 ms   avg =   51.04 ms
ALL TEST DONE.

测试一下我转的模型 K3399 pro测试结果

root@teamhd:~/Tengine/build/examples# ./tm_yolox -m ~/Tengine/yolox.tmfile -i ~/Tengine/dog.jpg -r 1 -t 8
tengine-lite library version: 1.5-dev
Repeat 1 times, thread 8, avg time 952.34 ms, max_time 952.34 ms, min_time 952.34 ms
--------------------------------------
detection num: 31:  94%, [ 117,  131,  564,  414], bicycle
16:  90%, [ 128,  210,  309,  549], dog2:  74%, [ 468,   82,  679,  171], car
root@teamhd:~/Tengine/build/examples# ^C
root@teamhd:~/Tengine/build/examples# export TG_DEBUG_TIME=1

37、测试Yolox+TensorRT Yolox+NCNN Yolox+Tengine

相关文章

Win10+Torch1.9+CUDA11.1成功配置YOLOX预测环境

产品经理必备的专业术语

一加7pro 鸿蒙,屏幕亮了！一加7 Pro海外版发布：2K分辨率 90Hz刷新率

一加7Tpro刷鸿蒙,稀缺全面屏旗舰手机一加7T Pro 再不买真就没了

opporeno7pro和一加9rt哪个好

OpenCVSharp使用GPU和Cuda

html静态页面案例

滇西应用技术大学计算机专业录取分数线,滇西应用技术大学