觉基本思想:最近yolox刚被放出来,因为之前很多项目都是基于TensorRT部署nano,突然想使用Tengine部署一下nano,随手记录一下
分了四步走:1)先测试一下Yolovx在PC端的性能,源码来自官方的demo和网上相关资料
2)进行nano的择优部署,测试数据;
TensorRT 下载大佬的代码:GitHub - Megvii-BaseDetection/YOLOX: YOLOX is a high-performance anchor-free YOLO, exceeding yolov3~v5 with MegEngine, ONNX, TensorRT, ncnn, and OpenVINO supported. Documentation: https://yolox.readthedocs.io/
(1)开始编译demo源码,因为我之前已经吧tensorRT 的环境配置好了,就不在累述环境问题了 50、ubuntu18.04/20.04进行TensorRT环境搭建和YOLO5部署(含安装vulkan)_sxj731533730-CSDN博客
ubuntu@ubuntu:~$ cd YOLOX/demo/TensorRT/cpp/
ubuntu@ubuntu:~/YOLOX/demo/TensorRT/cpp$ mkdir build
ubuntu@ubuntu:~/YOLOX/demo/TensorRT/cpp$ cd build/
ubuntu@ubuntu:~/YOLOX/demo/TensorRT/cpp/build$ cmake ..
-- Configuring done
-- Generating done
-- Build files have been written to: /home/ubuntu/YOLOX/demo/TensorRT/cpp/build
(2)编译过程过程中,遇到了第一个问题
ubuntu@ubuntu:~/YOLOX/demo/TensorRT/cpp/build$ make -j 8
[ 50%] Building CXX object CMakeFiles/yolox.dir/yolox.cpp.o
/home/ubuntu/YOLOX/demo/TensorRT/cpp/yolox.cpp:9:10: fatal error: NvInfer.h: No such file or directory9 | #include "NvInfer.h"| ^~~~~~~~~~~
compilation terminated.
make[2]: *** [CMakeFiles/yolox.dir/build.make:63: CMakeFiles/yolox.dir/yolox.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:76: CMakeFiles/yolox.dir/all] Error 2
make: *** [Makefile:84: all] Error 2
需要将cuda 头文件拷贝到该源码目录
ubuntu@ubuntu:~/YOLOX/demo/TensorRT/cpp$ cp /home/ubuntu/NVIDIA_CUDA-11.1_Samples/TensorRT-7.2.2.3/include/* .
ubuntu@ubuntu:~/YOLOX/demo/TensorRT/cpp$ cp -r /usr/local/cuda/include/* .
(3)、再次编译又遇到问题
ubuntu@ubuntu:~/YOLOX/demo/TensorRT/cpp/build$ make -j 8
[100%] Linking CXX executable yolox
/usr/bin/ld: cannot find -lcudart
collect2: error: ld returned 1 exit status
make[2]: *** [CMakeFiles/yolox.dir/build.make:99: yolox] Error 1
make[1]: *** [CMakeFiles/Makefile2:76: CMakeFiles/yolox.dir/all] Error 2
make: *** [Makefile:84: all] Error 2
(4)、再次编译又遇到问题
ubuntu@ubuntu:~/YOLOX/demo/TensorRT/cpp/build$ sudo ln -s /usr/local/cuda/lib64/libcudart.so /usr/lib/libcudart.so
ubuntu@ubuntu:~/YOLOX/demo/TensorRT/cpp/build$ make
[ 50%] Linking CXX executable yolox
[100%] Built target yolox
(5) 下载模型 git clone https://github.com/Megvii-BaseDetection/storage/releases/download/0.0.1/yolox_s.pth
首先配置一下环境
ubuntu@ubuntu:~$ git clone https://github.com/NVIDIA-AI-IOT/torch2trt
ubuntu@ubuntu:~$ cd torch2trt
ubuntu@ubuntu:~/torch2trt$ python3 setup.py install --user
ubuntu@ubuntu:~/torch2trt$ python3
Python 3.8.10 (default, Jun 2 2021, 10:49:15)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch2trt
>>> exit()
ubuntu@ubuntu:~/YOLOX$ pip install -r requirements.txt
进行模型转tensorRT
ubuntu@ubuntu:~/YOLOX$ cp tools/trt.py .
ubuntu@ubuntu:~/YOLOX$ python3 trt.py -n yolox-s -c /home/ubuntu/a/yolox_s.pth
2021-07-24 10:50:26.033 | INFO | __main__:main:52 - loaded checkpoint done.
[TensorRT] WARNING: Tensor DataType is determined at build time for tensors not marked as input or output.
[TensorRT] WARNING: Half2 support requested on hardware without native FP16 support, performance will be negatively affected.
[TensorRT] WARNING: TensorRT was linked against cuBLAS/cuBLAS LT 11.3.0 but loaded cuBLAS/cuBLAS LT 11.2.1
[TensorRT] ERROR: ../rtSafe/safeRuntime.cpp (25) - Cuda Error in allocate: 2 (out of memory)
[TensorRT] ERROR: ../rtSafe/safeRuntime.cpp (25) - Cuda Error in allocate: 2 (out of memory)
因为我笔记本的显存太小了才4GB,所以修改一下代码trt.py
model_trt = torch2trt(model,[x],fp16_mode=True,log_level=trt.Logger.INFO,#max_workspace_size=(1 << 32),max_workspace_size=(1 << 22),)
然后转出成功
生成的模型路经
/home/ubuntu/YOLOX/YOLOX_outputs/yolox_s/model_trt.engine
(6)、测试一下效果 (本菜鸟的pc显卡1050TI)
ubuntu@ubuntu:~/YOLOX/demo/TensorRT/cpp/build$ ./yolox ../../../../YOLOX_outputs/yolox_s/model_trt.engine -i ../../../../assets/dog.jpg
[07/24/2021-11:16:26] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.3.0 but loaded cuBLAS/cuBLAS LT 11.2.1
[07/24/2021-11:16:26] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.3.0 but loaded cuBLAS/cuBLAS LT 11.2.1
blob image
24ms
num of boxes before nms: 38
num of boxes: 3
1 = 0.92316 at 111.95 128.16 456.10 x 312.89
16 = 0.88935 at 133.35 214.42 183.07 x 329.38
7 = 0.64439 at 468.19 76.37 221.86 x 92.72
save vis file
(7)NCNN的环境构建不在详细叙述,进行NCNN的测试
只是简单提一下,下载NCNN源码之后,进行编译,测试不含GPU加速驱动的NCNN,好像默认是打开的~
ubuntu@ubuntu:~/ncnn/build$ cmake NCNN_VULKAN=OFF ..
生onnx模型,使用该文件的脚本,奇怪,我必须把执行的脚本 从tools文件夹拷贝出来,才可以使用~
ubuntu@ubuntu:~/YOLOX$ cp tools/export_onnx.py .
ubuntu@ubuntu:~/YOLOX$ python3 export_onnx.py -n yolox-s -c /home/ubuntu/a/yolox_s.pth
2021-07-24 14:03:49.491 | INFO | __main__:main:50 - args value: Namespace(ckpt='/home/ubuntu/a/yolox_s.pth', exp_file=None, experiment_name=None, input='images', name='yolox-s', no_onnxsim=False, opset=11, opts=[], output='output', output_name='yolox.onnx')
2021-07-24 14:03:49.623 | INFO | __main__:main:74 - loaded checkpoint done.
2021-07-24 14:03:53.540 | INFO | __main__:main:84 - generate onnx named yolox.onnx
Simplifying...
Checking 0/3...
Checking 1/3...
Checking 2/3...
Ok!
2021-07-24 14:03:57.516 | INFO | __main__:main:89 - generate simplify onnx named yolox.onnx
进行模型转化
ubuntu@ubuntu:~/ncnn/build/tools/onnx$ ./onnx2ncnn ../../../../YOLOX/yolox.onnx ../../../../YOLOX/yolox-s.param ../../../../YOLOX/yolox-s.bin
Unsupported slice step !
Unsupported slice step !
Unsupported slice step !
Unsupported slice step !
Unsupported slice step !
Unsupported slice step !
Unsupported slice step !
Unsupported slice step !
然后参考之前的博客 45、NCNN之ONNX模型解析及其使用(YOLO5)_sxj731533730-CSDN博客 或者nihui大佬的知乎 详细记录u版YOLOv5目标检测ncnn实现 - 知乎进行模型修改
原param文件内容 (前16行)
7767517
235 268
Input images 0 1 images
Split splitncnn_input0 1 4 images images_splitncnn_0 images_splitncnn_1 images_splitncnn_2 images_splitncnn_3
Crop Slice_4 1 1 images_splitncnn_3 467 -23309=1,0 -23310=1,2147483647 -23311=1,1
Crop Slice_9 1 1 467 472 -23309=1,0 -23310=1,2147483647 -23311=1,2
Crop Slice_14 1 1 images_splitncnn_2 477 -23309=1,0 -23310=1,2147483647 -23311=1,1
Crop Slice_19 1 1 477 482 -23309=1,1 -23310=1,2147483647 -23311=1,2
Crop Slice_24 1 1 images_splitncnn_1 487 -23309=1,1 -23310=1,2147483647 -23311=1,1
Crop Slice_29 1 1 487 492 -23309=1,0 -23310=1,2147483647 -23311=1,2
Crop Slice_34 1 1 images_splitncnn_0 497 -23309=1,1 -23310=1,2147483647 -23311=1,1
Crop Slice_39 1 1 497 502 -23309=1,1 -23310=1,2147483647 -23311=1,2
Concat Concat_40 4 1 472 492 482 502 503 0=0
Convolution Conv_41 1 1 503 877 0=32 1=3 11=3 2=1 12=1 3=1 13=1 4=1 14=1 15=1 16=1 5=1 6=3456
Swish Mul_43 1 1 877 507
Convolution Conv_44 1 1 507 880 0=64 1=3 11=3 2=1 12=1 3=2 1
修改后的param文件内容
7767517
226 268
Input images 0 1 images
YoloV5Focus focus 1 1 images 503
Convolution Conv_41 1 1 503 877 0=32 1=3 11=3 2=1 12=1 3=1 13=1 4=1 14=1 15=1 16=1 5=1 6=3456
Swish Mul_43 1 1 877 507
Convolution Conv_44 1 1 507 880 0=64 1=3 11=3 2=1 12=1 3=2 13=2 4=1 14=1 15=1 16=1 5=1 6=18432
测试一下cpu+ncnn 的推理时间
/home/ubuntu/CLionProjects/untitled1/cmake-build-debug/untitled1
output height: 3549, width: 85, channels: 1, dims:2
1 = 0.94330 at 118.83 128.06 449.39 x 289.91
196ms
16 = 0.87024 at 131.04 219.33 183.01 x 321.65
2 = 0.77507 at 464.40 80.81 226.02 x 92.19
顺手测试了一下python版本的,
ubuntu@ubuntu:~/YOLOX$ python3 tools/demo.py image -f exps/default/yolox_s.py --trt --nms 0.5 --conf 0.5 --save_result
2021-08-27 17:29:07.976 | INFO | __main__:main:239 - Args: Namespace(camid=0, ckpt=None, conf=0.5, demo='image', device='gpu', exp_file='exps/default/yolox_s.py', experiment_name='yolox_s', fp16=False, fuse=False, name=None, nms=0.5, path='./assets/dog.jpg', save_result=True, trt=True, tsize=None)
/home/ps/anaconda2/envs/tensorflow/lib/python3.6/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.)return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
2021-08-27 17:29:08.168 | INFO | __main__:main:249 - Model Summary: Params: 8.97M, Gflops: 26.81
2021-08-27 17:29:14.971 | INFO | __main__:main:278 - Using TensorRT to inference
[TensorRT] WARNING: TensorRT was linked against cuBLAS/cuBLAS LT 11.3.0 but loaded cuBLAS/cuBLAS LT 11.2.1
[TensorRT] WARNING: TensorRT was linked against cuBLAS/cuBLAS LT 11.3.0 but loaded cuBLAS/cuBLAS LT 11.2.1
2021-08-27 17:29:16.636 | INFO | __main__:inference:150 - Infer time: 0.0095s
2021-08-27 17:29:16.638 | INFO | __main__:image_demo:187 - Saving detection result in ./YOLOX_outputs/yolox_s/vis_res/2021_08_27_17_29_16/dog.jpg
测试图片
(8) 下载圈圈大佬团队开发的源码:https://github.com/OAID/Tengine.git
发现官方的代码已经有了YOLOX的example源码,测试一下把,“作业”还没写完呢~~
通过上官网的教程产生的onnx模型,然后转Tengine的测试模型
ubuntu@ubuntu:~/Tengine/build$ cmake --DTENGINE_BUILD_CONVERT_TOOL=ON --DTENGINE_BUILD_QUANT_TOOL --DTENGINE_ENABLE_CUDA=ON --DTENGINE_ENABLE_TENSORRT=ON ..
先将Tengine的源码和生成库移植到Clion工程中,发现Tengine还是蛮贴心的,头文件只有两个,对应的生成库还兼顾动态的so和静态的.a
进行Tengine移植工程遇到一个问题,缺少头文件,后来发现在example的common文件夹中;还是修改一下,看看测试代码哪些函数依赖这些头文件吧;
稍微改一下即可,注释两个不存在的文件,添加一个头文件
//#include "common.h"
#include "tengine/c_api.h"
//#include "tengine_operations.h"#include <unistd.h>
记时代码改成clock()获取即可,编译通过~
double start = clock();//get_current_time();
if (run_graph(graph, 1) < 0)
{fprintf(stderr, "Run graph failed\n");return -1;
}double end = clock();//get_current_time();
(9)模型转化 模型转换工具 — Tengine 文档
首先进行源码编译,我将这四个选项打开
OPTION (TENGINE_BUILD_CONVERT_TOOL "Build convert tool" ON)
OPTION (TENGINE_BUILD_QUANT_TOOL "Build quantization tool" ON)
OPTION (TENGINE_ENABLE_CUDA "With nVIDIA CUDA support" ON)
OPTION (TENGINE_ENABLE_TENSORRT "With nVIDIA TensorRT support" ON)
编译过程中,有点奇怪,我都已经把我的cuda-11.1映射到cuda目录下,编译过程中,它仍然去寻找cuda-11.1,无奈,重新复制一下文件到cuda-11.1的目录下吧 50、ubuntu18.04/20.04进行TensorRT环境搭建和YOLO5部署(含安装vulkan)_sxj731533730-CSDN博客
遇到第一个问题
/usr/local/cuda-11.1/include/cudnn.h:60:10: fatal error: cudnn_version.h: No such file or directory60 | #include "cudnn_version.h"
解决办法,其实有点奇怪,我的环境变量已经映射到cuda目录,cuda目录下是有这些文件的,它仍然去找cuda-11.1, 不深究了~
ubuntu@ubuntu:~$ tar -zxvf cudnn-11.2-linux-x64-v8.1.0.77.tgz
ubuntu@ubuntu:~$ sudo cp cuda/include/* /usr/local/cuda-11.1/include
ubuntu@ubuntu:~$ sudo cp cuda/lib64/libcudnn* /usr/local/cuda-11.1/lib64
ubuntu@ubuntu:~$ sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda-11.1/lib64/libcudnn*
又遇到一个问题
43:
/home/ubuntu/Tengine/source/device/tensorrt/trt_limit.hpp:43:10: fatal error: NvInfer.h: No such file or directory43 | #include <NvInfer.h>| ^~~~~~~~~~~
解决办法 ,超神给我解决办法是在cmakelists.txt 直接引入即可,都差不多吧~
ubuntu@ubuntu:~/Tengine/source/device/tensorrt$ cp /home/ubuntu/Downloads/TensorRT/TensorRT-7.2.2.3.Ubuntu-18.04.x86_64-gnu.cuda-11.1.cudnn8.0/TensorRT-7.2.2.3/include/* .
然后就顺利编译成功了~
进行模型转化 ,参考虫叔知乎:”Tengine 支持 NPU 模型部署-YOLOX - 知乎 《 突然感觉以后应该学习如何改op了!!》
参考了虫叔叔的yolov5.pt优化op过程 Tengine/tools/optimize at tengine-lite · OAID/Tengine · GitHub
转化结果
然后执行代码命令行 (突然感觉成了搬运工,也许这就是知识的传播吧~)
$ python3 yolov5s-opt.py --input yolov5s.v5.onnx --output yolov5s.v5.opt.onnx --in_tensor 167 --out_tensor 397,458,519
yolov-x原模型,仿照它的预设输入即可
修改op的模型
这是使用Tengine转化模型成功之后的输出
/usr/bin/python3 /home/ubuntu/Tengine/tools/optimize/yolov5s-opt.py
---- Tengine YOLOv5 Optimize Tool ----Input model : /home/ubuntu/YOLOX/yolox.onnx
Output model : /home/ubuntu/YOLOX/yolox.opt.onnx
Input tensor : 503
Output tensor : output
[Quant Tools Info]: Step 0, load original onnx model from /home/ubuntu/YOLOX/yolox.onnx.
278
[Quant Tools Info]: Step 1, Remove the focus and postprocess nodes.
[Quant Tools Info]: Step 2, Using hardswish replace the sigmoid and mul.
[Quant Tools Info]: Step 3, Rebuild onnx graph nodes.
[Quant Tools Info]: Step 4, Update input and output tensor.
[Quant Tools Info]: Step 5, save the new onnx model to /home/ubuntu/YOLOX/yolox.opt.onnx.---- Tengine YOLOv5s Optimize onnx create success, best wish for your inference has a high accuracy ...\(^0^)/ ----
进行转化
ubuntu@ubuntu:~/Tengine/build/install/bin$ ./convert_tool -f onnx -m ~/YOLOX/yolox.opt.onnx -o ~/YOLOX/yolox.tmfile---- Tengine Convert Tool ---- Version : v1.0, 11:27:47 Jul 25 2021
Status : float32----------onnx2tengine begin----------
Model op set is :11
----------onnx2tengine done.----------
graph opt begin
graph opt done.
Convert model success. /home/ubuntu/YOLOX/yolox.opt.onnx -----> /home/ubuntu/YOLOX/yolox.tmfile
转的另一个模型
python3 yolov5s-opt.py --input yolox-smi.onnx --output yolox-smiopt.onnx --in_tensor 503 --out_tensor output
奈何客户给我的板子是Android 系统,我先搞了另一个项目12、 Android+RK3399 pro+USB直连摄像头+NCNN+Nanodet进行检测_sxj731533730-CSDN博客,又刷了个机,这玩意刷机太麻烦了;不记录了 太多问题了 ,刷成linux系统测试一下yolox
root@teamhd:~/Tengine/build/benchmark# ./tm_benchmark -r 5 -t 1 -p 1
Tengine benchmark:loops: 5threads: 1cluster: 1affinity: 0xFFFFFFFF
Tengine-lite library version: 1.5-devsqueezenet_v1.1 min = 58.57 ms max = 62.29 ms avg = 61.25 msmobilenetv1 min = 110.43 ms max = 115.09 ms avg = 113.89 msmobilenetv2 min = 117.23 ms max = 117.53 ms avg = 117.40 msmobilenetv3 min = 81.17 ms max = 81.76 ms avg = 81.53 msshufflenetv2 min = 36.52 ms max = 37.24 ms avg = 36.88 msresnet18 min = 200.26 ms max = 200.65 ms avg = 200.47 msresnet50 min = 582.59 ms max = 694.41 ms avg = 627.50 msgooglenet min = 243.06 ms max = 244.56 ms avg = 243.80 msinceptionv3 min = 1134.42 ms max = 1139.62 ms avg = 1137.18 msvgg16 min = 1069.57 ms max = 1206.45 ms avg = 1132.67 msmssd min = 240.48 ms max = 242.75 ms avg = 241.59 msretinaface min = 40.51 ms max = 40.94 ms avg = 40.78 msyolov3_tiny min = 302.06 ms max = 303.28 ms avg = 302.71 msmobilefacenets min = 50.08 ms max = 53.15 ms avg = 51.04 ms
ALL TEST DONE.
测试一下我转的模型 K3399 pro测试结果
root@teamhd:~/Tengine/build/examples# ./tm_yolox -m ~/Tengine/yolox.tmfile -i ~/Tengine/dog.jpg -r 1 -t 8
tengine-lite library version: 1.5-dev
Repeat 1 times, thread 8, avg time 952.34 ms, max_time 952.34 ms, min_time 952.34 ms
--------------------------------------
detection num: 31: 94%, [ 117, 131, 564, 414], bicycle
16: 90%, [ 128, 210, 309, 549], dog2: 74%, [ 468, 82, 679, 171], car
root@teamhd:~/Tengine/build/examples# ^C
root@teamhd:~/Tengine/build/examples# export TG_DEBUG_TIME=1