在这里插入图片描述
ECCV-2018

caffe 版代码：https://github.com/miaow1988/ShuffleNet_V2_pytorch_caffe/blob/master/shufflenet_v2_x1.0.prototxt
caffe 代码可视化工具：http://ethereon.github.io/netscope/#/editor

文章目录

1 Background and Motivation
2 Advantages / Contributions
3 Innovations
4 Method
- 4.1 Practical Guidelines for Efficient Network Design
- - 4.1.1 G1：Equal channel width minimizes memory access cost (MAC)
  - 4.1.2 G2：Excessive group convolution increases MAC
  - 4.1.3 G3：Network fragmentation reduces degree of parallelism
  - 4.1.4 G4：Element-wise operations are non-negligible
- 4.2 ShuffleNet V2
5 Experiments
- 5.1 Datasets
- 5.2 Speed and Accuracy
6 Conclusion
7 补充

1 Background and Motivation

现在的网络结构设计都是 guided by indirect metric——FPLOs（float-point operations，指的是 the number of multiply-adds），这样会忽视影响 direct metric（speed or latency.）的一些因素，例如 memory access cost and platform characterics（eg: GPU, ARM ）. 从而设计出来的 architectures 可能是 sub-optimal 的。

证据：MobileNet v2 is much faster than NasNet-A，但是 FLOPs 相当，下面图能更好的说明，同样的 FLOPs，但是速度不一样（固定横坐标看）
在这里插入图片描述

作者 guided by direct metric（speed or latency）设计了新的 architectures，called ShuffleNet V2

效果如下
在这里插入图片描述

indirect metric 和 direct metric 的差别
1）several important factors that have considerable affection on speed are not taken into account by FLOPs.

memory access cost (MAC)：group convolution 占用很大
degree of parallelism（并行度）：同样的 FLOPs，高并行度会比低并行度快！

2）operations with the same FLOPs could have different running time, depending on the platform（GPU, ARM）.

eg：tensor decomposition reduces FLOPs by 75%，但更慢，CUDNN 的 version（对不同的操作有专门的优化）

2 Advantages / Contributions

4 个 Practical Guidelines for architecture design
提出的shufflenet v2，accuracy 在同 FLOPs 下一骑绝尘，inference time 也超群绝伦（仅次 mobilenet v1）

3 Innovations

indirect metric（FLOPs） to direct metric（Speed）
遵循 direct metric，提出 4 个 Practical Guidelines for architecture design
提出 Shuffle V2

4 Method

ShuffleNet 的细节可以参考【ShuffleNet】《ShuffleNet：An Extremely Efficient Convolutional Neural Network for Mobile Devices》

4.1 Practical Guidelines for Efficient Network Design

在这里插入图片描述
FLOPs metric only account for the convolution part, other part as follow

data I/O,
data shuffle
element-wise operations（add, ReLU）

4.1.1 G1：Equal channel width minimizes memory access cost (MAC)

depthwise separable convolutions 计算量主要集中在 1×1 上，

FLOPs 为 $B = hwc_1c_2$ （输入 $hwc_1$ ，输出 $hwc_2$ ）
MAC 为 $hw(c_1+c2) + c1*c2$ （input / output feature 和 kernel weights）

由 mean value inequality, $a+b\geq2\sqrt{ab}$ 推

$MAC\geq 2\sqrt{hwB}+ \frac{B}{hw}$
这个形式不方便看，代入 $B$ 和 $M A C$ 之后，便一目了然！！！
$hw(c_1+c_2)+c_1c_2\geqslant 2hw\sqrt{c_1c_2}+c_1c_2$

当 $a = b$ ，也即 $c_1 = c_2$ 的时候，等号成立，所以 input channels 等于 output channels 的话 MAC 最小！

作者做实验验证了一下！不同 platform 都表明， $c_1 = c_2$ 时，最快！
在这里插入图片描述

4.1.2 G2：Excessive group convolution increases MAC

$1 * 1$ group convolution 的 MAC 和 B（FLOPs）如下
$MAC=hw(c_1+c_2)+\frac{c_1c_2}{g}= hwc_1 + \frac{Bg}{c_1} + \frac{B}{hw}$
其中 B 为
$\frac{hwc_1c_2}{g}$
可以看出，固定输入的时候，在相同的B的情况下，如果 g 越大，MAC 就越大！
注意一个细节，组越多，相同的 FLOPs 表示 output channels 也越多，因为，group convolution 会减少 1/g 的 FLOPs

在这里插入图片描述
可以看出，g 越大，越慢！因为同等输入和 FLOPs 下，增加了 MAC

4.1.3 G3：Network fragmentation reduces degree of parallelism

在这里插入图片描述
一些网络如Inception，以及Auto ML自动产生的网络NASNET-A，它们倾向于采用“多路”结构，即存在一个 block 中很多不同的卷积或者pooling。（参考 ShuffleNetV2：轻量级CNN网络中的桂冠）

碎片(Fragmentation)是指多分支上，每条分支上的小卷积或pooling等（如外面的一次大的卷积操作，被拆分到每个分支上分别进行小的卷积操作）。虽然这些Fragmented sturcture能够增加准确率，但是在高并行情况下降低了效率，增加了许多额外开销（内核启动、同步等等）。（参考 ShufflenetV2_高效网络的4条实用准则）

NasNet ：13
ResNet：2 or 3

4.1.4 G4：Element-wise operations are non-negligible

Element-wise operations：

ReLU
AddTensor
AddBias
depthwise convolution（作者这么认为的原因是 high MAC/FLOPs ratio）

small FLOPs but relatively heavy MAC

在这里插入图片描述
[5] 是 resnet

4.2 ShuffleNet V2

在这里插入图片描述

ShuffleNet V1 中

bottleneck-like structure（违背 G1）
point-wise group convolution（违背 G2）
too many groups （违背 G3）
shortcut connection（element-wise add 违背 G4）

ShuffleNet V2，channel split，表示 $C$ → $C - C^{'}$ and $C^{'}$ ，two branch，或者说 two groups， $C^{'}$ 作者设置为 $\frac{C}{2}$

改进 bottleneck（1×1 → 3×3 → 1×1），保持 c 不变，克服 G1
改进 bottleneck, point-wise convolution 不要 group，克服 G2
channel split，一个 identity，一个改进后的 bottleneck，达到分组的效果（two groups），也避免的分组过多，克服 G3
channel split，一个 identity，一个改进后的 bottleneck，concatenation 而不是 add 克服 G4，也进一步保证了 G1

最后再 channel shuffle like Shufflenet V1，让 two branch 的 information communicate

add 不存在了，Element-wise operations like ReLU and depth-wise convolutions exist only in one branch.

整个 structure 如下：

在这里插入图片描述
ShuffleNet V2 不仅快，而且 accuracy

First, the high efficiency in each building block enables using more feature channels and larger network capacity.（同FLOPs 下，因为参数效率高，capacity 更猛）
有 $\frac{C}{2}$ 的 feature 通过 shortcut connection 到了下一个 block，可以被视为 a kind of feature reuse，like DenseNet 和 CondenseNet

feature reuse

在这里插入图片描述

（a）的图来自 densenet，可以参考【DenseNet】《Densely Connected Convolutional Networks》，表示每层对其它层的 l1-norm 后的 weight，一行一行的看，第一行依次表示 layer1 对 layer1到 layer last 的 weight，第二行依次表示 layer2 对 layer2 到 layer last ……

（b）是 shufflenet 中表示 s 层有多少 channels 作用到了 l 层，也是一行一行的看比较好，按照 $C^{'}$ 为 $\frac{C}{2}$ 的设定，递进关系应该是 1，1/2，1/4，1/8……下去，看每一行颜色的递进关系也是如此！！！！

从（a）中可以看出 dense connection between all layers could introduce redundancy（因为相邻层的 weight 明显更强烈——更红，CondenseNet 也验证了这一点），而（b），指数级衰减！！！

5 Experiments

5.1 Datasets

ImageNet 2012 classification dataset
COCO for object detection

5.2 Speed and Accuracy

1）Accuracy vs. FLOPs
ShuffleNet V2 accuracy 没的说，都是傲视群雄，特别 under smaller computational budgets，然后在小的 FLOPs 上，MobileNet v2 效果很差，因为 too few channels
在这里插入图片描述
2）Inference Speed vs. FLOPs/Accuracy.

从表中可以看出：
GPU 上，高 FLOPs 下，shuffleNet v2 一骑绝尘
ARM 上，高 FLOPs 下，大家55开，但是低 FLOPs 下，MobileNet v2 有些“偃旗息鼓”了！（violate G1 和 G4，which is significant on mobile devices）

MobileNet 精度虽然一般，但是速度却超级快（超过 shufflenet v2），why？ 这是我第一次看 table 8 的最大的疑问！作者做出了如下的解释：
We believe this is because its structure satisfies most of proposed guidelines（e.g. for G3, the fragments of MobileNet v1 are even fewer than ShuffleNet v2).）

IGCV 2 和 IGCV 3 比较慢，为什么？
usage of too many convolution groups

这两个“意外”都在作者的 Practical Guidelines 中！

automatic model search 为啥速度慢

violate G3，usage of too many fragments，哈哈，结合作者设计的 guidelines，可以 search 出更好的 model 出来

在这里插入图片描述
精度上，shuffleNet v2 没的说！独领风骚！

3）Compatibility with other methods
配合 SE 提升了 0.5% 个点（25.1 to 24.6），FLOPs 从 591 to 597！

4）Generalization to Large Models

>2GFLOPs
在这里插入图片描述
可以说，更少的 FLOPs，更猛的效果！哈哈哈，怕是农药吧，含笑半步颠？一滴致命？
真是个如意金箍棒！！！

5）Object Detection
standard mmAP, i.e. the averaged mAPs at the box IoU thresholds from 0.5 to 0.95.
在这里插入图片描述
一个比较有趣的现象！

object detection performance：shufflenet v2 > Xception ≥ shufflenet v1 > mobilenet v2
classification： shufflenet v2 ≥ mobilenet v2 > shufflenet v1 > Xception

除了第一名，其它的反了，哈哈哈
This is probably due to the larger receptive field of Xception building blocks than the other counterparts (7 vs. 3).

下图是 xception 的结构，可以看出，module 5-12 的感受野都为7，比改进版 bottleneck 的感受野要大！
在这里插入图片描述

Xception 的介绍可以参考【Xception】《Xception: Deep Learning with Depthwise Separable Convolutions》

6 Conclusion

四个 Practical Guidelines

G1：Equal channel width minimizes memory access cost (MAC)
G2：Excessive group convolution increases MAC
G3：Network fragmentation reduces degree of parallelism
G4：Element-wise operations are non-negligible

ShuffleNet V1 中
- point-wise group convolution（违背 G2）
- bottleneck-like structure（违背 G1）
- too many groups （违背 G3）
- shortcut connection（element-wise add 违背 G4）
mobilenet v2
- inverted bottleneck structure（violates G1）
- depth-wise convolution （against G4）
- ReLU on “thick” feature map（against G4）
auto-generated structures（violate G3）

从 inference 看（table 8），mobilenet v1 设计的很 nice（按 G1-G4）来！