一、数据处理

RetinaFace的数据处理脚本主要有wider_face.py和data_augment.py两个文件。

1. wider_face.py

在wider_face.py中首先定义了一个类：class WiderFaceDetection(data.Dataset)。
其主要包含三个方法：1. 初始化def __init__(self, txt_path, preproc=None) 2. 获取数据集数量def __len__(self) 3. 获取数据集信息__getitem__(self, index)。

具体步骤
（1）获取一个待处理的.txt文件地址（包含了widerface数据图像的地址，以及每个图像中人脸的box的坐标值和关键点的坐标值）；
（2）对输入的图像做各种数据增强处理等，这个类位于data_augment.py脚本中，后续会提到；
（3）定义一个用来存储图像路径的list和一个用来存放关于人脸box和关键点信息的list；
（4）对label.txt的操作，首先读取这个txt,获取每行信息，将其存储在lines中，对其进行遍历，对于遇到“#”开头的，便是图片的地址，将其放入img_path中，然后处理此图片的信息。每行信息代表如下：首先是box的x,y然后是w,h，接着是5个关键点信息，分别用0.0隔开。
（5）将这些信息放入words中，而isFirst便是处理完一张图片的标志。

2. data_augment.py

这个脚本主要包括一些数据增强的方法（裁剪，失真，平方填充，镜像，减均值等）以及一个统一起来的数据处理的类，即wider_face.py中提到的preproc，在这个类里，将上面所有的数据增强的方法都用了一遍最后返回处理的图像和Target。
具体流程如代码所示：
（1）获取图片信息；
（2）按照裁剪–>失真–>平方填充–>镜像–>去均值的步骤对图像进行数据增强。

class preproc(object):def __init__(self, img_dim, rgb_means):self.img_dim = img_dimself.rgb_means = rgb_meansdef __call__(self, image, targets):assert targets.shape[0] > 0, "this image does not have gt"boxes = targets[:, :4].copy()labels = targets[:, -1].copy()landm = targets[:, 4:-1].copy()image_t, boxes_t, labels_t, landm_t, pad_image_flag = _crop(image, boxes, labels, landm, self.img_dim)image_t = _distort(image_t)image_t = _pad_to_square(image_t, self.rgb_means, pad_image_flag)image_t, boxes_t, landm_t = _mirror(image_t, boxes_t, landm_t)height, width, _ = image_t.shapeimage_t = _resize_subtract_mean(image_t, self.img_dim, self.rgb_means)boxes_t[:, 0::2] /= widthboxes_t[:, 1::2] /= heightlandm_t[:, 0::2] /= widthlandm_t[:, 1::2] /= heightlabels_t = np.expand_dims(labels_t, 1)targets_t = np.hstack((boxes_t, landm_t, labels_t))return image_t, targets_t

二、默认框生成

默认框生成主要在prior_box.py文件中，主要包含class PriorBox(object)类。
具体步骤
（1）在forward函数中获取一系列默认框，具体获取方式：先遍历三个特征图，（特征图大小为[80,40,20]论文中提到的是5个，但代码实现只有3个，如下图。）获取默认框大小，即min_sizes[k]；

anchors = []for k, f in enumerate(self.feature_maps):min_sizes = self.min_sizes[k]

（2）然后遍历特征图的每个像素点（itertools.product()即返回相应的笛卡尔坐标，各个坐标值相应配对），s_kx,s_ky代表的是默认框的宽w和高h；dense_cx,dense_cy包含的是默认框的中心点，即遍历获取的特征图的像素点；

            for i, j in product(range(f[0]), range(f[1])):for min_size in min_sizes:# s_kx,s_ky代表的是默认框的宽w和高hs_kx = min_size / self.image_size[1]s_ky = min_size / self.image_size[0]# dense_cx,dense_cy包含的是默认框的中心点，即遍历获取的特征图的像素点# x,y乘以的是steps/image_size,因为feature_map和steps的乘积正好是image的大小，这样处理刚好等分遍历完整个图像dense_cx = [x * self.steps[k] / self.image_size[1] for x in [j + 0.5]]dense_cy = [y * self.steps[k] / self.image_size[0] for y in [i + 0.5]]

（3）最后变化下anchors的形状（[x,y,w,h]这个样子），并将其归一化于0-1之间（clamp()即将输入压缩与min和max之间，小于min的记为min，大于max的记为max），因为论文中 the aspect ratio at 1:1，所以直接返回即可。

# 最后变化下anchors的形状（[x,y,w,h]这个样子），并将其归一化于0-1之间（clamp()即将输入压缩与min和max之间，# 小于min的记为min，大于max的记为max），因为论文中 the aspect ratio at 1:1，所以直接返回即可for cy, cx in product(dense_cy, dense_cx):anchors += [cx, cy, s_kx, s_ky]# back to torch landoutput = torch.Tensor(anchors).view(-1, 4)if self.clip:output.clamp_(max=1, min=0)return output

三、网络框架

网络框架主要包含有retinaface.py和net.py两个文件。RetinaNet的主干网络主要以Resnet50或MobileNet为主。
主要结构
（1）MobileNet
里面包含有深度可分离卷积，可大幅度降低网络参数量。
在这里插入图片描述

class MobileNetV1(nn.Module):def __init__(self):super(MobileNetV1, self).__init__()self.stage1 = nn.Sequential(conv_bn(3, 8, 2, leaky=0.1),  # 3conv_dw(8, 16, 1),  # 7conv_dw(16, 32, 2),  # 11conv_dw(32, 32, 1),  # 19conv_dw(32, 64, 2),  # 27conv_dw(64, 64, 1),  # 43)self.stage2 = nn.Sequential(conv_dw(64, 128, 2),  # 43 + 16 = 59conv_dw(128, 128, 1),  # 59 + 32 = 91conv_dw(128, 128, 1),  # 91 + 32 = 123conv_dw(128, 128, 1),  # 123 + 32 = 155conv_dw(128, 128, 1),  # 155 + 32 = 187conv_dw(128, 128, 1),  # 187 + 32 = 219)self.stage3 = nn.Sequential(conv_dw(128, 256, 2),  # 219 +3 2 = 241conv_dw(256, 256, 1),  # 241 + 64 = 301)self.avg = nn.AdaptiveAvgPool2d((1, 1))self.fc = nn.Linear(256, 1000)def forward(self, x):x = self.stage1(x)x = self.stage2(x)x = self.stage3(x)x = self.avg(x)# x = self.model(x)x = x.view(-1, 256)x = self.fc(x)return x

（2）FPN图像金字塔
图像金字塔主要是为了适应不同尺寸的输入图像而构建的。这种方法的优点在于针对各种尺寸的输入都有较好的检测效果，缺点在于增加了时间成本。
在这里插入图片描述

class FPN(nn.Module):def __init__(self, in_channels_list, out_channels):super(FPN, self).__init__()leaky = 0if out_channels <= 64:leaky = 0.1self.output1 = conv_bn1X1(in_channels_list[0], out_channels, stride=1, leaky=leaky)self.output2 = conv_bn1X1(in_channels_list[1], out_channels, stride=1, leaky=leaky)self.output3 = conv_bn1X1(in_channels_list[2], out_channels, stride=1, leaky=leaky)self.merge1 = conv_bn(out_channels, out_channels, leaky=leaky)self.merge2 = conv_bn(out_channels, out_channels, leaky=leaky)def forward(self, input):# names = list(input.keys())input = list(input.values())output1 = self.output1(input[0])output2 = self.output2(input[1])output3 = self.output3(input[2])up3 = F.interpolate(output3, size=[output2.size(2), output2.size(3)], mode="nearest")output2 = output2 + up3output2 = self.merge2(output2)up2 = F.interpolate(output2, size=[output1.size(2), output1.size(3)], mode="nearest")output1 = output1 + up2output1 = self.merge1(output1)out = [output1, output2, output3]return out

（3）SSH
通过了图像金字塔以后，我们获得了三个有效特征层，为了进一步加强感受野，采用了SSH模块。SSH的思想非常简单，使用了三个并行结构，利用3x3卷积的堆叠代替5x5与7x7卷积的效果：左边的是3x3卷积，中间利用两次3x3卷积代替5x5卷积，右边利用三次3x3卷积代替7x7卷积。
在这里插入图片描述

class SSH(nn.Module):def __init__(self, in_channel, out_channel):super(SSH, self).__init__()assert out_channel % 4 == 0leaky = 0if out_channel <= 64:leaky = 0.1self.conv3X3 = conv_bn_no_relu(in_channel, out_channel // 2, stride=1)self.conv5X5_1 = conv_bn(in_channel, out_channel // 4, stride=1, leaky=leaky)self.conv5X5_2 = conv_bn_no_relu(out_channel // 4, out_channel // 4, stride=1)self.conv7X7_2 = conv_bn(out_channel // 4, out_channel // 4, stride=1, leaky=leaky)self.conv7x7_3 = conv_bn_no_relu(out_channel // 4, out_channel // 4, stride=1)def forward(self, input):conv3X3 = self.conv3X3(input)conv5X5_1 = self.conv5X5_1(input)conv5X5 = self.conv5X5_2(conv5X5_1)conv7X7_2 = self.conv7X7_2(conv5X5_1)conv7X7 = self.conv7x7_3(conv7X7_2)out = torch.cat([conv3X3, conv5X5, conv7X7], dim=1)out = F.relu(out)return out

四、损失函数

损失函数主要包含两个脚本文件：multibox_loss.py和box_utils.py。
retinaface的损失函数是多重损失函数，主要由三个损失函数组成：分别是人脸边框坐标回归损失loss_landm、边框回归损失loss_l、人脸分类损失loss_c。

loss_landm = F.smooth_l1_loss(landm_p, landm_t, reduction='sum')
loss_l = F.smooth_l1_loss(loc_p, loc_t, reduction='sum')
loss_c = F.cross_entropy

box_utils.py

先看box_utils.py文件。里面主要是利用match函数来选取anchor。match函数的返回值为loc_t，conf_t，landm_t，这三个返回值是由encode函数计算得到。anchor的筛选通过交并比公式完成，每一个anchor对应一个最匹配的ground_truth，而每一个ground_truth也会得到一个最匹配的anchor。

multibox_loss.py

loss_landm = F.smooth_l1_loss(landm_p, landm_t, reduction='sum')
loss_l = F.smooth_l1_loss(loc_p, loc_t, reduction='sum')

人脸边框坐标回归损失loss_landm和边框回归损失loss_l都采用Smooth_L1_Loss函数，这个函数是由Fast RCNN提出来的。Smooth_L1_Loss 相比L1_loss 改进了零点不平滑问题；相比于L2_loss，在 x 较大的时候不像 L2 对异常值敏感，是一个缓慢变化的loss。公式如下：
在这里插入图片描述

loss_c = F.cross_entropy

人脸分类损失loss_c采用的是交叉熵损失，公式如下
在这里插入图片描述

代码如下：

class MultiBoxLoss(nn.Module):"""SSD Weighted Loss FunctionCompute Targets:1) Produce Confidence Target Indices by matching  ground truth boxeswith (default) 'priorboxes' that have jaccard index > threshold parameter(default threshold: 0.5).2) Produce localization target by 'encoding' variance into offsets of groundtruth boxes and their matched  'priorboxes'.3) Hard negative mining to filter the excessive number of negative examplesthat comes with using a large number of default bounding boxes.(default negative:positive ratio 3:1)Objective Loss:L(x,c,l,g) = (Lconf(x, c) + αLloc(x,l,g)) / NWhere, Lconf is the CrossEntropy Loss and Lloc is the SmoothL1 Lossweighted by α which is set to 1 by cross val.Args:c: class confidences,l: predicted boxes,g: ground truth boxesN: number of matched default boxesSee: https://arxiv.org/pdf/1512.02325.pdf for more details."""def __init__(self, num_classes, overlap_thresh, prior_for_matching, bkg_label, neg_mining, neg_pos, neg_overlap,encode_target):super(MultiBoxLoss, self).__init__()self.num_classes = num_classesself.threshold = overlap_threshself.background_label = bkg_labelself.encode_target = encode_targetself.use_prior_for_matching = prior_for_matchingself.do_neg_mining = neg_miningself.negpos_ratio = neg_posself.neg_overlap = neg_overlapself.variance = [0.1, 0.2]def forward(self, predictions, priors, targets):"""Multibox LossArgs:predictions (tuple): A tuple containing loc preds, conf preds,and prior boxes from SSD net.conf shape: torch.size(batch_size,num_priors,num_classes)loc shape: torch.size(batch_size,num_priors,4)priors shape: torch.size(num_priors,4)ground_truth (tensor): Ground truth boxes and labels for a batch,shape: [batch_size,num_objs,5] (last idx is the label)."""loc_data, conf_data, landm_data = predictionspriors = priorsnum = loc_data.size(0)num_priors = (priors.size(0))# match priors (default boxes) and ground truth boxesloc_t = torch.Tensor(num, num_priors, 4)landm_t = torch.Tensor(num, num_priors, 10)conf_t = torch.LongTensor(num, num_priors)for idx in range(num):truths = targets[idx][:, :4].datalabels = targets[idx][:, -1].datalandms = targets[idx][:, 4:14].datadefaults = priors.datamatch(self.threshold, truths, defaults, self.variance, labels, landms, loc_t, conf_t, landm_t, idx)if GPU:loc_t = loc_t.cpu()conf_t = conf_t.cpu()landm_t = landm_t.cpu()zeros = torch.tensor(0).cpu()# landm Loss (Smooth L1)# Shape: [batch,num_priors,10]# pos1挑选出置信度大于0的用于计算landmark的损失值,这里采用Smooth函数pos1 = conf_t > zerosnum_pos_landm = pos1.long().sum(1, keepdim=True)N1 = max(num_pos_landm.data.sum().float(), 1)pos_idx1 = pos1.unsqueeze(pos1.dim()).expand_as(landm_data)landm_p = landm_data[pos_idx1].view(-1, 10)landm_t = landm_t[pos_idx1].view(-1, 10)loss_landm = F.smooth_l1_loss(landm_p, landm_t, reduction='sum')pos = conf_t != zerosconf_t[pos] = 1# Localization Loss (Smooth L1)# Shape: [batch,num_priors,4]pos_idx = pos.unsqueeze(pos.dim()).expand_as(loc_data)loc_p = loc_data[pos_idx].view(-1, 4)loc_t = loc_t[pos_idx].view(-1, 4)loss_l = F.smooth_l1_loss(loc_p, loc_t, reduction='sum')# Compute max conf across batch for hard negative miningbatch_conf = conf_data.view(-1, self.num_classes)loss_c = log_sum_exp(batch_conf) - batch_conf.gather(1, conf_t.view(-1, 1))# Hard Negative Mining'''先将正样本置为0，然后对loss排序(每张图片内部挑选)之后，取前self.negpos_ratio*num_pos个负样本的loss下一步loss_c shape转变为[batch,num_priors]'''loss_c[pos.view(-1, 1)] = 0  # filter out pos boxes for nowloss_c = loss_c.view(num, -1)_, loss_idx = loss_c.sort(1, descending=True)_, idx_rank = loss_idx.sort(1)num_pos = pos.long().sum(1, keepdim=True)num_neg = torch.clamp(self.negpos_ratio * num_pos, max=pos.size(1) - 1)neg = idx_rank < num_neg.expand_as(idx_rank)# Confidence Loss Including Positive and Negative Examples"""上面几步的操作就是为获得pos_idx和neg_idxconf_data 的shape为[batch,num_priors,num_classes]"""pos_idx = pos.unsqueeze(2).expand_as(conf_data)neg_idx = neg.unsqueeze(2).expand_as(conf_data)"""(pos_idx+neg_idx).gt(0)的原因个人猜测可能是因为挑选的正样本和负样本可能会重复，因此将大于1的数变成1.但是经过实验Tensor[mask]中对于mask大于1的数也是可以的"""conf_p = conf_data[(pos_idx + neg_idx).gt(0)].view(-1, self.num_classes)targets_weighted = conf_t[(pos + neg).gt(0)]loss_c = F.cross_entropy(conf_p, targets_weighted, reduction='sum')# Sum of losses: L(x,c,l,g) = (Lconf(x, c) + αLloc(x,l,g)) / NN = max(num_pos.data.sum().float(), 1)loss_l /= Nloss_c /= Nloss_landm /= N1return loss_l, loss_c, loss_landm# loss_l:L(box)(边框回归损失) cross entropy# loss_c:L(cls)(人脸分类损失) smooth l1 loss# loss_landm：人脸边框坐标回归损失L（pts） smooth l1 loss