文章目录
- 摘要
- Abstract
- 1. 引言
- 2. 框架
- 2.1 基础网络HRNet
- 2.1.1 多分辨率并行卷积
- 2.1.2 多分辨率融合
- 2.1.3 表达头
- 2.1.4 HRNet中每个阶段的代码
- 2.2 反卷积模块
- 2.3 关键点分组
- 2.4 热力图聚合策略
- 2.5 HigherNet代码
- 3. 创新点和不足
- 3.1 创新点
- 3.2 不足
- 参考
- 总结
摘要
HigherNet是以HRNet为骨干的多人姿态估计模型,其核心贡献在于解决了自底向上多人姿态估计中面临的尺度变化问题。该模型通过结合高分辨率网络HRNet与反卷积模块,生成了两个不同分辨率的特征图,从而有效处理了大尺度与小尺度物体的姿态估计问题。在这一框架下,HigherNet能够同时利用高分辨率特征图的细节信息和低分辨率特征图的上下文信息,以提高对复杂场景中多人体姿态的识别能力。此外,HigherNet还通过引入多分辨率特征融合策略,增强了网络对不同尺度物体的感知能力,从而提升了模型在多人场景下的准确度和鲁棒性。尽管如此,HigherNet仍面临着计算开销较大的问题,尤其是在高分辨率特征图的生成和处理过程中,这对计算资源要求较高。
Abstract
HigherNet is a multi-person pose estimation model built on the HRNet backbone. Its key contribution lies in addressing the scale variation problem faced by bottom-up multi-person pose estimation. The model combines the high-resolution network HRNet with a deconvolution module to generate feature maps at two different resolutions, effectively handling pose estimation for both large and small-scale objects. In this framework, HigherNet is able to simultaneously leverage the fine-grained details from the high-resolution feature map and the contextual information from the low-resolution feature map, enhancing its ability to recognize multi-person poses in complex scenes. Additionally, HigherNet introduces a multi-resolution feature fusion strategy, strengthening the network’s ability to perceive objects at different scales, which improves the model’s accuracy and robustness in multi-person scenarios. However, HigherNet still faces challenges in terms of computational overhead, particularly during the generation and processing of high-resolution feature maps, which require significant computational resources.
1. 引言
人类姿势估计方法只有两种类型:自顶向下的方法和自底向上的方法。自顶向下的方法依赖于人类侦测器来侦测人类实例,然后将问题降级为简单的单人姿态估计。因为自顶向下的方法能通过裁剪和调整包含侦测到人类的边界框来将所有人归一化,所以这些方法对人的尺寸变化不敏感,从而在姿态估计上达到了极佳的效果。与之相反,自底向上的方法先通过预测不同关键点的热力图定位所有人的关键点,然后将这些关键点分配到每个人的实例上。但是由于自底向上的方法需要处理人的尺度变化,因此在小尺度人姿态估计的效果上仍然与自顶向下的方法存在很大的差距。
预测小尺度人的关键点面临两个挑战:一是如何处理尺度变化,在改变小尺度人姿态估计效果的基础上不牺牲大尺度人的效果;二是如何生成用于定位小尺度人关键点的高分辨率热力图。论文以HRNet为基础网络提出了HigherHRNet来解决上述挑战。
2. 框架
下图是阶段数为3的HigherHRNet模型的结构,竖线左边是HRNet模型,竖线右边上面的损失要同时计算热图损失和下面会提及的分组损失,竖线右边下面的损失仅计算热图损失。热值损失是真实热图与预测热图之间的均方误差。
2.1 基础网络HRNet
下图是阶段数为4的HRNet网络中的主体部分。输入在进入这个主体部分之前,首先要经过两层步长为2、过滤器大小为 3 × 3 3\times3 3×3的卷积层,此时特征图的分辨率变为输入的1/4倍。
2.1.1 多分辨率并行卷积
HRNet将一个高分辨率的卷积流作为第一阶段,然后在维持前一阶段的所有卷积流的基础上逐步添加从高分辨率到低分辨率的卷积流,如下图所示,图中 N s r N_{sr} Nsr是第 s s s个阶段的子流, r r r是分辨率索引,意味着当前流的特征图大小是第一个流中特征图的 1 2 r − 1 \displaystyle\frac{1}{2^{r-1}} 2r−11倍。每个卷积流中的模块都是残差单元。
2.1.2 多分辨率融合
这个模块的目的是跨越多个分辨率的特征图来进行信息的交换,并且它每次经过四个残差块后重复。下图是第三阶段末尾的多分辨率融合模块。该多分辨率融合模块的输入为 { R r i , r = 1 , 2 , 3 } \{R_r^i, r=1, 2, 3\} {Rri,r=1,2,3},输出为 { R r o , r = 1 , 2 , 3 } \{R_r^o, r=1, 2, 3\} {Rro,r=1,2,3},每个输出是三个输入经过变化函数 f x r ( ⋅ ) f_{xr}(·) fxr(⋅)后的结果之和 R r o = f 1 r ( R 1 i ) + f 2 r ( R 2 i ) + f 3 r ( R 3 i ) R_r^o=f_{1r}(R_1^i)+f_{2r}(R_2^i)+f_{3r}(R_3^i) Rro=f1r(R1i)+f2r(R2i)+f3r(R3i)。此外,如果这个模块处于第三阶段的末尾,后面就是第四阶段,该模块有一个额外的输出 R 4 o = f 14 ( R 1 i ) + f 24 ( R 2 i ) + f 34 ( R 3 i ) R_4^o=f_{14}(R_1^i)+f_{24}(R_2^i)+f_{34}(R_3^i) R4o=f14(R1i)+f24(R2i)+f34(R3i)。
变化函数 f x r ( ⋅ ) f_{xr}(·) fxr(⋅)的计算依赖于输入的分辨率索引 x x x和输出的分辨率索引 r r r:
f x r ( R ) = { R , i f x = r 对 R 进行 r − x 次步长为 2 的 3 × 3 卷积 , i f x < r 对 R 进行双线性插值上采样,并接着进行 1 × 1 的卷积 , i f x > r \begin{aligned}f_{xr}(R)=\left\{\begin{aligned}R\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad, &if\ x = r\\对R进行r-x次步长为2的3\times3卷积\quad\quad\quad\quad,&if\ x \lt r\\对R进行双线性插值上采样,并接着进行1\times1的卷积,&if\ x\gt r\end{aligned}\right.\end{aligned} fxr(R)=⎩ ⎨ ⎧R,对R进行r−x次步长为2的3×3卷积,对R进行双线性插值上采样,并接着进行1×1的卷积,if x=rif x<rif x>r
2.1.3 表达头
HRNet分别有如下三种表达头:HRNetV1、HRNetV2和HRNetV2p。HRNetV1只获取最高分辨率的特征图,忽略其他特征图;HRNetV2对低分辨率的特征图进行双线性插值上采样,然后在颜色通道上拼接,最后再接着一个 1 × 1 1\times1 1×1的卷积;HRNetV2p是对HRNetV2的输出进行下采样,得到多个尺寸不同但通道数相同的特征图。
2.1.4 HRNet中每个阶段的代码
def conv3x3(in_planes, out_planes, stride=1):"""3x3 convolution with padding"""return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,padding=1, bias=False)class BasicBlock(nn.Module):expansion = 1def __init__(self, inplanes, planes, stride=1, downsample=None):super(BasicBlock, self).__init__()self.conv1 = conv3x3(inplanes, planes, stride)self.bn1 = nn.BatchNorm2d(planes, momentum=BN_MOMENTUM)self.relu = nn.ReLU(inplace=True)self.conv2 = conv3x3(planes, planes)self.bn2 = nn.BatchNorm2d(planes, momentum=BN_MOMENTUM)self.downsample = downsampleself.stride = stridedef forward(self, x):residual = xout = self.conv1(x)out = self.bn1(out)out = self.relu(out)out = self.conv2(out)out = self.bn2(out)if self.downsample is not None:residual = self.downsample(x)out += residualout = self.relu(out)return outclass Bottleneck(nn.Module):expansion = 4def __init__(self, inplanes, planes, stride=1, downsample=None):super(Bottleneck, self).__init__()self.conv1 = nn.Conv2d(inplanes, planes, kernel_size=1, bias=False)self.bn1 = nn.BatchNorm2d(planes, momentum=BN_MOMENTUM)self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=stride,padding=1, bias=False)self.bn2 = nn.BatchNorm2d(planes, momentum=BN_MOMENTUM)self.conv3 = nn.Conv2d(planes, planes * self.expansion, kernel_size=1,bias=False)self.bn3 = nn.BatchNorm2d(planes * self.expansion,momentum=BN_MOMENTUM)self.relu = nn.ReLU(inplace=True)self.downsample = downsampleself.stride = stridedef forward(self, x):residual = xout = self.conv1(x)out = self.bn1(out)out = self.relu(out)out = self.conv2(out)out = self.bn2(out)out = self.relu(out)out = self.conv3(out)out = self.bn3(out)if self.downsample is not None:residual = self.downsample(x)out += residualout = self.relu(out)return outclass HighResolutionModule(nn.Module):def __init__(self, num_branches, blocks, num_blocks, num_inchannels,num_channels, fuse_method, multi_scale_output=True):super(HighResolutionModule, self).__init__()self._check_branches(num_branches, blocks, num_blocks, num_inchannels, num_channels)self.num_inchannels = num_inchannelsself.fuse_method = fuse_methodself.num_branches = num_branchesself.multi_scale_output = multi_scale_outputself.branches = self._make_branches(num_branches, blocks, num_blocks, num_channels)self.fuse_layers = self._make_fuse_layers()self.relu = nn.ReLU(True)def _check_branches(self, num_branches, blocks, num_blocks,num_inchannels, num_channels):if num_branches != len(num_blocks):error_msg = 'NUM_BRANCHES({}) <> NUM_BLOCKS({})'.format(num_branches, len(num_blocks))logger.error(error_msg)raise ValueError(error_msg)if num_branches != len(num_channels):error_msg = 'NUM_BRANCHES({}) <> NUM_CHANNELS({})'.format(num_branches, len(num_channels))logger.error(error_msg)raise ValueError(error_msg)if num_branches != len(num_inchannels):error_msg = 'NUM_BRANCHES({}) <> NUM_INCHANNELS({})'.format(num_branches, len(num_inchannels))logger.error(error_msg)raise ValueError(error_msg)def _make_one_branch(self, branch_index, block, num_blocks, num_channels,stride=1):downsample = Noneif stride != 1 or \self.num_inchannels[branch_index] != num_channels[branch_index] * block.expansion:downsample = nn.Sequential(nn.Conv2d(self.num_inchannels[branch_index],num_channels[branch_index] * block.expansion,kernel_size=1, stride=stride, bias=False),nn.BatchNorm2d(num_channels[branch_index] * block.expansion,momentum=BN_MOMENTUM),)layers = []layers.append(block(self.num_inchannels[branch_index],num_channels[branch_index], stride, downsample))self.num_inchannels[branch_index] = \num_channels[branch_index] * block.expansionfor i in range(1, num_blocks[branch_index]):layers.append(block(self.num_inchannels[branch_index],num_channels[branch_index]))return nn.Sequential(*layers)def _make_branches(self, num_branches, block, num_blocks, num_channels):branches = []for i in range(num_branches):branches.append(self._make_one_branch(i, block, num_blocks, num_channels))return nn.ModuleList(branches)def _make_fuse_layers(self):if self.num_branches == 1:return Nonenum_branches = self.num_branchesnum_inchannels = self.num_inchannelsfuse_layers = []for i in range(num_branches if self.multi_scale_output else 1):fuse_layer = []for j in range(num_branches):if j > i:fuse_layer.append(nn.Sequential(nn.Conv2d(num_inchannels[j],num_inchannels[i],1,1,0,bias=False),nn.BatchNorm2d(num_inchannels[i]),nn.Upsample(scale_factor=2**(j-i), mode='nearest')))elif j == i:fuse_layer.append(None)else:conv3x3s = []for k in range(i-j):if k == i - j - 1:num_outchannels_conv3x3 = num_inchannels[i]conv3x3s.append(nn.Sequential(nn.Conv2d(num_inchannels[j],num_outchannels_conv3x3,3, 2, 1, bias=False),nn.BatchNorm2d(num_outchannels_conv3x3)))else:num_outchannels_conv3x3 = num_inchannels[j]conv3x3s.append(nn.Sequential(nn.Conv2d(num_inchannels[j],num_outchannels_conv3x3,3, 2, 1, bias=False),nn.BatchNorm2d(num_outchannels_conv3x3),nn.ReLU(True)))fuse_layer.append(nn.Sequential(*conv3x3s))fuse_layers.append(nn.ModuleList(fuse_layer))return nn.ModuleList(fuse_layers)def get_num_inchannels(self):return self.num_inchannelsdef forward(self, x):if self.num_branches == 1:return [self.branches[0](x[0])]for i in range(self.num_branches):x[i] = self.branches[i](x[i])x_fuse = []for i in range(len(self.fuse_layers)):y = x[0] if i == 0 else self.fuse_layers[i][0](x[0])for j in range(1, self.num_branches):if i == j:y = y + x[j]else:y = y + self.fuse_layers[i][j](x[j])x_fuse.append(self.relu(y))return x_fuseblocks_dict = {'BASIC': BasicBlock,'BOTTLENECK': Bottleneck
}
2.2 反卷积模块
反卷积模块的目的是生成输入两倍大的高质量特征图,该模块由 4 × 4 4\times4 4×4的反卷积层、BatchNormalization、ReLU层组成。此外,该模块还包含后面的四个残差块,添加这些残差块的目的是细化上采样后的特征图。该模块的输入是特征图和预测热图拼接后的特征图,预测热图可以来自HRNet的最终输出或者上一个反卷积模块的输出。
2.3 关键点分组
HigherNet在关键点分组上使用关联嵌入方法,输入是原图1/4倍的特征图。假设 h k ∈ R w × h h_k\in \mathbb{R}^{w\times h} hk∈Rw×h是一张图片上预测第 k k k个关键点的热图,这张图片上有 N N N,真实关键点位置 T = { x n k , n = 1 , 2 , ⋯ , N k = 1 , 2 , ⋯ , K } T=\{x_{nk}, n=1, 2, \cdots, N \ k=1, 2, \cdots, K\} T={xnk,n=1,2,⋯,N k=1,2,⋯,K},分组损失 L g L_g Lg的计算公式为:
L g ( h , T ) = 1 N K ∑ n ∑ k ( h ‾ n − h k ( x n k ) ) 2 + 1 N 2 ∑ n ∑ n ′ e x p { − 1 2 σ 2 ( h ‾ n − h ‾ n ′ ) 2 } . L_g(h, T)=\frac{1}{NK}\sum_{n}\sum_{k}(\overline{h}_n-h_k(x_{nk}))^2+\frac{1}{N^2}\sum_{n}\sum_{n'}exp\{-\frac{1}{2\sigma^2}(\overline{h}_n-\overline{h}_{n'})^2\}. Lg(h,T)=NK1n∑k∑(hn−hk(xnk))2+N21n∑n′∑exp{−2σ21(hn−hn′)2}.
其中 h ‾ n = 1 K ∑ k h k ( x n k ) \displaystyle\overline{h}_n=\frac{1}{K}\sum_{k}h_k(x_{nk}) hn=K1k∑hk(xnk)。
L g L_g Lg的第一部分损失鼓励同一个体内的所有关键点的嵌入向量靠近它们的均值,从而使得这些属于同一个人的关键点在嵌入空间内紧密聚集在一起,有助于将属于同一个人的关键点正确分组; L g L_g Lg的第二部分损失鼓励不同个体之间的嵌入向量的均值之间的距离尽可能大,从而使得属于不同人的关键点在嵌入空间更加分散,有助于避免不同人关键点的错误分组。
在预测时,可以通过预测的热力图得到关键点的候选位置以及该位置处的嵌入向量,接着通过计算该关键点与其他关键点嵌入向量之间的L2距离,如果距离小于某个阈值,则可以认为这两个关键点属于同一个分组。
2.4 热力图聚合策略
HigherNet通过双线性插值上采样将不同分辨率的特征图变为输入图片的大小,然后通过平均来自不同分辨率的特征图得到最终用于预测的特征图。
2.5 HigherNet代码
class PoseHigherResolutionNet(nn.Module):def __init__(self, cfg, **kwargs):self.inplanes = 64extra = cfg.MODEL.EXTRAsuper(PoseHigherResolutionNet, self).__init__()# stem netself.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=2, padding=1,bias=False)self.bn1 = nn.BatchNorm2d(64, momentum=BN_MOMENTUM)self.conv2 = nn.Conv2d(64, 64, kernel_size=3, stride=2, padding=1,bias=False)self.bn2 = nn.BatchNorm2d(64, momentum=BN_MOMENTUM)self.relu = nn.ReLU(inplace=True)self.layer1 = self._make_layer(Bottleneck, 64, 4)self.stage2_cfg = cfg['MODEL']['EXTRA']['STAGE2']num_channels = self.stage2_cfg['NUM_CHANNELS']block = blocks_dict[self.stage2_cfg['BLOCK']]num_channels = [num_channels[i] * block.expansion for i in range(len(num_channels))]self.transition1 = self._make_transition_layer([256], num_channels)self.stage2, pre_stage_channels = self._make_stage(self.stage2_cfg, num_channels)self.stage3_cfg = cfg['MODEL']['EXTRA']['STAGE3']num_channels = self.stage3_cfg['NUM_CHANNELS']block = blocks_dict[self.stage3_cfg['BLOCK']]num_channels = [num_channels[i] * block.expansion for i in range(len(num_channels))]self.transition2 = self._make_transition_layer(pre_stage_channels, num_channels)self.stage3, pre_stage_channels = self._make_stage(self.stage3_cfg, num_channels)self.stage4_cfg = cfg['MODEL']['EXTRA']['STAGE4']num_channels = self.stage4_cfg['NUM_CHANNELS']block = blocks_dict[self.stage4_cfg['BLOCK']]num_channels = [num_channels[i] * block.expansion for i in range(len(num_channels))]self.transition3 = self._make_transition_layer(pre_stage_channels, num_channels)self.stage4, pre_stage_channels = self._make_stage(self.stage4_cfg, num_channels, multi_scale_output=False)self.final_layers = self._make_final_layers(cfg, pre_stage_channels[0])self.deconv_layers = self._make_deconv_layers(cfg, pre_stage_channels[0])self.num_deconvs = extra.DECONV.NUM_DECONVSself.deconv_config = cfg.MODEL.EXTRA.DECONVself.loss_config = cfg.LOSSself.pretrained_layers = cfg['MODEL']['EXTRA']['PRETRAINED_LAYERS']def _make_final_layers(self, cfg, input_channels):dim_tag = cfg.MODEL.NUM_JOINTS if cfg.MODEL.TAG_PER_JOINT else 1extra = cfg.MODEL.EXTRAfinal_layers = []output_channels = cfg.MODEL.NUM_JOINTS + dim_tag \if cfg.LOSS.WITH_AE_LOSS[0] else cfg.MODEL.NUM_JOINTSfinal_layers.append(nn.Conv2d(in_channels=input_channels,out_channels=output_channels,kernel_size=extra.FINAL_CONV_KERNEL,stride=1,padding=1 if extra.FINAL_CONV_KERNEL == 3 else 0))deconv_cfg = extra.DECONVfor i in range(deconv_cfg.NUM_DECONVS):input_channels = deconv_cfg.NUM_CHANNELS[i]output_channels = cfg.MODEL.NUM_JOINTS + dim_tag \if cfg.LOSS.WITH_AE_LOSS[i+1] else cfg.MODEL.NUM_JOINTSfinal_layers.append(nn.Conv2d(in_channels=input_channels,out_channels=output_channels,kernel_size=extra.FINAL_CONV_KERNEL,stride=1,padding=1 if extra.FINAL_CONV_KERNEL == 3 else 0))return nn.ModuleList(final_layers)def _make_deconv_layers(self, cfg, input_channels):dim_tag = cfg.MODEL.NUM_JOINTS if cfg.MODEL.TAG_PER_JOINT else 1extra = cfg.MODEL.EXTRAdeconv_cfg = extra.DECONVdeconv_layers = []for i in range(deconv_cfg.NUM_DECONVS):if deconv_cfg.CAT_OUTPUT[i]:final_output_channels = cfg.MODEL.NUM_JOINTS + dim_tag \if cfg.LOSS.WITH_AE_LOSS[i] else cfg.MODEL.NUM_JOINTSinput_channels += final_output_channelsoutput_channels = deconv_cfg.NUM_CHANNELS[i]deconv_kernel, padding, output_padding = \self._get_deconv_cfg(deconv_cfg.KERNEL_SIZE[i])layers = []layers.append(nn.Sequential(nn.ConvTranspose2d(in_channels=input_channels,out_channels=output_channels,kernel_size=deconv_kernel,stride=2,padding=padding,output_padding=output_padding,bias=False),nn.BatchNorm2d(output_channels, momentum=BN_MOMENTUM),nn.ReLU(inplace=True)))for _ in range(cfg.MODEL.EXTRA.DECONV.NUM_BASIC_BLOCKS):layers.append(nn.Sequential(BasicBlock(output_channels, output_channels),))deconv_layers.append(nn.Sequential(*layers))input_channels = output_channelsreturn nn.ModuleList(deconv_layers)def _get_deconv_cfg(self, deconv_kernel):if deconv_kernel == 4:padding = 1output_padding = 0elif deconv_kernel == 3:padding = 1output_padding = 1elif deconv_kernel == 2:padding = 0output_padding = 0return deconv_kernel, padding, output_paddingdef _make_transition_layer(self, num_channels_pre_layer, num_channels_cur_layer):num_branches_cur = len(num_channels_cur_layer)num_branches_pre = len(num_channels_pre_layer)transition_layers = []for i in range(num_branches_cur):if i < num_branches_pre:if num_channels_cur_layer[i] != num_channels_pre_layer[i]:transition_layers.append(nn.Sequential(nn.Conv2d(num_channels_pre_layer[i],num_channels_cur_layer[i],3,1,1,bias=False),nn.BatchNorm2d(num_channels_cur_layer[i]),nn.ReLU(inplace=True)))else:transition_layers.append(None)else:conv3x3s = []for j in range(i+1-num_branches_pre):inchannels = num_channels_pre_layer[-1]outchannels = num_channels_cur_layer[i] \if j == i-num_branches_pre else inchannelsconv3x3s.append(nn.Sequential(nn.Conv2d(inchannels, outchannels, 3, 2, 1, bias=False),nn.BatchNorm2d(outchannels),nn.ReLU(inplace=True)))transition_layers.append(nn.Sequential(*conv3x3s))return nn.ModuleList(transition_layers)def _make_layer(self, block, planes, blocks, stride=1):downsample = Noneif stride != 1 or self.inplanes != planes * block.expansion:downsample = nn.Sequential(nn.Conv2d(self.inplanes, planes * block.expansion,kernel_size=1, stride=stride, bias=False),nn.BatchNorm2d(planes * block.expansion, momentum=BN_MOMENTUM),)layers = []layers.append(block(self.inplanes, planes, stride, downsample))self.inplanes = planes * block.expansionfor i in range(1, blocks):layers.append(block(self.inplanes, planes))return nn.Sequential(*layers)def _make_stage(self, layer_config, num_inchannels,multi_scale_output=True):num_modules = layer_config['NUM_MODULES']num_branches = layer_config['NUM_BRANCHES']num_blocks = layer_config['NUM_BLOCKS']num_channels = layer_config['NUM_CHANNELS']block = blocks_dict[layer_config['BLOCK']]fuse_method = layer_config['FUSE_METHOD']modules = []for i in range(num_modules):# multi_scale_output is only used last moduleif not multi_scale_output and i == num_modules - 1:reset_multi_scale_output = Falseelse:reset_multi_scale_output = Truemodules.append(HighResolutionModule(num_branches,block,num_blocks,num_inchannels,num_channels,fuse_method,reset_multi_scale_output))num_inchannels = modules[-1].get_num_inchannels()return nn.Sequential(*modules), num_inchannelsdef forward(self, x):x = self.conv1(x)x = self.bn1(x)x = self.relu(x)x = self.conv2(x)x = self.bn2(x)x = self.relu(x)x = self.layer1(x)x_list = []for i in range(self.stage2_cfg['NUM_BRANCHES']):if self.transition1[i] is not None:x_list.append(self.transition1[i](x))else:x_list.append(x)y_list = self.stage2(x_list)x_list = []for i in range(self.stage3_cfg['NUM_BRANCHES']):if self.transition2[i] is not None:x_list.append(self.transition2[i](y_list[-1]))else:x_list.append(y_list[i])y_list = self.stage3(x_list)x_list = []for i in range(self.stage4_cfg['NUM_BRANCHES']):if self.transition3[i] is not None:x_list.append(self.transition3[i](y_list[-1]))else:x_list.append(y_list[i])y_list = self.stage4(x_list)final_outputs = []x = y_list[0]y = self.final_layers[0](x)final_outputs.append(y)for i in range(self.num_deconvs):if self.deconv_config.CAT_OUTPUT[i]:x = torch.cat((x, y), 1)x = self.deconv_layers[i](x)y = self.final_layers[i+1](x)final_outputs.append(y)return final_outputsdef init_weights(self, pretrained='', verbose=True):logger.info('=> init weights from normal distribution')for m in self.modules():if isinstance(m, nn.Conv2d):nn.init.normal_(m.weight, std=0.001)for name, _ in m.named_parameters():if name in ['bias']:nn.init.constant_(m.bias, 0)elif isinstance(m, nn.BatchNorm2d):nn.init.constant_(m.weight, 1)nn.init.constant_(m.bias, 0)elif isinstance(m, nn.ConvTranspose2d):nn.init.normal_(m.weight, std=0.001)for name, _ in m.named_parameters():if name in ['bias']:nn.init.constant_(m.bias, 0)parameters_names = set()for name, _ in self.named_parameters():parameters_names.add(name)buffers_names = set()for name, _ in self.named_buffers():buffers_names.add(name)if os.path.isfile(pretrained):pretrained_state_dict = torch.load(pretrained)logger.info('=> loading pretrained model {}'.format(pretrained))need_init_state_dict = {}for name, m in pretrained_state_dict.items():if name.split('.')[0] in self.pretrained_layers \or self.pretrained_layers[0] is '*':if name in parameters_names or name in buffers_names:if verbose:logger.info('=> init {} from {}'.format(name, pretrained))need_init_state_dict[name] = mself.load_state_dict(need_init_state_dict, strict=False)
3. 创新点和不足
3.1 创新点
HigherHRNet采用HRNet作为主干网络,HRNet能够保留多尺度信息,使得模型在不同分辨率下都能提取丰富的特征。这种多尺度特征融合使得模型在处理不同大小的人体时更加高效;HigherHRNet通过将1/4分辨率的特征图和热图通过反卷积上采样到1/2分辨率,从而提高对小人物的检测能力。传统的高斯核方法在处理小人物时会导致训练困难和性能下降,而 HigherHRNet通过反卷积模块解决了这一问题;HigherHRNet在训练阶段采用多分辨率监督策略,为每个尺度建立一个GT热图,并计算每个尺度下的均方差损失。所有尺度的损失相加作为最终的损失,进行训练。这种方法可以有效提高各个尺度关节预测的准确性。
3.2 不足
HigherHRNet 由于采用了多分辨率特征金字塔和多分辨率监督策略,模型的复杂度相对较高,训练和推理速度可能较慢,尤其是在处理大规模数据集时;尽管 HigherHRNet 在处理小人物的检测方面取得了显著进展,但在某些复杂场景下,如背景复杂或人物重叠的情况下,对小人物的检测仍有一定的改进空间。
参考
Jingdong Wang, Ke Sun, Tianheng Cheng, and et al. Deep High-Resolution Representation Learning for Visual Recognition.
Alejandro Newell, Zhiao Huang, Jia Deng. Associative Embedding: End-to-End Learning for Joint Detection and Grouping.
Bowen Cheng, Bin Xiao, Jingdong Wang, and et al. HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation.
代码来源:https://github.com/HRNet/HigherHRNet-Human-Pose-Estimation
总结
HigherHRNet通过多分辨率特征融合提取不同尺度的特征图,并通过反卷积模块提高小尺度人物的检测精度。输入图像经过初步卷积处理后,特征图在多个分辨率下进行融合和上采样,生成热图并通过关键点分组进行优化,最终得到姿态估计结果。尽管在小尺度人物检测上有所突破,HigherHRNet仍面临较高的计算复杂度,训练和推理速度较慢,同时在复杂背景或人物重叠的场景中,检测精度还有提升空间。