改进YOLOv7系列:28.YOLOv7 结合 Swin Transformer V2结构,Swin Transformer V2:通向视觉大模型之路

news/2024/11/22 13:41:38/

🔥🔥🔥YOLO系列 + Swin Transformer V2结构 结合应用 为 CSDN芒果汁没有芒果 首发更新博文


-💡统一使用 YOLO 代码框架,结合不同模块来构建不同的YOLO目标检测模型。

🔥 《芒果书》系列改进专栏内的改进文章,均包含多种模型改进方式,均适用于YOLOv3YOLOv4YOLORYOLOXYOLOv5YOLOv7YOLOv8 改进(重点)!!!

🔥 专栏创新点教程 均有不少同学反应和我说已经在自己的数据集上有效涨点啦!! 包括COCO数据集也能涨点

所有文章博客均包含 改进源代码部分,一键训练即可

🔥 对应专栏订阅的越早,就可以越早使用原创创新点去改进模型,抢先一步

以下《芒果书》改进YOLO专栏内容 适用于以下主流模型的改进
… 以上模型改进均适用《芒果书》🥭专栏系列

芒果书 点击以下链接 查看文章目录详情🔗

  • 💡🎈☁️:一、CSDN原创《芒果改进YOLO高阶指南》强烈改进涨点推荐!📚推荐指数:🌟🌟🌟🌟🌟

  • 💡🎈☁️:二、CSDN原创YOLO进阶 | 《芒果改进YOLO进阶指南》改进涨点推荐!📚推荐指数:🌟🌟🌟🌟🌟

  • 💡🎈☁️:三、CSDN独家全网首发专栏 | 《目标检测YOLO改进指南》改进涨点推荐!推荐指数:🌟🌟🌟🌟🌟


    • 🔥🔥🔥YOLO系列 + Swin Transformer V2结构 结合应用 为 `CSDN芒果汁没有芒果` 首发更新博文
    • 最新创新点改进推荐
    • Swin Transformer论文
    • YOLOv7结合Swin Transformer-V2 演示教程

Swin Transformer论文



该论文作者提出了缩放 Swin Transformer 的技术 多达 30 亿个参数,使其能够使用多达 1,536 个图像进行训练1,536 分辨率。通过扩大容量和分辨率,Swin Transformer 在四个具有代表性的视觉基准上创造了新记录:ImageNet-V2 图像分类的84.0% top-1 准确率,COCO 对象检测的63.1 / 54.4 box / mask mAP,ADE20K 语义分割的59.9 mIoU,和86.8%Kinetics-400 视频动作分类的前 1 准确率。我们的技术通常适用于扩大视觉模型,但尚未像 NLP 语言模型那样被广泛探索,部分原因是在训练和应用方面存在以下困难:1)视觉模型经常面临大规模的不稳定性问题和 2)许多下游视觉任务需要高分辨率图像或窗口,目前尚不清楚如何有效地将低分辨率预训练的模型转移到更高分辨率的模型。当图像分辨率很高时,GPU 内存消耗也是一个问题。为了解决这些问题,我们提出了几种技术,并通过使用 Swin Transformer 作为案例研究来说明:1)后归一化技术和缩放余弦注意方法,以提高大型视觉模型的稳定性;2) 一种对数间隔的连续位置偏差技术,可有效地将在低分辨率图像和窗口上预训练的模型转移到其更高分辨率的对应物上。此外,我们分享了我们的关键实现细节,这些细节可以显着节省 GPU 内存消耗,从而使使用常规 GPU 训练大型视觉模型变得可行。使用这些技术和自我监督的预训练,我们成功训练了一个强大的 30 亿个 Swin Transformer 模型,并有效地将其转移到涉及高分辨率图像或窗口的各种视觉任务中,在各种的基准。代码将在 我们分享了我们的关键实现细节,这些细节可以显着节省 GPU 内存消耗,从而使使用常规 GPU 训练大型视觉模型变得可行。使用这些技术和自我监督的预训练,我们成功训练了一个强大的 30 亿个 Swin Transformer 模型,并有效地将其转移到涉及高分辨率图像或窗口的各种视觉任务中,在各种的基准。代码将在 我们分享了我们的关键实现细节,这些细节可以显着节省 GPU 内存消耗,从而使使用常规 GPU 训练大型视觉模型变得可行。使用这些技术和自我监督的预训练,我们成功训练了一个强大的 30 亿个 Swin Transformer 模型,并有效地将其转移到涉及高分辨率图像或窗口的各种视觉任务中,在各种的基准。代码将在 我们成功训练了一个强大的 30 亿个 Swin Transformer 模型,并将其有效地转移到涉及高分辨率图像或窗口的各种视觉任务中,在各种基准测试中达到了最先进的精度。代码将在 我们成功训练了一个强大的 30 亿个 Swin Transformer 模型,并将其有效地转移到涉及高分辨率图像或窗口的各种视觉任务中,在各种基准测试中达到了最先进的精度。

YOLOv7结合Swin Transformer-V2 演示教程



# YOLOv7 🚀, GPL-3.0 license
# parameters
nc: 80  # number of classes
depth_multiple: 0.33  # model depth multiple
width_multiple: 1.0  # layer channel multiple# anchors
anchors:- [12,16, 19,36, 40,28]  # P3/8- [36,75, 76,55, 72,146]  # P4/16- [142,110, 192,243, 459,401]  # P5/32# yolov7 backbone by yoloair
backbone:# [from, number, module, args][[-1, 1, Conv, [32, 3, 1]],  # 0[-1, 1, Conv, [64, 3, 2]],  # 1-P1/2[-1, 1, Conv, [64, 3, 1]],[-1, 1, Conv, [128, 3, 2]],  # 3-P2/4 [-1, 1, SwinV2_CSPB, [128, 128]], [-1, 1, Conv, [256, 3, 2]], [-1, 1, MP, []],[-1, 1, Conv, [128, 1, 1]],[-3, 1, Conv, [128, 1, 1]],[-1, 1, Conv, [128, 3, 2]],[[-1, -3], 1, Concat, [1]],  # 16-P3/8[-1, 1, Conv, [128, 1, 1]],[-2, 1, Conv, [128, 1, 1]],[-1, 1, Conv, [128, 3, 1]],[-1, 1, Conv, [128, 3, 1]],[-1, 1, Conv, [128, 3, 1]],[-1, 1, Conv, [128, 3, 1]],[[-1, -3, -5, -6], 1, Concat, [1]],[-1, 1, Conv, [512, 1, 1]],[-1, 1, MP, []],[-1, 1, Conv, [256, 1, 1]],[-3, 1, Conv, [256, 1, 1]],[-1, 1, Conv, [256, 3, 2]],[[-1, -3], 1, Concat, [1]],[-1, 1, Conv, [256, 1, 1]],[-2, 1, Conv, [256, 1, 1]],[-1, 1, Conv, [256, 3, 1]],[-1, 1, Conv, [256, 3, 1]],[-1, 1, Conv, [256, 3, 1]],[-1, 1, Conv, [256, 3, 1]],[[-1, -3, -5, -6], 1, Concat, [1]],[-1, 1, Conv, [1024, 1, 1]],          [-1, 1, MP, []],[-1, 1, Conv, [512, 1, 1]],[-3, 1, Conv, [512, 1, 1]],[-1, 1, Conv, [512, 3, 2]],[[-1, -3], 1, Concat, [1]],[-1, 1, SwinV2_CSPB, [1024, 1024]],[-1, 1, Conv, [256, 3, 1]],]# yolov7 head by yoloair
head:[[-1, 1, SPPCSPC, [512]],[-1, 1, Conv, [256, 1, 1]],[-1, 1, nn.Upsample, [None, 2, 'nearest']],[31, 1, Conv, [256, 1, 1]],[[-1, -2], 1, Concat, [1]],[-1, 1, C3, [128]],[-1, 1, Conv, [128, 1, 1]],[-1, 1, nn.Upsample, [None, 2, 'nearest']],[18, 1, Conv, [128, 1, 1]],[[-1, -2], 1, Concat, [1]],[-1, 1, C3, [128]],[-1, 1, MP, []],[-1, 1, Conv, [128, 1, 1]],[-3, 1, Conv, [128, 1, 1]],[-1, 1, Conv, [128, 3, 2]],[[-1, -3, 44], 1, Concat, [1]],[-1, 1, C3, [256]], [-1, 1, MP, []],[-1, 1, Conv, [256, 1, 1]],[-3, 1, Conv, [256, 1, 1]],[-1, 1, Conv, [256, 3, 2]], [[-1, -3, 39], 1, Concat, [1]],[-1, 3, C3, [512]],# 检测头 -----------------------------[49, 1, RepConv, [256, 3, 1]],[55, 1, RepConv, [512, 3, 1]],[61, 1, RepConv, [1024, 3, 1]],[[62,63,64], 1, IDetect, [nc, anchors]],   # Detect(P3, P4, P5)]



class WindowAttention_v2(nn.Module):def __init__(self, dim, window_size, num_heads, qkv_bias=True, attn_drop=0., proj_drop=0.,pretrained_window_size=[0, 0]):super().__init__()self.dim = dimself.window_size = window_size  # Wh, Wwself.pretrained_window_size = pretrained_window_sizeself.num_heads = num_headsself.logit_scale = nn.Parameter(torch.log(10 * torch.ones((num_heads, 1, 1))), requires_grad=True)# mlp to generate continuous relative position biasself.cpb_mlp = nn.Sequential(nn.Linear(2, 512, bias=True),nn.ReLU(inplace=True),nn.Linear(512, num_heads, bias=False))# get relative_coords_tablerelative_coords_h = torch.arange(-(self.window_size[0] - 1), self.window_size[0], dtype=torch.float32)relative_coords_w = torch.arange(-(self.window_size[1] - 1), self.window_size[1], dtype=torch.float32)relative_coords_table = torch.stack(torch.meshgrid([relative_coords_h,relative_coords_w])).permute(1, 2, 0).contiguous().unsqueeze(0)  # 1, 2*Wh-1, 2*Ww-1, 2if pretrained_window_size[0] > 0:relative_coords_table[:, :, :, 0] /= (pretrained_window_size[0] - 1)relative_coords_table[:, :, :, 1] /= (pretrained_window_size[1] - 1)else:relative_coords_table[:, :, :, 0] /= (self.window_size[0] - 1)relative_coords_table[:, :, :, 1] /= (self.window_size[1] - 1)relative_coords_table *= 8  # normalize to -8, 8relative_coords_table = torch.sign(relative_coords_table) * torch.log2(torch.abs(relative_coords_table) + 1.0) / np.log2(8)self.register_buffer("relative_coords_table", relative_coords_table)# get pair-wise relative position index for each token inside the windowcoords_h = torch.arange(self.window_size[0])coords_w = torch.arange(self.window_size[1])coords = torch.stack(torch.meshgrid([coords_h, coords_w]))  # 2, Wh, Wwcoords_flatten = torch.flatten(coords, 1)  # 2, Wh*Wwrelative_coords = coords_flatten[:, :, None] - coords_flatten[:, None, :]  # 2, Wh*Ww, Wh*Wwrelative_coords = relative_coords.permute(1, 2, 0).contiguous()  # Wh*Ww, Wh*Ww, 2relative_coords[:, :, 0] += self.window_size[0] - 1  # shift to start from 0relative_coords[:, :, 1] += self.window_size[1] - 1relative_coords[:, :, 0] *= 2 * self.window_size[1] - 1relative_position_index = relative_coords.sum(-1)  # Wh*Ww, Wh*Wwself.register_buffer("relative_position_index", relative_position_index)self.qkv = nn.Linear(dim, dim * 3, bias=False)if qkv_bias:self.q_bias = nn.Parameter(torch.zeros(dim))self.v_bias = nn.Parameter(torch.zeros(dim))else:self.q_bias = Noneself.v_bias = Noneself.attn_drop = nn.Dropout(attn_drop)self.proj = nn.Linear(dim, dim)self.proj_drop = nn.Dropout(proj_drop)self.softmax = nn.Softmax(dim=-1)def forward(self, x, mask=None):B_, N, C = x.shapeqkv_bias = Noneif self.q_bias is not None:qkv_bias = torch.cat((self.q_bias, torch.zeros_like(self.v_bias, requires_grad=False), self.v_bias))qkv = F.linear(input=x, weight=self.qkv.weight, bias=qkv_bias)qkv = qkv.reshape(B_, N, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)q, k, v = qkv[0], qkv[1], qkv[2]  # make torchscript happy (cannot use tensor as tuple)# cosine attentionattn = (F.normalize(q, dim=-1) @ F.normalize(k, dim=-1).transpose(-2, -1))logit_scale = torch.clamp(self.logit_scale, max=torch.log(torch.tensor(1. / 0.01))).exp()attn = attn * logit_scalerelative_position_bias_table = self.cpb_mlp(self.relative_coords_table).view(-1, self.num_heads)relative_position_bias = relative_position_bias_table[self.relative_position_index.view(-1)].view(self.window_size[0] * self.window_size[1], self.window_size[0] * self.window_size[1], -1)  # Wh*Ww,Wh*Ww,nHrelative_position_bias = relative_position_bias.permute(2, 0, 1).contiguous()  # nH, Wh*Ww, Wh*Wwrelative_position_bias = 16 * torch.sigmoid(relative_position_bias)attn = attn + relative_position_bias.unsqueeze(0)if mask is not None:nW = mask.shape[0]attn = attn.view(B_ // nW, nW, self.num_heads, N, N) + mask.unsqueeze(1).unsqueeze(0)attn = attn.view(-1, self.num_heads, N, N)attn = self.softmax(attn)else:attn = self.softmax(attn)attn = self.attn_drop(attn)try:x = (attn @ v).transpose(1, 2).reshape(B_, N, C)except:x = (attn.half() @ v).transpose(1, 2).reshape(B_, N, C)x = self.proj(x)x = self.proj_drop(x)return xdef extra_repr(self) -> str:return f'dim={self.dim}, window_size={self.window_size}, ' \f'pretrained_window_size={self.pretrained_window_size}, num_heads={self.num_heads}'def flops(self, N):# calculate flops for 1 window with token length of Nflops = 0# qkv = self.qkv(x)flops += N * self.dim * 3 * self.dim# attn = (q @ k.transpose(-2, -1))flops += self.num_heads * N * (self.dim // self.num_heads) * N#  x = (attn @ v)flops += self.num_heads * N * N * (self.dim // self.num_heads)# x = self.proj(x)flops += N * self.dim * self.dimreturn flopsclass Mlp_v2(nn.Module):def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.SiLU, drop=0.):super().__init__()out_features = out_features or in_featureshidden_features = hidden_features or in_featuresself.fc1 = nn.Linear(in_features, hidden_features)self.act = act_layer()self.fc2 = nn.Linear(hidden_features, out_features)self.drop = nn.Dropout(drop)def forward(self, x):x = self.fc1(x)x = self.act(x)x = self.drop(x)x = self.fc2(x)x = self.drop(x)return x
# add 2 functions
class SwinTransformerLayer_v2(nn.Module):def __init__(self, dim, num_heads, window_size=7, shift_size=0,mlp_ratio=4., qkv_bias=True, drop=0., attn_drop=0., drop_path=0.,act_layer=nn.SiLU, norm_layer=nn.LayerNorm, pretrained_window_size=0):super().__init__()self.dim = dim#self.input_resolution = input_resolutionself.num_heads = num_headsself.window_size = window_sizeself.shift_size = shift_sizeself.mlp_ratio = mlp_ratio#if min(self.input_resolution) <= self.window_size:#    # if window size is larger than input resolution, we don't partition windows#    self.shift_size = 0#    self.window_size = min(self.input_resolution)assert 0 <= self.shift_size < self.window_size, "shift_size must in 0-window_size"self.norm1 = norm_layer(dim)self.attn = WindowAttention_v2(dim, window_size=(self.window_size, self.window_size), num_heads=num_heads,qkv_bias=qkv_bias, attn_drop=attn_drop, proj_drop=drop,pretrained_window_size=(pretrained_window_size, pretrained_window_size))self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()self.norm2 = norm_layer(dim)mlp_hidden_dim = int(dim * mlp_ratio)self.mlp = Mlp_v2(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)def create_mask(self, H, W):# calculate attention mask for SW-MSAimg_mask = torch.zeros((1, H, W, 1))  # 1 H W 1h_slices = (slice(0, -self.window_size),slice(-self.window_size, -self.shift_size),slice(-self.shift_size, None))w_slices = (slice(0, -self.window_size),slice(-self.window_size, -self.shift_size),slice(-self.shift_size, None))cnt = 0for h in h_slices:for w in w_slices:img_mask[:, h, w, :] = cntcnt += 1mask_windows = window_partition(img_mask, self.window_size)  # nW, window_size, window_size, 1mask_windows = mask_windows.view(-1, self.window_size * self.window_size)attn_mask = mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)attn_mask = attn_mask.masked_fill(attn_mask != 0, float(-100.0)).masked_fill(attn_mask == 0, float(0.0))return attn_maskdef forward(self, x):# reshape x[b c h w] to x[b l c]_, _, H_, W_ = x.shapePadding = Falseif min(H_, W_) < self.window_size or H_ % self.window_size!=0 or W_ % self.window_size!=0:Padding = True# print(f'img_size {min(H_, W_)} is less than (or not divided by) window_size {self.window_size}, Padding.')pad_r = (self.window_size - W_ % self.window_size) % self.window_sizepad_b = (self.window_size - H_ % self.window_size) % self.window_sizex = F.pad(x, (0, pad_r, 0, pad_b))# print('2', x.shape)B, C, H, W = x.shapeL = H * Wx = x.permute(0, 2, 3, 1).contiguous().view(B, L, C)  # b, L, c# create mask from init to forwardif self.shift_size > 0:attn_mask = self.create_mask(H, W).to(x.device)else:attn_mask = Noneshortcut = xx = x.view(B, H, W, C)# cyclic shiftif self.shift_size > 0:shifted_x = torch.roll(x, shifts=(-self.shift_size, -self.shift_size), dims=(1, 2))else:shifted_x = x# partition windowsx_windows = window_partition_v2(shifted_x, self.window_size)  # nW*B, window_size, window_size, Cx_windows = x_windows.view(-1, self.window_size * self.window_size, C)  # nW*B, window_size*window_size, C# W-MSA/SW-MSAattn_windows = self.attn(x_windows, mask=attn_mask)  # nW*B, window_size*window_size, C# merge windowsattn_windows = attn_windows.view(-1, self.window_size, self.window_size, C)shifted_x = window_reverse_v2(attn_windows, self.window_size, H, W)  # B H' W' C# reverse cyclic shiftif self.shift_size > 0:x = torch.roll(shifted_x, shifts=(self.shift_size, self.shift_size), dims=(1, 2))else:x = shifted_xx = x.view(B, H * W, C)x = shortcut + self.drop_path(self.norm1(x))# FFNx = x + self.drop_path(self.norm2(self.mlp(x)))x = x.permute(0, 2, 1).contiguous().view(-1, C, H, W)  # b c h wif Padding:x = x[:, :, :H_, :W_]  # reverse paddingreturn xdef extra_repr(self) -> str:return f"dim={self.dim}, input_resolution={self.input_resolution}, num_heads={self.num_heads}, " \f"window_size={self.window_size}, shift_size={self.shift_size}, mlp_ratio={self.mlp_ratio}"def flops(self):flops = 0H, W = self.input_resolution# norm1flops += self.dim * H * W# W-MSA/SW-MSAnW = H * W / self.window_size / self.window_sizeflops += nW * self.attn.flops(self.window_size * self.window_size)# mlpflops += 2 * H * W * self.dim * self.dim * self.mlp_ratio# norm2flops += self.dim * H * Wreturn flopsclass SwinTransformer2Block(nn.Module):def __init__(self, c1, c2, num_heads, num_layers, window_size=7):super().__init__()self.conv = Noneif c1 != c2:self.conv = Conv(c1, c2)# remove input_resolutionself.blocks = nn.Sequential(*[SwinTransformerLayer_v2(dim=c2, num_heads=num_heads, window_size=window_size,shift_size=0 if (i % 2 == 0) else window_size // 2) for i in range(num_layers)])def forward(self, x):if self.conv is not None:x = self.conv(x)x = self.blocks(x)return xclass SwinV2_CSPB(nn.Module):# CSP Bottleneck https://github.com/WongKinYiu/CrossStagePartialNetworksdef __init__(self, c1, c2, n=1, shortcut=False, g=1, e=0.5):  # ch_in, ch_out, number, shortcut, groups, expansionsuper(SwinV2_CSPB, self).__init__()c_ = int(c2)  # hidden channelsself.cv1 = Conv(c1, c_, 1, 1)self.cv2 = Conv(c_, c_, 1, 1)self.cv3 = Conv(2 * c_, c2, 1, 1)num_heads = c_ // 32self.m = SwinTransformer2Block(c_, c_, num_heads, n)#self.m = nn.Sequential(*[Bottleneck(c_, c_, shortcut, g, e=1.0) for _ in range(n)])def forward(self, x):x1 = self.cv1(x)y1 = self.m(x1)y2 = self.cv2(x1)return self.cv3(torch.cat((y1, y2), dim=1))


python train.py --cfg yolov7_swin_transfomrer-V2.yaml



ERROR: [Hsi 55-1545] ,无法正常生成fsbl,Unable to read in MSS file,Failed to closesw system.mss

现象1 无法正常操作device_tree_bsp的BSPs setting sdk log 23:24:10 INFO : Unable to read in MSS file D:\prj\h264\h264enc_base\base.sdk\device_tree_bsp_0\system.mss : null 23:24:21 ERROR : (XSDB Server)ERROR: [Hsi 55-1434] Error D:/prj/h264/h264enc_b…

inception V2/3(2015)

Inception v2 and Inception v3 were presented in the same paper. ** 外网blog** note V1-V4 https://towardsdatascience.com/a-simple-guide-to-the-versions-of-the-inception-network-7fc52b863202 https://hackmd.io/bouteille/SkD5Xd4DL 两个要点&#xff1a; 引入…

YOLO v2实现图像目标检测

目录 1、作者介绍2、算法简介3、环境配置4、代码实现4.1 数据准备4.2 完整代码4.3 运行结果 常见问题总结 1、作者介绍 熊文博&#xff0c;男&#xff0c;西安工程大学电子信息学院&#xff0c;2020级硕士研究生&#xff0c;张宏伟人工智能课题组。 研究方向&#xff1a;机器视…

URAL 做题记录 V2

题号 标题 难度系数 算法 1100 Final Standings 50% 反复统计 1101 Robot in the field 30% 表达式求值 1102 Strange Dialog 60% 动态规划或语法图 1103 Pencils and Circles 65% 不错的几何问题 1104 Donk ask a woman about her age 55% 同余问题 1105 Observers coloring 7…


本文转载自&#xff1a;http://www.cnblogs.com/chutianyao/archive/2012/04/13/2446140.html 录制程序要添加新功能&#xff1a;录制CMMB电视节目&#xff0c;我们的板卡发送出来的是RTP流&#xff08;H264视频和AAC音频&#xff09;&#xff0c;录制程序要做的工作是&#xf…

【ShuffleNet V2】《ShuffleNet V2:Practical Guidelines for Efficient CNN Architecture Design》

ECCV-2018 caffe 版代码&#xff1a;https://github.com/miaow1988/ShuffleNet_V2_pytorch_caffe/blob/master/shufflenet_v2_x1.0.prototxt caffe 代码可视化工具&#xff1a;http://ethereon.github.io/netscope/#/editor 文章目录 1 Background and Motivation2 Advantages…



学习使用mp4v2-2.0.0 —— 1

本来想自己根据mp4的结构自己创建mp4文件的&#xff0c;但。。。还是先去找现有的解决方案看是否更好。 找到了一篇&#xff1a;http://www.ahlinux.com/embed/6770.html 然后到这里&#xff1a;https://launchpad.net/ubuntu/source/mp4v2/2.0.0~dfsg0-3 下载了mp4v2的源码…