Project webpage: https://splice-vit.github.io

Abstruct

将两张图片中语义相近的目标的结构和风格（外观）拼接

• 输入一个 Structure/ Appearence 图像对：训练生成器。

• 关键思想是利用预训练和固定的视觉转换器 ( ViT ) 模型（作为外部语义先验）。

• 从 deep Vit features 提取到 structure 和appearance的表示，然后从 learned self-attention 模块中分解开。

• 再构建一个目标函数，将所需的 structure 和 appearance 拼接到一起，在 Vit 特征空间中将他们融合。

1、Introduction

• 当前图像风格迁移的研究，大多数方法都使用了生成对抗网络的方法generative adversarial networks (GANs)。这种方法就是在目标图像的区域中生成新的图像。而本文的目标是保留源图像结构的同时描绘上特定目标图像的视觉外观。

• Neural Style Transfer （ NST ）神经风格转移方法，能由预先训练的分类 CNN 模型（例如 VGG ）表示深层特征空间中的内容和艺术风格。 NST 更适合基于全局的风格迁移，但是并不适用于两幅自然图片中局部语义相近的目标进行风格迁移。

• ＤＩＮＯ－ＶｉＴ，以自监督方式进行预训练的ＶｉＴ模型，提取 appearance 和 structure的深度表示。

• 为了更好地理解ＶｉＴ各层的信息编码，文中利用了特征反演可视化技术实现。

• 研究提供了两个关键的观察结果： ① global token （ CLS ）提供了一种强大的视觉外观表示，它不仅捕获纹理信息，而且捕获更多的全局信息，如物体部分， ② 并且原始图像可以从最深层的特征重构，但它们提供了强大的高空间粒度的语义信息。

CLS token：就是 Class Token ，假设将原始图像切分成共9个小图像块，最终的输入序列长度则是10，这里人为的增加了一个向量进行输入，这个人为增加的这个向量被称为 Class Token 。
那么这个 Class Token 有什么作用呢？
如果没有这个向量，也就是将9个向量（1~9）输入 Transformer 结构中进行编码，最终会得到9个编码向量，可对于图像分类任务而言，我们应该选择哪个输出向量进行后续分类呢？
因此，ViT算法提出了一个可学习的嵌入向量 Class Token( 向量0 )，将它与9个向量一起输入到 Transformer 结构中，输出10个编码向量，然后用这个 Class Token 进行分类预测即可。

• 本文通过 glbal token [CLS]表示视觉外观，并通过 key 的自相似性表示结构，所有这些都是从 ViT 的最后一层提取的。

• 然后，我们在 structure / appearance 图像的单个输入训练生成器，以产生将所需的视觉外观和结构拼接在ViT特征空间中的图像。我们的框架不需要任何额外的信息，比如语义分割，也不涉及对抗性训练。

• 此外，我们的模型可以在高分辨率图像上进行训练，产生高质量的高清结果。我们展示了不同自然图像对的各种语义外观转移结果，其中包含物体数量，姿势和外观的显着变化。

2、Related Work

• Domain Transfer & Image-to-Image Translation.

domain transfer：就是适配分布，特别地是指适配marginal distribution，但是没有考虑类别信息。如何做domain transfer：在传统深度网路的loss上，再加另一个confusion loss，作为classifier能否将两个domain进行分开的loss。两个loss一起计算，就是domain transfer。（https://zhuanlan.zhihu.com/p/30621691 《迁移学习导论》作者很厉害）

这些方法的目标是学习源domains（图像域）和目标域之间的映射。，典型方法是训练一个GAN网络。（注：图像域：图像内容被赋予了相同属性。图像翻译：将图像内容从一个图像域X转换到另一个图像域Y，将原始图像的某种属性X移除，重新赋予新属性Y）

ＳＡ（Swapping Autoencoder）训练了一个特定的GAN来分解图片的结构和纹理，然后在两张图片的图像域中交换。

单样本image-to-image translation已经出现。

（跟SA相比，本文的方法不限于特定图像域，也不需要数据集进行训练，不涉及到对抗训练。）

以上这些方法只能利用低维信息并缺少语义理解。

• Neural Style Transfer (NST)

STROTSS使用预训练的VGG表示风格和自相似性，在基于优化的框架中捕获结构，在全局方式下进行风格迁移。

Semantic Style transfer 方法在两张图片语义相近的两部分区域匹配，这个方法仅限于色彩变换，或依赖于额外的语义输入。

本文的目标是在两张自然场景图片对中语义相近的两个目标间进行风格迁移，这个目标是随机并灵活的。

• DINO - ViT

论文导读：DINO-自监督视觉Transformers： https://new.qq.com/rain/a/20211202A02CHQ00

DINO:基于Vit的自监督算法： https://zhuanlan.zhihu.com/p/439244656

• DINO- ViT 特征在精细的空间粒度上捕捉到丰富的语义信息，例如描述语义对象部分；该表示在不同但相关的 object classes 之间共享。

• 受 DINO- ViT 启发，我们 以一个新的生成方向利用了 DINO- ViT 特征的力量 —— 我们得出了新的感知损失，能够拼接语义相关对象的结构和语义外观 。

3、Method

原始structure图：Is，目标appearance图：It，生成新的图片Io

Io：=Is中的objects “painted”成It中与之语义相关对象的视觉外观风格。

输入图片对{Is，It} ，训练一个生成器Gɵ（IS）=Io

损失函数：用自监督的DINO-ViT（预训练ViT模型）确定训练损失。

把structure/appearance图片输入到模型中，训练Gɵ生成目标图片。

Lapp是Io和It的损失（交叉熵？）

Lstructure是Io和Is的损失

1、 for a given pair{Is, It}, we train a generator Gθ(Is) = Io.

2、 To establish our training losses, we leverage DINO-ViT – a self-supervised, pre-trained ViT model – which is kept fixed and serves as an external high-level prior.

3、We propose new deep representations for structure and appearance in DINO-ViT feature space：

we represent structure via the self-similarity of keys in the deepest attention module (Self-Sim), and appearance via the [CLS] token in the deepest layer.

4、we train Gθ to output an image, that when fed into DINO-ViT, matches the source structure and target appearance representations.

our training objective is twofold: (i) Lapp that encourages the deep appearance representation of Io and It to match, and (ii) Lstructure,which encourages the deep structure representation of Io and Is to match.

3.1 Vision Transformers – overview ViT模型回顾

参考博文：

1、vit网络模型简介 https://blog.csdn.net/m0_63156697/article/details/126889774

2、ViT（Vision Transformer）解析（更详细） https://zhuanlan.zhihu.com/p/445122996

3、ViT学习笔记（有代码） https://blog.csdn.net/m0_53374472/article/details/127665215

论文中的vit表述

1、 an image I is processed as a sequence of n non-overlapping patches as follows:

2、spatial tokens are formed by linearly embedding each patch to a d-dimensional vector

3、and adding learned positional embeddings.

4、 An additional learnable token, a.k.a [CLS] token, serves as a global representation of the image.

The set of tokens are then passed through L Transformer layers：each consists of layer normalization (LN), Multihead Self-Attention (MSA) modules, and MLP blocks:

5、After the last layer, the [CLS] token is passed through an additional MLP to form the final output, e.g., output distribution over a set of labels。

In our framework, we leverage DINO-ViT , in which the model has been trained in a self-supervised manner using a self-distillation approach（自蒸馏方法）.

3.2 Structure & Appearance in ViT’s Feature Space

• appearance ：一种具有空间灵活性的表示，即在捕获全局外观信息和风格的同时，可以忽视对象的姿势和场景的空间布局。为此，我们利用[CLS] token 作为全局图像表示。

• structure ：对局部纹理模式具有鲁棒性的表示，同时保留对象及其周围的空间布局、形状和感知语义。为此，利用从 DINO-ViT 中提取深层空间特征，并使用 key 的自相似性作为结构表示:

cos-sim是key之间的余弦相似度（见公式1），自相似性维度：

•Understanding and visualizing DINO-ViT’s features

• we take a feature inversion approach – given an image, we extract target features, and optimize for an image that matches the extracted features .

• we incorporate “Deep Image Prior“ [30], i.e ., we optimize for the weights of a CNN F θ that translates a fixed random noise z to an output image:

•

•φ(I) denotes the target features.

• || · ||F denotes Frobenius norm( F- 范数 : 是一种矩阵范数 ), 矩阵 A 的 Frobenius 范数定义为矩阵 A各项元素的绝对值平方的总和开根，

• To better understand our ViT-based representations, we take a feature inversion approach

• 1. From shallow to deep layers, the [CLS] token gradually accumulates appearance information. Earlier layers mostly capture local texture patterns, while in deeper layers, more global information such as object parts emerges.

• 2. The [CLS] token encodes appearance information in a spatially flexible manner, i.e., different object parts can stretch, deform or be flipped. Figure 4shows multiple runs of our inversions per image; in all runs, we can notice similar global information, but the diversity across runs demonstrate the spatial flexibility of the representation.

• 跨层反转 CLS ：

• Each input image (a) is fed to DINO-ViT to compute its global [CLS] token at different layers.

• Inversion results: starting from a noise image, we optimize for an image that would match the original [CLS] token at a specific layer. 从噪声图像开始，在特定层和 CLS token 匹配

• While earlier layers capture local texture, higher level information such as object parts emerges at the deeper layers 。图片中的全局对象出现在更深的层

• key 的反转结果：结论：原始图像可以用这样的表示重建

•如果不考虑key的appearance信息，只考虑key的自相似性self-similarity

3.3. Splicing ViT Features

training our generator：

其中α和β表示两项之间的相对权重。目标函数的驱动损失为Lapp，所有实验均设α = 0.1， β = 0.1。

Appearance loss：The term Lapp. encourages the output image to match the appearance of It, and is defined as the difference in [CLS] token between the generated and appearance image:

我们将Identity Loss应用于最深层ViT层中的key，这是输入图像的语义可逆表示。

Data augmentations and training

一对图像的输入{It,Is}，通过应用增益，例如crops 和 color jittering创建额外的训练样例

Gθ对多个内部示例进行了训练。因此，它必须为包含N个例子的数据集习得一个好的映射函数，而不是解决单个实例的测试时间优化问题。

4、result

数据集：

• Animal Faces HQ (AFHQ) dataset

• images crawled from Flickr Mountain

own dataset, named Wild-Pairs。 The image resolution ranges from 512px to 2000px.

4.1 Comparisons to Prior Work：Qualitative comparison（定性比较）

4.1 Comparisons to Prior Work： Quantitative comparison（定量比较）

•Human Perceptual Evaluation（人类感知评估）：The participants（参与者） are asked:“Which image best shows the shape/structure of image A combined with the appearance/style of image B?”.

• Semantic layout preservation（语义布局保存）：A key property of our method is the ability to preserve the semantic layout of the scene (while significantly changing the appearance of objects)