【多模态大模型学习】位置编码的学习记录

0.前言
1. sinusoidal编码
- 1.0 数学知识——复数
- - 1.0.1 复数乘法、共轭复数
  - 1.0.2 复数的指数表示
- 1.1 sinusoidal编码来历
- 1.2 代码实现
2. Rotary Positional Embedding (RoPE) ——旋转位置编码
- 2.1 RoPE来历
- 2.2 代码实现
- - 2.2.1 GPT-J风格的1D-RoPE实现
  - 2.2.2 GPT-NeoX style的1D-RoPE
3. 二维旋转位置编码(2D-RoPE)
- 3.1 ECCV的2D-RoPE论文中的实现
- 3.2 qwen2-vl的实现
4. qwen2-vl提出的M-RoPE
5. qwen2.5-vl的位置编码
6.很好的参考资料
7.TODO

0.前言

本文是近期位置编码相关内容的学习记录，之前遇到位置编码的内容都是直接跳过的，在看了近期一些模型还有苏建林老师的博客内容后发现位置编码也是一个很重要的内容。
直接从最新的看会很迷惑这些位置编码的代码是在做什么神奇的操作。。。。。。以及为什么是这样，所以本文从最早的开始记录，也是我学习的过程。

1. sinusoidal编码

这部分推荐看苏建林老师的博客，想不明白的地方可以截图不停地追问通义千问。

1.0 数学知识——复数

1.0.1 复数乘法、共轭复数

在数学中，复数可以被表示为 $a + bi$ 的形式，其中 $a$ 和 $b$ 是实数， $i$ 是虚数单位（满足 $i^2 = -1$ )。复数可以在二维平面上用向量表示，横轴代表实部，纵轴代表虚部。
假设我们有两个二维向量 $x_1, y_1]$ 和 $x_2, y_2]$ ，我们可以将它们视为两个复数 $z_1 = x_1 + y_1i$ 和 $z_2 = x_2 + y_2$ 。
复数的乘法遵循特定的规则：

$z_1 \cdot z_2 = (x_1 + y_1i) \cdot (x_2 + y_2i) = x_1x_2 - y_1y_2 + (x_1y_2 + x_2y_1)i$

如果我们想要计算两个复数的内积，并且只关心结果的实部，那么我们可以使用共轭的概念。给定一个复数 $z = a + bi$ ，其共轭定义为 $z^* = a - bi$ 。互为共轭的两个复数相乘，结果为模长平方。
$\cdot z^* = (a + bi) \cdot (a - bi) = a(a) + a(-bi) + (bi)a + (bi)(-bi) = a^2 - abi + abi - b^2i^2 = a^2 + b^2$

使用共轭可以帮助我们“消去”虚部，使得最终结果成为实数。对于两个复数 $z_1$ 和 $z_2$ ，它们的共轭为
$z_1 \cdot z_2^* = (x_1 + y_1i) \cdot (x_2 - y_2i) = x_1x_2 + y_1y_2 + (x_2y_1 - x_1y_2)i$

它们的内积可以定义为：

$\langle z_1, z_2 \rangle =x_1x_2 + y_1y_2= \text{Re}[z_1 \cdot z_2^*]$

换句话说，给定两个复数 $z_1 = x_1 + y_1i$ 和 $z_2 = x_2 + y_2i$ ，它们作为二维向量的内积可以通过公式 $\langle z_1, z_2 \rangle = x_1x_2 + y_1y_2$ 来计算。

1.0.2 复数的指数表示

复数还有指数表示形式，它基于欧拉公式（Euler’s formula），将复数与三角函数和指数函数联系起来。欧拉公式表述为：

$e^{i\theta} = \cos(\theta) + i\sin(\theta)$

这里， $e$ 是自然对数的底数，而 $\theta$ 是以弧度为单位的角度。

对于任意一个非零复数 $z = a + bi$ ，可以将其转换为极坐标形式（polar form）来表示，即通过它的模（magnitude）和辐角（argument）来描述：

模（或绝对值）： $\sqrt{a^2 + b^2}$
辐角（或幅角）： $\theta = \arg(z)$ ，是实轴正方向到从原点到复数点连线之间的夹角。

因此，任何非零复数都可以写成：

$r(\cos(\theta) + i\sin(\theta))$

利用欧拉公式，这个表达式可以简化为指数形式：

$re^{i\theta}$

这里， $r$ 代表复数的长度或大小，而 $e^{i\theta}$ 描述了该复数的方向。

1.1 sinusoidal编码来历

之所以要位置编码，在没有掩码的情况下，attention函数 $f (x)$ 是对称的，比如对于输入的Q序列里面的两个向量 $x_m$ 和 $x_n$ 调换位置( $f_1=\{x_1,...,x_m,...,x_n,...\}$ 和 $f_2=\{x_1,...,x_n,...,x_m,...\}$ )，有 $f_1=f_2$ ，从结果上区分不出输入是 $x_m$ 还是 $x_n$ 。
所以要让attention的 $\cdot K$ 这个乘法过程中， $Q$ 和 $K$ 分别带上位置信息，每个位置的向量加上一个和位置信息相关的向量 $p$ ，变成例如 ${x_1+p_1,...,x_m+p_m,...,x_n+p_n\}$ ， $p_m$ 是位置编码向量。
在只考虑m,n这两个位置的位置编码情况下，泰勒展开后发现只有 $p_m^T\mathcal{H} p_n$ 这一项同时包含 $p_m$ 和 $p_n$ 。在最简单的情况下，取 $\mathcal{H}=\mathcal{I}$ ，此时 $p_m^T\mathcal{H} p_n=p_m^Tp_n=\langle p_m,p_n\rangle$ 。希望这一项能够表示m和n的相对位置，最好能有一个函数 $g(\cdot)$ 使得
$\langle p_m,p_n\rangle=g(m-n)$
为了方便理解，先考虑2维的情况，假如 $Q$ 是2维的，借助复数作为工具进行计算，有 $\langle p_m,p_n\rangle= \text{Re}[p_m \cdot p_n^*]$
假设有复数 $q_{m-n}$ 让上式成立， $p_m \cdot p_n^*=q_{m-n}$ 。用复数的指数形式表示，假设 $p_m=r_me^{i \phi_m}$ , $p_n^*=r_ne^{-i \phi_n}$ , $q_{m-n}=R_{m-n}e^{i \Phi{m-n}}$

解方程：
$r_mr_ne^{i(\phi_m-\phi_n)}=R_{m-n}e^{i\Phi_{m-n}}$
整理后有：
$\left\{ \begin{array}{l} r_m r_n = R_{m-n} \\ \phi_m - \phi_n = \Phi_{m-n} \end{array} \right.$

解第一个条件对于 $r_m r_n = R_{m-n}$
当 $n = m$ 时，可以得到 $r_m^2 = R_0$ 。这意味着 $r_m$ 是一个常数（因为 $R_0$ 是一个固定值），为了简化，设 $r_m = 1$ 。
解第二个条件，对于 $\phi_m - \phi_n = \Phi_{m-n}$
首先，令 $n = 0$ ，则有 $\phi_m - \phi_0 = \Phi_m$ ，如果我们假设 $\phi_0 = 0$ （不失一般性，因为角度是相对的），那么 $\phi_m = \Phi_m$ 。接着，令 $n = m - 1$ ，则有 $\phi_m - \phi_{m-1} = \Phi_1$ ，这里 $\Phi_1$ 是一个固定的相位差。由于 $\Phi_{m-n}$ 表示的是相对位置信息，因此 $\Phi_1$ 实际上是一个常数。这意味着 $\{\phi_m\}$ 形成了一个等差数列，其中每一项之间的差值为 $\Phi_1$ 。用数学语言来说，就是存在一个常数 $\theta$ （在这里 $\theta = \Phi_1$ ）使得 $\phi_m = m\theta$ 。

最终，通过解方程可以得到隐向量维度是2维的情况下，位置编码的一个解，通过欧拉公式表示为cos和sin的形式：
$p_m = e^{im\theta} \quad \Leftrightarrow \quad p_m = \begin{pmatrix} \cos(m\theta) \\ \sin(m\theta) \end{pmatrix}$
当 $Q$ 向量的隐向量维度是 $d$ 维时，位置 $m$ 的 $Q_m$ 对应的编码向量 $p_m=\begin{pmatrix} \cos(m\theta_0) \\ \sin(m\theta_0) \\ \cos(m\theta_1) \\ \sin(m\theta_1) \\ ... \\ \cos(m\theta_{d/2-1}) \\ \sin(m\theta_{d/2-1}) \end{pmatrix}$
这里面需要注意的是， $d$ 指的是隐向量的维度， $m$ 指的是向量是在第 $m$ 个。在《Attention is All You Need》，位置编码的计算公式如下：
$\begin{cases} p_{k,2i} = \sin\left(\frac{k}{10000^{2i/d}}\right) \\ p_{k,2i+1} = \cos\left(\frac{k}{10000^{2i/d}}\right) \end{cases}$
这里， $p_{k,2i}$ 和 $p_{k,2i+1}$ 分别表示位置 $k$ 的编码向量的第 $2 i$ 和 $2 i + 1$ 个分量， $k$ 是位置索引(对应上面的推导的 $m$ )， $i$ 是向量维度的索引， $d$ 是向量的总维度。对应位置的位置编码会和在attention运算前 $Q$ 和 $K$ 相加。

1.2 代码实现

看了下代码，之前不少多模态模型的位置编码都是学习式的，而且是直接位置编码和 $q$ 、 $k$ 相加。《Attention is all you need》有一份别人实现的pytorch代码里面是sinusoidal编码，并且完全遵循了上面公式的实现方式，sinusoidal编码也是和 $q$ 、 $k$ 相加：

#https://github.com/jadore801120/attention-is-all-you-need-pytorch/blob/master/transformer/Models.py
class PositionalEncoding(nn.Module):def __init__(self, d_hid, n_position=200):super(PositionalEncoding, self).__init__()# Not a parameterself.register_buffer('pos_table', self._get_sinusoid_encoding_table(n_position, d_hid))def _get_sinusoid_encoding_table(self, n_position, d_hid):''' Sinusoid position encoding table '''# TODO: make it with torch instead of numpydef get_position_angle_vec(position):return [position / np.power(10000, 2 * (hid_j // 2) / d_hid) for hid_j in range(d_hid)]sinusoid_table = np.array([get_position_angle_vec(pos_i) for pos_i in range(n_position)])sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2])  # dim 2isinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2])  # dim 2i+1return torch.FloatTensor(sinusoid_table).unsqueeze(0)def forward(self, x):return x + self.pos_table[:, :x.size(1)].clone().detach()  # 直接相加

2. Rotary Positional Embedding (RoPE) ——旋转位置编码

运算前含有绝对位置的信息，运算后的结果含有相对位置的信息

2.1 RoPE来历

RoPE是苏建林老师在博客里面提出来的一种位置编码方式，提出的背景、证明等可以参考其博客空间。这部分如果有比较晦涩难懂的地方也是可以直接截博客里面的图片问通义千问，通义千问可以在看图之后进行非常仔细的解答。
RoPE用绝对编码的方式，在计算 $Q$ 和 $K$ 的内积时，又让结果能带入 $Q$ 和 $K$ 的相对信息。对于位置为 $m$ 的 $q_m$ 和位置为n的 $k_n$ 分别乘以绝对位置编码 $e^{im\theta}$ 和 $e^{in\theta}$ ，得到 $q_me^{im\theta}$ 和 $q_ne^{in\theta}$ ，在进行内积运算，会发现运算结果含有相对信息
$\langle q_m e^{im\theta}, k_n e^{in\theta} \rangle = \operatorname{Re} \left[ (q_m e^{im\theta}) (k_n e^{in\theta})^* \right] = \operatorname{Re} \left[ q_m k_n^* e^{i(m-n)\theta} \right]$
最简单的情况下假如 $Q$ 向量的隐向量维度 $d = 2$ ，这个操作对于位置 $m$ 的 $q_m$ 向量进行了一个旋转操作
$q_m e^{im\theta} = \begin{pmatrix} \cos m\theta & -\sin m\theta \\ \sin m\theta & \cos m\theta \end{pmatrix} \begin{pmatrix} q_m^0 \\ q_m^1 \end{pmatrix}$
通用的情况下，对于位置在 $m$ (可以说position_id=m)的 $Q$ 向量 $q_m$ ，它的旋转位置编码的计算过程为：
$\begin{pmatrix} q_0 \\ q_1 \\ q_2 \\ q_3 \\ \vdots \\ q_{d-2} \\ q_{d-1} \end{pmatrix} \otimes \begin{pmatrix} \cos m\theta_0 \\ \cos m\theta_0 \\ \cos m\theta_1 \\ \cos m\theta_1 \\ \vdots \\ \cos m\theta_{d/2-1} \\ \cos m\theta_{d/2-1} \end{pmatrix} + \begin{pmatrix} -q_1 \\ q_0 \\ -q_3 \\ q_2 \\ \vdots \\ -q_{d-1} \\ q_{d-2} \end{pmatrix} \otimes \begin{pmatrix} \sin m\theta_0 \\ \sin m\theta_0 \\ \sin m\theta_1 \\ \sin m\theta_1 \\ \vdots \\ \sin m\theta_{d/2-1} \\ \sin m\theta_{d/2-1} \end{pmatrix}$

2.2 代码实现

RoPE是相乘进行的位置编码，不是相加

2.2.1 GPT-J风格的1D-RoPE实现

看代码这部分比较让人头大，如果是完全按照上面公式来实现的是一目了然的，首先看这种实现方式，被称为GPT-J。在Meta官方实现的llama代码里面，可以找到这种实现方式。当然，这里也不是使用提到的乘法方式，而是使用了复数运算。一个复数对应2个实数，所以如果是 $q$ 转为了复数，维度只有 $d /2$ ，最后变成实数时回到 $d$ 维。以及需要注意 $q_0$ 和 $q_1$ 对应的是 $m\theta_0$ ，所以freqs_cis的维度只有 $q$ 和 $k$ 的一半就够了。

# https://github.com/meta-llama/llama/blob/main/llama/model.py# 下面这个函数是要预先把从0到最大长度的位置编码需要使用的角度算好
def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0):"""Precompute the frequency tensor for complex exponentials (cis) with given dimensions.This function calculates a frequency tensor with complex exponentials using the given dimension 'dim'and the end index 'end'. The 'theta' parameter scales the frequencies.The returned tensor contains complex values in complex64 data type.Args:dim (int): Dimension of the frequency tensor.end (int): End index for precomputing frequencies.theta (float, optional): Scaling factor for frequency computation. Defaults to 10000.0.Returns:torch.Tensor: Precomputed frequency tensor with complex exponentials."""## 因为维度为2i、2i+1的mθ相同，所以是(0, dim, 2)freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim)) # 算θ值t = torch.arange(end, device=freqs.device)  # end是最大长度，对应一个个位置mfreqs = torch.outer(t, freqs).float()  #这个是算m*θ行数为t，列数为dim//2，每行对应一个q向量freqs_cis = torch.polar(torch.ones_like(freqs), freqs)  # 变成复数形式，幅度为1，角度为freqsreturn freqs_cis# 在每个attention block中有
xq = xq.view(bsz, seqlen, self.n_local_heads, self.head_dim)
xk = xk.view(bsz, seqlen, self.n_local_kv_heads, self.head_dim)
xv = xv.view(bsz, seqlen, self.n_local_kv_heads, self.head_dim)
xq, xk = apply_rotary_emb(xq, xk, freqs_cis=freqs_cis)
......
score = torch.matmul(xq,xk.transpose(2,3)) # 位置编码后直接计算attention分数def apply_rotary_emb(xq: torch.Tensor,xk: torch.Tensor,freqs_cis: torch.Tensor,
) -> Tuple[torch.Tensor, torch.Tensor]:"""Apply rotary embeddings to input tensors using the given frequency tensor.This function applies rotary embeddings to the given query 'xq' and key 'xk' tensors using the providedfrequency tensor 'freqs_cis'. The input tensors are reshaped as complex numbers, and the frequency tensoris reshaped for broadcasting compatibility. The resulting tensors contain rotary embeddings and arereturned as real tensors.Args:xq (torch.Tensor): Query tensor to apply rotary embeddings.xk (torch.Tensor): Key tensor to apply rotary embeddings.freqs_cis (torch.Tensor): Precomputed frequency tensor for complex exponentials.Returns:Tuple[torch.Tensor, torch.Tensor]: Tuple of modified query tensor and key tensor with rotary embeddings.     """# 把q向量看成复数，2个2个一组看成一个复数，例如(q0,q1)->(qc_0)xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))freqs_cis = reshape_for_broadcast(freqs_cis, xq_)xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3) # 做乘法xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)return xq_out.type_as(xq), xk_out.type_as(xk)

上面的代码是用复数乘法实现的，可能不是特别直观，考虑最简单的 $d = 2$ 的情形，这种情况下令 $q=(q_0,q_1)$ ，这两个向量要旋转的角度是 $\theta_0$ 。
首先，apply_rotary_emb()函数里面的view_as_complex是让 $q_0$ 和 $q_1$ 组成了一个复数 $q_c={q_0+i \cdot q_1}$ 。
假设 freqs_cis 对应于这个位置和频率分量的旋转因子为 $e^{i\theta_0} = \cos(\theta_0) + i\sin(\theta_0)$ ，即[ $cos(\theta_0)$ , $sin(\theta_0)$ ]，注意预先计算的函数precompute_freqs_cis()里面最后也是以复数形式表示的，这个cos和sin变成了一个复数，也就是freqs_cis[0] = $cos(\theta_0) + i \cdot sin(\theta_0)$ 。

对 $q_0$ 和 $q_1$ 进行旋转，需要执行复数乘法xq_ * freqs_cis：
$q_me^{im\theta_0} = (q0 + i \cdot q1) \times (\cos(\theta_0) + i \cdot \sin(\theta_0))$

根据复数乘法公式：

$\times (c + di) = (ac - bd) + i(ad + bc)$

上述表达式展开为：

$q0 \cdot \cos(\theta_0) - q1 \cdot \sin(\theta_0)) + i(q0 \cdot \sin(\theta_0) + q1 \cdot \cos(\theta_0))$

因此，旋转后的结果是一个新的复数，其实部和虚部分别为：

实部： $q0 \cdot \cos(\theta_0) - q1 \cdot \sin(\theta_0)$
虚部： $q0 \cdot \sin(\theta_0) + q1 \cdot \cos(\theta_0)$

这和1D-RoPE的结果是一致的：

$q0 \cdot \cos(\theta_0) - q1 \cdot \sin(\theta_0), q0 \cdot \sin(\theta_0) + q1 \cdot \cos(\theta_0))$

2.2.2 GPT-NeoX style的1D-RoPE

在eleuther的官方实现以及transformer的llama代码里面，使用的是另一种风格的旋转位置编码的代码，没有用到复数计算。重点关注forward、rotate_half以及embedding函数。

# https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.pyclass LlamaRotaryEmbedding(nn.Module):def __init__(self, config: LlamaConfig, device=None):super().__init__()# BC: "rope_type" was originally "type"if hasattr(config, "rope_scaling") and config.rope_scaling is not None:self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type"))else:self.rope_type = "default"self.max_seq_len_cached = config.max_position_embeddingsself.original_max_seq_len = config.max_position_embeddingsself.config = configself.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device)self.register_buffer("inv_freq", inv_freq, persistent=False)self.original_inv_freq = self.inv_freqdef _dynamic_frequency_update(self, position_ids, device):"""dynamic RoPE layers should recompute `inv_freq` in the following situations:1 - growing beyond the cached sequence length (allow scaling)2 - the current sequence length is in the original scale (avoid losing precision with small sequences)"""seq_len = torch.max(position_ids) + 1if seq_len > self.max_seq_len_cached:  # growthinv_freq, self.attention_scaling = self.rope_init_fn(self.config, device, seq_len=seq_len)self.register_buffer("inv_freq", inv_freq, persistent=False)  # TODO joao: may break with compilationself.max_seq_len_cached = seq_lenif seq_len < self.original_max_seq_len and self.max_seq_len_cached > self.original_max_seq_len:  # reset# This .to() is needed if the model has been moved to a device after being initialized (because# the buffer is automatically moved, but not the original copy)self.original_inv_freq = self.original_inv_freq.to(device)self.register_buffer("inv_freq", self.original_inv_freq, persistent=False)self.max_seq_len_cached = self.original_max_seq_len@torch.no_grad()def forward(self, x, position_ids):if "dynamic" in self.rope_type:self._dynamic_frequency_update(position_ids, device=x.device)# Core RoPE blockinv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)position_ids_expanded = position_ids[:, None, :].float()# Force float32 (see https://github.com/huggingface/transformers/pull/29285)device_type = x.device.typedevice_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"with torch.autocast(device_type=device_type, enabled=False):freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)emb = torch.cat((freqs, freqs), dim=-1)cos = emb.cos()sin = emb.sin()# Advanced RoPE types (e.g. yarn) apply a post-processing scaling factor, equivalent to scaling attentioncos = cos * self.attention_scalingsin = sin * self.attention_scalingreturn cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)  # 这个更像原始的RoPE，没有变成复数，就是分开了cos和sindef rotate_half(x):"""Rotates half the hidden dims of the input."""x1 = x[..., : x.shape[-1] // 2]x2 = x[..., x.shape[-1] // 2 :]return torch.cat((-x2, x1), dim=-1)def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):"""Applies Rotary Position Embedding to the query and key tensors.Args:q (`torch.Tensor`): The query tensor.k (`torch.Tensor`): The key tensor.cos (`torch.Tensor`): The cosine part of the rotary embedding.sin (`torch.Tensor`): The sine part of the rotary embedding.position_ids (`torch.Tensor`, *optional*):Deprecated and unused.unsqueeze_dim (`int`, *optional*, defaults to 1):The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] andsin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, notethat cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q andk have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makescos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k havethe shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.Returns:`tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding."""cos = cos.unsqueeze(unsqueeze_dim)sin = sin.unsqueeze(unsqueeze_dim)q_embed = (q * cos) + (rotate_half(q) * sin)  # 像是原始公式里面的cos和sin操作k_embed = (k * cos) + (rotate_half(k) * sin)return q_embed, k_embed

一眼看下来，会发现不对啊，如果没有rotate_half是可以理解的，rotate_half之后，和原始公式对不上了。原来的是比如 $q_0,q_1)$ 是在一组的，得到 $q0 \cdot \cos(\theta_0) - q1 \cdot \sin(\theta_0), q0 \cdot \sin(\theta_0) + q1 \cdot \cos(\theta_0))$ 。现在的 $q_0$ 和 $q_{d//2}$ 咋在一起了。而且神奇的是，meta版本代码训练的模型，能用transformer版本的代码加载。
github上有一个issue解释这个问题，具体参考这个issue。首先结论是，这是2种RoPE方式，这两种方式结果肯定不一样。但是，如果加入一些转换的代码，转换后能和另一种方式能对上。
meta的GPT-J风格的模型，要用transformer加载，需要先把 $W_q$ 矩阵 $W_k$ 矩阵进行一些转换，有一个permute函数专门进行这个操作。

# https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.pydef permute(w, n_heads, dim1=dim, dim2=dim):return w.view(n_heads, dim1 // n_heads // 2, 2, dim2).transpose(1, 2).reshape(dim1, dim2)f"model.layers.{layer_i}.self_attn.q_proj.weight": permute(loaded[f"layers.{layer_i}.attention.wq.weight"], n_heads=n_heads)........
##

这个函数的效果在官方论坛里面有，仔细比对前后，还真能对上。。。。把 $W_q$ 和 $W_k$ 矩阵里面元素位置变了是真没想到的。
在这里插入图片描述

为什么用这种实现方式，而不是GPT-J的实现方式，官方的解答是，一方面这种方式开销小，另一方面最重要的原因是GPT-J涉及版权的问题。

3. 二维旋转位置编码(2D-RoPE)

一维旋转位置编码1D-RoPE，或是二维的2D-RoPE，这个维度指的是有几维的位置信息，也就是position_id的维度。对于文本，是一维序列，所以只有 $x$ 轴上的信息，position_id =
{0,1,2,3…}，也就是之前提到的 $m$ ，表示到底这个 $q$ 是在第几个。对于图片，ViT会切分为一个个patch，每个patch需要标识是在第几行、第几列，所以需要{w,h}的形式来表示。如果是视频，还有时间维度时间帧的信息，需要三维的形式{t,w,h}。
上面的1D-RoPE扩展到高维度的方式很简单粗暴，如果要进行X-D RoPE，就把 $q$ 和 $k$ 向量从头到尾平均分成X份，每一份里面再按照position_id进行1D-RoPE。例如是要进行2D RoPE，就把 $q$ 分成 $q [0 : d /2]$ 和 $q [d /2 :]$ 这两份， $k$ 分成 $k [0 : d /2]$ 和 $k [d /2 :]$ 这两份，position_id是[(x,y)]的形式，就让前半段计算 $\cdot e^{ix\theta}$ ，后半段计算前半段计算 $\cdot e^{iy\theta}$ 。
2D-RoPE里面难的地方可能是position_id的计算，尤其如果输入不止有图片，是图文混杂的情况下。苏神完整分析了各种可行性方案，在他的博客中可以仔细阅读这一部分——多模态位置编码的思考。提到了一种方案，就是如果输入是图片，就用(x,y)的形式给出position_id，如果输入是文本，就让 $x = y$ 。
举例而言，首先，对于patch大小为 $w * h$ 的图片，如果图片在开头：

position_id	第1行	第h行
x	1 … 1	h … h
y	1 … w	1 … w

如果开头是一段文本，文本的长度为 $L$ ，这个句子的位置编码为 ${(1,1),(2,2),...,(L,L)\}$ ，上面的图片接在这段文本后面，图片的编码变为

position_id	第1行	第h行
x	L+1 … 1	L+h … L+h
y	L+1 … L+w	L+1 … L+w

苏神提到这种方式看着不完美，没有对称。因为句子的最后一个token的位置ID是 $(L, L)$ ，它和图片的第一个patch的位置ID $(L + 1, L + 1)$ 的差距是 $(1, 1)$ ，但是如果图片后面再接一个句子，因为图片的最后一个token的位置ID是 $(L + h, L + w)$ ，如果 $\neq h$ ，后面这个句子的位置ID不可能和前面图片最后一个token的差距也是 $(1, 1)$ ，显得不优雅，只有像下面示意这样 $w = h$ 时才对称。
在这里插入图片描述
更进阶的苏神提到的位置编码的方式可以看他的博客尤其多模态编码的思考这篇，还有下面朋友们写的评论，都非常有价值。看到这里应该就可以去看多模态模型的2D-RoPE的相关的源代码了。

3.1 ECCV的2D-RoPE论文中的实现

《Rotary Position Embedding for Vision Transformer》这篇论文在ViT中实现了2D-RoPE，里面实现了2个版本，一个是mix的2D-ROPE，一个是axial的2D-ROPE，主要看axial版本的。

#https://github.com/naver-ai/rope-vit/blob/main/models/vit_rope.py# 计算position id
def init_t_xy(end_x: int, end_y: int):t = torch.arange(end_x * end_y, dtype=torch.float32)t_x = (t % end_x).float()t_y = torch.div(t, end_x, rounding_mode='floor').float()return t_x, t_y# 计算RoPE的角度mθ
def compute_axial_cis(dim: int, end_x: int, end_y: int, theta: float = 100.0):freqs_x = 1.0 / (theta ** (torch.arange(0, dim, 4)[: (dim // 4)].float() / dim))freqs_y = 1.0 / (theta ** (torch.arange(0, dim, 4)[: (dim // 4)].float() / dim))t_x, t_y = init_t_xy(end_x, end_y)freqs_x = torch.outer(t_x, freqs_x)freqs_y = torch.outer(t_y, freqs_y)freqs_cis_x = torch.polar(torch.ones_like(freqs_x), freqs_x)freqs_cis_y = torch.polar(torch.ones_like(freqs_y), freqs_y)return torch.cat([freqs_cis_x, freqs_cis_y], dim=-1)def apply_rotary_emb(xq: torch.Tensor, xk: torch.Tensor, freqs_cis: torch.Tensor):xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))freqs_cis = reshape_for_broadcast(freqs_cis, xq_)xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)return xq_out.type_as(xq).to(xq.device), xk_out.type_as(xk).to(xk.device)class RoPEAttention(Attention):"""Multi-head Attention block with rotary position embeddings."""def forward(self, x, freqs_cis):B, N, C = x.shapeqkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)q, k, v = qkv[0], qkv[1], qkv[2]q[:, :, 1:], k[:, :, 1:] = apply_rotary_emb(q[:, :, 1:], k[:, :, 1:], freqs_cis=freqs_cis)  # 这里跳过第一个不编码，是因为self-attn的里面x[0]是[CLS] tokenattn = (q * self.scale) @ k.transpose(-2, -1)attn = attn.softmax(dim=-1)attn = self.attn_drop(attn)x = (attn @ v).transpose(1, 2).reshape(B, N, C)x = self.proj(x)x = self.proj_drop(x)return x# 这个的attention块是上面的RoPEAttention
class rope_vit_models(vit_models):def __init__(self, rope_theta=100.0, rope_mixed=False, use_ape=False,**kwargs):super().__init__(**kwargs)self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))self.compute_cis = partial(compute_axial_cis, dim=embed_dim//num_heads, theta=rope_theta)def forward_features(self, x):      freqs_cis = self.compute_cis(end_x = img_size // patch_size, end_y = img_size // patch_size)self.freqs_cis = freqs_cisif self.freqs_cis.shape[0] != x.shape[1] - 1:freqs_cis = self.compute_cis(end_x = W // self.patch_size, end_y = H // self.patch_size)else:freqs_cis = self.freqs_cisfreqs_cis = freqs_cis.to(x.device)for i , blk in enumerate(self.blocks):x = blk(x, freqs_cis=freqs_cis)

3.2 qwen2-vl的实现

首先需要明确一下，qwen2-vl的论文里面重点是说M-RoPE的优点，但是里面也写道了qwen2-vl重新训练了ViT，ViT使用的是2D-RoPE。2D-RoPE是在纯ViT部分使用的，图片首先被CNN变成patch，然后进ViT的block，在ViT的block里面使用2D-RoPE。出了ViT之后，图片向量填充到文本向量预留的位置里面之后，对这个图文混合向量才是使用M-RoPE。
qwen2-vl的位置编码风格是GPT-NeoX的风格的，所以会有rotate_half()函数，首先看角度生成和最后的乘法部分 $\cdot e^{i \theta}$ ，看这部分的时候会疑惑position_id的代码去哪里了，稍后再看。

# modeling_qwen2_vl.py
# 里面涉及到的是apply_rotary_pos_emb_vision函数，以q为例子
# 输入为q和位置信息rotary_pos_emb#1. rotary_pos_emb的值为θ角
class VisionRotaryEmbedding(nn.Module):def __init__(self, dim: int, theta: float = 10000.0) -> None:super().__init__()inv_freq = 1.0 / (theta ** (torch.arange(0, dim, 2, dtype=torch.float) / dim))self.register_buffer("inv_freq", inv_freq, persistent=False)def forward(self, seqlen: int) -> torch.Tensor:seq = torch.arange(seqlen, device=self.inv_freq.device, dtype=self.inv_freq.dtype)freqs = torch.outer(seq, self.inv_freq)return freqs
head_dim = config.embed_dim // config.num_heads
self.rotary_pos_emb = VisionRotaryEmbedding(head_dim // 2) # 二维的rope，所以只需要一半# Copied from transformers.models.llama.modeling_llama.rotate_half
def rotate_half(x):"""Rotates half the hidden dims of the input."""x1 = x[..., : x.shape[-1] // 2]x2 = x[..., x.shape[-1] // 2 :]return torch.cat((-x2, x1), dim=-1)def apply_rotary_pos_emb_vision(tensor: torch.Tensor, freqs: torch.Tensor) -> torch.Tensor:orig_dtype = tensor.dtypetensor = tensor.float()cos = freqs.cos()sin = freqs.sin()cos = cos.unsqueeze(1).repeat(1, 1, 2).unsqueeze(0).float() # 先在第1维增加一个维度，变成[seqlen,1,dim//4] repeat(1, 1, 2) 表示在第0维不重复，在第1维不重复，在第2维重复2次，变成[[0,1,2,3,0,1,2,3]]。sin = sin.unsqueeze(1).repeat(1, 1, 2).unsqueeze(0).float()output = (tensor * cos) + (rotate_half(tensor) * sin)  # 位置编码是q*e^iθoutput = output.to(orig_dtype)return output## 2. ViT的block中，q、k被加入位置信息
class VisionAttention(nn.Module):def __init__(self, dim: int, num_heads: int = 16) -> None:super().__init__()self.num_heads = num_headsself.head_dim = dim // num_headsself.qkv = nn.Linear(dim, dim * 3, bias=True)self.proj = nn.Linear(dim, dim)def forward(self, hidden_states: torch.Tensor, cu_seqlens: torch.Tensor, rotary_pos_emb: torch.Tensor = None) -> torch.Tensor:seq_length = hidden_states.shape[0]q, k, v = self.qkv(hidden_states).reshape(seq_length, 3, self.num_heads, -1).permute(1, 0, 2, 3).unbind(0)q = apply_rotary_pos_emb_vision(q.unsqueeze(0), rotary_pos_emb).squeeze(0)k = apply_rotary_pos_emb_vision(k.unsqueeze(0), rotary_pos_emb).squeeze(0)attention_mask = torch.full([1, seq_length, seq_length], torch.finfo(q.dtype).min, device=q.device, dtype=q.dtype)for i in range(1, len(cu_seqlens)):attention_mask[..., cu_seqlens[i - 1] : cu_seqlens[i], cu_seqlens[i - 1] : cu_seqlens[i]] = 0q = q.transpose(0, 1)k = k.transpose(0, 1)v = v.transpose(0, 1)attn_weights = torch.matmul(q, k.transpose(1, 2)) / math.sqrt(self.head_dim)attn_weights = attn_weights + attention_maskattn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(q.dtype)attn_output = torch.matmul(attn_weights, v)attn_output = attn_output.transpose(0, 1)attn_output = attn_output.reshape(seq_length, -1)attn_output = self.proj(attn_output)return attn_output

position_id的代码在qwen2-vl的model里面定义，可以看到它也有置换函数，如果不置换，可以打印出来，以及取pos_id之后打印一下看看，是很规整的 $(0, 0), (0, 1), .... (0, w - 1), (1, 0), ..., (h - 1, w - 1)$ 的形式，并且是相乘的形式

class Qwen2VisionTransformerPretrainedModel(Qwen2VLPreTrainedModel):def __init__(self, config) -> None:self.spatial_merge_size = config.spatial_merge_sizehead_dim = config.embed_dim // config.num_headsself.rotary_pos_emb = VisionRotaryEmbedding(head_dim // 2)# grid_thw是一个[[t,h,w]]的形式，如果就一张图这里t=1def rot_pos_emb(self, grid_thw):pos_ids = []for t, h, w in grid_thw:hpos_ids = torch.arange(h).unsqueeze(1).expand(-1, w)hpos_ids = hpos_ids.reshape(h // self.spatial_merge_size,self.spatial_merge_size,w // self.spatial_merge_size,self.spatial_merge_size,)hpos_ids = hpos_ids.permute(0, 2, 1, 3)  # 可以打印一下不置换的结果hpos_ids = hpos_ids.flatten()wpos_ids = torch.arange(w).unsqueeze(0).expand(h, -1)wpos_ids = wpos_ids.reshape(h // self.spatial_merge_size,self.spatial_merge_size,w // self.spatial_merge_size,self.spatial_merge_size,)wpos_ids = wpos_ids.permute(0, 2, 1, 3)wpos_ids = wpos_ids.flatten()pos_ids.append(torch.stack([hpos_ids, wpos_ids], dim=-1).repeat(t, 1)) # x,y，重复t份pos_ids = torch.cat(pos_ids, dim=0)max_grid_size = grid_thw[:, 1:].max()rotary_pos_emb_full = self.rotary_pos_emb(max_grid_size) rotary_pos_emb = rotary_pos_emb_full[pos_ids].flatten(1) # 根据patch的形状来取position_id的(x,y)return rotary_pos_embdef forward(self, hidden_states: torch.Tensor, grid_thw: torch.Tensor) -> torch.Tensor:hidden_states = self.patch_embed(hidden_states)rotary_pos_emb = self.rot_pos_emb(grid_thw)for blk in self.blocks:hidden_states = blk(hidden_states, cu_seqlens=cu_seqlens, rotary_pos_emb=rotary_pos_emb)

4. qwen2-vl提出的M-RoPE

在这里插入图片描述
首先明确，M-RoPE是3D-RoPE，乘法过程等是3D-RoPE的方式，自定义的部分在于position_id的计算方式上。可以回顾2D-RoPE里面苏神在多模态上讨论的不同实现方式，到底怎么排文本和图片。qwen2-vl的论文中给出了一个编排position_id的图，可以看到图片是按顺序排的，多了一个时间维度的坐标，position_id是3维的 $(t, h, w)$ 。然后对于图片后面接的文本起始编码，取图片的最后一个patch的position_id的各个维度的最大值+1。
代码上，需要关注2个地方，一个是这个position_id如何算的，这个在get_rope_index函数中定义，函数比较长，可以看它的注释，实现的就是上图的逻辑，计算出每个位置的position_id。

Each embedding sequence contains vision embedding and text embedding or just contains text embedding.
For pure text embedding sequence, the rotary position embedding has no difference with mordern LLMs.Examples:input_ids: [T T T T T], here T is for text.temporal position_ids: [0, 1, 2, 3, 4]height position_ids: [0, 1, 2, 3, 4]width position_ids: [0, 1, 2, 3, 4]For vision and text embedding sequence, we calculate 3D rotary position embedding for vision partand 1D rotary position embeddin for text part.Examples:Assume we have a video input with 3 temporal patches, 2 height patches and 2 width patches.input_ids: [V V V V V V V V V V V V T T T T T], here V is for vision.vision temporal position_ids: [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2]vision height position_ids: [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1]vision width position_ids: [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]text temporal position_ids: [3, 4, 5, 6, 7]text height position_ids: [3, 4, 5, 6, 7]text width position_ids: [3, 4, 5, 6, 7]Here we calculate the text start position_ids as the max vision position_ids plus 1.

最后M-RoPE的函数里面完成嵌入3D RoPE编码， $q$ 和 $k$ 分别与角度相乘。不过这里面好像有一个mrope_section，看config里面好像涉及rope_scaling的内容，这个学不动了后面学学emmm

query_states, key_states = apply_multimodal_rotary_pos_emb(query_states, key_states, cos, sin, self.rope_scaling["mrope_section"]
)def apply_multimodal_rotary_pos_emb(q, k, cos, sin, mrope_section, unsqueeze_dim=1):mrope_section = mrope_section * 2cos = torch.cat([m[i % 3] for i, m in enumerate(cos.split(mrope_section, dim=-1))], dim=-1).unsqueeze(unsqueeze_dim)sin = torch.cat([m[i % 3] for i, m in enumerate(sin.split(mrope_section, dim=-1))], dim=-1).unsqueeze(unsqueeze_dim)q_embed = (q * cos) + (rotate_half(q) * sin)k_embed = (k * cos) + (rotate_half(k) * sin)return q_embed, k_embed

借助M-RoPE，论文里面提到qwen2-vl的外推能力挺好。
在这里插入图片描述

5. qwen2.5-vl的位置编码

最近测试下来qwen2.5-vl的效果没有qwen2-vl好，可能因为里面用了窗口注意力(qwen-vl里面提到过，说这个效果不如global attention)，qwen2.5能做的任务比qwen2要好。如果要使用qwen2.5，记得transformer版本安装方式为

pip install git+https://github.com/huggingface/transformers.git@9985d06add07a4cc691dc54a7e34f54205c04d40

看上去qwen2.5-vl的位置编码，和qwen2-vl的主要区别是position_id这里面，time_id这一维度的计算方式，qwen2.5-vl里面不同的采样率，会对应不同的time_id，和绝对的时间进行对齐。
在这里插入图片描述

6.很好的参考资料

1.苏剑林老师的Transformer升级系列，在他的网站“归档”里面进行搜索，可以一章一章的看，例如:

Transformer升级之路：2、博采众长的旋转式位置编码
Transformer升级之路：4、二维位置的旋转式位置编码
“闭门造车”之多模态思路浅谈（三）：位置编码
Transformer升级之路：17、多模态位置编码的简单思考

2.eleuther的博客：https://blog.eleuther.ai/rotary-embeddings/

7.TODO

天池比赛最近出新的LLM比赛了
强化学习很多教程云里雾里的，发现磨菇书非常不错，代码还没看：https://datawhalechina.github.io/easy-rl/#/