[CVPR 2025]Neuro-3D: Towards 3D Visual Decoding from EEG Signals

论文网址：Neuro-3D: Towards 3D Visual Decoding from EEG Signals

论文代码：GitHub - gzq17/neuro-3D

英文是纯手打的！论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误，若有发现欢迎评论指正！文章偏向于笔记，谨慎食用

1. 心得

2. 论文逐段精读

2.1. Abstract

2.2. Introduction

2.3. Related Work

2.3.1. 2D Visual Decoding from Brain Activity

2.3.2. 3D Reconstruction from fMRI

2.3.3. Diffusion Models

2.4. EEG-3D Dataset

2.4.1. Participants

2.4.2. Stimuli

2.4.3. Data Acquisitionand Preprocessing

2.4.4. Dataset Attributes

2.5. Method

2.5.1. Overview

2.5.2. Dynamic-Static EEG-fusion Encoder

2.5.3. Colored Point Cloud Decoder

2.6. Experiments

2.6.1. Experimental Setup

2.6.2. Classification Task

2.6.3. 3D Reconstruction Task

2.6.4. Analysis of Brain Regions

2.7. Discussion and Conclusion

3. Reference

1. 心得

（1）很久没看CVPR了，主要是视觉那边没什么和脑影像相关的，最近脑信号解码又给抬了一手

（2）颜色/物体分类和重建是分开的

2. 论文逐段精读

2.1. Abstract

①They pioneeringly proposed a EEG-3D datasets, which includes EEG signals from 12 subjects when they watch 72 3D objects in both videos and images

②They correspondingly proposed Neuro-3D model to decoding brain signals and reconstruct 3D object

2.2. Introduction

①Process of 3D object reconstruction:

3dcf6cad844884bb86d4a416d77413.png" width="705" />

②Source of 3D object: subset of Objaverse dataset which has 360-degree rotating videos

2.3. Related Work

2.3.1. 2D Visual Decoding from Brain Activity

①List some EEG encoding methods with 2D visual perception

2.3.2. 3D Reconstruction from fMRI

①Existing work use fMRI signal with low temporal resolution and non-portable features

②Existing work only focus on shape reconstruction

2.3.3. Diffusion Models

①3D diffusion model can be used in image reconstruction

2.4. EEG-3D Dataset

2.4.1. Participants

①Subjects: 12 with 5 male and 7 female

②Vision of subjects: normal or corrected-to-normal

2.4.2. Stimuli

①Objects are all from Objaverse dataset

②Object categories: 72, each category has 10 objects

③Data split: 8 object for training and 2 for testing for each class

④Color class: 6

⑤Visual stimuli: “遵循Zero-123中的程序使用 Blender 模拟相机，该相机通过增量旋转捕获每个对象的 360 度视图，从而生成 180 张高分辨率图像（1024×1024 像素）。物体以最佳角度倾斜以提供全面的视角。”

⑥Experimental paradigm:

“旋转 3D 对象视频提供多视角视图，捕捉 3D 对象的整体外观。然而，此类视频的持续时间较长，再加上眼球运动、眨眼伪影、任务负载和注意力不集中等因素，通常会导致 EEG 信号的信噪比较低。相比之下，静态图像刺激提供单视角但更稳定的信息，可以通过减轻动态 EEG 信号的噪声影响来补充动态 EEG 信号。因此，我们收集了动态视频和静态图像刺激的脑电图信号。刺激呈现范式如图 2 所示。2. 具体来说，多视图图像被编译成 6 Hz 的 30 秒视频。每个对象刺激块都由一个 8 秒的事件序列组成：开头和结尾的 0.5 秒静态图像刺激、一个 6 秒的旋转视频以及每个片段之间短暂的空白屏幕过渡。在每次实验期间，从每个类别中随机选择一个 3D 对象，并在对象块之间划出 1 秒的注视十字，以引导参与者的注意力。参与者手动启动每个新对象演示。训练集对象具有2测量重复，而测试集对象具有4，导致总计24会话。参与者在会议之间休息 2-3 分钟。遵循既定协议[16]，在所有会话的开始和结束时记录 5 分钟的静息状态数据，以支持进一步分析。每个参与者的总实验时间约为 5.5 小时，分为两次采集。”

2.4.3. Data Acquisitionand Preprocessing

①Screen resolution: 1920 × 1080

②Sitting posture and distance: 参与者坐在距离屏幕约 95 厘米的位置，确保刺激占据约 8.4 度的视角

2.4.4. Dataset Attributes

①Comparison with other datasets:

resting-state (Re), responses to static stimuli (St) and dynamic stimuli (Dy). The analysis data includes images (Img), videos (Vid), text captions (Text), 3D shape (3D (S)) and color attributes (3D (C))

2.5. Method

2.5.1. Overview

①Overall framework:

2.5.2. Dynamic-Static EEG-fusion Encoder

①Pre-processed static and dynamic EEG signal: $e_{s}\in\mathbb{R}^{C\times T_{s}}$ and $e_{\mathrm{d}}\in\mathbb{R}^{C\times T_{\mathrm{d}}}$

②2 embedders: $z_{\mathrm{s}}=E_{\mathrm{s}}(e_{\mathrm{s}}),z_{\mathrm{d}}=E_{\mathrm{d}}(e_{\mathrm{d}})$ , with multiple temporal self-attention layers and MLP mapping layers

③Complementary attention-based neural aggregator:

$Q=W^{Q}z_{\mathrm{s}},K=W^{K}z_{\mathrm{d}},V=W^{V}z_{\mathrm{d}}$

$z_{\mathrm{sd}}=\mathrm{Softmax}(\frac{QK^{T}}{\sqrt{d}})\cdot V$

where $z_{\mathrm{sd}}$ denotes aggregated EEG feature

2.5.3. Colored Point Cloud Decoder

①Decouple $z_{\mathrm{sd}}$ to geometry and appearance features $f_g$ and $f_a$ by linear layer（上图最右边那一坨，有点歪理了，都不是卷积提取，直接线性层了。而且几何和外观不是很相似的东西吗？作者为什么不是关注形状和颜色（作者最终要预测的就是形状和颜色））

②Align encoded video and $f_g$ and $f_a$ by contrastive and MSE loss:

$L_{\mathrm{allgn}}(f,f_{\mathbf{v}})=\alpha\mathrm{CLIP}(f,f_{\mathbf{v}})+(1-\alpha)\mathrm{MSE}(f,f_{\mathbf{v}}),\\f_{\mathbf{v}}=\sum_{i=1}^{n}E_{\mathbf{v}}(v_{i})/n,$

where $f$ denotes $f_g$ or $f_a$ , $\{v_i\}_{i=1}^n$ denotes downsampled video sequence

③CEloss for color and shape prediction:

$L_{\mathrm{c}}=\mathrm{CE(\hat{y}_{\mathrm{g}},y_{\mathrm{g}})+CE(\hat{y}_{\mathrm{a}},y_{\mathrm{a}})}$

where $\hat{y}_g$ and $\hat{y}_a$ is the predicted shape and color respectively, $y_g$ and $y_a$ are the groud truths

④Total loss:

$L=L_{\mathrm{align}}(f_{\mathrm{g}},f_{\mathrm{v}})+L_{\mathrm{align}}(f_{\mathrm{a}},f_{\mathrm{v}})+\gamma L_{\mathrm{c}}$

⑤Variance of point cloud $X_0\in\mathbb{R}^{N\times3}$ scheduled by hyperparameters $\{\beta_{i}\}_{t=0}^{T}$ （重建不是我的方向，这里是搬运了）:

$q(X_{t}\mid X_{t-1})=\mathcal{N}\left(\sqrt{1-\beta_{t}}X_{t-1},\beta_{t}\mathbf{I}\right)$

⑥The transition from the Gaussian state $X_T$ back to the initial point cloud $X_0$ can be represented as:

$p_{\theta}(X_{t-1}\mid X_{t},f_{\mathrm{g}})=N\left(\mu_{\theta}(X_{t},t,f_{\mathrm{g}}),\sigma_{t}^{2}\mathrm{I}\right)\\p_{\theta}(X_{0:T})=p(X_{T})\prod_{t=1}^{T}p_{\theta}(X_{t-1}\mid X_{t},f_{\mathrm{g}})$

where parameterized network $\mu_\theta$ is a learnable model to iteratively predict the reverse diffusion steps

②Reconstruction loss:

$\mathcal{L}_{t}=\mathbb{E}_{X_{0}\sim q(X_{0})}\mathbb{E}_{\epsilon_{t}\sim N(0,\mathbf{I})}\|\epsilon_{t}-\mu_{\theta}\left(X_{t},t,f_{\mathbf{g}}\right)\|^{2}$

③“以前对点云生成的研究表明，共同生成几何和颜色信息通常会导致性能下降和模型复杂性”。因此作者“学习一个单独的单步着色模型 $h_{\phi}$ 以重建对象颜色以及对象形状。使用基于外观 EEG 特征 $f_a$ 生成的点云 $\hat{X}_0$ 作为条件，并将其发送到 Coloring Model $h_{\phi}$ 以估计点云的颜色。由于 EEG 信号提供的信息有限，预测 3D 结构中每个点的不同颜色是一项重大挑战。作为解决此问题的第一步，我们通过聚合来自 Ground-Truth Point Cloud 的颜色信息来简化任务。通过多数投票机制，我们选择主色来表示整个对象，从而降低颜色预测过程的复杂性。”

2.6. Experiments

2.6.1. Experimental Setup

①Optimizer: AdamW with $\beta=(0.95,0.999)$

②Inital learning rate: 1e-3

③Loss coefficients $\alpha=0.01$ and $\gamma=0.1$

④Dimension of $f_g$ or $f_a$ : 1024

⑤Points of point cloud: $N=8192$

⑥Downsampled video sequence: $n=4$ frames

⑦Diffusion model in colored point cloud decoder: Point-Voxel Network (PVN)

2.6.2. Classification Task

（1）Comparison with Related Methods

①Performance table:

（2）Ablation Study

①Module ablation:

2.6.3. 3D Reconstruction Task

（1）Quantitative Results

①Module ablation:

（2）Reconstructed Examples

①Reconstructions:

2.6.4. Analysis of Brain Regions

①“通过依次删除 64 个电极通道中的每一个，为 3 个受试者生成了显著性图”（作者说的好直白！）：

2.7. Discussion and Conclusion

①Texture and cross-subject can be more explored

3. Reference

@article{guo2024neuro,
title={Neuro-3D: Towards 3D Visual Decoding from EEG Signals},
author={Guo, Zhanqiang and Wu, Jiamin and Song, Yonghao and Mai, Weijian and Zheng, Qihao and Ouyang, Wanli and Song, Chunfeng},
journal={arXiv preprint arXiv:2411.12248},
year={2024}
}