[TPAMI 2024]Vision-Language Models for Vision Tasks: A Survey

devtools/2024/12/4 22:47:59/

论文网址:Vision-Language Models for Vision Tasks: A Survey | IEEE Journals & Magazine | IEEE Xplore

论文Github页面:GitHub - jingyi0000/VLM_survey: Collection of AWESOME vision-language models for vision tasks

英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用


1. 心得

2. 论文逐段精读

2.1. Abstract

2.2. Introduction

2.3. Background

2.3.1. Training Paradigms for Visual Recognition

2.3.2. Development of VLMs for Visual Recognition

2.3.3. Relevant Surveys

2.4. VLM Foundations

2.4.1. Network Architectures

2.4.2. VLM Pre-Training Objectives

2.4.3. VLM Pre-Training Frameworks

2.4.4. Evaluation Setups and Downstream Tasks

2.5. Datasets

2.5.1. Datasets for Pre-Training VLMs

2.5.2. Datasets for VLM Evaluation

2.6. Vision-Language Model Pre-Training

2.6.1. VLM Pre-Training With Contrastive Objectives

2.6.2. VLM Pre-Training With Generative Objectives

2.6.3. VLM Pre-Training With Alignment Objectives

2.6.4. Summary and Discussion

2.7. VLM Transfer Learning

2.7.1. Motivation of Transfer Learning

2.7.2. Common Setup of Transfer Learning

2.7.3. Common Transfer Learning Methods

2.7.4. Summary and Discussion

2.8. VLM Knowledge Distillation

2.8.1. Motivation of Distilling Knowledge From VLMs

2.8.2. Common Knowledge Distillation Methods

2.8.3. Summary and Discussion

2.9. Performance Comparison

2.9.1. Performance of VLM Pre-Training

2.9.2. Performance of VLM Transfer Learning

2.9.3. Performance of VLM Knowledge Distillation

2.9.4. Summary

2.10. Future Directions

2.11. Conclusion

3. Reference

1. 心得






2. 论文逐段精读

2.1. Abstract

        ①Existing problems: train DNN for each visual task, which is laborious and time costing

        ②Content: a) background of VLM in visual task, b) doundations of VLM, c) datasets, d) pretraining, transfer learning and knowledge distillation methods of VLM, e) benchmarks, f) challenges

laborious  adj.费力的;辛苦的

2.2. Introduction

        ①New paradigm: Pre-training (on large scale data w/ or w/o label), Fune-tuning (for specific labelled training data), and Prediction, see (a) and (b):

        ②Vision-Language Model Pre-training and Zero-shot Prediction which do not need fune-tuning:

        ③VLM publication number on Google Scholar:

frisbee  n.(投掷游戏用的)飞盘;飞碟

2.3. Background

2.3.1. Training Paradigms for Visual Recognition

(1)Traditional Machine Learning and Prediction

        ①Mostly hand-crafted and lightweight but hard to cope with complex or multi tasks

        ②Poor scalability

(2)Deep Learning From Scratch and Prediction

        ①Low speed convergence from scratch

        ②A mount of labels needed

(3)Supervised Pre-Training, Fine-Tuning and Prediction

        ①Speed up convergence

(4)Unsupervised Pre-Training, Fine-Tuning & Prediction

        ①Does not require labelled data

        ②Beter performance due to larger samples learning

(5)VLM Pre-Training and Zero-Shot Prediction

        ①Discarding fine-tuning

        ②Future directions: a) large scale informative image-text data, b) high-capacity models, c) new pre-training objectives

2.3.2. Development of VLMs for Visual Recognition

        ①3 improvements to VLMs:

2.3.3. Relevant Surveys

        ①Framework of their review:

2.4. VLM Foundations

2.4.1. Network Architectures

        ①Number of image-text pairs: N

        ②Features extracted from pairs: \mathcal{D}=\left \{ x^I_n, x^T_n\right \}^N_{n=1}, where x with superscript I denotes image sample with T denotes text

        ③Image encoder and text encoder in DNN: f_\theta / f_\phi

        ④Encoding operation: z_n^I=f_\theta(x_n^I) and z_n^T=f_\theta(x_n^T)

(1)Architectures for Learning Image Features

        ①CNN-based architectures: such as VGG, ResNet and EfficientNet

        ②Transformer-base architectures: such as ViT

(2)Architectures for Learning Language Features

        ①The framework of standard Transformer: 6 blocks in encoder (each with a multi-head attention layer and MLP) and 6 blocks in decoder (each with a multi-head attention layer, a masked multi-head layer and MLP)

2.4.2. VLM Pre-Training Objectives

(1)Contrastive Objectives

        ①Image Contrastive Learning: close with positive keys and faraway from negative keys in embedding space. For B images(实际上作者这里表达得很特殊,他们是说“对于这样的batch size”大小,这是比较贴近代码的表达,如果要概念上的表达其实就看成总共有这么多样本就好), this loss always be:

\mathcal{L}_I^\mathrm{InfoNCE}=-\frac{1}{B}\sum_{i=1}^{B}\log\frac{\exp{(z_i^I\cdot z_+^I/\tau)}}{\sum_{j=1,j\neq i}^{B+1}\exp(z_i^I\cdot z_j^I/\tau)}

where z_i^I denotes query embedding, \{z_j^I\}_{j=1,j\neq i}^{B+1} denotes key embeddings, z_+^I denotes positive keys in the i-th sample, \tau denotes temperature hyper-parameter

        ②Image-Text Contrastive Learning: pull paired embeddings closed and others away:

\begin{gathered} \mathcal{L}_{I\to T} =-\frac1B\sum_{i=1}^B\log\frac{\exp{(z_i^I\cdot z_i^T/\tau)}}{\sum_{j=1}^B\exp(z_i^I\cdot z_j^T/\tau)} \\ \mathcal{L}_{T\to I} =-\frac1B\sum_{i=1}^B\log\frac{\exp{(z_i^T\cdot z_i^I/\tau)}}{\sum_{j=1}^B\exp(z_i^T\cdot z_j^I/\tau)}\\ \mathcal{L}_{\mathrm{infoNCE}}^{IT}=\mathcal{L}_{I\to T}+\mathcal{L}_{T\to I} \end{gathered}

where \mathcal{L}_{I\to T} denotes contrasting the query image with the text keys, \mathcal{L}_{T\to I} denotes contrasting the query text with image keys

        ③Image-Text-Label Contrastive Learning: supervised:

\begin{gathered} \mathcal{L}_{I\to T}^{ITL} =-\sum_{i=1}^B\frac1{|\mathcal{P}(i)|}\sum_{k\in\mathcal{P}(i)}\log\frac{\exp{(z_i^I\cdot z_k^T/\tau)}}{\sum_{j=1}^B\exp(z_i^I\cdot z_j^T/\tau)} \\ \mathcal{L}_{T\to I}^{ITL} =-\sum_{i=1}^B\frac1{|\mathcal{P}(i)|}\sum_{k\in\mathcal{P}(i)}\log\frac{\exp{(z_i^T\cdot z_k^I/\tau)}}{\sum_{j=1}^B\exp(z_i^T\cdot z_j^I/\tau)}\\ \mathcal{L}_{\mathrm{infoNCE}}^{ITL}=\mathcal{L}_{I\to T}^{ITL}+\mathcal{L}_{T\to I}^{ITL} \end{gathered}

where k\in\mathcal{P}(i)=\{k|k\in B,y_k=y_i\}y denotes the class label of (z^I,z^T)(相当于多增加了一个样本类循环)

(2)Generative Objectives

        ①Masked Image Modelling: learns cross-patch correlation by masking a set of patches and reconstructing images. The loss usually is:

\mathcal{L}_{MIM}=-\frac1B\sum_{i=1}^B\log f_\theta(\overline{x}_i^I\mid\hat{x}_i^I)

where \overline{x}_i^I denotes masked patches, \hat{x}_i^I denotes unmasked patches(这“|”什么玩意儿啊条件概率吗但是说不通?在不mask的情况下mask的概率???怎么感觉反了呢还是我有问题

        ②Masked Language Modelling: mask at a specific ratio:

\mathcal{L}_{MLM}=-\frac1B\sum_{i=1}^B\log f_\phi(\overline{x}_i^T\mid\hat{x}_i^T)

        ③Masked Cross-Modal Modelling: randomly masks a subset of image patches and a subset of text tokens then reconstruct by unmasked ones:

\mathcal{L}_{MCM}=-\frac{1}{B}\sum_{i=1}^{B}[\log f_{\theta}(\overline{x}_{i}^{I}|\hat{x}_{i}^{I},\hat{x}_{i}^{T})+\log f_{\phi}(\overline{x}_{i}^{T}|\hat{x}_{i}^{I},\hat{x}_{i}^{T})]

        ④Image-to-Text Generation: through image and text pairs to predict text:

\mathcal{L}_{ITG}=-\sum_{l=1}^L \log f_\theta(x^T\mid x_{<l}^T,z^I)

where L denotes the number of tokens, z^I is the embedding of the image paired with x^T

(3)Alignment Objectives

        ①Image-Text Matching: BCE loss:


where \mathcal{S}\left ( \cdot \right ) measures the alignment probability between the image and text, p=1 when matches otherwise 0

        ②Region-Word Matching: model local cross-modal correlation in dense scenes:


where (r^I,w^T) denotes a region-word pair, p=1 when matches otherwise 0

2.4.3. VLM Pre-Training Frameworks

        ①two-tower, two-leg and one-tower pre-training approaches:

2.4.4. Evaluation Setups and Downstream Tasks

(1)Zero-Shot Prediction

        ①Image Classification: apply prompt engineering and compare embeddings of images and texts

        ②Semantic Segmentation: comparing the embeddings of the given image pixels and texts

        ③Object Detection: comparing the embeddings of the given object proposals and texts

        ④Image-Text Retrieval: retrieve the demanded samples from one modality given the cues from another modality, text-to-image or image-to-text

(2)Linear Probing

        ①freeze pre-trained VLM→get embedding→train a linear classifier to classify these embeddings

2.5. Datasets

         ①Widely Used Image-Text Datasets for VLM Pre-Training:

        ②Widely-Used Visual Recognition Datasets for VLM Evaluation:

2.5.1. Datasets for Pre-Training VLMs

        ①Collection of image-text data is easier and cheaper than traditional crowd-labelled data

        ②⭐Some researches utilize auxiliary datasets to provide additional information for better vision-language modelling, such as GLIP leverages Object365 for extracting region-level features

2.5.2. Datasets for VLM Evaluation

        ①Count each type of datasets

2.6. Vision-Language Model Pre-Training

        ①Vision-Language Model Pre-Training Methods:

2.6.1. VLM Pre-Training With Contrastive Objectives

(1)Image Contrastive Learning

        ①e.g. SLIP utilizes infoNCE loss to learn the discriminative image features

(2)Image-Text Contrastive Learning

        ①Learning the correlation between pair image-text, and pull irrelevant matchings away:

(3)Image-Text-Label Contrastive Learning

        ①Encodding image-text-label to one shared space:


        ①Challenge 1: Joint optimizing positive and negative pairs is complicated and challenging

        ②Challenge 2: Heuristic temperature hyper-parameter selection

2.6.2. VLM Pre-Training With Generative Objectives

(1)Masked Image Modelling

        ①Image patches mask strategy:

(2)Masked Language Modelling

        ①Text mask strategy:

(3)Masked Cross-Modal Modelling

        ①Mask image and text at the same time

(4)Image-to-Text Generation

        ①Encode images and then decode them to match the texts


        ①Learning context information

2.6.3. VLM Pre-Training With Alignment Objectives

(1)Image-Text Matching

        ①Match image and text pairs

(2)Region-Word Matching

        ①Match region and text pairs:


        ①Alignment always be context information or correlation enhancing

2.6.4. Summary and Discussion

        ①Recent VLM pre-training focuses on learning global vision-language correlation or models local fine-grained vision-language correlation via region-word matching

2.7. VLM Transfer Learning

2.7.1. Motivation of Transfer Learning

        ①Chanllenges for pretrained VLM: a) different downstream distribution,b) different downstream task

2.7.2. Common Setup of Transfer Learning

        ①Unsupervised methods are more efficient and promising

2.7.3. Common Transfer Learning Methods

        ①3 types of VLM transfer models:

(1)Transfer Via Prompt Tuning

        ①Transfer with Text Prompt Tuning: 

        ②Transfer with Visual Prompt Tuning: 

        ③Transfer with Text-Visual Prompt Tuning: tune image and text together

        ④Discussion: Challenge of this is low flexibility by following the manifold (distribution) of the original VLMs in prompting

(2)Transfer Via Feature Adaptation

        ①Fine-tune the feature by additional feature adapter:

but has intellectual property problem

(3)Other Transfer Methods

        ①Lists other methods

2.7.4. Summary and Discussion

        ①2 main methods of VLM transfer learning: prompt tuning and feature adapter

2.8. VLM Knowledge Distillation

2.8.1. Motivation of Distilling Knowledge From VLMs

        ①VLM knowledge distillation distils general and robust VLM knowledge to task-specific models without the restriction of VLM architecture

intact  adj.完整的;完整;完好无损

2.8.2. Common Knowledge Distillation Methods

(1)Knowledge Distillation for Object Detection

        ①Introduced basic and prompt based knowledge distillation for open vocabulary object detection

(2)Knowledge Distillation for Semantic Segmentation

        ①Also basic and weak supervised distillation methods

2.8.3. Summary and Discussion

        ①More flexible than transfer learning

2.9. Performance Comparison

2.9.1. Performance of VLM Pre-Training

        ①Performance comparison on image classification:

        ②Data and model size test:

        ③The main source of VLM advantages: a) large samples, b) large model, c) task-agnostic learning

        ④Segmentation performance:

        ⑤Detection performance:

        ⑥Limitation: a) saturates when continuously expanding the scale of the model, 2) computing costs in pre-training, c) excessive computation and memory overheads in both training and inference

2.9.2. Performance of VLM Transfer Learning

        ①Image classification performance:

2.9.3. Performance of VLM Knowledge Distillation

        ①Object detection performance:

        ②Semantic segmentation performance:

2.9.4. Summary

        ①The baseline tests are not unified

2.10. Future Directions

(1)For VLM pretraining:

        ①Fine-grained vision-language correlation modelling

        ②Unification of vision and language learning

        ③Pre-training VLMs with multiple languages

        ④Data-efficient VLMs: increase supervision among image-text pairs training

        ⑤Pre-training VLMs with LLMs: 

(2)For VLM transfer learning:

        ①Unsupervised VLM transfer

        ②VLM transfer with visual prompt/adapter

        ③Test-time VLM transfer

        ④VLM transfer with LLMs

(3)VLM knowledge distillation

        ①Extract knowledge from multi VLMs

        ②Other visual tasks, such as instance segmentation, panoptic segmentation, person re-identification etc.

panoptic  adj.全景的;(用图)表示物体(一眼可见,显示)全貌的;一目了然的

2.11. Conclusion


3. Reference

Zhang, J. et al. (2024) Vision-Language Models for Vision Tasks: A Survey. TPAMI, 46(8): 5625-5644. doi: 10.1109/TPAMI.2024.3369699




查看集群信息&#xff1a; kubectl get nodes 删除节点 &#xff08;⽆效且显示的也可以删除&#xff09; 后期如果 要删除某个节点&#xff0c;为了不增加其他节点的访问压力&#xff0c;先增加一个节点&#xff0c;再删除要删除的节点 语法 &#xff1a;kubect delete…


使用了miniexcel插件&#xff0c;与mydata.dll 。 using MiniExcelLibs; using MySql.Data.MySqlClient; using System.Collections.Generic; using System.Data; using System.Text; using UnityEngine;public class LoadMySQL_虚拟仿真 : DataLayerBase<Dictionary<st…


【iOS】设计模式的六大原则 文章目录 【iOS】设计模式的六大原则前言开闭原则——OCP单一职能原则——SRP里氏替换原则——LSP依赖倒置原则——DLP接口隔离原则——ISP迪米特法则——LoD小结 前言 笔者这段时间看了一下有关于设计模式的七大原则&#xff0c;下面代码示例均为OC…



1、C++ 介绍

1、C介绍 1.1、C发展 C是在C的基础上发展而来。 目前的 C具有三方面的特点&#xff1a; 其一&#xff0c; C是 C 语言的超集&#xff0c;因此其能与 C 语言兼容&#xff1b;&#xff08;数据类型 变量 运算符 流程控制语句 函数&#xff09; 其二&#xff0c; C支持面向对象…


JavaScript实现tab栏切换 代码功能概述 这段代码实现了一个简单的选项卡&#xff08;Tab&#xff09;切换功能。它通过操作 HTML 元素的类名&#xff08;class&#xff09;来控制哪些选项卡&#xff08;Tab&#xff09;和对应的内容板块显示&#xff0c;哪些隐藏。基本思路是先…


一、三块12TB组RAID 5 可用容量约24TB 二、安装LVM工具&#xff08;已安装请忽略&#xff09; sudo apt-get install lvm2二、查看可用磁盘 sudo lsblk 或者 sudo fdisk -l三、创建物理卷&#xff08;PV&#xff09; 选中刚做的磁盘组 sudo pvcreat /dev/sdb1四、创建卷组…


PyTorch是一个基于Python的开源深度学习框架&#xff0c;由Facebook的人工智能研究小组于2016年发布。它以其灵活性、易用性和动态计算图的特点&#xff0c;在研究人员和工程师中非常受欢迎。以下是PyTorch的一些核心概念和组件&#xff1a; 张量 (Tensor): 张量是PyTorch中的…