论文网址:Vision-Language Models for Vision Tasks: A Survey | IEEE Journals & Magazine | IEEE Xplore
论文Github页面:GitHub - jingyi0000/VLM_survey: Collection of AWESOME vision-language models for vision tasks
英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用
目录
1. 心得
2. 论文逐段精读
2.1. Abstract
2.2. Introduction
2.3. Background
2.3.1. Training Paradigms for Visual Recognition
2.3.2. Development of VLMs for Visual Recognition
2.3.3. Relevant Surveys
2.4. VLM Foundations
2.4.1. Network Architectures
2.4.2. VLM Pre-Training Objectives
2.4.3. VLM Pre-Training Frameworks
2.4.4. Evaluation Setups and Downstream Tasks
2.5. Datasets
2.5.1. Datasets for Pre-Training VLMs
2.5.2. Datasets for VLM Evaluation
2.6. Vision-Language Model Pre-Training
2.6.1. VLM Pre-Training With Contrastive Objectives
2.6.2. VLM Pre-Training With Generative Objectives
2.6.3. VLM Pre-Training With Alignment Objectives
2.6.4. Summary and Discussion
2.7. VLM Transfer Learning
2.7.1. Motivation of Transfer Learning
2.7.2. Common Setup of Transfer Learning
2.7.3. Common Transfer Learning Methods
2.7.4. Summary and Discussion
2.8. VLM Knowledge Distillation
2.8.1. Motivation of Distilling Knowledge From VLMs
2.8.2. Common Knowledge Distillation Methods
2.8.3. Summary and Discussion
2.9. Performance Comparison
2.9.1. Performance of VLM Pre-Training
2.9.2. Performance of VLM Transfer Learning
2.9.3. Performance of VLM Knowledge Distillation
2.9.4. Summary
2.10. Future Directions
2.11. Conclusion
3. Reference
1. 心得
(1)依旧放松一下,以及很久没看TPAMI了,感觉一直很认可TPAMI的质量啊,拜读一下
(2)感觉比起长篇大论的n个模型介绍,突出每种模型的重点也是非常不错的。和我之前看的一个TPAMI综述一样,就写了损失。然后数据集介绍太多其实也有点睿智,这篇就是精简了。挺好的挺好的顾得顾得
(3)好就好在有些总结放表格里,没有那种给我硬塞文本的恶心感
(4)感觉如果要介绍一些新颖的模型,可以不从头到尾全部说一遍,而是突出它们某个方面的新颖,就把创新写了就行了
(5)还可以,这边给到一个较高的评价
2. 论文逐段精读
2.1. Abstract
①Existing problems: train DNN for each visual task, which is laborious and time costing
②Content: a) background of VLM in visual task, b) doundations of VLM, c) datasets, d) pretraining, transfer learning and knowledge distillation methods of VLM, e) benchmarks, f) challenges
laborious adj.费力的;辛苦的
2.2. Introduction
①New paradigm: Pre-training (on large scale data w/ or w/o label), Fune-tuning (for specific labelled training data), and Prediction, see (a) and (b):
②Vision-Language Model Pre-training and Zero-shot Prediction which do not need fune-tuning:
③VLM publication number on Google Scholar:
frisbee n.(投掷游戏用的)飞盘;飞碟
2.3. Background
2.3.1. Training Paradigms for Visual Recognition
(1)Traditional Machine Learning and Prediction
①Mostly hand-crafted and lightweight but hard to cope with complex or multi tasks
②Poor scalability
(2)Deep Learning From Scratch and Prediction
①Low speed convergence from scratch
②A mount of labels needed
(3)Supervised Pre-Training, Fine-Tuning and Prediction
①Speed up convergence
(4)Unsupervised Pre-Training, Fine-Tuning & Prediction
①Does not require labelled data
②Beter performance due to larger samples learning
(5)VLM Pre-Training and Zero-Shot Prediction
①Discarding fine-tuning
②Future directions: a) large scale informative image-text data, b) high-capacity models, c) new pre-training objectives
2.3.2. Development of VLMs for Visual Recognition
①3 improvements to VLMs:
2.3.3. Relevant Surveys
①Framework of their review:
2.4. VLM Foundations
2.4.1. Network Architectures
①Number of image-text pairs:
②Features extracted from pairs: , where with superscript denotes image sample with denotes text
③Image encoder and text encoder in DNN: /
④Encoding operation: and
(1)Architectures for Learning Image Features
①CNN-based architectures: such as VGG, ResNet and EfficientNet
②Transformer-base architectures: such as ViT
(2)Architectures for Learning Language Features
①The framework of standard Transformer: 6 blocks in encoder (each with a multi-head attention layer and MLP) and 6 blocks in decoder (each with a multi-head attention layer, a masked multi-head layer and MLP)
2.4.2. VLM Pre-Training Objectives
(1)Contrastive Objectives
①Image Contrastive Learning: close with positive keys and faraway from negative keys in embedding space. For images(实际上作者这里表达得很特殊,他们是说“对于这样的batch size”大小,这是比较贴近代码的表达,如果要概念上的表达其实就看成总共有这么多样本就好), this loss always be:
where denotes query embedding, denotes key embeddings, denotes positive keys in the -th sample, denotes temperature hyper-parameter
②Image-Text Contrastive Learning: pull paired embeddings closed and others away:
where denotes contrasting the query image with the text keys, denotes contrasting the query text with image keys
③Image-Text-Label Contrastive Learning: supervised:
where , denotes the class label of (相当于多增加了一个样本类循环)
(2)Generative Objectives
①Masked Image Modelling: learns cross-patch correlation by masking a set of patches and reconstructing images. The loss usually is:
where denotes masked patches, denotes unmasked patches(这“|”什么玩意儿啊条件概率吗但是说不通?在不mask的情况下mask的概率???怎么感觉反了呢还是我有问题)
②Masked Language Modelling: mask at a specific ratio:
③Masked Cross-Modal Modelling: randomly masks a subset of image patches and a subset of text tokens then reconstruct by unmasked ones:
④Image-to-Text Generation: through image and text pairs to predict text:
where denotes the number of tokens, is the embedding of the image paired with
(3)Alignment Objectives
①Image-Text Matching: BCE loss:
where measures the alignment probability between the image and text, when matches otherwise 0
②Region-Word Matching: model local cross-modal correlation in dense scenes:
where denotes a region-word pair, when matches otherwise 0
2.4.3. VLM Pre-Training Frameworks
①two-tower, two-leg and one-tower pre-training approaches:
2.4.4. Evaluation Setups and Downstream Tasks
(1)Zero-Shot Prediction
①Image Classification: apply prompt engineering and compare embeddings of images and texts
②Semantic Segmentation: comparing the embeddings of the given image pixels and texts
③Object Detection: comparing the embeddings of the given object proposals and texts
④Image-Text Retrieval: retrieve the demanded samples from one modality given the cues from another modality, text-to-image or image-to-text
(2)Linear Probing
①freeze pre-trained VLM→get embedding→train a linear classifier to classify these embeddings
2.5. Datasets
①Widely Used Image-Text Datasets for VLM Pre-Training:
②Widely-Used Visual Recognition Datasets for VLM Evaluation:
2.5.1. Datasets for Pre-Training VLMs
①Collection of image-text data is easier and cheaper than traditional crowd-labelled data
②⭐Some researches utilize auxiliary datasets to provide additional information for better vision-language modelling, such as GLIP leverages Object365 for extracting region-level features
2.5.2. Datasets for VLM Evaluation
①Count each type of datasets
2.6. Vision-Language Model Pre-Training
①Vision-Language Model Pre-Training Methods:
2.6.1. VLM Pre-Training With Contrastive Objectives
(1)Image Contrastive Learning
①e.g. SLIP utilizes infoNCE loss to learn the discriminative image features
(2)Image-Text Contrastive Learning
①Learning the correlation between pair image-text, and pull irrelevant matchings away:
(3)Image-Text-Label Contrastive Learning
①Encodding image-text-label to one shared space:
(4)Discussion
①Challenge 1: Joint optimizing positive and negative pairs is complicated and challenging
②Challenge 2: Heuristic temperature hyper-parameter selection
2.6.2. VLM Pre-Training With Generative Objectives
(1)Masked Image Modelling
①Image patches mask strategy:
(2)Masked Language Modelling
①Text mask strategy:
(3)Masked Cross-Modal Modelling
①Mask image and text at the same time
(4)Image-to-Text Generation
①Encode images and then decode them to match the texts
(5)Discussion
①Learning context information
2.6.3. VLM Pre-Training With Alignment Objectives
(1)Image-Text Matching
①Match image and text pairs
(2)Region-Word Matching
①Match region and text pairs:
(3)Discussion
①Alignment always be context information or correlation enhancing
2.6.4. Summary and Discussion
①Recent VLM pre-training focuses on learning global vision-language correlation or models local fine-grained vision-language correlation via region-word matching
2.7. VLM Transfer Learning
2.7.1. Motivation of Transfer Learning
①Chanllenges for pretrained VLM: a) different downstream distribution,b) different downstream task
2.7.2. Common Setup of Transfer Learning
①Unsupervised methods are more efficient and promising
2.7.3. Common Transfer Learning Methods
①3 types of VLM transfer models:
(1)Transfer Via Prompt Tuning
①Transfer with Text Prompt Tuning:
②Transfer with Visual Prompt Tuning:
③Transfer with Text-Visual Prompt Tuning: tune image and text together
④Discussion: Challenge of this is low flexibility by following the manifold (distribution) of the original VLMs in prompting
(2)Transfer Via Feature Adaptation
①Fine-tune the feature by additional feature adapter:
but has intellectual property problem
(3)Other Transfer Methods
①Lists other methods
2.7.4. Summary and Discussion
①2 main methods of VLM transfer learning: prompt tuning and feature adapter
2.8. VLM Knowledge Distillation
2.8.1. Motivation of Distilling Knowledge From VLMs
①VLM knowledge distillation distils general and robust VLM knowledge to task-specific models without the restriction of VLM architecture
intact adj.完整的;完整;完好无损
2.8.2. Common Knowledge Distillation Methods
(1)Knowledge Distillation for Object Detection
①Introduced basic and prompt based knowledge distillation for open vocabulary object detection
(2)Knowledge Distillation for Semantic Segmentation
①Also basic and weak supervised distillation methods
2.8.3. Summary and Discussion
①More flexible than transfer learning
2.9. Performance Comparison
2.9.1. Performance of VLM Pre-Training
①Performance comparison on image classification:
②Data and model size test:
③The main source of VLM advantages: a) large samples, b) large model, c) task-agnostic learning
④Segmentation performance:
⑤Detection performance:
⑥Limitation: a) saturates when continuously expanding the scale of the model, 2) computing costs in pre-training, c) excessive computation and memory overheads in both training and inference
2.9.2. Performance of VLM Transfer Learning
①Image classification performance:
2.9.3. Performance of VLM Knowledge Distillation
①Object detection performance:
②Semantic segmentation performance:
2.9.4. Summary
①The baseline tests are not unified
2.10. Future Directions
(1)For VLM pretraining:
①Fine-grained vision-language correlation modelling
②Unification of vision and language learning
③Pre-training VLMs with multiple languages
④Data-efficient VLMs: increase supervision among image-text pairs training
⑤Pre-training VLMs with LLMs:
(2)For VLM transfer learning:
①Unsupervised VLM transfer
②VLM transfer with visual prompt/adapter
③Test-time VLM transfer
④VLM transfer with LLMs
(3)VLM knowledge distillation
①Extract knowledge from multi VLMs
②Other visual tasks, such as instance segmentation, panoptic segmentation, person re-identification etc.
panoptic adj.全景的;(用图)表示物体(一眼可见,显示)全貌的;一目了然的
2.11. Conclusion
good
3. Reference
Zhang, J. et al. (2024) Vision-Language Models for Vision Tasks: A Survey. TPAMI, 46(8): 5625-5644. doi: 10.1109/TPAMI.2024.3369699