2022年32篇最佳AI论文：DALL·E 2、Stable Diffusion、ChatGPT等入选

Mila在读博士Louis Bouchard总结的论文列表，总体比较靠谱。GitHub上还有很多论文的短视频和文字解读、代码链接等。

下面的列表我们添加了论文的主要贡献机构（有些机构虽然有贡献但排名较后有挂名嫌疑的，都被忽略不计了），似乎可以反映出各公司在AI领域的江湖地位：

第一档：Google 8篇，Meta 6篇雄踞前二名，OpenAI 3篇但有两篇影响力巨大的（DALL·E 2和ChatGPT），如果按代表作评价，可能不会输给两巨头。
第二档：NVIDIA有2.5篇。
第三档：国内腾讯、百度、微软（出自亚研院）各1篇。国外有三星、迪士尼各1篇。Snap、Adobe都是0.5篇。

高校总共5.5篇，不如两巨头一家，相比之下要逊色很多。其中：

特拉维夫有1.5篇位居第一，但慕尼黑的Stable Diffusion影响巨大，应该视为第一档。
CMU、南洋理工各1篇，第二档。
南加大和伯克利各0.5篇，第三档。

从方向来看，大模型和文生图、跨模态是今年毫无疑问的热点，此外也有多篇GAN等视觉领域的文章。

[1] 三星: Suvorov, R., Logacheva, E., Mashikhin, A., Remizova, A., Ashukha, A., Silvestrov, A., Kong, N., Goka, H., Park, K. and Lempitsky, V., 2022. Resolution-robust Large Mask Inpainting with Fourier Convolutions. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 2149–2159)., https://arxiv.org/pdf/2109.07161.pdf

[2] 特拉维夫: Tzaban, R., Mokady, R., Gal, R., Bermano, A.H. and Cohen-Or, D., 2022. Stitch it in Time: GAN-Based Facial Editing of Real Videos. https://arxiv.org/abs/2201.08361

[3] 南加大&Snap: Kuang, Z., Olszewski, K., Chai, M., Huang, Z., Achlioptas, P. and Tulyakov, S., 2022. NeROIC: Neural Rendering of Objects from Online Image Collections. https://arxiv.org/pdf/2201.02533.pdf

[4] Google: Borsos, Z., Sharifi, M. and Tagliasacchi, M., 2022. SpeechPainter: Text-conditioned Speech Inpainting. https://arxiv.org/pdf/2202.07273.pdf

[5] 腾讯: Wang, X., Li, Y., Zhang, H. and Shan, Y., 2021. Towards real-world blind face restoration with generative facial prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9168–9178), https://arxiv.org/pdf/2101.04061.pdf

[6] Google: Piergiovanni, A.J., Casser, V., Ryoo, M.S. and Angelova, A., 2021. 4D-Net for learned multi-modal alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 15435–15445), https://openaccess.thecvf.com/content/ICCV2021/papers/Piergiovanni_4D-Net_for_Learned_Multi-Modal_Alignment_ICCV_2021_paper.pdf.

[7] NVIDIA: Thomas Muller, Alex Evans, Christoph Schied and Alexander Keller, 2022, Instant Neural Graphics Primitives with a Multiresolution Hash Encoding, https://nvlabs.github.io/instant-ngp/assets/mueller2022instant.pdf

[8] OpenAI/DALL·E 2: Ramesh et al., 2022, Hierarchical Text-Conditional Image Generation with CLIP Latents, https://cdn.openai.com/papers/dall-e-2.pdf

[9] Google: Nitzan, Y., Aberman, K., He, Q., Liba, O., Yarom, M., Gandelsman, Y., Mosseri, I., Pritch, Y. and Cohen-Or, D., 2022. MyStyle: A Personalized Generative Prior. arXiv preprint arXiv:2203.17272.

[10] Meta/OPT: Zhang, Susan et al. OPT: Open Pre-trained Transformer Language Models. https://arxiv.org/abs/2205.01068

[11] 伯克利&Adobe: Epstein, D., Park, T., Zhang, R., Shechtman, E. and Efros, A.A., 2022. BlobGAN: Spatially Disentangled Scene Representations. arXiv preprint arXiv:2205.02837.

[12] Google DeepMind: Reed S. et al., 2022. Gato - A generalist agent, A Generalist Agent

[13] Google/Imagen: Saharia et al., 2022. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. Imagen: Text-to-Image Diffusion Models

[14] Craiyon: Dayma, et al., 2021, DALL·E Mini, doi:10.5281/zenodo.5146400. GitHub （DALL·E的复现，只有一些技术报告，未找到正规论文）

[15] Meta: NLLB Team et al., 2022, No Language Left Behind: Scaling Human-Centered Machine Translation. https://arxiv.org/abs/2207.04672

[16] CMU: Sheinin, Mark and Chan, Dorian and O’Toole, Matthew and Narasimhan, Srinivasa G., 2022, Dual-Shutter Optical Vibration Sensing, Proc. IEEE CVPR. Dual-Shutter Optical Vibration Sensing （CVPR2022最佳论文入围）

[17] Meta: Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D. and Taigman, Y., 2022. Make-a-scene: Scene-based text-to-image generation with human priors. https://arxiv.org/pdf/2203.13131.pdf

[18] Meta: Yang, G., Vo, M., Neverova, N., Ramanan, D., Vedaldi, A. and Joo, H., 2022. Banmo: Building animatable 3d neural models from many casual videos. In CVPR2022 (pp. 2863-2873). https://arxiv.org/abs/2112.12761

[19] 慕尼黑/Stable Diffusion: Rombach, R., Blattmann, A., Lorenz, D., Esser, P. and Ommer, B., 2022. High-resolution image synthesis with latent diffusion models. In CVPR2022 (pp. 10684–10695), https://arxiv.org/pdf/2112.10752.pdf

[20] 南洋理工: Yang, J., Ang, Y.Z., Guo, Z., Zhou, K., Zhang, W. and Liu, Z., 2022. Panoptic Scene Graph Generation. arXiv preprint arXiv:2207.11247.

[21] 特拉维夫&NVIDIA: Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G. and Cohen-Or, D., 2022. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion.  An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

[22] 微软: Ni, B., Peng, H., Chen, M., Zhang, S., Meng, G., Fu, J., Xiang, S. and Ling, H., 2022. Expanding Language-Image Pretrained Models for General Video Recognition. arXiv preprint arXiv:2208.02816.

[23] Meta/Make-A-Video: Singer et al., 2022. Make-A-Video: Text-To-Video Generation without Text-Video Data, https://makeavideo.studio/Make-A-Video.pdf

[24] OpenAI/Whisper: Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C. and Sutskever, I., Robust Speech Recognition via Large-Scale Weak Supervision. GitHub

[25] Google: Poole, B., Jain, A., Barron, J.T. and Mildenhall, B., 2022. DreamFusion: Text-to-3D using 2D Diffusion. arXiv preprint arXiv:2209.14988. DreamFusion: Text-to-3D using 2D Diffusion

[26] Google/Imagic: Kawar, B., Zada, S., Lang, O., Tov, O., Chang, H., Dekel, T., Mosseri, I. and Irani, M., 2022. Imagic: Text-Based Real Image Editing with Diffusion Models. arXiv preprint arXiv:2210.09276.

[27] NVIDIA: Balaji, Y. et al., 2022, eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers, https://arxiv.org/abs/2211.01324

[28] Google: Li, Z., Wang, Q., Snavely, N. and Kanazawa, A., 2022. InfiniteNature-Zero: Learning Perpetual View Generation of Natural Scenes from Single Images. In ECCV (pp. 515–534). Springer, Cham, https://arxiv.org/abs/2207.11148

[29] Meta/Galactica: Taylor et al., 2022: Galactica: A Large Language Model for Science, Galactica Demo

[30] 百度: Tang, J., Wang, K., Zhou, H., Chen, X., He, D., Hu, T., Liu, J., Zeng, G. and Wang, J., 2022. Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition. arXiv preprint arXiv:2211.12368.

[31] OpenAI/ChatGPT: ChatGPT: Optimizing Language Models for Dialogue,  ChatGPT: Optimizing Language Models for Dialogue

[32] 迪士尼/FRAN: Loss et al., DisneyResearch, 2022: Production-Ready Face Re-Aging for Visual Effects, https://studios.disneyresearch.com