前沿分享
-
转载:爱可可-爱生活(知乎)
摘要:图像到图像转换有预训练就够了、对比自监督学习和非对比自监督学习对应于全局和局部谱嵌入方法、真实场景的神经3D重建、面向预测生成和插值的掩码条件视频扩散、压缩集成量化美学复杂性和视觉艺术的演进、基于习得自适应多平面图像的真实场景单视视图合成、大型语言模型训练动态分析、基于图像描述子的语言模型是强大的少样本视频-语言学习器、选择性自适应Lasso1、[CV] Pretraining is All You Need for Image-to-Image Translation
T Wang, T Zhang, B Zhang, H Ouyang, D Chen, Q Chen, F Wen
The Hong Kong University of Science and Technology & Microsoft Research Asia]
图像到图像转换有预训练就够了。本文建议用预训练来促进通用的图像到图像转换。之前的图像到图像转换方法通常需要专门的架构设计,并从头开始训练各翻译模型,对于复杂场景的高质量生成来说是很困难的,特别是在配对训练数据不丰富的情况下。本文将每个图像到图像转换问题视为一个下游任务,并引入一种简单而通用的框架,使预训练的扩散模型适应各种图像到图像转换。本文还提出对抗性训练,以加强扩散模型训练中的纹理合成,并结合归一化指导采样来提高生成质量。在ADE20K、COCO-Stuff和DIODE等具有挑战性的基准上对各种任务进行了广泛的实证比较,表明所提出的基于预训练的图像到图像转换(PITI)能合成出前所未有的真实的高忠实度的图像。
We propose to use pretraining to boost general image-to-image translation. Prior image-to-image translation methods usually need dedicated architectural design and train individual translation models from scratch, struggling for high-quality generation of complex scenes, especially when paired training data are not abundant. In this paper, we regard each image-to-image translation problem as a downstream task and introduce a simple and generic framework that adapts a pretrained diffusion model to accommodate various kinds of image-to-image translation. We also propose adversarial training to enhance the texture synthesis in the diffusion model training, in conjunction with normalized guidance sampling to improve the generation quality. We present extensive empirical comparison across various tasks on challenging benchmarks such as ADE20K, COCO-Stuff, and DIODE, showing the proposed pretraining-based image-to-image translation (PITI) is capable of synthesizing images of unprecedented realism and faithfulness. Code will be available on the project webpage.
https://arxiv.org/abs/2205.12952
2、[LG] Contrastive and Non-Contrastive Self-Supervised Learning Recover Global and Local Spectral Embedding Methods
R Balestriero, Y LeCun
[Meta AI Research]
对比自监督学习和非对比自监督学习对应于全局和局部谱嵌入方法。自监督学习(SSL)认为,输入和成对的正向关系足以学习有意义的表示。尽管SSL最近达到了一个里程碑:在许多模态下超过了有监督方法。但是,理论基础很有限,方法也是特定的,实践者缺少原则性的设计指南。本文在谱流形学习的启发下提出一种统一框架来解决这些限制。通过这项研究,严格证明VICReg、SimCLR、BarlowTwins等对应于同名的谱方法,如Laplacian Eigenmaps、Multidimensional Scaling等。这种统一将使我们能获得:(i)每种方法封闭解的最优表示,(ii)每种方法在线性系统中的封闭解最优网络参数,(iii)训练期间使用的配对关系对这些量中每一个及下游任务表现的影响,以及最重要的,(iv)在对比和非对比方法之间建立起一个理论桥梁,分别面向全局和局部谱嵌入方法,暗示了每种方法的好处和限制。例如,(i)如果成对关系与下游任务一致,任何SSL方法都可以成功采用,并将恢复监督方法,但在低数据场景下,应首选SimCLR或具有高不变性超参数的VICReg;(ii)如果成对关系与下游任务不一致,应首选BarlowTwins或具有低不变性超参数的VICReg。
Self-Supervised Learning (SSL) surmises that inputs and pairwise positive relationships are enough to learn meaningful representations. Although SSL has recently reached a milestone: outperforming supervised methods in many modalities. . . the theoretical foundations are limited, method-specific, and fail to provide principled design guidelines to practitioners. In this paper, we propose a unifying framework under the helm of spectral manifold learning to address those limitations. Through the course of this study, we will rigorously demonstrate that VICReg, SimCLR, BarlowTwins et al. correspond to eponymous spectral methods such as Laplacian Eigenmaps, Multidimensional Scaling et al. This unification will then allow us to obtain (i) the closed-form optimal representation for each method, (ii) the closed-form optimal network parameters in the linear regime for each method, (iii) the impact of the pairwise relations used during training on each of those quantities and on downstream task performances, and most importantly, (iv) the first theoretical bridge between contrastive and non-contrastive methods towards global and local spectral embedding methods respectively, hinting at the benefits and limitations of each. For example, (i) if the pairwise relation is aligned with the downstream task, any SSL method can be employed successfully and will recover the supervised method, but in the low data regime, SimCLR or VICReg with high invariance hyper-parameter should be preferred; (ii) if the pairwise relation is misaligned with the downstream task, BarlowTwins or VICReg with small invariance hyper-parameter should be preferred.
https://arxiv.org/abs/2205.11508
3、[CV] Neural 3D Reconstruction in the Wild
J Sun, X Chen, Q Wang, Z Li, H Averbuch-Elor, X Zhou, N Snavely
[Image Derivative Inc. & Zhejiang University & Cornell University]
真实场景的神经3D重建。我们正在见证计算机视觉和图形学中的神经隐表示的爆炸。其适用性最近已经超越了形状生成和基于图像的渲染等任务,扩展到基于图像的3D重建这一基本问题。然而,现有方法通常假定有限制的3D环境,其恒定的光照由一小部分大致均匀分布的相机捕获。本文提出一种新方法,能在光照变化的情况下从互联网照片集中进行有效和准确的表面重建。为实现这一目标,本文提出一种混合体素面引导采样技术,允许在表面周围进行更有效的射线采样,并导致重建质量的显著改善。提出了一个新的基准和协议,用于评估在这种真实场景中的重建性能。通过广泛的实验,证明该方法在各种指标上都超过了经典的和神经的重建方法。
We are witnessing an explosion of neural implicit representations in computer vision and graphics. Their applicability has recently expanded beyond tasks such as shape generation and image-based rendering to the fundamental problem of image-based 3D reconstruction. However, existing methods typically assume constrained 3D environments with constant illumination captured by a small set of roughly uniformly distributed cameras. We introduce a new method that enables efficient and accurate surface reconstruction from Internet photo collections in the presence of varying illumination. To achieve this, we propose a hybrid voxeland surfaceguided sampling technique that allows for more efficient ray sampling around surfaces and leads to significant improvements in reconstruction quality. Further, we present a new benchmark and protocol for evaluating reconstruction performance on such in-thewild scenes. We perform extensive experiments, demonstrating that our approach surpasses both classical and neural reconstruction methods on a wide variety of metrics. Code and data will be made available at https://zju3dv.github.io/neuralrecon-w.
https://arxiv.org/abs/2205.12955
4、[CV] Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation
V Voleti, A Jolicoeur-Martineau, C Pal
[Mila, University of Montreal & Canada CIFAR AI Chair]
面向预测生成和插值的掩码条件视频扩散。视频预测是一项具有挑战性的任务。目前最先进的(SOTA)生成模型的视频帧质量往往很差,而且很难在训练数据之外进行泛化。此外,现有的预测框架通常不能同时处理其他视频相关任务,如无条件生成或插值。本文为所有这些视频合成任务设计了一个通用框架——掩码条件视频扩散(MCVD),用基于概率条件得分的去噪扩散模型,以过去和/或未来帧为条件。以随机和独立地掩盖所有过去帧或所有未来帧的方式来训练该模型。这种新的直接的设置使我们能训练一个能执行广泛视频任务的单一模型,特别是:未来/过去的预测——当只有未来/过去的帧被掩码时;无条件生成——当过去和未来的帧被掩码时;以及插值——当过去和未来的帧都没有被掩码时。实验表明,这种方法可以为不同类型的视频生成高质量的帧。所提出的MCVD模型是由简单的非递归2D卷积架构建立的,对帧块进行调节并生成帧块。以块的方式生成任意长度的自回归视频。所提出方法在标准的视频预测和插值基准中产生了SOTA结果,用≤4个GPU测量训练模型的计算时间为1-12天。
Video prediction is a challenging task. The quality of video frames from current state-of-the-art (SOTA) generative models tends to be poor and generalization beyond the training data is difficult. Furthermore, existing prediction frameworks are typically not capable of simultaneously handling other video-related tasks such as unconditional generation or interpolation. In this work, we devise a generalpurpose framework called Masked Conditional Video Diffusion (MCVD) for all of these video synthesis tasks using a probabilistic conditional score-based denoising diffusion model, conditioned on past and/or future frames. We train the model in a manner where we randomly and independently mask all the past frames or all the future frames. This novel but straightforward setup allows us to train a single model that is capable of executing a broad range of video tasks, specifically: future/past prediction – when only future/past frames are masked; unconditional generation – when both past and future frames are masked; and interpolation – when neither past nor future frames are masked. Our experiments show that this approach can generate high-quality frames for diverse types of videos. Our MCVD models are built from simple non-recurrent 2D-convolutional architectures, conditioning on blocks of frames and generating blocks of frames. We generate videos of arbitrary lengths autoregressively in a block-wise manner. Our approach yields SOTA results across standard video prediction and interpolation benchmarks, with computation times for training models measured in 1-12 days using≤ 4 GPUs. https://mask-cond-video-diffusion.github.io/
https://arxiv.org/abs/2205.09853
5、[CV] Compression ensembles quantify aesthetic complexity and the evolution of visual art
A Karjus, M C Solà, T Ohm, S E. Ahnert, M Schich
[Tallinn University & University of Cambridge]
压缩集成量化美学复杂性和视觉艺术的演进。视觉美学及复杂性的量化由来已久,后者以前是通过压缩算法的应用来实现的。本文将压缩方法进行了概括和扩展,超越了简单的复杂性测量,以量化历史和当代视觉媒体的算法距离。所提出的"集成"方法通过压缩一个给定输入图像的大量转换版本,产生一个相关的压缩率矢量。这种方法比其他基于压缩的算法距离更有效,而且特别适合于视觉作品的定量分析,因为人类的创作过程可以理解为最广泛意义上的算法。与使用机器学习的可比图像嵌入方法不同,所提出方法是可以通过转换来完全解释的。通过对人类复杂性判断以及作者和风格的自动检测任务进行评估,证明该方法在认知上是合理的,并且适合于目的。展示了该方法如何用于揭示和量化艺术历史数据的趋势,包括几个世纪的规模和快速发展的当代NFT艺术市场。进一步量化了时间上的相似性,以区分记录的主流之外的艺术家和那些深深嵌入时代印记的艺术家。压缩组合构成了视觉家族相似性概念的量化表示,因为不同的维度组合对应着共同的视觉特征,否则很难确定。所提出方法为研究视觉艺术、算法图像分析和更广泛的定量美学提供了一个新的视角。
The quantification of visual aesthetics and complexity have a long history, the latter previously operationalized via the application of compression algorithms. Here we generalize and extend the compression approach beyond simple complexity measures to quantify algorithmic distance in historical and contemporary visualmedia. The proposed “ensemble" approachworks by compressing a large number of transformed versions of a given input image, resulting in a vector of associated compression ratios. This approach is more efficient than other compression-based algorithmic distances, and is particularly suited for the quantitative analysis of visual artifacts, because human creative processes can be understood as algorithms in the broadest sense. Unlike comparable image embedding methods using machine learning, our approach is fully explainable through the transformations. We demonstrate that the method is cognitively plausible and fit for purpose by evaluating it against human complexity judgments, and on automated detection tasks of authorship and style. We show how the approach can be used to reveal and quantify trends in art historical data, both on the scale of centuries and in rapidly evolving contemporary NFT art markets. We further quantify temporal resemblance to disambiguate artists outside the documented mainstream from those who are deeply embedded in Zeitgeist. Finally, we note that compression ensembles constitute a quantitative representation of the concept of visual family resemblance, as distinct sets of dimensions correspond to shared visual characteristics otherwise hard to pin down. Our approach provides a new perspective for the study of visual art, algorithmic image analysis, and quantitative aesthetics more generally.
https://arxiv.org/abs/2205.10271
另外几篇值得关注的论文:
[CV] Single-View View Synthesis in the Wild with Learned Adaptive Multiplane Images
基于习得自适应多平面图像的真实场景单视视图合成
Y Han, R Wang, J Yang
[Tsinghua University & USTC & Microsoft Research Asia]
https://arxiv.org/abs/2205.11733
[CL] Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models
非过拟合记忆:大型语言模型训练动态分析
K Tirumala, A H. Markosyan, L Zettlemoyer, A Aghajanyan [Meta AI Research] (2022) https://arxiv.org/abs/2205.10770
[CV] Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
基于图像描述子的语言模型是强大的少样本视频-语言学习器
Z Wang, M Li, R Xu, L Zhou, J Lei, X Lin, S Wang, Z Yang, C Zhu, D Hoiem, S Chang…
[UIUC & MSR & UNC & Columbia University]
https://arxiv.org/abs/2205.10747
[LG] The Selectively Adaptive Lasso
选择性自适应Lasso
A Schuler, M v d Laan
https://arxiv.org/abs/2205.10697