前沿推荐
-
转载: 爱可可-爱生活(知乎)
摘要:神经特征融合场、几何多模态表示学习、无监督场景草图的逼真照片合成、通过基于提示的跨模态生成将原始描述翻译成图像、种族视角GAN偏差研究、预训练人脸生成器的自然语言界面无文本学习、面向专家任务应用的特定基础模型、非高斯过程回归、面向零样本图像分类的自定义提示生成1、[CV] Neural Feature Fusion Fields: 3D Distillation of Self-Supervised 2D Image Representations
V Tschernezki, I Laina, D Larlus, A Vedaldi
[University of Oxford & NAVER LABS Europe]
神经特征融合场:自监督2D图像表示的3D蒸馏。本文提出神经特征融合场(N3F),一种改进稠密2D图像特征提取器的方法,当后者被应用于可重构为3D场景的多幅图像的分析时。给定一个图像特征提取器,例如用自监督进行预训练,N3F将其作为教师来学习一个定义在3D空间的学生网络。3D学生网络类似于一个神经辐射场,可以蒸馏出上述特征,并可以用通常的可微渲染机制进行训练。因此,N3F很容易适用于大多数神经渲染方案,包括vanilla NeRF及其对复杂动态场景的扩展。所提出方法不仅能在不使用人工标签的情况下实现对特定场景神经场的语义理解,还能持续改进自监督的2D基线。这一点通过考虑各种任务得到了证明,如2D目标检索、3D分割和场景编辑,在不同的序列中,包括EPIC-KITCHENS基准的长自我中心视频。
We present Neural Feature Fusion Fields (N3F), a method that improves dense 2D image feature extractors when the latter are applied to the analysis of multiple images reconstructible as a 3D scene. Given an image feature extractor, for example pre-trained using self-supervision, N3F uses it as a teacher to learn a student network defined in 3D space. The 3D student network is similar to a neural radiance field that distills said features and can be trained with the usual differentiable rendering machinery. As a consequence, N3F is readily applicable to most neural rendering formulations, including vanilla NeRF and its extensions to complex dynamic scenes. We show that our method not only enables semantic understanding in the context of scene-specific neural fields without the use of manual labels, but also consistently improves over the self-supervised 2D baselines. This is demonstrated by considering various tasks, such as 2D object retrieval, 3D segmentation, and scene editing, in diverse sequences, including long egocentric videos in the EPIC-KITCHENS benchmark. Project page: https://www.robots.ox.ac.uk/ ̃vadim/n3f/
https://arxiv.org/abs/2209.034942、[LG] Geometric multimodal representation learning
Y Ektefaie, G Dasoulas, A Noori, M Farhat, M Zitnik
[Harvard Medical School & Harvard University]
几何多模态表示学习。以图为中心的人工智能(图AI)在自然界普遍存在的交互系统建模方面取得了显著的成功,从生物学动态系统到粒子物理学。数据的异质性越来越强,这就需要能够结合多种归纳偏差的图神经架构。然而,结合来自不同来源的数据是具有挑战性的,因为适当的归纳偏差可能因数据模式而异。多模态学习方法融合了多种数据模态,同时利用跨模态的依赖性来解决这一挑战。本文调研了以图为中心的人工智能的140项研究,并意识到不同的数据类型越来越多地用图汇集在一起,并被送入复杂的多模态模型。这些模型分为图像、语言和知识为基础的多模态学习。本文提出一种基于这种分类的多模态图学习的算法蓝图。该蓝图作为一种方式,将处理多模态数据的最先进的架构适当地选择四种不同的组件。这一努力可以为高度复杂的现实世界问题的复杂多模态架构的标准化设计铺平道路。
Graph-centric artificial intelligence (graph AI) has achieved remarkable success in modeling interacting systems prevalent in nature, from dynamical systems in biology to particle physics. The increasing heterogeneity of data calls for graph neural architectures that can combine multiple inductive biases. However, combining data from various sources is challenging because appropriate inductive bias may vary by data modality. Multimodal learning methods fuse multiple data modalities while leveraging cross-modal dependencies to address this challenge. Here, we survey 140 studies in graph-centric AI and realize that diverse data types are increasingly brought together using graphs and fed into sophisticated multimodal models. These models stratify into image-, language-, and knowledge-grounded multimodal learning. We put forward an algorithmic blueprint for multimodal graph learning based on this categorization. The blueprint serves as a way to group state-of-the-art architectures that treat multimodal data by choosing appropriately four different components. This effort can pave the way for standardizing the design of sophisticated multimodal architectures for highly complex real-world problems.
https://arxiv.org/abs/2209.032993、[CV] Unsupervised Scene Sketch to Photo Synthesis
J Wang, S Jeon, S X. Yu, X Zhang, H Arora, Y Lou
[UC Berkeley & Amazon]
无监督场景草图的逼真照片合成。草图是一种直观而有力的视觉表达方式,因为它们是快速实现的自由手绘。本文提出一种从场景草图合成逼真照片的方法。在不需要草图和照片对的情况下,所提出框架以无监督方式直接从现成的大规模照片数据集中学习。为此,引入了一个标准化模块,通过将照片和草图转换到一个标准化的域,即边缘图,在训练期间提供伪草图-照片对。草图和照片之间域差距的缩小也使得能够将它们分解成两部分:整体的场景结构和低层次的视觉风格,如颜色和纹理。利用这一优势,通过结合草图的结构和参考照片的视觉风格来合成照片的逼真图像。对感知相似性指标和人类感知研究的广泛实验结果表明,所提出的方法可以从场景草图中生成具有高保真度的逼真照片,并优于最先进的照片合成基线。该框架通过编辑相应草图的笔画,促进了对照片合成的可控操作,比之前依赖区域级编辑的方法提供了更精细的细节。
Sketches make an intuitive and powerful visual expression as they are fast executed freehand drawings. We present a method for synthesizing realistic photos from scene sketches. Without the need for sketch and photo pairs, our framework directly learns from readily available large-scale photo datasets in an unsupervised manner. To this end, we introduce a standardization module that provides pseudo sketch-photo pairs during training by converting photos and sketches to a standardized domain, i.e. the edge map. The reduced domain gap between sketch and photo also allows us to disentangle them into two components: holistic scene structures and low-level visual styles such as color and texture. Taking this advantage, we synthesize a photo-realistic image by combining the structure of a sketch and the visual style of a reference photo. Extensive experimental results on perceptual similarity metrics and human perceptual studies show the proposed method could generate realistic photos with high fidelity from scene sketches and outperform state-of-the-art photo synthesis baselines. We also demonstrate that our framework facilitates a controllable manipulation of photo synthesis by editing strokes of corresponding sketches, delivering more fine-grained details than previous approaches that rely on region-level editing.
https://arxiv.org/abs/2209.028344、[CV] AI Illustrator: Translating Raw Descriptions into Images by Prompt-based Cross-Modal Generation
Y Ma, H Yang, B Liu, J Fu, J Liu
[Peking University & Microsoft Research]
AI插画师: 通过基于提示的跨模态生成将原始描述翻译成图像。AI插画师的目的是为书籍自动设计出具有视觉吸引力的图像,以激发丰富的思想和情感。为实现这一目标,本文提出一种框架,将具有复杂语义的原始描述翻译成语义上对应的图像。主要的挑战在于原始描述的语义的复杂性,有可能很难被视觉化(例如,“阴郁"或"亚洲人”)。这通常给现有的方法处理这种描述带来了挑战。为解决这个问题,本文提出一种基于提示的跨模态生成框架(PCM-Frame)来利用两个强大的预训练模型,包括CLIP和StyleGAN。该框架由两部分组成:一个从文本嵌入到基于提示的图像嵌入的投影模块,以及一个建立在StyleGAN上的自适应图像生成模块,将图像嵌入作为输入,并通过合成语义一致性损失进行训练。为了弥补现实图像和插图设计之间的差距,进一步采用风格化模型作为框架的后处理,以获得更好的视觉效果。受益于预训练的模型,所提出方法可以处理复杂的描述,并且不需要外部的配对数据进行训练。此外,本文建立了一个由200条原始描述组成的基准。通过用户研究证明了所提框架在处理复杂文本时比其他竞争方法更有优势。
AI illustrator aims to automatically design visually appealing images for books to provoke rich thoughts and emotions. To achieve this goal, we propose a framework for translating raw descriptions with complex semantics into semantically corresponding images. The main challenge lies in the complexity of the semantics of raw descriptions, which may be hard to be visualized (e.g., “gloomy” or “Asian”). It usually poses challenges for existing methods to handle such descriptions. To address this issue, we propose a Prompt-based Cross-Modal Generation Framework (PCM-Frame) to leverage two powerful pre-trained models, including CLIP and StyleGAN. Our framework consists of two components: a projection module from Text Embeddings to Image Embeddings based on prompts, and an adapted image generation module built on StyleGAN which takes Image Embeddings as inputs and is trained by combined semantic consistency losses. To bridge the gap between realistic images and illustration designs, we further adopt a stylization model as post-processing in our framework for better visual effects. Benefiting from the pre-trained models, our method can handle complex descriptions and does not require external paired data for training. Furthermore, we have built a benchmark that consists of 200 raw descriptions. We conduct a user study to demonstrate our superiority over the competing methods with complicated texts. We release our code at this https URL.
https://arxiv.org/abs/2209.031605、[CV] Studying Bias in GANs through the Lens of Race
V H. Maluleke, N Thakkar, T Brooks, E Weber, T Darrell, A A. Efros, A Kanazawa, D Guillory
[UC Berkeley]
种族视角GAN偏差研究。本文研究生成式图像模型的性能和评估是如何被其训练数据集的种族构成所影响的。通过检查和控制各种训练数据集的种族分布,能观察到不同的训练分布对生成图像质量和生成图像的种族分布的影响。结果显示,生成图像的种族构成成功地保持了训练数据的种族构成。然而,也观察到,在推理过程中用于生成高质量图像的截断(truncation)技术加剧了数据中的种族不平衡。在研究图像质量和种族之间的关系时,发现一种特定种族的最高感知视觉质量的图像来自于该种族被充分表示的分布,并且标注者始终喜欢生成的白人图像而不是黑人图像。
In this work, we study how the performance and evaluation of generative image models are impacted by the racial composition of their training datasets. By examining and controlling the racial distributions in various training datasets, we are able to observe the impacts of different training distributions on generated image quality and the racial distributions of the generated images. Our results show that the racial compositions of generated images successfully preserve that of the training data. However, we observe that truncation, a technique used to generate higher quality images during inference, exacerbates racial imbalances in the data. Lastly, when examining the relationship between image quality and race, we find that the highest perceived visual quality images of a given race come from a distribution where that race is wellrepresented, and that annotators consistently prefer generated images of white people over those of Black people.
https://arxiv.org/abs/2209.02836另外几篇值得关注的论文:
[CV] Text-Free Learning of a Natural Language Interface for Pretrained Face Generators
预训练人脸生成器的自然语言界面无文本学习
X Du, R A. Yeh, N Kolkin, E Shechtman, G Shakhnarovich
[TTI-Chicago & Purdue University & Adobe Research] https://arxiv.org/abs/2209.03953[CV] FETA: Towards Specializing Foundation Models for Expert Task Applications
FETA:面向专家任务应用的特定基础模型
A Alfassy, A Arbelle, O Halimi, S Harary…
[IBM Research & MIT-IBM AI-Watson Lab]
https://arxiv.org/abs/2209.03648[LG] Non-Gaussian Process Regression
非高斯过程回归
Y Kındap, S Godsill
[University of Cambridge]
https://arxiv.org/abs/2209.03117[CV] What does a platypus look like? Generating customized prompts for zero-shot image classification
鸭嘴兽长什么样?面向零样本图像分类的自定义提示生成
S Pratt, R Liu, A Farhadi
[University of Washington & Google Research]
https://arxiv.org/abs/2209.03320