前沿推介
-
前沿推介
转载:知乎(爱可可-爱生活)
摘要:面向深度学习的图像增强技术全面综述、通过模块化Transformer预训练解除多语言诅咒、基于视觉Transformer的简单开放词表目标检测、高度近似照片的3D时刻、交互式模型概念卡、面向通用自然语言理解的稀疏激活模型、变分量子算法优化的基本限制、面向图像到图像转换的双扩散隐式桥、高精度差分隐私图像分类
1、[CV] A Comprehensive Survey of Image Augmentation Techniques for Deep Learning
M Xu, S Yoon, A Fuentes, D S Park
[Jeonbuk National University & Mokpo National University]
面向深度学习的图像增强技术全面综述。深度学习在需要大量图像的计算机视觉中取得了不错的表现,然而,在许多情况下,收集图像是昂贵和困难的。为了缓解这个问题,许多图像增强算法作为有效和高效的策略被提出。了解当前的算法对于找到合适的方法或为给定的任务开发新的技术至关重要。本文通过一个新的信息分类法对深度学习图像增强进行了全面调研。为了了解为什么需要图像增强的基本概念,本文介绍了计算机视觉任务的基本挑战。本文将图像强化学习算法分成三类:无模型、基于模型和基于优化策略的增强算法。无模型类采用了图像处理方法,而基于模型的方法则利用了可训练的图像生成模型。相比之下,基于优化策略的方法旨在寻找最佳操作或其组合。此外,本文用两个比较活跃的话题讨论了目前常见的应用趋势,利用不同的方式来理解图像增强,如群和核理论,以及部署图像增强进行无监督学习。
Deep learning has been achieving decent performance in computer vision requiring a large volume of images, however, collecting images is expensive and difficult in many scenarios. To alleviate this issue, many image augmentation algorithms have been proposed as effective and efficient strategies. Understanding current algorithms is essential to find suitable methods or develop novel techniques for given tasks. In this paper, we perform a comprehensive survey on image augmentation for deep learning with a novel informative taxonomy. To get the basic idea why we need image augmentation, we introduce the challenges in computer vision tasks and vicinity distribution. Then, the algorithms are split into three categories; model-free, model-based, and optimizing policybased. The model-free category employs image processing methods while the model-based method leverages trainable image generation models. In contrast, the optimizing policy-based approach aims to find the optimal operations or their combinations. Furthermore, we discuss the current trend of common applications with two more active topics, leveraging different ways to understand image augmentation, such as group and kernel theory, and deploying image augmentation for unsupervised learning. Based on the analysis, we believe that our survey gives a better understanding helpful to choose suitable methods or design novel algorithms for practical applications.
https://arxiv.org/abs/2205.01491
2、[CL] Lifting the Curse of Multilinguality by Pre-training Modular Transformers
J Pfeiffer, N Goyal, X V Lin, X Li, J Cross, S Riedel, M Artetxe
[New York University & Meta AI]
通过模块化Transformer预训练解除多语言诅咒。众所周知,多语言预训练模型会受到多语言诅咒的影响,随着模型覆盖更多的语言,每一种语言的性能会下降。本文通过引入特定语言模块来解决这个问题,使得能够增加模型总容量,同时保持每种语言的可训练参数总数不变。与之前学习特定语言组件的工作相比,本文从一开始就对跨语言模块化(XMOD)模型的模块进行预训练。在自然语言推理、命名实体识别和问答方面的实验表明,所提出方法不仅减轻了语言之间的负面干扰,还实现了积极的迁移,提高了单语言和跨语言的性能。此外,该方法可在事后增加语言,而性能却没有明显的下降,不再将模型的使用限制在预训练好的语言集上。
Multilingual pre-trained models are known to suffer from the curse of multilinguality, which causes per-language performance to drop as they cover more languages. We address this issue by introducing language-specific modules, which allows us to grow the total capacity of the model, while keeping the total number of trainable parameters per language constant. In contrast with prior work that learns languagespecific components post-hoc, we pre-train the modules of our Cross-lingual Modular (XMOD) models from the start. Our experiments on natural language inference, named entity recognition and question answering show that our approach not only mitigates the negative interference between languages, but also enables positive transfer, resulting in improved monolingual and cross-lingual performance. Furthermore, our approach enables adding languages post-hoc with no measurable drop in performance, no longer limiting the model usage to the set of pre-trained languages.
https://arxiv.org/abs/2205.06266
3、[CV] Simple Open-Vocabulary Object Detection with Vision Transformers
M Minderer, A Gritsenko, A Stone, M Neumann, D Weissenborn, A Dosovitskiy, A Mahendran, A Arnab, M Dehghani, Z Shen…
[Google Research]
基于视觉Transformer的简单开放词表目标检测。将简单架构与大规模预训练相结合,带来了图像分类的大规模改进。对目标检测来说,预训练和规模化的方法不太成熟,特别是在长尾和开放词表设置下,训练数据相对稀少。本文提出一种将图像-文本模型迁移到开放词表目标检测的强大秘诀。采用一个标准的Vision Transformer架构,并对其进行了最小化修改,进行对比性图像-文本预训练,并进行端到端检测微调。对这一设置的扩展特性的分析表明,增加图像级预训练和模型大小会对下游检测任务产生一致的改善。本文提供了必要的自适应策略和正则化,以便在零次文本条件和单次图像条件的目标检测中获得强大的性能。
Combining simple architectures with large-scale pre-training has led to massive improvements in image classification. For object detection, pre-training and scaling approaches are less well established, especially in the long-tailed and open-vocabulary setting, where training data is relatively scarce. In this paper, we propose a strong recipe for transferring image-text models to open-vocabulary object detection. We use a standard Vision Transformer architecture with minimal modifications, contrastive image-text pre-training, and end-to-end detection fine-tuning. Our analysis of the scaling properties of this setup shows that increasing image-level pre-training and model size yield consistent improvements on the downstream detection task. We provide the adaptation strategies and regularizations needed to attain very strong performance on zero-shot text-conditioned and one-shot image-conditioned object detection. Code and models are available on GitHub.
https://arxiv.org/abs/2205.06230
4、[CV] 3D Moments from Near-Duplicate Photos
Q Wang, Z Li, D Salesin, N Snavely, B Curless, J Kontkanen
[Google Research]
高度近似照片的3D时刻。我们介绍3D时刻(3D Moments),一种新的计算摄影效果。用一对近乎重复的照片作为输入,即在人们的照片集中常见的从类似视角拍摄的移动目标的照片。制作一个视频作为输出,将场景的运动从第一张照片平滑地插到第二张照片上,同时产生具有视差的相机运动,给人以更高的3D感。为达到这种效果,将场景表示为一对基于特征的分层深度图像,并辅以场景流。该表示方法使运动插值与摄像机视角的独立控制相结合。该系统产生了具有运动视差和场景动态的逼真的时空视频,同时合理恢复了原始视图中被遮挡的区域。广泛的实验证明,在公共数据集和野外照片上的性能均优于基线。
We introduce 3D Moments, a new computational photography effect. As input we take a pair of near-duplicate photos, i.e., photos of moving subjects from similar viewpoints, common in people’s photo collections. As output, we produce a video that smoothly interpolates the scene motion from the first photo to the second, while also producing camera motion with parallax that gives a heightened sense of 3D. To achieve this effect, we represent the scene as a pair of feature-based layered depth images augmented with scene flow. This representation enables motion interpolation along with independent control of the camera viewpoint. Our system produces photorealistic space-time videos with motion parallax and scene dynamics, while plausibly recovering regions occluded in the original views. We conduct extensive experiments demonstrating superior performance over baselines on public datasets and in-the-wild photos. Project page: https://3d-moments.github.io/.
https://arxiv.org/abs/2205.06255
5、[CL] Interactive Model Cards: A Human-Centered Approach to Model Documentation
A Crisan, M Drouhard, J Vig, N Rajani
[Tableau Research & User Research & Salesforce Research & Hugging Face]
交互式模型概念卡:以人为本的模型文档方法。面向自然语言处理(NLP)的深度学习模型越来越多地被没有经过NLP或机器学习(ML)正式培训的分析人员所使用和部署。然而,旨在传达模型的细节和适当使用的文档主要是为具有机器学习或NLP专业知识的人定制的。为了解决这个问题,本文对交互式模型卡进行了设计调查,为传统的静态模型卡增加了探索模型文档和与模型本身互动的能力。所做调查包括对机器学习、NLP和AI伦理学专家的初步概念研究,然后是对在工作中使用机器学习模型的非专家分析员的单独评估研究。通过使用半结构化的访谈形式,加上有声思维协议,共收集了30名参与者的反馈,他们参与了不同版本的标准和互动模型卡。通过对收集到的数据进行主题分析,本文确定了几个概念性的维度,概括了标准和交互式模型卡的优势和局限,包括:利益相关者;设计;指导;可理解性和可解释性;感性认识和怀疑;以及信任和安全。发现证明了精心考虑的设计和互动性对于引导和支持非专业分析人员使用深度学习模型的重要性,同时也需要考虑更广泛的社会技术背景和组织动态。本文还确定了一些设计元素,比如语言、视觉提示和警告等等,这些元素支持交互性,并使非交互性的内容变得容易理解。本文将发现总结为设计指南,并讨论了它们对以人为本AI/ML文档方法的影响。
Deep learning models for natural language processing (NLP) are increasingly adopted and deployed by analysts without formal training in NLP or machine learning (ML). However, the documentation intended to convey the model’s details and appropriate use is tailored primarily to individuals with ML or NLP expertise. To address this gap, we conduct a design inquiry into interactive model cards, which augment traditionally static model cards with affordances for exploring model documentation and interacting with the models themselves. Our investigation consists of an initial conceptual study with experts in ML, NLP, and AI Ethics, followed by a separate evaluative study with non-expert analysts who use ML models in their work. Using a semi-structured interview format coupled with a think-aloud protocol, we collected feedback from a total of 30 participants who engaged with different versions of standard and interactive model cards. Through a thematic analysis of the collected data, we identified several conceptual dimensions that summarize the strengths and limitations of standard and interactive model cards, including: stakeholders; design; guidance; understandability & interpretability; sensemaking & skepticism; and trust & safety. Our findings demonstrate the importance of carefully considered design and interactivity for orienting and supporting non-expert analysts using deep learning models, along with a need for consideration of broader sociotechnical contexts and organizational dynamics. We have also identified design elements, such as language, visual cues, and warnings, among others, that support interactivity and make non-interactive content accessible. We summarize our findings as design guidelines and discuss their implications for a human-centered approach towards AI/ML documentation.
https://arxiv.org/abs/2205.02894
另外几篇值得关注的论文:
[CL] SkillNet-NLU: A Sparsely Activated Model for General-Purpose Natural Language Understanding
SkillNet-NLU:面向通用自然语言理解的稀疏激活模型
F Zhang, D Tang, Y Dai, C Zhou, S Wu, S Shi
[Tencent AI Lab]
https://arxiv.org/abs/2203.03312
[LG] Fundamental limitations on optimization in variational quantum algorithms
变分量子算法优化的基本限制
H Zhang, C Zhu, G Liu, X Wang
[Baidu Research]
https://arxiv.org/abs/2205.05056
[CV] Dual Diffusion Implicit Bridges for Image-to-Image Translation
面向图像到图像转换的双扩散隐式桥
X Su, J Song, C Meng, S Ermon
[Stanford University]
https://arxiv.org/abs/2203.08382
[LG] Unlocking High-Accuracy Differentially Private Image Classification through Scale
高精度差分隐私图像分类
S De, L Berrada, J Hayes, S L. Smith, B Balle
[DeepMind]
https://arxiv.org/abs/2204.13650