前沿分享
-
转载:爱可可-爱生活(知乎)
摘要:视觉Transformer可视化探索、面向解析的习得增量表示、用于实际大规模控制的机器人Transformer、路径一致的激光雷达-相机深度特征融合、从经验学习机器人导航、面向3D理解的语言图像和点云统一表示学习、基于扩散误差收缩的盲人脸恢复、用结构化提示将上下文学习扩展到1000个样本、视觉运动控制预训练研究1、[CV] What do Vision Transformers Learn? A Visual Exploration
A Ghiasi, H Kazemi, E Borgnia, S Reich, M Shu, M Goldblum, A G Wilson, T Goldstein
[University of Maryland & New York University]视觉Transformer学到了什么? 一种可视化探索
要点:
- 对ViT进行可视化分析,以了解其工作原理和学习内容;
- 语言模型监督下训练的ViT神经元被语义概念而不是视觉特征激活;
- Transformer检测图像背景特征,相比CNN更少依赖于高频信息;
- Transformer和CNN在浅层学习纹理属性,在深层学习高级对象特征或抽象概念。
摘要:
视觉transformer(ViT)已经迅速成为计算机视觉的事实架构,然而对其工作原因和学习内容了解甚少。虽然现有的研究对卷积神经网络的机制进行了可视化分析,但对ViT进行类似的探索仍然具有挑战性。本文首先解决了对ViT进行可视化分析的障碍。在这些解决方案的协助下,观察到在语言模型监督下训练的ViT(例如CLIP)的神经元被语义概念而不是视觉特征激活。本文还探索了ViT和CNN之间的基本差异,发现transformer检测图像背景特征,像卷积一样,但其预测对高频信息的依赖要少得多。另一方面,两种架构类型在特征从浅层的抽象模式发展到深层的具体对象的方式上表现相似。ViTs在除最后一层外的所有层中都保持着空间信息。本文表明最后一层最可能丢弃空间信息,并表现为一个学习的全局池化操作。在广泛的ViT变体上进行了大规模的可视化,包括DeiT, CoaT, ConViT, PiT, Swin和Twin,以验证所提出方法的有效性。Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision, yet we understand very little about why they work and what they learn. While existing studies visually analyze the mechanisms of convolutional neural networks, an analogous exploration of ViTs remains challenging. In this paper, we first address the obstacles to performing visualizations on ViTs. Assisted by these solutions, we observe that neurons in ViTs trained with language model supervision (e.g., CLIP) are activated by semantic concepts rather than visual features. We also explore the underlying differences between ViTs and CNNs, and we find that transformers detect image background features, just like their convolutional counterparts, but their predictions depend far less on high-frequency information. On the other hand, both architecture types behave similarly in the way features progress from abstract patterns in early layers to concrete objects in late layers. In addition, we show that ViTs maintain spatial information in all layers except the final layer. In contrast to previous works, we show that the last layer most likely discards the spatial information and behaves as a learned global pooling operation. Finally, we conduct large-scale visualizations on a wide range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin, to validate the effectiveness of our method.
https://arxiv.org/abs/2212.06727
2、[CL] Learned Incremental Representations for Parsing
N Kitaev, T Lu, D Klein
[UC Berkeley]面向解析的习得增量表示
要点:
- 提出一种递增句法表示,为句中每个词分配单个离散标签;
- 习得表示仅用5位每词就达到Penn Treebank上的高F1值;
- 该系统系统可提高对递增解析和顺序决策的理解。
摘要:
提出了一种递增的句法表示方法,包括给句子中的每个词分配一个离散的标签,其中标签是通过对句子前缀的严格递增处理来预测的,而句子标签序列完全决定了解析树。目标是归纳出一种句法表示,该表示只在输入逐渐揭示的情况下致力于句法选择,与标准表示不同,后者必须推测性地做出输出选择,如附件,然后抛出冲突的分析。所学习的表示在每个词只有5比特的情况下在Penn Treebank上达到了93.72的F1,而在每个词有8比特的情况下,它们达到了94.97的F1,这与其他最先进的解析模型在使用相同的预训练嵌入时是相当的。对该系统所学到的表示进行了分析,研究了诸如系统所捕获的可解释的句法特征和句法歧义的延迟解决机制等属性。We present an incremental syntactic representation that consists of assigning a single discrete label to each word in a sentence, where the label is predicted using strictly incremental processing of a prefix of the sentence, and the sequence of labels for a sentence fully determines a parse tree. Our goal is to induce a syntactic representation that commits to syntactic choices only as they are incrementally revealed by the input, in contrast with standard representations that must make output choices such as attachments speculatively and later throw out conflicting analyses. Our learned representations achieve 93.72 F1 on the Penn Treebank with as few as 5 bits per word, and at 8 bits per word they achieve 94.97 F1, which is comparable with other state of the art parsing models when using the same pre-trained embeddings. We also provide an analysis of the representations learned by our system, investigating properties such as the interpretable syntactic features captured by the system and mechanisms for deferred resolution of syntactic ambiguities.
https://aclanthology.org/2022.acl-long.220/
3、[RO] RT-1: Robotics Transformer for Real-World Control at Scale
A Brohan, N Brown, J Carbajal, Y Chebotar…
[Google]RT-1: 用于实际大规模控制的机器人Transformer
要点:
- 从大型,多样化,与任务无关的数据集中迁移知识可以以很高水平解决下游任务;
- 机器人技术成功的关键在于采用高容量架构结合开放式任务无关训练;
- RT-1可吸收大量数据,并推广到新任务,环境,物体和其他机器人形态。
摘要:
通过从大型、多样化的、与任务无关的数据集中迁移知识,现代机器学习模型可以解决特定的下游任务,无论是零样本还是有小型特定任务数据集,都能达到很高的性能水平。虽然这种能力已经在其他领域得到证明,如计算机视觉、自然语言处理或语音识别,但在机器人领域仍有待证明,由于难以收集真实世界的机器人数据,模型的泛化能力尤为关键。这种通用机器人模型成功的关键之一,在于开放式的任务诊断训练,并与能够吸收所有不同机器人数据的高容量架构相结合。本文提出一个模型类,被称为机器人Transformer,它表现出有希望的可扩展模型特性。在对不同模型类的研究中验证了以上结论,以及它们作为数据大小、模型大小和数据多样性的函数,基于对执行真实世界任务的真实机器人的大规模数据收集的泛化能力。By transferring knowledge from large, diverse, task-agnostic datasets, modern machine learning models can solve specific downstream tasks either zero-shot or with small task-specific datasets to a high level of performance. While this capability has been demonstrated in other fields such as computer vision, natural language processing or speech recognition, it remains to be shown in robotics, where the generalization capabilities of the models are particularly critical due to the difficulty of collecting real-world robotic data. We argue that one of the keys to the success of such general robotic models lies with open-ended task-agnostic training, combined with high-capacity architectures that can absorb all of the diverse, robotic data. In this paper, we present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties. We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks. The project’s website and videos can be found at this http URL
https://arxiv.org/abs/2212.06817
4、[CV] PathFusion: Path-consistent Lidar-Camera Deep Feature Fusion
L Wu, D Wang, M Li, Y Xiong, R Krishnamoorthi, Q Liu, V Chandra
[Meta Reality Labs & University of Texas at Austin & Peking University]PathFusion: 路径一致的激光雷达-相机深度特征融合
要点:
- 激光雷达与相机融合以提高3D检测精度;
- 提出PathFusion以实现激光雷达与相机之间的深度特征融合的路径一致性;
- PathFusion改善了Focals Conv,在nuScenes测试中的mAP提高了1.2%以上。
摘要:
由于物理特性的互补性,将相机与激光相机(LiDAR)融合是一种很有前途的技术,可以提高3D检测的准确性。虽然大多数现有的方法侧重于直接将相机特征与原始LiDAR点云或浅层3D特征相融合,但据观察,由于特征错位,直接的深度3D特征融合实现的精度较低。对于深度网络阶段来说,源于大感受野的特征聚合的错位变得越来越严重。本文提出了PathFusion来实现路径一致的LiDAR-相机深度特征融合。PathFusion在浅层和深层特征之间引入了路径一致性损失,鼓励2D骨干及其融合路径以与3D骨干的变换在语义上一致的方式变换2D特征。将PathFusion应用于之前的融合基线Focals Conv,在nuScenes测试中观察到超过1.2%的mAP改进,无论是否有测试时的增强。此外,PathFusion还使KITTI AP3D(R11)在中等水平上提高了0.6%以上。Fusing camera with LiDAR is a promising technique to improve the accuracy of 3D detection due to the complementary physical properties. While most existing methods focus on fusing camera features directly with raw LiDAR point clouds or shallow 3D features, it is observed that direct deep 3D feature fusion achieves inferior accuracy due to feature misalignment. The misalignment that originates from the feature aggregation across large receptive fields becomes increasingly severe for deep network stages. In this paper, we propose PathFusion to enable path-consistent LiDAR-camera deep feature fusion. PathFusion introduces a path consistency loss between shallow and deep features, which encourages the 2D backbone and its fusion path to transform 2D features in a way that is semantically aligned with the transform of the 3D backbone. We apply PathFusion to the prior-art fusion baseline, Focals Conv, and observe more than 1.2% mAP improvements on the nuScenes test split consistently with and without testing-time augmentations. Moreover, PathFusion also improves KITTI AP3D (R11) by more than 0.6% on moderate level.
https://arxiv.org/abs/2212.06244
5、[RO] Learning Robotic Navigation from Experience: Principles, Methods, and Recent Results
S Levine, D Shah
[UC Berkeley]从经验学习机器人导航: 原理、方法与最新结果
要点:
- 机器学习提供了一种有前途的方法,可以超越传统的几何和规划,可根据历史经验探索机器人导航;
- 本文为导航技能的实践学习提供了一个工具包,统一了近来的方法;
- 算法可以组合起来解决暂时性的扩展导航问题。
摘要:
导航是机器人学中研究最多的问题之一,传统上是作为一个几何映射和规划问题来处理。然而,现实世界中的导航提出了一系列复杂的物理挑战,无法用简单的几何学抽象来解释。机器学习提供了一种超越几何学和传统规划的有希望的方法,允许导航系统根据实际的先前经验做出决定。这样的系统能够以超越几何学的方式推理可穿越性,考虑其行动的物理结果并利用现实世界环境中的模式。它们还可以随着更多数据的收集而改进,有可能提供一个强大的网络效应。本文提出一个用于机器人导航技能体验式学习的通用工具包,它统一了最近的几种方法,描述了基本的设计原则。Navigation is one of the most heavily studied problems in robotics, and is conventionally approached as a geometric mapping and planning problem. However, real-world navigation presents a complex set of physical challenges that defies simple geometric abstractions. Machine learning offers a promising way to go beyond geometry and conventional planning, allowing for navigational systems that make decisions based on actual prior experience. Such systems can reason about traversability in ways that go beyond geometry, accounting for the physical outcomes of their actions and exploiting patterns in real-world environments. They can also improve as more data is collected, potentially providing a powerful network effect. In this article, we present a general toolkit for experiential learning of robotic navigation skills that unifies several recent approaches, describe the underlying design principles, summarize experimental results from several of our recent papers, and discuss open problems and directions for future work.
https://arxiv.org/abs/2212.06759
另外几篇值得关注的论文:
[CV] ULIP: Learning Unified Representation of Language, Image and Point Cloud for 3D Understanding
L Xue, M Gao, C Xing, R Martín-Martín…
[Salesforce Research & UT Austin & Stanford University]ULIP: 面向3D理解的语言、图像和点云统一表示学习
要点:
- 提出ULIP,一种预训练框架,可将图像、文本和点云等多种模态对齐到同一特征空间中;
- 实验表明ULIP可以有效改善3D后端模型的表示;
- ULIP在零样本和标准3D分类任务中均实现了最先进的性能,具有跨模态检索应用的潜力。
https://arxiv.org/abs/2212.05171
[CV] DifFace: Blind Face Restoration with Diffused Error Contraction
Z Yue, C C Loy
[Nanyang Technological University]DifFace: 基于扩散误差收缩的盲人脸恢复
要点:
- 大多数基于深度学习的盲人脸恢复方法存在两个主要的限制:训练外数据退化和各种约束;
- 提出DifFace,只需要更简单的训练,更适合处理未见复杂退化情况;
- DifFace由过渡分布和部分从预训练扩散模型中借用的马尔可夫链构成。
https://arxiv.org/abs/2212.06512
[CL] Structured Prompting: Scaling In-Context Learning to 1,000 Examples
Y Hao, Y Sun, L Dong, Z Han, Y Gu, F Wei
[Microsoft Research]结构化提示: 将上下文学习扩展到1000个样本
要点:
- 大型语言模型在不更新参数的情况下表现出很好的零样本和少样本性能,但通常受到长度限制;
- 本文提出结构化提示来打破长度限制,并将上下文学习扩展到成千上万的样本,其复杂度与长度成线性关系;
- 在一组不同的任务上的实验结果表明,随着样本数量的增加,该方法提高了任务结束时的表现,减少了评价差异,并提供了比传统的上下文学习更大的稳定性。
https://arxiv.org/abs/2212.06713
[LG] On Pre-Training for Visuo-Motor Control: Revisiting a Learning-from-Scratch Baseline
N Hansen, Z Yuan, Y Ze, T Mu, A Rajeswaran, H Su, H Xu, X Wang
[University of California San Diego & Tsinghua University]视觉运动控制预训练研究: 重新审视从头学习基线
要点:
- 重新回顾了一个简单的视觉运动控制的从头学习基准,使用数据增强和浅层ConvNet,与最近使用冻结视觉表示的方法相比表现具有竞争力;
- 设计的LfS基准在3个任务领域、 3种算法类别中都具有与使用冻结预训练表示的竞争力,有时甚至优于它们;
- 推测设置一个强有力的基准和拥有更好更困难的任务将帮助今后利用预训练的表示。
https://arxiv.org/abs/2212.05749