前沿分享
-
转载:爱可可-爱生活(知乎)
摘要:YOLOv7:基于可训练bag-of-freebies方法的实时目标检测新标杆、基于标准Transformer的强大图学习器、面向3D场景风格化的神经隐表示、基于局部流形平滑度的域外泛化预测、迁移学习中偏差何时迁移、基于CLIP的反事实图像处理研究、基于线性稳定性的SGD平坦最小趋向性定量表征、面向半监督LiDAR语义分割的LaserMix、基于Grounded语言代理的可扩展现实世界Web交互
1、[CV] YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors
C Wang, A Bochkovskiy, H M Liao
[Academia Sinica]
YOLOv7:基于可训练bag-of-freebies方法的实时目标检测新标杆。本文提出一种新的实时目标检测器的结构和相应的模型扩展方法,研究过程中,发现重参数化模块的替换问题和动态标签分配的分配问题,为解决该问题,提出了可训练bag-of-freebies方法来提高目标检测的精度。在此基础上,开发了YOLOv7系列目标检测系统。YOLOv7在5 FPS到160 FPS范围内的速度和精度超过了所有已知的目标检测器,在GPU V100上30 FPS或更高的所有已知实时目标检测器中具有最高的精度56.8% AP。YOLOv7-E6目标检测器(56 FPS V100, 55.9% AP)比基于Transformer的检测器SWINL Cascade-Mask R-CNN(9.2 FPS A100, 53. 9%)速度和精度都有所提高,比基于卷积的检测器ConvNeXt-XL Cascade-Mask R-CNN(8.6 FPS A100, 55.2% AP)速度和精度也有所提高。YOLOv7同时也超过了:YOLOR、YOLOX、Scaled-YOLOv4、YOLOv5、DETR、Deformable DETR、DINO-5scale-R50、ViT-Adapter-B和其他许多目标检测器,在速度和精度方面都有所体现。只在MS COCO数据集上从头开始训练YOLOv7,没有使用任何其他数据集或预训练权重。
YOLOv7 surpasses all known object detectors in both speed and accuracy in the range from 5 FPS to 160 FPS and has the highest accuracy 56.8% AP among all known real-time object detectors with 30 FPS or higher on GPU V100. YOLOv7-E6 object detector (56 FPS V100, 55.9% AP) outperforms both transformer-based detector SWINL Cascade-Mask R-CNN (9.2 FPS A100, 53.9% AP) by 509% in speed and 2% in accuracy, and convolutionalbased detector ConvNeXt-XL Cascade-Mask R-CNN (8.6 FPS A100, 55.2% AP) by 551% in speed and 0.7% AP in accuracy, as well as YOLOv7 outperforms: YOLOR, YOLOX, Scaled-YOLOv4, YOLOv5, DETR, Deformable DETR, DINO-5scale-R50, ViT-Adapter-B and many other object detectors in speed and accuracy. Moreover, we train YOLOv7 only on MS COCO dataset from scratch without using any other datasets or pre-trained weights. Source code is released in https://github.com/WongKinYiu/yolov7.
https://arxiv.org/abs/2207.026962、[LG] Pure Transformers are Powerful Graph Learners
J Kim, T D Nguyen, S Min, S Cho, M Lee, H Lee, S Hong
[KAIST & LG AI Research]
基于标准Transformer的强大图学习器。本文表明,不论理论上还是实践上,标准Transformer无需对图进行特定修改,就可以在图学习中取得很好的结果。给定一个图,简单地把所有节点和边当作独立token,用token嵌入来增强它们,并输入到Transformer中。通过对token嵌入的适当选择,本文证明这种方法在理论上至少与由等变线性层组成的不变图网络(2-IGN)一样具有表达能力,而后者已经比所有消息传递图神经网络(GNN)更具表达能力。当在大规模图数据集(PCQM4Mv2)上训练时,所提出方法(Tokenized Graph Transformer,TokenGT)与GNN基线相比,取得了明显更好的结果,与具有复杂的图特定归纳偏差的Transformer变体相比,也取得了有竞争力的结果。
We show that standard Transformers without graph-specific modifications can lead to promising results in graph learning both in theory and practice. Given a graph, we simply treat all nodes and edges as independent tokens, augment them with token embeddings, and feed them to a Transformer. With an appropriate choice of token embeddings, we prove that this approach is theoretically at least as expressive as an invariant graph network (2-IGN) composed of equivariant linear layers, which is already more expressive than all message-passing Graph Neural Networks (GNN). When trained on a large-scale graph dataset (PCQM4Mv2), our method coined Tokenized Graph Transformer (TokenGT) achieves significantly better results compared to GNN baselines and competitive results compared to Transformer variants with sophisticated graph-specific inductive bias. Our implementation is available at https://github.com/jw9730/tokengt.
https://arxiv.org/abs/2207.025053、[CV] SNeRF: Stylized Neural Implicit Representations for 3D Scenes
T Nguyen-Phuoc, F Liu, L Xiao
[Meta]
SNeRF:面向3D场景风格化的神经隐表示。本文提出一种风格化新视图合成方法。由于缺乏跨视图的一致性,将先进的风格化方法逐帧应用于新视图往往会导致抖动伪影。因此,本文研究了3D场景风格化,为一致新视图合成提供了强大的归纳偏差。采用新的神经辐射场(NeRF)作为3D场景表示,其有能力为各种场景渲染高质量的新视图。然而,由于从NeRF渲染新视图需要大量样本,训练一个风格化NeRF需要大量的GPU内存,超出了既有的GPU容量。本文提出一种新的训练方法,通过交替进行NeRF和风格化的优化步骤来解决该问题。该方法能充分利用硬件内存容量,既能以更高的分辨率生成图像,又能采用更具表现力的图像风格迁移方法。实验表明,所提出方法能为广泛的内容(包括室内、室外和动态场景)生成风格化的NeRF,并且能合成高质量的具有跨视角一致性的新视图。
This paper presents a stylized novel view synthesis method. Applying stateof-the-art stylization methods to novel views frame by frame often causes jittering artifacts due to the lack of cross-view consistency. Therefore, this paper investigates 3D scene stylization that provides a strong inductive bias for consistent novel view synthesis. Specifically, we adopt the emerging neural radiance fields (NeRF) as our choice of 3D scene representation for their capability to render high-quality novel views for a variety of scenes. However, as rendering a novel view from a NeRF requires a large number of samples, training a stylized NeRF requires a large amount of GPU memory that goes beyond an off-the-shelf GPU capacity. We introduce a new training method to address this problem by alternating the NeRF and stylization optimization steps. Such a method enables us to make full use of our hardware memory capacity to both generate images at higher resolution and adopt more expressive image style transfer methods. Our experiments show that our method produces stylized NeRFs for a wide range of content, including indoor, outdoor and dynamic scenes, and synthesizes high-quality novel views with cross-view consistency.
https://arxiv.org/abs/2207.023634、[LG] Predicting Out-of-Domain Generalization with Local Manifold Smoothness
N Ng, K Cho, N Hulkund, M Ghassemi
[MIT & New York University]
基于局部流形平滑度的域外泛化预测。了解机器学习模型如何泛化到新环境是安全部署的一个关键部分。最近的工作提出了各种复杂度,直接预测或从理论上约束模型的泛化能力。然而,这些方法依赖于一系列强有力的假设,而这些假设在实践中并不总是满足。出于对现有测量方法有限应用的考虑,本文提出一种基于分类器局部流形平滑度的新的复杂性测量方法。将局部流形平滑度定义为分类器的输出对给定测试点周围流形邻域扰动的敏感性。直观上,一个对这些扰动不太敏感的分类器,应该有更好的泛化性。为了估计平滑性,用数据增强对点进行采样,测量这些点中被归入多数类的部分。所提出方法只需要选择一种数据增强方法,对模型或数据分布不做其他假设,这意味着它甚至可以应用于现有方法无法应用的域外(OOD)环境中。在图像分类、情感分析和自然语言推理的鲁棒性基准实验中,证明了在100多个训练/测试域对上评估的3000多个模型上,流形平滑度测量和实际的OOD泛化之间有很强的关联性。
Understanding how machine learning models generalize to new environments is a critical part of their safe deployment. Recent work has proposed a variety of complexity measures that directly predict or theoretically bound the generalization capacity of a model. However, these methods rely on a strong set of assumptions that in practice are not always satisfied. Motivated by the limited settings in which existing measures can be applied, we propose a novel complexity measure based on the local manifold smoothness of a classifier. We define local manifold smoothness as a classifier’s output sensitivity to perturbations in the manifold neighborhood around a given test point. Intuitively, a classifier that is less sensitive to these perturbations should generalize better. To estimate smoothness we sample points using data augmentation and measure the fraction of these points classified into the majority class. Our method only requires selecting a data augmentation method and makes no other assumptions about the model or data distributions, meaning it can be applied even in out-of-domain (OOD) settings where existing methods cannot. In experiments on robustness benchmarks in image classification, sentiment analysis, and natural language inference, we demonstrate a strong and robust correlation between our manifold smoothness measure and actual OOD generalization on over 3,000 models evaluated on over 100 train/test domain pairs.
https://arxiv.org/abs/2207.020935、[LG] When does Bias Transfer in Transfer Learning?
H Salman, S Jain, A Ilyas, L Engstrom, E Wong, A Madry
[MIT]
迁移学习中偏差何时迁移?用迁移学习使预训练的"源模型"适应下游的"目标任务",可以极大地提高性能,而且似乎没什么缺点。本文证明了仍然可能存在的一个缺点:偏差迁移,或者说,即使在适应了目标类别的模型之后,源模型的偏差仍然存在的趋势。通过合成实验和自然实验的结合表明,偏差迁移:(a)在现实环境中出现(如在ImageNet或其他标准数据集上进行预训练时),(b)甚至在目标数据集明确去偏的情况下也会发生。随着迁移学习模型越来越多地被部署在现实世界中,本文强调了理解预训练源模型局限性的重要性。
Using transfer learning to adapt a pre-trained “source model” to a downstream “target task” can dramatically increase performance with seemingly no downside. In this work, we demonstrate that there can exist a downside after all: bias transfer, or the tendency for biases of the source model to persist even after adapting the model to the target class. Through a combination of synthetic and natural experiments, we show that bias transfer both (a) arises in realistic settings (such as when pre-training on ImageNet or other standard datasets) and (b) can occur even when the target dataset is explicitly de-biased. As transfer-learned models are increasingly deployed in the real world, our work highlights the importance of understanding the limitations of pre-trained source models. https://arxiv.org/abs/2207.02842另外几篇值得关注的论文:
[CV] Towards Counterfactual Image Manipulation via CLIP
基于CLIP的反事实图像处理研究
Y Yu, F Zhan, R Wu, J Zhang, S Lu, M Cui, X Xie, X Hua, C Miao
[Nanyang Technological University & Max Planck Institute for Informatics & Alibaba Group]
https://arxiv.org/abs/2207.02812[LG] When does SGD favor flat minima? A quantitative characterization via linear stability
SGD何时倾向平坦最小值?基于线性稳定性的定量表征
L Wu, M Wang, W Su
[Peking University & University of Pennsylvani]
https://arxiv.org/abs/2207.02628[CV] LaserMix for Semi-Supervised LiDAR Semantic Segmentation
面向半监督LiDAR语义分割的LaserMix
L Kong, J Ren, L Pan, Z Liu
[Nanyang Technological University]
https://arxiv.org/abs/2207.00026[CL] WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents
WebShop:基于Grounded语言代理的可扩展现实世界Web交互
S Yao, H Chen, J Yang, K Narasimhan
[Princeton University]
https://arxiv.org/abs/2207.01206