前沿分享
-
转载: 爱可可-爱生活(知乎)
摘要:将图像视为点集合、无需监督的语言模型潜知识发现、基于多分辨率注意力头的Transformer、用模型生成信号对文本-文本框架单语言模型进行预训练、生成式预训练Transformer的精确量化、关于“通过Disagreement评估SGD泛化性”的评论、对自监督表示知道什么的高保真可视化、用兼容标签传播简化异质图节点分类、基于解释的虚假模式查找和修复1、[CV] Image as Set of Points
将图像视为点集合。什么是图像?如何提取潜特征?卷积网络(ConvNet)将图像视为矩形形状的有组织的像素,并通过卷积操作在局部区域提取特征;视觉Transformer(ViT)将图像视为图块序列,通过注意力机制在全局范围内提取特征。本文提出一种直接的、有前途的视觉表示范式,称为上下文簇。上下文簇(CoC)将图像视为一组无组织的点,并通过简化的聚类算法提取特征。详细来说,每个点都包括原始特征(如颜色)和位置信息(如坐标),并采用简化的聚类算法来分层分组和提取深度特征。CoC是非卷积也非注意力的,只依靠聚类算法进行空间交互。由于设计简单,CoC通过聚类过程的可视化赋予了令人满意的可解释性。CoC旨在为图像和视觉表示提供一个新的视角,可以在不同的领域得到广泛的应用并表现出深刻的洞察力。尽管目标不是SOTA性能,COC仍然在一些基准上取得了与ConvNet或ViT相当甚至更好的性能。
Convolutional Networks (ConvNets) consider an image as organized pixels in a rectangular shape and extract features via convolutional operation in a local region; Vision Transformers (ViTs) treat an image as a sequence of patches and extract features via attention mechanism in a global range. In this work, we introduce a straightforward and promising paradigm for visual representation, which is called Context Clusters. Context clusters (CoCs) view an image as a set of unorganized points and extract features via a simplified clustering algorithm. In detail, each point includes the raw feature (e.g., color) and positional information (e.g., coordinates), and a simplified clustering algorithm is employed to group and extract deep features hierarchically. Our CoCs are convolution- and attention-free, only relying on clustering algorithm for spatial interaction. Owing to the simple design, we show CoCs endow gratifying interpretability via the visualization of the clustering process. Our CoCs aim at providing a new perspective on image and visual representation, which may enjoy broad applications in different domains and exhibit profound insights. Even though we are not targeting SOTA performance, COCs still achieve comparable or even better performance than ConvNets or ViTs on several benchmarks.
https://openreview.net/forum?id=awnvqZja692、[CL] Discovering Latent Knowledge in Language Models Without Supervision
无需监督的语言模型潜知识发现。现有的训练语言模型的技术可能与事实不符:如果用模仿学习来训练模型,可能会重现人所犯的错误;如果训练其生成人高度评价的文本,它们可能会输出人高评价者无法检测的错误。本文建议通过以纯粹无监督的方式直接找到语言模型内部激活的潜在知识来规避这个问题。本文提出一种方法来准确回答"是-否"的问题,只给定未标记的模型激活。其工作原理是在激活空间中找到一个满足逻辑一致性的方向,比如说一个语句和它的否定句具有相反的真值。本文表明,尽管没有使用监督和模型输出,该方法可以恢复大型语言模型中的各种知识:在6个模型和10个问答数据集中,其准确率平均超过了零样本的4%。本文还发现,它将提示敏感性减半,即使模型被提示产生错误的答案,也能继续保持高准确率。本文结果为发现语言模型知道什么,而不是他们说什么,提供了初步的步骤,即使我们无法获得明确的真实标签。
Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can’t detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models: across 6 models and 10 question-answering datasets, it outperforms zero-shot accuracy by 4% on average. We also find that it cuts prompt sensitivity in half and continues to maintain high accuracy even when models are prompted to generate incorrect answers. Our results provide an initial step toward discovering what language models know, distinct from what they say, even when we don’t have access to explicit ground truth labels.
https://openreview.net/forum?id=ETKGuby0hcs3、[LG] Transformers with Multiresolution Attention Heads
基于多分辨率注意力头的Transformer。本文提出基于多分辨率注意力头的Transformer(MrsFormer),一类受多分辨率近似(MRA)启发的高效Transformer,MRA用小波基近似信号f。MRA将一个信号分解为位于不同尺度正交子空间上的成分。同样,MrsFormer将多头注意力中的注意力头分解为细尺度和粗尺度的注意力头,对标记间和标记组间的注意力模式进行建模。与基于多头注意力的标准softmax变换器相比,在MrsFormer中计算注意力头所需的计算量和内存占用明显减少。本文分析并验证了MrsFormer在图像和时间序列分类等广泛的应用中相对于标准transformer的优势。
We propose the Transformer with Multiresolution-head Attention (MrsFormer), a class of efficient transformers inspired by the multiresolution approximation (MRA) for approximating a signal f using wavelet bases. MRA decomposes a signal into components that lie on orthogonal subspaces at different scales. Similarly, MrsFormer decomposes the attention heads in the multi-head attention into fine-scale and coarse-scale heads, modeling the attention patterns between tokens and between groups of tokens. Computing the attention heads in MrsFormer requires significantly less computation and memory footprint compared to the standard softmax transformer with multi-head attention. We analyze and validate the advantage of MrsFormer over the standard transformers on a wide range of applications including image and time series classification.
https://openreview.net/forum?id=L8qKBr_bht4、[CL] Pretraining One Language Model for All With the Text-To-Text Framework Using Model-Generated Signals
用模型生成信号对文本-文本框架单语言模型进行预训练。预训练编-解码器语言模型提供了将各种语言场景统一到一个文本到文本框架中的灵活性,但最近的各种研究提出了对其预训练效率和效果不如仅编码器和仅解码器模型的担忧。本文通过用ELECTRA式的模型生成信号进行预训练,提高了编-解码器语言模型在统一NLP任务中的性能。本文首先展示了使用模型生成信号对编-解码器模型(如T5)进行预训练所面临的挑战,包括目标格式不正确、标签泄漏和训练不稳定。然后,本文提出Metro-T5,一个去噪预训练任务的新表述,也是编-解码器模型的多任务学习损失,以纳入ELECTRA-Style预训练。Metro-T5在标准微调和基于提示的零样本/少样本场景中的各种语言任务上的表现优于T5。分析表明,Metro-T5以更高的效率实现了类似的泛化能力,在基于提示的学习中,仅用8%的参数就超过了T0(3B),在所有任务中,T5的GPU时间更少。
Pretrained encoder-decoder language models provide the flexibility to unify various language scenarios into one text-to-text framework, but various recent studies raised concerns about their inferior pretraining efficiency and effectiveness compared to encoder only and decoder only models. In this paper, we improve the performance of encoder-decoder language models in unifying NLP tasks by pretraining with ELECTRA-style model-generated signals. We first show the challenges of pretraining encoder-decoder models (such as T5) using model-generated signals, including ill-formed target, label leakage, and training instability. We then propose Metro-T5, a new formulation of the denoising pretraining task and multi-task learning loss for encoder-decoder models to incorporate ELECTRA-Style pretraining. Metro-T5 outperforms T5 on a variety of language tasks in standard fine-tuning and prompt-based zero/few-shot scenarios. Our analysis shows Metro-T5 achieves similar generalization ability with much better efficiency, outperforming T0 (3B) in prompt-based learning with only 8% parameters and T5 in all tasks with fewer GPU hours. Our pretraining code and model checkpoints will be open-sourced.
https://openreview.net/forum?id=us3brYx_ZBZ5、[LG] GPTQ: Accurate Quantization for Generative Pre-trained Transformers
GPTQ:生成式预训练Transformer的精确量化。生成式预训练Transformer(GPT)模型在复杂的语言建模任务中具有突破性的性能,但也因其极高的计算成本而与众不同。具体来说,由于内存成本的原因,即使是大型、高度精确的GPT模型的推理也可能需要多个高性能的GPU来执行,这限制了它们的可用性。虽然通过模型压缩来缓解这一压力的工作正在出现,但现有压缩技术的适用性和性能受到GPT模型的规模和复杂性的限制。本文解决了这一挑战,提出GPTQ,一种基于近似二阶信息的新的单样本权重量化方法,既高度精确又高度有效。具体来说,GPTQ可以在大约4个GPU小时内对有1750亿个参数的GPT模型进行量化,将每个权重的位宽减少到3或4比特,相对于未压缩的基线,精度的降低可以忽略不计。该方法相对于之前提出的单样本量化方法的压缩收益增加了一倍以上,同时保持了精度,第一次在单个GPU内执行1750亿个参数的模型。实验表明,这些改进可以利用端到端的推理速度超过FP16,在使用高端GPU(NVIDIA A100)时,速度约为2倍,在使用更经济的GPU(NVIDIA A6000)时,速度为4倍。
Generative Pre-trained Transformer (GPT) models have set themselves apart by breakthrough performance across complex language modelling tasks, but also by their extremely high computational costs. Specifically, due to memory costs, even inference for large, highly-accurate GPT models may require multiple performant GPUs to execute, which limits their usability. While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models. In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient. Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline. Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods, preserving accuracy, allowing us for the first time to execute an 175 billion-parameter model inside a single GPU. We show experimentally that these improvements can be leveraged for end-to-end inference speedups over FP16, of around 2x when using high-end GPUs (NVIDIA A100) and 4x when using more cost-effective ones (NVIDIA A6000).
https://openreview.net/forum?id=tcbBPnfwxS另外几篇值得关注的论文:
[LG] A Note on “Assessing Generalization of SGD via Disagreement”
关于“通过Disagreement评估SGD泛化性”的评论
A Kirsch, Y Gal
[University of Oxford]
https://openreview.net/forum?id=oRP8urZ8Fx[LG] High Fidelity Visualization of What Your Self-Supervised Representation Knows About
对自监督表示知道什么的高保真可视化
F Bordes, R Balestriero, P Vincent
[Meta AI Research]
https://openreview.net/forum?id=urfWb7VjmL[LG] Simplifying Node Classification on Heterophilous Graphs with Compatible Label Propagation
用兼容标签传播简化异质图节点分类
Z Zhong, S Ivanov, J Pang
[University of Luxembourg & Criteo]
https://openreview.net/forum?id=JBuCfkmKYu[LG] Finding and Fixing Spurious Patterns with Explanations
基于解释的虚假模式查找和修复
G Plumb, MT Ribeiro, A Talwalkar
[CMU & Microsoft Research]
https://openreview.net/forum?id=whJPugmP5I