前沿分享
-
转载:爱可可-爱生活(知乎)
摘要:用综合推理任务LEGO揭秘Transformer、基于大规模训练的通用神经声码器、CNN可以比Transformer更鲁棒、无限推荐网络、后向兼容嵌入学习、基于LIMoE的多模态对比学习、高效神经辐射场、向量神经元旋转等变注意力、面向神经动力学潜成分学习的分解线性动力系统(dLDS)1、[LG] Unveiling Transformers with LEGO: a synthetic reasoning task
Y Zhang, A Backurs, S Bubeck, R Eldan, S Gunasekar, T Wagner
[Microsoft Research]
用综合推理任务LEGO揭秘Transformer。本文提出一种综合任务,LEGO(平等性和群操作学习),封装了一个推理链问题,研究了Transformer架构如何学习该任务。特别关注数据效果,比如预训练(在看似无关的 NLP 任务上)和数据组合(比如,在训练时和测试时不同的链长) ,以及架构变量,比如权重绑定层或增加卷积组件。研究训练过的模型如何最终成功地完成任务,特别是能够(在一定程度上)理解一些注意力头以及信息在网络中是如何流动的。基于这些观察,本文提出一种假设,预训练有帮助仅仅是因为这是一个明智的初始化,而不是因为深层次知识储存在了网络中。本文还观察到,在一些数据体制下,训练好的Transformer找到了遵循推理链的“捷径”解决方案,这阻碍了模型泛化到主任务的简单变体的能力,通过适当的体系结构修改或仔细的数据准备可以防止这种捷径。基于以上发现,本文开始探索学习执行c程序的任务,其中一个卷积修改Transformer,在键/查询/值映射中添加卷积结构,显示了一个令人鼓舞的优势。
We propose a synthetic task, LEGO (Learning Equality and Group Operations), that encapsulates the problem of following a chain of reasoning, and we study how the transformer architecture learns this task. We pay special attention to data effects such as pretraining (on seemingly unrelated NLP tasks) and dataset composition (e.g., differing chain length at training and test time), as well as architectural variants such as weight-tied layers or adding convolutional components. We study how the trained models eventually succeed at the task, and in particular, we are able to understand (to some extent) some of the attention heads as well as how the information flows in the network. Based on these observations we propose a hypothesis that here pretraining helps merely due to being a smart initialization rather than some deep knowledge stored in the network. We also observe that in some data regime the trained transformer finds “shortcut” solutions to follow the chain of reasoning, which impedes the model’s ability to generalize to simple variants of the main task, and moreover we find that one can prevent such shortcut with appropriate architecture modification or careful data preparation. Motivated by our findings, we begin to explore the task of learning to execute C programs, where a convolutional modification to transformers, namely adding convolutional structures in the key/query/value maps, shows an encouraging edge.
https://arxiv.org/abs/2206.043012、[AS] BigVGAN: A Universal Neural Vocoder with Large-Scale Training
S Lee, W Ping, B Ginsburg, B Catanzaro, S Yoon
[Seoul National University (SNU) & NVIDIA]
BigVGAN: 基于大规模训练的通用神经声码器。尽管最近在基于生成对抗性网络(GAN)的声码器方面取得了进展,该模型基于梅尔谱生成原始波形,但要在不同的录音环境中为众多说话人合成高保真音频仍然是一个挑战。本文提出BigVGAN,一种通用声码器,在各种未见条件零样本设置下可以很好的泛化。本文在生成器中引入了周期非线性和抗锯齿表示,为波形合成带来了期望的归纳偏差,显著地提高了音频质量。基于改进的生成器和最先进的鉴别器,所训练的GAN声码器最大规模达112M参数。本文识别和解决了该规模训练特有的不稳定性,同时保持高保真输出而不过度正则化。所提出的 BigVGAN 实现了最先进的零样本表现,在各种不同的分布外场景,包括新的说话人,新的语言,歌唱的声音,音乐和器乐音频在未见的(甚至是嘈杂的)录制环境。
Despite recent progress in generative adversarial network(GAN)-based vocoders, where the model generates raw waveform conditioned on mel spectrogram, it is still challenging to synthesize high-fidelity audio for numerous speakers across varied recording environments. In this work, we present BigVGAN, a universal vocoder that generalizes well under various unseen conditions in zero-shot setting. We introduce periodic nonlinearities and anti-aliased representation into the generator, which brings the desired inductive bias for waveform synthesis and significantly improves audio quality. Based on our improved generator and the state-of-the-art discriminators, we train our GAN vocoder at the largest scale up to 112M parameters, which is unprecedented in the literature. In particular, we identify and address the training instabilities specific to such scale, while maintaining high-fidelity output without over-regularization. Our BigVGAN achieves the state-of-the-art zero-shot performance for various out-of-distribution scenarios, including new speakers, novel languages, singing voices, music and instrumental audio in unseen (even noisy) recording environments. We will release our code and model at: this https URL
https://arxiv.org/abs/2206.046583、[CV] Can CNNs Be More Robust Than Transformers?
Z Wang, Y Bai, Y Zhou, C Xie
[University of California, Santa Cruz & Johns Hopkins University]
CNN可以比Transformer更鲁棒。最近,视觉Transformer的成功正在动摇卷积神经网络(CNN)十年来在图像识别领域的长期主导地位。具体来说,在对分布外样本的鲁棒性方面,最近的研究发现,在各种训练设置下,Transformer在本质上都比CNN更鲁棒。此外,人们认为Transformer的这种优越性在很大程度上应该归功于其类自注意力架构本身。本文通过仔细研究Transformer的设计,来质疑这种信念。本文的发现带来了三个高效的架构设计,以提高鲁棒性,但简单到只需几行代码就能实现:a)输入图像的图块化,b)扩大核大小,以及c)减少激活层和规范化层。将这些组件结合在一起,就能构建出没有任何类注意力操作的纯CNN架构,其鲁棒性与Transformer一样,甚至更强。希望本文工作能帮助社区更好地理解鲁棒性神经结构的设计。
The recent success of Vision Transformers is shaking the long dominance of Convolutional Neural Networks (CNNs) in image recognition for a decade. Specifically, in terms of robustness on out-of-distribution samples, recent research finds that Transformers are inherently more robust than CNNs, regardless of different training setups. Moreover, it is believed that such superiority of Transformers should largely be credited to their self-attention-like architectures per se. In this paper, we question that belief by closely examining the design of Transformers. Our findings lead to three highly effective architecture designs for boosting robustness, yet simple enough to be implemented in several lines of code, namely a) patchifying input images, b) enlarging kernel size, and c) reducing activation layers and normalization layers. Bringing these components together, we are able to build pure CNN architectures without any attention-like operations that is as robust as, or even more robust than, Transformers. We hope this work can help the community better understand the design of robust neural architectures. The code is publicly available at this https URL.
https://arxiv.org/abs/2206.034524、[IR] Infinite Recommendation Networks: A Data-Centric Approach
N Sachdeva, M P Dhaliwal, C Wu, J McAuley
[University of California, San Diego & Facebook AI Research]
无限推荐网络: 一种以数据为中心的方法。本文利用神经正切核及其与训练无限宽神经网络的等价性设计了∞-AE: 一种具有无限宽瓶颈层的自编码器。其结果是一个具有高表达性但非常简单的推荐模型,只有一个超参数和一个封闭解。利用∞-AE的简单性,本文还提出了Distill-CF,用于合成微小的、高保真的数据摘要,这些摘要从极其庞大和稀疏的user-item交互矩阵中提炼出最重要的知识,用于高效和准确的后续数据使用,如模型训练、推理、架构搜索等。本文采取了一种以数据为中心的推荐方法,其目标是提高记录的用户反馈数据的质量,用于后续的建模,与学习算法无关。特别利用可微的Gumbel-采样概念来处理固有的数据异质性、稀疏性和半结构性,同时可扩展到具有数以亿计的user-item交互的数据集。所提出的两种方法都明显优于各自的最先进水平,一起使用时,可以观察到∞-AE在完整数据集上的性能为96-105%,而原始数据集的大小仅为0.1%,这使我们探索到一个反直觉的问题。要实现更好的推荐你需要的就是更多的数据吗?
We leverage the Neural Tangent Kernel and its equivalence to training infinitely-wide neural networks to devise ∞-AE: an autoencoder with infinitely-wide bottleneck layers. The outcome is a highly expressive yet simplistic recommendation model with a single hyper-parameter and a closed-form solution. Leveraging ∞-AE’s simplicity, we also develop Distill-CF for synthesizing tiny, high-fidelity data summaries which distill the most important knowledge from the extremely large and sparse user-item interaction matrix for efficient and accurate subsequent data-usage like model training, inference, architecture search, etc. This takes a data-centric approach to recommendation, where we aim to improve the quality of logged user-feedback data for subsequent modeling, independent of the learning algorithm. We particularly utilize the concept of differentiable Gumbel-sampling to handle the inherent data heterogeneity, sparsity, and semi-structuredness, while being scalable to datasets with hundreds of millions of user-item interactions. Both of our proposed approaches significantly outperform their respective state-of-the-art and when used together, we observe 96-105% of ∞-AE’s performance on the full dataset with as little as 0.1% of the original dataset size, leading us to explore the counter-intuitive question: Is more data what you need for better recommendation?
https://arxiv.org/abs/2206.026265、[LG] Learning Backward Compatible Embeddings
W Hu, R Bansal, K Cao, N Rao, K Subbian, J Leskovec
[Stanford University & Amazon]
后向兼容嵌入学习。嵌入,对象的低维向量表示,是建立现代机器学习系统的基础。在工业环境中,通常有一个嵌入团队来训练嵌入模型,以解决预定的任务(例如,产品推荐)。然后,所产生的嵌入被消费者团队广泛使用,以解决他们的非预期任务(例如,欺诈检测)。然而,随着嵌入模型被更新和重新训练以提高预期任务的性能,新产生的嵌入不再与现有的消费者模型兼容。这意味着历史版本的嵌入永远不能退役,或者所有消费者团队都必须重新训练其模型,使其与最新版本的嵌入兼容,这两种情况在实践中都是极其昂贵的。本文研究了嵌入的版本更新及其向后兼容性的问题。将这个问题形式化,目标是让嵌入团队持续更新嵌入版本,而消费者团队不必重新训练模型。本文开发了一个基于后向兼容学习的嵌入的解决方案,允许嵌入模型的版本经常更新,同时也允许嵌入的最新版本迅速转变为它的任何向后兼容的历史版本,这样消费者团队就不必重新训练模型。在所提出框架下,本文探索了六种方法,并在一个真实世界的推荐系统应用上系统地评估了这些方法。实验表明,最好的方法,称为BC-Aligner,即使在多个模型版本更新后,也能保持与现有非预期任务的后向兼容。同时,BC-Aligner实现的预期任务性能与只为预期任务优化的嵌入模型相似。
Embeddings, low-dimensional vector representation of objects, are fundamental in building modern machine learning systems. In industrial settings, there is usually an embedding team that trains an embedding model to solve intended tasks (e.g., product recommendation). The produced embeddings are then widely consumed by consumer teams to solve their unintended tasks (e.g., fraud detection). However, as the embedding model gets updated and retrained to improve performance on the intended task, the newly-generated embeddings are no longer compatible with the existing consumer models. This means that historical versions of the embeddings can never be retired or all consumer teams have to retrain their models to make them compatible with the latest version of the embeddings, both of which are extremely costly in practice. Here we study the problem of embedding version updates and their backward compatibility. We formalize the problem where the goal is for the embedding team to keep updating the embedding version, while the consumer teams do not have to retrain their models. We develop a solution based on learning backward compatible embeddings, which allows the embedding model version to be updated frequently, while also allowing the latest version of the embedding to be quickly transformed into any backward compatible historical version of it, so that consumer teams do not have to retrain their models. Under our framework, we explore six methods and systematically evaluate them on a real-world recommender system application. We show that the best method, which we call BC-Aligner, maintains backward compatibility with existing unintended tasks even after multiple model version updates. Simultaneously, BC-Aligner achieves the intended task performance similar to the embedding model that is solely optimized for the intended task.
https://arxiv.org/abs/2206.03040另外几篇值得关注的论文:
[CV] Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts
基于LIMoE的多模态对比学习:专家的语言-图像混合
B Mustafa, C Riquelme, J Puigcerver, R Jenatton, N Houlsby
[Google Brain]
https://arxiv.org/abs/2206.02770[CV] EfficientNeRF: Efficient Neural Radiance Fields
EfficientNeRF:高效神经辐射场
T Hu, S Liu, Y Chen, T Shen, J Jia
[The Chinese University of Hong Kong & SmartMore]
https://arxiv.org/abs/2206.00878[CV] VN-Transformer: Rotation-Equivariant Attention for Vector Neurons
VN-Transformer:向量神经元旋转等变注意力
S Assaad, C Downey, R Al-Rfou, N Nayakanti, B Sapp
[Duke University & Waymo LLC]
https://arxiv.org/abs/2206.04176[LG] Decomposed Linear Dynamical Systems (dLDS) for learning the latent components of neural dynamics
面向神经动力学潜成分学习的分解线性动力系统(dLDS)
N Mudrik, Y Chen, E Yezerets, C J. Rozell, A S. Charles
[Johns Hopkins University & Georgia Institute of Technology]
https://arxiv.org/abs/2206.02972