2020年 ICLR 国际会议最终接受论文(poster-paper)列表(一)

来源:AINLPer微信公众号(点击了解一下吧
编辑: ShuYini
校稿: ShuYini
时间: 2020-01-22

    2020年的ICLR会议将于今年的4月26日-4月30日在Millennium Hall, Addis Ababa ETHIOPIA(埃塞俄比亚首都亚的斯亚贝巴 千禧大厅)举行。

    2020年ICLR会议(Eighth International Conference on Learning Representations)论文接受结果刚刚出来,今年的论文接受情况如下:poster-paper共523篇,Spotlight-paper共107篇,演讲Talk共48篇,共计接受678篇文章,被拒论文(reject-paper)共计1907篇,接受率为:26.48%。

    下面是ICLR2020接受的论文(poster-paper)列表,欢迎大家Ctrl+F进行搜索查看。

    关注 AINLPer ,回复:ICLR2020 获取会议全部列表PDF,其中一共有四个文件(2020-ICLR-accept-poster.pdf、2020-ICLR-accept-spotlight.pdf、2020-ICLR-accept-talk.pdf、2020-ICLR-reject.pdf)

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
Author: Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, Cho-Jui Hsieh
link: https://openreview.net/pdf?id=Syx4wnEtvH
Code: https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/lamb.py
Abstract: Training large deep neural networks on massive datasets is computationally very challenging. There has been recent surge in interest in using large batch stochastic optimization methods to tackle this issue. The most prominent algorithm in this line of research is LARS, which by employing layerwise adaptive learning rates trains ResNet on ImageNet in a few minutes. However, LARS performs poorly for attention models like BERT, indicating that its performance gains are not consistent across tasks. In this paper, we first study a principled layerwise adaptation strategy to accelerate training of deep neural networks using large mini-batches. Using this strategy, we develop a new layerwise adaptive large batch optimization technique called LAMB; we then provide convergence analysis of LAMB as well as LARS, showing convergence to a stationary point in general nonconvex settings. Our empirical results demonstrate the superior performance of LAMB across various tasks such as BERT and ResNet-50 training with very little hyperparameter tuning. In particular, for BERT training, our optimizer enables use of very large batch sizes of 32868 without any degradation of performance. By increasing the batch size to the memory limit of a TPUv3 Pod, BERT training time can be reduced from 3 days to just 76 minutes.
Keyword: large-batch optimization, distributed training, fast optimizer

SELF: Learning to Filter Noisy Labels with Self-Ensembling
Author: Duc Tam Nguyen, Chaithanya Kumar Mummadi, Thi Phuong Nhung Ngo, Thi Hoai Phuong Nguyen, Laura Beggel, Thomas Brox
link: https://openreview.net/pdf?id=HkgsPhNYPS
Code: None
Abstract: Deep neural networks (DNNs) have been shown to over-fit a dataset when being trained with noisy labels for a long enough time. To overcome this problem, we present a simple and effective method self-ensemble label filtering (SELF) to progressively filter out the wrong labels during training. Our method improves the task performance by gradually allowing supervision only from the potentially non-noisy (clean) labels and stops learning on the filtered noisy labels. For the filtering, we form running averages of predictions over the entire training dataset using the network output at different training epochs. We show that these ensemble estimates yield more accurate identification of inconsistent predictions throughout training than the single estimates of the network at the most recent training epoch. While filtered samples are removed entirely from the supervised training loss, we dynamically leverage them via semi-supervised learning in the unsupervised loss. We demonstrate the positive effect of such an approach on various image classification tasks under both symmetric and asymmetric label noise and at different noise ratios. It substantially outperforms all previous works on noise-aware learning across different datasets and can be applied to a broad set of network architectures.
Keyword: Ensemble Learning, Robust Learning, Noisy Labels, Labels Filtering

Reinforcement Learning Based Graph-to-Sequence Model for Natural Question Generation
Author: Yu Chen, Lingfei Wu, Mohammed J. Zaki
link: https://openreview.net/pdf?id=HygnDhEtvr
Code: https://github.com/hugochan/RL-based-Graph2Seq-for-NQG
Abstract: Natural question generation (QG) aims to generate questions from a passage and an answer. Previous works on QG either (i) ignore the rich structure information hidden in text, (ii) solely rely on cross-entropy loss that leads to issues like exposure bias and inconsistency between train/test measurement, or (iii) fail to fully exploit the answer information. To address these limitations, in this paper, we propose a reinforcement learning (RL) based graph-to-sequence (Graph2Seq) model for QG. Our model consists of a Graph2Seq generator with a novel Bidirectional Gated Graph Neural Network based encoder to embed the passage, and a hybrid evaluator with a mixed objective combining both cross-entropy and RL losses to ensure the generation of syntactically and semantically valid text. We also introduce an effective Deep Alignment Network for incorporating the answer information into the passage at both the word and contextual levels. Our model is end-to-end trainable and achieves new state-of-the-art scores, outperforming existing methods by a significant margin on the standard SQuAD benchmark.
Keyword: deep learning, reinforcement learning, graph neural networks, natural language processing, question generation

Sharing Knowledge in Multi-Task Deep Reinforcement Learning
Author: Carlo D’Eramo, Davide Tateo, Andrea Bonarini, Marcello Restelli, Jan Peters
link: https://openreview.net/pdf?id=rkgpv2VFvr
Code: https://github.com/carloderamo/shared/tree/master
Abstract: We study the benefit of sharing representations among tasks to enable the effective use of deep neural networks in Multi-Task Reinforcement Learning. We leverage the assumption that learning from different tasks, sharing common properties, is helpful to generalize the knowledge of them resulting in a more effective feature extraction compared to learning a single task. Intuitively, the resulting set of features offers performance benefits when used by Reinforcement Learning algorithms. We prove this by providing theoretical guarantees that highlight the conditions for which is convenient to share representations among tasks, extending the well-known finite-time bounds of Approximate Value-Iteration to the multi-task setting. In addition, we complement our analysis by proposing multi-task extensions of three Reinforcement Learning algorithms that we empirically evaluate on widely used Reinforcement Learning benchmarks showing significant improvements over the single-task counterparts in terms of sample efficiency and performance.
Keyword: Deep Reinforcement Learning, Multi-Task

On the Weaknesses of Reinforcement Learning for Neural Machine Translation
Author: Leshem Choshen, Lior Fox, Zohar Aizenbud, Omri Abend
link: https://openreview.net/pdf?id=H1eCw3EKvH
Code: None
Abstract: Reinforcement learning (RL) is frequently used to increase performance in text generation tasks,
including machine translation (MT),
notably through the use of Minimum Risk Training (MRT) and Generative Adversarial Networks (GAN).
However, little is known about what and how these methods learn in the context of MT.
We prove that one of the most common RL methods for MT does not optimize the
expected reward, as well as show that other methods take an infeasibly long time to converge.
In fact, our results suggest that RL practices in MT are likely to improve performance
only where the pre-trained parameters are already close to yielding the correct translation.
Our findings further suggest that observed gains may be due to effects unrelated to the training signal, concretely, changes in the shape of the distribution curve.
Keyword: Reinforcement learning, MRT, minimum risk training, reinforce, machine translation, peakkiness, generation

StructPool: Structured Graph Pooling via Conditional Random Fields
Author: Hao Yuan, Shuiwang Ji
link: https://openreview.net/pdf?id=BJxg_hVtwH
Code: None
Abstract: Learning high-level representations for graphs is of great importance for graph analysis tasks. In addition to graph convolution, graph pooling is an important but less explored research area. In particular, most of existing graph pooling techniques do not consider the graph structural information explicitly. We argue that such information is important and develop a novel graph pooling technique, know as the StructPool, in this work. We consider the graph pooling as a node clustering problem, which requires the learning of a cluster assignment matrix. We propose to formulate it as a structured prediction problem and employ conditional random fields to capture the relationships among assignments of different nodes. We also generalize our method to incorporate graph topological information in designing the Gibbs energy function. Experimental results on multiple datasets demonstrate the effectiveness of our proposed StructPool.
Keyword: Graph Pooling, Representation Learning, Graph Analysis

Learning deep graph matching with channel-independent embedding and Hungarian attention
Author: Tianshu Yu, Runzhong Wang, Junchi Yan, Baoxin Li
link: https://openreview.net/pdf?id=rJgBd2NYPH
Code: None
Abstract: Graph matching aims to establishing node-wise correspondence between two graphs, which is a classic combinatorial problem and in general NP-complete. Until very recently, deep graph matching methods start to resort to deep networks to achieve unprecedented matching accuracy. Along this direction, this paper makes two complementary contributions which can also be reused as plugin in existing works: i) a novel node and edge embedding strategy which stimulates the multi-head strategy in attention models and allows the information in each channel to be merged independently. In contrast, only node embedding is accounted in previous works; ii) a general masking mechanism over the loss function is devised to improve the smoothness of objective learning for graph matching. Using Hungarian algorithm, it dynamically constructs a structured and sparsely connected layer, taking into account the most contributing matching pairs as hard attention. Our approach performs competitively, and can also improve state-of-the-art methods as plugin, regarding with matching accuracy on three public benchmarks.
Keyword: deep graph matching, edge embedding, combinatorial problem, Hungarian loss

Graph inference learning for semi-supervised classification
Author: Chunyan Xu, Zhen Cui, Xiaobin Hong, Tong Zhang, Jian Yang, Wei Liu
link: https://openreview.net/pdf?id=r1evOhEKvH
Code: None
Abstract: In this work, we address the semi-supervised classification of graph data, where the categories of those unlabeled nodes are inferred from labeled nodes as well as graph structures. Recent works often solve this problem with the advanced graph convolution in a conventional supervised manner, but the performance could be heavily affected when labeled data is scarce. Here we propose a Graph Inference Learning (GIL) framework to boost the performance of node classification by learning the inference of node labels on graph topology. To bridge the connection of two nodes, we formally define a structure relation by encapsulating node attributes, between-node paths and local topological structures together, which can make inference conveniently deduced from one node to another node. For learning the inference process, we further introduce meta-optimization on structure relations from training nodes to validation nodes, such that the learnt graph inference capability can be better self-adapted into test nodes. Comprehensive evaluations on four benchmark datasets (including Cora, Citeseer, Pubmed and NELL) demonstrate the superiority of our GIL when compared with other state-of-the-art methods in the semi-supervised node classification task.
Keyword: semi-supervised classification, graph inference learning

SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards
Author: Siddharth Reddy, Anca D. Dragan, Sergey Levine
link: https://openreview.net/pdf?id=S1xKd24twB
Code: None
Abstract: Learning to imitate expert behavior from demonstrations can be challenging, especially in environments with high-dimensional, continuous observations and unknown dynamics. Supervised learning methods based on behavioral cloning (BC) suffer from distribution shift: because the agent greedily imitates demonstrated actions, it can drift away from demonstrated states due to error accumulation. Recent methods based on reinforcement learning (RL), such as inverse RL and generative adversarial imitation learning (GAIL), overcome this issue by training an RL agent to match the demonstrations over a long horizon. Since the true reward function for the task is unknown, these methods learn a reward function from the demonstrations, often using complex and brittle approximation techniques that involve adversarial training. We propose a simple alternative that still uses RL, but does not require learning a reward function. The key idea is to provide the agent with an incentive to match the demonstrations over a long horizon, by encouraging it to return to demonstrated states upon encountering new, out-of-distribution states. We accomplish this by giving the agent a constant reward of r=+1 for matching the demonstrated action in a demonstrated state, and a constant reward of r=0 for all other behavior. Our method, which we call soft Q imitation learning (SQIL), can be implemented with a handful of minor modifications to any standard Q-learning or off-policy actor-critic algorithm. Theoretically, we show that SQIL can be interpreted as a regularized variant of BC that uses a sparsity prior to encourage long-horizon imitation. Empirically, we show that SQIL outperforms BC and achieves competitive results compared to GAIL, on a variety of image-based and low-dimensional tasks in Box2D, Atari, and MuJoCo. This paper is a proof of concept that illustrates how a simple imitation method based on RL with constant rewards can be as effective as more complex methods that use learned rewards.
Keyword: Imitation Learning, Reinforcement Learning

Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data
Author: Sergei Popov, Stanislav Morozov, Artem Babenko
link: https://openreview.net/pdf?id=r1eiu2VtwH
Code: https://github.com/anonICLR2020/node
Abstract: Nowadays, deep neural networks (DNNs) have become the main instrument for machine learning tasks within a wide range of domains, including vision, NLP, and speech. Meanwhile, in an important case of heterogenous tabular data, the advantage of DNNs over shallow counterparts remains questionable. In particular, there is no sufficient evidence that deep learning machinery allows constructing methods that outperform gradient boosting decision trees (GBDT), which are often the top choice for tabular problems. In this paper, we introduce Neural Oblivious Decision Ensembles (NODE), a new deep learning architecture, designed to work with any tabular data. In a nutshell, the proposed NODE architecture generalizes ensembles of oblivious decision trees, but benefits from both end-to-end gradient-based optimization and the power of multi-layer hierarchical representation learning. With an extensive experimental comparison to the leading GBDT packages on a large number of tabular datasets, we demonstrate the advantage of the proposed NODE architecture, which outperforms the competitors on most of the tasks. We open-source the PyTorch implementation of NODE and believe that it will become a universal framework for machine learning on tabular data.
Keyword: tabular data, architectures, DNN

Mutual Mean-Teaching: Pseudo Label Refinery for Unsupervised Domain Adaptation on Person Re-identification
Author: Yixiao Ge, Dapeng Chen, Hongsheng Li
link: https://openreview.net/pdf?id=rJlnOhVYPS
Code: https://github.com/yxgeee/MMT
Abstract: Person re-identification (re-ID) aims at identifying the same persons’ images across different cameras. However, domain diversities between different datasets pose an evident challenge for adapting the re-ID model trained on one dataset to another one. State-of-the-art unsupervised domain adaptation methods for person re-ID transferred the learned knowledge from the source domain by optimizing with pseudo labels created by clustering algorithms on the target domain. Although they achieved state-of-the-art performances, the inevitable label noise caused by the clustering procedure was ignored. Such noisy pseudo labels substantially hinders the model’s capability on further improving feature representations on the target domain. In order to mitigate the effects of noisy pseudo labels, we propose to softly refine the pseudo labels in the target domain by proposing an unsupervised framework, Mutual Mean-Teaching (MMT), to learn better features from the target domain via off-line refined hard pseudo labels and on-line refined soft pseudo labels in an alternative training manner. In addition, the common practice is to adopt both the classification loss and the triplet loss jointly for achieving optimal performances in person re-ID models. However, conventional triplet loss cannot work with softly refined labels. To solve this problem, a novel soft softmax-triplet loss is proposed to support learning with soft pseudo triplet labels for achieving the optimal domain adaptation performance. The proposed MMT framework achieves considerable improvements of 14.4%, 18.2%, 13.1% and 16.4% mAP on Market-to-Duke, Duke-to-Market, Market-to-MSMT and Duke-to-MSMT unsupervised domain adaptation tasks.
Keyword: Label Refinery, Unsupervised Domain Adaptation, Person Re-identification

Automatically Discovering and Learning New Visual Categories with Ranking Statistics
Author: Kai Han, Sylvestre-Alvise Rebuffi, Sebastien Ehrhardt, Andrea Vedaldi, Andrew Zisserman
link: https://openreview.net/pdf?id=BJl2_nVFPB
Code: http://www.robots.ox.ac.uk/~vgg/research/auto_novel/
Abstract: We tackle the problem of discovering novel classes in an image collection given labelled examples of other classes. This setting is similar to semi-supervised learning, but significantly harder because there are no labelled examples for the new classes. The challenge, then, is to leverage the information contained in the labelled images in order to learn a general-purpose clustering model and use the latter to identify the new classes in the unlabelled data. In this work we address this problem by combining three ideas: (1) we suggest that the common approach of bootstrapping an image representation using the labeled data only introduces an unwanted bias, and that this can be avoided by using self-supervised learning to train the representation from scratch on the union of labelled and unlabelled data; (2) we use rank statistics to transfer the model’s knowledge of the labelled classes to the problem of clustering the unlabelled images; and, (3) we train the data representation by optimizing a joint objective function on the labelled and unlabelled subsets of the data, improving both the supervised classification of the labelled data, and the clustering of the unlabelled data. We evaluate our approach on standard classification benchmarks and outperform current methods for novel category discovery by a significant margin.
Keyword: deep learning, classification, novel classes, transfer learning, clustering, incremental learning

Maxmin Q-learning: Controlling the Estimation Bias of Q-learning
Author: Qingfeng Lan, Yangchen Pan, Alona Fyshe, Martha White
link: https://openreview.net/pdf?id=Bkg0u3Etwr
Code: https://github.com/qlan3/Explorer
Abstract: Q-learning suffers from overestimation bias, because it approximates the maximum action value using the maximum estimated action value. Algorithms have been proposed to reduce overestimation bias, but we lack an understanding of how bias interacts with performance, and the extent to which existing algorithms mitigate bias. In this paper, we 1) highlight that the effect of overestimation bias on learning efficiency is environment-dependent; 2) propose a generalization of Q-learning, called \emph{Maxmin Q-learning}, which provides a parameter to flexibly control bias; 3) show theoretically that there exists a parameter choice for Maxmin Q-learning that leads to unbiased estimation with a lower approximation variance than Q-learning; and 4) prove the convergence of our algorithm in the tabular case, as well as convergence of several previous Q-learning variants, using a novel Generalized Q-learning framework. We empirically verify that our algorithm better controls estimation bias in toy environments, and that it achieves superior performance on several benchmark problems.
Keyword: reinforcement learning, bias and variance reduction

Federated Adversarial Domain Adaptation
Author: Xingchao Peng, Zijun Huang, Yizhe Zhu, Kate Saenko
link: https://openreview.net/pdf?id=HJezF3VYPB
Code: https://drive.google.com/file/d/1OekTpqB6qLfjlE2XUjQPm3F110KDMFc0/view?usp=sharing
Abstract: Federated learning improves data privacy and efficiency in machine learning performed over networks of distributed devices, such as mobile phones, IoT and wearable devices, etc. Yet models trained with federated learning can still fail to generalize to new devices due to the problem of domain shift. Domain shift occurs when the labeled data collected by source nodes statistically differs from the target node’s unlabeled data. In this work, we present a principled approach to the problem of federated domain adaptation, which aims to align the representations learned among the different nodes with the data distribution of the target node. Our approach extends adversarial adaptation techniques to the constraints of the federated setting. In addition, we devise a dynamic attention mechanism and leverage feature disentanglement to enhance knowledge transfer. Empirically, we perform extensive experiments on several image and text classification tasks and show promising results under unsupervised federated domain adaptation setting.
Keyword: Federated Learning, Domain Adaptation, Transfer Learning, Feature Disentanglement

Depth-Adaptive Transformer
Author: Maha Elbayad, Jiatao Gu, Edouard Grave, Michael Auli
link: https://openreview.net/pdf?id=SJg7KhVKPH
Code: None
Abstract: State of the art sequence-to-sequence models for large scale tasks perform a fixed number of computations for each input sequence regardless of whether it is easy or hard to process. In this paper, we train Transformer models which can make output predictions at different stages of the network and we investigate different ways to predict how much computation is required for a particular sequence. Unlike dynamic computation in Universal Transformers, which applies the same set of layers iteratively, we apply different layers at every step to adjust both the amount of computation as well as the model capacity. On IWSLT German-English translation our approach matches the accuracy of a well tuned baseline Transformer while using less than a quarter of the decoder layers.
Keyword: Deep learning, natural language processing, sequence modeling

DeepHoyer: Learning Sparser Neural Network with Differentiable Scale-Invariant Sparsity Measures
Author: Huanrui Yang, Wei Wen, Hai Li
link: https://openreview.net/pdf?id=rylBK34FDS
Code: https://github.com/yanghr/DeepHoyer
Abstract: In seeking for sparse and efficient neural network models, many previous works investigated on enforcing L1 or L0 regularizers to encourage weight sparsity during training. The L0 regularizer measures the parameter sparsity directly and is invariant to the scaling of parameter values. But it cannot provide useful gradients and therefore requires complex optimization techniques. The L1 regularizer is almost everywhere differentiable and can be easily optimized with gradient descent. Yet it is not scale-invariant and causes the same shrinking rate to all parameters, which is inefficient in increasing sparsity. Inspired by the Hoyer measure (the ratio between L1 and L2 norms) used in traditional compressed sensing problems, we present DeepHoyer, a set of sparsity-inducing regularizers that are both differentiable almost everywhere and scale-invariant. Our experiments show that enforcing DeepHoyer regularizers can produce even sparser neural network models than previous works, under the same accuracy level. We also show that DeepHoyer can be applied to both element-wise and structural pruning.
Keyword: Deep neural network, Sparsity inducing regularizer, Model compression

Evaluating The Search Phase of Neural Architecture Search
Author: Kaicheng Yu, Christian Sciuto, Martin Jaggi, Claudiu Musat, Mathieu Salzmann
link: https://openreview.net/pdf?id=H1loF2NFwr
Code: https://github.com/kcyu2014/eval-nas
Abstract:
Neural Architecture Search (NAS) aims to facilitate the design of deep networks for new tasks. Existing techniques rely on two stages: searching over the architecture space and validating the best architecture. NAS algorithms are currently compared solely based on their results on the downstream task. While intuitive, this fails to explicitly evaluate the effectiveness of their search strategies. In this paper, we propose to evaluate the NAS search phase.
To this end, we compare the quality of the solutions obtained by NAS search policies with that of random architecture selection. We find that: (i) On average, the state-of-the-art NAS algorithms perform similarly to the random policy; (ii) the widely-used weight sharing strategy degrades the ranking of the NAS candidates to the point of not reflecting their true performance, thus reducing the effectiveness of the search process.
We believe that our evaluation framework will be key to designing NAS strategies that consistently discover architectures superior to random ones.
Keyword: Neural architecture search, parameter sharing, random search, evaluation framework

Diverse Trajectory Forecasting with Determinantal Point Processes
Author: Ye Yuan, Kris M. Kitani
link: https://openreview.net/pdf?id=ryxnY3NYPS
Code: None
Abstract: The ability to forecast a set of likely yet diverse possible future behaviors of an agent (e.g., future trajectories of a pedestrian) is essential for safety-critical perception systems (e.g., autonomous vehicles). In particular, a set of possible future behaviors generated by the system must be diverse to account for all possible outcomes in order to take necessary safety precautions. It is not sufficient to maintain a set of the most likely future outcomes because the set may only contain perturbations of a dominating single outcome (major mode). While generative models such as variational autoencoders (VAEs) have been shown to be a powerful tool for learning a distribution over future trajectories, randomly drawn samples from the learned implicit likelihood model may not be diverse – the likelihood model is derived from the training data distribution and the samples will concentrate around the major mode of the data. In this work, we propose to learn a diversity sampling function (DSF) that generates a diverse yet likely set of future trajectories. The DSF maps forecasting context features to a set of latent codes which can be decoded by a generative model (e.g., VAE) into a set of diverse trajectory samples. Concretely, the process of identifying the diverse set of samples is posed as DSF parameter estimation. To learn the parameters of the DSF, the diversity of the trajectory samples is evaluated by a diversity loss based on a determinantal point process (DPP). Gradient descent is performed over the DSF parameters, which in turn moves the latent codes of the sample set to find an optimal set of diverse yet likely trajectories. Our method is a novel application of DPPs to optimize a set of items (forecasted trajectories) in continuous space. We demonstrate the diversity of the trajectories produced by our approach on both low-dimensional 2D trajectory data and high-dimensional human motion data.
Keyword: Diverse Inference, Generative Models, Trajectory Forecasting

ProxSGD: Training Structured Neural Networks under Regularization and Constraints
Author: Yang Yang, Yaxiong Yuan, Avraam Chatzimichailidis, Ruud JG van Sloun, Lei Lei, Symeon Chatzinotas
link: https://openreview.net/pdf?id=HygpthEtvr
Code: https://github.com/optyang/proxsgd;
Abstract: In this paper, we consider the problem of training neural networks (NN). To promote a NN with specific structures, we explicitly take into consideration the nonsmooth regularization (such as L1-norm) and constraints (such as interval constraint). This is formulated as a constrained nonsmooth nonconvex optimization problem, and we propose a convergent proximal-type stochastic gradient descent (Prox-SGD) algorithm. We show that under properly selected learning rates, momentum eventually resembles the unknown real gradient and thus is crucial in analyzing the convergence. We establish that with probability 1, every limit point of the sequence generated by the proposed Prox-SGD is a stationary point. Then the Prox-SGD is tailored to train a sparse neural network and a binary neural network, and the theoretical analysis is also supported by extensive numerical tests.
Keyword: stochastic gradient descent, regularization, constrained optimization, nonsmooth optimization

LAMOL: LAnguage MOdeling for Lifelong Language Learning
Author: Fan-Keng Sun*, Cheng-Hao Ho*, Hung-Yi Lee
link: https://openreview.net/pdf?id=Skgxcn4YDS
Code: https://github.com/jojotenya/LAMOL
Abstract: Most research on lifelong learning applies to images or games, but not language.
We present LAMOL, a simple yet effective method for lifelong language learning (LLL) based on language modeling.
LAMOL replays pseudo-samples of previous tasks while requiring no extra memory or model capacity.
Specifically, LAMOL is a language model that simultaneously learns to solve the tasks and generate training samples.
When the model is trained for a new task, it generates pseudo-samples of previous tasks for training alongside data for the new task.
The results show that LAMOL prevents catastrophic forgetting without any sign of intransigence and can perform five very different language tasks sequentially with only one model.
Overall, LAMOL outperforms previous methods by a considerable margin and is only 2-3% worse than multitasking, which is usually considered the LLL upper bound.
The source code is available at
Keyword: NLP, Deep Learning, Lifelong Learning

Learning Expensive Coordination: An Event-Based Deep RL Approach
Author: Zhenyu Shi*, Runsheng Yu*, Xinrun Wang*, Rundong Wang, Youzhi Zhang, Hanjiang Lai, Bo An
link: https://openreview.net/pdf?id=ryeG924twB
Code: None
Abstract: Existing works in deep Multi-Agent Reinforcement Learning (MARL) mainly focus on coordinating cooperative agents to complete certain tasks jointly. However, in many cases of the real world, agents are self-interested such as employees in a company and clubs in a league. Therefore, the leader, i.e., the manager of the company or the league, needs to provide bonuses to followers for efficient coordination, which we call expensive coordination. The main difficulties of expensive coordination are that i) the leader has to consider the long-term effect and predict the followers’ behaviors when assigning bonuses and ii) the complex interactions between followers make the training process hard to converge, especially when the leader’s policy changes with time. In this work, we address this problem through an event-based deep RL approach. Our main contributions are threefold. (1) We model the leader’s decision-making process as a semi-Markov Decision Process and propose a novel multi-agent event-based policy gradient to learn the leader’s long-term policy. (2) We exploit the leader-follower consistency scheme to design a follower-aware module and a follower-specific attention module to predict the followers’ behaviors and make accurate response to their behaviors. (3) We propose an action abstraction-based policy gradient algorithm to reduce the followers’ decision space and thus accelerate the training process of followers. Experiments in resource collections, navigation, and the predator-prey game reveal that our approach outperforms the state-of-the-art methods dramatically.
Keyword: Multi-Agent Deep Reinforcement Learning, Deep Reinforcement Learning, Leader–Follower Markov Game, Expensive Coordination

Curvature Graph Network
Author: Ze Ye, Kin Sum Liu, Tengfei Ma, Jie Gao, Chao Chen
link: https://openreview.net/pdf?id=BylEqnVFDB
Code: None
Abstract: Graph-structured data is prevalent in many domains. Despite the widely celebrated success of deep neural networks, their power in graph-structured data is yet to be fully explored. We propose a novel network architecture that incorporates advanced graph structural features. In particular, we leverage discrete graph curvature, which measures how the neighborhoods of a pair of nodes are structurally related. The curvature of an edge (x, y) defines the distance taken to travel from neighbors of x to neighbors of y, compared with the length of edge (x, y). It is a much more descriptive feature compared to previously used features that only focus on node specific attributes or limited topological information such as degree. Our curvature graph convolution network outperforms state-of-the-art on various synthetic and real-world graphs, especially the larger and denser ones.
Keyword: Deep Learning, Graph Convolution, Ricci Curvature.

Distance-Based Learning from Errors for Confidence Calibration
Author: Chen Xing, Sercan Arik, Zizhao Zhang, Tomas Pfister
link: https://openreview.net/pdf?id=BJeB5hVtvB
Code: https://drive.google.com/open?id=1UThGvkkvFvKX8ogsfwvdA3uY8xzDlIuL
Abstract: Deep neural networks (DNNs) are poorly calibrated when trained in conventional ways. To improve confidence calibration of DNNs, we propose a novel training method, distance-based learning from errors (DBLE). DBLE bases its confidence estimation on distances in the representation space. In DBLE, we first adapt prototypical learning to train classification models. It yields a representation space where the distance between a test sample and its ground truth class center can calibrate the model’s classification performance. At inference, however, these distances are not available due to the lack of ground truth labels. To circumvent this by inferring the distance for every test sample, we propose to train a confidence model jointly with the classification model. We integrate this into training by merely learning from mis-classified training samples, which we show to be highly beneficial for effective learning. On multiple datasets and DNN architectures, we demonstrate that DBLE outperforms alternative single-model confidence calibration approaches. DBLE also achieves comparable performance with computationally-expensive ensemble approaches with lower computational cost and lower number of parameters.
Keyword: Confidence Calibration, Uncertainty Estimation, Prototypical Learning

Deep Learning of Determinantal Point Processes via Proper Spectral Sub-gradient
Author: Tianshu Yu, Yikang Li, Baoxin Li
link: https://openreview.net/pdf?id=rkeIq2VYPr
Code: None
Abstract: Determinantal point processes (DPPs) is an effective tool to deliver diversity on multiple machine learning and computer vision tasks. Under deep learning framework, DPP is typically optimized via approximation, which is not straightforward and has some conflict with diversity requirement. We note, however, there has been no deep learning paradigms to optimize DPP directly since it involves matrix inversion which may result in highly computational instability. This fact greatly hinders the wide use of DPP on some specific objectives where DPP serves as a term to measure the feature diversity. In this paper, we devise a simple but effective algorithm to address this issue to optimize DPP term directly expressed with L-ensemble in spectral domain over gram matrix, which is more flexible than learning on parametric kernels. By further taking into account some geometric constraints, our algorithm seeks to generate valid sub-gradients of DPP term in case when the DPP gram matrix is not invertible (no gradients exist in this case). In this sense, our algorithm can be easily incorporated with multiple deep learning tasks. Experiments show the effectiveness of our algorithm, indicating promising performance for practical learning problems.
Keyword: determinantal point processes, deep learning, optimization

N-BEATS: Neural basis expansion analysis for interpretable time series forecasting
Author: Boris N. Oreshkin, Dmitri Carpov, Nicolas Chapados, Yoshua Bengio
link: https://openreview.net/pdf?id=r1ecqn4YwB
Code: None
Abstract: We focus on solving the univariate times series point forecasting problem using deep learning. We propose a deep neural architecture based on backward and forward residual links and a very deep stack of fully-connected layers. The architecture has a number of desirable properties, being interpretable, applicable without modification to a wide array of target domains, and fast to train. We test the proposed architecture on several well-known datasets, including M3, M4 and TOURISM competition datasets containing time series from diverse domains. We demonstrate state-of-the-art performance for two configurations of N-BEATS for all the datasets, improving forecast accuracy by 11% over a statistical benchmark and by 3% over last year’s winner of the M4 competition, a domain-adjusted hand-crafted hybrid between neural network and statistical time series models. The first configuration of our model does not employ any time-series-specific components and its performance on heterogeneous datasets strongly suggests that, contrarily to received wisdom, deep learning primitives such as residual blocks are by themselves sufficient to solve a wide range of forecasting problems. Finally, we demonstrate how the proposed architecture can be augmented to provide outputs that are interpretable without considerable loss in accuracy.
Keyword: time series forecasting, deep learning

Automated Relational Meta-learning
Author: Huaxiu Yao, Xian Wu, Zhiqiang Tao, Yaliang Li, Bolin Ding, Ruirui Li, Zhenhui Li
link: https://openreview.net/pdf?id=rklp93EtwH
Code: https://github.com/huaxiuyao/ARML
Abstract: In order to efficiently learn with small amount of data on new tasks, meta-learning transfers knowledge learned from previous tasks to the new ones. However, a critical challenge in meta-learning is the task heterogeneity which cannot be well handled by traditional globally shared meta-learning methods. In addition, current task-specific meta-learning methods may either suffer from hand-crafted structure design or lack the capability to capture complex relations between tasks. In this paper, motivated by the way of knowledge organization in knowledge bases, we propose an automated relational meta-learning (ARML) framework that automatically extracts the cross-task relations and constructs the meta-knowledge graph. When a new task arrives, it can quickly find the most relevant structure and tailor the learned structure knowledge to the meta-learner. As a result, the proposed framework not only addresses the challenge of task heterogeneity by a learned meta-knowledge graph, but also increases the model interpretability. We conduct extensive experiments on 2D toy regression and few-shot image classification and the results demonstrate the superiority of ARML over state-of-the-art baselines.
Keyword: meta-learning, task heterogeneity, meta-knowledge graph

To Relieve Your Headache of Training an MRF, Take AdVIL
Author: Chongxuan Li, Chao Du, Kun Xu, Max Welling, Jun Zhu, Bo Zhang
link: https://openreview.net/pdf?id=Sylgsn4Fvr
Code: https://anonymous.4open.science/r/8c779fbc-6394-40c7-8273-e52504814703/
Abstract: We propose a black-box algorithm called {\it Adversarial Variational Inference and Learning} (AdVIL) to perform inference and learning on a general Markov random field (MRF). AdVIL employs two variational distributions to approximately infer the latent variables and estimate the partition function of an MRF, respectively. The two variational distributions provide an estimate of the negative log-likelihood of the MRF as a minimax optimization problem, which is solved by stochastic gradient descent. AdVIL is proven convergent under certain conditions. On one hand, compared with contrastive divergence, AdVIL requires a minimal assumption about the model structure and can deal with a broader family of MRFs. On the other hand, compared with existing black-box methods, AdVIL provides a tighter estimate of the log partition function and achieves much better empirical results.
Keyword: Markov Random Fields, Undirected Graphical Models, Variational Inference, Black-box Infernece

Linear Symmetric Quantization of Neural Networks for Low-precision Integer Hardware
Author: Xiandong Zhao, Ying Wang, Xuyi Cai, Cheng Liu, Lei Zhang
link: https://openreview.net/pdf?id=H1lBj2VFPS
Code: https://anonymous.4open.science/r/c05a5b6a-1d0c-4201-926f-e7b52034f7a5/
Abstract: With the proliferation of specialized neural network processors that operate on low-precision integers, the performance of Deep Neural Network inference becomes increasingly dependent on the result of quantization. Despite plenty of prior work on the quantization of weights or activations for neural networks, there is still a wide gap between the software quantizers and the low-precision accelerator implementation, which degrades either the efficiency of networks or that of the hardware for the lack of software and hardware coordination at design-phase. In this paper, we propose a learned linear symmetric quantizer for integer neural network processors, which not only quantizes neural parameters and activations to low-bit integer but also accelerates hardware inference by using batch normalization fusion and low-precision accumulators (e.g., 16-bit) and multipliers (e.g., 4-bit). We use a unified way to quantize weights and activations, and the results outperform many previous approaches for various networks such as AlexNet, ResNet, and lightweight models like MobileNet while keeping friendly to the accelerator architecture. Additional, we also apply the method to object detection models and witness high performance and accuracy in YOLO-v2. Finally, we deploy the quantized models on our specialized integer-arithmetic-only DNN accelerator to show the effectiveness of the proposed quantizer. We show that even with linear symmetric quantization, the results can be better than asymmetric or non-linear methods in 4-bit networks. In evaluation, the proposed quantizer induces less than 0.4% accuracy drop in ResNet18, ResNet34, and AlexNet when quantizing the whole network as required by the integer processors.
Keyword: quantization, integer-arithmetic-only DNN accelerator, acceleration

Weakly Supervised Clustering by Exploiting Unique Class Count
Author: Mustafa Umit Oner, Hwee Kuan Lee, Wing-Kin Sung
link: https://openreview.net/pdf?id=B1xIj3VYvr
Code: http://bit.ly/uniqueclasscount
Abstract: A weakly supervised learning based clustering framework is proposed in this paper. As the core of this framework, we introduce a novel multiple instance learning task based on a bag level label called unique class count (ucc), which is the number of unique classes among all instances inside the bag. In this task, no annotations on individual instances inside the bag are needed during training of the models. We mathematically prove that with a perfect ucc classifier, perfect clustering of individual instances inside the bags is possible even when no annotations on individual instances are given during training. We have constructed a neural network based ucc classifier and experimentally shown that the clustering performance of our framework with our weakly supervised ucc classifier is comparable to that of fully supervised learning models where labels for all instances are known. Furthermore, we have tested the applicability of our framework to a real world task of semantic segmentation of breast cancer metastases in histological lymph node sections and shown that the performance of our weakly supervised framework is comparable to the performance of a fully supervised Unet model.
Keyword: weakly supervised clustering, weakly supervised learning, multiple instance learning

Scalable and Order-robust Continual Learning with Additive Parameter Decomposition
Author: Jaehong Yoon, Saehoon Kim, Eunho Yang, Sung Ju Hwang
link: https://openreview.net/pdf?id=r1gdj2EKPB
Code: https://github.com/iclr2020-apd/anonymous_iclr2020_apd_code
Abstract: While recent continual learning methods largely alleviate the catastrophic problem on toy-sized datasets, there are issues that remain to be tackled in order to apply them to real-world problem domains. First, a continual learning model should effectively handle catastrophic forgetting and be efficient to train even with a large number of tasks. Secondly, it needs to tackle the problem of order-sensitivity, where the performance of the tasks largely varies based on the order of the task arrival sequence, as it may cause serious problems where fairness plays a critical role (e.g. medical diagnosis). To tackle these practical challenges, we propose a novel continual learning method that is scalable as well as order-robust, which instead of learning a completely shared set of weights, represents the parameters for each task as a sum of task-shared and sparse task-adaptive parameters. With our Additive Parameter Decomposition (APD), the task-adaptive parameters for earlier tasks remain mostly unaffected, where we update them only to reflect the changes made to the task-shared parameters. This decomposition of parameters effectively prevents catastrophic forgetting and order-sensitivity, while being computation- and memory-efficient. Further, we can achieve even better scalability with APD using hierarchical knowledge consolidation, which clusters the task-adaptive parameters to obtain hierarchically shared parameters. We validate our network with APD, APD-Net, on multiple benchmark datasets against state-of-the-art continual learning methods, which it largely outperforms in accuracy, scalability, and order-robustness.
Keyword: Continual Learning, Lifelong Learning, Catastrophic Forgetting, Deep Learning

Continual Learning with Adaptive Weights (CLAW)
Author: Tameem Adel, Han Zhao, Richard E. Turner
link: https://openreview.net/pdf?id=Hklso24Kwr
Code: None
Abstract: Approaches to continual learning aim to successfully learn a set of related tasks that arrive in an online manner. Recently, several frameworks have been developed which enable deep learning to be deployed in this learning scenario. A key modelling decision is to what extent the architecture should be shared across tasks. On the one hand, separately modelling each task avoids catastrophic forgetting but it does not support transfer learning and leads to large models. On the other hand, rigidly specifying a shared component and a task-specific part enables task transfer and limits the model size, but it is vulnerable to catastrophic forgetting and restricts the form of task-transfer that can occur. Ideally, the network should adaptively identify which parts of the network to share in a data driven way. Here we introduce such an approach called Continual Learning with Adaptive Weights (CLAW), which is based on probabilistic modelling and variational inference. Experiments show that CLAW achieves state-of-the-art performance on six benchmarks in terms of overall continual learning performance, as measured by classification accuracy, and in terms of addressing catastrophic forgetting.
Keyword: Continual learning

Transferable Perturbations of Deep Feature Distributions
Author: Nathan Inkawhich, Kevin Liang, Lawrence Carin, Yiran Chen
link: https://openreview.net/pdf?id=rJxAo2VYwr
Code: None
Abstract: Almost all current adversarial attacks of CNN classifiers rely on information derived from the output layer of the network. This work presents a new adversarial attack based on the modeling and exploitation of class-wise and layer-wise deep feature distributions. We achieve state-of-the-art targeted blackbox transfer-based attack results for undefended ImageNet models. Further, we place a priority on explainability and interpretability of the attacking process. Our methodology affords an analysis of how adversarial attacks change the intermediate feature distributions of CNNs, as well as a measure of layer-wise and class-wise feature distributional separability/entanglement. We also conceptualize a transition from task/data-specific to model-specific features within a CNN architecture that directly impacts the transferability of adversarial examples.
Keyword: adversarial attacks, transferability, interpretability

A Learning-based Iterative Method for Solving Vehicle Routing Problems
Author: Hao Lu, Xingwen Zhang, Shuang Yang
link: https://openreview.net/pdf?id=BJe1334YDH
Code: None
Abstract: This paper is concerned with solving combinatorial optimization problems, in particular, the capacitated vehicle routing problems (CVRP). Classical Operations Research (OR) algorithms such as LKH3 (Helsgaun, 2017) are extremely inefficient (e.g., 13 hours on CVRP of only size 100) and difficult to scale to larger-size problems. Machine learning based approaches have recently shown to be promising, partly because of their efficiency (once trained, they can perform solving within minutes or even seconds). However, there is still a considerable gap between the quality of a machine learned solution and what OR methods can offer (e.g., on CVRP-100, the best result of learned solutions is between 16.10-16.80, significantly worse than LKH3’s 15.65). In this paper, we present ’‘learn to Improve’‘ (L2I), the first learning based approach for CVRP that is efficient in solving speed and at the same time outperforms OR methods. Starting with a random initial solution, L2I learns to iteratively refine the solution with an improvement operator, selected by a reinforcement learning based controller. The improvement operator is selected from a pool of powerful operators that are customized for routing problems. By combining the strengths of the two worlds, our approach achieves the new state-of-the-art results on CVRP, e.g., an average cost of 15.57 on CVRP-100.
Keyword: vehicle routing, reinforcement learning, optimization, heuristics

Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring
Author: Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, Jason Weston
link: https://openreview.net/pdf?id=SkxgnnNFvH
Code: None
Abstract: The use of deep pre-trained transformers has led to remarkable progress in a number of applications (Devlin et al., 2018). For tasks that make pairwise comparisons between sequences, matching a given input with a corresponding label, two approaches are common: Cross-encoders performing full self-attention over the pair and Bi-encoders encoding the pair separately. The former often performs better, but is too slow for practical use. In this work, we develop a new transformer architecture, the Poly-encoder, that learns global rather than token level self-attention features. We perform a detailed comparison of all three approaches, including what pre-training and fine-tuning strategies work best. We show our models achieve state-of-the-art results on four tasks; that Poly-encoders are faster than Cross-encoders and more accurate than Bi-encoders; and that the best results are obtained by pre-training on large datasets similar to the downstream tasks.
Keyword: None

**AutoQ: Automated Kernel-Wise Neural Network Quantization **
Author: Qian Lou, Feng Guo, Minje Kim, Lantao Liu, Lei Jiang.
link: https://openreview.net/pdf?id=rygfnn4twS
Code: None
Abstract: Network quantization is one of the most hardware friendly techniques to enable the deployment of convolutional neural networks (CNNs) on low-power mobile devices. Recent network quantization techniques quantize each weight kernel in a convolutional layer independently for higher inference accuracy, since the weight kernels in a layer exhibit different variances and hence have different amounts of redundancy. The quantization bitwidth or bit number (QBN) directly decides the inference accuracy, latency, energy and hardware overhead. To effectively reduce the redundancy and accelerate CNN inferences, various weight kernels should be quantized with different QBNs. However, prior works use only one QBN to quantize each convolutional layer or the entire CNN, because the design space of searching a QBN for each weight kernel is too large. The hand-crafted heuristic of the kernel-wise QBN search is so sophisticated that domain experts can obtain only sub-optimal results. It is difficult for even deep reinforcement learning (DRL) DDPG-based agents to find a kernel-wise QBN configuration that can achieve reasonable inference accuracy. In this paper, we propose a hierarchical-DRL-based kernel-wise network quantization technique, AutoQ, to automatically search a QBN for each weight kernel, and choose another QBN for each activation layer. Compared to the models quantized by the state-of-the-art DRL-based schemes, on average, the same models quantized by AutoQ reduce the inference latency by 54.06%, and decrease the inference energy consumption by 50.69%, while achieving the same inference accuracy.
Keyword: AutoML, Kernel-Wise Neural Networks Quantization, Hierarchical Deep Reinforcement Learning

Understanding Architectures Learnt by Cell-based Neural Architecture Search
Author: Yao Shu, Wei Wang, Shaofeng Cai
link: https://openreview.net/pdf?id=BJxH22EKPS
Code: https://github.com/shuyao95/Understanding-NAS.git
Abstract: Neural architecture search (NAS) searches architectures automatically for given tasks, e.g., image classification and language modeling. Improving the search efficiency and effectiveness has attracted increasing attention in recent years. However, few efforts have been devoted to understanding the generated architectures. In this paper, we first reveal that existing NAS algorithms (e.g., DARTS, ENAS) tend to favor architectures with wide and shallow cell structures. These favorable architectures consistently achieve fast convergence and are consequently selected by NAS algorithms. Our empirical and theoretical study further confirms that their fast convergence derives from their smooth loss landscape and accurate gradient information. Nonetheless, these architectures may not necessarily lead to better generalization performance compared with other candidate architectures in the same search space, and therefore further improvement is possible by revising existing NAS algorithms.
Keyword: Neural Architecture Search, connection pattern, optimization, convergence, Lipschitz smoothness, gradient variance, generalization

SVQN: Sequential Variational Soft Q-Learning Networks
Author: Shiyu Huang, Hang Su, Jun Zhu, Ting Chen
link: https://openreview.net/pdf?id=r1xPh2VtPB
Code: None
Abstract: Partially Observable Markov Decision Processes (POMDPs) are popular and flexible models for real-world decision-making applications that demand the information from past observations to make optimal decisions. Standard reinforcement learning algorithms for solving Markov Decision Processes (MDP) tasks are not applicable, as they cannot infer the unobserved states. In this paper, we propose a novel algorithm for POMDPs, named sequential variational soft Q-learning networks (SVQNs), which formalizes the inference of hidden states and maximum entropy reinforcement learning (MERL) under a unified graphical model and optimizes the two modules jointly. We further design a deep recurrent neural network to reduce the computational complexity of the algorithm. Experimental results show that SVQNs can utilize past information to help decision making for efficient inference, and outperforms other baselines on several challenging tasks. Our ablation study shows that SVQNs have the generalization ability over time and are robust to the disturbance of the observation.
Keyword: reinforcement learning, POMDP, variational inference, generative model

Ranking Policy Gradient
Author: Kaixiang Lin, Jiayu Zhou
link: https://openreview.net/pdf?id=rJld3hEYvS
Code: None
Abstract: Sample inefficiency is a long-lasting problem in reinforcement learning (RL). The state-of-the-art estimates the optimal action values while it usually involves an extensive search over the state-action space and unstable optimization. Towards the sample-efficient RL, we propose ranking policy gradient (RPG), a policy gradient method that learns the optimal rank of a set of discrete actions. To accelerate the learning of policy gradient methods, we establish the equivalence between maximizing the lower bound of return and imitating a near-optimal policy without accessing any oracles. These results lead to a general off-policy learning framework, which preserves the optimality, reduces variance, and improves the sample-efficiency. We conduct extensive experiments showing that when consolidating with the off-policy learning framework, RPG substantially reduces the sample complexity, comparing to the state-of-the-art.
Keyword: Sample-efficient reinforcement learning, off-policy learning.

On Mutual Information Maximization for Representation Learning
Author: Michael Tschannen, Josip Djolonga, Paul K. Rubenstein, Sylvain Gelly, Mario Lucic
link: https://openreview.net/pdf?id=rkxoh24FPH
Code: https://storage.googleapis.com/mi_for_rl_files/code.zip
Abstract: Many recent methods for unsupervised or self-supervised representation learning train feature extractors by maximizing an estimate of the mutual information (MI) between different views of the data. This comes with several immediate problems: For example, MI is notoriously hard to estimate, and using it as an objective for representation learning may lead to highly entangled representations due to its invariance under arbitrary invertible transformations. Nevertheless, these methods have been repeatedly shown to excel in practice. In this paper we argue, and provide empirical evidence, that the success of these methods cannot be attributed to the properties of MI alone, and that they strongly depend on the inductive bias in both the choice of feature extractor architectures and the parametrization of the employed MI estimators. Finally, we establish a connection to deep metric learning and argue that this interpretation may be a plausible explanation for the success of the recently introduced methods.
Keyword: mutual information, representation learning, unsupervised learning, self-supervised learning

Observational Overfitting in Reinforcement Learning
Author: Xingyou Song, Yiding Jiang, Stephen Tu, Yilun Du, Behnam Neyshabur
link: https://openreview.net/pdf?id=HJli2hNKDH
Code: None
Abstract: A major component of overfitting in model-free reinforcement learning (RL) involves the case where the agent may mistakenly correlate reward with certain spurious features from the observations generated by the Markov Decision Process (MDP). We provide a general framework for analyzing this scenario, which we use to design multiple synthetic benchmarks from only modifying the observation space of an MDP. When an agent overfits to different observation spaces even if the underlying MDP dynamics is fixed, we term this observational overfitting. Our experiments expose intriguing properties especially with regards to implicit regularization, and also corroborate results from previous works in RL generalization and supervised learning (SL).
Keyword: observational, overfitting, reinforcement, learning, generalization, implicit, regularization, overparametrization

Enhancing Transformation-Based Defenses Against Adversarial Attacks with a Distribution Classifier
Author: Connie Kou, Hwee Kuan Lee, Ee-Chien Chang, Teck Khim Ng
link: https://openreview.net/pdf?id=BkgWahEFvr
Code: None
Abstract: Adversarial attacks on convolutional neural networks (CNN) have gained significant attention and there have been active research efforts on defense mechanisms. Stochastic input transformation methods have been proposed, where the idea is to recover the image from adversarial attack by random transformation, and to take the majority vote as consensus among the random samples. However, the transformation improves the accuracy on adversarial images at the expense of the accuracy on clean images. While it is intuitive that the accuracy on clean images would deteriorate, the exact mechanism in which how this occurs is unclear. In this paper, we study the distribution of softmax induced by stochastic transformations. We observe that with random transformations on the clean images, although the mass of the softmax distribution could shift to the wrong class, the resulting distribution of softmax could be used to correct the prediction. Furthermore, on the adversarial counterparts, with the image transformation, the resulting shapes of the distribution of softmax are similar to the distributions from the clean images. With these observations, we propose a method to improve existing transformation-based defenses. We train a separate lightweight distribution classifier to recognize distinct features in the distributions of softmax outputs of transformed images. Our empirical studies show that our distribution classifier, by training on distributions obtained from clean images only, outperforms majority voting for both clean and adversarial images. Our method is generic and can be integrated with existing transformation-based defenses.
Keyword: adversarial attack, transformation defenses, distribution classifier

Additive Powers-of-Two Quantization: An Efficient Non-uniform Discretization for Neural Networks
Author: Yuhang Li, Xin Dong, Wei Wang
link: https://openreview.net/pdf?id=BkgXT24tDS
Code: https://github.com/yhhhli/APoT_Quantization
Abstract: We propose Additive Powers-of-Two~(APoT) quantization, an efficient non-uniform quantization scheme for the bell-shaped and long-tailed distribution of weights and activations in neural networks. By constraining all quantization levels as the sum of Powers-of-Two terms, APoT quantization enjoys high computational efficiency and a good match with the distribution of weights. A simple reparameterization of the clipping function is applied to generate a better-defined gradient for learning the clipping threshold. Moreover, weight normalization is presented to refine the distribution of weights to make the training more stable and consistent. Experimental results show that our proposed method outperforms state-of-the-art methods, and is even competitive with the full-precision models, demonstrating the effectiveness of our proposed APoT quantization. For example, our 4-bit quantized ResNet-50 on ImageNet achieves 76.6% top-1 accuracy without bells and whistles; meanwhile, our model reduces 22% computational cost compared with the uniformly quantized counterpart.
Keyword: Quantization, Efficient Inference, Neural Networks

Lazy-CFR: fast and near-optimal regret minimization for extensive games with imperfect information
Author: Yichi Zhou, Tongzheng Ren, Jialian Li, Dong Yan, Jun Zhu
link: https://openreview.net/pdf?id=rJx4p3NYDB
Code: None
Abstract: Counterfactual regret minimization (CFR) methods are effective for solving two-player zero-sum extensive games with imperfect information with state-of-the-art results. However, the vanilla CFR has to traverse the whole game tree in each round, which is time-consuming in large-scale games. In this paper, we present Lazy-CFR, a CFR algorithm that adopts a lazy update strategy to avoid traversing the whole game tree in each round. We prove that the regret of Lazy-CFR is almost the same to the regret of the vanilla CFR and only needs to visit a small portion of the game tree. Thus, Lazy-CFR is provably faster than CFR. Empirical results consistently show that Lazy-CFR is significantly faster than the vanilla CFR.
Keyword: None

Knowledge Consistency between Neural Networks and Beyond
Author: Ruofan Liang, Tianlin Li, Longfei Li, Jing Wang, Quanshi Zhang
link: https://openreview.net/pdf?id=BJeS62EtwH
Code: None
Abstract: This paper aims to analyze knowledge consistency between pre-trained deep neural networks. We propose a generic definition for knowledge consistency between neural networks at different fuzziness levels. A task-agnostic method is designed to disentangle feature components, which represent the consistent knowledge, from raw intermediate-layer features of each neural network. As a generic tool, our method can be broadly used for different applications. In preliminary experiments, we have used knowledge consistency as a tool to diagnose representations of neural networks. Knowledge consistency provides new insights to explain the success of existing deep-learning techniques, such as knowledge distillation and network compression. More crucially, knowledge consistency can also be used to refine pre-trained networks and boost performance.
Keyword: Deep Learning, Interpretability, Convolutional Neural Networks

Image-guided Neural Object Rendering
Author: Justus Thies, Michael Zollhöfer, Christian Theobalt, Marc Stamminger, Matthias Nießner
link: https://openreview.net/pdf?id=Hyg9anEFPS
Code: None
Abstract: We propose a learned image-guided rendering technique that combines the benefits of image-based rendering and GAN-based image synthesis. The goal of our method is to generate photo-realistic re-renderings of reconstructed objects for virtual and augmented reality applications (e.g., virtual showrooms, virtual tours and sightseeing, the digital inspection of historical artifacts). A core component of our work is the handling of view-dependent effects. Specifically, we directly train an object-specific deep neural network to synthesize the view-dependent appearance of an object.
As input data we are using an RGB video of the object. This video is used to reconstruct a proxy geometry of the object via multi-view stereo. Based on this 3D proxy, the appearance of a captured view can be warped into a new target view as in classical image-based rendering. This warping assumes diffuse surfaces, in case of view-dependent effects, such as specular highlights, it leads to artifacts. To this end, we propose EffectsNet, a deep neural network that predicts view-dependent effects. Based on these estimations, we are able to convert observed images to diffuse images. These diffuse images can be projected into other views. In the target view, our pipeline reinserts the new view-dependent effects. To composite multiple reprojected images to a final output, we learn a composition network that outputs photo-realistic results. Using this image-guided approach, the network does not have to allocate capacity on ``remembering’’ object appearance, instead it learns how to combine the appearance of captured images. We demonstrate the effectiveness of our approach both qualitatively and quantitatively on synthetic as well as on real data.
Keyword: Neural Rendering, Neural Image Synthesis

Implicit Bias of Gradient Descent based Adversarial Training on Separable Data
Author: Yan Li, Ethan X.Fang, Huan Xu, Tuo Zhao
link: https://openreview.net/pdf?id=HkgTTh4FDH
Code: None
Abstract: Adversarial training is a principled approach for training robust neural networks. Despite of tremendous successes in practice, its theoretical properties still remain largely unexplored. In this paper, we provide new theoretical insights of gradient descent based adversarial training by studying its computational properties, specifically on its implicit bias. We take the binary classification task on linearly separable data as an illustrative example, where the loss asymptotically attains its infimum as the parameter diverges to infinity along certain directions. Specifically, we show that for any fixed iteration T T , when the adversarial perturbation during training has proper bounded L2 norm, the classifier learned by gradient descent based adversarial training converges in direction to the maximum L2 norm margin classifier at the rate of O ( 1 / T ) O(1/\sqrt{T}) , significantly faster than the rate KaTeX parse error: Expected 'EOF', got '}' at position 11: O(1/\log T}̲ of training with clean data. In addition, when the adversarial perturbation during training has bounded Lq norm, the resulting classifier converges in direction to a maximum mixed-norm margin classifier, which has a natural interpretation of robustness, as being the maximum L2 norm margin classifier under worst-case bounded Lq norm perturbation to the data. Our findings provide theoretical backups for adversarial training that it indeed promotes robustness against adversarial perturbation.
Keyword: implicit bias, adversarial training, robustness, gradient descent

TabFact: A Large-scale Dataset for Table-based Fact Verification
Author: Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, William Yang Wang
link: https://openreview.net/pdf?id=rkeJRhNYDH
Code: None
Abstract: The problem of verifying whether a textual hypothesis holds based on the given evidence, also known as fact verification, plays an important role in the study of natural language understanding and semantic representation. However, existing studies are mainly restricted to dealing with unstructured evidence (e.g., natural language sentences and documents, news, etc), while verification under structured evidence, such as tables, graphs, and databases, remains unexplored. This paper specifically aims to study the fact verification given semi-structured data as evidence. To this end, we construct a large-scale dataset called TabFact with 16k Wikipedia tables as the evidence for 118k human-annotated natural language statements, which are labeled as either ENTAILED or REFUTED. TabFact is challenging since it involves both soft linguistic reasoning and hard symbolic reasoning. To address these reasoning challenges, we design two different models: Table-BERT and Latent Program Algorithm (LPA). Table-BERT leverages the state-of-the-art pre-trained language model to encode the linearized tables and statements into continuous vectors for verification. LPA parses statements into LISP-like programs and executes them against the tables to obtain the returned binary value for verification. Both methods achieve similar accuracy but still lag far behind human performance. We also perform a comprehensive analysis to demonstrate great future opportunities.
Keyword: Fact Verification, Tabular Data, Symbolic Reasoning

ES-MAML: Simple Hessian-Free Meta Learning
Author: Xingyou Song, Wenbo Gao, Yuxiang Yang, Krzysztof Choromanski, Aldo Pacchiano, Yunhao Tang
link: https://openreview.net/pdf?id=S1exA2NtDB
Code: None
Abstract: We introduce ES-MAML, a new framework for solving the model agnostic meta learning (MAML) problem based on Evolution Strategies (ES). Existing algorithms for MAML are based on policy gradients, and incur significant difficulties when attempting to estimate second derivatives using backpropagation on stochastic policies. We show how ES can be applied to MAML to obtain an algorithm which avoids the problem of estimating second derivatives, and is also conceptually simple and easy to implement. Moreover, ES-MAML can handle new types of nonsmooth adaptation operators, and other techniques for improving performance and estimation of ES methods become applicable. We show empirically that ES-MAML is competitive with existing methods and often yields better adaptation with fewer queries.
Keyword: ES, MAML, evolution, strategies, meta, learning, gaussian, perturbation, reinforcement, learning, adaptation

Neural Stored-program Memory
Author: Hung Le, Truyen Tran, Svetha Venkatesh
link: https://openreview.net/pdf?id=rkxxA24FDr
Code: https://github.com/thaihungle/NSM
Abstract: Neural networks powered with external memory simulate computer behaviors. These models, which use the memory to store data for a neural controller, can learn algorithms and other complex tasks. In this paper, we introduce a new memory to store weights for the controller, analogous to the stored-program memory in modern computer architectures. The proposed model, dubbed Neural Stored-program Memory, augments current memory-augmented neural networks, creating differentiable machines that can switch programs through time, adapt to variable contexts and thus fully resemble the Universal Turing Machine. A wide range of experiments demonstrate that the resulting machines not only excel in classical algorithmic problems, but also have potential for compositional, continual, few-shot learning and question-answering tasks.
Keyword: Memory Augmented Neural Networks, Universal Turing Machine, fast-weight

Hierarchical Foresight: Self-Supervised Learning of Long-Horizon Tasks via Visual Subgoal Generation
Author: Suraj Nair, Chelsea Finn
link: https://openreview.net/pdf?id=H1gzR2VKDH
Code: https://drive.google.com/file/d/1INQ6_lqefSW3zXEd6Y8LQphfFwUBIyqx/view
Abstract: Video prediction models combined with planning algorithms have shown promise in enabling robots to learn to perform many vision-based tasks through only self-supervision, reaching novel goals in cluttered scenes with unseen objects. However, due to the compounding uncertainty in long horizon video prediction and poor scalability of sampling-based planning optimizers, one significant limitation of these approaches is the ability to plan over long horizons to reach distant goals. To that end, we propose a framework for subgoal generation and planning, hierarchical visual foresight (HVF), which generates subgoal images conditioned on a goal image, and uses them for planning. The subgoal images are directly optimized to decompose the task into easy to plan segments, and as a result, we observe that the method naturally identifies semantically meaningful states as subgoals. Across three out of four simulated vision-based manipulation tasks, we find that our method achieves more than 20% absolute performance improvement over planning without subgoals and model-free RL approaches. Further, our experiments illustrate that our approach extends to real, cluttered visual scenes.
Keyword: video prediction, reinforcement learning, planning

Multi-agent Reinforcement Learning for Networked System Control
Author: Tianshu Chu, Sandeep Chinchali, Sachin Katti
link: https://openreview.net/pdf?id=Syx7A3NFvH
Code: https://github.com/cts198859/deeprl_network
Abstract: This paper considers multi-agent reinforcement learning (MARL) in networked system control. Specifically, each agent learns a decentralized control policy based on local observations and messages from connected neighbors. We formulate such a networked MARL (NMARL) problem as a spatiotemporal Markov decision process and introduce a spatial discount factor to stabilize the training of each local agent. Further, we propose a new differentiable communication protocol, called NeurComm, to reduce information loss and non-stationarity in NMARL. Based on experiments in realistic NMARL scenarios of adaptive traffic signal control and cooperative adaptive cruise control, an appropriate spatial discount factor effectively enhances the learning curves of non-communicative MARL algorithms, while NeurComm outperforms existing communication protocols in both learning efficiency and control performance.
Keyword: deep reinforcement learning, multi-agent reinforcement learning, decision and control

FSPool: Learning Set Representations with Featurewise Sort Pooling
Author: Yan Zhang, Jonathon Hare, Adam Prügel-Bennett
link: https://openreview.net/pdf?id=HJgBA2VYwH
Code: https://github.com/Cyanogenoid/fspool
Abstract: Traditional set prediction models can struggle with simple datasets due to an issue we call the responsibility problem. We introduce a pooling method for sets of feature vectors based on sorting features across elements of the set. This can be used to construct a permutation-equivariant auto-encoder that avoids this responsibility problem. On a toy dataset of polygons and a set version of MNIST, we show that such an auto-encoder produces considerably better reconstructions and representations. Replacing the pooling function in existing set encoders with FSPool improves accuracy and convergence speed on a variety of datasets.
Keyword: set auto-encoder, set encoder, pooling

Are Pre-trained Language Models Aware of Phrases? Simple but Strong Baselines for Grammar Induction
Author: Taeuk Kim, Jihun Choi, Daniel Edmiston, Sang-goo Lee
link: https://openreview.net/pdf?id=H1xPR3NtPB
Code: None
Abstract: With the recent success and popularity of pre-trained language models (LMs) in natural language processing, there has been a rise in efforts to understand their inner workings.
In line with such interest, we propose a novel method that assists us in investigating the extent to which pre-trained LMs capture the syntactic notion of constituency.
Our method provides an effective way of extracting constituency trees from the pre-trained LMs without training.
In addition, we report intriguing findings in the induced trees, including the fact that pre-trained LMs outperform other approaches in correctly demarcating adverb phrases in sentences.
Keyword: None

Dynamically Pruned Message Passing Networks for Large-scale Knowledge Graph Reasoning
Author: Xiaoran Xu, Wei Feng, Yunsheng Jiang, Xiaohui Xie, Zhiqing Sun, Zhi-Hong Deng
link: https://openreview.net/pdf?id=rkeuAhVKvB
Code: https://github.com/netpaladinx/DPMPN
Abstract: We propose Dynamically Pruned Message Passing Networks (DPMPN) for large-scale knowledge graph reasoning. In contrast to existing models, embedding-based or path-based, we learn an input-dependent subgraph to explicitly model a sequential reasoning process. Each subgraph is dynamically constructed, expanding itself selectively under a flow-style attention mechanism. In this way, we can not only construct graphical explanations to interpret prediction, but also prune message passing in Graph Neural Networks (GNNs) to scale with the size of graphs. We take the inspiration from the consciousness prior proposed by Bengio to design a two-GNN framework to encode global input-invariant graph-structured representation and learn local input-dependent one coordinated by an attention module. Experiments show the reasoning capability in our model that is providing a clear graphical explanation as well as predicting results accurately, outperforming most state-of-the-art methods in knowledge base completion tasks.
Keyword: knowledge graph reasoning, graph neural networks, attention mechanism

Mixup Inference: Better Exploiting Mixup to Defend Adversarial Attacks
Author: Tianyu Pang*, Kun Xu*, Jun Zhu
link: https://openreview.net/pdf?id=ByxtC2VtPB
Code: https://github.com/P2333/Mixup-Inference
Abstract: It has been widely recognized that adversarial examples can be easily crafted to fool deep networks, which mainly root from the locally non-linear behavior nearby input examples. Applying mixup in training provides an effective mechanism to improve generalization performance and model robustness against adversarial perturbations, which introduces the globally linear behavior in-between training examples. However, in previous work, the mixup-trained models only passively defend adversarial attacks in inference by directly classifying the inputs, where the induced global linearity is not well exploited. Namely, since the locality of the adversarial perturbations, it would be more efficient to actively break the locality via the globality of the model predictions. Inspired by simple geometric intuition, we develop an inference principle, named mixup inference (MI), for mixup-trained models. MI mixups the input with other random clean samples, which can shrink and transfer the equivalent perturbation if the input is adversarial. Our experiments on CIFAR-10 and CIFAR-100 demonstrate that MI can further improve the adversarial robustness for the models trained by mixup and its variants.
Keyword: Trustworthy Machine Learning, Adversarial Robustness, Inference Principle, Mixup

Theory and Evaluation Metrics for Learning Disentangled Representations
Author: Kien Do, Truyen Tran
link: https://openreview.net/pdf?id=HJgK0h4Ywr
Code: None
Abstract: We make two theoretical contributions to disentanglement learning by (a) defining precise semantics of disentangled representations, and (b) establishing robust metrics for evaluation. First, we characterize the concept “disentangled representations” used in supervised and unsupervised methods along three dimensions–informativeness, separability and interpretability–which can be expressed and quantified explicitly using information-theoretic constructs. This helps explain the behaviors of several well-known disentanglement learning models. We then propose robust metrics for measuring informativeness, separability and interpretability. Through a comprehensive suite of experiments, we show that our metrics correctly characterize the representations learned by different methods and are consistent with qualitative (visual) results. Thus, the metrics allow disentanglement learning methods to be compared on a fair ground. We also empirically uncovered new interesting properties of VAE-based methods and interpreted them with our formulation. These findings are promising and hopefully will encourage the design of more theoretically driven models for learning disentangled representations.
Keyword: disentanglement, metrics

Measuring Compositional Generalization: A Comprehensive Method on Realistic Data
Author: Daniel Keysers, Nathanael Schärli, Nathan Scales, Hylke Buisman, Daniel Furrer, Sergii Kashubin, Nikola Momchev, Danila Sinopalnikov, Lukasz Stafiniak, Tibor Tihon, Dmitry Tsarkov, Xiao Wang, Marc van Zee, Olivier Bousquet
link: https://openreview.net/pdf?id=SygcCnNKwr
Code: https://github.com/google-research/google-research/tree/master/cfq
Abstract: State-of-the-art machine learning methods exhibit limited compositional generalization. At the same time, there is a lack of realistic benchmarks that comprehensively measure this ability, which makes it challenging to find and evaluate improvements. We introduce a novel method to systematically construct such benchmarks by maximizing compound divergence while guaranteeing a small atom divergence between train and test sets, and we quantitatively compare this method to other approaches for creating compositional generalization benchmarks. We present a large and realistic natural language question answering dataset that is constructed according to this method, and we use it to analyze the compositional generalization ability of three machine learning architectures. We find that they fail to generalize compositionally and that there is a surprisingly strong negative correlation between compound divergence and accuracy. We also demonstrate how our method can be used to create new compositionality benchmarks on top of the existing SCAN dataset, which confirms these findings.

Keyword: compositionality, generalization, natural language understanding, benchmark, compositional generalization, compositional modeling, semantic parsing, generalization measurement

Rethinking Softmax Cross-Entropy Loss for Adversarial Robustness
Author: Tianyu Pang, Kun Xu, Yinpeng Dong, Chao Du, Ning Chen, Jun Zhu
link: https://openreview.net/pdf?id=Byg9A24tvB
Code: https://github.com/P2333/Max-Mahalanobis-Training
Abstract: Previous work shows that adversarially robust generalization requires larger sample complexity, and the same dataset, e.g., CIFAR-10, which enables good standard accuracy may not suffice to train robust models. Since collecting new training data could be costly, we focus on better utilizing the given data by inducing the regions with high sample density in the feature space, which could lead to locally sufficient samples for robust learning. We first formally show that the softmax cross-entropy (SCE) loss and its variants convey inappropriate supervisory signals, which encourage the learned feature points to spread over the space sparsely in training. This inspires us to propose the Max-Mahalanobis center (MMC) loss to explicitly induce dense feature regions in order to benefit robustness. Namely, the MMC loss encourages the model to concentrate on learning ordered and compact representations, which gather around the preset optimal centers for different classes. We empirically demonstrate that applying the MMC loss can significantly improve robustness even under strong adaptive attacks, while keeping state-of-the-art accuracy on clean inputs with little extra computation compared to the SCE loss.
Keyword: Trustworthy Machine Learning, Adversarial Robustness, Training Objective, Sample Density

The Implicit Bias of Depth: How Incremental Learning Drives Generalization
Author: Daniel Gissin, Shai Shalev-Shwartz, Amit Daniely
link: https://openreview.net/pdf?id=H1lj0nNFwB
Code: https://github.com/dsgissin/Incremental-Learning
Abstract: A leading hypothesis for the surprising generalization of neural networks is that the dynamics of gradient descent bias the model towards simple solutions, by searching through the solution space in an incremental order of complexity. We formally define the notion of incremental learning dynamics and derive the conditions on depth and initialization for which this phenomenon arises in deep linear models. Our main theoretical contribution is a dynamical depth separation result, proving that while shallow models can exhibit incremental learning dynamics, they require the initialization to be exponentially small for these dynamics to present themselves. However, once the model becomes deeper, the dependence becomes polynomial and incremental learning can arise in more natural settings. We complement our theoretical findings by experimenting with deep matrix sensing, quadratic neural networks and with binary classification using diagonal and convolutional linear networks, showing all of these models exhibit incremental learning.
Keyword: gradient flow, gradient descent, implicit regularization, implicit bias, generalization, optimization, quadratic network, matrix sensing

The Variational Bandwidth Bottleneck: Stochastic Evaluation on an Information Budget
Author: Anirudh Goyal, Yoshua Bengio, Matthew Botvinick, Sergey Levine
link: https://openreview.net/pdf?id=Hye1kTVFDS
Code: None
Abstract: In many applications, it is desirable to extract only the relevant information from complex input data, which involves making a decision about which input features are relevant.
The information bottleneck method formalizes this as an information-theoretic optimization problem by maintaining an optimal tradeoff between compression (throwing away irrelevant input information), and predicting the target. In many problem settings, including the reinforcement learning problems we consider in this work, we might prefer to compress only part of the input. This is typically the case when we have a standard conditioning input, such as a state observation, and a ``privileged’’ input, which might correspond to the goal of a task, the output of a costly planning algorithm, or communication with another agent. In such cases, we might prefer to compress the privileged input, either to achieve better generalization (e.g., with respect to goals) or to minimize access to costly information (e.g., in the case of communication). Practical implementations of the information bottleneck based on variational inference require access to the privileged input in order to compute the bottleneck variable, so although they perform compression, this compression operation itself needs unrestricted, lossless access. In this work, we propose the variational bandwidth bottleneck, which decides for each example on the estimated value of the privileged information before seeing it, i.e., only based on the standard input, and then accordingly chooses stochastically, whether to access the privileged input or not. We formulate a tractable approximation to this framework and demonstrate in a series of reinforcement learning experiments that it can improve generalization and reduce access to computationally costly information.
Keyword: Variational Information Bottleneck, Reinforcement learning

Learning the Arrow of Time for Problems in Reinforcement Learning
Author: Nasim Rahaman, Steffen Wolf, Anirudh Goyal, Roman Remme, Yoshua Bengio
link: https://openreview.net/pdf?id=rylJkpEtwS
Code: https://www.sendspace.com/file/0mx0en
Abstract: We humans have an innate understanding of the asymmetric progression of time, which we use to efficiently and safely perceive and manipulate our environment. Drawing inspiration from that, we approach the problem of learning an arrow of time in a Markov (Decision) Process. We illustrate how a learned arrow of time can capture salient information about the environment, which in turn can be used to measure reachability, detect side-effects and to obtain an intrinsic reward signal. Finally, we propose a simple yet effective algorithm to parameterize the problem at hand and learn an arrow of time with a function approximator (here, a deep neural network). Our empirical results span a selection of discrete and continuous environments, and demonstrate for a class of stochastic processes that the learned arrow of time agrees reasonably well with a well known notion of an arrow of time due to Jordan, Kinderlehrer and Otto (1998).
Keyword: Arrow of Time, Reinforcement Learning, AI-Safety

Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives
Author: Anirudh Goyal, Shagun Sodhani, Jonathan Binas, Xue Bin Peng, Sergey Levine, Yoshua Bengio
link: https://openreview.net/pdf?id=ryxgJTEYDr
Code: None
Abstract: Reinforcement learning agents that operate in diverse and complex environments can benefit from the structured decomposition of their behavior. Often, this is addressed in the context of hierarchical reinforcement learning, where the aim is to decompose a policy into lower-level primitives or options, and a higher-level meta-policy that triggers the appropriate behaviors for a given situation. However, the meta-policy must still produce appropriate decisions in all states.
In this work, we propose a policy design that decomposes into primitives, similarly to hierarchical reinforcement learning, but without a high-level meta-policy. Instead, each primitive can decide for themselves whether they wish to act in the current state.
We use an information-theoretic mechanism for enabling this decentralized decision: each primitive chooses how much information it needs about the current state to make a decision and the primitive that requests the most information about the current state acts in the world. The primitives are regularized to use as little information as possible, which leads to natural competition and specialization. We experimentally demonstrate that this policy architecture improves over both flat and hierarchical policies in terms of generalization.
Keyword: Reinforcement Learning, Variational Information Bottleneck, Learning primitives

Robust Local Features for Improving the Generalization of Adversarial Training
Author: Chuanbiao Song, Kun He, Jiadong Lin, Liwei Wang, John E. Hopcroft
link: https://openreview.net/pdf?id=H1lZJpVFvr
Code: https://github.com/JHL-HUST/RLFAT
Abstract: Adversarial training has been demonstrated as one of the most effective methods for training robust models to defend against adversarial examples. However, adversarially trained models often lack adversarially robust generalization on unseen testing data. Recent works show that adversarially trained models are more biased towards global structure features. Instead, in this work, we would like to investigate the relationship between the generalization of adversarial training and the robust local features, as the robust local features generalize well for unseen shape variation. To learn the robust local features, we develop a Random Block Shuffle (RBS) transformation to break up the global structure features on normal adversarial examples. We continue to propose a new approach called Robust Local Features for Adversarial Training (RLFAT), which first learns the robust local features by adversarial training on the RBS-transformed adversarial examples, and then transfers the robust local features into the training of normal adversarial examples. To demonstrate the generality of our argument, we implement RLFAT in currently state-of-the-art adversarial training frameworks. Extensive experiments on STL-10, CIFAR-10 and CIFAR-100 show that RLFAT significantly improves both the adversarially robust generalization and the standard generalization of adversarial training. Additionally, we demonstrate that our models capture more local features of the object on the images, aligning better with human perception.
Keyword: adversarial robustness, adversarial training, adversarial example, deep learning

Analysis of Video Feature Learning in Two-Stream CNNs on the Example of Zebrafish Swim Bout Classification
Author: Bennet Breier, Arno Onken
link: https://openreview.net/pdf?id=rJgQkT4twH
Code: https://github.com/Benji4/zebrafish-learning.git
Abstract: Semmelhack et al. (2014) have achieved high classification accuracy in distinguishing swim bouts of zebrafish using a Support Vector Machine (SVM). Convolutional Neural Networks (CNNs) have reached superior performance in various image recognition tasks over SVMs, but these powerful networks remain a black box. Reaching better transparency helps to build trust in their classifications and makes learned features interpretable to experts. Using a recently developed technique called Deep Taylor Decomposition, we generated heatmaps to highlight input regions of high relevance for predictions. We find that our CNN makes predictions by analyzing the steadiness of the tail’s trunk, which markedly differs from the manually extracted features used by Semmelhack et al. (2014). We further uncovered that the network paid attention to experimental artifacts. Removing these artifacts ensured the validity of predictions. After correction, our best CNN beats the SVM by 6.12%, achieving a classification accuracy of 96.32%. Our work thus demonstrates the utility of AI explainability for CNNs.
Keyword: convolutional neural networks, neural network transparency, AI explainability, deep Taylor decomposition, supervised classification, zebrafish, transparency, behavioral research, optical flow

Learning Disentangled Representations for CounterFactual Regression
Author: Negar Hassanpour, Russell Greiner
link: https://openreview.net/pdf?id=HkxBJT4YvB
Code: https://www.dropbox.com/sh/vrux2exqwc9uh7k/AAAR4tlJLScPlkmPruvbrTJQa?dl=0
Abstract: We consider the challenge of estimating treatment effects from observational data; and point out that, in general, only some factors based on the observed covariates X contribute to selection of the treatment T, and only some to determining the outcomes Y. We model this by considering three underlying sources of {X, T, Y} and show that explicitly modeling these sources offers great insight to guide designing models that better handle selection bias. This paper is an attempt to conceptualize this line of thought and provide a path to explore it further.
In this work, we propose an algorithm to (1) identify disentangled representations of the above-mentioned underlying factors from any given observational dataset D and (2) leverage this knowledge to reduce, as well as account for, the negative impact of selection bias on estimating the treatment effects from D. Our empirical results show that the proposed method achieves state-of-the-art performance in both individual and population based evaluation measures.
Keyword: Counterfactual Regression, Causal Effect Estimation, Selection Bias, Off-policy Learning

Exploration in Reinforcement Learning with Deep Covering Options
Author: Yuu Jinnai, Jee Won Park, Marlos C. Machado, George Konidaris
link: https://openreview.net/pdf?id=SkeIyaVtwB
Code: None
Abstract: While many option discovery methods have been proposed to accelerate exploration in reinforcement learning, they are often heuristic. Recently, covering options was proposed to discover a set of options that provably reduce the upper bound of the environment’s cover time, a measure of the difficulty of exploration. Covering options are computed using the eigenvectors of the graph Laplacian, but they are constrained to tabular tasks and are not applicable to tasks with large or continuous state-spaces.
We introduce deep covering options, an online method that extends covering options to large state spaces, automatically discovering task-agnostic options that encourage exploration. We evaluate our method in several challenging sparse-reward domains and we show that our approach identifies less explored regions of the state-space and successfully generates options to visit these regions, substantially improving both the exploration and the total accumulated reward.
Keyword: Reinforcement learning, temporal abstraction, exploration

AE-OT: A NEW GENERATIVE MODEL BASED ON EXTENDED SEMI-DISCRETE OPTIMAL TRANSPORT
Author: Dongsheng An, Yang Guo, Na Lei, Zhongxuan Luo, Shing-Tung Yau, Xianfeng Gu
link: https://openreview.net/pdf?id=HkldyTNYwH
Code: None
Abstract: Generative adversarial networks (GANs) have attracted huge attention due to
its capability to generate visual realistic images. However, most of the existing
models suffer from the mode collapse or mode mixture problems. In this work, we
give a theoretic explanation of the both problems by Figalli’s regularity theory of
optimal transportation maps. Basically, the generator compute the transportation
maps between the white noise distributions and the data distributions, which are
in general discontinuous. However, DNNs can only represent continuous maps.
This intrinsic conflict induces mode collapse and mode mixture. In order to
tackle the both problems, we explicitly separate the manifold embedding and the
optimal transportation; the first part is carried out using an autoencoder to map the
images onto the latent space; the second part is accomplished using a GPU-based
convex optimization to find the discontinuous transportation maps. Composing the
extended OT map and the decoder, we can finally generate new images from the
white noise. This AE-OT model avoids representing discontinuous maps by DNNs,
therefore effectively prevents mode collapse and mode mixture.
Keyword: Generative model, auto-encoder, optimal transport, mode collapse, regularity

Logic and the 2-Simplicial Transformer
Author: James Clift, Dmitry Doryn, Daniel Murfet, James Wallbridge
link: https://openreview.net/pdf?id=rkecJ6VFvr
Code: https://github.com/dmurfet/2simplicialtransformer
Abstract: We introduce the 2-simplicial Transformer, an extension of the Transformer which includes a form of higher-dimensional attention generalising the dot-product attention, and uses this attention to update entity representations with tensor products of value vectors. We show that this architecture is a useful inductive bias for logical reasoning in the context of deep reinforcement learning.

Keyword: transformer, logic, reinforcement learning, reasoning

Watch, Try, Learn: Meta-Learning from Demonstrations and Rewards
Author: Allan Zhou, Eric Jang, Daniel Kappler, Alex Herzog, Mohi Khansari, Paul Wohlhart, Yunfei Bai, Mrinal Kalakrishnan, Sergey Levine, Chelsea Finn
link: https://openreview.net/pdf?id=SJg5J6NtDr
Code: https://drive.google.com/open?id=1f1LzO0fe1m-kINY8DTgL6JGimVGiQOuz
Abstract: Imitation learning allows agents to learn complex behaviors from demonstrations. However, learning a complex vision-based task may require an impractical number of demonstrations. Meta-imitation learning is a promising approach towards enabling agents to learn a new task from one or a few demonstrations by leveraging experience from learning similar tasks. In the presence of task ambiguity or unobserved dynamics, demonstrations alone may not provide enough information; an agent must also try the task to successfully infer a policy. In this work, we propose a method that can learn to learn from both demonstrations and trial-and-error experience with sparse reward feedback. In comparison to meta-imitation, this approach enables the agent to effectively and efficiently improve itself autonomously beyond the demonstration data. In comparison to meta-reinforcement learning, we can scale to substantially broader distributions of tasks, as the demonstration reduces the burden of exploration. Our experiments show that our method significantly outperforms prior approaches on a set of challenging, vision-based control tasks.
Keyword: meta-learning, reinforcement learning, imitation learning

Fooling Detection Alone is Not Enough: Adversarial Attack against Multiple Object Tracking
Author: Yunhan Jia, Yantao Lu, Junjie Shen, Qi Alfred Chen, Hao Chen, Zhenyu Zhong, Tao Wei
link: https://openreview.net/pdf?id=rJl31TNYPr
Code: https://github.com/anonymousjack/hijacking
Abstract: Recent work in adversarial machine learning started to focus on the visual perception in autonomous driving and studied Adversarial Examples (AEs) for object detection models. However, in such visual perception pipeline the detected objects must also be tracked, in a process called Multiple Object Tracking (MOT), to build the moving trajectories of surrounding obstacles. Since MOT is designed to be robust against errors in object detection, it poses a general challenge to existing attack techniques that blindly target objection detection: we find that a success rate of over 98% is needed for them to actually affect the tracking results, a requirement that no existing attack technique can satisfy. In this paper, we are the first to study adversarial machine learning attacks against the complete visual perception pipeline in autonomous driving, and discover a novel attack technique, tracker hijacking, that can effectively fool MOT using AEs on object detection. Using our technique, successful AEs on as few as one single frame can move an existing object in to or out of the headway of an autonomous vehicle to cause potential safety hazards. We perform evaluation using the Berkeley Deep Drive dataset and find that on average when 3 frames are attacked, our attack can have a nearly 100% success rate while attacks that blindly target object detection only have up to 25%.
Keyword: Adversarial examples, object detection, object tracking, security, autonomous vehicle, deep learning

DivideMix: Learning with Noisy Labels as Semi-supervised Learning
Author: Junnan Li, Richard Socher, Steven C.H. Hoi
link: https://openreview.net/pdf?id=HJgExaVtwr
Code: None
Abstract: Deep neural networks are known to be annotation-hungry. Numerous efforts have been devoted to reducing the annotation cost when learning with deep networks. Two prominent directions include learning with noisy labels and semi-supervised learning by exploiting unlabeled data. In this work, we propose DivideMix, a novel framework for learning with noisy labels by leveraging semi-supervised learning techniques. In particular, DivideMix models the per-sample loss distribution with a mixture model to dynamically divide the training data into a labeled set with clean samples and an unlabeled set with noisy samples, and trains the model on both the labeled and unlabeled data in a semi-supervised manner. To avoid confirmation bias, we simultaneously train two diverged networks where each network uses the dataset division from the other network. During the semi-supervised training phase, we improve the MixMatch strategy by performing label co-refinement and label co-guessing on labeled and unlabeled samples, respectively. Experiments on multiple benchmark datasets demonstrate substantial improvements over state-of-the-art methods. Code is available at
Keyword: label noise, semi-supervised learning

Improving Adversarial Robustness Requires Revisiting Misclassified Examples
Author: Yisen Wang, Difan Zou, Jinfeng Yi, James Bailey, Xingjun Ma, Quanquan Gu
link: https://openreview.net/pdf?id=rklOg6EFwS
Code: None
Abstract: Deep neural networks (DNNs) are vulnerable to adversarial examples crafted by imperceptible perturbations. A range of defense techniques have been proposed to improve DNN robustness to adversarial examples, among which adversarial training has been demonstrated to be the most effective. Adversarial training is often formulated as a min-max optimization problem, with the inner maximization for generating adversarial examples. However, there exists a simple, yet easily overlooked fact that adversarial examples are only defined on correctly classified (natural) examples, but inevitably, some (natural) examples will be misclassified during training. In this paper, we investigate the distinctive influence of misclassified and correctly classified examples on the final robustness of adversarial training. Specifically, we find that misclassified examples indeed have a significant impact on the final robustness. More surprisingly, we find that different maximization techniques on misclassified examples may have a negligible influence on the final robustness, while different minimization techniques are crucial. Motivated by the above discovery, we propose a new defense algorithm called {\em Misclassification Aware adveRsarial Training} (MART), which explicitly differentiates the misclassified and correctly classified examples during the training. We also propose a semi-supervised extension of MART, which can leverage the unlabeled data to further improve the robustness. Experimental results show that MART and its variant could significantly improve the state-of-the-art adversarial robustness.
Keyword: Robustness, Adversarial Defense, Adversarial Training

V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control
Author: H. Francis Song, Abbas Abdolmaleki, Jost Tobias Springenberg, Aidan Clark, Hubert Soyer, Jack W. Rae, Seb Noury, Arun Ahuja, Siqi Liu, Dhruva Tirumala, Nicolas Heess, Dan Belov, Martin Riedmiller, Matthew M. Botvinick
link: https://openreview.net/pdf?id=SylOlp4FvH
Code: None
Abstract: Some of the most successful applications of deep reinforcement learning to challenging domains in discrete and continuous control have used policy gradient methods in the on-policy setting. However, policy gradients can suffer from large variance that may limit performance, and in practice require carefully tuned entropy regularization to prevent policy collapse. As an alternative to policy gradient algorithms, we introduce V-MPO, an on-policy adaptation of Maximum a Posteriori Policy Optimization (MPO) that performs policy iteration based on a learned state-value function. We show that V-MPO surpasses previously reported scores for both the Atari-57 and DMLab-30 benchmark suites in the multi-task setting, and does so reliably without importance weighting, entropy regularization, or population-based tuning of hyperparameters. On individual DMLab and Atari levels, the proposed algorithm can achieve scores that are substantially higher than has previously been reported. V-MPO is also applicable to problems with high-dimensional, continuous action spaces, which we demonstrate in the context of learning to control simulated humanoids with 22 degrees of freedom from full state observations and 56 degrees of freedom from pixel observations, as well as example OpenAI Gym tasks where V-MPO achieves substantially higher asymptotic scores than previously reported.
Keyword: reinforcement learning, policy iteration, multi-task learning, continuous control

Interpretable Complex-Valued Neural Networks for Privacy Protection
Author: Liyao Xiang, Hao Zhang, Haotian Ma, Yifan Zhang, Jie Ren, Quanshi Zhang
link: https://openreview.net/pdf?id=S1xFl64tDr
Code: None
Abstract: Previous studies have found that an adversary attacker can often infer unintended input information from intermediate-layer features. We study the possibility of preventing such adversarial inference, yet without too much accuracy degradation. We propose a generic method to revise the neural network to boost the challenge of inferring input attributes from features, while maintaining highly accurate outputs. In particular, the method transforms real-valued features into complex-valued ones, in which the input is hidden in a randomized phase of the transformed features. The knowledge of the phase acts like a key, with which any party can easily recover the output from the processing result, but without which the party can neither recover the output nor distinguish the original input. Preliminary experiments on various datasets and network structures have shown that our method significantly diminishes the adversary’s ability in inferring about the input while largely preserves the resulting accuracy.
Keyword: Deep Learning, Privacy Protection, Complex-Valued Neural Networks

Accelerating SGD with momentum for over-parameterized learning
Author: Chaoyue Liu, Mikhail Belkin
link: https://openreview.net/pdf?id=r1gixp4FPH
Code: https://github.com/ts66395/MaSS
Abstract:
Nesterov SGD is widely used for training modern neural networks and other machine learning models. Yet, its advantages over SGD have not been theoretically clarified. Indeed, as we show in this paper, both theoretically and empirically, Nesterov SGD with any parameter selection does not in general provide acceleration over ordinary SGD. Furthermore, Nesterov SGD may diverge for step sizes that ensure convergence of ordinary SGD. This is in contrast to the classical results in the deterministic setting, where the same step size ensures accelerated convergence of the Nesterov’s method over optimal gradient descent.

  To address the non-acceleration issue, we  introduce a compensation term to Nesterov SGD. The resulting  algorithm, which we call MaSS, converges  for same step sizes as SGD. We prove that MaSS obtains an accelerated convergence rates over SGD for any mini-batch size in the linear setting.  For full batch, the convergence rate of MaSS matches the well-known accelerated rate of the Nesterov's method. 
  
  We also analyze the  practically important question of the dependence of the convergence rate and  optimal hyper-parameters on the mini-batch size, demonstrating three distinct regimes: linear scaling, diminishing returns and saturation.
  
  Experimental evaluation of MaSS for several standard  architectures of deep networks, including ResNet and convolutional networks, shows improved performance over SGD, Nesterov SGD  and Adam. 

Keyword: SGD, acceleration, momentum, stochastic, over-parameterized, Nesterov

A critical analysis of self-supervision, or what we can learn from a single image
Author: Asano YM., Rupprecht C., Vedaldi A.
link: https://openreview.net/pdf?id=B1esx6EYvr
Code: None
Abstract: We look critically at popular self-supervision techniques for learning deep convolutional neural networks without manual labels. We show that three different and representative methods, BiGAN, RotNet and DeepCluster, can learn the first few layers of a convolutional network from a single image as well as using millions of images and manual labels, provided that strong data augmentation is used. However, for deeper layers the gap with manual supervision cannot be closed even if millions of unlabelled images are used for training.
We conclude that:
(1) the weights of the early layers of deep networks contain limited information about the statistics of natural images, that
(2) such low-level statistics can be learned through self-supervision just as well as through strong supervision, and that
(3) the low-level statistics can be captured via synthetic transformations instead of using a large image dataset.
Keyword: self-supervision, feature representation learning, CNN

Disentangling Factors of Variations Using Few Labels
Author: Francesco Locatello, Michael Tschannen, Stefan Bauer, Gunnar Rätsch, Bernhard Schölkopf, Olivier Bachem
link: https://openreview.net/pdf?id=SygagpEKwB
Code: None
Abstract: Learning disentangled representations is considered a cornerstone problem in representation learning. Recently, Locatello et al. (2019) demonstrated that unsupervised disentanglement learning without inductive biases is theoretically impossible and that existing inductive biases and unsupervised methods do not allow to consistently learn disentangled representations. However, in many practical settings, one might have access to a limited amount of supervision, for example through manual labeling of (some) factors of variation in a few training examples. In this paper, we investigate the impact of such supervision on state-of-the-art disentanglement methods and perform a large scale study, training over 52000 models under well-defined and reproducible experimental conditions. We observe that a small number of labeled examples (0.01–0.5% of the data set), with potentially imprecise and incomplete labels, is sufficient to perform model selection on state-of-the-art unsupervised models. Further, we investigate the benefit of incorporating supervision into the training process. Overall, we empirically validate that with little and imprecise supervision it is possible to reliably learn disentangled representations.
Keyword: None

Functional vs. parametric equivalence of ReLU networks
Author: Mary Phuong, Christoph H. Lampert
link: https://openreview.net/pdf?id=Bylx-TNKvH
Code: None
Abstract: We address the following question: How redundant is the parameterisation of ReLU networks? Specifically, we consider transformations of the weight space which leave the function implemented by the network intact. Two such transformations are known for feed-forward architectures: permutation of neurons within a layer, and positive scaling of all incoming weights of a neuron coupled with inverse scaling of its outgoing weights. In this work, we show for architectures with non-increasing widths that permutation and scaling are in fact the only function-preserving weight transformations. For any eligible architecture we give an explicit construction of a neural network such that any other network that implements the same function can be obtained from the original one by the application of permutations and rescaling. The proof relies on a geometric understanding of boundaries between linear regions of ReLU networks, and we hope the developed mathematical tools are of independent interest.

Keyword: ReLU networks, symmetry, functional equivalence, over-parameterization

Input Complexity and Out-of-distribution Detection with Likelihood-based Generative Models
Author: Joan Serrà, David Álvarez, Vicenç Gómez, Olga Slizovskaia, José F. Núñez, Jordi Luque
link: https://openreview.net/pdf?id=SyxIWpVYvr
Code: None
Abstract: Likelihood-based generative models are a promising resource to detect out-of-distribution (OOD) inputs which could compromise the robustness or reliability of a machine learning system. However, likelihoods derived from such models have been shown to be problematic for detecting certain types of inputs that significantly differ from training data. In this paper, we pose that this problem is due to the excessive influence that input complexity has in generative models’ likelihoods. We report a set of experiments supporting this hypothesis, and use an estimate of input complexity to derive an efficient and parameter-free OOD score, which can be seen as a likelihood-ratio, akin to Bayesian model comparison. We find such score to perform comparably to, or even better than, existing OOD detection approaches under a wide range of data sets, models, model sizes, and complexity estimates.
Keyword: OOD, generative models, likelihood

RTFM: Generalising to New Environment Dynamics via Reading
Author: Victor Zhong, Tim Rocktäschel, Edward Grefenstette
link: https://openreview.net/pdf?id=SJgob6NKvH
Code: None
Abstract: Obtaining policies that can generalise to new environments in reinforcement learning is challenging. In this work, we demonstrate that language understanding via a reading policy learner is a promising vehicle for generalisation to new environments. We propose a grounded policy learning problem, Read to Fight Monsters (RTFM), in which the agent must jointly reason over a language goal, relevant dynamics described in a document, and environment observations. We procedurally generate environment dynamics and corresponding language descriptions of the dynamics, such that agents must read to understand new environment dynamics instead of memorising any particular information. In addition, we propose txt2π, a model that captures three-way interactions between the goal, document, and observations. On RTFM, txt2π generalises to new environments with dynamics not seen during training via reading. Furthermore, our model outperforms baselines such as FiLM and language-conditioned CNNs on RTFM. Through curriculum learning, txt2π produces policies that excel on complex RTFM tasks requiring several reasoning and coreference steps.
Keyword: reinforcement learning, policy learning, reading comprehension, generalisation

What graph neural networks cannot learn: depth vs width
Author: Andreas Loukas
link: https://openreview.net/pdf?id=B1l2bp4YwS
Code: None
Abstract: This paper studies the expressive power of graph neural networks falling within the message-passing framework (GNNmp). Two results are presented. First, GNNmp are shown to be Turing universal under sufficient conditions on their depth, width, node attributes, and layer expressiveness. Second, it is discovered that GNNmp can lose a significant portion of their power when their depth and width is restricted. The proposed impossibility statements stem from a new technique that enables the repurposing of seminal results from distributed computing and leads to lower bounds for an array of decision, optimization, and estimation problems involving graphs. Strikingly, several of these problems are deemed impossible unless the product of a GNNmp’s depth and width exceeds a polynomial of the graph size; this dependence remains significant even for tasks that appear simple or when considering approximation.
Keyword: graph neural networks, capacity, impossibility results, lower bounds, expressive power

Progressive Memory Banks for Incremental Domain Adaptation
Author: Nabiha Asghar, Lili Mou, Kira A. Selby, Kevin D. Pantasdo, Pascal Poupart, Xin Jiang
link: https://openreview.net/pdf?id=BkepbpNFwr
Code: https://github.com/nabihach/IDA
Abstract: This paper addresses the problem of incremental domain adaptation (IDA) in natural language processing (NLP). We assume each domain comes one after another, and that we could only access data in the current domain. The goal of IDA is to build a unified model performing well on all the domains that we have encountered. We adopt the recurrent neural network (RNN) widely used in NLP, but augment it with a directly parameterized memory bank, which is retrieved by an attention mechanism at each step of RNN transition. The memory bank provides a natural way of IDA: when adapting our model to a new domain, we progressively add new slots to the memory bank, which increases the number of parameters, and thus the model capacity. We learn the new memory slots and fine-tune existing parameters by back-propagation. Experimental results show that our approach achieves significantly better performance than fine-tuning alone. Compared with expanding hidden states, our approach is more robust for old domains, shown by both empirical and theoretical results. Our model also outperforms previous work of IDA including elastic weight consolidation and progressive neural networks in the experiments.
Keyword: natural language processing, domain adaptation

Automated curriculum generation through setter-solver interactions
Author: Sebastien Racaniere, Andrew Lampinen, Adam Santoro, David Reichert, Vlad Firoiu, Timothy Lillicrap
link: https://openreview.net/pdf?id=H1e0Wp4KvH
Code: None
Abstract: Reinforcement learning algorithms use correlations between policies and rewards to improve agent performance. But in dynamic or sparsely rewarding environments these correlations are often too small, or rewarding events are too infrequent to make learning feasible. Human education instead relies on curricula –the breakdown of tasks into simpler, static challenges with dense rewards– to build up to complex behaviors. While curricula are also useful for artificial agents, hand-crafting them is time consuming. This has lead researchers to explore automatic curriculum generation. Here we explore automatic curriculum generation in rich,dynamic environments. Using a setter-solver paradigm we show the importance of considering goal validity, goal feasibility, and goal coverage to construct useful curricula. We demonstrate the success of our approach in rich but sparsely rewarding 2D and 3D environments, where an agent is tasked to achieve a single goal selected from a set of possible goals that varies between episodes, and identify challenges for future work. Finally, we demonstrate the value of a novel technique that guides agents towards a desired goal distribution. Altogether, these results represent a substantial step towards applying automatic task curricula to learn complex, otherwise unlearnable goals, and to our knowledge are the first to demonstrate automated curriculum generation for goal-conditioned agents in environments where the possible goals vary between episodes.
Keyword: Deep Reinforcement Learning, Automatic Curriculum

On Identifiability in Transformers
Author: Gino Brunner, Yang Liu, Damian Pascual, Oliver Richter, Massimiliano Ciaramita, Roger Wattenhofer
link: https://openreview.net/pdf?id=BJg1f6EFDB
Code: None
Abstract: In this paper we delve deep in the Transformer architecture by investigating two of its core components: self-attention and contextual embeddings. In particular, we study the identifiability of attention weights and token embeddings, and the aggregation of context into hidden tokens. We show that, for sequences longer than the attention head dimension, attention weights are not identifiable. We propose effective attention as a complementary tool for improving explanatory interpretations based on attention. Furthermore, we show that input tokens retain to a large degree their identity across the model. We also find evidence suggesting that identity information is mainly encoded in the angle of the embeddings and gradually decreases with depth. Finally, we demonstrate strong mixing of input information in the generation of contextual embeddings by means of a novel quantification method based on gradient attribution. Overall, we show that self-attention distributions are not directly interpretable and present tools to better understand and further investigate Transformer models.
Keyword: Self-attention, interpretability, identifiability, BERT, Transformer, NLP, explanation, gradient attribution

Exploring Model-based Planning with Policy Networks
Author: Tingwu Wang, Jimmy Ba
link: https://openreview.net/pdf?id=H1exf64KwH
Code: None
Abstract: Model-based reinforcement learning (MBRL) with model-predictive control or
online planning has shown great potential for locomotion control tasks in both
sample efficiency and asymptotic performance. Despite the successes, the existing
planning methods search from candidate sequences randomly generated in the
action space, which is inefficient in complex high-dimensional environments. In
this paper, we propose a novel MBRL algorithm, model-based policy planning
(POPLIN), that combines policy networks with online planning. More specifically,
we formulate action planning at each time-step as an optimization problem using
neural networks. We experiment with both optimization w.r.t. the action sequences
initialized from the policy network, and also online optimization directly w.r.t. the
parameters of the policy network. We show that POPLIN obtains state-of-the-art
performance in the MuJoCo benchmarking environments, being about 3x more
sample efficient than the state-of-the-art algorithms, such as PETS, TD3 and SAC.
To explain the effectiveness of our algorithm, we show that the optimization surface
in parameter space is smoother than in action space. Further more, we found the
distilled policy network can be effectively applied without the expansive model
predictive control during test time for some environments such as Cheetah. Code
is released.
Keyword: reinforcement learning, model-based reinforcement learning, planning

Learning Self-Correctable Policies and Value Functions from Demonstrations with Negative Sampling
Author: Yuping Luo, Huazhe Xu, Tengyu Ma
link: https://openreview.net/pdf?id=rke-f6NKvS
Code: None
Abstract: Imitation learning, followed by reinforcement learning algorithms, is a promising paradigm to solve complex control tasks sample-efficiently. However, learning from demonstrations often suffers from the covariate shift problem, which results
in cascading errors of the learned policy. We introduce a notion of conservatively extrapolated value functions, which provably lead to policies with self-correction. We design an algorithm Value Iteration with Negative Sampling (VINS) that practically learns such value functions with conservative extrapolation. We show that VINS can correct mistakes of the behavioral cloning policy on simulated robotics benchmark tasks. We also propose the algorithm of using VINS to initialize a reinforcement learning algorithm, which is shown to outperform prior works in sample efficiency.
Keyword: imitation learning, model-based imitation learning, model-based RL, behavior cloning, covariate shift

Geometric Insights into the Convergence of Nonlinear TD Learning
Author: David Brandfonbrener, Joan Bruna
link: https://openreview.net/pdf?id=SJezGp4YPr
Code: None
Abstract: While there are convergence guarantees for temporal difference (TD) learning when using linear function approximators, the situation for nonlinear models is far less understood, and divergent examples are known. Here we take a first step towards extending theoretical convergence guarantees to TD learning with nonlinear function approximation. More precisely, we consider the expected learning dynamics of the TD(0) algorithm for value estimation. As the step-size converges to zero, these dynamics are defined by a nonlinear ODE which depends on the geometry of the space of function approximators, the structure of the underlying Markov chain, and their interaction. We find a set of function approximators that includes ReLU networks and has geometry amenable to TD learning regardless of environment, so that the solution performs about as well as linear TD in the worst case. Then, we show how environments that are more reversible induce dynamics that are better for TD learning and prove global convergence to the true value function for well-conditioned function approximators. Finally, we generalize a divergent counterexample to a family of divergent problems to demonstrate how the interaction between approximator and environment can go wrong and to motivate the assumptions needed to prove convergence.
Keyword: TD, nonlinear, convergence, value estimation, reinforcement learning

Few-shot Text Classification with Distributional Signatures
Author: Yujia Bao, Menghua Wu, Shiyu Chang, Regina Barzilay
link: https://openreview.net/pdf?id=H1emfT4twB
Code: https://github.com/YujiaBao/Distributional-Signatures
Abstract: In this paper, we explore meta-learning for few-shot text classification. Meta-learning has shown strong performance in computer vision, where low-level patterns are transferable across learning tasks. However, directly applying this approach to text is challenging–lexical features highly informative for one task may be insignificant for another. Thus, rather than learning solely from words, our model also leverages their distributional signatures, which encode pertinent word occurrence patterns. Our model is trained within a meta-learning framework to map these signatures into attention scores, which are then used to weight the lexical representations of words. We demonstrate that our model consistently outperforms prototypical networks learned on lexical knowledge (Snell et al., 2017) in both few-shot text classification and relation classification by a significant margin across six benchmark datasets (20.0% on average in 1-shot classification).
Keyword: text classification, meta learning, few shot learning

Escaping Saddle Points Faster with Stochastic Momentum
Author: Jun-Kun Wang, Chi-Heng Lin, Jacob Abernethy
link: https://openreview.net/pdf?id=rkeNfp4tPr
Code: None
Abstract: Stochastic gradient descent (SGD) with stochastic momentum is popular in nonconvex stochastic optimization and particularly for the training of deep neural networks. In standard SGD, parameters are updated by improving along the path of the gradient at the current iterate on a batch of examples, where the addition of a ``momentum’’ term biases the update in the direction of the previous change in parameters. In non-stochastic convex optimization one can show that a momentum adjustment provably reduces convergence time in many settings, yet such results have been elusive in the stochastic and non-convex settings. At the same time, a widely-observed empirical phenomenon is that in training deep networks stochastic momentum appears to significantly improve convergence time, variants of it have flourished in the development of other popular update methods, e.g. ADAM, AMSGrad, etc. Yet theoretical justification for the use of stochastic momentum has remained a significant open question. In this paper we propose an answer: stochastic momentum improves deep network training because it modifies SGD to escape saddle points faster and, consequently, to more quickly find a second order stationary point. Our theoretical results also shed light on the related question of how to choose the ideal momentum parameter–our analysis suggests that β [ 0 , 1 ) \beta \in [0,1) should be large (close to 1), which comports with empirical findings. We also provide experimental findings that further validate these conclusions.
Keyword: SGD, momentum, escaping saddle point

Adversarial Policies: Attacking Deep Reinforcement Learning
Author: Adam Gleave, Michael Dennis, Cody Wild, Neel Kant, Sergey Levine, Stuart Russell
link: https://openreview.net/pdf?id=HJgEMpVFwB
Code: https://github.com/humancompatibleai/adversarial-policies
Abstract: Deep reinforcement learning (RL) policies are known to be vulnerable to adversarial perturbations to their observations, similar to adversarial examples for classifiers. However, an attacker is not usually able to directly modify another agent’s observations. This might lead one to wonder: is it possible to attack an RL agent simply by choosing an adversarial policy acting in a multi-agent environment so as to create natural observations that are adversarial? We demonstrate the existence of adversarial policies in zero-sum games between simulated humanoid robots with proprioceptive observations, against state-of-the-art victims trained via self-play to be robust to opponents. The adversarial policies reliably win against the victims but generate seemingly random and uncoordinated behavior. We find that these policies are more successful in high-dimensional environments, and induce substantially different activations in the victim policy network than when the victim plays against a normal opponent. Videos are available at
Keyword: deep RL, adversarial examples, security, multi-agent

VideoFlow: A Conditional Flow-Based Model for Stochastic Video Generation
Author: Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn, Sergey Levine, Laurent Dinh, Durk Kingma
link: https://openreview.net/pdf?id=rJgUfTEYvH
Code: https://storage.googleapis.com/iclr_code/videoflow_code.zip
Abstract: Generative models that can model and predict sequences of future events can, in principle, learn to capture complex real-world phenomena, such as physical interactions. However, a central challenge in video prediction is that the future is highly uncertain: a sequence of past observations of events can imply many possible futures. Although a number of recent works have studied probabilistic models that can represent uncertain futures, such models are either extremely expensive computationally as in the case of pixel-level autoregressive models, or do not directly optimize the likelihood of the data. To our knowledge, our work is the first to propose multi-frame video prediction with normalizing flows, which allows for direct optimization of the data likelihood, and produces high-quality stochastic predictions. We describe an approach for modeling the latent space dynamics, and demonstrate that flow-based generative models offer a viable and competitive approach to generative modeling of video.
Keyword: Video generation, flow-based generative models, stochastic video prediction

GLAD: Learning Sparse Graph Recovery
Author: Harsh Shrivastava, Xinshi Chen, Binghong Chen, Guanghui Lan, Srinivas Aluru, Han Liu, Le Song
link: https://openreview.net/pdf?id=BkxpMTEtPB
Code: https://github.com/Harshs27/GLAD
Abstract: Recovering sparse conditional independence graphs from data is a fundamental problem in machine learning with wide applications. A popular formulation of the problem is an 1 \ell_1 regularized maximum likelihood estimation. Many convex optimization algorithms have been designed to solve this formulation to recover the graph structure. Recently, there is a surge of interest to learn algorithms directly based on data, and in this case, learn to map empirical covariance to the sparse precision matrix. However, it is a challenging task in this case, since the symmetric positive definiteness (SPD) and sparsity of the matrix are not easy to enforce in learned algorithms, and a direct mapping from data to precision matrix may contain many parameters. We propose a deep learning architecture, GLAD, which uses an Alternating Minimization (AM) algorithm as our model inductive bias, and learns the model parameters via supervised learning. We show that GLAD learns a very compact and effective model for recovering sparse graphs from data.
Keyword: Meta learning, automated algorithm design, learning structure recovery, Gaussian graphical models

Pruned Graph Scattering Transforms
Author: Vassilis N. Ioannidis, Siheng Chen, Georgios B. Giannakis
link: https://openreview.net/pdf?id=rJeg7TEYwB
Code: None
Abstract: Graph convolutional networks (GCNs) have achieved remarkable performance in a variety of network science learning tasks. However, theoretical analysis of such approaches is still at its infancy. Graph scattering transforms (GSTs) are non-trainable deep GCN models that are amenable to generalization and stability analyses. The present work addresses some limitations of GSTs by introducing a novel so-termed pruned §GST approach. The resultant pruning algorithm is guided by a graph-spectrum-inspired criterion, and retains informative scattering features on-the-fly while bypassing the exponential complexity associated with GSTs. It is further established that pGSTs are stable to perturbations of the input graph signals with bounded energy. Experiments showcase that i) pGST performs comparably to the baseline GST that uses all scattering features, while achieving significant computational savings; ii) pGST achieves comparable performance to state-of-the-art GCNs; and iii) Graph data from various domains lead to different scattering patterns, suggesting domain-adaptive pGST network architectures.
Keyword: Graph scattering transforms, pruning, graph convolutional networks, stability, deep learning

Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model
Author: Wenhan Xiong, Jingfei Du, William Yang Wang, Veselin Stoyanov
link: https://openreview.net/pdf?id=BJlzm64tDH
Code: None
Abstract: Recent breakthroughs of pretrained language models have shown the effectiveness of self-supervised learning for a wide range of natural language processing (NLP) tasks. In addition to standard syntactic and semantic NLP tasks, pretrained models achieve strong improvements on tasks that involve real-world knowledge, suggesting that large-scale language modeling could be an implicit method to capture knowledge. In this work, we further investigate the extent to which pretrained models such as BERT capture knowledge using a zero-shot fact completion task. Moreover, we propose a simple yet effective weakly supervised pretraining objective, which explicitly forces the model to incorporate knowledge about real-world entities. Models trained with our new objective yield significant improvements on the fact completion task. When applied to downstream tasks, our model consistently outperforms BERT on four entity-related question answering datasets (i.e., WebQuestions, TriviaQA, SearchQA and Quasar-T) with an average 2.7 F1 improvements and a standard fine-grained entity typing dataset (i.e., FIGER) with 5.7 accuracy gains.
Keyword: None

Can gradient clipping mitigate label noise?
Author: Aditya Krishna Menon, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar
link: https://openreview.net/pdf?id=rklB76EKPr
Code: None
Abstract: Gradient clipping is a widely-used technique in the training of deep networks, and is generally motivated from an optimisation lens: informally, it controls the dynamics of iterates, thus enhancing the rate of convergence to a local minimum. This intuition has been made precise in a line of recent works, which show that suitable clipping can yield significantly faster convergence than vanilla gradient descent. In this paper, we propose a new lens for studying gradient clipping, namely, robustness: informally, one expects clipping to provide robustness to noise, since one does not overly trust any single sample. Surprisingly, we prove that for the common problem of label noise in classification, standard gradient clipping does not in general provide robustness. On the other hand, we show that a simple variant of gradient clipping is provably robust, and corresponds to suitably modifying the underlying loss function. This yields a simple, noise-robust alternative to the standard cross-entropy loss which performs well empirically.
Keyword: None

Editable Neural Networks
Author: Anton Sinitsin, Vsevolod Plokhotnyuk, Dmitry Pyrkin, Sergei Popov, Artem Babenko
link: https://openreview.net/pdf?id=HJedXaEtvS
Code: https://github.com/editable-ICLR2020/editable
Abstract: These days deep neural networks are ubiquitously used in a wide range of tasks, from image classification and machine translation to face identification and self-driving cars. In many applications, a single model error can lead to devastating financial, reputational and even life-threatening consequences. Therefore, it is crucially important to correct model mistakes quickly as they appear. In this work, we investigate the problem of neural network editing - how one can efficiently patch a mistake of the model on a particular sample, without influencing the model behavior on other samples. Namely, we propose Editable Training, a model-agnostic training technique that encourages fast editing of the trained model. We empirically demonstrate the effectiveness of this method on large-scale image classification and machine translation tasks.
Keyword: editing, editable, meta-learning, maml

LEARNING EXECUTION THROUGH NEURAL CODE FUSION
Author: Zhan Shi, Kevin Swersky, Daniel Tarlow, Parthasarathy Ranganathan, Milad Hashemi
link: https://openreview.net/pdf?id=SJetQpEYvB
Code: https://www.dropbox.com/s/yrjhx8ifowdktwh/ncf_code.zip?dl=0
Abstract: As the performance of computer systems stagnates due to the end of Moore’s Law,
there is a need for new models that can understand and optimize the execution
of general purpose code. While there is a growing body of work on using Graph
Neural Networks (GNNs) to learn static representations of source code, these
representations do not understand how code executes at runtime. In this work, we
propose a new approach using GNNs to learn fused representations of general
source code and its execution. Our approach defines a multi-task GNN over
low-level representations of source code and program state (i.e., assembly code
and dynamic memory states), converting complex source code constructs and data
structures into a simpler, more uniform format. We show that this leads to improved
performance over similar methods that do not use execution and it opens the door
to applying GNN models to new tasks that would not be feasible from static code
alone. As an illustration of this, we apply the new model to challenging dynamic
tasks (branch prediction and prefetching) from the SPEC CPU benchmark suite,
outperforming the state-of-the-art by 26% and 45% respectively. Moreover, we
use the learned fused graph embeddings to demonstrate transfer learning with high
performance on an indirectly related algorithm classification task.
Keyword: code understanding, graph neural networks, learning program execution, execution traces, program performance

FasterSeg: Searching for Faster Real-time Semantic Segmentation
Author: Wuyang Chen, Xinyu Gong, Xianming Liu, Qian Zhang, Yuan Li, Zhangyang Wang
link: https://openreview.net/pdf?id=BJgqQ6NYvB
Code: https://github.com/TAMU-VITA/FasterSeg
Abstract: We present FasterSeg, an automatically designed semantic segmentation network with not only state-of-the-art performance but also faster speed than current methods. Utilizing neural architecture search (NAS), FasterSeg is discovered from a novel and broader search space integrating multi-resolution branches, that has been recently found to be vital in manually designed segmentation models. To better calibrate the balance between the goals of high accuracy and low latency, we propose a decoupled and fine-grained latency regularization, that effectively overcomes our observed phenomenons that the searched networks are prone to “collapsing” to low-latency yet poor-accuracy models. Moreover, we seamlessly extend FasterSeg to a new collaborative search (co-searching) framework, simultaneously searching for a teacher and a student network in the same single run. The teacher-student distillation further boosts the student model’s accuracy. Experiments on popular segmentation benchmarks demonstrate the competency of FasterSeg. For example, FasterSeg can run over 30% faster than the closest manually designed competitor on Cityscapes, while maintaining comparable accuracy.
Keyword: neural architecture search, real-time, segmentation

Difference-Seeking Generative Adversarial Network–Unseen Sample Generation
Author: Yi Lin Sung, Sung-Hsien Hsieh, Soo-Chang Pei, Chun-Shien Lu
link: https://openreview.net/pdf?id=rygjmpVFvB
Code: https://drive.google.com/open?id=18aQzyPbTT7_4fdkFjxL2MLjxMLK_hCuH
Abstract:
Unseen data, which are not samples from the distribution of training data and are difficult to collect, have exhibited importance in numerous applications, ({\em e.g.,} novelty detection, semi-supervised learning, and adversarial training). In this paper, we introduce a general framework called \textbf{d}ifference-\textbf{s}eeking \textbf{g}enerative \textbf{a}dversarial \textbf{n}etwork (DSGAN), to generate various types of unseen data. Its novelty is the consideration of the probability density of the unseen data distribution as the difference between two distributions p d ˉ p_{\bar{d}} and p d p_{d} whose samples are relatively easy to collect.

  The DSGAN can learn the target distribution, $p_{t}$, (or the unseen data distribution)  from only the samples from the two distributions, $p_{d}$ and $p_{\bar{d}}$. In our scenario, $p_d$ is the distribution of the seen data, and $p_{\bar{d}}$ can be obtained from $p_{d}$ via simple operations, so that  we only need the samples of $p_{d}$ during the training. 
  Two key applications, semi-supervised learning and novelty detection, are taken as case studies to illustrate that the DSGAN enables the production of various unseen data. We also provide theoretical analyses about the convergence of the DSGAN.

Keyword: generative adversarial network, semi-supervised learning, novelty detection

Stochastic AUC Maximization with Deep Neural Networks
Author: Mingrui Liu, Zhuoning Yuan, Yiming Ying, Tianbao Yang
link: https://openreview.net/pdf?id=HJepXaVYDr
Code: https://drive.google.com/drive/folders/1nPM6fmvN5fTsSaWsOcGFbhMVW7Fxso-Y
Abstract: Stochastic AUC maximization has garnered an increasing interest due to better fit to imbalanced data classification. However, existing works are limited to stochastic AUC maximization with a linear predictive model, which restricts its predictive power when dealing with extremely complex data. In this paper, we consider stochastic AUC maximization problem with a deep neural network as the predictive model. Building on the saddle point reformulation of a surrogated loss of AUC, the problem can be cast into a {\it non-convex concave} min-max problem. The main contribution made in this paper is to make stochastic AUC maximization more practical for deep neural networks and big data with theoretical insights as well. In particular, we propose to explore Polyak-\L{}ojasiewicz (PL) condition that has been proved and observed in deep learning, which enables us to develop new stochastic algorithms with even faster convergence rate and more practical step size scheme. An AdaGrad-style algorithm is also analyzed under the PL condition with adaptive convergence rate. Our experimental results demonstrate the effectiveness of the proposed algorithms.
Keyword: Stochastic AUC Maximization, Deep Neural Networks

Semantically-Guided Representation Learning for Self-Supervised Monocular Depth
Author: Vitor Guizilini, Rui Hou, Jie Li, Rares Ambrus, Adrien Gaidon
link: https://openreview.net/pdf?id=ByxT7TNFvH
Code: https://github.com/tri-ml/packnet-sfm
Abstract: Self-supervised learning is showing great promise for monocular depth estimation, using geometry as the only source of supervision. Depth networks are indeed capable of learning representations that relate visual appearance to 3D properties by implicitly leveraging category-level patterns. In this work we investigate how to leverage more directly this semantic structure to guide geometric representation learning, while remaining in the self-supervised regime. Instead of using semantic labels and proxy losses in a multi-task approach, we propose a new architecture leveraging fixed pretrained semantic segmentation networks to guide self-supervised representation learning via pixel-adaptive convolutions. Furthermore, we propose a two-stage training process to overcome a common semantic bias on dynamic objects via resampling. Our method improves upon the state of the art for self-supervised monocular depth prediction over all pixels, fine-grained details, and per semantic categories.

Keyword: computer vision, machine learning, deep learning, monocular depth estimation, self-supervised learning

MACER: Attack-free and Scalable Robust Training via Maximizing Certified Radius
Author: Runtian Zhai, Chen Dan, Di He, Huan Zhang, Boqing Gong, Pradeep Ravikumar, Cho-Jui Hsieh, Liwei Wang
link: https://openreview.net/pdf?id=rJx1Na4Fwr
Code: https://github.com/RuntianZ/macer
Abstract: Adversarial training is one of the most popular ways to learn robust models but is usually attack-dependent and time costly. In this paper, we propose the MACER algorithm, which learns robust models without using adversarial training but performs better than all existing provable l2-defenses. Recent work shows that randomized smoothing can be used to provide a certified l2 radius to smoothed classifiers, and our algorithm trains provably robust smoothed classifiers via MAximizing the CErtified Radius (MACER). The attack-free characteristic makes MACER faster to train and easier to optimize. In our experiments, we show that our method can be applied to modern deep neural networks on a wide range of datasets, including Cifar-10, ImageNet, MNIST, and SVHN. For all tasks, MACER spends less training time than state-of-the-art adversarial training algorithms, and the learned models achieve larger average certified radius.
Keyword: Adversarial Robustness, Provable Adversarial Defense, Randomized Smoothing, Robustness Certification

Detecting and Diagnosing Adversarial Images with Class-Conditional Capsule Reconstructions
Author: Yao Qin, Nicholas Frosst, Sara Sabour, Colin Raffel, Garrison Cottrell, Geoffrey Hinton
link: https://openreview.net/pdf?id=Skgy464Kvr
Code: None
Abstract: Adversarial examples raise questions about whether neural network models are sensitive to the same visual features as humans. In this paper, we first detect adversarial examples or otherwise corrupted images based on a class-conditional reconstruction of the input. To specifically attack our detection mechanism, we propose the Reconstructive Attack which seeks both to cause a misclassification and a low reconstruction error. This reconstructive attack produces undetected adversarial examples but with much smaller success rate. Among all these attacks, we find that CapsNets always perform better than convolutional networks. Then, we diagnose the adversarial examples for CapsNets and find that the success of the reconstructive attack is highly related to the visual similarity between the source and target class. Additionally, the resulting perturbations can cause the input image to appear visually more like the target class and hence become non-adversarial. This suggests that CapsNets use features that are more aligned with human perception and have the potential to address the central issue raised by adversarial examples.
Keyword: Adversarial Examples, Detection of adversarial attacks

Adversarial Example Detection and Classification with Asymmetrical Adversarial Training
Author: Xuwang Yin, Soheil Kolouri, Gustavo K Rohde
link: https://openreview.net/pdf?id=SJeQEp4YDH
Code: https://github.com/xuwangyin
Abstract: The vulnerabilities of deep neural networks against adversarial examples have become a significant concern for deploying these models in sensitive domains. Devising a definitive defense against such attacks is proven to be challenging, and the methods relying on detecting adversarial samples are only valid when the attacker is oblivious to the detection mechanism. In this paper we first present an adversarial example detection method that provides performance guarantee to norm constrained adversaries. The method is based on the idea of training adversarial robust subspace detectors using asymmetrical adversarial training (AAT). The novel AAT objective presents a minimax problem similar to that of GANs; it has the same convergence property, and consequently supports the learning of class conditional distributions. We first demonstrate that the minimax problem could be reasonably solved by PGD attack, and then use the learned class conditional generative models to define generative detection/classification models that are both robust and more interpretable. We provide comprehensive evaluations of the above methods, and demonstrate their competitive performances and compelling properties on adversarial detection and robust classification problems.
Keyword: adversarial example detection, adversarial examples classification, robust optimization, ML security, generative modeling, generative classification

Variational Recurrent Models for Solving Partially Observable Control Tasks
Author: Dongqi Han, Kenji Doya, Jun Tani
link: https://openreview.net/pdf?id=r1lL4a4tDB
Code: https://github.com/oist-cnru/Variational-Recurrent-Models
Abstract: In partially observable (PO) environments, deep reinforcement learning (RL) agents often suffer from unsatisfactory performance, since two problems need to be tackled together: how to extract information from the raw observations to solve the task, and how to improve the policy. In this study, we propose an RL algorithm for solving PO tasks. Our method comprises two parts: a variational recurrent model (VRM) for modeling the environment, and an RL controller that has access to both the environment and the VRM. The proposed algorithm was tested in two types of PO robotic control tasks, those in which either coordinates or velocities were not observable and those that require long-term memorization. Our experiments show that the proposed algorithm achieved better data efficiency and/or learned more optimal policy than other alternative approaches in tasks in which unobserved states cannot be inferred from raw observations in a simple manner.
Keyword: Reinforcement Learning, Deep Learning, Variational Inference, Recurrent Neural Network, Partially Observable, Robotic Control, Continuous Control

Population-Guided Parallel Policy Search for Reinforcement Learning
Author: Whiyoung Jung, Giseung Park, Youngchul Sung
link: https://openreview.net/pdf?id=rJeINp4KwH
Code: None
Abstract: In this paper, a new population-guided parallel learning scheme is proposed to enhance the performance of off-policy reinforcement learning (RL). In the proposed scheme, multiple identical learners with their own value-functions and policies share a common experience replay buffer, and search a good policy in collaboration with the guidance of the best policy information. The key point is that the information of the best policy is fused in a soft manner by constructing an augmented loss function for policy update to enlarge the overall search region by the multiple learners. The guidance by the previous best policy and the enlarged range enable faster and better policy search, and monotone improvement of the expected cumulative return by the proposed scheme is proved theoretically. Working algorithms are constructed by applying the proposed scheme to the twin delayed deep deterministic (TD3) policy gradient algorithm, and numerical results show that the constructed P3S-TD3 outperforms most of the current state-of-the-art RL algorithms, and the gain is significant in the case of sparse reward environment.
Keyword: Reinforcement Learning, Parallel Learning, Population Based Learning

Compositional languages emerge in a neural iterated learning model
Author: Yi Ren, Shangmin Guo, Matthieu Labeau, Shay B. Cohen, Simon Kirby
link: https://openreview.net/pdf?id=HkePNpVKPB
Code: https://github.com/Joshua-Ren/Neural_Iterated_Learning
Abstract: The principle of compositionality, which enables natural language to represent complex concepts via a structured combination of simpler ones, allows us to convey an open-ended set of messages using a limited vocabulary. If compositionality is indeed a natural property of language, we may expect it to appear in communication protocols that are created by neural agents via grounded language learning. Inspired by the iterated learning framework, which simulates the process of language evolution, we propose an effective neural iterated learning algorithm that, when applied to interacting neural agents, facilitates the emergence of a more structured type of language. Indeed, these languages provide specific advantages to neural agents during training, which translates as a larger posterior probability, which is then incrementally amplified via the iterated learning procedure. Our experiments confirm our analysis, and also demonstrate that the emerged languages largely improve the generalization of the neural agent communication.
Keyword: Compositionality, Multi-agent, Emergent language, Iterated learning

Black-Box Adversarial Attack with Transferable Model-based Embedding
Author: Zhichao Huang, Tong Zhang
link: https://openreview.net/pdf?id=SJxhNTNYwB
Code: https://github.com/TransEmbedBA/TREMBA
Abstract: We present a new method for black-box adversarial attack. Unlike previous methods that combined transfer-based and scored-based methods by using the gradient or initialization of a surrogate white-box model, this new method tries to learn a low-dimensional embedding using a pretrained model, and then performs efficient search within the embedding space to attack an unknown target network. The method produces adversarial perturbations with high level semantic patterns that are easily transferable. We show that this approach can greatly improve the query efficiency of black-box adversarial attack across different target network architectures. We evaluate our approach on MNIST, ImageNet and Google Cloud Vision API, resulting in a significant reduction on the number of queries. We also attack adversarially defended networks on CIFAR10 and ImageNet, where our method not only reduces the number of queries, but also improves the attack success rate.
Keyword: adversarial examples, black-box attack, embedding

I Am Going MAD: Maximum Discrepancy Competition for Comparing Classifiers Adaptively
Author: Haotao Wang, Tianlong Chen, Zhangyang Wang, Kede Ma
link: https://openreview.net/pdf?id=rJehNT4YPr
Code: https://github.com/TAMU-VITA/MAD
Abstract: The learning of hierarchical representations for image classification has experienced an impressive series of successes due in part to the availability of large-scale labeled data for training. On the other hand, the trained classifiers have traditionally been evaluated on small and fixed sets of test images, which are deemed to be extremely sparsely distributed in the space of all natural images. It is thus questionable whether recent performance improvements on the excessively re-used test sets generalize to real-world natural images with much richer content variations. Inspired by efficient stimulus selection for testing perceptual models in psychophysical and physiological studies, we present an alternative framework for comparing image classifiers, which we name the MAximum Discrepancy (MAD) competition. Rather than comparing image classifiers using fixed test images, we adaptively sample a small test set from an arbitrarily large corpus of unlabeled images so as to maximize the discrepancies between the classifiers, measured by the distance over WordNet hierarchy. Human labeling on the resulting model-dependent image sets reveals the relative performance of the competing classifiers, and provides useful insights on potential ways to improve them. We report the MAD competition results of eleven ImageNet classifiers while noting that the framework is readily extensible and cost-effective to add future classifiers into the competition. Codes can be found at
Keyword: model comparison

Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models
Author: Cheolhyoung Lee, Kyunghyun Cho, Wanmo Kang
link: https://openreview.net/pdf?id=HkgaETNtDB
Code: https://github.com/bloodwass/mixout
Abstract: In natural language processing, it has been observed recently that generalization could be greatly improved by finetuning a large-scale language model pretrained on a large unlabeled corpus. Despite its recent success and wide adoption, finetuning a large pretrained language model on a downstream task is prone to degenerate performance when there are only a small number of training instances available. In this paper, we introduce a new regularization technique, to which we refer as “mixout”, motivated by dropout. Mixout stochastically mixes the parameters of two models. We show that our mixout technique regularizes learning to minimize the deviation from one of the two models and that the strength of regularization adapts along the optimization trajectory. We empirically evaluate the proposed mixout and its variants on finetuning a pretrained language model on downstream tasks. More specifically, we demonstrate that the stability of finetuning and the average accuracy greatly increase when we use the proposed approach to regularize finetuning of BERT on downstream tasks in GLUE.
Keyword: regularization, finetuning, dropout, dropconnect, adaptive L2-penalty, BERT, pretrained language model

Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP
Author: Yuanhao Wang, Kefan Dong, Xiaoyu Chen, Liwei Wang
link: https://openreview.net/pdf?id=BkglSTNFDB
Code: None
Abstract: A fundamental question in reinforcement learning is whether model-free algorithms are sample efficient. Recently, Jin et al. (2018) proposed a Q-learning algorithm with UCB exploration policy, and proved it has nearly optimal regret bound for finite-horizon episodic MDP. In this paper, we adapt Q-learning with UCB-exploration bonus to infinite-horizon MDP with discounted rewards \emph{without} accessing a generative model. We show that the \textit{sample complexity of exploration} of our algorithm is bounded by O ~ ( S A ϵ 2 ( 1 γ ) 7 ) \tilde{O}({\frac{SA}{\epsilon^2(1-\gamma)^7}}) . This improves the previously best known result of O ~ ( S A ϵ 4 ( 1 γ ) 8 ) \tilde{O}({\frac{SA}{\epsilon^4(1-\gamma)^8}}) in this setting achieved by delayed Q-learning (Strehlet al., 2006), and matches the lower bound in terms of ϵ \epsilon as well as S S and A A up to logarithmic factors.
Keyword: theory, reinforcement learning, sample complexity

Deep Network Classification by Scattering and Homotopy Dictionary Learning
Author: John Zarka, Louis Thiry, Tomas Angles, Stephane Mallat
link: https://openreview.net/pdf?id=SJxWS64FwH
Code: None
Abstract: We introduce a sparse scattering deep convolutional neural network, which provides a simple model to analyze properties of deep representation learning for classification. Learning a single dictionary matrix with a classifier yields a higher classification accuracy than AlexNet over the ImageNet 2012 dataset. The network first applies a scattering transform that linearizes variabilities due to geometric transformations such as translations and small deformations.
A sparse 1 \ell^1 dictionary coding reduces intra-class variability while preserving class separation through projections over unions of linear spaces. It is implemented in a deep convolutional network with a homotopy algorithm having an exponential convergence. A convergence proof is given in a general framework that includes ALISTA. Classification results are analyzed on ImageNet.
Keyword: dictionary learning, scattering transform, sparse coding, imagenet

Data-Independent Neural Pruning via Coresets
Author: Ben Mussay, Margarita Osadchy, Vladimir Braverman, Samson Zhou, Dan Feldman
link: https://openreview.net/pdf?id=H1gmHaEKwB
Code: None
Abstract: Previous work showed empirically that large neural networks can be significantly reduced in size while preserving their accuracy. Model compression became a central research topic, as it is crucial for deployment of neural networks on devices with limited computational and memory resources. The majority of the compression methods are based on heuristics and offer no worst-case guarantees on the trade-off between the compression rate and the approximation error for an arbitrarily new sample.

  We propose the first efficient, data-independent neural pruning algorithm with a provable trade-off between its compression rate and the approximation error for any future test sample. Our method is based on the coreset framework, which finds a small weighted subset of points that provably approximates the original inputs. Specifically, we approximate the output of a layer of neurons by a coreset of neurons in the previous layer and discard the rest. We apply this framework in a layer-by-layer fashion from the top to the bottom. Unlike previous works, our coreset is data independent, meaning that it provably guarantees the accuracy of the function for any input $x\in \mathbb{R}^d$, including an adversarial one. We demonstrate the effectiveness of our method on popular network architectures. In particular, our coresets yield 90% compression of the LeNet-300-100 architecture on MNIST while improving the accuracy.

Keyword: coresets, neural pruning, network compression

Bounds on Over-Parameterization for Guaranteed Existence of Descent Paths in Shallow ReLU Networks
Author: Arsalan Sharifnassab, Saber Salehkaleybar, S. Jamaloddin Golestani
link: https://openreview.net/pdf?id=BkgXHTNtvS
Code: None
Abstract: We study the landscape of squared loss in neural networks with one-hidden layer and ReLU activation functions. Let m m and d d be the widths of hidden and input layers, respectively. We show that there exist poor local minima with positive curvature for some training sets of size n m + 2 d 2 n\geq m+2d-2 . By positive curvature of a local minimum, we mean that within a small neighborhood the loss function is strictly increasing in all directions. Consequently, for such training sets, there are initialization of weights from which there is no descent path to global optima. It is known that for n m n\le m , there always exist descent paths to global optima from all initial weights. In this perspective, our results provide a somewhat sharp characterization of the over-parameterization required for “existence of descent paths” in the loss landscape.
Keyword: Spurious local minima, Loss landscape, Over-parameterization, Theory of deep learning, Optimization, Descent path

Novelty Detection Via Blurring
Author: Sungik Choi, Sae-Young Chung
link: https://openreview.net/pdf?id=ByeNra4FDB
Code: None
Abstract: Conventional out-of-distribution (OOD) detection schemes based on variational autoencoder or Random Network Distillation (RND) are known to assign lower uncertainty to the OOD data than the target distribution. In this work, we discover that such conventional novelty detection schemes are also vulnerable to the blurred images. Based on the observation, we construct a novel RND-based OOD detector, SVD-RND, that utilizes blurred images during training. Our detector is simple, efficient in test time, and outperforms baseline OOD detectors in various domains. Further results show that SVD-RND learns a better target distribution representation than the baselines. Finally, SVD-RND combined with geometric transform achieves near-perfect detection accuracy in CelebA domain.
Keyword: novelty, anomaly, uncertainty

Piecewise linear activations substantially shape the loss surfaces of neural networks
Author: Fengxiang He, Bohan Wang, Dacheng Tao
link: https://openreview.net/pdf?id=B1x6BTEKwr
Code: None
Abstract: Understanding the loss surface of a neural network is fundamentally important to the understanding of deep learning. This paper presents how piecewise linear activation functions substantially shape the loss surfaces of neural networks. We first prove that the loss surfaces of many neural networks have infinite spurious local minima, which are defined as the local minima with higher empirical risks than the global minima. Our result holds for any neural network with arbitrary depth and arbitrary piecewise linear activation functions (excluding linear functions) under most loss functions in practice with some mild assumptions. This result demonstrates that the networks with piecewise linear activations possess substantial differences to the well-studied linear neural networks. Essentially, the underlying assumptions for the above result are consistent with most practical circumstances where the output layer is narrower than any hidden layer. In addition, the loss surface of a neural network with piecewise linear activations is partitioned into multiple smooth and multilinear cells by nondifferentiable boundaries. The constructed spurious local minima are concentrated in one cell as a valley: they are connected with each other by a continuous path, on which empirical risk is invariant. Further for one-hidden-layer networks, we prove that all local minima in a cell constitute an equivalence class; they are concentrated in a valley; and they are all global minima in the cell.
Keyword: neural network, nonlinear activation, loss surface, spurious local minimum

Relational State-Space Model for Stochastic Multi-Object Systems
Author: Fan Yang, Ling Chen, Fan Zhou, Yusong Gao, Wei Cao
link: https://openreview.net/pdf?id=B1lGU64tDr
Code: None
Abstract: Real-world dynamical systems often consist of multiple stochastic subsystems that interact with each other. Modeling and forecasting the behavior of such dynamics are generally not easy, due to the inherent hardness in understanding the complicated interactions and evolutions of their constituents. This paper introduces the relational state-space model (R-SSM), a sequential hierarchical latent variable model that makes use of graph neural networks (GNNs) to simulate the joint state transitions of multiple correlated objects. By letting GNNs cooperate with SSM, R-SSM provides a flexible way to incorporate relational information into the modeling of multi-object dynamics. We further suggest augmenting the model with normalizing flows instantiated for vertex-indexed random variables and propose two auxiliary contrastive objectives to facilitate the learning. The utility of R-SSM is empirically evaluated on synthetic and real time series datasets.
Keyword: state-space model, time series, deep sequential model, graph neural network

Learning Efficient Parameter Server Synchronization Policies for Distributed SGD
Author: Rong Zhu, Sheng Yang, Andreas Pfadler, Zhengping Qian, Jingren Zhou
link: https://openreview.net/pdf?id=rJxX8T4Kvr
Code: None
Abstract: We apply a reinforcement learning (RL) based approach to learning optimal synchronization policies used for Parameter Server-based distributed training of machine learning models with Stochastic Gradient Descent (SGD). Utilizing a formal synchronization policy description in the PS-setting, we are able to derive a suitable and compact description of states and actions, allowing us to efficiently use the standard off-the-shelf deep Q-learning algorithm. As a result, we are able to learn synchronization policies which generalize to different cluster environments, different training datasets and small model variations and (most importantly) lead to considerable decreases in training time when compared to standard policies such as bulk synchronous parallel (BSP), asynchronous parallel (ASP), or stale synchronous parallel (SSP). To support our claims we present extensive numerical results obtained from experiments performed in simulated cluster environments. In our experiments training time is reduced by 44 on average and learned policies generalize to multiple unseen circumstances.
Keyword: Distributed SGD, Paramter-Server, Synchronization Policy, Reinforcement Learning

Action Semantics Network: Considering the Effects of Actions in Multiagent Systems
Author: Weixun Wang, Tianpei Yang, Yong Liu, Jianye Hao, Xiaotian Hao, Yujing Hu, Yingfeng Chen, Changjie Fan, Yang Gao
link: https://openreview.net/pdf?id=ryg48p4tPH
Code: https://github.com/MAS-anony/ASN
Abstract: In multiagent systems (MASs), each agent makes individual decisions but all of them contribute globally to the system evolution. Learning in MASs is difficult since each agent’s selection of actions must take place in the presence of other co-learning agents. Moreover, the environmental stochasticity and uncertainties increase exponentially with the increase in the number of agents. Previous works borrow various multiagent coordination mechanisms into deep learning architecture to facilitate multiagent coordination. However, none of them explicitly consider action semantics between agents that different actions have different influences on other agents. In this paper, we propose a novel network architecture, named Action Semantics Network (ASN), that explicitly represents such action semantics between agents. ASN characterizes different actions’ influence on other agents using neural networks based on the action semantics between them. ASN can be easily combined with existing deep reinforcement learning (DRL) algorithms to boost their performance. Experimental results on StarCraft II micromanagement and Neural MMO show ASN significantly improves the performance of state-of-the-art DRL approaches compared with several network architectures.
Keyword: multiagent coordination, multiagent learning

Vid2Game: Controllable Characters Extracted from Real-World Videos
Author: Oran Gafni, Lior Wolf, Yaniv Taigman
link: https://openreview.net/pdf?id=SkxBUpEKwH
Code: None
Abstract: We extract a controllable model from a video of a person performing a certain activity. The model generates novel image sequences of that person, according to user-defined control signals, typically marking the displacement of the moving body. The generated video can have an arbitrary background, and effectively capture both the dynamics and appearance of the person.

  The method is based on two networks. The first  maps a current pose, and a single-instance control signal to the next pose. The second maps the current pose, the new pose, and a given background, to an output frame. Both networks include multiple novelties that enable high-quality performance. This is demonstrated on multiple characters extracted from various videos of dancers and athletes.

Keyword: None

Self-Adversarial Learning with Comparative Discrimination for Text Generation
Author: Wangchunshu Zhou, Tao Ge, Ke Xu, Furu Wei, Ming Zhou
link: https://openreview.net/pdf?id=B1l8L6EtDS
Code: None
Abstract: Conventional Generative Adversarial Networks (GANs) for text generation tend to have issues of reward sparsity and mode collapse that affect the quality and diversity of generated samples. To address the issues, we propose a novel self-adversarial learning (SAL) paradigm for improving GANs’ performance in text generation. In contrast to standard GANs that use a binary classifier as its discriminator to predict whether a sample is real or generated, SAL employs a comparative discriminator which is a pairwise classifier for comparing the text quality between a pair of samples. During training, SAL rewards the generator when its currently generated sentence is found to be better than its previously generated samples. This self-improvement reward mechanism allows the model to receive credits more easily and avoid collapsing towards the limited number of real samples, which not only helps alleviate the reward sparsity issue but also reduces the risk of mode collapse. Experiments on text generation benchmark datasets show that our proposed approach substantially improves both the quality and the diversity, and yields more stable performance compared to the previous GANs for text generation.
Keyword: adversarial learning, text generation

Robust training with ensemble consensus
Author: Jisoo Lee, Sae-Young Chung
link: https://openreview.net/pdf?id=ryxOUTVYDH
Code: None
Abstract: Since deep neural networks are over-parameterized, they can memorize noisy examples. We address such memorizing issue in the presence of annotation noise. From the fact that deep neural networks cannot generalize neighborhoods of the features acquired via memorization, we hypothesize that noisy examples do not consistently incur small losses on the network under a certain perturbation. Based on this, we propose a novel training method called Learning with Ensemble Consensus (LEC) that prevents overfitting noisy examples by eliminating them using the consensus of an ensemble of perturbed networks. One of the proposed LECs, LTEC outperforms the current state-of-the-art methods on noisy MNIST, CIFAR-10, and CIFAR-100 in an efficient manner.
Keyword: Annotation noise, Noisy label, Robustness, Ensemble, Perturbation

Identifying through Flows for Recovering Latent Representations
Author: Shen Li, Bryan Hooi, Gim Hee Lee
link: https://openreview.net/pdf?id=SklOUpEYvB
Code: None
Abstract: Identifiability, or recovery of the true latent representations from which the observed data originates, is de facto a fundamental goal of representation learning. Yet, most deep generative models do not address the question of identifiability, and thus fail to deliver on the promise of the recovery of the true latent sources that generate the observations. Recent work proposed identifiable generative modelling using variational autoencoders (iVAE) with a theory of identifiability. Due to the intractablity of KL divergence between variational approximate posterior and the true posterior, however, iVAE has to maximize the evidence lower bound (ELBO) of the marginal likelihood, leading to suboptimal solutions in both theory and practice. In contrast, we propose an identifiable framework for estimating latent representations using a flow-based model (iFlow). Our approach directly maximizes the marginal likelihood, allowing for theoretical guarantees on identifiability, thereby dispensing with variational approximations. We derive its optimization objective in analytical form, making it possible to train iFlow in an end-to-end manner. Simulations on synthetic data validate the correctness and effectiveness of our proposed method and demonstrate its practical advantages over other existing methods.
Keyword: Representation learning, identifiable generative models, nonlinear-ICA

Certified Robustness for Top-k Predictions against Adversarial Perturbations via Randomized Smoothing
Author: Jinyuan Jia, Xiaoyu Cao, Binghui Wang, Neil Zhenqiang Gong
link: https://openreview.net/pdf?id=BkeWw6VFwr
Code: https://github.com/jjy1994/Certify_Topk
Abstract: It is well-known that classifiers are vulnerable to adversarial perturbations. To defend against adversarial perturbations, various certified robustness results have been derived. However, existing certified robustnesses are limited to top-1 predictions. In many real-world applications, top- k k predictions are more relevant. In this work, we aim to derive certified robustness for top- k k predictions. In particular, our certified robustness is based on randomized smoothing, which turns any classifier to a new classifier via adding noise to an input example. We adopt randomized smoothing because it is scalable to large-scale neural networks and applicable to any classifier. We derive a tight robustness in 2 \ell_2 norm for top- k k predictions when using randomized smoothing with Gaussian noise. We find that generalizing the certified robustness from top-1 to top- k k predictions faces significant technical challenges. We also empirically evaluate our method on CIFAR10 and ImageNet. For example, our method can obtain an ImageNet classifier with a certified top-5 accuracy of 62.8% when the 2 \ell_2 -norms of the adversarial perturbations are less than 0.5 (=127/255). Our code is publicly available at: \url{
Keyword: Certified Adversarial Robustness, Randomized Smoothing, Adversarial Examples

Optimistic Exploration even with a Pessimistic Initialisation
Author: Tabish Rashid, Bei Peng, Wendelin Boehmer, Shimon Whiteson
link: https://openreview.net/pdf?id=r1xGP6VYwH
Code: None
Abstract: Optimistic initialisation is an effective strategy for efficient exploration in reinforcement learning (RL). In the tabular case, all provably efficient model-free algorithms rely on it. However, model-free deep RL algorithms do not use optimistic initialisation despite taking inspiration from these provably efficient tabular algorithms. In particular, in scenarios with only positive rewards, Q-values are initialised at their lowest possible values due to commonly used network initialisation schemes, a pessimistic initialisation. Merely initialising the network to output optimistic Q-values is not enough, since we cannot ensure that they remain optimistic for novel state-action pairs, which is crucial for exploration. We propose a simple count-based augmentation to pessimistically initialised Q-values that separates the source of optimism from the neural network. We show that this scheme is provably efficient in the tabular setting and extend it to the deep RL setting. Our algorithm, Optimistic Pessimistically Initialised Q-Learning (OPIQ), augments the Q-value estimates of a DQN-based agent with count-derived bonuses to ensure optimism during both action selection and bootstrapping. We show that OPIQ outperforms non-optimistic DQN variants that utilise a pseudocount-based intrinsic motivation in hard exploration tasks, and that it predicts optimistic estimates for novel state-action pairs.
Keyword: Reinforcement Learning, Exploration, Optimistic Initialisation

VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Author: Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, Jifeng Dai
link: https://openreview.net/pdf?id=SygXPaEYvH
Code: https://github.com/jackroos/VL-BERT
Abstract: We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short). VL-BERT adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input. In it, each element of the input is either of a word from the input sentence, or a region-of-interest (RoI) from the input image. It is designed to fit for most of the visual-linguistic downstream tasks. To better exploit the generic representation, we pre-train VL-BERT on the massive-scale Conceptual Captions dataset, together with text-only corpus. Extensive empirical analysis demonstrates that the pre-training procedure can better align the visual-linguistic clues and benefit the downstream tasks, such as visual commonsense reasoning, visual question answering and referring expression comprehension. It is worth noting that VL-BERT achieved the first place of single model on the leaderboard of the VCR benchmark.
Keyword: Visual-Linguistic, Generic Representation, Pre-training

发布了51 篇原创文章 · 获赞 3 · 访问量 1万+

猜你喜欢

转载自blog.csdn.net/yinizhilianlove/article/details/104430189