# Paper Digest: NeurIPS 2019 Highlights

Download NIPS-2019-Paper-Digests.pdf– highlights of all 1,427 NIPS-2019 papers.

The Conference on Neural Information Processing Systems (NIPS) is one of the top machine learning conferences in the world. In 2019, it is to be held in Vancouver, Canada. There were more than 6,743 paper submissions, of which around 1,427 were accepted. Many papers also published their code (download link).

To help the community quickly catch up on the work presented in this conference, Paper Digest Team processed all accepted papers, and generated one highlight sentence (typically the main topic) for each paper. Readers are encouraged to read these machine generated highlights / summaries to quickly get the main idea of each paper.

If you do not want to miss any interesting academic paper, you are welcome to **sign up our free daily paper digest service ** to get updates on new papers published in your area every day. You are also welcome to follow us on Twitter and Linkedin to get updated with new conference digests.

Paper Digest Team

team@paperdigest.org

#### TABLE 1: NeurIPS 2019 Papers

Title | Authors | Highlight | |
---|---|---|---|

1 | Multimodal Model-Agnostic Meta-Learning via Task-Aware Modulation | Risto Vuorio, Shao-Hua Sun, Hexiang Hu, Joseph J. Lim | In this paper, we augment MAML with the capability to identify the mode of tasks sampled from a multimodal task distribution and adapt quickly through gradient updates. |

2 | ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks | Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee | We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. |

3 | Stochastic Shared Embeddings: Data-driven Regularization of Embedding Layers | Liwei Wu, Shuqing Li, Cho-Jui Hsieh, James L. Sharpnack | Alternatively, we propose stochastically shared embeddings (SSE), a data-driven approach to regularizing embedding layers, which stochastically transitions between embeddings during stochastic gradient descent (SGD). |

4 | Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video | Jiawang Bian, Zhichao Li, Naiyan Wang, Huangying Zhan, Chunhua Shen, Ming-Ming Cheng, Ian Reid | This paper tackles these challenges by proposing a geometry consistency loss for scale-consistent predictions and an induced self-discovered mask for handling moving objects and occlusions. |

5 | Zero-shot Learning via Simultaneous Generating and Learning | Hyeonwoo Yu, Beomhee Lee | Beyond exploiting relations between classes of seen and unseen, we present a deep generative model to provide the model with experience about both seen and unseen classes. |

6 | Ask not what AI can do, but what AI should do: Towards a framework of task delegability | Brian Lubars, Chenhao Tan | We approach this problem of task delegability from a human-centered perspective by developing a framework on human perception of task delegation to AI. |

7 | Stand-Alone Self-Attention in Vision Models | Niki Parmar, Prajit Ramachandran, Ashish Vaswani, Irwan Bello, Anselm Levskaya, Jon Shlens | In developing and testing a pure self-attention vision model, we verify that self-attention can indeed be an effective stand-alone layer. |

8 | High Fidelity Video Prediction with Large Stochastic Recurrent Neural Networks | Ruben Villegas, Arkanath Pathak, Harini Kannan, Dumitru Erhan, Quoc V. Le, Honglak Lee | In this work, we question if such handcrafted architectures are necessary and instead propose a different approach: finding minimal inductive bias for video prediction while maximizing network capacity. |

9 | Unsupervised learning of object structure and dynamics from videos | Matthias Minderer, Chen Sun, Ruben Villegas, Forrester Cole, Kevin P. Murphy, Honglak Lee | To address this challenge, we adopt a keypoint-based image representation and learn a stochastic dynamics model of the keypoints. |

10 | GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism | Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, zhifeng Chen | To address the need for efficient and task-independent model parallelism, we introduce TensorPipe, a pipeline parallelism library that allows scaling any network that can be expressed as a sequence of layers. |

11 | Meta-Learning with Implicit Gradients | Aravind Rajeswaran, Chelsea Finn, Sham M. Kakade, Sergey Levine | By drawing upon implicit differentiation, we develop the implicit MAML algorithm, which depends only on the solution to the inner level optimization and not the path taken by the inner loop optimizer. |

12 | Adversarial Examples Are Not Bugs, They Are Features | Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, Aleksander Madry | We demonstrate that adversarial examples can be directly attributed to the presence of non-robust features: features (derived from patterns in the data distribution) that are highly predictive, yet brittle and (thus) incomprehensible to humans. |

13 | Social-BiGAT: Multimodal Trajectory Forecasting using Bicycle-GAN and Graph Attention Networks | Vineet Kosaraju, Amir Sadeghian, Roberto Mart?n-Mart?n, Ian Reid, Hamid Rezatofighi, Silvio Savarese | In this paper, we present Social-BiGAT, a graph-based generative adversarial network that generates realistic, multimodal trajectory predictions for multiple pedestrians in a scene. |

14 | FreeAnchor: Learning to Match Anchors for Visual Object Detection | Xiaosong Zhang, Fang Wan, Chang Liu, Rongrong Ji, Qixiang Ye | In this study, we propose a learning-to-match approach to break IoU restriction, allowing objects to match anchors in a flexible manner. |

15 | Private Hypothesis Selection | Mark Bun, Gautam Kamath, Thomas Steinke, Steven Z. Wu | We provide a differentially private algorithm for hypothesis selection. |

16 | Differentially Private Algorithms for Learning Mixtures of Separated Gaussians | Gautam Kamath, Or Sheffet, Vikrant Singhal, Jonathan Ullman | In this work, we give new algorithms for learning the parameters of a high-dimensional, well separated, Gaussian mixture model subject to the strong constraint of differential privacy. |

17 | Average-Case Averages: Private Algorithms for Smooth Sensitivity and Mean Estimation | Mark Bun, Thomas Steinke | We propose the trimmed mean estimator, which interpolates between the mean and the median, as a way of attaining much lower sensitivity on average while losing very little in terms of statistical accuracy. |

18 | Multi-Resolution Weak Supervision for Sequential Data | Paroma Varma, Frederic Sala, Shiori Sagawa, Jason Fries, Daniel Fu, Saelig Khattar, Ashwini Ramamoorthy, Ke Xiao, Kayvon Fatahalian, James Priest, Christopher R? | We propose Dugong, the first framework to model multi-resolution weak supervision sources with complex correlations to assign probabilistic labels to training data. |

19 | DeepUSPS: Deep Robust Unsupervised Saliency Prediction via Self-supervision | Tam Nguyen, Maximilian Dax, Chaithanya Kumar Mummadi, Nhung Ngo, Thi Hoai Phuong Nguyen, Zhongyu Lou, Thomas Brox | In this work, we propose a two-stage mechanism for robust unsupervised object saliency prediction, where the first stage involves refinement of the noisy pseudo labels generated from different handcrafted methods. |

20 | The Point Where Reality Meets Fantasy: Mixed Adversarial Generators for Image Splice Detection | Vladimir V. Kniaz, Vladimir Knyaz, Fabio Remondino | In this paper, we propose a new framework for training of discriminative segmentation model via an adversarial process. |

21 | You Only Propagate Once: Accelerating Adversarial Training via Maximal Principle | Dinghuai Zhang, Tianyuan Zhang, Yiping Lu, Zhanxing Zhu, Bin Dong | In this paper, we show that adversarial training can be cast as a discrete time differential game. |

22 | Imitation Learning from Observations by Minimizing Inverse Dynamics Disagreement | Chao Yang, Xiaojian Ma, Wenbing Huang, Fuchun Sun, Huaping Liu, Junzhou Huang, Chuang Gan | In this paper, we investigate LfO and its difference with LfD in both theoretical and practical perspectives. |

23 | Asymptotic Guarantees for Learning Generative Models with the Sliced-Wasserstein Distance | Kimia Nadjahi, Alain Durmus, Umut Simsekli, Roland Badeau | In this study, we investigate the asymptotic properties of estimators that are obtained by minimizing SW. |

24 | Generalized Sliced Wasserstein Distances | Soheil Kolouri, Kimia Nadjahi, Umut Simsekli, Roland Badeau, Gustavo Rohde | In this paper, we first clarify the mathematical connection between the SW distance and the Radon transform. We then utilize the generalized Radon transform to define a new family of distances for probability measures, which we call generalized sliced-Wasserstein (GSW) distances. |

25 | First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise | Thanh Huy Nguyen, Umut Simsekli, Mert Gurbuzbalaban, Ga?l RICHARD | In this study, we provide formal theoretical analysis where we derive explicit conditions for the step-size such that the metastability behavior of the discrete-time system is similar to its continuous-time limit. |

26 | Blind Super-Resolution Kernel Estimation using an Internal-GAN | Sefi Bell-Kligler, Assaf Shocher, Michal Irani | In this paper we show how this powerful cross-scale recurrence property can be realized using Deep Internal Learning. |

27 | Noise-tolerant fair classification | Alex Lamy, Ziyuan Zhong | In this paper, we answer the question in the affirmative: we show that if one measures fairness using the mean-difference score, and sensitive features are subject to noise from the mutually contaminated learning model, then owing to a simple identity we only need to change the desired fairness-tolerance. |

28 | Generalization in Generative Adversarial Networks: A Novel Perspective from Privacy Protection | Bingzhe Wu, Shiwan Zhao, Chaochao Chen, Haoyang Xu, Li Wang, Xiaolu Zhang, Guangyu Sun, Jun Zhou | In this paper, we aim to understand the generalization properties of generative adversarial networks (GANs) from a new perspective of privacy protection. |

29 | Joint-task Self-supervised Learning for Temporal Correspondence | Xueting Li, Sifei Liu, Shalini De Mello, Xiaolong Wang, Jan Kautz, Ming-Hsuan Yang | This paper proposes to learn reliable dense correspondence from videos in a self-supervised manner. |

30 | Provable Gradient Variance Guarantees for Black-Box Variational Inference | Justin Domke | Recent variational inference methods use stochastic gradient estimators whose variance is not well understood. |

31 | Divide and Couple: Using Monte Carlo Variational Objectives for Posterior Approximation | Justin Domke, Daniel R. Sheldon | This paper gives bounds for the common “reparameterization” estimators when the target is smooth and the variational family is a location-scale distribution. |

32 | Experience Replay for Continual Learning | David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, Gregory Wayne | Here, we introduce CLEAR, a replay-based method that greatly reduces catastrophic forgetting in multi-task reinforcement learning. |

33 | Deep ReLU Networks Have Surprisingly Few Activation Patterns | Boris Hanin, David Rolnick | In this paper, we show that the average number of activation patterns for ReLU networks at initialization is bounded by the total number of neurons raised to the input dimension. |

34 | Chasing Ghosts: Instruction Following as Bayesian State Tracking | Peter Anderson, Ayush Shrivastava, Devi Parikh, Dhruv Batra, Stefan Lee | Based on this intuition, we formulate the problem of finding the goal location in Vision-and-Language Navigation (VLN) within the framework of Bayesian state tracking – learning observation and motion models conditioned on these expectable events. |

35 | Block Coordinate Regularization by Denoising | Yu Sun, Jiaming Liu, Ulugbek Kamilov | In this work, we develop a new block coordinate RED algorithm that decomposes a large-scale estimation problem into a sequence of updates over a small subset of the unknown variables. |

36 | Reducing Noise in GAN Training with Variance Reduced Extragradient | Tatjana Chavdarova, Gauthier Gidel, Fran?ois Fleuret, Simon Lacoste-Julien | We address this issue with a novel stochastic variance-reduced extragradient (SVRE) optimization algorithm, which for a large class of games improves upon the previous convergence rates proposed in the literature. |

37 | Learning Erdos-Renyi Random Graphs via Edge Detecting Queries | Zihan Li, Matthias Fresacher, Jonathan Scarlett | In this paper, we consider the problem of learning an unknown graph via queries on groups of nodes, with the result indicating whether or not at least one edge is present among those nodes. |

38 | A Primal-Dual link between GANs and Autoencoders | Hisham Husain, Richard Nock, Robert C. Williamson | In this work, we study the $f$-GAN and WAE models and make two main discoveries. |

39 | muSSP: Efficient Min-cost Flow Algorithm for Multi-object Tracking | Congchao Wang, Yizhi Wang, Yinxue Wang, Chiung-Ting Wu, Guoqiang Yu | In this paper, by exploiting the special structures and properties of the graphs formulated in MOT problems, we develop an efficient min-cost flow algorithm, namely, minimum-update Successive Shortest Path (muSSP). |

40 | Category Anchor-Guided Unsupervised Domain Adaptation for Semantic Segmentation | Qiming ZHANG, Jing Zhang, Wei Liu, Dacheng Tao | In this paper, we propose a novel category anchor-guided (CAG) UDA model for semantic segmentation, which explicitly enforces category-aware feature alignment to learn shared discriminative features and classifiers simultaneously. |

41 | Invert to Learn to Invert | Patrick Putzky, Max Welling | In this work, we propose an iterative inverse model with constant memory that relies on invertible networks to avoid storing intermediate activations. |

42 | Equitable Stable Matchings in Quadratic Time | Nikolaos Tziavelis, Ioannis Giannakopoulos, Katerina Doka, Nectarios Koziris, Panagiotis Karras | In this paper, we propose an alternative that is computationally simpler and achieves high equity too. |

43 | Zero-Shot Semantic Segmentation | Maxime Bucher, Tuan-Hung VU, Matthieu Cord, Patrick P?rez | In this paper, we introduce the new task of zero-shot semantic segmentation: learning pixel-wise classifiers for never-seen object categories with zero training examples. |

44 | Metric Learning for Adversarial Robustness | Chengzhi Mao, Ziyuan Zhong, Junfeng Yang, Carl Vondrick, Baishakhi Ray | Motivated by this observation, we propose to regularize the representation space under attack with metric learning to produce more robust classifiers. |

45 | DISN: Deep Implicit Surface Network for High-quality Single-view 3D Reconstruction | Qiangeng Xu, Weiyue Wang, Duygu Ceylan, Radomir Mech, Ulrich Neumann | In this paper, we present DISN, a Deep Implicit Surface Net- work which can generate a high-quality detail-rich 3D mesh from a 2D image by predicting the underlying signed distance fields. |

46 | Batched Multi-armed Bandits Problem | Zijun Gao, Yanjun Han, Zhimei Ren, Zhengqing Zhou | In this paper, we study the multi-armed bandit problem in the batched setting where the employed policy must split data into a small number of batches. |

47 | vGraph: A Generative Model for Joint Community Detection and Node Representation Learning | Fan-Yun Sun, Meng Qu, Jordan Hoffmann, Chin-Wei Huang, Jian Tang | We propose a probabilistic generative model called vGraph to learn community membership and node representation collaboratively. |

48 | Differentially Private Bayesian Linear Regression | Garrett Bernstein, Daniel R. Sheldon | We investigate the problem of Bayesian linear regression, with the goal of computing posterior distributions that correctly quantify uncertainty given privately released statistics. |

49 | Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos | Yitian Yuan, Lin Ma, Jingwen Wang, Wei Liu, Wenwu Zhu | In this paper, we propose a novel semantic conditioned dynamic modulation (SCDM) mechanism, which relies on the sentence semantics to modulate the temporal convolution operations for better correlating and composing the sentence related video contents over time. |

50 | AGEM: Solving Linear Inverse Problems via Deep Priors and Sampling | Bichuan Guo, Yuxing Han, Jiangtao Wen | In this paper we propose to use a denoising autoencoder (DAE) prior to simultaneously solve a linear inverse problem and estimate its noise parameter. |

51 | CPM-Nets: Cross Partial Multi-View Networks | Changqing Zhang, Zongbo Han, yajie cui, Huazhu Fu, Joey Tianyi Zhou, Qinghua Hu | To address the challenge, we propose a novel framework termed Cross Partial Multi-View Networks (CPM-Nets). |

52 | Learning to Predict Layout-to-image Conditional Convolutions for Semantic Image Synthesis | Xihui Liu, Guojun Yin, Jing Shao, Xiaogang Wang, hongsheng Li | In order to better exploit the semantic layout for the image generator, we propose to predict convolutional kernels conditioned on the semantic label map to generate the intermediate feature maps from the noise maps and eventually generate the images. |

53 | Staying up to Date with Online Content Changes Using Reinforcement Learning for Scheduling | Andrey Kolobov, Yuval Peres, Cheng Lu, Eric J. Horvitz | We propose a novel optimization objective for this setting that has several practically desirable properties, and efficient algorithms for it with optimality guarantees even in the face of mixed content change observability and initially unknown change model parameters. |

54 | SySCD: A System-Aware Parallel Coordinate Descent Algorithm | Nikolas Ioannou, Celestine Mendler-D?nner, Thomas Parnell | In this paper we propose a novel parallel stochastic coordinate descent (SCD) algorithm with convergence guarantees that exhibits strong scalability. |

55 | Importance Weighted Hierarchical Variational Inference | Artem Sobolev, Dmitry P. Vetrov | To overcome this roadblock, we introduce a new family of variational upper bounds on a marginal log-density in the case of hierarchical models (also known as latent variable models). |

56 | RSN: Randomized Subspace Newton | Robert Gower, Dmitry Koralev, Felix Lieder, Peter Richtarik | We develop a randomized Newton method capable of solving learning problems with huge dimensional feature spaces, which is a common setting in applications such as medical imaging, genomics and seismology. |

57 | Trust Region-Guided Proximal Policy Optimization | Yuhui Wang, Hao He, Xiaoyang Tan, Yaozhong Gan | In this paper, we give an in-depth analysis on the exploration behavior of PPO, and show that PPO is prone to suffer from the risk of lack of exploration especially under the case of bad initialization, which may lead to the failure of training or being trapped in bad local optima. |

58 | Adversarial Self-Defense for Cycle-Consistent GANs | Dina Bashkirova, Ben Usman, Kate Saenko | In this paper, we show how such self-attacking behavior of unsupervised translation methods affects their performance and provide two defense techniques. |

59 | Towards closing the gap between the theory and practice of SVRG | Othmane Sebbouh, Nidham Gazagnadou, Samy Jelassi, Francis Bach, Robert Gower | Our first contribution is that we take several steps towards closing this gap. |

60 | Uniform Error Bounds for Gaussian Process Regression with Application to Safe Control | Armin Lederer, Jonas Umlauft, Sandra Hirche | In this paper, we employ the Gaussian process distribution and continuity arguments to derive a novel uniform error bound under weaker assumptions. |

61 | ETNet: Error Transition Network for Arbitrary Style Transfer | Chunjin Song, Zhijie Wu, Yang Zhou, Minglun Gong, Hui Huang | Inspired by the works on error-correction, instead, we propose a self-correcting model to predict what is wrong with the current stylization and refine it accordingly in an iterative manner. |

62 | No Pressure! Addressing the Problem of Local Minima in Manifold Learning Algorithms | Max Vladymyrov | We propose a natural extension to several manifold learning methods aimed at identifying pressured points, i.e. points stuck in the poor local minima and have poor embedding quality. |

63 | Deep Equilibrium Models | Shaojie Bai, J. Zico Kolter, Vladlen Koltun | We present a new approach to modeling sequential data: the deep equilibrium model (DEQ). |

64 | Saccader: Improving Accuracy of Hard Attention Models for Vision | Gamaleldin Elsayed, Simon Kornblith, Quoc V. Le | Here, we propose a novel hard attention model, which we term Saccader. |

65 | Multiway clustering via tensor block models | Miaoyan Wang, Yuchen Zeng | We propose a tensor block model, develop a unified least-square estimation, and obtain the theoretical accuracy guarantees for multiway clustering. |

66 | Regret Minimization for Reinforcement Learning with Vectorial Feedback and Complex Objectives | Wang Chi Cheung | We propose a no-regret algorithm based on the Frank-Wolfe algorithm (Frank and Wolfe 1956), UCRL2 (Jaksch et al. 2010), as well as a crucial and novel gradient threshold procedure. |

67 | NAT: Neural Architecture Transformer for Accurate and Compact Architectures | Yong Guo, Yin Zheng, Mingkui Tan, Qi Chen, Jian Chen, Peilin Zhao, Junzhou Huang | To make the problem feasible, we cast the optimization problem into a Markov decision process (MDP) and seek to learn a Neural Architecture Transformer (NAT) to replace the redundant operations with the more computationally efficient ones (e.g., skip connection or directly removing the connection). |

68 | Selecting Optimal Decisions via Distributionally Robust Nearest-Neighbor Regression | Ruidi Chen, Ioannis Paschalidis | This paper develops a prediction-based prescriptive model for optimal decision making that (i) predicts the outcome under each action using a robust nonlinear model, and (ii) adopts a randomized prescriptive policy determined by the predicted outcomes. |

69 | Network Pruning via Transformable Architecture Search | Xuanyi Dong, Yi Yang | To break the structure limitation of the pruned networks, we propose to apply neural architecture search to search directly for a network with flexible channel and layer sizes. |

70 | Differentiable Cloth Simulation for Inverse Problems | Junbang Liang, Ming Lin, Vladlen Koltun | We propose a differentiable cloth simulator that can be embedded as a layer in deep neural networks. |

71 | Poisson-Randomized Gamma Dynamical Systems | Aaron Schein, Scott Linderman, Mingyuan Zhou, David Blei, Hanna Wallach | This paper presents the Poisson-randomized gamma dynamical system (PRGDS), a model for sequentially observed count tensors that encodes a strong inductive bias toward sparsity and burstiness. |

72 | Volumetric Correspondence Networks for Optical Flow | Gengshan Yang, Deva Ramanan | Instead, we introduce several simple modifications that dramatically simplify the use of volumetric layers – (1) volumetric encoder-decoder architectures that efficiently capture large receptive fields, (2) multi-channel cost volumes that capture multi-dimensional notions of pixel similarities, and finally, (3) separable volumetric filtering that significantly reduces computation and parameters while preserving accuracy. |

73 | Learning Conditional Deformable Templates with Convolutional Networks | Adrian Dalca, Marianne Rakic, John Guttag, Mert Sabuncu | In this work, we present a probabilistic model and efficient learning strategy that yields either universal or \textit{conditional} templates, jointly with a neural network that provides efficient alignment of the images to these templates. |

74 | Fast Low-rank Metric Learning for Large-scale and High-dimensional Data | Han Liu, Zhizhong Han, Yu-Shen Liu, Ming Gu | To address this issue, we present a novel fast low-rank metric learning (FLRML) method. |

75 | Efficient Symmetric Norm Regression via Linear Sketching | Zhao Song, Ruosong Wang, Lin Yang, Hongyang Zhang, Peilin Zhong | We provide efficient algorithms for overconstrained linear regression problems with size $n \times d$ when the loss function is a symmetric norm (a norm invariant under sign-flips and coordinate-permutations). |

76 | RUBi: Reducing Unimodal Biases for Visual Question Answering | Remi Cadene, Corentin Dancette, Hedi Ben younes, Matthieu Cord, Devi Parikh | We propose RUBi, a new learning strategy to reduce biases in any VQA model. |

77 | Why Can’t I Dance in the Mall? Learning to Mitigate Scene Bias in Action Recognition | Jinwoo Choi, Chen Gao, Joseph C. E. Messou, Jia-Bin Huang | In this paper, we propose to mitigate scene bias for video representation learning. |

78 | NeurVPS: Neural Vanishing Point Scanning via Conic Convolution | Yichao Zhou, Haozhi Qi, Jingwei Huang, Yi Ma | In this work, we identify a canonical conic space in which the neural network can effectively compute the global geometric information of vanishing points locally, and we propose a novel operator named conic convolution that can be implemented as regular convolutions in this space. |

79 | DATA: Differentiable ArchiTecture Approximation | Jianlong Chang, xinbang zhang, Yiwen Guo, GAOFENG MENG, SHIMING XIANG, Chunhong Pan | To bridge this gap, we develop Differentiable ArchiTecture Approximation (DATA) with an Ensemble Gumbel-Softmax (EGS) estimator to automatically approximate architectures during searching and validating in a differentiable manner. |

80 | Learn, Imagine and Create: Text-to-Image Generation from Prior Knowledge | Tingting Qiao, Jing Zhang, Duanqing Xu, Dacheng Tao | In this paper, and inspired by this process, we propose a novel text-to-image method called LeicaGAN to combine the above three phases in a unified framework. |

81 | Memory-oriented Decoder for Light Field Salient Object Detection | Miao Zhang, Jingjing Li, JI WEI, Yongri Piao, Huchuan Lu | In this paper, we present a deep-learning-based method where a novel memory-oriented decoder is tailored for light field saliency detection. |

82 | Multi-label Co-regularization for Semi-supervised Facial Action Unit Recognition | Xuesong Niu, Hu Han, Shiguang Shan, Xilin Chen | In this work, we propose a semi-supervised approach for AU recognition utilizing a large number of web face images without AU labels and a small face dataset with AU labels inspired by the co-training methods. |

83 | Correlated Uncertainty for Learning Dense Correspondences from Noisy Labels | Natalia Neverova, David Novotny, Andrea Vedaldi | We address this issue by augmenting neural network predictors with the ability to output a distribution over labels, thus explicitly and introspectively capturing the aleatoric uncertainty in the annotations. |

84 | Powerset Convolutional Neural Networks | Chris Wendler, Markus P?schel, Dan Alistarh | We present a novel class of convolutional neural networks (CNNs) for set functions, i.e., data indexed with the powerset of a finite set. |

85 | Optimal Pricing in Repeated Posted-Price Auctions with Different Patience of the Seller and the Buyer | Arsenii Vanunts, Alexey Drutsa | We study revenue optimization pricing algorithms for repeated posted-price auctions where a seller interacts with a single strategic buyer that holds a fixed private valuation. |

86 | An Accelerated Decentralized Stochastic Proximal Algorithm for Finite Sums | Hadrien Hendrikx, Francis Bach, Laurent Massouli? | In this work, we propose an efficient \textbf{A}ccelerated \textbf{D}ecentralized stochastic algorithm for \textbf{F}inite \textbf{S}ums named ADFS, which uses local stochastic proximal updates and randomized pairwise communications between nodes. |

87 | Point-Voxel CNN for Efficient 3D Deep Learning | Zhijian Liu, Haotian Tang, Yujun Lin, Song Han | In this paper, we propose PVCNN that represents the 3D input data in points to reduce the memory consumption, while performing the convolutions in voxels to reduce the irregular, sparse data access and improve the locality. |

88 | Deep Learning without Weight Transport | Mohamed Akrout, Collin Wilson, Peter Humphreys, Timothy Lillicrap, Douglas B. Tweed | Here we describe two mechanisms — a neural circuit called a weight mirror and a modification of an algorithm proposed by Kolen and Pollack in 1994 — both of which let the feedback path learn appropriate synaptic weights quickly and accurately even in large networks, without weight transport or complex wiring. |

89 | Combinatorial Bandits with Relative Feedback | Aadirupa Saha, Aditya Gopalan | For both settings, we devise instance-dependent and order-optimal regret algorithms with regret $O(\frac{n}{m} \ln T)$ and $O(\frac{n}{k} \ln T)$, respectively. |

90 | General Proximal Incremental Aggregated Gradient Algorithms: Better and Novel Results under General Scheme | Tao Sun, Yuejiao Sun, Dongsheng Li, Qing Liao | In this paper, we propose a general proximal incremental aggregated gradient algorithm, which contains various existing algorithms including the basic incremental aggregated gradient method. |

91 | A Condition Number for Joint Optimization of Cycle-Consistent Networks | Leonidas J. Guibas, Qixing Huang, Zhenxiao Liang | This paper presents an algorithm that select a subset of weighted cycles to minimize a condition number of the induced joint optimization problem. |

92 | Explicit Disentanglement of Appearance and Perspective in Generative Models | Nicki Skafte, S?ren Hauberg | Specifically, we propose a model with two latent spaces: one that represents spatial transformations of the input data, and another that represents the transformed data. |

93 | Polynomial Cost of Adaptation for X-Armed Bandits | Hedi Hadiji | In the context of stochastic continuum-armed bandits, we present an algorithm that adapts to the unknown smoothness of the objective function. |

94 | Learning to Propagate for Graph Meta-Learning | LU LIU, Tianyi Zhou, Guodong Long, Jing Jiang, Chengqi Zhang | In this paper, we show that a meta-learner that explicitly relates tasks on a graph describing the relations of their output dimensions (e.g., classes) can significantly improve few-shot learning. |

95 | Secretary Ranking with Minimal Inversions | Sepehr Assadi, Eric Balkanski, Renato Leme | We present an algorithm that ranks n elements with only O(n^{3/2}) inversions in expectation, and show that any algorithm necessarily suffers \Omega(n^{3/2}) inversions when there are n available positions. |

96 | Nonparametric Regressive Point Processes Based on Conditional Gaussian Processes | Siqi Liu, Milos Hauskrecht | In this work, we propose and develop a new nonparametric regressive point process model based on Gaussian processes. |

97 | Learning Perceptual Inference by Contrasting | Chi Zhang, Baoxiong Jia, Feng Gao, Yixin Zhu, HongJing Lu, Song-Chun Zhu | In this work, we study how to improve machines’ reasoning ability on one challenging task of this kind: Raven’s Progressive Matrices (RPM). |

98 | Selecting the independent coordinates of manifolds with large aspect ratios | Yu-Chia Chen, Marina Meila | Hence, we propose a bicriterial Independent Eigencoordinate Selection (IES) algorithm that selects smooth embeddings with few eigenvectors. |

99 | Region-specific Diffeomorphic Metric Mapping | Zhengyang Shen, Francois-Xavier Vialard, Marc Niethammer | We introduce a region-specific diffeomorphic metric mapping (RDMM) registration approach. |

100 | Deep Supervised Summarization: Algorithm and Application to Learning Instructions | Chengguang Xu, Ehsan Elhamifar | To do so, we propose to learn representations of data so that the input of transformed data to the facility location recovers their ground-truth representatives. |

101 | Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations | Vincent Sitzmann, Michael Zollhoefer, Gordon Wetzstein | We propose Scene Representation Networks (SRNs), a continuous, 3D-structure-aware scene representation that encodes both geometry and appearance. |

102 | Reconciling ?-Returns with Experience Replay | Brett Daley, Christopher Amato | Towards this, we propose the first method to enable practical use of ?-returns in arbitrary replay-based methods without relying on other forms of decorrelation such as asynchronous gradient updates. |

103 | Control Batch Size and Learning Rate to Generalize Well: Theoretical and Empirical Evidence | Fengxiang He, Tongliang Liu, Dacheng Tao | This paper reports both theoretical and empirical evidence of a training strategy that we should control the ratio of batch size to learning rate not too large to achieve a good generalization ability. |

104 | Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs | Max Simchowitz, Kevin G. Jamieson | This paper establishes that optimistic algorithms attain gap-dependent and non-asymptotic logarithmic regret for episodic MDPs. |

105 | A Graph Theoretic Framework of Recomputation Algorithms for Memory-Efficient Backpropagation | Mitsuru Kusumoto, Takuya Inoue, Gentaro Watanabe, Takuya Akiba, Masanori Koyama | In this paper, we will propose a novel and efficient recomputation method that can be applied to a wider range of neural nets than previous methods. |

106 | Combinatorial Inference against Label Noise | Paul Hongsuck Seo, Geeho Kim, Bohyung Han | To handle the label noise issue in a principled way, we propose a unique classification framework of constructing multiple models in heterogeneous coarse-grained meta-class spaces and making joint inference of the trained models for the final predictions in the original (base) class space. |

107 | Value Propagation for Decentralized Networked Deep Multi-agent Reinforcement Learning | Chao Qu, Shie Mannor, Huan Xu, Yuan Qi, Le Song, Junwu Xiong | We consider the networked multi-agent reinforcement learning (MARL) problem in a fully decentralized setting, where agents learn to coordinate to achieve joint success. |

108 | Convolution with even-sized kernels and symmetric padding | Shuang Wu, Guanrui Wang, Pei Tang, Feng Chen, Luping Shi | In this work, we quantify the shift problem occurs in even-sized kernel convolutions by an information erosion hypothesis, and eliminate it by proposing symmetric padding on four sides of the feature maps (C2sp, C4sp). |

109 | On The Classification-Distortion-Perception Tradeoff | Dong Liu, Haochen Zhang, Zhiwei Xiong | In this paper, we extend the previous perception-distortion tradeoff to the case of classification-distortion-perception (CDP) tradeoff, where we introduced the classification error rate of the restored signal in addition to distortion and perceptual difference. |

110 | Optimal Statistical Rates for Decentralised Non-Parametric Regression with Linear Speed-Up | Dominic Richards, Patrick Rebeschini | We analyse the learning performance of Distributed Gradient Descent in the context of multi-agent decentralised non-parametric regression with the square loss function when i.i.d. samples are assigned to agents. |

111 | Online sampling from log-concave distributions | Holden Lee, Oren Mangoubi, Nisheeth Vishnoi | Technically, lack of strong convexity is a significant barrier to analysis and, here, our main contribution is a martingale exit time argument that shows our Markov chain remains in a ball of radius roughly poly-logarithmic in $T$ for enough time to reach within $\epsilon$ of $\pi_t$. |

112 | Envy-Free Classification | Maria-Florina F. Balcan, Travis Dick, Ritesh Noothigattu, Ariel D. Procaccia | On a conceptual level, we argue that envy-freeness also provides a compelling notion of fairness for classification tasks, especially when individuals have heterogeneous preferences. |

113 | Finding Friend and Foe in Multi-Agent Games | Jack Serrino, Max Kleiman-Weiner, David C. Parkes, Josh Tenenbaum | Here we develop the DeepRole algorithm, a multi-agent reinforcement learning agent that we test on “The Resistance: Avalon”, the most popular hidden role game. |

114 | Image Synthesis with a Single (Robust) Classifier | Shibani Santurkar, Andrew Ilyas, Dimitris Tsipras, Logan Engstrom, Brandon Tran, Aleksander Madry | We show that the basic classification framework alone can be used to tackle some of the most challenging tasks in image synthesis. |

115 | Model Compression with Adversarial Robustness: A Unified Optimization Framework | Shupeng Gui, Haotao N. Wang, Haichuan Yang, Chen Yu, Zhangyang Wang, Ji Liu | We propose a novel Adversarially Trained Model Compression (ATMC) framework. |

116 | Cross-channel Communication Networks | Jianwei Yang, Zhile Ren, Chuang Gan, Hongyuan Zhu, Devi Parikh | We introduce a novel network unit called Cross-channel Communication (C3) block, a simple yet effective module to encourage the neuron communication within the same layer. |

117 | CondConv: Conditionally Parameterized Convolutions for Efficient Inference | Brandon Yang, Gabriel Bender, Quoc V. Le, Jiquan Ngiam | We propose conditionally parameterized convolutions (CondConv), which learn specialized convolutional kernels for each example. |

118 | Regression Planning Networks | Danfei Xu, Roberto Mart?n-Mart?n, De-An Huang, Yuke Zhu, Silvio Savarese, Li F. Fei-Fei | In this work, we combine the benefits of these two paradigms and propose a learning-to-plan method that can directly generate a long-term symbolic plan conditioned on high-dimensional observations. |

119 | Twin Auxilary Classifiers GAN | Mingming Gong, Yanwu Xu, Chunyuan Li, Kun Zhang, Kayhan Batmanghelich | In this paper, we identify the source of low diversity issue theoretically and propose a practical solution to the problem. |

120 | Conditional Structure Generation through Graph Variational Generative Adversarial Nets | Carl Yang, Peiye Zhuang, Wenhan Shi, Alan Luu, Pan Li | While existing graph generative models only consider graph structures without semantic contexts, we formulate the novel problem of conditional structure generation, and propose a novel unified model of graph variational generative adversarial nets (CondGen) to handle the intrinsic challenges of flexible context-structure conditioning and permutation-invariant generation. |

121 | Distributional Policy Optimization: An Alternative Approach for Continuous Control | Chen Tessler, Guy Tennenholtz, Shie Mannor | We identify a fundamental problem in policy gradient-based methods in continuous control. |

122 | Sampling Sketches for Concave Sublinear Functions of Frequencies | Edith Cohen, Ofir Geri | Our main contribution is the design of composable sampling sketches that can be tailored to any concave sublinear function of the frequencies. |

123 | Deliberative Explanations: visualizing network insecurities | Pei Wang, Nuno Nvasconcelos | A new approach to explainable AI, denoted {\it deliberative explanations,\/} is proposed. |

124 | Computing Full Conformal Prediction Set with Approximate Homotopy | Eugene Ndiaye, Ichiro Takeuchi | We propose efficient algorithms to compute conformal prediction set using approximated solution of (convex) regularized empirical risk minimization. |

125 | Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift | Stephan Rabanser, Stephan G?nnemann, Zachary Lipton | This paper explores the problem of building ML systems that fail loudly, investigating methods for detecting dataset shift, identifying exemplars that most typify the shift, and quantifying shift malignancy. |

126 | Hierarchical Reinforcement Learning with Advantage-Based Auxiliary Rewards | Siyuan Li, Rui Wang, Minxue Tang, Chongjie Zhang | In this paper, we aim to adapt low-level skills to downstream tasks while maintaining the generality of reward design. |

127 | Multi-View Reinforcement Learning | Minne Li, Lisheng Wu, Jun WANG, Haitham Bou Ammar | We define the MVRL framework by extending partially observable Markov decision processes (POMDPs) to support more than one observation model and propose two solution methods through observation augmentation and cross-view policy transfer. |

128 | Cascade RPN: Delving into High-Quality Region Proposal Network with Adaptive Convolution | Thang Vu, Hyunjun Jang, Trung X. Pham, Chang Yoo | This paper considers an architecture referred to as Cascade Region Proposal Network (Cascade RPN) for improving the region-proposal quality and detection performance by systematically addressing the limitation of the conventional RPN that heuristically defines the anchors and aligns the features to the anchors. |

129 | Neural Diffusion Distance for Image Segmentation | Jian Sun, Zongben Xu | In this work, we propose a spec-diff-net for computing diffusion distance on graph based on approximate spectral decomposition. |

130 | Fine-grained Optimization of Deep Neural Networks | Mete Ozay | In this work, we conjecture that if we can impose multiple constraints on weights of DNNs to upper bound the norms of the weights, and train the DNNs with these weights, then we can attain empirical generalization errors closer to the derived theoretical bounds, and improve accuracy of the DNNs. To this end, we pose two problems. |

131 | Extending Stein's unbiased risk estimator to train deep denoisers with correlated pairs of noisy images | Magauiya Zhussip, Shakarim Soltanayev, Se Young Chun | Here, we propose an extended SURE (eSURE) to train deep denoisers with correlated pairs of noise realizations per image and applied it to the case with two uncorrelated realizations per image to achieve better performance than SURE based method and comparable results to Noise2Noise. |

132 | Fixing Implicit Derivatives: Trust-Region Based Learning of Continuous Energy Functions | Chris Russell, Matteo Toso, Neill Campbell | We present a new technique for the learning of continuous energy functions that we refer to as Wibergian Learning. |

133 | Hyperspherical Prototype Networks | Pascal Mettes, Elise van der Pol, Cees Snoek | This paper introduces hyperspherical prototype networks, which unify classification and regression with prototypes on hyperspherical output spaces. |

134 | Expressive power of tensor-network factorizations for probabilistic modeling | Ivan Glasser, Ryan Sweke, Nicola Pancotti, Jens Eisert, Ignacio Cirac | Inspired by these developments, and the natural correspondence between tensor networks and probabilistic graphical models, we provide a rigorous analysis of the expressive power of various tensor-network factorizations of discrete multivariate probability distributions. |

135 | HyperGCN: A New Method For Training Graph Convolutional Networks on Hypergraphs | Naganand Yadati, Madhav Nimishakavi, Prateek Yadav, Vikram Nitin, Anand Louis, Partha Talukdar | Motivated by the fact that a graph convolutional network (GCN) has been effective for graph-based SSL, we propose HyperGCN, a novel GCN for SSL on attributed hypergraphs. |

136 | SSRGD: Simple Stochastic Recursive Gradient Descent for Escaping Saddle Points | Zhize Li | We analyze stochastic gradient algorithms for optimizing nonconvex problems. |

137 | Efficient Meta Learning via Minibatch Proximal Update | Pan Zhou, Xiaotong Yuan, Huan Xu, Shuicheng Yan, Jiashi Feng | To remedy this deficiency, in this paper we propose a minibatch proximal update based meta-learning approach for learning to efficient hypothesis transfer. |

138 | Unconstrained Monotonic Neural Networks | Antoine Wehenkel, Gilles Louppe | In this work, we propose the Unconstrained Monotonic Neural Network (UMNN) architecture based on the insight that a function is monotonic as long as its derivative is strictly positive. |

139 | Guided Similarity Separation for Image Retrieval | Chundi Liu, Guangwei Yu, Maksims Volkovs, Cheng Chang, Himanshu Rai, Junwei Ma, Satya Krishna Gorti | In this work we propose a different approach where we leverage graph convolutional networks to directly encode neighbor information into image descriptors. |

140 | Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss | Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, Tengyu Ma | Deep learning algorithms can fare poorly when the training dataset suffers from heavy class-imbalance but the testing criterion requires good generalization on less frequent classes. We design two novel methods to improve performance in such scenarios. |

141 | Strategizing against No-regret Learners | Yuan Deng, Jon Schneider, Balasubramanian Sivan | We study this question and show that under some mild assumptions, the player can always guarantee himself a utility of at least what he would get in a Stackelberg equilibrium. |

142 | D-VAE: A Variational Autoencoder for Directed Acyclic Graphs | Muhan Zhang, Shali Jiang, Zhicheng Cui, Roman Garnett, Yixin Chen | In this paper, we study deep generative models for DAGs, and propose a novel DAG variational autoencoder (D-VAE). |

143 | Hierarchical Optimal Transport for Document Representation | Mikhail Yurochkin, Sebastian Claici, Edward Chien, Farzaneh Mirzazadeh, Justin M. Solomon | As an alternative, we introduce hierarchical optimal transport as a meta-distance between documents, where documents are modeled as distributions over topics, which themselves are modeled as distributions over words. |

144 | Multivariate Sparse Coding of Nonstationary Covariances with Gaussian Processes | Rui Li | We propose a unified nonstationary modeling framework to jointly encode the observation correlations to generate a piece-wise representation with a hyper-level Gaussian process (GP) governing the overall contour of the pieces. |

145 | Positional Normalization | Boyi Li, Felix Wu, Kilian Q. Weinberger, Serge Belongie | In this paper, we propose a novel normalization method that deviates from this theme. |

146 | A New Defense Against Adversarial Images: Turning a Weakness into a Strength | Shengyuan Hu, Tao Yu, Chuan Guo, Wei-Lun Chao, Kilian Q. Weinberger | In this paper, we adopt a novel perspective and regard the omnipresence of adversarial perturbations as a strength rather than a weakness. |

147 | Quadratic Video Interpolation | Xiangyu Xu, Li Siyao, Wenxiu Sun, Qian Yin, Ming-Hsuan Yang | To address these issues, we propose a quadratic video interpolation method which exploits the acceleration information in videos. |

148 | ResNets Ensemble via the Feynman-Kac Formalism to Improve Natural and Robust Accuracies | Bao Wang, Zuoqiang Shi, Stanley Osher | Based on this unified viewpoint, we propose a simple yet effective ResNets ensemble algorithm to boost the accuracy of the robustly trained model on both clean and adversarial images. |

149 | Incremental Scene Synthesis | Benjamin Planche, Xuejian Rong, Ziyan Wu, Srikrishna Karanam, Harald Kosch, YingLi Tian, Jan Ernst, ANDREAS HUTTER | We present a method to incrementally generate complete 2D or 3D scenes with the following properties: (a) it is globally consistent at each step according to a learned scene prior, (b) real observations of a scene can be incorporated while observing global consistency, (c) unobserved regions can be hallucinated locally in consistence with previous observations, hallucinations and global priors, and (d) hallucinations are statistical in nature, i.e., different scenes can be generated from the same observations. |

150 | Self-Supervised Generalisation with Meta Auxiliary Learning | Shikun Liu, Andrew Davison, Edward Johns | We propose a new method which automatically learns appropriate labels for an auxiliary task, such that any supervised learning task can be improved without requiring access to any further data. |

151 | Variational Denoising Network: Toward Blind Noise Modeling and Removal | Zongsheng Yue, Hongwei Yong, Qian Zhao, Deyu Meng, Lei Zhang | In this work we propose a new variational inference method, which integrates both noise estimation and image denoising into a unique Bayesian framework, for blind image denoising. |

152 | Fast Sparse Group Lasso | Yasutoshi Ida, Yasuhiro Fujiwara, Hisashi Kashima | This paper proposes a fast Block Coordinate Descent for Sparse Group Lasso. |

153 | Learnable Tree Filter for Structure-preserving Feature Transform | Lin Song, Yanwei Li, Zeming Li, Gang Yu, Hongbin Sun, Jian Sun, Nanning Zheng | In this paper, we propose the learnable tree filter to form a generic tree filtering module that leverages the structural property of minimal spanning tree to model long-range dependencies while preserving the details. |

154 | Data-Dependence of Plateau Phenomenon in Learning with Neural Network — Statistical Mechanical Analysis | Yuki Yoshida, Masato Okada | In this paper, using statistical mechanical formulation, we clarified the relationship between the plateau phenomenon and the statistical property of the data learned. |

155 | Coordinated hippocampal-entorhinal replay as structural inference | Talfan Evans, Neil Burgess | We propose that this offline inference corresponds to coordinated hippocampal-entorhinal replay during sharp wave ripples. |

156 | Cascaded Dilated Dense Network with Two-step Data Consistency for MRI Reconstruction | Hao Zheng, Faming Fang, Guixu Zhang | Inspired by recent deep learning methods, we propose a Cascaded Dilated Dense Network (CDDN) for MRI reconstruction. |

157 | On the Ineffectiveness of Variance Reduced Optimization for Deep Learning | Aaron Defazio, Leon Bottou | We show that naive application of the SVRG technique and related approaches fail, and explore why. |

158 | On the Curved Geometry of Accelerated Optimization | Aaron Defazio | In this work we propose a differential geometric motivation for Nesterov’s accelerated gradient method (AGM) for strongly-convex problems. |

159 | Multi-marginal Wasserstein GAN | Jiezhang Cao, Langyuan Mo, Yifan Zhang, Kui Jia, Chunhua Shen, Mingkui Tan | In this paper, we propose a novel Multi-marginal Wasserstein GAN (MWGAN) to minimize Wasserstein distance among domains. |

160 | Better Exploration with Optimistic Actor Critic | Kamil Ciosek, Quan Vuong, Robert Loftin, Katja Hofmann | To address both of these phenomena, we introduce a new algorithm, Optimistic Actor Critic, which approximates a lower and upper confidence bound on the state-action value function. |

161 | Importance Resampling for Off-policy Prediction | Matthew Schlegel, Wesley Chung, Daniel Graves, Jian Qian, Martha White | In this work, we explore a resampling strategy as an alternative to reweighting. |

162 | The Label Complexity of Active Learning from Observational Data | Songbai Yan, Kamalika Chaudhuri, Tara Javidi | We provably demonstrate that the result of this is an algorithm which is statistically consistent as well as more label-efficient than prior work. |

163 | Meta-Learning Representations for Continual Learning | Khurram Javed, Martha White | In this paper, we propose OML, an objectivethat directly minimizes catastrophic interference by learning representations thataccelerate future learning and are robust to forgetting under online updates in con-tinual learning. |

164 | Defense Against Adversarial Attacks Using Feature Scattering-based Adversarial Training | Haichao Zhang, Jianyu Wang | We introduce a feature scattering-based adversarial training approach for improving model robustness against adversarial attacks. |

165 | Visualizing the PHATE of Neural Networks | Scott Gigante, Adam S. Charles, Smita Krishnaswamy, Gal Mishne | To this end, we introduce a novel visualization algorithm that reveals the internal geometry of such networks: Multislice PHATE (M-PHATE), the first method designed explicitly to visualize how a neural network’s hidden representations of data evolve throughout the course of training. |

166 | The Cells Out of Sample (COOS) dataset and benchmarks for measuring out-of-sample generalization of image classifiers | Alex Lu, Amy Lu, Wiebke Schormann, Marzyeh Ghassemi, David Andrews, Alan Moses | We created a public dataset of 132,209 images of mouse cells, COOS-7 (Cells Out Of Sample 7-Class). |

167 | Nonconvex Low-Rank Tensor Completion from Noisy Data | Changxiao Cai, Gen Li, H. Vincent Poor, Yuxin Chen | Focusing on “incoherent” and well-conditioned tensors of a constant CP rank, we propose a two-stage nonconvex algorithm — (vanilla) gradient descent following a rough initialization — that achieves the best of both worlds. |

168 | Beyond Online Balanced Descent: An Optimal Algorithm for Smoothed Online Optimization | Gautam Goel, Yiheng Lin, Haoyuan Sun, Adam Wierman | No existing algorithms have competitive ratios matching this bound, and we show that the state-of-the-art algorithm, Online Balanced Decent (OBD), has a competitive ratio that is $\Omega(m^{-2/3})$. |

169 | Channel Gating Neural Networks | Weizhe Hua, Yuan Zhou, Christopher M. De Sa, Zhiru Zhang, G. Edward Suh | This paper introduces channel gating, a dynamic, fine-grained, and hardware-efficient pruning scheme to reduce the computation cost for convolutional neural networks (CNNs). |

170 | Neural networks grown and self-organized by noise | Guruprasad Raghavan, Matt Thomson | In this paper, we propose a biologically inspired developmental algorithm that can ‘grow’ a functional, layered neural network from a single initial cell. |

171 | Catastrophic Forgetting Meets Negative Transfer: Batch Spectral Shrinkage for Safe Transfer Learning | Xinyang Chen, Sinan Wang, Bo Fu, Mingsheng Long, Jianmin Wang | In this paper, we launch an in-depth empirical investigation into negative transfer in fine-tuning and find that, for the weight parameters and feature representations, transferability of their spectral components is diverse. |

172 | Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting | Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, Zongben Xu, Deyu Meng | To address this issue, we propose a method capable of adaptively learning an explicit weighting function directly from data. |

173 | Variational Structured Semantic Inference for Diverse Image Captioning | Fuhai Chen, Rongrong Ji, Jiayi Ji, Xiaoshuai Sun, Baochang Zhang, Xuri Ge, Yongjian Wu, Feiyue Huang, Yan Wang | To model these two inherent diversities in image captioning, we propose a Variational Structured Semantic Inferring model (termed VSSI-cap) executed in a novel structured encoder-inferer-decoder schema. |

174 | Mapping State Space using Landmarks for Universal Goal Reaching | Zhiao Huang, Hao Su, Fangchen Liu | We propose a method to address this issue in large MDPs with sparse rewards, in which exploration and routing across remote states are both extremely challenging. |

175 | Transferable Normalization: Towards Improving Transferability of Deep Neural Networks | Ximei Wang, Ying Jin, Mingsheng Long, Jianmin Wang, Michael I. Jordan | In this paper, we delve into the components of DNN architectures and propose Transferable Normalization (TransNorm) in place of existing normalization techniques. |

176 | Random deep neural networks are biased towards simple functions | Giacomo De Palma, Bobak Kiani, Seth Lloyd | We prove that the binary classifiers of bit strings generated by random wide deep neural networks with ReLU activation function are biased towards simple functions. |

177 | XNAS: Neural Architecture Search with Expert Advice | Niv Nayman, Asaf Noy, Tal Ridnik, Itamar Friedman, Rong Jin, Lihi Zelnik | This paper introduces a novel optimization method for differential neural architecture search, based on the theory of prediction with expert advice. |

178 | CNN^{2}: Viewpoint Generalization via a Binocular Vision | Wei-Da Chen, Shan-Hung (Brandon) Wu | Observing that humans use binocular vision to understand the world, we study in this paper whether the 3D viewpoint generalizability of CNNs can be achieved via a binocular vision. |

179 | Generalized Off-Policy Actor-Critic | Shangtong Zhang, Wendelin Boehmer, Shimon Whiteson | We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuing reinforcement learning (RL) setting. |

180 | DAC: The Double Actor-Critic Architecture for Learning Options | Shangtong Zhang, Shimon Whiteson | We apply an actor-critic algorithm on each augmented MDP, yielding the Double Actor-Critic (DAC) architecture. |

181 | Numerically Accurate Hyperbolic Embeddings Using Tiling-Based Models | Tao Yu, Christopher M. De Sa | To address this, we propose a new model which uses an integer-based tiling to represent \emph{any} point in hyperbolic space with provably bounded numerical error. |

182 | Controlling Neural Level Sets | Matan Atzmon, Niv Haim, Lior Yariv, Ofer Israelov, Haggai Maron, Yaron Lipman | In this paper we present a simple and scalable approach to directly control level sets of a deep neural network. |

183 | Blended Matching Pursuit | Cyrille Combettes, Sebastian Pokutta | We present a blended matching pursuit algorithm, combining coordinate descent-like steps with stronger gradient descent steps, for minimizing a smooth convex function over a linear space spanned by a set of atoms. |

184 | An Improved Analysis of Training Over-parameterized Deep Neural Networks | Difan Zou, Quanquan Gu | In this paper, we provide an improved analysis of the global convergence of (stochastic) gradient descent for training deep neural networks, which only requires a milder over-parameterization condition than previous work in terms of the training sample size and other problem-dependent parameters. |

185 | Controllable Text-to-Image Generation | Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, Philip Torr | In this paper, we propose a novel controllable text-to-image generative adversarial network (ControlGAN), which can effectively synthesise high-quality images and also control parts of the image generation according to natural language descriptions. |

186 | Improving Textual Network Learning with Variational Homophilic Embeddings | Wenlin Wang, Chenyang Tao, Zhe Gan, Guoyin Wang, Liqun Chen, Xinyuan Zhang, Ruiyi Zhang, Qian Yang, Ricardo Henao, Lawrence Carin | Different from most existing methods that optimize a discriminative objective, we introduce Variational Homophilic Embedding (VHE), a fully generative model that learns network embeddings by modeling the semantic (textual) information with a variational autoencoder, while accounting for the structural (topology) information through a novel homophilic prior design. |

187 | Rethinking Generative Mode Coverage: A Pointwise Guaranteed Approach | Peilin Zhong, Yuchen Mo, Chang Xiao, Pengyu Chen, Changxi Zheng | Rethinking this problem from a game-theoretic perspective, we show that a complete mode coverage is firmly attainable. |

188 | The Randomized Midpoint Method for Log-Concave Sampling | Ruoqi Shen, Yin Tat Lee | To solve the sampling problem, we propose a new framework to discretize stochastic differential equations. |

189 | Sample-Efficient Deep Reinforcement Learning via Episodic Backward Update | Su Young Lee, Choi Sungik, Sae-Young Chung | We propose Episodic Backward Update (EBU) – a novel deep reinforcement learning algorithm with a direct value propagation. |

190 | Fully Neural Network based Model for General Temporal Point Processes | Takahiro Omi, naonori ueda, Kazuyuki Aihara | We herein propose a novel RNN based model in which the time course of the intensity function is represented in a general manner. |

191 | Gate Decorator: Global Filter Pruning Method for Accelerating Deep Convolutional Neural Networks | Zhonghui You, Kun Yan, Jinmian Ye, Meng Ma, Ping Wang | In this work, we propose a global filter pruning algorithm called Gate Decorator, which transforms a vanilla CNN module by multiplying its output by the channel-wise scaling factors (i.e. gate). |

192 | Discrimination in Online Markets: Effects of Social Bias on Learning from Reviews and Policy Design | Faidra Georgia Monachou, Itai Ashlagi | We study this problem using a two-sided large market model with employers and workers mediated by a platform. |

193 | Provably Powerful Graph Networks | Haggai Maron, Heli Ben-Hamu, Hadar Serviansky, Yaron Lipman | Differently put, we suggest a simple model that interleaves applications of standard Multilayer-Perceptron (MLP) applied to the feature dimension and matrix multiplication. |

194 | Order Optimal One-Shot Distributed Learning | Arsalan Sharifnassab, Saber Salehkaleybar, S. Jamaloddin Golestani | We propose an algorithm called Multi-Resolution Estimator (MRE) whose expected error is no larger than $\tilde{O}( m^{-1/\max(d,2)} n^{-1/2})$, where $d$ is the dimension of the parameter space. |

195 | Information Competing Process for Learning Diversified Representations | Jie Hu, Rongrong Ji, ShengChuan Zhang, Xiaoshuai Sun, Qixiang Ye, Chia-Wen Lin, Qi Tian | Towards learning diversified representations, a new approach, termed Information Competing Process (ICP), is proposed in this paper. |

196 | GENO — GENeric Optimization for Classical Machine Learning | Soeren Laue, Matthias Mitterreiter, Joachim Giesen | We show on a wide variety of classical but also some recently suggested problems that the automatically generated solvers are (1) as efficient as well engineered, specialized solvers, (2) more efficient by a decent margin than recent state-of-the-art solvers, and (3) orders of magnitude more efficient than classical modeling language plus solver approaches. |

197 | Conditional Independence Testing using Generative Adversarial Networks | Alexis Bellot, Mihaela van der Schaar | Our contribution is a new test statistic based on samples from a generative adversarial network designed to approximate directly a conditional distribution that encodes the null hypothesis, in a manner that maximizes power (the rate of true negatives). |

198 | Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function | Aviv Rosenberg, Yishay Mansour | We consider online learning in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes. |

199 | Partitioning Structure Learning for Segmented Linear Regression Trees | Xiangyu Zheng, Song Xi Chen | This paper proposes a partitioning structure learning method for segmented linear regression trees (SLRT), which assigns linear predictors over the terminal nodes. |

200 | A Tensorized Transformer for Language Modeling | Xindian Ma, Peng Zhang, Shuai Zhang, Nan Duan, Yuexian Hou, Ming Zhou, Dawei Song | In this paper, based on the ideas of tensor decomposition and parameters sharing, we propose a novel self-attention model (namely Multi-linear attention) with Block-Term Tensor Decomposition (BTD). |

201 | Kernel Stein Tests for Multiple Model Comparison | Jen Ning Lim, Makoto Yamada, Bernhard Sch?lkopf, Wittawat Jitkrittum | We address the problem of non-parametric multiple model comparison: given $l$ candidate models, decide whether each candidate is as good as the best one(s) or worse than it. |

202 | Disentangled behavioural representations | Amir Dezfouli, Hassan Ashtiani, Omar Ghattas, Richard Nock, Peter Dayan, Cheng Soon Ong | To achieve this, we propose a novel end-to-end learning framework in which an encoder is trained to map the behavior of subjects into a low-dimensional latent space. |

203 | More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation | Quanfu Fan, Chun-Fu (Richard) Chen, Hilde Kuehne, Marco Pistoia, David Cox | To address this problem, we present an lightweight and memory-friendly architecture for action recognition that performs on par with or better than current architectures by using only a fraction of resources. |

204 | Rethinking the CSC Model for Natural Images | Dror Simon, Michael Elad | In this work we provide new insights regarding the CSC model and its capability to represent natural images, and suggest a Bayesian connection between this model and its patch-based ancestor. |

205 | Integrating Bayesian and Discriminative Sparse Kernel Machines for Multi-class Active Learning | Weishi Shi, Qi Yu | We propose a novel active learning (AL) model that integrates Bayesian and discriminative kernel machines for fast and accurate multi-class data sampling. |

206 | Learning to Control Self-Assembling Morphologies: A Study of Generalization via Modularity | Deepak Pathak, Christopher Lu, Trevor Darrell, Phillip Isola, Alexei A. Efros | In contrast, this paper investigates a modular co-evolution strategy: a collection of primitive agents learns to dynamically self-assemble into composite bodies while also learning to coordinate their behavior to control these bodies. |

207 | Perceiving the arrow of time in autoregressive motion | Kristof Meding, Dominik Janzing, Bernhard Sch?lkopf, Felix A. Wichmann | We employ a so-called frozen noise paradigm enabling us to compare human performance with four different algorithms on a trial-by-trial basis: A causal inference algorithm exploiting the dependence structure of additive noise terms, a neurally inspired network, a Bayesian ideal observer model as well as a simple heuristic. |

208 | DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections | Ofir Nachum, Yinlam Chow, Bo Dai, Lihong Li | In this work, we propose an algorithm, DualDICE, for estimating these quantities. |

209 | Hyper-Graph-Network Decoders for Block Codes | Eliya Nachmani, Lior Wolf | In this work, we extend these results to much larger families of algebraic block codes, by performing message passing with graph neural networks. |

210 | Large Scale Markov Decision Processes with Changing Rewards | Adrian Rivera Cardoso, He Wang, Huan Xu | By approximating the state-action occupancy measures with a linear architecture of dimension $d\ll|S|$, we propose a modified algorithm with a computational complexity polynomial in $d$ and independent of $|S|$. |

211 | Multiview Aggregation for Learning Category-Specific Shape Reconstruction | Srinath Sridhar, Davis Rempe, Julien Valentin, Bouaziz Sofien, Leonidas J. Guibas | We present a method that can estimate dense 3D shape, and aggregate shape across multiple and varying number of input views. |

212 | Semi-Parametric Dynamic Contextual Pricing | Virag Shah, Ramesh Johari, Jose Blanchet | Motivated by the application of real-time pricing in e-commerce platforms, we consider the problem of revenue-maximization in a setting where the seller can leverage contextual information describing the customer’s history and the product’s type to predict her valuation of the product. |

213 | Interlaced Greedy Algorithm for Maximization of Submodular Functions in Nearly Linear Time | Alan Kuhnle | A deterministic approximation algorithm is presented for the maximization of non-monotone submodular functions over a ground set of size $n$ subject to cardinality constraint $k$; the algorithm is based upon the idea of interlacing two greedy procedures. |

214 | Initialization of ReLUs for Dynamical Isometry | Rebekka Burkholz, Alina Dubatovka | We derive the joint signal output distribution exactly, without mean field assumptions, for fully-connected networks with Gaussian weights and biases, and analyze deviations from the mean field results. |

215 | Gradient Information for Representation and Modeling | Jie Ding, Robert Calderbank, Vahid Tarokh | Motivated by Fisher divergence, in this paper we present a new set of information quantities which we refer to as gradient information. |

216 | SpiderBoost and Momentum: Faster Variance Reduction Algorithms | Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, Vahid Tarokh | In this paper, we propose SpiderBoost as an improved scheme, which allows to use a much larger constant-level stepsize while maintaining the same near-optimal oracle complexity, and can be extended with proximal mapping to handle composite optimization (which is nonsmooth and nonconvex) with provable convergence guarantee. |

217 | Minimax Optimal Estimation of Approximate Differential Privacy on Neighboring Databases | Xiyang Liu, Sewoong Oh | We pose it as a property estimation problem, and study the fundamental trade-offs involved in the accuracy in estimated privacy guarantees and the number of samples required. |

218 | Backprop with Approximate Activations for Memory-efficient Network Training | Ayan Chakrabarti, Benjamin Moseley | In this paper, we propose a new implementation for back-propagation that significantly reduces memory usage, by enabling the use of approximations with negligible computational cost and minimal effect on training performance. |

219 | Training Image Estimators without Image Ground Truth | Zhihao Xia, Ayan Chakrabarti | In this paper, we introduce an unsupervised framework for training image estimation networks, from a training set that contains only measurements—with two varied measurements per image—but no ground-truth for the full images desired as output. |

220 | Deep Structured Prediction for Facial Landmark Detection | Lisha Chen, Hui Su, Qiang Ji | This paper proposes a method for deep structured facial landmark detection based on combining a deep Convolutional Network with a Conditional Random Field. |

221 | Information-Theoretic Confidence Bounds for Reinforcement Learning | Xiuyuan Lu, Benjamin Van Roy | We integrate information-theoretic concepts into the design and analysis of optimistic algorithms and Thompson sampling. |

222 | Transfer Anomaly Detection by Inferring Latent Domain Representations | Atsutoshi Kumagai, Tomoharu Iwata, Yasuhiro Fujiwara | We propose a method to improve the anomaly detection performance on target domains by transferring knowledge on related domains. |

223 | Total Least Squares Regression in Input Sparsity Time | Huaian Diao, Zhao Song, David Woodruff, Xin Yang | We give an algorithm for finding a solution $X$ to the linear system $\hat{A}X=\hat{B}$ for which the cost $\|A-\hat{A}\|_F^2 + \|B-\hat{B}\|_F^2$ is at most a multiplicative $(1+\epsilon)$ factor times the optimal cost, up to an additive error $\eta$ that may be an arbitrarily small function of $n$. |

224 | Park: An Open Platform for Learning-Augmented Computer Systems | Hongzi Mao, Parimarjan Negi, Akshay Narayan, Hanrui Wang, Jiacheng Yang, Haonan Wang, Ryan Marcus, ravichandra addanki, Mehrdad Khani Shirkoohi, Songtao He, Vikram Nathan, Frank Cangialosi, Shaileshh Venkatakrishnan, Wei-Hung Weng, Song Han, Tim Kraska, Dr.Mohammad Alizadeh | We present Park, a platform for researchers to experiment with Reinforcement Learning (RL) for computer systems. |

225 | Adapting Neural Networks for the Estimation of Treatment Effects | Claudia Shi, David Blei, Victor Veitch | We propose two adaptations based on insights from the statistical literature on the estimation of treatment effects. |

226 | Learning Transferable Graph Exploration | Hanjun Dai, Yujia Li, Chenglong Wang, Rishabh Singh, Po-Sen Huang, Pushmeet Kohli | We propose a `learning to explore’ framework where we learn a policy from a distribution of environments. |

227 | Conformal Prediction Under Covariate Shift | Ryan J. Tibshirani, Rina Foygel Barber, Emmanuel Candes, Aaditya Ramdas | We extend conformal prediction methodology beyond the case of exchangeable data. |

228 | Optimal Analysis of Subset-Selection Based L_p Low-Rank Approximation | Chen Dan, Hong Wang, Hongyang Zhang, Yuchen Zhou, Pradeep K. Ravikumar | We show that for the problem of $\ell_p$ rank-$k$ approximation of any given matrix over $R^{n\times m}$ and $C^{n\times m}$, the algorithm of column subset selection enjoys approximation ratio $(k+1)^{1/p}$ for $1\le p\le 2$ and $(k+1)^{1-1/p}$ for $p\ge 2$. |

229 | Asymmetric Valleys: Beyond Sharp and Flat Local Minima | Haowei He, Gao Huang, Yang Yuan | In this paper, we observe that local minima of modern deep networks are more than being flat or sharp. |

230 | Positive-Unlabeled Compression on the Cloud | Yixing Xu, Yunhe Wang, Jia Zeng, Kai Han, Chunjing XU, Dacheng Tao, Chang Xu | In this paper, we present a novel positive-unlabeled (PU) setting for addressing this problem. |

231 | Direct Estimation of Differential Functional Graphical Models | Boxin Zhao, Y. Samuel Wang, Mladen Kolar | We consider the problem of estimating the difference between two functional undirected graphical models with shared structures. |

232 | On the Calibration of Multiclass Classification with Rejection | Chenri Ni, Nontawat Charoenphakdee, Junya Honda, Masashi Sugiyama | We propose rejection criteria for more general losses for this approach and guarantee calibration to the Bayes-optimal solution. |

233 | Third-Person Visual Imitation Learning via Decoupled Hierarchical Controller | Pratyusha Sharma, Deepak Pathak, Abhinav Gupta | Our central insight is to enforce this structure explicitly during learning by decoupling what to achieve (intended task) from how to perform it (controller). |

234 | Stagewise Training Accelerates Convergence of Testing Error Over SGD | Zhuoning Yuan, Yan Yan, Rong Jin, Tianbao Yang | This paper provides some theoretical evidence for explaining this faster convergence. |

235 | Learning Robust Options by Conditional Value at Risk Optimization | Takuya Hiraoka, Takahisa Imagawa, Tatsuya Mori, Takashi Onishi, Yoshimasa Tsuruoka | In this paper, we propose a conditional value at risk (CVaR)-based method to learn options that work well in both the average and worst cases. |

236 | Non-asymptotic Analysis of Stochastic Methods for Non-Smooth Non-Convex Regularized Problems | Yi Xu, Rong Jin, Tianbao Yang | Our contributions are two-fold: (i) we show that they enjoy the same complexities as their counterparts for solving convex regularized non-convex problems in terms of finding an approximate stationary point; (ii) we develop more practical variants using dynamic mini-batch size instead of a fixed mini-batch size without requiring the target accuracy level of solution. |

237 | On Learning Over-parameterized Neural Networks: A Functional Approximation Perspective | Lili Su, Pengkun Yang | We consider training over-parameterized two-layer neural networks with Rectified Linear Unit (ReLU) using gradient descent (GD) method. |

238 | Drill-down: Interactive Retrieval of Complex Scenes using Natural Language Queries | Fuwen Tan, Paola Cascante-Bonilla, Xiaoxiao Guo, Hui Wu, Song Feng, Vicente Ordonez | We propose Drill-down, an effective framework for encoding multiple queries with an efficient compact state representation that significantly extends current methods for single-round image retrieval. |

239 | Visual Sequence Learning in Hierarchical Prediction Networks and Primate Visual Cortex | JIELIN QIU, Ge Huang, Tai Sing Lee | In this paper we developed a computational hierarchical network model to understand the spatiotemporal sequence learning effects observed in the primate visual cortex. |

240 | Dual Variational Generation for Low Shot Heterogeneous Face Recognition | Chaoyou Fu, Xiang Wu, Yibo Hu, Huaibo Huang, Ran He | This paper considers HFR as a dual generation problem, and proposes a novel Dual Variational Generation (DVG) framework. |

241 | Discovering Neural Wirings | Mitchell Wortsman, Ali Farhadi, Mohammad Rastegari | In this work we propose a method for discovering neural wirings. |

242 | On the Optimality of Perturbations in Stochastic and Adversarial Multi-armed Bandit Problems | Baekjin Kim, Ambuj Tewari | We investigate the optimality of perturbation based algorithms in the stochastic and adversarial multi-armed bandit problems. |

243 | Knowledge Extraction with No Observable Data | Jaemin Yoo, Minyong Cho, Taebum Kim, U Kang | In this work, we propose KegNet (Knowledge Extraction with Generative Networks), a novel approach to extract the knowledge of a trained deep neural network and to generate artificial data points that replace the missing training data in knowledge distillation. |

244 | PAC-Bayes under potentially heavy tails | Matthew Holland | We derive PAC-Bayesian learning guarantees for heavy-tailed losses, and obtain a novel optimal Gibbs posterior which enjoys finite-sample excess risk bounds at logarithmic confidence. |

245 | One-Shot Object Detection with Co-Attention and Co-Excitation | Ting-I Hsieh, Yi-Chen Lo, Hwann-Tzong Chen, Tyng-Luh Liu | This paper aims to tackle the challenging problem of one-shot object detection. |

246 | Quaternion Knowledge Graph Embeddings | SHUAI ZHANG, Yi Tay, Lina Yao, Qi Liu | In this work, we move beyond the traditional complex-valued representations, introducing more expressive hypercomplex representations to model entities and relations for knowledge graph embeddings. |

247 | Glyce: Glyph-vectors for Chinese Character Representations | Yuxian Meng, Wei Wu, Fei Wang, Xiaoya Li, Ping Nie, Fan Yin, Muyu Li, Qinghong Han, Yuxian Meng, Jiwei Li | In this paper, we address this gap by presenting Glyce, the glyph-vectors for Chinese character representations. |

248 | Turbo Autoencoder: Deep learning based channel codes for point-to-point communication channels | Yihan Jiang, Hyeji Kim, Himanshu Asnani, Sreeram Kannan, Sewoong Oh, Pramod Viswanath | In this work, we make significant progress on this problem by designing a fully end-to-end jointly trained neural encoder and decoder, namely, Turbo Autoencoder (TurboAE), with the following contributions: (a) under moderate block lengths, TurboAE approaches state-of-the-art performance under canonical channels; (b) moreover, TurboAE outperforms the state-of-the-art codes under non-canonical settings in terms of reliability. |

249 | Heterogeneous Graph Learning for Visual Commonsense Reasoning | Weijiang Yu, Jingwen Zhou, Weihao Yu, Xiaodan Liang, Nong Xiao | In this paper, we propose a new Heterogeneous Graph Learning (HGL) framework for seamlessly integrating the intra-graph and inter-graph reasoning in order to bridge the vision and language domain. |

250 | Probabilistic Watershed: Sampling all spanning forests for seeded segmentation and semi-supervised learning | Enrique Fita Sanmartin, Sebastian Damrich, Fred A. Hamprecht | We propose instead to consider all possible spanning forests and calculate, for every node, the probability of sampling a forest connecting a certain seed with that node. |

251 | Classification-by-Components: Probabilistic Modeling of Reasoning over a Set of Components | Sascha Saralajew, Lars Holdijk, Maike Rees, Ebubekir Asan, Thomas Villmann | In this work, a network architecture, denoted as Classification-By-Components network (CBC), is proposed. |

252 | Identifying Causal Effects via Context-specific Independence Relations | Santtu Tikka, Antti Hyttinen, Juha Karvanen | Motivated by this, we design a calculus and an automated search procedure for identifying causal effects in the presence of CSIs. |

253 | Bridging Machine Learning and Logical Reasoning by Abductive Learning | Wang-Zhou Dai, Qiuling Xu, Yang Yu, Zhi-Hua Zhou | In this paper, we present the abductive learning targeted at unifying the two AI paradigms in a mutually beneficial way, where the machine learning model learns to perceive primitive logic facts from data, while logical reasoning can exploit symbolic domain knowledge and correct the wrongly perceived facts for improving the machine learning models. |

254 | Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function | Zihan Zhang, Xiangyang Ji | We present an algorithm based on the \emph{Optimism in the Face of Uncertainty} (OFU) principle which is able to learn Reinforcement Learning (RL) modeled by Markov decision process (MDP) with finite state-action space efficiently. |

255 | On the Global Convergence of (Fast) Incremental Expectation Maximization Methods | Belhal Karimi, Hoi-To Wai, Eric Moulines, Marc Lavielle | In this paper, we analyze incremental and stochastic version of the EM algorithm as well as the variance reduced-version of [Chen et al., 2018] in a common unifying framework. |

256 | A Linearly Convergent Proximal Gradient Algorithm for Decentralized Optimization | Sulaiman Alghunaim, Kun Yuan, Ali H. Sayed | This work studies decentralized composite optimization problems with non-smooth regularization terms. |

257 | Regularizing Trajectory Optimization with Denoising Autoencoders | Rinu Boney, Norman Di Palo, Mathias Berglund, Alexander Ilin, Juho Kannala, Antti Rasmus, Harri Valpola | We propose to regularize trajectory optimization by means of a denoising autoencoder that is trained on the same trajectories as the model of the environment. |

258 | Learning Hierarchical Priors in VAEs | Alexej Klushyn, Nutan Chen, Richard Kurle, Botond Cseke, Patrick van der Smagt | We introduce a graph-based interpolation method, which shows that the topology of the learned latent representation corresponds to the topology of the data manifold—and present several examples, where desired properties of latent representation such as smoothness and simple explanatory factors are learned by the prior. |

259 | Epsilon-Best-Arm Identification in Pay-Per-Reward Multi-Armed Bandits | Sivan Sabato | We provide an algorithm for this setting, that with a high probability returns an epsilon-best arm, while incurring a cost that depends only linearly on the total expected reward of all arms, and does not depend at all on the number of arms. |

260 | Safe Exploration for Interactive Machine Learning | Matteo Turchetta, Felix Berkenkamp, Andreas Krause | In this paper, we introduce a novel framework that renders any existing unsafe IML algorithm safe. |

261 | Addressing Failure Detection by Learning Model Confidence | Charles Corbi?re, Nicolas THOME, Avner Bar-Hen, Matthieu Cord, Patrick P?rez | In this paper, we propose a new target criterion for model confidence, corresponding to the True Class Probability (TCP). |

262 | Combinatorial Bayesian Optimization using the Graph Cartesian Product | Changyong Oh, Jakub Tomczak, Efstratios Gavves, Max Welling | We introduce COMBO, a new Gaussian Process (GP) BO. |

263 | Fooling Neural Network Interpretations via Adversarial Model Manipulation | Juyeon Heo, Sunghwan Joo, Taesup Moon | We propose two types of fooling, Passive and Active, and demonstrate such foolings generalize well to the entire validation set as well as transfer to other interpretation methods. |

264 | On Lazy Training in Differentiable Programming | L?na?c Chizat, Edouard Oyallon, Francis Bach | In this work, we show that this “lazy training” phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling, often implicit, that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels. |

265 | Quality Aware Generative Adversarial Networks | KANCHARLA PARIMALA, Sumohana Channappayya | In this work, we show how a distance metric that is a variant of the Structural SIMilarity (SSIM) index (a popular full-reference image quality assessment algorithm), and a novel quality aware discriminator gradient penalty function that is inspired by the Natural Image Quality Evaluator (NIQE, a popular no-reference image quality assessment algorithm) can each be used as excellent regularizers for GAN objective functions. |

266 | Copula-like Variational Inference | Marcel Hirt, Petros Dellaportas, Alain Durmus | This paper considers a new family of variational distributions motivated by Sklar’s theorem. |

267 | Implicit Regularization for Optimal Sparse Recovery | Tomas Vaskevicius, Varun Kanade, Patrick Rebeschini | We investigate implicit regularization schemes for gradient descent methods applied to unpenalized least squares regression to solve the problem of reconstructing a sparse signal from an underdetermined system of linear measurements under the restricted isometry assumption. |

268 | Locally Private Gaussian Estimation | Matthew Joseph, Janardhan Kulkarni, Jieming Mao, Steven Z. Wu | We study a basic private estimation problem: each of n users draws a single i.i.d. sample from an unknown Gaussian distribution N(\mu,\sigma^2), and the goal is to estimate \mu while guaranteeing local differential privacy for each user. |

269 | Multi-mapping Image-to-Image Translation via Learning Disentanglement | Xiaoming Yu, Yuanqi Chen, Shan Liu, Thomas Li, Ge Li | To address this issue, we propose a novel unified model, which bridges these two objectives. |

270 | Spatially Aggregated Gaussian Processes with Multivariate Areal Outputs | Yusuke Tanaka, Toshiyuki Tanaka, Tomoharu Iwata, Takeshi Kurashima, Maya Okawa, Yasunori Akagi, Hiroyuki Toda | We propose a probabilistic model for inferring the multivariate function from multiple areal data sets with various granularities. |

271 | Fast Structured Decoding for Sequence Models | Zhiqing Sun, Zhuohan Li, Haoqing Wang, Di He, Zi Lin, Zhihong Deng | Specifically, we design an efficient approximation for Conditional Random Fields (CRF) for non-autoregressive sequence models, and further propose a dynamic transition technique to model positional contexts in the CRF. |

272 | Learning Temporal Pose Estimation from Sparsely-Labeled Videos | Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani | To reduce the need for dense annotations, we propose a PoseWarper network that leverages training videos with sparse annotations (every k frames) to learn to perform dense temporal pose propagation and estimation. |

273 | Putting An End to End-to-End: Gradient-Isolated Learning of Representations | Sindy L?we, Peter O’Connor, Bastiaan Veeling | We propose a novel deep learning method for local self-supervised representation learning that does not require labels nor end-to-end backpropagation but exploits the natural order in data instead. |

274 | Scalable Gromov-Wasserstein Learning for Graph Partitioning and Matching | Hongteng Xu, Dixin Luo, Lawrence Carin | We propose a scalable Gromov-Wasserstein learning (S-GWL) method and establish a novel and theoretically-supported paradigm for large-scale graph analysis. |

275 | Meta-Reinforced Synthetic Data for One-Shot Fine-Grained Visual Recognition | Satoshi Tsutsui, Yanwei Fu, David Crandall | To this end, this paper proposes a meta-learning framework to reinforce the generated images by original images so that these images can facilitate one-shot learning. |

276 | Real-Time Reinforcement Learning | Simon Ramstedt, Chris Pal | In this paper, we introduce a new framework, in which states and actions evolve simultaneously and show how it is related to the classical MDP formulation. |

277 | Robust Multi-agent Counterfactual Prediction | Alexander Peysakhovich, Christian Kroer, Adam Lerer | We propose a method for analyzing the sensitivity of counterfactual conclusions to violations of these assumptions, which we call robust multi-agent counterfactual prediction (RMAC). |

278 | Approximate Inference Turns Deep Networks into Gaussian Processes | Mohammad Emtiyaz E. Khan, Alexander Immer, Ehsan Abedi, Maciej Korzepa | In this paper, we show that certain Gaussian posterior approximations for Bayesian DNNs are equivalent to GP posteriors. |

279 | Deep Signature Transforms | Patrick Kidger, Patric Bonnier, Imanol Perez Arribas, Cristopher Salvi, Terry Lyons | We propose a novel approach which combines the advantages of the signature transform with modern deep learning frameworks. |

280 | Individual Regret in Cooperative Nonstochastic Multi-Armed Bandits | Yogev Bar-On, Yishay Mansour | We present algorithms both for the case that the communication graph is known to all the agents, and for the case that the graph is unknown. |

281 | Convergent Policy Optimization for Safe Reinforcement Learning | Ming Yu, Zhuoran Yang, Mladen Kolar, Zhaoran Wang | We study the safe reinforcement learning problem with nonlinear function approximation, where policy optimization is formulated as a constrained optimization problem with both the objective and the constraint being nonconvex functions. |

282 | Augmented Neural ODEs | Emilien Dupont, Arnaud Doucet, Yee Whye Teh | To address these limitations, we introduce Augmented Neural ODEs which, in addition to being more expressive models, are empirically more stable, generalize better and have a lower computational cost than Neural ODEs. |

283 | Thompson Sampling for Multinomial Logit Contextual Bandits | Min-hwan Oh, Garud Iyengar | The distinguishing feature in this work is that this feedback has a multinomial logistic distribution. |

284 | Backpropagation-Friendly Eigendecomposition | Wei Wang, Zheng Dang, Yinlin Hu, Pascal Fua, Mathieu Salzmann | In this paper, we introduce a numerically stable and differentiable approach to leveraging eigenvectors in deep networks. |

285 | FastSpeech: Fast, Robust and Controllable Text to Speech | Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu | In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS. |

286 | Ultrametric Fitting by Gradient Descent | Giovanni Chierchia, Benjamin Perret | We aim to overcome this limitation by presenting a general optimization framework for ultrametric fitting. |

287 | Distinguishing Distributions When Samples Are Strategically Transformed | Hanrui Zhang, Yu Cheng, Vincent Conitzer | In this paper, we give necessary and sufficient conditions for when the principal can distinguish between agents of “good” and “bad” types, when the type affects the distribution of samples that the agent has access to. |

288 | Implicit Regularization of Discrete Gradient Dynamics in Linear Neural Networks | Gauthier Gidel, Francis Bach, Simon Lacoste-Julien | Using a time rescaling, we show that, with a vanishing initialization and a small enough step size, this dynamics sequentially learns the solutions of a reduced-rank regression with a gradually increasing rank. |

289 | Deep Set Prediction Networks | Yan Zhang, Jonathon Hare, Adam Prugel-Bennett | We propose a general model for predicting sets that properly respects the structure of sets and avoids this problem. |

290 | DppNet: Approximating Determinantal Point Processes with Deep Networks | Zelda E. Mariet, Yaniv Ovadia, Jasper Snoek | We approach this problem by introducing DppNets: generative deep models that produce DPP-like samples for arbitrary ground sets. |

291 | Efficient Communication in Multi-Agent Reinforcement Learning via Variance Based Control | Sai Qian Zhang, Qi Zhang, Jieyu Lin | In this work, we propose Variance Based Control (VBC), a simple yet efficient technique to improve communication efficiency in MARL. |

292 | Neural Lyapunov Control | Ya-Chien Chang, Nima Roohi, Sicun Gao | We propose new methods for learning control policies and neural network Lyapunov functions for nonlinear control problems, with provable guarantee of stability. |

293 | Fully Dynamic Consistent Facility Location | Vincent Cohen-Addad, Niklas Oskar D. Hjuler, Nikos Parotsidis, David Saulpic, Chris Schwiegelshohn | In this paper, we focus on general metric spaces and mainly on the facility location problem. |

294 | SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems | Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel Bowman | In this paper we present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard. |

295 | A Flexible Generative Framework for Graph-based Semi-supervised Learning | Jiaqi Ma, Weijing Tang, Ji Zhu, Qiaozhu Mei | In this work, we propose a flexible generative framework for graph-based semi-supervised learning, which approaches the joint distribution of the node features, labels, and the graph structure. |

296 | Inherent Weight Normalization in Stochastic Neural Networks | Georgios Detorakis, Sourav Dutta, Abhishek Khanna, Matthew Jerry, Suman Datta, Emre Neftci | Here, we further demonstrate that always-on multiplicative stochasticity combined with simple threshold neurons provide a sufficient substrate for deep learning machines. |

297 | Optimal Decision Tree with Noisy Outcomes | Su Jia, viswanath nagarajan, Fatemeh Navidi, R Ravi | We design new approximation algorithms for both the non-adaptive setting, where the test sequence must be fixed a-priori, and the adaptive setting where the test sequence depends on the outcomes of prior tests. |

298 | Meta-Curvature | Eunbyung Park, Junier B. Oliva | We propose meta-curvature (MC), a framework to learn curvature information for better generalization and fast model adaptation. |

299 | Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning | Nathan Kallus, Masatoshi Uehara | We propose new estimators for OPE based on empirical likelihood that are always more efficient than IS, SNIS, and DR and satisfy the same stability and boundedness properties as SNIS. |

300 | KerGM: Kernelized Graph Matching | Zhen Zhang, Yijian Xiang, Lingfei Wu, Bing Xue, Arye Nehorai | In our paper, we provide a unifying view for these two problems by introducing new rules for array operations in Hilbert spaces. |

301 | Transfusion: Understanding Transfer Learning for Medical Imaging | Maithra Raghu, Chiyuan Zhang, Jon Kleinberg, Samy Bengio | In this paper, we explore properties of transfer learning for medical imaging. |

302 | Adversarial training for free! | Ali Shafahi, Mahyar Najibi, Mohammad Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S. Davis, Gavin Taylor, Tom Goldstein | We present an algorithm that eliminates the overhead cost of generating adversarial examples by recycling the gradient information computed when updating model parameters. |

303 | Communication-Efficient Distributed Learning via Lazily Aggregated Quantized Gradients | Jun Sun, Tianyi Chen, Georgios Giannakis, Zaiyue Yang | The present paper develops a novel aggregated gradient approach for distributed machine learning that adaptively compresses the gradient communication. |

304 | Implicitly learning to reason in first-order logic | Vaishak Belle, Brendan Juba | In this work, we present a new theoretical approach to robustly learning to reason in first-order logic, and consider universally quantified clauses over a countably infinite domain. |

305 | Kernel-Based Approaches for Sequence Modeling: Connections to Neural Methods | Kevin Liang, Guoyin Wang, Yitong Li, Ricardo Henao, Lawrence Carin | We investigate time-dependent data analysis from the perspective of recurrent kernel machines, from which models with hidden units and gated memory cells arise naturally. |

306 | PC-Fairness: A Unified Framework for Measuring Causality-based Fairness | Yongkai Wu, Lu Zhang, Xintao Wu, Hanghang Tong | In this paper, we develop a framework for measuring different causality-based fairness. |

307 | Arbicon-Net: Arbitrary Continuous Geometric Transformation Networks for Image Registration | Jianchun Chen, Lingjing Wang, Xiang Li, Yi Fang | To address this issue, we present an end-to-end trainable deep neural networks, named Arbitrary Continuous Geometric Transformation Networks (Arbicon-Net), to directly predict the dense displacement field for pairwise image alignment. |

308 | Assessing Disparate Impact of Personalized Interventions: Identifiability and Bounds | Nathan Kallus, Angela Zhou | We prove how we can nonetheless point-identify these quantities under the additional assumption of monotone treatment response, which may be reasonable in many applications. |

309 | The Fairness of Risk Scores Beyond Classification: Bipartite Ranking and the XAUC Metric | Nathan Kallus, Angela Zhou | To better account for this, in this paper, we investigate the fairness of predictive risk scores from the point of view of a bipartite ranking task, where one seeks to rank positive examples higher than negative ones. |

310 | HYPE: A Benchmark for Human eYe Perceptual Evaluation of Generative Models | Sharon Zhou, Mitchell Gordon, Ranjay Krishna, Austin Narcomey, Li F. Fei-Fei, Michael Bernstein | Our work establishes a gold standard human benchmark for generative realism. |

311 | First order expansion of convex regularized estimators | Pierre Bellec, Arun Kuchibhotla | We consider first order expansions of convex penalized estimators in high-dimensional regression problems with random designs. |

312 | Capacity Bounded Differential Privacy | Kamalika Chaudhuri, Jacob Imola, Ashwin Machanavajjhala | In this work, we present a novel relaxation of differential privacy, capacity bounded differential privacy, where the adversary that distinguishes output distributions is assumed to be capacity-bounded — i.e. bounded not in computational power, but in terms of the function class from which their attack algorithm is drawn. |

313 | Universal Boosting Variational Inference | Trevor Campbell, Xinglong Li | We thus develop universal boosting variational inference (UBVI), a BVI scheme that exploits the simple geometry of probability densities under the Hellinger metric to prevent the degeneracy of other gradient-based BVI methods, avoid difficult joint optimizations of both component and weight, and simplify fully-corrective weight optimizations. |

314 | SGD on Neural Networks Learns Functions of Increasing Complexity | Dimitris Kalimeris, Gal Kaplun, Preetum Nakkiran, Benjamin Edelman, Tristan Yang, Boaz Barak, Haofeng Zhang | More generally, we give evidence for the hypothesis that, as iterations progress, SGD learns functions of increasing complexity. |

315 | The Landscape of Non-convex Empirical Risk with Degenerate Population Risk | Shuang Li, Gongguo Tang, Michael B. Wakin | In this work, we focus on the situation where the corresponding population risk is a degenerate non-convex loss function, namely, the Hessian of the population risk can have zero eigenvalues. |

316 | Making AI Forget You: Data Deletion in Machine Learning | Antonio Ginart, Melody Guan, Gregory Valiant, James Y. Zou | In this paper we initiate a framework studying what to do when it is no longer permissible to deploy models derivative from specific user data. |

317 | Practical Differentially Private Top-k Selection with Pay-what-you-get Composition | David Durfee, Ryan M. Rogers | We study the problem of top-k selection over a large domain universe subject to user-level differential privacy. |

318 | Conformalized Quantile Regression | Yaniv Romano, Evan Patterson, Emmanuel Candes | In this paper we propose a new method that is fully adaptive to heteroscedasticity. |

319 | Thompson Sampling with Information Relaxation Penalties | Seungki Min, Costis Maglaras, Ciamac C. Moallemi | We consider a finite-horizon multi-armed bandit (MAB) problem in a Bayesian setting, for which we propose an information relaxation sampling framework. |

320 | Deep Generalized Method of Moments for Instrumental Variable Analysis | Andrew Bennett, Nathan Kallus, Tobias Schnabel | In this paper, we propose the DeepGMM algorithm to overcome this. |

321 | Learning Sample-Specific Models with Low-Rank Personalized Regression | Ben Lengerich, Bryon Aragam, Eric P. Xing | To address this challenge, we propose to estimate sample-specific models that tailor inference and prediction at the individual level. |

322 | Dancing to Music | Hsin-Ying Lee, Xiaodong Yang, Ming-Yu Liu, Ting-Chun Wang, Yu-Ding Lu, Ming-Hsuan Yang, Jan Kautz | In this paper, we propose a synthesis-by-analysis learning framework to generate dance from music. |

323 | Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask | Hattie Zhou, Janice Lan, Rosanne Liu, Jason Yosinski | In this paper we study the three critical components of the Lottery Ticket (LT) algorithm, showing that each may be varied significantly without impacting the overall results. |

324 | Implicit Generation and Modeling with Energy Based Models | Yilun Du, Igor Mordatch | We present techniques to scale MCMC based EBM training on continuous neural networks, and we show its success on the high-dimensional data domains of ImageNet32x32, ImageNet128x128, CIFAR-10, and robotic hand trajectories, achieving better samples than other likelihood models and nearing the performance of contemporary GAN approaches, while covering all modes of the data. |

325 | LCA: Loss Change Allocation for Neural Network Training | Janice Lan, Rosanne Liu, Hattie Zhou, Jason Yosinski | We propose a new window into training called Loss Change Allocation (LCA), in which credit for changes to the network loss is conservatively partitioned to the parameters. |

326 | Predicting the Politics of an Image Using Webly Supervised Data | Christopher Thomas, Adriana Kovashka | In this paper, we model visual political bias in contemporary media sources at scale, using webly supervised data. We collect a dataset of over one million unique images and associated news articles from left- and right-leaning news sources, and develop a method to predict the image’s political leaning. |

327 | Adaptive GNN for Image Analysis and Editing | Lingyu Liang, LianWen Jin, Yong Xu | In mathematical analysis, we propose an adaptive GNN model by recursive definition, and derive its relation with two basic operations in CV: filtering and propagation operations. |

328 | Ultra Fast Medoid Identification via Correlated Sequential Halving | Tavor Baharav, David Tse | In this work, we show that we can better exploit the structure of the underlying computation problem by modifying the traditional bandit sampling strategy and using it in conjunction with a suitably chosen multi-armed bandit algorithm. |

329 | Tight Dimension Independent Lower Bound on the Expected Convergence Rate for Diminishing Step Sizes in SGD | PHUONG_HA NGUYEN, Lam Nguyen, Marten van Dijk | We study the convergence of Stochastic Gradient Descent (SGD) for strongly convex objective functions. |

330 | Asymptotics for Sketching in Least Squares Regression | Edgar Dobriban, Sifan Liu | In this paper, we make progress on this problem, working in an asymptotic framework where the number of datapoints and dimension of features goes to infinity. |

331 | MCP: Learning Composable Hierarchical Control with Multiplicative Compositional Policies | Xue Bin Peng, Michael Chang, Grace Zhang, Pieter Abbeel, Sergey Levine | In this work, we propose multiplicative compositional policies (MCP), a method for learning reusable motor skills that can be composed to produce a range of complex behaviors. |

332 | Exact inference in structured prediction | Kevin Bello, Jean Honorio | We consider the generative process proposed by Globerson et al. (2015) and apply it to general connected graphs. |

333 | Coda: An End-to-End Neural Program Decompiler | Cheng Fu, Huili Chen, Haolan Liu, Xinyun Chen, Yuandong Tian, Farinaz Koushanfar, Jishen Zhao | To address the above problems, we propose Coda1, the first end-to-end neural-based framework for code decompilation. |

334 | Bat-G net: Bat-inspired High-Resolution 3D Image Reconstruction using Ultrasonic Echoes | Gunpil Hwang, Seohyeon Kim, Hyeon-Min Bae | In this paper, a bat-inspired high-resolution ultrasound 3D imaging system is presented. |

335 | Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates | Sharan Vaswani, Aaron Mishkin, Issam Laradji, Mark Schmidt, Gauthier Gidel, Simon Lacoste-Julien | We propose to use line-search techniques to automatically set the step-size when training models that can interpolate the data. |

336 | Scalable Structure Learning of Continuous-Time Bayesian Networks from Incomplete Data | Dominik Linzner, Michael Schmidt, Heinz Koeppl | Instead of sampling and scoring all possible structures individually, we assume the generator of the CTBN to be composed as a mixture of generators stemming from different structures. In this framework, structure learning can be performed via a gradient-based optimization of mixture weights. |

337 | Privacy-Preserving Classification of Personal Text Messages with Secure Multi-Party Computation | Devin Reich, Ariel Todoki, Rafael Dowsley, Martine De Cock, anderson nascimento | We propose the first privacy-preserving solution for text classification that is provably secure. |

338 | Efficiently Estimating Erdos-Renyi Graphs with Node Differential Privacy | Jonathan Ullman, Adam Sealfon | We give a simple, computationally efficient, and node-differentially-private algorithm for estimating the parameter of an Erdos-Renyi graph—that is, estimating p in a G(n,p)—with near-optimal accuracy. |

339 | Learning Representations for Time Series Clustering | Qianli Ma, Jiawei Zheng, Sen Li, Gary W. Cottrell | Here we propose a novel unsupervised temporal representation learning model, named Deep Temporal Clustering Representation (DTCR), which integrates the temporal reconstruction and K-means objective into the seq2seq model. |

340 | Verified Uncertainty Calibration | Ananya Kumar, Percy S. Liang, Tengyu Ma | To get the best of both worlds, we introduce the scaling-binning calibrator, which first fits a parametric function that acts like a baseline for variance reduction and then bins the function values to actually ensure calibration. |

341 | A Normative Theory for Causal Inference and Bayes Factor Computation in Neural Circuits | Wenhao Zhang, Si Wu, Brent Doiron, Tai Sing Lee | In this paper, we consider the causal inference in multisensory processing and propose a novel generative model based on neural population code that takes into account both stimulus feature and stimulus reliability in the inference. |

342 | Unsupervised Keypoint Learning for Guiding Class-Conditional Video Prediction | Yunji Kim, Seonghyeon Nam, In Cho, Seon Joo Kim | We propose a deep video prediction model conditioned on a single image and an action class. |

343 | Subspace Attack: Exploiting Promising Subspaces for Query-Efficient Black-box Attacks | Yiwen Guo, Ziang Yan, Changshui Zhang | In this paper, we aim at reducing the query complexity of black-box attacks in this category. |

344 | Stochastic Gradient Hamiltonian Monte Carlo Methods with Recursive Variance Reduction | Difan Zou, Pan Xu, Quanquan Gu | In this paper, we propose a Stochastic Recursive Variance-Reduced gradient HMC (SRVR-HMC) algorithm. |

345 | Learning Latent Process from High-Dimensional Event Sequences via Efficient Sampling | Qitian Wu, Zixuan Zhang, Xiaofeng Gao, Junchi Yan, Guihai Chen | To these ends, in this paper, we propose a seminal adversarial imitation learning framework for high-dimension event sequence generation which could be decomposed into: 1) a latent structural intensity model that estimates the adjacent nodes without explicit networks and learns to capture the temporal dynamics in the latent space of markers over observed sequence; 2) an efficient random walk based generation model that aims at imitating the generation process of high-dimension event sequences from a bottom-up view; 3) a discriminator specified as a seq2seq network optimizing the rewards to help the generator output event sequences as real as possible. |

346 | Cross-sectional Learning of Extremal Dependence among Financial Assets | Xing Yan, Qi Wu, Wen Zhang | We propose a novel probabilistic model to facilitate the learning of multivariate tail dependence of multiple financial assets. |

347 | Principal Component Projection and Regression in Nearly Linear Time through Asymmetric SVRG | Yujia Jin, Aaron Sidford | In this paper we provide the first algorithms that solve these problems in nearly linear time for fixed eigenvalue distribution and large n. |

348 | Compression with Flows via Local Bits-Back Coding | Jonathan Ho, Evan Lohn, Pieter Abbeel | To fill in this gap, we introduce local bits-back coding, a new compression technique for flow models. |

349 | Exact Rate-Distortion in Autoencoders via Echo Noise | Rob Brekelmans, Daniel Moyer, Aram Galstyan, Greg Ver Steeg | We introduce a new noise channel, Echo noise, that admits a simple, exact expression for mutual information for arbitrary input distributions. |

350 | iSplit LBI: Individualized Partial Ranking with Ties via Split LBI | Qianqian Xu, Xinwei Sun, Zhiyong Yang, Xiaochun Cao, Qingming Huang, Yuan Yao | In this paper, instead of learning a global ranking which is agreed with the consensus, we pursue the tie-aware partial ranking from an individualized perspective. |

351 | Domes to Drones: Self-Supervised Active Triangulation for 3D Human Pose Reconstruction | Aleksis Pirinen, Erik G?rtner, Cristian Sminchisescu | In order to address the view selection problem in a principled way, we here introduce ACTOR, an active triangulation agent for 3d human pose reconstruction. |

352 | MetaQuant: Learning to Quantize by Learning to Penetrate Non-differentiable Quantization | Shangyu Chen, Wenya Wang, Sinno Jialin Pan | In this paper, we propose to learn $g_r$ by a neural network. |

353 | Improved Precision and Recall Metric for Assessing Generative Models | Tuomas Kynk??nniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, Timo Aila | We present an evaluation metric that can separately and reliably measure both of these aspects in image generation tasks by forming explicit, non-parametric representations of the manifolds of real and generated data. |

354 | A First-Order Algorithmic Framework for Distributionally Robust Logistic Regression | JIAJIN LI, SEN HUANG, Anthony Man-Cho So | In this paper, we take a first step towards resolving the above difficulty by developing a first-order algorithmic framework for tackling a class of Wasserstein distance-based distributionally robust logistic regression (DRLR) problem. |

355 | PasteGAN: A Semi-Parametric Method to Generate Image from Scene Graph | Yikang LI, Tao Ma, Yeqi Bai, Nan Duan, Sining Wei, Xiaogang Wang | Therefore, to generate the images with preferred objects and rich interactions, we propose a semi-parametric method, PasteGAN, for generating the image from the scene graph and the image crops, where spatial arrangements of the objects and their pair-wise relationships are defined by the scene graph and the object appearances are determined by the given object crops. |

356 | Handling correlated and repeated measurements with the smoothed multivariate square-root Lasso | Quentin Bertrand, Mathurin Massias, Alexandre Gramfort, Joseph Salmon | In this work, we propose a concomitant estimator that can cope with complex noise structure by using non-averaged measurements, its data-fitting term arising as a smoothing of the nuclear norm. |

357 | Joint Optimization of Tree-based Index and Deep Model for Recommender Systems | Han Zhu, Daqing Chang, Ziru Xu, Pengye Zhang, Xiang Li, Jie He, Han Li, Jian Xu, Kun Gai | Our purpose, in this paper, is to develop a method to jointly learn the index structure and user preference prediction model. |

358 | Learning Generalizable Device Placement Algorithms for Distributed Machine Learning | ravichandra addanki, Shaileshh Bojja Venkatakrishnan, Shreyan Gupta, Hongzi Mao, Mohammad Alizadeh | We present Placeto, a reinforcement learning (RL) approach to efficiently find device placements for distributed neural network training. |

359 | Uncoupled Regression from Pairwise Comparison Data | Liyuan Xu, Junya Honda, Gang Niu, Masashi Sugiyama | We propose two practical methods for uncoupled regression from pairwise comparison data and show that the learned regression model converges to the optimal model with the optimal parametric convergence rate when the target variable distributes uniformly. |

360 | Cross Attention Network for Few-shot Classification | Ruibing Hou, Hong Chang, Bingpeng MA, Shiguang Shan, Xilin Chen | In this work, we propose a novel Cross Attention Network to address the challenging problems in few-shot classification. |

361 | A Nonconvex Approach for Exact and Efficient Multichannel Sparse Blind Deconvolution | Qing Qu, Xiao Li, Zhihui Zhu | We study the multi-channel sparse blind deconvolution (MCS-BD) problem, whose task is to simultaneously recover a kernel $\mathbf a$ and multiple sparse inputs $\{\mathbf x_i\}_{i=1}^p$ from their circulant convolution $\mathbf y_i = \mb a \circledast \mb x_i $ ($i=1,\cdots,p$). |

362 | SCAN: A Scalable Neural Networks Framework Towards Compact and Efficient Models | Linfeng Zhang, Zhanhong Tan, Jiebo Song, Jingwei Chen, Chenglong Bao, Kaisheng Ma | To address this problem, we propose the so-called SCAN framework for networks training and inference, which is orthogonal and complementary to existing acceleration and compression methods. |

363 | Revisiting the Bethe-Hessian: Improved Community Detection in Sparse Heterogeneous Graphs | Lorenzo Dall’Amico, Romain Couillet, Nicolas Tremblay | This article studies spectral clustering based on the Bethe-Hessian matrix H_r= (r^2-1)I_n+D-rA for sparse heterogeneous graphs (following the degree-corrected stochastic block model) in a two-class setting. |

364 | Teaching Multiple Concepts to a Forgetful Learner | Anette Hunziker, Yuxin Chen, Oisin Mac Aodha, Manuel Gomez Rodriguez, Andreas Krause, Pietro Perona, Yisong Yue, Adish Singla | In this paper, we look at the problem from the perspective of discrete optimization and introduce a novel algorithmic framework for teaching multiple concepts with strong performance guarantees. |

365 | Regularized Weighted Low Rank Approximation | Frank Ban, David Woodruff, Richard Zhang | We derive provably sharper guarantees for the regularized version by obtaining parameterized complexity bounds in terms of the statistical dimension rather than the rank, allowing for a rank-independent runtime that can be significantly faster. |

366 | Practical and Consistent Estimation of f-Divergences | Paul Rubenstein, Olivier Bousquet, Josip Djolonga, Carlos Riquelme, Ilya O. Tolstikhin | Under these assumptions we propose and study an estimator that can be easily implemented, works well in high dimensions, and enjoys faster rates of convergence. |

367 | Approximation Ratios of Graph Neural Networks for Combinatorial Problems | Ryoma Sato, Makoto Yamada, Hisashi Kashima | In this paper, from a theoretical perspective, we study how powerful graph neural networks (GNNs) can be for learning approximation algorithms for combinatorial problems. |

368 | Thinning for Accelerating the Learning of Point Processes | Tianbo Li, Yiping Ke | We propose thinning as a downsampling method for accelerating the learning of point processes. |

369 | A Prior of a Googol Gaussians: a Tensor Ring Induced Prior for Generative Models | Maxim Kuznetsov, Daniil Polykovskiy, Dmitry P. Vetrov, Alex Zhebrak | Altogether, we propose a novel plug-and-play framework for generative models that can be utilized in any GAN and VAE-like architectures. |

370 | Differentially Private Markov Chain Monte Carlo | Mikko Heikkil?, Joonas J?lk?, Onur Dikmen, Antti Honkela | In this paper, we further extend the applicability of DP Bayesian learning by presenting the first general DP Markov chain Monte Carlo (MCMC) algorithm whose privacy-guarantees are not subject to unrealistic assumptions on Markov chain convergence and that is applicable to posterior inference in arbitrary models. |

371 | Full-Gradient Representation for Neural Network Visualization | Suraj Srinivas, Fran?ois Fleuret | We introduce a new tool for interpreting neural nets, namely full-gradients, which decomposes the neural net response into input sensitivity and per-neuron sensitivity components. |

372 | q-means: A quantum algorithm for unsupervised machine learning | Iordanis Kerenidis, Jonas Landman, Alessandro Luongo, Anupam Prakash | In this paper, we introduce q-means, a new quantum algorithm for clustering. |

373 | Learner-aware Teaching: Inverse Reinforcement Learning with Preferences and Constraints | Sebastian Tschiatschek, Ahana Ghosh, Luis Haug, Rati Devidze, Adish Singla | In this paper, we consider the setting where the learner has its own preferences that it additionally takes into consideration. |

374 | Limitations of the empirical Fisher approximation for natural gradient descent | Frederik Kunstner, Philipp Hennig, Lukas Balles | We dispute this argument by showing that the empirical Fisher—unlike the Fisher—does not generally capture second-order information. |

375 | Flow-based Image-to-Image Translation with Feature Disentanglement | Ruho Kondo, Keisuke Kawano, Satoshi Koide, Takuro Kutsuna | To this end we propose a flow-based image-to-image model, called Flow U-Net with Squeeze modules (FUNS), that allows us to disentangle the features while retaining the ability to generate highquality diverse images from condition images. |

376 | Learning dynamic polynomial proofs | Alhussein Fawzi, Mateusz Malinowski, Hamza Fawzi, Omar Fawzi | In this paper, we consider the fundamental computational task of automatically searching for proofs of polynomial inequalities. |

377 | Shape and Time Distortion Loss for Training Deep Time Series Forecasting Models | Vincent LE GUEN, Nicolas THOME | To handle this challenging task, we introduce DILATE (DIstortion Loss including shApe and TimE), a new objective function for training deep neural networks. |

378 | Understanding Attention and Generalization in Graph Neural Networks | Boris Knyazev, Graham W. Taylor, Mohamed Amer | Motivated by insights from the work on Graph Isomorphism Networks, we design simple graph reasoning tasks that allow us to study attention in a controlled environment. |

379 | Data Cleansing for Models Trained with SGD | Satoshi Hara, Atsushi Nitanda, Takanori Maehara | In this paper, we propose an algorithm that can identify influential instances without using any domain knowledge. |

380 | Curvilinear Distance Metric Learning | Shuo Chen, Lei Luo, Jian Yang, Chen Gong, Jun Li, Heng Huang | After that, by extending such straight lines to general curved forms, we propose a Curvilinear Distance Metric Learning (CDML) method, which adaptively learns the nonlinear geometries of the training data. |

381 | Embedding Symbolic Knowledge into Deep Networks | Xie Yaqi, Ziwei Xu, Kuldeep S Meel, Mohan Kankanhalli, Harold Soh | In this work, we aim to leverage prior symbolic knowledge to improve the performance of deep models. |

382 | Modeling Uncertainty by Learning a Hierarchy of Deep Neural Connections | Raanan Yehezkel Rohekar, Yaniv Gurwicz, Shami Nisimov, Gal Novik | We propose an approach for modeling this confounder by sharing neural connectivity patterns between the generative and discriminative networks. |

383 | Efficient Graph Generation with Graph Recurrent Attention Networks | Renjie Liao, Yujia Li, Yang Song, Shenlong Wang, Will Hamilton, David K. Duvenaud, Raquel Urtasun, Richard Zemel | We propose a new family of efficient and expressive deep generative models of graphs, called Graph Recurrent Attention Networks (GRANs). |

384 | Beyond Alternating Updates for Matrix Factorization with Inertial Bregman Proximal Gradient Algorithms | Mahesh Chandra Mukkamala, Peter Ochs | We exploit this theory by proposing a novel Bregman distance for matrix factorization problems, which, at the same time, allows for simple/closed form update steps. |

385 | Learning Deep Bilinear Transformation for Fine-grained Image Representation | Heliang Zheng, Jianlong Fu, Zheng-Jun Zha, Jiebo Luo | In this paper, we propose a deep bilinear transformation (DBT) block, which can be deeply stacked in convolutional neural networks to learn fine-grained image representations. |

386 | Practical Deep Learning with Bayesian Principles | Kazuki Osawa, Siddharth Swaroop, Mohammad Emtiyaz E. Khan, Anirudh Jain, Runa Eschenhagen, Richard E. Turner, Rio Yokota | In this paper, we demonstrate practical training of deep networks with natural-gradient variational inference. |

387 | Training Language GANs from Scratch | Cyprien de Masson d’Autume, Shakir Mohamed, Mihaela Rosca, Jack Rae | We combine existing techniques such as large batch sizes, dense rewards and discriminator regularization to stabilize and improve language GANs. |

388 | Pseudo-Extended Markov chain Monte Carlo | Christopher Nemeth, Fredrik Lindsten, Maurizio Filippone, James Hensman | In this paper, we introduce the pseudo-extended MCMC method as a simple approach for improving the mixing of the MCMC sampler for multi-modal posterior distributions. |

389 | Differentially Private Bagging: Improved utility and cheaper privacy than subsample-and-aggregate | James Jordon, Jinsung Yoon, Mihaela van der Schaar | In this paper,we extend this approach by dividing the data several times (rather than just once)and learning models on each chunk within each division. |

390 | Propagating Uncertainty in Reinforcement Learning via Wasserstein Barycenters | Alberto Maria Metelli, Amarildo Likmeta, Marcello Restelli | In this paper, we address this question by proposing a Bayesian framework in which we employ approximate posterior distributions to model the uncertainty of the value function and Wasserstein barycenters to propagate it across state-action pairs. |

391 | On Adversarial Mixup Resynthesis | Christopher Beckham, Sina Honari, Alex M. Lamb, Vikas Verma, Farnoosh Ghadiri, R Devon Hjelm, Yoshua Bengio, Chris Pal | In this paper, we explore new approaches to combining information encoded within the learned representations of auto-encoders. |

392 | A Geometric Perspective on Optimal Representations for Reinforcement Learning | Marc Bellemare, Will Dabney, Robert Dadashi, Adrien Ali Taiga, Pablo Samuel Castro, Nicolas Le Roux, Dale Schuurmans, Tor Lattimore, Clare Lyle | We propose a new perspective on representation learning in reinforcement learning based on geometric properties of the space of value functions. |

393 | Learning New Tricks From Old Dogs: Multi-Source Transfer Learning From Pre-Trained Networks | Joshua Lee, Prasanna Sattigeri, Gregory Wornell | For such scenarios, we consider the multi-source learning problem of training a classifier using an ensemble of pre-trained neural networks for a set of classes that have not been observed by any of the source networks, and for which we have very few training samples. |

394 | Understanding and Improving Layer Normalization | Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang Zhao, Junyang Lin | In this paper, our main contribution is to take a step further in understanding LayerNorm. |

395 | Uncertainty-based Continual Learning with Adaptive Regularization | Hongjoon Ahn, Sungmin Cha, Donggyu Lee, Taesup Moon | We introduce a new neural network-based continual learning algorithm, dubbed as Uncertainty-regularized Continual Learning (UCL), which builds on traditional Bayesian online learning framework with variational inference. |

396 | LIIR: Learning Individual Intrinsic Reward in Multi-Agent Reinforcement Learning | Yali Du, Lei Han, Meng Fang, Ji Liu, Tianhong Dai, Dacheng Tao | In this paper, we propose to merge the two directions and learn each agent an intrinsic reward function which diversely stimulates the agents at each time step. |

397 | U-Time: A Fully Convolutional Network for Time Series Segmentation Applied to Sleep Staging | Mathias Perslev, Michael Jensen, Sune Darkner, Poul J?rgen Jennum, Christian Igel | We propose U-Time, a fully feed-forward deep learning approach to physiological time series segmentation developed for the analysis of sleep data. |

398 | Massively scalable Sinkhorn distances via the Nystrom method | Jason Altschuler, Francis Bach, Alessandro Rudi, Jonathan Niles-Weed | In this work, we show that this challenge is surprisingly easy to circumvent: combining two simple techniques—the Nyström method and Sinkhorn scaling—provably yields an accurate approximation of the Sinkhorn distance with significantly lower time and memory requirements than other approaches. |

399 | Double Quantization for Communication-Efficient Distributed Optimization | Yue Yu, Jiaxiang Wu, Longbo Huang | In this paper, to reduce the communication complexity, we propose double quantization, a general scheme for quantizing both model parameters and gradients. |

400 | Globally optimal score-based learning of directed acyclic graphs in high-dimensions | Bryon Aragam, Arash Amini, Qing Zhou | We prove that $\Omega(s\log p)$ samples suffice to learn a sparse Gaussian directed acyclic graph (DAG) from data, where $s$ is the maximum Markov blanket size. |

401 | Multi-relational Poincare Graph Embeddings | Ivana Balazevic, Carl Allen, Timothy Hospedales | To address this, we propose a model that embeds multi-relational graph data in the Poincaré ball model of hyperbolic space. |

402 | No-Press Diplomacy: Modeling Multi-Agent Gameplay | Philip Paquette, Yuchen Lu, SETON STEVEN BOCCO, Max Smith, Satya O.-G., Jonathan K. Kummerfeld, Joelle Pineau, Satinder Singh, Aaron C. Courville | In this work, we focus on training an agent that learns to play the No Press version of Diplomacy where there is no dedicated communication channel between players. |

403 | State Aggregation Learning from Markov Transition Data | Yaqi Duan, Tracy Ke, Mengdi Wang | In this paper, we propose a tractable algorithm that estimates the probabilistic aggregation map from the system’s trajectory. |

404 | Disentangling Influence: Using disentangled representations to audit model predictions | Charles Marx, Richard Phillips, Sorelle Friedler, Carlos Scheidegger, Suresh Venkatasubramanian | In this paper, we develop disentangled influence audits, a procedure to audit the indirect influence of features. |

405 | Successor Uncertainties: Exploration and Uncertainty in Temporal Difference Learning | David Janz, Jiri Hron, Przemyslaw Mazur, Katja Hofmann, Jos? Miguel Hern?ndez-Lobato, Sebastian Tschiatschek | We use these insights to design Successor Uncertainties (SU), a cheap and easy to implement RVF algorithm that retains key properties of PSRL. |

406 | Partially Encrypted Deep Learning using Functional Encryption | Th?o Ryffel, David Pointcheval, Francis Bach, Edouard Dufour-Sans, Romain Gay | We propose a practical framework to perform partially encrypted and privacy-preserving predictions which combines adversarial training and functional encryption. |

407 | Decentralized Cooperative Stochastic Bandits | David Mart?nez-Rubio, Varun Kanade, Patrick Rebeschini | We study a decentralized cooperative stochastic multi-armed bandit problem with K arms on a network of N agents. |

408 | Statistical bounds for entropic optimal transport: sample complexity and the central limit theorem | Gonzalo Mena, Jonathan Niles-Weed | We prove several fundamental statistical bounds for entropic OT with the squared Euclidean cost between subgaussian probability measures in arbitrary dimension. |

409 | Efficient Deep Approximation of GMMs | Shirin Jalali, Carl Nuzman, Iraj Saniee | In this work, we extend this idea to a rich class of functions, namely the discriminant functions that arise in optimal Bayesian classification of Gaussian mixture models (GMMs) in $\mathds{R}^n$. |

410 | Learning low-dimensional state embeddings and metastable clusters from time series data | Yifan Sun, Yaqi Duan, Hao Gong, Mengdi Wang | In the spirit of diffusion map, we propose an efficient method for learning a low-dimensional state embedding and capturing the process’s dynamics. |

411 | Exploiting Local and Global Structure for Point Cloud Semantic Segmentation with Contextual Point Representations | Xu Wang, Jingming He, Lin Ma | In this paper, we propose one novel model for point cloud semantic segmentation,which exploits both the local and global structures within the point cloud based onthe contextual point representations. |

412 | Scalable Bayesian dynamic covariance modeling with variational Wishart and inverse Wishart processes | Creighton Heaukulani, Mark van der Wilk | We implement gradient-based variational inference routines for Wishart and inverse Wishart processes, which we apply as Bayesian models for the dynamic, heteroskedastic covariance matrix of a multivariate time series. |

413 | Kernel Instrumental Variable Regression | Rahul Singh, Maneesh Sahani, Arthur Gretton | We propose kernel instrumental variable regression (KIV), a nonparametric generalization of 2SLS, modeling relations among X, Y, and Z as nonlinear functions in reproducing kernel Hilbert spaces (RKHSs). |

414 | Symmetry-Based Disentangled Representation Learning requires Interaction with Environments | Hugo Caselles-Dupr?, Michael Garcia Ortiz, David Filliat | We build on their work and make observations, theoretical and empirical, that lead us to argue that Symmetry-Based Disentangled Representation Learning cannot only be based on static observations: agents should interact with the environment to discover its symmetries. |

415 | Fast Efficient Hyperparameter Tuning for Policy Gradient Methods | Supratik Paul, Vitaly Kurin, Shimon Whiteson | In this paper, we propose Hyperparameter Optimisation on the Fly (HOOF), a gradient-free algorithm that requires no more than one training run to automatically adapt the hyperparameter that affect the policy update directly through the gradient. |

416 | Offline Contextual Bayesian Optimization | Ian Char, Youngseog Chung, Willie Neiswanger, Kirthevasan Kandasamy, Oak Nelson, Mark Boyer, Egemen Kolemen | In this work, we describe a theoretically grounded Bayesian optimization method to tackle this problem. |

417 | Making the Cut: A Bandit-based Approach to Tiered Interviewing | Candice Schumann, Zhi Lang, Jeffrey Foster, John Dickerson | We present new algorithms in both the probably approximately correct (PAC) and fixed-budget settings that select a near-optimal cohort with provable guarantees. |

418 | Unsupervised Scalable Representation Learning for Multivariate Time Series | Jean-Yves Franceschi, Aymeric Dieuleveut, Martin Jaggi | In this paper, we tackle this challenge by proposing an unsupervised method to learn universal embeddings of time series. |

419 | A state-space model for inferring effective connectivity of latent neural dynamics from simultaneous EEG/fMRI | Tao Tu, John Paisley, Stefan Haufe, Paul Sajda | In this study, we develop a linear state-space model to infer the effective connectivity in a distributed brain network based on simultaneously recorded EEG and fMRI data. |

420 | End to end learning and optimization on graphs | Bryan Wilder, Eric Ewing, Bistra Dilkina, Milind Tambe | Here, we propose an alternative decision-focused learning approach that integrates a differentiable proxy for common graph optimization problems as a layer in learned systems. |

421 | Game Design for Eliciting Distinguishable Behavior | Fan Yang, Liu Leqi, Yifan Wu, Zachary Lipton, Pradeep K. Ravikumar, Tom M. Mitchell, William W. Cohen | In this paper, we formulate the task of designing behavior diagnostic games that elicit distinguishable behavior as a mutual information maximization problem, which can be solved by optimizing a variational lower bound. |

422 | When does label smoothing help? | Rafael M?ller, Simon Kornblith, Geoffrey E. Hinton | We show that label smoothing encourages the representations of training examples from the same class to group in tight clusters. |

423 | Finite-Time Performance Bounds and Adaptive Learning Rate Selection for Two Time-Scale Reinforcement Learning | Harsh Gupta, R. Srikant, Lei Ying | We present finite-time performance bounds for the case where the learning rate is fixed. |

424 | Rethinking Deep Neural Network Ownership Verification: Embedding Passports to Defeat Ambiguity Attacks | Lixin Fan, Kam Woh Ng, Chee Seng Chan | As remedies to the above-mentioned loophole, this paper proposes novel passport-based DNN ownership verification schemes which are both robust to network modifications and resilient to ambiguity attacks. |

425 | Scalable Spike Source Localization in Extracellular Recordings using Amortized Variational Inference | Cole Hurwitz, Kai Xu, Akash Srivastava, Alessio Buccino, Matthias Hennig | In this work, we present a Bayesian modelling approach for localizing the source of individual spikes on high-density, microelectrode arrays. |

426 | Optimal Sketching for Kronecker Product Regression and Low Rank Approximation | Huaian Diao, Rajesh Jayaram, Zhao Song, Wen Sun, David Woodruff | In this work, we provide significantly faster algorithms. |

427 | Distribution-Independent PAC Learning of Halfspaces with Massart Noise | Ilias Diakonikolas, Themis Gouleakis, Christos Tzamos | We study the problem of distribution-independent PAC learning of halfspaces in the presence of Massart noise. |

428 | The Convergence Rate of Neural Networks for Learned Functions of Different Frequencies | Basri Ronen, David Jacobs, Yoni Kasten, Shira Kritchman | We study the relationship between the frequency of a function and the speed at which a neural network learns it. |

429 | Adaptive Auxiliary Task Weighting for Reinforcement Learning | Xingyu Lin, Harjatin Baweja, George Kantor, David Held | In this work, we propose a principled online learning algorithm that dynamically combines different auxiliary tasks to speed up training for reinforcement learning. |

430 | Blocking Bandits | Soumya Basu, Rajat Sen, Sujay Sanghavi, Sanjay Shakkottai | We consider a novel stochastic multi-armed bandit setting, where playing an arm makes it unavailable for a fixed number of time slots thereafter. |

431 | Global Convergence of Least Squares EM for Demixing Two Log-Concave Densities | Wei Qian, Yuqian Zhang, Yudong Chen | We demonstrate that Least Squares EM, a variant of the EM algorithm, converges to the true location parameter from a randomly initialized point. |

432 | Prior-Free Dynamic Auctions with Low Regret Buyers | Yuan Deng, Jon Schneider, Balasubramanian Sivan | In this work, we do away with this assumption and consider the prior-free setting where the buyer’s value each round is chosen adversarially (possibly adaptively). |

433 | On Single Source Robustness in Deep Fusion Models | Taewan Kim, Joydeep Ghosh | Motivated by this discovery, two possible approaches are proposed to increase robustness: a carefully designed loss with corresponding training algorithms for deep fusion models, and a simple convolutional fusion layer that has a structural advantage in dealing with noise. |

434 | Policy Evaluation with Latent Confounders via Optimal Balance | Andrew Bennett, Nathan Kallus | Instead, we propose an adversarial objective and weights that minimize it, ensuring sufficient balance in the latent confounders regardless of outcome model. |

435 | Think Globally, Act Locally: A Deep Neural Network Approach to High-Dimensional Time Series Forecasting | Rajat Sen, Hsiang-Fu Yu, Inderjit S. Dhillon | In this paper, we seek to correct this deficiency and propose DeepGLO, a deep forecasting model which thinks globally and acts locally. |

436 | Adaptive Cross-Modal Few-shot Learning | Chen Xing, Negar Rostamzadeh, Boris Oreshkin, Pedro O. O. Pinheiro | In this paper, we propose to leverage cross-modal information to enhance metric-based few-shot learning methods. |

437 | Spectral Modification of Graphs for Improved Spectral Clustering | Ioannis Koutis, Huong Le | In this paper we show that for any graph $G$, there exists a `spectral maximizer’ graph $H$ which is cut-similar to $G$, but has eigenvalues that are near the theoretical limit implied by the cut structure of $G$. |

438 | Hyperbolic Graph Convolutional Neural Networks | Ines Chami, Zhitao Ying, Christopher R?, Jure Leskovec | Here we propose Hyperbolic Graph Convolutional Neural Network (HGCN), the first inductive hyperbolic GCN that leverages both the expressiveness of GCNs and hyperbolic geometry to learn inductive node representations for hierarchical and scale-free graphs. |

439 | Cost Effective Active Search | Shali Jiang, Roman Garnett, Benjamin Moseley | We propose simple and fast approximations for computing its expectation, which serves as an essential role in our proposed policy. |

440 | Exploration Bonus for Regret Minimization in Discrete and Continuous Average Reward MDPs | Jian QIAN, Ronan Fruit, Matteo Pirotta, Alessandro Lazaric | While it has been analyzed in infinite-horizon discounted and finite-horizon problems, we focus on designing and analysing the exploration bonus in the more challenging infinite-horizon undiscounted setting. |

441 | Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks | Xiao Sun, Jungwook Choi, Chia-Yu Chen, Naigang Wang, Swagath Venkataramani, Vijayalakshmi (Viji) Srinivasan, Xiaodong Cui, Wei Zhang, Kailash Gopalakrishnan | Using theoretical insights, we propose a hybrid FP8 (HFP8) format and DNN end-to-end distributed training procedure. |

442 | Tight Certificates of Adversarial Robustness for Randomly Smoothed Classifiers | Guang-He Lee, Yang Yuan, Shiyu Chang, Tommi Jaakkola | In particular, we offer adversarial robustness guarantees and associated algorithms for the discrete case where the adversary is $\ell_0$ bounded. |

443 | Poisson-Minibatching for Gibbs Sampling with Convergence Rate Guarantees | Ruqi Zhang, Christopher M. De Sa | In this paper, we propose a new auxiliary-variable minibatched Gibbs sampling method, Poisson-minibatching Gibbs, which both produces unbiased samples and has a theoretical guarantee on its convergence rate. |

444 | One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers | Ari Morcos, Haonan Yu, Michela Paganini, Yuandong Tian | Here, we attempt to answer this question by generating winning tickets for one training configuration (optimizer and dataset) and evaluating their performance on another configuration. |

445 | Breaking the Glass Ceiling for Embedding-Based Classifiers for Large Output Spaces | Chuan Guo, Ali Mousavi, Xiang Wu, Daniel N. Holtmann-Rice, Satyen Kale, Sashank Reddi, Sanjiv Kumar | In this paper, we demonstrate that theoretically there is no limitation to using low-dimensional embedding-based methods, and provide experimental evidence that overfitting is the root cause of the poor performance of embedding-based methods. |

446 | Fair Algorithms for Clustering | Suman Bera, Deeparnab Chakrabarty, Nicolas Flores, Maryam Negahbani | We study the problem of finding low-cost {\em fair clusterings} in data where each data point may belong to many protected groups. |

447 | Learning Mean-Field Games | Xin Guo, Anran Hu, Renyuan Xu, Junzi Zhang | This paper presents a general mean-field game (GMFG) framework for simultaneous learning and decision-making in stochastic games with a large population. |

448 | SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers | Igor Fedorov, Ryan P. Adams, Matthew Mattina, Paul Whatmough | This paper challenges the idea that CNNs are not suitable for deployment on MCUs. We demonstrate that it is possible to automatically design CNNs which generalize well, while also being small enough to fit onto memory-limited MCUs. |

449 | Deep imitation learning for molecular inverse problems | Eric Jonas | We treat this as a problem of graph-structured prediction, where armed with per-vertex information on a subset of the vertices, we infer the edges and edge types. |

450 | Visual Concept-Metaconcept Learning | Chi Han, Jiayuan Mao, Chuang Gan, Josh Tenenbaum, Jiajun Wu | In this paper, we propose the visual concept-metaconcept learner (VCML) for joint learning of concepts and metaconcepts from images and associated question-answer pairs. |

451 | Few-shot Video-to-Video Synthesis | Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu, Bryan Catanzaro, Jan Kautz | To address the limitations, we propose a few-shot vid2vid framework, which learns to synthesize videos of previously unseen subjects or scenes by leveraging few example images of the target at test time. |

452 | Neural Similarity Learning | Weiyang Liu, Zhen Liu, James M. Rehg, Le Song | By generalizing inner product with a bilinear matrix, we propose the neural similarity which serves as a learnable parametric similarity measure for CNNs. |

453 | Ordered Memory | Yikang Shen, Shawn Tan, Arian Hosseini, Zhouhan Lin, Alessandro Sordoni, Aaron C. Courville | In this paper, we propose the Ordered Memory architecture. |

454 | MixMatch: A Holistic Approach to Semi-Supervised Learning | David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, Colin A. Raffel | In this work, we unify the current dominant approaches for semi-supervised learning to produce a new algorithm, MixMatch, that guesses low-entropy labels for data-augmented unlabeled examples and mixes labeled and unlabeled data using MixUp. |

455 | Multivariate Triangular Quantile Maps for Novelty Detection | Jingjing Wang, Sun Sun, Yaoliang Yu | In this work, we present a general framework for neural novelty detection that centers around a multivariate extension of the univariate quantile function. |

456 | Fast Parallel Algorithms for Statistical Subset Selection Problems | Sharon Qian, Yaron Singer | In this paper, we propose a new framework for designing fast parallel algorithms for fundamental statistical subset selection tasks that include feature selection and experimental design. |

457 | PHYRE: A New Benchmark for Physical Reasoning | Anton Bakhtin, Laurens van der Maaten, Justin Johnson, Laura Gustafson, Ross Girshick | We develop the PHYRE benchmark for physical reasoning that contains a set of simple classical mechanics puzzles in a 2D physical environment. |

458 | On the number of variables to use in principal component regression | Ji Xu, Daniel J. Hsu | We study least squares linear regression over uncorrelated Gaussian features that are selected in order of decreasing variance. |

459 | Factor Group-Sparse Regularization for Efficient Low-Rank Matrix Recovery | Jicong Fan, Lijun Ding, Yudong Chen, Madeleine Udell | This paper develops a new class of nonconvex regularizers for low-rank matrix recovery. |

460 | Mutually Regressive Point Processes | Ifigeneia Apostolopoulou, Scott Linderman, Kyle Miller, Artur Dubrawski | In this paper, we introduce the first general class of Bayesian point process models extended with a nonlinear component that allows both excitatory and inhibitory relationships in continuous time. |

461 | Data-driven Estimation of Sinusoid Frequencies | Gautier Izacard, Sreyas Mohan, Carlos Fernandez-Granda | In this work, we propose a novel neural-network architecture that produces a significantly more accurate representation, and combine it with an additional neural-network module trained to detect the number of frequencies. |

462 | E2-Train: Training State-of-the-art CNNs with Over 80% Less Energy | Ziyu Jiang, Yue Wang, Xiaohan Chen, Pengfei Xu, Yang Zhao, Yingyan Lin, Zhangyang Wang | This paper attempts to explore an orthogonal direction: how to conduct more energy-efficient training of CNNs, so as to enable on-device training? |

463 | ANODEV2: A Coupled Neural ODE Framework | Tianjun Zhang, Zhewei Yao, Amir Gholami, Joseph E. Gonzalez, Kurt Keutzer, Michael W. Mahoney, George Biros | Here, we propose \OURS, which extends this approach by introducing a framework that allows ODE-based evolution for both the weights and the activations, in a coupled formulation. |

464 | Estimating Entropy of Distributions in Constant Space | Jayadev Acharya, Sourbh Bhadane, Piotr Indyk, Ziteng Sun | Our main contribution is an algorithm that requires $O\left(\frac{k \log (1/\varepsilon)^2}{\varepsilon^3}\right)$ samples and a constant $O(1)$ memory words of space and outputs a $\pm\varepsilon$ estimate of $H(p)$. |

465 | On the Utility of Learning about Humans for Human-AI Coordination | Micah Carroll, Rohin Shah, Mark K. Ho, Tom Griffiths, Sanjit Seshia, Pieter Abbeel, Anca Dragan | To demonstrate this, we introduce a simple environment that requires challenging coordination, based on the popular game Overcooked, and learn a simple model that mimics human play. |

466 | Efficient Regret Minimization Algorithm for Extensive-Form Correlated Equilibrium | Gabriele Farina, Chun Kai Ling, Fei Fang, Tuomas Sandholm | In this paper, we introduce the first efficient regret minimization algorithm for computing extensive-form correlated equilibria in large two-player general-sum games with no chance moves. |

467 | Learning in Generalized Linear Contextual Bandits with Stochastic Delays | Zhengyuan Zhou, Renyuan Xu, Jose Blanchet | In this paper, we consider online learning in generalized linear contextual bandits where rewards are not immediately observed. |

468 | Empirically Measuring Concentration: Fundamental Limits on Intrinsic Robustness | Saeed Mahloujifar, Xiao Zhang, Mohammad Mahmoody, David Evans | This paper presents a method for empirically measuring and bounding the concentration of a concrete dataset which is proven to converge to the actual concentration. |

469 | Optimistic Regret Minimization for Extensive-Form Games via Dilated Distance-Generating Functions | Gabriele Farina, Christian Kroer, Tuomas Sandholm | We study the performance of optimistic regret-minimization algorithms for both minimizing regret in, and computing Nash equilibria of, zero-sum extensive-form games. |

470 | Learning Non-Convergent Non-Persistent Short-Run MCMC Toward Energy-Based Model | Erik Nijkamp, Mitch Hill, Song-Chun Zhu, Ying Nian Wu | This paper studies a curious phenomenon in learning energy-based model (EBM) using MCMC. |

471 | Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting | Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, Xifeng Yan | In this paper, we propose to tackle such forecasting problem with Transformer. |

472 | On the Accuracy of Influence Functions for Measuring Group Effects | Pang Wei W. Koh, Kai-Siang Ang, Hubert Teo, Percy S. Liang | In this paper, we find that across many different types of groups and for a range of real-world datasets, the predicted effect (using influence functions) of a group correlates surprisingly well with its actual effect, even if the absolute and relative errors are large. |

473 | Face Reconstruction from Voice using Generative Adversarial Networks | Yandong Wen, Bhiksha Raj, Rita Singh | In this paper, we address the challenge posed by a subtask of voice profiling – reconstructing someone’s face from their voice. |

474 | Incremental Few-Shot Learning with Attention Attractor Networks | Mengye Ren, Renjie Liao, Ethan Fetaya, Richard Zemel | To this end, we propose a meta-learning model, the Attention Attractor Network, which regularizes the learning of novel classes. |

475 | On Testing for Biases in Peer Review | Ivan Stelmakh, Nihar Shah, Aarti Singh | We consider the issue of biases in scholarly research, specifically, in peer review. |

476 | Learning Disentangled Representation for Robust Person Re-identification | Chanho Eom, Bumsub Ham | To tackle this problem, we propose to disentangle identity-related and -unrelated features from person images. |

477 | Balancing Efficiency and Fairness in On-Demand Ridesourcing | Nixie S. Lesmana, Xuan Zhang, Xiaohui Bei | In this paper, we focus on both the system efficiency and the fairness among drivers and quantitatively analyze the trade-offs between these two objectives. |

478 | Latent Ordinary Differential Equations for Irregularly-Sampled Time Series | Yulia Rubanova, Tian Qi Chen, David K. Duvenaud | We generalize RNNs to have continuous-time hidden dynamics defined by ordinary differential equations (ODEs), a model we call ODE-RNNs. |

479 | Deep RGB-D Canonical Correlation Analysis For Sparse Depth Completion | Yiqi Zhong, Cho-Ying Wu, Suya You, Ulrich Neumann | In this paper, we propose our Correlation For Completion Network (CFCNet), an end-to-end deep learning model that uses the correlation between two data sources to perform sparse depth completion. |

480 | Input Similarity from the Neural Network Perspective | Guillaume Charpiat, Nicolas Girard, Loris Felardos, Yuliya Tarabalka | Given a trained neural network, we aim at understanding how similar it considers any two samples. |

481 | Adaptive Sequence Submodularity | Marko Mitrovic, Ehsan Kazemi, Moran Feldman, Andreas Krause, Amin Karbasi | In this paper, we view the problem of adaptive and sequential decision making through the lens of submodularity and propose an adaptive greedy policy with strong theoretical guarantees. |

482 | Weight Agnostic Neural Networks | Adam Gaier, David Ha | In this work, we question to what extent neural network architectures alone, without learning any weight parameters, can encode solutions for a given task. |

483 | Learning to Predict Without Looking Ahead: World Models Without Forward Prediction | Daniel Freeman, David Ha, Luke Metz | In this work, we introduce a modification to traditional reinforcement learning which we call observational dropout, whereby we limit the agents ability to observe the real environment at each timestep. |

484 | Reducing the variance in online optimization by transporting past gradients | S?bastien Arnold, Pierre-Antoine Manzagol, Reza Babanezhad Harikandeh, Ioannis Mitliagkas, Nicolas Le Roux | We propose to correct this staleness using the idea of {\em implicit gradient transport} (IGT) which transforms gradients computed at previous iterates into gradients evaluated at the current iterate without using the Hessian explicitly. |

485 | Characterizing Bias in Classifiers using Generative Models | Daniel McDuff, Shuang Ma, Yale Song, Ashish Kapoor | We propose a simulation-based approach for interrogating classifiers using generative adversarial models in a systematic manner. |

486 | Optimal Stochastic and Online Learning with Individual Iterates | Yunwen Lei, Peng Yang, Ke Tang, Ding-Xuan Zhou | In this paper, we propose a theoretically sound strategy to select an individual iterate of the vanilla SCMD, which is able to achieve optimal rates for both convex and strongly convex problems in a non-smooth learning setting. |

487 | Policy Learning for Fairness in Ranking | Ashudeep Singh, Thorsten Joachims | To address this need, we propose a general LTR framework that can optimize a wide range of utility metrics (e.g. NDCG) while satisfying fairness of exposure constraints with respect to the items. |

488 | Off-Policy Evaluation via Off-Policy Classification | Alexander Irpan, Kanishka Rao, Konstantinos Bousmalis, Chris Harris, Julian Ibarz, Sergey Levine | In this work, we consider the problem of model selection for deep reinforcement learning (RL) in real-world environments. |

489 | Regularized Gradient Boosting | Corinna Cortes, Mehryar Mohri, Dmitry Storcheus | We introduce a new algorithm, called rgb, that directly benefits from these generalization bounds and that, at every boosting round, applies the \emph{Structural Risk Minimization} principle to search for a base predictor with the best empirical fit versus complexity trade-off. |

490 | Efficient Probabilistic Inference in the Quest for Physics Beyond the Standard Model | Atilim Gunes Baydin, Lei Shao, Wahid Bhimji, Lukas Heinrich, Saeid Naderiparizi, Andreas Munk, Jialin Liu, Bradley Gram-Hansen, Gilles Louppe, Lawrence Meadows, Philip Torr, Victor Lee, Kyle Cranmer, Mr. Prabhat, Frank Wood | We present a novel probabilistic programming framework that couples directly to existing large-scale simulators through a cross-platform probabilistic execution protocol, which allows general-purpose inference engines to record and control random number draws within simulators in a language-agnostic way. |

491 | Markov Random Fields for Collaborative Filtering | Harald Steck | In this paper, we model the dependencies among the items that are recommended to a user in a collaborative-filtering problem via a Gaussian Markov Random Field (MRF). |

492 | A Step Toward Quantifying Independently Reproducible Machine Learning Research | Edward Raff | We take the first step toward a quantifiable answer by manually attempting to implement 255 papers published from 1984 until 2017, recording features of each paper, and performing statistical analysis of the results. |

493 | Scalable Global Optimization via Local Bayesian Optimization | David Eriksson, Michael Pearce, Jacob Gardner, Ryan D. Turner, Matthias Poloczek | We propose the TuRBO algorithm that fits a collection of local models and performs a principled global allocation of samples across these models via an implicit bandit approach. |

494 | Time-series Generative Adversarial Networks | Jinsung Yoon, Daniel Jarrett, M Van Der Schaar | We propose a novel framework for generating realistic time-series data that combines the flexibility of the unsupervised paradigm with the control afforded by supervised training. |

495 | Ouroboros: On Accelerating Training of Transformer-Based Language Models | Qian Yang, Zhouyuan Huo, Wenlin Wang, Lawrence Carin | We propose the first model-parallel algorithm that speeds the training of Transformer-based language models. |

496 | A Refined Margin Distribution Analysis for Forest Representation Learning | Shen-Huan Lyu, Liang Yang, Zhi-Hua Zhou | In this paper, we formulate the forest representation learning approach called \textsc{CasDF} as an additive model which boosts the augmented feature instead of the prediction. |

497 | Robustness to Adversarial Perturbations in Learning from Incomplete Data | Amir Najafi, Shin-ichi Maeda, Masanori Koyama, Takeru Miyato | We develop a generalization theory for our framework based on a number of novel complexity measures, such as an adversarial extension of Rademacher complexity and its semi-supervised analogue. |

498 | Exploring Unexplored Tensor Network Decompositions for Convolutional Neural Networks | Kohei Hayashi, Taiki Yamaguchi, Yohei Sugawara, Shin-ichi Maeda | In this study, we first characterize a decomposition class specific to CNNs by adopting a flexible graphical notation. |

499 | An Adaptive Empirical Bayesian Method for Sparse Deep Learning | Wei Deng, Xiao Zhang, Faming Liang, Guang Lin | We propose a novel adaptive empirical Bayesian (AEB) method for sparse deep learning, where the sparsity is ensured via a class of self-adaptive spike-and-slab priors. |

500 | Adaptive Influence Maximization with Myopic Feedback | Binghui Peng, Wei Chen | We study the adaptive influence maximization problem with myopic feedback under the independent cascade model: one sequentially selects k nodes as seeds one by one from a social network, and each selected seed returns the immediate neighbors it activates as the feedback available for by later selections, and the goal is to maximize the expected number of total activated nodes, referred as the influence spread. |

501 | Focused Quantization for Sparse CNNs | Yiren Zhao, Xitong Gao, Daniel Bates, Robert Mullins, Cheng-Zhong Xu | In this paper, we attend to the statistical properties of sparse CNNs and present focused quantization, a novel quantization strategy based on power-of-two values, which exploits the weight distributions after fine-grained pruning. |

502 | Quantum Embedding of Knowledge for Reasoning | Dinesh Garg, Shajith Ikbal Mohamed, Santosh K. Srivastava, Harit Vishwakarma, Hima Karanam, L Venkata Subramaniam | We present a novel approach called Embed2Reason (E2R) that embeds a symbolic KB into a vector space in a logical structure preserving manner. |

503 | Optimal Best Markovian Arm Identification with Fixed Confidence | Vrettos Moulos | We give a complete characterization of the sampling complexity of best Markovian arm identification in one-parameter Markovian bandit models. |

504 | Limiting Extrapolation in Linear Approximate Value Iteration | Andrea Zanette, Alessandro Lazaric, Mykel J. Kochenderfer, Emma Brunskill | We introduce an algorithm that approximates value functions by combining Q-values estimated at a set of anchor states. |

505 | Almost Horizon-Free Structure-Aware Best Policy Identification with a Generative Model | Andrea Zanette, Mykel J. Kochenderfer, Emma Brunskill | We propose an algorithm that is initially agnostic to the MDP but that can leverage the specific MDP structure, expressed in terms of variances of the rewards and next-state value function, and gaps in the optimal action-value function to reduce the sample complexity needed to find a good policy, precisely highlighting the contribution of each state-action pair to the final sample complexity. |

506 | Invertible Convolutional Flow | Mahdi Karami, Dale Schuurmans, Jascha Sohl-Dickstein, Laurent Dinh, Daniel Duckworth | As an alternative, we investigate a set of novel normalizing flows based on the circular and symmetric convolutions. |

507 | A Latent Variational Framework for Stochastic Optimization | Philippe Casgrain | This paper provides a unifying theoretical framework for stochastic optimization algorithms by means of a latent stochastic variational problem. |

508 | Topology-Preserving Deep Image Segmentation | Xiaoling Hu, Fuxin Li, Dimitris Samaras, Chao Chen | We propose a novel method that learns to segment with correct topology. |

509 | Connective Cognition Network for Directional Visual Commonsense Reasoning | Aming Wu, Linchao Zhu, Yahong Han, Yi Yang | Inspired by this idea, towards VCR, we propose a connective cognition network (CCN) to dynamically reorganize the visual neuron connectivity that is contextualized by the meaning of questions and answers. |

510 | Online Markov Decoding: Lower Bounds and Near-Optimal Approximation Algorithms | Vikas Garg, Tamar Pichkhadze | We resolve the fundamental problem of online decoding with general nth order ergodic Markov chain models. |

511 | A Meta-MDP Approach to Exploration for Lifelong Reinforcement Learning | Francisco Garcia, Philip S. Thomas | In this paper we consider the problem of how a reinforcement learning agent that is tasked with solving a sequence of reinforcement learning problems (a sequence of Markov decision processes) can use knowledge acquired early in its lifetime to improve its ability to solve new problems. |

512 | Push-pull Feedback Implements Hierarchical Information Retrieval Efficiently | Xiao Liu, Xiaolong Zou, Zilong Ji, Gengshuo Tian, Yuanyuan Mi, Tiejun Huang, K. Y. Michael Wong, Si Wu | Here, we investigate the role of feedback in hierarchical information retrieval. |

513 | Learning Disentangled Representations for Recommendation | Jianxin Ma, Chang Zhou, Peng Cui, Hongxia Yang, Wenwu Zhu | In this paper, we present the MACRo-mIcro Disentangled Variational Auto-Encoder (MacridVAE) for learning disentangled representations from user behavior. |

514 | Graph Neural Tangent Kernel: Fusing Graph Neural Networks with Graph Kernels | Simon S. Du, Kangcheng Hou, Russ R. Salakhutdinov, Barnabas Poczos, Ruosong Wang, Keyulu Xu | The current paper presents a new class of graph kernels, Graph Neural Tangent Kernels (GNTKs), which correspond to infinitely wide multi-layer GNNs trained by gradient descent. |

515 | In-Place Zero-Space Memory Protection for CNN | Hui Guan, Lin Ning, Zhen Lin, Xipeng Shen, Huiyang Zhou, Seung-Hwan Lim | This paper introduces in-place zero-space ECC assisted with a new training scheme weight distribution-oriented training. |

516 | Acceleration via Symplectic Discretization of High-Resolution Differential Equations | Bin Shi, Simon S. Du, Weijie Su, Michael I. Jordan | We study first-order optimization algorithms obtained by discretizing ordinary differential equations (ODEs) corresponding to Nesterov’s accelerated gradient methods (NAGs) and Polyak’s heavy-ball method. |

517 | XLNet: Generalized Autoregressive Pretraining for Language Understanding | Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R. Salakhutdinov, Quoc V. Le | In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. |

518 | Comparison Against Task Driven Artificial Neural Networks Reveals Functional Properties in Mouse Visual Cortex | Jianghong Shi, Eric Shea-Brown, Michael Buice | We find that the comparison procedure is robust to different choices of stimuli set and the level of sub-sampling that one might expect in a large scale brain survey with thousands of neurons. |

519 | Variance Reduced Policy Evaluation with Smooth Function Approximation | Hoi-To Wai, Mingyi Hong, Zhuoran Yang, Zhaoran Wang, Kexin Tang | We formulate the policy evaluation problem as a non-convex primal-dual, finite-sum optimization problem, whose primal sub-problem is non-convex and dual sub-problem is strongly concave. |

520 | Learning GANs and Ensembles Using Discrepancy | Ben Adlam, Corinna Cortes, Mehryar Mohri, Ningshan Zhang | We present efficient algorithms using discrepancy for two tasks: training a GAN directly, namely DGAN, and mixing previously trained generative models, namely EDGAN. |

521 | Co-Generation with GANs using AIS based HMC | Tiantian Fang, Alexander Schwing | Therefore, in this paper, we study the occurring challenges for co-generation with GANs. |

522 | AttentionXML: Label Tree-based Attention-Aware Deep Model for High-Performance Extreme Multi-Label Text Classification | Ronghui You, Zihan Zhang, Ziye Wang, Suyang Dai, Hiroshi Mamitsuka, Shanfeng Zhu | We propose a new label tree-based deep learning model for XMTC, called AttentionXML, with two unique features: 1) a multi-label attention mechanism with raw text as input, which allows to capture the most relevant part of text to each label; and 2) a shallow and wide probabilistic label tree (PLT), which allows to handle millions of labels, especially for “tail labels”. |

523 | Addressing Sample Complexity in Visual Tasks Using HER and Hallucinatory GANs | Himanshu Sahni, Toby Buckley, Pieter Abbeel, Ilya Kuzovkin | In this work, we show how visual trajectories can be hallucinated to appear successful by altering agent observations using a generative model trained on relatively few snapshots of the goal. |

524 | Abstract Reasoning with Distracting Features | Kecheng Zheng, Zheng-Jun Zha, Wei Wei | Inspired this fact, we propose feature robust abstract reasoning (FRAR) model, which consists of a reinforcement learning based teacher network to determine the sequence of training and a student network for predictions. |

525 | Generalized Block-Diagonal Structure Pursuit: Learning Soft Latent Task Assignment against Negative Transfer | Zhiyong Yang, Qianqian Xu, Yangbangyan Jiang, Xiaochun Cao, Qingming Huang | To circumvent this issue, we propose a novel multi-task learning method, which simultaneously learns latent task representations and a block-diagonal Latent Task Assignment Matrix (LTAM). |

526 | Adversarial Training and Robustness for Multiple Perturbations | Florian Tramer, Dan Boneh | Our aim is to understand the reasons underlying this robustness trade-off, and to train models that are simultaneously robust to multiple perturbation types. |

527 | Doubly-Robust Lasso Bandit | Gi-Soo Kim, Myunghee Cho Paik | We consider the stochastic linear contextual bandit problem and propose a novel algorithm, namely the Doubly-Robust Lasso Bandit algorithm, which exploits the sparse structure of the regression parameter as in Lasso, while blending the doubly-robust technique used in missing data literature. |

528 | DM2C: Deep Mixed-Modal Clustering | Yangbangyan Jiang, Qianqian Xu, Zhiyong Yang, Xiaochun Cao, Qingming Huang | In this paper, we consider a more challenging task where each instance is represented in only one modality, which we call mixed-modal data. |

529 | MaCow: Masked Convolutional Generative Flow | Xuezhe Ma, Xiang Kong, Shanghang Zhang, Eduard Hovy | In this work, we introduce masked convolutional generative flow (MaCow), a simple yet effective architecture of generative flow using masked convolution. |

530 | Learning by Abstraction: The Neural State Machine | Drew Hudson, Christopher D. Manning | We introduce the Neural State Machine, seeking to bridge the gap between the neural and symbolic views of AI and integrate their complementary strengths for the task of visual reasoning. |

531 | Adaptive Gradient-Based Meta-Learning Methods | Mikhail Khodak, Maria-Florina F. Balcan, Ameet S. Talwalkar | We build a theoretical framework for designing and understanding practical meta-learning methods that integrates sophisticated formalizations of task-similarity with the extensive literature on online convex optimization and sequential prediction algorithms. |

532 | Equipping Experts/Bandits with Long-term Memory | Kai Zheng, Haipeng Luo, Ilias Diakonikolas, Liwei Wang | We propose the first black-box approach to obtaining long-term memory guarantees for online learning in the sense of Bousquet and Warmuth, 2002, by reducing the problem to achieving typical switching regret. |

533 | A Regularized Approach to Sparse Optimal Policy in Reinforcement Learning | Wenhao Yang, Xiang Li, Zhihua Zhang | We propose and study a general framework for regularized Markov decision processes (MDPs) where the goal is to find an optimal policy that maximizes the expected discounted total reward plus a policy regularization term. |

534 | Scalable inference of topic evolution via models for latent geometric structures | Mikhail Yurochkin, Zhiwei Fan, Aritra Guha, Paraschos Koutris, XuanLong Nguyen | We develop new models and algorithms for learning the temporal dynamics of the topic polytopes and related geometric objects that arise in topic model based inference. |

535 | Effective End-to-end Unsupervised Outlier Detection via Inlier Priority of Discriminative Network | Siqi Wang, Yijie Zeng, Xinwang Liu, En Zhu, Jianping Yin, Chuanfu Xu, Marius Kloft | In this paper, we propose a framework named E^3Outlier, which can perform UOD in a both effective and end-to-end manner: First, instead of the commonly-used autoencoders in previous end-to-end UOD methods, E^3Outlier for the first time leverages a discriminative DNN for better representation learning, by using surrogate supervision to create multiple pseudo classes from original unlabelled data. |

536 | Deep Active Learning with a Neural Architecture Search | Yonatan Geifman, Ran El-Yaniv | We challenge this assumption and propose a novel active strategy whereby the learning algorithm searches for effective architectures on the fly, while actively learning. |

537 | Efficiently escaping saddle points on manifolds | Christopher Criscitiello, Nicolas Boumal | Generalizing Jin et al.’s recent work on perturbed gradient descent (PGD) for optimization on linear spaces [How to Escape Saddle Points Efficiently (2017), Stochastic Gradient Descent Escapes Saddle Points Efficiently (2019)], we study a version of perturbed Riemannian gradient descent (PRGD) to show that necessary optimality conditions can be met approximately with high probability, without evaluating the Hessian. |

538 | AutoAssist: A Framework to Accelerate Training of Deep Neural Networks | Jiong Zhang, Hsiang-Fu Yu, Inderjit S. Dhillon | In this paper, we propose AutoAssist, a simple framework to accelerate training of a deep neural network. |

539 | DFNets: Spectral CNNs for Graphs with Feedback-Looped Filters | W. O. K. Asiri Suranga Wijesinghe, Qing Wang | We propose a novel spectral convolutional neural network (CNN) model on graph structured data, namely Distributed Feedback-Looped Networks (DFNets). |

540 | Learning Dynamics of Attention: Human Prior for Interpretable Machine Reasoning | Wonjae Kim, Yoonho Lee | We propose Dynamics of Attention for Focus Transition (DAFT) as a human prior for machine reasoning. |

541 | Comparing Unsupervised Word Translation Methods Step by Step | Mareike Hartmann, Yova Kementchedjhieva, Anders S?gaard | We focus on the first step and compare distribution matching techniques in the context of language pairs for which mixed training stability and evaluation scores have been reported. |

542 | Learning from Bad Data via Generation | Tianyu Guo, Chang Xu, Boxin Shi, Chao Xu, Dacheng Tao | We suppose the real data distribution lies in a distribution set supported by the empirical distribution of bad data. A worst-case formulation can be developed over this distribution set, and then be interpreted as a generation task in an adversarial manner. |

543 | Constrained deep neural network architecture search for IoT devices accounting for hardware calibration | Florian Scheidegger, Luca Benini, Costas Bekas, A. Cristiano I. Malossi | We propose a unique narrow-space architecture search that focuses on delivering low-cost and rapidly executing networks that respect strict memory and time requirements typical of Internet-of-Things (IoT) near-sensor computing platforms. |

544 | Quantum Entropy Scoring for Fast Robust Mean Estimation and Improved Outlier Detection | Yihe Dong, Samuel Hopkins, Jerry Li | We study two problems in high-dimensional robust statistics: \emph{robust mean estimation} and \emph{outlier detection}. |

545 | Iterative Least Trimmed Squares for Mixed Linear Regression | Yanyao Shen, Sujay Sanghavi | In this paper, we analyze ILTS in the setting of mixed linear regression with corruptions (MLR-C). |

546 | Dynamic Ensemble Modeling Approach to Nonstationary Neural Decoding in Brain-Computer Interfaces | Yu Qi, Bin Liu, Yueming Wang, Gang Pan | We propose a dynamic ensemble modeling (DyEnsemble) approach that is capable of adapting to changes in neural signals by employing a proper combination of decoding functions. |

547 | Divergence-Augmented Policy Optimization | Qing Wang, Yingru Li, Jiechao Xiong, Tong Zhang | This paper introduces a method to stabilize policy optimization when off-policy data are reused. |

548 | Intrinsic dimension of data representations in deep neural networks | Alessio Ansuini, Alessandro Laio, Jakob H. Macke, Davide Zoccolan | Here we study the intrinsic dimensionality (ID) of data representations, i.e. the minimal number of parameters needed to describe a representation. |

549 | Towards a Zero-One Law for Column Subset Selection | Zhao Song, David Woodruff, Peilin Zhong | In this work we give approximation algorithms for {\it every} function $g$ which is approximately monotone and satisfies an approximate triangle inequality, and we show both of these conditions are necessary. |

550 | Compositional De-Attention Networks | Yi Tay, Anh Tuan Luu, Aston Zhang, Shuohang Wang, Siu Cheung Hui | This paper proposes a new quasi-attention that is compositional in nature, i.e., learning whether to \textit{add}, \textit{subtract} or \textit{nullify} a certain vector when learning representations. |

551 | Dual Adversarial Semantics-Consistent Network for Generalized Zero-Shot Learning | Jian Ni, Shanghang Zhang, Haiyong Xie | To address these limitations, we propose a Dual Adversarial Semantics-Consistent Network (referred to as DASCN), which learns both primal and dual Generative Adversarial Networks (GANs) in a unified framework for GZSL. |

552 | Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers | Zeyuan Allen-Zhu, Yuanzhi Li, Yingyu Liang | In this work, we prove that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations. |

553 | Mining GOLD Samples for Conditional GANs | Sangwoo Mo, Chiheon Kim, Sungwoong Kim, Minsu Cho, Jinwoo Shin | We introduce a simple yet effective approach to improving cGANs by measuring the discrepancy between the data distribution and the model distribution on given samples. |

554 | Deep Model Transferability from Attribution Maps | Jie Song, Yixin Chen, Xinchao Wang, Chengchao Shen, Mingli Song | In this paper, we propose an embarrassingly simple yet very efficacious approach to estimating the transferability of deep networks, especially those handling vision tasks. |

555 | Fully Parameterized Quantile Function for Distributional Reinforcement Learning | Derek Yang, Li Zhao, Zichuan Lin, Tao Qin, Jiang Bian, Tie-Yan Liu | In this paper, we propose fully parameterized quantile function that parameterizes both the quantile fraction axis (i.e., the x-axis) and the value axis (i.e., y-axis) for distributional RL. |

556 | Direct Optimization through \arg \max for Discrete Variational Auto-Encoder |
Guy Lorberbom, Tommi Jaakkola, Andreea Gane, Tamir Hazan | In contrast to previous works which resort to \emph{softmax}-based relaxations, we propose to optimize it directly by applying the \emph{direct loss minimization} approach. |

557 | Distributional Reward Decomposition for Reinforcement Learning | Zichuan Lin, Li Zhao, Derek Yang, Tao Qin, Tie-Yan Liu, Guangwen Yang | In this paper, we propose Distributional Reward Decomposition for Reinforcement Learning (DRDRL), a novel reward decomposition algorithm which captures the multiple reward channel structure under distributional setting. |

558 | L_DMI: A Novel Information-theoretic Loss Function for Training Deep Nets Robust to Label Noise | Yilun Xu, Peng Cao, Yuqing Kong, Yizhou Wang | In this paper, we propose a novel information-theoretic loss function, L_DMI, for training deep neural networks robust to label noise. |

559 | Convergence Guarantees for Adaptive Bayesian Quadrature Methods | Motonobu Kanagawa, Philipp Hennig | In this work, for a broad class of adaptive Bayesian quadrature methods, we prove consistency, deriving non-tight but informative convergence rates. |

560 | Progressive Augmentation of GANs | Dan Zhang, Anna Khoreva | To mitigate this issue we introduce a new regularization technique – progressive augmentation of GANs (PA-GAN). |

561 | UniXGrad: A Universal, Adaptive Algorithm with Optimal Guarantees for Constrained Optimization | Ali Kavis, Kfir Y. Levy, Francis Bach, Volkan Cevher | We propose a novel adaptive, accelerated algorithm for the stochastic constrained convex optimization setting.Our method, which is inspired by the Mirror-Prox method, \emph{simultaneously} achieves the optimal rates for smooth/non-smooth problems with either deterministic/stochastic first-order oracles. |

562 | Meta-Surrogate Benchmarking for Hyperparameter Optimization | Aaron Klein, Zhenwen Dai, Frank Hutter, Neil Lawrence, Javier Gonzalez | This work proposes a method to alleviate these issues by means of a meta-surrogate model for HPO tasks trained on off-line generated data. |

563 | Learning to Perform Local Rewriting for Combinatorial Optimization | Xinyun Chen, Yuandong Tian | In this paper, we propose NeuRewriter that learns a policy to pick heuristics and rewrite the local components of the current solution to iteratively improve it until convergence. |

564 | Anti-efficient encoding in emergent communication | Rahma Chaabouni, Eugene Kharitonov, Emmanuel Dupoux, Marco Baroni | We study whether the same pattern emerges when two neural networks, a “speaker” and a “listener”, are trained to play a signaling game. |

565 | Singleshot : a scalable Tucker tensor decomposition | Abraham Traore, Maxime Berar, Alain Rakotomamonjy | This paper introduces a new approach for the scalable Tucker decomposition problem. |

566 | Neural Machine Translation with Soft Prototype | Yiren Wang, Yingce Xia, Fei Tian, Fei Gao, Tao Qin, Cheng Xiang Zhai, Tie-Yan Liu | In this work, we propose a new framework that introduces a soft prototype into the encoder-decoder architecture, which allows the decoder to have indirect access to both past and future information, such that each target word can be generated based on the better global understanding. |

567 | Reliable training and estimation of variance networks | Nicki Skafte, Martin J?rgensen, S?ren Hauberg | We propose and investigate new complementary methodologies for estimating predictive variance networks in regression neural networks. |

568 | Copula Multi-label Learning | Weiwei Liu | In particular, the paper first leverages the kernel trick to construct continuous distribution in the output space, and then estimates our proposed model semiparametrically where the copula is modeled parametrically, while the marginal distributions are modeled nonparametrically. |

569 | Bayesian Learning of Sum-Product Networks | Martin Trapp, Robert Peharz, Hong Ge, Franz Pernkopf, Zoubin Ghahramani | In this paper, we introduce a well-principled Bayesian framework for SPN structure learning. |

570 | Bayesian Batch Active Learning as Sparse Subset Approximation | Robert Pinsler, Jonathan Gordon, Eric Nalisnick, Jos? Miguel Hern?ndez-Lobato | In this paper, we introduce a novel Bayesian batch active learning approach that mitigates these issues. |

571 | Optimal Sparsity-Sensitive Bounds for Distributed Mean Estimation | zengfeng Huang, Ziyue Huang, Yilei WANG, Ke Yi | We propose a new sparsity-aware algorithm, which improves previous results both theoretically and empirically. |

572 | Global Sparse Momentum SGD for Pruning Very Deep Neural Networks | Xiaohan Ding, guiguang ding, Xiangxin Zhou, Yuchen Guo, Jungong Han, Ji Liu | In this paper, we propose a novel momentum-SGD-based optimization method to reduce the network complexity by on-the-fly pruning. |

573 | Variational Bayesian Decision-making for Continuous Utilities | Tomasz Kusmierczyk, Joseph Sakaya, Arto Klami | We present an automatic pipeline that co-opts continuous utilities into variational inference algorithms to account for decision-making. |

574 | The Normalization Method for Alleviating Pathological Sharpness in Wide Neural Networks | Ryo Karakida, Shotaro Akaho, Shun-ichi Amari | We reveal that batch normalization in the last layer contributes to drastically decreasing such pathological sharpness if the width and sample number satisfy a specific condition. |

575 | Single-Model Uncertainties for Deep Learning | Natasa Tagasovska, David Lopez-Paz | To estimate aleatoric uncertainty, we propose Simultaneous Quantile Regression (SQR), a loss function to learn all the conditional quantiles of a given target variable. |

576 | Is Deeper Better only when Shallow is Good? | Eran Malach, Shai Shalev-Shwartz | In this work we explore the relation between expressivity properties of deep networks and the ability to train them efficiently using gradient-based algorithms. |

577 | Wasserstein Weisfeiler-Lehman Graph Kernels | Matteo Togninalli, Elisabetta Ghisu, Felipe Llinares-L?pez, Bastian Rieck, Karsten Borgwardt | We propose a novel method that relies on the Wasserstein distance between the node feature vector distributions of two graphs, which allows to find subtler differences in data sets by considering graphs as high-dimensional objects, rather than simple means. |

578 | Domain Generalization via Model-Agnostic Learning of Semantic Features | Qi Dou, Daniel Coelho de Castro, Konstantinos Kamnitsas, Ben Glocker | We investigate the challenging problem of domain generalization, i.e., training a model on multi-domain source data such that it can directly generalize to target domains with unknown statistics. |

579 | Grid Saliency for Context Explanations of Semantic Segmentation | Lukas Hoyer, Mauricio Munoz, Prateek Katiyar, Anna Khoreva, Volker Fischer | To overcome this limitation, we extend the existing approaches to generate grid saliencies, which provide spatially coherent visual explanations for (pixel-level) dense prediction networks. |

580 | First-order methods almost always avoid saddle points: The case of vanishing step-sizes | Ioannis Panageas, Georgios Piliouras, Xiao Wang | In this paper, we resolve this question on the affirmative for gradient descent, mirror descent, manifold descent and proximal point. |

581 | Maximum Mean Discrepancy Gradient Flow | Michael Arbel, Anna Korba, Adil SALIM, Arthur Gretton | We construct a Wasserstein gradient flow of the maximum mean discrepancy (MMD) and study its convergence properties. |

582 | Oblivious Sampling Algorithms for Private Data Analysis | Sajin Sasy, Olga Ohrimenko | We study secure and privacy-preserving data analysis based on queries executed on samples from a dataset. |

583 | Semi-supervisedly Co-embedding Attributed Networks | Zaiqiao Meng, Shangsong Liang, Jinyuan Fang, Teng Xiao | In this paper, to deal with the problem, we present a semi-supervised co-embedding model for attributed networks (SCAN) based on the generalized SVAE for the heterogeneous data, which collaboratively learns low- dimensional vector representations of both nodes and attributes for partially labelled attributed networks semi-supervisedly. |

584 | From voxels to pixels and back: Self-supervision in natural-image reconstruction from fMRI | Roman Beliy, Guy Gaziv, Assaf Hoogi, Francesca Strappini, Tal Golan, Michal Irani | We present a novel approach which, in addition to the scarce labeled data (training pairs), allows to train fMRI-to-image reconstruction networks also on “unlabeled” data (i.e., images without fMRI recording, and fMRI recording without images). |

585 | Copulas as High-Dimensional Generative Models: Vine Copula Autoencoders | Natasa Tagasovska, Damien Ackerer, Thibault Vatter | We introduce the vine copula autoencoder (VCAE), a flexible generative model for high-dimensional distributions built in a straightforward three-step procedure. |

586 | Nonstochastic Multiarmed Bandits with Unrestricted Delays | Tobias Sommer Thune, Nicol? Cesa-Bianchi, Yevgeny Seldin | We investigate multiarmed bandits with delayed feedback, where the delays need neither be identical nor bounded. |

587 | BIVA: A Very Deep Hierarchy of Latent Variables for Generative Modeling | Lars Maal?e, Marco Fraccaro, Valentin Li?vin, Ole Winther | In this paper we close the performance gap by constructing VAE models that can effectively utilize a deep hierarchy of stochastic variables and model complex covariance structures. |

588 | Code Generation as a Dual Task of Code Summarization | Bolin Wei, Ge Li, Xin Xia, Zhiyi Fu, Zhi Jin | In this paper, we apply the relations between two tasks to improve the performance of both tasks. |

589 | Diffeomorphic Temporal Alignment Nets | Ron A. Shapira Weber, Matan Eyal, Nicki Skafte, Oren Shriki, Oren Freifeld | Here we propose the Diffeomorphic Temporal alignment Net (DTAN), a learning-based method for time-series joint alignment. |

590 | Weakly Supervised Instance Segmentation using the Bounding Box Tightness Prior | Cheng-Chun Hsu, Kuang-Jui Hsu, Chung-Chi Tsai, Yen-Yu Lin, Yung-Yu Chuang | This paper presents a weakly supervised instance segmentation method that consumes training data with tight bounding box annotations. |

591 | On the Power and Limitations of Random Features for Understanding Neural Networks | Gilad Yehudai, Ohad Shamir | In this paper, we formalize the link between existing results and random features, and argue that despite the impressive positive results, random feature approaches are also inherently limited in what they can explain. |

592 | Efficient Pure Exploration in Adaptive Round model | tianyuan jin, Jieming SHI, Xiaokui Xiao, Enhong Chen | In this paper, we study both PAC and exact top-$k$ arm identification problems and design efficient algorithms considering both round complexity and query complexity. |

593 | Multi-objects Generation with Amortized Structural Regularization | Taufik Xu, Chongxuan LI, Jun Zhu, Bo Zhang | In this paper, we propose amortized structural regularization (ASR), which adopts posterior regularization (PR) to embed human knowledge into DGMs via a set of structural constraints. |

594 | Neural Shuffle-Exchange Networks – Sequence Processing in O(n log n) Time | Karlis Freivalds, Emils Ozolin?, Agris ?ostaks | We introduce a new Shuffle-Exchange neural network model for sequence to sequence tasks which have O(log n) depth and O(n log n) total complexity. |

595 | DetNAS: Backbone Search for Object Detection | Yukang Chen, Tong Yang, Xiangyu Zhang, GAOFENG MENG, Xinyu Xiao, Jian Sun | In this work, we present DetNAS to use Neural Architecture Search (NAS) for the design of better backbones for object detection. |

596 | Stochastic Proximal Langevin Algorithm: Potential Splitting and Nonasymptotic Rates | Adil SALIM, Dmitry Koralev, Peter Richtarik | We propose a new algorithm—Stochastic Proximal Langevin Algorithm (SPLA)—for sampling from a log concave distribution. |

597 | Fast AutoAugment | Sungbin Lim, Ildoo Kim, Taesup Kim, Chiheon Kim, Sungwoong Kim | In this paper, we propose an algorithm called Fast AutoAugment that finds effective augmentation policies via a more efficient search strategy based on density matching. |

598 | On the Convergence Rate of Training Recurrent Neural Networks | Zeyuan Allen-Zhu, Yuanzhi Li, Zhao Song | More importantly, in this paper we build general toolkits to analyze multi-layer networks with ReLU activations. |

599 | Interval timing in deep reinforcement learning agents | Ben Deverett, Ryan Faulkner, Meire Fortunato, Gregory Wayne, Joel Z. Leibo | In artificial agents, little work has directly addressed (1) which architectural components are necessary for successful development of this ability, (2) how this timing ability comes to be represented in the units and actions of the agent, and (3) whether the resulting behavior of the system converges on solutions similar to those of biology. |

600 | Graph-based Discriminators: Sample Complexity and Expressiveness | Roi Livni, Yishay Mansour | For $k\geq 2$ we introduce a notion similar to the VC-dimension, and show that it controls the sample complexity. |

601 | Large Scale Structure of Neural Network Loss Landscapes | Stanislav Fort, Stanislaw Jastrzebski | We propose and experimentally verify a unified phenomenological model of the loss landscape that incorporates many of them. |

602 | Learning Nonsymmetric Determinantal Point Processes | Mike Gartrell, Victor-Emmanuel Brunel, Elvis Dohmatob, Syrine Krichene | We present a method that enables a tractable algorithm, based on maximum likelihood estimation, for learning nonsymmetric DPPs from data composed of observed subsets. |

603 | Hypothesis Set Stability and Generalization | Dylan J. Foster, Spencer Greenberg, Satyen Kale, Haipeng Luo, Mehryar Mohri, Karthik Sridharan | We present a study of generalization for data-dependent hypothesis sets. |

604 | Learning Object Bounding Boxes for 3D Instance Segmentation on Point Clouds | Bo Yang, Jianan Wang, Ronald Clark, Qingyong Hu, Sen Wang, Andrew Markham, Niki Trigoni | We propose a novel, conceptually simple and general framework for instance segmentation on 3D point clouds. |

605 | Precision-Recall Balanced Topic Modelling | Seppo Virtanen, Mark Girolami | We formulate topic modelling as an information retrieval task, where the goal is, based on the latent topic representation, to capture relevant term co-occurrence patterns. |

606 | Learning Sparse Distributions using Iterative Hard Thresholding | Jacky Y. Zhang, Rajiv Khanna, Anastasios Kyrillidis, Oluwasanmi O. Koyejo | In this work, we consider IHT as a solution to the problem of learning sparse discrete distributions. |

607 | Discriminative Topic Modeling with Logistic LDA | Iryna Korshunova, Hanchen Xiong, Mateusz Fedoryszak, Lucas Theis | We propose logistic LDA, a novel discriminative variant of latent Dirichlet allocation which is easy to apply to arbitrary inputs. |

608 | Quantum Wasserstein Generative Adversarial Networks | Shouvanik Chakrabarti, Huang Yiming, Tongyang Li, Soheil Feizi, Xiaodi Wu | Inspired by previous studies on the adversarial training of classical and quantum generative models, we propose the first design of quantum Wasserstein Generative Adversarial Networks (WGANs), which has been shown to improve the robustness and the scalability of the adversarial training of quantum generative models even on noisy quantum hardware. |

609 | Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion | Joan Serr?, Santiago Pascual, Carlos Segura Perales | In this paper, we propose Blow, a single-scale normalizing flow using hypernetwork conditioning to perform many-to-many voice conversion between raw audio. |

610 | Hyperparameter Learning via Distributional Transfer | Ho Chung Law, Peilin Zhao, Leung Sing Chan, Junzhou Huang, Dino Sejdinovic | We propose to transfer information across tasks using learnt representations of training datasets used in those tasks. |

611 | Discriminator optimal transport | Akinori Tanaka | Based on some experiments and a bit of OT theory, we propose discriminator optimal transport (DOT) scheme to improve generated images. |

612 | High-dimensional multivariate forecasting with low-rank Gaussian Copula Processes | David Salinas, Michael Bohlke-Schneider, Laurent Callot, Roberto Medico, Jan Gasthaus | We propose to combine an RNN-based time series model with a Gaussian copula process output model with a low-rank covariance structure to reduce the computational complexity and handle non-Gaussian marginal distributions. |

613 | Are Anchor Points Really Indispensable in Label-Noise Learning? | Xiaobo Xia, Tongliang Liu, Nannan Wang, Bo Han, Chen Gong, Gang Niu, Masashi Sugiyama | In this paper, without employing anchor points, we propose a transition-revision (T-Revision) method to effectively learn transition matrices, leading to better classifiers. |

614 | Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations | Fenglin Liu, Yuanxin Liu, Xuancheng Ren, Xiaodong He, Xu Sun | In this work, we aim at representing an image with a set of integrated visual regions and corresponding textual concepts, reflecting certain semantics. |

615 | Differentiable Ranking and Sorting using Optimal Transport | Marco Cuturi, Olivier Teboul, Jean-Philippe Vert | We propose in this paper to replace the usual sort procedure with a differentiable proxy. |

616 | Dichotomize and Generalize: PAC-Bayesian Binary Activated Deep Neural Networks | Ga?l Letarte, Pascal Germain, Benjamin Guedj, Francois Laviolette | We present a comprehensive study of multilayer neural networks with binary activation, relying on the PAC-Bayesian theory. |

617 | Likelihood-Free Overcomplete ICA and Applications In Causal Discovery | Chenwei DING, Mingming Gong, Kun Zhang, Dacheng Tao | To tackle these problems, we present a Likelihood-Free Overcomplete ICA algorithm (LFOICA) that estimates the mixing matrix directly by back-propagation without any explicit assumptions on the density function of independent components. |

618 | Interior-Point Methods Strike Back: Solving the Wasserstein Barycenter Problem | DongDong Ge, Haoyue Wang, Zikai Xiong, Yinyu Ye | In this paper, we overcome the difficulty by developing a new adapted interior-point method that fully exploits the problem’s special matrix structure to reduce the iteration complexity and speed up the Newton procedure. |

619 | Beyond Vector Spaces: Compact Data Representation as Differentiable Weighted Graphs | Denis Mazur, Vage Egiazarian, Stanislav Morozov, Artem Babenko | In this paper, we aim to eliminate the inductive bias imposed by the embedding space geometry. |

620 | Subspace Detours: Building Transport Plans that are Optimal on Subspace Projections | Boris Muzellec, Marco Cuturi | We propose in this work two methods to extrapolate, from an transport map that is optimal on a subspace, one that is nearly optimal in the entire space. |

621 | Efficient Smooth Non-Convex Stochastic Compositional Optimization via Stochastic Recursive Gradient Descent | Huizhuo Yuan, Xiangru Lian, Chris Junchi Li, Ji Liu, Wenqing Hu | In this paper, we investigate the stochastic compositional optimization in the general smooth non-convex setting. |

622 | On the convergence of single-call stochastic extra-gradient methods | Yu-Guan Hsieh, Franck Iutzeler, J?r?me Malick, Panayotis Mertikopoulos | In this paper, we develop a synthetic view of such algorithms, and we complement the existing literature by showing that they retain a $O(1/t)$ ergodic convergence rate in smooth, deterministic problems. |

623 | Infra-slow brain dynamics as a marker for cognitive function and decline | Shagun Ajmera Shyam Sunder Ajmera, Shreya Rajagopal, Razi Rehman, Devarajan Sridharan | We investigated this question with a novel application of Gaussian Process Factor Analysis (GPFA) and machine learning to fMRI data. |

624 | Robust Principal Component Analysis with Adaptive Neighbors | Rui Zhang, Hanghang Tong | To tackle the issue, we propose a general framework namely robust weight learning with adaptive neighbors (RWL-AN), via which adaptive weight vector is automatically obtained with both robustness and sparse neighbors. |

625 | High-Quality Self-Supervised Deep Image Denoising | Samuli Laine, Tero Karras, Jaakko Lehtinen, Timo Aila | We describe a novel method for training high-quality image denoising models based on unorganized collections of corrupted images. |

626 | Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup | Sebastian Goldt, Madhu Advani, Andrew M. Saxe, Florent Krzakala, Lenka Zdeborov? | We show how the dynamics of stochastic gradient descent (SGD) is captured by a set of differential equations and prove that this description is asymptotically exact in the limit of large inputs. |

627 | GIFT: Learning Transformation-Invariant Dense Visual Descriptors via Group CNNs | Yuan Liu, Zehong Shen, Zhixuan Lin, Sida Peng, Hujun Bao, Xiaowei Zhou | In this paper, we introduce a novel visual descriptor named Group Invariant Feature Transform (GIFT), which is both discriminative and robust to geometric transformations. |

628 | Online Prediction of Switching Graph Labelings with Cluster Specialists | Mark Herbster, James Robinson | We present an algorithm based on a specialist approach; we develop the machinery of cluster specialists which probabilistically exploits the cluster structure in the graph. |

629 | Graph-Based Semi-Supervised Learning with Non-ignorable Non-response | Fan Zhou, Tengfei Li, Haibo Zhou, Hongtu Zhu, Ye Jieping | To solve the problem, we propose a Graph-based joint model with Non-ignorable Non-response (GNN), followed by a joint inverse weighting estimation procedure incorporated with sampling imputation approach. |

630 | BatchBALD: Efficient and Diverse Batch Acquisition for Deep Bayesian Active Learning | Andreas Kirsch, Joost van Amersfoort, Yarin Gal | We develop BatchBALD, a tractable approximation to the mutual information between a batch of points and model parameters, which we use as an acquisition function to select multiple informative points jointly for the task of deep Bayesian active learning. |

631 | A Mean Field Theory of Quantized Deep Networks: The Quantization-Depth Trade-Off | Yaniv Blumenfeld, Dar Gilboa, Daniel Soudry | We apply mean field techniques to networks with quantized activations in order to evaluate the degree to which quantization degrades signal propagation at initialization. |

632 | Beyond Confidence Regions: Tight Bayesian Ambiguity Sets for Robust MDPs | Marek Petrik, Reazul Hasan Russel | This paper proposes a new paradigm that can achieve better solutions with the same robustness guarantees without using confidence regions as ambiguity sets. |

633 | Cross-lingual Language Model Pretraining | Alexis CONNEAU, Guillaume Lample | In this work, we extend this approach to multiple languages and show the effectiveness of cross-lingual pretraining. |

634 | Approximate Bayesian Inference for a Mechanistic Model of Vesicle Release at a Ribbon Synapse | Cornelius Schr?der, Ben James, Leon Lagnado, Philipp Berens | Here, we develop an approximate Bayesian inference scheme for a fully stochastic, biophysically inspired model of glutamate release at the ribbon synapse, a highly specialized synapse found in different sensory systems. |

635 | Updates of Equilibrium Prop Match Gradients of Backprop Through Time in an RNN with Static Input | Maxence Ernoult, Benjamin Scellier, Yoshua Bengio, Damien Querlioz, Julie Grollier | In this work, we introduce a discrete-time version of EP with simplified equations and with reduced simulation time, bringing EP closer to practical machine learning tasks. |

636 | Universal Invariant and Equivariant Graph Neural Networks | Nicolas Keriven, Gabriel Peyr? | In this paper, we consider a specific class of invariant and equivariant networks, for which we prove new universality theorems. |

637 | Are sample means in multi-armed bandits positively or negatively biased? | Jaehyeok Shin, Aaditya Ramdas, Alessandro Rinaldo | In this paper, we decouple three different sources of this selection bias: adaptive \emph{sampling} of arms, adaptive \emph{stopping} of the experiment, and adaptively \emph{choosing} which arm to study. |

638 | On the Correctness and Sample Complexity of Inverse Reinforcement Learning | Abi Komanduru, Jean Honorio | A L1-regularized Support Vector Machine formulation of the IRL problem motivated by the geometric analysis is then proposed with the basic objective of the inverse reinforcement problem in mind: to find a reward function that generates a specified optimal policy. |

639 | VIREL: A Variational Inference Framework for Reinforcement Learning | Matthew Fellows, Anuj Mahajan, Tim G. J. Rudner, Shimon Whiteson | We propose VIREL, a theoretically grounded probabilistic inference framework for RL that utilises a parametrised action-value function to summarise future dynamics of the underlying MDP, generalising existing approaches. |

640 | First Order Motion Model for Image Animation | Aliaksandr Siarohin, Stephane Lathuillere, Sergey Tulyakov, Elisa Ricci, Nicu Sebe | Image animation consists of generating a video sequence so that an object in a source image is animated according to the motion of a driving video. Our framework addresses this problem without using any annotation or prior information about the specific object to animate. |

641 | Tensor Monte Carlo: Particle Methods for the GPU era | Laurence Aitchison | To address these issues, we developed tensor Monte-Carlo (TMC) which gives exponentially many importance samples by separately drawing samples for each of the latent variables, then averaging over all possible combinations. |

642 | Unsupervised Emergence of Egocentric Spatial Structure from Sensorimotor Prediction | Alban Laflaqui?re, Michael Garcia Ortiz | We propose a simple sensorimotor predictive scheme, apply it to different agents and types of exploration, and evaluate the pertinence of these hypotheses. |

643 | Learning from Label Proportions with Generative Adversarial Networks | Jiabin Liu, Bo Wang, Zhiquan Qi, YingJie Tian, Yong Shi | In this paper, we leverage generative adversarial networks (GANs) to derive an effective algorithm LLP-GAN for learning from label proportions (LLP), where only the bag-level proportional information in labels is available. |

644 | Efficient and Thrifty Voting by Any Means Necessary | Debmalya Mandal, Ariel D. Procaccia, Nisarg Shah, David Woodruff | We take an unorthodox view of voting by expanding the design space to include both the elicitation rule, whereby voters map their (cardinal) preferences to votes, and the aggregation rule, which transforms the reported votes into collective decisions. |

645 | PointDAN: A Multi-Scale 3D Domain Adaption Network for Point Cloud Representation | Can Qin, Haoxuan You, Lichen Wang, C.-C. Jay Kuo, Yun Fu | In this paper, we propose a novel 3D Domain Adaptation Network for point cloud data (PointDAN). |

646 | ZO-AdaMM: Zeroth-Order Adaptive Momentum Method for Black-Box Optimization | Xiangyi Chen, Sijia Liu, Kaidi Xu, Xingguo Li, Xue Lin, Mingyi Hong, David Cox | In this paper, we propose a zeroth-order AdaMM (ZO-AdaMM) algorithm, that generalizes AdaMM to the gradient-free regime. |

647 | Non-Stationary Markov Decision Processes, a Worst-Case Approach using Model-Based Reinforcement Learning | Erwan Lecarpentier, Emmanuel Rachelson | This work tackles the problem of robust zero-shot planning in non-stationary stochastic environments. |

648 | Depth-First Proof-Number Search with Heuristic Edge Cost and Application to Chemical Synthesis Planning | Akihiro Kishimoto, Beat Buesser, Bei Chen, Adi Botea | We address this disadvantage of DFPN in RA with a novel approach to combine DFPN with Heuristic Edge Initialization. |

649 | Toward a Characterization of Loss Functions for Distribution Learning | Nika Haghtalab, Cameron Musco, Bo Waggoner | In this work we study loss functions for learning and evaluating probability distributions over large discrete domains. |

650 | Coresets for Archetypal Analysis | Sebastian Mair, Ulf Brefeld | In this paper, we propose efficient coresets for archetypal analysis. |

651 | Emergence of Object Segmentation in Perturbed Generative Models | Adam Bielski, Paolo Favaro | We introduce a novel framework to build a model that can learn how to segment objects from a collection of images without any human annotation. |

652 | Optimal Sparse Decision Trees | Xiyang Hu, Cynthia Rudin, Margo Seltzer | This work introduces the first practical algorithm for optimal decision trees for binary variables. |

653 | Escaping from saddle points on Riemannian manifolds | Yue Sun, Nicolas Flammarion, Maryam Fazel | We consider minimizing a nonconvex, smooth function $f$ on a Riemannian manifold $\mathcal{M}$. |

654 | Multi-source Domain Adaptation for Semantic Segmentation | Sicheng Zhao, Bo Li, Xiangyu Yue, Yang Gu, Pengfei Xu, Runbo Hu, Hua Chai, Kurt Keutzer | In this paper, we propose to investigate multi-source domain adaptation for semantic segmentation. |

655 | Localized Structured Prediction | Carlo Ciliberto, Francis Bach, Alessandro Rudi | In this work we propose the first theoretical framework to deal with part-based data from a general perspective and study a novel method within the setting of statistical learning theory. |

656 | Nonzero-sum Adversarial Hypothesis Testing Games | Sarath Yasodharan, Patrick Loiseau | We study nonzero-sum hypothesis testing games that arise in the context of adversarial classification, in both the Bayesian as well as the Neyman-Pearson frameworks. |

657 | Manifold-regression to predict from MEG/EEG brain signals without source modeling | David Sabbagh, Pierre Ablin, Gael Varoquaux, Alexandre Gramfort, Denis A. Engemann | In this article, we focus on the task of regression with rank-reduced covariance matrices. |

658 | Modeling Tabular data using Conditional GAN | Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni | We design CTGAN, which uses a conditional generative adversarial network to address these challenges. |

659 | Normalization Helps Training of Quantized LSTM | Lu Hou, Jinhua Zhu, James Kwok, Fei Gao, Tao Qin, Tie-Yan Liu | In this paper, we first show theoretically that training a quantized LSTM is difficult because quantization makes the exploding gradient problem more severe, particularly when the LSTM weight matrices are large. We then show that the popularly used weight/layer/batch normalization schemes can help stabilize the gradient magnitude in training quantized LSTMs. |

660 | Trajectory of Alternating Direction Method of Multipliers and Adaptive Acceleration | Clarice Poon, Jingwei Liang | By studying the geometric properties of ADMM, we discuss the limitations of current inertial accelerated ADMM and then present and analyze an adaptive acceleration scheme for the method. |

661 | Deep Scale-spaces: Equivariance Over Scale | Daniel Worrall, Max Welling | We introduce deep scale-spaces, a generalization of convolutional neural networks, exploiting the scale symmetry structure of conventional image recognition tasks. |

662 | GRU-ODE-Bayes: Continuous Modeling of Sporadically-Observed Time Series | Edward De Brouwer, Jaak Simm, Adam Arany, Yves Moreau | To address these challenges, we propose (1) a continuous-time version of the Gated Recurrent Unit, building upon the recent Neural Ordinary Differential Equations (Chen et al., 2018), and (2) a Bayesian update network that processes the sporadic observations. |

663 | Estimating Convergence of Markov chains with L-Lag Couplings | Niloy Biswas, Pierre E. Jacob, Paul Vanetti | We introduce L-lag couplings to generate computable, non-asymptotic upper bound estimates for the total variation or the Wasserstein distance of general Markov chains. |

664 | Learning-Based Low-Rank Approximations | Piotr Indyk, Ali Vakilian, Yang Yuan | We introduce a “learning-based” algorithm for the low-rank decomposition problem: given an $n \times d$ matrix $A$, and a parameter $k$, compute a rank-$k$ matrix $A’$ that minimizes the approximation loss $\|A-A’\|_F$. |

665 | Implicit Regularization in Deep Matrix Factorization | Sanjeev Arora, Nadav Cohen, Wei Hu, Yuping Luo | We study the implicit regularization of gradient descent over deep linear neural networks for matrix completion and sensing, a model referred to as deep matrix factorization. |

666 | List-decodable Linear Regression | Sushrut Karmalkar, Adam Klivans, Pravesh Kothari | To solve the problem we introduce a new framework for list-decodable learning that strengthens the “identifiability to algorithms” paradigm based on the sum-of-squares method. |

667 | Learning elementary structures for 3D shape generation and matching | Theo Deprelle, Thibault Groueix, Matthew Fisher, Vladimir Kim, Bryan Russell, Mathieu Aubry | More precisely, we present two complementary approaches to learn elementary structures in a deep learning framework: (i) continuous surface deformation learning and (ii) 3D structure points learning. |

668 | On the Hardness of Robust Classification | Pascale Gourdeau, Varun Kanade, Marta Kwiatkowska, James Worrell | In this paper we study the feasibility of robust learning from the perspective of computational learning theory, considering both sample and computational complexity. |

669 | Foundations of Comparison-Based Hierarchical Clustering | Debarghya Ghoshdastidar, Micha?l Perrot, Ulrike von Luxburg | We address the classical problem of hierarchical clustering, but in a framework where one does not have access to a representation of the objects or their pairwise similarities. |

670 | What the Vec? Towards Probabilistically Grounded Embeddings | Carl Allen, Ivana Balazevic, Timothy Hospedales | We show that different interactions of PMI vectors encode semantic properties that can be captured in low dimensional word embeddings by suitable projection, theoretically explaining why the embeddings of W2V and Glove work, and, in turn, revealing an interesting mathematical interconnection between the semantic relationships of relatedness, similarity, paraphrase and analogy. |

671 | Minimizers of the Empirical Risk and Risk Monotonicity | Marco Loog, Tom Viering, Alexander Mey | Our work introduces the formal notion of risk monotonicity, which asks the risk to not deteriorate with increasing training set sizes in expectation over the training samples. |

672 | Explicit Planning for Efficient Exploration in Reinforcement Learning | Liangpeng Zhang, Ke Tang, Xin Yao | We argue that explicit planning for exploration can help alleviate such a problem, and propose a Value Iteration for Exploration Cost (VIEC) algorithm which computes the optimal exploration scheme by solving an augmented MDP. |

673 | Lower Bounds on Adversarial Robustness from Optimal Transport | Arjun Nitin Bhagoji, Daniel Cullina, Prateek Mittal | In this paper, we use optimal transport to characterize the maximum achievable accuracy in an adversarial classification scenario. |

674 | Neural Spline Flows | Conor Durkan, Artur Bekasov, Iain Murray, George Papamakarios | Building upon recent work, we propose a fully-differentiable module based on monotonic rational-quadratic splines, which enhances the flexibility of both coupling and autoregressive transforms while retaining analytic invertibility. |

675 | Phase Transitions and Cyclic Phenomena in Bandits with Switching Constraints | David Simchi-Levi, Yunzong Xu | We consider the classical stochastic multi-armed bandit problem with a constraint on the total cost incurred by switching between actions. |

676 | Latent Weights Do Not Exist: Rethinking Binarized Neural Network Optimization | Koen Helwegen, James Widdicombe, Lukas Geiger, Zechun Liu, Kwang-Ting Cheng, Roeland Nusselder | In this paper, we argue that these latent weights cannot be treated analogously to weights in real-valued networks. |

677 | Nonlinear scaling of resource allocation in sensory bottlenecks | Laura Rose Edmondson, Alejandro Jimenez Rodriguez, Hannes P. Saal | Here, we show analytically and numerically that resource allocation scales nonlinearly in efficient coding models that maximize information transfer, when inputs arise from separate regions with different receptor densities. |

678 | Constrained Reinforcement Learning Has Zero Duality Gap | Santiago Paternain, Luiz Chamon, Miguel Calvo-Fullana, Alejandro Ribeiro | This work provides theoretical support to these approaches by establishing that despite its non-convexity, this problem has zero duality gap, i.e., it can be solved exactly in the dual domain, where it becomes convex. |

679 | Symmetry-adapted generation of 3d point sets for the targeted discovery of molecules | Niklas Gebauer, Michael Gastegger, Kristof Sch?tt | Here, we introduce a generative neural network for 3d point sets that respects the rotational invariance of the targeted structures. |

680 | An adaptive nearest neighbor rule for classification | Akshay Balsubramani, Sanjoy Dasgupta, yoav Freund, Shay Moran | We introduce a variant of the $k$-nearest neighbor classifier in which $k$ is chosen adaptively for each query, rather than supplied as a parameter. |

681 | Coresets for Clustering with Fairness Constraints | Lingxiao Huang, Shaofeng Jiang, Nisheeth Vishnoi | The main contribution of this paper is an approach to clustering with fairness constraints that involve {\em multiple, non-disjoint} attributes, that is {\em also scalable}. |

682 | PerspectiveNet: A Scene-consistent Image Generator for New View Synthesis in Real Indoor Environments | Ben Graham, David Novotny, Jeremy Reizenstein | Given a set of a reference RGBD views of an indoor environment, and a new viewpoint, our goal is to predict the view from that location. |

683 | MAVEN: Multi-Agent Variational Exploration | Anuj Mahajan, Tabish Rashid, Mikayel Samvelyan, Shimon Whiteson | In this paper, we analyse value-based methods that are known to have superior performance in complex environments. |

684 | Competitive Gradient Descent | Florian Schaefer, Anima Anandkumar | We introduce a new algorithm for the numerical computation of Nash equilibria of competitive two-player games. |

685 | Globally Convergent Newton Methods for Ill-conditioned Generalized Self-concordant Losses | Ulysse Marteau-Ferey, Francis Bach, Alessandro Rudi | In this paper, we study large-scale convex optimization algorithms based on the Newton method applied to regularized generalized self-concordant losses, which include logistic regression and softmax regression. |

686 | Continual Unsupervised Representation Learning | Dushyant Rao, Francesco Visin, Andrei Rusu, Razvan Pascanu, Yee Whye Teh, Raia Hadsell | In this work, we propose an approach (CURL) to tackle a more general problem that we will refer to as unsupervised continual learning. |

687 | Self-Routing Capsule Networks | Taeyoung Hahn, Myeongjang Pyeon, Gunhee Kim | In this work, we propose a novel and surprisingly simple routing strategy called self-routing where each capsule is routed independently by its subordinate routing network. |

688 | The Parameterized Complexity of Cascading Portfolio Scheduling | Eduard Eiben, Robert Ganian, Iyad Kanj, Stefan Szeider | In this paper we study the parameterized complexity of this problem and establish its fixed-parameter tractability by utilizing structural properties of the success relation between algorithms and test instances. |

689 | Maximum Expected Hitting Cost of a Markov Decision Process and Informativeness of Rewards | Falcon Dai, Matthew Walter | We propose a new complexity measure for Markov decision processes (MDPs), the maximum expected hitting cost (MEHC). |

690 | Bipartite expander Hopfield networks as self-decoding high-capacity error correcting codes | Rishidev Chaudhuri, Ila Fiete | We prove that it is possible to construct an associative content-addressable network that combines the properties of strong error correcting codes and Hopfield networks: it simultaneously possesses exponentially many stable states, these states are robust enough, with large enough basins of attraction that they can be correctly recovered despite errors in a finite fraction of all nodes, and the errors are intrinsically corrected by the network’s own dynamics. |

691 | Sequence Modeling with Unconstrained Generation Order | Dmitrii Emelianenko, Elena Voita, Pavel Serdyukov | In contrast, we propose a more general model that can generate the output sequence by inserting tokens in any arbitrary order. |

692 | Probabilistic Logic Neural Networks for Reasoning | Meng Qu, Jian Tang | In this paper, we propose the probabilistic Logic Neural Network (pLogicNet), which combines the advantages of both methods. |

693 | A Polynomial Time Algorithm for Log-Concave Maximum Likelihood via Locally Exponential Families | Brian Axelrod, Ilias Diakonikolas, Alistair Stewart, Anastasios Sidiropoulos, Gregory Valiant | Specifically, we present an algorithm which, given $n$ points in $\mathbb{R}^d$ and an accuracy parameter $\eps>0$, runs in time $\poly(n,d,1/\eps),$ and returns a log-concave distribution which, with high probability, has the property that the likelihood of the $n$ points under the returned distribution is at most an additive $\eps$ less than the maximum likelihood that could be achieved via any log-concave distribution. |

694 | A Unifying Framework for Spectrum-Preserving Graph Sparsification and Coarsening | Gecia Bravo Hermsdorff, Lee Gunderson | In this work, we provide a unifying framework that captures both of these operations, allowing one to simultaneously sparsify and coarsen a graph while preserving its large-scale structure. |

695 | Stochastic Runge-Kutta Accelerates Langevin Monte Carlo and Beyond | Xuechen Li, Yi Wu, Lester Mackey | In this paper, we establish the convergence rate of sampling algorithms obtained by discretizing smooth It\^o diffusions exhibiting fast $2$-Wasserstein contraction, based on local deviation properties of the integration scheme. |

696 | The Implicit Bias of AdaGrad on Separable Data | Qian Qian, Xiaoyuan Qian | We study the implicit bias of AdaGrad on separable linear classification problems. |

697 | On two ways to use determinantal point processes for Monte Carlo integration | Guillaume Gautier, R?mi Bardenet, Michal Valko | In this paper, we first take the EZ estimator out of the cellar, and analyze it using modern arguments. Second, we provide an efficient implementation to sample exactly a particular multidimensional DPP called multivariate Jacobi ensemble. |

698 | LiteEval: A Coarse-to-Fine Framework for Resource Efficient Video Recognition | Zuxuan Wu, Caiming Xiong, Yu-Gang Jiang, Larry S. Davis | This paper presents LiteEval, a simple yet effective coarse-to-fine framework for resource efficient video recognition, suitable for both online and offline scenarios. |

699 | How degenerate is the parametrization of neural networks with the ReLU activation function? | Dennis Maximilian Elbr?chter, Julius Berner, Philipp Grohs | We present pathologies which prevent inverse stability in general, and, for shallow networks, proceed to establish a restricted space of parametrizations on which we have inverse stability w.r.t. to a Sobolev norm. |

700 | Spike-Train Level Backpropagation for Training Deep Recurrent Spiking Neural Networks | Wenrui Zhang, Peng Li | To enable supervised training of RSNNs under a well-defined loss function, we present a novel Spike-Train level RSNNs Backpropagation (ST-RSBP) algorithm for training deep RSNNs. |

701 | Re-examination of the Role of Latent Variables in Sequence Modeling | Guokun Lai, Zihang Dai, Yiming Yang, Shinjae Yoo | Our analysis reveals that under the restriction of fully factorized output distribution in previous evaluations, the stochastic variants were implicitly leveraging intra-step correlation but the deterministic recurrent baselines were prohibited to do so, resulting in an unfair comparison. |

702 | Max-value Entropy Search for Multi-Objective Bayesian Optimization | Syrine Belakaria, Aryan Deshwal, Janardhan Rao Doppa | We propose a novel approach referred to as Max-value Entropy Search for Multi-objective Optimization (MESMO) to solve this problem. |

703 | Stein Variational Gradient Descent With Matrix-Valued Kernels | Dilin Wang, Ziyang Tang, Chandrajit Bajaj, Qiang Liu | In this work, we enhance SVGD by leveraging preconditioning matrices, such as the Hessian and Fisher information matrix, to incorporate geometric information into SVGD updates. |

704 | Crowdsourcing via Pairwise Co-occurrences: Identifiability and Algorithms | Shahana Ibrahim, Xiao Fu, Nikolaos Kargas, Kejun Huang | We propose an algebraic algorithm reminiscent of convex geometry-based structured matrix factorization to solve the model identification problem efficiently, and an identifiability-enhanced algorithm for handling more challenging and critical scenarios. |

705 | Detecting Overfitting via Adversarial Examples | Roman Werpachowski, Andr?s Gy?rgy, Csaba Szepesvari | We propose a new hypothesis test that uses only the original test data to detect overfitting. |

706 | A Unified Bellman Optimality Principle Combining Reward Maximization and Empowerment | Felix Leibfried, Sergio Pascual-D?az, Jordi Grau-Moya | In this paper, we investigate the use of empowerment in the presence of an extrinsic reward signal. |

707 | SMILe: Scalable Meta Inverse Reinforcement Learning through Context-Conditional Policies | Seyed Kamyar Seyed Ghasemipour, Shixiang (Shane) Gu, Richard Zemel | In this work, we propose SMILe, a scalable framework for Meta Inverse Reinforcement Learning (Meta-IRL) based on maximum entropy IRL, which can learn high-quality policies from few demonstrations. |

708 | Towards Understanding the Importance of Shortcut Connections in Residual Networks | Tianyi Liu, Minshuo Chen, Mo Zhou, Simon S. Du, Enlu Zhou, Tuo Zhao | In this paper, we study a two-layer non-overlapping convolutional ResNet. |

709 | Modular Universal Reparameterization: Deep Multi-task Learning Across Diverse Domains | Elliot Meyerson, Risto Miikkulainen | To approach this question, deep multi-task learning is extended in this paper to the setting where there is no obvious overlap between task architectures. |

710 | Solving Interpretable Kernel Dimensionality Reduction | Chieh Wu, Jared Miller, Yale Chang, Mario Sznaier, Jennifer Dy | This work extends the theoretical guarantees of ISM to an entire family of kernels, thereby empowering ISM to solve any kernel method of the same objective. |

711 | Interaction Hard Thresholding: Consistent Sparse Quadratic Regression in Sub-quadratic Time and Space | Shuo Yang, Yanyao Shen, Sujay Sanghavi | In this paper, we provide a new algorithm – Interaction Hard Thresholding (IntHT) which is the first one to provably accurately solve this problem in sub-quadratic time and space. |

712 | A Model to Search for Synthesizable Molecules | John Bradshaw, Brooks Paige, Matt J. Kusner, Marwin Segler, Jos? Miguel Hern?ndez-Lobato | We propose a new molecule generation model, mirroring a more realistic real-world process, where (a) reactants are selected, and (b) combined to form more complex molecules. |

713 | Post training 4-bit quantization of convolutional networks for rapid-deployment | Ron Banner, Yury Nahshan, Daniel Soudry | This paper introduces the first practical 4-bit post training quantization approach: it does not involve training the quantized model (fine-tuning), nor it requires the availability of the full dataset. |

714 | Fast and Flexible Multi-Task Classification using Conditional Neural Adaptive Processes | James Requeima, Jonathan Gordon, John Bronskill, Sebastian Nowozin, Richard E. Turner | The goal of this paper is to design image classification systems that, after an initial multi-task training phase, can automatically adapt to new tasks encountered at test time. |

715 | Differentially Private Anonymized Histograms | Ananda Theertha Suresh | Motivated by these applications, we propose the first differentially private mechanism to release anonymized histograms that achieves near-optimal privacy utility trade-off both in terms of number of items and the privacy parameter. |

716 | Dynamic Local Regret for Non-convex Online Forecasting | Sergul Aydore, Tianhao Zhu, Dean P. Foster | We introduce a local regret for non-convex models in a dynamic environment. |

717 | Learning Local Search Heuristics for Boolean Satisfiability | Emre Yolcu, Barnabas Poczos | We present an approach to learn SAT solver heuristics from scratch through deep reinforcement learning with a curriculum. |

718 | Provably Efficient Q-Learning with Low Switching Cost | Yu Bai, Tengyang Xie, Nan Jiang, Yu-Xiang Wang | Our main contribution, Q-Learning with UCB2 exploration, is a model-free algorithm for $H$-step episodic MDP that achieves sublinear regret whose local switching cost in $K$ episodes is $O(H^3SA\log K)$, and we provide a lower bound of $\Omega(HSA)$ on the local switching cost for any no-regret algorithm. |

719 | Solving graph compression via optimal transport | Vikas Garg, Tommi Jaakkola | We propose a new approach to graph compression by appeal to optimal transport. |

720 | PyTorch: An Imperative Style, High-Performance Deep Learning Library | Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, Soumith Chintala | In this paper, we detail the principles that drove the implementation of PyTorch and how they are reflected in its architecture. |

721 | Stability of Graph Scattering Transforms | Fernando Gama, Alejandro Ribeiro, Joan Bruna | In this work, we extend scattering transforms to network data by using multi-resolution graph wavelets, whose computation can be obtained by means of graph convolutions. |

722 | A Debiased MDI Feature Importance Measure for Random Forests | Xiao Li, Yu Wang, Sumanta Basu, Karl Kumbier, Bin Yu | In this paper, we address the feature selection bias of MDI from both theoretical and methodological perspectives. |

723 | Provably Efficient Q-learning with Function Approximation via Distribution Shift Error Checking Oracle | Simon S. Du, Yuping Luo, Ruosong Wang, Hanrui Zhang | The current paper presents a provably efficient algorithm for Q-learning with linear function approximation. |

724 | Sparse Logistic Regression Learns All Discrete Pairwise Graphical Models | Shanshan Wu, Sujay Sanghavi, Alexandros G. Dimakis | The algorithm is (appropriately regularized) maximum conditional log-likelihood, which involves solving a convex program for each node; for Ising models this is -constrained logistic regression, while for more general alphabets an group-norm constraint needs to be used. |

725 | Fast Convergence of Natural Gradient Descent for Over-Parameterized Neural Networks | Guodong Zhang, James Martens, Roger B. Grosse | In this work, we analyze for the first time the speed of convergence to global optimum for natural gradient descent on non-linear neural networks with the squared error loss. |

726 | Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffices | Santosh Vempala, Andre Wibisono | We study the Unadjusted Langevin Algorithm (ULA) for sampling from a probability distribution $\nu = e^{-f}$ on $\R^n$. |

727 | Learning Distributions Generated by One-Layer ReLU Networks | Shanshan Wu, Alexandros G. Dimakis, Sujay Sanghavi | We consider the problem of estimating the parameters of a $d$-dimensional rectified Gaussian distribution from i.i.d. samples. |

728 | Large-scale optimal transport map estimation using projection pursuit | Cheng Meng, Yuan Ke, Jingyi Zhang, Mengrui Zhang, Wenxuan Zhong, Ping Ma | Instead, we propose an estimation method of large-scale OTM by combining the idea of projection pursuit regression and sufficient dimension reduction. |

729 | A Structured Prediction Approach for Generalization in Cooperative Multi-Agent Reinforcement Learning | Nicolas Carion, Nicolas Usunier, Gabriel Synnaeve, Alessandro Lazaric | By leveraging this property, we introduce a novel structured prediction approach to assign agents to tasks. |

730 | On Exact Computation with an Infinitely Wide Neural Net | Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, Russ R. Salakhutdinov, Ruosong Wang | The current paper gives the first efficient exact algorithm for computing the extension of NTK to convolutional neural nets, which we call Convolutional NTK (CNTK), as well as an efficient GPU implementation of this algorithm. |

731 | Loaded DiCE: Trading off Bias and Variance in Any-Order Score Function Gradient Estimators for Reinforcement Learning | Gregory Farquhar, Shimon Whiteson, Jakob Foerster | Our objective is compatible with arbitrary advantage estimators, which allows the control of the bias and variance of any-order derivatives when using function approximation. |

732 | Chirality Nets for Human Pose Regression | Raymond Yeh, Yuan-Ting Hu, Alexander Schwing | We propose Chirality Nets, a family of deep nets that is equivariant to the “chirality transform,” i.e., the transformation to create a chiral pair. |

733 | Efficient Approximation of Deep ReLU Networks for Functions on Low Dimensional Manifolds | Minshuo Chen, Haoming Jiang, Wenjing Liao, Tuo Zhao | In this paper, we prove that neural networks can efficiently approximate functions supported on low dimensional manifolds. |

734 | Fast Decomposable Submodular Function Minimization using Constrained Total Variation | Senanayak Sesh Kumar Karri, Francis Bach, Thomas Pock | In this paper, we consider a modified convex problem requiring constrained version of the total variation oracles that can be solved with significantly fewer calls to the simple minimization oracles. |

735 | Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model | Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George Dahl, Chris Shallue, Roger B. Grosse | In this work, we study how the critical batch size changes based on properties of the optimization algorithm, including acceleration and preconditioning, through two different lenses: large scale experiments and analysis using a simple noisy quadratic model (NQM). |

736 | Spherical Text Embedding | Yu Meng, Jiaxin Huang, Guangyuan Wang, Chao Zhang, Honglei Zhuang, Lance Kaplan, Jiawei Han | To close this gap, we propose a spherical generative model based on which unsupervised word and paragraph embeddings are jointly learned. |

737 | Mobius Transformation for Fast Inner Product Search on Graph | Zhixin Zhou, Shulong Tan, Zhaozhuo Xu, Ping Li | We present a fast search on graph algorithm for Maximum Inner Product Search (MIPS). |

738 | Hyperbolic Graph Neural Networks | Qi Liu, Maximilian Nickel, Douwe Kiela | Motivated by recent advances in geometric representation learning, we propose a novel GNN architecture for learning representations on Riemannian manifolds with differentiable exponential and logarithmic maps. |

739 | Average Individual Fairness: Algorithms, Generalization and Experiments | Saeed Sharifi-Malvajerdi, Michael Kearns, Aaron Roth | We propose a new family of fairness definitions for classification problems that combine some of the best properties of both statistical and individual notions of fairness. |

740 | Fixing the train-test resolution discrepancy | Hugo Touvron, Andrea Vedaldi, Matthijs Douze, Herve Jegou | This paper first shows that existing augmentations induce a significant discrepancy between the size of the objects seen by the classifier at train and test time: in fact, a lower train resolution improves the classification at test time! We then propose a simple strategy to optimize the classifier performance, that employs different train and test resolutions. |

741 | Modeling Dynamic Functional Connectivity with Latent Factor Gaussian Processes | Lingge Li, Dustin Pluta, Babak Shahbaba, Norbert Fortin, Hernando Ombao, Pierre Baldi | We present a latent factor Gaussian process model which addresses these challenges by learning a parsimonious representation of connectivity dynamics. |

742 | Manipulating a Learning Defender and Ways to Counteract | Jiarui Gan, Qingyu Guo, Long Tran-Thanh, Bo An, Michael Wooldridge | In this paper, we show that, however, these algorithms can be easily manipulated if the attacker responds untruthfully. |

743 | Learning-In-The-Loop Optimization: End-To-End Control And Co-Design Of Soft Robots Through Learned Deep Latent Representations | Andrew Spielberg, Allan Zhao, Yuanming Hu, Tao Du, Wojciech Matusik, Daniela Rus | We present a learning-in-the-loop co-optimization algorithm in which a latent state representation is learned as the robot figures out how to solve the task. |

744 | Learning to Infer Implicit Surfaces without 3D Supervision | Shichen Liu, Shunsuke Saito, Weikai Chen, Hao Li | To this end, we propose a novel ray-based field probing technique for efficient image-to-field supervision, as well as a general geometric regularizer for implicit surfaces, which provides natural shape priors in unconstrained regions. |

745 | Fast and Accurate Least-Mean-Squares Solvers | Ibrahim Jubran, Alaa Maalouf, Dan Feldman | We suggest an algorithm that gets a finite set of $n$ $d$-dimensional real vectors and returns a weighted subset of $d+1$ vectors whose sum is \emph{exactly} the same. |

746 | Certifiable Robustness to Graph Perturbations | Aleksandar Bojchevski, Stephan G?nnemann | We propose the first method for verifying certifiable (non-)robustness to graph perturbations for a general class of models that includes graph neural networks and label/feature propagation. |

747 | Fast Convergence of Belief Propagation to Global Optima: Beyond Correlation Decay | Frederic Koehler | We show that under a natural initialization, BP converges quickly to the global optimum of the Bethe free energy for Ising models on arbitrary graphs, as long as the Ising model is \emph{ferromagnetic} (i.e. neighbors prefer to be aligned). |

748 | Paradoxes in Fair Machine Learning | Paul Goelz, Anson Kahng, Ariel D. Procaccia | We extend equalized odds to the setting of cardinality-constrained fair classification, where we have a bounded amount of a resource to distribute. |

749 | Provably Global Convergence of Actor-Critic: A Case for Linear Quadratic Regulator with Ergodic Cost | Zhuoran Yang, Yongxin Chen, Mingyi Hong, Zhaoran Wang | To understand the instability of actor-critic, we focus on its application to linear quadratic regulators, a simple yet fundamental setting of reinforcement learning. |

750 | The spiked matrix model with generative priors | Benjamin Aubin, Bruno Loureiro, Antoine Maillard, Florent Krzakala, Lenka Zdeborov? | In this paper we study spiked matrix models, where a low-rank matrix is observed through a noisy channel. |

751 | Gradient Dynamics of Shallow Univariate ReLU Networks | Francis Williams, Matthew Trager, Daniele Panozzo, Claudio Silva, Denis Zorin, Joan Bruna | We present a theoretical and empirical study of the gradient dynamics of overparameterized shallow ReLU networks with one-dimensional input, solving least-squares interpolation. |

752 | Robust and Communication-Efficient Collaborative Learning | Amirhossein Reisizadeh, Hossein Taheri, Aryan Mokhtari, Hamed Hassani, Ramtin Pedarsani | In this paper, we tackle these bottlenecks by proposing a novel decentralized and gradient-based optimization algorithm named as QuanTimed-DSGD. |

753 | Multiclass Learning from Contradictions | Sauptik Dhar, Vladimir Cherkassky, Mohak Shah | We introduce the notion of learning from contradictions, a.k.a Universum learning, for multiclass problems and propose a novel formulation for multiclass universum SVM (MU-SVM). |

754 | Learning from Trajectories via Subgoal Discovery | Sujoy Paul, Jeroen Vanbaar, Amit Roy-Chowdhury | In this paper, we propose an approach which uses the expert trajectories and learns to decompose the complex main task into smaller sub-goals. |

755 | Distributed Low-rank Matrix Factorization With Exact Consensus | Zhihui Zhu, Qiuwei Li, Xinshuo Yang, Gongguo Tang, Michael B. Wakin | In this paper, we study low-rank matrix factorization in the distributed setting, where local variables at each node encode parts of the overall matrix factors, and consensus is encouraged among certain such variables. |

756 | Online Normalization for Training Neural Networks | Vitaliy Chiley, Ilya Sharapov, Atli Kosson, Urs Koster, Ryan Reece, Sofia Samaniego de la Fuente, Vishal Subbiah, Michael James | We resolve a theoretical limitation of Batch Normalization by introducing an unbiased technique for computing the gradient of normalized activations. |

757 | The Synthesis of XNOR Recurrent Neural Networks with Stochastic Logic | Arash Ardakani, Zhengyun Ji, Amir Ardakani, Warren Gross | In this paper, we propose a method that converts all the multiplications in LSTMs to XNOR operations using stochastic computing. |

758 | An adaptive Mirror-Prox method for variational inequalities with singular operators | Kimon Antonakopoulos, Veronica Belmega, Panayotis Mertikopoulos | To address this issue, we propose a novel smoothness condition which we call Bregman smoothness, and which relates the variation of the operator to that of a suitably chosen Bregman function. |

759 | N-Gram Graph: Simple Unsupervised Representation for Graphs, with Applications to Molecules | Shengchao Liu, Mehmet F. Demirel, Yingyu Liang | This paper introduces the N-gram graph, a simple unsupervised representation for molecules. |

760 | Characterizing the Exact Behaviors of Temporal Difference Learning Algorithms Using Markov Jump Linear System Theory | Bin Hu, Usman Syed | In this paper, we provide a unified analysis of temporal difference learning algorithms with linear function approximators by exploiting their connections to Markov jump linear systems (MJLS). |

761 | Facility Location Problem in Differential Privacy Model Revisited | Yunus Esencayi, Marco Gaboardi, Shi Li, Di Wang | In this paper we study the facility location problem in the model of differential privacy (DP) with uniform facility cost. |

762 | Energy-Inspired Models: Learning with Sampler-Induced Distributions | John Lawson, George Tucker, Bo Dai, Rajesh Ranganath | This yields a class of energy-inspired models (EIMs) that incorporate learned energy functions while still providing exact samples and tractable log-likelihood lower bounds. |

763 | Finite-time Analysis of Approximate Policy Iteration for the Linear Quadratic Regulator | Karl Krauth, Stephen Tu, Benjamin Recht | We study the sample complexity of approximate policy iteration (PI) for the Linear Quadratic Regulator (LQR), building on a recent line of work using LQR as a testbed to understand the limits of reinforcement learning (RL) algorithms on continuous control tasks. |

764 | A Universally Optimal Multistage Accelerated Stochastic Gradient Method | Necdet Serhat Aybat, Alireza Fallah, Mert Gurbuzbalaban, Asuman Ozdaglar | We propose a novel multistage accelerated algorithm that is universally optimal in the sense that it achieves the optimal rate both in the deterministic and stochastic case and operates without knowledge of noise characteristics. |

765 | From deep learning to mechanistic understanding in neuroscience: the structure of retinal prediction | Hidenori Tanaka, Aran Nayebi, Niru Maheswaranathan, Lane McIntosh, Stephen Baccus, Surya Ganguli | We develop such a systematic approach by combining dimensionality reduction and modern attribution methods for determining the relative importance of interneurons for specific visual computations. |

766 | Large Memory Layers with Product Keys | Guillaume Lample, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Ludovic Denoyer, Herve Jegou | This paper introduces a structured memory which can be easily integrated into a neural network. |

767 | Learning Deterministic Weighted Automata with Queries and Counterexamples | Gail Weiss, Yoav Goldberg, Eran Yahav | We present an algorithm for reconstruction of a probabilistic deterministic finite automaton (PDFA) from a given black-box language model, such as a recurrent neural network (RNN). |

768 | Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent | Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, Jeffrey Pennington | In this work, we show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters. |

769 | Time/Accuracy Tradeoffs for Learning a ReLU with respect to Gaussian Marginals | Surbhi Goel, Sushrut Karmalkar, Adam Klivans | We consider the problem of computing the best-fitting ReLU with respect to square-loss on a training set when the examples have been drawn according to a spherical Gaussian distribution (the labels can be arbitrary). |

770 | Visualizing and Measuring the Geometry of BERT | Emily Reif, Ann Yuan, Martin Wattenberg, Fernanda B. Viegas, Andy Coenen, Adam Pearce, Been Kim | This paper describes qualitative and quantitative investigations of one particularly effective model, BERT. |

771 | Self-Critical Reasoning for Robust Visual Question Answering | Jialin Wu, Raymond Mooney | To address this issue, we introduce a self-critical training objective that ensures that visual explanations of correct answers match the most influential image regions more than other competitive answer candidates. |

772 | Learning to Screen | Alon Cohen, Avinatan Hassidim, Haim Kaplan, Yishay Mansour, Shay Moran | We model such scenarios as an assignment problem between items (candidates) and categories (departments): the items arrive one-by-one in an online manner, and upon processing each item the algorithm decides, based on its value and the categories it can be matched with, whether to retain or discard it (this decision is irrevocable). |

773 | A Communication Efficient Stochastic Multi-Block Alternating Direction Method of Multipliers | Hao Yu | In this paper, we propose a new parallel multi-block stochastic ADMM for distributed stochastic optimization, where each node is only required to perform simple stochastic gradient descent updates. |

774 | A Little Is Enough: Circumventing Defenses For Distributed Learning | Gilad Baruch, Moran Baruch, Yoav Goldberg | We observe that if the empirical variance between the gradients of workers is high enough, an attacker could take advantage of this and launch a non-omniscient attack that operates within the population variance. |

775 | Error Correcting Output Codes Improve Probability Estimation and Adversarial Robustness of Deep Neural Networks | Gunjan Verma, Ananthram Swami | In this paper, we propose a fundamentally different approach which instead changes the way the output is represented and decoded. |

776 | A Robust Non-Clairvoyant Dynamic Mechanism for Contextual Auctions | Yuan Deng, S?bastien Lahaie, Vahab Mirrokni | In this paper, we consider the problem of contextual auctions where the seller gradually learns a model of the buyer’s valuation as a function of the context (e.g., item features) and seeks a pricing policy that optimizes revenue. |

777 | Finite-Sample Analysis for SARSA with Linear Function Approximation | Shaofeng Zou, Tengyu Xu, Yingbin Liang | In this paper, we develop a novel technique to explicitly characterize the stochastic bias of a type of stochastic approximation procedures with time-varying Markov transition kernels. |

778 | Who is Afraid of Big Bad Minima? Analysis of gradient-flow in spiked matrix-tensor models | Stefano Sarao Mannelli, Giulio Biroli, Chiara Cammarota, Florent Krzakala, Lenka Zdeborov? | Here we present a quantitative theory explaining this behaviour in a spiked matrix-tensor model. |

779 | Graph Structured Prediction Energy Networks | Colin Graber, Alexander Schwing | To address this shortcoming, we introduce ‘Graph Structured Prediction Energy Networks,’ for which we develop inference techniques that allow to both model explicit local and implicit higher-order correlations while maintaining tractability of inference. |

780 | Private Learning Implies Online Learning: An Efficient Reduction | Alon Gonen, Elad Hazan, Shay Moran | In this paper we resolve this open question in the context of pure differential privacy. |

781 | Graph Agreement Models for Semi-Supervised Learning | Otilia Stretcu, Krishnamurthy Viswanathan, Dana Movshovitz-Attias, Emmanouil Platanios, Sujith Ravi, Andrew Tomkins | To address this, we propose Graph Agreement Models (GAM), which introduces an auxiliary model that predicts the probability of two nodes sharing the same label as a learned function of their features. |

782 | Latent distance estimation for random geometric graphs | Ernesto Araya Valdivia, De Castro Yohann | We introduce a spectral estimator of the pairwise distance between latent points and we prove that its rate of convergence is the same as the nonparametric estimation of a function on $\mathbb{S}^{d-1}$, up to a logarithmic factor. |

783 | Seeing the Wind: Visual Wind Speed Prediction with a Coupled Convolutional and Recurrent Neural Network | Jennifer Cardona, Michael Howland, John Dabiri | Here, we demonstrate a coupled convolutional neural network and recurrent neural network architecture that extracts the wind speed encoded in visually recorded flow-structure interactions of a flag and tree in naturally occurring wind. |

784 | The Functional Neural Process | Christos Louizos, Xiahan Shi, Klamer Schutte, Max Welling | We present a new family of exchangeable stochastic processes, the Functional Neural Processes (FNPs). |

785 | Recurrent Registration Neural Networks for Deformable Image Registration | Robin Sandk?hler, Simon Andermatt, Grzegorz Bauman, Sylvia Nyilas, Christoph Jud, Philippe C. Cattin | We reformulate the pairwise registration problem as a recursive sequence of successive alignments. |

786 | Unsupervised State Representation Learning in Atari | Ankesh Anand, Evan Racah, Sherjil Ozair, Yoshua Bengio, Marc-Alexandre C?t?, R Devon Hjelm | We introduce a method that tries to learn better state representations by maximizing mutual information across spatially and temporally distinct features of a neural encoder of the observations. |

787 | Unlocking Fairness: a Trade-off Revisited | Michael Wick, swetasudha panda, Jean-Baptiste Tristan | We investigate fairness and accuracy, but this time under a variety of controlled conditions in which we vary the amount and type of bias. |

788 | Fisher Efficient Inference of Intractable Models | Song Liu, Takafumi Kanamori, Wittawat Jitkrittum, Yu Chen | In this paper, we derive a Discriminative Likelihood Estimator (DLE) from the Kullback-Leibler divergence minimization criterion implemented via density ratio estimation and a Stein operator. |

789 | Thompson Sampling and Approximate Inference | My Phan, Yasin Abbasi Yadkori, Justin Domke | We study the effects of approximate inference on the performance of Thompson sampling in the $k$-armed bandit problems. |

790 | PRNet: Self-Supervised Learning for Partial-to-Partial Registration | Yue Wang, Justin M. Solomon | We present a simple, flexible, and general framework titled Partial Registration Network (PRNet), for partial-to-partial point cloud registration. |

791 | Surrogate Objectives for Batch Policy Optimization in One-step Decision Making | Minmin Chen, Ramki Gummadi, Chris Harris, Dale Schuurmans | We investigate batch policy optimization for cost-sensitive classification and contextual bandits—two related tasks that obviate exploration but require generalizing from observed rewards to action selections in unseen contexts. |

792 | Modelling heterogeneous distributions with an Uncountable Mixture of Asymmetric Laplacians | Axel Brando, Jose A. Rodriguez, Jordi Vitria, Alberto Rubio Mu?oz | In this paper, we propose a generic deep learning framework that learns an Uncountable Mixture of Asymmetric Laplacians (UMAL), which will allow us to estimate heterogeneous distributions of the output variable and shows its connections to quantile regression. |

793 | Learning Macroscopic Brain Connectomes via Group-Sparse Factorization | Farzane Aminmansour, Andrew Patterson, Lei Le, Yisu Peng, Daniel Mitchell, Franco Pestilli, Cesar F. Caiafa, Russell Greiner, Martha White | In this work, we explore a framework that facilitates applying learning algorithms to automatically extract brain connectomes. |

794 | Approximating the Permanent by Sampling from Adaptive Partitions | Jonathan Kuck, Tri Dao, Hamid Rezatofighi, Ashish Sabharwal, Stefano Ermon | We present ADAPART, a simple and efficient method for exact sampling of permutations, each associated with a weight as determined by a matrix. |

795 | Retrosynthesis Prediction with Conditional Graph Logic Network | Hanjun Dai, Chengtao Li, Connor Coley, Bo Dai, Le Song | In this work, we propose a new approach to this task using the Conditional Graph Logic Network, a conditional graphical model built upon graph neural networks that learns when rules from reaction templates should be applied, implicitly considering whether the resulting reaction would be both chemically feasible and strategic. |

796 | Procrastinating with Confidence: Near-Optimal, Anytime, Adaptive Algorithm Configuration | Robert Kleinberg, Kevin Leyton-Brown, Brendan Lucier, Devon Graham | This paper introduces a new algorithm, “Structured Procrastination with Confidence”, that preserves the near-optimality and anytime properties of Structured Procrastination while adding adaptivity. |

797 | Online Learning via the Differential Privacy Lens | Jacob D. Abernethy, Young Hun Jung, Chansoo Lee, Audra McMillan, Ambuj Tewari | In this paper, we use differential privacy as a lens to examine online learning in both full and partial information settings. |

798 | PerspectiveNet: 3D Object Detection from a Single RGB Image via Perspective Points | Siyuan Huang, Yixin Chen, Tao Yuan, Siyuan Qi, Yixin Zhu, Song-Chun Zhu | To address this challenge, we propose to adopt perspective points as a new intermediate representation for 3D object detection, defined as the 2D projections of local Manhattan 3D keypoints to locate an object; these perspective points satisfy geometric constraints imposed by the perspective projection. |

799 | Parameter elimination in particle Gibbs sampling | Anna Wigren, Riccardo Sven Risuleo, Lawrence Murray, Fredrik Lindsten | We focus on particle Gibbs (PG) and particle Gibbs with ancestor sampling (PGAS), improving their performance beyond that of the ideal Gibbs sampler (which they approximate) by marginalizing out one or more parameters. |

800 | This Looks Like That: Deep Learning for Interpretable Image Recognition | Chaofan Chen, Oscar Li, Daniel Tao, Alina Barnett, Cynthia Rudin, Jonathan K. Su | In this work, we introduce a deep network architecture — prototypical part network (ProtoPNet), that reasons in a similar way: the network dissects the image by finding prototypical parts, and combines evidence from the prototypes to make a final classification. |

801 | Adaptively Aligned Image Captioning via Adaptive Attention Time | Lun Huang, Wenmin Wang, Yaxian Xia, Jie Chen | In this paper, we propose a novel attention model, namely Adaptive Attention Time (AAT), to align the source and the target adaptively for image captioning. |

802 | Accurate Uncertainty Estimation and Decomposition in Ensemble Learning | Jeremiah Liu, John Paisley, Marianthi-Anna Kioumourtzoglou, Brent Coull | We introduce a Bayesian nonparametric ensemble (BNE) approach that augments an existing ensemble model to account for different sources of model uncertainty. |

803 | Learning Bayesian Networks with Low Rank Conditional Probability Tables | Adarsh Barik, Jean Honorio | In this paper, we provide a method to learn the directed structure of a Bayesian network using data. |

804 | Equal Opportunity in Online Classification with Partial Feedback | Yahav Bechavod, Katrina Ligett, Aaron Roth, Bo Waggoner, Steven Z. Wu | We study an online classification problem with partial feedback in which individuals arrive one at a time from a fixed but unknown distribution, and must be classified as positive or negative. |

805 | Modeling Expectation Violation in Intuitive Physics with Coarse Probabilistic Object Representations | Kevin Smith, Lingjie Mei, Shunyu Yao, Jiajun Wu, Elizabeth Spelke, Josh Tenenbaum, Tomer Ullman | We propose ADEPT, a model that uses a coarse (approximate geometry) object-centric representation for dynamic 3D scene understanding. |

806 | Neural Multisensory Scene Inference | Jae Hyun Lim, Pedro O. O. Pinheiro, Negar Rostamzadeh, Chris Pal, Sungjin Ahn | In this paper, we propose the Generative Multisensory Network (GMN) for learning latent representations of 3D scenes which are partially observable through multiple sensory modalities. |

807 | Regret Bounds for Thompson Sampling in Episodic Restless Bandit Problems | Young Hun Jung, Ambuj Tewari | In this paper, we analyze the performance of Thompson sampling in episodic restless bandits with unknown parameters. |

808 | What Can ResNet Learn Efficiently, Going Beyond Kernels? | Zeyuan Allen-Zhu, Yuanzhi Li | We prove neural networks can efficiently learn a notable class of functions, including those defined by three-layer residual networks with smooth activations, without any distributional assumption. |

809 | Better Transfer Learning with Inferred Successor Maps | Tamas Madarasz, Tim Behrens | We thus provide a novel algorithmic approach for multi-task learning, as well as a common normative framework that links together these different characteristics of the brain’s spatial representation. |

810 | Unsupervised Co-Learning on G-Manifolds Across Irreducible Representations |
Yifeng Fan, Tingran Gao, Zhizhen Jane Zhao | We introduce a novel co-learning paradigm for manifolds naturally admitting an action of a transformation group $\mathcal{G}$, motivated by recent developments on learning a manifold from attached fibre bundle structures. |

811 | Defending Against Neural Fake News | Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, Yejin Choi | We thus present a model for controllable text generation called Grover. |

812 | Sample Adaptive MCMC | Michael Zhu | In this paper, we present Sample Adaptive MCMC (SA-MCMC), a MCMC method based on a reversible Markov chain for \pi^{\otimes N} that uses an adaptive proposal distribution based on the current state of N points and a sequential substitution procedure with one new likelihood evaluation per iteration and at most one updated point each iteration. |

813 | A Stochastic Composite Gradient Method with Incremental Variance Reduction | Junyu Zhang, Lin Xiao | We propose a stochastic composite gradient method that employs incremental variance-reduced estimators for both the inner vector mapping and its Jacobian. |

814 | Nonparametric Density Estimation & Convergence Rates for GANs under Besov IPM Losses | Ananya Uppal, Shashank Singh, Barnabas Poczos | We study the problem of estimating a nonparametric probability distribution under a family of losses called Besov IPMs. |

815 | STAR-Caps: Capsule Networks with Straight-Through Attentive Routing | Karim Ahmed, Lorenzo Torresani | In this work, we propose Star-Caps a capsule-based network that exploits a straight-through attentive routing to address the drawbacks of capsule networks. |

816 | Limitations of Lazy Training of Two-layers Neural Network | Song Mei, Theodor Misiakiewicz, Behrooz Ghorbani, Andrea Montanari | We study the supervised learning problem under either of the following two models: (1) Feature vectors x_i are d-dimensional Gaussian and responses are y_i = f_*(x_i) for f_* an unknown quadratic function; (2) Feature vectors x_i are distributed as a mixture of two d-dimensional centered Gaussians, and y_i’s are the corresponding class labels. |

817 | Reconciling meta-learning and continual learning with online mixtures of tasks | Ghassen Jerfel, Erin Grant, Tom Griffiths, Katherine A. Heller | We use the connection between gradient-based meta-learning and hierarchical Bayes to propose a Dirichlet process mixture of hierarchical Bayesian models over the parameters of an arbitrary parametric model such as a neural network. |

818 | Distributionally Robust Optimization and Generalization in Kernel Methods | Matthew Staib, Stefanie Jegelka | In this paper, we study DRO with uncertainty sets measured via maximum mean discrepancy (MMD). |

819 | A General Theory of Equivariant CNNs on Homogeneous Spaces | Taco S. Cohen, Mario Geiger, Maurice Weiler | We present a general theory of Group equivariant Convolutional Neural Networks (G-CNNs) on homogeneous spaces such as Euclidean space and the sphere. |

820 | Trivializations for Gradient-Based Optimization on Manifolds | Mario Lezcano Casado | We introduce a framework to study the transformation of problems with manifold constraints into unconstrained problems through parametrizations in terms of a Euclidean space. |

821 | Write, Execute, Assess: Program Synthesis with a REPL | Kevin Ellis, Maxwell Nye, Yewen Pu, Felix Sosa, Josh Tenenbaum, Armando Solar-Lezama | We present a neural program synthesis approach integrating components which write, execute, and assess code to navigate the search space of possible programs. |

822 | A Meta-Analysis of Overfitting in Machine Learning | Rebecca Roelofs, Vaishaal Shankar, Benjamin Recht, Sara Fridovich-Keil, Moritz Hardt, John Miller, Ludwig Schmidt | We conduct the first large meta-analysis of overfitting due to test set reuse in the machine learning community. |

823 | (Nearly) Efficient Algorithms for the Graph Matching Problem on Correlated Random Graphs | Boaz Barak, Chi-Ning Chou, Zhixian Lei, Tselil Schramm, Yueqi Sheng | We consider the graph matching/similarity problem of determining how similar two given graphs $G_0,G_1$ are and recovering the permutation $\pi$ on the vertices of $G_1$ that minimizes the symmetric difference between the edges of $G_0$ and $\pi(G_1)$. |

824 | Preference-Based Batch and Sequential Teaching: Towards a Unified View of Models | Farnam Mansouri, Yuxin Chen, Ara Vartanian, Jerry Zhu, Adish Singla | To better understand the connections between these different batch and sequential models, we develop a novel framework which captures the teaching process via preference functions |

825 | Online Continuous Submodular Maximization: From Full-Information to Bandit Feedback | Mingrui Zhang, Lin Chen, Hamed Hassani, Amin Karbasi | In this paper, we propose three online algorithms for submodular maximization. |

826 | Sampling Networks and Aggregate Simulation for Online POMDP Planning | Hao(Jackson) Cui, Roni Khardon | The paper introduces a new algorithm for planning in partially observable Markov decision processes (POMDP) based on the idea of aggregate simulation. |

827 | Correlation in Extensive-Form Games: Saddle-Point Formulation and Benchmarks | Gabriele Farina, Chun Kai Ling, Fei Fang, Tuomas Sandholm | To showcase how this novel formulation can inspire new algorithms to compute EFCEs, we propose a simple subgradient descent method which exploits this formulation and structural properties of EFCEs. |

828 | GNNExplainer: Generating Explanations for Graph Neural Networks | Zhitao Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, Jure Leskovec | Here we propose GNNExplainer, the first general, model-agnostic approach for providing interpretable explanations for predictions of any GNN-based model on any graph-based machine learning task. |

829 | Linear Stochastic Bandits Under Safety Constraints | Sanae Amani, Mahnoosh Alizadeh, Christos Thrampoulidis | In this paper, we formulate a linear stochastic multi-armed bandit problem with safety constraints that depend (linearly) on an unknown parameter vector. |

830 | A coupled autoencoder approach for multi-modal analysis of cell types | Rohan Gala, Nathan Gouwens, Zizhen Yao, Agata Budzillo, Osnat Penn, Bosiljka Tasic, Gabe Murphy, Hongkui Zeng, Uygar S?mb?l | We pose this issue of cross-modal alignment as an optimization problem and develop an approach based on coupled training of autoencoders as a framework for such analyses. |

831 | Towards Automatic Concept-based Explanations | Amirata Ghorbani, James Wexler, James Y. Zou, Been Kim | In this work, we propose principles and desiderata for \emph{concept} based explanation, which goes beyond per-sample features to identify higher level human-understandable concepts that apply across the entire dataset. |

832 | Deep Generative Video Compression | Salvator Lombardo, JUN HAN, Christopher Schroers, Stephan Mandt | Here, we propose an end-to-end, deep generative modeling approach to compress temporal sequences with a focus on video. |

833 | Budgeted Reinforcement Learning in Continuous State Space | Nicolas Carrara, Edouard Leurent, Romain Laroche, Tanguy Urvoy, Odalric-Ambrym Maillard, Olivier Pietquin | This observation allows us to introduce natural extensions of Deep Reinforcement Learning algorithms to address large-scale BMDPs. |

834 | Discovery of Useful Questions as Auxiliary Tasks | Vivek Veeriah, Matteo Hessel, Zhongwen Xu, Janarthanan Rajendran, Richard L. Lewis, Junhyuk Oh, Hado P. van Hasselt, David Silver, Satinder Singh | We present a novel method for a reinforcement learning (RL) agent to discover questions formulated as general value functions or GVFs, a fairly rich form of knowledge representation. |

835 | Sinkhorn Barycenters with Free Support via Frank-Wolfe Algorithm | Giulia Luise, Saverio Salzo, Massimiliano Pontil, Carlo Ciliberto | We present a novel algorithm to estimate the barycenter of arbitrary probability distributions with respect to the Sinkhorn divergence. |

836 | Finding the Needle in the Haystack with Convolutions: on the benefits of architectural bias | St?phane d’Ascoli, Levent Sagun, Giulio Biroli, Joan Bruna | The aim of this work is to understand this fact through the lens of dynamics in the loss landscape. |

837 | Correlation clustering with local objectives | Sanchit Kalhan, Konstantin Makarychev, Timothy Zhou | In this paper, we study algorithms for minimizing \ell_q norms (q >= 1) of the disagreements vector for both arbitrary and complete graphs. |

838 | Multiclass Performance Metric Elicitation | Gaurush Hiranandani, Shant Boodaghians, Ruta Mehta, Oluwasanmi O. Koyejo | In this paper, we propose novel strategies for eliciting multiclass classification performance metrics using only relative preference feedback. |

839 | Algorithmic Analysis and Statistical Estimation of SLOPE via Approximate Message Passing | Zhiqi Bu, Jason Klusowski, Cynthia Rush, Weijie Su | In this paper, we develop an asymptotically exact characterization of the SLOPE solution under Gaussian random designs through solving the SLOPE problem using approximate message passing (AMP). |

840 | Explicit Explore-Exploit Algorithms in Continuous State Spaces | Mikael Henaff | We present a new model-based algorithm for reinforcement learning (RL) which consists of explicit exploration and exploitation phases, and is applicable in large or infinite state spaces. |

841 | ADDIS: an adaptive discarding algorithm for online FDR control with conservative nulls | Jinjin Tian, Aaditya Ramdas | In this work, we introduce a new adaptive discarding method called ADDIS that provably controls the FDR and achieves the best of both worlds: it enjoys appreciable power increase over all existing methods if nulls are conservative (the practical case), and rarely loses power if nulls are exactly uniformly distributed (the ideal case). |

842 | Slice-based Learning: A Programming Model for Residual Learning in Critical Data Slices | Vincent Chen, Sen Wu, Alexander J. Ratner, Jen Weng, Christopher R? | We propose Slice-based Learning, a new programming model in which the slicing function (SF), a programmer abstraction, is used to specify additional model capacity for each slice. |

843 | Don't Blame the ELBO! A Linear VAE Perspective on Posterior Collapse | James Lucas, George Tucker, Roger B. Grosse, Mohammad Norouzi | This paper presents a simple and intuitive explanation for posterior collapse through the analysis of linear VAEs and their direct correspondence with Probabilistic PCA (pPCA). |

844 | Language as an Abstraction for Hierarchical Deep Reinforcement Learning | YiDing Jiang, Shixiang (Shane) Gu, Kevin P. Murphy, Chelsea Finn | In this paper, we propose to use language as the abstraction, as it provides unique compositional structure, enabling fast learning and combinatorial generalization, while retaining tremendous flexibility, making it suitable for a variety of problems. |

845 | Efficient online learning with kernels for adversarial large scale problems | R?mi J?z?quel, Pierre Gaillard, Alessandro Rudi | Our contributions are twofold: 1) For the Gaussian kernel, we propose to build the basis beforehand (independently of the data) through Taylor expansion. |

846 | A Linearly Convergent Method for Non-Smooth Non-Convex Optimization on the Grassmannian with Applications to Robust Subspace and Dictionary Learning | Zhihui Zhu, Tianyu Ding, Daniel Robinson, Manolis Tsakiris, Ren? Vidal | In this paper we show that if the objective satisfies a certain Riemannian regularity condition with respect to some point in the Grassmannian, then a Riemannian subgradient method with appropriate initialization and geometrically diminishing step size converges at a linear rate to that point. |

847 | ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models | Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, Boris Katz | We collect a large real-world test set, ObjectNet, for object recognition with controls where object backgrounds, rotations, and imaging viewpoints are random. |

848 | Certified Adversarial Robustness with Additive Noise | Bai Li, Changyou Chen, Wenlin Wang, Lawrence Carin | To address these limitations, we introduce a framework that is scalable and provides certified bounds on the norm of the input manipulation for constructing adversarial examples. |

849 | Tight Dimensionality Reduction for Sketching Low Degree Polynomial Kernels | Michela Meister, Tamas Sarlos, David Woodruff | We give a new analysis of this sketch, providing nearly optimal bounds. |

850 | Non-Cooperative Inverse Reinforcement Learning | Xiangyuan Zhang, Kaiqing Zhang, Erik Miehling, Tamer Basar | To describe such strategic situations, we introduce the non-cooperative inverse reinforcement learning (N-CIRL) formalism. |

851 | DINGO: Distributed Newton-Type Method for Gradient-Norm Optimization | Rixon Crane, Fred Roosta | For optimization of a large sum of functions in a distributed computing environment, we present a novel communication efficient Newton-type algorithm that enjoys a variety of advantages over similar existing methods. |

852 | Sobolev Independence Criterion | Youssef Mroueh, Tom Sercu, Mattia Rigotti, Inkit Padhi, Cicero Nogueira dos Santos | We propose the Sobolev Independence Criterion (SIC), an interpretable dependency measure between a high dimensional random variable X and a response variable Y. SIC decomposes to the sum of feature importance scores and hence can be used for nonlinear feature selection. |

853 | Maximum Entropy Monte-Carlo Planning | Chenjun Xiao, Ruitong Huang, Jincheng Mei, Dale Schuurmans, Martin M?ller | We develop a new algorithm for online planning in large scale sequential decision problems that improves upon the worst case efficiency of UCT. |

854 | Learning from brains how to regularize machines | Zhe Li, Wieland Brendel, Edgar Walker, Erick Cobos, Taliah Muhammad, Jacob Reimer, Matthias Bethge, Fabian Sinz, Zachary Pitkow, Andreas Tolias | We propose to regularize CNNs using large-scale neuroscience data to learn more robust neural features in terms of representational similarity. |

855 | Using Statistics to Automate Stochastic Optimization | Hunter Lang, Lin Xiao, Pengchuan Zhang | Rather than changing the learning rate at each iteration, we propose an approach that automates the most common hand-tuning heuristic: use a constant learning rate until “progress stops,” then drop. |

856 | Zero-shot Knowledge Transfer via Adversarial Belief Matching | Paul Micaelli, Amos J. Storkey | We propose a novel method which trains a student to match the predictions of its teacher without using any data or metadata. |

857 | Differentiable Convex Optimization Layers | Akshay Agrawal, Brandon Amos, Shane Barratt, Stephen Boyd, Steven Diamond, J. Zico Kolter | In this paper, we propose an approach to differentiating through disciplined convex programs, a subclass of convex optimization problems used by domain-specific languages (DSLs) for convex optimization. |

858 | Random Tessellation Forests | Shufei Ge, Shijia Wang, Yee Whye Teh, Liangliang Wang, Lloyd Elliott | Motivated by the need for a multi-dimensional partitioning tree with non-axis aligned cuts, we propose the Random Tessellation Process, a framework that includes the Mondrian process as a special case. |

859 | Learning Nearest Neighbor Graphs from Noisy Distance Samples | Blake Mason, Ardhendu Tripathy, Robert Nowak | In this paper, we propose an active algorithm to find the graph with high probability and analyze its query complexity. |

860 | Lookahead Optimizer: k steps forward, 1 step back | Michael Zhang, James Lucas, Jimmy Ba, Geoffrey E. Hinton | In this paper, we propose a new optimization algorithm, Lookahead, that is orthogonal to these previous approaches and iteratively updates two sets of weights. |

861 | Learning to Predict 3D Objects with an Interpolation-based Differentiable Renderer | Wenzheng Chen, Huan Ling, Jun Gao, Edward Smith, Jaakko Lehtinen, Alec Jacobson, Sanja Fidler | In this paper, we present DIB-Render, a novel rendering framework through which gradients can be analytically computed. |

862 | Covariate-Powered Empirical Bayes Estimation | Nikolaos Ignatiadis, Stefan Wager | In this paper, we propose a flexible plug-in empirical Bayes estimator that synthesizes both sources of information and may leverage any black-box predictive model. |

863 | Understanding the Role of Momentum in Stochastic Gradient Methods | Igor Gitman, Hunter Lang, Pengchuan Zhang, Lin Xiao | In this paper, we use the general formulation of QHM to give a unified analysis of several popular algorithms, covering their asymptotic convergence conditions, stability regions, and properties of their stationary distributions. |

864 | A neurally plausible model for online recognition and postdiction in a dynamical environment | Li Kevin Wenliang, Maneesh Sahani | Here, we propose a general framework for neural probabilistic inference in dynamic models based on the distributed distributional code (DDC) representation of uncertainty, naturally extending the underlying encoding to incorporate implicit probabilistic beliefs about both present and past. |

865 | Guided Meta-Policy Search | Russell Mendonca, Abhishek Gupta, Rosen Kralev, Pieter Abbeel, Sergey Levine, Chelsea Finn | To this end, we propose to learn a reinforcement learning procedure in a federated way, where individual off-policy learners can solve the individual meta-training tasks, and then consolidate these solutions into a single meta-learner. |

866 | Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling | Tengyang Xie, Yifei Ma, Yu-Xiang Wang | Motivated by the many real-world applications of reinforcement learning (RL) that require safe-policy iterations, we consider the problem of off-policy evaluation (OPE) — the problem of evaluating a new policy using the historical data obtained by different behavior policies — under the model of nonstationary episodic Markov Decision Processes (MDP) with a long horizon and a large action space. |

867 | Contextual Bandits with Cross-Learning | Santiago Balseiro, Negin Golrezaei, Mohammad Mahdian, Vahab Mirrokni, Jon Schneider | We demonstrate algorithms for the contextual bandits problem with cross-learning that remove the dependence on $C$ and achieve regret $\tilde{O}(\sqrt{KT})$ (when contexts are stochastic with known distribution), $\tilde{O}(K^{1/3}T^{2/3})$ (when contexts are stochastic with unknown distribution), and $\tilde{O}(\sqrt{KT})$ (when contexts are adversarial but rewards are stochastic). |

868 | Evaluating Protein Transfer Learning with TAPE | Roshan Rao, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Peter Chen, John Canny, Pieter Abbeel, Yun Song | To facilitate progress in this field, we introduce the Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. |

869 | A Bayesian Theory of Conformity in Collective Decision Making | Koosha Khalvati, Saghar Mirbagheri, Seongmin A. Park, Jean-Claude Dreher, Rajesh PN Rao | In this paper, we present a new Bayesian theory of collective decision making based on a simple yet most commonly observed behavior: conformity. |

870 | Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel | Colin Wei, Jason D. Lee, Qiang Liu, Tengyu Ma | We show that sample efficiency can indeed depend on the presence of the regularizer: we construct a simple distribution in dimensions which the optimal regularized neural net learns with samples but the NTK requires samples to learn. |

871 | Data-dependent Sample Complexity of Deep Neural Networks via Lipschitz Augmentation | Colin Wei, Tengyu Ma | For feedforward neural nets as well as RNNs, we obtain tighter Rademacher complexity bounds by considering additional data-dependent properties of the network: the norms of the hidden layers of the network, and the norms of the Jacobians of each layer with respect to all previous layers. |

872 | A Benchmark for Interpretability Methods in Deep Neural Networks | Sara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, Been Kim | We propose an empirical measure of the approximate accuracy of feature importance estimates in deep neural networks. |

873 | Memory Efficient Adaptive Optimization | Rohan Anil, Vineet Gupta, Tomer Koren, Yoram Singer | We describe an effective and flexible adaptive optimization method with greatly reduced memory overhead. |

874 | Dynamic Incentive-Aware Learning: Robust Pricing in Contextual Auctions | Negin Golrezaei, Adel Javanmard, Vahab Mirrokni | We propose two learning policies that are robust to such strategic behavior. |

875 | Convergence-Rate-Matching Discretization of Accelerated Optimization Flows Through Opportunistic State-Triggered Control | Miguel Vaquero, Jorge Cortes | This paper provides a novel approach through the idea of opportunistic state-triggered control. |

876 | A Unified Framework for Data Poisoning Attack to Graph-based Semi-supervised Learning | Xuanqing Liu, Si Si, Jerry Zhu, Yang Li, Cho-Jui Hsieh | In this paper, we proposed a general framework for data poisoning attacks to graph-based semi-supervised learning (G-SSL). |

877 | Compositional generalization through meta sequence-to-sequence learning | Brenden M. Lake | In this paper, I show how memory-augmented neural networks can be trained to generalize compositionally through meta seq2seq learning. |

878 | Bayesian Joint Estimation of Multiple Graphical Models | Lingrui Gan, Xinming Yang, Naveen Narisetty, Feng Liang | In this paper, we propose a novel Bayesian group regularization method based on the spike and slab Lasso priors for jointly estimating multiple graphical models. |

879 | Practical Two-Step Lookahead Bayesian Optimization | Jian Wu, Peter Frazier | This paper proposes a computationally efficient algorithm that provides an accurate solution to the two-step lookahead Bayesian optimization problem in seconds to at most several minutes of computation per batch of evaluations. |

880 | Leader Stochastic Gradient Descent for Distributed Training of Deep Learning Models | Yunfei Teng, Wenbo Gao, Fran?ois Chalus, Anna E. Choromanska, Donald Goldfarb, Adrian Weller | We propose a new algorithm, whose parameter updates rely on two forces: a regular gradient step, and a corrective direction dictated by the currently best-performing worker (leader). |

881 | A Convex Relaxation Barrier to Tight Robustness Verification of Neural Networks | Hadi Salman, Greg Yang, Huan Zhang, Cho-Jui Hsieh, Pengchuan Zhang | In this paper, we unify all existing LP-relaxed verifiers, to the best of our knowledge, under a general convex relaxation framework. |

882 | Neural Jump Stochastic Differential Equations | Junteng Jia, Austin R. Benson | To this end, we introduce Neural Jump Stochastic Differential Equations that provide a data-driven approach to learn continuous and discrete dynamic behavior, i.e., hybrid systems that both flow and jump. |

883 | Learning metrics for persistence-based summaries and applications for graph classification | Qi Zhao, Yusu Wang | We study this problem and develop a new weighted kernel, called WKPI, for persistence summaries, as well as an optimization framework to learn the weight (and thus kernel). |

884 | On the Value of Target Data in Transfer Learning | Steve Hanneke, Samory Kpotufe | We aim to understand the value of additional labeled or unlabeled target data in transfer learning, for any given amount of source data; this is motivated by practical questions around minimizing sampling costs, whereby, target data is usually harder or costlier to acquire than source data, but can yield better accuracy. |

885 | Stochastic Variance Reduced Primal Dual Algorithms for Empirical Composition Optimization | Adithya M Devraj, Jianshu Chen | We take a novel approach to solving this problem by reformulating the original minimization objective into an equivalent min-max objective, which brings out all the empirical averages that are originally inside the nonlinear loss functions. |

886 | On Robustness of Principal Component Regression | Anish Agarwal, Devavrat Shah, Dennis Shen, Dogyoon Song | As the main contribution of this work, we address this challenge by rigorously establishing that PCR is robust to noisy, sparse, and possibly mixed valued covariates. |

887 | Meta Learning with Relational Information for Short Sequences | Yujia Xie, Haoming Jiang, Feng Liu, Tuo Zhao, Hongyuan Zha | This paper proposes a new meta-learning method — named HARMLESS (HAwkes Relational Meta Learning method for Short Sequences) for learning heterogeneous point process models from a collection of short event sequence data along with a relational network. |

888 | Residual Flows for Invertible Generative Modeling | Tian Qi Chen, Jens Behrmann, David K. Duvenaud, Joern-Henrik Jacobsen | We give a tractable unbiased estimate of the log density, and reduce the memory required during training by a factor of ten. |

889 | Multi-Agent Common Knowledge Reinforcement Learning | Christian Schroeder de Witt, Jakob Foerster, Gregory Farquhar, Philip Torr, Wendelin Boehmer, Shimon Whiteson | In this paper, we show that common knowledge between agents allows for complex decentralised coordination. |

890 | Learning to Learn By Self-Critique | Antreas Antoniou, Amos J. Storkey | In this paper, we propose a framework called \emph{Self-Critique and Adapt} or SCA. |

891 | Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes | Greg Yang | More generally, we introduce a language for expressing neural network computations, and our result encompasses all such expressible neural networks. |

892 | Neural Networks with Cheap Differential Operators | Tian Qi Chen, David K. Duvenaud | We describe a family of neural network architectures that allow easy access to a family of differential operators involving \emph{dimension-wise derivatives}, and we show how to modify the backward computation graph to compute them efficiently. |

893 | Transductive Zero-Shot Learning with Visual Structure Constraint | Ziyu Wan, Dongdong Chen, Yan Li, Xingguang Yan, Junge Zhang, Yizhou Yu, Jing Liao | Based on the observation that visual features of test instances can be separated into different clusters, we propose a new visual structure constraint on class centers for transductive ZSL, to improve the generality of the projection function (\ie alleviate the above domain shift problem). |

894 | Dying Experts: Efficient Algorithms with Optimal Regret Bounds | Hamid Shayestehmanesh, Sajjad Azami, Nishant A. Mehta | We study a variant of decision-theoretic online learning in which the set of experts that are available to Learner can shrink over time. |

895 | Model Similarity Mitigates Test Set Overuse | Horia Mania, John Miller, Ludwig Schmidt, Moritz Hardt, Benjamin Recht | We proffer a new explanation for the apparent longevity of test data: Many proposed models are similar in their predictions and we prove that this similarity mitigates overfitting. |

896 | A unified theory for the origin of grid cells through the lens of pattern formation | Ben Sorscher, Gabriel Mel, Surya Ganguli, Samuel Ocko | Here we provide an analytic theory that unifies the two perspectives by casting the learning dynamics of neural networks trained on navigational tasks as a pattern forming dynamical system. |

897 | On Sample Complexity Upper and Lower Bounds for Exact Ranking from Noisy Comparisons | Wenbo Ren, Jia (Kevin) Liu, Ness Shroff | Different from most previous works, in this paper, we have three main novelties: (i) compared to prior works, our upper bounds (algorithms) and lower bounds on the sample complexity (aka number of comparisons) require the minimal assumptions on the instances, and are not restricted to specific models; (ii) we give lower bounds and upper bounds on instances with \textit{unequal} noise levels; and (iii) this paper aims at the \textit{exact} ranking without knowledge on the instances, while most of the previous works either focus on approximate rankings or study exact ranking but require prior knowledge. |

898 | Hierarchical Decision Making by Generating and Following Natural Language Instructions | Hengyuan Hu, Denis Yarats, Qucheng Gong, Yuandong Tian, Mike Lewis | We introduce a challenging real-time strategy game environment in which the actions of a large number of units must be coordinated across long time scales. |

899 | SHE: A Fast and Accurate Deep Neural Network for Encrypted Data | Qian Lou, Lei Jiang | In this paper, we propose a Shift-accumulation-based LHE-enabled deep neural network (SHE) for fast and accurate inferences on encrypted data. |

900 | Locality-Sensitive Hashing for f-Divergences: Mutual Information Loss and Beyond | Lin Chen, Hossein Esfandiari, Gang Fu, Vahab Mirrokni | In this paper, we aim to develop LSH schemes for distance functions that measure the distance between two probability distributions, particularly for f-divergences as well as a generalization to capture mutual information loss. |

901 | A Game Theoretic Approach to Class-wise Selective Rationalization | Shiyu Chang, Yang Zhang, Mo Yu, Tommi Jaakkola | To this end, we propose a new game theoretic approach to class-dependent rationalization, where the method is specifically trained to highlight evidence supporting alternative conclusions. |

902 | Efficiently avoiding saddle points with zero order methods: No gradients required | Emmanouil-Vasileios Vlatakis-Gkaragkounis, Lampros Flokas, Georgios Piliouras | We consider the case of derivative-free algorithms for non-convex optimization, also known as zero order algorithms, that use only function evaluations rather than gradients |

903 | Metamers of neural networks reveal divergence from human perceptual systems | Jenelle Feather, Alex Durango, Ray Gonzalez, Josh McDermott | To more thoroughly investigate their similarity to biological systems, we synthesized model metamers – stimuli that produce the same responses at some stage of a network’s representation. |

904 | Spatial-Aware Feature Aggregation for Image based Cross-View Geo-Localization | Yujiao Shi, Liu Liu, Xin Yu, Hongdong Li | In this paper, we develop a new deep network to explicitly address these inherent differences between ground and aerial views. |

905 | Decentralized sketching of low rank matrices | Rakshith Sharma Srinivasa, Kiryung Lee, Marius Junge, Justin Romberg | Leveraging the joint structure between the columns, we propose a method to recover the matrix to within an epsilon relative error in the Frobenius norm from a total of O(r(d_1 + d_2)\log^6(d_1 + d_2)/\epsilon^2) observations. |

906 | Average Case Column Subset Selection for Entrywise \ell_1-Norm Loss |
Zhao Song, David Woodruff, Peilin Zhong | We study the column subset selection problem with respect to the entrywise $\ell_1$-norm loss. |

907 | Efficient Forward Architecture Search | Hanzhang Hu, John Langford, Rich Caruana, Saurajit Mukherjee, Eric J. Horvitz, Debadeepta Dey | We propose a neural architecture search (NAS) algorithm, Petridish, to iteratively add shortcut connections to existing network layers. |

908 | Unsupervised Meta-Learning for Few-Shot Image Classification | Siavash Khodadadeh, Ladislau Boloni, Mubarak Shah | In this paper, we propose UMTRA, an algorithm that performs unsupervised, model-agnostic meta-learning for classification tasks. |

909 | Learning Mixtures of Plackett-Luce Models from Structured Partial Orders | Zhibing Zhao, Lirong Xia | In this paper, we focus on three popular structures of partial orders: ranked top-$l_1$, $l_2$-way, and choice data over a subset of alternatives. |

910 | Certainty Equivalence is Efficient for Linear Quadratic Control | Horia Mania, Stephen Tu, Benjamin Recht | We show that for both the fully and partially observed settings, the sub-optimality gap between the cost incurred by playing the certainty equivalent controller on the true system and the cost incurred by using the optimal LQ controller enjoys a fast statistical rate, scaling as the square of the parameter error. |

911 | Scalable Bayesian inference of dendritic voltage via spatiotemporal recurrent state space models | Ruoxi Sun, Ian Kinsella, Scott Linderman, Liam Paninski | Here we introduce a scalable fully Bayesian approach. |

912 | Logarithmic Regret for Online Control | Naman Agarwal, Elad Hazan, Karan Singh | We study optimal regret bounds for control in linear dynamical systems under adversarially changing strongly convex cost functions, given the knowledge of transition dynamics. |

913 | Elliptical Perturbations for Differential Privacy | Matthew Reimherr, Jordan Awan | We study elliptical distributions in locally convex vector spaces, and determine conditions when they can or cannot be used to satisfy differential privacy (DP). |

914 | Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks | Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, Yang Liu | Inspired by the work on manually-defined patterns of vulnerabilities from various code representation graphs and the recent advance on graph neural networks, we propose Devign, a general graph neural network based model for graph-level classification through learning on a rich set of code semantic representations. |

915 | KNG: The K-Norm Gradient Mechanism | Matthew Reimherr, Jordan Awan | This paper presents a new mechanism for producing sanitized statistical summaries that achieve {\it differential privacy}, called the {\it K-Norm Gradient} Mechanism, or KNG. |

916 | CXPlain: Causal Explanations for Model Interpretation under Uncertainty | Patrick Schwab, Walter Karlen | We present experiments that demonstrate that CXPlain is significantly more accurate and faster than existing model-agnostic methods for estimating feature importance. |

917 | Regularized Anderson Acceleration for Off-Policy Deep Reinforcement Learning | Wenjie Shi, Shiji Song, Hui Wu, Ya-Chu Hsu, Cheng Wu, Gao Huang | To tackle this problem, we propose a general acceleration method for model-free, off-policy deep RL algorithms by drawing the idea underlying regularized Anderson acceleration (RAA), which is an effective approach to accelerating the solving of fixed point problems with perturbations. |

918 | STREETS: A Novel Camera Network Dataset for Traffic Flow | Corey Snyder, Minh Do | In this paper, we introduce STREETS, a novel traffic flow dataset from publicly available web cameras in the suburbs of Chicago, IL. |

919 | Sequential Neural Processes | Gautam Singh, Jaesik Yoon, Youngsung Son, Sungjin Ahn | In this paper, we propose Sequential Neural Processes (SNP) which incorporates a temporal state-transition model of stochastic processes and thus extends its modeling capabilities to dynamic stochastic processes. |

920 | Policy Continuation with Hindsight Inverse Dynamics | Hao Sun, Zhizhong Li, Xiaotong Liu, Bolei Zhou, Dahua Lin | To tackle this difficulty, we propose a new approach called Policy Continuation with Hindsight Inverse Dynamics (PCHID). |

921 | Learning to Self-Train for Semi-Supervised Few-Shot Classification | Xinzhe Li, Qianru Sun, Yaoyao Liu, Qin Zhou, Shibao Zheng, Tat-Seng Chua, Bernt Schiele | In this paper we propose a novel semi-supervised meta-learning method called learning to self-train (LST) that leverages unlabeled data and specifically meta-learns how to cherry-pick and label such unsupervised data to further improve performance. |

922 | Temporal FiLM: Capturing Long-Range Sequence Dependencies with Feature-Wise Modulations. | Sawyer Birnbaum, Volodymyr Kuleshov, Zayd Enam, Pang Wei W. Koh, Stefano Ermon | Here, we propose Temporal Feature-Wise Linear Modulation (TFiLM) — a novel architectural component inspired by adaptive batch normalization and its extensions — that uses a recurrent neural network to alter the activations of a convolutional model. |

923 | From Complexity to Simplicity: Adaptive ES-Active Subspaces for Blackbox Optimization | Krzysztof M. Choromanski, Aldo Pacchiano, Jack Parker-Holder, Yunhao Tang, Vikas Sindhwani | We present a new algorithm (ASEBO) for optimizing high-dimensional blackbox functions. |

924 | On the Expressive Power of Deep Polynomial Neural Networks | Joe Kileel, Matthew Trager, Joan Bruna | This paper proposes the dimension of this variety as a precise measure of the expressive power of polynomial neural networks. |

925 | DETOX: A Redundancy-based Framework for Faster and More Robust Gradient Aggregation | Shashank Rajput, Hongyi Wang, Zachary Charles, Dimitris Papailiopoulos | In this work, we present DETOX, a Byzantine-resilient distributed training framework that combines algorithmic redundancy with robust aggregation. |

926 | Can SGD Learn Recurrent Neural Networks with Provable Generalization? | Zeyuan Allen-Zhu, Yuanzhi Li | In this paper, we show using the vanilla stochastic gradient descent (SGD), RNN can actually learn some notable concept class \emph{efficiently}, meaning that both time and sample complexity scale \emph{polynomially} in the input length (or almost polynomially, depending on the concept). |

927 | Limits of Private Learning with Access to Public Data | Raef Bassily, Shay Moran, Noga Alon | We consider learning problems where the training set consists of two types of examples: private and public. The goal is to design a learning algorithm that satisfies differential privacy only with respect to the private examples. |

928 | Discrete Object Generation with Reversible Inductive Construction | Ari Seff, Wenda Zhou, Farhan Damani, Abigail Doyle, Ryan P. Adams | Here, we present a generative model for discrete objects employing a Markov chain where transitions are restricted to a set of local operations that preserve validity. |

929 | Efficient Near-Optimal Testing of Community Changes in Balanced Stochastic Block Models | Aditya Gangrade, Praveen Venkatesh, Bobak Nazer, Venkatesh Saligrama | We propose and analyze the problems of \textit{community goodness-of-fit and two-sample testing} for stochastic block models (SBM), where changes arise due to modification in community memberships of nodes. |

930 | Keeping Your Distance: Solving Sparse Reward Tasks Using Self-Balancing Shaped Rewards | Alexander Trott, Stephan Zheng, Caiming Xiong, Richard Socher | We introduce a simple and effective model-free method to learn from shaped distance-to-goal rewards on tasks where success depends on reaching a goal state. |

931 | Superset Technique for Approximate Recovery in One-Bit Compressed Sensing | Larkin Flodin, Venkata Gandikota, Arya Mazumdar | In this paper, we propose a generic approach for signal recovery in nonadaptive 1bCS that leads to improved sample complexity for approximate recovery for a variety of signal models, including nonnegative signals and binary signals. |

932 | Bandits with Feedback Graphs and Switching Costs | Raman Arora, Teodor Vanislavov Marinov, Mehryar Mohri | We study the adversarial multi-armed bandit problem where the learner is supplied with partial observations modeled by a \emph{feedback graph} and where shifting to a new action incurs a fixed \emph{switching cost}. |

933 | Functional Adversarial Attacks | Cassidy Laidlaw, Soheil Feizi | We propose functional adversarial attacks, a novel class of threat models for crafting adversarial examples to fool machine learning models. |

934 | Statistical-Computational Tradeoff in Single Index Models | Lingxiao Wang, Zhuoran Yang, Zhaoran Wang | In this paper, we investigate the case when this critical assumption fails to hold, where the problem becomes considerably harder. |

935 | On Fenchel Mini-Max Learning | Chenyang Tao, Liqun Chen, Shuyang Dai, Junya Chen, Ke Bai, Dong Wang, Jianfeng Feng, Wenlian Lu, Georgiy Bobashev, Lawrence Carin | We present a novel probabilistic learning framework, called Fenchel Mini-Max Learning (FML), that accommodates all four desiderata in a flexible and scalable manner. |

936 | MarginGAN: Adversarial Training in Semi-Supervised Learning | Jinhao Dong, Tong Lin | Our method is motivated by the success of large-margin classifiers and the recent viewpoint that good semi-supervised learning requires a bad” GAN. |

937 | Poincare Recurrence, Cycles and Spurious Equilibria in Gradient-Descent-Ascent for Non-Convex Non-Concave Zero-Sum Games | Emmanouil-Vasileios Vlatakis-Gkaragkounis, Lampros Flokas, Georgios Piliouras | We study a wide class of non-convex non-concave min-max games that generalizes over standard bilinear zero-sum games. |

938 | A unified variance-reduced accelerated gradient method for convex optimization | Guanghui Lan, Zhize Li, Yi Zhou | We propose a novel randomized incremental gradient algorithm, namely, VAriance-Reduced Accelerated Gradient (Varag), for finite-sum optimization. |

939 | Nearly Tight Bounds for Robust Proper Learning of Halfspaces with a Margin | Ilias Diakonikolas, Daniel Kane, Pasin Manurangsi | We study the problem of properly learning large margin halfspaces in the agnostic PAC model. |

940 | Same-Cluster Querying for Overlapping Clusters | Wasim Huleihel, Arya Mazumdar, Muriel Medard, Soumyabrata Pal | In this paper, we look at the more practical scenario of overlapping clusters, and provide upper bounds (with algorithms) on the sufficient number of queries. |

941 | Efficient Convex Relaxations for Streaming PCA | Raman Arora, Teodor Vanislavov Marinov | In this work, we give improved bounds on per iteration cost of mini-batched variants of both MSG and $\ell_2$-RMSG and arrive at an algorithm with total computational complexity matching that of Oja’s algorithm. |

942 | Learning Robust Global Representations by Penalizing Local Predictive Power | Haohan Wang, Songwei Ge, Zachary Lipton, Eric P. Xing | This paper proposes a method for training robust convolutional networks by penalizing the predictive power of the local representations learned by earlier layers. |

943 | Unsupervised Curricula for Visual Meta-Reinforcement Learning | Allan Jabri, Kyle Hsu, Abhishek Gupta, Ben Eysenbach, Sergey Levine, Chelsea Finn | We formulate unsupervised meta-RL as information maximization between a latent task variable and the meta-learner’s data distribution, and describe a practical instantiation which alternates between integration of recent experience into the task distribution and meta-learning of the updated tasks. |

944 | Sample Complexity of Learning Mixture of Sparse Linear Regressions | Akshay Krishnamurthy, Arya Mazumdar, Andrew McGregor, Soumyabrata Pal | In this paper, we consider the case where the signal vectors are sparse; this generalizesthe popular compressed sensing paradigm. |

945 | Large Scale Adversarial Representation Learning | Jeff Donahue, Karen Simonyan | In this work we show that progress in image generation quality translates to substantially improved representation learning performance. |

946 | G2SAT: Learning to Generate SAT Formulas | Jiaxuan You, Haoze Wu, Clark Barrett, Raghuram Ramanujan, Jure Leskovec | In this work, we present G2SAT, the first deep generative framework that learns to generate SAT formulas from a given set of input formulas. |

947 | Neural Trust Region/Proximal Policy Optimization Attains Globally Optimal Policy | Boyi Liu, Qi Cai, Zhuoran Yang, Zhaoran Wang | In this paper, we prove that a variant of PPO and TRPO equipped with overparametrized neural networks converges to the globally optimal policy at a sublinear rate. |

948 | Dimensionality reduction: theoretical perspective on practical measures | Yair Bartal, Nova Fandina, Ofer Neiman | The goal of this paper is to bridge the gap between theory and practice view-points of metric dimensionality reduction, laying the foundation for a theoretical study of more practically oriented analysis. |

949 | Oracle-Efficient Algorithms for Online Linear Optimization with Bandit Feedback | Shinji Ito, Daisuke Hatano, Hanna Sumita, Kei Takemura, Takuro Fukunaga, Naonori Kakimura, Ken-Ichi Kawarabayashi | We propose computationally efficient algorithms for \textit{online linear optimization with bandit feedback}, in which a player chooses an \textit{action vector} from a given (possibly infinite) set $\mathcal{A} \subseteq \mathbb{R}^d$, and then suffers a loss that can be expressed as a linear function in action vectors. |

950 | Multilabel reductions: what is my loss optimising? | In this paper, we study five commonly used reductions, including the one-versus-all reduction, a reduction to multiclass classification, and normalised versions of the same, wherein the contribution of each instance is normalised by the number of relevant labels. | |

951 | Tight Sample Complexity of Learning One-hidden-layer Convolutional Neural Networks | Yuan Cao, Quanquan Gu | We propose a novel algorithm called approximate gradient descent for training CNNs, and show that, with high probability, the proposed algorithm with random initialization grants a linear convergence to the ground-truth parameters up to statistical precision. |

952 | Deep Gamblers: Learning to Abstain with Portfolio Theory | Ziyin Liu, Zhikang Wang, Paul Pu Liang, Russ R. Salakhutdinov, Louis-Philippe Morency, Masahito Ueda | Inspired by portfolio theory, we propose a loss function for the selective classification problem based on the doubling rate of gambling. |

953 | Two Time-scale Off-Policy TD Learning: Non-asymptotic Analysis over Markovian Samples | Tengyu Xu, Shaofeng Zou, Yingbin Liang | In contrast to previous studies that characterized the non-asymptotic convergence rate of TDC only under identical and independently distributed (i.i.d.) data samples, we provide the first non-asymptotic convergence analysis for two time-scale TDC under a non-i.i.d.\ Markovian sample path and linear function approximation. |

954 | Transfer Learning via Minimizing the Performance Gap Between Domains | Boyu Wang, Jorge Mendez, Mingbo Cai, Eric Eaton | To formalize this intuition, we define the performance gap as a measure of the discrepancy between the source and target domains. |

955 | Splitting Steepest Descent for Growing Neural Architectures | Lemeng Wu, Dilin Wang, Qiang Liu | We develop a progressive training approach for neural networks which adaptively grows the network structure by splitting existing neurons to multiple off-springs. |

956 | Sequential Experimental Design for Transductive Linear Bandits | Lalit Jain, Kevin G. Jamieson, Tanner Fiez, Lillian Ratliff | In this paper we introduce the pure exploration \emph{transductive linear bandit problem}: given a set of measurement vectors $\mathcal{X}\subset \mathbb{R}^d$, a set of items $\mathcal{Z}\subset \mathbb{R}^d$, a fixed confidence $\delta$, and an unknown vector $\theta^{\ast}\in \mathbb{R}^d$, the goal is to infer $\argmax_{z\in \mathcal{Z}} z^\top\theta^\ast$ with probability $1-\delta$ by making as few sequentially chosen noisy measurements of the form $x^\top\theta^{\ast}$ as possible. |

957 | Time Matters in Regularizing Deep Networks: Weight Decay and Data Augmentation Affect Early Learning Dynamics, Matter Little Near Convergence | Aditya Sharad Golatkar, Alessandro Achille, Stefano Soatto | Deep neural networks (DNNs), however, challenge this view: We show that removing regularization after an initial transient period has little effect on generalization, even if the final loss landscape is the same as if there had been no regularization. |

958 | Outlier-Robust High-Dimensional Sparse Estimation via Iterative Filtering | Ilias Diakonikolas, Daniel Kane, Sushrut Karmalkar, Eric Price, Alistair Stewart | Specifically, we focus on the fundamental problems of robust sparse mean estimation and robust sparse PCA. We give the first practically viable robust estimators for these problems. |

959 | Variational Graph Recurrent Neural Networks | Ehsan Hajiramezanali, Arman Hasanzadeh, Krishna Narayanan, Nick Duffield, Mingyuan Zhou, Xiaoning Qian | In this paper, we develop a novel hierarchical variational model that introduces additional latent random variables to jointly model the hidden states of a graph recurrent neural network (GRNN) to capture both topology and node attribute changes in dynamic graphs. |

960 | Semi-Implicit Graph Variational Auto-Encoders | Arman Hasanzadeh, Ehsan Hajiramezanali, Krishna Narayanan, Nick Duffield, Mingyuan Zhou, Xiaoning Qian | Semi-implicit graph variational auto-encoder (SIG-VAE) is proposed to expand the flexibility of variational graph auto-encoders (VGAE) to model graph data. |

961 | Unsupervised Learning of Object Keypoints for Perception and Control | Tejas D. Kulkarni, Ankush Gupta, Catalin Ionescu, Sebastian Borgeaud, Malcolm Reynolds, Andrew Zisserman, Volodymyr Mnih | In this work we aim to learn object representations that are useful for control and reinforcement learning (RL). |

962 | A Model-Based Reinforcement Learning with Adversarial Training for Online Recommendation | Xueying Bai, Jian Guan, Hongning Wang | In this work, we propose a model-based reinforcement learning solution which models the user-agent interaction for offline policy learning via a generative adversarial network. |

963 | Optimizing Generalized Rate Metrics with Three Players | Harikrishna Narasimhan, Andrew Cotter, Maya Gupta | We present a general framework for solving a large class of learning problems with non-linear functions of classification rates. |

964 | Consistency-based Semi-supervised Learning for Object detection | Jisoo Jeong, Seungeui Lee, Jeesoo Kim, Nojun Kwak | To alleviate this problem, we propose a Consistency-based Semi-supervised learning method for object Detection (CSD), which is a way of using consistency constraints as a tool for enhancing detection performance by making full use of available unlabeled data. |

965 | Rates of Convergence for Large-scale Nearest Neighbor Classification | Xingye Qiao, Jiexin Duan, Guang Cheng | For a large data set which cannot be loaded into the memory of a single machine due to computation, communication, privacy, or ownership limitations, we consider the divide and conquer scheme: the entire data set is divided into small subsamples, on which nearest neighbor predictions are made, and then a final decision is reached by aggregating the predictions on subsamples by majority voting. |

966 | An Embedding Framework for Consistent Polyhedral Surrogates | Jessica Finocchiaro, Rafael Frongillo, Bo Waggoner | We formalize and study the natural approach of designing convex surrogate loss functions via embeddings for problems such as classification or ranking. |

967 | Cross-Modal Learning with Adversarial Samples | CHAO LI, Shangqian Gao, Cheng Deng, De Xie, Wei Liu | In this paper, we propose a novel Cross-Modal correlation Learning with Adversarial samples, namely CMLA, which for the first time presents the existence of adversarial samples in cross-modal data. |

968 | Fast-rate PAC-Bayes Generalization Bounds via Shifted Rademacher Processes | Jun Yang, Shengyang Sun, Daniel M. Roy | The goal of this paper is to extend this bridge between Rademacher complexity and state-of-the-art PAC-Bayesian theory. |

969 | Input-Cell Attention Reduces Vanishing Saliency of Recurrent Neural Networks | Aya Abdelsalam Ismail, Mohamed Gunady, Luiz Pessoa, Hector Corrada Bravo, Soheil Feizi | In this work we analyze saliency-based methods for RNNs, both classical and gated cell architectures. |

970 | Program Synthesis and Semantic Parsing with Learned Code Idioms | Eui Chul Shin, Miltiadis Allamanis, Marc Brockschmidt, Alex Polozov | In this work, we present Patois, a system that allows a neural program synthesizer to explicitly interleave high-level and low-level reasoning at every generation step. |

971 | Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks | Yuan Cao, Quanquan Gu | We study the training and generalization of deep neural networks (DNNs) in the over-parameterized regime, where the network width (i.e., number of hidden nodes per layer) is much larger than the number of training data points. |

972 | High-Dimensional Optimization in Adaptive Random Subspaces | Jonathan Lacotte, Mert Pilanci, Marco Pavone | We propose a new randomized optimization method for high-dimensional problems which can be seen as a generalization of coordinate descent to random subspaces. |

973 | Random Projections with Asymmetric Quantization | Xiaoyun Li, Ping Li | In this paper, we investigate the cosine similarity estimators derived in such setting under the Lloyd-Max (LM) quantization scheme. |

974 | Superposition of many models into one | Brian Cheung, Alexander Terekhov, Yubei Chen, Pulkit Agrawal, Bruno Olshausen | We present a method for storing multiple models within a single set of parameters. |

975 | Private Testing of Distributions via Sample Permutations | Maryam Aliakbarpour, Ilias Diakonikolas, Daniel Kane, Ronitt Rubinfeld | In this paper, we use the framework of property testing to design algorithms to test the properties of the distribution that the data is drawn from with respect to differential privacy. |

976 | McDiarmid-Type Inequalities for Graph-Dependent Variables and Stability Bounds | Rui (Ray) Zhang, Xingwu Liu, Yuyi Wang, Liwei Wang | We consider learning problems in which examples are dependent and their dependency relation is characterized by a graph. |

977 | How to Initialize your Network? Robust Initialization for WeightNorm & ResNets | Devansh Arpit, V?ctor Campos, Yoshua Bengio | To address these issues, we propose a novel parameter initialization strategy that avoids explosion/vanishment of information across layers for weight normalized networks with and without residual connections. |

978 | On Making Stochastic Classifiers Deterministic | Andrew Cotter, Maya Gupta, Harikrishna Narasimhan | In this paper, we attempt to answer the theoretical question of how well a stochastic classifier can be approximated by a deterministic one, and compare several different approaches, proving lower and upper bounds. |

979 | Statistical Analysis of Nearest Neighbor Methods for Anomaly Detection | Xiaoyi Gu, Leman Akoglu, Alessandro Rinaldo | In this paper we are concerned with investigating the performance of NN-based methods for anomaly detection. |

980 | Improving Black-box Adversarial Attacks with a Transfer-based Prior | Shuyu Cheng, Yinpeng Dong, Tianyu Pang, Hang Su, Jun Zhu | To address these problems, we propose a prior-guided random gradient-free (P-RGF) method to improve black-box adversarial attacks, which takes the advantage of a transfer-based prior and the query information simultaneously. |

981 | Break the Ceiling: Stronger Multi-scale Deep Graph Convolutional Networks | Sitao Luan, Mingde Zhao, Xiao-Wen Chang, Doina Precup | In this paper, we first analyze key factors constraining the expressive power of existing Graph Convolutional Networks (GCNs), including the activation function and shallow learning mechanisms. Then, we generalize spectral graph convolution and deep GCN in block Krylov subspace forms, upon which we devise two architectures, both scalable in depth however making use of multi-scale information differently. |

982 | Statistical Model Aggregation via Parameter Matching | Mikhail Yurochkin, Mayank Agarwal, Soumya Ghosh, Kristjan Greenewald, Nghia Hoang | Exploiting tools from Bayesian nonparametrics, we develop a general meta-modeling framework that learns shared global latent structures by identifying correspondences among local model parameterizations. |

983 | On the (In)fidelity and Sensitivity of Explanations | Chih-Kuan Yeh, Cheng-Yu Hsieh, Arun Suggala, David I. Inouye, Pradeep K. Ravikumar | We propose simple robust variants of two notions that have been considered in recent literature: (in)fidelity, and sensitivity. |

984 | Exponential Family Estimation via Adversarial Dynamics Embedding | Bo Dai, Zhen Liu, Hanjun Dai, Niao He, Arthur Gretton, Le Song, Dale Schuurmans | We present an efficient algorithm for maximum likelihood estimation (MLE) of exponential family models, with a general parametrization of the energy function that includes neural networks. |

985 | The Broad Optimality of Profile Maximum Likelihood | Yi Hao, Alon Orlitsky | We study three fundamental statistical-learning problems: distribution estimation, property estimation, and property testing. |

986 | MintNet: Building Invertible Neural Networks with Masked Convolutions | Yang Song, Chenlin Meng, Stefano Ermon | We propose a new way of constructing invertible neural networks by combining simple building blocks with a novel set of composition rules. |

987 | Information-Theoretic Generalization Bounds for SGLD via Data-Dependent Estimates | Jeffrey Negrea, Mahdi Haghifam, Gintare Karolina Dziugaite, Ashish Khisti, Daniel M. Roy | In this work, we improve upon the stepwise analysis of noisy iterative learning algorithms initiated by Pensia, Jog, and Loh (2018) and recently extended by Bu, Zou, and Veeravalli (2019). |

988 | On Distributed Averaging for Stochastic k-PCA | Aditya Bhaskara, Pruthuvi Maheshakya Wijewardena | We consider a slight variant of the well-studied “distributed averaging” approach, and prove that this leads to significantly better bounds on the dependence between ‘n’ and the eigenvalue gaps. |

989 | Controllable Unsupervised Text Attribute Transfer via Editing Entangled Latent Representation | Ke Wang, Hang Hua, Xiaojun Wan | To address the above problems, we propose a more flexible unsupervised text attribute transfer framework which replaces the process of modeling attribute with minimal editing of latent representations based on an attribute classifier. |

990 | MaxGap Bandit: Adaptive Algorithms for Approximate Ranking | Sumeet Katariya, Ardhendu Tripathy, Robert Nowak | We propose elimination and UCB-style algorithms and show that they are minimax optimal. |

991 | Bias Correction of Learned Generative Models using Likelihood-Free Importance Weighting | Aditya Grover, Jiaming Song, Ashish Kapoor, Kenneth Tran, Alekh Agarwal, Eric J. Horvitz, Stefano Ermon | We employ this likelihood-free importance weighting method to correct for the bias in generative models. |

992 | Online Forecasting of Total-Variation-bounded Sequences | Dheeraj Baby, Yu-Xiang Wang | We consider the problem of online forecasting of sequences of length $n$ with total-variation at most $C_n$ using observations contaminated by independent $\sigma$-subgaussian noise. |

993 | Local SGD with Periodic Averaging: Tighter Analysis and Adaptive Synchronization | Farzin Haddadpour, Mohammad Mahdi Kamani, Mehrdad Mahdavi, Viveck Cadambe | In this paper, we study local distributed SGD, where data is partitioned among computation nodes, and the computation nodes perform local updates with periodically exchanging the model among the workers to perform averaging. |

994 | Data Parameters: A New Family of Parameters for Learning a Differentiable Curriculum | Shreyas Saxena, Oncel Tuzel, Dennis DeCoste | In this work, we address this problem by introducing data parameters. |

995 | Unified Sample-Optimal Property Estimation in Near-Linear Time | Yi Hao, Alon Orlitsky | We consider the fundamental learning problem of estimating properties of distributions over large domains. |

996 | Region Mutual Information Loss for Semantic Segmentation | Shuai Zhao, Yang Wang, Zheng Yang, Deng Cai | In this paper, we develop a region mutual information (RMI) loss to model the dependencies among pixels more simply and efficiently. |

997 | Learning Stable Deep Dynamics Models | J. Zico Kolter, Gaurav Manek | In this paper, we propose an approach for learning dynamical systems that are guaranteed to be stable over the entire state space. |

998 | Image Captioning: Transforming Objects into Words | Simao Herdade, Armin Kappeler, Kofi Boakye, Joao Soares | In this work we introduce the Object Relation Transformer, that builds upon this approach by explicitly incorporating information about the spatial relationship between input detected objects through geometric attention. |

999 | Greedy Sampling for Approximate Clustering in the Presence of Outliers | Aditya Bhaskara, Sharvaree Vadgama, Hong Xu | In this work we show that for k-means and k-center clustering, simple modifications to the well-studied greedy algorithms result in nearly identical guarantees, while additionally being robust to outliers. |

1000 | Adversarial Fisher Vectors for Unsupervised Representation Learning | Joshua Susskind, Shuangfei Zhai, Walter Talbott, Carlos Guestrin | We examine Generative Adversarial Networks (GANs) through the lens of deep Energy Based Models (EBMs), with the goal of exploiting the density model that follows from this formulation. |

1001 | On Tractable Computation of Expected Predictions | Pasha Khosravi, YooJung Choi, Yitao Liang, Antonio Vergari, Guy Van den Broeck | In this paper, we identify a pair of generative and discriminative models that enables tractable computation of expectations, as well as moments of any order, of the latter with respect to the former in case of regression. |

1002 | Levenshtein Transformer | Jiatao Gu, Changhan Wang, Junbo Zhao | In this work, we develop Levenshtein Transformer, a new partially autoregressive model devised for more flexible and amenable sequence generation. |

1003 | Unlabeled Data Improves Adversarial Robustness | Yair Carmon, Aditi Raghunathan, Ludwig Schmidt, John C. Duchi, Percy S. Liang | We prove that unlabeled data bridges this gap: a simple semisupervised learning procedure (self-training) achieves high robust accuracy using the same number of labels required for achieving high standard accuracy. |

1004 | Machine Teaching of Active Sequential Learners | Tomi Peltola, Mustafa Mert ?elikok, Pedram Daee, Samuel Kaski | We formulate this sequential teaching problem, which current techniques in machine teaching do not address, as a Markov decision process, with the dynamics nesting a model of the learner and the actions being the teacher’s responses. |

1005 | Gaussian-Based Pooling for Convolutional Neural Networks | Takumi Kobayashi | In this paper, to improve performance of CNNs, we propose a novel local pooling method based on the Gaussian-based probabilistic model over local neuron activations for flexibly pooling (extracting) features, in contrast to the previous model restricting the output within the convex hull of local neurons. |

1006 | Meta Architecture Search | Albert Shaw, Wei Wei, Weiyang Liu, Le Song, Bo Dai | We propose the Bayesian Meta Architecture SEarch (BASE) framework which takes advantage of a Bayesian formulation of the architecture search problem to learn over an entire set of tasks simultaneously. |

1007 | NAOMI: Non-Autoregressive Multiresolution Sequence Imputation | Yukai Liu, Rose Yu, Stephan Zheng, Eric Zhan, Yisong Yue | In this paper, we take a non-autoregressive approach and propose a novel deep generative model: Non-AutOregressive Multiresolution Imputation (NAOMI) to impute long-range sequences given arbitrary missing patterns. |

1008 | Layer-Dependent Importance Sampling for Training Deep and Large Graph Convolutional Networks | Difan Zou, Ziniu Hu, Yewen Wang, Song Jiang, Yizhou Sun, Quanquan Gu | To deal with the above two problems, we propose a new effective sampling algorithm called LAyer-Dependent ImportancE Sampling (LADIES). |

1009 | Two Generator Game: Learning to Sample via Linear Goodness-of-Fit Test | Lizhong Ding, Mengyang Yu, Li Liu, Fan Zhu, Yong Liu, Yu Li, Ling Shao | To solve this problem, we formulate a deep energy adversarial network (DEAN), which casts the energy model learned from real data into an optimization of a goodness-of-fit (GOF) test statistic. |

1010 | Distribution oblivious, risk-aware algorithms for multi-armed bandits with unbounded rewards | Anmol Kagrecha, Jayakrishnan Nair, Krishna Jagannathan | In this paper, we consider the problem of selecting the arm that optimizes a linear combination of the expected reward and the associated Conditional Value at Risk (CVaR) in a fixed budget best-arm identification framework. |

1011 | Private Stochastic Convex Optimization with Optimal Rates | Raef Bassily, Vitaly Feldman, Kunal Talwar, Abhradeep Guha Thakurta | We study differentially private (DP) algorithms for stochastic convex optimization (SCO). |

1012 | Provably Robust Deep Learning via Adversarially Trained Smoothed Classifiers | Hadi Salman, Jerry Li, Ilya Razenshteyn, Pengchuan Zhang, Huan Zhang, Sebastien Bubeck, Greg Yang | In this paper, we employ adversarial training to improve the performance of randomized smoothing. |

1013 | Demystifying Black-box Models with Symbolic Metamodels | Ahmed M. Alaa, Mihaela van der Schaar | To address this issue, we introduce the symbolic metamodeling framework – a general methodology for interpreting predictions by converting “black-box” models into “white-box” functions that are understandable to human subjects. |

1014 | Neural Temporal-Difference Learning Converges to Global Optima | Qi Cai, Zhuoran Yang, Jason D. Lee, Zhaoran Wang | In this paper, we prove for the first time that neural TD converges at a sublinear rate to the global optimum of the mean-squared projected Bellman error for policy evaluation. |

1015 | Privacy-Preserving Q-Learning with Functional Noise in Continuous Spaces | Baoxiang Wang, Nidhi Hegde | Our aim is to protect the value function approximator, without regard to the number of states queried to the function. |

1016 | Attentive State-Space Modeling of Disease Progression | Ahmed M. Alaa, Mihaela van der Schaar | In this paper, we develop the attentive state-space model, a deep probabilistic model that learns accurate and interpretable structured representations for disease trajectories. |

1017 | Online EXP3 Learning in Adversarial Bandits with Delayed Feedback | Ilai Bistritz, Zhengyuan Zhou, Xi Chen, Nicholas Bambos, Jose Blanchet | For the case where \sum_{t=1}^{T}d_{t} and T are unknown, we propose a novel doubling trick for online learning with delays and prove that this adaptive EXP3 achieves a regret of O\left(\sqrt{\ln K\left(K^{2}T+\sum_{t=1}^{T}d_{t}\right)}\right). |

1018 | A Direct tilde{O}(1/epsilon) Iteration Parallel Algorithm for Optimal Transport | Arun Jambulapati, Aaron Sidford, Kevin Tian | We give an algorithm which solves the problem to additive $\epsilon$ accuracy with $\tilde{O}(1/\epsilon)$ parallel depth and $\tilde{O}\left(n^2/\epsilon\right)$ work. |

1019 | Faster Boosting with Smaller Memory | Julaiti Alafate, Yoav S. Freund | This paper presents an alternative approach to implementing the boosted trees, which achieves a significant speedup over XGBoost and LightGBM, especially when the memory size is small. |

1020 | Variance Reduction for Matrix Games | Yair Carmon, Yujia Jin, Aaron Sidford, Kevin Tian | We present a randomized primal-dual algorithm that solves the problem min_x max_y y^T A x to additive error epsilon in time nnz(A) + sqrt{nnz(A) n} / epsilon, for matrix A with larger dimension n and nnz(A) nonzero entries. |

1021 | Learning Neural Networks with Adaptive Regularization | Han Zhao, Yao-Hung Hubert Tsai, Russ R. Salakhutdinov, Geoffrey J. Gordon | To optimize the model, we present an efficient block coordinate descent algorithm with analytical solutions. |

1022 | Distributed estimation of the inverse Hessian by determinantal averaging | Michal Derezinski, Michael W. Mahoney | To address this, we propose determinantal averaging, a new approach for correcting the inversion bias. |

1023 | Smoothing Structured Decomposable Circuits | Andy Shih, Guy Van den Broeck, Paul Beame, Antoine Amarilli | We propose a near-linear time algorithm for this task and explore lower bounds for smoothing decomposable circuits, using existing results on range-sum queries. |

1024 | Efficient and Accurate Estimation of Lipschitz Constants for Deep Neural Networks | Mahyar Fazlyab, Alexander Robey, Hamed Hassani, Manfred Morari, George Pappas | In this paper, we present a convex optimization framework to compute guaranteed upper bounds on the Lipschitz constant of DNNs both accurately and efficiently. |

1025 | Provable Non-linear Inductive Matrix Completion | Kai Zhong, Zhao Song, Prateek Jain, Inderjit S. Dhillon | In this paper, we provide the first theoretical analysis for a simple NIMC model in the realizable setting, where the relevance score of a (query, item) pair is formulated as the inner product between their single-layer neural representations. |

1026 | Communication-Efficient Distributed Blockwise Momentum SGD with Error-Feedback | Shuai Zheng, Ziyue Huang, James Kwok | In this paper, we propose a general distributed compressed SGD with Nesterov’s momentum. |

1027 | Sparse Variational Inference: Bayesian Coresets from Scratch | Trevor Campbell, Boyan Beronov | In the present work we remove this requirement by formulating coreset construction as sparsity-constrained variational inference within an exponential family. |

1028 | Personalizing Many Decisions with High-Dimensional Covariates | Nima Hamidi, Mohsen Bayati, Kapil Gupta | The main contribution of this paper is to introduce and theoretically analyze a new algorithm (REAL Bandit) with a regret that scales by r^2(k+d) when r is rank of the k by d matrix of unknown parameters. |

1029 | A Necessary and Sufficient Stability Notion for Adaptive Generalization | Moshe Shenfeld, Katrina Ligett | We introduce a new notion of the stability of computations, which holds under post-processing and adaptive composition. |

1030 | Necessary and Sufficient Geometries for Gradient Methods | Daniel Levy, John C. Duchi | We study the impact of the constraint set and gradient geometry on the convergence of online and stochastic methods for convex optimization, providing a characterization of the geometries for which stochastic gradient and adaptive gradient methods are (minimax) optimal. |

1031 | Landmark Ordinal Embedding | Nikhil Ghosh, Yuxin Chen, Yisong Yue | In this paper, we aim to learn a low-dimensional Euclidean representation from a set of constraints of the form “item j is closer to item i than item k”. |

1032 | Identification of Conditional Causal Effects under Markov Equivalence | Amin Jaber, Jiji Zhang, Elias Bareinboim | In this work, we derive an algorithm to identify conditional effects, which are particularly useful for evaluating conditional plans or policies. |

1033 | The Thermodynamic Variational Objective | Vaden Masrani, Tuan Anh Le, Frank Wood | We introduce the thermodynamic variational objective (TVO) for learning in both continuous and discrete deep generative models. |

1034 | Global Guarantees for Blind Demodulation with Generative Priors | Paul Hand, Babhru Joshi | We study a deep learning inspired formulation for the blind demodulation problem, which is the task of recovering two unknown vectors from their entrywise multiplication. |

1035 | Exact sampling of determinantal point processes with sublinear time preprocessing | Michal Derezinski, Daniele Calandriello, Michal Valko | For this purpose we provide DPP-VFX, a new algorithm which, given access only to L, samples exactly from a determinantal point process while satisfying the following two properties: (1) its preprocessing cost is n poly(k), i.e., sublinear in the size of L, and (2) its sampling cost is poly(k), i.e., independent of the size of L. Prior to our results, state-of-the-art exact samplers required O(n^3) preprocessing time and sampling time linear in n or dependent on the spectral properties of L. |

1036 | Geometry-Aware Neural Rendering | Joshua Tobin, Wojciech Zaremba, Pieter Abbeel | We propose Epipolar Cross Attention (ECA), an attention mechanism that leverages the geometry of the scene to perform efficient non-local operations, requiring only $O(n)$ comparisons per spatial dimension instead of $O(n^2)$. |

1037 | Variational Temporal Abstraction | Taesup Kim, Sungjin Ahn, Yoshua Bengio | We introduce a variational approach to learning and inference of temporally hierarchical structure and representation for sequential data. |

1038 | Subquadratic High-Dimensional Hierarchical Clustering | Amir Abboud, Vincent Cohen-Addad, Hussein Houdrouge | We consider the widely-used average-linkage, single-linkage, and Ward’s methods for computing hierarchical clusterings of high-dimensional Euclidean inputs. |

1039 | Learning Auctions with Robust Incentive Guarantees | Jacob D. Abernethy, Rachel Cummings, Bhuvesh Kumar, Sam Taggart, Jamie H. Morgenstern | In this paper, we combine tools from differential privacy, mechanism design, and sample complexity to give a repeated auction that (1) learns bidder demand from past data, (2) is approximately revenue-optimal, and (3) strategically robust, as it incentivizes bidders to behave truthfully. |

1040 | Policy Optimization Provably Converges to Nash Equilibria in Zero-Sum Linear Quadratic Games | Kaiqing Zhang, Zhuoran Yang, Tamer Basar | Building upon this, we develop three projected nested-gradient methods that are guaranteed to converge to the NE of the game. |

1041 | Uniform convergence may be unable to explain generalization in deep learning | Vaishnavh Nagarajan, J. Zico Kolter | Through these findings, we cast doubt on the power of uniform convergence-based generalization bounds to provide a complete picture of why overparameterized deep networks generalize well. |

1042 | A Zero-Positive Learning Approach for Diagnosing Software Performance Regressions | Mejbah Alam, Justin Gottschlich, Nesime Tatbul, Javier S. Turek, Tim Mattson, Abdullah Muzahid | In this paper, we apply MP to the automation of software performance regression testing. |

1043 | DTWNet: a Dynamic Time Warping Network | Xingyu Cai, Tingyang Xu, Jinfeng Yi, Junzhou Huang, Sanguthevar Rajasekaran | In this paper, we propose a novel component in an artificial neural network. |

1044 | Structured Graph Learning Via Laplacian Spectral Constraints | Sandeep Kumar, Jiaxi Ying, Jose Vinicius de Miranda Cardoso, Daniel Palomar | In this paper, we first show, for a set of important graph families it is possible to convert the combinatorial constraints of structure into eigenvalue constraints of the graph Laplacian matrix. Then we introduce a unified graph learning framework lying at the integration of the spectral properties of the Laplacian matrix with Gaussian graphical modeling, which is capable of learning structures of a large class of graph families. |

1045 | Thresholding Bandit with Optimal Aggregate Regret | Chao Tao, Sa?l Blanco, Jian Peng, Yuan Zhou | We introduce LSA, a new, simple and anytime algorithm that aims to minimize the aggregate regret (or the expected number of mis-classified arms). |

1046 | Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks | Yuanzhi Li, Colin Wei, Tengyu Ma | Towards explaining this phenomenon, we devise a setting in which we can prove that a two layer network trained with large initial learning rate and annealing provably generalizes better than the same network trained with a small learning rate from the start. |

1047 | Rethinking Kernel Methods for Node Representation Learning on Graphs | Yu Tian, Long Zhao, Xi Peng, Dimitris Metaxas | Here, we present a novel theoretical kernel-based framework for node classification that can bridge the gap between these two representation learning problems on graphs. |

1048 | Causal Confusion in Imitation Learning | Pim de Haan, Dinesh Jayaraman, Sergey Levine | We investigate how this problem arises, and propose a solution to combat it through targeted interventions—either environment interaction or expert queries—to determine the correct causal model. |

1049 | Optimizing Generalized PageRank Methods for Seed-Expansion Community Detection | Pan Li, I Chien, Olgica Milenkovic | Given this result, we propose a new GPR, termed Inverse PR (IPR), with LP weights that increase for the initial few steps of the walks. |

1050 | The Case for Evaluating Causal Models Using Interventional Measures and Empirical Data | Amanda Gentzel, Dan Garant, David Jensen | We argue for more frequent use of evaluation techniques that examine interventional measures rather than structural or observational measures, and that evaluate those measures on empirical data rather than synthetic data. |

1051 | Dimension-Free Bounds for Low-Precision Training | Zheng Li, Christopher M. De Sa | In this paper, we derive new bounds for low-precision training algorithms that do not contain the dimension $d$ , which lets us better understand what affects the convergence of these algorithms as parameters scale. |

1052 | Concentration of risk measures: A Wasserstein distance approach | Sanjay P. Bhat, Prashanth L.A. | Concentration of risk measures: A Wasserstein distance approach |

1053 | Meta-Inverse Reinforcement Learning with Probabilistic Context Variables | Lantao Yu, Tianhe Yu, Chelsea Finn, Stefano Ermon | To this end, we propose a deep latent variable model that is capable of learning rewards from unstructured, multi-task demonstration data, and critically, use this experience to infer robust rewards for new, structurally-similar tasks from a single demonstration. |

1054 | Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction | Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, Sergey Levine | Based on our analysis, we propose a practical algorithm, bootstrapping error accumulation reduction (BEAR). |

1055 | Bayesian Optimization with Unknown Search Space | Huong Ha, Santu Rana, Sunil Gupta, Thanh Nguyen, Hung Tran-The, Svetha Venkatesh | To address this problem, we propose a systematic volume expansion strategy for the Bayesian optimization. |

1056 | On the Downstream Performance of Compressed Word Embeddings | Avner May, Jian Zhang, Tri Dao, Christopher R? | We thus propose the eigenspace overlap score as a new measure. |

1057 | Multivariate Distributionally Robust Convex Regression under Absolute Error Loss | Jose Blanchet, Peter W. Glynn, Jun Yan, Zhengqing Zhou | This paper proposes a novel non-parametric multidimensional convex regression estimator which is designed to be robust to adversarial perturbations in the empirical measure. |

1058 | Neural Relational Inference with Fast Modular Meta-learning | Ferran Alet, Erica Weng, Tom?s Lozano-P?rez, Leslie Pack Kaelbling | We frame relational inference as a modular meta-learning problem, where neural modules are trained to be composed in different ways to solve many tasks. |

1059 | Gradient based sample selection for online continual learning | Rahaf Aljundi, Min Lin, Baptiste Goujaud, Yoshua Bengio | In this work, we formulate sample selection as a constraint reduction problem based on the constrained optimization view of continual learning. |

1060 | Attribution-Based Confidence Metric For Deep Neural Networks | Susmit Jha, Sunny Raj, Steven Fernandes, Sumit K. Jha, Somesh Jha, Brian Jalaian, Gunjan Verma, Ananthram Swami | We propose a novel confidence metric, namely, attribution-based confidence (ABC) for deep neural networks (DNNs). |

1061 | Theoretical evidence for adversarial robustness through randomization | Rafael Pinot, Laurent Meunier, Alexandre Araujo, Hisashi Kashima, Florian Yger, Cedric Gouy-Pailler, Jamal Atif | This paper investigates the theory of robustness against adversarial attacks. |

1062 | Online Continual Learning with Maximal Interfered Retrieval | Rahaf Aljundi, Eugene Belilovsky, Tinne Tuytelaars, Laurent Charlin, Massimo Caccia, Min Lin, Lucas Page-Caccia | In this work, we consider a controlled sampling of memories for replay. |

1063 | Neural Attribution for Semantic Bug-Localization in Student Programs | Rahul Gupta, Aditya Kanade, Shirish Shevade | In this work, we present NeuralBugLocator, a deep learning based technique, that can localize the bugs in a faulty program with respect to a failing test, without even running the program. |

1064 | Adaptive Temporal-Difference Learning for Policy Evaluation with Per-State Uncertainty Estimates | Carlos Riquelme, Hugo Penedones, Damien Vincent, Hartmut Maennel, Sylvain Gelly, Timothy A. Mann, Andre Barreto, Gergely Neu | In this paper, we argue that the larger bias of TD can be a result of the amplification of local approximation errors. |

1065 | SPoC: Search-based Pseudocode to Code | Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, Percy S. Liang | We consider the task of mapping pseudocode to executable code, assuming a one-to-one correspondence between lines of pseudocode and lines of code. |

1066 | Generative Modeling by Estimating Gradients of the Data Distribution | Yang Song, Stefano Ermon | We introduce a new generative model where samples are produced via Langevin dynamics using gradients of the data distribution estimated with score matching. |

1067 | Adversarial Music: Real world Audio Adversary against Wake-word Detection System | Juncheng Li, Shuhui Qu, Xinjian Li, Joseph Szurley, J. Zico Kolter, Florian Metze | In this work, we target our attack on the wake-word detection system. |

1068 | Prediction of Spatial Point Processes: Regularized Method with Out-of-Sample Guarantees | Muhammad Osama, Dave Zachariah, Peter Stoica | In this paper, we develop a method to infer predictive intensity intervals by learning a spatial model using a regularized criterion. |

1069 | Debiased Bayesian inference for average treatment effects | Kolyan Ray, Botond Szabo | Working in the standard potential outcomes framework, we propose a data-driven modification to an arbitrary (nonparametric) prior based on the propensity score that corrects for the first-order posterior bias, thereby improving performance. |

1070 | Margin-Based Generalization Lower Bounds for Boosted Classifiers | Allan Gr?nlund, Lior Kamma, Kasper Green Larsen, Alexander Mathiasen, Jelani Nelson | In this paper, we give the first margin-based lower bounds on the generalization error of boosted classifiers. |

1071 | Connections Between Mirror Descent, Thompson Sampling and the Information Ratio | Julian Zimmert, Tor Lattimore | We make a formal connection, showing that the information-theoretic bounds in most applications are derived from existing techniques from online convex optimisation. |

1072 | Graph Transformer Networks | Seongjun Yun, Minbyul Jeong, Raehyun Kim, Jaewoo Kang, Hyunwoo J. Kim | In this paper, we propose Graph Transformer Networks (GTNs) that are capable of generating new graph structures, which involve identifying useful connections between unconnected nodes on the original graph, while learning effective node representation on the new graphs in an end-to-end fashion. |

1073 | Learning to Confuse: Generating Training Time Adversarial Data with Auto-Encoder | Ji Feng, Qi-Zhi Cai, Zhi-Hua Zhou | In this work, we consider one challenging training time attack by modifying training data with bounded perturbation, hoping to manipulate the behavior (both targeted or non-targeted) of any corresponding trained classifier during test time when facing clean samples. |

1074 | The Impact of Regularization on High-dimensional Logistic Regression | Fariborz Salehi, Ehsan Abbasi, Babak Hassibi | In the high-dimensional regime the underlying parameter vector is often structured (sparse, block-sparse, finite-alphabet, etc.) and so in this paper we study regularized logistic regression (RLR), where a convex regularizer that encourages the desired structure is added to the negative of the log-likelihood function. |

1075 | Adaptive Density Estimation for Generative Models | Thomas Lucas, Konstantin Shmelkov, Karteek Alahari, Cordelia Schmid, Jakob Verbeek | As a solution, we propose the use of deep invertible transformations in the latent variable decoder. |

1076 | Fast and Provable ADMM for Learning with Generative Priors | Fabian Latorre, Armin eftekhari, Volkan Cevher | In this work, we propose a (linearized) Alternating Direction Method-of-Multipliers (ADMM) algorithm for minimizing a convex function subject to a nonconvex constraint. |

1077 | Weighted Linear Bandits for Non-Stationary Environments | Yoan Russac, Claire Vernade, Olivier Capp? | To address this problem, we propose D-LinUCB, a novel optimistic algorithm based on discounted linear regression, where exponential weights are used to smoothly forget the past. |

1078 | Improved Regret Bounds for Bandit Combinatorial Optimization | Shinji Ito, Daisuke Hatano, Hanna Sumita, Kei Takemura, Takuro Fukunaga, Naonori Kakimura, Ken-Ichi Kawarabayashi | In this paper, we aim to reveal the property, which makes the bandit combinatorial optimization hard. |

1079 | Pareto Multi-Task Learning | Xi Lin, Hui-Ling Zhen, Zhenhua Li, Qing-Fu Zhang, Sam Kwong | In this paper, we generalize this idea and propose a novel Pareto multi-task learning algorithm (Pareto MTL) to find a set of well-distributed Pareto solutions which can represent different trade-offs among different tasks. |

1080 | SIC-MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits | Etienne Boursier, Vianney Perchet | We present a decentralized algorithm that achieves the same performance as a centralized one, contradicting the existing lower bounds for that problem. |

1081 | Novel positional encodings to enable tree-based transformers | Vighnesh Shiv, Chris Quirk | Motivated by this property, we propose a method to extend transformers to tree-structured data, enabling sequence-to-tree, tree-to-sequence, and tree-to-tree mappings. |

1082 | A Domain Agnostic Measure for Monitoring and Evaluating GANs | Paulina Grnarova, Kfir Y. Levy, Aurelien Lucchi, Nathanael Perraudin, Ian Goodfellow, Thomas Hofmann, Andreas Krause | We leverage the notion of duality gap from game theory to propose a measure that addresses both (i) and (ii) at a low computational cost. |

1083 | Submodular Function Minimization with Noisy Evaluation Oracle | Shinji Ito | For this problem, we provide an algorithm that returns an $O(n^{3/2}/\sqrt{T})$-additive approximate solution in expectation, where $n$ and $T$ stand for the size of the problem and the number of oracle calls, respectively. |

1084 | Counting the Optimal Solutions in Graphical Models | Radu Marinescu, Rina Dechter | We introduce #opt, a new inference task for graphical models which calls for counting the number of optimal solutions of the model. |

1085 | Modelling the Dynamics of Multiagent Q-Learning in Repeated Symmetric Games: a Mean Field Theoretic Approach | Shuyue Hu, Chin-wing Leung, Ho-fung Leung | In this paper, we study an n-agent setting with n tends to infinity, such that agents learn their policies concurrently over repeated symmetric bimatrix games with some other agents. |

1086 | Deep Multimodal Multilinear Fusion with High-order Polynomial Pooling | Ming Hou, Jiajia Tang, Jianhai Zhang, Wanzeng Kong, Qibin Zhao | In this work, we first propose a polynomial tensor pooling (PTP) block for integrating multimodal features by considering high-order moments, followed by a tensorized fully connected layer. Treating PTP as a building block, we further establish a hierarchical polynomial fusion network (HPFN) to recursively transmit local correlations into global ones. |

1087 | Bootstrapping Upper Confidence Bound | Botao Hao, Yasin Abbasi Yadkori, Zheng Wen, Guang Cheng | In this paper, we propose a non-parametric and data-dependent UCB algorithm based on the multiplier bootstrap. |

1088 | Integer Discrete Flows and Lossless Compression | Emiel Hoogeboom, Jorn Peters, Rianne van den Berg, Max Welling | For that reason, we introduce a flow-based generative model for ordinal discrete data called Integer Discrete Flow (IDF): a bijective integer map that can learn rich transformations on high-dimensional data. |

1089 | Structured Prediction with Projection Oracles | Mathieu Blondel | We propose in this paper a general framework for deriving loss functions for structured prediction. |

1090 | A Primal Dual Formulation For Deep Learning With Constraints | Yatin Nandwani, Abhishek Pathak, Mausam, Parag Singla | In this paper, we present a constrained optimization formulation for training a deep network with a given set of hard constraints on output labels. |

1091 | Screening Sinkhorn Algorithm for Regularized Optimal Transport | Mokhtar Z. Alaya, Maxime Berar, Gilles Gasso, Alain Rakotomamonjy | We introduce in this paper a novel strategy for efficiently approximating the Sinkhorn distance between two discrete measures. |

1092 | PAC-Bayes Un-Expected Bernstein Inequality | Zakaria Mhammedi, Peter Gr?nwald, Benjamin Guedj | We present a new PAC-Bayesian generalization bound. |

1093 | Are Labels Required for Improving Adversarial Robustness? | Jean-Baptiste Alayrac, Jonathan Uesato, Po-Sen Huang, Alhussein Fawzi, Robert Stanforth, Pushmeet Kohli | Our main insight is that unlabeled data can be a competitive alternative to labeled data for training adversarially robust models. |

1094 | Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies | Yonathan Efroni, Nadav Merlis, Mohammad Ghavamzadeh, Shie Mannor | In this paper, we focus on model-based RL in the finite-state finite-horizon MDP setting and establish that exploring with greedy policies — act by 1-step planning — can achieve tight minimax performance in terms of regret, O(\sqrt{HSAT}). |

1095 | Multi-objective Bayesian optimisation with preferences over objectives | Majid Abdolshah, Alistair Shilton, Santu Rana, Sunil Gupta, Svetha Venkatesh | We present a multi-objective Bayesian optimisation algorithm that allows the user to express preference-order constraints on the objectives of the type objective A is more important than objective B. |

1096 | Think out of the "Box": Generically-Constrained Asynchronous Composite Optimization and Hedging | Pooria Joulani, Andr?s Gy?rgy, Csaba Szepesvari | We present two new algorithms, ASYNCADA and HEDGEHOG, for asynchronous sparse online and stochastic optimization. |

1097 | Calibration tests in multi-class classification: A unifying framework | David Widmann, Fredrik Lindsten, Dave Zachariah | We propose and study calibration measures for multi-class classification that generalize existing measures such as the expected calibration error, the maximum calibration error, and the maximum mean calibration error. |

1098 | Classification Accuracy Score for Conditional Generative Models | Suman Ravuri, Oriol Vinyals | To test this latter hypothesis, we use class-conditional generative models from a number of model classes – variational autoencoders, autoregressive models, and generative adversarial networks (GANs) – to infer the class labels of real data. |

1099 | Theoretical Analysis of Adversarial Learning: A Minimax Approach | Zhuozhuo Tu, Jingwei Zhang, Dacheng Tao | In this paper, we propose a general theoretical method for analyzing the risk bound in the presence of adversaries. |

1100 | Multiagent Evaluation under Incomplete Information | Mark Rowland, Shayegan Omidshafiei, Karl Tuyls, Julien Perolat, Michal Valko, Georgios Piliouras, Remi Munos | We propose adaptive algorithms for accurate ranking, provide correctness and sample complexity guarantees, then introduce a means of connecting uncertainties in noisy match outcomes to uncertainties in rankings. |

1101 | Tree-Sliced Variants of Wasserstein Distances | Tam Le, Makoto Yamada, Kenji Fukumizu, Marco Cuturi | We consider in this work a more general family of ground metrics, namely \textit{tree metrics}, which also yield fast closed-form computations and negative definite, and of which the sliced-Wasserstein distance is a particular case (the tree is a chain). |

1102 | Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with Dirichlet calibration | Meelis Kull, Miquel Perello Nieto, Markus K?ngsepp, Telmo Silva Filho, Hao Song, Peter Flach | We propose a natively multiclass calibration method applicable to classifiers from any model class, derived from Dirichlet distributions and generalising the beta calibration method from binary classification. |

1103 | Comparing distributions: \ell_1 geometry improves kernel two-sample testing |
meyer scetbon, Gael Varoquaux | Here, we show that $L^p$ distances (with $p\geq 1$) between these distribution representatives give metrics on the space of distributions that are well-behaved to detect differences between distributions as they metrize the weak convergence. |

1104 | Robustness Verification of Tree-based Models | Hongge Chen, Huan Zhang, Si Si, Yang Li, Duane Boning, Cho-Jui Hsieh | For general problems, by exploiting the boxicity of the graph, we devise an efficient verification algorithm that can give tight lower bounds on robustness of decision tree ensembles, and allows iterative improvement and any-time termination. |

1105 | Towards Interpretable Reinforcement Learning Using Attention Augmented Agents | Alexander Mott, Daniel Zoran, Mike Chrzanowski, Daan Wierstra, Danilo Jimenez Rezende | Inspired by recent work in attention models for image captioning and question answering, we present a soft attention model for the reinforcement learning domain. |

1106 | Fast and Accurate Stochastic Gradient Estimation | Beidi Chen, Yingchen Xu, Anshumali Shrivastava | In this paper, we break this barrier by providing the first demonstration of a scheme, Locality sensitive hashing (LSH) sampled Stochastic Gradient Descent (LGD), which leads to superior gradient estimation while keeping the sampling cost per iteration similar to that of the uniform sampling. |

1107 | Theoretical Limits of Pipeline Parallel Optimization and Application to Distributed Deep Learning | Igor Colin, Ludovic DOS SANTOS, Kevin Scaman | We investigate the theoretical limits of pipeline parallel learning of deep learning architectures, a distributed setup in which the computation is distributed per layer instead of per example. |

1108 | Root Mean Square Layer Normalization | Biao Zhang, Rico Sennrich | In this paper, we hypothesize that re-centering invariance in LayerNorm is dispensable and propose root mean square layer normalization, or RMSNorm. |

1109 | Universality in Learning from Linear Measurements | Ehsan Abbasi, Fariborz Salehi, Babak Hassibi | We study the problem of recovering a structured signal from independently and identically drawn linear measurements. |

1110 | Planning in entropy-regularized Markov decision processes and games | Jean-Bastien Grill, Omar Darwiche Domingues, Pierre Menard, Remi Munos, Michal Valko | We propose SmoothCruiser, a new planning algorithm for estimating the value function in entropy-regularized Markov decision processes and two-player games, given a generative model of the SmoothCruiser. |

1111 | Exponentially convergent stochastic k-PCA without variance reduction | Cheng Tang | We present Matrix Krasulina, an algorithm for online k-PCA, by gen- eralizing the classic Krasulina’s method (Krasulina, 1969) from vector to matrix case. |

1112 | R2D2: Reliable and Repeatable Detector and Descriptor | Jerome Revaud, Cesar De Souza, Martin Humenberger, Philippe Weinzaepfel | In this work, we argue that repeatable regions are not necessarily discriminative and can therefore lead to select suboptimal keypoints. |

1113 | Selective Sampling-based Scalable Sparse Subspace Clustering | Shin Matsushima, Maria Brbic | To overcome this limitation, we introduce Selective Sampling-based Scalable Sparse Subspace Clustering (S5C) algorithm which selects subsamples based on the approximated subgradients and linearly scales with the number of data points in terms of time and memory requirements. |

1114 | A General Framework for Symmetric Property Estimation | Moses Charikar, Kirankumar Shiragur, Aaron Sidford | In this paper we provide a general framework for estimating symmetric properties of distributions from i.i.d. samples. |

1115 | Structured Variational Inference in Continuous Cox Process Models | Virginia Aglietti, Edwin V. Bonilla, Theodoros Damoulas, Sally Cripps | We propose a scalable framework for inference in a continuous sigmoidal Cox process that assumes the corresponding intensity function is given by a Gaussian process (GP) prior transformed with a scaled logistic sigmoid function. |

1116 | Generalization of Reinforcement Learners with Working and Episodic Memory | Meire Fortunato, Melissa Tan, Ryan Faulkner, Steven Hansen, Adri? Puigdom?nech Badia, Gavin Buttimore, Charles Deck, Joel Z. Leibo, Charles Blundell | In this paper, we aim to develop a comprehensive methodology to test different kinds of memory in an agent and assess how well the agent can apply what it learns in training to a holdout set that differs from the training set along dimensions that we suggest are relevant for evaluating memory-specific generalization. |

1117 | Distribution Learning of a Random Spatial Field with a Location-Unaware Mobile Sensor | Meera Pai, Animesh Kumar | While GPS or other localization methods can reduce this uncertainty, we address a more fundamental question: can a location-unaware mobile sensor, recording samples on a directed non-uniform random walk, learn the statistical distribution (as a function of space) of an underlying random process (spatial field)? |

1118 | Hindsight Credit Assignment | Anna Harutyunyan, Will Dabney, Thomas Mesnard, Mohammad Gheshlaghi Azar, Bilal Piot, Nicolas Heess, Hado P. van Hasselt, Gregory Wayne, Satinder Singh, Doina Precup, Remi Munos | We consider the problem of efficient credit assignment in reinforcement learning. |

1119 | Efficient Identification in Linear Structural Causal Models with Instrumental Cutsets | Daniel Kumor, Bryant Chen, Elias Bareinboim | In this paper, we investigate graphical conditions to allow efficient identification in arbitrary linear structural causal models (SCMs). |

1120 | Kernelized Bayesian Softmax for Text Generation | Ning Miao, Hao Zhou, Chengqi Zhao, Wenxian Shi, Lei Li | In this paper, we propose KerBS, a novel approach for learning better embeddings for text generation. |

1121 | When to Trust Your Model: Model-Based Policy Optimization | Michael Janner, Justin Fu, Marvin Zhang, Sergey Levine | In this paper, we study the role of model usage in policy optimization both theoretically and empirically. |

1122 | Correlation Clustering with Adaptive Similarity Queries | Marco Bressan, Nicol? Cesa-Bianchi, Andrea Paudice, Fabio Vitale | In this work we investigate correlation clustering as an active learning problem: each similarity score can be learned by making a query, and the goal is to minimise both the disagreements and the total number of queries. |

1123 | Control What You Can: Intrinsically Motivated Task-Planning Agent | Sebastian Blaes, Marin Vlastelica Pogancic, Jiajie Zhu, Georg Martius | We present a novel intrinsically motivated agent that learns how to control the environment in a sample efficient manner, that is with as few environment interactions as possible, by optimizing learning progress. |

1124 | Selecting causal brain features with a single conditional independence test per feature | Atalanti Mastakouri, Bernhard Sch?lkopf, Dominik Janzing | We propose a constraint-based causal feature selection method for identifying causes of a given target variable, selecting from a set of candidate variables, while there can also be hidden variables acting as common causes with the target. |

1125 | Continuous Hierarchical Representations with Poincare Variational Auto-Encoders | Emile Mathieu, Charline Le Lan, Chris J. Maddison, Ryota Tomioka, Yee Whye Teh | We therefore endow VAEs with a Poincaré ball model of hyperbolic geometry as a latent space and rigorously derive the necessary methods to work with two main Gaussian generalisations on that space. |

1126 | A Generic Acceleration Framework for Stochastic Composite Optimization | Andrei Kulunchakov, Julien Mairal | In this paper, we introduce various mechanisms to obtain accelerated first-order stochastic optimization algorithms when the objective function is convex or strongly convex. |

1127 | Beating SGD Saturation with Tail-Averaging and Minibatching | Nicole Muecke, Gergely Neu, Lorenzo Rosasco | In this paper, we consider least squares learning in a nonparametric setting and contribute to filling this gap by focusing on the effect and interplay of multiple passes, mini-batching and averaging, in particular tail averaging. |

1128 | Random Quadratic Forms with Dependence: Applications to Restricted Isometry and Beyond | Arindam Banerjee, Qilong Gu, Vidyashankar Sivakumar, Steven Z. Wu | In this paper, we show that such independence is in fact not needed for such results which continue to hold under fairly general dependence structures. |

1129 | Continuous-time Models for Stochastic Optimization Algorithms | Antonio Orvieto, Aurelien Lucchi | We propose new continuous-time formulations for first-order stochastic optimization algorithms such as mini-batch gradient descent and variance-reduced methods. |

1130 | Curriculum-guided Hindsight Experience Replay | Meng Fang, Tianyi Zhou, Yali Du, Lei Han, Zhengyou Zhang | In this paper, we propose to 1) adaptively select the failed experiences for replay according to the proximity to the true goals and the curiosity of exploration over diverse pseudo goals, and 2) gradually change the proportion of the goal-proximity and the diversity-based curiosity in the selection criteria: we adopt a human-like learning strategy that enforces more curiosity in earlier stages and changes to larger goal-proximity later. |

1131 | Implicit Semantic Data Augmentation for Deep Networks | Yulin Wang, Xuran Pan, Shiji Song, Hong Zhang, Gao Huang, Cheng Wu | In this paper, we propose a novel implicit semantic data augmentation (ISDA) approach to complement traditional augmentation techniques like flipping, translation or rotation. |

1132 | MetaInit: Initializing learning by learning to initialize | Yann N. Dauphin, Samuel Schoenholz | In this work, we introduce an algorithm called MetaInit as a step towards automating the search for good initializations using meta-learning. |

1133 | Scalable Deep Generative Relational Model with High-Order Node Dependence | Xuhui Fan, Bin Li, Caoyuan Li, Scott SIsson, Ling Chen | In this work, we propose a probabilistic framework for relational data modelling and latent structure exploring. |

1134 | Random Path Selection for Continual Learning | Jathushan Rajasegaran, Munawar Hayat, Salman H. Khan, Fahad Shahbaz Khan, Ling Shao | In this paper, we propose a random path selection algorithm, called RPS-Net, that progressively chooses optimal paths for the new tasks while encouraging parameter sharing and reuse. |

1135 | Efficient Algorithms for Smooth Minimax Optimization | Kiran K. Thekumparampil, Prateek Jain, Praneeth Netrapalli, Sewoong Oh | For strongly-convex $g(\cdot, y),\ \forall y$, we propose a new direct optimal algorithm combining Mirror-Prox and Nesterov’s AGD, and show that it can find global optimum in $\widetilde{O}\left(1/k^2 \right)$ iterations, improving over current state-of-the-art rate of $O(1/k)$. |

1136 | Shadowing Properties of Optimization Algorithms | Antonio Orvieto, Aurelien Lucchi | In an attempt to encourage the use of continuous-time methods in optimization, we show that, if some additional regularity on the objective is assumed, the ODE representations of Gradient Descent and Heavy-ball do not suffer from the aforementioned problem, once we allow for a small perturbation on the algorithm initial condition. |

1137 | Causal Regularization | Dominik Janzing | We argue that regularizing terms in standard regression methods not only help against overfitting finite data, but sometimes also help in getting better causal models. |

1138 | Learning Hawkes Processes from a handful of events | Farnood Salehi, William Trouleau, Matthias Grossglauser, Patrick Thiran | To solve both issues, we develop in this work an efficient algorithm based on variational expectation-maximization. |

1139 | Unsupervised Object Segmentation by Redrawing | Micka?l Chen, Thierry Arti?res, Ludovic Denoyer | We present ReDO, a new model able to extract objects from images without any annotation in an unsupervised way. |

1140 | Regret Bounds for Learning State Representations in Reinforcement Learning | Ronald Ortner, Matteo Pirotta, Alessandro Lazaric, Ronan Fruit, Odalric-Ambrym Maillard | We propose an algorithm (UCB-MS) with O(sqrt(T)) regret in any communicating Markov decision process. |

1141 | Band-Limited Gaussian Processes: The Sinc Kernel | Felipe Tobar | We propose a novel class of Gaussian processes (GPs) whose spectra have compact support, meaning that their sample trajectories are almost-surely band limited. |

1142 | Leveraging Labeled and Unlabeled Data for Consistent Fair Binary Classification | Evgenii Chzhen, Christophe Denis, Mohamed Hebiri, Luca Oneto, Massimiliano Pontil | We study the problem of fair binary classification using the notion of Equal Opportunity. |

1143 | Learning search spaces for Bayesian optimization: Another view of hyperparameter transfer learning | Valerio Perrone, Huibin Shen | In this work, we introduce a method to automatically design the BO search space by relying on evaluations of previous black-box functions. |

1144 | Streaming Bayesian Inference for Crowdsourced Classification | Edoardo Manino, Long Tran-Thanh, Nicholas Jennings | In this paper, we revisit the problem of binary classification from crowdsourced data. |

1145 | Neuropathic Pain Diagnosis Simulator for Causal Discovery Algorithm Evaluation | Ruibo Tu, Kun Zhang, Bo Bertilson, Hedvig Kjellstrom, Cheng Zhang | In this work, we handle the problem of evaluating causal discovery algorithms by building a flexible simulator in the medical setting. We develop a neuropathic pain diagnosis simulator, inspired by the fact that the biological processes of neuropathic pathophysiology are well studied with well-understood causal influences. |

1146 | Brain-Like Object Recognition with High-Performing Shallow Recurrent ANNs | Jonas Kubilius, Martin Schrimpf, Ha Hong, Najib Majaj, Rishi Rajalingham, Elias Issa, Kohitij Kar, Pouya Bashivan, Jonathan Prescott-Roy, Kailyn Schmidt, Aran Nayebi, Daniel Bear, Daniel L. Yamins, James J. DiCarlo | Here we demonstrate that better anatomical alignment to the brain and high performance on machine learning as well as neuroscience measures do not have to be in contradiction. |

1147 | k-Means Clustering of Lines for Big Data | Yair Marom, Dan Feldman | We suggest the first PTAS that computes a $(1+\epsilon)$-approximation to this problem in time $O(n \log n)$ for any constant approximation error $\epsilon \in (0, 1)$, and constant integers $k, d \geq 1$. |

1148 | Random Projections and Sampling Algorithms for Clustering of High-Dimensional Polygonal Curves | Stefan Meintrup, Alexander Munteanu, Dennis Rohde | We study the $k$-median clustering problem for high-dimensional polygonal curves with finite but unbounded number of vertices. |

1149 | Recurrent Space-time Graph Neural Networks | Andrei Nicolicioiu, Iulia Duta, Marius Leordeanu | We propose a neural graph model, recurrent in space and time, suitable for capturing both the local appearance and the complex higher-level interactions of different entities and objects within the changing world scene. |

1150 | Uncertainty on Asynchronous Time Event Prediction | Bertrand Charpentier, Marin Bilo?, Stephan G?nnemann | In this work, we tackle the task of predicting the next event (given a history), and how this prediction changes with the passage of time. |

1151 | Accurate, reliable and fast robustness evaluation | Wieland Brendel, Jonas Rauber, Matthias K?mmerer, Ivan Ustyuzhaninov, Matthias Bethge | We here develop a new set of gradient-based adversarial attacks which (a) are more reliable in the face of gradient-masking than other gradient-based attacks, (b) perform better and are more query efficient than current state-of-the-art gradient-based attacks, (c) can be flexibly adapted to a wide range of adversarial criteria and (d) require virtually no hyperparameter tuning. |

1152 | Sparse High-Dimensional Isotonic Regression | David Gamarnik, Julia Gaudio | We consider the problem of estimating an unknown coordinate-wise monotone function given noisy measurements, known as the isotonic regression problem. |

1153 | Triad Constraints for Learning Causal Structure of Latent Variables | Ruichu Cai, Feng Xie, Clark Glymour, Zhifeng Hao, Kun Zhang | In this paper, by properly leveraging the non-Gaussianity of the data, we propose to estimate the structure over latent variables with the so-called Triad constraints: we design a form of “pseudo-residual” from three variables, and show that when causal relations are linear and noise terms are non-Gaussian, the causal direction between the latent variables for the three observed variables is identifiable by checking a certain kind of independence relationship. |

1154 | On the Inductive Bias of Neural Tangent Kernels | Alberto Bietti, Julien Mairal | In particular, we study smoothness, approximation, and stability properties of functions with finite norm, including stability to image deformations in the case of convolutional networks, and compare to other known kernels for similar architectures. |

1155 | Cross-Domain Transferability of Adversarial Perturbations | Muhammad Muzammal Naseer, Salman H. Khan, Muhammad Haris Khan, Fahad Shahbaz Khan, Fatih Porikli | To this end, we propose a framework capable of launching highly transferable attacks that crafts adversarial patterns to mislead networks trained on wholly different domains. |

1156 | Shallow RNN: Accurate Time-series Classification on Resource Constrained Devices | Don Dennis, Durmus Alp Emre Acar, Vikram Mandikal, Vinu Sankar Sadasivan, Venkatesh Saligrama, Harsha Vardhan Simhadri, Prateek Jain | To induce long-term dependencies, and yet admit parallelization, we introduce novel shallow RNNs. |

1157 | Kernel quadrature with DPPs | Ayoub Belhadji, R?mi Bardenet, Pierre Chainais | We study quadrature rules for functions living in an RKHS, using nodes sampled from a projection determinantal point process (DPP). |

1158 | REM: From Structural Entropy to Community Structure Deception | Yiwei Liu, Jiamou Liu, Zijian Zhang, Liehuang Zhu, Angsheng Li | To this end, we propose a community-based structural entropy to express the amount of information revealed by a community structure. |

1159 | Sim2real transfer learning for 3D human pose estimation: motion to the rescue | Carl Doersch, Andrew Zisserman | In this paper, we show that standard neural-network approaches, which perform poorly when trained on synthetic RGB images, can perform well when the data is pre-processed to extract cues about the person’s motion, notably as optical flow and the motion of 2D keypoints. |

1160 | Self-Supervised Deep Learning on Point Clouds by Reconstructing Space | Bjarne Sievers, Jonathan Sauder | We propose a self-supervised learning task for deep learning on raw point cloud data in which a neural network is trained to reconstruct point clouds whose parts have been randomly rearranged. |

1161 | Piecewise Strong Convexity of Neural Networks | Tristan Milne | We study the loss surface of a feed-forward neural network with ReLU non-linearities, regularized with weight decay. |

1162 | Minimum Stein Discrepancy Estimators | Alessandro Barp, Francois-Xavier Briol, Andrew Duncan, Mark Girolami, Lester Mackey | We provide a unifying perspective of these techniques as minimum Stein discrepancy estimators, and use this lens to design new diffusion kernel Stein discrepancy (DKSD) and diffusion score matching (DSM) estimators with complementary strengths. |

1163 | Fast and Furious Learning in Zero-Sum Games: Vanishing Regret with Non-Vanishing Step Sizes | James Bailey, Georgios Piliouras | We show for the first time that it is possible to reconcile in online learning in zero-sum games two seemingly contradictory objectives: vanishing time-average regret and non-vanishing step sizes. |

1164 | Generalization Bounds for Neural Networks via Approximate Description Length | Amit Daniely, Elad Granot | We investigate the sample complexity of networks with bounds on the magnitude of its weights. |

1165 | Provably robust boosted decision stumps and trees against adversarial attacks | Maksym Andriushchenko, Matthias Hein | We show in this paper that for boosted decision stumps the \textit{exact} min-max robust loss and test error for an $l_\infty$-attack can be computed in $O(T\log T)$ time per input, where $T$ is the number of decision stumps and the optimal update step of the ensemble can be done in $O(n^2\,T\log T)$, where $n$ is the number of data points. |

1166 | Convergence of Adversarial Training in Overparametrized Neural Networks | Ruiqi Gao, Tianle Cai, Haochuan Li, Cho-Jui Hsieh, Liwei Wang, Jason D. Lee | This paper provides a partial answer to the success of adversarial training, by showing that it converges to a network where the surrogate loss with respect to the the attack algorithm is within $\epsilon$ of the optimal robust loss. |

1167 | A Composable Specification Language for Reinforcement Learning Tasks | Kishor Jothimurugan, Rajeev Alur, Osbert Bastani | We propose a language for specifying complex control tasks, along with an algorithm that compiles specifications in our language into a reward function and automatically performs reward shaping. |

1168 | The Option Keyboard: Combining Skills in Reinforcement Learning | Andre Barreto, Diana Borsa, Shaobo Hou, Gheorghe Comanici, Eser Ayg?n, Philippe Hamel, Daniel Toyama, Jonathan hunt, Shibl Mourad, David Silver, Doina Precup | Based on this premise, we propose a framework for combining skills using the formalism of options. |

1169 | Unified Language Model Pre-training for Natural Language Understanding and Generation | Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon | This paper presents a new Unified pre-trained Language Model (UniLM) that can be fine-tuned for both natural language understanding and generation tasks. |

1170 | Learning to Correlate in Multi-Player General-Sum Sequential Games | Andrea Celli, Alberto Marchesi, Tommaso Bianchi, Nicola Gatti | In this paper, we focus on coarse correlated equilibria (CCEs) in sequential games. |

1171 | Stochastic Continuous Greedy ++: When Upper and Lower Bounds Match | Amin Karbasi, Hamed Hassani, Aryan Mokhtari, Zebang Shen | In this paper, we develop \scg~(\text{SCG}{$++$}), the first efficient variant of a conditional gradient method for maximizing a continuous submodular function subject to a convex constraint. |

1172 | Generative Well-intentioned Networks | Justin Cosentino, Jun Zhu | We propose Generative Well-intentioned Networks (GWINs), a novel framework for increasing the accuracy of certainty-based, closed-world classifiers. |

1173 | Online-Within-Online Meta-Learning | Giulia Denevi, Dimitris Stamos, Carlo Ciliberto, Massimiliano Pontil | We study the problem of learning a series of tasks in a fully online Meta-Learning setting. |

1174 | Learning step sizes for unfolded sparse coding | Pierre Ablin, Thomas Moreau, Mathurin Massias, Alexandre Gramfort | In this paper, we study the selection of adapted step sizes for ISTA. |

1175 | Biases for Emergent Communication in Multi-agent Reinforcement Learning | Tom Eccles, Yoram Bachrach, Guy Lever, Angeliki Lazaridou, Thore Graepel | We introduce inductive biases for positive signalling and positive listening, which ease this problem. |

1176 | Episodic Memory in Lifelong Language Learning | Cyprien de Masson d’Autume, Sebastian Ruder, Lingpeng Kong, Dani Yogatama | We propose an episodic memory model that performs sparse experience replay and local adaptation to mitigate catastrophic forgetting in this setup. |

1177 | A Simple Baseline for Bayesian Uncertainty in Deep Learning | Wesley J. Maddox, Pavel Izmailov, Timur Garipov, Dmitry P. Vetrov, Andrew Gordon Wilson | We propose SWA-Gaussian (SWAG), a simple, scalable, and general purpose approach for uncertainty representation and calibration in deep learning. |

1178 | Communication-efficient Distributed SGD with Sketching | Nikita Ivkin, Daniel Rothchild, Enayat Ullah, Vladimir braverman, Ion Stoica, Raman Arora | Motivated by the success of sketching methods in sub-linear/streaming algorithms, we introduce Sketched-SGD, an algorithm for carrying out distributed SGD by communicating sketches instead of full gradients. |

1179 | Modeling Conceptual Understanding in Image Reference Games | Rodolfo Corona Rodriguez, Stephan Alaniz, Zeynep Akata | In this work, we present both an image reference game between a speaker and a population of listeners where reasoning about the concepts other agents can comprehend is necessary and a model formulation with this capability. |

1180 | Kalman Filter, Sensor Fusion, and Constrained Regression: Equivalences and Insights | Maria Jahja, David Farrow, Roni Rosenfeld, Ryan J. Tibshirani | In this work, we show that the state estimates from the KF in a standard linear dynamical system setting are equivalent to those given by the KF in a transformed system, with infinite process noise (i.e., a “flat prior”) and an augmented measurement space. |

1181 | Near Neighbor: Who is the Fairest of Them All? | Sariel Har-Peled, Sepideh Mahabadi | In this work we study a “fair” variant of the near neighbor problem. |

1182 | Outlier-robust estimation of a sparse linear model using \ell_1-penalized Huber's M-estimator |
Arnak Dalalyan, Philip Thompson | We study the problem of estimating a $p$-dimensional $s$-sparse vector in a linear model with Gaussian design. |

1183 | Learning nonlinear level sets for dimensionality reduction in function approximation | Guannan Zhang, Jiaxin Zhang, Jacob Hinkle | We developed a Nonlinear Level-set Learning (NLL) method for dimensionality reduction in high-dimensional function approximation with small data. |

1184 | Assessing Social and Intersectional Biases in Contextualized Word Representations | Yi Chern Tan, L. Elisa Celis | In this paper, we analyze the extent to which state-of-the-art models for contextual word representations, such as BERT and GPT-2, encode biases with respect to gender, race, and intersectional identities. |

1185 | Online Convex Matrix Factorization with Representative Regions | Jianhao Peng, Olgica Milenkovic, Abhishek Agarwal | We address both problems by proposing the first online convex MF algorithm that maintains a collection of constant-size sets of representative data samples needed for interpreting each of the basis (Ding et al., 2010) and has the same almost sure convergence guarantees as the online learning algorithm of Mairal et al., 2010. |

1186 | Self-supervised GAN: Analysis and Improvement with Multi-class Minimax Game | Ngoc-Trung Tran, Viet-Hung Tran, Bao-Ngoc Nguyen, Linxiao Yang, Ngai-Man (Man) Cheung | In this work, we perform an in-depth analysis to understand how SS tasks interact with learning of generator. |

1187 | Extreme Classification in Log Memory using Count-Min Sketch: A Case Study of Amazon Search with 50M Products | Tharun Kumar Reddy Medini, Qixuan Huang, Yiqiu Wang, Vijai Mohan, Anshumali Shrivastava | To alleviate this problem, we present Merged-Average Classifiers via Hashing (MACH), a generic $K$-classification algorithm where memory provably scales at $O(\log K)$ without any assumption on the relation between classes. |

1188 | A Fourier Perspective on Model Robustness in Computer Vision | Dong Yin, Raphael Gontijo Lopes, Jon Shlens, Ekin Dogus Cubuk, Justin Gilmer | We find that both methods improve robustness to corruptions that are concentrated in the high frequency domain while reducing robustness to corruptions that are concentrated in the low frequency domain. |

1189 | The continuous Bernoulli: fixing a pervasive error in variational autoencoders | Gabriel Loaiza-Ganem, John P. Cunningham | We introduce and fully characterize a new [0,1]-supported, single parameter distribution: the continuous Bernoulli, which patches this pervasive bug in VAE. |

1190 | Privacy Amplification by Mixing and Diffusion Mechanisms | Borja Balle, Gilles Barthe, Marco Gaboardi, Joseph Geumlek | In this paper we investigate under what conditions stochastic post-processing can amplify the privacy of a mechanism. |

1191 | Variance Reduction in Bipartite Experiments through Correlation Clustering | Jean Pouget-Abadie, Kevin Aydin, Warren Schudy, Kay Brodersen, Vahab Mirrokni | This paper introduces a novel clustering objective and a corresponding algorithm that partitions a bipartite graph so as to maximize the statistical power of a bipartite experiment on that graph. |

1192 | Gossip-based Actor-Learner Architectures for Deep Reinforcement Learning | Mahmoud (“Mido”) Assran, Joshua Romoff, Nicolas Ballas, Joelle Pineau, Mike Rabbat | In this work, we propose Gossip-based Actor-Learner Architectures (GALA) where several actor-learners (such as A2C agents) are organized in a peer-to-peer communication topology, and exchange information through asynchronous gossip in order to take advantage of a large number of distributed simulators. |

1193 | Metalearned Neural Memory | Tsendsuren Munkhdalai, Alessandro Sordoni, TONG WANG, Adam Trischler | We augment recurrent neural networks with an external memory mechanism that builds upon recent progress in metalearning. |

1194 | Learning Multiple Markov Chains via Adaptive Allocation | Mohammad Sadegh Talebi, Odalric-Ambrym Maillard | We present a novel learning algorithm that efficiently balances \emph{exploration} and \emph{exploitation} intrinsic to this problem, without any prior knowledge of the chains. |

1195 | Diffusion Improves Graph Learning | Johannes Klicpera, Stefan Wei?enberger, Stephan G?nnemann | In this work, we remove the restriction of using only the direct neighbors by introducing a powerful, yet spatially localized graph convolution: Graph diffusion convolution (GDC). |

1196 | Deep Random Splines for Point Process Intensity Estimation of Neural Population Data | Gabriel Loaiza-Ganem, Sean Perkins, Karen Schroeder, Mark Churchland, John P. Cunningham | Here we propose Deep Random Splines, a flexible class of random functions obtained by transforming Gaussian noise through a deep neural network whose output are the parameters of a spline. |

1197 | Variational Bayes under Model Misspecification | Yixin Wang, David Blei | In this work, we study VB under model misspecification. |

1198 | Global Convergence of Gradient Descent for Deep Linear Residual Networks | Lei Wu, Qingcan Wang, Chao Ma | We analyze the global convergence of gradient descent for deep linear residual networks by proposing a new initialization: zero-asymmetric (ZAS) initialization. |

1199 | On Differentially Private Graph Sparsification and Applications | Raman Arora, Jalaj Upadhyay | In this paper, we study private sparsification of graphs. |

1200 | Manifold denoising by Nonlinear Robust Principal Component Analysis | He Lyu, Ningyu Sha, Shuyang Qin, Ming Yan, Yuying Xie, Rongrong Wang | We answer these two questions affirmatively by proposing and analyzing an optimization framework that separates the sparse component from the manifold under noisy data. |

1201 | Near-Optimal Reinforcement Learning in Dynamic Treatment Regimes | Junzhe Zhang, Elias Bareinboim | In this paper, we investigate the online reinforcement learning (RL) problem for selecting optimal DTRs provided that observational data is available. |

1202 | ODE2VAE: Deep generative second order ODEs with Bayesian neural networks | Cagatay Yildiz, Markus Heinonen, Harri Lahdesmaki | We present Ordinary Differential Equation Variational Auto-Encoder (ODE2VAE), a latent second order ODE model for high-dimensional sequential data. |

1203 | Optimal Sampling and Clustering in the Stochastic Block Model | Se-Young Yun, Alexandre Proutiere | This paper investigates the design of joint adaptive sampling and clustering algorithms in networks whose structure follows the celebrated Stochastic Block Model (SBM). |

1204 | Recurrent Kernel Networks | Dexiong Chen, Laurent Jacob, Julien Mairal | In this paper, we revisit this link by generalizing convolutional kernel networks—originally related to a relaxation of the mismatch kernel—to model gaps in sequences. |

1205 | Cold Case: The Lost MNIST Digits | Chhavi Yadav, Leon Bottou | We propose a reconstruction that is accurate enough to serve as a replacement for the MNIST dataset, with insignificant changes in accuracy. |

1206 | Hierarchical Optimal Transport for Multimodal Distribution Alignment | John Lee, Max Dabagia, Eva Dyer, Christopher Rozell | To solve this numerically, we propose a distributed ADMM algorithm that also exploits the Sinkhorn distance, thus it has an efficient computational complexity that scales quadratically with the size of the largest cluster. |

1207 | Exploration via Hindsight Goal Generation | Zhizhou Ren, Kefan Dong, Yuan Zhou, Qiang Liu, Jian Peng | In this paper, we introduce Hindsight Goal Generation (HGG), a novel algorithmic framework that generates valuable hindsight goals which are easy for an agent to achieve in the short term and are also potential for guiding the agent to reach the actual goal in the long term. |

1208 | Shaping Belief States with Generative Environment Models for RL | Karol Gregor, Danilo Jimenez Rezende, Frederic Besse, Yan Wu, Hamza Merzic, Aaron van den Oord | We propose a way to efficiently train expressive generative models in complex environments. |

1209 | Globally Optimal Learning for Structured Elliptical Losses | Yoav Wald, Nofar Noy, Gal Elidan, Ami Wiesel | In this work, we analyze robust alternatives. |

1210 | Object landmark discovery through unsupervised adaptation | Enrique Sanchez, Georgios Tzimiropoulos | This paper proposes a method to ease the unsupervised learning of object landmark detectors. |

1211 | Specific and Shared Causal Relation Modeling and Mechanism-Based Clustering | Biwei Huang, Kun Zhang, Pengtao Xie, Mingming Gong, Eric P. Xing, Clark Glymour | In this paper, we develop a unified framework for causal discovery and mechanism-based group identification. |

1212 | Search-Guided, Lightly-Supervised Training of Structured Prediction Energy Networks | Amirmohammad Rooshenas, Dongxu Zhang, Gopal Sharma, Andrew McCallum | In this paper, we instead use efficient truncated randomized search in this reward function to train structured prediction energy networks (SPENs), which provide efficient test-time inference using gradient-based search on a smooth, learned representation of the score landscape, and have previously yielded state-of-the-art results in structured prediction. |

1213 | Accelerating Rescaled Gradient Descent: Fast Optimization of Smooth Functions | Ashia C. Wilson, Lester Mackey, Andre Wibisono | We present a family of algorithms, called descent algorithms, for optimizing convex and non-convex functions. |

1214 | RUDDER: Return Decomposition for Delayed Rewards | Jose A. Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, Johannes Brandstetter, Sepp Hochreiter | We propose RUDDER, a novel reinforcement learning approach for delayed rewards in finite Markov decision processes (MDPs). |

1215 | Graph Normalizing Flows | Jenny Liu, Aviral Kumar, Jimmy Ba, Jamie Kiros, Kevin Swersky | We introduce graph normalizing flows: a new, reversible graph neural network model for prediction and generation. |

1216 | Explanations can be manipulated and geometry is to blame | Ann-Kathrin Dombrowski, Maximillian Alber, Christopher Anders, Marcel Ackermann, Klaus-Robert M?ller, Pan Kessel | In this paper, we demonstrate a property of explanation methods which is disconcerting for both of these purposes. |

1217 | Communication trade-offs for Local-SGD with large step size | Aymeric Dieuleveut, Kumar Kshitij Patel | We propose a non-asymptotic error analysis, which enables comparison to \emph{one-shot averaging} i.e., a single communication round among independent workers, and \emph{mini-batch averaging} i.e., communicating at every step. |

1218 | Non-normal Recurrent Neural Network (nnRNN): learning long time dependencies while improving expressivity with transient dynamics | Giancarlo Kerg, Kyle Goyette, Maximilian Puelma Touzel, Gauthier Gidel, Eugene Vorontsov, Yoshua Bengio, Guillaume Lajoie | We propose a novel connectivity structure based on the Schur decomposition and a splitting of the Schur form into normal and non-normal parts. |

1219 | No-Regret Learning in Unknown Games with Correlated Payoffs | Pier Giuseppe Sessa, Ilija Bogunovic, Maryam Kamgarpour, Andreas Krause | In this paper, we consider a natural model where, besides a noisy measurement of the obtained reward, the player can also observe the opponents’ actions. |

1220 | Alleviating Label Switching with Optimal Transport | Pierre Monteiller, Sebastian Claici, Edward Chien, Farzaneh Mirzazadeh, Justin M. Solomon, Mikhail Yurochkin | We propose a resolution to label switching that leverages machinery from optimal transport. |

1221 | Paraphrase Generation with Latent Bag of Words | Yao Fu, Yansong Feng, John P. Cunningham | Inspired by variational autoencoders with discrete latent structures, in this work, we propose a latent bag of words (BOW) model for paraphrase generation. |

1222 | An Algorithmic Framework For Differentially Private Data Analysis on Trusted Processors | Joshua Allen, Bolin Ding, Janardhan Kulkarni, Harsha Nori, Olga Ohrimenko, Sergey Yekhanin | In this work, we propose a framework based on trusted processors and a new definition of differential privacy called Oblivious Differential Privacy, which combines the best of both local and global models. |

1223 | Compacting, Picking and Growing for Unforgetting Continual Learning | Ching-Yi Hung, Cheng-Hao Tu, Cheng-En Wu, Chien-Hung Chen, Yi-Ming Chan, Chu-Song Chen | In this paper, we propose a simple but effective approach to continual deep learning. |

1224 | Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems | Asma Ghandeharioun, Judy Hanwen Shen, Natasha Jaques, Craig Ferguson, Noah Jones, Agata Lapedriza, Rosalind Picard | In particular, we propose a self-play scenario where the dialog system talks to itself and we calculate a combination of proxies such as sentiment and semantic coherence on the conversation trajectory. |

1225 | A New Distribution on the Simplex with Auto-Encoding Applications | Andrew Stirn, Tony Jebara, David Knowles | We construct a new distribution for the simplex using the Kumaraswamy distribution and an ordered stick-breaking process. |

1226 | AutoPrune: Automatic Network Pruning by Regularizing Auxiliary Parameters | XIA XIAO, Zigeng Wang, Sanguthevar Rajasekaran | To build a better generalized and easy-to-use pruning method, we propose AutoPrune, which prunes the network through optimizing a set of trainable auxiliary parameters instead of original weights. |

1227 | A neurally plausible model learns successor representations in partially observable environments | Eszter V?rtes, Maneesh Sahani | Here, we introduce a neurally plausible model using \emph{distributional successor features}, which builds on the distributed distributional code for the representation and computation of uncertainty, and which allows for efficient value function computation in partially observed environments via the successor representation. |

1228 | Learning about an exponential amount of conditional distributions | Mohamed Belghazi, Maxime Oquab, David Lopez-Paz | We introduce the Neural Conditioner (NC), a self-supervised machine able to learn about all the conditional distributions of a random vector X. |

1229 | Towards modular and programmable architecture search | Renato Negrinho, Matthew Gormley, Geoffrey J. Gordon, Darshan Patil, Nghia Le, Daniel Ferreira | In this work, we propose a formal language for encoding search spaces over general computational graphs. |

1230 | Towards Hardware-Aware Tractable Learning of Probabilistic Models | Laura I. Galindez Olascoaga, Wannes Meert, Nimish Shah, Marian Verhelst, Guy Van den Broeck | We propose a novel resource-aware cost metric that takes into consideration the hardware’s properties in determining whether the inference task can be efficiently deployed. |

1231 | On Robustness to Adversarial Examples and Polynomial Optimization | Pranjal Awasthi, Abhratanu Dutta, Aravindan Vijayaraghavan | The main contribution of this work is to exhibit a strong connection between achieving robustness to adversarial examples, and a rich class of polynomial optimization problems, thereby making progress on the above questions. |

1232 | Rand-NSG: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node | Suhas Jayaram Subramanya, Fnu Devvrit, Harsha Vardhan Simhadri, Ravishankar Krishnawamy, Rohan Kadekodi | We present a new graph-based indexing and search system called DiskANN that can index, store, and search a billion point database on a single workstation with just 64GB RAM and an inexpensive solid-state drive (SSD). |

1233 | A Solvable High-Dimensional Model of GAN | Chuang Wang, Hong Hu, Yue Lu | We present a theoretical analysis of the training process for a single-layer GAN fed by high-dimensional input data. |

1234 | Using Embeddings to Correct for Unobserved Confounding in Networks | Victor Veitch, Yixin Wang, David Blei | We consider causal inference in the presence of unobserved confounding. |

1235 | MonoForest framework for tree ensemble analysis | Igor Kuralenok, Vasilii Ershov, Igor Labutin | In this work, we introduce a new decision tree ensemble representation framework: instead of using a graph model we transform each tree into a well-known polynomial form. |

1236 | Bayesian Optimization under Heavy-tailed Payoffs | Sayak Ray Chowdhury, Aditya Gopalan | We consider black box optimization of an unknown function in the nonparametric Gaussian process setting when the noise in the observed function values can be heavy tailed. |

1237 | Combining Generative and Discriminative Models for Hybrid Inference | Victor Garcia Satorras, Max Welling, Zeynep Akata | In this work we propose a hybrid model that combines graphical inference with a learned inverse model, which we structure as in a graph neural network, while the iterative algorithm as a whole is formulated as a recurrent neural network. |

1238 | A Graph Theoretic Additive Approximation of Optimal Transport | Nathaniel Lahn, Deepika Mulchandani, Sharath Raghvendra | We present an adaptation of the classical graph algorithm of Gabow and Tarjan and provide a novel analysis of this algorithm that bounds its execution time by $\BigO(\frac{n^2 C}{\delta}+ \frac{nC^2}{\delta^2})$. |

1239 | Adversarial Robustness through Local Linearization | Chongli Qin, James Martens, Sven Gowal, Dilip Krishnan, Krishnamurthy Dvijotham, Alhussein Fawzi, Soham De, Robert Stanforth, Pushmeet Kohli | In this work, we introduce a novel regularizer that encourages the loss to behave linearly in the vicinity of the training data, thereby penalizing gradient obfuscation while encouraging robustness. |

1240 | Sampled Softmax with Random Fourier Features | Ankit Singh Rawat, Jiecao Chen, Felix Xinnan X. Yu, Ananda Theertha Suresh, Sanjiv Kumar | Motivated by our analysis and the work on kernel-based sampling, we propose the Random Fourier Softmax (RF-softmax) method that utilizes the powerful Random Fourier Features to enable more efficient and accurate sampling from an approximate softmax distribution. |

1241 | Semi-flat minima and saddle points by embedding neural networks to overparameterization | Kenji Fukumizu, Shoichiro Yamaguchi, Yoh-ichi Mototake, Mirai Tanaka | We consider three basic methods for embedding a network into a wider one with more hidden units, and discuss whether a minimum point of the narrower network gives a minimum or saddle point of the wider one. |

1242 | Learning Fairness in Multi-Agent Systems | Jiechuan Jiang, Zongqing Lu | To tackle these difficulties, we propose FEN, a novel hierarchical reinforcement learning model. |

1243 | Primal-Dual Block Generalized Frank-Wolfe | Qi Lei, JIACHENG ZHUO, Constantine Caramanis, Inderjit S. Dhillon, Alexandros G. Dimakis | We propose a generalized variant of Frank-Wolfe algorithm for solving a class of sparse/low-rank optimization problems. |

1244 | GOT: An Optimal Transport framework for Graph comparison | Hermina Petric Maretic, Mireille El Gheche, Giovanni Chierchia, Pascal Frossard | We present a novel framework based on optimal transport for the challenging problem of comparing graphs. |

1245 | On Mixup Training: Improved Calibration and Predictive Uncertainty for Deep Neural Networks | Sunil Thulasidasan, Gopinath Chennupati, Jeff A. Bilmes, Tanmoy Bhattacharya, Sarah Michalak | In this work, we discuss a hitherto untouched aspect of mixup training — the calibration and predictive uncertainty of models trained with mixup. |

1246 | Complexity of Highly Parallel Non-Smooth Convex Optimization | Sebastien Bubeck, Qijia Jiang, Yin-Tat Lee, Yuanzhi Li, Aaron Sidford | Namely we consider optimization algorithms interacting with a highly parallel gradient oracle, that is one that can answer poly(d) gradient queries in parallel. |

1247 | Inverting Deep Generative models, One layer at a time | Qi Lei, Ajil Jalal, Inderjit S. Dhillon, Alexandros G. Dimakis | In this paper we obtain several novel theoretical results for the inversion problem. |

1248 | Calculating Optimistic Likelihoods Using (Geodesically) Convex Optimization | Viet Anh Nguyen, Soroosh Shafieezadeh Abadeh, Man-Chung Yue, Daniel Kuhn, Wolfram Wiesemann | We thus propose to replace each nominal distribution with an ambiguity set containing all distributions in its vicinity and to evaluate an optimistic likelihood, that is, the maximum of the likelihood over all distributions in the ambiguity set. |

1249 | The Implicit Metropolis-Hastings Algorithm | Kirill Neklyudov, Evgenii Egorov, Dmitry P. Vetrov | For any implicit probabilistic model and a target distribution represented by a set of samples, implicit Metropolis-Hastings operates by learning a discriminator to estimate the density-ratio and then generating a chain of samples. |

1250 | An Inexact Augmented Lagrangian Framework for Nonconvex Optimization with Nonlinear Constraints | Mehmet Fatih Sahin, Armin eftekhari, Ahmet Alacaoglu, Fabian Latorre, Volkan Cevher | We propose a practical inexact augmented Lagrangian method (iALM) for nonconvex problems with nonlinear constraints. |

1251 | Generalization in Reinforcement Learning with Selective Noise Injection and Information Bottleneck | Maximilian Igl, Kamil Ciosek, Yingzhen Li, Sebastian Tschiatschek, Cheng Zhang, Sam Devlin, Katja Hofmann | We discuss those differences and propose modifications to existing regularization techniques in order to better adapt them to RL. |

1252 | Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift | Jasper Snoek, Yaniv Ovadia, Emily Fertig, Balaji Lakshminarayanan, Sebastian Nowozin, D. Sculley, Joshua Dillon, Jie Ren, Zachary Nado | We present a large-scale benchmark of existing state-of-the-art methods on classification problems and investigate the effect of dataset shift on accuracy and calibration. |

1253 | Accurate Layerwise Interpretable Competence Estimation | Vickram Rajendran, William LeVine | In this paper, we seek to examine, understand, and predict the pointwise competence of classification models. |

1254 | A New Perspective on Pool-Based Active Classification and False-Discovery Control | Lalit Jain, Kevin G. Jamieson | In this paper, we provide the first provably sample efficient adaptive algorithm for this problem. |

1255 | Defending Neural Backdoors via Generative Distribution Modeling | Ximing Qiao, Yukun Yang, Hai Li | In the work, we explore the space formed by the pixel values of all possible backdoor triggers. |

1256 | Are Sixteen Heads Really Better than One? | Paul Michel, Omer Levy, Graham Neubig | However we observe that, in practice, a large proportion of attention heads can be removed at test time without significantly impacting performance, and that some layers can even be reduced to a single head. |

1257 | Multi-resolution Multi-task Gaussian Processes | Oliver Hamelijnck, Theodoros Damoulas, Kangrui Wang, Mark Girolami | We offer a multi-resolution multi-task (MRGP) framework that allows for both inter-task and intra-task multi-resolution and multi-fidelity. |

1258 | Variational Bayesian Optimal Experimental Design | Adam Foster, Martin Jankowiak, Elias Bingham, Paul Horsfall, Yee Whye Teh, Thomas Rainforth, Noah Goodman | To address this, we introduce several classes of fast EIG estimators by building on ideas from amortized variational inference. |

1259 | Universal Approximation of Input-Output Maps by Temporal Convolutional Nets | Joshua Hanson, Maxim Raginsky | We prove that TCNs can approximate a large class of input-output maps having approximately finite memory to arbitrary error tolerance. |

1260 | Provable Certificates for Adversarial Examples: Fitting a Ball in the Union of Polytopes | Matt Jordan, Justin Lewis, Alexandros G. Dimakis | We propose a novel method for computing exact pointwise robustness of deep neural networks for all convex lp norms. |

1261 | Reinforcement Learning with Convex Constraints | Sobhan Miryoosefi, Kiant? Brantley, Hal Daume III, Miro Dudik, Robert E. Schapire | In this paper, we propose an algorithmic scheme that can handle a wide class of constraints in RL tasks: specifically, any constraints that require expected values of some vector measurements (such as the use of an action) to lie in a convex set. |

1262 | User-Specified Local Differential Privacy in Unconstrained Adaptive Online Learning | Dirk van der Hoeven | In this paper we generalize this approach by allowing the provider of the data to choose the distribution of the noise without disclosing any parameters of the distribution to the learner, under the constraint that the distribution is symmetrical. |

1263 | Stochastic Bandits with Context Distributions | Johannes Kirschner, Andreas Krause | We introduce a stochastic contextual bandit model where at each time step the environment chooses a distribution over a context set and samples the context from this distribution. |

1264 | Inducing brain-relevant bias in natural language processing models | Dan Schwartz, Mariya Toneva, Leila Wehbe | We demonstrate that a version of BERT, a recently introduced and powerful language model, can improve the prediction of brain activity after fine-tuning. |

1265 | Using a Logarithmic Mapping to Enable Lower Discount Factors in Reinforcement Learning | Harm Van Seijen, Mehdi Fatemi, Arash Tavakoli | We propose an alternative hypothesis that identifies the size-difference of the action-gap across the state-space as the primary cause. |

1266 | Recovering Bandits | Ciara Pike-Burke, Steffen Grunewalder | In this work, we explore the use of Gaussian processes to tackle the estimation and planing problem. |

1267 | Computing Linear Restrictions of Neural Networks | Matthew Sotoudeh, Aditya V. Thakur | We present an efficient algorithm for computing ExactLine for networks that use ReLU, MaxPool, batch normalization, fully-connected, convolutional, and other layers, along with several applications. |

1268 | Learning Positive Functions with Pseudo Mirror Descent | Yingxiang Yang, Haoxiang Wang, Negar Kiyavash, Niao He | In this paper, we propose a novel algorithm, pseudo mirror descent, that performs efficient estimation of positive functions within a Hilbert space without expensive projections. |

1269 | Correlation Priors for Reinforcement Learning | Bastian Alt, Adrian ?o?ic, Heinz Koeppl | In this work, we present a Bayesian learning framework based on Pólya-Gamma augmentation that enables an analogous reasoning in such cases. |

1270 | Fast, Provably convergent IRLS Algorithm for p-norm Linear Regression | Deeksha Adil, Richard Peng, Sushant Sachdeva | We propose p-IRLS, the first IRLS algorithm that provably converges geometrically for any p \in [2,\infty). |

1271 | A Similarity-preserving Network Trained on Transformed Images Recapitulates Salient Features of the Fly Motion Detection Circuit | Yanis Bahroun, Dmitri Chklovskii, Anirvan Sengupta | Here we propose a biologically plausible model of motion detection. |

1272 | Differentially Private Covariance Estimation | Kareem Amin, Travis Dick, Alex Kulesza, Andres Munoz, Sergei Vassilvitskii | In this work we propose a new epsilon-differentially private algorithm for computing the covariance matrix of a dataset that addresses both of these limitations. |

1273 | Outlier Detection and Robust PCA Using a Convex Measure of Innovation | Mostafa Rahmani, Ping Li | This paper presents a provable and strong algorithm, termed Innovation Search (iSearch), to robust Principal Component Analysis (PCA) and outlier detection. |

1274 | Integrating Markov processes with structural causal modeling enables counterfactual inference in complex systems | Robert Ness, Kaushal Paneri, Olga Vitek | This manuscript contributes a general and practical framework for casting a Markov process model of a system at equilibrium as a structural causal model, and carrying out counterfactual inference. |

1275 | Are Disentangled Representations Helpful for Abstract Visual Reasoning? | Sjoerd van Steenkiste, Francesco Locatello, J?rgen Schmidhuber, Olivier Bachem | In this paper, we conduct a large-scale study that investigates whether disentangled representations are more suitable for abstract reasoning tasks. |

1276 | PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization | Thijs Vogels, Sai Praneeth Karimireddy, Martin Jaggi | We propose a low-rank gradient compressor that can i) compress gradients rapidly, ii) efficiently aggregate the compressed gradients using all-reduce, and iii) achieve test performance on par with SGD. |

1277 | Stochastic Frank-Wolfe for Composite Convex Minimization | Francesco Locatello, Alp Yurtsever, Olivier Fercoq, Volkan Cevher | In this work, we propose the first conditional-gradient-type method for solving stochastic optimization problems under affine constraints. |

1278 | Constraint-based Causal Structure Learning with Consistent Separating Sets | Honghao Li, Vincent Cabeli, Nadir Sella, Herve Isambert | In this paper, we propose a simple modification of PC and PC-derived algorithms so as to ensure that all separating sets identified to remove dispensable edges are consistent with the final graph,thus enhancing the explainability of constraint-basedmethods. |

1279 | Unsupervised Discovery of Temporal Structure in Noisy Data with Dynamical Components Analysis | David Clark, Jesse Livezey, Kristofer Bouchard | Combining these approaches, we introduce Dynamical Components Analysis (DCA), a linear dimensionality reduction method which discovers a subspace of high-dimensional time series data with maximal predictive information, defined as the mutual information between the past and future. |

1280 | Sample Efficient Active Learning of Causal Trees | Kristjan Greenewald, Dmitriy Katz, Karthikeyan Shanmugam, Sara Magliacane, Murat Kocaoglu, Enric Boix Adsera, Guy Bresler | We propose an adaptive framework that determines the next intervention based on a Bayesian prior updated with the outcomes of previous experiments, focusing on the setting where observational data is cheap (assumed infinite) and interventional data is expensive. |

1281 | Efficient Neural Architecture Transformation Search in Channel-Level for Object Detection | Junran Peng, Ming Sun, ZHAO-XIANG ZHANG, Tieniu Tan, Junjie Yan | To overcome this obstacle, we introduce a practical neural architecture transformation search(NATS) algorithm for object detection in this paper. |

1282 | Robust Attribution Regularization | Jiefeng Chen, Xi Wu, Vaibhav Rastogi, Yingyu Liang, Somesh Jha | We propose training objectives in classic robust optimization models to achieve robust IG attributions. |

1283 | Computational Mirrors: Blind Inverse Light Transport by Deep Matrix Factorization | Miika Aittala, Prafull Sharma, Lukas Murmann, Adam Yedidia, Gregory Wornell, Bill Freeman, Fredo Durand | We solve this problem by factoring the observed video into a matrix product between the unknown hidden scene video and an unknown light transport matrix. |

1284 | When to use parametric models in reinforcement learning? | Hado P. van Hasselt, Matteo Hessel, John Aslanides | We examine the question of when and how parametric models are most useful in reinforcement learning. |

1285 | General E(2)-Equivariant Steerable CNNs | Maurice Weiler, Gabriele Cesa | Here we give a general description of E(2)-equivariant convolutions in the framework of Steerable CNNs. |

1286 | Characterization and Learning of Causal Graphs with Latent Variables from Soft Interventions | Murat Kocaoglu, Amin Jaber, Karthikeyan Shanmugam, Elias Bareinboim | In this paper, we investigate the more general scenario where multiple observational and experimental distributions are available. |

1287 | Structure Learning with Side Information: Sample Complexity | Saurabh Sihag, Ali Tajer | This paper focuses on Ising graphical models, and considers the problem of simultaneously learning the structures of two {\sl partially} similar graphs, where any inference about the structure of one graph offers side information for the other graph. |

1288 | Untangling in Invariant Speech Recognition | Cory Stephenson, Jenelle Feather, Suchismita Padhy, Oguz Elibol, Hanlin Tang, Josh McDermott, SueYeon Chung | In this work, we employ a recently developed statistical mechanical theory that connects geometric properties of network representations and the separability of classes to probe how information is untangled within neural networks trained to recognize speech. |

1289 | Flexible information routing in neural populations through stochastic comodulation | Caroline Haimerl, Cristina Savin, Eero Simoncelli | Here, we propose a novel solution based on functionally-targeted stochastic modulation. |

1290 | Generalization Bounds in the Predict-then-Optimize Framework | Othman El Balghiti, Adam Elmachtoub, Paul Grigas, Ambuj Tewari | In this work, we provide an assortment of generalization bounds for the SPO loss function. |

1291 | Categorized Bandits | Matthieu Jedor, Vianney Perchet, Jonathan Louedec | We introduce a new stochastic multi-armed bandit setting where arms are grouped inside “ordered” categories. |

1292 | Worst-Case Regret Bounds for Exploration via Randomized Value Functions | Daniel Russo | By providing a worst-case regret bound for tabular finite-horizon Markov decision processes, we show that planning with respect to these randomized value functions can induce provably efficient exploration. |

1293 | Efficient characterization of electrically evoked responses for neural interfaces | Nishal Shah, Sasidhar Madugula, Pawel Hottowy, Alexander Sher, Alan Litke, Liam Paninski, E.J. Chichilnisky | This work tests the idea that using prior information from previous experiments and closed-loop measurements may greatly increase the efficiency of the neural interface. |

1294 | Differentially Private Distributed Data Summarization under Covariate Shift | Kanthi Sarpatwar, Karthikeyan Shanmugam, Venkata Sitaramagiridharganesh Ganapavarapu, Ashish Jagmohan, Roman Vaculin | We introduce a novel “noiseless” differentially private auctioning protocol, which may be of independent interest. |

1295 | Hamiltonian descent for composite objectives | Brendan O’Donoghue, Chris J. Maddison | In this paper we consider a convex optimization problem consisting of the sum of two convex functions, sometimes referred to as a composite objective, and we identify the duality gap to be the `energy’ of the system. |

1296 | Implicit Regularization of Accelerated Methods in Hilbert Spaces | Nicol? Pagliana, Lorenzo Rosasco | We study learning properties of accelerated gradient descent methods for linear least-squares in Hilbert spaces. |

1297 | Non-Asymptotic Pure Exploration by Solving Games | R?my Degenne, Wouter M. Koolen, Pierre M?nard | We interpret the optimisation problem as an unknown game, and propose sampling rules based on iterative strategies to estimate and converge to its saddle point. |

1298 | Implicit Posterior Variational Inference for Deep Gaussian Processes | Haibin YU, Yizhou Chen, Bryan Kian Hsiang Low, Patrick Jaillet, Zhongxiang Dai | This paper presents an implicit posterior variational inference (IPVI) framework for DGPs that can ideally recover an unbiased posterior belief and still preserve time efficiency. |

1299 | Deep Multi-State Dynamic Recurrent Neural Networks Operating on Wavelet Based Neural Features for Robust Brain Machine Interfaces | Benyamin Allahgholizadeh Haghi, Spencer Kellis, Sahil Shah, Maitreyi Ashok, Luke Bashford, Daniel Kramer, Brian Lee, Charles Liu, Richard Andersen, Azita Emami | We present a new deep multi-state Dynamic Recurrent Neural Network (DRNN) architecture for Brain Machine Interface (BMI) applications. |

1300 | Censored Semi-Bandits: A Framework for Resource Allocation with Censored Feedback | Arun Verma, Manjesh Hanawal, Arun Rajkumar, Raman Sankaran | In this paper, we study Censored Semi-Bandits, a novel variant of the semi-bandits problem. |

1301 | Cormorant: Covariant Molecular Neural Networks | Brandon Anderson, Truong Son Hy, Risi Kondor | We propose Cormorant, a rotationally covariant neural network architecture for learning the behavior and properties of complex many-body physical systems. |

1302 | Reverse KL-Divergence Training of Prior Networks: Improved Uncertainty and Adversarial Robustness | Andrey Malinin, Mark Gales | First, we show that the appropriate training criterion for Prior Networks is the reverse KL-divergence between Dirichlet distributions. This addresses issues in the nature of the training data target distributions, enabling prior networks to be successfully trained on classification tasks with arbitrarily many classes, as well as improving out-of-distribution detection performance. Second, taking advantage of this new training criterion, this paper investigates using Prior Networks to detect adversarial attacks and proposes a generalized form of adversarial training. |

1303 | Reflection Separation using a Pair of Unpolarized and Polarized Images | Youwei Lyu, Zhaopeng Cui, Si Li, Marc Pollefeys, Boxin Shi | In this paper, we propose to exploit physical constraints from a pair of unpolarized and polarized images to separate reflection and transmission layers. |

1304 | Policy Poisoning in Batch Reinforcement Learning and Control | Yuzhe Ma, Xuezhou Zhang, Wen Sun, Jerry Zhu | We present a unified framework for solving batch policy poisoning attacks, and instantiate the attack on two standard victims: tabular certainty equivalence learner in reinforcement learning and linear quadratic regulator in control. |

1305 | Low-Complexity Nonparametric Bayesian Online Prediction with Universal Guarantees | Alix LHERITIER, Frederic Cazals | We propose a novel nonparametric online predictor for discrete labels conditioned on multivariate continuous features. |

1306 | Pure Exploration with Multiple Correct Answers | R?my Degenne, Wouter M. Koolen | We present a new algorithm which extends Track-and-Stop to the multiple-answer case and has asymptotic sample complexity matching the lower bound. |

1307 | Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets | Rohith Kuditipudi, Xiang Wang, Holden Lee, Yi Zhang, Zhiyuan Li, Wei Hu, Rong Ge, Sanjeev Arora | We give mathematical explanations for this phenomenon, assuming generic properties (such as dropout stability and noise stability) of well-trained deep nets, which have previously been identified as part of understanding the generalization properties of deep nets. |

1308 | On the Fairness of Disentangled Representations | Francesco Locatello, Gabriele Abbati, Thomas Rainforth, Stefan Bauer, Bernhard Sch?lkopf, Olivier Bachem | In this paper, we investigate the usefulness of different notions of disentanglement for improving the fairness of downstream prediction tasks based on representations. |

1309 | Compiler Auto-Vectorization with Imitation Learning | Charith Mendis, Cambridge Yang, Yewen Pu, Dr.Saman Amarasinghe, Michael Carbin | In this work, we explore whether it is feasible to imitate optimal decisions made by their ILP solution by fitting a graph neural network policy. |

1310 | A Generalized Algorithm for Multi-Objective Reinforcement Learning and Policy Adaptation | Runzhe Yang, Xingyuan Sun, Karthik Narasimhan | We introduce a new algorithm for multi-objective reinforcement learning (MORL) with linear preferences, with the goal of enabling few-shot adaptation to new tasks. |

1311 | Exact Gaussian Processes on a Million Data Points | Ke Wang, Geoff Pleiss, Jacob Gardner, Stephen Tyree, Kilian Q. Weinberger, Andrew Gordon Wilson | In this paper, we develop a scalable approach for exact GPs that leverages multi-GPU parallelization and methods like linear conjugate gradients, accessing the kernel matrix only through matrix multiplication. |

1312 | Bayesian Layers: A Module for Neural Network Uncertainty | Dustin Tran, Mike Dusenberry, Mark van der Wilk, Danijar Hafner | We describe Bayesian Layers, a module designed for fast experimentation with neural network uncertainty. |

1313 | Learning Compositional Neural Programs with Recursive Tree Search and Planning | Thomas PIERROT, Guillaume Ligner, Scott E. Reed, Olivier Sigaud, Nicolas Perrin, Alexandre Laterre, David Kas, Karim Beguir, Nando de Freitas | We propose a novel reinforcement learning algorithm, AlphaNPI, that incorporates the strengths of Neural Programmer-Interpreters (NPI) and AlphaZero. |

1314 | Nonparametric Contextual Bandits in Metric Spaces with Unknown Metric | Nirandika Wanigasekara, Christina Yu | We present a novel algorithm which learns data-driven similarities amongst the arms, in order to implement adaptive partitioning of the context-arm space for more efficient learning. |

1315 | Qsparse-local-SGD: Distributed SGD with Quantization, Sparsification and Local Computations | Debraj Basu, Deepesh Data, Can Karakus, Suhas Diggavi | In this paper we propose Qsparse-local-SGD algorithm, which combines aggressive sparsification with quantization and local computation along with error compensation, by keeping track of the difference between the true and compressed gradients. |

1316 | Likelihood Ratios for Out-of-Distribution Detection | Jie Ren, Peter J. Liu, Emily Fertig, Jasper Snoek, Ryan Poplin, Mark Depristo, Joshua Dillon, Balaji Lakshminarayanan | We propose a likelihood ratio method for deep generative models which effectively corrects for these confounding background statistics. |

1317 | Discrete Flows: Invertible Generative Models of Discrete Data | Dustin Tran, Keyon Vafa, Kumar Agrawal, Laurent Dinh, Ben Poole | In this paper, we show that flows can in fact be extended to discrete events—and under a simple change-of-variables formula not requiring log-determinant-Jacobian computations. |

1318 | A Self Validation Network for Object-Level Human Attention Estimation | Zehua Zhang, Chen Yu, David Crandall | In this paper, we propose a novel unified model that incorporates both spatial and temporal evidence in identifying as well as locating the attended object in firstperson videos. |

1319 | Model Selection for Contextual Bandits | Dylan J. Foster, Akshay Krishnamurthy, Haipeng Luo | We introduce the problem of model selection for contextual bandits, where a learner must adapt to the complexity of the optimal policy while balancing exploration and exploitation. |

1320 | Sliced Gromov-Wasserstein | Vayer Titouan, R?mi Flamary, Nicolas Courty, Romain Tavenard, Laetitia Chapel | This paper proposes a new divergence based on GW akin to SW. |

1321 | Towards Practical Alternating Least-Squares for CCA | Zhiqiang Xu, Ping Li | To promote the practical use of ALS for CCA, we propose truly alternating least-squares. |

1322 | Deep Leakage from Gradients | Ligeng Zhu, Zhijian Liu, Song Han | However, in this paper, we show that we can obtain the private training set from the publicly shared gradients. |

1323 | Invariance-inducing regularization using worst-case transformations suffices to boost accuracy and spatial robustness | Fanny Yang, Zuowen Wang, Christina Heinze-Deml | This work provides theoretical and empirical evidence that invariance-inducing regularizers can increase predictive accuracy for worst-case spatial transformations (spatial robustness). |

1324 | Algorithm-Dependent Generalization Bounds for Overparameterized Deep Residual Networks | Spencer Frei, Yuan Cao, Quanquan Gu | In this work, we analyze overparameterized deep residual networks trained by gradient descent following random initialization, and demonstrate that (i) the class of networks learned by gradient descent constitutes a small subset of the entire neural network function class, and (ii) this subclass of networks is sufficiently large to guarantee small training error. |

1325 | Value Function in Frequency Domain and the Characteristic Value Iteration Algorithm | Amir-massoud Farahmand | It presents a new representational framework to maintain the uncertainty of returns and provides mathematical tools to compute it. |

1326 | Icebreaker: Element-wise Efficient Information Acquisition with a Bayesian Deep Latent Gaussian Model | Wenbo Gong, Sebastian Tschiatschek, Sebastian Nowozin, Richard E. Turner, Jos? Miguel Hern?ndez-Lobato, Cheng Zhang | In this paper, we address the ice-start problem, i.e., the challenge of deploying machine learning models when only a little or no training data is initially available, and acquiring each feature element of data is associated with costs. |

1327 | Algorithmic Guarantees for Inverse Imaging with Untrained Network Priors | Gauri Jagatap, Chinmay Hegde | Specifically, we consider the problem of solving linear inverse problems, such as compressive sensing, as well as non-linear problems, such as compressive phase retrieval. |

1328 | Planning with Goal-Conditioned Policies | Soroush Nasiriany, Vitchyr Pong, Steven Lin, Sergey Levine | We show that goal-conditioned policies learned with RL can be incorporated into planning, such that a planner can focus on which states to reach, rather than how those states are reached. |

1329 | Don't take it lightly: Phasing optical random projections with unknown operators | Sidharth Gupta, Remi Gribonval, Laurent Daudet, Ivan Dokmanic | In this paper we tackle the problem of recovering the phase of complex linear measurements when only magnitude information is available and we control the input. |

1330 | Generating Diverse High-Fidelity Images with VQ-VAE-2 | Ali Razavi, Aaron van den Oord, Oriol Vinyals | We explore the use of Vector Quantized Variational AutoEncoder (VQ-VAE) models for large scale image generation. |

1331 | Generalized Matrix Means for Semi-Supervised Learning with Multilayer Graphs | Pedro Mercado, Francesco Tudisco, Matthias Hein | We propose a regularizer based on the generalized matrix mean, which is a one-parameter family of matrix means that includes the arithmetic, geometric and harmonic means as particular cases. |

1332 | Online Optimal Control with Linear Dynamics and Predictions: Algorithms and Regret Analysis | Yingying Li, Xin Chen, Na Li | We design online algorithms, Receding Horizon Gradient-based Control (RHGC), that utilize the predictions through finite steps of gradient computations. |

1333 | Missing Not at Random in Matrix Completion: The Effectiveness of Estimating Missingness Probabilities Under a Low Nuclear Norm Assumption | Wei Ma, George H. Chen | In this paper, we suggest a simple approach to estimating these probabilities that avoids these shortcomings. |

1334 | MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis | Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Br?bisson, Yoshua Bengio, Aaron C. Courville | In this paper, we show that it is possible to train GANs reliably to generate high quality coherent waveforms by introducing a set of architectural changes and simple training techniques. |

1335 | Offline Contextual Bandits with High Probability Fairness Guarantees | Blossom Metevier, Stephen Giguere, Sarah Brockman, Ari Kobren, Yuriy Brun, Emma Brunskill, Philip S. Thomas | We present RobinHood, an offline contextual bandit algorithm designed to satisfy a broad family of fairness constraints. |

1336 | Solving a Class of Non-Convex Min-Max Games Using Iterative First Order Methods | Maher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D. Lee, Meisam Razaviyayn | In this paper, we study the problem in the non-convex regime and show that an $\varepsilon$–first order stationary point of the game can be computed when one of the player’s objective can be optimized to global optimality efficiently. |

1337 | Semantic-Guided Multi-Attention Localization for Zero-Shot Learning | Yizhe Zhu, Jianwen Xie, Zhiqiang Tang, Xi Peng, Ahmed Elgammal | In this paper, we study the significance of the discriminative region localization. |

1338 | Interpreting and improving natural-language processing (in machines) with natural language-processing (in the brain) | Mariya Toneva, Leila Wehbe | We propose here a novel interpretation approach that relies on the only processing system we have that does understand language: the human brain. |

1339 | Function-Space Distributions over Kernels | Gregory Benton, Wesley J. Maddox, Jayson Salkey, Julio Albinati, Andrew Gordon Wilson | In this paper, we develop functional kernel learning (FKL) to directly infer functional posteriors over kernels. |

1340 | The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure For Least Squares | Rong Ge, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli | Motivated by this observation, this work provides a detailed study of the following question: what rate is achievable using the final iterate of SGD for the streaming least squares regression problem with and without strong convexity? |

1341 | Compositional Plan Vectors | Coline Devin, Daniel Geng, Pieter Abbeel, Trevor Darrell, Sergey Levine | We introduce compositional plan vectors (CPVs) to enable a policy to perform compositions of tasks without additional supervision. |

1342 | Locally Private Learning without Interaction Requires Separation | Amit Daniely, Vitaly Feldman | We consider learning under the constraint of local differential privacy (LDP). |

1343 | Robust Bi-Tempered Logistic Loss Based on Bregman Divergences | Ehsan Amid, Manfred K. K. Warmuth, Rohan Anil, Tomer Koren | We introduce a temperature into the exponential function and replace the softmax output layer of the neural networks by a high-temperature generalization. |

1344 | Computational Separations between Sampling and Optimization | Kunal Talwar | We present a simpler and stronger separation. |

1345 | Surfing: Iterative Optimization Over Incrementally Trained Deep Networks | Ganlin Song, Zhou Fan, John Lafferty | We investigate a sequential optimization procedure to minimize the empirical risk functional $f_{\hat\theta}(x) = \frac{1}{2}\|G_{\hat\theta}(x) – y\|^2$ for certain families of deep networks $G_{\theta}(x)$. |

1346 | Learning to Optimize in Swarms | Yue Cao, Tianlong Chen, Zhangyang Wang, Yang Shen | To overcome the limitations, we propose a meta-optimizer that learns in the algorithmic space of both point-based and population-based optimization algorithms. |

1347 | On Human-Aligned Risk Minimization | Liu Leqi, Adarsh Prasad, Pradeep K. Ravikumar | In this paper, we pose the following simple question: in contrast to minimizing expected loss, could we minimize a better human-aligned risk measure? |

1348 | Semi-Parametric Efficient Policy Learning with Continuous Actions | Victor Chernozhukov, Mert Demirer, Greg Lewis, Vasilis Syrgkanis | We propose a doubly robust off-policy estimate for this setting and show that off-policy optimization based on this doubly robust estimate is robust to estimation errors of the policy function or the regression model. |

1349 | Multi-task Learning for Aggregated Data using Gaussian Processes | Fariba Yousefi, Michael T. Smith, Mauricio ?lvarez | In this paper, we present a novel multi-task learning model based on Gaussian processes for joint learning of variables that have been aggregated at different input scales. |

1350 | Minimal Variance Sampling in Stochastic Gradient Boosting | Bulat Ibragimov, Gleb Gusev | In this paper, we formulate the problem of randomization in SGB in terms of optimization of sampling probabilities to maximize the estimation accuracy of split scoring used to train decision trees. |

1351 | Beyond the Single Neuron Convex Barrier for Neural Network Certification | Gagandeep Singh, Rupanshu Ganvir, Markus P?schel, Martin Vechev | We propose a new parametric framework, called k-ReLU, for computing precise and scalable convex relaxations used to certify neural networks. |

1352 | An Algorithm to Learn Polytree Networks with Hidden Nodes | Firoozeh Sepehr, Donatello Materassi | In this article, we develop an algorithm to exactly recover graphical models of random variables with underlying polytree structures when the latent nodes satisfy specific degree conditions. |

1353 | Efficiently Learning Fourier Sparse Set Functions | Andisheh Amrollahi, Amir Zandieh, Michael Kapralov, Andreas Krause | In this paper we consider the problem of efficiently learning set functions that are defined over a ground set of size $n$ and that are sparse (say $k$-sparse) in the Fourier domain. |

1354 | Projected Stein Variational Newton: A Fast and Scalable Bayesian Inference Method in High Dimensions | Peng Chen, Keyi Wu, Joshua Chen, Tom O’Leary-Roseberry, Omar Ghattas | We propose a projected Stein variational Newton (pSVN) method for high-dimensional Bayesian inference. |

1355 | Invariance and identifiability issues for word embeddings | Rachel Carrington, Karthik Bharath, Simon Preston | We provide a formal treatment of the above identifiability issue, present some numerical examples, and discuss possible resolutions. |

1356 | Generalization Error Analysis of Quantized Compressive Learning | Xiaoyun Li, Ping Li | In this paper, we consider the learning problem where the projected data is further compressed by scalar quantization, which is called quantized compressive learning. |

1357 | Multi-Criteria Dimensionality Reduction with Applications to Fairness | Uthaipon Tantipongpipat, Samira Samadi, Mohit Singh, Jamie H. Morgenstern, Santosh Vempala | In this paper, we introduce the multi-criteria dimensionality reduction problem where we are given multiple objectives that need to be optimized simultaneously. |

1358 | Efficient Rematerialization for Deep Networks | Ravi Kumar, Manish Purohit, Zoya Svitkina, Erik Vee, Joshua Wang | In this work we consider the rematerialization problem and devise efficient algorithms that use structural characterizations of computation graphs—treewidth and pathwidth—to obtain provably efficient rematerialization schedules. |

1359 | Mo’ States Mo’ Problems: Emergency Stop Mechanisms from Observation | Samuel Ainsworth, Matt Barnes, Siddhartha Srinivasa | We develop a simple technique using emergency stops (e-stops) to exploit this phenomenon. |

1360 | Machine Learning Estimation of Heterogeneous Treatment Effects with Instruments | Vasilis Syrgkanis, Victor Lei, Miruna Oprescu, Maggie Hei, Keith Battocchi, Greg Lewis | We develop a statistical learning approach to the estimation of heterogeneous effects, reducing the problem to the minimization of an appropriate loss function that depends on a set of auxiliary models (each corresponding to a separate prediction task). |

1361 | Understanding Sparse JL for Feature Hashing | Meena Jagadeesan | In this paper, we demonstrate the benefits of using sparsity s greater than 1 in sparse JL on feature vectors. |

1362 | Text-Based Interactive Recommendation via Constraint-Augmented Reinforcement Learning | Ruiyi Zhang, Tong Yu, Yilin Shen, Hongxia Jin, Changyou Chen | To alleviate this issue, we propose a novel constraint-augmented reinforcement learning (RL) framework to efficiently incorporate user preferences over time. |

1363 | Flexible Modeling of Diversity with Strongly Log-Concave Distributions | Joshua Robinson, Suvrit Sra, Stefanie Jegelka | We propose SLC as the right extension of SR that enables easier, more intuitive control over diversity, illustrating this via examples of practical importance. |

1364 | Momentum-Based Variance Reduction in Non-Convex SGD | Ashok Cutkosky, Francesco Orabona | We present a new algorithm, STORM, that does not require any batches and makes use of adaptive learning rates, enabling simpler implementation and less hyperparameter tuning. |

1365 | Search on the Replay Buffer: Bridging Planning and Reinforcement Learning | Ben Eysenbach, Russ R. Salakhutdinov, Sergey Levine | We introduce a general-purpose control algorithm that combines the strengths of planning and reinforcement learning to effectively solve these tasks. |

1366 | Can Unconditional Language Models Recover Arbitrary Sentences? | Nishant Subramani, Samuel Bowman, Kyunghyun Cho | To do this, we introduce a pair of effective, complementary methods for feeding representations into pretrained unconditional language models and a corresponding set of methods to map sentences into and out of this representation space, the reparametrized sentence space. |

1367 | Group Retention when Using Machine Learning in Sequential Decision Making: the Interplay between User Dynamics and Fairness | Xueru Zhang, Mohammadmahdi Khaliligarekani, Cem Tekin, mingyan liu | In this study, we seek to understand the interplay between ML decisions and the underlying group representation, how they evolve in a sequential framework, and how the use of fairness criteria plays a role in this process. |

1368 | Faster width-dependent algorithm for mixed packing and covering LPs | Digvijay Boob, Saurabh Sawlani, Di Wang | In this paper, we give a faster width-dependent algorithm for mixed packing-covering LPs. |

1369 | Flattening a Hierarchical Clustering through Active Learning | Fabio Vitale, Anand Rajagopalan, Claudio Gentile | We investigate active learning by pairwise similarity over the leaves of trees originating from hierarchical clustering procedures. |

1370 | DeepWave: A Recurrent Neural-Network for Real-Time Acoustic Imaging | Matthieu SIMEONI, Sepand Kashani, Paul Hurley, Martin Vetterli | We propose a recurrent neural-network for real-time reconstruction of acoustic camera spherical maps. |

1371 | Certifying Geometric Robustness of Neural Networks | Mislav Balunovic, Maximilian Baader, Gagandeep Singh, Timon Gehr, Martin Vechev | In this work, we propose a new method to compute sound and asymptotically optimal linear relaxations for any composition of transformations. |

1372 | Goal-conditioned Imitation Learning | Yiming Ding, Carlos Florensa, Pieter Abbeel, Mariano Phielipp | In this work we propose a novel algorithm goalGAIL, which incorporates demonstrations to drastically speed up the convergence to a policy able to reach any goal, surpassing the performance of an agent trained with other Imitation Learning algorithms. |

1373 | Robust exploration in linear quadratic reinforcement learning | Jack Umenberger, Mina Ferizbegovic, Thomas B. Sch?n, H?kan Hjalmarsson | We present a method, based on convex optimization, that accomplishes this task ‘robustly’, i.e., the worst-case cost, accounting for system uncertainty given the observed data, is minimized. |

1374 | DRUM: End-To-End Differentiable Rule Mining On Knowledge Graphs | Ali Sadeghian, Mohammadreza Armandpour, Patrick Ding, Daisy Zhe Wang | In this paper, we study the problem of learning probabilistic logical rules for inductive and interpretable link prediction. |

1375 | Kernel Truncated Randomized Ridge Regression: Optimal Rates and Low Noise Acceleration | Kwang-Sung Jun, Ashok Cutkosky, Francesco Orabona | In this paper we consider the nonparametric least square regression in a Reproducing Kernel Hilbert Space (RKHS). |

1376 | Input-Output Equivalence of Unitary and Contractive RNNs | Melikasadat Emami, Mojtaba Sahraee Ardakan, Sundeep Rangan, Alyson K. Fletcher | This works shows that for any contractive RNN with ReLU activations, there is a URNN with at most twice the number of hidden states and the identical input-output mapping. |

1377 | Hamiltonian Neural Networks | Samuel Greydanus, Misko Dzamba, Jason Yosinski | In this paper, we draw inspiration from Hamiltonian mechanics to train models that learn and respect exact conservation laws in an unsupervised manner. |

1378 | Preventing Gradient Attenuation in Lipschitz Constrained Convolutional Networks | Qiyang Li, Saminul Haque, Cem Anil, James Lucas, Roger B. Grosse, Joern-Henrik Jacobsen | In particular, we present the Block Convolution Orthogonal Parameterization (BCOP), an expressive parameterization of orthogonal convolution operations. |

1379 | Structured and Deep Similarity Matching via Structured and Deep Hebbian Networks | Dina Obeid, Hugo Ramambason, Cengiz Pehlevan | In this paper, we introduce structured and deep similarity matching cost functions, and show how they can be optimized in a gradient-based manner by neural networks with local learning rules. |

1380 | Understanding the Representation Power of Graph Neural Networks in Learning Graph Topology | Nima Dehmamy, Albert-Laszlo Barabasi, Rose Yu | To deepen our understanding of graph neural networks, we investigate the representation power of Graph Convolutional Networks (GCN) through the looking glass of graph moments, a key property of graph topology encoding path of various lengths. |

1381 | Multiple Futures Prediction | Charlie Tang, Russ R. Salakhutdinov | Towards these goals, we introduce a probabilistic framework that efficiently learns latent variables to jointly model the multi-step future motions of agents in a scene. |

1382 | Explicitly disentangling image content from translation and rotation with spatial-VAE | Tristan Bepler, Ellen Zhong, Kotaro Kelley, Edward Brignole, Bonnie Berger | We propose a method for explicitly disentangling image rotation and translation from other unstructured latent factors in a variational autoencoder (VAE) framework. |

1383 | Power analysis of knockoff filters for correlated designs | Jingbo Liu, Philippe Rigollet | In this work we study the case where the predictors have a general covariance matrix $\bsigma$. |

1384 | A Kernel Loss for Solving the Bellman Equation | Yihao Feng, Lihong Li, Qiang Liu | In this paper, we propose a novel loss function, which can be optimized using standard gradient-based methods with guaranteed convergence. |

1385 | Low-Rank Bandit Methods for High-Dimensional Dynamic Pricing | Jonas W. Mueller, Vasilis Syrgkanis, Matt Taddy | We consider dynamic pricing with many products under an evolving but low-dimensional demand model. |

1386 | Differential Privacy Has Disparate Impact on Model Accuracy | Eugene Bagdasaryan, Omid Poursaeed, Vitaly Shmatikov | We demonstrate that in the neural networks trained using differentially private stochastic gradient descent (DP-SGD), this cost is not borne equally: accuracy of DP models drops much more for the underrepresented classes and subgroups. |

1387 | Riemannian batch normalization for SPD neural networks | Daniel Brooks, Olivier Schwander, Frederic Barbaresco, Jean-Yves Schneider, Matthieu Cord | In our article, we introduce a Riemannian batch normalization (batch- norm) algorithm, which generalizes the one used in Euclidean nets. |

1388 | Neural Taskonomy: Inferring the Similarity of Task-Derived Representations from Brain Activity | Yuan Wang, Michael Tarr, Leila Wehbe | To address this problem, we used learned representations drawn from 21 computer vision tasks to construct encoding models for predicting brain responses from BOLD5000—a large-scale dataset comprised of fMRI scans collected while observers viewed over 5000 naturalistic scene and object images. |

1389 | Stacked Capsule Autoencoders | Adam Kosiorek, Sara Sabour, Yee Whye Teh, Geoffrey E. Hinton | We introduce an unsupervised capsule autoencoder (SCAE), which explicitly uses geometric relationships between parts to reason about objects. |

1390 | Learning Reward Machines for Partially Observable Reinforcement Learning | Rodrigo Toro Icarte, Ethan Waldie, Toryn Klassen, Rick Valenzano, Margarita Castro, Sheila McIlraith | We pose the task of learning RMs as a discrete optimization problem where the objective is to find an RM that decomposes the problem into a set of subproblems such that the combination of their optimal memoryless policies is an optimal policy for the original problem. |

1391 | Learning Representations by Maximizing Mutual Information Across Views | Philip Bachman, R Devon Hjelm, William Buchwalter | We propose an approach to self-supervised representation learning based on maximizing mutual information between features extracted from multiple views of a shared context. |

1392 | Amortized Bethe Free Energy Minimization for Learning MRFs | Sam Wiseman, Yoon Kim | We propose to learn deep undirected graphical models (i.e., MRFs) with a non-ELBO objective for which we can calculate exact gradients. |

1393 | Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity | Chulhee Yun, Suvrit Sra, Ali Jadbabaie | We study finite sample expressivity, i.e., memorization power of ReLU networks. |

1394 | Legendre Memory Units: Continuous-Time Representation in Recurrent Neural Networks | Aaron Voelker, Ivana Kajic, Chris Eliasmith | We propose a novel memory cell for recurrent neural networks that dynamically maintains information across long windows of time using relatively few resources. |

1395 | Exact Combinatorial Optimization with Graph Convolutional Neural Networks | Maxime Gasse, Didier Chetelat, Nicola Ferroni, Laurent Charlin, Andrea Lodi | We propose a new graph convolutional neural network model for learning branch-and-bound variable selection policies, which leverages the natural variable-constraint bipartite graph representation of mixed-integer linear programs. |

1396 | Fast structure learning with modular regularization | Greg Ver Steeg, Hrayr Harutyunyan, Daniel Moyer, Aram Galstyan | We introduce a novel method that leverages a newly discovered connection between information-theoretic measures and structured latent factor models to derive an optimization objective which encourages modular structures where each observed variable has a single latent parent. |

1397 | Wasserstein Dependency Measure for Representation Learning | Sherjil Ozair, Corey Lynch, Yoshua Bengio, Aaron van den Oord, Sergey Levine, Pierre Sermanet | In this work, we empirically demonstrate that mutual information-based representation learning approaches do fail to learn complete representations on a number of designed and real-world tasks. |

1398 | TAB-VCR: Tags and Attributes based VCR Baselines | Jingxiang Lin, Unnat Jain, Alexander Schwing | Here we show that a much simpler model obtained by ablating and pruning the existing intricate baseline can perform better with half the number of trainable parameters. |

1399 | Universality and individuality in neural dynamics across large populations of recurrent networks | Niru Maheswaranathan, Alex Williams, Matthew Golub, Surya Ganguli, David Sussillo | To address these foundational questions, we study populations of thousands of networks of commonly used RNN architectures trained to solve neuroscientifically motivated tasks and characterize their low-dimensional dynamics via CCA and nonlinear dynamical systems analysis. |

1400 | End-to-End Learning on 3D Protein Structure for Interface Prediction | Raphael Townshend, Rishi Bedi, Patricia Suriana, Ron Dror | To address this question, we focused on a central problem in biology: predicting how proteins interact with one another—that is, which surfaces of one protein bind to those of another protein. |

1401 | A Family of Robust Stochastic Operators for Reinforcement Learning | Yingdong Lu, Mark Squillante, Chai Wah Wu | We consider a new family of stochastic operators for reinforcement learning with the goal of alleviating negative effects and becoming more robust to approximation or estimation errors. |

1402 | Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty | Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, Dawn Song | We find that self-supervision can benefit robustness in a variety of ways, including robustness to adversarial examples, label corruption, and common input corruptions. |

1403 | Inherent Tradeoffs in Learning Fair Representations | Han Zhao, Geoff Gordon | In this paper, through the lens of information theory, we provide the first result that quantitatively characterizes the tradeoff between demographic parity and the joint utility across different population groups. |

1404 | Are deep ResNets provably better than linear predictors? | Chulhee Yun, Suvrit Sra, Ali Jadbabaie | Our main theorem on deep ResNets shows under simple geometric conditions that, any critical point in the optimization landscape is either (i) at least as good as the best linear predictor; or (ii) the Hessian at this critical point has a strictly negative eigenvalue. |

1405 | Reverse engineering recurrent networks for sentiment classification reveals line attractor dynamics | Niru Maheswaranathan, Alex Williams, Matthew Golub, Surya Ganguli, David Sussillo | In this work, we use tools from dynamical systems analysis to reverse engineer recurrent networks trained to perform sentiment classification, a foundational natural language processing task. |

1406 | BehaveNet: nonlinear embedding and Bayesian neural decoding of behavioral videos | Eleanor Batty, Matthew Whiteway, Shreya Saxena, Dan Biderman, Taiga Abe, Simon Musall, Winthrop Gillis, Jeffrey Markowitz, Anne Churchland, John P. Cunningham, Sandeep R. Datta, Scott Linderman, Liam Paninski | Here we introduce a probabilistic framework for the analysis of behavioral video and neural activity. |

1407 | Variational Mixture-of-Experts Autoencoders for Multi-Modal Deep Generative Models | Yuge Shi, Siddharth N, Brooks Paige, Philip Torr | In this work, we characterise successful learning of such models as the fulfilment of four criteria: i) implicit latent decomposition into shared and private subspaces, ii) coherent joint generation over all modalities, iii) coherent cross-generation across individual modalities, and iv) improved model learning for individual modalities through multi-modal integration. |

1408 | Gradient-based Adaptive Markov Chain Monte Carlo | Michalis Titsias, Petros Dellaportas | We introduce a gradient-based learning method to automatically adapt Markov chain Monte Carlo (MCMC) proposal distributions to intractable targets. |

1409 | On the Transfer of Inductive Bias from Simulation to the Real World: a New Disentanglement Dataset | Muhammad Waleed Gondal, Manuel Wuthrich, Djordje Miladinovic, Francesco Locatello, Martin Breidt, Valentin Volchkov, Joel Akpo, Olivier Bachem, Bernhard Sch?lkopf, Stefan Bauer | In this paper, we propose a novel data-set which consists of over 1 million images of physical 3D objects with seven factors of variation, such as object color, shape, size and position. |

1410 | Imitation-Projected Programmatic Reinforcement Learning | Abhinav Verma, Hoang Le, Yisong Yue, Swarat Chaudhuri | We study the problem of programmatic reinforcement learning, in which policies are represented as short programs in a symbolic language. |

1411 | Learning Data Manipulation for Augmentation and Weighting | Zhiting Hu, Bowen Tan, Russ R. Salakhutdinov, Tom M. Mitchell, Eric P. Xing | In this work, we propose a new method that supports learning different manipulation schemes with the same gradient-based algorithm. |

1412 | Exploring Algorithmic Fairness in Robust Graph Covering Problems | Aida Rahmattalabi, Phebe Vayanos, Anthony Fulginiti, Eric Rice, Bryan Wilder, Amulya Yadav, Milind Tambe | To remediate this issue, we propose a novel formulation of the robust covering problem with fairness constraints and a tractable approximation scheme applicable to real world instances. |

1413 | Abstraction based Output Range Analysis for Neural Networks | Pavithra Prabhakar, Zahra Rahimi Afzal | In this paper, we consider the problem of output range analysis for feed-forward neural networks. |

1414 | Space and Time Efficient Kernel Density Estimation in High Dimensions | Arturs Backurs, Piotr Indyk, Tal Wagner | In this work, we present an improvement to their framework that retains the same query time, while requiring only linear space and linear preprocessing time. |

1415 | PIDForest: Anomaly Detection via Partial Identification | Parikshit Gopalan, Vatsal Sharan, Udi Wieder | We propose a framework called Partial Identification which captures the intuition that anomalies are easy to distinguish from the overwhelming majority of points by relatively few attribute values. |

1416 | Generative Models for Graph-Based Protein Design | John Ingraham, Vikas Garg, Regina Barzilay, Tommi Jaakkola | We develop relational language models for protein sequences that directly condition on a graph specification of the target structure. |

1417 | The Geometry of Deep Networks: Power Diagram Subdivision | Randall Balestriero, Romain Cosentino, Behnaam Aazhang, Richard Baraniuk | We study the geometry of deep (neural) networks (DNs) with piecewise affine and convex nonlinearities. |

1418 | Approximate Feature Collisions in Neural Nets | Ke Li, Tianhao Zhang, Jitendra Malik | In this paper, we show the opposite: neural nets could be surprisingly insensitive to adversarially chosen changes of large magnitude. |

1419 | Ease-of-Teaching and Language Structure from Emergent Communication | Fushan Li, Michael Bowling | By introducing new agents periodically to replace old ones, sequentially and within a population, we explore such a new pressure – ease of teaching – and show its impact on the structure of the resulting language. |

1420 | Generalization in multitask deep neural classifiers: a statistical physics approach | Anthony Ndirango, Tyler Lee | We develop an analytic theory of the nonlinear dynamics of generalization of deep neural networks trained to solve classification tasks using softmax outputs and cross-entropy loss, addressing both single task and multitask settings. |

1421 | Optimistic Distributionally Robust Optimization for Nonparametric Likelihood Approximation | Viet Anh Nguyen, Soroosh Shafieezadeh Abadeh, Man-Chung Yue, Daniel Kuhn, Wolfram Wiesemann | In this paper, we propose a non-parametric approximation of the likelihood that identifies a probability measure which lies in the neighborhood of the nominal measure and that maximizes the probability of observing the given sample point. |

1422 | On Relating Explanations and Adversarial Examples | Alexey Ignatiev, Nina Narodytska, Joao Marques-Silva | This paper demonstrates that explanations and adversarial examples are related by a generalized form of hitting set duality, which extends earlier work on hitting set duality observed in model-based diagnosis and knowledge compilation. |

1423 | On the equivalence between graph isomorphism testing and function approximation with GNNs | Zhengdao Chen, Soledad Villar, Lei Chen, Joan Bruna | Our work connects these two perspectives and proves their equivalence. |

1424 | Surround Modulation: A Bio-inspired Connectivity Structure for Convolutional Neural Networks | Hosein Hasani, Mahdieh Soleymani, Hamid Aghajan | Inspired by the notion of surround modulation, we designed new excitatory-inhibitory connections between a unit and its surrounding units in the convolutional neural network (CNN) to achieve a more biologically plausible network. |

1425 | Self-attention with Functional Time Representation Learning | Da Xu, Chuanwei Ruan, Evren Korpeoglu, Sushant Kumar, Kannan Achan | We propose several models to learn the functional time representation and the interactions with event representation. |

1426 | Re-randomized Densification for One Permutation Hashing and Bin-wise Consistent Weighted Sampling | Ping Li, Xiaoyun Li, Cun-Hui Zhang | In this paper, we propose a strategy named “re-randomization” in the process of densification that could achieve the smallest variance among all densification schemes. |

1427 | Enabling hyperparameter optimization in sequential autoencoders for spiking neural data | Mohammad Reza Keshtkaran, Chethan Pandarinath | We develop and test two potential solutions: an alternate validation method (“sample validation”) and a novel regularization method (“coordinated dropout”). These innovations prevent overfitting quite effectively, and allow us to test whether SAEs can achieve good performance on limited data through large-scale HP optimization. |