# Paper Digest: ICML 2019 Highlights

Download ICML-2019-Paper-Digests.pdf– highlights of all ICML-2019 papers (.PDF file size is ~0.5M).

The 2019 International Conference on Machine Learning (ICML) is one of the top machine learning conferences in the world. In 2019, it is to be held in Long Beach, California. There were ~3,400 paper submissions, of which 774 were accepted. 519 papers also published their code (download link).

To help AI community quickly catch up on the work presented in this conference, Paper Digest Team processed all accepted papers, and generated one highlight sentence (typically the main topic) for each paper. Readers are encouraged to read these machine generated highlights / summaries to quickly get the main idea of each paper.

We thank all authors for writing these interesting papers, and readers for reading our digests. If you do not want to miss any interesting AI paper, you are welcome to **sign up our free paper digest service ** to get new paper updates customized to your own interests on a daily basis.

Paper Digest Team

team@paperdigest.org

#### TABLE 1: ICML 2019 Papers

Title | Authors | Highlight | |
---|---|---|---|

1 | AReS and MaRS Adversarial and MMD-Minimizing Regression for SDEs | Gabriele Abbati, Philippe Wenk, Michael A. Osborne, Andreas Krause, Bernhard Sch?lkopf, Stefan Bauer | In this paper, we propose a novel, probabilistic model for estimating the drift and diffusion given noisy observations of the underlying stochastic system. |

2 | Dynamic Weights in Multi-Objective Deep Reinforcement Learning | Axel Abels, Diederik Roijers, Tom Lenaerts, Ann Now?, Denis Steckelmacher | We generalize across weight changes and high-dimensional inputs by proposing a multi-objective Q-network whose outputs are conditioned on the relative importance of objectives and we introduce Diverse Experience Replay (DER) to counter the inherent non-stationarity of the Dynamic Weights setting. |

3 | MixHop: Higher-Order Graph Convolutional Architectures via Sparsified Neighborhood Mixing | Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, Nazanin Alipourfard, Kristina Lerman, Hrayr Harutyunyan, Greg Ver Steeg, Aram Galstyan | To address this weakness, we propose a new model, MixHop, that can learn these relationships, including difference operators, by repeatedly mixing feature representations of neighbors at various distances. |

4 | Communication-Constrained Inference and the Role of Shared Randomness | Jayadev Acharya, Clement Canonne, Himanshu Tyagi | We propose a general purpose simulate-and-infer strategy that uses only private-coin communication protocols and is sample-optimal for distribution learning. |

5 | Distributed Learning with Sublinear Communication | Jayadev Acharya, Chris De Sa, Dylan Foster, Karthik Sridharan | Our main result is that by slightly relaxing the standard boundedness assumptions for linear models, we can obtain distributed algorithms that enjoy optimal error with communication logarithmic in dimension. |

6 | Communication Complexity in Locally Private Distribution Estimation and Heavy Hitters | Jayadev Acharya, Ziteng Sun | We propose a sample-optimal $\eps$-locally differentially private (LDP) scheme for distribution estimation, where each user communicates one bit, and requires no public randomness. |

7 | Learning Models from Data with Measurement Error: Tackling Underreporting | Roy Adams, Yuelong Ji, Xiaobin Wang, Suchi Saria | In this paper we present a method for estimating the distribution of an outcome given a binary exposure that is subject to underreporting. As studies based on observational data are increasingly used to inform decisions with real-world impact, it is critical that we develop a robust set of techniques for analyzing and adjusting for these biases. |

8 | TibGM: A Transferable and Information-Based Graphical Model Approach for Reinforcement Learning | Tameem Adel, Adrian Weller | Here we propose a flexible GM-based RL framework which leverages efficient inference procedures to enhance generalisation and transfer power. |

9 | PAC Learnability of Node Functions in Networked Dynamical Systems | Abhijin Adiga, Chris J Kuhlman, Madhav Marathe, S Ravi, Anil Vullikanti | We consider the PAC learnability of the local functions at the vertices of a discrete networked dynamical system, assuming that the underlying network is known. |

10 | Static Automatic Batching In TensorFlow | Ashish Agarwal | To address this we extend TensorFlow with pfor, a parallel-for loop optimized using static loop vectorization. |

11 | Efficient Full-Matrix Adaptive Regularization | Naman Agarwal, Brian Bullins, Xinyi Chen, Elad Hazan, Karan Singh, Cyril Zhang, Yi Zhang | We show how to modify full-matrix adaptive regularization in order to make it practical and effective. |

12 | Online Control with Adversarial Disturbances | Naman Agarwal, Brian Bullins, Elad Hazan, Sham Kakade, Karan Singh | We present an efficient algorithm that achieves nearly-tight regret bounds in this setting. |

13 | Fair Regression: Quantitative Definitions and Reduction-Based Algorithms | Alekh Agarwal, Miroslav Dudik, Zhiwei Steven Wu | In this paper, we study the prediction of a real-valued target, such as a risk score or recidivism rate, while guaranteeing a quantitative notion of fairness with respect to a protected attribute such as gender or race. |

14 | Learning to Generalize from Sparse and Underspecified Rewards | Rishabh Agarwal, Chen Liang, Dale Schuurmans, Mohammad Norouzi | We propose Meta Reward Learning (MeRL) to construct an auxiliary reward function that provides more refined feedback for learning. |

15 | The Kernel Interaction Trick: Fast Bayesian Discovery of Pairwise Interactions in High Dimensions | Raj Agrawal, Brian Trippe, Jonathan Huggins, Tamara Broderick | Our key insight is that many hierarchical models of practical interest admit a Gaussian process representation such that rather than maintaining a posterior over all O(p^2) interactions, we need only maintain a vector of O(p) kernel hyper-parameters. |

16 | Understanding the Impact of Entropy on Policy Optimization | Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, Dale Schuurmans | In this work, we analyze this claim using new visualizations of the optimization landscape based on randomly perturbing the loss function. |

17 | Fairwashing: the risk of rationalization | Ulrich Aivodji, Hiromi Arai, Olivier Fortineau, S?bastien Gambs, Satoshi Hara, Alain Tapp | Our solution, LaundryML, is based on a regularized rule list enumeration algorithm whose objective is to search for fair rule lists approximating an unfair black-box model. |

18 | Adaptive Stochastic Natural Gradient Method for One-Shot Neural Architecture Search | Youhei Akimoto, Shinichi Shirakawa, Nozomu Yoshinari, Kento Uchida, Shota Saito, Kouhei Nishida | We propose a stochastic natural gradient method with an adaptive step-size mechanism built upon our theoretical investigation (robust). |

19 | Projections for Approximate Policy Iteration Algorithms | Riad Akrour, Joni Pajarinen, Jan Peters, Gerhard Neumann | In this paper, we propose to improve over such solutions by introducing a set of projections that transform the constrained problem into an unconstrained one which is then solved by standard gradient descent. |

20 | Validating Causal Inference Models via Influence Functions | Ahmed Alaa, Mihaela Van Der Schaar | In this paper, we use influence functions {—} the functional derivatives of a loss function {—} to develop a model validation procedure that estimates the estimation error of causal inference methods. |

21 | Multi-objective training of Generative Adversarial Networks with multiple discriminators | Isabela Albuquerque, Joao Monteiro, Thang Doan, Breandan Considine, Tiago Falk, Ioannis Mitliagkas | In this work, we revisit the multiple-discriminator setting by framing the simultaneous minimization of losses provided by different models as a multi-objective optimization problem. |

22 | Graph Element Networks: adaptive, structured computation and memory | Ferran Alet, Adarsh Keshav Jeewajee, Maria Bauza Villalonga, Alberto Rodriguez, Tomas Lozano-Perez, Leslie Kaelbling | We explore the use of graph neural networks (GNNs) to model spatial processes in which there is no a priori graphical structure. |

23 | Analogies Explained: Towards Understanding Word Embeddings | Carl Allen, Timothy Hospedales | We derive a probabilistically grounded definition of paraphrasing that we re-interpret as word transformation, a mathematical description of “$w_x$ is to $w_y$”. |

24 | Infinite Mixture Prototypes for Few-shot Learning | Kelsey Allen, Evan Shelhamer, Hanul Shin, Joshua Tenenbaum | We propose infinite mixture prototypes to adaptively represent both simple and complex data distributions for few-shot learning. |

25 | A Convergence Theory for Deep Learning via Over-Parameterization | Zeyuan Allen-Zhu, Yuanzhi Li, Zhao Song | In this work, we prove simple algorithms such as stochastic gradient descent (SGD) can find Global Minima on the training objective of DNNs in Polynomial Time. |

26 | Asynchronous Batch Bayesian Optimisation with Improved Local Penalisation | Ahsan Alvi, Binxin Ru, Jan-Peter Calliess, Stephen Roberts, Michael A. Osborne | We address this problem by developing an approach, Penalising Locally for Asynchronous Bayesian Optimisation on K Workers (PLAyBOOK), for asynchronous parallel BO. |

27 | Bounding User Contributions: A Bias-Variance Trade-off in Differential Privacy | Kareem Amin, Alex Kulesza, Andres Munoz, Sergei Vassilvtiskii | Here, we characterize this trade-off for an empirical risk minimization setting, showing that in general there is a “sweet spot” that depends on measurable properties of the dataset, but that there is also a concrete cost to privacy that cannot be avoided simply by collecting more data. |

28 | Explaining Deep Neural Networks with a Polynomial Time Algorithm for Shapley Value Approximation | Marco Ancona, Cengiz Oztireli, Markus Gross | In this work, by leveraging recent results on uncertainty propagation, we propose a novel, polynomial-time approximation of Shapley values in deep neural networks. |

29 | Scaling Up Ordinal Embedding: A Landmark Approach | Jesse Anderton, Javed Aslam | We propose a novel landmark-based method as a partial solution. |

30 | Sorting Out Lipschitz Function Approximation | Cem Anil, James Lucas, Roger Grosse | Based on this, we propose to combine a gradient norm preserving activation function, GroupSort, with norm-constrained weight matrices. |

31 | Sparse Multi-Channel Variational Autoencoder for the Joint Analysis of Heterogeneous Data | Luigi Antelmi, Nicholas Ayache, Philippe Robert, Marco Lorenzi | To tackle this problem, in this work we extend the variational framework of VAE to bring parsimony and interpretability when jointly account for latent relationships across multiple channels. |

32 | Unsupervised Label Noise Modeling and Loss Correction | Eric Arazo, Diego Ortego, Paul Albert, Noel O?Connor, Kevin Mcguinness | Specifically, we propose a beta mixture to estimate this probability and correct the loss by relying on the network prediction (the so-called bootstrapping loss). |

33 | Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks | Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, Ruosong Wang | This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: (i) Using a tighter characterization of training speed than recent papers, an explanation for why training a neural net with random labels leads to slower training, as originally observed in [Zhang et al. ICLR’17]. (ii) Generalization bound independent of network size, using a data-dependent complexity measure. |

34 | Distributed Weighted Matching via Randomized Composable Coresets | Sepehr Assadi, Mohammadhossein Bateni, Vahab Mirrokni | In this paper, we develop a simple distributed algorithm for the problem on general graphs with approximation guarantee of 2 + eps that (nearly) matches that of the sequential greedy algorithm. |

35 | Stochastic Gradient Push for Distributed Deep Learning | Mahmoud Assran, Nicolas Loizou, Nicolas Ballas, Mike Rabbat | This paper studies Stochastic Gradient Push (SGP), which combines PushSum with stochastic gradient updates. |

36 | Bayesian Optimization of Composite Functions | Raul Astudillo, Peter Frazier | We consider optimization of composite objective functions, i.e., of the form $f(x)=g(h(x))$, where $h$ is a black-box derivative-free expensive-to-evaluate function with vector-valued outputs, and $g$ is a cheap-to-evaluate real-valued function. |

37 | Linear-Complexity Data-Parallel Earth Mover?s Distance Approximations | Kubilay Atasu, Thomas Mittelholzer | We propose novel approximation algorithms that overcome both of these limitations, yet still achieve linear time complexity. |

38 | Benefits and Pitfalls of the Exponential Mechanism with Applications to Hilbert Spaces and Functional PCA | Jordan Awan, Ana Kenney, Matthew Reimherr, Aleksandra Slavkovic | We study its extension to settings with summaries based on infinite dimensional outputs such as with functional data analysis, shape analysis, and nonparametric statistics. |

39 | Feature Grouping as a Stochastic Regularizer for High-Dimensional Structured Data | Sergul Aydore, Bertrand Thirion, Gael Varoquaux | We propose a new regularizer specifically designed to leverage structure in the data in a way that can be applied efficiently to complex models. |

40 | Beyond the Chinese Restaurant and Pitman-Yor processes: Statistical Models with double power-law behavior | Fadhel Ayed, Juho Lee, Francois Caron | In this paper, we introduce a class of completely random measures which are doubly regularly-varying. |

41 | Scalable Fair Clustering | Arturs Backurs, Piotr Indyk, Krzysztof Onak, Baruch Schieber, Ali Vakilian, Tal Wagner | In this paper, we present a practical approximate fairlet decomposition algorithm that runs in nearly linear time. |

42 | Entropic GANs meet VAEs: A Statistical Approach to Compute Sample Likelihoods in GANs | Yogesh Balaji, Hamed Hassani, Rama Chellappa, Soheil Feizi | In this work, we resolve this issue by constructing an explicit probability model that can be used to compute sample likelihood statistics in GANs. |

43 | Provable Guarantees for Gradient-Based Meta-Learning | Maria-Florina Balcan, Mikhail Khodak, Ameet Talwalkar | We study the problem of meta-learning through the lens of online convex optimization, developing a meta-algorithm bridging the gap between popular gradient-based meta-learning and classical regularization-based multi-task transfer methods. |

44 | Open-ended learning in symmetric zero-sum games | David Balduzzi, Marta Garnelo, Yoram Bachrach, Wojciech Czarnecki, Julien Perolat, Max Jaderberg, Thore Graepel | In this paper, we introduce a geometric framework for formulating agent objectives in zero-sum games, in order to construct adaptive sequences of objectives that yield open-ended learning. |

45 | Concrete Autoencoders: Differentiable Feature Selection and Reconstruction | Muhammed Fatih Balin, Abubakar Abid, James Zou | We introduce the concrete autoencoder, an end-to-end differentiable method for global feature selection, which efficiently identifies a subset of the most informative features and simultaneously learns a neural network to reconstruct the input data from the selected features. |

46 | HOList: An Environment for Machine Learning of Higher Order Logic Theorem Proving | Kshitij Bansal, Sarah Loos, Markus Rabe, Christian Szegedy, Stewart Wilcox | We present an environment, benchmark, and deep learning driven automated theorem prover for higher-order logic. |

47 | Structured agents for physical construction | Victor Bapst, Alvaro Sanchez-Gonzalez, Carl Doersch, Kimberly Stachenfeld, Pushmeet Kohli, Peter Battaglia, Jessica Hamrick | We examine how a range of deep reinforcement learning agents fare on these challenges, and introduce several new approaches which provide superior performance. |

48 | Learning to Route in Similarity Graphs | Dmitry Baranchuk, Dmitry Persiyanov, Anton Sinitsin, Artem Babenko | In this paper we propose to learn the routing function that overcomes local minima via incorporating information about the graph global structure. |

49 | A Personalized Affective Memory Model for Improving Emotion Recognition | Pablo Barros, German Parisi, Stefan Wermter | In this paper, we present a neural model based on a conditional adversarial autoencoder to learn how to represent and edit general emotion expressions. |

50 | Scale-free adaptive planning for deterministic dynamics & discounted rewards | Peter Bartlett, Victor Gabillon, Jennifer Healey, Michal Valko | We introduce PlaTypOOS, an adaptive, robust, and efficient alternative to the OLOP (open-loop optimistic planning) algorithm. |

51 | Pareto Optimal Streaming Unsupervised Classification | Soumya Basu, Steven Gutstein, Brent Lance, Sanjay Shakkottai | In this paper, we characterize the Pareto-optimal region of accuracy and arrival rate, and develop an algorithm that can operate at any point within this region. |

52 | Categorical Feature Compression via Submodular Optimization | Mohammadhossein Bateni, Lin Chen, Hossein Esfandiari, Thomas Fu, Vahab Mirrokni, Afshin Rostamizadeh | To address this, we introduce a novel re-parametrization of the mutual information objective, which we prove is submodular, and also design a data structure to query the submodular function in amortized O(logn) time (where n is the input vocabulary size). |

53 | Noise2Self: Blind Denoising by Self-Supervision | Joshua Batson, Loic Royer | We propose a general framework for denoising high-dimensional measurements which requires no prior on the signal, no estimate of the noise, and no clean training data. |

54 | Efficient optimization of loops and limits with randomized telescoping sums | Alex Beatson, Ryan P Adams | We propose randomized telescope (RT) gradient estimators, which represent the objective as the sum of a telescoping series and sample linear combinations of terms to provide cheap unbiased gradient estimates. |

55 | Recurrent Kalman Networks: Factorized Inference in High-Dimensional Deep Feature Spaces | Philipp Becker, Harit Pandya, Gregor Gebhardt, Cheng Zhao, C. James Taylor, Gerhard Neumann | We propose a new deep approach to Kalman filtering which can be learned directly in an end-to-end manner using backpropagation without additional approximations. |

56 | Switching Linear Dynamics for Variational Bayes Filtering | Philip Becker-Ehmck, Jan Peters, Patrick Van Der Smagt | Leveraging Bayesian inference, Variational Autoencoders and Concrete relaxations, we show how to learn a richer and more meaningful state space, e.g. encoding joint constraints and collisions with walls in a maze, from partial and high-dimensional observations. |

57 | Active Learning for Probabilistic Structured Prediction of Cuts and Matchings | Sima Behpour, Anqi Liu, Brian Ziebart | We propose an adversarial approach for active learning with structured prediction domains that is tractable for cuts and matching. |

58 | Invertible Residual Networks | Jens Behrmann, Will Grathwohl, Ricky T. Q. Chen, David Duvenaud, Joern-Henrik Jacobsen | To compute likelihoods, we introduce a tractable approximation to the Jacobian log-determinant of a residual block. |

59 | Greedy Layerwise Learning Can Scale To ImageNet | Eugene Belilovsky, Michael Eickenberg, Edouard Oyallon | Here we use 1-hidden layer learning problems to sequentially build deep networks layer by layer, which can inherit properties from shallow networks. |

60 | Overcoming Multi-model Forgetting | Yassine Benyahia, Kaicheng Yu, Kamil Bennani Smires, Martin Jaggi, Anthony C. Davison, Mathieu Salzmann, Claudiu Musat | To overcome this, we introduce a statistically-justified weight plasticity loss that regularizes the learning of a model’s shared parameters according to their importance for the previous models, and demonstrate its effectiveness when training two models sequentially and for neural architecture search. |

61 | Optimal Kronecker-Sum Approximation of Real Time Recurrent Learning | Frederik Benzing, Marcelo Matheus Gauy, Asier Mujika, Anders Martinsson, Angelika Steger | We present a new approximation algorithm of RTRL, Optimal Kronecker-Sum Approximation (OK). |

62 | Adversarially Learned Representations for Information Obfuscation and Inference | Martin Bertran, Natalia Martinez, Afroditi Papadaki, Qiang Qiu, Miguel Rodrigues, Galen Reeves, Guillermo Sapiro | In this work, we take an information theoretic approach that is implemented as an unconstrained adversarial game between Deep Neural Networks in a principled, data-driven manner. |

63 | Bandit Multiclass Linear Classification: Efficient Algorithms for the Separable Case | Alina Beygelzimer, David Pal, Balazs Szorenyi, Devanathan Thiruvenkatachari, Chen-Yu Wei, Chicheng Zhang | In this work, we take a first step towards this problem. |

64 | Analyzing Federated Learning through an Adversarial Lens | Arjun Nitin Bhagoji, Supriyo Chakraborty, Prateek Mittal, Seraphin Calo | In this work, we explore how the federated learning setting gives rise to a new threat, namely model poisoning, which differs from traditional data poisoning. |

65 | Optimal Continuous DR-Submodular Maximization and Applications to Provable Mean Field Inference | Yatao Bian, Joachim Buhmann, Andreas Krause | In this work we propose provable mean filed methods for probabilistic log-submodular models and its posterior agreement (PA) with strong approximation guarantees. |

66 | More Efficient Off-Policy Evaluation through Regularized Targeted Learning | Aurelien Bibaut, Ivana Malenica, Nikos Vlassis, Mark Van Der Laan | In particular, we introduce a novel doubly-robust estimator for the OPE problem in RL, based on the Targeted Maximum Likelihood Estimation principle from the statistical causal inference literature. |

67 | A Kernel Perspective for Regularizing Deep Neural Networks | Alberto Bietti, Gr?goire Mialon, Dexiong Chen, Julien Mairal | We propose a new point of view for regularizing deep neural networks by using the norm of a reproducing kernel Hilbert space (RKHS). |

68 | Rethinking Lossy Compression: The Rate-Distortion-Perception Tradeoff | Yochai Blau, Tomer Michaeli | In this paper, we adopt the mathematical definition of perceptual quality recently proposed by Blau & Michaeli (2018), and use it to study the three-way tradeoff between rate, distortion, and perception. |

69 | Correlated bandits or: How to minimize mean-squared error online | Vinay Praneeth Boda, Prashanth L.A. | Under a best-arm identification framework, we propose a successive rejects type algorithm and provide bounds on the probability of error in identifying the best arm. |

70 | Adversarial Attacks on Node Embeddings via Graph Poisoning | Aleksandar Bojchevski, Stephan G?nnemann | We provide the first adversarial vulnerability analysis on the widely used family of methods based on random walks. |

71 | Online Variance Reduction with Mixtures | Zal?n Borsos, Sebastian Curi, Kfir Yehuda Levy, Andreas Krause | In this work, we propose a new framework for variance reduction that enables the use of mixtures over predefined sampling distributions, which can naturally encode prior knowledge about the data. |

72 | Compositional Fairness Constraints for Graph Embeddings | Avishek Bose, William Hamilton | Here, we introduce an adversarial framework to enforce fairness constraints on graph embeddings. |

73 | Unreproducible Research is Reproducible | Xavier Bouthillier, C?sar Laurent, Pascal Vincent | This work is an attempt to promote the use of more rigorous and diversified methodologies. |

74 | Blended Conditonal Gradients | G?bor Braun, Sebastian Pokutta, Dan Tu, Stephen Wright | We present a blended conditional gradient approach for minimizing a smooth convex function over a polytope P, combining the Frank{–}Wolfe algorithm (also called conditional gradient) with gradient-based steps, different from away steps and pairwise steps, but still achieving linear convergence for strongly convex functions, along with good practical performance. |

75 | Coresets for Ordered Weighted Clustering | Vladimir Braverman, Shaofeng H.-C. Jiang, Robert Krauthgamer, Xuan Wu | Our main result is a construction of a simultaneous coreset of size O?,d(k2log2|X|) for Ordered k-Median. |

76 | Target Tracking for Contextual Bandits: Application to Demand Side Management | Margaux Br?g?re, Pierre Gaillard, Yannig Goude, Gilles Stoltz | We propose a contextual-bandit approach for demand side management by offering price incentives. |

77 | Active Manifolds: A non-linear analogue to Active Subspaces | Robert Bridges, Anthony Gruber, Christopher Felder, Miki Verma, Chelsey Hoff | We present an approach to analyze $C^1(\mathbb{R}^m)$ functions that addresses limitations present in the Active Subspaces (AS) method of Constantine et al. (2014; 2015). |

78 | Conditioning by adaptive sampling for robust design | David Brookes, Hahnbeom Park, Jennifer Listgarten | We present a method for design problems wherein the goal is to maximize or specify the value of one or more properties of interest (e.g. maximizing the fluorescence of a protein). |

79 | Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations | Daniel Brown, Wonjoon Goo, Prabhat Nagarajan, Scott Niekum | In this paper, we introduce a novel reward-learning-from-observation algorithm, Trajectory-ranked Reward EXtrapolation (T-REX), that extrapolates beyond a set of (approximately) ranked demonstrations in order to infer high-quality reward functions from a set of potentially poor demonstrations. |

80 | Deep Counterfactual Regret Minimization | Noam Brown, Adam Lerer, Sam Gross, Tuomas Sandholm | This paper introduces Deep Counterfactual Regret Minimization, a form of CFR that obviates the need for abstraction by instead using deep neural networks to approximate the behavior of CFR in the full game. |

81 | Understanding the Origins of Bias in Word Embeddings | Marc-Etienne Brunet, Colleen Alkalay-Houlihan, Ashton Anderson, Richard Zemel | In this work we develop a technique to address this question. |

82 | Low Latency Privacy Preserving Inference | Alon Brutzkus, Ran Gilad-Bachrach, Oren Elisha | In this study we provide two solutions that address these limitations. |

83 | Why do Larger Models Generalize Better? A Theoretical Perspective via the XOR Problem | Alon Brutzkus, Amir Globerson | In this work, we provide theoretical and empirical evidence that, in certain cases, overparameterized convolutional networks generalize better than small networks because of an interplay between weight clustering and feature exploration at initialization. |

84 | Adversarial examples from computational constraints | Sebastien Bubeck, Yin Tat Lee, Eric Price, Ilya Razenshteyn | Why are classifiers in high dimension vulnerable to “adversarial” perturbations? We show that it is likely not due to information theoretic limitations, but rather it could be due to computational constraints. |

85 | Self-similar Epochs: Value in arrangement | Eliav Buchnik, Edith Cohen, Avinatan Hasidim, Yossi Matias | We hypothesize that the training can be more effective with self-similar arrangements that potentially allow each epoch to provide benefits of multiple ones. |

86 | Learning Generative Models across Incomparable Spaces | Charlotte Bunne, David Alvarez-Melis, Andreas Krause, Stefanie Jegelka | In this work, we propose an approach to learn generative models across such incomparable spaces, and demonstrate how to steer the learned distribution towards target properties. |

87 | Rates of Convergence for Sparse Variational Gaussian Process Regression | David Burt, Carl Edward Rasmussen, Mark Van Der Wilk | We show that with high probability the KL divergence can be made arbitrarily small by growing $M$ more slowly than $N$. |

88 | What is the Effect of Importance Weighting in Deep Learning? | Jonathon Byrd, Zachary Lipton | We present the surprising finding that while importance weighting impacts models early in training, its effect diminishes over successive epochs. |

89 | A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent | Yongqiang Cai, Qianxiao Li, Zuowei Shen | In this paper, we provide such an analysis on the simple problem of ordinary least squares (OLS), where the precise dynamical properties of gradient descent (GD) is completely known, thus allowing us to isolate and compare the additional effects of BN. |

90 | Accelerated Linear Convergence of Stochastic Momentum Methods in Wasserstein Distances | Bugra Can, Mert Gurbuzbalaban, Lingjiong Zhu | For strongly convex problems, we show that the distribution of the iterates of AG converges with the accelerated $O(\sqrt{\kappa}\log(1/\varepsilon))$ linear rate to a ball of radius $\varepsilon$ centered at a unique invariant distribution in the 1-Wasserstein metric where $\kappa$ is the condition number as long as the noise variance is smaller than an explicit upper bound we can provide. |

91 | Active Embedding Search via Noisy Paired Comparisons | Gregory Canal, Andy Massimino, Mark Davenport, Christopher Rozell | In such tasks, queries can be extremely costly and subject to varying levels of response noise; thus, we aim to actively choose pairs that are most informative given the results of previous comparisons. |

92 | Dynamic Learning with Frequent New Product Launches: A Sequential Multinomial Logit Bandit Problem | Junyu Cao, Wei Sun | For the offline version with known customers’ preferences, we propose a polynomial-time algorithm and characterize the properties of the optimal tiered product recommendation. |

93 | Competing Against Nash Equilibria in Adversarially Changing Zero-Sum Games | Adrian Rivera Cardoso, Jacob Abernethy, He Wang, Huan Xu | But when the payoff matrix evolves over time our goal is to find a sequential algorithm that can compete with, in a certain sense, the NE of the long-term-averaged payoff matrix. |

94 | Automated Model Selection with Bayesian Quadrature | Henry Chai, Jean-Francois Ton, Michael A. Osborne, Roman Garnett | We present a novel technique for tailoring Bayesian quadrature (BQ) to model selection. |

95 | Learning Action Representations for Reinforcement Learning | Yash Chandak, Georgios Theocharous, James Kostas, Scott Jordan, Philip Thomas | We provide an algorithm to both learn and use action representations and provide conditions for its convergence. |

96 | Dynamic Measurement Scheduling for Event Forecasting using Deep RL | Chun-Hao Chang, Mingjie Mai, Anna Goldenberg | We answer this question by deep reinforcement learning (RL) that jointly minimizes the measurement cost and maximizes predictive gain, by scheduling strategically-timed measurements. |

97 | On Symmetric Losses for Learning from Corrupted Labels | Nontawat Charoenphakdee, Jongyeong Lee, Masashi Sugiyama | This paper aims to provide a better understanding of a symmetric loss. |

98 | Online learning with kernel losses | Niladri Chatterji, Aldo Pacchiano, Peter Bartlett | We present a generalization of the adversarial linear bandits framework, where the underlying losses are kernel functions (with an associated reproducing kernel Hilbert space) rather than linear functions. |

99 | Neural Network Attributions: A Causal Perspective | Aditya Chattopadhyay, Piyushi Manupriya, Anirban Sarkar, Vineeth N Balasubramanian | We propose a new attribution method for neural networks developed using ?rst principles of causality (to the best of our knowledge, the ?rst such). |

100 | PAC Identification of Many Good Arms in Stochastic Multi-Armed Bandits | Arghya Roy Chaudhuri, Shivaram Kalyanakrishnan | We present a lower bound on the worst-case sample complexity for general k, and a fully sequential PAC algorithm, LUCB-k-m, which is more sample-efficient on easy instances. |

101 | Nearest Neighbor and Kernel Survival Analysis: Nonasymptotic Error Bounds and Strong Consistency Rates | George Chen | We establish the first nonasymptotic error bounds for Kaplan-Meier-based nearest neighbor and kernel survival probability estimators where feature vectors reside in metric spaces. |

102 | Stein Point Markov Chain Monte Carlo | Wilson Ye Chen, Alessandro Barp, Francois-Xavier Briol, Jackson Gorham, Mark Girolami, Lester Mackey, Chris Oates | Stein Point Markov Chain Monte Carlo |

103 | Particle Flow Bayes? Rule | Xinshi Chen, Hanjun Dai, Le Song | We present a particle flow realization of Bayes’ rule, where an ODE-based neural operator is used to transport particles from a prior to its posterior after a new observation. |

104 | Proportionally Fair Clustering | Xingyu Chen, Brandon Fain, Liang Lyu, Kamesh Munagala | We present and analyze algorithms to efficiently compute, optimize, and audit proportional solutions. |

105 | Information-Theoretic Considerations in Batch Reinforcement Learning | Jinglin Chen, Nan Jiang | In this paper, we revisit these assumptions and provide theoretical results towards answering the above questions, and make steps towards a deeper understanding of value-function approximation. |

106 | Generative Adversarial User Model for Reinforcement Learning Based Recommendation System | Xinshi Chen, Shuang Li, Hui Li, Shaohua Jiang, Yuan Qi, Le Song | In this paper, we propose a novel model-based reinforcement learning framework for recommendation systems, where we develop a generative adversarial network to imitate user behavior dynamics and learn her reward function. |

107 | Understanding and Utilizing Deep Neural Networks Trained with Noisy Labels | Pengfei Chen, Ben Ben Liao, Guangyong Chen, Shengyu Zhang | In this paper, we find that the test accuracy can be quantitatively characterized in terms of the noise ratio in datasets. |

108 | A Gradual, Semi-Discrete Approach to Generative Network Training via Explicit Wasserstein Minimization | Yucheng Chen, Matus Telgarsky, Chao Zhang, Bolton Bailey, Daniel Hsu, Jian Peng | This paper provides a simple procedure to fit generative networks to target distributions, with the goal of a small Wasserstein distance (or other optimal transport costs). |

109 | Transferability vs. Discriminability: Batch Spectral Penalization for Adversarial Domain Adaptation | Xinyang Chen, Sinan Wang, Mingsheng Long, Jianmin Wang | In this paper, a series of experiments based on spectral analysis of the feature representations have been conducted, revealing an unexpected deterioration of the discriminability while learning transferable features adversarially. |

110 | Fast Incremental von Neumann Graph Entropy Computation: Theory, Algorithm, and Applications | Pin-Yu Chen, Lingfei Wu, Sijia Liu, Indika Rajapakse | In this paper, we propose a new computational framework, Fast Incremental von Neumann Graph EntRopy (FINGER), which approaches VNGE with a performance guarantee. |

111 | Katalyst: Boosting Convex Katayusha for Non-Convex Problems with a Large Condition Number | Zaiyi Chen, Yi Xu, Haoyuan Hu, Tianbao Yang | In this paper, we present a simple but non-trivial boosting of a state-of-the-art SVRG-type method for convex problems (namely Katyusha) to enjoy an improved complexity for solving non-convex problems with a large condition number (that is close to a convex function). |

112 | Multivariate-Information Adversarial Ensemble for Scalable Joint Distribution Matching | Ziliang Chen, Zhanfu Yang, Xiaoxi Wang, Xiaodan Liang, Xiaopeng Yan, Guanbin Li, Liang Lin | In this paper, we propose a domain-scalable DGM, i.e., MMI-ALI for $m$-domain joint distribution matching. |

113 | Robust Decision Trees Against Adversarial Examples | Hongge Chen, Huan Zhang, Duane Boning, Cho-Jui Hsieh | In this paper, we show that tree-based models are also vulnerable to adversarial examples and develop a novel algorithm to learn robust trees. |

114 | RaFM: Rank-Aware Factorization Machines | Xiaoshuang Chen, Yin Zheng, Jiaxing Wang, Wenye Ma, Junzhou Huang | Different from existing FM-based approaches which use a fixed rank for all features, this paper proposes a Rank-Aware FM (RaFM) model which adopts pairwise interactions from embeddings with different ranks. |

115 | Control Regularization for Reduced Variance Reinforcement Learning | Richard Cheng, Abhinav Verma, Gabor Orosz, Swarat Chaudhuri, Yisong Yue, Joel Burdick | Focusing on problems arising in continuous control, we propose a functional regularization approach to augmenting model-free RL. |

116 | Predictor-Corrector Policy Optimization | Ching-An Cheng, Xinyan Yan, Nathan Ratliff, Byron Boots | We present a predictor-corrector framework, called PicCoLO, that can transform a first-order model-free reinforcement or imitation learning algorithm into a new hybrid method that leverages predictive models to accelerate policy learning. |

117 | Variational Inference for sparse network reconstruction from count data | Julien Chiquet, Stephane Robin, Mahendra Mariadassou | In this work, we consider instead a full-fledged probabilistic model with a latent layer where the counts follow Poisson distributions, conditional to latent (hidden) Gaussian correlated variables. |

118 | Random Walks on Hypergraphs with Edge-Dependent Vertex Weights | Uthsav Chitra, Benjamin Raphael | In this paper, we use random walks to develop a spectral theory for hypergraphs with edge-dependent vertex weights: hypergraphs where every vertex v has a weight $\gamma_e(v)$ for each incident hyperedge e that describes the contribution of v to the hyperedge e. |

119 | Neural Joint Source-Channel Coding | Kristy Choi, Kedar Tatwawadi, Aditya Grover, Tsachy Weissman, Stefano Ermon | In this work, we propose to jointly learn the encoding and decoding processes using a new discrete variational autoencoder model. |

120 | Beyond Backprop: Online Alternating Minimization with Auxiliary Variables | Anna Choromanska, Benjamin Cowen, Sadhana Kumaravel, Ronny Luss, Mattia Rigotti, Irina Rish, Paolo Diachille, Viatcheslav Gurev, Brian Kingsbury, Ravi Tejwani, Djallel Bouneffouf | The main contribution of our work is a novel online (stochastic/mini-batch) alternating minimization (AM) approach for training deep neural networks, together with the first theoretical convergence guarantees for AM in stochastic settings and promising empirical results on a variety of architectures and datasets. |

121 | Unifying Orthogonal Monte Carlo Methods | Krzysztof Choromanski, Mark Rowland, Wenyu Chen, Adrian Weller | In this paper, we present a unifying perspective of many approximate methods by considering Givens transformations, propose new approximate methods based on this framework, and demonstrate the ?rst statistical guarantees for families of approximate methods in kernel approximation. |

122 | Probability Functional Descent: A Unifying Perspective on GANs, Variational Inference, and Reinforcement Learning | Casey Chu, Jose Blanchet, Peter Glynn | The goal of this paper is to provide a unifying view of a wide range of problems of interest in machine learning by framing them as the minimization of functionals defined on the space of probability measures. |

123 | MeanSum: A Neural Model for Unsupervised Multi-Document Abstractive Summarization | Eric Chu, Peter Liu | In our work, we consider the setting where there are only documents (product or business reviews) with no summaries provided, and propose an end-to-end, neural model architecture to perform unsupervised abstractive summarization. Finally, we collect a ground-truth evaluation dataset and show that our model outperforms a strong extractive baseline. |

124 | Weak Detection of Signal in the Spiked Wigner Model | Hye Won Chung, Ji Oon Lee | In case the signal-to-noise ratio is under the threshold below which a reliable detection is impossible, we propose a hypothesis test based on the linear spectral statistics of the data matrix. |

125 | New results on information theoretic clustering | Ferdinando Cicalese, Eduardo Laber, Lucas Murtinho | We study the problem of optimizing the clustering of a set of vectors when the quality of the clustering is measured by the Entropy or the Gini impurity measure. |

126 | Sensitivity Analysis of Linear Structural Causal Models | Carlos Cinelli, Daniel Kumor, Bryant Chen, Judea Pearl, Elias Bareinboim | In this paper, we develop a formal, systematic approach to sensitivity analysis for arbitrary linear Structural Causal Models (SCMs). |

127 | Dimensionality Reduction for Tukey Regression | Kenneth Clarkson, Ruosong Wang, David Woodruff | We give the first dimensionality reduction methods for the overconstrained Tukey regression problem. |

128 | On Medians of (Randomized) Pairwise Means | Stephan Clemencon, Pierre Laforgue, Patrice Bertail | It is the purpose of this paper to extend this approach, in order to address other learning problems in particular, for which the performance criterion takes the form of an expectation over pairs of observations rather than over one single observation, as may be the case in pairwise ranking, clustering or metric learning. |

129 | Quantifying Generalization in Reinforcement Learning | Karl Cobbe, Oleg Klimov, Chris Hesse, Taehoon Kim, John Schulman | In this paper, we investigate the problem of overfitting in deep reinforcement learning. |

130 | Empirical Analysis of Beam Search Performance Degradation in Neural Sequence Models | Eldan Cohen, Christopher Beck | We perform an empirical study of the behavior of beam search across three sequence synthesis tasks. |

131 | Learning Linear-Quadratic Regulators Efficiently with only $\sqrtT$ Regret | Alon Cohen, Tomer Koren, Yishay Mansour | We present the first computationally-efficient algorithm with $\widetilde{O}(\sqrt{T})$ regret for learning in Linear Quadratic Control systems with unknown dynamics. |

132 | Certified Adversarial Robustness via Randomized Smoothing | Jeremy Cohen, Elan Rosenfeld, Zico Kolter | We show how to turn any classifier that classifies well under Gaussian noise into a new classifier that is certifiably robust to adversarial perturbations under the L2 norm. |

133 | Gauge Equivariant Convolutional Networks and the Icosahedral CNN | Taco Cohen, Maurice Weiler, Berkay Kicanaoglu, Max Welling | Here we show how this principle can be extended beyond global symmetries to local gauge transformations. |

134 | CURIOUS: Intrinsically Motivated Modular Multi-Goal Reinforcement Learning | C?dric Colas, Pierre-Yves Oudeyer, Olivier Sigaud, Pierre Fournier, Mohamed Chetouani | This paper proposes CURIOUS , an algorithm that leverages 1) a modular Universal Value Function Approximator with hindsight learning to achieve a diversity of goals of different kinds within a unique policy and 2) an automated curriculum learning mechanism that biases the attention of the agent towards goals maximizing the absolute learning progress. |

135 | A fully differentiable beam search decoder | Ronan Collobert, Awni Hannun, Gabriel Synnaeve | We introduce a new beam search decoder that is fully differentiable, making it possible to optimize at training time through the inference procedure. |

136 | Scalable Metropolis-Hastings for Exact Bayesian Inference with Large Datasets | Rob Cornish, Paul Vanetti, Alexandre Bouchard-Cote, George Deligiannidis, Arnaud Doucet | We propose the Scalable Metropolis-Hastings (SMH) kernel that only requires processing on average $O(1)$ or even $O(1/\sqrt{n})$ data points per step. |

137 | Adjustment Criteria for Generalizing Experimental Findings | Juan Correa, Jin Tian, Elias Bareinboim | In this paper, we investigate the assumptions and machinery necessary for using covariate adjustment to correct for the biases generated by both of these problems, and generalize experimental data to infer causal effects in a new domain. |

138 | Online Learning with Sleeping Experts and Feedback Graphs | Corinna Cortes, Giulia Desalvo, Claudio Gentile, Mehryar Mohri, Scott Yang | Our main contribution is then to relax this assumption, present a more general notion of sleeping regret, and derive a general algorithm with strong theoretical guarantees. |

139 | Active Learning with Disagreement Graphs | Corinna Cortes, Giulia Desalvo, Mehryar Mohri, Ningshan Zhang, Claudio Gentile | We present two novel enhancements of an online importance-weighted active learning algorithm IWAL, using the properties of disagreements among hypotheses. |

140 | Shape Constraints for Set Functions | Andrew Cotter, Maya Gupta, Heinrich Jiang, Erez Louidor, James Muller, Tamann Narayan, Serena Wang, Tao Zhu | We propose making set functions more understandable and regularized by capturing domain knowledge through shape constraints. |

141 | Training Well-Generalizing Classifiers for Fairness Metrics and Other Data-Dependent Constraints | Andrew Cotter, Maya Gupta, Heinrich Jiang, Nathan Srebro, Karthik Sridharan, Serena Wang, Blake Woodworth, Seungil You | To improve generalization, we frame the problem as a two-player game where one player optimizes the model parameters on a training dataset, and the other player enforces the constraints on an independent validation dataset. |

142 | Monge blunts Bayes: Hardness Results for Adversarial Training | Zac Cranko, Aditya Menon, Richard Nock, Cheng Soon Ong, Zhan Shi, Christian Walder | We suggest a formal answer for losses that satisfy the minimal statistical requirement of being proper. |

143 | Boosted Density Estimation Remastered | Zac Cranko, Richard Nock | We show how to combine this latter approach and the classical boosting theory in supervised learning to get the first density estimation algorithm that provably achieves geometric convergence under very weak assumptions. |

144 | Submodular Cost Submodular Cover with an Approximate Oracle | Victoria Crawford, Alan Kuhnle, My Thai | In this work, we study the Submodular Cost Submodular Cover problem, which is to minimize the submodular cost required to ensure that the submodular benefit function exceeds a given threshold. |

145 | Flexibly Fair Representation Learning by Disentanglement | Elliot Creager, David Madras, Joern-Henrik Jacobsen, Marissa Weis, Kevin Swersky, Toniann Pitassi, Richard Zemel | Taking inspiration from the disentangled representation learning literature, we propose an algorithm for learning compact representations of datasets that are useful for reconstruction and prediction, but are also flexibly fair, meaning they can be easily modified at test time to achieve subgroup demographic parity with respect to multiple sensitive attributes and their conjunctions. |

146 | Anytime Online-to-Batch, Optimism and Acceleration | Ashok Cutkosky | We close this gap by introducing a black-box modification to any online learning algorithm whose iterates converge to the optimum in stochastic scenarios. |

147 | Matrix-Free Preconditioning in Online Learning | Ashok Cutkosky, Tamas Sarlos | We provide an online convex optimization algorithm with regret that interpolates between the regret of an algorithm using an optimal preconditioning matrix and one using a diagonal preconditioning matrix. |

148 | Minimal Achievable Sufficient Statistic Learning | Milan Cvitkovic, G?nther Koliander | We introduce Minimal Achievable Sufficient Statistic (MASS) Learning, a machine learning training objective for which the minima are minimal sufficient statistics with respect to a class of functions being optimized over (e.g., deep networks). |

149 | Open Vocabulary Learning on Source Code with a Graph-Structured Cache | Milan Cvitkovic, Badal Singh, Animashree Anandkumar | We introduce a Graph-Structured Cache to address this problem; this cache contains a node for each new word the model encounters with edges connecting each word to its occurrences in the code. |

150 | The Value Function Polytope in Reinforcement Learning | Robert Dadashi, Marc G. Bellemare, Adrien Ali Taiga, Nicolas Le Roux, Dale Schuurmans | Our main contribution is the characterization of the nature of its shape: a general polytope (Aigner et al., 2010). |

151 | Bayesian Optimization Meets Bayesian Optimal Stopping | Zhongxiang Dai, Haibin Yu, Bryan Kian Hsiang Low, Patrick Jaillet | This paper proposes to unify BO (specifically, Gaussian process-upper confidence bound (GP-UCB)) with Bayesian optimal stopping (BO-BOS) to boost the epoch efficiency of BO. |

152 | Policy Certificates: Towards Accountable Reinforcement Learning | Christoph Dann, Lihong Li, Wei Wei, Emma Brunskill | We address this lack of accountability by proposing that algorithms output policy certificates. |

153 | Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations | Tri Dao, Albert Gu, Matthew Eichhorn, Atri Rudra, Christopher Re | Motivated by a characterization of fast matrix-vector multiplication as products of sparse matrices, we introduce a parameterization of divide-and-conquer methods that is capable of representing a large class of transforms. |

154 | A Kernel Theory of Modern Data Augmentation | Tri Dao, Albert Gu, Alexander Ratner, Virginia Smith, Chris De Sa, Christopher Re | In this paper, we seek to establish a theoretical framework for understanding data augmentation. |

155 | TarMAC: Targeted Multi-Agent Communication | Abhishek Das, Th?ophile Gervet, Joshua Romoff, Dhruv Batra, Devi Parikh, Mike Rabbat, Joelle Pineau | We propose a targeted communication architecture for multi-agent reinforcement learning, where agents learn both what messages to send and whom to address them to while performing cooperative tasks in partially-observable environments. |

156 | Teaching a black-box learner | Sanjoy Dasgupta, Daniel Hsu, Stefanos Poulis, Xiaojin Zhu | We consider the problem of teaching a learner whose representation and hypothesis class are unknown—that is, the learner is a black box. |

157 | Stochastic Deep Networks | Gwendoline De Bie, Gabriel Peyr?, Marco Cuturi | We propose in this work a deep framework designed to handle crucial aspects of measures, namely permutation invariances, variations in weights and cardinality. |

158 | Learning-to-Learn Stochastic Gradient Descent with Biased Regularization | Giulia Denevi, Carlo Ciliberto, Riccardo Grazzi, Massimiliano Pontil | We present an average excess risk bound for such a learning algorithm that quantifies the potential benefit of using a bias vector with respect to the unbiased case. |

159 | A Multitask Multiple Kernel Learning Algorithm for Survival Analysis with Application to Cancer Biology | Onur Dereli, Ceyda Oguz, Mehmet G?nen | Rather than performing survival analysis on each data set to predict survival times of cancer patients, we developed a novel multitask approach based on multiple kernel learning (MKL). |

160 | Learning to Convolve: A Generalized Weight-Tying Approach | Nichita Diaconu, Daniel Worrall | In this paper, we learn how to transform filters for use in the group convolution, focussing on roto-translation. |

161 | Sever: A Robust Meta-Algorithm for Stochastic Optimization | Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Jacob Steinhardt, Alistair Stewart | To address this, we introduce a new meta-algorithm that can take in a base learner such as least squares or stochastic gradient descent, and harden the learner to be resistant to outliers. |

162 | Approximated Oracle Filter Pruning for Destructive CNN Width Optimization | Xiaohan Ding, Guiguang Ding, Yuchen Guo, Jungong Han, Chenggang Yan | To address these problems, we propose Approximated Oracle Filter Pruning (AOFP), which keeps searching for the least important filters in a binary search manner, makes pruning attempts by masking out filters randomly, accumulates the resulting errors, and finetunes the model via a multi-path framework. |

163 | Noisy Dual Principal Component Pursuit | Tianyu Ding, Zhihui Zhu, Tianjiao Ding, Yunchen Yang, Daniel Robinson, Manolis Tsakiris, Rene Vidal | Noisy Dual Principal Component Pursuit |

164 | Finite-Time Analysis of Distributed TD(0) with Linear Function Approximation on Multi-Agent Reinforcement Learning | Thinh Doan, Siva Maguluri, Justin Romberg | Our main contribution is providing a finite-time analysis for the convergence of the distributed TD(0) algorithm. |

165 | Trajectory-Based Off-Policy Deep Reinforcement Learning | Andreas Doerr, Michael Volpp, Marc Toussaint, Trimpe Sebastian, Christian Daniel | This work addresses these weaknesses by combining recent improvements in the reuse of off-policy data and exploration in parameter space with deterministic behavioral policies. |

166 | Generalized No Free Lunch Theorem for Adversarial Robustness | Elvis Dohmatob | This manuscript presents some new impossibility results on adversarial robustness in machine learning, a very important yet largely open problem. |

167 | Width Provably Matters in Optimization for Deep Linear Neural Networks | Simon Du, Wei Hu | We prove that for an $L$-layer fully-connected linear neural network, if the width of every hidden layer is $\widetilde{\Omega}\left(L \cdot r \cdot d_{out} \cdot \kappa^3 \right)$, where $r$ and $\kappa$ are the rank and the condition number of the input data, and $d_{out}$ is the output dimension, then gradient descent with Gaussian random initialization converges to a global minimum at a linear rate. |

168 | Provably efficient RL with Rich Observations via Latent State Decoding | Simon Du, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal, Miroslav Dudik, John Langford | We study the exploration problem in episodic MDPs with rich observations generated from a small number of latent states. |

169 | Gradient Descent Finds Global Minima of Deep Neural Networks | Simon Du, Jason Lee, Haochuan Li, Liwei Wang, Xiyu Zhai | The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet). |

170 | Incorporating Grouping Information into Bayesian Decision Tree Ensembles | Junliang Du, Antonio Linero | We consider the problem of nonparametric regression in the high-dimensional setting in which $P \gg N$. |

171 | Task-Agnostic Dynamics Priors for Deep Reinforcement Learning | Yilun Du, Karthic Narasimhan | In this work, we propose an approach to learn task-agnostic dynamics priors from videos and incorporate them into an RL agent. |

172 | Optimal Auctions through Deep Learning | Paul Duetting, Zhe Feng, Harikrishna Narasimhan, David Parkes, Sai Srivatsa Ravindranath | In this work, we initiate the exploration of the use of tools from deep learning for the automated design of optimal auctions. |

173 | Wasserstein of Wasserstein Loss for Learning Generative Models | Yonatan Dukler, Wuchen Li, Alex Lin, Guido Montufar | We propose to use the Wasserstein distance itself as the ground metric on the sample space of images. |

174 | Learning interpretable continuous-time models of latent stochastic dynamical systems | Lea Duncker, Gergo Bohner, Julien Boussard, Maneesh Sahani | This form yields a flexible nonparametric model of the dynamics, with a representation corresponding directly to the interpretable portraits routinely employed in the study of nonlinear dynamical systems. |

175 | Autoregressive Energy Machines | Conor Durkan, Charlie Nash | We propose the Autoregressive Energy Machine, an energy-based model which simultaneously learns an unnormalized density and computes an importance-sampling estimate of the normalizing constant for each conditional in an autoregressive decomposition. |

176 | Band-limited Training and Inference for Convolutional Neural Networks | Adam Dziedzic, John Paparrizos, Sanjay Krishnan, Aaron Elmore, Michael Franklin | We explore artificially constraining the frequency spectra of these filters and data, called band-limiting, during training. |

177 | Imitating Latent Policies from Observation | Ashley Edwards, Himanshu Sahni, Yannick Schroecker, Charles Isbell | In this paper, we describe a novel approach to imitation learning that infers latent policies directly from state observations. |

178 | Semi-Cyclic Stochastic Gradient Descent | Hubert Eichner, Tomer Koren, Brendan Mcmahan, Nathan Srebro, Kunal Talwar | We show that such block-cyclic structure can significantly deteriorate the performance of SGD, but propose a simple approach that allows prediction with the same guarantees as for i.i.d., non-cyclic, sampling. |

179 | GDPP: Learning Diverse Generations using Determinantal Point Processes | Mohamed Elfeki, Camille Couprie, Morgane Riviere, Mohamed Elhoseiny | In this work, we draw inspiration from Determinantal Point Process (DPP) to propose an unsupervised penalty loss that alleviates mode collapse while producing higher quality samples. |

180 | Sequential Facility Location: Approximate Submodularity and Greedy Algorithm | Ehsan Elhamifar | We propose a cardinality-constrained sequential facility location function that finds a fixed number of representatives, where the sequence of representatives is compatible with the dynamic model and well encodes the data. |

181 | Improved Convergence for $\ell_1$ and $\ell_8$ Regression via Iteratively Reweighted Least Squares | Alina Ene, Adrian Vladu | In this paper we propose a simple and natural version of IRLS for solving $\ell_\infty$ and $\ell_1$ regression, which provably converges to a $(1+\epsilon)$-approximate solution in $O(m^{1/3}\log(1/\epsilon)/\epsilon^{2/3} + \log m/\epsilon^2)$ iterations, where $m$ is the number of rows of the input matrix. |

182 | Exploring the Landscape of Spatial Robustness | Logan Engstrom, Brandon Tran, Dimitris Tsipras, Ludwig Schmidt, Aleksander Madry | In this work, we thoroughly investigate the vulnerability of neural network–based classifiers to rotations and translations. |

183 | Cross-Domain 3D Equivariant Image Embeddings | Carlos Esteves, Avneesh Sud, Zhengyi Luo, Kostas Daniilidis, Ameesh Makadia | In this paper we learn 2D image embeddings with a similar equivariant structure: embedding the image of a 3D object should commute with rotations of the object. |

184 | On the Connection Between Adversarial Robustness and Saliency Map Interpretability | Christian Etmann, Sebastian Lunz, Peter Maass, Carola Schoenlieb | We aim to quantify this behaviour by considering the alignment between input image and saliency map. |

185 | Non-monotone Submodular Maximization with Nearly Optimal Adaptivity and Query Complexity | Matthew Fahrbach, Vahab Mirrokni, Morteza Zadimoghaddam | In this paper, we give the first constant-factor approximation algorithm for maximizing a non-monotone submodular function subject to a cardinality constraint $k$ that runs in $O(\log(n))$ adaptive rounds and makes $O(n \log(k))$ oracle queries in expectation. |

186 | Multi-Frequency Vector Diffusion Maps | Yifeng Fan, Zhizhen Zhao | We introduce multi-frequency vector diffusion maps (MFVDM), a new framework for organizing and analyzing high dimensional data sets. |

187 | Stable-Predictive Optimistic Counterfactual Regret Minimization | Gabriele Farina, Christian Kroer, Noam Brown, Tuomas Sandholm | In this work we present the first CFR variant that breaks the square-root dependence on iterations. |

188 | Regret Circuits: Composability of Regret Minimizers | Gabriele Farina, Christian Kroer, Tuomas Sandholm | In this paper we study the general composability of regret minimizers. |

189 | Dead-ends and Secure Exploration in Reinforcement Learning | Mehdi Fatemi, Shikhar Sharma, Harm Van Seijen, Samira Ebrahimi Kahou | To deal with the bridge effect, we propose a condition for exploration, called security. |

190 | Invariant-Equivariant Representation Learning for Multi-Class Data | Ilya Feige | We introduce an approach to probabilistic modelling that learns to represent data with two separate deep representations: an invariant representation that encodes the information of the class from which the data belongs, and an equivariant representation that encodes the symmetry transformation defining the particular data point within the class manifold (equivariant in the sense that the representation varies naturally with symmetry transformations). |

191 | The advantages of multiple classes for reducing overfitting from test set reuse | Vitaly Feldman, Roy Frostig, Moritz Hardt | We show a new upper bound of $\tilde O(\max\{\sqrt{k\log(n)/(mn)}, k/n\})$ on the worst-case bias that any attack can achieve in a prediction problem with $m$ classes. |

192 | Decentralized Exploration in Multi-Armed Bandits | Raphael Feraud, Reda Alami, Romain Laroche | We consider the decentralized exploration problem: a set of players collaborate to identify the best arm by asynchronously interacting with the same stochastic environment. |

193 | Almost surely constrained convex optimization | Olivier Fercoq, Ahmet Alacaoglu, Ion Necoara, Volkan Cevher | We propose a stochastic gradient framework for solving stochastic composite convex optimization problems with (possibly) infinite number of linear inclusion constraints that need to be satisfied almost surely. |

194 | Online Meta-Learning | Chelsea Finn, Aravind Rajeswaran, Sham Kakade, Sergey Levine | This work introduces an online meta-learning setting, which merges ideas from both paradigms to better capture the spirit and practice of continual lifelong learning. |

195 | DL2: Training and Querying Neural Networks with Logic | Marc Fischer, Mislav Balunovic, Dana Drachsler-Cohen, Timon Gehr, Ce Zhang, Martin Vechev | We present DL2, a system for training and querying neural networks with logical constraints. |

196 | Bayesian Action Decoder for Deep Multi-Agent Reinforcement Learning | Jakob Foerster, Francis Song, Edward Hughes, Neil Burch, Iain Dunning, Shimon Whiteson, Matthew Botvinick, Michael Bowling | We present the Bayesian action decoder (BAD), a new multi-agent learning method that uses an approximate Bayesian update to obtain a public belief that conditions on the actions taken by all agents in the environment. |

197 | Scalable Nonparametric Sampling from Multimodal Posteriors with the Posterior Bootstrap | Edwin Fong, Simon Lyddon, Chris Holmes | We present a scalable Bayesian nonparametric learning routine that enables posterior sampling through the optimization of suitably randomized objective functions. |

198 | On discriminative learning of prediction uncertainty | Vojtech Franc, Daniel Prusa | We propose a discriminative algorithm learning an uncertainty function which preserves ordering of the input space induced by the conditional risk, and hence can be used to construct optimal rejection strategies. |

199 | Learning Discrete Structures for Graph Neural Networks | Luca Franceschi, Mathias Niepert, Massimiliano Pontil, Xiao He | With this work, we propose to jointly learn the graph structure and the parameters of graph convolutional networks (GCNs) by approximately solving a bilevel program that learns a discrete probability distribution on the edges of the graph. |

200 | Distributional Multivariate Policy Evaluation and Exploration with the Bellman GAN | Dror Freirich, Tzahi Shimkin, Ron Meir, Aviv Tamar | In this work, we show that the distributional Bellman equation, which drives DiRL methods, is equivalent to a generative adversarial network (GAN) model. |

201 | Approximating Orthogonal Matrices with Effective Givens Factorization | Thomas Frerix, Joan Bruna | We analyze effective approximation of unitary matrices. |

202 | Fast and Flexible Inference of Joint Distributions from their Marginals | Charlie Frogner, Tomaso Poggio | In this paper, we treat the inference problem generally and propose a unified class of models that encompasses some of those previously proposed while including many new ones. |

203 | Analyzing and Improving Representations with the Soft Nearest Neighbor Loss | Nicholas Frosst, Nicolas Papernot, Geoffrey Hinton | We explore and expand the Soft Nearest Neighbor Loss to measure the entanglement of class manifolds in representation space: i.e., how close pairs of points from the same class are relative to pairs of points from different classes. |

204 | Diagnosing Bottlenecks in Deep Q-learning Algorithms | Justin Fu, Aviral Kumar, Matthew Soh, Sergey Levine | In this work, we aim to experimentally investigate potential issues in Q-learning, by means of a “unit testing” framework where we can utilize oracles to disentangle sources of error. |

205 | MetricGAN: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement | Szu-Wei Fu, Chien-Feng Liao, Yu Tsao, Shou-De Lin | To overcome this issue, we propose a novel MetricGAN approach with an aim to optimize the generator with respect to one or multiple evaluation metrics. |

206 | Beyond Adaptive Submodularity: Approximation Guarantees of Greedy Policy with Adaptive Submodularity Ratio | Kaito Fujii, Shinsaku Sakaue | We propose a new concept named adaptive submodularity ratio to study the greedy policy for sequential decision making. |

207 | Off-Policy Deep Reinforcement Learning without Exploration | Scott Fujimoto, David Meger, Doina Precup | In this paper, we demonstrate that due to errors introduced by extrapolation, standard off-policy deep reinforcement learning algorithms, such as DQN and DDPG, are incapable of learning with data uncorrelated to the distribution under the current policy, making them ineffective for this fixed batch setting. |

208 | Transfer Learning for Related Reinforcement Learning Tasks via Image-to-Image Translation | Shani Gamrian, Yoav Goldberg | We demonstrate the approach on synthetic visual variants of the Breakout game, as well as on transfer between subsequent levels of Road Fighter, a Nintendo car-driving game. |

209 | Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-linearities | Octavian Ganea, Sylvain Gelly, Gary Becigneul, Aliaksei Severyn | As an efficient and effective solution to alleviate this issue, we propose to learn parametric monotonic functions on top of the logits. |

210 | Graph U-Nets | Hongyang Gao, Shuiwang Ji | To address these challenges, we propose novel graph pooling (gPool) and unpooling (gUnpool) operations in this work. |

211 | Deep Generative Learning via Variational Gradient Flow | Yuan Gao, Yuling Jiao, Yang Wang, Yao Wang, Can Yang, Shunkang Zhang | We propose a framework to learn deep generative models via \textbf{V}ariational \textbf{Gr}adient Fl\textbf{ow} (VGrow) on probability spaces. |

212 | Rate Distortion For Model Compression:From Theory To Practice | Weihao Gao, Yu-Han Liu, Chong Wang, Sewoong Oh | In this paper, we propose principled approaches to improve upon the common heuristics used in those building blocks, by studying the fundamental limit for model compression via the rate distortion theory. |

213 | Demystifying Dropout | Hongchang Gao, Jian Pei, Heng Huang | In this paper, unlike existing works, we explore it from a new perspective to provide new insight into this line of research. |

214 | Geometric Scattering for Graph Data Analysis | Feng Gao, Guy Wolf, Matthew Hirn | We explore the generalization of scattering transforms from traditional (e.g., image or audio) signals to graph data, analogous to the generalization of ConvNets in geometric deep learning, and the utility of extracted graph features in graph data analysis. |

215 | Multi-Frequency Phase Synchronization | Tingran Gao, Zhizhen Zhao | We propose a novel formulation for phase synchronization—the statistical problem of jointly estimating alignment angles from noisy pairwise comparisons—as a nonconvex optimization problem that enforces consistency among the pairwise comparisons in multiple frequency channels. |

216 | Optimal Mini-Batch and Step Sizes for SAGA | Nidham Gazagnadou, Robert Gower, Joseph Salmon | Using these bounds, and since the SAGA algorithm is part of this JacSketch family, we suggest a new standard practice for setting the step and mini-batch sizes for SAGA that are competitive with a numerical grid search. |

217 | SelectiveNet: A Deep Neural Network with an Integrated Reject Option | Yonatan Geifman, Ran El-Yaniv | We consider the problem of selective prediction (also known as reject option) in deep neural networks, and introduce SelectiveNet, a deep neural architecture with an integrated reject option. |

218 | A Theory of Regularized Markov Decision Processes | Matthieu Geist, Bruno Scherrer, Olivier Pietquin | We propose a general theory of regularized Markov Decision Processes that generalizes these approaches in two directions: we consider a larger class of regularizers, and we consider the general modified policy iteration approach, encompassing both policy iteration and value iteration. |

219 | DeepMDP: Learning Continuous Latent Space Models for Representation Learning | Carles Gelada, Saurabh Kumar, Jacob Buckman, Ofir Nachum, Marc G. Bellemare | To formalize this process, we introduce the concept of a \texit{DeepMDP}, a parameterized latent space model that is trained via the minimization of two tractable latent space losses: prediction of rewards and prediction of the distribution over next latent states. |

220 | Partially Linear Additive Gaussian Graphical Models | Sinong Geng, Minhao Yan, Mladen Kolar, Sanmi Koyejo | We propose a partially linear additive Gaussian graphical model (PLA-GGM) for the estimation of associations between random variables distorted by observed confounders. |

221 | Learning and Data Selection in Big Datasets | Hossein Shokri Ghadikolaei, Hadi Ghauch, Carlo Fischione, Mikael Skoglund | More specifically, we propose a framework that jointly learns the input-output mapping as well as the most representative samples of the dataset (sufficient dataset). |

222 | Improved Parallel Algorithms for Density-Based Network Clustering | Mohsen Ghaffari, Silvio Lattanzi, Slobodan Mitrovic | In the case of $k$-core decomposition, our work improves exponentially on the algorithm provided by Esfandiari et al. (ICML’18). |

223 | Recursive Sketches for Modular Deep Learning | Badih Ghazi, Rina Panigrahy, Joshua Wang | We present a mechanism to compute a sketch (succinct summary) of how a complex modular deep network processes its inputs. |

224 | An Instability in Variational Inference for Topic Models | Behrooz Ghorbani, Hamid Javadi, Andrea Montanari | We show that these methods suffer from an instability that can produce misleading conclusions. |

225 | An Investigation into Neural Net Optimization via Hessian Eigenvalue Density | Behrooz Ghorbani, Shankar Krishnan, Ying Xiao | To understand the dynamics of training in deep neural networks, we study the evolution of the Hessian eigenvalue density throughout the optimization process. |

226 | Data Shapley: Equitable Valuation of Data for Machine Learning | Amirata Ghorbani, James Zou | In this work, we develop a principled framework to address data valuation in the context of supervised machine learning. |

227 | Efficient Dictionary Learning with Gradient Descent | Dar Gilboa, Sam Buchanan, John Wright | We study one such problem – complete orthogonal dictionary learning, and provide converge guarantees for randomly initialized gradient descent to the neighborhood of a global optimum. |

228 | A Tree-Based Method for Fast Repeated Sampling of Determinantal Point Processes | Jennifer Gillenwater, Alex Kulesza, Zelda Mariet, Sergei Vassilvtiskii | In this work we address both of these shortcomings. |

229 | Learning to Groove with Inverse Sequence Transformations | Jon Gillick, Adam Roberts, Jesse Engel, Douglas Eck, David Bamman | We explore models for translating abstract musical ideas (scores, rhythms) into expressive performances using seq2seq and recurrent variational information bottleneck (VIB) models. Focusing on the case of drum set players, we create and release a new dataset for this purpose, containing over 13 hours of recordings by professional drummers aligned with fine-grained timing and dynamics information. |

230 | Adversarial Examples Are a Natural Consequence of Test Error in Noise | Justin Gilmer, Nicolas Ford, Nicholas Carlini, Ekin Cubuk | In this paper we provide both empirical and theoretical evidence that these are two manifestations of the same underlying phenomenon, and therefore the adversarial robustness and corruption robustness research programs are closely related. |

231 | Discovering Conditionally Salient Features with Statistical Guarantees | Jaime Roquero Gimenez, James Zou | We study a more fine-grained statistical problem: conditional feature selection, where a feature may be relevant depending on the values of the other features. |

232 | Estimating Information Flow in Deep Neural Networks | Ziv Goldfeld, Ewout Van Den Berg, Kristjan Greenewald, Igor Melnyk, Nam Nguyen, Brian Kingsbury, Yury Polyanskiy | Focusing on feedforward networks with fixed weights and noisy internal representations, we develop a rigorous framework for accurate estimation of I(X;T_$\ell$). |

233 | Amortized Monte Carlo Integration | Adam Golinski, Frank Wood, Tom Rainforth | In this paper, we address this inefficiency by introducing AMCI, a method for amortizing Monte Carlo integration directly. |

234 | Online Algorithms for Rent-Or-Buy with Expert Advice | Sreenivas Gollapudi, Debmalya Panigrahi | In particular, we consider the classical rent-or-buy problem (also called ski rental), and obtain algorithms that provably improve their performance over the adversarial scenario by using these predictions. |

235 | The information-theoretic value of unlabeled data in semi-supervised learning | Alexander Golovnev, David Pal, Balazs Szorenyi | More specifically, we prove a separation by $\Theta(\log n)$ multiplicative factor for the class of projections over the Boolean hypercube of dimension $n$. |

236 | Efficient Training of BERT by Progressively Stacking | Linyuan Gong, Di He, Zhuohan Li, Tao Qin, Liwei Wang, Tieyan Liu | In this paper, we explore an efficient training method for the state-of-the-art bidirectional Transformer (BERT) model. |

237 | Quantile Stein Variational Gradient Descent for Batch Bayesian Optimization | Chengyue Gong, Jian Peng, Qiang Liu | In this paper, we introduce a novel variational framework for batch query optimization, based on the argument that the query batch should be selected to have both high diversity and good worst case performance. |

238 | Obtaining Fairness using Optimal Transport Theory | Paula Gordaliza, Eustasio Del Barrio, Gamboa Fabrice, Loubes Jean-Michel | We propose a Random Repair which yields a tradeoff between minimal information loss and a certain amount of fairness. |

239 | Combining parametric and nonparametric models for off-policy evaluation | Omer Gottesman, Yao Liu, Scott Sussex, Emma Brunskill, Finale Doshi-Velez | We consider a model-based approach to perform batch off-policy evaluation in reinforcement learning. |

240 | Counterfactual Visual Explanations | Yash Goyal, Ziyan Wu, Jan Ernst, Dhruv Batra, Devi Parikh, Stefan Lee | In this work, we develop a technique to produce counterfactual visual explanations. |

241 | Adaptive Sensor Placement for Continuous Spaces | James Grant, Alexis Boukouvalas, Ryan-Rhys Griffiths, David Leslie, Sattar Vakili, Enrique Munoz De Cote | We present a new formulation of the problem as a continuum-armed bandit problem with feedback in the form of partial observations of realisations of an inhomogeneous Poisson process. |

242 | A Statistical Investigation of Long Memory in Language and Music | Alexander Greaves-Tunnell, Zaid Harchaoui | We contribute a statistical framework for investigating long-range dependence in current applications of deep sequence modeling, drawing on the well-developed theory of long memory stochastic processes. |

243 | Automatic Posterior Transformation for Likelihood-Free Inference | David Greenberg, Marcel Nonnenmacher, Jakob Macke | Here we present automatic posterior transformation (APT), a new sequential neural posterior estimation method for simulation-based inference. |

244 | Learning to Optimize Multigrid PDE Solvers | Daniel Greenfeld, Meirav Galun, Ronen Basri, Irad Yavneh, Ron Kimmel | In this paper we propose a framework for learning multigrid solvers. |

245 | Multi-Object Representation Learning with Iterative Variational Inference | Klaus Greff, Rapha?l Lopez Kaufman, Rishabh Kabra, Nick Watters, Christopher Burgess, Daniel Zoran, Loic Matthey, Matthew Botvinick, Alexander Lerchner | Instead, we argue for the importance of learning to segment and represent objects jointly. |

246 | Graphite: Iterative Generative Modeling of Graphs | Aditya Grover, Aaron Zweig, Stefano Ermon | In this work, we propose Graphite, an algorithmic framework for unsupervised learning of representations over nodes in large graphs using deep latent variable generative models. |

247 | Fast Algorithm for Generalized Multinomial Models with Ranking Data | Jiaqi Gu, Guosheng Yin | Based on this property, we propose an iterative algorithm that is easy to implement and interpret, and is guaranteed to converge. |

248 | Towards a Deep and Unified Understanding of Deep Neural Models in NLP | Chaoyu Guan, Xiting Wang, Quanshi Zhang, Runjin Chen, Di He, Xing Xie | We define a unified information-based measure to provide quantitative explanations on how intermediate layers of deep Natural Language Processing (NLP) models leverage information of input words. |

249 | An Investigation of Model-Free Planning | Arthur Guez, Mehdi Mirza, Karol Gregor, Rishabh Kabra, Sebastien Racaniere, Theophane Weber, David Raposo, Adam Santoro, Laurent Orseau, Tom Eccles, Greg Wayne, David Silver, Timothy Lillicrap | In this paper, we go even further, and demonstrate empirically that an entirely model-free approach, without special structure beyond standard neural network components such as convolutional networks and LSTMs, can learn to exhibit many of the characteristics typically associated with a model-based planner. |

250 | Humor in Word Embeddings: Cockamamie Gobbledegook for Nincompoops | Limor Gultchin, Genevieve Patterson, Nancy Baym, Nathaniel Swinger, Adam Kalai | While humor is often thought to be beyond the reach of Natural Language Processing, we show that several aspects of single-word humor correlate with simple linear directions in Word Embeddings. |

251 | Simple Black-box Adversarial Attacks | Chuan Guo, Jacob Gardner, Yurong You, Andrew Gordon Wilson, Kilian Weinberger | We propose an intriguingly simple method for the construction of adversarial images in the black-box setting. |

252 | Exploring interpretable LSTM neural networks over multi-variable data | Tian Guo, Tao Lin, Nino Antulov-Fantulin | In this paper, we explore the structure of LSTM recurrent neural networks to learn variable-wise hidden states, with the aim to capture different dynamics in multi-variable time series and distinguish the contribution of variables to the prediction. |

253 | Learning to Exploit Long-term Relational Dependencies in Knowledge Graphs | Lingbing Guo, Zequn Sun, Wei Hu | In this paper, we propose recurrent skipping networks (RSNs), which employ a skipping mechanism to bridge the gaps between entities. |

254 | Memory-Optimal Direct Convolutions for Maximizing Classification Accuracy in Embedded Applications | Albert Gural, Boris Murmann | This paper presents memory-optimal direct convolutions as a way to push classification accuracy as high as possible given strict hardware memory constraints at the expense of extra compute. |

255 | IMEXnet A Forward Stable Deep Neural Network | Eldad Haber, Keegan Lensink, Eran Treister, Lars Ruthotto | We introduce the IMEXnet that addresses these challenges by adapting semi-implicit methods for partial differential equations. We also present a new dataset for semantic segmentation and demonstrate the effectiveness of our architecture using the NYU Depth dataset. |

256 | On The Power of Curriculum Learning in Training Deep Networks | Guy Hacohen, Daphna Weinshall | In this work, we analyze the effect of curriculum learning, which involves the non-uniform sampling of mini-batches, on the training of deep networks, and specifically CNNs trained for image recognition. |

257 | Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex Optimization | Farzin Haddadpour, Mohammad Mahdi Kamani, Mehrdad Mahdavi, Viveck Cadambe | In this paper, we advocate the use of redundancy towards communication-efficient distributed stochastic algorithms for non-convex optimization. |

258 | Learning Latent Dynamics for Planning from Pixels | Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, James Davidson | We propose the Deep Planning Network (PlaNet), a purely model-based agent that learns the environment dynamics from images and chooses actions through fast online planning in latent space. |

259 | Neural Separation of Observed and Unobserved Distributions | Tavi Halperin, Ariel Ephrat, Yedid Hoshen | In this work, we introduce a new method—Neural Egg Separation—to tackle the scenario of extracting a signal from an unobserved distribution additively mixed with a signal from an observed distribution. |

260 | Grid-Wise Control for Multi-Agent Reinforcement Learning in Video Game AI | Lei Han, Peng Sun, Yali Du, Jiechao Xiong, Qing Wang, Xinghai Sun, Han Liu, Tong Zhang | To address the issue, we propose a novel architecture that learns a spatial joint representation of all the agents and outputs grid-wise actions. |

261 | Dimension-Wise Importance Sampling Weight Clipping for Sample-Efficient Reinforcement Learning | Seungyul Han, Youngchul Sung | In this paper, we consider PPO, a representative on-policy algorithm, and propose its improvement by dimension-wise IS weight clipping which separately clips the IS weight of each action dimension to avoid large bias and adaptively controls the IS weight to bound policy update from the current policy. |

262 | Complexity of Linear Regions in Deep Networks | Boris Hanin, David Rolnick | In this paper, we provide a mathematical framework to count the number of linear regions of a piecewise linear network and measure the volume of the boundaries between these regions. |

263 | Importance Sampling Policy Evaluation with an Estimated Behavior Policy | Josiah Hanna, Scott Niekum, Peter Stone | In this paper, we study importance sampling with an estimated behavior policy where the behavior policy estimate comes from the same set of data used to compute the importance sampling estimate. |

264 | Doubly-Competitive Distribution Estimation | Yi Hao, Alon Orlitsky | This paper combines and strengthens the two frameworks. |

265 | Random Shuffling Beats SGD after Finite Epochs | Jeff Haochen, Suvrit Sra | Building upon \citep{gurbuzbalaban2015random}, we present the first (to our knowledge) non-asymptotic results for this problem by proving that after a reasonable number of epochs \rsgd converges faster than \sgd. |

266 | Submodular Maximization beyond Non-negativity: Guarantees, Fast Algorithms, and Applications | Chris Harshaw, Moran Feldman, Justin Ward, Amin Karbasi | We present an algorithm for maximizing $g – c$ under a $k$-cardinality constraint which produces a random feasible set $S$ such that $\mathbb{E}[g(S) -c(S)] \geq (1 – e^{-\gamma} – \epsilon) g(\opt) – c(\opt)$, whose running time is $O (\frac{n}{\epsilon} \log^2 \frac{1}{\epsilon})$, independent of $k$. |

267 | Per-Decision Option Discounting | Anna Harutyunyan, Peter Vrancx, Philippe Hamel, Ann Nowe, Doina Precup | We propose a modification to the options framework that naturally scales the agent’s horizon with option length. |

268 | Submodular Observation Selection and Information Gathering for Quadratic Models | Abolfazl Hashemi, Mahsa Ghasemi, Haris Vikalo, Ufuk Topcu | We study the problem of selecting most informative subset of a large observation set to enable accurate estimation of unknown parameters. |

269 | Understanding and Controlling Memory in Recurrent Neural Networks | Doron Haviv, Alexander Rivkind, Omri Barak | Here, we utilize different training protocols, datasets and architectures to obtain a range of networks solving a delayed classification task with similar performance, alongside substantial differences in their ability to extrapolate for longer delays. |

270 | On the Impact of the Activation function on Deep Neural Networks Training | Soufiane Hayou, Arnaud Doucet, Judith Rousseau | We give a comprehensive theoretical analysis of the Edge of Chaos and show that we can indeed tune the initialization parameters and the activation function in order to accelerate the training and improve the performance. |

271 | Provably Efficient Maximum Entropy Exploration | Elad Hazan, Sham Kakade, Karan Singh, Abby Van Soest | We provide an efficient algorithm to optimize such such intrinsically defined objectives, when given access to a black box planning oracle (which is robust to function approximation). |

272 | On the Long-term Impact of Algorithmic Decision Policies: Effort Unfairness and Feature Segregation through Social Learning | Hoda Heidari, Vedant Nanda, Krishna Gummadi | We propose an effort-based measure of fairness and present a data-driven framework for characterizing the long-term impact of algorithmic policies on reshaping the underlying population. |

273 | Graph Resistance and Learning from Pairwise Comparisons | Julien Hendrickx, Alexander Olshevsky, Venkatesh Saligrama | We consider the problem of learning the qualities of a collection of items by performing noisy comparisons among them. |

274 | Using Pre-Training Can Improve Model Robustness and Uncertainty | Dan Hendrycks, Kimin Lee, Mantas Mazeika | We show that although pre-training may not improve performance on traditional classification metrics, it improves model robustness and uncertainty estimates. |

275 | Flow++: Improving Flow-Based Generative Models with Variational Dequantization and Architecture Design | Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, Pieter Abbeel | In this paper, we investigate and improve upon three limiting design choices employed by flow-based models in prior work: the use of uniform noise for dequantization, the use of inexpressive affine flows, and the use of purely convolutional conditioning networks in coupling layers. |

276 | Population Based Augmentation: Efficient Learning of Augmentation Policy Schedules | Daniel Ho, Eric Liang, Xi Chen, Ion Stoica, Pieter Abbeel | In this paper, we introduce a new data augmentation algorithm, Population Based Augmentation (PBA), which generates nonstationary augmentation policy schedules instead of a fixed augmentation policy. |

277 | Collective Model Fusion for Multiple Black-Box Experts | Minh Hoang, Nghia Hoang, Bryan Kian Hsiang Low, Carleton Kingsford | The proposed method will enable thisby addressing the key issues of how black-boxexperts interact to understand the predictive be-haviors of one another; how these understandingscan be represented and shared efficiently amongthemselves; and how the shared understandingscan be combined to generate high-quality consen-sus prediction. |

278 | Connectivity-Optimized Representation Learning via Persistent Homology | Christoph Hofer, Roland Kwitt, Marc Niethammer, Mandar Dixit | Under mild conditions, this loss is differentiable and we present a theoretical analysis of the properties induced by the loss. |

279 | Better generalization with less data using robust gradient descent | Matthew Holland, Kazushi Ikeda | In pursuit of stronger performance under weaker assumptions, we propose a technique which uses a cheap and robust iterative estimate of the risk gradient, which can be easily fed into any steepest descent procedure. |

280 | Emerging Convolutions for Generative Normalizing Flows | Emiel Hoogeboom, Rianne Van Den Berg, Max Welling | We propose two methods to produce invertible convolutions, that have receptive fields identical to standard convolutions: Emerging convolutions are obtained by chaining specific autoregressive convolutions, and periodic convolutions are decoupled in the frequency domain. |

281 | Nonconvex Variance Reduced Optimization with Arbitrary Sampling | Samuel Horv?th, Peter Richtarik | We provide the first importance sampling variants of variance reduced algorithms for empirical risk minimization with non-convex loss functions. |

282 | Parameter-Efficient Transfer Learning for NLP | Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, Sylvain Gelly | As an alternative, we propose transfer with adapter modules. |

283 | Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging | Ping-Chun Hsieh, Xi Liu, Anirban Bhattacharya, P R Kumar | To address the above issue, this paper proposes a model of heteroscedastic linear bandits with reneging, which allows each participant to have a distinct “satisfaction level,” with any interaction outcome falling short of that level resulting in that participant reneging. |

284 | Finding Mixed Nash Equilibria of Generative Adversarial Networks | Ya-Ping Hsieh, Chen Liu, Volkan Cevher | In this work, we tackle the training of GANs by rethinking the problem formulation from the mixed Nash Equilibria (NE) perspective. |

285 | Classification from Positive, Unlabeled and Biased Negative Data | Yu-Guan Hsieh, Gang Niu, Masashi Sugiyama | We provide a method based on empirical risk minimization to address this PUbN classification problem. |

286 | Bayesian Deconditional Kernel Mean Embeddings | Kelvin Hsu, Fabio Ramos | Critically, we introduce the notion of task transformed Gaussian processes and establish deconditional kernel means embeddings as their posterior predictive mean. |

287 | Faster Stochastic Alternating Direction Method of Multipliers for Nonconvex Optimization | Feihu Huang, Songcan Chen, Heng Huang | In this paper, we propose a faster stochastic alternating direction method of multipliers (ADMM) for nonconvex optimization by using a new stochastic path-integrated differential estimator (SPIDER), called as SPIDER-ADMM. |

288 | Unsupervised Deep Learning by Neighbourhood Discovery | Jiabo Huang, Qi Dong, Shaogang Gong, Xiatian Zhu | In this work, we introduce a generic unsupervised deep learning approach to training deep models without the need for any manual label supervision. |

289 | Detecting Overlapping and Correlated Communities without Pure Nodes: Identifiability and Algorithm | Kejun Huang, Xiao Fu | We adopt the mixed-membership stochastic blockmodel as the underlying probabilistic model, and give conditions under which the memberships of a subset of nodes can be uniquely identified. |

290 | Hierarchical Importance Weighted Autoencoders | Chin-Wei Huang, Kris Sankaran, Eeshan Dhekane, Alexandre Lacoste, Aaron Courville | Theoretically, we analyze the condition under which convergence of the estimator variance can be connected to convergence of the lower bound. |

291 | Stable and Fair Classification | Lingxiao Huang, Nisheeth Vishnoi | We propose an extended framework based on fair classification algorithms that are formulated as optimization problems, by introducing a stability-focused regularization term. |

292 | Addressing the Loss-Metric Mismatch with Adaptive Loss Alignment | Chen Huang, Shuangfei Zhai, Walter Talbott, Miguel Bautista Martin, Shih-Yu Sun, Carlos Guestrin, Josh Susskind | In this work we assess this assumption by meta-learning an adaptive loss function to directly optimize the evaluation metric. |

293 | Causal Discovery and Forecasting in Nonstationary Environments with State-Space Models | Biwei Huang, Kun Zhang, Mingming Gong, Clark Glymour | In this paper, we study causal discovery and forecasting for nonstationary time series. |

294 | Composing Entropic Policies using Divergence Correction | Jonathan Hunt, Andre Barreto, Timothy Lillicrap, Nicolas Heess | As part of this analysis, we extend an important generalization of policy improvement to the maximum entropy framework and introduce an algorithm for the practical implementation of successor features in continuous action spaces. |

295 | HexaGAN: Generative Adversarial Nets for Real World Classification | Uiwon Hwang, Dahuin Jung, Sungroh Yoon | In this paper, we propose HexaGAN, a generative adversarial network framework that shows promising classification performance for all three problems. |

296 | Overcoming Mean-Field Approximations in Recurrent Gaussian Process Models | Alessandro Davide Ialongo, Mark Van Der Wilk, James Hensman, Carl Edward Rasmussen | We identify a new variational inference scheme for dynamical systems whose transition function is modelled by a Gaussian process. |

297 | Learning Structured Decision Problems with Unawareness | Craig Innes, Alex Lascarides | In this paper, we learn Bayesian Decision Networks from both domain exploration and expert assertions in a way which guarantees convergence to optimal behaviour, even when the agent starts unaware of actions or belief variables that are critical to success. |

298 | Phase transition in PCA with missing data: Reduced signal-to-noise ratio, not sample size! | Niels Ipsen, Lars Kai Hansen | Here we generalize this analysis to include missing data. |

299 | Actor-Attention-Critic for Multi-Agent Reinforcement Learning | Shariq Iqbal, Fei Sha | We present an actor-critic algorithm that trains decentralized policies in multi-agent settings, using centrally computed critics that share an attention mechanism which selects relevant information for each agent at every timestep. |

300 | Complementary-Label Learning for Arbitrary Losses and Models | Takashi Ishida, Gang Niu, Aditya Menon, Masashi Sugiyama | The goal of this paper is to derive a novel framework of complementary-label learning with an unbiased estimator of the classification risk, for arbitrary losses and models—all existing methods have failed to achieve this goal. |

301 | Causal Identification under Markov Equivalence: Completeness Results | Amin Jaber, Jiji Zhang, Elias Bareinboim | In this paper, we relax this requirement and consider that the knowledge is articulated in the form of an equivalence class of causal diagrams, in particular, a partial ancestral graph (PAG). |

302 | Learning from a Learner | Alexis Jacq, Matthieu Geist, Ana Paiva, Olivier Pietquin | In this paper, we propose a novel setting for Inverse Reinforcement Learning (IRL), namely “Learning from a Learner” (LfL). |

303 | Differentially Private Fair Learning | Matthew Jagielski, Michael Kearns, Jieming Mao, Alina Oprea, Aaron Roth, Saeed Sharifi -Malvajerdi, Jonathan Ullman | Motivated by settings in which predictive models may be required to be non-discriminatory with respect to certain attributes (such as race), but even collecting the sensitive attribute may be forbidden or restricted, we initiate the study of fair learning under the constraint of differential privacy. |

304 | Sum-of-Squares Polynomial Flow | Priyank Jaini, Kira A. Selby, Yaoliang Yu | Based on triangular maps, we propose a general framework for high-dimensional density estimation, by specifying one-dimensional transformations (equivalently conditional densities) and appropriate conditioner networks. |

305 | DBSCAN++: Towards fast and scalable density clustering | Jennifer Jang, Heinrich Jiang | We propose DBSCAN++, a simple modification of DBSCAN which only requires computing the densities for a chosen subset of points. |

306 | Learning What and Where to Transfer | Yunhun Jang, Hankook Lee, Sung Ju Hwang, Jinwoo Shin | To address the issue, we propose a novel transfer learning approach based on meta-learning that can automatically learn what knowledge to transfer from the source network to where in the target network. |

307 | Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning | Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, Pedro Ortega, Dj Strouse, Joel Z Leibo, Nando De Freitas | We propose a unified mechanism for achieving coordination and communication in Multi-Agent Reinforcement Learning (MARL), through rewarding agents for having causal influence over other agents’ actions. |

308 | A Deep Reinforcement Learning Perspective on Internet Congestion Control | Nathan Jay, Noga Rotman, Brighten Godfrey, Michael Schapira, Aviv Tamar | We present and investigate a novel and timely application domain for deep reinforcement learning (RL): Internet congestion control. |

309 | Graph Neural Network for Music Score Data and Modeling Expressive Piano Performance | Dasaem Jeong, Taegyun Kwon, Yoojin Kim, Juhan Nam | In this paper, we represent the unique form of musical score using graph neural network and apply it for rendering expressive piano performance from the music score. |

310 | Ladder Capsule Network | Taewon Jeong, Youngmin Lee, Heeyoung Kim | We propose a new architecture of the capsule network called the ladder capsule network, which has an alternative building block to the dynamic routing algorithm in the capsule network (Sabour et al., 2017). |

311 | Training CNNs with Selective Allocation of Channels | Jongheon Jeong, Jinwoo Shin | In this paper, we propose a simple way to improve the capacity of any CNN model having large-scale features, without adding more parameters. |

312 | Learning Discrete and Continuous Factors of Data via Alternating Disentanglement | Yeonwoo Jeong, Hyun Oh Song | We address the problem of unsupervised disentanglement of discrete and continuous explanatory factors of data. |

313 | Improved Zeroth-Order Variance Reduced Algorithms and Analysis for Nonconvex Optimization | Kaiyi Ji, Zhe Wang, Yi Zhou, Yingbin Liang | In this paper, we propose a new algorithm ZO-SVRG-Coord-Rand and develop a new analysis for an existing ZO-SVRG-Coord algorithm proposed in Liu et al. 2018b, and show that both ZO-SVRG-Coord-Rand and ZO-SVRG-Coord (under our new analysis) outperform other exiting SVRG-type zeroth-order methods as well as ZO-GD and ZO-SGD. |

314 | Neural Logic Reinforcement Learning | Zhengyao Jiang, Shan Luo | To address these two challenges, we propose a novel algorithm named Neural Logic Reinforcement Learning (NLRL) to represent the policies in reinforcement learning by first-order logic. |

315 | Finding Options that Minimize Planning Time | Yuu Jinnai, David Abel, David Hershkowitz, Michael Littman, George Konidaris | We formalize the problem of selecting the optimal set of options for planning as that of computing the smallest set of options so that planning converges in less than a given maximum of value-iteration passes. |

316 | Discovering Options for Exploration by Minimizing Cover Time | Yuu Jinnai, Jee Won Park, David Abel, George Konidaris | We introduce a new option discovery algorithm that diminishes the expected cover time by connecting the most distant states in the state-space graph with options. |

317 | Kernel Mean Matching for Content Addressability of GANs | Wittawat Jitkrittum, Patsorn Sangkloy, Muhammad Waleed Gondal, Amit Raj, James Hays, Bernhard Sch?lkopf | We propose a novel procedure which adds “content-addressability” to any given unconditional implicit model e.g., a generative adversarial network (GAN). |

318 | GOODE: A Gaussian Off-The-Shelf Ordinary Differential Equation Solver | David John, Vincent Heuveline, Michael Schober | Our method based on iterated Gaussian process (GP) regression returns a GP posterior over the solution of nonlinear ODEs, which provides a meaningful error estimation via its predictive posterior standard deviation. |

319 | Bilinear Bandits with Low-rank Structure | Kwang-Sung Jun, Rebecca Willett, Stephen Wright, Robert Nowak | We introduce the bilinear bandit problem with low-rank structure in which an action takes the form of a pair of arms from two different entity types, and the reward is a bilinear function of the known feature vectors of the arms. |

320 | Statistical Foundations of Virtual Democracy | Anson Kahng, Min Kyung Lee, Ritesh Noothigattu, Ariel Procaccia, Christos-Alexandros Psomas | One of the key questions is which aggregation method – or voting rule – to use; we offer a novel statistical viewpoint that provides guidance. |

321 | Molecular Hypergraph Grammar with Its Application to Molecular Optimization | Hiroshi Kajino | This paper presents a molecular hypergraph grammar variational autoencoder (MHG-VAE), which uses a single VAE to achieve 100% validity. |

322 | Robust Influence Maximization for Hyperparametric Models | Dimitris Kalimeris, Gal Kaplun, Yaron Singer | In this paper we study the problem of robust influence maximization in the independent cascade model under a hyperparametric assumption. |

323 | Classifying Treatment Responders Under Causal Effect Monotonicity | Nathan Kallus | In the context of individual-level causal inference, we study the problem of predicting whether someone will respond or not to a treatment based on their features and past examples of features, treatment indicator (e.g., drug/no drug), and a binary outcome (e.g., recovery from disease). |

324 | Trainable Decoding of Sets of Sequences for Neural Sequence Models | Ashwin Kalyan, Peter Anderson, Stefan Lee, Dhruv Batra | To address this, we propose $\nabla$BS, a trainable decoding procedure that outputs a set of sequences, highly valued according to the metric. |

325 | Myopic Posterior Sampling for Adaptive Goal Oriented Design of Experiments | Kirthevasan Kandasamy, Willie Neiswanger, Reed Zhang, Akshay Krishnamurthy, Jeff Schneider, Barnabas Poczos | In this work, we design a new myopic strategy for a wide class of adaptive design of experiment (DOE) problems, where we wish to collect data in order to fulfil a given goal. |

326 | Differentially Private Learning of Geometric Concepts | Haim Kaplan, Yishay Mansour, Yossi Matias, Uri Stemmer | We present differentially private efficient algorithms for learning union of polygons in the plane (which are not necessarily convex). |

327 | Policy Consolidation for Continual Reinforcement Learning | Christos Kaplanis, Murray Shanahan, Claudia Clopath | We propose a method for tackling catastrophic forgetting in deep reinforcement learning that is agnostic to the timescale of changes in the distribution of experiences, does not require knowledge of task boundaries and can adapt in continuously changing environments. |

328 | Error Feedback Fixes SignSGD and other Gradient Compression Schemes | Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, Martin Jaggi | We show simple convex counter-examples where signSGD does not converge to the optimum. |

329 | Riemannian adaptive stochastic gradient algorithms on matrix manifolds | Hiroyuki Kasai, Pratik Jawanpuria, Bamdev Mishra | We propose novel stochastic gradient algorithms for problems on Riemannian matrix manifolds by adapting the row and column subspaces of gradients. |

330 | Neural Inverse Knitting: From Images to Manufacturing Instructions | Alexandre Kaspar, Tae-Hyun Oh, Liane Makatura, Petr Kellnhofer, Wojciech Matusik | Motivated by the recent potential of mass customization brought by whole-garment knitting machines, we introduce the new problem of automatic machine instruction generation using a single image of the desired physical product, which we apply to machine knitting. We create a cured dataset of real samples with their instruction counterpart and propose to use synthetic images to augment it in a novel way. |

331 | Processing Megapixel Images with Deep Attention-Sampling Models | Angelos Katharopoulos, Francois Fleuret | To tackle this limitation, we propose a fully differentiable end-to-end trainable model that samples and processes only a fraction of the full resolution input image. |

332 | Robust Estimation of Tree Structured Gaussian Graphical Models | Ashish Katiyar, Jessica Hoffmann, Constantine Caramanis | Robust Estimation of Tree Structured Gaussian Graphical Models. |

333 | Shallow-Deep Networks: Understanding and Mitigating Network Overthinking | Yigitcan Kaya, Sanghyun Hong, Tudor Dumitras | For prediction transparency, we propose the Shallow-Deep Network (SDN), a generic modification to off-the-shelf DNNs that introduces internal classifiers. |

334 | Submodular Streaming in All Its Glory: Tight Approximation, Minimum Memory and Low Adaptive Complexity | Ehsan Kazemi, Marko Mitrovic, Morteza Zadimoghaddam, Silvio Lattanzi, Amin Karbasi | In this paper, we study the problem of maximizing a monotone submodular function in the streaming setting with a cardinality constraint $k$. |

335 | Adaptive Scale-Invariant Online Algorithms for Learning Linear Models | Michal Kempka, Wojciech Kotlowski, Manfred K. Warmuth | In this paper, we resolve the tuning problem by proposing online algorithms making predictions which are invariant under arbitrary rescaling of the features. |

336 | CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network | Tom Kenter, Vincent Wan, Chun-An Chan, Rob Clark, Jakub Vit | We present a new, hierarchically structured conditional variational auto-encoder to generate prosodic features (fundamental frequency, energy and duration) suitable for use with a vocoder or a generative model like WaveNet. |

337 | Collaborative Evolutionary Reinforcement Learning | Shauharda Khadka, Somdeb Majumdar, Tarek Nassar, Zach Dwiel, Evren Tumer, Santiago Miret, Yinyin Liu, Kagan Tumer | In this paper, we introduce Collaborative Evolutionary Reinforcement Learning (CERL), a scalable framework that comprises a portfolio of policies that simultaneously explore and exploit diverse regions of the solution space. |

338 | Geometry Aware Convolutional Filters for Omnidirectional Images Representation | Renata Khasanova, Pascal Frossard | In this paper we aim at improving popular deep convolutional neural networks so that they can properly take into account the specific properties of omnidirectional data. |

339 | EMI: Exploration with Mutual Information | Hyoungseok Kim, Jaekyeom Kim, Yeonwoo Jeong, Sergey Levine, Hyun Oh Song | We propose EMI, which is an exploration method that constructs embedding representation of states and actions that does not rely on generative decoding of the full observation but extracts predictive signals that can be used to guide exploration based on forward prediction in the representation space. |

340 | FloWaveNet : A Generative Flow for Raw Audio | Sungwon Kim, Sang-Gil Lee, Jongyoon Song, Jaehyeon Kim, Sungroh Yoon | We propose FloWaveNet, a flow-based generative model for raw audio synthesis. |

341 | Curiosity-Bottleneck: Exploration By Distilling Task-Specific Novelty | Youngjin Kim, Wontae Nam, Hyunwoo Kim, Ji-Hoon Kim, Gunhee Kim | We introduce an information- theoretic exploration strategy named Curiosity-Bottleneck that distills task-relevant information from observation. |

342 | Contextual Multi-armed Bandit Algorithm for Semiparametric Reward Model | Gi-Soo Kim, Myunghee Cho Paik | This paper proposes a new contextual MAB algorithm for a relaxed, semiparametric reward model that supports nonstationarity. |

343 | Uniform Convergence Rate of the Kernel Density Estimator Adaptive to Intrinsic Volume Dimension | Jisu Kim, Jaehyeok Shin, Alessandro Rinaldo, Larry Wasserman | We derive concentration inequalities for the supremum norm of the difference between a kernel density estimator (KDE) and its point-wise expectation that hold uniformly over the selection of the bandwidth and under weaker conditions on the kernel and the data generating distribution than previously used in the literature. |

344 | Bit-Swap: Recursive Bits-Back Coding for Lossless Compression with Hierarchical Latent Variables | Friso Kingma, Pieter Abbeel, Jonathan Ho | In this paper we propose Bit-Swap, a new compression scheme that generalizes BB-ANS and achieves strictly better compression rates for hierarchical latent variable models with Markov chain structure. |

345 | CompILE: Compositional Imitation Learning and Execution | Thomas Kipf, Yujia Li, Hanjun Dai, Vinicius Zambaldi, Alvaro Sanchez-Gonzalez, Edward Grefenstette, Pushmeet Kohli, Peter Battaglia | We introduce Compositional Imitation Learning and Execution (CompILE): a framework for learning reusable, variable-length segments of hierarchically-structured behavior from demonstration data. |

346 | Adaptive and Safe Bayesian Optimization in High Dimensions via One-Dimensional Subspaces | Johannes Kirschner, Mojmir Mutny, Nicole Hiller, Rasmus Ischebeck, Andreas Krause | In order to scale the method and keep its benefits, we propose an algorithm (LineBO) that restricts the problem to a sequence of iteratively chosen one-dimensional sub-problems that can be solved efficiently. |

347 | AUC\textmu: A Performance Metric for Multi-Class Machine Learning Models | Ross Kleiman, David Page | We provide in this work a multi-class extension of AUC that we call AUC{\textmu} that is derived from first principles of the binary class AUC. |

348 | Fair k-Center Clustering for Data Summarization | Matth?us Kleindessner, Pranjal Awasthi, Jamie Morgenstern | In this paper, we resolve this gap by providing a simple approximation algorithm for the $k$-center problem under the fairness constraint with running time linear in the size of the data set and $k$. |

349 | Guarantees for Spectral Clustering with Fairness Constraints | Matth?us Kleindessner, Samira Samadi, Pranjal Awasthi, Jamie Morgenstern | Given the widespread popularity of spectral clustering (SC) for partitioning graph data, we study a version of constrained SC in which we try to incorporate the fairness notion proposed by Chierichetti et al. (2017). |

350 | POPQORN: Quantifying Robustness of Recurrent Neural Networks | Ching-Yun Ko, Zhaoyang Lyu, Lily Weng, Luca Daniel, Ngai Wong, Dahua Lin | In this work, we propose POPQORN (Propagated-output Quantified Robustness for RNNs), a general algorithm to quantify robustness of RNNs, including vanilla RNNs, LSTMs, and GRUs. |

351 | Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication | Anastasia Koloskova, Sebastian Stich, Martin Jaggi | We (ii) present a novel gossip algorithm, CHOCO-GOSSIP, for the average consensus problem that converges in time O(1/(\rho^2\delta) \log (1/\epsilon)) for accuracy \epsilon > 0. |

352 | Robust Learning from Untrusted Sources | Nikola Konstantinov, Christoph Lampert | In this work, we address the question of how to learn robustly in such scenarios. |

353 | Stochastic Beams and Where To Find Them: The Gumbel-Top-k Trick for Sampling Sequences Without Replacement | Wouter Kool, Herke Van Hoof, Max Welling | We show how to implicitly apply this ’Gumbel-Top-$k$’ trick on a factorized distribution over sequences, allowing to draw exact samples without replacement using a Stochastic Beam Search. |

354 | LIT: Learned Intermediate Representation Training for Model Compression | Animesh Koratana, Daniel Kang, Peter Bailis, Matei Zaharia | In this work, we introduce Learned Intermediate representation Training (LIT), a novel model compression technique that outperforms a range of recent model compression techniques by leveraging the highly repetitive structure of modern DNNs (e.g., ResNet). |

355 | Similarity of Neural Network Representations Revisited | Simon Kornblith, Mohammad Norouzi, Honglak Lee, Geoffrey Hinton | We introduce a similarity index that measures the relationship between representational similarity matrices and does not suffer from this limitation. |

356 | On the Complexity of Approximating Wasserstein Barycenters | Alexey Kroshnin, Nazarii Tupitsa, Darina Dvinskikh, Pavel Dvurechensky, Alexander Gasnikov, Cesar Uribe | To overcome this issue, we propose a novel proximal-IBP algorithm, which can be seen as a proximal gradient method, which uses IBP on each iteration to make a proximal step. |

357 | Estimate Sequences for Variance-Reduced Stochastic Composite Optimization | Andrei Kulunchakov, Julien Mairal | In this paper, we propose a unified view of gradient-based algorithms for stochastic convex composite optimization by extending the concept of estimate sequence introduced by Nesterov. |

358 | Faster Algorithms for Binary Matrix Factorization | Ravi Kumar, Rina Panigrahy, Ali Rahimi, David Woodruff | We give faster approximation algorithms for well-studied variants of Binary Matrix Factorization (BMF), where we are given a binary $m \times n$ matrix $A$ and would like to find binary rank-$k$ matrices $U, V$ to minimize the Frobenius norm of $U \cdot V – A$. |

359 | Loss Landscapes of Regularized Linear Autoencoders | Daniel Kunin, Jonathan Bloom, Aleksandrina Goeva, Cotton Seed | In this paper, we prove that $L_2$-regularized LAEs are symmetric at all critical points and learn the principal directions as the left singular vectors of the decoder. |

360 | Geometry and Symmetry in Short-and-Sparse Deconvolution | Han-Wen Kuo, Yenson Lau, Yuqian Zhang, John Wright | We propose a method based on nonconvex optimization, which under certain conditions recovers the target short and sparse signals, up to a signed shift symmetry which is intrinsic to this model. |

361 | A Large-Scale Study on Regularization and Normalization in GANs | Karol Kurach, Mario Lucic, Xiaohua Zhai, Marcin Michalski, Sylvain Gelly | In this work we take a sober view of the current state of GANs from a practical perspective. |

362 | Making Decisions that Reduce Discriminatory Impacts | Matt Kusner, Chris Russell, Joshua Loftus, Ricardo Silva | To address this, we describe causal methods that model the relevant parts of the real-world system in which the decisions are made. |

363 | Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits | Branislav Kveton, Csaba Szepesvari, Sharan Vaswani, Zheng Wen, Tor Lattimore, Mohammad Ghavamzadeh | We propose a bandit algorithm that explores by randomizing its history of rewards. |

364 | Characterizing Well-Behaved vs. Pathological Deep Neural Networks | Antoine Labatie | We introduce a novel approach, requiring only mild assumptions, for the characterization of deep neural networks at initialization. |

365 | State-Reification Networks: Improving Generalization by Modeling the Distribution of Hidden Representations | Alex Lamb, Jonathan Binas, Anirudh Goyal, Sandeep Subramanian, Ioannis Mitliagkas, Yoshua Bengio, Michael Mozer | We introduce a method, which we refer to as _state reification_, that involves modeling the distribution of hidden states over the training data and then projecting hidden states observed during testing toward this distribution. |

366 | A Recurrent Neural Cascade-based Model for Continuous-Time Diffusion | Sylvain Lamprier | In this paper we propose a model at the crossroads of these two extremes, which embeds the history of diffusion in infected nodes as hidden continuous states. |

367 | Projection onto Minkowski Sums with Application to Constrained Learning | Kenneth Lange, Joong-Ho Won, Jason Xu | We introduce block descent algorithms for projecting onto Minkowski sums of sets. |

368 | Safe Policy Improvement with Baseline Bootstrapping | Romain Laroche, Paul Trichelair, Remi Tachet Des Combes | This paper considers Safe Policy Improvement (SPI) in Batch Reinforcement Learning (Batch RL): from a fixed dataset and without direct access to the true environment, train a policy that is guaranteed to perform at least as well as the baseline policy used to collect the data. |

369 | A Better k-means++ Algorithm via Local Search | Silvio Lattanzi, Christian Sohler | In this paper, we develop a new variant of k-means++ seeding that in expectation achieves a constant approximation guarantee. |

370 | Lorentzian Distance Learning for Hyperbolic Representations | Marc Law, Renjie Liao, Jake Snell, Richard Zemel | We introduce an approach to learn representations based on the Lorentzian distance in hyperbolic geometry. |

371 | DP-GP-LVM: A Bayesian Non-Parametric Model for Learning Multivariate Dependency Structures | Andrew Lawrence, Carl Henrik Ek, Neill Campbell | We present a non-parametric Bayesian latent variable model capable of learning dependency structures across dimensions in a multivariate setting. |

372 | POLITEX: Regret Bounds for Policy Iteration using Expert Prediction | Nevena Lazic, Yasin Abbasi-Yadkori, Kush Bhatia, Gellert Weisz, Peter Bartlett, Csaba Szepesvari | We present POLITEX (POLicy ITeration with EXpert advice), a variant of policy iteration where each policy is a Boltzmann distribution over the sum of action-value function estimates of the previous policies, and analyze its regret in continuing RL problems. |

373 | Batch Policy Learning under Constraints | Hoang Le, Cameron Voloshin, Yisong Yue | As part of off-policy learning, we propose a simple method for off-policy policy evaluation (OPE) and derive PAC-style bounds. |

374 | Target-Based Temporal-Difference Learning | Donghwan Lee, Niao He | In this work, we introduce a new family of target-based temporal difference (TD) learning algorithms that maintain two separate learning parameters {–} the target variable and online variable. |

375 | Functional Transparency for Structured Data: a Game-Theoretic Approach | Guang-He Lee, Wengong Jin, David Alvarez-Melis, Tommi Jaakkola | We provide a new approach to training neural models to exhibit transparency in a well-defined, functional manner. |

376 | Self-Attention Graph Pooling | Junhyun Lee, Inyeop Lee, Jaewoo Kang | In this paper, we propose a graph pooling method based on self-attention. |

377 | Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks | Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, Yee Whye Teh | We present an attention-based neural network module, the Set Transformer, specifically designed to model interactions among elements in the input set. |

378 | First-Order Algorithms Converge Faster than $O(1/k)$ on Convex Problems | Ching-Pei Lee, Stephen Wright | In this work, we improve this rate to $o(1/k)$. |

379 | Robust Inference via Generative Classifiers for Handling Noisy Labels | Kimin Lee, Sukmin Yun, Kibok Lee, Honglak Lee, Bo Li, Jinwoo Shin | To mitigate the issue, we propose a novel inference method, termed Robust Generative classifier (RoG), applicable to any discriminative (e.g., softmax) neural classifier pre-trained on noisy datasets. |

380 | Sublinear Time Nearest Neighbor Search over Generalized Weighted Space | Yifan Lei, Qiang Huang, Mohan Kankanhalli, Anthony Tung | Based on the idea of Asymmetric Locality-Sensitive Hashing (ALSH), we introduce a novel spherical asymmetric transformation and propose the first two novel weight-oblivious hashing schemes SL-ALSH and S2-ALSH accordingly. |

381 | MONK Outlier-Robust Mean Embedding Estimation by Median-of-Means | Matthieu Lerasle, Zoltan Szabo, Timoth?e Mathieu, Guillaume Lecue | In this paper, we show how the recently emerged principle of median-of-means can be used to design estimators for kernel mean embedding and MMD with excessive resistance properties to outliers, and optimal sub-Gaussian deviation bounds under mild assumptions. |

382 | Cheap Orthogonal Constraints in Neural Networks: A Simple Parametrization of the Orthogonal and Unitary Group | Mario Lezcano-Casado, David Marti?nez-Rubio | We introduce a novel approach to perform first-order optimization with orthogonal and unitary constraints. |

383 | Are Generative Classifiers More Robust to Adversarial Attacks? | Yingzhen Li, John Bradshaw, Yash Sharma | In this paper, we propose and investigate the deep Bayes classifier, which improves classical naive Bayes with conditional deep generative models. |

384 | Sublinear quantum algorithms for training linear and kernel-based classifiers | Tongyang Li, Shouvanik Chakrabarti, Xiaodi Wu | We investigate quantum algorithms for classification, a fundamental problem in machine learning, with provable guarantees. |

385 | LGM-Net: Learning to Generate Matching Networks for Few-Shot Learning | Huaiyu Li, Weiming Dong, Xing Mei, Chongyang Ma, Feiyue Huang, Bao-Gang Hu | In this work, we propose a novel meta-learning approach for few-shot classification, which learns transferable prior knowledge across tasks and directly produces network parameters for similar unseen tasks with training samples. |

386 | Graph Matching Networks for Learning the Similarity of Graph Structured Objects | Yujia Li, Chenjie Gu, Thomas Dullien, Oriol Vinyals, Pushmeet Kohli | This paper addresses the challenging problem of retrieval and matching of graph structured objects, and makes two key contributions. |

387 | Area Attention | Yang Li, Lukasz Kaiser, Samy Bengio, Si Si | We propose area attention: a way to attend to areas in the memory, where each area contains a group of items that are structurally adjacent, e.g., spatially for a 2D memory such as images, or temporally for a 1D memory such as natural language sentences. |

388 | Online Learning to Rank with Features | Shuai Li, Tor Lattimore, Csaba Szepesvari | We introduce a new model for online ranking in which the click probability factors into an examination and attractiveness function and the attractiveness function is a linear function of a feature vector and an unknown parameter. |

389 | NATTACK: Learning the Distributions of Adversarial Examples for an Improved Black-Box Attack on Deep Neural Networks | Yandong Li, Lijun Li, Liqiang Wang, Tong Zhang, Boqing Gong | In this paper, we propose a black-box adversarial attack algorithm that can defeat both vanilla DNNs and those generated by various defense techniques developed recently. |

390 | Bayesian Joint Spike-and-Slab Graphical Lasso | Zehang Li, Tyler Mccormick, Samuel Clark | In this article, we propose a new class of priors for Bayesian inference with multiple Gaussian graphical models. |

391 | Exploiting Worker Correlation for Label Aggregation in Crowdsourcing | Yuan Li, Benjamin Rubinstein, Trevor Cohn | In this paper, we argue that existing crowdsourcing approaches do not sufficiently model worker correlations observed in practical settings; we propose in response an enhanced Bayesian classifier combination (EBCC) model, with inference based on a mean-field variational approach. |

392 | Adversarial camera stickers: A physical camera-based attack on deep learning systems | Juncheng Li, Frank Schmidt, Zico Kolter | In this work, we consider an alternative question: is it possible to fool deep classifiers, over all perceived objects of a certain type, by physically manipulating the camera itself? |

393 | Towards a Unified Analysis of Random Fourier Features | Zhu Li, Jean-Francois Ton, Dino Oglic, Dino Sejdinovic | We study both the standard random Fourier features method for which we improve the existing bounds on the number of features required to guarantee the corresponding minimax risk convergence rate of kernel ridge regression, as well as a data-dependent modification which samples features proportional to ridge leverage scores and further reduces the required number of features. |

394 | Feature-Critic Networks for Heterogeneous Domain Generalization | Yiying Li, Yongxin Yang, Wei Zhou, Timothy Hospedales | In this work, we propose a learning to learn approach, where the auxiliary loss that helps generalisation is itself learned. |

395 | Learn to Grow: A Continual Structure Learning Framework for Overcoming Catastrophic Forgetting | Xilai Li, Yingbo Zhou, Tianfu Wu, Richard Socher, Caiming Xiong | This paper presents a conceptually simple yet general and effective framework for handling catastrophic forgetting in continual learning with DNNs. |

396 | Alternating Minimizations Converge to Second-Order Optimal Solutions | Qiuwei Li, Zhihui Zhu, Gongguo Tang | We show that under mild assumptions on the (nonconvex) objective function, both algorithms avoid strict saddles almost surely from random initialization. |

397 | Cautious Regret Minimization: Online Optimization with Long-Term Budget Constraints | Nikolaos Liakopoulos, Apostolos Destounis, Georgios Paschos, Thrasyvoulos Spyropoulos, Panayotis Mertikopoulos | We study a class of online convex optimization problems with long-term budget constraints that arise naturally as reliability guarantees or total consumption constraints. |

398 | Regularization in directable environments with application to Tetris | Jan Lichtenberg, Simsek | We present a regularized linear model called STEW that benefits from a generic and prevalent form of prior knowledge: feature directions. |

399 | Inference and Sampling of $K_33$-free Ising Models | Valerii Likhosherstov, Yury Maximov, Misha Chertkov | Inference and Sampling of $K_33$-free Ising Models. |

400 | Kernel-Based Reinforcement Learning in Robust Markov Decision Processes | Shiau Hong Lim, Arnaud Autef | We extend these results to the much larger class of kernel-based approximators and show, both analytically and empirically that the robust policies can significantly outperform the non-robust counterpart. |

401 | On Efficient Optimal Transport: An Analysis of Greedy and Accelerated Mirror Descent Algorithms | Tianyi Lin, Nhat Ho, Michael Jordan | We provide theoretical analyses for two algorithms that solve the regularized optimal transport (OT) problem between two discrete probability measures with at most $n$ atoms. |

402 | Fast and Simple Natural-Gradient Variational Inference with Mixture of Exponential-family Approximations | Wu Lin, Mohammad Emtiyaz Khan, Mark Schmidt | In this paper, we extend their application to estimate structured approximations such as mixtures of EF distributions. |

403 | Acceleration of SVRG and Katyusha X by Inexact Preconditioning | Yanli Liu, Fei Feng, Wotao Yin | In this paper, we propose to accelerate these two algorithms by inexact preconditioning, the proposed methods employ fixed preconditioners, although the subproblem in each epoch becomes harder, it suffices to apply fixed number of simple subroutines to solve it inexactly, without losing the overall convergence. |

404 | Transferable Adversarial Training: A General Approach to Adapting Deep Classifiers | Hong Liu, Mingsheng Long, Jianmin Wang, Michael Jordan | To this end, we propose Transferable Adversarial Training (TAT) to enable the adaptation of deep classifiers. |

405 | Rao-Blackwellized Stochastic Gradients for Discrete Distributions | Runjing Liu, Jeffrey Regier, Nilesh Tripuraneni, Michael Jordan, Jon Mcauliffe | In this paper, we describe a technique that can be applied to reduce the variance of any such estimator, without changing its bias{—}in particular, unbiasedness is retained. |

406 | Sparse Extreme Multi-label Learning with Oracle Property | Weiwei Liu, Xiaobo Shen | To fill this gap, we present a unified framework for SLEEC with nonconvex penalty. |

407 | Data Poisoning Attacks on Stochastic Bandits | Fang Liu, Ness Shroff | In this paper, we propose a framework of offline attacks on bandit algorithms and study convex optimization based attacks on several popular bandit algorithms. |

408 | The Implicit Fairness Criterion of Unconstrained Learning | Lydia T. Liu, Max Simchowitz, Moritz Hardt | We clarify what fairness guarantees we can and cannot expect to follow from unconstrained machine learning. |

409 | Taming MAML: Efficient unbiased meta-reinforcement learning | Hao Liu, Richard Socher, Caiming Xiong | We propose a surrogate objective function named, Taming MAML (TMAML), that adds control variates into gradient estimation via automatic differentiation. |

410 | On Certifying Non-Uniform Bounds against Adversarial Attacks | Chen Liu, Ryota Tomioka, Volkan Cevher | We formulate our target as an optimization problem with nonlinear constraints. |

411 | Understanding and Accelerating Particle-Based Variational Inference | Chang Liu, Jingwei Zhuo, Pengyu Cheng, Ruiyi Zhang, Jun Zhu | We propose an acceleration framework and a principled bandwidth-selection method for general ParVIs; these are based on the developed theory and leverage the geometry of the Wasserstein space. |

412 | Understanding MCMC Dynamics as Flows on the Wasserstein Space | Chang Liu, Jingwei Zhuo, Jun Zhu | In this work, by developing novel concepts, we propose a theoretical framework that recognizes a general MCMC dynamics as the fiber-gradient Hamiltonian flow on the Wasserstein space of a fiber-Riemannian Poisson manifold. |

413 | Sliced-Wasserstein Flows: Nonparametric Generative Modeling via Optimal Transport and Diffusions | Antoine Liutkus, Umut Simsekli, Szymon Majewski, Alain Durmus, Fabian-Robert St?ter | By building upon the recent theory that established the connection between implicit generative modeling (IGM) and optimal transport, in this study, we propose a novel parameter-free algorithm for learning the underlying distributions of complicated datasets and sampling from them. |

414 | Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations | Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Sch?lkopf, Olivier Bachem | In this paper, we provide a sober look at recent progress in the field and challenge some common assumptions. |

415 | Bayesian Counterfactual Risk Minimization | Ben London, Ted Sandler | We present a Bayesian view of counterfactual risk minimization (CRM) for offline learning from logged bandit feedback. |

416 | PA-GD: On the Convergence of Perturbed Alternating Gradient Descent to Second-Order Stationary Points for Structured Nonconvex Optimization | Songtao Lu, Mingyi Hong, Zhengdao Wang | In this paper, we consider a smooth unconstrained nonconvex optimization problem, and propose a perturbed A-GD (PA-GD) which is able to converge (with high probability) to the second-order stationary points (SOSPs) with a global sublinear rate. |

417 | Neurally-Guided Structure Inference | Sidi Lu, Jiayuan Mao, Joshua Tenenbaum, Jiajun Wu | In this paper, we propose a hybrid inference algorithm, the Neurally-Guided Structure Inference (NG-SI), keeping the advantages of both search-based and data-driven methods. |

418 | Optimal Algorithms for Lipschitz Bandits with Heavy-tailed Rewards | Shiyin Lu, Guanghui Wang, Yao Hu, Lijun Zhang | To address this limitation, in this paper we relax the assumption on rewards to allow arbitrary distributions that have finite $(1+\epsilon)$-th moments for some $\epsilon \in (0, 1]$, and propose algorithms that enjoy a sublinear regret of $\widetilde{O}(T^{(d_z\epsilon + 1)/(d_z \epsilon + \epsilon + 1)})$ where $T$ is the time horizon and $d_z$ is the zooming dimension. |

419 | CoT: Cooperative Training for Generative Modeling of Discrete Data | Sidi Lu, Lantao Yu, Siyuan Feng, Yaoming Zhu, Weinan Zhang | In this paper, we study the generative models of sequential discrete data. |

420 | Generalized Approximate Survey Propagation for High-Dimensional Estimation | Carlo Lucibello, Luca Saglietti, Yue Lu | In this paper, we propose a new algorithm, named Generalized Approximate Survey Propagation (GASP), for solving GLE in the presence of prior or model misspecifications. Furthermore, we present a set of state evolution equations that can precisely characterize the performance of GASP in the high-dimensional limit. |

421 | High-Fidelity Image Generation With Fewer Labels | Mario Lucic, Michael Tschannen, Marvin Ritter, Xiaohua Zhai, Olivier Bachem, Sylvain Gelly | In this work we demonstrate how one can benefit from recent work on self- and semi-supervised learning to outperform the state of the art on both unsupervised ImageNet synthesis, as well as in the conditional setting. |

422 | Leveraging Low-Rank Relations Between Surrogate Tasks in Structured Prediction | Giulia Luise, Dimitrios Stamos, Massimiliano Pontil, Carlo Ciliberto | We propose an efficient algorithm based on trace norm regularization which, differently from previous methods, does not require explicit knowledge of the coding/decoding functions of the surrogate framework. |

423 | Differentiable Dynamic Normalization for Learning Deep Representation | Ping Luo, Peng Zhanglin, Shao Wenqi, Zhang Ruimao, Ren Jiamin, Wu Lingyun | This work presents Dynamic Normalization (DN), which is able to learn arbitrary normalization operations for different convolutional layers in a deep ConvNet. |

424 | Disentangled Graph Convolutional Networks | Jianxin Ma, Peng Cui, Kun Kuang, Xin Wang, Wenwu Zhu | In this paper, we introduce the disentangled graph convolutional network (DisenGCN) to learn disentangled node representations. |

425 | Variational Implicit Processes | Chao Ma, Yingzhen Li, Jose Miguel Hernandez-Lobato | We introduce the implicit processes (IPs), a stochastic process that places implicitly defined multivariate distributions over any finite collections of random variables. |

426 | EDDI: Efficient Dynamic Discovery of High-Value Information with Partial VAE | Chao Ma, Sebastian Tschiatschek, Konstantina Palla, Jose Miguel Hernandez-Lobato, Sebastian Nowozin, Cheng Zhang | To this end, we propose a principled framework, named EDDI (Efficient Dynamic Discovery of high-value Information), based on the theory of Bayesian experimental design. |

427 | Bayesian leave-one-out cross-validation for large data | M?ns Magnusson, Michael Andersen, Johan Jonasson, Aki Vehtari | We propose a combination of using approximate inference techniques and probability-proportional-to-size-sampling (PPS) for fast LOO model evaluation for large datasets. |

428 | Composable Core-sets for Determinant Maximization: A Simple Near-Optimal Algorithm | Sepideh Mahabadi, Piotr Indyk, Shayan Oveis Gharan, Alireza Rezaei | In this work, we consider efficient construction of composable core-sets for the determinant maximization problem. |

429 | Guided evolutionary strategies: augmenting random search with surrogate gradients | Niru Maheswaranathan, Luke Metz, George Tucker, Dami Choi, Jascha Sohl-Dickstein | We propose Guided Evolutionary Strategies (GES), a method for optimally using surrogate gradient directions to accelerate random search. |

430 | Data Poisoning Attacks in Multi-Party Learning | Saeed Mahloujifar, Mohammad Mahmoody, Ameer Mohammed | In this work, we demonstrate universal multi-party poisoning attacks that adapt and apply to any multi-party learning process with arbitrary interaction pattern between the parties. |

431 | Traditional and Heavy Tailed Self Regularization in Neural Network Models | Michael Mahoney, Charles Martin | Building on recent results in RMT, most notably its extension to Universality classes of Heavy-Tailed matrices, we develop a theory to identify 5+1 Phases of Training, corresponding to increasing amounts of Implicit Self-Regularization. |

432 | Curvature-Exploiting Acceleration of Elastic Net Computations | Vien Mai, Mikael Johansson | This paper introduces an efficient second-order method for solving the elastic net problem. |

433 | Breaking the gridlock in Mixture-of-Experts: Consistent and Efficient Algorithms | Ashok Makkuva, Pramod Viswanath, Sreeram Kannan, Sewoong Oh | In this paper, we introduce the first algorithm that learns the true parameters of a MoE model for a wide class of non-linearities with global consistency guarantees. |

434 | Calibrated Model-Based Deep Reinforcement Learning | Ali Malik, Volodymyr Kuleshov, Jiaming Song, Danny Nemer, Harlan Seymour, Stefano Ermon | We describe a simple way to augment any model-based reinforcement learning agent with a calibrated model and show that doing so consistently improves planning, sample complexity, and exploration. |

435 | Learning from Delayed Outcomes via Proxies with Applications to Recommender Systems | Timothy Arthur Mann, Sven Gowal, Andras Gyorgy, Huiyi Hu, Ray Jiang, Balaji Lakshminarayanan, Prav Srinivasan | Motivated by our regret analysis, we propose two neural network architectures: Factored Forecaster (FF) which is ideal if the proxy is informative of the outcome in hindsight, and Residual Factored Forecaster (RFF) that is robust to a non-informative proxy. |

436 | Passed & Spurious: Descent Algorithms and Local Minima in Spiked Matrix-Tensor Models | Stefano Sarao Mannelli, Florent Krzakala, Pierfrancesco Urbani, Lenka Zdeborova | In this work we analyse quantitatively the interplay between the loss landscape and performance of descent algorithms in a prototypical inference problem, the spiked matrix-tensor model. |

437 | A Baseline for Any Order Gradient Estimation in Stochastic Computation Graphs | Jingkai Mao, Jakob Foerster, Tim Rockt?schel, Maruan Al-Shedivat, Gregory Farquhar, Shimon Whiteson | To improve the sample efficiency of DiCE, we propose a new baseline term for higher order gradient estimation. |

438 | Adversarial Generation of Time-Frequency Features with application in audio synthesis | Andr?s Marafioti, Nathana?l Perraudin, Nicki Holighaus, Piotr Majdak | In this article, focusing on the short-time Fourier transform, we discuss the challenges that arise in audio synthesis based on generated invertible TF features and how to overcome them. |

439 | On the Universality of Invariant Networks | Haggai Maron, Ethan Fetaya, Nimrod Segol, Yaron Lipman | In this paper, we consider a fundamental question that has received very little attention to date: Can these networks approximate any (continuous) invariant function? |

440 | Decomposing feature-level variation with Covariate Gaussian Process Latent Variable Models | Kaspar Martens, Kieran Campbell, Christopher Yau | In this paper, we propose to achieve this through a structured kernel decomposition in a hybrid Gaussian Process model which we call the Covariate Gaussian Process Latent Variable Model (c-GPLVM). |

441 | Fairness-Aware Learning for Continuous Attributes and Treatments | Jeremie Mary, Cl?ment Calauz?nes, Noureddine El Karoui | As common fairness metrics can be expressed as measures of (conditional) independence between variables, we propose to use the Rényi maximum correlation coefficient to generalize fairness measurement to continuous variables. |

442 | Optimal Minimal Margin Maximization with Boosting | Alexander Mathiasen, Kasper Green Larsen, Allan Gr?nlund | Our main contribution is a new algorithm refuting this conjecture. |

443 | Disentangling Disentanglement in Variational Autoencoders | Emile Mathieu, Tom Rainforth, N Siddharth, Yee Whye Teh | We develop a generalisation of disentanglement in variational autoencoders (VAEs)—decomposition of the latent representation—characterising it as the fulfilment of two factors: a) the latent encodings of the data having an appropriate level of overlap, and b) the aggregate encoding of the data conforming to a desired structure, represented through the prior. |

444 | MIWAE: Deep Generative Modelling and Imputation of Incomplete Data Sets | Pierre-Alexandre Mattei, Jes Frellsen | We consider the problem of handling missing data with deep latent variable models (DLVMs). |

445 | Distributional Reinforcement Learning for Efficient Exploration | Borislav Mavrin, Hengshuai Yao, Linglong Kong, Kaiwen Wu, Yaoliang Yu | We propose a novel and efficient exploration method for deep RL that has two components. |

446 | Graphical-model based estimation and inference for differential privacy | Ryan Mckenna, Daniel Sheldon, Gerome Miklau | In this work, we provide an approach to solve this estimation problem efficiently using graphical models, which is particularly effective when the distribution is high-dimensional but the measurements are over low-dimensional marginals. |

447 | Efficient Amortised Bayesian Inference for Hierarchical and Nonlinear Dynamical Systems | Ted Meeds, Geoffrey Roeder, Paul Grant, Andrew Phillips, Neil Dalchau | We introduce a flexible, scalable Bayesian inference framework for nonlinear dynamical systems characterised by distinct and hierarchical variability at the individual, group, and population levels. |

448 | Toward Controlling Discrimination in Online Ad Auctions | Anay Mehrotra, Elisa Celis, Nisheeth Vishnoi | To prevent this, we propose a constrained ad auction framework that maximizes the platform’s revenue conditioned on ensuring that the audience seeing an advertiser’s ad is distributed appropriately across sensitive types such as gender or race. |

449 | Stochastic Blockmodels meet Graph Neural Networks | Nikhil Mehta, Lawrence Carin Duke, Piyush Rai | In this work, we unify these two directions by developing a sparse variational autoencoder for graphs, that retains the interpretability of SBMs, while also enjoying the excellent predictive performance of graph neural nets. |

450 | Imputing Missing Events in Continuous-Time Event Streams | Hongyuan Mei, Guanghui Qin, Jason Eisner | Given a probability model of complete sequences, we propose particle smoothing—a form of sequential importance sampling—to impute the missing events in an incomplete sequence. |

451 | Same, Same But Different: Recovering Neural Network Quantization Error Through Weight Factorization | Eldad Meller, Alexander Finkelstein, Uri Almog, Mark Grobman | In this paper, we exploit an oft-overlooked degree of freedom in most networks – for a given layer, individual output channels can be scaled by any factor provided that the corresponding weights of the next layer are inversely scaled. |

452 | The Wasserstein Transform | Facundo Memoli, Zane Smith, Zhengchao Wan | We introduce the Wasserstein transform, a method for enhancing and denoising datasets defined on general metric spaces. |

453 | Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks | Charith Mendis, Alex Renda, Dr.Saman Amarasinghe, Michael Carbin | In this paper we present Ithemal, the first tool which learns to predict the throughput of a set of instructions. |

454 | Geometric Losses for Distributional Learning | Arthur Mensch, Mathieu Blondel, Gabriel Peyr? | Building upon recent advances in entropy-regularized optimal transport, and upon Fenchel duality between measures and continuous functions, we propose a generalization of the logistic loss that incorporates a metric or cost between classes. |

455 | Spectral Clustering of Signed Graphs via Matrix Power Means | Pedro Mercado, Francesco Tudisco, Matthias Hein | We provide a thorough analysis of the proposed approach in the setting of a general Stochastic Block Model that includes models such as the Labeled Stochastic Block Model and the Censored Block Model. |

456 | Simple Stochastic Gradient Methods for Non-Smooth Non-Convex Regularized Optimization | Michael Metel, Akiko Takeda | We present two simple stochastic gradient algorithms, for finite-sum and general stochastic optimization problems, which have superior convergence complexities compared to the current state-of-the-art. |

457 | Reinforcement Learning in Configurable Continuous Environments | Alberto Maria Metelli, Emanuele Ghelfi, Marcello Restelli | In this paper, we fill this gap by proposing a trust-region method, Relative Entropy Model Policy Search (REMPS), able to learn both the policy and the MDP configuration in continuous domains without requiring the knowledge of the true model of the environment. |

458 | Understanding and correcting pathologies in the training of learned optimizers | Luke Metz, Niru Maheswaranathan, Jeremy Nixon, Daniel Freeman, Jascha Sohl-Dickstein | In this work we propose a training scheme which overcomes both of these difficulties, by dynamically weighting two unbiased gradient estimators for a variational loss on optimizer performance. |

459 | Optimality Implies Kernel Sum Classifiers are Statistically Efficient | Raphael Meyer, Jean Honorio | We propose a novel combination of optimization tools with learning theory bounds in order to analyze the sample complexity of optimal kernel sum classifiers. |

460 | On Dropout and Nuclear Norm Regularization | Poorya Mianjy, Raman Arora | We give a formal and complete characterization of the explicit regularizer induced by dropout in deep linear networks with squared loss. |

461 | Discriminative Regularization for Latent Variable Models with Applications to Electrocardiography | Andrew Miller, Ziad Obermeyer, John Cunningham, Sendhil Mullainathan | We propose a generative model training objective that uses a black-box discriminative model as a regularizer to learn representations that preserve this predictive variation. |

462 | Formal Privacy for Functional Data with Gaussian Perturbations | Ardalan Mirshani, Matthew Reimherr, Aleksandra Slavkovic | Motivated by the rapid rise in statistical tools in Functional Data Analysis, we consider the Gaussian mechanism for achieving differential privacy (DP) with parameter estimates taking values in a, potentially infinite-dimensional, separable Banach space. |

463 | Co-manifold learning with missing data | Gal Mishne, Eric Chi, Ronald Coifman | We propose utilizing this coupled structure to perform co-manifold learning: uncovering the underlying geometry of both the rows and the columns of a given matrix, where we focus on a missing data setting. |

464 | Agnostic Federated Learning | Mehryar Mohri, Gary Sivek, Ananda Theertha Suresh | Instead, we propose a new framework of agnostic federated learning, where the centralized model is optimized for any target distribution formed by a mixture of the client distributions. |

465 | Flat Metric Minimization with Applications in Generative Modeling | Thomas M?llenhoff, Daniel Cremers | In our theoretical contribution we prove that the flat metric between a parametrized current and a reference current is Lipschitz continuous in the parameters. |

466 | Parsimonious Black-Box Adversarial Attacks via Efficient Combinatorial Optimization | Seungyong Moon, Gaon An, Hyun Oh Song | We propose an efficient discrete surrogate to the optimization problem which does not require estimating the gradient and consequently becomes free of the first order update hyperparameters to tune. |

467 | Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization | Hesham Mostafa, Xin Wang | Here we present a novel dynamic sparse reparameterization method that addresses the limitations of previous techniques such as high computational cost and the need for manual configuration of the number of free parameters allocated to each layer. |

468 | A Dynamical Systems Perspective on Nesterov Acceleration | Michael Muehlebach, Michael Jordan | We present a dynamical system framework for understanding Nesterov’s accelerated gradient method. |

469 | Relational Pooling for Graph Representations | Ryan Murphy, Balasubramaniam Srinivasan, Vinayak Rao, Bruno Ribeiro | This work generalizes graph neural networks (GNNs) beyond those based on the Weisfeiler-Lehman (WL) algorithm, graph Laplacians, and diffusions. |

470 | Learning Optimal Fair Policies | Razieh Nabi, Daniel Malinsky, Ilya Shpitser | In this paper, we consider how to make optimal but fair decisions, which “break the cycle of injustice” by correcting for the unfair dependence of both decisions and outcomes on sensitive features (e.g., variables that correspond to gender, race, disability, or other protected attributes). |

471 | Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models | Mor Shpigel Nacson, Suriya Gunasekar, Jason Lee, Nathan Srebro, Daniel Soudry | For non-homogeneous ensemble models, which output is a sum of homogeneous sub-models, we show that this solution discards the shallowest sub-models if they are unnecessary. |

472 | A Wrapped Normal Distribution on Hyperbolic Space for Gradient-Based Learning | Yoshihiro Nagano, Shoichiro Yamaguchi, Yasuhiro Fujita, Masanori Koyama | In this paper, we present a novel hyperbolic distribution called hyperbolic wrapped distribution, a wrapped normal distribution on hyperbolic space whose density can be evaluated analytically and differentiated with respect to the parameters. |

473 | SGD without Replacement: Sharper Rates for General Smooth Convex Functions | Dheeraj Nagaraj, Prateek Jain, Praneeth Netrapalli | We study stochastic gradient descent without replacement (SGDo) for smooth convex functions. |

474 | Dropout as a Structured Shrinkage Prior | Eric Nalisnick, Jose Miguel Hernandez-Lobato, Padhraic Smyth | We propose a novel framework for understanding multiplicative noise in neural networks, considering continuous distributions as well as Bernoulli noise (i.e. dropout). |

475 | Hybrid Models with Deep and Invertible Features | Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, Balaji Lakshminarayanan | We propose a neural hybrid model consisting of a linear model defined on a set of features computed by a deep, invertible transformation (i.e. a normalizing flow). |

476 | Learning Context-dependent Label Permutations for Multi-label Classification | Jinseok Nam, Young-Bum Kim, Eneldo Loza Mencia, Sunghyun Park, Ruhi Sarikaya, Johannes F?rnkranz | In this work, we propose a multi-label classification approach which allows to choose a dynamic, context-dependent label ordering. |

477 | Zero-Shot Knowledge Distillation in Deep Networks | Gaurav Kumar Nayak, Konda Reddy Mopuri, Vaisakh Shaj, Venkatesh Babu Radhakrishnan, Anirban Chakraborty | Hence, in this paper, we propose a novel data-free method to train the Student from the Teacher. |

478 | A Framework for Bayesian Optimization in Embedded Subspaces | Amin Nayebi, Alexander Munteanu, Matthias Poloczek | We present a theoretically founded approach for high-dimensional Bayesian optimization based on low-dimensional subspace embeddings. |

479 | Phaseless PCA: Low-Rank Matrix Recovery from Column-wise Phaseless Measurements | Seyedehsara Nayer, Praneeth Narayanamurthy, Namrata Vaswani | We introduce a simple algorithm that is provably correct as long as the subspace changes are piecewise constant. This work proposes the first set of simple, practically useful, and provable algorithms for two inter-related problems. |

480 | Safe Grid Search with Optimal Complexity | Eugene Ndiaye, Tam Le, Olivier Fercoq, Joseph Salmon, Ichiro Takeuchi | In this paper, we revisit the techniques of approximating the regularization path up to predefined tolerance $\epsilon$ in a unified framework and show that its complexity is $O(1/\sqrt[d]{\epsilon})$ for uniformly convex loss of order $d \geq 2$ and $O(1/\sqrt{\epsilon})$ for Generalized Self-Concordant functions. |

481 | Learning to bid in revenue-maximizing auctions | Thomas Nedelec, Noureddine El Karoui, Vianney Perchet | Using a variational approach, we study the complexity of the original objective and we introduce a relaxation of the objective functional in order to use gradient descent methods. |

482 | On Connected Sublevel Sets in Deep Learning | Quynh Nguyen | This paper shows that every sublevel set of the loss function of a class of deep over-parameterized neural nets with piecewise linear activation functions is connected and unbounded. |

483 | Anomaly Detection With Multiple-Hypotheses Predictions | Duc Tam Nguyen, Zhongyu Lou, Michael Klar, Thomas Brox | We propose to learn the data distribution of the foreground more efficiently with a multi-hypotheses autoencoder. |

484 | Non-Asymptotic Analysis of Fractional Langevin Monte Carlo for Non-Convex Optimization | Than Huy Nguyen, Umut Simsekli, Gael Richard | In this study, we analyze the non-asymptotic behavior of FLMC for non-convex optimization and prove finite-time bounds for its expected suboptimality. |

485 | Rotation Invariant Householder Parameterization for Bayesian PCA | Rajbir Nirwan, Nils Bertschinger | Here, we propose a parameterization based on Householder transformations, which remove the rotational symmetry of the posterior. |

486 | Lossless or Quantized Boosting with Integer Arithmetic | Richard Nock, Robert Williamson | We build a learning algorithm which is able, under mild assumptions, to achieve a lossless boosting-compliant training. |

487 | Training Neural Networks with Local Error Signals | Arild N?kland, Lars Hiller Eidnes | In this paper we demonstrate, for the first time, that layer-wise training can approach the state-of-the-art on a variety of image datasets. |

488 | Remember and Forget for Experience Replay | Guido Novati, Petros Koumoutsakos | We introduce Remember and Forget Experience Replay (ReF-ER), a novel method that can enhance RL algorithms with parameterized policies. |

489 | Learning to Infer Program Sketches | Maxwell Nye, Luke Hewitt, Joshua Tenenbaum, Armando Solar-Lezama | The key idea of this work is that a flexible combination of pattern recognition and explicit reasoning can be used to solve these complex programming problems. |

490 | Tensor Variable Elimination for Plated Factor Graphs | Fritz Obermeyer, Eli Bingham, Martin Jankowiak, Neeraj Pradhan, Justin Chiu, Alexander Rush, Noah Goodman | To exploit efficient tensor algebra in graphs with plates of variables, we generalize undirected factor graphs to plated factor graphs and variable elimination to a tensor variable elimination algorithm that operates directly on plated factor graphs. |

491 | Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models | Michael Oberst, David Sontag | In particular, we introduce a class of structural causal models (SCMs) for generating counterfactual trajectories in finite partially observable Markov Decision Processes (POMDPs). |

492 | Model Function Based Conditional Gradient Method with Armijo-like Line Search | Peter Ochs, Yura Malitsky | As special cases, for example, we develop an algorithm for additive composite problems and an algorithm for non-linear composite problems which leads to a Gauss-Newton-type algorithm. |

493 | TensorFuzz: Debugging Neural Networks with Coverage-Guided Fuzzing | Augustus Odena, Catherine Olsson, David Andersen, Ian Goodfellow | We introduce testing techniques for neural networks that can discover errors occurring only for rare inputs. |

494 | Scalable Learning in Reproducing Kernel Krein Spaces | Dino Oglic, Thomas G?rtner | We provide the first mathematically complete derivation of the Nystr{ö}m method for low-rank approximation of indefinite kernels and propose an efficient method for finding an approximate eigendecomposition of such kernel matrices. |

495 | Approximation and non-parametric estimation of ResNet-type convolutional neural networks | Kenta Oono, Taiji Suzuki | We show a ResNet-type CNN can attain the minimax optimal error rates in these classes in more plausible situations – it can be dense, and its width, channel size, and filter size are constant with respect to sample size. |

496 | Orthogonal Random Forest for Causal Inference | Miruna Oprescu, Vasilis Syrgkanis, Zhiwei Steven Wu | We propose the orthogonal random forest, an algorithm that combines Neyman-orthogonality to reduce sensitivity with respect to estimation error of nuisance parameters with generalized random forests (Athey et al., 2017)—a flexible non-parametric method for statistical estimation of conditional moment models using random forests. |

497 | Inferring Heterogeneous Causal Effects in Presence of Spatial Confounding | Muhammad Osama, Dave Zachariah, Thomas Sch?n | We address the problem of inferring the causal effect of an exposure on an outcome across space, using observational data. |

498 | Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path? | Samet Oymak, Mahdi Soltanolkotabi | In this paper we demonstrate that when the loss has certain properties over a minimally small neighborhood of the initial point, first order methods such as (stochastic) gradient descent have a few intriguing properties: (1) the iterates converge at a geometric rate to a global optima even when the loss is nonconvex, (2) among all global optima of the loss the iterates converge to one with a near minimal distance to the initial point, (3) the iterates take a near direct route from the initial point to this global optimum. |

499 | Multiplicative Weights Updates as a distributed constrained optimization algorithm: Convergence to second-order stationary points almost always | Ioannis Panageas, Georgios Piliouras, Xiao Wang | In this paper we focus on constrained non-concave maximization. |

500 | Improving Adversarial Robustness via Promoting Ensemble Diversity | Tianyu Pang, Kun Xu, Chao Du, Ning Chen, Jun Zhu | This paper presents a new method that explores the interaction among individual networks to improve robustness for ensemble models. |

501 | Nonparametric Bayesian Deep Networks with Local Competition | Konstantinos Panousis, Sotirios Chatzis, Sergios Theodoridis | The aim of this work is to enable inference of deep networks that retain high accuracy for the least possible model complexity, with the latter deduced from the data during inference. |

502 | Optimistic Policy Optimization via Multiple Importance Sampling | Matteo Papini, Alberto Maria Metelli, Lorenzo Lupo, Marcello Restelli | In this paper, we address the exploration-exploitation trade-off in PS by proposing an approach based on Optimism in the Face of Uncertainty. |

503 | Deep Residual Output Layers for Neural Language Generation | Nikolaos Pappas, James Henderson | In this paper, we investigate the usefulness of more powerful shared mappings for output labels, and propose a deep residual output mapping with dropout between layers to better capture the structure of the output space and avoid overfitting. |

504 | Measurements of Three-Level Hierarchical Structure in the Outliers in the Spectrum of Deepnet Hessians | Vardan Papyan | We show this term is not a Covariance but a second moment matrix, i.e., it is influenced by means of gradients. |

505 | Generalized Majorization-Minimization | Sobhan Naderi Parizi, Kun He, Reza Aghajani, Stan Sclaroff, Pedro Felzenszwalb | We generalize MM by relaxing this constraint, and propose a new optimization framework, named Generalized Majorization-Minimization (G-MM), that is more flexible. |

506 | Variational Laplace Autoencoders | Yookoon Park, Chris Kim, Gunhee Kim | We present a novel approach that addresses both challenges. |

507 | The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study | Daniel Park, Jascha Sohl-Dickstein, Quoc Le, Samuel Smith | We investigate how the final parameters found by stochastic gradient descent are influenced by over-parameterization. |

508 | Spectral Approximate Inference | Sejun Park, Eunho Yang, Se-Young Yun, Jinwoo Shin | To overcome the limitation, we propose a novel approach utilizing the global spectral feature of GM. |

509 | Self-Supervised Exploration via Disagreement | Deepak Pathak, Dhiraj Gandhi, Abhinav Gupta | In this paper, we propose a formulation for exploration inspired by the work in active learning literature. |

510 | Subspace Robust Wasserstein Distances | Fran?ois-Pierre Paty, Marco Cuturi | We propose in this work a “max-min” robust variant of the Wasserstein distance by considering the maximal possible distance that can be realized between two measures, assuming they can be projected orthogonally on a lower k-dimensional subspace. |

511 | Fingerprint Policy Optimisation for Robust Reinforcement Learning | Supratik Paul, Michael A. Osborne, Shimon Whiteson | In this paper, we present fingerprint policy optimisation (FPO), which finds a policy that is optimal in expectation across the distribution of environment variables. |

512 | COMIC: Multi-view Clustering Without Parameter Selection | Xi Peng, Zhenyu Huang, Jiancheng Lv, Hongyuan Zhu, Joey Tianyi Zhou | In this paper, we study two challenges in clustering analysis, namely, how to cluster multi-view data and how to perform clustering without parameter selection on cluster size. |

513 | Domain Agnostic Learning with Disentangled Representations | Xingchao Peng, Zijun Huang, Ximeng Sun, Kate Saenko | In this paper, we propose the task of Domain-Agnostic Learning (DAL): How to transfer knowledge from a labeled source domain to unlabeled data from arbitrary target domains? |

514 | Collaborative Channel Pruning for Deep Networks | Hanyu Peng, Jiaxiang Wu, Shifeng Chen, Junzhou Huang | In this paper, we propose a novel algorithm, namely collaborative channel pruning (CCP), to reduce the computational overhead with negligible performance degradation. |

515 | Exploiting structure of uncertainty for efficient matroid semi-bandits | Pierre Perrault, Vianney Perchet, Michal Valko | We improve the efficiency of algorithms for stochastic combinatorial semi-bandits. |

516 | Cognitive model priors for predicting human decisions | Joshua Peterson, David Bourgin, Daniel Reichman, Thomas Griffiths, Stuart Russell | We argue that this is mainly due to data scarcity, since noisy human behavior requires massive sample sizes to be accurately captured by off-the-shelf machine learning methods. Second, we present the first large-scale dataset for human decision-making, containing over 240,000 human judgments across over 13,000 decision problems. |

517 | Towards Understanding Knowledge Distillation | Mary Phuong, Christoph Lampert | In this work, we provide the first insights into the working mechanisms of distillation by studying the special case of linear and deep linear classifiers. |

518 | Temporal Gaussian Mixture Layer for Videos | Aj Piergiovanni, Michael Ryoo | We present our fully convolutional video models with multiple TGM layers for activity detection. |

519 | Voronoi Boundary Classification: A High-Dimensional Geometric Approach via Weighted Monte Carlo Integration | Vladislav Polianskii, Florian T. Pokorny | We propose a Monte-Carlo integration based approach that instead computes a weighted integral over the boundaries of Voronoi cells, thus incorporating additional information about the Voronoi cell structure. |

520 | On Variational Bounds of Mutual Information | Ben Poole, Sherjil Ozair, Aaron Van Den Oord, Alex Alemi, George Tucker | In this work, we unify these recent developments in a single framework. |

521 | Hiring Under Uncertainty | Manish Purohit, Sreenivas Gollapudi, Manish Raghavan | In this paper we introduce the hiring under uncertainty problem to model the questions faced by hiring committees in large enterprises and universities alike. |

522 | SAGA with Arbitrary Sampling | Xun Qian, Zheng Qu, Peter Richtarik | We remedy this situation and propose a general and flexible variant of SAGA following the arbitrary sampling paradigm. |

523 | SGD with Arbitrary Sampling: General Analysis and Improved Rates | Xun Qian, Peter Richtarik, Robert Gower, Alibek Sailanbayev, Nicolas Loizou, Egor Shulgin | We propose a general yet simple theorem describing the convergence of SGD under the arbitrary sampling paradigm. |

524 | AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss | Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, Mark Hasegawa-Johnson | In this paper, we propose a new style transfer scheme that involves only an autoencoder with a carefully designed bottleneck. |

525 | Fault Tolerance in Iterative-Convergent Machine Learning | Aurick Qiao, Bryon Aragam, Bingjing Zhang, Eric Xing | In this paper, we develop a general framework to quantify the effects of calculation errors on iterative-convergent algorithms. |

526 | Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition | Yao Qin, Nicholas Carlini, Garrison Cottrell, Ian Goodfellow, Colin Raffel | This paper makes progress on both of these fronts. |

527 | GMNN: Graph Markov Neural Networks | Meng Qu, Yoshua Bengio, Jian Tang | In this paper, we propose the Graph Markov Neural Network (GMNN) that combines the advantages of both worlds. |

528 | Nonlinear Distributional Gradient Temporal-Difference Learning | Chao Qu, Shie Mannor, Huan Xu | In the control setting, we propose the distributional Greedy-GQ using similar derivation. |

529 | Learning to Collaborate in Markov Decision Processes | Goran Radanovic, Rati Devidze, David Parkes, Adish Singla | We consider a two-agent MDP framework where agents repeatedly solve a task in a collaborative setting. |

530 | Meta-Learning Neural Bloom Filters | Jack Rae, Sergey Bartunov, Timothy Lillicrap | In this paper we explore the learning of approximate set membership over a set of data in one-shot via meta-learning. |

531 | Direct Uncertainty Prediction for Medical Second Opinions | Maithra Raghu, Katy Blumer, Rory Sayres, Ziad Obermeyer, Bobby Kleinberg, Sendhil Mullainathan, Jon Kleinberg | In this work, we show that machine learning models can be successfully trained to give uncertainty scores to data instances that result in high expert disagreements. |

532 | Game Theoretic Optimization via Gradient-based Nikaido-Isoda Function | Arvind Raghunathan, Anoop Cherian, Devesh Jha | To this end, we introduce the Gradient-based Nikaido-Isoda (GNI) function which serves: (i) as a merit function, vanishing only at the first-order stationary points of each player’s optimization problem, and (ii) provides error bounds to a stationary Nash point. |

533 | On the Spectral Bias of Neural Networks | Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, Aaron Courville | In this work we present properties of neural networks that complement this aspect of expressivity. |

534 | Look Ma, No Latent Variables: Accurate Cutset Networks via Compilation | Tahrima Rahman, Shasha Jin, Vibhav Gogate | To address this problem, in this paper, we propose a novel approach for inducing cutset networks, a well-known tractable, highly interpretable representation that does not use latent variables and admits linear time MAR as well as MAP inference. |

535 | Does Data Augmentation Lead to Positive Margin? | Shashank Rajput, Zhili Feng, Zachary Charles, Po-Ling Loh, Dimitris Papailiopoulos | In this work, we analyze the robustness that DA begets by quantifying the margin that DA enforces on empirical risk minimizers. |

536 | Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables | Kate Rakelly, Aurick Zhou, Chelsea Finn, Sergey Levine, Deirdre Quillen | In this paper, we address these challenges by developing an off-policy meta-RL algorithm that disentangles task inference and control. |

537 | Screening rules for Lasso with non-convex Sparse Regularizers | Alain Rakotomamonjy, Gilles Gasso, Joseph Salmon | The approach we propose is based on a iterative majorization-minimization (MM) strategy that includes a screening rule in the inner solver and a condition for propagating screened variables between iterations of MM. |

538 | Topological Data Analysis of Decision Boundaries with Application to Model Selection | Karthikeyan Natesan Ramamurthy, Kush Varshney, Krishnan Mody | We propose the labeled Cech complex, the plain labeled Vietoris-Rips complex, and the locally scaled labeled Vietoris-Rips complex to perform persistent homology inference of decision boundaries in classification tasks. |

539 | HyperGAN: A Generative Model for Diverse, Performant Neural Networks | Neale Ratzlaff, Li Fuxin | We introduce HyperGAN, a generative model that learns to generate all the parameters of a deep neural network. |

540 | Efficient On-Device Models using Neural Projections | Sujith Ravi | We propose a neural projection approach for training compact on-device neural networks. |

541 | A Block Coordinate Descent Proximal Method for Simultaneous Filtering and Parameter Estimation | Ramin Raziperchikolaei, Harish Bhat | We propose and analyze a block coordinate descent proximal algorithm (BCD-prox) for simultaneous filtering and parameter estimation of ODE models. |

542 | Do ImageNet Classifiers Generalize to ImageNet? | Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, Vaishaal Shankar | We build new test sets for the CIFAR-10 and ImageNet datasets. |

543 | Fast Rates for a kNN Classifier Robust to Unknown Asymmetric Label Noise | Henry Reeve, Ata Kaban | We consider classification in the presence of class-dependent asymmetric label noise with unknown noise probabilities. |

544 | Almost Unsupervised Text to Speech and Automatic Speech Recognition | Yi Ren, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu | In this paper, by leveraging the dual nature of the two tasks, we propose an almost unsupervised learning method that only leverages few hundreds of paired data and extra unpaired data for TTS and ASR. |

545 | Adaptive Antithetic Sampling for Variance Reduction | Hongyu Ren, Shengjia Zhao, Stefano Ermon | In this paper, we propose a general-purpose adaptive antithetic sampling framework. |

546 | Adversarial Online Learning with noise | Alon Resler, Yishay Mansour | Specifically, we consider binary losses xored with the noise, which is a Bernoulli random variable. |

547 | A Polynomial Time MCMC Method for Sampling from Continuous Determinantal Point Processes | Alireza Rezaei, Shayan Oveis Gharan | We study the Gibbs sampling algorithm for discrete and continuous $k$-determinantal point processes. |

548 | A Persistent Weisfeiler-Lehman Procedure for Graph Classification | Bastian Rieck, Christian Bock, Karsten Borgwardt | Our method, which we formalise as a generalisation of Weisfeiler–Lehman subtree features, exhibits favourable classification accuracy and its improvements in predictive performance are mainly driven by including cycle information. |

549 | Efficient learning of smooth probability functions from Bernoulli tests with guarantees | Paul Rolland, Ali Kavis, Alexander Immer, Adish Singla, Volkan Cevher | We study the fundamental problem of learning an unknown, smooth probability function via point-wise Bernoulli tests. |

550 | Separable value functions across time-scales | Joshua Romoff, Peter Henderson, Ahmed Touati, Yann Ollivier, Joelle Pineau, Emma Brunskill | We present an extension of temporal difference (TD) learning, which we call TD($\Delta$), that breaks down a value function into a series of components based on the differences between value functions with smaller discount factors. |

551 | Online Convex Optimization in Adversarial Markov Decision Processes | Aviv Rosenberg, Yishay Mansour | We consider online learning in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes, and the transition function is not known to the learner. |

552 | Good Initializations of Variational Bayes for Deep Models | Simone Rossi, Pietro Michiardi, Maurizio Filippone | We address this by proposing a novel layer-wise initialization strategy based on Bayesian linear models. |

553 | The Odds are Odd: A Statistical Test for Detecting Adversarial Examples | Kevin Roth, Yannic Kilcher, Thomas Hofmann | We investigate conditions under which test statistics exist that can reliably detect examples, which have been adversarially manipulated in a white-box attack. |

554 | Neuron birth-death dynamics accelerates gradient descent and converges asymptotically | Grant Rotskoff, Samy Jelassi, Joan Bruna, Eric Vanden-Eijnden | In this work, we propose a non-local mass transport dynamics that leads to a modified PDE with the same minimizer. |

555 | Iterative Linearized Control: Stable Algorithms and Complexity Guarantees | Vincent Roulet, Dmitriy Drusvyatskiy, Siddhartha Srinivasa, Zaid Harchaoui | We examine popular gradient-based algorithms for nonlinear control in the light of the modern complexity analysis of first-order optimization algorithms. |

556 | Statistics and Samples in Distributional Reinforcement Learning | Mark Rowland, Robert Dadashi, Saurabh Kumar, Remi Munos, Marc G. Bellemare, Will Dabney | We present a unifying framework for designing and analysing distributional reinforcement learning (DRL) algorithms in terms of recursively estimating statistics of the return distribution. |

557 | A Contrastive Divergence for Combining Variational Inference and MCMC | Francisco Ruiz, Michalis Titsias | To make inference tractable, we introduce the variational contrastive divergence (VCD), a new divergence that replaces the standard Kullback-Leibler (KL) divergence used in VI. |

558 | Plug-and-Play Methods Provably Converge with Properly Trained Denoisers | Ernest Ryu, Jialin Liu, Sicheng Wang, Xiaohan Chen, Zhangyang Wang, Wotao Yin | In this paper, we theoretically establish convergence of PnP-FBS and PnP-ADMM, without using diminishing stepsizes, under a certain Lipschitz condition on the denoisers. |

559 | White-box vs Black-box: Bayes Optimal Strategies for Membership Inference | Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Yann Ollivier, Herve Jegou | In this paper, we derive the optimal strategy for membership inference with a few assumptions on the distribution of the parameters. |

560 | Tractable n-Metrics for Multiple Graphs | Sam Safavi, Jose Bento | In this paper, we introduce a new family of multi-distances (a distance between more than two elements) that satisfies a generalization of the properties of metrics to multiple elements. |

561 | An Optimal Private Stochastic-MAB Algorithm based on Optimal Private Stopping Rule | Touqir Sajed, Or Sheffet | We present a provably optimal differentially private algorithm for the stochastic multi-arm bandit problem, as opposed to the private analogue of the UCB-algorithm (Mishra and Thakurta, 2015; Tossou and Dimitrakakis, 2016) which doesn’t meet the recently discovered lower-bound of $\Omega \left(\frac{K\log(T)}{\epsilon} \right)$ (Shariff and Sheffet, 2018). |

562 | Deep Gaussian Processes with Importance-Weighted Variational Inference | Hugh Salimbeni, Vincent Dutordoir, James Hensman, Marc Deisenroth | We instead incorporate noisy variables as latent covariates, and propose a novel importance-weighted objective, which leverages analytic results and provides a mechanism to trade off computation for improved accuracy. |

563 | Multivariate Submodular Optimization | Richard Santiago, F. Bruce Shepherd | In this work we focus on a more general class of multivariate submodular optimization (MVSO) problems: $\min/\max f (S_1,S_2,\ldots,S_k): S_1 \uplus S_2 \uplus \cdots \uplus S_k \in \mathcal{F}$. |

564 | Near optimal finite time identification of arbitrary linear dynamical systems | Tuhin Sarkar, Alexander Rakhlin | We provide the first analysis of the general case when eigenvalues of the LTI system are arbitrarily distributed in three regimes: stable, marginally stable, and explosive. |

565 | Breaking Inter-Layer Co-Adaptation by Classifier Anonymization | Ikuro Sato, Kohta Ishikawa, Guoqing Liu, Masayuki Tanaka | We introduce a method called Feature-extractor Optimization through Classifier Anonymization (FOCA), which is designed to avoid an explicit co-adaptation between a feature extractor and a particular classifier by using many randomly-generated, weak classifiers during optimization. |

566 | A Theoretical Analysis of Contrastive Unsupervised Representation Learning | Nikunj Saunshi, Orestis Plevrakis, Sanjeev Arora, Mikhail Khodak, Hrishikesh Khandeparkar | The current paper uses the term contrastive learning for such algorithms and presents a theoretical framework for analyzing them by introducing latent classes and hypothesizing that semantically similar points are sampled from the same latent class. |

567 | Locally Private Bayesian Inference for Count Models | Aaron Schein, Zhiwei Steven Wu, Alexandra Schofield, Mingyuan Zhou, Hanna Wallach | We present a general and modular method for privacy-preserving Bayesian inference for Poisson factorization, a broad class of models that includes some of the most widely used models in the social sciences. |

568 | Weakly-Supervised Temporal Localization via Occurrence Count Learning | Julien Schroeter, Kirill Sidorov, David Marshall | We propose a novel model for temporal detection and localization which allows the training of deep neural networks using only counts of event occurrences as training labels. |

569 | Discovering Context Effects from Raw Choice Data | Arjun Seshadri, Alex Peysakhovich, Johan Ugander | In this work, our goal is to discover such choice set effects from raw choice data. |

570 | On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference | Rohin Shah, Noah Gundotra, Pieter Abbeel, Anca Dragan | Our goal is for agents to optimize the right reward function, despite how difficult it is for us to specify what that is. |

571 | Exploration Conscious Reinforcement Learning Revisited | Lior Shani, Yonathan Efroni, Shie Mannor | In this work, we take a different approach and study exploration-conscious criteria, that result in optimal policies with respect to the exploration mechanism. |

572 | Compressed Factorization: Fast and Accurate Low-Rank Factorization of Compressively-Sensed Data | Vatsal Sharan, Kai Sheng Tai, Peter Bailis, Gregory Valiant | In this work, we consider the question of accurately and efficiently computing low-rank matrix or tensor factorizations given data compressed via random projections. |

573 | Conditional Independence in Testing Bayesian Networks | Yujia Shen, Haiying Huang, Arthur Choi, Adnan Darwiche | In this paper, we study conditional independence in TBNs, showing that it can be inferred from d-separation as in BNs. |

574 | Learning to Clear the Market | Weiran Shen, Sebastien Lahaie, Renato Paes Leme | In this work, we cast the problem of predicting clearing prices into a learning framework and use the resulting models to perform revenue optimization in auctions and markets with contextual information. |

575 | Mixture Models for Diverse Machine Translation: Tricks of the Trade | Tianxiao Shen, Myle Ott, Michael Auli, Marc?Aurelio Ranzato | Mixture Models for Diverse Machine Translation: Tricks of the Trade. |

576 | Hessian Aided Policy Gradient | Zebang Shen, Alejandro Ribeiro, Hamed Hassani, Hui Qian, Chao Mi | This paper presents a Hessian aided policy gradient method with the first improved sample complexity of $\OM({1}/{\epsilon^3})$. |

577 | Learning with Bad Training Data via Iterative Trimmed Loss Minimization | Yanyao Shen, Sujay Sanghavi | In this paper, we study a simple and generic framework to tackle the problem of learning model parameters when a fraction of the training samples are corrupted. |

578 | Replica Conditional Sequential Monte Carlo | Alex Shestopaloff, Arnaud Doucet | We propose a Markov chain Monte Carlo (MCMC) scheme to perform state inference in non-linear non-Gaussian state-space models. |

579 | Scalable Training of Inference Networks for Gaussian-Process Models | Jiaxin Shi, Mohammad Emtiyaz Khan, Jun Zhu | We propose an algorithm that enables such training by tracking a stochastic, functional mirror-descent algorithm. |

580 | Fast Direct Search in an Optimally Compressed Continuous Target Space for Efficient Multi-Label Active Learning | Weishi Shi, Qi Yu | We propose a novel CS-BPCA process that integrates compressed sensing and Bayesian principal component analysis to perform a two-level label transformation, resulting in an optimally compressed continuous target space. |

581 | Model-Based Active Exploration | Pranav Shyam, Wojciech Jaskowski, Faustino Gomez | This paper introduces an efficient active exploration algorithm, Model-Based Active eXploration (MAX), which uses an ensemble of forward models to plan to observe novel events. |

582 | Rehashing Kernel Evaluation in High Dimensions | Paris Siminelakis, Kexin Rong, Peter Bailis, Moses Charikar, Philip Levis | In this paper, we close the gap between theory and practice by addressing these challenges via provable and practical procedures for adaptive sample size selection, preprocessing time reduction, and refined variance bounds that quantify the data-dependent performance of random sampling and hashing-based kernel evaluation methods. |

583 | Revisiting precision recall definition for generative modeling | Loic Simon, Ryan Webster, Julien Rabin | In this article we revisit the definition of Precision-Recall (PR) curves for generative models proposed by (Sajjadi et al., 2018). |

584 | First-Order Adversarial Vulnerability of Neural Networks and Input Dimension | Carl-Johann Simon-Gabriel, Yann Ollivier, Leon Bottou, Bernhard Sch?lkopf, David Lopez-Paz | We show that adversarial vulnerability increases with the gradients of the training objective when viewed as a function of the inputs. |

585 | Refined Complexity of PCA with Outliers | Kirill Simonov, Fedor Fomin, Petr Golovach, Fahad Panolan | We provide a rigorous algorithmic analysis of the problem. |

586 | A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks | Umut Simsekli, Levent Sagun, Mert Gurbuzbalaban | Accordingly, we propose to analyze SGD as an SDE driven by a Lévy motion. |

587 | Non-Parametric Priors For Generative Adversarial Networks | Rajhans Singh, Pavan Turaga, Suren Jayasuriya, Ravi Garg, Martin Braun | We present a straightforward formalization of this problem; using basic results from probability theory and off-the-shelf-optimization tools, we develop ways to arrive at appropriate non-parametric priors. |

588 | Understanding Impacts of High-Order Loss Approximations and Features in Deep Learning Interpretation | Sahil Singla, Eric Wallace, Shi Feng, Soheil Feizi | We use an L0 – L1 relaxation technique along with proximal gradient descent to efficiently compute group-feature importance values. |

589 | kernelPSI: a Post-Selection Inference Framework for Nonlinear Variable Selection | Lotfi Slim, Cl?ment Chatelain, Chloe-Agathe Azencott, Jean-Philippe Vert | In the present work, we exploit recent advances in post-selection inference to propose a valid statistical test for the association of a joint model of the selected kernels with the outcome. |

590 | GEOMetrics: Exploiting Geometric Structure for Graph-Encoded Objects | Edward Smith, Adriana Romero, Scott Fujimoto, David Meger | In this paper, we argue that the graph representation of geometric objects allows for additional structure, which should be leveraged for enhanced reconstruction. |

591 | The Evolved Transformer | David So, Quoc Le, Chen Liang | Our goal is to apply NAS to search for a better alternative to the Transformer. |

592 | QTRAN: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning | Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero, Yung Yi | In this paper, we propose a new factorization method for MARL, QTRAN, which is free from such structural constraints and takes on a new approach to transforming the original joint action-value function into an easily factorizable one, with the same optimal actions. |

593 | Distribution calibration for regression | Hao Song, Tom Diethe, Meelis Kull, Peter Flach | We introduce the novel concept of distribution calibration, and demonstrate its advantages over the existing definition of quantile calibration. |

594 | SELFIE: Refurbishing Unclean Samples for Robust Deep Learning | Hwanjun Song, Minseok Kim, Jae-Gil Lee | To overcome overfitting on the noisy labels, we propose a novel robust training method called SELFIE. |

595 | Revisiting the Softmax Bellman Operator: New Benefits and New Perspective | Zhao Song, Ron Parr, Lawrence Carin | To better understand how and why this occurs, we revisit theoretical properties of the softmax Bellman operator, and prove that (i) it converges to the standard Bellman operator exponentially fast in the inverse temperature parameter, and (ii) the distance of its Q function from the optimal one can be bounded. |

596 | MASS: Masked Sequence to Sequence Pre-training for Language Generation | Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu | Inspired by the success of BERT, we propose MAsked Sequence to Sequence pre-training (MASS) for the encoder-decoder based language generation tasks. |

597 | Dual Entangled Polynomial Code: Three-Dimensional Coding for Distributed Matrix Multiplication | Pedro Soto, Jun Li, Xiaodi Fan | In this paper, we propose dual entangled polynomial (DEP) codes that require around 25% fewer tasks than EP codes by executing two matrix multiplications on each task. |

598 | Compressing Gradient Optimizers via Count-Sketches | Ryan Spring, Anastasios Kyrillidis, Vijai Mohan, Anshumali Shrivastava | Theoretically, we prove that count-sketch optimization maintains the SGD convergence rate, while gracefully reducing memory usage for large-models. |

599 | Escaping Saddle Points with Adaptive Gradient Methods | Matthew Staib, Sashank Reddi, Satyen Kale, Sanjiv Kumar, Suvrit Sra | In this paper, we seek a crisp, clean and precise characterization of their behavior in nonconvex settings. |

600 | Faster Attend-Infer-Repeat with Tractable Probabilistic Models | Karl Stelzner, Robert Peharz, Kristian Kersting | In this paper, we show that the speed and robustness of learning in AIR can be considerably improved by replacing the intractable object representations with tractable probabilistic models. |

601 | Insertion Transformer: Flexible Sequence Generation via Insertion Operations | Mitchell Stern, William Chan, Jamie Kiros, Jakob Uszkoreit | We present the Insertion Transformer, an iterative, partially autoregressive model for sequence generation based on insertion operations. |

602 | BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning | Asa Cooper Stickland, Iain Murray | We explore multi-task approaches that share a \hbox{single} BERT model with a small number of additional task-specific parameters. |

603 | Learning Optimal Linear Regularizers | Matthew Streeter | We present algorithms for efficiently learning regularizers that improve generalization. |

604 | CAB: Continuous Adaptive Blending for Policy Evaluation and Learning | Yi Su, Lequn Wang, Michele Santacatterina, Thorsten Joachims | In this paper, we identify a family of counterfactual estimators which subsumes most such estimators proposed to date. |

605 | Learning Distance for Sequences by Learning a Ground Metric | Bing Su, Ying Wu | We propose to learn the distance for sequences through learning a ground Mahalanobis metric for the vectors in sequences. |

606 | Contextual Memory Trees | Wen Sun, Alina Beygelzimer, Hal Daum? Iii, John Langford, Paul Mineiro | We design and study a Contextual Memory Tree (CMT), a learning memory controller that inserts new memories into an experience store of unbounded size. |

607 | Provably Efficient Imitation Learning from Observation Alone | Wen Sun, Anirudh Vemula, Byron Boots, Drew Bagnell | We design a new model-free algorithm for ILFO, Forward Adversarial Imitation Learning (FAIL), which learns a sequence of time-dependent policies by minimizing an Integral Probability Metric between the observation distributions of the expert policy and the learner. |

608 | Active Learning for Decision-Making from Imbalanced Observational Data | Iiris Sundin, Peter Schulam, Eero Siivola, Aki Vehtari, Suchi Saria, Samuel Kaski | We propose to assess the decision-making reliability by estimating the ITE model’s Type S error rate, which is the probability of the model inferring the sign of the treatment effect wrong. |

609 | Robustly Disentangled Causal Mechanisms: Validating Deep Representations for Interventional Robustness | Raphael Suter, Djordje Miladinovic, Bernhard Sch?lkopf, Stefan Bauer | We provide a causal perspective on representation learning which covers disentanglement and domain shift robustness as special cases. |

610 | Hyperbolic Disk Embeddings for Directed Acyclic Graphs | Ryota Suzuki, Ryusuke Takahama, Shun Onoda | Tackling in this problem, we develop Disk Embeddings, which is a framework for embedding DAGs into quasi-metric spaces. |

611 | Accelerated Flow for Probability Distributions | Amirhossein Taghvaei, Prashant Mehta | This paper presents a methodology and numerical algorithms for constructing accelerated gradient flows on the space of probability distributions. |

612 | Equivariant Transformer Networks | Kai Sheng Tai, Peter Bailis, Gregory Valiant | We propose Equivariant Transformers (ETs), a family of differentiable image-to-image mappings that improve the robustness of models towards pre-defined continuous transformation groups. |

613 | Making Deep Q-learning methods robust to time discretization | Corentin Tallec, L?onard Blier, Yann Ollivier | In this paper, we identify sensitivity to time dis- cretization in near continuous-time environments as a critical factor; this covers, e.g., changing the number of frames per second, or the action frequency of the controller. |

614 | EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks | Mingxing Tan, Quoc Le | In this paper, we systematically study model scaling and identify that carefully balancing network depth, width, and resolution can lead to better performance. |

615 | Hierarchical Decompositional Mixtures of Variational Autoencoders | Ping Liang Tan, Robert Peharz | Since these problems become generally more severe in high dimensions, we propose a novel hierarchical mixture model over low-dimensional VAE experts. |

616 | Mallows ranking models: maximum likelihood estimate and regeneration | Wenpin Tang | Motivated by the infinite top-$t$ ranking model, we propose an algorithm to select the model size $t$ automatically. |

617 | Correlated Variational Auto-Encoders | Da Tang, Dawen Liang, Tony Jebara, Nicholas Ruozzi | We propose Correlated Variational Auto-Encoders (CVAEs) that can take the correlation structure into consideration when learning latent representations with VAEs. |

618 | The Variational Predictive Natural Gradient | Da Tang, Rajesh Ranganath | To address this, we construct a new natural gradient called the Variational Predictive Natural Gradient (VPNG). |

619 | $\textttDoubleSqueeze$: Parallel Stochastic Gradient Descent with Double-pass Error-Compensated Compression | Hanlin Tang, Chen Yu, Xiangru Lian, Tong Zhang, Ji Liu | In this work, we provide a detailed analysis on this two-pass communication model, with error-compensated compression both on the worker nodes and on the parameter server. |

620 | Adaptive Neural Trees | Ryutaro Tanno, Kai Arulkumaran, Daniel Alexander, Antonio Criminisi, Aditya Nori | We unite the two via adaptive neural trees (ANTs), a model that incorporates representation learning into edges, routing functions and leaf nodes of a decision tree, along with a backpropagation-based training algorithm that adaptively grows the architecture from primitive modules (e.g., convolutional layers). |

621 | Variational Annealing of GANs: A Langevin Perspective | Chenyang Tao, Shuyang Dai, Liqun Chen, Ke Bai, Junya Chen, Chang Liu, Ruiyi Zhang, Georgiy Bobashev, Lawrence Carin Duke | We highlight new insights from variational theory of diffusion processes to derive a likelihood-based regularizing scheme for GAN training, and present a novel approach to train GANs with an unnormalized distribution instead of empirical samples. |

622 | Predicate Exchange: Inference with Declarative Knowledge | Zenna Tavares, Rajesh Ranganath, Javier Burroni, Armando Solar-Lezama, Edgar Minasyan | To support a broader class of predicates, we develop an inference procedure called predicate exchange, which softens predicates. |

623 | The Natural Language of Actions | Guy Tennenholtz, Shie Mannor | We introduce Act2Vec, a general framework for learning context-based action representation for Reinforcement Learning. |

624 | Kernel Normalized Cut: a Theoretical Revisit | Yoshikazu Terada, Michio Yamamoto | In this paper, we study the theoretical properties of clustering based on the kernel normalized cut. |

625 | Action Robust Reinforcement Learning and Applications in Continuous Control | Chen Tessler, Yonathan Efroni, Shie Mannor | In this work we formalize two new criteria of robustness to action uncertainty. |

626 | Concentration Inequalities for Conditional Value at Risk | Philip Thomas, Erik Learned-Miller | In this paper we derive new concentration inequalities for the conditional value at risk (CVaR) of a random variable, and compare them to the previous state of the art (Brown, 2007). |

627 | Combating Label Noise in Deep Learning using Abstention | Sunil Thulasidasan, Tanmoy Bhattacharya, Jeff Bilmes, Gopinath Chennupati, Jamal Mohd-Yusof | We introduce a novel method to combat label noise when training deep neural networks for classification. |

628 | ELF OpenGo: an analysis and open reimplementation of AlphaZero | Yuandong Tian, Jerry Ma, Qucheng Gong, Shubho Sengupta, Zhuoyuan Chen, James Pinkerton, Larry Zitnick | Toward elucidating unresolved mysteries and facilitating future research, we propose ELF OpenGo, an open-source reimplementation of the AlphaZero algorithm. |

629 | Random Matrix Improved Covariance Estimation for a Large Class of Metrics | Malik Tiomoko, Romain Couillet, Florent Bouchard, Guillaume Ginolhac | Relying on recent advances in statistical estimation of covariance distances based on random matrix theory, this article proposes an improved covariance and precision matrix estimation for a wide family of metrics. |

630 | Transfer of Samples in Policy Search via Multiple Importance Sampling | Andrea Tirinzoni, Mattia Salvini, Marcello Restelli | In this paper, we consider the more complex case of reusing samples in policy search methods, in which the agent is required to transfer entire trajectories between environments with different transition models. |

631 | Optimal Transport for structured data with application on graphs | Vayer Titouan, Nicolas Courty, Romain Tavenard, Chapel Laetitia, R?mi Flamary | This work considers the problem of computing distances between structured objects such as undirected graphs, seen as probability distributions in a specific metric space. |

632 | Discovering Latent Covariance Structures for Multiple Time Series | Anh Tong, Jaesik Choi | We present a pragmatic search algorithm which explores a larger structure space efficiently. |

633 | Bayesian Generative Active Deep Learning | Toan Tran, Thanh-Toan Do, Ian Reid, Gustavo Carneiro | In this paper, we propose a Bayesian generative active deep learning approach that combines active learning with data augmentation – we provide theoretical and empirical evidence (MNIST, CIFAR-$\{10,100\}$, and SVHN) that our approach has more efficient training and better classification results than data augmentation and active learning. |

634 | DeepNose: Using artificial neural networks to represent the space of odorants | Ngoc Tran, Daniel Kepple, Sergey Shuvaev, Alexei Koulakov | We propose that DeepNose network can extract de novo chemical features predictive of various bioactivities and can help understand the factors influencing the composition of ORs ensemble. |

635 | LR-GLM: High-Dimensional Bayesian Inference Using Low-Rank Data Approximations | Brian Trippe, Jonathan Huggins, Raj Agrawal, Tamara Broderick | We propose to reduce time and memory costs with a low-rank approximation of the data in an approach we call LR-GLM. |

636 | Learning Hawkes Processes Under Synchronization Noise | William Trouleau, Jalal Etesami, Matthias Grossglauser, Negar Kiyavash, Patrick Thiran | We characterize the robustness of the classic maximum likelihood estimator to synchronization noise, and we introduce a new approach for learning the causal structure in the presence of noise. |

637 | Homomorphic Sensing | Manolis Tsakiris, Liangzu Peng | In this paper we introduce an abstraction of this problem which we call “homomorphic sensing”. |

638 | Metropolis-Hastings Generative Adversarial Networks | Ryan Turner, Jane Hung, Eric Frank, Yunus Saatchi, Jason Yosinski | We introduce the Metropolis-Hastings generative adversarial network (MH-GAN), which combines aspects of Markov chain Monte Carlo and GANs. |

639 | Distributed, Egocentric Representations of Graphs for Detecting Critical Structures | Ruo-Chun Tzeng, Shan-Hung Wu | In this paper, we propose a novel graph embedding model, called the Ego-CNNs, that employs the ego-convolutions convolutions at each layer and stacks up layers using an ego-centric way to detects precise critical structures efficiently. |

640 | Sublinear Space Private Algorithms Under the Sliding Window Model | Jalaj Upadhyay | In this paper, we study heavy hitters in the sliding window model with window size $w$. |

641 | Fairness without Harm: Decoupled Classifiers with Preference Guarantees | Berk Ustun, Yang Liu, David Parkes | In this work, we argue that when there is this kind of treatment disparity, then it should be in the best interest of each group. |

642 | Large-Scale Sparse Kernel Canonical Correlation Analysis | Viivi Uurtio, Sahely Bhadra, Juho Rousu | This paper presents gradKCCA, a large-scale sparse non-linear canonical correlation method. |

643 | Characterization of Convex Objective Functions and Optimal Expected Convergence Rates for SGD | Marten Van Dijk, Lam Nguyen, Phuong Ha Nguyen, Dzung Phan | We introduce a definitional framework and theory that defines and characterizes a core property, called curvature, of convex objective functions. |

644 | Composing Value Functions in Reinforcement Learning | Benjamin Van Niekerk, Steven James, Adam Earle, Benjamin Rosman | Under the assumption of deterministic dynamics, we prove that optimal value function composition can be achieved in entropy-regularised reinforcement learning (RL), and extend this result to the standard RL setting. |

645 | Model Comparison for Semantic Grouping | Francisco Vargas, Kamen Brestnichki, Nils Hammerla | We introduce a probabilistic framework for quantifying the semantic similarity between two groups of embeddings. |

646 | Learning Dependency Structures for Weak Supervision Models | Paroma Varma, Frederic Sala, Ann He, Alexander Ratner, Christopher Re | We focus on a robust PCA-based algorithm for learning these dependency structures, establish improved theoretical recovery rates, and outperform existing methods on various real-world tasks. |

647 | Probabilistic Neural Symbolic Models for Interpretable Visual Question Answering | Ramakrishna Vedantam, Karan Desai, Stefan Lee, Marcus Rohrbach, Dhruv Batra, Devi Parikh | We propose a new class of probabilistic neural-symbolic models, that have symbolic functional programs as a latent, stochastic variable. |

648 | Manifold Mixup: Better Representations by Interpolating Hidden States | Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, David Lopez-Paz, Yoshua Bengio | To address these issues, we propose \manifoldmixup{}, a simple regularizer that encourages neural networks to predict less confidently on interpolations of hidden representations. |

649 | Maximum Likelihood Estimation for Learning Populations of Parameters | Ramya Korlakai Vinayak, Weihao Kong, Gregory Valiant, Sham Kakade | After observing the outcomes of $t$ independent Bernoulli trials, i.e., $X_i \sim \text{Binomial}(t, p_i)$ per individual, our objective is to accurately estimate $P^\star$ in the sparse regime, namely when $t \ll N$. |

650 | Understanding Priors in Bayesian Neural Networks at the Unit Level | Mariia Vladimirova, Jakob Verbeek, Pablo Mesejo, Julyan Arbel | We investigate deep Bayesian neural networks with Gaussian priors on the weights and a class of ReLU-like nonlinearities. |

651 | On the Design of Estimators for Bandit Off-Policy Evaluation | Nikos Vlassis, Aurelien Bibaut, Maria Dimakopoulou, Tony Jebara | We present our main results in the context of multi-armed bandits, and we propose a simple design for contextual bandits that gives rise to an estimator that is shown to perform well in multi-class cost-sensitive classification datasets. |

652 | Learning to select for a predefined ranking | Aleksandr Vorobev, Aleksei Ustimenko, Gleb Gusev, Pavel Serdyukov | In this paper, we formulate a novel problem of learning to select a set of items maximizing the quality of their ordered list, where the order is predefined by some explicit rule. |

653 | On the Limitations of Representing Functions on Sets | Edward Wagstaff, Fabian Fuchs, Martin Engelcke, Ingmar Posner, Michael A. Osborne | Motivated by this observation, we prove that an implementation of this model via continuous mappings (as provided by e.g. neural networks or Gaussian processes) actually imposes a constraint on the dimensionality of the latent space. |

654 | Graph Convolutional Gaussian Processes | Ian Walker, Ben Glocker | We propose a novel Bayesian nonparametric method to learn translation-invariant relationships on non-Euclidean domains. |

655 | Gaining Free or Low-Cost Interpretability with Interpretable Partial Substitute | Tong Wang | Under this framework, we develop a Hybrid Rule Sets (HyRS) model that uses decision rules to capture the subspace of data where the rules are as accurate or almost as accurate as the black-box provided. |

656 | Convolutional Poisson Gamma Belief Network | Chaojie Wang, Bo Chen, Sucheng Xiao, Mingyuan Zhou | In this paper, we propose convolutional Poisson factor analysis (CPFA) that directly operates on a lossless representation that processes the words in each document as a sequence of high-dimensional one-hot vectors. |

657 | Differentially Private Empirical Risk Minimization with Non-convex Loss Functions | Di Wang, Changyou Chen, Jinhui Xu | We study the problem of Empirical Risk Minimization (ERM) with (smooth) non-convex loss functions under the differential-privacy (DP) model. |

658 | Random Expert Distillation: Imitation Learning via Expert Policy Support Estimation | Ruohan Wang, Carlo Ciliberto, Pierluigi Vito Amadori, Yiannis Demiris | We propose a new framework for imitation learning by estimating the support of the expert policy to compute a fixed reward function, which allows us to re-frame imitation learning within the standard reinforcement learning setting. |

659 | SATNet: Bridging deep learning and logical reasoning using a differentiable satisfiability solver | Po-Wei Wang, Priya Donti, Bryan Wilder, Zico Kolter | In this paper, we propose a new direction toward this goal by introducing a differentiable (smoothed) maximum satisfiability (MAXSAT) solver that can be integrated into the loop of larger deep learning systems. |

660 | Improving Neural Language Modeling via Adversarial Training | Dilin Wang, Chengyue Gong, Qiang Liu | In this paper, we present a simple yet highly effective adversarial training mechanism for regularizing neural language models. |

661 | EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis | Chaoqi Wang, Roger Grosse, Sanja Fidler, Guodong Zhang | In particular, we highlight that the improvements are especially significant for more challenging datasets and networks. |

662 | Nonlinear Stein Variational Gradient Descent for Learning Diversified Mixture Models | Dilin Wang, Qiang Liu | In this work, we present a variational approach for diversity-promoting learning, which leverages the entropy functional as a natural mechanism for enforcing diversity. |

663 | On the Convergence and Robustness of Adversarial Training | Yisen Wang, Xingjun Ma, James Bailey, Jinfeng Yi, Bowen Zhou, Quanquan Gu | In this paper, we propose such a criterion, namely First-Order Stationary Condition for constrained optimization (FOSC), to quantitatively evaluate the convergence quality of adversarial examples found in the inner maximization. |

664 | State-Regularized Recurrent Neural Networks | Cheng Wang, Mathias Niepert | We aim to address both shortcomings with a class of recurrent networks that use a stochastic state transition mechanism between cell applications. |

665 | Deep Factors for Forecasting | Yuyang Wang, Alex Smola, Danielle Maddix, Jan Gasthaus, Dean Foster, Tim Januschowski | In this paper, we propose a hybrid model that incorporates the benefits of both approaches. |

666 | Repairing without Retraining: Avoiding Disparate Impact with Counterfactual Distributions | Hao Wang, Berk Ustun, Flavio Calmon | In this paper, we exploit this fact to reduce the disparate impact of a fixed classification model over a population of interest. |

667 | On Sparse Linear Regression in the Local Differential Privacy Model | Di Wang, Jinhui Xu | In this paper, we study the sparse linear regression problem under the Local Differential Privacy (LDP) model. |

668 | Doubly Robust Joint Learning for Recommendation on Data Missing Not at Random | Xiaojie Wang, Rui Zhang, Yu Sun, Jianzhong Qi | To achieve good performance guarantees, based on this estimator, we propose joint learning of rating prediction and error imputation, which outperforms the state-of-the-art approaches on four real-world datasets. |

669 | On the Generalization Gap in Reparameterizable Reinforcement Learning | Huan Wang, Stephan Zheng, Caiming Xiong, Richard Socher | We focus on the special class of reparameterizable RL problems, where the trajectory distribution can be decomposed using the reparametrization trick. |

670 | Bias Also Matters: Bias Attribution for Deep Neural Network Explanation | Shengjie Wang, Tianyi Zhou, Jeff Bilmes | In this paper, we observe that since the bias in a DNN also has a non-negligible contribution to the correctness of predictions, it can also play a significant role in understanding DNN behavior. |

671 | Jumpout : Improved Dropout for Deep Neural Networks with ReLUs | Shengjie Wang, Tianyi Zhou, Jeff Bilmes | We discuss three novel insights about dropout for DNNs with ReLUs: 1) dropout encourages each local linear piece of a DNN to be trained on data points from nearby regions; 2) the same dropout rate results in different (effective) deactivation rates for layers with different portions of ReLU-deactivated neurons; and 3) the rescaling factor of dropout causes a normalization inconsistency between training and test when used together with batch normalization. |

672 | AdaGrad stepsizes: sharp convergence over nonconvex landscapes | Rachel Ward, Xiaoxia Wu, Leon Bottou | We bridge this gap by providing strong theoretical guarantees for the convergence of AdaGrad over smooth, nonconvex landscapes. |

673 | Generalized Linear Rule Models | Dennis Wei, Sanjeeb Dash, Tian Gao, Oktay Gunluk | This paper considers generalized linear models using rule-based features, also referred to as rule ensembles, for regression and probabilistic classification. |

674 | On the statistical rate of nonlinear recovery in generative models with heavy-tailed data | Xiaohan Wei, Zhuoran Yang, Zhaoran Wang | In this paper, we make a step towards such a direction by considering the scenario where the measurements are non-Gaussian, subject to possibly unknown nonlinear transformations and the responses are heavy-tailed. |

675 | CapsAndRuns: An Improved Method for Approximately Optimal Algorithm Configuration | Gellert Weisz, Andras Gyorgy, Csaba Szepesvari | In this paper we present a new algorithm, CapsAndRuns, which finds a near-optimal configuration while using time that scales (in a problem dependent way) with the optimal expected capped runtime, significantly strengthening previous results which could only guarantee a bound that scaled with the potentially much larger optimal expected uncapped runtime. |

676 | Non-Monotonic Sequential Text Generation | Sean Welleck, Kiant? Brantley, Hal Daum? Iii, Kyunghyun Cho | In this work, we propose a framework for training models of text generation that operate in non-monotonic orders; the model directly learns good orders, without any additional annotation. |

677 | PROVEN: Verifying Robustness of Neural Networks with a Probabilistic Approach | Lily Weng, Pin-Yu Chen, Lam Nguyen, Mark Squillante, Akhilan Boopathy, Ivan Oseledets, Luca Daniel | We propose a novel framework PROVEN to \textbf{PRO}babilistically \textbf{VE}rify \textbf{N}eural network’s robustness with statistical guarantees. |

678 | Learning deep kernels for exponential family densities | Li Wenliang, Dougal Sutherland, Heiko Strathmann, Arthur Gretton | We provide a scheme for learning a kernel parameterized by a deep network, which can find complex location-dependent local features of the data geometry. |

679 | Improving Model Selection by Employing the Test Data | Max Westphal, Werner Brannath | We investigate the properties of novel evaluation strategies, namely when the final model is selected based on empirical performances on the test data. |

680 | Automatic Classifiers as Scientific Instruments: One Step Further Away from Ground-Truth | Jacob Whitehill, Anand Ramakrishnan | We examine how the accuracy of d, as quantified by the correlation q of d’s out- puts with the ground-truth construct U, impacts the estimated correlation between U (e.g., stress) and some other phenomenon V (e.g., academic performance). |

681 | Moment-Based Variational Inference for Markov Jump Processes | Christian Wildner, Heinz Koeppl | We propose moment-based variational inference as a flexible framework for approximate smoothing of latent Markov jump processes. |

682 | End-to-End Probabilistic Inference for Nonstationary Audio Analysis | William Wilkinson, Michael Andersen, Joshua D. Reiss, Dan Stowell, Arno Solin | We show how time-frequency analysis and nonnegative matrix factorisation can be jointly formulated as a spectral mixture Gaussian process model with nonstationary priors over the amplitude variance parameters. |

683 | Fairness risk measures | Robert Williamson, Aditya Menon | In this paper, we propose a new definition of fairness that generalises some existing proposals, while allowing for generic sensitive features and resulting in a convex objective. |

684 | Partially Exchangeable Networks and Architectures for Learning Summary Statistics in Approximate Bayesian Computation | Samuel Wiqvist, Pierre-Alexandre Mattei, Umberto Picchini, Jes Frellsen | We present a novel family of deep neural architectures, named partially exchangeable networks (PENs) that leverage probabilistic symmetries. |

685 | Wasserstein Adversarial Examples via Projected Sinkhorn Iterations | Eric Wong, Frank Schmidt, Zico Kolter | In this paper, we propose a new threat model for adversarial attacks based on the Wasserstein distance. |

686 | Imitation Learning from Imperfect Demonstration | Yueh-Hua Wu, Nontawat Charoenphakdee, Han Bao, Voot Tangkaratt, Masashi Sugiyama | To effectively learn from imperfect demonstrations, we propose a novel approach that utilizes confidence scores, which describe the quality of demonstrations. |

687 | Learning a Compressed Sensing Measurement Matrix via Gradient Unrolling | Shanshan Wu, Alex Dimakis, Sujay Sanghavi, Felix Yu, Daniel Holtmann-Rice, Dmitry Storcheus, Afshin Rostamizadeh, Sanjiv Kumar | In this paper we present a new method to learn linear encoders that adapt to data, while still performing well with the widely used $\ell_1$ decoder. |

688 | Heterogeneous Model Reuse via Optimizing Multiparty Multiclass Margin | Xi-Zhu Wu, Song Liu, Zhi-Hua Zhou | In this paper, we define a multiparty multiclass margin to measure the global behavior of a set of heterogeneous local models, and propose a general learning method called HMR (Heterogeneous Model Reuse) to optimize the margin. |

689 | Deep Compressed Sensing | Yan Wu, Mihaela Rosca, Timothy Lillicrap | Here we propose a novel framework that significantly improves both the performance and speed of signal recovery by jointly training a generator and the optimisation process for reconstruction via meta-learning. |

690 | Simplifying Graph Convolutional Networks | Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, Kilian Weinberger | In this paper, we reduce this excess complexity through successively removing nonlinearities and collapsing weight matrices between consecutive layers. |

691 | Domain Adaptation with Asymmetrically-Relaxed Distribution Alignment | Yifan Wu, Ezra Winston, Divyansh Kaushik, Zachary Lipton | We propose asymmetrically-relaxed distribution alignment, a new approach that overcomes some limitations of standard domain-adversarial algorithms. |

692 | On Scalable and Efficient Computation of Large Scale Optimal Transport | Yujia Xie, Minshuo Chen, Haoming Jiang, Tuo Zhao, Hongyuan Zha | To address the scalability issue, we propose an implicit generative learning-based framework called SPOT (Scalable Push-forward of Optimal Transport). |

693 | Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance | Cong Xie, Sanmi Koyejo, Indranil Gupta | We present Zeno, a technique to make distributed machine learning, particularly Stochastic Gradient Descent (SGD), tolerant to an arbitrary number of faulty workers. |

694 | Differentiable Linearized ADMM | Xingyu Xie, Jianlong Wu, Guangcan Liu, Zhisheng Zhong, Zhouchen Lin | In this paper, we propose Differentiable Linearized ADMM (D-LADMM) for solving the problems with linear constraints. |

695 | Calibrated Approximate Bayesian Inference | Hanwen Xing, Geoff Nicholls, Jeong Lee | We give a general purpose computational framework for estimating the bias in coverage resulting from making approximations in Bayesian inference. |

696 | Power k-Means Clustering | Jason Xu, Kenneth Lange | This paper explores an alternative to Lloyd’s algorithm that retains its simplicity and mitigates its tendency to get trapped by local minima. |

697 | Gromov-Wasserstein Learning for Graph Matching and Node Embedding | Hongteng Xu, Dixin Luo, Hongyuan Zha, Lawrence Carin Duke | We apply the proposed method to matching problems in real-world networks, and demonstrate its superior performance compared to alternative approaches. |

698 | Stochastic Optimization for DC Functions and Non-smooth Non-convex Regularizers with Non-asymptotic Convergence | Yi Xu, Qi Qi, Qihang Lin, Rong Jin, Tianbao Yang | In this paper, we propose new stochastic optimization algorithms and study their first-order convergence theories for solving a broad family of DC functions. |

699 | Learning a Prior over Intent via Meta-Inverse Reinforcement Learning | Kelvin Xu, Ellis Ratner, Anca Dragan, Sergey Levine, Chelsea Finn | In this work, we exploit the insight that demonstrations from other tasks can be used to constrain the set of possible reward functions by learning a “prior” that is specifically optimized for the ability to infer expressive reward functions from limited numbers of demonstrations. |

700 | Variational Russian Roulette for Deep Bayesian Nonparametrics | Kai Xu, Akash Srivastava, Charles Sutton | Instead, we propose a new variational approximation, based on a method from statistical physics called Russian roulette sampling. |

701 | Supervised Hierarchical Clustering with Exponential Linkage | Nishant Yadav, Ari Kobren, Nicholas Monath, Andrew Mccallum | In this paper, we introduce a method for training the dissimilarity function in a way that is tightly coupled with hierarchical clustering, in particular single linkage. |

702 | Learning to Prove Theorems via Interacting with Proof Assistants | Kaiyu Yang, Jia Deng | In this paper, we study the problem of using machine learning to automate the interaction with proof assistants. |

703 | Sample-Optimal Parametric Q-Learning Using Linearly Additive Features | Lin Yang, Mengdi Wang | We propose a parametric Q-learning algorithm that finds an approximate-optimal policy using a sample size proportional to the feature dimension $K$ and invariant with respect to the size of the state space. |

704 | LegoNet: Efficient Convolutional Neural Networks with Lego Filters | Zhaohui Yang, Yunhe Wang, Chuanjian Liu, Hanting Chen, Chunjing Xu, Boxin Shi, Chao Xu, Chang Xu | This paper aims to build efficient convolutional neural networks using a set of Lego filters. |

705 | SWALP : Stochastic Weight Averaging in Low Precision Training | Guandao Yang, Tianyi Zhang, Polina Kirichenko, Junwen Bai, Andrew Gordon Wilson, Chris De Sa | This paper proposes SWALP, an approach to low precision training that averages low-precision SGD iterates with a modified learning rate schedule. |

706 | ME-Net: Towards Effective Adversarial Robustness with Matrix Estimation | Yuzhe Yang, Guo Zhang, Zhi Xu, Dina Katabi | This paper proposes ME-Net, a defense method that leverages matrix estimation (ME). |

707 | Efficient Nonconvex Regularized Tensor Completion with Structure-aware Proximal Iterations | Quanming Yao, James Tin-Yau Kwok, Bo Han | In this paper, we extend this to the more challenging problem of low-rank tensor completion. |

708 | Hierarchically Structured Meta-learning | Huaxiu Yao, Ying Wei, Junzhou Huang, Zhenhui Li | In this paper, based on gradient-based meta-learning, we propose a hierarchically structured meta-learning (HSML) algorithm that explicitly tailors the transferable knowledge to different clusters of tasks. |

709 | Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel $k$-means Clustering | Taisuke Yasuda, David Woodruff, Manuel Fernandez | In this work, we present nearly tight lower bounds on the number of kernel evaluations required to approximately solve kernel ridge regression (KRR) and kernel $k$-means clustering (KKMC) on $n$ input points. |

710 | Understanding Geometry of Encoder-Decoder CNNs | Jong Chul Ye, Woon Kyoung Sung | Inspired by recent theoretical understanding on generalizability, expressivity and optimization landscape of neural networks, as well as the theory of convolutional framelets, here we provide a unified theoretical framework that leads to a better understanding of geometry of encoder-decoder CNNs. |

711 | Defending Against Saddle Point Attack in Byzantine-Robust Distributed Learning | Dong Yin, Yudong Chen, Ramchandran Kannan, Peter Bartlett | As a by-product, we give a simpler algorithm and analysis for escaping saddle points in the usual non-Byzantine setting. |

712 | Rademacher Complexity for Adversarially Robust Generalization | Dong Yin, Ramchandran Kannan, Peter Bartlett | In this paper, we focus on $\ell_\infty$ attacks, and study the adversarially robust generalization problem through the lens of Rademacher complexity. |

713 | ARSM: Augment-REINFORCE-Swap-Merge Estimator for Gradient Backpropagation Through Categorical Variables | Mingzhang Yin, Yuguang Yue, Mingyuan Zhou | To address the challenge of backpropagating the gradient through categorical variables, we propose the augment-REINFORCE-swap-merge (ARSM) gradient estimator that is unbiased and has low variance. |

714 | NAS-Bench-101: Towards Reproducible Neural Architecture Search | Chris Ying, Aaron Klein, Eric Christiansen, Esteban Real, Kevin Murphy, Frank Hutter | We aim to ameliorate these problems by introducing NAS-Bench-101, the first public architecture dataset for NAS research. |

715 | TapNet: Neural Network Augmented with Task-Adaptive Projection for Few-Shot Learning | Sung Whan Yoon, Jun Seo, Jaekyun Moon | We propose TapNets, neural networks augmented with task-adaptive projection for improved few-shot learning. |

716 | Towards Accurate Model Selection in Deep Unsupervised Domain Adaptation | Kaichao You, Ximei Wang, Mingsheng Long, Michael Jordan | To this end, we propose Deep Embedded Validation (DEV), which embeds adapted feature representation into the validation procedure to obtain unbiased estimation of the target risk with bounded variance. |

717 | Position-aware Graph Neural Networks | Jiaxuan You, Rex Ying, Jure Leskovec | Here we propose Position-aware Graph Neural Networks (P-GNNs), a new class of GNNs for computing position-aware node embeddings. |

718 | Learning Neurosymbolic Generative Models via Program Synthesis | Halley Young, Osbert Bastani, Mayur Naik | We propose to address this problem by incorporating programs representing global structure into generative models{—}e.g., a 2D for-loop may represent a repeating pattern of windows{—}along with a framework for learning these models by leveraging program synthesis to obtain training data. |

719 | DAG-GNN: DAG Structure Learning with Graph Neural Networks | Yue Yu, Jie Chen, Tian Gao, Mo Yu | Motivated by the widespread success of deep learning that is capable of capturing complex nonlinear mappings, in this work we propose a deep generative model and apply a variant of the structural constraint to learn the DAG. |

720 | How does Disagreement Help Generalization against Label Corruption? | Xingrui Yu, Bo Han, Jiangchao Yao, Gang Niu, Ivor Tsang, Masashi Sugiyama | To tackle this issue, we propose a robust learning paradigm called Co-teaching+, which bridges the “Update by Disagreement” strategy with the original Co-teaching. |

721 | On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic Non-Convex Optimization | Hao Yu, Rong Jin | For general stochastic non-convex optimization, we propose a Catalyst-like algorithm to achieve the fastest known $O(1/\sqrt{NT})$ convergence with only $O(\sqrt{NT}\log(\frac{T}{N}))$ communication rounds. |

722 | On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization | Hao Yu, Rong Jin, Sen Yang | This paper fills the gap by considering a distributed communication efficient momentum SGD method and proving its linear speedup property. |

723 | Multi-Agent Adversarial Inverse Reinforcement Learning | Lantao Yu, Jiaming Song, Stefano Ermon | In this paper, we propose MA-AIRL, a new framework for multi-agent inverse reinforcement learning, which is effective and scalable for Markov games with high-dimensional state-action space and unknown dynamics. |

724 | Distributed Learning over Unreliable Networks | Chen Yu, Hanlin Tang, Cedric Renggli, Simon Kassing, Ankit Singla, Dan Alistarh, Ce Zhang, Ji Liu | In this paper, we connect these two trends, and consider the following question: Can we design machine learning systems that are tolerant to network unreliability during training? |

725 | Online Adaptive Principal Component Analysis and Its extensions | Jianjun Yuan, Andrew Lamperski | We propose algorithms for online principal component analysis (PCA) and variance minimization for adaptive settings. |

726 | Generative Modeling of Infinite Occluded Objects for Compositional Scene Representation | Jinyang Yuan, Bin Li, Xiangyang Xue | We present a deep generative model which explicitly models object occlusions for compositional scene representation. |

727 | Differential Inclusions for Modeling Nonsmooth ADMM Variants: A Continuous Limit Theory | Huizhuo Yuan, Yuren Zhou, Chris Junchi Li, Qingyun Sun | In this paper, we analyze some well-known and widely used ADMM variants for nonsmooth optimization problems using tools of differential inclusions. |

728 | Trimming the $\ell_1$ Regularizer: Statistical Analysis, Optimization, and Applications to Deep Learning | Jihun Yun, Peng Zheng, Eunho Yang, Aurelie Lozano, Aleksandr Aravkin | We present the first statistical analyses for M-estimation, and characterize support recovery, $\ell_\infty$ and $\ell_2$ error of the trimmed $\ell_1$ estimates as a function of the trimming parameter h. |

729 | Bayesian Nonparametric Federated Learning of Neural Networks | Mikhail Yurochkin, Mayank Agarwal, Soumya Ghosh, Kristjan Greenewald, Nghia Hoang, Yasaman Khazaeni | We develop a Bayesian nonparametric framework for federated learning with neural networks. |

730 | Dirichlet Simplex Nest and Geometric Inference | Mikhail Yurochkin, Aritra Guha, Yuekai Sun, Xuanlong Nguyen | We propose Dirichlet Simplex Nest, a class of probabilistic models suitable for a variety of data types, and develop fast and provably accurate inference algorithms by accounting for the model’s convex geometry and low dimensional simplicial structure. |

731 | A Conditional-Gradient-Based Augmented Lagrangian Framework | Alp Yurtsever, Olivier Fercoq, Volkan Cevher | To this end, we propose a new conditional gradient method, based on a unified treatment of smoothing and augmented Lagrangian frameworks. |

732 | Conditional Gradient Methods via Stochastic Path-Integrated Differential Estimator | Alp Yurtsever, Suvrit Sra, Volkan Cevher | We propose a class of variance-reduced stochastic conditional gradient methods. |

733 | Context-Aware Zero-Shot Learning for Object Recognition | Eloi Zablocki, Patrick Bordes, Laure Soulier, Benjamin Piwowarski, Patrick Gallinari | Following the intuitive principle that objects tend to be found in certain contexts but not others, we propose a new and challenging approach, context-aware ZSL, that leverages semantic representations in a new way to model the conditional likelihood of an object to appear in a given context. |

734 | Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds | Andrea Zanette, Emma Brunskill | As a step towards this we derive an algorithm and analysis for finite horizon discrete MDPs with state-of-the-art worst-case regret bounds and substantially tighter bounds if the RL environment has special features but without apriori knowledge of the environment from the algorithm. |

735 | Global Convergence of Block Coordinate Descent in Deep Learning | Jinshan Zeng, Tim Tsz-Kit Lau, Shaobo Lin, Yuan Yao | In this paper, we aim at providing a general methodology for provable convergence guarantees for this type of methods. |

736 | Making Convolutional Networks Shift-Invariant Again | Richard Zhang | We show that when integrated correctly, it is compatible with existing architectural components, such as max-pooling. |

737 | Warm-starting Contextual Bandits: Robustly Combining Supervised and Bandit Feedback | Chicheng Zhang, Alekh Agarwal, Hal Daum? Iii, John Langford, Sahand Negahban | We investigate the feasibility of learning from both fully-labeled supervised data and contextual bandit data. |

738 | When Samples Are Strategically Selected | Hanrui Zhang, Yu Cheng, Vincent Conitzer | In this paper, we introduce a theoretical framework for this problem and provide key structural and computational results. |

739 | Self-Attention Generative Adversarial Networks | Han Zhang, Ian Goodfellow, Dimitris Metaxas, Augustus Odena | In this paper, we propose the Self-Attention Generative Adversarial Network (SAGAN) which allows attention-driven, long-range dependency modeling for image generation tasks. |

740 | Circuit-GNN: Graph Neural Networks for Distributed Circuit Design | Guo Zhang, Hao He, Dina Katabi | We present Circuit-GNN, a graph neural network (GNN) model for designing distributed circuits. |

741 | LatentGNN: Learning Efficient Non-local Relations for Visual Recognition | Songyang Zhang, Xuming He, Shipeng Yan | In this work, we propose an efficient and yet flexible non-local relation representation based on a novel class of graph neural networks. |

742 | Neural Collaborative Subspace Clustering | Tong Zhang, Pan Ji, Mehrtash Harandi, Wenbing Huang, Hongdong Li | We introduce the Neural Collaborative Subspace Clustering, a neural model that discovers clusters of data points drawn from a union of low-dimensional subspaces. |

743 | Incremental Randomized Sketching for Online Kernel Learning | Xiao Zhang, Shizhong Liao | To address these issues, we propose a novel incremental randomized sketching approach for online kernel learning, which has efficient incremental maintenances with theoretical guarantees. |

744 | Bridging Theory and Algorithm for Domain Adaptation | Yuchen Zhang, Tianle Liu, Mingsheng Long, Michael Jordan | We introduce Margin Disparity Discrepancy, a novel measurement with rigorous generalization bounds, tailored to the distribution comparison with the asymmetric margin loss, and to the minimax optimization for easier training. |

745 | Adaptive Regret of Convex and Smooth Functions | Lijun Zhang, Tie-Yan Liu, Zhi-Hua Zhou | To this end, we develop novel adaptive algorithms for convex and smooth functions, and establish problem-dependent regret bounds over any interval. |

746 | Random Function Priors for Correlation Modeling | Aonan Zhang, John Paisley | In this paper, we introduce random function priors for $Z_n$ for modeling correlations among its $K$ dimensions $Z_{n1}$ through $Z_{nK}$, which we call population random measure embedding (PRME). |

747 | Co-Representation Network for Generalized Zero-Shot Learning | Fei Zhang, Guangming Shi | Hence we propose a embedding model called co-representation network to learn a more uniform visual embedding space that effectively alleviates the bias problem and helps with classification. |

748 | SOLAR: Deep Structured Representations for Model-Based Reinforcement Learning | Marvin Zhang, Sharad Vikram, Laura Smith, Pieter Abbeel, Matthew Johnson, Sergey Levine | In this paper, we present a method for learning representations that are suitable for iterative model-based policy improvement, even when the underlying dynamical system has complex dynamics and image observations, in that these representations are optimized for inferring simple dynamics and cost models given data from the current policy. |

749 | A Composite Randomized Incremental Gradient Method | Junyu Zhang, Lin Xiao | We propose a composite randomized incremental gradient method by extending the SAGA framework. |

750 | Fast and Stable Maximum Likelihood Estimation for Incomplete Multinomial Models | Chenyang Zhang, Guosheng Yin | We propose a fixed-point iteration approach to the maximum likelihood estimation for the incomplete multinomial model, which provides a unified framework for ranking data analysis. |

751 | Theoretically Principled Trade-off between Robustness and Accuracy | Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, Michael Jordan | In this work, we decompose the prediction error for adversarial examples (robust error) as the sum of the natural (classification) error and boundary error, and provide a differentiable upper bound using the theory of classification-calibrated loss, which is shown to be the tightest possible upper bound uniform over all probability distributions and measurable predictors. |

752 | Learning Novel Policies For Tasks | Yunbo Zhang, Wenhao Yu, Greg Turk | In this work, we present a reinforcement learning algorithm that can find a variety of policies (novel policies) for a task that is given by a task reward function. |

753 | Greedy Orthogonal Pivoting Algorithm for Non-Negative Matrix Factorization | Kai Zhang, Sheng Zhang, Jun Liu, Jun Wang, Jie Zhang | To address this challenge, we propose an innovative procedure called Greedy Orthogonal Pivoting Algorithm (GOPA). |

754 | Interpreting Adversarially Trained Convolutional Neural Networks | Tianyuan Zhang, Zhanxing Zhu | We design systematic approaches to interpret AT-CNNs in both qualitative and quantitative ways and compare them with normally trained models. Second, to achieve quantitative verification, we construct additional test datasets that destroy either textures or shapes, such as style-transferred version of clean data, saturated images and patch-shuffled ones, and then evaluate the classification accuracy of AT-CNNs and normal CNNs on these datasets. |

755 | Adaptive Monte Carlo Multiple Testing via Multi-Armed Bandits | Martin Zhang, James Zou, David Tse | In this paper, we propose \texttt{A}daptive \texttt{M}C multiple \texttt{T}esting (\texttt{AMT}) to estimate MC p-values and control false discovery rate in multiple testing. |

756 | On Learning Invariant Representations for Domain Adaptation | Han Zhao, Remi Tachet Des Combes, Kun Zhang, Geoffrey Gordon | To give a sufficient condition for domain adaptation, we propose a natural and interpretable generalization upper bound that explicitly takes into account the aforementioned shift. |

757 | Metric-Optimized Example Weights | Sen Zhao, Mahdi Milani Fard, Harikrishna Narasimhan, Maya Gupta | Motivated by known connections between complex test metrics and cost-weighted learning, we propose addressing these issues by using a weighted loss function with a standard loss, where the weights on the training examples are learned to optimize the test metric on a validation set. |

758 | Improving Neural Network Quantization without Retraining using Outlier Channel Splitting | Ritchie Zhao, Yuwei Hu, Jordan Dotzel, Chris De Sa, Zhiru Zhang | In this work, we propose outlier channel splitting (OCS), which duplicates channels containing outliers, then halves the channel values. |

759 | Maximum Entropy-Regularized Multi-Goal Reinforcement Learning | Rui Zhao, Xudong Sun, Volker Tresp | On a set of multi-goal robotic tasks of OpenAI Gym, we compare our method with other baselines and show promising improvements in both performance and sample-efficiency. |

760 | Stochastic Iterative Hard Thresholding for Graph-structured Sparsity Optimization | Baojian Zhou, Feng Chen, Yiming Ying | In this paper, we propose a stochastic gradient-based method for solving graph-structured sparsity constraint problems, not restricted to the least square loss. |

761 | Lower Bounds for Smooth Nonconvex Finite-Sum Optimization | Dongruo Zhou, Quanquan Gu | In this paper, we study the lower bounds for smooth nonconvex finite-sum optimization, where the objective function is the average of $n$ nonconvex component functions. |

762 | Lipschitz Generative Adversarial Nets | Zhiming Zhou, Jiadong Liang, Yuxuan Song, Lantao Yu, Hongwei Wang, Weinan Zhang, Yong Yu, Zhihua Zhang | In this paper we show that generative adversarial networks (GANs) without restriction on the discriminative function space commonly suffer from the problem that the gradient produced by the discriminator is uninformative to guide the generator. |

763 | Toward Understanding the Importance of Noise in Training Neural Networks | Mo Zhou, Tianyi Liu, Yan Li, Dachao Lin, Enlu Zhou, Tuo Zhao | This implies that the noise enables the algorithm to efficiently escape from the spurious local optimum. |

764 | BayesNAS: A Bayesian Approach for Neural Architecture Search | Hongpeng Zhou, Minghao Yang, Jun Wang, Wei Pan | In this paper, we employ the classic Bayesian learning approach to alleviate these two issues by modeling architecture parameters using hierarchical automatic relevance determination (HARD) priors. |

765 | Transferable Clean-Label Poisoning Attacks on Deep Neural Nets | Chen Zhu, W. Ronny Huang, Hengduo Li, Gavin Taylor, Christoph Studer, Tom Goldstein | In this paper, we explore clean-label poisoning attacks on deep convolutional networks with access to neither the network’s output nor its architecture or parameters. |

766 | Improved Dynamic Graph Learning through Fault-Tolerant Sparsification | Chunjiang Zhu, Sabine Storandt, Kam-Yiu Lam, Song Han, Jinbo Bi | We propose a new type of graph sparsification namely fault-tolerant (FT) sparsification to significantly reduce the cost to only a constant. |

767 | Poission Subsampled R?nyi Differential Privacy | Yuqing Zhu, Yu-Xiang Wang | We consider the problem of privacy-amplification by under the Renyi Differential Privacy framework. |

768 | Learning Classifiers for Target Domain with Limited or No Labels | Pengkai Zhu, Hanxiao Wang, Venkatesh Saligrama | We propose a novel visual attribute encoding method that encodes each image as a low-dimensional probability vector composed of prototypical part-type probabilities. |

769 | The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects | Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, Jinwen Ma | Along this line, we study a general form of gradient based optimization dynamics with unbiased noise, which unifies SGD and standard Langevin dynamics. |

770 | Surrogate Losses for Online Learning of Stepsizes in Stochastic Non-Convex Optimization | Zhenxun Zhuang, Ashok Cutkosky, Francesco Orabona | In this paper, we propose new surrogate losses to cast the problem of learning the optimal stepsizes for the stochastic optimization of a non-convex smooth objective function onto an online convex optimization problem. |

771 | Latent Normalizing Flows for Discrete Sequences | Zachary Ziegler, Alexander Rush | We propose a VAE-based generative model which jointly learns a normalizing flow-based distribution in the latent space and a stochastic mapping to an observed discrete space. |

772 | Beating Stochastic and Adversarial Semi-bandits Optimally and Simultaneously | Julian Zimmert, Haipeng Luo, Chen-Yu Wei | We develop the first general semi-bandit algorithm that simultaneously achieves $\mathcal{O}(\log T)$ regret for stochastic environments and $\mathcal{O}(\sqrt{T})$ regret for adversarial environments without knowledge of the regime or the number of rounds $T$. |

773 | Fast Context Adaptation via Meta-Learning | Luisa Zintgraf, Kyriacos Shiarli, Vitaly Kurin, Katja Hofmann, Shimon Whiteson | We propose CAVIA for meta-learning, a simple extension to MAML that is less prone to meta-overfitting, easier to parallelise, and more interpretable. |

774 | Natural Analysts in Adaptive Data Analysis | Tijana Zrnic, Moritz Hardt | In this work, we propose notions of natural analysts that smoothly interpolate between the optimal non-adaptive bounds and the best-known adaptive generalization bounds. |