# Paper Digest: ICML 2020 Highlights

Download ICML-2020-Paper-Digests.pdf– highlights of all ICML-2020 papers. Readers can also choose to read this highlight article on our console, which allows users to filter out papers using keywords and find related papers and patents.

The International Conference on Machine Learning (ICML) is one of the top machine learning conferences in the world. In 2020, it is to be held virtually due to covid-19 pandemic.

To help the community quickly catch up on the work presented in this conference, Paper Digest Team processed all accepted papers, and generated one highlight sentence (typically the main topic) for each paper. Readers are encouraged to read these machine generated highlights / summaries to quickly get the main idea of each paper.

If you do not want to miss any interesting academic paper, you are welcome to **sign up our free daily paper digest service ** to get updates on new papers published in your area every day. You are also welcome to follow us on Twitter and Linkedin to get updated with new conference digests.

Paper Digest Team

team@paperdigest.org

#### TABLE 1: Paper Digest: ICML 2020 Highlights

Title | Authors | Highlight | |
---|---|---|---|

1 | Reverse-engineering deep ReLU networks | David Rolnick; Konrad Kording; | Here, we prove that in fact it is often possible to identify the architecture, weights, and biases of an unknown deep ReLU network by observing only its output. |

2 | My Fair Bandit: Distributed Learning of Max-Min Fairness with Multi-player Bandits | Ilai Bistritz; Tavor Baharav; Amir Leshem; Nicholas Bambos; | We present an algorithm and prove that it is regret optimal up to a log(log T) factor. |

3 | Scalable Differentiable Physics for Learning and Control | Yi-Ling Qiao; Junbang Liang; Vladlen Koltun; Ming Lin; | We develop a scalable framework for differentiable physics that can support a large number of objects and their interactions. |

4 | Generalization to New Actions in Reinforcement Learning | Ayush Jain; Andrew Szot; Joseph Lim; | To approach this problem, we propose a two-stage framework where the agent first infers action representations from acquired action observations and then learns to use these in reinforcement learning with added generalization objectives. |

5 | Randomized Block-Diagonal Preconditioning for Parallel Learning | Celestine Mendler-Dünner; Aurelien Lucchi; | Our main contribution is to demonstrate that the convergence of these methods can significantly be improved by a randomization technique which corresponds to repartitioning coordinates across tasks during the optimization procedure. |

6 | Stochastic Flows and Geometric Optimization on the Orthogonal Group | Krzysztof Choromanski; David Cheikhi; Jared Davis; Valerii Likhosherstov; Achille Nazaret; Achraf Bahamou; Xingyou Song; Mrugank Akarte; Jack Parker-Holder; Jacob Bergquist; YUAN GAO; Aldo Pacchiano; Tamas Sarlos; Adrian Weller; Vikas Sindhwani; | We present a new class of stochastic, geometrically-driven optimization algorithms on the orthogonal group O(d) and naturally reductive homogeneous manifolds obtained from the action of the rotation group SO(d). |

7 | PackIt: A Virtual Environment for Geometric Planning | Ankit Goyal; Jia Deng; | We present PackIt, a virtual environment to evaluate and potentially learn the ability to do geometric planning. We also construct a set of challenging packing tasks using an evolutionary algorithm. |

8 | Soft Threshold Weight Reparameterization for Learnable Sparsity | Aditya Kusupati; Vivek Ramanujan; Raghav Somani; Mitchell Wortsman; Prateek Jain; Sham Kakade; Ali Farhadi; | This work proposes Soft Threshold Reparameterization (STR), a novel use of the soft-threshold operator on DNN weights. |

9 | Stochastic Latent Residual Video Prediction | Jean-Yves Franceschi; Edouard Delasalles; Mickael Chen; Sylvain Lamprier; Patrick Gallinari; | In this paper, we overcome these difficulties by introducing a novel stochastic temporal model whose dynamics are governed in a latent space by a residual update rule. |

10 | Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum under Heavy-Tailed Gradient Noise | Umut Simsekli; Lingjiong Zhu; Yee Whye Teh; Mert Gurbuzbalaban; | In this study, we consider a continuous-time variant of SGDm, known as the underdamped Langevin dynamics (ULD), and investigate its asymptotic properties under heavy-tailed perturbations. |

11 | Context Aware Local Differential Privacy | Jayadev Acharya; Keith Bonawitz; Peter Kairouz; Daniel Ramage; Ziteng Sun; | We propose a context-aware framework for LDP that allows the privacy level to vary across the data domain, enabling system designers to place privacy constraints where they matter without paying the cost where they do not. |

12 | Privately Learning Markov Random Fields | Gautam Kamath; Janardhan Kulkarni; Steven Wu; Huanyu Zhang; | Our learning goals include both structure learning, where we try to estimate the underlying graph structure of the model, as well as the harder goal of parameter learning, in which we additionally estimate the parameter on each edge. |

13 | A Mean Field Analysis Of Deep ResNet And Beyond: Towards Provably Optimization Via Overparameterization From Depth | Yiping Lu; Chao Ma; Yulong Lu; Jianfeng Lu; Lexing Ying; | To understand the success of SGD for training deep neural networks, this work presents a mean-field analysis of deep residual networks, based on a line of works which interpret the continuum limit of the deep residual network as an ordinary differential equation as the the network capacity tends to infinity. |

14 | Provable Smoothness Guarantees for Black-Box Variational Inference | Justin Domke; | This paper shows that for location-scale family approximations, if the target is M-Lipschitz smooth, then so is the “energy” part of the variational objective. |

15 | Enhancing Simple Models by Exploiting What They Already Know | Amit Dhurandhar; Karthikeyan Shanmugam; Ronny Luss; | In this paper, we propose a novel method SRatio that can utilize information from high performing complex models (viz. deep neural networks, boosted trees, random forests) to reweight a training dataset for a potentially low performing simple model of much lower complexity such as a decision tree or a shallow network enhancing its performance. |

16 | Fiduciary Bandits | Gal Bahar; Omer Ben-Porat; Kevin Leyton-Brown; Moshe Tennenholtz; | More formally, we introduce a model in which a recommendation system faces an exploration-exploitation tradeoff under the constraint that it can never recommend any action that it knows yields lower reward in expectation than an agent would achieve if it acted alone. |

17 | Training Deep Energy-Based Models with f-Divergence Minimization | Lantao Yu; Yang Song; Jiaming Song; Stefano Ermon; | In this paper, we propose a general variational framework termed f-EBM to train EBMs using any desired f-divergence. |

18 | Progressive Graph Learning for Open-Set Domain Adaptation | Yadan Luo; Zijian Wang; Zi Huang; Mahsa Baktashmotlagh; | More specifically, we introduce an end-to-end Progressive Graph Learning (PGL) framework where a graph neural network with episodic training is integrated to suppress underlying conditional shift and adversarial learning is adopted to close the gap between the source and target distributions. |

19 | Learning De-biased Representations with Biased Representations | Hyojin Bahng; SANGHYUK CHUN; Sangdoo Yun; Jaegul Choo; Seong Joon Oh; | In this work, we propose a novel framework to train a de-biased representation by encouraging it to be \textit{different} from a set of representations that are biased by design. |

20 | Generalized Neural Policies for Relational MDPs | Sankalp Garg; Aniket Bajpai; Mausam; | We present the first neural approach for solving RMDPs, expressed in the probabilistic planning language of RDDL. |

21 | Feature-map-level Online Adversarial Knowledge Distillation | Inseop Chung; SeongUk Park; Kim Jangho; NOJUN KWAK; | Thus in this paper, we propose an online knowledge distillation method that transfers not only the knowledge of the class probabilities but also that of the feature map using the adversarial training framework. |

22 | DRWR: A Differentiable Renderer without Rendering for Unsupervised 3D Structure Learning from Silhouette Images | Zhizhong Han; Chao Chen; Yu-Shen Liu; Matthias Zwicker; | In contrast, here we propose a Differentiable Renderer Without Rendering (DRWR) that omits these steps. |

23 | Towards Accurate Post-training Network Quantization via Bit-Split and Stitching | Peisong Wang; Qiang Chen; Xiangyu He; Jian Cheng; | In this paper, we propose a Bit-Split and Stitching framework for lower-bit post-training quantization with minimal accuracy degradation. |

24 | Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization | Pan Zhou; Xiao-Tong Yuan; | To address this deficiency, we propose a hybrid stochastic-deterministic minibatch proximal gradient (HSDMPG) algorithm for strongly-convex problems that enjoys provably improved data-size-independent complexity guarantees. |

25 | Reserve Pricing in Repeated Second-Price Auctions with Strategic Bidders | Alexey Drutsa; | We propose a novel algorithm that has strategic regret upper bound of $O(\log\log T)$ for worst-case valuations. |

26 | On Gradient Descent Ascent for Nonconvex-Concave Minimax Problems | Tianyi Lin; Chi Jin; Michael Jordan; | In this paper, we present the complexity results on two-time-scale GDA for solving nonconvex-concave minimax problems, showing that the algorithm can find a stationary point of the function $\Phi(\cdot) := \max_{\mathbf{y} \in \mathcal{Y}} f(\cdot, \mathbf{y})$ efficiently. |

27 | Training Binary Neural Networks through Learning with Noisy Supervision | Kai Han; Yunhe Wang; Yixing Xu; Chunjing Xu; Enhua Wu; Chang Xu; | In contrast to classical hand crafted rules (\eg hard thresholding) to binarize full-precision neurons, we propose to learn a mapping from full-precision neurons to the target binary ones. |

28 | Stochastic Frank-Wolfe for Constrained Finite-Sum Minimization | Geoffrey Negiar; Gideon Dresdner; Alicia Yi-Ting Tsai; Laurent El Ghaoui; Francesco Locatello; Fabian Pedregosa; | We propose a novel Stochastic Frank-Wolfe ($\equiv$ Conditional Gradient) algorithm with fixed batch size tailored to the constrained optimization of a finite sum of smooth objectives. |

29 | Do We Really Need to Access the Source Data? Source Hypothesis Transfer for Unsupervised Domain Adaptation | Jian Liang; Dapeng Hu; Jiashi Feng; | In this work we tackle a novel setting where only a trained source model is available and investigate how we can effectively utilize such a model without source data to solve UDA problems. |

30 | Acceleration through spectral density estimation | Fabian Pedregosa; Damien Scieur; | We develop a framework for designing optimal optimization methods in terms of their average-case runtime. |

31 | Graph Structure of Neural Networks | Jiaxuan You; Kaiming He; Jure Leskovec; Saining Xie; | Here we systematically investigate this relationship, via developing a novel graph-based representation of neural networks called relational graph, where computation is specified by rounds of message exchange along the graph structure. |

32 | Optimal Continual Learning has Perfect Memory and is NP-hard | Jeremias Knoblauch; Hisham Husain; Tom Diethe; | Designing CL algorithms that perform reliably and avoidso-calledcatastrophic forgettinghas proven a per-sistent challenge. The current paper develops a theoretical approach that explains why. |

33 | Clinician-in-the-Loop Decision Making: Reinforcement Learning with Near-Optimal Set-Valued Policies | Shengpu Tang; Aditya Modi; Michael Sjoding; Jenna Wiens; | We propose a model-free, off-policy algorithm based on temporal difference learning and a near-greedy action selection heuristic. |

34 | Computational and Statistical Tradeoffs in Inferring Combinatorial Structures of Ising Model | Ying Jin; Zhaoran Wang; Junwei Lu; | Under the framework of oracle computational model where an algorithm interacts with an oracle that discourses a randomized version of truth, we characterize the computational lower bounds of learning combinatorial structures in polynomial time, under which no algorithms within polynomial-time can distinguish between graphs with and without certain structures. |

35 | On the Number of Linear Regions of Convolutional Neural Networks | Huan Xiong; Lei Huang; Mengyang Yu; Li Liu; Fan Zhu; Ling Shao; | In this paper, we provide several mathematical results needed for studying the linear regions of CNNs, and use them to derive the maximal and average numbers of linear regions for one-layer ReLU CNNs. |

36 | Deep Streaming Label Learning | Zhen Wang; Liu Liu; Dacheng Tao; | In order to fill in these research gaps, we propose a novel deep neural network (DNN) based framework, Deep Streaming Label Learning (DSLL), to classify instances with newly emerged labels effectively. |

37 | From Importance Sampling to Doubly Robust Policy Gradient | Jiawei Huang; Nan Jiang; | Starting from the doubly robust (DR) estimator (Jiang & Li, 2016), we provide a simple derivation of a very general and flexible form of PG, which subsumes the state-of-the-art variance reduction technique (Cheng et al., 2019) as its special case and immediately hints at further variance reduction opportunities overlooked by existing literature. |

38 | Loss Function Search for Face Recognition | Xiaobo Wang; Shuo Wang; Shifeng Zhang; Cheng Chi; Tao Mei; | In this paper, we first analyze that the key to enhance the feature discrimination is actually how to reduce the softmax probability. We then design a unified formulation for the current margin-based softmax losses. |

39 | Breaking the Curse of Space Explosion: Towards Efficient NAS with Curriculum Search | Yong Guo; Yaofo Chen; Yin Zheng; Peilin Zhao; Jian Chen; Junzhou Huang; Mingkui Tan; | To alleviate this issue, we propose a curriculum search method that starts from a small search space and gradually incorporates the learned knowledge to guide the search in a large space. |

40 | Automatic Reparameterisation of Probabilistic Programs | Maria Gorinova; Dave Moore; Matthew Hoffman; | This enables new inference algorithms, and we propose two: a simple approach using interleaved sampling and a novel variational formulation that searches over a continuous space of parameterisations. |

41 | Kernel Methods for Cooperative Multi-Agent Learning with Delays | Abhimanyu Dubey; Alex `Sandy’ Pentland; | In this paper, we consider the kernelised contextual bandit problem, where the reward obtained by an agent is an arbitrary linear function of the contexts’ images in the related reproducing kernel Hilbert space (RKHS), and a group of agents must cooperate to collectively solve their unique decision problems. |

42 | Robust Multi-Agent Decision-Making with Heavy-Tailed Payoffs | Abhimanyu Dubey; Alex `Sandy’ Pentland; | We propose \textsc{MP-UCB}, a decentralized multi-agent algorithm for the cooperative stochastic bandit that incorporates robust estimation with a message-passing protocol. |

43 | Learning the Valuations of a $k$-demand Agent | Hanrui Zhang; Vincent Conitzer; | We study problems where a learner aims to learn the valuations of an agent by observing which goods he buys under varying price vectors. |

44 | Rigging the Lottery: Making All Tickets Winners | Utku Evci; Trevor Gale; Jacob Menick; Pablo Samuel Castro; Erich Elsen; | In this paper we introduce a method to train sparse neural networks with a fixed parameter count and a fixed computational cost throughout training, without sacrificing accuracy relative to existing dense-to-sparse training methods. |

45 | Active Learning on Attributed Graphs via Graph Cognizant Logistic Regression and Preemptive Query Generation | Florence Regol; Soumyasundar Pal; Yingxue Zhang; Mark Coates; | We propose a novel graph-based active learning algorithm for the task of node classification in attributed graphs. |

46 | Performative Prediction | Juan Perdomo; Tijana Zrnic; Celestine Mendler-Dünner; University of California Moritz Hardt; | We develop a risk minimization framework for performative prediction bringing together concepts from statistics, game theory, and causality. |

47 | On Layer Normalization in the Transformer Architecture | Ruibin Xiong; Yunchang Yang; Di He; Kai Zheng; Shuxin Zheng; Chen Xing; Huishuai Zhang; Yanyan Lan; Liwei Wang; Tie-Yan Liu; | Specifically, we prove with mean field theory that at initialization, for the original-designed Post-LN Transformer, which places the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer are large. |

48 | The many Shapley values for model explanation | Mukund Sundararajan; Amir Najmi; | In this paper, we use the axiomatic approach to study the differences between some of the many operationalizations of the Shapley value for attribution, and propose a technique called Baseline Shapley (BShap) that is backed by a proper uniqueness result. |

49 | Linear Convergence of Randomized Primal-Dual Coordinate Method for Large-scale Linear Constrained Convex Programming | Daoli Zhu; Lei Zhao; | We propose the randomized primal-dual coordinate (RPDC) method, a randomized coordinate extension of the first-order primal-dual method by Cohen and Zhu, 1984 and Zhao and Zhu, 2019, to solve LCCP. |

50 | New Oracle-Efficient Algorithms for Private Synthetic Data Release | Giuseppe Vietri; Steven Wu; Mark Bun; Thomas Steinke; Grace Tian; | We present three new algorithms for constructing differentially private synthetic data—a sanitized version of a sensitive dataset that approximately preserves the answers to a large collection of statistical queries. |

51 | Oracle Efficient Private Non-Convex Optimization | Seth Neel; Aaron Roth; Giuseppe Vietri; Steven Wu; | This technique augments a given optimization problem (e.g. deriving from an ERM problem) with a random linear term, and then exactly solves it. However, to date, analyses of this approach crucially rely on the convexity and smoothness of the objective function. We give two algorithms that extend this approach substantially. |

52 | Universal Asymptotic Optimality of Polyak Momentum | Damien Scieur; Fabian Pedregosa; | We consider the average-case runtime analysis of algorithms for minimizing quadratic objectives. |

53 | Adversarial Robustness via Runtime Masking and Cleansing | Yi-Hsuan Wu; Chia-Hung Yuan; Shan-Hung (Brandon) Wu; | In this paper, we propose improving the adversarial robustness of a network by leveraging the potentially large test data seen at runtime. |

54 | Implicit Euler Skip Connections: Enhancing Adversarial Robustness via Numerical Stability | Mingjie Li; Lingshen He; Zhouchen Lin; | On this account, we try to address such an issue from the perspective of dynamic system in this work. |

55 | Best Arm Identification for Cascading Bandits in the Fixed Confidence Setting | Zixin Zhong; Wang Chi Cheung; Vincent Tan; | We design and analyze CascadeBAI, an algorithm for finding the best set of K items, also called an arm, within the framework of cascading bandits. |

56 | Robustness to Programmable String Transformations via Augmented Abstract Training | Yuhao Zhang; Aws Albarghouthi; Loris D’Antoni; | To fill this gap, we present a technique to train models that are robust to user-defined string transformations. |

57 | The Complexity of Finding Stationary Points with Stochastic Gradient Descent | Yoel Drori; Ohad Shamir; | We study the iteration complexity of stochastic gradient descent (SGD) for minimizing the gradient norm of smooth, possibly nonconvex functions. |

58 | Sample Complexity Bounds for 1-bit Compressive Sensing and Binary Stable Embeddings with Generative Priors | Zhaoqiang Liu; Selwyn Gomes; Avtansh Tiwari; Jonathan Scarlett; | Motivated by recent advances in compressive sensing with generative models, where a generative modeling assumption replaces the usual sparsity assumption, we study the problem of 1-bit compressive sensing with generative models. |

59 | Class-Weighted Classification: Trade-offs and Robust Approaches | Ziyu Xu; Chen Dan; Justin Khim; Pradeep Ravikumar; | We consider imbalanced classification, the problem in which a label may have low marginal probability relative to other labels, by weighting losses according to the correct class. |

60 | Neural Architecture Search in a Proxy Validation Loss Landscape | Yanxi Li; Minjing Dong; Yunhe Wang; Chang Xu; | In this paper, we propose to approximate the validation loss landscape by learning a mapping from neural architectures to their corresponding validate losses. |

61 | Almost Tune-Free Variance Reduction | Bingcong Li; Lingda Wang; Georgios B. Giannakis; | This work introduces ‘almost tune-free’ SVRG and SARAH schemes equipped with i) Barzilai-Borwein (BB) step sizes; ii) averaging; and, iii) the inner loop length adjusted to the BB step sizes. |

62 | Uniform Convergence of Rank-weighted Learning | Liu Leqi; Justin Khim; Adarsh Prasad; Pradeep Ravikumar; | In this work, we study a novel notion of L-Risk based on the classical idea of rank-weighted learning. |

63 | Non-autoregressive Translation with Disentangled Context Transformer | Jungo Kasai; James Cross; Marjan Ghazvininejad; Jiatao Gu; | We propose an attention-masking based model, called Disentangled Context (DisCo) transformer, that simultaneously generates all tokens given different contexts. |

64 | More Information Supervised Probabilistic Deep Face Embedding Learning | Ying Huang; Shangfeng Qiu; Wenwei Zhang; Xianghui Luo; Jinzhuo Wang; | In this paper, we analyse margin based softmax loss in probability view. |

65 | Reinforcement Learning for Non-Stationary Markov Decision Processes: The Blessing of (More) Optimism | Wang Chi Cheung; David Simchi-Levi; Ruihao Zhu; | We overcome the challenge by a novel confidence widening technique that incorporates additional optimism. |

66 | Improved Sleeping Bandits with Stochastic Action Sets and Adversarial Rewards | Aadirupa Saha; Pierre Gaillard; Michal Valko; | In this paper, we consider the problem of sleeping bandits with stochastic action sets and adversarial rewards. |

67 | From PAC to Instance-Optimal Sample Complexity in the Plackett-Luce Model | Aadirupa Saha; Aditya Gopalan; | We consider PAC learning a good item from $k$-subsetwise feedback sampled from a Plackett-Luce probability model, with instance-dependent sample complexity performance. |

68 | Reliable Fidelity and Diversity Metrics for Generative Models | Muhammad Ferjad Naeem; Seong Joon Oh; Yunjey Choi; Youngjung Uh; Jaejun Yoo; | In this paper, we show that even the latest version of the precision and recall metrics are not reliable yet. |

69 | Learning Factorized Weight Matrix for Joint Image Filtering | Xiangyu Xu; Yongrui Ma; Wenxiu Sun; | In this work, we propose to learn the weight matrix for joint image filtering. |

70 | Likelihood-free MCMC with Amortized Approximate Ratio Estimators | Joeri Hermans; Volodimir Begy; Gilles Louppe; | This work introduces a novel approach to address the intractability of the likelihood and the marginal model. |

71 | Attacks Which Do Not Kill Training Make Adversarial Learning Stronger | Jingfeng Zhang; Xilie Xu; Bo Han; Gang Niu; Lizhen Cui; Masashi Sugiyama; Mohan Kankanhalli; | In this paper, we raise a fundamental question—do we have to trade off natural generalization for adversarial robustness? |

72 | GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values | Shangtong Zhang; Bo Liu; Shimon Whiteson; | We present GradientDICE for estimating the density ratio between the state distribution of the target policy and the sampling distribution in off-policy reinforcement learning. |

73 | Provably Convergent Two-Timescale Off-Policy Actor-Critic with Function Approximation | Shangtong Zhang; Bo Liu; Hengshuai Yao; Shimon Whiteson; | We present the first provably convergent two-timescale off-policy actor-critic algorithm (COF-PAC) with function approximation. |

74 | Adversarial Attacks on Probabilistic Autoregressive Forecasting Models | Raphaël Dang-Nhu; Gagandeep Singh; Pavol Bielik; Martin Vechev; | The key technical challenge we address is how to effectively differentiate through the Monte-Carlo estimation of statistics of the output sequence joint distribution. |

75 | Informative Dropout for Robust Representation Learning: A Shape-bias Perspective | Baifeng Shi; Dinghuai Zhang; Qi Dai; Jingdong Wang; Zhanxing Zhu; Yadong Mu; | In this work, we attempt at improving various kinds of robustness universally by alleviating CNN’s texture bias. |

76 | Graph Convolutional Network for Recommendation with Low-pass Collaborative Filters | Wenhui Yu; Zheng Qin; | To address this gap, we leverage the original graph convolution in GCN and propose a Low-pass Collaborative Filter (LCF) to make it applicable to the large graph. |

77 | SoftSort: A Differantiable Continuous Relaxation of the argsort Operator | Sebastian Prillo; Julian Eisenschlos; | In this work we propose a simple continuous relaxation for the argsort operator. |

78 | Too Relaxed to Be Fair | Michael Lohaus; Michaël Perrot; Ulrike von Luxburg; | We address the problem of classification under fairness constraints. Given a notion of fairness, the goal is to learn a classifier that is not discriminatory against a group of individuals. |

79 | Lorentz Group Equivariant Neural Network for Particle Physics | Alexander Bogatskiy; Brandon Anderson; Jan Offermann; Marwah Roussi; David Miller; Risi Kondor; | We present a neural network architecture that is fully equivariant with respect to transformations under the Lorentz group, a fundamental symmetry of space and time in physics. |

80 | One-shot Distributed Ridge Regression in High Dimensions | Yue Sheng; Edgar Dobriban; | Here we study a fundamental problem in this area: How to do ridge regression in a distributed computing environment? |

81 | Streaming k-Submodular Maximization under Noise subject to Size Constraint | Lan N. Nguyen; My T. Thai; | In this paper, we investigate a more realistic scenario of this problem that (1) obtaining exact evaluation of an objective function is impractical, instead, its noisy version is acquired; and (2) algorithms are required to take only one single pass over dataset, producing solutions in a timely manner. |

82 | Variational Imitation Learning with Diverse-quality Demonstrations | Voot Tangkaratt; Bo Han; Mohammad Emtiyaz Khan; Masashi Sugiyama; | We propose a new method for imitation learning in such scenarios. |

83 | Task Understanding from Confusing Multi-task Data | Xin Su; Yizhou Jiang; Shangqi Guo; Feng Chen; | We propose Confusing Supervised Learning (CSL) that takes these confusing samples and extracts task concepts by differentiating between these samples. |

84 | Cost-effective Interactive Attention Learning with Neural Attention Process | Jay Heo; Junhyeon Park; Hyewon Jeong; Kwang Joon Kim; Juho Lee; Eunho Yang; Sung Ju Hwang; | We propose a novel interactive learning framework which we refer to as Interactive Attention Learning (IAL), in which the human supervisors interactively manipulate the allocated attentions, to correct the model’s behavior by updating the attention-generating network. |

85 | Channel Equilibrium Networks for Learning Deep Representation | Wenqi Shao; Shitao Tang; Xingang Pan; Ping Tan; Xiaogang Wang; Ping Luo; | Unlike prior arts that simply removed the inhibited channels, we propose to “wake them up” during training by designing a novel neural building block, termed Channel Equilibrium (CE) block, which enables channels at the same layer to contribute equally to the learned representation. |

86 | Optimal Non-parametric Learning in Repeated Contextual Auctions with Strategic Buyer | Alexey Drutsa; | We introduce a novel non-parametric learning algorithm that is horizon-independent and has tight strategic regret upper bound of $\Theta(T^{d/(d+1)})$. |

87 | Topological Autoencoders | Michael Moor; Max Horn; Bastian Rieck; Karsten Borgwardt; | We propose a novel approach for preserving topological structures of the input space in latent representations of autoencoders. |

88 | An Accelerated DFO Algorithm for Finite-sum Convex Functions | Yuwen Chen; Antonio Orvieto; Aurelien Lucchi; | In this work, we exploit the finite-sum structure of the objective to design a variance-reduced DFO algorithm that probably yields an accelerated rate of convergence. |

89 | The Shapley Taylor Interaction Index | Mukund Sundararajan; Kedar Dhamdhere; Ashish Agarwal; | We propose a generalization of the Shapley value called Shapley-Taylor index that attributes the model’s prediction to interactions of subsets of features up to some size $k$. |

90 | Privately detecting changes in unknown distributions | Rachel Cummings; Sara Krehbiel; Yuliia Lut; Wanrong Zhang; | This work develops differentially private algorithms for solving the change-point problem when the data distributions are unknown. |

91 | CAUSE: Learning Granger Causality from Event Sequences using Attribution Methods | Wei Zhang; Thomas Panum; Somesh Jha; Prasad Chalasani; David Page; | To address these weaknesses, we propose CAUSE (Causality from AttribUtions on Sequence of Events), a novel framework for the studied task. |

92 | Efficient Continuous Pareto Exploration in Multi-Task Learning | Pingchuan Ma; Tao Du; Wojciech Matusik; | We present a novel, efficient method that generates locally continuous Pareto sets and Pareto fronts, which opens up the possibility of continuous analysis of Pareto optimal solutions in machine learning problems. |

93 | WaveFlow: A Compact Flow-based Model for Raw Audio | Wei Ping; Kainan Peng; Kexin Zhao; Zhao Song; | In this work, we propose WaveFlow, a small-footprint generative flow for raw audio, which is directly trained with maximum likelihood. |

94 | Multi-Agent Determinantal Q-Learning | Yaodong Yang; Ying Wen; Jun Wang; Liheng Chen; Kun Shao; David Mguni; Weinan Zhang; | Though practical, current methods rely on restrictive assumptions to decompose the centralized value function across agents for execution. In this paper, we eliminate this restriction by proposing multi-agent determinantal Q-learning. |

95 | Revisiting Spatial Invariance with Low-Rank Local Connectivity | Gamaleldin Elsayed; Prajit Ramachandran; Jon Shlens; Simon Kornblith; | To test this hypothesis, we design a method to relax the spatial invariance of a network layer in a controlled manner. |

96 | Minimax Weight and Q-Function Learning for Off-Policy Evaluation | Masatoshi Uehara; Jiawei Huang; Nan Jiang; | Our contributions include: (1) A new estimator, MWL, that directly estimates importance ratios over the state-action distributions, removing the reliance on knowledge of the behavior policy as in prior work (Liu et.al, 2018), (2) Another new estimator, MQL, obtained by swapping the roles of importance weights and value-functions in MWL. |

97 | Tensor denoising and completion based on ordinal observations | Chanwoo Lee; Miaoyan Wang; | We propose a multi-linear cumulative link model, develop a rank-constrained M-estimator, and obtain theoretical accuracy guarantees. |

98 | Learning Human Objectives by Evaluating Hypothetical Behavior | Siddharth Reddy; Anca Dragan; Sergey Levine; Shane Legg; Jan Leike; | We propose an algorithm that safely and efficiently learns a model of the user’s reward function by posing ‘what if?’ |

99 | Counterfactual Cross-Validation: Stable Model Selection Procedure for Causal Inference Models | Yuta Saito; Shota Yasui; | We study the model selection problem in \textit{conditional average treatment effect} (CATE) prediction. |

100 | Learning Efficient Multi-agent Communication: An Information Bottleneck Approach | Rundong Wang; Xu He; Runsheng Yu; Wei Qiu; Bo An; Zinovi Rabinovich; | In this paper, we develop an Informative Multi-Agent Communication (IMAC) method to learn efficient communication protocols as well as scheduling. |

101 | MoNet3D: Towards Accurate Monocular 3D Object Localization in Real Time | XICHUAN ZHOU; YiCong Peng; Chunqiao Long; Fengbo Ren; Cong Shi; | The MoNet3D algorithm is a novel and effective framework that can predict the 3D position of each object in a monocular image and draw a 3D bounding box for each object. |

102 | SIGUA: Forgetting May Make Learning with Noisy Labels More Robust | Bo Han; Gang Niu; Xingrui Yu; QUANMING YAO; Miao Xu; Ivor Tsang; Masashi Sugiyama; | In this paper, to relieve this issue, we propose stochastic integrated gradient underweighted ascent (SIGUA): in a mini-batch, we adopt gradient descent on good data as usual, and learning-rate-reduced gradient ascent} on bad data; the proposal is a versatile approach where data goodness or badness is w.r.t. desired or undesired memorization given a base learning method. |

103 | Multinomial Logit Bandit with Low Switching Cost | Kefan Dong; Yingkai Li; Qin Zhang; Yuan Zhou; | We present an anytime algorithm (AT-DUCB) with $O(N \log T)$ assortment switches, almost matching the lower bound $\Omega(\frac{N \log T}{ \log \log T})$. |

104 | Deep Reasoning Networks for Unsupervised Pattern De-mixing with Constraint Reasoning | Di Chen; Yiwei Bai; Wenting Zhao; Sebastian Ament; John Gregoire; Carla Gomes; | We introduce Deep Reasoning Networks (DRNets), an end-to-end framework that combines deep learning with constraint reasoning for solving pattern de-mixing problems, typically in an unsupervised or very-weakly-supervised setting. |

105 | Uncertainty-Aware Lookahead Factor Models for Improved Quantitative Investing | Lakshay Chauhan; John Alberg; Zachary Lipton; | We propose lookahead factor models to act upon these predictions, plugging the predicted future fundamentals into traditional factors. |

106 | On the Unreasonable Effectiveness of the Greedy Algorithm: Greedy Adapts to Sharpness | Sebastian Pokutta; Mohit Singh; Alfredo Torrico; | In this work, we define sharpness for submodular functions as a candidate explanation for this phenomenon. |

107 | Stronger and Faster Wasserstein Adversarial Attacks | Kaiwen Wu; Allen Wang; Yaoliang Yu; | We address this gap in two ways: (a) we develop an exact yet efficient projection operator to enable a stronger projected gradient attack; (b) we show for the first time that conditional gradient method equipped with a suitable linear minimization oracle works extremely fast under Wasserstein constraints. |

108 | Optimizing Multiagent Cooperation via Policy Evolution and Shared Experiences | Somdeb Majumdar; Shauharda Khadka; Santiago Miret; Stephen Mcaleer; Kagan Tumer; | We introduce Multiagent Evolutionary Reinforcement Learning (MERL), a split-level training platform that handles the two objectives separately through two optimization processes. |

109 | Why Are Learned Indexes So Effective? | Paolo Ferragina; Fabrizio Lillo; Giorgio Vinciguerra; | In this paper, we present the first mathematically-grounded answer to this open problem. |

110 | Fast OSCAR and OWL with Safe Screening Rules | Runxue Bao; Bin Gu; Heng Huang; | To address this challenge, we propose the first safe screening rule for the OWL regularized regression, which effectively avoids the updates of the parameters whose coefficients must be zeros. |

111 | Which Tasks Should Be Learned Together in Multi-task Learning? | Trevor Standley; Amir Zamir; Dawn Chen; Leonidas Guibas; Jitendra Malik; Silvio Savarese; | We systematically study task cooperation and competition and propose a framework for assigning tasks to a few neural networks such that cooperating tasks are computed by the same neural network, while competing tasks are computed by different networks. |

112 | Inertial Block Proximal Methods for Non-Convex Non-Smooth Optimization | Hien Le; Nicolas Gillis; Panagiotis Patrinos; | We propose inertial versions of block coordinate descent methods for solving non-convex non-smooth composite optimization problems. |

113 | Adversarial Neural Pruning with Latent Vulnerability Suppression | Divyam Madaan; Jinwoo Shin; Sung Ju Hwang; | In this paper, we conjecture that the leading cause of this adversarial vulnerability is the distortion in the latent feature space, and provide methods to suppress them effectively. |

114 | Lifted Disjoint Paths with Application in Multiple Object Tracking | Andrea Hornakova; Roberto Henschel; Bodo Rosenhahn; Paul Swoboda; | We present an extension to the disjoint paths problem in which additional lifted edges are introduced to provide path connectivity priors. |

115 | Being Bayesian, Even Just a Bit, Fixes Overconfidence in ReLU Networks | Agustinus Kristiadi; Matthias Hein; Philipp Hennig; | We theoretically analyze approximate Gaussian posterior distributions on the weights of ReLU networks and show that they fix the overconfidence problem. |

116 | SCAFFOLD: Stochastic Controlled Averaging for Federated Learning | Sai Praneeth Reddy Karimireddy; Satyen Kale; Mehryar Mohri; Sashank Jakkam Reddi; Sebastian Stich; Ananda Theertha Suresh; | As a solution, we propose a new algorithm (SCAFFOLD) which uses control variates (variance reduction) to correct for the `client drift’. |

117 | Statistically Preconditioned Accelerated Gradient Method for Distributed Optimization | Hadrien Hendrikx; Lin Xiao; Sebastien Bubeck; Francis Bach; Laurent Massoulié; | In order to reduce the number of communications required to reach a given accuracy, we propose a preconditioned accelerated gradient method where the preconditioning is done by solving a local optimization problem over a subsampled dataset at the server. |

118 | Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification | Hui Ye; Zhiyu Chen; Da-Han Wang; Brian Davison; | Our approach fine-tunes the recently released generalized autoregressive pretraining model (XLNet) to learn the dense representation for the input text. We propose the Adaptive Probabilistic Label Cluster (APLC) to approximate the cross entropy loss by exploiting the unbalanced label distribution to form clusters that explicitly reduce the computational time. |

119 | Frequentist Uncertainty in Recurrent Neural Networks via Blockwise Influence Functions | Ahmed Alaa; Mihaela van der Schaar; | Capitalizing on ideas from classical jackknife resampling, we develop a frequentist alternative that: (a) is computationally efficient, (b) does not interfere with model training or compromise its accuracy, (c) applies to any RNN architecture, and (d) provides theoretical coverage guarantees on the estimated uncertainty intervals. |

120 | Disentangling Trainability and Generalization in Deep Neural Networks | Lechao Xiao; Jeffrey Pennington; Samuel Schoenholz; | In this work, we provide such a char-acterization in the limit of very wide and verydeep networks, for which the analysis simplifiesconsiderably. |

121 | Moniqua: Modulo Quantized Communication in Decentralized SGD | Yucheng Lu; Christopher De Sa; | In this paper we propose Moniqua, a technique that allows decentralized SGD to use quantized communication. |

122 | Expectation Maximization with Bias-Corrected Calibration is Hard-To-Beat at Label Shift Adaptation | Amr Mohamed Alexandari; Anshul Kundaje; Avanti Shrikumar; | We show that by combining EM with a type of calibration we call bias-corrected calibration, we outperform both BBSL and RLLS across diverse datasets and distribution shifts. |

123 | Expert Learning through Generalized Inverse Multiobjective Optimization: Models, Insights and Algorithms | Chaosheng Dong; Bo Zeng; | Leveraging these critical insights and connections, we propose two algorithms to solve IMOP through manifold learning and clustering. |

124 | Random Matrix Theory Proves that Deep Learning Representations of GAN-data Behave as Gaussian Mixtures | Mohamed El Amine Seddik; Cosme Louart; Mohamed Tamaazousti; Romain COUILLET; | This paper shows that deep learning (DL) representations of data produced by generative adversarial nets (GANs) are random vectors which fall within the class of so-called \textit{concentrated} random vectors. |

125 | Optimizing Data Usage via Differentiable Rewards | Xinyi Wang; Hieu Pham; Paul Michel; Antonios Anastasopoulos; Jaime Carbonell; Graham Neubig; | To efficiently optimize data usage, we propose a reinforcement learning approach called Differentiable Data Selection (DDS). |

126 | Optimistic Policy Optimization with Bandit Feedback | Lior Shani; Yonathan Efroni; Aviv Rosenberg; Shie Mannor; | In this paper we consider model-based RL in the tabular finite-horizon MDP setting with unknown transitions and bandit feedback. |

127 | Maximum-and-Concatenation Networks | Xingyu Xie; Hao Kong; Jianlong Wu; Wayne Zhang; Guangcan Liu; Zhouchen Lin; | In this work, we propose a novel architecture called Maximum-and-Concatenation Networks (MCN) to try eliminating bad local minima and improving generalization ability as well. |

128 | Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition | Chi Jin; Tiancheng Jin; Haipeng Luo; Suvrit Sra; Tiancheng Yu; | We propose an efficient algorithm that achieves O(L|X|AT ) regret with high probability, where L is the horizon, |X| the number of states, |A| the number of actions, and T the number of episodes. |

129 | Kernelized Stein Discrepancy Tests of Goodness-of-fit for Time-to-Event Data | Wenkai Xu; Tamara Fernandez; Nicolas Rivera; Arthur Gretton; | In this paper, we focus on non-parametric Goodness-of-Fit testing procedures based on combining the Stein’s method and kernelized discrepancies. |

130 | Efficient Intervention Design for Causal Discovery with Latents | Raghavendra Addanki; Shiva Kasiviswanathan; Andrew McGregor; Cameron Musco; | In particular, we introduce the notion of p-colliders, that are colliders between pair of nodes arising from a specific type of conditioning in the causal graph, and provide an upper bound on the number of interventions as a function of the maximum number of p-colliders between any two nodes in the causal graph. |

131 | Certified Data Removal from Machine Learning Models | Chuan Guo; Tom Goldstein; Awni Hannun; Laurens van der Maaten; | We study this problem by defining certified removal: a very strong theoretical guarantee that a model from which data is removed cannot be distinguished from a model that never observed the data to begin with. |

132 | One Size Fits All: Can We Train One Denoiser for All Noise Levels? | Abhiram Gnanasambandam; Stanley Chan; | However, why should we allocate the samples uniformly? Can we have more training samples that are less noisy, and fewer samples that are more noisy? What is the optimal distribution? How do we obtain such optimal distribution? The goal of this paper is to address these questions. |

133 | GNN-FiLM: Graph Neural Networks with Feature-wise Linear Modulation | Marc Brockschmidt; | This paper presents a new Graph Neural Network (GNN) type using feature-wise linear modulation (FiLM). |

134 | Sparse Gaussian Processes with Spherical Harmonic Features | Vincent Dutordoir; Nicolas Durrande; James Hensman; | We introduce a new class of interdomain variational Gaussian processes (GP) where data is mapped onto the unit hypersphere in order to use spherical harmonic representations. |

135 | Asynchronous Coagent Networks | James Kostas; Chris Nota; Philip Thomas; | In this work, we prove that CPGAs converge to locally optimal policies. |

136 | Adaptive Checkpoint Adjoint Method for Gradient Estimation in Neural ODE | Juntang Zhuang; Nicha Dvornek; Xiaoxiao Li; Sekhar Tatikonda; Xenophon Papademetris; James Duncan; | We propose the Adaptive Checkpoint Adjoint (ACA) method: ACA applies a trajectory checkpoint strategy which records the forward- mode trajectory as the reverse-mode trajectory to guarantee accuracy; ACA deletes redundant components for shallow computation graphs; and ACA supports adaptive solvers. |

137 | Understanding the Curse of Horizon in Off-Policy Evaluation via Conditional Importance Sampling | Yao Liu; Pierre-Luc Bacon; Emma Brunskill; | We analyze the variance of the most popular approaches through the viewpoint of conditional Monte Carlo. |

138 | Taylor Expansion Policy Optimization | Yunhao Tang; Michal Valko; Remi Munos; | In this work, we investigate the application of Taylor expansions in reinforcement learning. |

139 | Reinforcement Learning for Integer Programming: Learning to Cut | Yunhao Tang; Shipra Agrawal; Yuri Faenza; | The goal of this work is to show that the performance of those solvers can be greatly enhanced using reinforcement learning (RL). |

140 | Safe Reinforcement Learning in Constrained Markov Decision Processes | Akifumi Wachi; Yanan Sui; | In this paper, we propose an algorithm, SNO-MDP, that explores and optimizes Markov decision processes under unknown safety constraints. |

141 | Layered Sampling for Robust Optimization Problems | Hu Ding; Zixiu Wang; | In this paper, we propose a new variant of coreset technique, {\em layered sampling}, to deal with two fundamental robust optimization problems: {\em $k$-median/means clustering with outliers} and {\em linear regression with outliers}. |

142 | Learning to Encode Position for Transformer with Continuous Dynamical Model | Xuanqing Liu; Hsiang-Fu Yu; Inderjit Dhillon; Cho-Jui Hsieh; | We introduce a new way of learning to encode position information for non-recurrent models, such as Transformer models. |

143 | Do RNN and LSTM have Long Memory? | Jingyu Zhao; Feiqing Huang; Jia Lv; Yanjie Duan; Zhen Qin; Guodong Li; Guangjian Tian; | Since the term "long memory" is still not well-defined for a network, we propose a new definition for long memory network. |

144 | Training Linear Neural Networks: Non-Local Convergence and Complexity Results | Armin Eftekhari; | In this paper, we improve the state of the art in (Bah et al., 2019) by identifying conditions under which gradient flow successfully trains a linear network, in spite of the non-strict saddle points present in the optimization landscape. |

145 | On Validation and Planning of An Optimal Decision Rule with Application in Healthcare Studies | Hengrui Cai; Wenbin Lu; Rui Song; | In this paper, we propose a testing procedure for detecting the existence of an ODR that is better than the naive decision rule under the randomized trials. |

146 | Graph Optimal Transport for Cross-Domain Alignment | Liqun Chen; Zhe Gan; Yu Cheng; Linjie Li; Lawrence Carin; Jingjing Liu; | We propose Graph Optimal Transport (GOT), a principled framework that builds upon recent advances in Optimal Transport (OT). |

147 | Approximation Capabilities of Neural ODEs and Invertible Residual Networks | Han Zhang; Xi Gao; Jacob Unterman; Tomasz Arodz; | We conclude by showing that capping a Neural ODE or an i-ResNet with a single linear layer is sufficient to turn the model into a universal approximator for non-invertible continuous functions. |

148 | Refined bounds for algorithm configuration: The knife-edge of dual class approximability | Nina Balcan; Tuomas Sandholm; Ellen Vitercik; | We investigate a fundamental question about these techniques: how large should the training set be to ensure that a parameter\u2019s average empirical performance over the training set is close to its expected, future performance? |

149 | Teaching with Limited Information on the Learner’s Behaviour | Ferdinando Cicalese; Francisco Sergio de Freitas Filho; Eduardo Laber; Marco Molinaro; | Motivated by the realistic possibility that $h^*$ is not available to the learner, we consider the case where the teacher can only aim at having the learner converge to a best available approximation of $h^*$. |

150 | Interpretations are Useful: Penalizing Explanations to Align Neural Networks with Prior Knowledge | Laura Rieger; Chandan Singh; William Murdoch; Bin Yu; | In this paper, we propose contextual decomposition explanation penalization (CDEP), a method which enables practitioners to leverage existing explanation methods to increase the predictive accuracy of a deep learning model. |

151 | DeltaGrad: Rapid retraining of machine learning models | Yinjun Wu; Edgar Dobriban; Susan Davidson; | To address this problem, we propose the DeltaGrad algorithm for rapidly retraining machine learning models based on information cached during the training phase. |

152 | The Cost-free Nature of Optimally Tuning Tikhonov Regularizers and Other Ordered Smoothers | Pierre Bellec; Dana Yang; | We consider the problem of selecting the best estimator among a family of Tikhonov regularized estimators, or, alternatively, to select a linear combination of these regularizers that is as good as the best regularizer in the family. |

153 | Approximation Guarantees of Local Search Algorithms via Localizability of Set Functions | Kaito Fujii; | This paper proposes a new framework for providing approximation guarantees of local search algorithms. |

154 | Fine-Grained Analysis of Stability and Generalization for Stochastic Gradient Descent | Yunwen Lei; Yiming Ying; | In this paper, we provide a fine-grained analysis of stability and generalization for SGD by substantially relaxing these assumptions. |

155 | Online Dense Subgraph Discovery via Blurred-Graph Feedback | Yuko Kuroki; Atsushi Miyauchi; Junya Honda; Masashi Sugiyama; | In this paper, we introduce a novel learning problem for dense subgraph discovery in which a learner queries edge subsets rather than only single edges and observes a noisy sum of edge weights in a queried subset. |

156 | LazyIter: A Fast Algorithm for Counting Markov Equivalent DAGs and Designing Experiments | Ali AhmadiTeshnizi; Saber Salehkaleybar; Negar Kiyavash; | We propose a method for efficient iteration over possible MECs given intervention results. |

157 | Perceptual Generative Autoencoders | Zijun Zhang; Ruixiang ZHANG; Zongpeng Li; Yoshua Bengio; Liam Paull; | We therefore propose to map both the generated and target distributions to the latent space using the encoder of a standard autoencoder, and train the generator (or decoder) to match the target distribution in the latent space. |

158 | Towards Understanding the Regularization of Adversarial Robustness on Neural Networks | Yuxin Wen; Shuai Li; Kui Jia; | In this work, we study the degradation through the regularization perspective. |

159 | Stochastic Gradient and Langevin Processes | Xiang Cheng; Dong Yin; Peter Bartlett; Michael Jordan; | We prove quantitative convergence rates at which discrete Langevin-like processes converge to the invariant distribution of a related stochastic differential equation. |

160 | ROMA: Multi-Agent Reinforcement Learning with Emergent Roles | Tonghan Wang; Heng Dong; Victor Lesser; Chongjie Zhang; | In this paper, we synergize these two paradigms and propose a role-oriented MARL framework (ROMA). |

161 | Minimax Pareto Fairness: A Multi Objective Perspective | Martin Bertran; Natalia Martinez; Guillermo Sapiro; | In this work we formulate and formally characterize group fairness as a multi-objective optimization problem, where each sensitive group risk is a separate objective. |

162 | Online Pricing with Offline Data: Phase Transition and Inverse Square Law | Jinzhi Bu; David Simchi-Levi; Yunzong Xu; | We study a single-product dynamic pricing problem over a selling horizon of T periods. |

163 | Explicit Gradient Learning for Black-Box Optimization | Elad Sarafian; Mor Sinay; yoram louzoun; Noa Agmon; Sarit Kraus; | Here we present a BBO method, termed Explicit Gradient Learning (EGL), that is designed to optimize high-dimensional ill-behaved functions. |

164 | Optimization and Analysis of the pAp@k Metric for Recommender Systems | Gaurush Hiranandani; Warut Vijitbenjaronk; Sanmi Koyejo; Prateek Jain; | In this paper, we analyze the learning-theoretic properties of pAp@k and propose novel surrogates that are consistent under certain data regularity conditions. |

165 | When Explanations Lie: Why Many Modified BP Attributions Fail | Leon Sixt; Maximilian Granz; Tim Landgraf; | We analyze an extensive set of modified BP methods: Deep Taylor Decomposition, Layer-wise Relevance Propagation (LRP), Excitation BP, PatternAttribution, DeepLIFT, Deconv, RectGrad, and Guided BP. We find empirically that the explanations of all mentioned methods, except for DeepLIFT, are independent of the parameters of later layers. We provide theoretical insights for this surprising behavior and also analyze why DeepLIFT does not suffer from this limitation. |

166 | Naive Exploration is Optimal for Online LQR | Max Simchowitz; Dylan Foster; | We consider the problem of online adaptive control of the linear quadratic regulator, where the true system parameters are unknown. |

167 | Learning Structured Latent Factors from Dependent Data:A Generative Model Framework from Information-Theoretic Perspective | Ruixiang ZHANG; Katsuhiko Ishiguro; Masanori Koyama; | In this paper, we present a novel framework for learning generative models with various underlying structures in the latent space. |

168 | Implicit Generative Modeling for Efficient Exploration | Neale Ratzlaff; Qinxun Bai; Fuxin Li; Wei Xu; | In this work, we introduce an exploration approach based on a novel implicit generative modeling algorithm to estimate a Bayesian uncertainty of the agent’s belief of the environment dynamics. |

169 | Prediction-Guided Multi-Objective Reinforcement Learning for Continuous Robot Control | Jie Xu; Yunsheng Tian; Pingchuan Ma; Daniela Rus; Shinjiro Sueda; Wojciech Matusik; | In this work, we propose an efficient evolutionary learning algorithm to find the Pareto set approximation for continuous robot control problems, by extending a state-of-the-art RL algorithm and presenting a novel prediction model to guide the learning process. |

170 | Goodness-of-Fit Tests for Inhomogeneous Random Graphs | Soham Dan; Bhaswar B. Bhattacharya; | In this paper we consider the goodness-of-fit testing problem for large inhomogeneous random (IER) graphs, where given a (known) reference symmetric matrix $Q \in [0, 1]^{n \times n}$ and $m$ independent samples from an IER graph given by an unknown symmetric matrix $P \in [0, 1]^{n \times n}$, the goal is to test the hypothesis $P=Q$ versus $||P-Q|| \geq \varepsilon$, where $||\cdot||$ is some specified norm on symmetric matrices. |

171 | Few-shot Domain Adaptation by Causal Mechanism Transfer | Takeshi Teshima; Issei Sato; Masashi Sugiyama; | We take the structural equations in causal modeling as an example and propose a novel DA method, which is shown to be useful both theoretically and experimentally. |

172 | Adaptive Adversarial Multi-task Representation Learning | YUREN MAO; Weiwei Liu; Xuemin Lin; | Based on the duality, we proposed an novel adaptive AMTRL algorithm which improves the performance of original AMTRL methods. |

173 | Streaming Submodular Maximization under a k-Set System Constraint | Ran Haba; Ehsan Kazemi; Moran Feldman; Amin Karbasi; | In this paper, we propose a novel framework that converts streaming algorithms for monotone submodular maximization into streaming algorithms for non-monotone submodular maximization. |

174 | A Generic First-Order Algorithmic Framework for Bi-Level Programming Beyond Lower-Level Singleton | Risheng Liu; Pan Mu; Xiaoming Yuan; Shangzhi Zeng; Jin Zhang; | To address this critical issue, a new method, named Bi-level Descent Aggregation (BDA) is proposed, aiming to broaden the application horizon of first-order schemes for BLPs. |

175 | Optimal approximation for unconstrained non-submodular minimization | Marwa El Halabi; Stefanie Jegelka; | We show how these relations can be extended to obtain approximation guarantees for minimizing non-submodular functions, characterized by how close the function is to submodular. |

176 | Generating Programmatic Referring Expressions via Program Synthesis | Jiani Huang; Calvin Smith; Osbert Bastani; Rishabh Singh; Aws Albarghouthi; Mayur Naik; | We propose a neurosymbolic program synthesis algorithm that combines a policy neural network with enumerative search to generate such relational programs. |

177 | Nearly Linear Row Sampling Algorithm for Quantile Regression | Yi Li; Ruosong Wang; Lin Yang; Hanrui Zhang; | Our main technical contribution is to show that Lewis weights sampling, which has been used in row sampling algorithms for $\ell_p$ norms, can also be applied in row sampling algorithms for a variety of loss functions. |

178 | On Leveraging Pretrained GANs for Generation with Limited Data | Miaoyun Zhao; Yulai Cong; Lawrence Carin; | To facilitate this, we leverage existing GAN models pretrained on large-scale datasets (like ImageNet) to introduce additional knowledge (which may not exist within the limited data), following the concept of transfer learning. |

179 | More Data Can Expand The Generalization Gap Between Adversarially Robust and Standard Models | Lin Chen; Yifei Min; Mingrui Zhang; Amin Karbasi; | However, we study the training of robust classifiers for both Gaussian and Bernoulli models under $\ell_\infty$ attacks, and we prove that more data may actually increase this gap. |

180 | Double Reinforcement Learning for Efficient and Robust Off-Policy Evaluation | Nathan Kallus; Masatoshi Uehara; | We consider for the first time the semiparametric efficiency limits of OPE in Markov decision processes (MDPs), where actions, rewards, and states are memoryless. |

181 | Statistically Efficient Off-Policy Policy Gradients | Nathan Kallus; Masatoshi Uehara; | In this paper, we consider the efficient estimation of policy gradients from off-policy data, where the estimation is particularly non-trivial. |

182 | Self-PU: Self Boosted and Calibrated Positive-Unlabeled Training | Xuxi Chen; Wuyang Chen; Tianlong Chen; Ye Yuan; Chen Gong; Kewei Chen; Zhangyang Wang; | This motivates us to propose a novel Self-PU learning framework, which seamlessly integrates PU learning and self-training. |

183 | When Does Self-Supervision Help Graph Convolutional Networks? | Yuning You; Tianlong Chen; Zhangyang Wang; Yang Shen; | In this study, we report the first systematic exploration and assessment of incorporating self-supervision into GCNs. |

184 | On Differentially Private Stochastic Convex Optimization with Heavy-tailed Data | Di Wang; Hanshen Xiao; Srinivas Devadas; Jinhui Xu; | In this paper, we consider the problem of designing Differentially Private (DP) algorithms for Stochastic Convex Optimization (SCO) on heavy-tailed data. |

185 | Variance Reduced Coordinate Descent with Acceleration: New Method With a Surprising Application to Finite-Sum Problems | Filip Hanzely; Dmitry Kovalev; Peter Richtarik; | We propose an accelerated version of stochastic variance reduced coordinate descent — ASVRCD. |

186 | Stochastic Subspace Cubic Newton Method | Filip Hanzely; Nikita Doikov; Yurii Nesterov; Peter Richtarik; | In this paper, we propose a new randomized second-order optimization algorithm—Stochastic Subspace Cubic Newton (SSCN)—for minimizing a high dimensional convex function $f$. |

187 | Ready Policy One: World Building Through Active Learning | Philip Ball; Jack Parker-Holder; Aldo Pacchiano; Krzysztof Choromanski; Stephen Roberts; | In this paper we introduce Ready Policy One (RP1), a framework that views MBRL as an active learning problem, where we aim to improve the world model in the fewest samples possible. |

188 | Structural Language Models of Code | Uri Alon; Roy Sadaka; Omer Levy; Eran Yahav; | We introduce a new approach to any-code completion that leverages the strict syntax of programming languages to model a code snippet as a tree – structural language modeling (SLM). |

189 | PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization | Jingqing Zhang; Yao Zhao; Mohammad Saleh; Peter Liu; | In this work, we propose pre-training large Transformer-based encoder-decoder models on massive text corpora with a new self-supervised objective. |

190 | Aggregation of Multiple Knockoffs | Tuan-Binh Nguyen; Jerome-Alexis Chevalier; Thirion Bertrand; Sylvain Arlot; | We develop an extension of the knockoff inference procedure, introduced by Barber & Candes (2015). |

191 | Off-Policy Actor-Critic with Shared Experience Replay | Simon Schmitt; Matteo Hessel; Karen Simonyan; | We investigate the combination of actor-critic reinforcement learning algorithms with a uniform large-scale experience replay and propose solutions for two ensuing challenges: (a) efficient actor-critic learning with experience replay (b) the stability of off-policy learning where agents learn from other agents behaviour. |

192 | Graph-based Nearest Neighbor Search: From Practice to Theory | Liudmila Prokhorenkova; Aleksandr Shekhovtsov; | In this work, we fill this gap and rigorously analyze the performance of graph-based NNS algorithms, specifically focusing on the low-dimensional (d << log n) regime. |

193 | Policy Teaching via Environment Poisoning: Training-time Adversarial Attacks against Reinforcement Learning | Amin Rakhsha; Goran Radanovic; Rati Devidze; Jerry Zhu; Adish Singla; | We propose an optimization framework for finding an optimal stealthy attack for different measures of attack cost. |

194 | Semismooth Newton Algorithm for Efficient Projections onto $\ell_{1, \infty}$-norm Ball | Dejun Chu; Changshui Zhang; Shiliang Sun; Qing Tao; | In this paper, we propose an efficient algorithm for Euclidean projection onto $\ell_{1, \infty}$-norm ball. |

195 | Influenza Forecasting Framework based on Gaussian Processes | Christoph Zimmer; Reza Yaesoubi; | Here, we propose a new framework based on Gaussian process (GP) for seasonal epidemics forecasting and demonstrate its capability on the CDC reference data on influenza like illness: our framework leads to accurate forecasts with small but reliable uncertainty estimation. |

196 | Unique Properties of Wide Minima in Deep Networks | Rotem Mulayoff; Tomer Michaeli; | In this paper, we characterize the wide minima in linear neural networks trained with a quadratic loss. |

197 | Does the Markov Decision Process Fit the Data: Testing for the Markov Property in Sequential Decision Making | Chengchun Shi; Runzhe Wan; Rui Song; Wenbin Lu; Ling Leng; | In this paper, we propose a novel Forward-Backward Learning procedure to test MA in sequential decision making. |

198 | LTF: A Label Transformation Framework for Correcting Label Shift | Jiaxian Guo; Mingming Gong; Tongliang Liu; Kun Zhang; Dacheng Tao; | In this paper, we propose an end-to-end Label Transformation Framework (LTF) for correcting label shift, which implicitly models the shift of $P_Y$ and the conditional distribution $P_{X|Y}$ using neural networks. |

199 | Divide, Conquer, and Combine: a New Inference Strategy for Probabilistic Programs with Stochastic Support | Yuan Zhou; Hongseok Yang; Yee Whye Teh; Tom Rainforth; | To address this, we introduce a new inference framework: Divide, Conquer, and Combine, which remains efficient for such models, and show how it can be implemented as an automated and general-purpose PPS inference engine. |

200 | Duality in RKHSs with Infinite Dimensional Outputs: Application to Robust Losses | Pierre Laforgue; Alex Lambert; Luc Brogat-Motte; Florence d’Alche-Buc; | To overcome this limitation, this paper develops a duality approach that allows to solve OVK machines for a wide range of loss functions. |

201 | Causal Effect Estimation and Optimal Dose Suggestions in Mobile Health | Liangyu Zhu; Wenbin Lu; Rui Song; | In this article, we propose novel structural nested models to estimate causal effects of continuous treatments based on mobile health data. |

202 | Towards Understanding the Dynamics of the First-Order Adversaries | Zhun Deng; Hangfeng He; Jiaoyang Huang; Weijie Su; | In this paper, we analyze the dynamics of the maximization step towards understanding the experimentally observed effectiveness of this defense mechanism. |

203 | Interpreting Robust Optimization via Adversarial Influence Functions | Zhun Deng; Cynthia Dwork; Jialiang Wang; Linjun Zhang; | In this paper, inspired by the influence function in robust statistics, we introduce the Adversarial Influence Function (AIF) as a tool to investigate the solution produced by robust optimization. |

204 | Multilinear Latent Conditioning for Generating Unseen Attribute Combinations | Markos Georgopoulos; Grigorios Chrysos; Yannis Panagakis; Maja Pantic; | To this end, we extend the cVAE by introducing a multilinear latent conditioning framework. |

205 | No-Regret Exploration in Goal-Oriented Reinforcement Learning | Jean Tarbouriech; Evrard Garcelon; Michal Valko; Matteo Pirotta; Alessandro Lazaric; | In this paper, we study the general SSP problem with no assumption on its dynamics (some policies may actually never reach the goal). |

206 | OPtions as REsponses: Grounding behavioural hierarchies in multi-agent reinforcement learning | Alexander Vezhnevets; Yuhuai Wu; Maria Eckstein; Rémi Leblond; Joel Z Leibo; | This paper investigates generalisation in multi-agent games, where the generality of the agent can be evaluated by playing against opponents it hasn’t seen during training. |

207 | Feature Noise Induces Loss Discrepancy Across Groups | Fereshte Khani; Percy Liang; | In this work, we point to a more subtle source of loss discrepancy—feature noise. |

208 | Reinforcement Learning for Molecular Design Guided by Quantum Mechanics | Gregor Simm; Robert Pinsler; Jose Miguel Hernandez-Lobato; | To address this, we present a novel RL formulation for molecular design in 3D space, thereby extending the class of molecules that can be built. |

209 | Small-GAN: Speeding up GAN Training using Core-Sets | Samrath Sinha; Han Zhang; Anirudh Goyal; Yoshua Bengio; Hugo Larochelle; Augustus Odena; | Thus, it would be nice if there were some trick by which we could generate batches that were effectively big though small in practice. In this work, we propose such a trick, inspired by the use of Coreset-selection in active learning. |

210 | Conditional gradient methods for stochastically constrained convex minimization | Maria-Luiza Vladarean; Ahmet Alacaoglu; Ya-Ping Hsieh; Volkan Cevher; | We propose two novel conditional gradient-based methods for solving structured stochastic convex optimization problems with a large number of linear constraints. |

211 | Undirected Graphical Models as Approximate Posteriors | Arash Vahdat; Evgeny Andriyash; William Macready; | We develop an efficient method to train undirected approximate posteriors by showing that the gradient of the training objective with respect to the parameters of the undirected posterior can be computed by backpropagation through Markov chain Monte Carlo updates. |

212 | Dynamics of Deep Neural Networks and Neural Tangent Hierarchy | Jiaoyang Huang; Horng-Tzer Yau; | In the current paper, we study the dynamic of the NTK for finite width deep fully-connected neural networks. |

213 | Measuring Non-Expert Comprehension of Machine Learning Fairness Metrics | Debjani Saha; Candice Schumann; Duncan McElfresh; John Dickerson; Michelle Mazurek; Michael Tschantz; | We take initial steps toward bridging this gap between ML researchers and the public, by addressing the question: does a lay audience understand a basic definition of ML fairness? |

214 | Encoding Musical Style with Transformer Autoencoders | Kristy Choi; Curtis Hawthorne; Ian Simon; Monica Dinculescu; Jesse Engel; | In this work, we present the Transformer autoencoder, which aggregates encodings of the input data across time to obtain a global representation of style from a given performance. |

215 | Min-Max Optimization without Gradients: Convergence and Applications to Black-Box Evasion and Poisoning Attacks | Sijia Liu; Songtao Lu; Xiangyi Chen; Yao Feng; Kaidi Xu; Abdullah Al-Dujaili; Mingyi Hong; Una-May O’Reilly; | In this paper, we study the problem of constrained min-max optimization in a black-box setting, where the desired optimizer cannot access the gradients of the objective function but may query its values. |

216 | ConQUR: Mitigating Delusional Bias in Deep Q-Learning | DiJia Su; Jayden Ooi; Tyler Lu; Dale Schuurmans; Craig Boutilier; | In this paper, we develop efficient methods to mitigate delusional bias by training Q-approximators with labels that are "consistent" with the underlying greedy policy class. |

217 | Self-Modulating Nonparametric Event-Tensor Factorization | Zheng Wang; Xinqi Chu; Shandian Zhe; | To overcome these limitations, we propose a self-modulating nonparametric Bayesian factorization model. |

218 | Extreme Multi-label Classification from Aggregated Labels | Yanyao Shen; Hsiang-Fu Yu; Sujay Sanghavi; Inderjit Dhillon; | We develop a new and scalable algorithm to impute individual-sample labels from the group labels; this can be paired with any existing XMC method to solve the aggregated label problem. |

219 | Full Law Identification In Graphical Models Of Missing Data: Completeness Results | Razieh Nabi; Rohit Bhattacharya; Ilya Shpitser; | In this paper, we address the longstanding question of the characterization of models that are identifiable within this class of missing data distributions. |

220 | Self-Attentive Associative Memory | Hung Le; Truyen Tran; Svetha Venkatesh; | In this paper, we propose to separate the storage of individual experiences (item memory) and their occurring relationships (relational memory). |

221 | Imputer: Sequence Modelling via Imputation and Dynamic Programming | William Chan; Chitwan Saharia; Geoffrey Hinton; Mohammad Norouzi; Navdeep Jaitly; | This paper presents the Imputer, a neural sequence model that generates output sequences iteratively via imputations. |

222 | Continuously Indexed Domain Adaptation | Hao Wang; Hao He; Dina Katabi; | In this paper, we propose the first method for continuously indexed domain adaptation. |

223 | Evolving Machine Learning Algorithms From Scratch | Esteban Real; Chen Liang; David So; Quoc Le; | Our goal is to show that AutoML can go further: it is possible today to automatically discover complete machine learning algorithms just using basic mathematical operations as building blocks. |

224 | Self-Attentive Hawkes Process | Qiang Zhang; Aldo Lipani; Omer Kirnap; Emine Yilmaz; | This study attempts to fill the gap by designing a self-attentive Hawkes process (SAHP). |

225 | On hyperparameter tuning in general clustering problemsm | Xinjie Fan; Yuguang Yue; Purnamrita Sarkar; Y. X. Rachel Wang; | In this paper, we provide a overarching framework with provable guarantees for tuning hyperparameters in the above class of problems under two different models. |

226 | Communication-Efficient Distributed Stochastic AUC Maximization with Deep Neural Networks | Zhishuai Guo; Mingrui Liu; Zhuoning Yuan; Li Shen; Wei Liu; Tianbao Yang; | In this paper, we study distributed algorithms for large-scale AUC maximization with a deep neural network as a predictive model. |

227 | Adaptive Region-Based Active Learning | Corinna Cortes; Giulia DeSalvo; Claudio Gentile; Mehryar Mohri; Ningshan Zhang; | We present a new active learning algorithm that adaptively partitions the input space into a finite number of regions, and subsequently seeks a distinct predictor for each region, while actively requesting labels. |

228 | Robust Outlier Arm Identification | Yinglun Zhu; Sumeet Katariya; Robert Nowak; | We propose two computationally efficient delta-PAC algorithms for ROAI, which includes the first UCB-style algorithm for outlier detection, and derive upper bounds on their sample complexity. |

229 | Provably Efficient Exploration in Policy Optimization | Qi Cai; Zhuoran Yang; Chi Jin; Zhaoran Wang; | To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an "optimistic version" of the policy gradient direction. |

230 | Striving for simplicity and performance in off-policy DRL: Output Normalization and Non-Uniform Sampling | Che Wang; Yanqiu Wu; Quan Vuong; Keith Ross; | With this insight, we propose a streamlined algorithm with a simple normalization scheme or with inverted gradients. |

231 | Multidimensional Shape Constraints | Maya Gupta; Erez Louidor; Oleksandr Mangylov; Nobu Morioka; Tamann Narayan; Sen Zhao; | We propose new multi-input shape constraints across four intuitive categories: complements, diminishers, dominance, and unimodality constraints. |

232 | Fast Deterministic CUR Matrix Decomposition with Accuracy Assurance | Yasutoshi Ida; Sekitoshi Kanai; Yasuhiro Fujiwara; Tomoharu Iwata; Koh Takeuchi; Hisashi Kashima; | This paper proposes a fast deterministic CUR matrix decomposition. |

233 | Operation-Aware Soft Channel Pruning using Differentiable Masks | Minsoo Kang; Bohyung Han; | We propose a simple but effective data-driven channel pruning algorithm, which compresses deep neural networks effectively by exploiting the characteristics of operations in a differentiable way. |

234 | Normalized Loss Functions for Deep Learning with Noisy Labels | Xingjun Ma; Hanxun Huang; Yisen Wang; Simone Romano; Sarah Erfani; James Bailey; | In this paper, we theoretically show by applying a simple normalization that: any loss can be made robust to noisy labels. |

235 | Learning Deep Kernels for Non-Parametric Two-Sample Tests | Feng Liu; Wenkai Xu; Jie Lu; Guangquan Zhang; Arthur Gretton; D.J. Sutherland; | We propose a class of kernel-based two-sample tests, which aim to determine whether two sets of samples are drawn from the same distribution. |

236 | DeBayes: a Bayesian method for debiasing network embeddings | Maarten Buyl; Tijl De Bie; | We thus propose DeBayes: a conceptually elegant Bayesian method that is capable of learning debiased embeddings by using a biased prior. |

237 | Principled learning method for Wasserstein distributionally robust optimization with local perturbations | Yongchan Kwon; Wonyoung Kim; Joong-Ho Won; Myunghee Cho Paik; | In this paper, we propose a minimizer based on a novel approximation theorem and provide the corresponding risk consistency results. |

238 | Low-Variance and Zero-Variance Baselines for Extensive-Form Games | Trevor Davis; Martin Schmid; Michael Bowling; | In this paper, we extend recent work that uses baseline estimates to reduce this variance. |

239 | Converging to Team-Maxmin Equilibria in Zero-Sum Multiplayer Games | Youzhi Zhang; Bo An; | This paper focuses on computing Team-Maxmin Equilibria (TMEs), which is an important solution concept for zero-sum multiplayer games where players in a team having the same utility function play against an adversary independently. |

240 | Landscape Connectivity and Dropout Stability of SGD Solutions for Over-parameterized Neural Networks | Alexander Shevchenko; Marco Mondelli; | In this paper, we shed light on this phenomenon: we show that the combination of stochastic gradient descent (SGD) and over-parameterization makes the landscape of multilayer neural networks approximately connected and thus more favorable to optimization. |

241 | Leveraging Frequency Analysis for Deep Fake Image Recognition | Joel Frank; Thorsten Eisenhofer; Lea Schönherr; Dorothea Kolossa; Thorsten Holz; Asja Fischer; | This paper addresses this shortcoming and our results reveal, that in frequency space, GAN-generated images exhibit severe artifacts that can be easily identified. |

242 | Tails of Lipschitz Triangular Flows | Priyank Jaini; Ivan Kobyzev; Yaoliang Yu; Marcus Brubaker; | We investigate the ability of popular flow models to capture tail-properties of a target density by studying the increasing triangular maps used in these flow methods acting on a tractable source density. |

243 | Deep Coordination Graphs | Wendelin Boehmer; Vitaly Kurin; Shimon Whiteson; | This paper introduces the deep coordination graph (DCG) for collaborative multi-agent reinforcement learning. |

244 | Voice Separation with an Unknown Number of Multiple Speakers | Eliya Nachmani; Yossi Adi; Lior Wolf; | We present a new method for separating a mixed audio sequence, in which multiple voices speak simultaneously. |

245 | Predicting Choice with Set-Dependent Aggregation | Nir Rosenfeld; Kojin Oshiba; Yaron Singer; | Here we propose a learning framework for predicting choice that is accurate, versatile, and theoretically grounded. |

246 | Thompson Sampling Algorithms for Mean-Variance Bandits | Qiuyu Zhu; Vincent Tan; | We develop Thompson Sampling-style algorithms for mean-variance MAB and provide comprehensive regret analyses for Gaussian and Bernoulli bandits with fewer assumptions. |

247 | Differentiable Likelihoods for Fast Inversion of ‘Likelihood-Free’ Dynamical Systems | Hans Kersting; Nicholas Krämer; Martin Schiegg; Christian Daniel; Michael Schober; Philipp Hennig; | To address this shortcoming, we employ Gaussian ODE filtering (a probabilistic numerical method for ODEs) to construct a local Gaussian approximation to the likelihood. |

248 | Debiased Sinkhorn barycenters | Hicham Janati; Marco Cuturi; Alexandre Gramfort; | Here we show how this bias is tightly linked to the reference measure that defines the entropy regularizer and propose debiased Sinkhorn barycenters that preserve the best of worlds: fast Sinkhorn-like iterations without entropy smoothing. |

249 | Double Trouble in Double Descent: Bias and Variance(s) in the Lazy Regime | Stéphane d’Ascoli; Maria Refinetti; Giulio Biroli; Florent Krzakala; | In this work, we develop a quantitative theory for this phenomenon in the so-called lazy learning regime of neural networks, by considering the problem of learning a high-dimensional function with random features regression. |

250 | Explore, Discover and Learn: Unsupervised Discovery of State-Covering Skills | Victor Campos; Alexander Trott; Caiming Xiong; Richard Socher; Xavier Giro-i-Nieto; Jordi Torres; | In light of this, we propose \textit{Explore, Discover and Learn} (EDL), an alternative approach to information-theoretic skill discovery. |

251 | Sparsified Linear Programming for Zero-Sum Equilibrium Finding | Brian Zhang; Tuomas Sandholm; | In this paper we present a totally different approach to the problem, which is competitive and often orders of magnitude better than the prior state of the art. |

252 | Extra-gradient with player sampling for faster convergence in n-player games | Samy Jelassi; Carles Domingo-Enrich; Damien Scieur; Arthur Mensch; Joan Bruna; | In this paper, we analyse a new extra-gradient method for Nash equilibrium finding, that performs gradient extrapolations and updates on a random subset of players at each iteration. |

253 | Entropy Minimization In Emergent Languages | Eugene Kharitonov; Rahma Chaabouni; Diane Bouchacourt; Marco Baroni; | We investigate here the information-theoretic complexity of such languages, focusing on the basic two-agent, one-exchange setup. |

254 | Spectral Clustering with Graph Neural Networks for Graph Pooling | Filippo Maria Bianchi; Daniele Grattarola; Cesare Alippi; | In this paper, we propose a graph clustering approach that addresses these limitations of SC. |

255 | VFlow: More Expressive Generative Flows with Variational Data Augmentation | Jianfei Chen; Cheng Lu; Biqi Chenli; Jun Zhu; Tian Tian; | In this work, we study a previously overlooked constraint that all the intermediate representations must have the same dimensionality with the data due to invertibility, limiting the width of the network. |

256 | Fully Parallel Hyperparameter Search: Reshaped Space-Filling | Marie-Liesse Cauwet; Camille Couprie; Julien Dehos; Pauline Luc; Jeremy Rapin; Morgane Riviere; Fabien Teytaud; Olivier Teytaud; Nicolas Usunier; | Consequently, we introduce a new sampling approach based on the reshaping of the search distribution, and we show both theoretically and numerically that it leads to significant gains over random search. |

257 | Discount Factor as a Regularizer in Reinforcement Learning | Ron Amit; Kamil Ciosek; Ron Meir; | It is known that applying RL algorithms with a discount set lower than the evaluation discount factor can act as a regularizer, improving performance in the limited data regime. Yet the exact nature of this regularizer has not been investigated. In this work, we fill in this gap. |

258 | On Learning Sets of Symmetric Elements | Haggai Maron; Or Litany; Gal Chechik; Ethan Fetaya; | In this paper, we present a principled approach to learning sets of general symmetric elements. |

259 | Non-convex Learning via Replica Exchange Stochastic Gradient MCMC | Wei Deng; Qi Feng; Liyao Gao; Faming Liang; Guang Lin; | In this paper, we propose an adaptive replica exchange SG-MCMC (reSG-MCMC) to automatically correct the bias and study the corresponding properties. |

260 | Learning Similarity Metrics for Numerical Simulations | Georg Kohl; Kiwon Um; Nils Thuerey; | We propose a neural network-based approach that computes a stable and generalizing metric (LSiM) to compare data from a variety of numerical simulation sources. |

261 | FR-Train: A mutual information-based approach to fair and robust training | Yuji Roh; Kangwook Lee; Steven Whang; Changho Suh; | To fix this problem, we propose FR-Train, which holistically performs fair and robust model training. |

262 | Real-Time Optimisation for Online Learning in Auctions | Lorenzo Croissant; Marc Abeille; Clément Calauzènes; | In this paper, we provide the first algorithm for online learning of monopoly prices in online auctions whose update is constant in time and memory. |

263 | Graph Random Neural Features for Distance-Preserving Graph Representations | Daniele Zambon; Cesare Alippi; Lorenzo Livi; | We present Graph Random Neural Features (GRNF), a novel embedding method from graph-structured data to real vectors based on a family of graph neural networks. |

264 | Modulating Surrogates for Bayesian Optimization | Erik Bodin; Markus Kaiser; Ieva Kazlauskaite; Zhenwen Dai; Neill Campbell; Carl Henrik Ek; | We address this issue by proposing surrogate models that focus on the well-behaved structure in the objective function, which is informative for search, while ignoring detrimental structure that is challenging to model from few observations. |

265 | Convolutional Kernel Networks for Graph-Structured Data | Dexiong Chen; Laurent Jacob; Julien Mairal; | We introduce a family of multilayer graph kernels and establish new links between graph convolutional neural networks and kernel methods. |

266 | Improving the Sample and Communication Complexity for Decentralized Non-Convex Optimization: Joint Gradient Estimation and Tracking | Haoran Sun; Songtao Lu; Mingyi Hong; | In particular, we propose an algorithm named D-GET (decentralized gradient estimation and tracking), which jointly performs decentralized gradient estimation (which estimates the local gradient using a subset of local samples) {\it and} gradient tracking (which tracks the global full gradient using local estimates). |

267 | Proper Network Interpretability Helps Adversarial Robustness in Classification | Akhilan Boopathy; Sijia Liu; Gaoyuan Zhang; Cynthia Liu; Pin-Yu Chen; Shiyu Chang; Luca Daniel; | In this paper, we theoretically show that with a proper measurement of interpretation, it is actually difficult to prevent prediction-evasion adversarial attacks from causing interpretability discrepancy, as confirmed by experiments on MNIST, CIFAR-10 and Restricted ImageNet. |

268 | Generalization Guarantees for Sparse Kernel Approximation with Entropic Optimal Features | Liang Ding; Rui Tuo; Shahin Shahrampour; | In this paper, in lieu of commonly used kernel expansion with respect to $N$ inputs, we develop a novel optimal design maximizing the entropy among kernel features. |

269 | Understanding the Impact of Model Incoherence on Convergence of Incremental SGD with Random Reshuffle | Shaocong Ma; Yi Zhou; | In this work, we introduce model incoherence to characterize the diversity of model characteristics and study its impact on convergence of SGD with random reshuffle \shaocong{under weak strong convexity}. |

270 | Learning Opinions in Social Networks | Vincent Conitzer; Debmalya Panigrahi; Hanrui Zhang; | We study the problem of learning opinions in social networks. |

271 | Latent Variable Modelling with Hyperbolic Normalizing Flows | Joey Bose; Ariella Smofsky; Renjie Liao; Prakash Panangaden; Will Hamilton; | To address this fundamental limitation, we present the first extension of normalizing flows to hyperbolic spaces. |

272 | StochasticRank: Global Optimization of Scale-Free Discrete Functions | Aleksei Ustimenko; Liudmila Prokhorenkova; | In this paper, we introduce a powerful and efficient framework for the direct optimization of ranking metrics. |

273 | Working Memory Graphs | Ricky Loynd; Roland Fernandez; Asli Celikyilmaz; Adith Swaminathan; Matthew Hausknecht; | We present the Working Memory Graph (WMG), an agent that employs multi-head self-attention to reason over a dynamic set of vectors representing observed and recurrent state. |

274 | Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neural Networks with Attention over Modules | Sarthak Mittal; Alex Lamb; Anirudh Goyal; Vikram Voleti; Murray Shanahan; Guillaume Lajoie; Michael Mozer; Yoshua Bengio; | We explore deep recurrent neural net architectures in which bottom-up and top-down signals are dynamically combined using attention. |

275 | Spread Divergence | Mingtian Zhang; Peter Hayes; Thomas Bird; Raza Habib; David Barber; | We define a spread divergence $\sdiv{p}{q}$ on modified $p$ and $q$ and describe sufficient conditions for the existence of such a divergence. We demonstrate how to maximize the discriminatory power of a given divergence by parameterizing and learning the spread. |

276 | Optimizing Black-box Metrics with Adaptive Surrogates | Qijia Jiang; Olaoluwa Adigun; Harikrishna Narasimhan; Mahdi Milani Fard; Maya Gupta; | We address the problem of training models with black-box and hard-to-optimize metrics by expressing the metric as a monotonic function of a small number of easy-to-optimize surrogates. |

277 | Domain Adaptive Imitation Learning | Kuno Kim; Yihong Gu; Jiaming Song; Shengjia Zhao; Stefano Ermon; | In this work, we formalize the Domain Adaptive Imitation Learning (DAIL) problem – a unified framework for imitation learning in the presence of viewpoint, embodiment, and/or dynamics mismatch. |

278 | A general recurrent state space framework for modeling neural dynamics during decision-making | David Zoltowski; Jonathan Pillow; Scott Linderman; | Here we propose a general framework for modeling neural activity during decision-making. |

279 | An Imitation Learning Approach for Cache Replacement | Evan Liu; Milad Hashemi; Kevin Swersky; Parthasarathy Ranganathan; Junwhan Ahn; | In contrast, we propose an imitation learning approach to automatically learn cache access patterns by leveraging Belady’s, an oracle policy that computes the optimal eviction decision given the future cache accesses. |

280 | Revisiting Training Strategies and Generalization Performance in Deep Metric Learning | Karsten Roth; Timo Milbich; Samrath Sinha; Prateek Gupta; Bjorn Ommer; Joseph Paul Cohen; | Exploiting these insights, we propose a simple, yet effective, training regularization to reliably boost the performance of ranking-based DML models on various standard benchmark datasets; code and a publicly accessible WandB-repo are available. |

281 | Temporal Phenotyping using Deep Predictive Clustering of Disease Progression | Changhee Lee; Mihaela van der Schaar; | In this paper, we develop a deep learning approach for clustering time-series data, where each cluster comprises patients who share similar future outcomes of interest (e.g., adverse events, the onset of comorbidities). |

282 | Countering Language Drift with Seeded Iterated Learning | Yuchen Lu; Soumye Singhal; Florian Strub; Aaron Courville; Olivier Pietquin; | In this paper, we propose a generic approach to counter language drift by using iterated learning. |

283 | Stochastic Gauss-Newton Algorithms for Nonconvex Compositional Optimization | Quoc Tran-Dinh; Nhan Pham; Lam Nguyen; | We develop two new stochastic Gauss-Newton algorithms for solving a class of stochastic non-convex compositional optimization problems frequently arising in practice. |

284 | Strategyproof Mean Estimation from Multiple-Choice Questions | Anson Kahng; Gregory Kehne; Ariel Procaccia; | Given n values possessed by n agents, we study the problem of estimating the mean by truthfully eliciting agents’ answers to multiple-choice questions about their values. |

285 | Sequential Cooperative Bayesian Inference | Junqi Wang; Pei Wang; Patrick Shafto; | We develop novel approaches analyzing consistency, rate of convergence and stability of Sequential Cooperative Bayesian Inference (SCBI). |

286 | Spectral Graph Matching and Regularized Quadratic Relaxations: Algorithm and Theory | Zhou Fan; Cheng Mao; Yihong Wu; Jiaming Xu; | To tackle this task, we propose a spectral method, GRAph Matching by Pairwise eigen-Alignments (GRAMPA), which first constructs a similarity matrix as a weighted sum of outer products between all pairs of eigenvectors of the two graphs, and then outputs a matching by a simple rounding procedure. |

287 | Zeno++: Robust Fully Asynchronous SGD | Cong Xie; Sanmi Koyejo; Indranil Gupta; | We propose Zeno++, a new robust asynchronous Stochastic Gradient Descent(SGD) procedure, intended to tolerate Byzantine failures of workers. |

288 | Network Pruning by Greedy Subnetwork Selection | Mao Ye; Chengyue Gong; Lizhen Nie; Denny Zhou; Adam Klivans; Qiang Liu; | In this work, we study a greedy forward selection approach following the opposite direction, which starts from an empty network, and gradually adds good neurons from the large network. |

289 | Logarithmic Regret for Learning Linear Quadratic Regulators Efficiently | Asaf Cassel; Alon Cohen; Tomer Koren; | We present new efficient algorithms that achieve, perhaps surprisingly,regret that scales only (poly-)logarithmically with the number of steps, in two scenarios: when only the state transition matrix A is unknown, and when only the state-action transition matrix B is unknown and the optimal policy satisfies a certain non-degeneracy condition. |

290 | Hierarchical Verification for Adversarial Robustness | Cong Han Lim; Raquel Urtasun; Ersin Yumer; | We introduce a new framework for the exact pointwise lp robustness verification problem that exploits the layer-wise geometric structure of deep feed-forward networks with rectified linear activations (ReLU networks). |

291 | BINOCULARS for efficient, nonmyopic sequential experimental design | Shali Jiang; Henry Chai; Javier Gonzalez; Roman Garnett; | We present BINOCULARS: Batch-Informed NOnmyopic Choices, Using Long-horizons for Adaptive, Rapid SED, a general framework for deriving efficient, nonmyopic approximations to the optimal experimental policy. |

292 | On the Global Optimality of Model-Agnostic Meta-Learning | Lingxiao Wang; Qi Cai; Zhuoran Yang; Zhaoran Wang; | To bridge such a gap between theory and practice, we characterize the optimality gap of the stationary points attained by MAML for both reinforcement learning and supervised learning, where both the inner- and outer-level problems are solved via first-order optimization methods. |

293 | Breaking the Curse of Many Agents: Provable Mean Embedding $Q$-Iteration for Mean-Field Reinforcement Learning | Lingxiao Wang; Zhuoran Yang; Zhaoran Wang; | In this paper, we exploit the symmetry of agents in MARL. |

294 | Learning with Bounded Instance- and Label-dependent Label Noise | Jiacheng Cheng; Tongliang Liu; Kotagiri Ramamohanarao; Dacheng Tao; | In this paper, we focus on Bounded Instance- and Label-dependent label Noise (BILN), a particular case of ILN where the label noise rates—the probabilities that the true labels of examples flip into the corrupted ones—have upper bound less than $1$. |

295 | Transparency Promotion with Model-Agnostic Linear Competitors | Hassan Rafique; Tong Wang; Qihang Lin; Arshia Singhani; | We propose a novel type of hybrid model for multi-class classification, which utilizes competing linear models to collaborate with an existing black-box model, promoting transparency in the decision-making process. |

296 | Learning Mixtures of Graphs from Epidemic Cascades | Jessica Hoffmann; Soumya Basu; Surbhi Goel; Constantine Caramanis; | We consider the problem of learning the weighted edges of a balanced mixture of two undirected graphs from epidemic cascades. |

297 | Implicit differentiation of Lasso-type models for hyperparameter optimization | Quentin Bertrand; Quentin Klopfenstein; Mathieu Blondel; Samuel Vaiter; Alexandre Gramfort; Joseph Salmon; | This work introduces an efficient implicit differentiation algorithm, without matrix inversion, tailored for Lasso-type problems. |

298 | Latent Space Factorisation and Manipulation via Matrix Subspace Projection | Xiao Li; Chenghua Lin; Ruizhe Li; Chaozheng Wang; Frank Guerin; | We tackle the problem disentangling the latent space of an autoencoder in order to separate labelled attribute information from other characteristic information. |

299 | Active World Model Learning in Agent-rich Environments with Progress Curiosity | Kuno Kim; Megumi Sano; Julian De Freitas; Nick Haber; Daniel Yamins; | In this work, we study how to design such a curiosity-driven Active World Model Learning (AWML) system. |

300 | SDE-Net: Equipping Deep Neural Networks with Uncertainty Estimates | Lingkai Kong; Jimeng Sun; Chao Zhang; | We propose a new method for quantifying uncertainties of DNNs from a dynamical system perspective. |

301 | GANs May Have No Nash Equilibria | Farzan Farnia; Asuman Ozdaglar; | In this work, we show through several theoretical and numerical results that indeed GAN zero-sum games may not have any Nash equilibria. |

302 | Gradient Temporal-Difference Learning with Regularized Corrections | Sina Ghiassian; Andrew Patterson; Shivam Garg; Dhawal Gutpa; Adam White; Martha White; | In this paper, we introduce a new method called TD with Regularized Corrections (TDRC), that attempts to balance ease of use, soundness, and performance. |

303 | Online mirror descent and dual averaging: keeping pace in the dynamic case | Huang Fang; Victor Sanches Portella; Nick Harvey; Michael Friedlander; | In this paper, we modify the OMD algorithm by a simple technique that we call stabilization. |

304 | Choice Set Optimization Under Discrete Choice Models of Group Decisions | Kiran Tomlinson; Austin Benson; | Here, we use discrete choice modeling to develop an optimization framework of such interventions for several problems of group influence, including maximizing agreement or disagreement and promoting a particular choice. |

305 | Complexity of Finding Stationary Points of Nonconvex Nonsmooth Functions | Jingzhao Zhang; Hongzhou Lin; Stefanie Jegelka; Suvrit Sra; Ali Jadbabaie; | Therefore, we introduce the notion of (delta, epsilon)-stationarity, a generalization that allows for a point to be within distance delta of an epsilon-stationary point and reduces to epsilon-stationarity for smooth functions. |

306 | Multi-Agent Routing Value Iteration Network | Quinlan Sykora; Mengye Ren; Raquel Urtasun; | Whereas traditional methods are not designed for realistic environments such as sparse connectivity and unknown traffics and are often slow in runtime; in this paper, we propose a graph neural network based model that is able to perform multiagent routing in a sparsely connected graph with dynamically changing traffic conditions, outperforming existing methods. |

307 | Adversarial Attacks on Copyright Detection Systems | Parsa Saadatpanah; Ali Shafahi; Tom Goldstein; | This paper discusses how industrial copyright detection tools, which serve a central role on the web, are susceptible to adversarial attacks. |

308 | Differentiating through the Fréchet Mean | Aaron Lou; Isay Katsman; Qingxuan Jiang; Serge Belongie; Ser Nam Lim; Christopher De Sa; | In this paper, we show how to differentiate through the Fréchet mean for arbitrary Riemannian manifolds. |

309 | Online Learning for Active Cache Synchronization | Andrey Kolobov; Sebastien Bubeck; Julian Zimmert; | We present MirrorSync, an online learning algorithm for synchronization bandits, establish an adversarial regret of $O(T^{2/3})$ for it, and show how to make it efficient in practice. |

310 | PoKED: A Semi-Supervised System for Word Sense Disambiguation | Feng Wei; | In this paper, we propose a semi-supervised neural system, Position-wise Orthogonal Knowledge-Enhanced Disambiguator (PoKED), which allows attention-driven, long-range dependency modeling for word sense disambiguation tasks. |

311 | A Finite-Time Analysis of Q-Learning with Neural Network Function Approximation | Pan Xu; Quanquan Gu; | In this paper, we present a finite-time analysis of a neural Q-learning algorithm, where the data are generated from a Markov decision process and the action-value function is approximated by a deep ReLU neural network. |

312 | Understanding and Stabilizing GANs’ Training Dynamics Using Control Theory | Kun Xu; Chongxuan Li; Jun Zhu; Bo Zhang; | To this end, we present a conceptually novel perspective from control theory to directly model the dynamics of GANs in the frequency domain and provide simple yet effective methods to stabilize GAN’s training. |

313 | Scalable Nearest Neighbor Search for Optimal Transport | Arturs Backurs; Yihe Dong; Piotr Indyk; Ilya Razenshteyn; Tal Wagner; | In this work we introduce Flowtree, a fast and accurate approximation algorithm for the Wasserstein-1 distance. |

314 | Supervised learning: no loss no cry | Richard Nock; Aditya Menon; | In this paper, we revisit the {\sc SLIsotron} algorithm of Kakade et al. (2011) through a novel lens, derive a generalisation based on Bregman divergences, and show how it provides a principled procedure for learning the loss. |

315 | Label-Noise Robust Domain Adaptation | Xiyu Yu; Tongliang Liu; Mingming Gong; Kun Zhang; Kayhan Batmanghelich; Dacheng Tao; | Focusing on the generalized target shift scenario, where both label distribution $P_Y$ and the class-conditional distribution $P_{X|Y}$ can change, we propose a new Denoising Conditional Invariant Component (DCIC) framework, which provably ensures (1) extracting invariant representations given examples with noisy labels in the source domain and unlabeled examples in the target domain and (2) estimating the label distribution in the target domain with no bias. |

316 | Description Based Text Classification with Reinforcement Learning | Wei Wu; Duo Chai; Qinghong Han; Fei Wu; Jiwei Li; | Inspired by the current trend of formalizing NLP problems as question answering tasks, we propose a new framework for text classification, in which each category label is associated with a category description. |

317 | Bandits for BMO Functions | Tianyu Wang; Cynthia Rudin; | We study the bandit problem where the underlying expected reward is a Bounded Mean Oscillation (BMO) function. |

318 | Cost-effectively Identifying Causal Effect When Only Response Variable Observable | Tian-Zuo Wang; Xi-Zhu Wu; Sheng-Jun Huang; Zhi-Hua Zhou; | In this paper, we propose a novel solution for this challenging task where only the response variable is observable under intervention. |

319 | Learning with Multiple Complementary Labels | LEI FENG; Takuo Kaneko; Bo Han; Gang Niu; Bo An; Masashi Sugiyama; | In this paper, we propose a novel problem setting to allow MCLs for each example and two ways for learning with MCLs. |

320 | Contrastive Multi-View Representation Learning on Graphs | Kaveh Hassani; Amir Hosein Khasahmadi; | We introduce a self-supervised approach for learning node and graph level representations by contrasting structural views of graphs. |

321 | A Chance-Constrained Generative Framework for Sequence Optimization | Xianggen Liu; Jian Peng; Qiang Liu; Sen Song; | In this paper, we formulate the sequence optimization task as a chance-constrained sampling problem. |

322 | dS^2LBI: Exploring Structural Sparsity on Deep Network via Differential Inclusion Paths | Yanwei Fu; Chen Liu; Donghao Li; Xinwei Sun; Jinshan ZENG; Yuan Yao; | In this paper, instead of pruning or distilling over-parameterized models to compressive ones, we propose a new approach based on differential inclusions of inverse scale spaces. |

323 | Sparse Subspace Clustering with Entropy-Norm | Liang Bai; Jiye Liang; | Therefore, in this paper, we provide an explicit theoretical connection between them from the respective of learning a data similarity matrix. |

324 | On the Generalization Effects of Linear Transformations in Data Augmentation | Sen Wu; Hongyang Zhang; Gregory Valiant; Christopher Re; | In this work, we consider a family of linear transformations and study their effects on the ridge estimator in an over-parametrized linear regression setting. |

325 | Sparse Shrunk Additive Models | Hong Chen; guodong liu; Heng Huang; | A new method, called as sparse shrunk additive models (SSAM), is proposed to explore the structure information among features. |

326 | Unsupervised Discovery of Interpretable Directions in the GAN Latent Space | Andrey Voynov; Artem Babenko; | In this paper, we introduce an unsupervised method to identify interpretable directions in the latent space of a pretrained GAN model. |

327 | DropNet: Reducing Neural Network Complexity via Iterative Pruning | Chong Min John Tan; Mehul Motani; | Inspired by the iterative weight pruning in the Lottery Ticket Hypothesis, we propose DropNet, an iterative pruning method which prunes nodes/filters to reduce network complexity. |

328 | Self-supervised Label Augmentation via Input Transformations | Hankook Lee; Sung Ju Hwang; Jinwoo Shin; | Our main idea is to learn a single unified task with respect to the joint distribution of the original and self-supervised labels, i.e., we augment original labels via self-supervision. |

329 | Mapping natural-language problems to formal-language solutions using structured neural representations | Kezhen Chen; Qiuyuan Huang; Hamid Palangi; Paul Smolensky; Ken Forbus; Jianfeng Gao; | In this paper, we propose a new encoder-decoder model based on a structured neural representation, Tensor Product Representations (TPRs), for generating formal-language solutions from natural-language, called TP-N2F. |

330 | Transformation of ReLU-based recurrent neural networks from discrete-time to continuous-time | Zahra Monfared; Daniel Durstewitz; | Here we show how to perform such a translation from discrete to continuous time for a particular class of ReLU-based RNN. |

331 | Implicit Geometric Regularization for Learning Shapes | Amos Gropp; Lior Yariv; Niv Haim; Matan Atzmon; Yaron Lipman; | In this paper we offer a new paradigm for computing high fidelity implicit neural representations directly from raw data (i.e., point clouds, with or without normal information). |

332 | Influence Diagram Bandits | Tong Yu; Branislav Kveton; Zheng Wen; Ruiyi Zhang; Ole J. Mengshoel; | We propose a novel framework for structured bandits, which we call influence diagram bandit. |

333 | Information Particle Filter Tree: An Online Algorithm for POMDPs with Belief-Based Rewards on Continuous Domains | Johannes Fischer; Ömer Sahin Tas; | In this work we propose a novel online algorithm, Information Particle Filter Tree (IPFT), to solve problems with belief-dependent rewards on continuous domains. |

334 | Convergence Rates of Variational Inference in Sparse Deep Learning | Badr-Eddine Chérief-Abdellatif; | In this paper, we show that variational inference for sparse deep learning retains precisely the same generalization properties than exact Bayesian inference. |

335 | Unsupervised Transfer Learning for Spatiotemporal Predictive Networks | Zhiyu Yao; Yunbo Wang; Mingsheng Long; Jianmin Wang; | Technically, we propose a differentiable framework named transferable memory. |

336 | DINO: Distributed Newton-Type Optimization Method | Rixon Crane; Fred Roosta; | We present a novel communication-efficient Newton-type algorithm for finite-sum optimization over a distributed computing environment. |

337 | Quantum Expectation-Maximization for Gaussian Mixture Models | Alessandro Luongo; Iordanis Kerenidis; Anupam Prakash; | We define a quantum version of Expectation-Maximization (QEM), a fundamental tool in unsupervised machine learning, often used to solve Maximum Likelihood (ML) and Maximum A Posteriori (MAP) estimation problems. |

338 | Consistent Structured Prediction with Max-Min Margin Markov Networks | Alex Nowak; Francis Bach; Alessandro Rudi; | In this paper, we prove consistency and finite sample generalization bounds for $M^4N$ and provide an explicit algorithm to compute the estimator. |

339 | Concentration bounds for CVaR estimation: The cases of light-tailed and heavy-tailed distributions | Prashanth L.A.; Krishna Jagannathan; Ravi Kolla; | We derive concentration bounds for CVaR estimates, considering separately the cases of sub-Gaussian, light-tailed and heavy-tailed distributions. |

340 | Robust Pricing in Dynamic Mechanism Design | Yuan Deng; Sébastien Lahaie; Vahab Mirrokni; | In this paper, we propose robust dynamic mechanism design. |

341 | Nested Subspace Arrangement for Representation of Relational Data | Nozomi Hata; Shizuo Kaji; Akihiro Yoshida; Katsuki Fujisawa; | In this paper, we introduce Nested SubSpace arrangement (NSS arrangement), a comprehensive framework for representation learning. |

342 | Equivariant Neural Rendering | Emilien Dupont; Miguel Bautista Martin; Alex Colburn; Aditya Sankar; Joshua Susskind; Qi Shan; | We propose a framework for learning neural scene representations directly from images, without 3D supervision. |

343 | Bounding the fairness and accuracy of classifiers from population statistics | Sivan Sabato; Elad Yom-Tov; | We propose an efficient and practical procedure for finding the best possible lower bound on the discrepancy of the classifier, given the aggregate statistics, and demonstrate in experiments the empirical tightness of this lower bound, as well as its possible uses on various types of problems, ranging from estimating the quality of voting polls to measuring the effectiveness of patient identification from internet search queries. |

344 | Healing Gaussian Process Experts | samuel cohen; Rendani Mbuvha; Tshilidzi Marwala; Marc Deisenroth; | In this paper, we provide a solution to these problems for multiple expert models, including the generalised product of experts and the robust Bayesian committee machine. |

345 | Beyond UCB: Optimal and Efficient Contextual Bandits with Regression Oracles | Dylan Foster; Alexander Rakhlin; | We provide the first universal and optimal reduction from contextual bandits to online regression. |

346 | Simple and Deep Graph Convolutional Networks | Ming Chen; Zhewei Wei; Zengfeng Huang; Bolin Ding; Yaliang Li; | In this paper, we study the problem of designing and analyzing deep graph convolutional networks. |

347 | Projection-free Distributed Online Convex Optimization with $O(\sqrt{T})$ Communication Complexity | Yuanyu Wan; Wei-Wei Tu; Lijun Zhang; | In this paper, we first propose an improved variant of D-OCG, namely D-BOCG, which enjoys an $O(T^{3/4})$ regret bound with only $O(\sqrt{T})$ communication complexity. |

348 | Meta Variance Transfer: Learning to Augment from the Others | Seong-Jin Park; Seungju Han; Ji-won Baek; Insoo Kim; Juhwan Song; Hae Beom Lee; Jae-Joon Han; Sung Ju Hwang; | To alleviate the need of collecting large data and better learn from scarce samples, we propose a novel meta-learning method which learns to transfer factors of variations from one class to another, such that it can improve the classification performance on unseen examples. |

349 | Coresets for Clustering in Graphs of Bounded Treewidth | Daniel Baker; Vladimir Braverman; Lingxiao Huang; Shaofeng H.-C. Jiang; Robert Krauthgamer; Xuan Wu; | The construction is based on the framework of Feldman and Langberg [STOC 2011], and our main technical contribution, as required by this framework, is a uniform bound of $O(\tw(G))$ on the shattering dimension under any point weights. |

350 | On Breaking Deep Generative Model-based Defenses and Beyond | Yanzhi Chen; Renjie Xie; Zhanxing Zhu; | In this work, we develop a new gradient approximation attack to break these defenses. |

351 | Exploration Through Bias: Revisiting Biased Maximum Likelihood Estimation in Stochastic Multi-Armed Bandits | Xi Liu; Ping-Chun Hsieh; Yu Heng Hung; Anirban Bhattacharya; P. Kumar; | We propose a new family of bandit algorithms, that are formulated in a general way based on the Biased Maximum Likelihood Estimation (BMLE) method originally appearing in the adaptive control literature. |

352 | Bisection-Based Pricing for Repeated Contextual Auctions against Strategic Buyer | Anton Zhiyanov; Alexey Drutsa; | We introduce a novel deterministic learning algorithm that is based on ideas of the Bisection method and has strategic regret upper bound of $O(\log^2 T)$. |

353 | Haar Graph Pooling | Yuguang Wang; Ming Li; Zheng Ma; Guido Montufar; Xiaosheng Zhuang; Yanan Fan; | We propose a new graph pooling operation based on compressive Haar transforms — HaarPooling. |

354 | Explaining Groups of Points in Low-Dimensional Representations | Gregory Plumb; Jonathan Terhorst; Sriram Sankararaman; Ameet Talwalkar; | To solve this problem, we introduce a new type of explanation, a Global Counterfactual Explanation (GCE), and our algorithm, Transitive Global Translations (TGT), for computing GCEs. |

355 | Learning Portable Representations for High-Level Planning | Steven James; Benjamin Rosman; George Konidaris; | We present a framework for autonomously learning a portable representation that describes a collection of low-level continuous environments. |

356 | Adaptive Estimator Selection for Off-Policy Evaluation | Yi Su; Pavithra Srinath; Akshay Krishnamurthy; | We develop a generic data-driven method for estimator selection in off-policy policy evaluation settings. |

357 | Doubly Stochastic Variational Inference for Neural Processes with Hierarchical Latent Variables | Qi Wang; Herke van Hoof; | To address this challenge, we investigate NPs systematically and present a new variant of NP model that we call Doubly Stochastic Variational Neural Process (DSVNP). |

358 | Generative Flows with Matrix Exponential | Changyi Xiao; Ligang Liu; | In this paper, we incorporate matrix exponential into generative flows. |

359 | Composable Sketches for Functions of Frequencies: Beyond the Worst Case | Edith Cohen; Ofir Geri; Rasmus Pagh; | In this paper we study when it is possible to construct compact, composable sketches for weighted sampling and statistics estimation according to functions of data frequencies. |

360 | Self-concordant analysis of Frank-Wolfe algorithm | Mathias Staudigl; Pavel Dvurechenskii; Shimrit Shtern; Kamil Safin; Petr Ostroukhov; | If the problem can be represented by a local linear minimization oracle, we are the first to propose a FW method with linear convergence rate without assuming neither strong convexity nor a Lipschitz continuous gradient. |

361 | Towards non-parametric drift detection via Dynamic Adapting Window Independence Drift Detection (DAWIDD) | Fabian Hinder, André Artelt, CITEC Barbara Hammer; | In this paper we present a novel concept drift detection method, Dynamic Adapting Window Independence Drift Detection (DAWIDD), which aims for non-parametric drift detection of diverse drift characteristics. |

362 | Non-Stationary Bandits with Intermediate Observations | Claire Vernade; András György; Timothy Mann; | To model this situation, we introduce the problem of stochastic, non-stationary, delayed bandits with intermediate observations. |

363 | Does label smoothing mitigate label noise? | Michal Lukasik; Srinadh Bhojanapalli; Aditya Menon; Sanjiv Kumar; | In this paper, we study whether label smoothing is also effective as a means of coping with label noise. |

364 | Proving the Lottery Ticket Hypothesis: Pruning is All You Need | Eran Malach; Gilad Yehudai; Shai Shalev-Schwartz; Ohad Shamir; | We prove an even stronger hypothesis (as was also conjectured in Ramanujan et al., 2019), showing that for every bounded distribution and every target network with bounded weights, a sufficiently over-parameterized neural network with random weights contains a subnetwork with roughly the same accuracy as the target network, without any further training. |

365 | Linear bandits with Stochastic Delayed Feedback | Claire Vernade; Alexandra Carpentier; Tor Lattimore; Giovanni Zappella; Beyza Ermis; Michael Brueckner; | We formalize this problem as a novel stochastic delayed linear bandit and propose OTFLinUCB and OTFLinTS, two computationally efficient algorithms able to integrate new information as it becomes available and to deal with the permanently censored feedback. |

366 | Time Series Deconfounder: Estimating Treatment Effects over Time in the Presence of Hidden Confounders | Ioana Bica; Ahmed Alaa; Mihaela van der Schaar; | In this paper, we develop the Time Series Deconfounder, a method that leverages the assignment of multiple treatments over time to enable the estimation of treatment effects in the presence of multi-cause hidden confounders. |

367 | Negative Sampling in Semi-Supervised learning | John Chen; Vatsal Shah; Anastasios Kyrillidis; | We introduce Negative Sampling in Semi-Supervised Learning (NS^3L), a simple, fast, easy to tune algorithm for semi-supervised learning (SSL). |

368 | Adaptive Sketching for Fast and Convergent Canonical Polyadic Decomposition | Alex Gittens; Kareem Aggour; Bülent Yener; | This work considers the canonical polyadic decomposition (CPD) of tensors using proximally regularized sketched alternating least squares algorithms. |

369 | Private Counting from Anonymous Messages: Near-Optimal Accuracy with Vanishing Communication Overhead | Badih Ghazi; Ravi Kumar; Pasin Manurangsi; Rasmus Pagh; | In this paper, we obtain practical communication-efficient algorithms in the shuffled DP model for two basic aggregation primitives: 1) binary summation, and 2) histograms over a moderate number of buckets. |

370 | On the Generalization Benefit of Noise in Stochastic Gradient Descent | Samuel Smith; Erich Elsen; Soham De; | In this paper, we perform carefully designed experiments and rigorous hyperparameter sweeps on a range of popular models, which verify that small or moderately large batch sizes can substantially outperform very large batches on the test set. |

371 | Momentum-Based Policy Gradient Methods | Feihu Huang; Shangqian Gao; Jian Pei; Heng Huang; | Specifically, we propose a fast important-sampling momentum-based policy gradient (IS-MBPG) method by using the important sampling technique. |

372 | Knowing The What But Not The Where in Bayesian Optimization | Vu Nguyen; Michael Osborne; | In this paper, we consider a new setting in BO in which the knowledge of the optimum output is available. |

373 | Robust Bayesian Classification Using An Optimistic Score Ratio | Viet Anh Nguyen; Nian Si; Jose Blanchet; | We consider the optimistic score ratio for robust Bayesian classification when the class-conditional distribution of the features is not perfectly known. |

374 | Boosted Histogram Transform for Regression | Yuchao Cai; Hanyuan Hang; Hanfang Yang; Zhouchen Lin; | In this paper, we propose a boosting algorithm for regression problems called \textit{boosted histogram transform for regression} (BHTR) based on histogram transforms composed of random rotations, stretchings, and translations. |

375 | Stochastic bandits with arm-dependent delays | Anne Gael Manegueu; Claire Vernade; Alexandra Carpentier; Michal Valko; | Addressing these difficulties, we propose a simple but efficient UCB-based algorithm called the PATIENTBANDITS. We provide both problem-dependent and problem-independent bounds on the regret as well as performance lower bounds. |

376 | Projective Preferential Bayesian Optimization | Petrus Mikkola; Milica Todorovic; Jari Järvi; Patrick Rinke; Samuel Kaski; | We propose a new type of Bayesian optimization for learning user preferences in high-dimensional spaces. |

377 | On Relativistic f-Divergences | Alexia Jolicoeur-Martineau; | We introduce the minimum-variance unbiased estimator (MVUE) for Relativistic paired GANs (RpGANs; originally called RGANs which could bring confusion) and show that it does not perform better. |

378 | A Flexible Framework for Nonparametric Graphical Modeling that Accommodates Machine Learning | Yunhua Xiang; Noah Simon; | In this paper, we instead consider 3 non-parametric measures of conditional dependence. |

379 | The Natural Lottery Ticket Winner: Reinforcement Learning with Ordinary Neural Circuits | Ramin Hasani; Mathias Lechner; Alexander Amini; Daniela Rus; Radu Grosu; | We propose a neural information processing system which is obtained by re-purposing the function of a biological neural circuit model to govern simulated and real-world control tasks. |

380 | Schatten Norms in Matrix Streams: Hello Sparsity, Goodbye Dimension | Aditya Krishnan; Roi Sinoff; Robert Krauthgamer; Vladimir Braverman; | We address this challenge by providing the first algorithms whose space requirement is independent of the matrix dimension, assuming the matrix is doubly-sparse and presented in row-order. |

381 | Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning | Alberto Maria Metelli; Flavio Mazzolini; Lorenzo Bisi; Luca Sabbioni; Marcello Restelli; | In this paper, we introduce the notion of action persistence that consists in the repetition of an action for a fixed number of decision steps, having the effect of modifying the control frequency. |

382 | Minimax Rate for Learning From Pairwise Comparisons in the BTL Model | Julien Hendrickx; Alex Olshevsky; Venkatesh Saligrama; | Our contribution is the determination of the minimax rate up to a constant factor. |

383 | Interferometric Graph Transform: a Deep Unsupervised Graph Representation | Edouard Oyallon; | We propose the Interferometric Graph Transform (IGT), which is a new class of deep unsupervised graph convolutional neural network for building graph representations. |

384 | Stochastic Differential Equations with Variational Wishart Diffusions | Martin Jørgensen; Marc Deisenroth; Hugh Salimbeni; | We present a Bayesian non-parametric way of inferring stochastic differential equations for both regression tasks and continuous-time dynamical modelling. |

385 | What Can Learned Intrinsic Rewards Capture? | Zeyu Zheng; Junhyuk Oh; Matteo Hessel; Zhongwen Xu; Manuel Kroiss; Hado van Hasselt; David Silver; Satinder Singh; | In this paper, we instead consider the proposition that the reward function itself can be a good locus of learned knowledge. |

386 | Random extrapolation for primal-dual coordinate descent | Ahmet Alacaoglu; Olivier Fercoq; Volkan Cevher; | We introduce a randomly extrapolated primal-dual coordinate descent method that automatically adapts to the sparsity of the data matrix as well as the favorable structures of the objective function in optimization. |

387 | Reinforcement Learning with Differential Privacy | Giuseppe Vietri; Borja de Balle Pigem; Steven Wu; Akshay Krishnamurthy; | Motivated by high-stakes decision-making domains like personalized medicine where user information is inherently sensitive, we design privacy preserving exploration policies for episodic reinforcement learning (RL). |

388 | Median Matrix Completion: from Embarrassment to Optimality | Weidong Liu; Xiaojun Mao; Raymond K. W. Wong; | In this paper, we consider matrix completion with absolute deviation loss and obtain an estimator of the median matrix. |

389 | Improved Optimistic Algorithms for Logistic Bandits | Louis Faury; Marc Abeille; Clément Calauzènes; Olivier Fercoq; | In this work, we study the logistic bandit with a focus on the prohibitive dependencies introduced by $\kappa$. |

390 | Learning to Rank Learning Curves | Martin Wistuba; Tejaswini Pedapati; | In this work, we present a new method that saves computational budget by terminating poor configurations early on in the training. |

391 | Model Fusion with Kullback–Leibler Divergence | Sebastian Claici; Mikhail Yurochkin; Soumya Ghosh; Justin Solomon; | We propose a method to fuse posterior distributions learned from heterogeneous datasets. |

392 | Randomization matters How to defend against strong adversarial attacks | Rafael Pinot; Raphael Ettedgui; Geovani Rizk; Yann Chevaleyre; Jamal Atif; | We tackle this problem by showing that, under mild conditions on the dataset distribution, any deterministic classifier can be outperformed by a randomized one. |

393 | Evolutionary Topology Search for Tensor Network Decomposition | Chao Li; Zhun Sun; | In this paper, we claim that this issue can be practically tackled by evolutionary algorithms in an efficient manner. |

394 | Quadratically Regularized Subgradient Methods for Weakly Convex Optimization with Weakly Convex Constraints | Runchao Ma; Qihang Lin; Tianbao Yang; | This paper proposes a class of subgradient methods for constrained optimization where the objective function and the constraint functions are weakly convex and nonsmooth. |

395 | Scalable and Efficient Comparison-based Search without Features | Daniyar Chumbalov; Lucas Maystre; Matthias Grossglauser; | We propose a new Bayesian comparison-based search algorithm with noisy answers; it has low computational complexity yet is efficient in the number of queries. |

396 | Error-Bounded Correction of Noisy Labels | Songzhu Zheng; Pengxiang Wu; Aman Goswami; Mayank Goswami; Dimitris Metaxas; Chao Chen; | We introduce a novel approach that directly cleans labels in order to train a high quality model. |

397 | Learning with Feature and Distribution Evolvable Streams | Zhen-Yu Zhang; Peng Zhao; Yuan Jiang; Zhi-Hua Zhou; | To address this difficulty, we propose a novel discrepancy measure for evolving feature space and data distribution named the evolving discrepancy, based on which we provide the generalization error analysis. |

398 | On Unbalanced Optimal Transport: An Analysis of Sinkhorn Algorithm | Khiem Pham; Khang Le; Nhat Ho; Tung Pham; Hung Bui; | We provide a computational complexity analysis for the Sinkhorn algorithm that solves the entropic regularized Unbalanced Optimal Transport (UOT) problem between two measures of possibly different masses with at most $n$ components. |

399 | Learning Optimal Tree Models under Beam Search | Jingwei Zhuo; Ziru Xu; Wei Dai; Han Zhu; HAN LI; Jian Xu; Kun Gai; | In this paper, we take a first step towards understanding the discrepancy by developing the definition of Bayes optimality and calibration under beam search as general analyzing tools, and prove that neither TDMs nor PLTs are Bayes optimal under beam search. |

400 | Estimating the Number and Effect Sizes of Non-null Hypotheses | Jennifer Brennan; Ramya Korlakai Vinayak; Kevin Jamieson; | We study the problem of estimating the distribution of effect sizes (the mean of the test statistic under the alternate hypothesis) in a multiple testing setting. |

401 | Estimating Model Uncertainty of Neural Network in Sparse Information Form | Jongseok Lee; Matthias Humt; Jianxiang Feng; Rudolph Triebel; | The key insight of our work is that the information matrix, i.e. the inverse of the covariance matrix tends to be sparse in its spectrum. |

402 | Double-Loop Unadjusted Langevin Algorithm | Paul Rolland; Armin Eftekhari; Ali Kavis; Volkan Cevher; | This work proposes a new annealing step-size schedule for ULA, which allows to prove new convergence guarantees for sampling from a smooth log-concave distribution, which are not covered by existing state-of-the-art convergence guarantees. |

403 | Growing Action Spaces | Gregory Farquhar; Laura Gustafson; Zeming Lin; Shimon Whiteson; Nicolas Usunier; Gabriel Synnaeve; | In this work, we use a curriculum of progressively growing action spaces to accelerate learning. |

404 | Analytic Marching: An Analytic Meshing Solution from Deep Implicit Surface Networks | Jiabao Lei; Kui Jia; | We propose a naturally parallelizable algorithm of analytic marching to exactly recover the mesh captured by a learned MLP. |

405 | Anderson Acceleration of Proximal Gradient Methods | Vien Mai; Mikael Johansson; | This work introduces novel methods for adapting Anderson acceleration to (non-smooth and constrained) proximal gradient algorithms. |

406 | Interpretable, Multidimensional, Multimodal Anomaly Detection with Negative Sampling for Detection of Device Failure | John Sipple; | In this paper we propose a scalable, unsupervised approach for detecting anomalies in the Internet of Things (IoT). |

407 | Certified Robustness to Label-Flipping Attacks via Randomized Smoothing | Elan Rosenfeld; Ezra Winston; Pradeep Ravikumar; Zico Kolter; | In this work, we propose a strategy for building linear classifiers that are certifiably robust against a strong variant of label flipping, where each test example is targeted independently. |

408 | Responsive Safety in Reinforcement Learning | Adam Stooke; Joshua Achiam; Pieter Abbeel; | Lagrangian method are the most commonly used algorithms for the resulting constrained optimization problem. Yet they are known to oscillate and overshoot cost limits, causing constraint-violating behavior during training. |

409 | Deep k-NN for Noisy Labels | Dara Bahri; Heinrich Jiang; Maya Gupta; | In this paper, we provide an empirical study showing that a simple k-nearest neighbor-based filtering approach on the logit layer of a preliminary model can remove mislabeled training data and produce more accurate models than some recently proposed methods. |

410 | Learning the piece-wise constant graph structure of a varying Ising model | Batiste Le Bars; Pierre Humbert; Argyris Kalogeratos; Nicolas Vayatis; | For this purpose, we propose to estimate the neighborhood of each node by maximizing a penalized version of its conditional log-likelihood. |

411 | Stabilizing Transformers for Reinforcement Learning | Emilio Parisotto; Francis Song; Jack Rae; Razvan Pascanu; Caglar Gulcehre; Siddhant Jayakumar; Max Jaderberg; Raphael Lopez Kaufman; Aidan Clark; Seb Noury; Matthew Botvinick; Nicolas Heess; Raia Hadsell; | In this work we demonstrate that the standard transformer architecture is difficult to optimize, which was previously observed in the supervised learning setting but becomes especially pronounced with RL objectives. |

412 | An Explicitly Relational Neural Network Architecture | Murray Shanahan; Kyriacos Nikiforou; Antonia Creswell; Christos Kaplanis; David Barrett; Marta Garnelo; | With a view to bridging the gap between deep learning and symbolic AI, we present a novel end-to-end neural network architecture that learns to form propositional representations with an explicitly relational structure from raw pixel data. |

413 | Harmonic Decompositions of Convolutional Networks | Meyer Scetbon; Zaid Harchaoui; | We present a description of function spaces and smoothness classes associated with convolutional networks from a reproducing kernel Hilbert space viewpoint. |

414 | Discriminative Jackknife: Quantifying Uncertainty in Deep Learning via Higher-Order Influence Functions | Ahmed Alaa; Mihaela van der Schaar; | To this end, this paper develops the discriminative jackknife (DJ), a frequentist procedure that uses higher-order influence functions (HOIFs) of a trained model parameters to construct a jackknife (leave-one-out) estimator of predictive confidence intervals. |

415 | Robust Graph Representation Learning via Neural Sparsification | Cheng Zheng; Bo Zong; Wei Cheng; Dongjin Song; Jingchao Ni; Wenchao Yu; Haifeng Chen; Wei Wang; | In this paper, we present NeuralSparse, a supervised graph sparsification technique that improves generalization power by learning to remove potentially task-irrelevant edges from input graphs. |

416 | Semiparametric Nonlinear Bipartite Graph Representation Learning with Provable Guarantees | Sen Na; Yuwei Luo; Zhuoran Yang; Zhaoran Wang; Mladen Kolar; | To overcome these challenges, we propose a pseudo-likelihood objective based on the rank-order decomposition technique and focus on its local geometry. |

417 | Forecasting sequential data using Consistent Koopman Autoencoders | Omri Azencot; N. Benjamin Erichson; Vanessa Lin; Michael Mahoney; | We propose a novel Consistent Koopman Autoencoder that exploits the forward and backward dynamics to achieve long time predictions. |

418 | Scalable Identification of Partially Observed Systems with Certainty-Equivalent EM | Kunal Menda; Jean de Becdelievre; Jayesh K. Gupta; Ilan Kroo; Mykel Kochenderfer; Zachary Manchester; | This work considers the offline identification of partially observed nonlinear systems. |

419 | Learning to Score Behaviors for Guided Policy Optimization | Aldo Pacchiano; Jack Parker-Holder; Yunhao Tang; Krzysztof Choromanski; Anna Choromanska; Michael Jordan; | We introduce a new approach for comparing reinforcement learning policies, using Wasserstein distances (WDs) in a newly defined latent behavioral space. |

420 | Improved Communication Cost in Distributed PageRank Computation- A Theoretical Study | Siqiang Luo; | In this paper, we provide a new algorithm that uses asymptotically the same communication rounds while significantly improves the bandwidth from $O(\log^{2d+3}{n})$ bits to $O(d\log^3{n})$ bits. |

421 | Learning Autoencoders with Relational Regularization | Hongteng Xu; Dixin Luo; Ricardo Henao; Svati Shah; Lawrence Carin; | We propose a new algorithmic framework for learning autoencoders of data distributions. |

422 | Neural Contextual Bandits with UCB-based Exploration | Dongruo Zhou; Lihong Li; Quanquan Gu; | We propose the NeuralUCB algorithm, which leverages the representation power of deep neural networks and uses a neural network-based random feature mapping to construct an upper confidence bound (UCB) of reward for efficient exploration. |

423 | Super-efficiency of automatic differentiation for functions defined as a minimum | Pierre Ablin; Gabriel Peyré; Thomas Moreau; | In this paper, we study the asymptotic error made by these estimators as a function of the optimization error. |

424 | PowerNorm: Rethinking Batch Normalization in Transformers | Sheng Shen; Zhewei Yao; Amir Gholaminejad; Michael Mahoney; Kurt Keutzer; | In this paper, we perform a systematic study of NLP transformer models to understand why BN has a poor performance, as compared to LN. |

425 | Invertible generative models for inverse problems: mitigating representation error and dataset bias | Muhammad Asim; Max Daniels; Oscar Leong; Paul Hand; Ali Ahmed; | In this paper, we demonstrate that invertible neural networks, which have zero representation error by design, can be effective natural signal priors at inverse problems such as denoising, compressive sensing, and inpainting. |

426 | Acceleration for Compressed Gradient Descent in Distributed Optimization | Zhize Li; Dmitry Kovalev; Xun Qian; Peter Richtarik; | In this paper, we remedy this situation and propose the first {\em accelerated compressed gradient descent (ACGD)} methods. |

427 | Neural Networks are Convex Regularizers: Exact Polynomial-time Convex Optimization Formulations for Two-Layer Networks | Mert Pilanci; Tolga Ergen; | We develop exact representations of two layer neural networks with rectified linear units in terms of a single convex program with number of variables polynomial in the number of training samples and number of hidden neurons. |

428 | Learning Quadratic Games on Networks | Yan Leng; Xiaowen Dong; Junfeng Wu; Alex `Sandy’ Pentland; | In this paper, we propose two novel frameworks for learning, from the observations on individual actions, network games with linear-quadratic payoffs, and in particular the structure of the interaction network. |

429 | Margin-aware Adversarial Domain Adaptation with Optimal Transport | Sofien Dhouib; Ievgen Redko; Carole Lartizien; | In this paper, we propose a new theoretical analysis of unsupervised domain adaptation that relates notions of large margin separation, adversarial learning and optimal transport. |

430 | The Sample Complexity of Best-$k$ Items Selection from Pairwise Comparisons | Wenbo Ren; Jia Liu; Ness Shroff; | In this paper, we study two problems: (i) finding the probably approximately correct (PAC) best-$k$ items and (ii) finding the exact best-$k$ items, both under strong stochastic transitivity and stochastic triangle inequality. |

431 | GraphOpt: Learning Optimization Models of Graph Formation | Rakshit Trivedi; Jiachen Yang; Hongyuan Zha; | In this work, we propose GraphOpt, an end-to-end framework that jointly learns an implicit model of graph structure formation and discovers an underlying optimization mechanism in the form of a latent objective function. |

432 | Distributionally Robust Policy Evaluation and Learning in Offline Contextual Bandits | Nian Si; Fan Zhang; Zhengyuan Zhou; Jose Blanchet; | In this paper, we lift this assumption and aim to learn a distributionally robust policy with bandit observational data. |

433 | Incremental Sampling Without Replacement for Sequence Models | Kensen Shi; David Bieber; Charles Sutton; | We present an elegant procedure for sampling without replacement from a broad class of randomized programs, including generative neural models that construct outputs sequentially. |

434 | Variable Skipping for Autoregressive Range Density Estimation | Eric Liang; Zongheng Yang; Ion Stoica; Pieter Abbeel; Yan Duan; Peter Chen; | In this paper, we explore a technique for accelerating range density estimation over deep autoregressive models. |

435 | TaskNorm: Rethinking Batch Normalization for Meta-Learning | John Bronskill; Jonathan Gordon; James Requeima; Sebastian Nowozin; Richard Turner; | We evaluate a range of approaches to batch normalization for meta-learning scenarios, and develop a novel approach that we call TaskNorm. |

436 | Scalable Gaussian Process Regression for Kernels with a Non-Stationary Phase | Jan Graßhoff; Alexandra Jankowski; Philipp Rostalski; | This paper investigates an efficient GP framework, that extends structured kernel interpolation methods to GPs with a non-stationary phase. |

437 | Transformer Hawkes Process | Simiao Zuo; Haoming Jiang; Zichong Li; Tuo Zhao; Hongyuan Zha; | To address this issue, we propose a Transformer Hawkes Process (THP) model, which leverages the self-attention mechanism to capture long-term dependencies and meanwhile enjoys computational efficiency. |

438 | An EM Approach to Non-autoregressive Conditional Sequence Generation | Zhiqing Sun; Yiming Yang; | This paper proposes a new approach that jointly optimizes both AR and NAR models in a unified Expectation-Maximization (EM) framework. |

439 | Variance Reduction in Stochastic Particle-Optimization Sampling | Jianyi Zhang; Yang Zhao; Changyou Chen; | In this paper, we bridge the gap by presenting several variance-reduction techniques for SPOS. |

440 | CLUB: A Contrastive Log-ratio Upper Bound of Mutual Information | Pengyu Cheng; Weituo Hao; Shuyang Dai; Jiachang Liu; Zhe Gan; Lawrence Carin; | In this paper, we propose a novel Contrastive Log-ratio Upper Bound (CLUB) of mutual information. |

441 | State Space Expectation Propagation: Efficient Inference Schemes for Temporal Gaussian Processes | William Wilkinson; Paul Chang; Michael Andersen; Arno Solin; | We formulate expectation propagation (EP), a state-of-the-art method for approximate Bayesian inference, as a nonlinear Kalman smoother, showing that it generalises a wide class of classical smoothing algorithms. |

442 | Training Neural Networks for and by Interpolation | Leonard Berrada; M. Pawan Kumar; Andrew Zisserman; | In this work, we explicitly exploit this interpolation property for the design of a new optimization algorithm for deep learning, which we term Adaptive Learning-rates for Interpolation with Gradients (ALI-G). |

443 | Learning Representations that Support Extrapolation | Taylor Webb; Zachary Dulberg; Steven Frankland; Alexander Petrov; Randall O’Reilly; Jonathan Cohen; | In this paper, we consider the challenge of learning representations that support extrapolation. |

444 | Topic Modeling via Full Dependence Mixtures | Dan Fisher; Mark Kozdoba; Shie Mannor; | In this paper we introduce a new approach to topic modelling that scales to large datasets by using a compact representation of the data and by leveraging the GPU architecture. |

445 | Instance-hiding Schemes for Private Distributed Learning | Yangsibo Huang; Zhao Song; Sanjeev Arora; Kai Li; | The new ideas in the current paper are: (a) new variants of mixup with negative as well as positive coefficients, and extend the sample-wise mixup to be pixel-wise. |

446 | The Implicit Regularization of Stochastic Gradient Flow for Least Squares | Alnur Ali; Edgar Dobriban; Ryan Tibshirani; | We study the implicit regularization of mini-batch stochastic gradient descent, when applied to the fundamental problem of least squares regression. |

447 | Decentralised Learning with Random Features and Distributed Gradient Descent | Dominic Richards; Patrick Rebeschini; Lorenzo Rosasco; | We present simulations that show how the number of random features, iterations and samples impact predictive performance. |

448 | Hierarchical Generation of Molecular Graphs using Structural Motifs | Wengong Jin; Regina Barzilay; Tommi Jaakkola; | In this paper, we propose a new hierarchical graph encoder-decoder that employs significantly larger and more flexible graph motifs as basic building blocks. |

449 | Composing Molecules with Multiple Property Constraints | Wengong Jin; Regina Barzilay; Tommi Jaakkola; | We propose to offset this complexity by composing molecules from a vocabulary of substructures that we call molecular rationales. |

450 | Data preprocessing to mitigate bias: A maximum entropy based approach | Elisa Celis; Vijay Keswani; Nisheeth Vishnoi; | This paper presents an optimization framework that can be used as a data preprocessing method towards mitigating bias: It can learn distributions over large domains and controllably adjust the representation rates of protected groups and/or achieve target fairness metrics such as statistical parity, yet remains close to the empirical distribution induced by the given dataset. |

451 | On Efficient Low Distortion Ultrametric Embedding | Vincent Cohen-Addad; Karthik C. S.; Guillaume Lagarde; | In this paper, we provide a new algorithm which takes as input a set of points $P$ in $R^d$, and for every $c\ge 1$, runs in time $n^{1+O(1/c^2)}$ to output an ultrametric $\Delta$ such that for any two points $u,v$ in $P$, we have $\Delta(u,v)$ is within a multiplicative factor of $5c$ to the distance between $u$ and $v$ in the "best" ultrametric representation of $P$. |

452 | Global Concavity and Optimization in a Class of Dynamic Discrete Choice Models | Yiding Feng; Ekaterina Khmelnitskaya; Denis Nekipelov; | We show that in an important class of discrete choice models the value function is globally concave in the policy. That means that simple algorithms that do not require fixed point computation, such as the policy gradient algorithm, globally converge to the optimal policy. |

453 | Efficient Policy Learning from Surrogate-Loss Classification Reductions | Andrew Bennett; Nathan Kallus; | In light of this, we instead propose an estimation approach based on generalized method of moments, which is efficient for the policy parameters. |

454 | On Contrastive Learning for Likelihood-free Inference | Conor Durkan; Iain Murray; George Papamakarios; | In this work, we show that both of these approaches can be unified under a general contrastive learning scheme, and clarify how they should be run and compared. |

455 | Obtaining Adjustable Regularization for Free via Iterate Averaging | Jingfeng Wu; Vladimir Braverman; Lin Yang; | In this paper, we establish a complete theory by showing an averaging scheme that provably converts the iterates of SGD on an arbitrary strongly convex and smooth objective function to its regularized counterpart with an adjustable regularization parameter. |

456 | Invariant Risk Minimization Games | Kartik Ahuja; Karthikeyan Shanmugam; Kush Varshney; Amit Dhurandhar; | In this work, we pose such invariant risk minimization as finding the Nash equilibrium of an ensemble game among several environments. |

457 | Video Prediction via Example Guidance | Jingwei Xu; Harry (Huazhe) Xu; Bingbing Ni; Xiaokang Yang; Trevor Darrell; | In this work, we propose a simple yet effective framework that can predict diverse and plausible future states. |

458 | Learning Discrete Structured Representations by Adversarially Maximizing Mutual Information | Karl Stratos; Sam Wiseman; | We propose learning discrete structured representations from unlabeled data by maximizing the mutual information between a structured latent variable and a target variable. |

459 | Reinforcement Learning in Feature Space: Matrix Bandit, Kernels, and Regret Bound | Lin Yang; Mengdi Wang; | In this paper, we propose an online RL algorithm, namely the MatrixRL, that leverages ideas from linear bandit to learn a low-dimensional representation of the probability transition model while carefully balancing the exploitation-exploration tradeoff. |

460 | Frequency Bias in Neural Networks for Input of Non-Uniform Density | Ronen Basri; Meirav Galun; Amnon Geifman; David Jacobs; Yoni Kasten; Shira Kritchman; | As realistic training sets are not drawn from a uniform distribution, we here use the Neural Tangent Kernel (NTK) model to explore the effect of variable density on training dynamics. |

461 | Constrained Markov Decision Processes via Backward Value Functions | Harsh Satija; Philip Amortila; Joelle Pineau; | In this work, we model the problem of learning with constraints as a Constrained Markov Decision Process and provide a new on-policy formulation for solving it. |

462 | Adding seemingly uninformative labels helps in low data regimes | Christos Matsoukas; Albert Bou Hernandez; Yue Liu; Karin Dembrower; Gisele Miranda; Emir Konuk; Johan Fredin Haslum; Athanasios Zouzos; Peter Lindholm; Fredrik Strand; Kevin Smith; | In this work, we consider a task that requires difficult-to-obtain expert annotations: tumor segmentation in mammography images. |

463 | When are Non-Parametric Methods Robust? | Robi Bhattacharjee; Kamalika Chaudhuri; | In this work, we study general non-parametric methods, with a view towards understanding when they are robust to these modifications. |

464 | Learning Calibratable Policies using Programmatic Style-Consistency | Eric Zhan; Albert Tseng; Yisong Yue; Adith Swaminathan; Matthew Hausknecht; | In this paper, we leverage large amounts of raw behavioral data to learn policies that can be calibrated to generate a diverse range of behavior styles (e.g., aggressive versus passive play in sports). |

465 | Momentum Improves Normalized SGD | Ashok Cutkosky; Harsh Mehta; | We provide an improved analysis of normalized SGD showing that adding momentum provably removes the need for large batch sizes on non-convex objectives. |

466 | Parameter-free, Dynamic, and Strongly-Adaptive Online Learning | Ashok Cutkosky; | We provide a new online learning algorithm that for the first time combines several disparate notions of adaptivity. |

467 | PENNI: Pruned Kernel Sharing for Efficient CNN Inference | Shiyu Li; Edward Hanson; Hai Li; Yiran Chen; | Based on this observation, we propose PENNI, a CNN model compression framework that is able to achieve model compactness and hardware efficiency simultaneously by (1) implementing kernel sharing in convolution layers via a small number of basis kernels and (2) alternately adjusting bases and coefficients with sparse constraints. |

468 | Optimal transport mapping via input convex neural networks | Ashok Vardhan Makkuva; Amirhossein Taghvaei; Sewoong Oh; Jason Lee; | In this paper, we present a novel and principled approach to learn the optimal transport between two distributions, from samples. |

469 | All in the (Exponential) Family: Information Geometry and Thermodynamic Variational Inference | Rob Brekelmans; Vaden Masrani; Frank Wood; Greg Ver Steeg; Aram Galstyan; | We interpret the geometric mixture curve common to TVO and related path sampling methods using the geometry of exponential families, which allows us to characterize the gap in TVO bounds as a sum of KL divergences along a given path. |

470 | SimGANs: Simulator-Based Generative Adversarial Networks for ECG Synthesis to Improve Deep ECG Classification | Tomer Golany; Kira Radinsky; Daniel Freedman; | We study the problem of heart signal electrocardiogram (ECG) synthesis for improved heartbeat classification. |

471 | Is There a Trade-Off Between Fairness and Accuracy? A Perspective Using Mismatched Hypothesis Testing | Sanghamitra Dutta; Dennis Wei; Hazar Yueksel; Pin-Yu Chen; Sijia Liu; Kush Varshney; | Novel to this work, we examine fair classification through the lens of mismatched hypothesis testing: trying to find a classifier that distinguishes between two ideal distributions when given two mismatched distributions that are biased. |

472 | Convex Calibrated Surrogates for the Multi-Label F-Measure | Mingyuan Zhang; Harish Guruprasad Ramaswamy; Shivani Agarwal; | In this paper, we explore the question of designing convex surrogate losses that are calibrated for the F-measure — specifically, that have the property that minimizing the surrogate loss yields (in the limit of sufficient data) a Bayes optimal multi-label classifier for the F-measure. |

473 | Learning Robot Skills with Temporal Variational Inference | Tanmay Shankar; Abhinav Gupta; | In this paper, we address the discovery of robotic options from demonstrations in an unsupervised manner. |

474 | Adaptive Gradient Descent without Descent | Konstantin Mishchenko; Yura Malitsky; | We present a strikingly simple proof that two rules are sufficient to automate gradient descent: 1) don’t increase the stepsize too fast and 2) don’t overstep the local curvature. |

475 | An end-to-end Differentially Private Latent Dirichlet Allocation Using a Spectral Algorithm | Christopher DeCarolis; Mukul Ram; Seyed Esmaeili; Yu-Xiang Wang; Furong Huang; | We provide an end-to-end differentially private spectral algorithm for learning LDA, based on matrix/tensor decompositions, and establish theoretical guarantees on utility/consistency of the estimated model parameters. |

476 | Dual Mirror Descent for Online Allocation Problems | Haihao Lu; Santiago Balseiro; Vahab Mirrokni; | We consider online allocation problems with concave revenue functions and resource constraints, which are central problems in revenue management and online advertising. |

477 | Optimal Robust Learning of Discrete Distributions from Batches | Ayush Jain; Alon Orlitsky; | We provide the first polynomial-time estimator that is optimal in the number of batches and achieves essentially the best possible estimation accuracy. |

478 | BoXHED: Boosted eXact Hazard Estimator with Dynamic covariates | Xiaochen Wang; Arash Pakbin; Bobak Mortazavi; Hongyu Zhao; Donald Lee; | This paper introduces the software package BoXHED (pronounced `box-head’) for nonparametrically estimating hazard functions via gradient boosting. |

479 | Unlabelled Data Improves Bayesian Uncertainty Calibration under Covariate Shift | Alexander Chan; Ahmed Alaa; Zhaozhi Qian; Mihaela van der Schaar; | In this paper, we develop an approximate Bayesian inference scheme based on posterior regularisation, where we use information from unlabelled target data to produce more appropriate uncertainty estimates for ”covariate-shifted” predictions. |

480 | Universal Equivariant Multilayer Perceptrons | Siamak Ravanbakhsh; | Using tools from group theory, this paper proves the universality of a broad class of equivariant MLPs with a single hidden layer. |

481 | Improving generalization by controlling label-noise information in neural network weights | Hrayr Harutyunyan; Kyle Reing; Greg Ver Steeg; Aram Galstyan; | To obtain these low values, we propose training algorithms that employ an auxiliary network that predicts gradients in the final layers of a classifier without accessing labels. |

482 | DeepMatch: Balancing Deep Covariate Representations for Causal Inference Using Adversarial Training | Nathan Kallus; | We propose a new method based on adversarial training of a weighting and a discriminator network that effectively addresses this methodological gap. |

483 | Bayesian Optimisation over Multiple Continuous and Categorical Inputs | Binxin Ru; Ahsan Alvi; Vu Nguyen; Michael Osborne; Stephen Roberts; | We propose a new approach, Continuous and Categorical Bayesian Optimisation (CoCaBO), which combines the strengths of multi-armed bandits and Bayesian optimisation to select values for both categorical and continuous inputs. |

484 | Generalization and Representational Limits of Graph Neural Networks | Vikas Garg; Stefanie Jegelka; Tommi Jaakkola; | We address two fundamental questions about graph neural networks (GNNs). |

485 | Multi-Precision Policy Enforced Training (MuPPET) : A Precision-Switching Strategy for Quantised Fixed-Point Training of CNNs | Aditya Rajagopal; Diederik Vink; Stylianos Venieris; Christos-Savvas Bouganis; | This work pushes the boundary of quantised training by employing a multilevel optimisation approach that utilises multiple precisions including low-precision fixed-point representations. |

486 | LowFER: Low-rank Bilinear Pooling for Link Prediction | Saadullah Amin; Stalin Varanasi; Katherine Ann Dunfield; Günter Neumann; | In this work, we propose a factorized bilinear pooling model, commonly used in multi-modal learning, for better fusion of entities and relations, leading to an efficient and constraints free model. |

487 | Parameterized Rate-Distortion Stochastic Encoder | Quan Hoang; Trung Le; Dinh Phung; | We propose a novel gradient-based tractable approach for the Blahut-Arimoto (BA) algorithm to compute the rate-distortion function where the BA algorithm is fully parameterized. |

488 | Incidence Networks for Geometric Deep Learning | Marjan Albooyeh; Daniele Bertolini; Siamak Ravanbakhsh; | In this paper, we formalize incidence tensors, analyze their structure, and present the family of equivariant networks that operate on them. |

489 | Energy-Based Processes for Exchangeable Data | Mengjiao Yang; Bo Dai; Hanjun Dai; Dale Schuurmans; | To overcome these limitations, we introduce Energy-Based Processes (EBPs), which extend energy based models to exchangeable data while allowing neural network parameterizations of the energy function. |

490 | Deep Isometric Learning for Visual Recognition | Haozhi Qi; Chong You; Xiaolong Wang; Yi Ma; Jitendra Malik; | This paper shows that deep vanilla ConvNets without normalization nor residual structure can also be trained to achieve surprisingly good performance on standard image recognition benchmarks (ImageNet, and COCO). |

491 | Second-Order Provable Defenses against Adversarial Attacks | Sahil Singla; Soheil Feizi; | In this paper, we provide computationally-efficient robustness certificates for neural networks with differentiable activation functions in two steps. |

492 | Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention | Angelos Katharopoulos; Apoorv Vyas; Nikolaos Pappas; Francois Fleuret; | To address this limitation, we express the self-attention as a linear dot-product of kernel feature maps and make use of the associativity property of matrix products to reduce the complexity from $\bigO{N^2}$ to $\bigO{N}$, where $N$ is the sequence length. |

493 | Overfitting in adversarially robust deep learning | Eric Wong; Leslie Rice; Zico Kolter; | In this paper, we empirically study this phenomenon in the setting of adversarially trained deep networks, which are trained to minimize the loss under worst-case adversarial perturbations. |

494 | Rethinking Bias-Variance Trade-off for Generalization of Neural Networks | Zitong Yang; Yaodong Yu; Chong You; Jacob Steinhardt; Yi Ma; | We provide a simple explanation of this by measuring the bias and variance of neural networks: while the bias is {\em monotonically decreasing} as in the classical theory, the variance is {\em unimodal} or bell-shaped: it increases then decreases with the width of the network. |

495 | Boosting for Control of Dynamical Systems | Naman Agarwal; Nataly Brukhim; Elad Hazan; Zhou Lu; | To this end, we propose a framework of boosting for online control. |

496 | Frustratingly Simple Few-Shot Object Detection | Xin Wang; Thomas Huang; Joseph Gonzalez; Trevor Darrell; Fisher Yu; | We find that fine-tuning only the last layer of existing detectors on rare classes is crucial to the few-shot object detection task. |

497 | Data-Dependent Differentially Private Parameter Learning for Directed Graphical Models | Amrita Roy Chowdhury; Theodoros Rekatsinas; Somesh Jha; | In this paper, we present an algorithm for differentially-private learning of the parameters of a DGM. |

498 | Adversarial Risk via Optimal Transport and Optimal Couplings | Muni Sreenivas Pydi; Varun Jog; | In this paper, we investigate the optimal adversarial risk and optimal adversarial classifiers from an optimal transport perspective. |

499 | Decoupled Greedy Learning of CNNs | Eugene Belilovsky; Michael Eickenberg; Edouard Oyallon; | In this context, we consider a simpler, but more effective, substitute that uses minimal feedback, which we call Decoupled Greedy Learning (DGL). |

500 | ACFlow: Flow Models for Arbitrary Conditional Likelihoods | Yang Li; Shoaib Akbar; Junier Oliva; | Instead, in this work we develop a model that is capable of yielding all conditional distributions $p(x_u \mid x_o)$ (for arbitrary $x_u$) via tractable conditional likelihoods. |

501 | Can autonomous vehicles identify, recover from, and adapt to distribution shifts? | Angelos Filos; Panagiotis Tigkas; Rowan McAllister; Nicholas Rhinehart; Sergey Levine; Yarin Gal; | In this paper, we introduce an autonomous car novel-scene benchmark, \texttt{CARNOVEL}, to evaluate the robustness of driving agents to a suite of tasks involving distribution shift. |

502 | Leveraging Procedural Generation to Benchmark Reinforcement Learning | Karl Cobbe; Chris Hesse; Jacob Hilton; John Schulman; | We introduce Procgen Benchmark, a suite of 16 procedurally generated game-like environments designed to benchmark both sample efficiency and generalization in reinforcement learning. |

503 | The Tree Ensemble Layer: Differentiability meets Conditional Computation | Hussein Hazimeh; Natalia Ponomareva; Rahul Mazumder; Zhenyu Tan; Petros Mol; | We aim to combine these advantages by introducing a new layer for neural networks, composed of an ensemble of differentiable decision trees (a.k.a. soft trees). |

504 | Near-Tight Margin-Based Generalization Bounds for Support Vector Machines | Allan Grønlund; Lior Kamma; Kasper Green Larsen; | In this paper, we revisit and improve the classic generalization bounds in terms of margins. |

505 | Error Estimation for Sketched SVD | Miles Lopes; N. Benjamin Erichson; Michael Mahoney; | To overcome these challenges, this paper develops a fully data-driven bootstrap method that numerically estimates the actual error of sketched singular vectors/values. |

506 | Goal-Aware Prediction: Learning to Model What Matters | Suraj Nair; Silvio Savarese; Chelsea Finn; | In this paper, we propose to direct prediction towards task relevant information, enabling the model to be aware of the current task and encouraging it to only model relevant quantities of the state space, resulting in a learning objective that more closely matches the downstream task. |

507 | Combinatorial Pure Exploration for Dueling Bandit | Wei Chen; Yihan Du; Longbo Huang; Haoyu Zhao; | In this paper, we study combinatorial pure exploration for dueling bandits (CPE-DB): we have multiple candidates for multiple positions as modeled by a bipartite graph, and in each round we sample a duel of two candidates on one position and observe who wins in the duel, with the goal of finding the best candidate-position matching with high probability after multiple rounds of samples. |

508 | Optimal Sequential Maximization: One Interview is Enough! | Moein Falahatgar; Alon Orlitsky; Venkatadheeraj Pichapati; | We derive the first query-optimal sequential algorithm for probabilistic-maximization. |

509 | What can I do here? A Theory of Affordances in Reinforcement Learning | Khimya Khetarpal; Zafarali Ahmed; Gheorghe Comanici; David Abel; Doina Precup; | In this paper, we develop a theory of affordances for agents who learn and plan in Markov Decision Processes. |

510 | An end-to-end approach for the verification problem: learning the right distance | Joao Monteiro; Isabela Albuquerque; Jahangir Alam; R Devon Hjelm; Tiago Falk; | In this contribution, we augment the metric learning setting by introducing a parametric pseudo-distance, trained jointly with the encoder. |

511 | Data Valuation using Reinforcement Learning | Jinsung Yoon; Sercan Arik; Tomas Pfister; | We propose Data Valuation using Reinforcement Learning (DVRL), to adaptively learn data values jointly with the predictor model. |

512 | FormulaZero: Distributionally Robust Online Adaptation via Offline Population Synthesis | Aman Sinha; Matthew O’Kelly; Hongrui Zheng; Rahul Mangharam; John Duchi; Russ Tedrake; | This work makes algorithmic contributions to both challenges. First, to generate a realistic, diverse set of opponents, we develop a novel method for self-play based on replica-exchange Markov chain Monte Carlo. Second, we propose a distributionally robust bandit optimization procedure that adaptively adjusts risk aversion relative to uncertainty in beliefs about opponents\u2019 behaviors. |

513 | Latent Bernoulli Autoencoder | Jiri Fajtl; Vasileios Argyriou; Dorothy Monekosso; Paolo Remagnino; | In this work, we pose a question whether it is possible to design and train an autoencoder model in an end-to-end fashion to learn latent representations in multivariate Bernoulli space, and achieve performance comparable with the current state-of-the-art variational methods. |

514 | Learning To Stop While Learning To Predict | Xinshi Chen; Hanjun Dai; Yu Li; Xin Gao; Le Song; | In this paper, we tackle this varying depth problem using a steerable architecture, where a feed-forward deep model and a variational stopping policy are learned together to sequentially determine the optimal number of layers for each input instance. |

515 | Accelerating the diffusion-based ensemble sampling by non-reversible dynamics | Futoshi Futami; Issei Sato; Masashi Sugiyama; | To cope with this problem, we propose a novel ensemble method that uses a non-reversible Markov chain for the interaction, and we present a non-asymptotic theoretical analysis for our method. |

516 | Efficient nonparametric statistical inference on population feature importance using Shapley values | Brian Williamson; Jean Feng; | We present a computationally efficient procedure for estimating and obtaining valid statistical inference on the \textbf{S}hapley \textbf{P}opulation \textbf{V}ariable \textbf{I}mportance \textbf{M}easure (SPVIM). |

517 | Curse of Dimensionality on Randomized Smoothing for Certifiable Robustness | Aounon Kumar; Alexander Levine; Tom Goldstein; Soheil Feizi; | In this work, we show that extending the smoothing technique to defend against other attack models can be challenging, especially in the high-dimensional regime. |

518 | Upper bounds for Model-Free Row-Sparse Principal Component Analysis | Guanyi Wang; Santanu Dey; | We propose a new framework that finds out upper (dual) bounds for the sparse PCA within polynomial time via solving a convex integer program (IP). |

519 | Explainable k-Means and k-Medians Clustering | Michal Moshkovitz; Sanjoy Dasgupta; Cyrus Rashtchian; Nave Frost; | We study this problem from a theoretical viewpoint, measuring the output quality by the k-means and k-medians objectives. |

520 | Reward-Free Exploration for Reinforcement Learning | Chi Jin; Akshay Krishnamurthy; Max Simchowitz; Tiancheng Yu; | To isolate the challenges of exploration, we propose the following “reward-free RL” framework. |

521 | Parametric Gaussian Process Regressors | Martin Jankowiak; Geoff Pleiss; Jacob Gardner; | In this work we propose two simple methods for scalable GP regression that address this issue and thus yield substantially improved predictive uncertainties. |

522 | p-Norm Flow Diffusion for Local Graph Clustering | Kimon Fountoulakis; Di Wang; Shenghao Yang; | In this work, we draw inspiration from both fields and propose a family of convex optimization formulations based on the idea of diffusion with $p$-norm network flow for $p\in (1,\infty)$. |

523 | Low-Rank Bottleneck in Multi-head Attention Models | Srinadh Bhojanapalli; Chulhee Yun; Ankit Singh Rawat; Sashank Jakkam Reddi; Sanjiv Kumar; | In this paper we identify one of the important factors contributing to the large embedding size requirement. |

524 | LEEP: A New Measure to Evaluate Transferability of Learned Representations | Cuong Nguyen; Tal Hassner; Cedric Archambeau; Matthias Seeger; | We introduce a new measure to evaluate the transferability of representations learned by classifiers. |

525 | The FAST Algorithm for Submodular Maximization | Adam Breuer; Eric Balkanski; Yaron Singer; | In this paper we describe a new parallel algorithm called Fast Adaptive Sequencing Technique (FAST) for maximizing a monotone submodular function under a cardinality constraint k. |

526 | On the Relation between Quality-Diversity Evaluation and Distribution-Fitting Goal in Text Generation | Jianing Li; Yanyan Lan; Jiafeng Guo; Xueqi Cheng; | In this paper, we try to reveal such relation in a theoretical approach. |

527 | Designing Optimal Dynamic Treatment Regimes: A Causal Reinforcement Learning Approach | Junzhe Zhang; | In particular, we develop two online algorithms that satisfy such regret bounds by exploiting the causal structure underlying the DTR; one is based on the principle of optimism in the face of uncertainty (OFU-DTR), and the other uses the posterior sampling learning (PS-DTR). |

528 | Global Decision-Making via Local Economic Transactions | Michael Chang; Sid Kaushik; S. Matthew Weinberg; Sergey Levine; Thomas Griffiths; | This paper seeks to establish a mechanism for directing a collection of simple, specialized, self-interested agents to solve what traditionally are posed as monolithic single-agent sequential decision problems with a central global objective. |

529 | Retrieval Augmented Language Model Pre-Training | Kelvin Guu; Kenton Lee; Zora Tung; Panupong Pasupat; Mingwei Chang; | To capture knowledge in a more modular and interpretable way, we augment language model pre-training with a latent knowledge retriever, which allows the model to retrieve and attend over documents from a large corpus such as Wikipedia, used during pre-training, fine-tuning and inference. |

530 | Variational Label Enhancement | Ning Xu; Yun-Peng Liu; Jun Shu; Xin Geng; | To solve this problem, we consider the label distributions as the latent vectors and infer the label distributions from the logical labels in the training datasets by using variational inference. |

531 | Bandits with Adversarial Scaling | Thodoris Lykouris; Vahab Mirrokni; Renato Leme; | We study "adversarial scaling", a multi-armed bandit model where rewards have a stochastic and an adversarial component. |

532 | Eliminating the Invariance on the Loss Landscape of Linear Autoencoders | Reza Oftadeh; Jiayi Shen; Zhangyang Wang; Dylan Shell; | Here, we prove that our loss function eliminates this issue, i.e., the decoder converges to the exact ordered unnormalized eigenvectors of the sample covariance matrix. |

533 | What is Local Optimality in Nonconvex-Nonconcave Minimax Optimization? | Chi Jin; Praneeth Netrapalli; Michael Jordan; | The main contribution of this paper is to propose a proper mathematical definition of local optimality for this sequential setting—local minimax, as well as to present its properties and existence results. |

534 | Lookahead-Bounded Q-learning | Ibrahim El Shar; Daniel Jiang; | We introduce the lookahead-bounded Q-learning (LBQL) algorithm, a new, provably convergent variant of Q-learning that seeks to improve the performance of standard Q-learning in stochastic environments through the use of “lookahead” upper and lower bounds. |

535 | Learning From Irregularly-Sampled Time Series: A Missing Data Perspective | Steven Cheng-Xian Li; Benjamin Marlin; | In this paper, we consider irregular sampling from the perspective of missing data. |

536 | Evaluating the Performance of Reinforcement Learning Algorithms | Scott Jordan; Yash Chandak; Daniel Cohen; Mengxue Zhang; Philip Thomas; | In this work, we argue that the inconsistency of performance stems from the use of flawed evaluation metrics. |

537 | Unbiased Risk Estimators Can Mislead: A Case Study of Learning with Complementary Labels | Yu-Ting Chou; Gang Niu; Hsuan-Tien Lin; Masashi Sugiyama; | In this paper, we investigate reasons for such overfitting by studying learning with complementary labels. |

538 | Provable Self-Play Algorithms for Competitive Reinforcement Learning | Yu Bai; Chi Jin; | We introduce a self-play algorithm—Value Iteration with Upper/Lower Confidence Bound (VI-ULCB), and show that it achieves regret $\tilde{O}(\sqrt{T})$ after playing $T$ steps of the game. |

539 | Optimizing Long-term Social Welfare in Recommender Systems: A Constrained Matching Approach | Martin Mladenov; Elliot Creager; Omer Ben-Porat; Kevin Swersky; Richard Zemel; Craig Boutilier; | In this work, we explore settings in which content providers cannot remain viable unless they receive a certain level of user engagement. |

540 | Semi-Supervised StyleGAN for Disentanglement Learning | Weili Nie; Tero Karras; Animesh Garg; Shoubhik Debnath; Anjul Patney; Ankit Patel; Anima Anandkumar; | To alleviate these limitations, we design new architectures and loss functions based on StyleGAN (Karras et al., 2019), for semi-supervised high-resolution disentanglement learning. |

541 | The Non-IID Data Quagmire of Decentralized Machine Learning | Kevin Hsieh; Amar Phanishayee; Onur Mutlu; Phillip Gibbons; | Based on these findings, we present SkewScout, a system-level approach that adapts the communication frequency of decentralized learning algorithms to the (skew-induced) accuracy loss between data partitions. |

542 | On the Noisy Gradient Descent that Generalizes as SGD | Jingfeng Wu; Wenqing Hu; Haoyi Xiong; Jun Huan; Vladimir Braverman; Zhanxing Zhu; | In this work we provide negative results by showing that noises in classes different from the SGD noise can also effectively regularize gradient descent. |

543 | Safe screening rules for L0-regression | Alper Atamturk; Andres Gomez; | We give safe screening rules to eliminate variables from regression with L0 regularization or cardinality constraint. |

544 | Single Point Transductive Prediction | Nilesh Tripuraneni; Lester Mackey; | We address this question in the context of linear prediction, showing how techniques from semi-parametric inference can be used transductively to combat regularization bias. |

545 | History-Gradient Aided Batch Size Adaptation for Variance Reduced Algorithms | Kaiyi Ji; Zhe Wang; Bowen Weng; Yi Zhou; Wei Zhang; Yingbin LIANG; | In this paper, we propose a novel scheme, which eliminates backtracking line search but still exploits the information along optimization path by adapting the batch size via history stochastic gradients. |

546 | Batch Stationary Distribution Estimation | Junfeng Wen; Bo Dai; Lihong Li; Dale Schuurmans; | We propose a consistent estimator that is based on recovering a correction ratio function over the given data. |

547 | Optimal Statistical Guaratees for Adversarially Robust Gaussian Classification | Chen Dan; Yuting Wei; Pradeep Ravikumar; | In this paper, we provide the first result of the optimal minimax guarantees for the excess risk for adversarially robust classification, under Gaussian mixture model proposed by \cite{schmidt2018adversarially}. |

548 | Generative Adversarial Imitation Learning with Neural Network Parameterization: Global Optimality and Convergence Rate | Yufeng Zhang; Qi Cai; Zhuoran Yang; Zhaoran Wang; | To bridge the gap between practice and theory, we analyze a gradient-based algorithm with alternating updates and establish its sublinear convergence to the globally optimal solution. |

549 | A Game Theoretic Perspective on Model-Based Reinforcement Learning | Aravind Rajeswaran; Igor Mordatch; Vikash Kumar; | We show that stable algorithms for MBRL can be derived by considering a Stackelberg game between the two players. |

550 | (Locally) Differentially Private Combinatorial Semi-Bandits | Xiaoyu Chen; Kai Zheng; Zixin Zhou; Yunchang Yang; Wei Chen; Liwei Wang; | In this paper, we study (locally) differentially private Combinatorial Semi-Bandits (CSB). |

551 | Optimizing for the Future in Non-Stationary MDPs | Yash Chandak; Georgios Theocharous; Shiv Shankar; Martha White; Sridhar Mahadevan; Philip Thomas; | To address this problem, we develop a method that builds upon ideas from both counter-factual reasoning and curve-fitting to proactively search for a good future policy, without ever modeling the underlying non-stationarity. |

552 | Learning Task-Agnostic Embedding of Multiple Black-Box Experts for Multi-Task Model Fusion | Nghia Hoang; Thanh Lam; Bryan Kian Hsiang Low; Patrick Jaillet; | To address this multi-task challenge, we develop a new fusion paradigm that represents each expert as a distribution over a spectrum of predictive prototypes, which are isolated from task-specific information encoded within the prototype distribution. |

553 | Dual-Path Distillation: A Unified Framework to Improve Black-Box Attacks | Yonggang Zhang; Ya Li; Tongliang Liu; Xinmei Tian; | Therefore, we propose a novel framework, dual-path distillation, that utilizes the feedback knowledge not only to craft adversarial examples but also to alter the searching directions to achieve efficient attacks. |

554 | Safe Deep Semi-Supervised Learning for Unseen-Class Unlabeled Data | Lan-Zhe Guo; Zhen-Yu Zhang; Yuan Jiang; Yufeng Li; Zhi-Hua Zhou; | This paper proposes a simple and effective safe deep SSL method to alleviate the performance harm caused by it. |

555 | Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data | Marc Finzi; Samuel Stanton; Pavel Izmailov; Andrew Wilson; | We propose a general method to construct a convolutional layer that is equivariant to transformations from any specified Lie group with a surjective exponential map. |

556 | Dispersed EM-VAEs for Interpretable Text Generation | Wenxian Shi; Hao Zhou; Ning Miao; Lei Li; | In this paper, we find that mode-collapse is a general problem for VAEs with exponential family mixture priors. |

557 | Deep Graph Random Process for Relational-Thinking-Based Speech Recognition | Huang Hengguan; Fuzhao Xue; Hao Wang; Ye Wang; | We present a framework that models a percept as weak relations between a current utterance and its history. |

558 | Hypernetwork approach to generating point clouds | Przemyslaw Spurek; Sebastian Winczowski; Jacek Tabor; Maciej Zamorski; Maciej Zieba; Tomasz Trzcinski; | In this work, we propose a novel method for generating 3D point clouds that leverage properties of hyper networks. |

559 | On a projective ensemble approach to two sample test for equality of distributions | Zhimei Li; Yaowu Zhang; | In this work, we propose a robust test for the multivariate two-sample problem through projective ensemble, which is a generalization of the Cramer-von Mises statistic. |

560 | Coresets for Data-efficient Training of Machine Learning Models | Baharan Mirzasoleiman; Jeff Bilmes; Jure Leskovec; | Here we develop CRAIG, a method to select a weighted subset (or coreset) of training data that closely estimates the full gradient by maximizing a submodular function. |

561 | Searching to Exploit Memorization Effect in Learning with Noisy Labels | QUANMING YAO; Hansi Yang; Bo Han; Gang Niu; James Kwok; | In this paper, motivated by the success of automated machine learning (AutoML), we model this issue as a function approximation problem. |

562 | Randomized Smoothing of All Shapes and Sizes | Greg Yang; Tony Duan; J. Edward Hu; Hadi Salman; Ilya Razenshteyn; Jerry Li; | We propose a novel framework for devising and analyzing randomized smoothing schemes, and validate its effectiveness in practice. |

563 | DeepCoDA: personalized interpretability for compositional health | Thomas Quinn; Dang Nguyen; Santu Rana; Sunil Gupta; Svetha Venkatesh; | We propose the DeepCoDA framework to extend precision health modelling to high-dimensional compositional data, and to provide personalized interpretability through patient-specific weights. |

564 | Private Query Release Assisted by Public Data | Raef Bassily; Albert Cheu; Shay Moran; Aleksandar Nikolov; Jonathan Ullman; Steven Wu; | We study the problem of differentially private query release assisted by public data. |

565 | Adaptive Droplet Routing in Digital Microfluidic Biochips Using Deep Reinforcement Learning | Tung-Che Liang; Zhanwei Zhong; Yaas Bigdeli; Tsung-Yi Ho; Richard Fair; Krishnendu Chakrabarty; | We present and investigate a novel application domain for deep reinforcement learning (RL): droplet routing on digital microfluidic biochips (DMFBs). |

566 | Continuous-time Lower Bounds for Gradient-based Algorithms | Michael Muehlebach; Michael Jordan; | We reduce the multi-dimensional problem to a single dimension, recover well-known lower bounds from the discrete-time setting, and provide insights into why these lower bounds occur. |

567 | A Tree-Structured Decoder for Image-to-Markup Generation | Jianshu Zhang; Jun Du; Yongxin Yang; Yi-Zhe Song; Si Wei; Lirong Dai; | In this work, we first show via a set of toy problems that string decoders struggle to decode tree structures, especially as structural complexity increases. We then propose a tree-structured decoder that specifically aims at generating a tree-structured markup. |

568 | Sample Factory: Egocentric 3D Control from Pixels at 100000 FPS with Asynchronous Reinforcement Learning | Aleksei Petrenko; Zhehui Huang; Tushar Kumar; Gaurav Sukhatme; Vladlen Koltun; | In this work we aim to solve this problem by optimizing the efficiency and resource utilization of reinforcement learning algorithms instead of relying on distributed computation. |

569 | Scalable Deep Generative Modeling for Sparse Graphs | Hanjun Dai; Azade Nazi; Yujia Li; Bo Dai; Dale Schuurmans; | Based on this, we develop a novel autoregressive model, named BiGG, that utilizes this sparsity to avoid generating the full adjacency matrix, and importantly reduces the graph generation time complexity to $O((n + m)\log n)$. |

570 | Closed Loop Neural-Symbolic Learning via Integrating Neural Perception, Grammar Parsing, and Symbolic Reasoning | Qing Li; Siyuan Huang; Yining Hong; Yixin Chen; Ying Nian Wu; Song-Chun Zhu; | In this paper, we address these issues and close the loop of neural-symbolic learning by (1) introducing the grammar model as a symbolic prior to bridge neural perception and symbolic reasoning, and (2) proposing a novel back-search algorithm which mimics the top-down human-like learning procedure to propagate the error through the symbolic reasoning module efficiently. |

571 | NGBoost: Natural Gradient Boosting for Probabilistic Prediction | Tony Duan; Anand Avati; Daisy Ding; Khanh K. Thai; Sanjay Basu; Andrew Ng; Alejandro Schuler; | We present Natural Gradient Boosting (NGBoost), an algorithm for generic probabilistic prediction via gradient boosting. |

572 | Q-value Path Decomposition for Deep Multiagent Reinforcement Learning | Yaodong Yang; Jianye Hao; Guangyong Chen; Hongyao Tang; Yingfeng Chen; Yujing Hu; Changjie Fan; Zhongyu Wei; | In this paper, we propose a new method called Q-value Path Decomposition (QPD) to decompose the system’s global Q-values into individual agents’ Q-values. |

573 | Online Learned Continual Compression with Adaptive Quantization Modules | Lucas Caccia; Eugene Belilovsky; Massimo Caccia; Joelle Pineau; | We introduce and study the problem of Online Continual Compression, where one attempts to simultaneously learn to compress and store a representative dataset from a non i.i.d data stream, while only observing each sample once. |

574 | Learning What to Defer for Maximum Independent Sets | Sungsoo Ahn; Younggyo Seo; Jinwoo Shin; | In this paper, we seek to resolve this issue by proposing a novel DRL scheme where the agent adaptively shrinks or stretch the number of stages by learning to defer the determination of the solution at each stage. |

575 | Generalized and Scalable Optimal Sparse Decision Trees | Jimmy Lin; Chudi Zhong; Diane Hu; Cynthia Rudin; Margo Seltzer; | The contribution in this work is to provide a general framework for decision tree optimization that addresses the two significant open problems in the area: treatment of imbalanced data and fully optimizing over continuous variables. |

576 | The Effect of Natural Distribution Shift on Question Answering Models | John Miller; Karl Krauth; Ludwig Schmidt; Benjamin Recht; | Taken together, our results confirmthe surprising resilience of the holdout methodand emphasize the need to move towards evalua-tion metrics that incorporate robustness to naturaldistribution shifts. |

577 | Quantized Decentralized Stochastic Learning over Directed Graphs | Hossein Taheri; Aryan Mokhtari; Hamed Hassani; Ramtin Pedarsani; | To tackle this bottleneck, we propose the quantized decentralized stochastic learning algorithm over directed graphs that is based on the push-sum algorithm in decentralized consensus optimization. |

578 | Semi-Supervised Learning with Normalizing Flows | Pavel Izmailov; Polina Kirichenko; Marc Finzi; Andrew Wilson; | We propose FlowGMM, an end-to-end approach to generative semi supervised learning with normalizing flows, using a latent Gaussian mixture model. |

579 | Student Specialization in Deep Rectified Networks With Finite Width and Input Dimension | Yuandong Tian; | We consider a deep ReLU / Leaky ReLU student network trained from the output of a fixed teacher network of the same depth, with Stochastic Gradient Descent (SGD). |

580 | Sample Amplification: Increasing Dataset Size even when Learning is Impossible | Brian Axelrod; Shivam Garg; Vatsal Sharan; Gregory Valiant; | Perhaps surprisingly, we show a valid amplification procedure exists for both of these settings, even in the regime where the size of the input dataset, n, is significantly less than what would be necessary to learn distribution D to non-trivial accuracy. |

581 | Alleviating Privacy Attacks via Causal Learning | Shruti Tople; Amit Sharma; Aditya Nori; | Therefore, we propose the use of causal learning approaches where a model learns the causal relationship between the input features and the outcome. |

582 | The Intrinsic Robustness of Stochastic Bandits to Strategic Manipulation | Zhe Feng; David Parkes; Haifeng Xu; | Motivated by economic applications such as recommender systems, we study the behavior of stochastic bandits algorithms under strategic behavior conducted by rational actors, i.e., the arms. |

583 | Normalized Flat Minima: Exploring Scale Invariant Definition of Flat Minima for Neural Networks Using PAC-Bayesian Analysis | Yusuke Tsuzuku; Issei Sato; Masashi Sugiyama; | In this paper, we first provide generalization error bounds using existing normalized flatness measures. Using the analysis, we then propose a novel normalized flatness metric. |

584 | Fiedler Regularization: Learning Neural Networks with Graph Sparsity | Edric Tam; David Dunson; | We introduce a novel regularization approach for deep learning that incorporates and respects the underlying graphical structure of the neural network. |

585 | Online Learning with Imperfect Hints | Aditya Bhaskara; Ashok Cutkosky; Ravi Kumar; Manish Purohit; | In this paper we develop algorithms and nearly matching lower bounds for online learning with imperfect hints. |

586 | Rate-distortion optimization guided autoencoder for isometric embedding in Euclidean latent space | Keizo Kato; Jing Zhou; Tomotake Sasaki; Akira Nakagawa; | In the end, the probability distribution function (PDF) in the real space cannot be estimated from that of the latent space accurately. To overcome this problem, we propose Rate-Distortion Optimization guided autoencoder. |

587 | Optimization from Structured Samples for Coverage Functions | Wei Chen; Xiaoming Sun; Jialin Zhang; Zhijie Zhang; | In this work, to circumvent the impossibility result of OPS, we propose a stronger model called optimization from structured samples (OPSS) for coverage functions, where the data samples encode the structural information of the functions. |

588 | Optimal Randomized First-Order Methods for Least-Squares Problems | Jonathan Lacotte; Mert Pilanci; | We provide an exact asymptotic analysis of the performance of some fast randomized algorithms for solving overdetermined least-squares problems. |

589 | Stochastic Optimization for Non-convex Inf-Projection Problems | Yan Yan; Yi Xu; Lijun Zhang; Wang Xiaoyu; Tianbao Yang; | In this paper, we study a family of non-convex and possibly non-smooth inf-projection minimization problems, where the target objective function is equal to minimization of a joint function over another variable. |

590 | Convex Representation Learning for Generalized Invariance in Semi-Inner-Product Space | Yingyi Ma; Vignesh Ganapathiraman; Yaoliang Yu; Xinhua Zhang; | In this work, we develop a convex representation learning algorithm for a variety of generalized invariances that can be modeled as semi-norms. |

591 | Neural Kernels Without Tangents | Vaishaal Shankar; Alex Fang; Wenshuo Guo; Sara Fridovich-Keil; Jonathan Ragan-Kelley; Ludwig Schmidt; Benjamin Recht; | In particular, using well established feature space tools such as direct sum, averaging, and moment lifting, we present an algebra for creating “compositional” kernels from bags of features. |

592 | Linear Lower Bounds and Conditioning of Differentiable Games | Adam Ibrahim; Waïss Azizian; Gauthier Gidel; Ioannis Mitliagkas; | In this work, we approach the question of fundamental iteration complexity by providing lower bounds to complement the linear (i.e. geometric) upper bounds observed in the literature on a wide class of problems. |

593 | Finite-Time Last-Iterate Convergence for Multi-Agent Learning in Games | Tianyi Lin; Zhengyuan Zhou; Panayotis Mertikopoulos; Michael Jordan; | In this paper, we consider multi-agent learning via online gradient descent in a class of games called $\lambda$-cocoercive games, a fairly broad class of games that admits many Nash equilibria and that properly includes unconstrained strongly monotone games. |

594 | Communication-Efficient Distributed PCA by Riemannian Optimization | Long-Kai Huang; Jialin Pan; | In this paper, we study the leading eigenvector problem in a statistically distributed setting and propose a communication-efficient algorithm based on Riemannian optimization, which trades local computation for global communication. |

595 | Manifold Identification for Ultimately Communication-Efficient Distributed Optimization | Yu-Sheng Li; Wei-Lin Chiang; Ching-pei Lee; | This work proposes a progressive manifold identification approach with sound theoretical justifications to greatly reduce both the communication rounds and the bytes communicated per round for partly smooth regularized problems, which include many large-scale machine learning tasks such as the training of $\ell_1$- and group-LASSO-regularized models. |

596 | When Demands Evolve Larger and Noisier: Learning and Earning in a Growing Environment | Feng Zhu; Zeyu Zheng; | We consider a single-product dynamic pricing problem under a specific non-stationary setting, where the demand grows over time in expectation and possibly gets noisier. |

597 | Being Bayesian about Categorical Probability | Taejong Joo; Uijung Chung; Min-Gwan Seo; | As a Bayesian alternative to the softmax, we consider a random variable of a categorical probability over class labels. |

598 | Context-aware Dynamics Model for Generalization in Model-Based Reinforcement Learning | Kimin Lee; Younggyo Seo; Seunghyun Lee; Honglak Lee; Jinwoo Shin; | To tackle this problem, we decompose the task of learning a global dynamics model into two stages: (a) learning a context latent vector that captures the local dynamics, then (b) predicting the next state conditioned on it. |

599 | Learning Reasoning Strategies in End-to-End Differentiable Proving | Pasquale Minervini; Tim Rocktäschel; Sebastian Riedel; Edward Grefenstette; Pontus Stenetorp; | We present Conditional Theorem Provers (CTPs), an extension to NTPs that learns an optimal rule selection strategy via gradient-based optimisation. |

600 | Fast and Private Submodular and $k$-Submodular Functions Maximization with Matroid Constraints | Akbar Rafiey; Yuichi Yoshida; | In this paper, we study the problem of maximizing monotone submodular functions subject to matroid constraints in the framework of differential privacy. |

601 | Streaming Coresets for Symmetric Tensor Factorization | Supratim Shit; Anirban Dasgupta; Rachit Chhaya; Jayesh Choudhari; | Given a set of $n$ vectors, each in $\~R^d$, we present algorithms to select a sublinear number of these vectors as coreset, while guaranteeing that the CP decomposition of the $p$-moment tensor of the coreset approximates the corresponding decomposition of the $p$-moment tensor computed from the full data. |

602 | How Good is the Bayes Posterior in Deep Neural Networks Really? | Florian Wenzel; Kevin Roth; Bastiaan Veeling; Jakub Swiatkowski; Linh Tran; Stephan Mandt; Jasper Snoek; Tim Salimans; Rodolphe Jenatton; Sebastian Nowozin; | In this work we cast doubt on the current understanding of Bayes posteriors in popular deep neural networks: we demonstrate through careful MCMC sampling that the posterior predictive induced by the Bayes posterior yields systematically worse predictions when compared to simpler methods including point estimates obtained from SGD. |

603 | Optimally Solving Two-Agent Decentralized POMDPs Under One-Sided Information Sharing | Yuxuan Xie; Jilles Dibangoye; Olivier Buffet; | This paper addresses this question for a team of two agents, with one-sided information sharing—\ie both agents have imperfect information about the state of the world, but only one has access to what the other sees and does. |

604 | Learning Algebraic Multigrid Using Graph Neural Networks | Ilay Luz; Meirav Galun; Haggai Maron; Ronen Basri; Irad Yavneh; | Here we propose a framework for learning AMG prolongation operators for linear systems with sparse symmetric positive (semi-) definite matrices. |

605 | Fractal Gaussian Networks: A sparse random graph model based on Gaussian Multiplicative Chaos | Subhroshekhar Ghosh; Krishna Balasubramanian; Xiaochuan Yang; | We propose a novel stochastic network model, called Fractal Gaussian Network (FGN), that embodies well-defined and analytically tractable fractal structures. |

606 | Structured Policy Iteration for Linear Quadratic Regulator | Youngsuk Park; Ryan Rossi; Zheng Wen; Gang Wu; Handong Zhao; | In this paper, we introduce the Structured Policy Iteration (S-PI) for LQR, a method capable of deriving a structured linear policy. |

607 | T-GD: Transferable GAN-generated Images Detection Framework | Hyeonseong Jeon; Young Oh Bang; Junyaup Kim; Simon Woo; | In this work, we present a robust transferable framework to effectively detect GAN-images, called Transferable GAN-images Detection framework (T-GD). |

608 | Low Bias Low Variance Gradient Estimates for Hierarchical Boolean Stochastic Networks | Adeel Pervez; Taco Cohen; Efstratios Gavves; | To analyze such networks, we introduce the framework of harmonic analysis for Boolean functions to derive an analytic formulation for the bias and variance in the Straight-Through estimator. |

609 | Learning Flat Latent Manifolds with VAEs | Nutan Chen; Alexej Klushyn; Francesco Ferroni; Justin Bayer; Patrick van der Smagt; | We propose an extension to the framework of variational auto-encoders allows learning flat latent manifolds, where the Euclidean metric is a proxy for the similarity between data points. |

610 | Multi-Task Learning with User Preferences: Gradient Descent with Controlled Ascent in Pareto Optimization | Debabrata Mahapatra; Vaibhav Rajan; | We develop the first gradient-based multi-objective MTL algorithm to address this problem. |

611 | Transfer Learning without Knowing: Reprogramming Black-box Machine Learning Models with Scarce Data and Limited Resources | Yun Yun Tsai; Pin-Yu Chen; Tsung-Yi Ho; | Motivated by the techniques from adversarial machine learning (ML) that are capable of manipulating the model prediction via data perturbations, in this paper we propose a novel approach, black-box adversarial reprogramming (BAR), that repurposes a well-trained black-box ML model (e.g., a prediction API or a proprietary software) for solving different ML tasks, especially in the scenario with scarce data and constrained resources. |

612 | On Coresets for Regularized Regression | Rachit Chhaya; Supratim Shit; Anirban Dasgupta; | We propose a modified version of the LASSO problem and obtain for it a coreset of size smaller than the least square regression. |

613 | Budgeted Online Influence Maximization | Pierre Perrault; Zheng Wen; Michal Valko; Jennifer Healey; | We introduce a new budgeted framework for online influence maximization, considering the total cost of an advertising campaign instead of the common cardinality constraint on a chosen influencer set. |

614 | On the (In)tractability of Computing Normalizing Constants for the Product of Determinantal Point Processes | Naoto Ohsaka; Tatsuya Matsuoka; | We consider the product of determinantal point processes (DPPs), a point process whose probability mass is proportional to the product of principal minors of multiple matrices as a natural, promising generalization of DPPs. |

615 | Monte-Carlo Tree Search as Regularized Policy Optimization | Jean-Bastien Grill; Florent Altché; Yunhao Tang; Thomas Hubert; Michal Valko; Ioannis Antonoglou; Remi Munos; | In this paper, we show that AlphaZero’s search heuristic, along with other common ones, can be interpreted as an approximation to the solution of a specific regularized policy optimization problem. |

616 | On the Expressivity of Neural Networks for Deep Reinforcement Learning | Kefan Dong; Yuping Luo; Tianhe Yu; Chelsea Finn; Tengyu Ma; | We show, theoretically and empirically, that even for one-dimensional continuous state space, there are many MDPs whose optimal Q-functions and policies are much more complex than the dynamics. |

617 | The k-tied Normal Distribution: A Compact Parameterization of Gaussian Mean Field Posteriors in Bayesian Neural Networks | Jakub Swiatkowski; Kevin Roth; Bastiaan Veeling; Linh Tran; Joshua Dillon; Jasper Snoek; Stephan Mandt; Tim Salimans; Rodolphe Jenatton; Sebastian Nowozin; | For a variety of deep Bayesian neural networks trained using Gaussian mean-field variational inference, we find that the posterior standard deviations consistently exhibit strong low-rank structure after convergence. |

618 | A Generative Model for Molecular Distance Geometry | Gregor Simm; Jose Miguel Hernandez-Lobato; | We present a probabilistic model that generates such samples for molecules from their graph representations. |

619 | Why bigger is not always better: on finite and infinite neural networks | Laurence Aitchison; | This motivates the introduction of a new class of network: infinite networks with bottlenecks, which inherit the theoretical tractability of infinite networks while at the same time allowing representation learning. |

620 | Data-Efficient Image Recognition with Contrastive Predictive Coding | Olivier Henaff; | We therefore revisit and improve Contrastive Predictive Coding, an unsupervised objective for learning such representations. |

621 | Intrinsic Reward Driven Imitation Learning via Generative Model | Xingrui Yu; Yueming LYU; Ivor Tsang; | To address this challenge, we propose a novel reward learning module to generate intrinsic reward signals via a generative model. |

622 | Can Increasing Input Dimensionality Improve Deep Reinforcement Learning? | Kei Ota; Tomoaki Oiki; Devesh Jha; Toshisada Mariyama; Daniel Nikovski; | In this paper, we try to study if increasing input dimensionality helps improve performance and sample efficiency of model-free deep RL algorithms. |

623 | Batch Reinforcement Learning with Hyperparameter Gradients | Byung-Jun Lee; Jongmin Lee; Peter Vrancx; Dongho Kim; Kee-Eung Kim; | Unlike prior work where this trade-off is controlled by hand-tuned hyperparameters, we propose a novel batch reinforcement learning approach, batch optimization of policy and hyperparameter (BOPAH), that uses a gradient-based optimization of the hyperparameter using held-out data. |

624 | Sub-Goal Trees — a Framework for Goal-Based Reinforcement Learning | Tom Jurgenson; Or Avner; Edward Groshev; Aviv Tamar; | Instead, we propose a new RL framework, derived from a dynamic programming equation for the all pairs shortest path (APSP) problem, which naturally solves goal-directed queries. |

625 | A Geometric Approach to Archetypal Analysis via Sparse Projections | Vinayak Abrol; Pulkit Sharma; | This work presents a computationally efficient greedy AA (GAA) algorithm. |

626 | Sequence Generation with Mixed Representations | Lijun Wu; Shufang Xie; Yingce Xia; Yang Fan; Jian-Huang Lai; Tao Qin; Tie-Yan Liu; | In this work, we propose to leverage the mixed representations from different tokenization methods for sequence generation tasks, in order to boost the model performance with unique characteristics and advantages of individual tokenization methods. |

627 | Agent57: Outperforming the Atari Human Benchmark | Adrià Puigdomenech Badia, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Oleksandr Vitvitskyi, Zhaohan Guo, Charles Blundell; | We propose Agent57, the first deep RL agent that outperforms the standard human benchmark on all 57 Atari games. |

628 | RIFLE: Backpropagation in Depth for Deep Transfer Learning through Re-Initializing the Fully-connected LayEr | Xingjian Li; Haoyi Xiong; Haozhe An; Dejing Dou; Cheng-Zhong Xu; | In this work, we propose RIFLE – a simple yet effective strategy that deepens backpropagation in transfer learning settings, through periodically ReInitializing the Fully-connected LayEr with random scratch during the fine-tuning procedure. |

629 | Fairwashing explanations with off-manifold detergent | Christopher Anders; Ann-Kathrin Dombrowski; Klaus-robert Mueller; Pan Kessel; Plamen Pasliev; | In this paper, we show both theoretically and experimentally that these hopes are presently unfounded. |

630 | Learning disconnected manifolds: a no GAN’s land | Ugo Tanielian; Thibaut Issenhuth; Elvis Dohmatob; Jeremie Mary; | We formalize this problem by establishing a "no free lunch" theorem for the disconnected manifold learning stating an upper-bound on the precision of the targeted distribution. |

631 | Sets Clustering | Ibrahim Jubran; Murad Tukan; Alaa Maalouf; Dan Feldman; | We prove that such a core-set of $O(\log^2{n})$ sets always exists, and can be computed in $O(n\log{n})$ time, for every input $\mathcal{P}$ and every fixed $d,k\geq 1$ and $\varepsilon \in (0,1)$. |

632 | Variational Autoencoders with Riemannian Brownian Motion Priors | Dimitris Kalatzis; David Eklund; Georgios Arvanitidis; Søren Hauberg; | To counter this, we assume a Riemannian structure over the latent space, which constitutes a more principled geometric view of the latent codes, and replace the standard Gaussian prior with a Riemannian Brownian motion prior. |

633 | Non-separable Non-stationary random fields | Kangrui Wang; Oliver Hamelijnck; Theodoros Damoulas; Mark Steel; | We describe a framework for constructing non-separable non-stationary random fields that is based on an infinite mixture of convolved stochastic processes. |

634 | Nonparametric Score Estimators | Yuhao Zhou; Jiaxin Shi; Jun Zhu; | We provide a unifying view of these estimators under the framework of regularized nonparametric regression. |

635 | A Free-Energy Principle for Representation Learning | Yansong Gao; Pratik Chaudhari; | This paper employs a formal connection of machine learning with thermodynamics to characterize the quality of learnt representations for transfer learning. |

636 | Scalable Differential Privacy with Certified Robustness in Adversarial Learning | Hai Phan; My T. Thai; Han Hu; Ruoming Jin; Tong Sun; Dejing Dou; | In this paper, we aim to develop a scalable algorithm to preserve differential privacy (DP) in adversarial learning for deep neural networks (DNNs), with certified robustness to adversarial examples. |

637 | Variational Inference for Sequential Data with Future Likelihood Estimates | Geon-Hyeong Kim; Youngsoo Jang; Hongseok Yang; Kee-Eung Kim; | To tackle this challenge, we present a novel variational inference algorithm for sequential data, which performs well even when the density from the model is not differentiable, for instance, due to the use of discrete random variables. |

638 | Implicit Learning Dynamics in Stackelberg Games: Equilibria Characterization, Convergence Analysis, and Empirical Study | Tanner Fiez; Benjamin Chasnov; Lillian Ratliff; | We derive novel gradient-based learning dynamics emulating the natural structure of a Stackelberg game using the Implicit Function Theorem and provide convergence analysis for deterministic and stochastic updates for zero-sum and general-sum games. |

639 | Let’s Agree to Agree: Neural Networks Share Classification Order on Real Datasets | Guy Hacohen; Leshem Choshen; Daphna Weinshall; | We report a series of robust empirical observations, whereby deep Neural Networks learn the examples in both the training and test sets in a similar order. |

640 | Quantile Causal Discovery | Natasa Tagasovska; Thibault Vatter; Valérie Chavez-Demoulin; | Based on this theory, we develop Quantile Causal Discovery (QCD), a new method to uncover causal relationships. |

641 | How to Solve Fair k-Center in Massive Data Models | Ashish Chiplunkar; Sagar Kale; Sivaramakrishnan Natarajan Ramamoorthy; | In this work, we design new streaming and distributed algorithms for the fair k-center problem that models fair data summarization. |

642 | Bayesian Learning from Sequential Data using Gaussian Processes with Signature Covariances | Csaba Toth; Harald Oberhauser; | To deal with this, we introduce a sparse variational approach with inducing tensors. |

643 | Beyond Signal Propagation: Is Feature Diversity Necessary in Deep Neural Network Initialization? | Yaniv Blumenfeld; Dar Gilboa; Daniel Soudry; | This indicates that random, diverse initializations are \textit{not} necessary for training neural networks. |

644 | Dynamic Knapsack Optimization Towards Efficient Multi-Channel Sequential Advertising | Xiaotian Hao; Zhaoqing Peng; Yi Ma; Guan Wang; Junqi Jin; Jianye Hao; Shan Chen; Rongquan Bai; Mingzhou Xie; Miao Xu; Zhenzhe Zheng; Chuan Yu; HAN LI; Jian Xu; Kun Gai; | In this paper, we formulate the sequential advertising strategy optimization as a dynamic knapsack problem. |

645 | Stochastically Dominant Distributional Reinforcement Learning | John Martin; Michal Lyskawinski; Xiaohu Li; Brendan Englot; | We describe a new approach for managing aleatoric uncertainty in the Reinforcement Learning paradigm. |

646 | Adversarial Robustness Against the Union of Multiple Threat Models | Pratyush Maini; Eric Wong; Zico Kolter; | In this work, we develop a natural generalization of the standard PGD-based procedure to incorporate multiple threat models into a single attack, by taking the worst-case over all steepest descent directions. |

647 | Student-Teacher Curriculum Learning via Reinforcement Learning: Predicting Hospital Inpatient Admission Location | Rasheed El-Bouri; David Eyre; Peter Watkinson; Tingting Zhu; David Clifton; | In this work we propose a student-teacher network via reinforcement learning to deal with this specific problem. |

648 | Option Discovery in the Absence of Rewards with Manifold Analysis | Amitay Bar; Ronen Talmon; Ron Meir; | In this paper, we present an approach based on spectral graph theory and derive an algorithm that systematically discovers options without access to a specific reward or task assignment. |

649 | Generalisation error in learning with random features and the hidden manifold model | Federica Gerace; Bruno Loureiro; Florent Krzakala; Marc Mezard; Lenka Zdeborova; | We study generalized linear regression and classification for a synthetically generated dataset encompassing different problems of interest, such as learning with random features, neural networks in the lazy training regime, and the hidden manifold model. |

650 | Fast and Consistent Learning of Hidden Markov Models by Incorporating Non-Consecutive Correlations | Robert Mattila; Cristian Rojas; Eric Moulines; Vikram Krishnamurthy; Bo Wahlberg; | In this paper, we propose extending these methods (both pair- and triplet-based) by also including non-consecutive correlations in a way which does not significantly increase the computational cost (which scales linearly with the number of additional lags included). |

651 | Gradient-free Online Learning in Continuous Games with Delayed Rewards | Amélie Héliou; Panayotis Mertikopoulos; Zhengyuan Zhou; | Motivated by applications to online advertising and recommender systems, we consider a game-theoretic model with delayed rewards and asynchronous, payoff-based feedback. |

652 | Pseudo-Masked Language Models for Unified Language Model Pre-Training | Hangbo Bao; Li Dong; Furu Wei; Wenhui Wang; Nan Yang; Xiaodong Liu; Yu Wang; Jianfeng Gao; Songhao Piao; Ming Zhou; Hsiao-Wuen Hon; | We propose to pre-train a unified language model for both autoencoding and partially autoregressive language modeling tasks using a novel training procedure, referred to as a pseudo-masked language model (PMLM). |

653 | Einsum Networks: Fast and Scalable Learning of Tractable Probabilistic Circuits | Robert Peharz; Steven Lang; Antonio Vergari; Karl Stelzner; Alejandro Molina; Martin Trapp; Guy Van den Broeck; Kristian Kersting; Zoubin Ghahramani; | In this paper, we propose Einsum Networks (EiNets), a novel implementation design for PCs, improving prior art in several regards. |

654 | Polynomial Tensor Sketch for Element-wise Function of Low-Rank Matrix | Insu Han; Haim Avron; Jinwoo Shin; | To this end, we propose an efficient sketching-based algorithm whose complexity is significantly lower than the number of entries of A, i.e., it runs without accessing all entries of [f(Aij)] explicitly. |

655 | Inexact Tensor Methods with Dynamic Accuracies | Nikita Doikov; Yurii Nesterov; | In this paper, we study inexact high-order Tensor Methods for solving convex optimization problems with composite objective. |

656 | k-means++: few more steps yield constant approximation | Davin Choo; Christoph Grunau; Julian Portmann; Vaclav Rozhon; | In this paper, we improve their analysis to show that, for any arbitrarily small constant epsilon > 0, with only epsilon * k additional local search steps, one can achieve a constant approximation guarantee (with high probability in k), resolving an open problem in their paper. |

657 | Radioactive data: tracing through training | Alexandre Sablayrolles; Douze Matthijs; Cordelia Schmid; Herve Jegou; | We propose a new technique, radioactive data, that makes imperceptible changes to this dataset such that any model trained on it will bear an identifiable mark. |

658 | Doubly robust off-policy evaluation with shrinkage | Yi Su; Maria Dimakopoulou; Akshay Krishnamurthy; Miroslav Dudik; | We propose a new framework for designing estimators for off-policy evaluation in contextual bandits. |

659 | Fast Adaptation to New Environments via Policy-Dynamics Value Functions | Roberta Raileanu; Max Goldstein; Arthur Szlam; Facebook Rob Fergus; | We introduce Policy-Dynamics Value Functions (PD-VF), a novel approach for rapidly adapting to dynamics different from those previously seen in training. |

660 | Neural Clustering Processes | Ari Pakman; Yueqi Wang; Catalin Mitelut; JinHyung Lee; Department of Statistics Liam Paninski; | In this work we introduce deep network architectures trained with labeled samples from any generative model of clustered datasets. |

661 | Topologically Densified Distributions | Christoph Hofer; Florian Graf; Marc Niethammer; Roland Kwitt; | We study regularization in the context of small sample-size learning with over-parametrized neural networks. |

662 | Low-loss connection of weight vectors: distribution-based approaches | Ivan Anokhin; Dmitry Yarotsky; | We describe and compare experimentally a panel of methods used to connect two low-loss points by a low-loss curve on this surface. |

663 | Graph Filtration Learning | Christoph Hofer; Florian Graf; Bastian Rieck; Marc Niethammer; Roland Kwitt; | We propose an approach to learning with graph-structured data in the problem domain of graph classification. |

664 | Differentiable Product Quantization for Learning Compact Embedding Layers | Ting Chen; Lala Li; Yizhou Sun; | In this work, we propose a generic and end-to-end learnable compression framework termed differentiable product quantization (DPQ). |

665 | Scalable Exact Inference in Multi-Output Gaussian Processes | Wessel Bruinsma; Eric Perim Martins; William Tebbutt; Scott Hosking; Arno Solin; Richard Turner; | We propose the use of a sufficient statistic of the data to accelerate inference and learning in MOGPs with orthogonal bases. |

666 | Lower Complexity Bounds for Finite-Sum Convex-Concave Minimax Optimization Problems | Guangzeng Xie; Luo Luo; yijiang lian; Zhihua Zhang; | This paper studies the lower bound complexity for minimax optimization problem whose objective function is the average of $n$ individual smooth convex-concave functions. |

667 | Near-optimal Regret Bounds for Stochastic Shortest Path | Aviv Rosenberg; Alon Cohen; Yishay Mansour; Haim Kaplan; | In this work we remove this dependence on the minimum cost—we give an algorithm that guarantees a regret bound of $\widetilde{O}(B^{3/2} S \sqrt{A K})$, where $B$ is an upper bound on the expected cost of the optimal policy, $S$ is the number of states, $A$ is the number of actions and $K$ is the total number of episodes. |

668 | The Usual Suspects? Reassessing Blame for VAE Posterior Collapse | Bin Dai; Ziyu Wang; David Wipf; | In particular, we prove that even small nonlinear perturbations of affine VAE decoder models can produce such minima, and in deeper models, analogous minima can force the VAE to behave like an aggressive truncation operator, provably discarding information along all latent dimensions in certain circumstances. |

669 | It’s Not What Machines Can Learn, It’s What We Cannot Teach | Gal Yehuda; Moshe Gabel; Assaf Schuster; | In this work we offer a different perspective on this question. |

670 | Guided Learning of Nonconvex Models through Successive Functional Gradient Optimization | Rie Johnson; Tong Zhang; | This paper presents a framework of successive functional gradient optimization for training nonconvex models such as neural networks, where training is driven by mirror descent in a function space. |

671 | A Markov Decision Process Model for Socio-Economic Systems Impacted by Climate Change | Salman Sadiq Shuvo; Yasin Yilmaz; Alan Bush; Mark Hafen; | In this work, we propose a Markov decision process (MDP) formulation for an agent (government) which interacts with the environment (nature and residents) to deal with the impacts of climate change, in particular sea level rise. |

672 | Can Stochastic Zeroth-Order Frank-Wolfe Method Converge Faster for Non-Convex Problems? | Hongchang Gao; Heng Huang; | To address the problem of lacking gradient in many applications, we propose two new stochastic zeroth-order Frank-Wolfe algorithms and theoretically proved that they have a faster convergence rate than existing methods for non-convex problems. |

673 | Distance Metric Learning with Joint Representation Diversification | Xu Chu; Yang Lin; Xiting Wang; Xin Gao; Qi Tong; Hailong Yu; Yasha Wang; | In contrast, we propose not to penalize intra-class distances explicitly and use a Joint Representation Similarity (JRS) regularizer that focuses on penalizing inter-class distributional similarities in a DML framework. |

674 | Meta-Learning with Shared Amortized Variational Inference | Ekaterina Iakovleva; Karteek Alahari; Jakob Verbeek; | In the context of an empirical Bayes model for meta-learning where a subset of model parameters is treated as latent variables, we propose a novel scheme for amortized variational inference. |

675 | Causal Effect Identifiability under Partial-Observability | Sanghack Lee; Elias Bareinboim; | In this paper, we study the causal effect identifiability problem when the available distributions may be associated with different sets of variables, which we refer to as identification under partial-observability. |

676 | Continuous Graph Neural Networks | Louis-Pascal Xhonneux; Meng Qu; Jian Tang; | We propose continuous graph neural networks (CGNN), which generalise existing graph neural networks with discrete dynamics in that they can be viewed as a specific discretisation scheme. |

677 | Restarted Bayesian Online Change-point Detector achieves Optimal Detection Delay | REDA ALAMI; Odalric-Ambrym Maillard; Raphaël Féraud; | In this paper, we consider the problem of sequential change-point detection where both the change-points and the distributions before and after the change are assumed to be unknown. |

678 | Robust learning with the Hilbert-Schmidt independence criterion | Daniel Greenfeld; Uri Shalit; | We investigate the use of a non-parametric independence measure, the Hilbert-Schmidt Independence Criterion (HSIC), as a loss-function for learning robust regression and classification models. |

679 | Bayesian Experimental Design for Implicit Models by Mutual Information Neural Estimation | Steven Kleinegesse; Michael Gutmann; | In this paper, we propose a new approach to Bayesian experimental design for implicit models that leverages recent advances in neural MI estimation to deal with these issues. |

680 | Fast Differentiable Sorting and Ranking | Mathieu Blondel; Olivier Teboul; Quentin Berthet; Josip Djolonga; | In this paper, we propose the first differentiable sorting and ranking operators with $O(n \log n)$ time and $O(n)$ space complexity. |

681 | Learning for Dose Allocation in Adaptive Clinical Trials with Safety Constraints | Cong Shen; Zhiyang Wang; Sofia Villar; Mihaela van der Schaar; | We present a novel adaptive clinical trial methodology, called Safe Efficacy Exploration Dose Allocation (SEEDA), that aims at maximizing the cumulative efficacies while satisfying the toxicity safety constraint with high probability. |

682 | Tuning-free Plug-and-Play Proximal Algorithm for Inverse Imaging Problems | Kaixuan Wei; Angelica I Aviles-Rivero; Jingwei Liang; Ying Fu; Carola-Bibiane Schönlieb; Hua Huang; | In this work, we present a tuning-free PnP proximal algorithm, which can automatically determine the internal parameters including the penalty parameter, the denoising strength and the terminal time. |

683 | Consistent Estimators for Learning to Defer to an Expert | Hussein Mozannar; David Sontag; | In this paper we explore how to learn predictors that can either predict or choose to defer the decision to a downstream expert. |

684 | A Graph to Graphs Framework for Retrosynthesis Prediction | Chence Shi; Minkai Xu; Hongyu Guo; Ming Zhang; Jian Tang; | In this paper, we propose a novel template-free approach called G2Gs by transforming a target molecular graph into a set of reactant molecular graphs. |

685 | Fast computation of Nash Equilibria in Imperfect Information Games | Remi Munos; Julien Perolat; Jean-Baptiste Lespiau; Mark Rowland; Bart De Vylder; Marc Lanctot; Finbarr Timbers; Daniel Hennes; Shayegan Omidshafiei; Audrunas Gruslys; Mohammad Gheshlaghi Azar; Edward Lockhart; Karl Tuyls; | We introduce and analyze a class of algorithms, called Mirror Ascent against an Improved Opponent (MAIO), for computing Nash equilibria in two-player zero-sum games, both in normal form and in sequential imperfect information form. |

686 | Invariant Rationalization | Shiyu Chang; Yang Zhang; Mo Yu; Tommi Jaakkola; | Instead, we introduce a game-theoretic invariant rationalization criterion where the rationales are constrained to enable the same predictor to be optimal across different environments. |

687 | Accelerated Stochastic Gradient-free and Projection-free Methods | Feihu Huang; Lue Tao; Songcan Chen; | In the paper, we propose a class of accelerated stochastic gradient-free and projection-free (a.k.a., zeroth-order Frank Wolfe) methods to solve the problem of constrained stochastic and finite-sum nonconvex optimization. |

688 | Efficient Optimistic Exploration in Linear-Quadratic Regulators via Lagrangian Relaxation | Marc Abeille; Alessandro Lazaric; | Inspired by the extended value iteration algorithm used in optimistic algorithms for finite MDPs, we propose to relax the optimistic optimization of \ofulq and cast it into a constrained \textit{extended} LQR problem, where an additional control variable implicitly selects the system dynamics within a confidence interval. |

689 | Implicit Regularization of Random Feature Models | Arthur Jacot; berfin simsek; Francesco Spadaro; Clement Hongler; Franck Gabriel; | We investigate, by means of random matrix theory, the connection between Gaussian RF models and Kernel Ridge Regression (KRR). |

690 | Missing Data Imputation using Optimal Transport | Boris Muzellec; Julie Josse; Claire Boyer; Marco Cuturi; | We propose practical methods to minimize these losses using end-to-end learning, that can exploit or not parametric assumptions on the underlying distributions of values. |

691 | Unsupervised Speech Decomposition via Triple Information Bottleneck | Kaizhi Qian; Yang Zhang; Shiyu Chang; Mark Hasegawa-Johnson; David Cox; | In this paper, we propose SpeechFlow, which can blindly decompose speech into its four components by introducing three carefully designed information bottlenecks. |

692 | Provable Representation Learning for Imitation Learning via Bi-level Optimization | Sanjeev Arora; Simon Du; Sham Kakade; Yuping Luo; Nikunj Umesh Saunshi; | We formulate representation learning as a bi-level optimization problem where the “outer” optimization tries to learn the joint representation and the “inner” optimization encodes the imitation learning setup and tries to learn task-specific parameters. |

693 | Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization | Vien Mai; Mikael Johansson; | Our key innovation is the construction of a special Lyapunov function for which the proven complexity can be achieved without any tunning of the momentum parameter. |

694 | XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalisation | Junjie Hu; Sebastian Ruder; Aditya Siddhant; Graham Neubig; Orhan Firat; Melvin Johnson; | To this end, we introduce the Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark, a multi-task benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. |

695 | Fair k-Centers via Maximum Matching | Matthew Jones; Thy Nguyen; Huy Nguyen; | This paper combines the best parts of each algorithm , by presenting a linear-time algorithm with a guaranteed 3-approximation factor, and provides empirical evidence of both the algorithm’s runtime and effectiveness. |

696 | Efficiently sampling functions from Gaussian process posteriors | James Wilson; Viacheslav Borovitskiy; Alexander Terenin; Peter Mostowsky; Marc Deisenroth; | Building off of this factorization, we propose decoupled sampling, an easy-to-use and general-purpose approach for fast posterior sampling. |

697 | Characterizing Distribution Equivalence and Structure Learning for Cyclic and Acyclic Directed Graphs | AmirEmad Ghassami; Alan Yang; Negar Kiyavash; Kun Zhang; | We propose analytic as well as graphical methods for characterizing the equivalence of two structures. |

698 | Inverse Active Sensing: Modeling and Understanding Timely Decision-Making | Daniel Jarrett; Mihaela van der Schaar; | In this paper, we develop an expressive, unified framework for the general setting of evidence-based decision-making under endogenous, context-dependent time pressure—which requires negotiating (subjective) tradeoffs between accuracy, speediness, and cost of information. |

699 | On Second-Order Group Influence Functions for Black-Box Predictions | Samyadeep Basu; Xuchen You; Soheil Feizi; | In this paper, we address this issue and propose second-order influence functions for identifying influential groups in test-time predictions. |

700 | Safe Imitation Learning via Fast Bayesian Reward Inference from Preferences | Daniel Brown; Scott Niekum; Russell Coleman; Ravi Srinivasan; | We propose a highly efficient Bayesian reward learning algorithm that scales to high-dimensional imitation learning problems by first pre-training a low-dimensional feature encoding via self-supervised tasks and then leveraging preferences over demonstrations to perform fast Bayesian inference via sampling. |

701 | Randomly Projected Additive Gaussian Processes for Regression | Ian Delbridge; David Bindel; Andrew Wilson; | Surprisingly, we find that as the number of random projections increases, the predictive performance of this approach quickly converges to the performance of a kernel operating on the original full dimensional inputs, over a wide range of data sets, even if we are projecting into a single dimension. |

702 | Attentive Group Equivariant Convolutional Networks | David Romero; Erik Bekkers; Jakub Tomczak; Mark Hoogendoorn; | In this paper, we present attentive group equivariant convolutions, a generalization of the group convolution, in which attention is applied during the course of convolution to accentuate meaningful symmetry combinations and suppress non-plausible, misleading ones. |

703 | Learning Compound Tasks without Task-specific Knowledge via Imitation and Self-supervised Learning | Sang-Hyun Lee; Seung-Woo Seo; | In this paper, we propose an imitation learning method that can learn compound tasks without task-specific knowledge. |

704 | Confidence Sets and Hypothesis Testing in a Likelihood-Free Inference Setting | Niccolo Dalmasso; Rafael Izbicki; Ann Lee; | In this paper, we present ACORE (Approximate Computation via Odds Ratio Estimation), a frequentist approach to LFI that first formulates the classical likelihood ratio test (LRT) as a parametrized classification problem, and then uses the equivalence of tests and confidence sets to build confidence regions for parameters of interest. |

705 | Curvature-corrected learning dynamics in deep neural networks | Dongsung Huh; | We introduce partially curvature-corrected learning rule, which provides most of the benefit of full curvature correction in terms of convergence speed with superior numerical stability while preserving the core property of gradient descent under block-diagonal approximations. |

706 | Tightening Exploration in Upper Confidence Reinforcement Learning | Hippolyte Bourel; Odalric-Ambrym Maillard; Mohammad Sadegh Talebi; | Motivated by practical efficiency, we present UCRL3, following the lines of UCRL2, but with two key modifications: First, it uses state-of-the-art time-uniform concentration inequalities to compute confidence sets on the reward and transition distributions for each state-action pair. To further tighten exploration, we introduce an adaptive computation of the support of each transition distributions. |

707 | Bootstrap Latent-Predictive Representations for Multitask Reinforcement Learning | Zhaohan Guo; Bernardo Avila Pires; Mohammad Gheshlaghi Azar; Bilal Piot; Florent Altché; Jean-Bastien Grill; Remi Munos; | Here we introduce Predictions of Bootstrapped Latents (PBL), a simple and flexible self-supervised representation learning algorithm for multitask deep RL. |

708 | Discriminative Adversarial Search for Abstractive Summarization | Thomas Scialom; Paul-Alexis Dray; Sylvain Lamprier; Benjamin Piwowarski; Jacopo Staiano; | We introduce a novel approach for sequence decoding, Discriminative Adversarial Search (DAS), which has the desirable properties of alleviating the effects of exposure bias without requiring external metrics. |

709 | A Swiss Army Knife for Minimax Optimal Transport | Sofien Dhouib; Ievgen Redko; Tanguy Kerdoncuff; Rémi Emonet; Marc Sebban; | In this paper, we propose a general formulation of a minimax OT problem that can tackle these restrictions by jointly optimizing the cost matrix and the transport plan, allowing us to define a robust distance between distributions. |

710 | Invariant Causal Prediction for Block MDPs | Clare Lyle; Amy Zhang; Angelos Filos; Shagun Sodhani; Marta Kwiatkowska; Yarin Gal; Doina Precup; Joelle Pineau; | In this work we propose a method for learning state abstractions which generalize to novel observation distributions in the multi-environment RL setting. |

711 | Involutive MCMC: One Way to Derive Them All | Kirill Neklyudov; Max Welling; Evgenii Egorov; Dmitry Vetrov; | Building upon this, we describe a wide range of MCMC algorithms in terms of iMCMC, and formulate a number of "tricks" which one can use as design principles for developing new MCMC algorithms. |

712 | Adversarial Learning Guarantees for Linear Hypotheses and Neural Networks | Pranjal Awasthi; Natalie Frank; Mehryar Mohri; | In order to make progress on this, we focus on the problem of understanding generalization in adversarial settings, via the lens of Rademacher complexity. |

713 | Deep Reinforcement Learning with Smooth Policy | Qianli Shen; Yan Li; Haoming Jiang; Zhaoran Wang; Tuo Zhao; | In this paper, we develop a new training framework — \textbf{S}mooth \textbf{R}egularized \textbf{R}einforcement \textbf{L}earning ($\textbf{SR}^2\textbf{L}$), where the policy is trained with smoothness-inducing regularization. |

714 | On the Power of Compressed Sensing with Generative Models | Akshay Kamath; Eric Price; Sushrut Karmalkar; | In this paper, we prove results that (i)establish the difficulty of this task and show that existing bounds are tight and (ii) demonstrate that the latter task is a generalization of the former. |

715 | Laplacian Regularized Few-Shot Learning | Imtiaz Ziko; Jose Dolz; Eric Granger; Ismail Ben Ayed; | We propose a Laplacian-regularization objective for few-shot tasks, which integrates two types of potentials: (1) unary potentials assigning query samples to the nearest class prototype and (2) pairwise Laplacian potentials encouraging nearby query samples to have consistent predictions. |

716 | Neural Datalog Through Time: Informed Temporal Modeling via Logical Specification | Hongyuan Mei; Guanghui Qin; Minjie Xu; Jason Eisner; | To exploit known structure, we propose using a deductive database to track facts over time, where each fact has a time-varying state- a vector computed by a neural net whose topology is determined by the fact’s provenance and experience. |

717 | Up or Down? Adaptive Rounding for Post-Training Quantization | Markus Nagel; Rana Ali Amjad; Marinus van Baalen; Christos Louizos; Tijmen Blankevoort; | In this paper, we propose AdaRound, a better weight-rounding mechanism for post-training quantization that adapts to the data and the task loss. |

718 | A quantile-based approach for hyperparameter transfer learning | David Salinas; Huibin Shen; Valerio Perrone; | In this work, we introduce a novel approach to achieve transfer learning across different datasets as well as different objectives. |

719 | Inductive Bias-driven Reinforcement Learning For Efficient Schedules in Heterogeneous Clusters | Subho Banerjee; Saurabh Jha; Zbigniew Kalbarczyk; Ravishankar Iyer; | This paper addresses the challenge in two ways: (i) a domain-driven Bayesian reinforcement learning (RL) model for scheduling, which inherently models the resource dependencies identified from the system architecture; and (ii) a sampling-based technique which allows the computation of gradients of a Bayesian model without performing full probabilistic inference. |

720 | Adversarial Robustness for Code | Pavol Bielik; Martin Vechev; | In this work we address this gap by: (i) developing adversarial attacks for code (a domain with discrete and highly structured inputs), (ii) showing that, similar to other domains, neural models for code are highly vulnerable to adversarial attacks, and (iii) developing a set of novel techniques that enable training robust and accurate models of code. |

721 | The Boomerang Sampler | Joris Bierkens; Sebastiano Grazzi; Kengo Kamatani; Gareth Roberts; | This paper introduces the boomerang sampler as a novel class of continuous-time non-reversible Markov chain Monte Carlo algorithms. |

722 | Weakly-Supervised Disentanglement Without Compromises | Francesco Locatello; Ben Poole; Gunnar Raetsch; Bernhard Schölkopf; Olivier Bachem; Michael Tschannen; | First, we theoretically show that only knowing how many factors have changed, but not which ones, is sufficient to learn disentangled representations. Second, we provide practical algorithms that learn disentangled representations from pairs of images without requiring annotation of groups, individual factors, or the number of factors that have changed. |

723 | Predictive Sampling with Forecasting Autoregressive Models | Auke Wiggers; Emiel Hoogeboom; | In this paper, we introduce the predictive sampling algorithm: a procedure that exploits the fast inference property of ARMs in order to speed up sampling, while keeping the model intact. |

724 | InfoGAN-CR: Disentangling Generative Adversarial Networks with Contrastive Regularizers | Zinan Lin; Kiran Thekumparampil; Giulia Fanti; Sewoong Oh; | We propose an unsupervised model selection scheme based on medoids. |

725 | TrajectoryNet: A Dynamic Optimal Transport Network for Modeling Cellular Dynamics | Alexander Tong; Jessie Huang; Guy Wolf; David van Dijk; Smita Krishnaswamy; | We present {\em TrajectoryNet}, which controls the continuous paths taken between distributions. |

726 | The role of regularization in classification of high-dimensional noisy Gaussian mixture | Francesca Mignacco; Florent Krzakala; Yue Lu; Pierfrancesco Urbani; Lenka Zdeborova; | We provide a rigorous analysis of the generalization error of regularized convex classifiers, including ridge, hinge and logistic regression, in the high-dimensional limit where the number $n$ of samples and their dimension $d$ goes to infinity while their ratio is fixed to $\alpha=n/d$. |

727 | Normalizing Flows on Tori and Spheres | Danilo J. Rezende; George Papamakarios; Sebastien Racaniere; Michael Albergo; Gurtej Kanwar; Phiala Shanahan; Kyle Cranmer; | In this paper, we propose and compare expressive and numerically stable flows on such spaces. |

728 | Structured Linear Contextual Bandits: A Sharp and Geometric Smoothed Analysis | Vidyashankar Sivakumar; Steven Wu; Arindam Banerjee; | In this work, we consider a smoothed setting for structured linear contextual bandits where the adversarial contexts are perturbed by Gaussian noise and the unknown parameter $\theta^*$ has structure, e.g., sparsity, group sparsity, low rank, etc. |

729 | Simple and sharp analysis of k-means|| | Vaclav Rozhon; | We present a truly simple analysis of k-means|| (Bahmani et al., PVLDB 2012) — a distributed variant of the k-means++ algorithm (Arthur and Vassilvitskii, SODA 2007) — and improve its round complexity from O(log (Var X)), where Var X is the variance of the input data set, to O(log (Var X) / log log (Var X)), which we show to be tight. |

730 | Efficient proximal mapping of the path-norm regularizer of shallow networks | Fabian Latorre; Paul Rolland; Shaul Nadav Hallak; Volkan Cevher; | We demonstrate two new important properties of the path-norm regularizer for shallow neural networks. |

731 | Regularized Optimal Transport is Ground Cost Adversarial | François-Pierre Paty; Marco Cuturi; | In this paper, we adopt a more geometrical point of view, and show using Fenchel duality that any convex regularization of OT can be interpreted as ground cost adversarial. |

732 | Automatic Shortcut Removal for Self-Supervised Representation Learning | Matthias Minderer; Olivier Bachem; Neil Houlsby; Michael Tschannen; | Here, we propose a general framework for removing shortcut features automatically. |

733 | Fair Learning with Private Demographic Data | Hussein Mozannar; Mesrob Ohannessian; Nati Srebro; | We give a scheme that allows individuals to release their sensitive information privately while still allowing any downstream entity to learn non-discriminatory predictors. |

734 | Deep Divergence Learning | Kubra Cilingir; Rachel Manzelli; Brian Kulis; | In this paper, we introduce deep Bregman divergences, which are based on learning and parameterizing functional Bregman divergences using neural networks, and which unify and extend these existing lines of work. |

735 | A new regret analysis for Adam-type algorithms | Ahmet Alacaoglu; Yura Malitsky; Panayotis Mertikopoulos; Volkan Cevher; | In this paper, we focus on a theory-practice gap for Adam and its variants (AMSgrad, AdamNC, etc.). |

736 | Accelerated Message Passing for Entropy-Regularized MAP Inference | Jonathan Lee; Aldo Pacchiano; Peter Bartlett; Michael Jordan; | In this paper, we present randomized methods for accelerating these algorithms by leveraging techniques that underlie classical accelerated gradient methods. |

737 | Dissecting Non-Vacuous Generalization Bounds based on the Mean-Field Approximation | Konstantinos Pitas; | We show empirically that this approach gives negligible gains when modelling the posterior as a Gaussian with diagonal covariance—known as the mean-field approximation. |

738 | (Individual) Fairness for k-Clustering | Sepideh Mahabadi; Ali Vakilian; | In this work, we show how to get an approximately optimal such fair $k$-clustering. |

739 | Relaxing Bijectivity Constraints with Continuously Indexed Normalising Flows | Rob Cornish; Anthony Caterini; George Deligiannidis; Arnaud Doucet; | To address this, we propose continuously indexed flows (CIFs), which replace the single bijection used by normalising flows with a continuously indexed family of bijections, and which intuitively allow rerouting mass that would be misplaced by a single bijection. |

740 | Gamification of Pure Exploration for Linear Bandits | Rémy Degenne; Pierre Menard; Xuedong Shang; Michal Valko; | We investigate an active pure-exploration setting, that includes best-arm identification, in the context of linear stochastic bandits. |

741 | Growing Adaptive Multi-hyperplane Machines | Nemanja Djuric; Zhuang Wang; Slobodan Vucetic; | In this paper we show that this performance gap is not due to limited representability of MM model, as it can represent arbitrary concepts. |

742 | Generative Teaching Networks: Accelerating Neural Architecture Search by Learning to Generate Synthetic Training Data | Felipe Petroski Such; Aditya Rawal; Joel Lehman; Kenneth Stanley; Jeffrey Clune; | This paper introduces GTNs, discusses their potential, and showcases that they can substantially accelerate learning. |

743 | Structured Prediction with Partial Labelling through the Infimum Loss | Vivien Cabannnes; Francis Bach; Alessandro Rudi; | This paper provides a unified framework based on structured prediction and on the concept of {\em infimum loss} to deal with partial labelling over a wide family of learning problems and loss functions. |

744 | ControlVAE: Controllable Variational Autoencoder | Huajie Shao; Shuochao Yao; Dachun Sun; Aston Zhang; Shengzhong Liu; Dongxin Liu; Jun Wang; Tarek Abdelzaher; | To address these issues, we propose a novel controllable variational autoencoder framework, ControlVAE, that combines a controller, inspired by automatic control theory, with the basic VAE to improve the performance of resulting generative models. |

745 | On Semi-parametric Inference for BART | Veronika Rockova; | In this work, we continue the theoretical investigation of BART initiated recently by Rockova and van der Pas (2017). |

746 | Simple and Scalable Epistemic Uncertainty Estimation Using a Single Deep Deterministic Neural Network | Joost van Amersfoort; Lewis Smith; Yee Whye Teh; Yarin Gal; | We propose a method for training a deterministic deep model that can find and reject out of distribution data points at test time with a single forward pass. |

747 | Ordinal Non-negative Matrix Factorization for Recommendation | Olivier Gouvert; Thomas Oberlin; Cedric Fevotte; | We introduce a new non-negative matrix factorization (NMF) method for ordinal data, called OrdNMF. |

748 | NetGAN without GAN: From Random Walks to Low-Rank Approximations | Luca Rendsburg; Holger Heidrich; Ulrike von Luxburg; | In this paper, we investigate the implicit bias of NetGAN. |

749 | On the Iteration Complexity of Hypergradient Computations | Riccardo Grazzi; Saverio Salzo; Massimiliano Pontil; Luca Franceschi; | We present a unified analysis which allows for the first time to quantitatively compare these methods, providing explicit bounds for their iteration complexity. |

750 | Skew-Fit: State-Covering Self-Supervised Reinforcement Learning | Vitchyr Pong; Murtaza Dalal; Steven Lin; Ashvin Nair; Shikhar Bahl; Sergey Levine; | In this paper, we propose a formal exploration objective for goal-reaching policies that maximizes state coverage. |

751 | Stochastic Optimization for Regularized Wasserstein Estimators | Marin Ballu; Quentin Berthet; Francis Bach; | In this work, we introduce an algorithm to solve a regularized version of this problem of Wasserstein estimators, with a time per step which is sublinear in the natural dimensions of the problem. |

752 | LP-SparseMAP: Differentiable Relaxed Optimization for Sparse Structured Prediction | Vlad Niculae; Andre Filipe Torres Martins; | In this paper, we introduce LP-SparseMAP, an extension of SparseMAP addressing this limitation via a local polytope relaxation. |

753 | Problems with Shapley-value-based explanations as feature importance measures | Indra Kumar; Suresh Venkatasubramanian; Carlos Scheidegger; Sorelle Friedler; | We show that mathematical problems arise when Shapley values are used for feature importance and that the solutions to mitigate these necessarily induce further complexity, such as the need for causal reasoning. |

754 | Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes | Chen-Yu Wei; Mehdi Jafarnia; Haipeng Luo; Hiteshi Sharma; Rahul Jain; | In this paper, two model-free algorithms are introduced for learning infinite-horizon average-reward Markov Decision Processes (MDPs). |

755 | Near-linear time Gaussian process optimization with adaptive batching and resparsification | Daniele Calandriello; Luigi Carratino; Alessandro Lazaric; Michal Valko; Lorenzo Rosasco; | In this paper, we introduce BBKB (Batch Budgeted Kernel Bandits), the first no-regret GP optimization algorithm that provably runs in near-linear time and selects candidates in batches. |

756 | Parallel Algorithm for Non-Monotone DR-Submodular Maximization | Alina Ene; Huy Nguyen; | In this work, we give a new parallel algorithm for the problem of maximizing a non-monotone diminishing returns submodular function subject to a cardinality constraint. |

757 | Structure Adaptive Algorithms for Stochastic Bandits | Rémy Degenne; Han Shao; Wouter Koolen; | Our aim is to develop methods that are flexible (in that they easily adapt to different structures), powerful (in that they perform well empirically and/or provably match instance-dependent lower bounds) and efficient in that the per-round computational burden is small. |

758 | Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural Networks | Blake Bordelon; Abdulkadir Canatar; Cengiz Pehlevan; | We derive analytical expressions for learning curves for kernel regression, and use them to evaluate how the test loss of a trained neural network depends on the number of samples. |

759 | Preference modelling with context-dependent salient features | Amanda Bower; Laura Balzano; | Formalizing this framework, we propose the \textit{salient feature preference model} and prove a sample complexity result for learning the parameters of our model and the underlying ranking with maximum likelihood estimation. |

760 | Infinite attention: NNGP and NTK for deep attention networks | Jiri Hron; Yasaman Bahri; Jascha Sohl-Dickstein; Roman Novak; | We provide a rigorous extension of these results to NNs involving attention layers, showing that unlike single-head attention, which induces non-Gaussian behaviour, multi-head attention architectures behave as GPs as the number of heads tends to infinity. |

761 | Fast Learning of Graph Neural Networks with Guaranteed Generalizability: One-hidden-layer Case | shuai zhang; Meng Wang; Sijia Liu; Pin-Yu Chen; Jinjun Xiong; | In this paper, we provide a theoretically-grounded generalizability analysis of GNNs with one hidden layer for both regression and binary classification problems. |

762 | Efficient Domain Generalization via Common-Specific Low-Rank Decomposition | Vihari Piratla; Praneeth Netrapalli; Sunita Sarawagi; | We present CSD (Common Specific Decomposition), for this setting, which jointly learns a common component (which generalizes to new domains) and a domain specific component (which overfits on training domains). |

763 | Identifying the Reward Function by Anchor Actions | Sinong Geng; Houssam Nassif; Carlos Manzanares; Max Reppen; Ronnie Sircar; | We propose a reward function estimation framework for inverse reinforcement learning with deep energy-based policies. |

764 | No-Regret and Incentive-Compatible Online Learning | Rupert Freeman; David Pennock; Chara Podimata; Jennifer Wortman Vaughan; | Our goal is twofold. First, we want the learning algorithm to be no-regret with respect to the best fixed expert in hindsight. Second, we want incentive compatibility, a guarantee that each expert’s best strategy is to report his true beliefs about the realization of each event. |

765 | Probing Emergent Semantics in Predictive Agents via Question Answering | Abhishek Das; Federico Carnevale; Hamza Merzic; Laura Rimell; Rosalia Schneider; Josh Abramson; Alden Hung; Arun Ahuja; Stephen Clark; Greg Wayne; Feilx Hill; | We propose question-answering as a general paradigm to decode and understand the representations that such agents develop, applying our method to two recent approaches to predictive modelling – action-conditional CPC (Guo et al., 2018) and SimCore (Gregor et al., 2019). |

766 | Meta-learning with Stochastic Linear Bandits | Leonardo Cella; Alessandro Lazaric; Massimiliano Pontil; | We investigate meta-learning procedures in the setting of stochastic linear bandits tasks. |

767 | A Unified Theory of Decentralized SGD with Changing Topology and Local Updates | Anastasiia Koloskova; Nicolas Loizou; Sadra Boreiri; Martin Jaggi; Sebastian Stich; | In this paper we introduce a unified convergence analysis that covers a large variety of decentralized SGD methods which so far have required different intuitions, have different applications, and which have been developed separately in various communities. |

768 | AdaScale SGD: A User-Friendly Algorithm for Distributed Training | Tyler Johnson; Pulkit Agrawal; Haijie Gu; Carlos Guestrin; | We propose AdaScale SGD, an algorithm that reliably adapts learning rates to large-batch training. |

769 | Kinematic State Abstraction and Provably Efficient Rich-Observation Reinforcement Learning | Dipendra Misra; Mikael Henaff; Akshay Krishnamurthy; John Langford; | We present an algorithm, HOMER, for exploration and reinforcement learning in rich observation environments that are summarizable by an unknown latent state space. |

770 | Logistic Regression for Massive Data with Rare Events | HaiYing Wang; | This paper studies binary logistic regression for rare events data, or imbalanced data, where the number of events (observations in one class, often called cases) is significantly smaller than the number of nonevents (observations in the other class, often called controls). |

771 | Automated Synthetic-to-Real Generalization | Wuyang Chen; Zhiding Yu; Zhangyang Wang; Anima Anandkumar; | We treat this as a learning without forgetting problem and propose a learning-to-optimize (L2O) method to automate layer-wise learning rates. |

772 | Online Learning with Dependent Stochastic Feedback Graphs | Corinna Cortes; Giulia DeSalvo; Claudio Gentile; Mehryar Mohri; Ningshan Zhang; | We study a challenging scenario where feedback graphs vary stochastically with time and, more importantly, where graphs and losses are dependent. |

773 | Sparse Sinkhorn Attention | Yi Tay; Dara Bahri; Liu Yang; Don Metzler; Da-Cheng Juan; | We propose Sparse Sinkhorn Attention, a new efficient and sparse method for learning to attend. |

774 | Online Continual Learning from Imbalanced Data | Aristotelis Chrysakis; Marie-Francine Moens; | More importantly, we introduce a new memory population approach, which we call class-balancing reservoir sampling (CBRS). |

775 | Differentially Private Set Union | Pankaj Gulhane; Sivakanth Gopi; Janardhan Kulkarni; Judy Hanwen Shen; Milad Shokouhi; Sergey Yekhanin; | We design two new algorithms, one using Laplace Noise and other Gaussian noise, as specific instances of policies satisfying the contractive properties. |

776 | The continuous categorical: a novel simplex-valued exponential family | Elliott Gordon-Rodriguez; Gabriel Loaiza-Ganem; John Cunningham; | We resolve these limitations by introducing a novel exponential family of distributions for modeling simplex-valued data \u2013 the continuous categorical, which arises as a nontrivial multivariate generalization of the recently discovered continuous Bernoulli. |

777 | Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation | Yaqi Duan; Zeyu Jia; Mengdi Wang; | This paper studies the statistical theory of batch data reinforcement learning with function approximation. |

778 | Enhanced POET: Open-ended Reinforcement Learning through Unbounded Invention of Learning Challenges and their Solutions | Rui Wang; Joel Lehman; Aditya Rawal; Jiale Zhi; Yulun Li; Jeffrey Clune; Kenneth Stanley; | Here we introduce and empirically validate two new innovations to the original algorithm, as well as two external innovations designed to help elucidate its full potential. |

779 | Set Functions for Time Series | Max Horn; Michael Moor; Christian Bock; Bastian Rieck; Karsten Borgwardt; | This paper proposes a novel approach for classifying irregularly-sampled time series with unaligned measurements, focusing on high scalability and data efficiency. |

780 | Individual Calibration with Randomized Forecasting | Shengjia Zhao; Tengyu Ma; Stefano Ermon; | We design a training objective to enforce individual calibration and use it to train randomized regression functions. |

781 | Bayesian Differential Privacy for Machine Learning | Aleksei Triastcyn; Boi Faltings; | We propose Bayesian differential privacy (BDP), which takes into account the data distribution to provide more practical privacy guarantees. |

782 | Causal Modeling for Fairness In Dynamical Systems | Elliot Creager; David Madras; Toniann Pitassi; Richard Zemel; | We discuss causal directed acyclic graphs (DAGs) as a unifying framework for the recent literature on fairness in such dynamical systems. |

783 | Learning General-Purpose Controllers via Locally Communicating Sensorimotor Modules | Wenlong Huang; Igor Mordatch; Deepak Pathak; | We propose a policy expressed as a collection of identical modular neural network components for each of the agent’s actuators. |

784 | Visual Grounding of Learned Physical Models | Yunzhu Li; Toru Lin; Kexin Yi; Daniel Bear; Daniel Yamins; Jiajun Wu; Josh Tenenbaum; Antonio Torralba; | In this work, we present a neural model that simultaneously reasons about physics and make future predictions based on visual and dynamics priors. |

785 | Task-Oriented Active Perception and Planning in Environments with Partially Known Semantics | Mahsa Ghasemi; Erdem Bulgur; Ufuk Topcu; | We develop a planning strategy that takes the semantic uncertainties into account and by doing so provides probabilistic guarantees on the task success. |

786 | Test-Time Training for Generalization under Distribution Shifts | Yu Sun; Xiaolong Wang; Zhuang Liu; John Miller; Alexei Efros; University of California Moritz Hardt; | We introduce a general approach, called test-time training, for improving the performance of predictive models when training and test data come from different distributions. |

787 | AutoGAN-Distiller: Searching to Compress Generative Adversarial Networks | Yonggan Fu; Wuyang Chen; Haotao Wang; Haoran Li; Yingyan Lin; Zhangyang Wang; | Inspired by the recent success of AutoML in deep compression, we introduce AutoML to GAN compression and develop an AutoGAN-Distiller (AGD) framework. |

788 | Associative Memory in Iterated Overparameterized Sigmoid Autoencoders | Yibo Jiang; Cengiz Pehlevan; | In this work, we theoretically analyze this behavior for sigmoid networks by leveraging recent developments in deep learning theories, especially the Neural Tangent Kernel (NTK) theory. |

789 | Adaptive Reward-Poisoning Attacks against Reinforcement Learning | Xuezhou Zhang; Yuzhe Ma; Adish Singla; Jerry Zhu; | We categorize such attacks by the infinity-norm constraint on $\delta_t$: We provide a lower threshold below which reward-poisoning attack is infeasible and RL is certified to be safe; we provide a corresponding upper threshold above which the attack is feasible. |

790 | Planning to Explore via Latent Disagreement | Ramanan Sekar; Oleh Rybkin; Kostas Daniilidis; Pieter Abbeel; Danijar Hafner; Deepak Pathak; | This work focuses on task-agnostic exploration, where an agent explores a visual environment without yet knowing the tasks it will later be asked to solve. |

791 | Defense Through Diverse Directions | Christopher Bender; Yang Li; Yifeng Shi; Michael K. Reiter; Junier Oliva; | In this work we develop a novel Bayesian neural network methodology to achieve strong adversarial robustness without the need for online adversarial training. |

792 | Beyond Synthetic Noise: Deep Learning on Controlled Noisy Labels | Lu Jiang; Di Huang; Mason Liu; Weilong Yang; | First, we establish the first benchmark of controlled real label noise (obtained from image search). This new benchmark will enable us to study the image search label noise in a controlled setting for the first time. The second contribution is a simple but highly effective method to overcome both synthetic and real noisy labels. |

793 | Confidence-Calibrated Adversarial Training: Generalizing to Unseen Attacks | David Stutz; Matthias Hein; Bernt Schiele; | Our confidence-calibrated adversarial training (CCAT) tackles this problem by biasing the model towards low confidence predictions on adversarial examples. |

794 | Online Control of the False Coverage Rate and False Sign Rate | Asaf Weinstein; Aaditya Ramdas; | We propose a novel solution to the problem which only requires the scientist to be able to construct a marginal CI at any given level. |

795 | Online Convex Optimization in the Random Order Model | Dan Garber; Gal Korcia; Kfir Levy; | In this work we consider a natural random-order version of the OCO model, in which the adversary can choose the set of loss functions, but does not get to choose the order in which they are supplied to the learner; Instead, they are observed in uniformly random order. |

796 | A Flexible Latent Space Model for Multilayer Networks | Xuefei Zhang; Songkai Xue; Ji Zhu; | This paper proposes a flexible latent space model for multilayer networks for the purpose of capturing such characteristics. |

797 | Estimation of Bounds on Potential Outcomes For Decision Making | Maggie Makar; Fredrik Johansson; John Guttag; David Sontag; | Our theoretical analysis highlights a tradeoff between the complexity of the learning task and the confidence with which the resulting bounds cover the true potential outcomes. Guided by our theoretical findings, we develop an algorithm for learning upper and lower bounds on the potential outcomes under treatment and non-treatment. |

798 | Deep Gaussian Markov Random Fields | Per Sidén; Fredrik Lindsten; | We establish a formal connection between GMRFs and convolutional neural networks (CNNs). |

799 | Generalization Error of Generalized Linear Models in High Dimensions | Melikasadat Emami; Mojtaba Sahraee-Ardakan; Parthe Pandit; Sundeep Rangan; Alyson Fletcher; | We provide a general framework to characterize the asymptotic generalization error for single-layer neural networks (i.e., generalized linear models) with arbitrary non-linearities, making it applicable to regression as well as classification problems. |

800 | Poisson Learning: Graph Based Semi-Supervised Learning At Very Low Label Rates | Jeff Calder; Brendan Cook; Matthew Thorpe; Dejan Slepcev; | We propose a new framework, called Poisson learning, for graph based semi-supervised learning at very low label rates. |

801 | Sequential Transfer in Reinforcement Learning with a Generative Model | Andrea Tirinzoni; Riccardo Poiani; Marcello Restelli; | In this work, we focus on the second objective when the agent has access to a generative model of state-action pairs. |

802 | Finite-Time Convergence in Continuous-Time Optimization | Orlando Romero; mouhacine Benosman; | In this paper, we investigate a Lyapunov-like differential inequality that allows us to establish finite-time stability of a continuous-time state-space dynamical system represented via a multivariate ordinary differential equation or differential inclusion. |

803 | Feature Quantization Improves GAN Training | Yang Zhao; Chunyuan Li; Ping Yu; Jianfeng Gao; Changyou Chen; | In this work, we propose feature quantizatoin (FQ) for the discriminator, to embed both true and fake data samples into a shared discrete space. |

804 | Temporal Logic Point Processes | Shuang Li; Lu Wang; Ruizhi Zhang; xiaofu Chang; Xuqin Liu; Yao Xie; Yuan Qi; Le Song; | We propose a modeling framework for event data, which excels in small data regime with the ability to incorporate domain knowledge. |

805 | Hallucinative Topological Memory for Zero-Shot Visual Planning | Thanard Kurutach; Kara Liu; Aviv Tamar; Pieter Abbeel; Christine Tung; | Here, instead, we propose a simple VP method that plans directly in image space and displays competitive performance. |

806 | Learning Attentive Meta-Transfer | Jaesik Yoon; Gautam Singh; Sungjin Ahn; | To resolve, we propose a new attention mechanism, Recurrent Memory Reconstruction (RMR), and demonstrate that providing an imaginary context that is recurrently updated and reconstructed with interaction is crucial in achieving effective attention for meta-transfer learning. |

807 | Optimizing Dynamic Structures with Bayesian Generative Search | Minh Hoang; Carleton Kingsford; | This paper instead proposes \textbf{DTERGENS}, a novel generative search framework that constructs and optimizes a high-performance composite kernel expressions generator. |

808 | Amortized Finite Element Analysis for Fast PDE-Constrained Optimization | Tianju Xue; Alex Beatson; Sigrid Adriaenssens ; Ryan Adams; | In this paper we propose amortized finite element analysis (AmorFEA), in which a neural network learns to produce accurate PDE solutions, while preserving many of the advantages of traditional finite element methods. |

809 | Preselection Bandits | Viktor Bengs; Eyke Hüllermeier; | In this paper, we introduce the Preselection Bandit problem, in which the learner preselects a subset of arms (choice alternatives) for a user, which then chooses the final arm from this subset. |

810 | Peer Loss Functions: Learning from Noisy Labels without Knowing Noise Rates | Yang Liu; Hongyi Guo; | In this work, we introduce a new family of loss functions that we name as peer loss functions, which enables learning from noisy labels that does not require a priori specification of the noise rates.Our approach uses a standard empirical risk minimization (ERM) framework with peer loss functions. |

811 | Rank Aggregation from Pairwise Comparisons in the Presence of Adversarial Corruptions | Prathamesh Patil; Arpit Agarwal; Shivani Agarwal; Sanjeev Khanna; | In this paper, we initiate the study of robustness in rank aggregation under the popular Bradley-Terry-Luce (BTL) model for pairwise comparisons. |

812 | Extrapolation for Large-batch Training in Deep Learning | Tao LIN; Lingjing Kong; Sebastian Stich; Martin Jaggi; | To alleviate these drawbacks, we propose to use instead computationally efficient extrapolation (extragradient) to stabilize the optimization trajectory while still benefiting from smoothing to avoid sharp minima. |

813 | VideoOneNet: Bidirectional Convolutional Recurrent OneNet with Trainable Data Steps for Video Processing | Zoltán Milacski; Barnabás Póczos; Andras Lorincz; | In this work, we make two contributions, both facilitating end-to-end learning using backpropagation. |

814 | Bio-Inspired Hashing for Unsupervised Similarity Search | Chaitanya Ryali; John Hopfield; Leopold Grinberg; Dmitry Krotov; | Building on inspiration from FlyHash and the ubiquity of sparse expansive representations in neurobiology, our work proposes a novel hashing algorithm BioHash that produces sparse high dimensional hash codes in a data-driven manner. |

815 | MetaFun: Meta-Learning with Iterative Functional Updates | Jin Xu; Jean-Francois Ton; Hyunjik Kim; Adam Kosiorek; Yee Whye Teh; | We develop a functional encoder-decoder approach to supervised meta-learning, where labeled data is encoded into an infinite-dimensional functional representation rather than a finite-dimensional one. |

816 | Learning and Simulation in Generative Structured World Models | Zhixuan Lin; Yi-Fu Wu; Skand Peri; Bofeng Fu; Jindong Jiang; Sungjin Ahn; | In this paper, we introduce Generative Structured World Models (G-SWM). |

817 | Random Hypervolume Scalarizations for Provable Multi-Objective Black Box Optimization | Richard Zhang; Daniel Golovin; | In this paper, we consider multi-objective optimization, where $f(x)$ outputs a vector of possibly competing objectives and the goal is to converge to the Pareto frontier. |

818 | SGD Learns One-Layer Networks in WGANs | Qi Lei; Jason Lee; Alexandros Dimakis; Constantinos Daskalakis; | In this paper, we show that, when the generator is a one-layer network, stochastic gradient descent-ascent converges to a global solution with polynomial time and sample complexity. |

819 | Implicit Class-Conditioned Domain Alignment for Unsupervised Domain Adaptation | Xiang Jiang; Qicheng Lao; Stan Matwin; Mohammad Havaei; | We present an approach for unsupervised domain adaptation—with a strong focus on practical considerations of within-domain class imbalance and between-domain class distribution shift—from a class-conditioned domain alignment perspective. |

820 | Interference and Generalization in Temporal Difference Learning | Emmanuel Bengio; Joelle Pineau; Doina Precup; | We study the link between generalization and interference in temporal-difference (TD) learning. |

821 | CoMic: Co-Training and Mimicry for Reusable Skills | Leonard Hasenclever; Fabio Pardo; Raia Hadsell; Nicolas Heess; Josh Merel; | We study the problem of learning reusable humanoid skills by imitating motion capture data and co-training with complementary tasks. |

822 | Provably Efficient Model-based Policy Adaptation | Yuda Song; Aditi Mavalankar; Wen Sun; Sicun Gao; | We propose new model-based mechanisms that are able to make online adaptation in unseen target environments, by combining ideas from no-regret online learning and adaptive control. |

823 | Optimizer Benchmarking Needs to Account for Hyperparameter Tuning | Prabhu Teja Sivaprasad; Florian Mai; Thijs Vogels; Martin Jaggi; Francois Fleuret; | In this work, we argue that a fair assessment of optimizers’ performance must take the computational cost of hyperparameter tuning into account, i.e., how easy it is to find good hyperparameter configurations using an automatic hyperparameter search. |

824 | From Local SGD to Local Fixed Point Methods for Federated Learning | Grigory Malinovsky; Dmitry Kovalev; Elnur Gasanov; Laurent CONDAT; Peter Richtarik; | In this work we consider the generic problem of finding a fixed point of an average of operators, or an approximation thereof, in a distributed setting. |

825 | Unraveling Meta-Learning: Understanding Feature Representations for Few-Shot Tasks | Micah Goldblum; Liam Fowl; Renkun Ni; Steven Reich; Valeriia Cherepanova; Tom Goldstein; | We develop a better understanding of the underlying mechanics of meta-learning and the difference between models trained using meta-learning and models which are trained classically. |

826 | Federated Learning with Only Positive Labels | Felix Xinnan Yu; Ankit Singh Rawat; Aditya Menon; Sanjiv Kumar; | To address this problem, we propose a generic framework for training with only positive labels, namely Federated Averaging with Spreadout (FedAwS), where the server imposes a geometric regularizer after each round to encourage classes spread out in the embedding space. |

827 | Causal Inference using Gaussian Processes with Structured Latent Confounders | Sam Witty; Kenta Takatsu; David Jensen; Vikash Mansinghka; | This paper shows how to model latent confounders that have this structure and thereby improve estimates of causal effects. |

828 | T-Basis: a Compact Representation for Neural Networks | Anton Obukhov; Maxim Rakhuba; Menelaos Kanakis; Stamatios Georgoulis; Dengxin Dai; Luc Van Gool; | We introduce T-Basis, a novel concept for a compact representation of a set of tensors, each of an arbitrary shape, which is often seen in Neural Networks. |

829 | Familywise Error Rate Control by Interactive Unmasking | Boyan Duan; Aaditya Ramdas; Larry Wasserman; | We propose a method for multiple hypothesis testing with familywise error rate (FWER) control, called the i-FWER test. |

830 | Learning to Branch for Multi-Task Learning | Pengsheng Guo; Chen-Yu Lee; Daniel Ulbricht; | In this work, we present an automated multi-task learning algorithm that learns where to share or branch within a network, designing an effective network topology that is directly optimized for multiple objectives across tasks. |

831 | Augmenting Continuous Time Bayesian Networks with Clocks | Nicolai Engelmann; Dominik Linzner; Heinz Koeppl; | In this work, we lift its restriction to exponential survival times to arbitrary distributions. |

832 | IPBoost – Non-Convex Boosting via Integer Programming | Sebastian Pokutta; Marc Pfetsch; | In this paper we explore non-convex boosting in classification by means of integer programming and demonstrate real-world practicability of the approach while circumvent- ing shortcomings of convex boosting approaches. |

833 | On Efficient Constructions of Checkpoints | Yu Chen; Zhenming LIU; Bin Ren; Xin Jin; | In this paper, we propose a lossy compression scheme for checkpoint constructions (called LC-Checkpoint). |

834 | Feature Selection using Stochastic Gates | Yutaro Yamada; Ofir Lindenbaum; Sahand Negahban; Yuval Kluger; | In this study, we propose a method for feature selection in non-linear function estimation problems. |

835 | How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization | Chris Finlay; Joern-Henrik Jacobsen; Levon Nurbekyan; Adam Oberman; | In this paper, we overcome this apparent difficulty by introducing a theoretically-grounded combination of both optimal transport and stability regularizations which encourage neural ODEs to prefer simpler dynamics out of all the dynamics that solve a problem well. |

836 | Evaluating Lossy Compression Rates of Deep Generative Models | Sicong Huang; Alireza Makhzani; Yanshuai Cao; Roger Grosse; | In this work, we argue that the log-likelihood metric by itself cannot represent all the different performance characteristics of generative models, and propose to use rate distortion curves to evaluate and compare deep generative models. |

837 | Mix-n-Match : Ensemble and Compositional Methods for Uncertainty Calibration in Deep Learning | Jize Zhang; Bhavya Kailkhura; T. Yong-Jin Han; | We introduce the following desiderata for uncertainty calibration: (a) accuracy-preserving, (b) data-efficient, and (c) high expressive power. |

838 | Learning Adversarially Robust Representations via Worst-Case Mutual Information Maximization | Sicheng Zhu; Xiao Zhang; David Evans; | We develop a general definition of representation vulnerability that captures the maximum change of mutual information between the input and output distributions, under the worst-case input distribution perturbation. We prove a theorem that establishes a lower bound on the minimum adversarial risk that can be achieved for any downstream classifier based on this definition. |

839 | Stochastic Regret Minimization in Extensive-Form Games | Gabriele Farina; Christian Kroer; Tuomas Sandholm; | In this paper we develop a new framework for developing stochastic regret minimization methods. |

840 | Simultaneous Inference for Massive Data: Distributed Bootstrap | Yang Yu; Shih-Kang Chao; Guang Cheng; | In this paper, we propose a bootstrap method applied to massive data processed distributedly in a large number of machines. |

841 | Stabilizing Differentiable Architecture Search via Perturbation-based Regularization | Xiangning Chen; Cho-Jui Hsieh; | Based on this observation, we propose a perturbation-based regularization, named SmoothDARTS (SDARTS), to smooth the loss landscape and improve the generalizability of DARTS. |

842 | Boosting Frank-Wolfe by Chasing Gradients | Cyrille Combettes; Sebastian Pokutta; | We propose to speed up the Frank-Wolfe algorithm by better aligning the descent direction with that of the negative gradient via a subroutine. |

843 | Concise Explanations of Neural Networks using Adversarial Training | Prasad Chalasani; Jiefeng Chen; Amrita Roy Chowdhury; Xi Wu; Somesh Jha; | Our first contribution is a theoretical exploration of how these two properties (when using IG-based attributions) are related to adversarial training, for a class of 1-layer networks (which includes logistic regression models for binary and multi-class classification); for these networks we show that (a) adversarial training using an $\ell_\infty$-bounded adversary produces models with sparse attribution vectors, and (b) natural model-training while encouraging stable explanations (via an extra term in the loss function), is equivalent to adversarial training. |

844 | Quantum Boosting | Srinivasan Arunachalam; Reevu Maity; | In this paper, we show how quantum techniques can improve the time complexity of classical AdaBoost. |

845 | Information-Theoretic Local Minima Characterization and Regularization | Zhiwei Jia; Hao Su; | Specifically, based on the observed Fisher information we propose a metric both strongly indicative of generalizability of local minima and effectively applied as a practical regularizer. |

846 | Kernel interpolation with continuous volume sampling | Ayoub Belhadji; Rémi Bardenet; Pierre Chainais; | We introduce and analyse continuous volume sampling (VS), the continuous counterpart -for choosing node locations- of a discrete distribution introduced in (Deshpande & Vempala, 2006). |

847 | Efficient Identification in Linear Structural Causal Models with Auxiliary Cutsets | Daniel Kumor; Carlos Cinelli; Elias Bareinboim; | We develop a a new polynomial-time algorithm for identification in linear Structural Causal Models that subsumes previous non-exponential identification methods when applied to direct effects, and unifies several disparate approaches to identification in linear systems. |

848 | Partial Trace Regression and Low-Rank Kraus Decomposition | Hachem Kadri; Stephane Ayache; Riikka Huusari; alain rakotomamonjy; Ralaivola Liva; | We here introduce a yet more general model, namely the partial trace regression model, a family of linear mappings from matrix-valued inputs to matrix-valued outputs; this model subsumes the trace regression model and thus the linear regression model. |

849 | Constant Curvature Graph Convolutional Networks | Gregor Bachmann; Gary Becigneul; Octavian Ganea; | Here, we bridge this gap by proposing mathematically grounded generalizations of graph convolutional networks (GCN) to (products of) constant curvature spaces. |

850 | Educating Text Autoencoders: Latent Representation Guidance via Denoising | Tianxiao Shen; Jonas Mueller; Regina Barzilay; Tommi Jaakkola; | To remedy this issue, we augment adversarial autoencoders with a denoising objective where original sentences are reconstructed from perturbed versions (referred as DAAE). We prove that this simple modification guides the latent space geometry of the resulting model by encouraging the encoder to map similar texts to similar latent representations. |

851 | Generalization via Derandomization | Jeffrey Negrea; Daniel Roy; Gintare Karolina Dziugaite; | We propose to study the generalization error of a learned predictor h^ in terms of that of a surrogate (potentially randomized) classifier that is coupled to h^ and designed to trade empirical risk for control of generalization error. |

852 | Inductive Relation Prediction by Subgraph Reasoning | Komal Teru; Etienne Denis; Will Hamilton; | Here, we propose a graph neural network based relation prediction framework, GraIL, that reasons over local subgraph structures and has a strong inductive bias to learn entity-independent relational semantics. |

853 | Logarithmic Regret for Online Control with Adversarial Noise | Dylan Foster; Max Simchowitz; | We propose a novel analysis that combines a new variant of the performance difference lemma with techniques from optimal control, allowing us to reduce online control to online prediction with delayed feedback. |

854 | Multiresolution Tensor Learning for Efficient and Interpretable Spatial Analysis | Jung Yeon Park; Kenneth Carr; Stephan Zheng; Yisong Yue; Rose Yu; | We develop a novel Multiresolution Tensor Learning (MRTL) algorithm for efficiently learning interpretable spatial patterns. |

855 | Customizing ML Predictions for Online Algorithms | Keerti Anand; Rong Ge; Debmalya Panigrahi; | In this paper, we ask the complementary question: can we redesign ML algorithms to provide better predictions for online algorithms? |

856 | Maximum Entropy Gain Exploration for Long Horizon Multi-goal Reinforcement Learning | Silviu Pitis; Harris Chan; Stephen Zhao; Bradly Stadie; Jimmy Ba; | We propose to optimize this objective by having the agent pursue past achieved goals in sparsely explored areas of the goal space, which focuses exploration on the frontier of the achievable goal set. |

857 | Recht-Re Noncommutative Arithmetic-Geometric Mean Conjecture is False | Zehua Lai; Lek-Heng Lim; | We will show that the Recht–Re conjecture is false for general n. |

858 | Predictive Multiplicity in Classification | Charles Marx; Flavio Calmon; Berk Ustun; | In this paper, we define predictive multiplicity as the ability of a prediction problem to admit competing models with conflicting predictions. |

859 | Word-Level Speech Recognition With a Letter to Word Encoder | Ronan Collobert; Awni Hannun; Gabriel Synnaeve; | We propose a direct-to-word sequence model which uses a word network to learn word embeddings from letters. |

860 | Reducing Sampling Error in Batch Temporal Difference Learning | Brahma Pavse; Ishan Durugkar; Josiah Hanna; Peter Stone; | To address this limitation, we introduce \textit{policy sampling error corrected}-TD(0) (PSEC-TD(0)). |

861 | Adaptive Sampling for Estimating Probability Distributions | Shubhanshu Shekhar; Tara Javidi; Mohammad Ghavamzadeh; | We consider the problem of allocating a fixed budget of samples to a finite set of discrete distributions to learn them uniformly well (minimizing the maximum error) in terms of four common distance measures: $\ell_2^2$, $\ell_1$, $f$-divergence, and separation distance. |

862 | Adversarial Filters of Dataset Biases | Ronan Le Bras; Swabha Swayamdipta; Chandra Bhagavatula; Rowan Zellers; Matthew Peters; Ashish Sabharwal; Yejin Choi; | We investigate one recently proposed approach, AFLite, which adversarially filters such dataset biases, as a means to mitigate the prevalent overestimation of machine performance. We provide a theoretical understanding for AFLite, by situating it in the generalized framework for optimum bias reduction. |

863 | Black-Box Variational Inference as a Parametric Approximation to Langevin Dynamics | Matthew Hoffman; Yian Ma; | In this paper, we analyze gradient-based MCMC and VI procedures and find theoretical and empirical evidence that these procedures are not as different as one might think. |

864 | Faster Graph Embeddings via Coarsening | Matthew Fahrbach; Gramoz Goranci; Sushant Sachdeva; Richard Peng; Chi Wang; | To address this, we present an efficient graph coarsening approach, based on Schur complements, for computing the embedding of the relevant vertices. |

865 | Efficient non-conjugate Gaussian process factor models for spike countdata using polynomial approximations | Stephen Keeley; David Zoltowski; Jonathan Pillow; Spencer Smith; Yiyi Yu; | Here we address this obstacle by introduc-ing a fast, approximate inference method fornon-conjugate GPFA models. |

866 | Multigrid Neural Memory | Tri Huynh; Michael Maire; Matthew Walter; | We introduce a radical new approach to endowing neural networks with access to long-term and large-scale memory. |

867 | Cautious Adaptation For Reinforcement Learning in Safety-Critical Settings | Jesse Zhang; Brian Cheung; Chelsea Finn; Sergey Levine; Dinesh Jayaraman; | Building on this intuition, we propose risk-averse domain adaptation (RADA). |

868 | Adversarial Nonnegative Matrix Factorization | lei luo; yanfu Zhang; Heng Huang; | To overcome this limitation, we propose a novel Adversarial NMF (ANMF) approach in which an adversary can exercise some control over the perturbed data generation process. |

869 | Aligned Cross Entropy for Non-Autoregressive Machine Translation | Marjan Ghazvininejad; Vladimir Karpukhin; Luke Zettlemoyer; Omer Levy; | In this paper, we propose aligned cross entropy (AXE) as an alternate loss function for training of non-autoregressive models. |

870 | Model-Agnostic Characterization of Fairness Trade-offs | Joon Kim; Jiahao Chen; Ameet Talwalkar; | We propose a diagnostic to enable practitioners to explore these trade-offs without training a single model. |

871 | A Distributional Framework For Data Valuation | Amirata Ghorbani; Michael Kim; James Zou; | To address these limitations, we propose a novel framework — distributional Shapley — where the value of a point is defined in the context of an underlying data distribution. |

872 | Supervised Quantile Normalization for Low Rank Matrix Factorization | Marco Cuturi; Olivier Teboul; Jonathan Niles-Weed; Jean-Philippe Vert; | We propose in this work to learn these normalization operators jointly with the factorization itself. |

873 | AR-DAE: Towards Unbiased Neural Entropy Gradient Estimation | Jae Hyun Lim; Aaron Courville; Christopher Pal; Chin-Wei Huang; | In this paper, we propose the amortized residual denoising autoencoder (AR-DAE) to approximate the gradient of the log density function, which can be used to estimate the gradient of entropy. |

874 | Bridging the Gap Between f-GANs and Wasserstein GANs | Jiaming Song; Stefano Ermon; | To overcome this limitation, we propose an new training objective where we additionally optimize over a set of importance weights over the generated samples. |

875 | “Other-Play” for Zero-Shot Coordination | Hengyuan Hu; Alexander Peysakhovich; Adam Lerer; Jakob Foerster; | We introduce a novel learning algorithm called other-play (OP), that enhances self-play by looking for more robust strategies. |

876 | Correlation Clustering with Asymmetric Classification Errors | Jafar Jafarov; Sanchit Kalhan; Konstantin Makarychev; Yury Makarychev; | We study the correlation clustering problem under the following assumption: Every "similar" edge $e$ has weight $w_e \in [ \alpha w, w ]$ and every "dissimilar" edge $e$ has weight $w_e \geq \alpha w$ (where $\alpha \leq 1$ and $w > 0$ is a scaling parameter). |

877 | An Optimistic Perspective on Offline Deep Reinforcement Learning | Rishabh Agarwal; Dale Schuurmans; Mohammad Norouzi; | To enhance generalization in the offline setting, we present Random Ensemble Mixture (REM), a robust Q-learning algorithm that enforces optimal Bellman consistency on random convex combinations of multiple Q-value estimates. |

878 | Neural Topic Modeling with Continual Lifelong Learning | Pankaj Gupta; Yatin Chaudhary; Thomas Runkler; Hinrich Schuetze; | To address the problem, we propose a lifelong learning framework for neural topic modeling that can continuously process streams of document collections, accumulate topics and guide future topic modeling tasks by knowledge transfer from several sources to better deal with the sparse data. |

879 | Learning and Evaluating Contextual Embedding of Source Code | Aditya Kanade; Petros Maniatis; Gogul Balakrishnan; Kensen Shi; | In this paper, we alleviate this gap by curating a code-understanding benchmark and evaluating a learned contextual embedding of source code. |

880 | Uncertainty quantification for nonconvex tensor completion: Confidence intervals, heteroscedasticity and optimality | Changxiao Cai; H. Vincent Poor; Yuxin Chen; | We study the distribution and uncertainty of nonconvex optimization for noisy tensor completion — the problem of estimating a low-rank tensor given incomplete and corrupted observations of its entries. |

881 | Learning with Good Feature Representations in Bandits and in RL with a Generative Model | Gellért Weisz; Tor Lattimore; Csaba Szepesvari; | Thus, features are useful when the approximation error is small relative to the dimensionality of the features. The idea is applied to stochastic bandits and reinforcement learning with a generative model where the learner has access to d-dimensional linear features that approximate the action-value functions for all policies to an accuracy of \u03b5. |

882 | Angular Visual Hardness | Beidi Chen; Weiyang Liu; Zhiding Yu; Jan Kautz; Anshumali Shrivastava; Animesh Garg; Anima Anandkumar; | In this paper, we propose angular visual hardness (AVH), a score given by the normalized angular distance between the sample feature embedding and the target classifier to measure sample hardness. |

883 | Learning the Stein Discrepancy for Training and Evaluating Energy-Based Models without Sampling | Will Grathwohl; Kuan-Chieh Wang; Joern-Henrik Jacobsen; David Duvenaud; Richard Zemel; | We present a new method for evaluating and training unnormalized density models. |

884 | Variance Reduction and Quasi-Newton for Particle-Based Variational Inference | Michael Zhu; Chang Liu; Jun Zhu; | In this paper, we find that existing ParVI approaches converge insufficiently fast under sample quality metrics, and we propose a novel variance reduction and quasi-Newton preconditioning framework for all ParVIs, by leveraging the Riemannian structure of the Wasserstein space and advanced Riemannian optimization algorithms. |

885 | Better depth-width trade-offs for neural networks through the lens of dynamical systems | Evangelos Chatziafratis; Ioannis Panageas; Sai Ganesh Nagarajan; | In this work, we strengthen the connection with dynamical systems and we improve the existing width lower bounds along several aspects. |

886 | Stochastic Coordinate Minimization with Progressive Precision for Stochastic Convex Optimization | Sudeep Salgia; Qing Zhao; Sattar Vakili; | A framework based on iterative coordinate minimization (CM) is developed for stochastic convex optimization. |

887 | Fundamental Tradeoffs between Invariance and Sensitivity to Adversarial Perturbations | Florian Tramer; Jens Behrmann; Nicholas Carlini; Nicolas Papernot; Joern-Henrik Jacobsen; | We demonstrate fundamental tradeoffs between these two types of adversarial examples. |

888 | Learning From Strategic Agents: Accuracy, Improvement, and Causality | Yonadav Shavit; Benjamin Edelman; Brian Axelrod; | As our main contribution, we provide the first algorithms for learning accuracy-optimizing, improvement-optimizing, and causal-precision-optimizing linear regression models directly from data, without prior knowledge of agents’ possible actions. |

889 | Causal Structure Discovery from Distributions Arising from Mixtures of DAGs | Basil Saeed; Snigdha Panigrahi; Caroline Uhler; | Since the mixing variable is latent, we consider causal structure discovery algorithms such as FCI that can deal with latent variables. |

890 | Explainable and Discourse Topic-aware Neural Language Understanding | Yatin Chaudhary; Pankaj Gupta; Hinrich Schuetze; | We present a novel neural composite language model that exploits both the latent and explainable topics along with topical discourse at sentence-level in a joint learning framework of topic and language models. |

891 | Understanding Contrastive Representation Learning through Geometry on the Hypersphere | Tongzhou Wang; Phillip Isola; | In this work, we identify two key properties related to the contrastive loss: (1) alignment (closeness) of features from positive pairs, and (2) uniformity of the induced distribution of the (normalized) features on the hypersphere. |

892 | On Learning Language-Invariant Representations for Universal Machine Translation | Han Zhao; Junjie Hu; Andrej Risteski; | In this paper, we take one step towards better understanding of universal machine translation by first proving an impossibility theorem in the general case. |

893 | Compressive sensing with un-trained neural networks: Gradient descent finds a smooth approximation | Reinhard Heckel; Mahdi Soltanolkotabi; | For signal recovery from a few measurements, however, un-trained convolutional networks have an intriguing self-regularizing property: Even though the network can perfectly fit any image, the network recovers a natural image from few measurements when trained with gradient descent until convergence. In this paper, we demonstrate this property numerically and study it theoretically. |

894 | Representing Unordered Data Using Multiset Automata and Complex Numbers | Justin DeBenedetto; David Chiang; | We propose to represent multisets using complex-weighted multiset automata and show how the multiset representations of certain existing neural architectures can be viewed as special cases of ours. |

895 | Mutual Transfer Learning for Massive Data | Ching-Wei Cheng; Xingye Qiao; Guang Cheng; | In this article, we study a new paradigm called mutual transfer learning where among many heterogeneous data domains, every data domain could potentially be the target of interest, and it could also be a useful source to help the learning in other data domains. |

896 | The Differentiable Cross-Entropy Method | Brandon Amos; Denis Yarats; | We study the Cross-Entropy Method (CEM) for the non-convex optimization of a continuous and parameterized objective function and introduce a differentiable variant that enables us to differentiate the output of CEM with respect to the objective function’s parameters. |

897 | A Sample Complexity Separation between Non-Convex and Convex Meta-Learning | Nikunj Umesh Saunshi; Yi Zhang; Mikhail Khodak; Sanjeev Arora; | This work shows that convex-case analysis might be insufficient to understand the success of meta-learning, and that even for non-convex models it is important to look inside the optimization black-box, specifically at properties of the optimization trajectory. |

898 | On the Convergence of Nesterov’s Accelerated Gradient Method in Stochastic Settings | Mahmoud Assran; Michael Rabbat; | We study Nesterov’s accelerated gradient method in the stochastic approximation setting (unbiased gradients with bounded variance) and the finite sum setting (where randomness is due to sampling mini-batches). |

899 | The Buckley-Osthus model and the block preferential attachment model: statistical analysis and application | Wenpin Tang; Xin Guo; Fengmin Tang; | This paper is concerned with statistical estimation of two preferential attachment models: the Buckley-Osthus model and the block preferential attachment model. |

900 | Representations for Stable Off-Policy Reinforcement Learning | Dibya Ghosh; Marc Bellemare; | In this paper, we formally show that there are indeed nontrivial state representations under which the canonical SARSA algorithm is stable, even when learning off-policy. |

901 | Piecewise Linear Regression via a Difference of Convex Functions | Ali Siahkamari; Aditya Gangrade; Brian Kulis; Venkatesh Saligrama; | We present a new piecewise linear regression methodology that utilises fitting a difference of convex functions (DC functions) to the data. |

902 | On the consistency of top-k surrogate losses | Forest Yang; Sanmi Koyejo; | Based on the top-$k$ calibration analysis, we propose a rich class of top-$k$ calibrated Bregman divergence surrogates. |

903 | Collapsed Amortized Variational Inference for Switching Nonlinear Dynamical Systems | Zhe Dong; Bryan Seybold; Kevin Murphy; Hung Bui; | We propose an efficient inference method for switching nonlinear dynamical systems. |

904 | Boosting Deep Neural Network Efficiency with Dual-Module Inference | Liu Liu; Lei Deng; Zhaodong Chen; yuke wang; Shuangchen Li; Jingwei Zhang; Yihua Yang; Zhenyu Gu; Yufei Ding; Yuan Xie; | We propose a big-little dual-module inference to dynamically skip unnecessary memory access and computation to speedup DNN inference. |

905 | Time-Consistent Self-Supervision for Semi-Supervised Learning | Tianyi Zhou; Shengjie Wang; Jeff Bilmes; | In this paper, we study the dynamics of neural net outputs in SSL and show that selecting and using first the unlabeled samples with more consistent outputs over the course of training (i.e., "time-consistency") can improve the final test accuracy and save computation. |

906 | Selective Dyna-style Planning Under Limited Model Capacity | Zaheer SM; Samuel Sokota; Erin Talvitie; Martha White; | In this paper, we investigate the idea of using an imperfect model selectively. |

907 | A Pairwise Fair and Community-preserving Approach to k-Center Clustering | Brian Brubach; Darshan Chakrabarti; John Dickerson; Samir Khuller; Aravind Srinivasan; Leonidas Tsepenekas; | To explore the practicality of our fairness goals, we devise an approach for extending existing k-center algorithms to satisfy these fairness constraints. |

908 | How recurrent networks implement contextual processing in sentiment analysis | Niru Maheswaranathan; David Sussillo; | Here, we propose general methods for reverse engineering recurrent neural networks (RNNs) to identify and elucidate contextual processing. |

909 | Smaller, more accurate regression forests using tree alternating optimization | Arman Zharmagambetov; Miguel Carreira-Perpinan; | We instead use the recently proposed Tree Alternating Optimization (TAO) algorithm. This is able to learn an oblique tree, where each decision node tests for a linear combination of features, and which has much higher accuracy than axis-aligned trees. |

910 | Divide and Conquer: Leveraging Intermediate Feature Representations for Quantized Training of Neural Networks | Ahmed T. Elthakeb; Prannoy Pilligundla; FatemehSadat Mireshghallah; Alexander Cloninger; Hadi Esmaeilzadeh; | This paper sets out to harvest these rich intermediate representations for quantization with minimal accuracy loss while significantly reducing the memory footprint and compute intensity of the DNN. |

911 | From Sets to Multisets: Provable Variational Inference for Probabilistic Integer Submodular Models | Aytunc Sahin; Yatao Bian; Joachim Buhmann; Andreas Krause; | We study central properties of this extension and formulate a new probabilistic model which is defined through integer submodular functions. |

912 | Empirical Study of the Benefits of Overparameterization in Learning Latent Variable Models | Rares-Darius Buhai; Yoni Halpern; Yoon Kim; Andrej Risteski; David Sontag; | We discuss benefits to different metrics of success (recovering the parameters of the ground-truth model, held-out log-likelihood), sensitivity to variations of the training algorithm, and behavior as the amount of overparameterization increases. |

913 | Improving the Gating Mechanism of Recurrent Neural Networks | Albert Gu; Caglar Gulcehre; Thomas Paine; Matthew Hoffman; Razvan Pascanu; | We address this problem by deriving two synergistic modifications to the standard gating mechanism that are easy to implement, introduce no additional hyperparameters, and improve learnability of the gates when they are close to saturation. |

914 | Efficient and Scalable Bayesian Neural Nets with Rank-1 Factors | Mike Dusenberry; Ghassen Jerfel; Yeming Wen; Yian Ma; Jasper Snoek; Katherine Heller; Balaji Lakshminarayanan; Dustin Tran; | To tackle this challenge, we propose a rank-1 parameterization of BNNs, where each weight matrix involves only a distribution on a rank-1 subspace. |

915 | Analyzing the effect of neural network architecture on training performance | Karthik Abinav Sankararaman; Soham De; Zheng Xu; W. Ronny Huang; Tom Goldstein; | In this paper we study how neural network architecture affects the speed of training. |

916 | Born-again Tree Ensembles | Thibaut Vidal; Maximilian Schiffer; | Against this background, we study born-again tree ensembles, i.e., the process of constructing a single decision tree of minimum size that reproduces the exact same behavior as a given tree ensemble. |

917 | Accountable Off-Policy Evaluation via a Kernelized Bellman Statistics | Yihao Feng; Tongzheng Ren; Ziyang Tang; Qiang Liu; | In this work, we investigate the statistical properties of the kernel loss, which allows us to find a feasible set that contains the true value function with high probability. |

918 | Improving Transformer Optimization Through Better Initialization | Xiao Shi Huang; Felipe Perez; Jimmy Ba; Maksims Volkovs; | In this work our contributions are two-fold. We first investigate and empirically validate the source of optimization problems in encoder-decoder Transformer architecture.We then propose a new weight initialization scheme with theoretical justification, which enables training without warmup or layer normalization. |

919 | Learning to Simulate and Design for Structural Engineering | Kai-Hung Chang; Chin-Yi Cheng; | In this work, we propose an end-to-end learning pipeline to solve the size design optimization problem, which is to design the optimal cross-sections for columns and beams, given the design objectives and building code as constraints. |

920 | Few-shot Relation Extraction via Bayesian Meta-learning on Task Graphs | Meng Qu; Tianyu Gao; Louis-Pascal Xhonneux; Jian Tang; | We propose a novel Bayesian meta-learning approach to effectively learn the posterior distributions of the prototype vectors of tasks, where the initial prior of the prototype vectors is parameterized with a graph neural network on the global task graph. |

921 | Optimal Differential Privacy Composition for Exponential Mechanisms | Jinshuo Dong; David Durfee; Ryan Rogers; | We consider precise composition bounds of the overall privacy loss for exponential mechanisms, one of the fundamental classes of mechanisms in DP. |

922 | Scaling up Hybrid Probabilistic Inference with Logical and Arithmetic Constraints via Message Passing | Zhe Zeng; Paolo Morettin; Fanqi Yan; Antonio Vergari; Guy Van den Broeck; | To narrow this gap, we derive a factorized formalism of WMI enabling us to devise a scalable WMI solver based on message passing, MP-WMI. |

923 | Accelerating Large-Scale Inference with Anisotropic Vector Quantization | Ruiqi Guo; Quan Geng; David Simcha; Felix Chern; Philip Sun; Erik Lindgren; Sanjiv Kumar; | Based on the observation that for a given query, the database points that have the largest inner products are more relevant, we develop a family of anisotropic quantization loss functions. |

924 | Convolutional dictionary learning based auto-encoders for natural exponential-family distributions | Bahareh Tolooshams; Andrew Song; Simona Temereanca; Demba Ba; | We introduce a class of auto-encoder neural networks tailored to data from the natural exponential family (e.g., count data). |

925 | Strength from Weakness: Fast Learning Using Weak Supervision | Joshua Robinson; Stefanie Jegelka; Suvrit Sra; | We study generalization properties of weakly supervised learning. |

926 | NADS: Neural Architecture Distribution Search for Uncertainty Awareness | Randy Ardywibowo; Shahin Boluki; Xinyu Gong; Zhangyang Wang; Xiaoning Qian; | To address these problems, we first seek to identify guiding principles for designing uncertainty-aware architectures, by proposing Neural Architecture Distribution Search (NADS). |

927 | Approximating Stacked and Bidirectional Recurrent Architectures with the Delayed Recurrent Neural Network | Javier Turek; Shailee Jain; Vy Vo; Mihai Capota; Alexander Huth; Theodore Willke; | In this work, we explore the delayed-RNN, which is a single-layer RNN that has a delay between the input and output. |

928 | Balancing Competing Objectives with Noisy Data: Score-Based Classifiers for Welfare-Aware Machine Learning | Esther Rolf; Max Simchowitz; Sarah Dean; Lydia T. Liu; Daniel Bjorkegren; University of California Moritz Hardt; Joshua Blumenstock; | In this paper, we study algorithmic policies which explicitly trade off between a private objective (such as profit) and a public objective (such as social welfare). |

929 | Time-aware Large Kernel Convolutions | Vasileios Lioutas; Yuhong Guo; | In this paper, we introduce Time-aware Large Kernel (TaLK) Convolutions, a novel adaptive convolution operation that learns to predict the size of a summation kernel instead of using a fixed-sized kernel matrix. |

930 | Amortised Learning by Wake-Sleep | Li Kevin Wenliang; Theodore Moskovitz; Heishiro Kanagawa; Maneesh Sahani; | Here, we propose an alternative approach that we call amortised learning. Rather than computing an approximation to the posterior over latents, we use a wake-sleep Monte-Carlo strategy to learn a function that directly estimates the maximum-likelihood parameter updates. |

931 | Fair Generative Modeling via Weak Supervision | Kristy Choi; Aditya Grover; Trisha Singh; Rui Shu; Stefano Ermon; | We present a weakly supervised algorithm for overcoming dataset bias for deep generative models. |

932 | Multi-Step Greedy Reinforcement Learning Algorithms | Manan Tomar; Yonathan Efroni; Mohammad Ghavamzadeh; | In this paper, we explore the benefits of multi-step greedy policies in model-free RL when employed using the multi-step Dynamic Programming algorithms: $\kappa$-Policy Iteration ($\kappa$-PI) and $\kappa$-Value Iteration ($\kappa$-VI). |

933 | Linear Mode Connectivity and the Lottery Ticket Hypothesis | Jonathan Frankle; Gintare Karolina Dziugaite; Daniel Roy; Michael Carbin; | We introduce "instability analysis," which assesses whether a neural network optimizes to the same, linearly connected minimum under different samples of SGD noise. |

934 | Superpolynomial Lower Bounds for Learning One-Layer Neural Networks using Gradient Descent | Surbhi Goel; Aravind Gollakota; Zhihan Jin; Sushrut Karmalkar; Adam Klivans; | We give the first superpolynomial lower bounds for learning one-layer neural networks with respect to the Gaussian distribution for a broad class of algorithms. |

935 | Learnable Group Transform For Time-Series | Romain Cosentino; Behnaam Aazhang; | We propose a novel approach to filter bank learning for time-series by considering spectral decompositions of signals defined as a Group Transform. |

936 | Optimistic bounds for multi-output learning | Henry Reeve; Ata Kaban; | We investigate the challenge of multi-output learning, where the goal is to learn a vector-valued function based on a supervised data set. |

937 | Detecting Out-of-Distribution Examples with Gram Matrices | Chandramouli Shama Sastry; Sageev Oore; | In this paper, we propose to detect OOD examples by identifying inconsistencies between activity patterns and predicted class. |

938 | On Variational Learning of Controllable Representations for Text without Supervision | Peng Xu; Jackie Chi Kit Cheung; Yanshuai Cao; | In this work, we find that sequence VAEs trained on text fail to properly decode when the latent codes are manipulated, because the modified codes often land in holes or vacant regions in the aggregated posterior latent space, where the decoding network fails to generalize. |

939 | Model-Based Reinforcement Learning with Value-Targeted Regression | Zeyu Jia; Lin Yang; Csaba Szepesvari; Mengdi Wang; Alex Ayoub; | In this paper we focus on finite-horizon episodic RL where the transition model admits a nonlinear parametrization $P_{\theta}$, a special case of which is the linear parameterization: $P_{\theta} = \sum_{i=1}^{d} (\theta)_{i}P_{i}$. |

940 | Two Routes to Scalable Credit Assignment without Weight Symmetry | Daniel Kunin; Aran Nayebi; Javier Sagastuy-Brena; Surya Ganguli; Jonathan Bloom; Daniel Yamins; | Our analysis indicates the underlying mathematical reason for this instability, allowing us to identify a more robust local learning rule that better transfers without metaparameter tuning. |

941 | Predicting deliberative outcomes | Vikas Garg; Tommi Jaakkola; | We extend structured prediction to deliberative outcomes. |

942 | Black-box Certification and Learning under Adversarial Perturbations | Hassan Ashtiani; Vinayak Pathak; Ruth Urner; | We formally study the problem of classification under adversarial perturbations, both from the learner’s perspective, and from the viewpoint of a third-party who aims at certifying the robustness of a given black-box classifier. |

943 | When deep denoising meets iterative phase retrieval | Yaotian Wang; Xiaohang Sun; Jason Fleischer; | Here, we combine iterative methods from phase retrieval with image statistics from deep denoisers, via regularization-by-denoising. |

944 | The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization | Ben Adlam; Jeffrey Pennington; | We provide a precise high-dimensional asymptotic analysis of generalization under kernel regression with the Neural Tangent Kernel, which characterizes the behavior of wide neural networks optimized with gradient descent. |

945 | A Sequential Self Teaching Approach for Improving Generalization in Sound Event Recognition | Anurag Kumar; Vamsi Krishna Ithapu; | In this paper, we propose a sequential self-teaching approach to learn sounds. |

946 | On the Global Convergence Rates of Softmax Policy Gradient Methods | Jincheng Mei; Chenjun Xiao; Csaba Szepesvari; Dale Schuurmans; | We make three contributions toward better understanding policy gradient methods. |

947 | Source Separation with Deep Generative Priors | Vivek Jayaram; John Thickstun; | This paper introduces a Bayesian approach to source separation that uses deep generative models as priors over the components of a mixture of sources, and Langevin dynamics to sample from the posterior distribution of sources given a mixture. |

948 | Non-Autoregressive Neural Text-to-Speech | Kainan Peng; Wei Ping; Zhao Song; Kexin Zhao; | In this work, we propose ParaNet, a non-autoregressive seq2seq model that converts text to spectrogram. |

949 | Amortized Population Gibbs Samplers with Neural Sufficient Statistics | Hao Wu; Heiko Zimmermann; Eli Sennesh; Tuan Anh Le; Jan-Willem van de Meent; | We develop amortized population Gibbs (APG) samplers, a class of scalable methods that frame structured variational inference as adaptive importance sampling. |

950 | Neural Network Control Policy Verification With Persistent Adversarial Perturbation | Yuh-Shyang Wang; Tsui-Wei Weng; Luca Daniel; | In this paper, we show how to combine recent works on static neural network certification tools with robust control theory to certify a neural network policy in a control loop. |

951 | Circuit-Based Intrinsic Methods to Detect Overfitting | Satrajit Chatterjee; Alan Mishchenko; | We propose a family of intrinsic methods called Counterfactual Simulation (CFS) which analyze the flow of training examples through the model by identifying and perturbing rare patterns. |

952 | Inter-domain Deep Gaussian Processes with RKHS Fourier Features | Tim Rudner; Dino Sejdinovic; Yarin Gal; | We propose Inter-domain Deep Gaussian Processes with RKHS Fourier Features, an extension of shallow inter-domain GPs that combines the advantages of inter-domain and deep Gaussian processes (DGPs) and demonstrate how to leverage existing approximate inference approaches to perform simple and scalable approximate inference on Inter-domain Deep Gaussian Processes. |

953 | Estimating Q(s,s’) with Deterministic Dynamics Gradients | Ashley Edwards; Himanshu Sahni; Rosanne Liu; Jane Hung; Ankit Jain; Rui Wang; Adrien Ecoffet; Thomas Miconi; Charles Isbell; Jason Yosinski; | In this paper, we introduce a novel form of a value function, $Q(s, s’)$, that expresses the utility of transitioning from a state $s$ to a neighboring state $s’$ and then acting optimally thereafter. |

954 | On conditional versus marginal bias in multi-armed bandits | Jaehyeok Shin; Aaditya Ramdas; Alessandro Rinaldo; | In this paper, we characterize the sign of the conditional bias of monotone functions of the rewards, including the sample mean. |

955 | Implicit competitive regularization in GANs | Florian Schaefer; Hongkai Zheng; Anima Anandkumar; | We argue that the performance of GANs is instead due to the implicit competitive regularization (ICR) arising from the simultaneous optimization of generator and discriminator. |

956 | Graph-based, Self-Supervised Program Repair from Diagnostic Feedback | Michihiro Yasunaga; Percy Liang; | Program repair is challenging for two reasons: First, it requires reasoning and tracking symbols across source code and diagnostic feedback. Second, labeled datasets available for program repair are relatively small. In this work, we propose novel solutions to these two challenges. |

957 | Interpretable Off-Policy Evaluation in Reinforcement Learning by Highlighting Influential Transitions | Omer Gottesman; Joseph Futoma; Yao Liu; Sonali Parbhoo; Leo Celi; Emma Brunskill; Finale Doshi-Velez; | In this paper we develop a method that could serve as a hybrid human-AI system, to enable human experts to analyze the validity of policy evaluation estimates. |

958 | Communication-Efficient Federated Learning with Sketching | Daniel Rothchild; Ashwinee Panda; Enayat Ullah; Nikita Ivkin; Vladimir Braverman; Joseph Gonzalez; Ion Stoica; Raman Arora; | In this paper we introduce a novel algorithm, called FedSketchedSGD, to overcome these challenges. |

959 | Learning Fair Policies in Multi-Objective (Deep) Reinforcement Learning with Average and Discounted Rewards | Umer Siddique; Paul Weng; Matthieu Zimmer; | In this paper, we formulate this novel RL problem, in which an objective function (generalized Gini index of utility vectors), which encodes a notion of fairness that we formally define, is optimized. |

960 | Robust Black Box Explanations Under Distribution Shift | Himabindu Lakkaraju; Nino Arsov; Osbert Bastani; | In this paper, we propose a novel framework for generating robust explanations of black box models based on adversarial training. |

961 | Distributed Online Optimization over a Heterogeneous Network | Nima Eshraghi; Ben Liang; | To address this issue, we consider a new algorithm termed Distributed Any-Batch Mirror Descent (DABMD), which is based on distributed Mirror Descent but uses a fixed per-round computing time to limit the waiting by fast nodes to receive information updates from slow nodes. |

962 | ECLIPSE: An Extreme-Scale Linear Program Solver for Web-Applications | Kinjal Basu; Amol Ghoting; Rahul Mazumder; Yao Pan; | In this work, we propose a distributed solver that solves a perturbation of the LP problems at scale. |

963 | CURL: Contrastive Unsupervised Representation Learning for Reinforcement Learning | Michael Laskin; Pieter Abbeel; Aravind Srinivas; | To that end, we propose a new model: Contrastive Unsupervised Representation Learning for Reinforcement Learning (CURL). |

964 | Confidence-Aware Learning for Deep Neural Networks | Sangheum Hwang; Jooyoung Moon; Jihyo Kim; Younghak Shin; | In this paper, we propose a method of training deep neural networks with a novel loss function, named Correctness Ranking Loss, which regularizes class probabilities explicitly to be better confidence estimates in terms of ordinal ranking according to confidence. |

965 | Online Bayesian Moment Matching based SAT Solver Heuristics | Haonan Duan; Saeed Nejati; George Trimponias; Pascal Poupart; Vijay Ganesh; | In this paper, we present a Bayesian Moment Matching (BMM) based method aimed at solving the initialization problem in Boolean SAT solvers. |

966 | Retro*: Learning Retrosynthetic Planning with Neural Guided A* Search | Binghong Chen; Chengtao Li; Hanjun Dai; Le Song; | In this paper, we propose Retro*, a neural-based A* -like algorithm that finds high-quality synthetic routes efficiently. |

967 | FedBoost: A Communication-Efficient Algorithm for Federated Learning | Jenny Hamer; Mehryar Mohri; Ananda Theertha Suresh; | In this work, we propose an alternative approach whereby an ensemble of pre-trained base predictors is trained via federated learning. |

968 | Sharp Composition Bounds for Gaussian Differential Privacy via Edgeworth Expansion | Qinqing Zheng; Jinshuo Dong; Qi Long; Weijie Su; | To address this question, we introduce a family of analytical and sharp privacy bounds under composition using the Edgeworth expansion in the framework of the recently proposed $f$-differential privacy. |

969 | Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods | Dan Fu; Mayee Chen; Frederic Sala; Sarah Hooper; Kayvon Fatahalian; Christopher Re; | In this work, we show that, for a class of latent variable models highly applicable to weak supervision, we can find a closed-form solution to model parameters, obviating the need for iterative solutions like stochastic gradient descent (SGD). |

970 | Spectral Frank-Wolfe Algorithm: Strict Complementarity and Linear Convergence | Lijun Ding; Yingjie Fei; Qiantong Xu; Chengrun Yang; | We develop a novel variant of the classical Frank-Wolfe algorithm, which we call spectral Frank-Wolfe, for convex optimization over a spectrahedron. |

971 | Deep Molecular Programming: A Natural Implementation of Binary-Weight ReLU Neural Networks | Marko Vasic; Cameron Chalk; Sarfraz Khurshid; David Soloveichik; | We discover a surprisingly tight connection between a popular class of neural networks (Binary-weight ReLU aka BinaryConnect) and a class of coupled chemical reactions that are absolutely robust to reaction rates. |

972 | Generative Pretraining From Pixels | Mark Chen; Alec Radford; Rewon Child; Jeffrey Wu; Heewoo Jun; David Luan; Ilya Sutskever; | Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images. |

973 | Inferring DQN structure for high-dimensional continuous control | Andrey Sakryukin; Chedy Raissi; Mohan Kankanhalli; | In this work, we show that the compositional structure of the action modules has a significant impact on model performance. |

974 | Subspace Fitting Meets Regression: The Effects of Supervision and Orthonormality Constraints on Double Descent of Generalization Errors | Yehuda Dar; Paul Mayer; Lorenzo Luzi; Richard Baraniuk; | We study the linear subspace fitting problem in the overparameterized setting, where the estimated subspace can perfectly interpolate the training examples. |

975 | Learning Selection Strategies in Buchberger’s Algorithm | Dylan Peifer; Michael Stillman; Daniel Halpern-Leistner; | We introduce a new approach to Buchberger’s algorithm that uses reinforcement learning agents to perform S-pair selection, a key step in the algorithm. |

976 | Estimating the Error of Randomized Newton Methods: A Bootstrap Approach | Miles Lopes; Jessie X.T. Chen; | Motivated by these difficulties, we develop a bootstrap method for directly estimating the unknown error, which avoids excessive computation and offers greater reliability. |

977 | Spectral Subsampling MCMC for Stationary Time Series | Robert Salomone; Matias Quiroz; Robert kohn; Mattias Villani; Minh-Ngoc Tran; | We propose a novel technique for speeding up MCMC for time series data by efficient data subsampling in the frequency domain. |

978 | Progressive Identification of True Labels for Partial-Label Learning | Jiaqi Lv; Miao Xu; LEI FENG; Gang Niu; Xin Geng; Masashi Sugiyama; | The goal of this paper is to propose a novel framework of partial-label learning without implicit assumptions on the model or optimization algorithm. |

979 | R2-B2: Recursive Reasoning-Based Bayesian Optimization for No-Regret Learning in Games | Zhongxiang Dai; Yizhou Chen; Bryan Kian Hsiang Low; Patrick Jaillet ; Teck-Hua Ho; | This paper presents a recursive reasoning formalism of Bayesian optimization (BO) to model the reasoning process in the interactions between boundedly rational, self-interested agents with unknown, complex, and costly-to-evaluate payoff functions in repeated games, which we call Recursive Reasoning-Based BO (R2-B2). |

980 | Graph Homomorphism Convolution | Hoang Nguyen; Takanori Maehara; | In this paper, we study the graph classification problem from the graph homomorphism perspective. |

981 | Conditional Augmentation for Generative Modeling | Heewoo Jun; Rewon Child; Mark Chen; John Schulman; Aditya Ramesh; Alec Radford; Ilya Sutskever; | We present conditional augmentation (CondAugment), a simple and powerful method of regularizing generative models. |

982 | PDO-eConvs: Partial Differential Operator Based Equivariant Convolutions | Zhengyang Shen; Lingshen He; Zhouchen Lin; Jinwen Ma; | In this work, we deal with this issue from the connection between convolutions and partial differential operators (PDOs). |

983 | Abstraction Mechanisms Predict Generalization in Deep Neural Networks | Alex Gain; Hava Siegelmann; | We approach this problem through the unconventional angle of \textit{cognitive abstraction mechanisms}, drawing inspiration from recent neuroscience work, allowing us to define the Cognitive Neural Activation metric (CNA) for DNNs, which is the correlation between information complexity (entropy) of given input and the concentration of higher activation values in deeper layers of the network. |

984 | Revisiting Fundamentals of Experience Replay | William Fedus; Prajit Ramachandran; Rishabh Agarwal; Yoshua Bengio; Hugo Larochelle; Mark Rowland; Will Dabney; | We therefore present a systematic and extensive analysis of experience replay in Q-learning methods, focusing on two fundamental properties: the replay capacity and the ratio of learning updates to experience collected (replay ratio). |

985 | Go Wide, Then Narrow: Efficient Training of Deep Thin Networks | Denny Zhou; Mao Ye; Chen Chen; Mingxing Tan; Tianjian Meng; Xiaodan Song; Quoc Le; Qiang Liu; Dale Schuurmans; | We propose an efficient algorithm to train a very deep and thin network with theoretic guarantee. |

986 | Meta-learning for Mixed Linear Regression | Weihao Kong; Raghav Somani; Zhao Song; Sham Kakade; Sewoong Oh; | To this end, we introduce a novel spectral approach and show that we can efficiently utilize small data tasks with the help of $\tilde\Omega(k^{3/2})$ medium data tasks each with $\tilde\Omega(k^{1/2})$ examples. |

987 | Efficiently Learning Adversarially Robust Halfspaces with Noise | Omar Montasser; Surbhi Goel; Ilias Diakonikolas; Nati Srebro; | We study the problem of learning adversarially robust halfspaces in the distribution-independent setting. |

988 | Bayesian Graph Neural Networks with Adaptive Connection Sampling | Arman Hasanzadeh; Ehsan Hajiramezanali; Shahin Boluki; Nick Duffield; Mingyuan Zhou; Krishna Narayanan; Xiaoning Qian; | We propose a unified framework for adaptive connection sampling in graph neural networks (GNNs) that generalizes existing stochastic regularization methods for training GNNs. |

989 | On the Theoretical Properties of the Network Jackknife | Qiaohui Lin; Robert Lunde; Purnamrita Sarkar; | Under the sparse graphon model, we prove an Efron-Stein-type inequality, showing that the network jackknife leads to conservative estimates of the variance (in expectation) for any network functional that is invariant to node permutation. |

990 | Thompson Sampling via Local Uncertainty | Zhendong Wang; Mingyuan Zhou; | In this paper, we propose a new probabilistic modeling framework for Thompson sampling, where local latent variable uncertainty is used to sample the mean reward. |

991 | Decision Trees for Decision-Making under the Predict-then-Optimize Framework | Adam Elmachtoub; Jason Cheuk Nam Liang; Ryan McNellis; | This natural loss function is known in the literature as the Smart Predict-then-Optimize (SPO) loss, and we propose a tractable methodology called SPO Trees (SPOTs) for training decision trees under this loss. |

992 | Representation Learning via Adversarially-Contrastive Optimal Transport | Anoop Cherian; Shuchin Aeron; | In this paper, we study the problem of learning compact (low-dimensional) representations for sequential data that captures its implicit spatio-temporal cues. |

993 | Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning" | Saeed Amizadeh; Hamid Palangi; Oleksandr Polozov; Yichen Huang; Kazuhito Koishida; | To address this, we propose (1) a framework to isolate and evaluate the reasoning aspect of VQA separately from its perception, and (2) a novel top-down calibration technique that allows the model to answer reasoning questions even with imperfect perception. |

994 | Two Simple Ways to Learn Individual Fairness Metric from Data | Debarghya Mukherjee; Mikhail Yurochkin; Moulinath Banerjee; Yuekai Sun; | In this paper, we present two simple algorithms that learn effective fair metrics from a variety of datasets. |

995 | A Simple Framework for Contrastive Learning of Visual Representations | Ting Chen; Simon Kornblith; Mohammad Norouzi; Geoffrey Hinton; | This paper presents a simple framework for contrastive representation learning. |

996 | The Implicit and Explicit Regularization Effects of Dropout | Colin Wei; Sham Kakade; Tengyu Ma; | This work observes that dropout introduces two distinct but entangled regularization effects: an explicit effect which occurs since dropout modifies the expected training objective, and an implicit effect from stochasticity in the dropout gradients. |

997 | Variable-Bitrate Neural Compression via Bayesian Arithmetic Coding | Yibo Yang; Robert Bamler; Stephan Mandt; | Here, we propose a new algorithm for compressing latent representations in deep probabilistic models, such as variational autoencoders, in post-processing. |

998 | Orthogonalized SGD and Nested Architectures for Anytime Neural Networks | Chengcheng Wan; Henry (Hank) Hoffmann; Shan Lu; Michael Maire; | We propose a novel variant of SGD customized for training network architectures that support anytime behavior: such networks produce a series of increasingly accurate outputs over time. |

999 | Evaluating Machine Accuracy on ImageNet | Vaishaal Shankar; Rebecca Roelofs; Horia Mania; Alex Fang; Benjamin Recht; Ludwig Schmidt; | We perform an in-depth evaluation of human accuracy on the ImageNet dataset. |

1000 | Learning to Navigate in Synthetically Accessible Chemical Space Using Reinforcement Learning | Sai Krishna Gottipati; Boris Sattarov; Sufeng Niu; Haoran Wei; Yashaswi Pathak; Shengchao Liu; Simon Blackburn; Karam Thomas; Connor Coley; Jian Tang; Sarath Chandar; Yoshua Bengio; | In this work, we propose a novel reinforcement learning (RL) setup for drug discovery that addresses this challenge by embedding the concept of synthetic accessibility directly into the de novo compound design system. |

1001 | Improved Bounds on Minimax Regret under Logarithmic Loss via Self-Concordance | Blair Bilodeau; Dylan Foster; Daniel Roy; | We present a novel approach to bounding the minimax regret that exploits the self-concordance property of logarithmic loss. |

1002 | Optimization Theory for ReLU Neural Networks Trained with Normalization Layers | Yonatan Dukler; Quanquan Gu; Guido Montufar; | The analysis shows how the introduction of normalization layers changes the optimization landscape and in some settings enables faster convergence as compared with un-normalized neural networks. |

1003 | Improving Molecular Design by Stochastic Iterative Target Augmentation | Kevin Yang; Wengong Jin; Kyle Swanson; Regina Barzilay; Tommi Jaakkola; | In this paper, we propose a surprisingly effective self-training approach for iteratively creating additional molecular targets. |

1004 | Don’t Waste Your Bits! Squeeze Activations and Gradients for Deep Neural Networks via TinyScript | Fangcheng Fu; Yuzheng Hu; Yihan He; Jiawei Jiang; Yingxia Shao; Ce Zhang; Bin Cui; | In this work, we introduce TinyScript, which applies a non-uniform quantization algorithm to both activations and gradients. |

1005 | Robust One-Bit Recovery via ReLU Generative Networks: Near-Optimal Statistical Rate and Global Landscape Analysis | Shuang Qiu; Xiaohan Wei; Zhuoran Yang; | We propose to recover the target $G(x_0)$ solving an unconstrained empirical risk minimization (ERM). |

1006 | Multi-objective Bayesian Optimization using Pareto-frontier Entropy | Shinya Suzuki; Shion Takeno; Tomoyuki Tamura; Kazuki Shitara; Masayuki Karasuyama; | We propose a novel entropy-based MBO called Pareto-frontier entropy search (PFES) by considering the entropy of Pareto-frontier, which is an essential notion of the optimality of the multi-objective problem. |

1007 | Closing the convergence gap of SGD without replacement | Shashank Rajput; Anant Gupta; Dimitris Papailiopoulos; | In this paper, we close this gap and show that SGD without replacement achieves a rate of $\mathcal{O}\left(\frac{1}{T^2}+\frac{n^2}{T^3}\right)$ when the sum of the functions is a quadratic, and offer a new lower bound of $\Omega\left(\frac{n}{T^2}\right)$ for strongly convex functions that are sums of smooth functions. |

1008 | Black-Box Methods for Restoring Monotonicity | Evangelia Gergatsouli; Brendan Lucier; Christos Tzamos; | In this work we develop algorithms that are able to restore monotonicity in the parameters of interest. |

1009 | Flexible and Efficient Long-Range Planning Through Curious Exploration | Aidan Curtis; Minjian Xin; Dilip Arumugam; Kevin Feigelis; Daniel Yamins; | Here, we propose the Curious Sample Planner (CSP), which fuses elements of TAMP and DRL by combining a curiosity-guided sampling strategy with imitation learning to accelerate planning. |

1010 | Sparse Convex Optimization via Adaptively Regularized Hard Thresholding | Kyriakos Axiotis; Maxim Sviridenko; | We present a new Adaptively Regularized Hard Thresholding (ARHT) algorithm that makes significant progress on this problem by bringing the bound down to $\gamma=O(\kappa)$, which has been shown to be tight for a general class of algorithms including LASSO, OMP, and IHT. |

1011 | On Thompson Sampling with Langevin Algorithms | Eric Mazumdar; Aldo Pacchiano; Yian Ma; Michael Jordan; Peter Bartlett; | We propose a Markov Chain Monte Carlo (MCMC) method tailored to Thompson sampling to address this issue. |

1012 | Strategic Classification is Causal Modeling in Disguise | John Miller; Smitha Milli; University of California Moritz Hardt; | In this work, we develop a causal framework for strategic adaptation. |

1013 | Multi-fidelity Bayesian Optimization with Max-value Entropy Search and its Parallelization | Shion Takeno; Hitoshi Fukuoka; Yuhki Tsukada; Toshiyuki Koyama; Motoki Shiga; Ichiro Takeuchi; Masayuki Karasuyama; | In this paper, we focus on the information-based approach, which is a popular and empirically successful approach in BO. |

1014 | Domain Aggregation Networks for Multi-Source Domain Adaptation | Junfeng Wen; Russell Greiner; Dale Schuurmans; | In this paper, we develop a finite-sample generalization bound based on domain discrepancy and accordingly propose a theoretically justified optimization procedure. |

1015 | Improving Robustness of Deep-Learning-Based Image Reconstruction | Ankit Raj; Yoram Bresler; Bo Li; | In this paper, we propose to modify the training strategy of end-to-end deep-learning-based inverse problem solvers to improve robustness. |

1016 | Outsourced Bayesian Optimization | Dmitrii Kharkovskii; Zhongxiang Dai; Bryan Kian Hsiang Low; | This paper presents the outsourced-Gaussian process-upper confidence bound (O-GP-UCB) algorithm, which is the first algorithm for privacy-preserving Bayesian optimization (BO) in the outsourced setting with a provable performance guarantee. |

1017 | Learning Near Optimal Policies with Low Inherent Bellman Error | Andrea Zanette; Alessandro Lazaric; Mykel Kochenderfer; Emma Brunskill; | We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning under the notion of low inherent Bellman error, a condition normally employed to show convergence of approximate value iteration. |

1018 | Message Passing Least Squares: A Unified Framework for Fast and Robust Group Synchronization | Yunpeng Shi; Gilad Lerman; | We propose an efficient algorithm for solving robust group synchronization given adversarially corrupted group ratios. |

1019 | Optimal Estimator for Unlabeled Linear Regression | hang zhang; Ping Li; | This paper proposes a one-step estimator which are optimal from both the computational and statistical sense. |

1020 | Recovery of sparse signals from a mixture of linear samples | Arya Mazumdar; Soumyabrata Pal; | In this work we address this query complexity problem and provide efficient algorithms that improves on the previously best known results. |

1021 | Recurrent Hierarchical Topic-Guided RNN for Language Generation | Dandan Guo; Bo Chen; Ruiying Lu; Mingyuan Zhou; | To simultaneously capture syntax and global semantics from a text corpus, we propose a new larger-context recurrent neural network (RNN)-based language model, which extracts recurrent hierarchical semantic structure via a dynamic deep topic model to guide natural language generation. |

1022 | Predictive Coding for Locally-Linear Control | Rui Shu; Tung Nguyen; Yinlam Chow; Tuan Pham; Khoat Than; Mohammad Ghavamzadeh; Stefano Ermon; Hung Bui; | In this paper, we propose a novel information-theoretic LCE approach and show theoretically that explicit next-observation prediction can be replaced with predictive coding. |

1023 | Near Input Sparsity Time Kernel Embeddings via Adaptive Sampling | Amir Zandieh; David Woodruff; | To accelerate kernel methods, we propose a near input sparsity time method for sampling the high-dimensional space implicitly defined by a kernel transformation. |

1024 | Near-optimal sample complexity bounds for learning Latent $k-$polytopes and applications to Ad-Mixtures | Chiranjib Bhattacharyya; Ravindran Kannan; | In this paper we show that $O^*(dk/m)$ samples are sufficient to learn each of $k-$ topic vectors of LDA, a popular Ad-mixture model, with vocabulary size $d$ and $m\in \Omega(1)$ words per document, to any constant error in $L_1$ norm. |

1025 | Population-Based Black-Box Optimization for Biological Sequence Design | Christof Angermueller; David Belanger; Andreea Gane; Zelda Mariet; David Dohan; Kevin Murphy; Lucy Colwell ; D. Sculley; | To improve robustness, we propose population-based optimization (P3BO), which generates batches of sequences by sampling from an ensemble of methods. |

1026 | Emergence of Separable Manifolds in Deep Language Representations | Jonathan Mamou; Hang Le; Miguel del Rio Fernandez; Cory Stephenson; Hanlin Tang; Yoon Kim; SueYeon Chung; | In this work, we utilize mean-field theoretic manifold analysis, a recent technique from computational neuroscience, to analyze the high dimensional geometry of language representations from large-scale contextual embedding models. |

1027 | Stochastic Hamiltonian Gradient Methods for Smooth Games | Nicolas Loizou; Hugo Berard; Alexia Jolicoeur-Martineau; Pascal Vincent; Simon Lacoste-Julien; Ioannis Mitliagkas; | We analyze the stochastic Hamiltonian method and a novel variance-reduced variant of it and provide the first set of last-iterate convergence guarantees for stochastic unbounded bilinear games. |

1028 | Understanding and Estimating the Adaptability of Domain-Invariant Representations | Ching-Yao Chuang; Antonio Torralba; Stefanie Jegelka; | In this work, we aim to better understand and estimate the effect of domain-invariant representations on generalization to the target. |

1029 | Adversarial Mutual Information for Text Generation | Boyuan Pan; Yazheng Yang; Kaizhao Liang; Bhavya Kailkhura; Zhongming Jin; Xian-Sheng Hua; Deng Cai; Bo Li; | In this paper, we propose Adversarial Mutual Information (AMI): a text generation framework which is formed as a novel saddle point (min-max) optimization aiming to identify joint interactions between the source and target. |

1030 | Bidirectional Model-based Policy Optimization | Hang Lai; Jian Shen; Weinan Zhang; Yong Yu; | We develop a novel method, called Bidirectional Model-based Policy Optimization (BMPO) to utilize both the forward model and backward model to generate short branched rollouts for policy optimization. |

1031 | Input-Sparsity Low Rank Approximation in Schatten Norm | Yi Li; David Woodruff; | We give the first input-sparsity time algorithms for the rank-$k$ low rank approximation problem in every Schatten norm. |

1032 | Do We Need Zero Training Loss After Achieving Zero Training Error? | Takashi Ishida; Ikko Yamane; Tomoya Sakai; Gang Niu; Masashi Sugiyama; | We propose a direct solution called flooding that intentionally prevents further reduction of the training loss when it reaches a reasonably small value, which we call the flooding level. |

1033 | Learning and sampling of atomic interventions from observations | Arnab Bhattacharyya; Sutanu Gayen; Saravanan Kandasamy; Ashwin Maran; Vinodchandran N. Variyam; | Our goal is to give algorithms with polynomial time and sample complexity in a non-parametric setting. |

1034 | Understanding and Mitigating the Tradeoff between Robustness and Accuracy | Aditi Raghunathan; Sang Michael Xie; Fanny Yang; John Duchi; Percy Liang; | In this work, we precisely characterize the effect of augmentation on the standard error in linear regression when the optimal linear predictor has zero standard and robust error. |

1035 | Combining Differentiable PDE Solvers and Graph Neural Networks for Fluid Flow Prediction | Filipe de Avila Belbute-Peres; Thomas Economon; Zico Kolter; | In this work, we develop a hybrid (graph) neural network that combines a traditional graph convolutional network with an embedded differentiable fluid dynamics simulator inside the network itself. |

1036 | From ImageNet to Image Classification: Contextualizing Progress on Benchmarks | Dimitris Tsipras; Shibani Santurkar; Logan Engstrom; Andrew Ilyas; Aleksander Madry; | Overall, our results highlight a misalignment between the way we train our models and the task we actually expect them to solve, emphasizing the need for fine-grained evaluation techniques that go beyond average-case accuracy. |

1037 | On Implicit Regularization in $\beta$-VAEs | Abhishek Kumar; Ben Poole; | This analysis uncovers the regularizer implicit in the $\beta$-VAE objective, and leads to an approximation consisting of a deterministic autoencoding objective plus analytic regularizers that depend on the Hessian or Jacobian of the decoding model, unifying VAEs with recent heuristics proposed for training regularized autoencoders. |

1038 | Data Amplification: Instance-Optimal Property Estimation | Yi Hao; Alon Orlitsky; | We present novel linear-time-computable estimators that significantly “amplify” the effective amount of data available. |

1039 | Provable guarantees for decision tree induction: the agnostic setting | Guy Blanc; Jane Lange; Li-Yang Tan; | We give strengthened provable guarantees on the performance of widely employed and empirically successful {\sl top-down decision tree learning heuristics}. |

1040 | Statistical Bias in Dataset Replication | Logan Engstrom; Andrew Ilyas; Shibani Santurkar; Dimitris Tsipras; Jacob Steinhardt; Aleksander Madry; | In this paper, we highlight the importance of statistical modeling in dataset replication: we present unintuitive yet pervasive ways in which statistical bias, when left unmitigated, can skew results. |

1041 | Towards Adaptive Residual Network Training: A Neural-ODE Perspective | chengyu dong; Liyuan Liu; Zichao Li; Jingbo Shang; | Illuminated by these derivations, we propose an adaptive training algorithm for residual networks, LipGrow, which automatically increases network depth and accelerates model training. |

1042 | Overparameterization hurts worst-group accuracy with spurious correlations | Shiori Sagawa; aditi raghunathan; Pang Wei Koh; Percy Liang; | We show on two image datasets that in contrast to average accuracy, overparameterization hurts worst-group accuracy in the presence of spurious correlations. |

1043 | A Nearly-Linear Time Algorithm for Exact Community Recovery in Stochastic Block Model | Peng Wang; Zirui Zhou; Anthony Man-Cho So; | In this paper, we focus on the problem of exactly recovering the communities in a binary symmetric SBM, where a graph of $n$ vertices is partitioned into two equal-sized communities and the vertices are connected with probability $p = \alpha\log(n)/n$ within communities and $q = \beta\log(n)/n$ across communities for some $\alpha>\beta>0$. |

1044 | Online Multi-Kernel Learning with Graph-Structured Feedback | Pouya M Ghari; Yanning Shen; | Leveraging the random feature approximation, we propose an online scalable multi-kernel learning approach with graph feedback, and prove that the proposed algorithm enjoys sublinear regret. |

1045 | Is Local SGD Better than Minibatch SGD? | Blake Woodworth; Kumar Kshitij Patel; Sebastian Stich; Zhen Dai; Brian Bullins; H. Brendan McMahan; Ohad Shamir; Nati Srebro; | We study local SGD (also known as parallel SGD and federated SGD), a natural and frequently used distributed optimization method. |

1046 | On Lp-norm Robustness of Ensemble Decision Stumps and Trees | Yihan Wang; Huan Zhang; Hongge Chen; Duane Boning; Cho-Jui Hsieh; | In this paper, we study the robustness verification and defense with respect to general $\ell_p$ norm perturbation for ensemble trees and stumps. |

1047 | Sub-linear Memory Sketches for Near Neighbor Search on Streaming Data with RACE | Benjamin Coleman; Anshumali Shrivastava; Richard Baraniuk; | We present the first sublinear memory sketch that can be queried to find the nearest neighbors in a dataset. |

1048 | Understanding Self-Training for Gradual Domain Adaptation | Ananya Kumar; Tengyu Ma; Percy Liang; | We consider gradual domain adaptation, where the goal is to adapt an initial classifier trained on a source domain given only unlabeled data that shifts gradually in distribution towards a target domain. |

1049 | Concept Bottleneck Models | Pang Wei Koh; Thao Nguyen; Yew Siang Tang; Stephen Mussmann; Emma Pierson; Been Kim; Percy Liang; | We seek to learn models that support interventions on high-level concepts: would the model predict severe arthritis if it thought there was a bone spur in the x-ray? |

1050 | Optimal Bounds between f-Divergences and Integral Probability Metrics | Rohit Agrawal; Thibaut Horel; | In this work, we systematically study the relationship between these two families from the perspective of convex duality. |

1051 | Robustness to Spurious Correlations via Human Annotations | Megha Srivastava; Tatsunori Hashimoto; Percy Liang; | We present a framework for making models robust to spurious correlations by leveraging humans’ common sense knowledge of causality. |

1052 | DROCC: Deep Robust One-Class Classification | Sachin Goyal; Aditi Raghunathan; Moksh Jain; Harsha Vardhan Simhadri; Prateek Jain; | In this work, we propose Deep Robust One Class Classification (DROCC) method that is robust to such a collapse by training the network to distinguish the training points from their perturbations, generated adversarially. |

1053 | Efficiently Solving MDPs with Stochastic Mirror Descent | Yujia Jin; Aaron Sidford; | In this paper we present a unified framework based on primal-dual stochastic mirror descent for approximately solving infinite-horizon Markov decision processes (MDPs) given a generative model. |

1054 | Handling the Positive-Definite Constraint in the Bayesian Learning Rule | Wu Lin; Mark Schmidt; Mohammad Emtiyaz Khan; | In this paper, we fix this issue for the positive-definite constraint by proposing an improved rule that naturally handles the constraint. |

1055 | A simpler approach to accelerated optimization: iterative averaging meets optimism | Pooria Joulani; Anant Raj; András György; Csaba Szepesvari; | In this paper, we show that there is a simpler approach to obtaining accelerated rates: applying generic, well-known optimistic online learning algorithms and using the online average of their predictions to query the (deterministic or stochastic) first-order optimization oracle at each time step. |

1056 | Training Binary Neural Networks using the Bayesian Learning Rule | Xiangming Meng; Roman Bachmann; Mohammad Emtiyaz Khan; | In this paper, we propose such an approach using the Bayesian learning rule. |

1057 | High-dimensional Robust Mean Estimation via Gradient Descent | Yu Cheng; Ilias Diakonikolas; Rong Ge; Mahdi Soltanolkotabi; | In this work, we show that a natural non-convex formulation of the problem can be solved directly by gradient descent. |

1058 | From Chaos to Order: Symmetry and Conservation Laws in Game Dynamics | Sai Ganesh Nagarajan; David Balduzzi; Georgios Piliouras; | In this paper, we present basic mechanism design tools for constructing games with predictable and controllable dynamics. |

1059 | Hierarchically Decoupled Morphological Transfer | Donald Hejna; Lerrel Pinto; Pieter Abbeel; | To this end, we propose a hierarchical decoupling of policies into two parts: an independently learned low-level policy and a transferable high-level policy. |

1060 | Puzzle Mix: Exploiting Saliency and Local Statistics for Optimal Mixup | Jang-Hyun Kim; Wonho Choo; Hyun Oh Song; | To this end, we propose Puzzle Mix, a mixup method for explicitly utilizing the saliency information and the underlying statistics of the natural examples. |

1061 | Train Big, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers | Zhuohan Li; Eric Wallace; Sheng Shen; Kevin Lin; Kurt Keutzer; Dan Klein; Joseph Gonzalez; | We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute: self-supervised pretraining and high-resource machine translation. |

1062 | Interpolation between CNNs and ResNets | Zonghan Yang; Yang Liu; Chenglong Bao; Zuoqiang Shi; | In this paper, we present a novel ODE model by adding a damping term. |

1063 | Online metric algorithms with untrusted predictions | Antonios Antoniadis; Christian Coester; Marek Elias; Adam Polak; Bertrand Simon; | In this paper, we propose a prediction setup for Metrical Task Systems (MTS), a broad class of online decision-making problems including, e.g., caching, k-server and convex body chasing. |

1064 | Collaborative Machine Learning with Incentive-Aware Model Rewards | Rachael Hwee Ling Sim; Yehong Zhang; Bryan Kian Hsiang Low; Mun Choon Chan; | This paper proposes to value a party’s contribution based on Shapley value and information gain on model parameters given its data. |

1065 | On Convergence-Diagnostic based Step Sizes for Stochastic Gradient Descent | Scott Pesme; Aymeric Dieuleveut; Nicolas Flammarion; | In this paper, we show that efficiently detecting this transition and appropriately decreasing the step size can lead to fast convergence rates. |

1066 | Equivariant Flows: exact likelihood generative learning for symmetric densities. | Jonas Köhler; Leon Klein; Frank Noe; | We provide a theoretical sufficient criterium showing that the distribution generated by \textit{equivariant} normalizing flows is invariant with respect to these symmetries by design. |

1067 | PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination | Saurabh Goyal; Anamitra Roy Choudhury; Venkatesan Chakaravarthy; Saurabh Raje; Yogish Sabharwal; Ashish Verma; | We develop a novel method, called PoWER-BERT, for improving the inference time of the popular BERT model, while maintaining the accuracy. |

1068 | Bayesian Sparsification of Deep C-valued Networks |