Paper Digest: ICML 2020 Highlights

July 10, 2020November 10, 2020 admin

Download ICML-2020-Paper-Digests.pdf– highlights of all ICML-2020 papers. Readers can also choose to read this highlight article on our console, which allows users to filter out papers using keywords and find related papers and patents.

The International Conference on Machine Learning (ICML) is one of the top machine learning conferences in the world. In 2020, it is to be held virtually due to covid-19 pandemic.

To help the community quickly catch up on the work presented in this conference, Paper Digest Team processed all accepted papers, and generated one highlight sentence (typically the main topic) for each paper. Readers are encouraged to read these machine generated highlights / summaries to quickly get the main idea of each paper.

If you do not want to miss any interesting academic paper, you are welcome to sign up our free daily paper digest service to get updates on new papers published in your area every day. You are also welcome to follow us on Twitter and Linkedin to get updated with new conference digests.

Paper Digest Team
team@paperdigest.org

TABLE 1: Paper Digest: ICML 2020 Highlights

	Title	Authors	Highlight
1	Reverse-engineering deep ReLU networks	David Rolnick; Konrad Kording;	Here, we prove that in fact it is often possible to identify the architecture, weights, and biases of an unknown deep ReLU network by observing only its output.
2	My Fair Bandit: Distributed Learning of Max-Min Fairness with Multi-player Bandits	Ilai Bistritz; Tavor Baharav; Amir Leshem; Nicholas Bambos;	We present an algorithm and prove that it is regret optimal up to a log(log T) factor.
3	Scalable Differentiable Physics for Learning and Control	Yi-Ling Qiao; Junbang Liang; Vladlen Koltun; Ming Lin;	We develop a scalable framework for differentiable physics that can support a large number of objects and their interactions.
4	Generalization to New Actions in Reinforcement Learning	Ayush Jain; Andrew Szot; Joseph Lim;	To approach this problem, we propose a two-stage framework where the agent first infers action representations from acquired action observations and then learns to use these in reinforcement learning with added generalization objectives.
5	Randomized Block-Diagonal Preconditioning for Parallel Learning	Celestine Mendler-Dünner; Aurelien Lucchi;	Our main contribution is to demonstrate that the convergence of these methods can significantly be improved by a randomization technique which corresponds to repartitioning coordinates across tasks during the optimization procedure.
6	Stochastic Flows and Geometric Optimization on the Orthogonal Group	Krzysztof Choromanski; David Cheikhi; Jared Davis; Valerii Likhosherstov; Achille Nazaret; Achraf Bahamou; Xingyou Song; Mrugank Akarte; Jack Parker-Holder; Jacob Bergquist; YUAN GAO; Aldo Pacchiano; Tamas Sarlos; Adrian Weller; Vikas Sindhwani;	We present a new class of stochastic, geometrically-driven optimization algorithms on the orthogonal group O(d) and naturally reductive homogeneous manifolds obtained from the action of the rotation group SO(d).
7	PackIt: A Virtual Environment for Geometric Planning	Ankit Goyal; Jia Deng;	We present PackIt, a virtual environment to evaluate and potentially learn the ability to do geometric planning. We also construct a set of challenging packing tasks using an evolutionary algorithm.
8	Soft Threshold Weight Reparameterization for Learnable Sparsity	Aditya Kusupati; Vivek Ramanujan; Raghav Somani; Mitchell Wortsman; Prateek Jain; Sham Kakade; Ali Farhadi;	This work proposes Soft Threshold Reparameterization (STR), a novel use of the soft-threshold operator on DNN weights.
9	Stochastic Latent Residual Video Prediction	Jean-Yves Franceschi; Edouard Delasalles; Mickael Chen; Sylvain Lamprier; Patrick Gallinari;	In this paper, we overcome these difficulties by introducing a novel stochastic temporal model whose dynamics are governed in a latent space by a residual update rule.
10	Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum under Heavy-Tailed Gradient Noise	Umut Simsekli; Lingjiong Zhu; Yee Whye Teh; Mert Gurbuzbalaban;	In this study, we consider a continuous-time variant of SGDm, known as the underdamped Langevin dynamics (ULD), and investigate its asymptotic properties under heavy-tailed perturbations.
11	Context Aware Local Differential Privacy	Jayadev Acharya; Keith Bonawitz; Peter Kairouz; Daniel Ramage; Ziteng Sun;	We propose a context-aware framework for LDP that allows the privacy level to vary across the data domain, enabling system designers to place privacy constraints where they matter without paying the cost where they do not.
12	Privately Learning Markov Random Fields	Gautam Kamath; Janardhan Kulkarni; Steven Wu; Huanyu Zhang;	Our learning goals include both structure learning, where we try to estimate the underlying graph structure of the model, as well as the harder goal of parameter learning, in which we additionally estimate the parameter on each edge.
13	A Mean Field Analysis Of Deep ResNet And Beyond: Towards Provably Optimization Via Overparameterization From Depth	Yiping Lu; Chao Ma; Yulong Lu; Jianfeng Lu; Lexing Ying;	To understand the success of SGD for training deep neural networks, this work presents a mean-field analysis of deep residual networks, based on a line of works which interpret the continuum limit of the deep residual network as an ordinary differential equation as the the network capacity tends to infinity.
14	Provable Smoothness Guarantees for Black-Box Variational Inference	Justin Domke;	This paper shows that for location-scale family approximations, if the target is M-Lipschitz smooth, then so is the “energy” part of the variational objective.
15	Enhancing Simple Models by Exploiting What They Already Know	Amit Dhurandhar; Karthikeyan Shanmugam; Ronny Luss;	In this paper, we propose a novel method SRatio that can utilize information from high performing complex models (viz. deep neural networks, boosted trees, random forests) to reweight a training dataset for a potentially low performing simple model of much lower complexity such as a decision tree or a shallow network enhancing its performance.
16	Fiduciary Bandits	Gal Bahar; Omer Ben-Porat; Kevin Leyton-Brown; Moshe Tennenholtz;	More formally, we introduce a model in which a recommendation system faces an exploration-exploitation tradeoff under the constraint that it can never recommend any action that it knows yields lower reward in expectation than an agent would achieve if it acted alone.
17	Training Deep Energy-Based Models with f-Divergence Minimization	Lantao Yu; Yang Song; Jiaming Song; Stefano Ermon;	In this paper, we propose a general variational framework termed f-EBM to train EBMs using any desired f-divergence.
18	Progressive Graph Learning for Open-Set Domain Adaptation	Yadan Luo; Zijian Wang; Zi Huang; Mahsa Baktashmotlagh;	More specifically, we introduce an end-to-end Progressive Graph Learning (PGL) framework where a graph neural network with episodic training is integrated to suppress underlying conditional shift and adversarial learning is adopted to close the gap between the source and target distributions.
19	Learning De-biased Representations with Biased Representations	Hyojin Bahng; SANGHYUK CHUN; Sangdoo Yun; Jaegul Choo; Seong Joon Oh;	In this work, we propose a novel framework to train a de-biased representation by encouraging it to be \textit{different} from a set of representations that are biased by design.
20	Generalized Neural Policies for Relational MDPs	Sankalp Garg; Aniket Bajpai; Mausam;	We present the first neural approach for solving RMDPs, expressed in the probabilistic planning language of RDDL.
21	Feature-map-level Online Adversarial Knowledge Distillation	Inseop Chung; SeongUk Park; Kim Jangho; NOJUN KWAK;	Thus in this paper, we propose an online knowledge distillation method that transfers not only the knowledge of the class probabilities but also that of the feature map using the adversarial training framework.
22	DRWR: A Differentiable Renderer without Rendering for Unsupervised 3D Structure Learning from Silhouette Images	Zhizhong Han; Chao Chen; Yu-Shen Liu; Matthias Zwicker;	In contrast, here we propose a Differentiable Renderer Without Rendering (DRWR) that omits these steps.
23	Towards Accurate Post-training Network Quantization via Bit-Split and Stitching	Peisong Wang; Qiang Chen; Xiangyu He; Jian Cheng;	In this paper, we propose a Bit-Split and Stitching framework for lower-bit post-training quantization with minimal accuracy degradation.
24	Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization	Pan Zhou; Xiao-Tong Yuan;	To address this deficiency, we propose a hybrid stochastic-deterministic minibatch proximal gradient (HSDMPG) algorithm for strongly-convex problems that enjoys provably improved data-size-independent complexity guarantees.
25	Reserve Pricing in Repeated Second-Price Auctions with Strategic Bidders	Alexey Drutsa;	We propose a novel algorithm that has strategic regret upper bound of $O(\log\log T)$ for worst-case valuations.
26	On Gradient Descent Ascent for Nonconvex-Concave Minimax Problems	Tianyi Lin; Chi Jin; Michael Jordan;	In this paper, we present the complexity results on two-time-scale GDA for solving nonconvex-concave minimax problems, showing that the algorithm can find a stationary point of the function $\Phi(\cdot) := \max_{\mathbf{y} \in \mathcal{Y}} f(\cdot, \mathbf{y})$ efficiently.
27	Training Binary Neural Networks through Learning with Noisy Supervision	Kai Han; Yunhe Wang; Yixing Xu; Chunjing Xu; Enhua Wu; Chang Xu;	In contrast to classical hand crafted rules (\eg hard thresholding) to binarize full-precision neurons, we propose to learn a mapping from full-precision neurons to the target binary ones.
28	Stochastic Frank-Wolfe for Constrained Finite-Sum Minimization	Geoffrey Negiar; Gideon Dresdner; Alicia Yi-Ting Tsai; Laurent El Ghaoui; Francesco Locatello; Fabian Pedregosa;	We propose a novel Stochastic Frank-Wolfe ($\equiv$ Conditional Gradient) algorithm with fixed batch size tailored to the constrained optimization of a finite sum of smooth objectives.
29	Do We Really Need to Access the Source Data? Source Hypothesis Transfer for Unsupervised Domain Adaptation	Jian Liang; Dapeng Hu; Jiashi Feng;	In this work we tackle a novel setting where only a trained source model is available and investigate how we can effectively utilize such a model without source data to solve UDA problems.
30	Acceleration through spectral density estimation	Fabian Pedregosa; Damien Scieur;	We develop a framework for designing optimal optimization methods in terms of their average-case runtime.
31	Graph Structure of Neural Networks	Jiaxuan You; Kaiming He; Jure Leskovec; Saining Xie;	Here we systematically investigate this relationship, via developing a novel graph-based representation of neural networks called relational graph, where computation is specified by rounds of message exchange along the graph structure.
32	Optimal Continual Learning has Perfect Memory and is NP-hard	Jeremias Knoblauch; Hisham Husain; Tom Diethe;	Designing CL algorithms that perform reliably and avoidso-calledcatastrophic forgettinghas proven a per-sistent challenge. The current paper develops a theoretical approach that explains why.
33	Clinician-in-the-Loop Decision Making: Reinforcement Learning with Near-Optimal Set-Valued Policies	Shengpu Tang; Aditya Modi; Michael Sjoding; Jenna Wiens;	We propose a model-free, off-policy algorithm based on temporal difference learning and a near-greedy action selection heuristic.
34	Computational and Statistical Tradeoffs in Inferring Combinatorial Structures of Ising Model	Ying Jin; Zhaoran Wang; Junwei Lu;	Under the framework of oracle computational model where an algorithm interacts with an oracle that discourses a randomized version of truth, we characterize the computational lower bounds of learning combinatorial structures in polynomial time, under which no algorithms within polynomial-time can distinguish between graphs with and without certain structures.
35	On the Number of Linear Regions of Convolutional Neural Networks	Huan Xiong; Lei Huang; Mengyang Yu; Li Liu; Fan Zhu; Ling Shao;	In this paper, we provide several mathematical results needed for studying the linear regions of CNNs, and use them to derive the maximal and average numbers of linear regions for one-layer ReLU CNNs.
36	Deep Streaming Label Learning	Zhen Wang; Liu Liu; Dacheng Tao;	In order to fill in these research gaps, we propose a novel deep neural network (DNN) based framework, Deep Streaming Label Learning (DSLL), to classify instances with newly emerged labels effectively.
37	From Importance Sampling to Doubly Robust Policy Gradient	Jiawei Huang; Nan Jiang;	Starting from the doubly robust (DR) estimator (Jiang & Li, 2016), we provide a simple derivation of a very general and flexible form of PG, which subsumes the state-of-the-art variance reduction technique (Cheng et al., 2019) as its special case and immediately hints at further variance reduction opportunities overlooked by existing literature.
38	Loss Function Search for Face Recognition	Xiaobo Wang; Shuo Wang; Shifeng Zhang; Cheng Chi; Tao Mei;	In this paper, we first analyze that the key to enhance the feature discrimination is actually how to reduce the softmax probability. We then design a unified formulation for the current margin-based softmax losses.
39	Breaking the Curse of Space Explosion: Towards Efficient NAS with Curriculum Search	Yong Guo; Yaofo Chen; Yin Zheng; Peilin Zhao; Jian Chen; Junzhou Huang; Mingkui Tan;	To alleviate this issue, we propose a curriculum search method that starts from a small search space and gradually incorporates the learned knowledge to guide the search in a large space.
40	Automatic Reparameterisation of Probabilistic Programs	Maria Gorinova; Dave Moore; Matthew Hoffman;	This enables new inference algorithms, and we propose two: a simple approach using interleaved sampling and a novel variational formulation that searches over a continuous space of parameterisations.
41	Kernel Methods for Cooperative Multi-Agent Learning with Delays	Abhimanyu Dubey; Alex `Sandy’ Pentland;	In this paper, we consider the kernelised contextual bandit problem, where the reward obtained by an agent is an arbitrary linear function of the contexts’ images in the related reproducing kernel Hilbert space (RKHS), and a group of agents must cooperate to collectively solve their unique decision problems.
42	Robust Multi-Agent Decision-Making with Heavy-Tailed Payoffs	Abhimanyu Dubey; Alex `Sandy’ Pentland;	We propose \textsc{MP-UCB}, a decentralized multi-agent algorithm for the cooperative stochastic bandit that incorporates robust estimation with a message-passing protocol.
43	Learning the Valuations of a $k$-demand Agent	Hanrui Zhang; Vincent Conitzer;	We study problems where a learner aims to learn the valuations of an agent by observing which goods he buys under varying price vectors.
44	Rigging the Lottery: Making All Tickets Winners	Utku Evci; Trevor Gale; Jacob Menick; Pablo Samuel Castro; Erich Elsen;	In this paper we introduce a method to train sparse neural networks with a fixed parameter count and a fixed computational cost throughout training, without sacrificing accuracy relative to existing dense-to-sparse training methods.
45	Active Learning on Attributed Graphs via Graph Cognizant Logistic Regression and Preemptive Query Generation	Florence Regol; Soumyasundar Pal; Yingxue Zhang; Mark Coates;	We propose a novel graph-based active learning algorithm for the task of node classification in attributed graphs.
46	Performative Prediction	Juan Perdomo; Tijana Zrnic; Celestine Mendler-Dünner; University of California Moritz Hardt;	We develop a risk minimization framework for performative prediction bringing together concepts from statistics, game theory, and causality.
47	On Layer Normalization in the Transformer Architecture	Ruibin Xiong; Yunchang Yang; Di He; Kai Zheng; Shuxin Zheng; Chen Xing; Huishuai Zhang; Yanyan Lan; Liwei Wang; Tie-Yan Liu;	Specifically, we prove with mean field theory that at initialization, for the original-designed Post-LN Transformer, which places the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer are large.
48	The many Shapley values for model explanation	Mukund Sundararajan; Amir Najmi;	In this paper, we use the axiomatic approach to study the differences between some of the many operationalizations of the Shapley value for attribution, and propose a technique called Baseline Shapley (BShap) that is backed by a proper uniqueness result.
49	Linear Convergence of Randomized Primal-Dual Coordinate Method for Large-scale Linear Constrained Convex Programming	Daoli Zhu; Lei Zhao;	We propose the randomized primal-dual coordinate (RPDC) method, a randomized coordinate extension of the first-order primal-dual method by Cohen and Zhu, 1984 and Zhao and Zhu, 2019, to solve LCCP.
50	New Oracle-Efficient Algorithms for Private Synthetic Data Release	Giuseppe Vietri; Steven Wu; Mark Bun; Thomas Steinke; Grace Tian;	We present three new algorithms for constructing differentially private synthetic data—a sanitized version of a sensitive dataset that approximately preserves the answers to a large collection of statistical queries.
51	Oracle Efficient Private Non-Convex Optimization	Seth Neel; Aaron Roth; Giuseppe Vietri; Steven Wu;	This technique augments a given optimization problem (e.g. deriving from an ERM problem) with a random linear term, and then exactly solves it. However, to date, analyses of this approach crucially rely on the convexity and smoothness of the objective function. We give two algorithms that extend this approach substantially.
52	Universal Asymptotic Optimality of Polyak Momentum	Damien Scieur; Fabian Pedregosa;	We consider the average-case runtime analysis of algorithms for minimizing quadratic objectives.
53	Adversarial Robustness via Runtime Masking and Cleansing	Yi-Hsuan Wu; Chia-Hung Yuan; Shan-Hung (Brandon) Wu;	In this paper, we propose improving the adversarial robustness of a network by leveraging the potentially large test data seen at runtime.
54	Implicit Euler Skip Connections: Enhancing Adversarial Robustness via Numerical Stability	Mingjie Li; Lingshen He; Zhouchen Lin;	On this account, we try to address such an issue from the perspective of dynamic system in this work.
55	Best Arm Identification for Cascading Bandits in the Fixed Confidence Setting	Zixin Zhong; Wang Chi Cheung; Vincent Tan;	We design and analyze CascadeBAI, an algorithm for finding the best set of K items, also called an arm, within the framework of cascading bandits.
56	Robustness to Programmable String Transformations via Augmented Abstract Training	Yuhao Zhang; Aws Albarghouthi; Loris D’Antoni;	To fill this gap, we present a technique to train models that are robust to user-defined string transformations.
57	The Complexity of Finding Stationary Points with Stochastic Gradient Descent	Yoel Drori; Ohad Shamir;	We study the iteration complexity of stochastic gradient descent (SGD) for minimizing the gradient norm of smooth, possibly nonconvex functions.
58	Sample Complexity Bounds for 1-bit Compressive Sensing and Binary Stable Embeddings with Generative Priors	Zhaoqiang Liu; Selwyn Gomes; Avtansh Tiwari; Jonathan Scarlett;	Motivated by recent advances in compressive sensing with generative models, where a generative modeling assumption replaces the usual sparsity assumption, we study the problem of 1-bit compressive sensing with generative models.
59	Class-Weighted Classification: Trade-offs and Robust Approaches	Ziyu Xu; Chen Dan; Justin Khim; Pradeep Ravikumar;	We consider imbalanced classification, the problem in which a label may have low marginal probability relative to other labels, by weighting losses according to the correct class.
60	Neural Architecture Search in a Proxy Validation Loss Landscape	Yanxi Li; Minjing Dong; Yunhe Wang; Chang Xu;	In this paper, we propose to approximate the validation loss landscape by learning a mapping from neural architectures to their corresponding validate losses.
61	Almost Tune-Free Variance Reduction	Bingcong Li; Lingda Wang; Georgios B. Giannakis;	This work introduces ‘almost tune-free’ SVRG and SARAH schemes equipped with i) Barzilai-Borwein (BB) step sizes; ii) averaging; and, iii) the inner loop length adjusted to the BB step sizes.
62	Uniform Convergence of Rank-weighted Learning	Liu Leqi; Justin Khim; Adarsh Prasad; Pradeep Ravikumar;	In this work, we study a novel notion of L-Risk based on the classical idea of rank-weighted learning.
63	Non-autoregressive Translation with Disentangled Context Transformer	Jungo Kasai; James Cross; Marjan Ghazvininejad; Jiatao Gu;	We propose an attention-masking based model, called Disentangled Context (DisCo) transformer, that simultaneously generates all tokens given different contexts.
64	More Information Supervised Probabilistic Deep Face Embedding Learning	Ying Huang; Shangfeng Qiu; Wenwei Zhang; Xianghui Luo; Jinzhuo Wang;	In this paper, we analyse margin based softmax loss in probability view.
65	Reinforcement Learning for Non-Stationary Markov Decision Processes: The Blessing of (More) Optimism	Wang Chi Cheung; David Simchi-Levi; Ruihao Zhu;	We overcome the challenge by a novel confidence widening technique that incorporates additional optimism.
66	Improved Sleeping Bandits with Stochastic Action Sets and Adversarial Rewards	Aadirupa Saha; Pierre Gaillard; Michal Valko;	In this paper, we consider the problem of sleeping bandits with stochastic action sets and adversarial rewards.
67	From PAC to Instance-Optimal Sample Complexity in the Plackett-Luce Model	Aadirupa Saha; Aditya Gopalan;	We consider PAC learning a good item from $k$-subsetwise feedback sampled from a Plackett-Luce probability model, with instance-dependent sample complexity performance.
68	Reliable Fidelity and Diversity Metrics for Generative Models	Muhammad Ferjad Naeem; Seong Joon Oh; Yunjey Choi; Youngjung Uh; Jaejun Yoo;	In this paper, we show that even the latest version of the precision and recall metrics are not reliable yet.
69	Learning Factorized Weight Matrix for Joint Image Filtering	Xiangyu Xu; Yongrui Ma; Wenxiu Sun;	In this work, we propose to learn the weight matrix for joint image filtering.
70	Likelihood-free MCMC with Amortized Approximate Ratio Estimators	Joeri Hermans; Volodimir Begy; Gilles Louppe;	This work introduces a novel approach to address the intractability of the likelihood and the marginal model.
71	Attacks Which Do Not Kill Training Make Adversarial Learning Stronger	Jingfeng Zhang; Xilie Xu; Bo Han; Gang Niu; Lizhen Cui; Masashi Sugiyama; Mohan Kankanhalli;	In this paper, we raise a fundamental question—do we have to trade off natural generalization for adversarial robustness?
72	GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values	Shangtong Zhang; Bo Liu; Shimon Whiteson;	We present GradientDICE for estimating the density ratio between the state distribution of the target policy and the sampling distribution in off-policy reinforcement learning.
73	Provably Convergent Two-Timescale Off-Policy Actor-Critic with Function Approximation	Shangtong Zhang; Bo Liu; Hengshuai Yao; Shimon Whiteson;	We present the first provably convergent two-timescale off-policy actor-critic algorithm (COF-PAC) with function approximation.
74	Adversarial Attacks on Probabilistic Autoregressive Forecasting Models	Raphaël Dang-Nhu; Gagandeep Singh; Pavol Bielik; Martin Vechev;	The key technical challenge we address is how to effectively differentiate through the Monte-Carlo estimation of statistics of the output sequence joint distribution.
75	Informative Dropout for Robust Representation Learning: A Shape-bias Perspective	Baifeng Shi; Dinghuai Zhang; Qi Dai; Jingdong Wang; Zhanxing Zhu; Yadong Mu;	In this work, we attempt at improving various kinds of robustness universally by alleviating CNN’s texture bias.
76	Graph Convolutional Network for Recommendation with Low-pass Collaborative Filters	Wenhui Yu; Zheng Qin;	To address this gap, we leverage the original graph convolution in GCN and propose a Low-pass Collaborative Filter (LCF) to make it applicable to the large graph.
77	SoftSort: A Differantiable Continuous Relaxation of the argsort Operator	Sebastian Prillo; Julian Eisenschlos;	In this work we propose a simple continuous relaxation for the argsort operator.
78	Too Relaxed to Be Fair	Michael Lohaus; Michaël Perrot; Ulrike von Luxburg;	We address the problem of classification under fairness constraints. Given a notion of fairness, the goal is to learn a classifier that is not discriminatory against a group of individuals.
79	Lorentz Group Equivariant Neural Network for Particle Physics	Alexander Bogatskiy; Brandon Anderson; Jan Offermann; Marwah Roussi; David Miller; Risi Kondor;	We present a neural network architecture that is fully equivariant with respect to transformations under the Lorentz group, a fundamental symmetry of space and time in physics.
80	One-shot Distributed Ridge Regression in High Dimensions	Yue Sheng; Edgar Dobriban;	Here we study a fundamental problem in this area: How to do ridge regression in a distributed computing environment?
81	Streaming k-Submodular Maximization under Noise subject to Size Constraint	Lan N. Nguyen; My T. Thai;	In this paper, we investigate a more realistic scenario of this problem that (1) obtaining exact evaluation of an objective function is impractical, instead, its noisy version is acquired; and (2) algorithms are required to take only one single pass over dataset, producing solutions in a timely manner.
82	Variational Imitation Learning with Diverse-quality Demonstrations	Voot Tangkaratt; Bo Han; Mohammad Emtiyaz Khan; Masashi Sugiyama;	We propose a new method for imitation learning in such scenarios.
83	Task Understanding from Confusing Multi-task Data	Xin Su; Yizhou Jiang; Shangqi Guo; Feng Chen;	We propose Confusing Supervised Learning (CSL) that takes these confusing samples and extracts task concepts by differentiating between these samples.
84	Cost-effective Interactive Attention Learning with Neural Attention Process	Jay Heo; Junhyeon Park; Hyewon Jeong; Kwang Joon Kim; Juho Lee; Eunho Yang; Sung Ju Hwang;	We propose a novel interactive learning framework which we refer to as Interactive Attention Learning (IAL), in which the human supervisors interactively manipulate the allocated attentions, to correct the model’s behavior by updating the attention-generating network.
85	Channel Equilibrium Networks for Learning Deep Representation	Wenqi Shao; Shitao Tang; Xingang Pan; Ping Tan; Xiaogang Wang; Ping Luo;	Unlike prior arts that simply removed the inhibited channels, we propose to “wake them up” during training by designing a novel neural building block, termed Channel Equilibrium (CE) block, which enables channels at the same layer to contribute equally to the learned representation.
86	Optimal Non-parametric Learning in Repeated Contextual Auctions with Strategic Buyer	Alexey Drutsa;	We introduce a novel non-parametric learning algorithm that is horizon-independent and has tight strategic regret upper bound of $\Theta(T^{d/(d+1)})$.
87	Topological Autoencoders	Michael Moor; Max Horn; Bastian Rieck; Karsten Borgwardt;	We propose a novel approach for preserving topological structures of the input space in latent representations of autoencoders.
88	An Accelerated DFO Algorithm for Finite-sum Convex Functions	Yuwen Chen; Antonio Orvieto; Aurelien Lucchi;	In this work, we exploit the finite-sum structure of the objective to design a variance-reduced DFO algorithm that probably yields an accelerated rate of convergence.
89	The Shapley Taylor Interaction Index	Mukund Sundararajan; Kedar Dhamdhere; Ashish Agarwal;	We propose a generalization of the Shapley value called Shapley-Taylor index that attributes the model’s prediction to interactions of subsets of features up to some size $k$.
90	Privately detecting changes in unknown distributions	Rachel Cummings; Sara Krehbiel; Yuliia Lut; Wanrong Zhang;	This work develops differentially private algorithms for solving the change-point problem when the data distributions are unknown.
91	CAUSE: Learning Granger Causality from Event Sequences using Attribution Methods	Wei Zhang; Thomas Panum; Somesh Jha; Prasad Chalasani; David Page;	To address these weaknesses, we propose CAUSE (Causality from AttribUtions on Sequence of Events), a novel framework for the studied task.
92	Efficient Continuous Pareto Exploration in Multi-Task Learning	Pingchuan Ma; Tao Du; Wojciech Matusik;	We present a novel, efficient method that generates locally continuous Pareto sets and Pareto fronts, which opens up the possibility of continuous analysis of Pareto optimal solutions in machine learning problems.
93	WaveFlow: A Compact Flow-based Model for Raw Audio	Wei Ping; Kainan Peng; Kexin Zhao; Zhao Song;	In this work, we propose WaveFlow, a small-footprint generative flow for raw audio, which is directly trained with maximum likelihood.
94	Multi-Agent Determinantal Q-Learning	Yaodong Yang; Ying Wen; Jun Wang; Liheng Chen; Kun Shao; David Mguni; Weinan Zhang;	Though practical, current methods rely on restrictive assumptions to decompose the centralized value function across agents for execution. In this paper, we eliminate this restriction by proposing multi-agent determinantal Q-learning.
95	Revisiting Spatial Invariance with Low-Rank Local Connectivity	Gamaleldin Elsayed; Prajit Ramachandran; Jon Shlens; Simon Kornblith;	To test this hypothesis, we design a method to relax the spatial invariance of a network layer in a controlled manner.
96	Minimax Weight and Q-Function Learning for Off-Policy Evaluation	Masatoshi Uehara; Jiawei Huang; Nan Jiang;	Our contributions include: (1) A new estimator, MWL, that directly estimates importance ratios over the state-action distributions, removing the reliance on knowledge of the behavior policy as in prior work (Liu et.al, 2018), (2) Another new estimator, MQL, obtained by swapping the roles of importance weights and value-functions in MWL.
97	Tensor denoising and completion based on ordinal observations	Chanwoo Lee; Miaoyan Wang;	We propose a multi-linear cumulative link model, develop a rank-constrained M-estimator, and obtain theoretical accuracy guarantees.
98	Learning Human Objectives by Evaluating Hypothetical Behavior	Siddharth Reddy; Anca Dragan; Sergey Levine; Shane Legg; Jan Leike;	We propose an algorithm that safely and efficiently learns a model of the user’s reward function by posing ‘what if?’
99	Counterfactual Cross-Validation: Stable Model Selection Procedure for Causal Inference Models	Yuta Saito; Shota Yasui;	We study the model selection problem in \textit{conditional average treatment effect} (CATE) prediction.
100	Learning Efficient Multi-agent Communication: An Information Bottleneck Approach	Rundong Wang; Xu He; Runsheng Yu; Wei Qiu; Bo An; Zinovi Rabinovich;	In this paper, we develop an Informative Multi-Agent Communication (IMAC) method to learn efficient communication protocols as well as scheduling.
101	MoNet3D: Towards Accurate Monocular 3D Object Localization in Real Time	XICHUAN ZHOU; YiCong Peng; Chunqiao Long; Fengbo Ren; Cong Shi;	The MoNet3D algorithm is a novel and effective framework that can predict the 3D position of each object in a monocular image and draw a 3D bounding box for each object.
102	SIGUA: Forgetting May Make Learning with Noisy Labels More Robust	Bo Han; Gang Niu; Xingrui Yu; QUANMING YAO; Miao Xu; Ivor Tsang; Masashi Sugiyama;	In this paper, to relieve this issue, we propose stochastic integrated gradient underweighted ascent (SIGUA): in a mini-batch, we adopt gradient descent on good data as usual, and learning-rate-reduced gradient ascent} on bad data; the proposal is a versatile approach where data goodness or badness is w.r.t. desired or undesired memorization given a base learning method.
103	Multinomial Logit Bandit with Low Switching Cost	Kefan Dong; Yingkai Li; Qin Zhang; Yuan Zhou;	We present an anytime algorithm (AT-DUCB) with $O(N \log T)$ assortment switches, almost matching the lower bound $\Omega(\frac{N \log T}{ \log \log T})$.
104	Deep Reasoning Networks for Unsupervised Pattern De-mixing with Constraint Reasoning	Di Chen; Yiwei Bai; Wenting Zhao; Sebastian Ament; John Gregoire; Carla Gomes;	We introduce Deep Reasoning Networks (DRNets), an end-to-end framework that combines deep learning with constraint reasoning for solving pattern de-mixing problems, typically in an unsupervised or very-weakly-supervised setting.
105	Uncertainty-Aware Lookahead Factor Models for Improved Quantitative Investing	Lakshay Chauhan; John Alberg; Zachary Lipton;	We propose lookahead factor models to act upon these predictions, plugging the predicted future fundamentals into traditional factors.
106	On the Unreasonable Effectiveness of the Greedy Algorithm: Greedy Adapts to Sharpness	Sebastian Pokutta; Mohit Singh; Alfredo Torrico;	In this work, we define sharpness for submodular functions as a candidate explanation for this phenomenon.
107	Stronger and Faster Wasserstein Adversarial Attacks	Kaiwen Wu; Allen Wang; Yaoliang Yu;	We address this gap in two ways: (a) we develop an exact yet efficient projection operator to enable a stronger projected gradient attack; (b) we show for the first time that conditional gradient method equipped with a suitable linear minimization oracle works extremely fast under Wasserstein constraints.
108	Optimizing Multiagent Cooperation via Policy Evolution and Shared Experiences	Somdeb Majumdar; Shauharda Khadka; Santiago Miret; Stephen Mcaleer; Kagan Tumer;	We introduce Multiagent Evolutionary Reinforcement Learning (MERL), a split-level training platform that handles the two objectives separately through two optimization processes.
109	Why Are Learned Indexes So Effective?	Paolo Ferragina; Fabrizio Lillo; Giorgio Vinciguerra;	In this paper, we present the first mathematically-grounded answer to this open problem.
110	Fast OSCAR and OWL with Safe Screening Rules	Runxue Bao; Bin Gu; Heng Huang;	To address this challenge, we propose the first safe screening rule for the OWL regularized regression, which effectively avoids the updates of the parameters whose coefficients must be zeros.
111	Which Tasks Should Be Learned Together in Multi-task Learning?	Trevor Standley; Amir Zamir; Dawn Chen; Leonidas Guibas; Jitendra Malik; Silvio Savarese;	We systematically study task cooperation and competition and propose a framework for assigning tasks to a few neural networks such that cooperating tasks are computed by the same neural network, while competing tasks are computed by different networks.
112	Inertial Block Proximal Methods for Non-Convex Non-Smooth Optimization	Hien Le; Nicolas Gillis; Panagiotis Patrinos;	We propose inertial versions of block coordinate descent methods for solving non-convex non-smooth composite optimization problems.
113	Adversarial Neural Pruning with Latent Vulnerability Suppression	Divyam Madaan; Jinwoo Shin; Sung Ju Hwang;	In this paper, we conjecture that the leading cause of this adversarial vulnerability is the distortion in the latent feature space, and provide methods to suppress them effectively.
114	Lifted Disjoint Paths with Application in Multiple Object Tracking	Andrea Hornakova; Roberto Henschel; Bodo Rosenhahn; Paul Swoboda;	We present an extension to the disjoint paths problem in which additional lifted edges are introduced to provide path connectivity priors.
115	Being Bayesian, Even Just a Bit, Fixes Overconfidence in ReLU Networks	Agustinus Kristiadi; Matthias Hein; Philipp Hennig;	We theoretically analyze approximate Gaussian posterior distributions on the weights of ReLU networks and show that they fix the overconfidence problem.
116	SCAFFOLD: Stochastic Controlled Averaging for Federated Learning	Sai Praneeth Reddy Karimireddy; Satyen Kale; Mehryar Mohri; Sashank Jakkam Reddi; Sebastian Stich; Ananda Theertha Suresh;	As a solution, we propose a new algorithm (SCAFFOLD) which uses control variates (variance reduction) to correct for the `client drift’.
117	Statistically Preconditioned Accelerated Gradient Method for Distributed Optimization	Hadrien Hendrikx; Lin Xiao; Sebastien Bubeck; Francis Bach; Laurent Massoulié;	In order to reduce the number of communications required to reach a given accuracy, we propose a preconditioned accelerated gradient method where the preconditioning is done by solving a local optimization problem over a subsampled dataset at the server.
118	Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification	Hui Ye; Zhiyu Chen; Da-Han Wang; Brian Davison;	Our approach fine-tunes the recently released generalized autoregressive pretraining model (XLNet) to learn the dense representation for the input text. We propose the Adaptive Probabilistic Label Cluster (APLC) to approximate the cross entropy loss by exploiting the unbalanced label distribution to form clusters that explicitly reduce the computational time.
119	Frequentist Uncertainty in Recurrent Neural Networks via Blockwise Influence Functions	Ahmed Alaa; Mihaela van der Schaar;	Capitalizing on ideas from classical jackknife resampling, we develop a frequentist alternative that: (a) is computationally efficient, (b) does not interfere with model training or compromise its accuracy, (c) applies to any RNN architecture, and (d) provides theoretical coverage guarantees on the estimated uncertainty intervals.
120	Disentangling Trainability and Generalization in Deep Neural Networks	Lechao Xiao; Jeffrey Pennington; Samuel Schoenholz;	In this work, we provide such a char-acterization in the limit of very wide and verydeep networks, for which the analysis simplifiesconsiderably.
121	Moniqua: Modulo Quantized Communication in Decentralized SGD	Yucheng Lu; Christopher De Sa;	In this paper we propose Moniqua, a technique that allows decentralized SGD to use quantized communication.
122	Expectation Maximization with Bias-Corrected Calibration is Hard-To-Beat at Label Shift Adaptation	Amr Mohamed Alexandari; Anshul Kundaje; Avanti Shrikumar;	We show that by combining EM with a type of calibration we call bias-corrected calibration, we outperform both BBSL and RLLS across diverse datasets and distribution shifts.
123	Expert Learning through Generalized Inverse Multiobjective Optimization: Models, Insights and Algorithms	Chaosheng Dong; Bo Zeng;	Leveraging these critical insights and connections, we propose two algorithms to solve IMOP through manifold learning and clustering.
124	Random Matrix Theory Proves that Deep Learning Representations of GAN-data Behave as Gaussian Mixtures	Mohamed El Amine Seddik; Cosme Louart; Mohamed Tamaazousti; Romain COUILLET;	This paper shows that deep learning (DL) representations of data produced by generative adversarial nets (GANs) are random vectors which fall within the class of so-called \textit{concentrated} random vectors.
125	Optimizing Data Usage via Differentiable Rewards	Xinyi Wang; Hieu Pham; Paul Michel; Antonios Anastasopoulos; Jaime Carbonell; Graham Neubig;	To efficiently optimize data usage, we propose a reinforcement learning approach called Differentiable Data Selection (DDS).
126	Optimistic Policy Optimization with Bandit Feedback	Lior Shani; Yonathan Efroni; Aviv Rosenberg; Shie Mannor;	In this paper we consider model-based RL in the tabular finite-horizon MDP setting with unknown transitions and bandit feedback.
127	Maximum-and-Concatenation Networks	Xingyu Xie; Hao Kong; Jianlong Wu; Wayne Zhang; Guangcan Liu; Zhouchen Lin;	In this work, we propose a novel architecture called Maximum-and-Concatenation Networks (MCN) to try eliminating bad local minima and improving generalization ability as well.
128	Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition	Chi Jin; Tiancheng Jin; Haipeng Luo; Suvrit Sra; Tiancheng Yu;	We propose an efficient algorithm that achieves O(L\|X\|AT ) regret with high probability, where L is the horizon, \|X\| the number of states, \|A\| the number of actions, and T the number of episodes.
129	Kernelized Stein Discrepancy Tests of Goodness-of-fit for Time-to-Event Data	Wenkai Xu; Tamara Fernandez; Nicolas Rivera; Arthur Gretton;	In this paper, we focus on non-parametric Goodness-of-Fit testing procedures based on combining the Stein’s method and kernelized discrepancies.
130	Efficient Intervention Design for Causal Discovery with Latents	Raghavendra Addanki; Shiva Kasiviswanathan; Andrew McGregor; Cameron Musco;	In particular, we introduce the notion of p-colliders, that are colliders between pair of nodes arising from a specific type of conditioning in the causal graph, and provide an upper bound on the number of interventions as a function of the maximum number of p-colliders between any two nodes in the causal graph.
131	Certified Data Removal from Machine Learning Models	Chuan Guo; Tom Goldstein; Awni Hannun; Laurens van der Maaten;	We study this problem by defining certified removal: a very strong theoretical guarantee that a model from which data is removed cannot be distinguished from a model that never observed the data to begin with.
132	One Size Fits All: Can We Train One Denoiser for All Noise Levels?	Abhiram Gnanasambandam; Stanley Chan;	However, why should we allocate the samples uniformly? Can we have more training samples that are less noisy, and fewer samples that are more noisy? What is the optimal distribution? How do we obtain such optimal distribution? The goal of this paper is to address these questions.
133	GNN-FiLM: Graph Neural Networks with Feature-wise Linear Modulation	Marc Brockschmidt;	This paper presents a new Graph Neural Network (GNN) type using feature-wise linear modulation (FiLM).
134	Sparse Gaussian Processes with Spherical Harmonic Features	Vincent Dutordoir; Nicolas Durrande; James Hensman;	We introduce a new class of interdomain variational Gaussian processes (GP) where data is mapped onto the unit hypersphere in order to use spherical harmonic representations.
135	Asynchronous Coagent Networks	James Kostas; Chris Nota; Philip Thomas;	In this work, we prove that CPGAs converge to locally optimal policies.
136	Adaptive Checkpoint Adjoint Method for Gradient Estimation in Neural ODE	Juntang Zhuang; Nicha Dvornek; Xiaoxiao Li; Sekhar Tatikonda; Xenophon Papademetris; James Duncan;	We propose the Adaptive Checkpoint Adjoint (ACA) method: ACA applies a trajectory checkpoint strategy which records the forward- mode trajectory as the reverse-mode trajectory to guarantee accuracy; ACA deletes redundant components for shallow computation graphs; and ACA supports adaptive solvers.
137	Understanding the Curse of Horizon in Off-Policy Evaluation via Conditional Importance Sampling	Yao Liu; Pierre-Luc Bacon; Emma Brunskill;	We analyze the variance of the most popular approaches through the viewpoint of conditional Monte Carlo.
138	Taylor Expansion Policy Optimization	Yunhao Tang; Michal Valko; Remi Munos;	In this work, we investigate the application of Taylor expansions in reinforcement learning.
139	Reinforcement Learning for Integer Programming: Learning to Cut	Yunhao Tang; Shipra Agrawal; Yuri Faenza;	The goal of this work is to show that the performance of those solvers can be greatly enhanced using reinforcement learning (RL).
140	Safe Reinforcement Learning in Constrained Markov Decision Processes	Akifumi Wachi; Yanan Sui;	In this paper, we propose an algorithm, SNO-MDP, that explores and optimizes Markov decision processes under unknown safety constraints.
141	Layered Sampling for Robust Optimization Problems	Hu Ding; Zixiu Wang;	In this paper, we propose a new variant of coreset technique, {\em layered sampling}, to deal with two fundamental robust optimization problems: {\em $k$-median/means clustering with outliers} and {\em linear regression with outliers}.
142	Learning to Encode Position for Transformer with Continuous Dynamical Model	Xuanqing Liu; Hsiang-Fu Yu; Inderjit Dhillon; Cho-Jui Hsieh;	We introduce a new way of learning to encode position information for non-recurrent models, such as Transformer models.
143	Do RNN and LSTM have Long Memory?	Jingyu Zhao; Feiqing Huang; Jia Lv; Yanjie Duan; Zhen Qin; Guodong Li; Guangjian Tian;	Since the term "long memory" is still not well-defined for a network, we propose a new definition for long memory network.
144	Training Linear Neural Networks: Non-Local Convergence and Complexity Results	Armin Eftekhari;	In this paper, we improve the state of the art in (Bah et al., 2019) by identifying conditions under which gradient flow successfully trains a linear network, in spite of the non-strict saddle points present in the optimization landscape.
145	On Validation and Planning of An Optimal Decision Rule with Application in Healthcare Studies	Hengrui Cai; Wenbin Lu; Rui Song;	In this paper, we propose a testing procedure for detecting the existence of an ODR that is better than the naive decision rule under the randomized trials.
146	Graph Optimal Transport for Cross-Domain Alignment	Liqun Chen; Zhe Gan; Yu Cheng; Linjie Li; Lawrence Carin; Jingjing Liu;	We propose Graph Optimal Transport (GOT), a principled framework that builds upon recent advances in Optimal Transport (OT).
147	Approximation Capabilities of Neural ODEs and Invertible Residual Networks	Han Zhang; Xi Gao; Jacob Unterman; Tomasz Arodz;	We conclude by showing that capping a Neural ODE or an i-ResNet with a single linear layer is sufficient to turn the model into a universal approximator for non-invertible continuous functions.
148	Refined bounds for algorithm configuration: The knife-edge of dual class approximability	Nina Balcan; Tuomas Sandholm; Ellen Vitercik;	We investigate a fundamental question about these techniques: how large should the training set be to ensure that a parameter\u2019s average empirical performance over the training set is close to its expected, future performance?
149	Teaching with Limited Information on the Learner’s Behaviour	Ferdinando Cicalese; Francisco Sergio de Freitas Filho; Eduardo Laber; Marco Molinaro;	Motivated by the realistic possibility that $h^$ is not available to the learner, we consider the case where the teacher can only aim at having the learner converge to a best available approximation of $h^$.
150	Interpretations are Useful: Penalizing Explanations to Align Neural Networks with Prior Knowledge	Laura Rieger; Chandan Singh; William Murdoch; Bin Yu;	In this paper, we propose contextual decomposition explanation penalization (CDEP), a method which enables practitioners to leverage existing explanation methods to increase the predictive accuracy of a deep learning model.
151	DeltaGrad: Rapid retraining of machine learning models	Yinjun Wu; Edgar Dobriban; Susan Davidson;	To address this problem, we propose the DeltaGrad algorithm for rapidly retraining machine learning models based on information cached during the training phase.
152	The Cost-free Nature of Optimally Tuning Tikhonov Regularizers and Other Ordered Smoothers	Pierre Bellec; Dana Yang;	We consider the problem of selecting the best estimator among a family of Tikhonov regularized estimators, or, alternatively, to select a linear combination of these regularizers that is as good as the best regularizer in the family.
153	Approximation Guarantees of Local Search Algorithms via Localizability of Set Functions	Kaito Fujii;	This paper proposes a new framework for providing approximation guarantees of local search algorithms.
154	Fine-Grained Analysis of Stability and Generalization for Stochastic Gradient Descent	Yunwen Lei; Yiming Ying;	In this paper, we provide a fine-grained analysis of stability and generalization for SGD by substantially relaxing these assumptions.
155	Online Dense Subgraph Discovery via Blurred-Graph Feedback	Yuko Kuroki; Atsushi Miyauchi; Junya Honda; Masashi Sugiyama;	In this paper, we introduce a novel learning problem for dense subgraph discovery in which a learner queries edge subsets rather than only single edges and observes a noisy sum of edge weights in a queried subset.
156	LazyIter: A Fast Algorithm for Counting Markov Equivalent DAGs and Designing Experiments	Ali AhmadiTeshnizi; Saber Salehkaleybar; Negar Kiyavash;	We propose a method for efficient iteration over possible MECs given intervention results.
157	Perceptual Generative Autoencoders	Zijun Zhang; Ruixiang ZHANG; Zongpeng Li; Yoshua Bengio; Liam Paull;	We therefore propose to map both the generated and target distributions to the latent space using the encoder of a standard autoencoder, and train the generator (or decoder) to match the target distribution in the latent space.
158	Towards Understanding the Regularization of Adversarial Robustness on Neural Networks	Yuxin Wen; Shuai Li; Kui Jia;	In this work, we study the degradation through the regularization perspective.
159	Stochastic Gradient and Langevin Processes	Xiang Cheng; Dong Yin; Peter Bartlett; Michael Jordan;	We prove quantitative convergence rates at which discrete Langevin-like processes converge to the invariant distribution of a related stochastic differential equation.
160	ROMA: Multi-Agent Reinforcement Learning with Emergent Roles	Tonghan Wang; Heng Dong; Victor Lesser; Chongjie Zhang;	In this paper, we synergize these two paradigms and propose a role-oriented MARL framework (ROMA).
161	Minimax Pareto Fairness: A Multi Objective Perspective	Martin Bertran; Natalia Martinez; Guillermo Sapiro;	In this work we formulate and formally characterize group fairness as a multi-objective optimization problem, where each sensitive group risk is a separate objective.
162	Online Pricing with Offline Data: Phase Transition and Inverse Square Law	Jinzhi Bu; David Simchi-Levi; Yunzong Xu;	We study a single-product dynamic pricing problem over a selling horizon of T periods.
163	Explicit Gradient Learning for Black-Box Optimization	Elad Sarafian; Mor Sinay; yoram louzoun; Noa Agmon; Sarit Kraus;	Here we present a BBO method, termed Explicit Gradient Learning (EGL), that is designed to optimize high-dimensional ill-behaved functions.
164	Optimization and Analysis of the pAp@k Metric for Recommender Systems	Gaurush Hiranandani; Warut Vijitbenjaronk; Sanmi Koyejo; Prateek Jain;	In this paper, we analyze the learning-theoretic properties of pAp@k and propose novel surrogates that are consistent under certain data regularity conditions.
165	When Explanations Lie: Why Many Modified BP Attributions Fail	Leon Sixt; Maximilian Granz; Tim Landgraf;	We analyze an extensive set of modified BP methods: Deep Taylor Decomposition, Layer-wise Relevance Propagation (LRP), Excitation BP, PatternAttribution, DeepLIFT, Deconv, RectGrad, and Guided BP. We find empirically that the explanations of all mentioned methods, except for DeepLIFT, are independent of the parameters of later layers. We provide theoretical insights for this surprising behavior and also analyze why DeepLIFT does not suffer from this limitation.
166	Naive Exploration is Optimal for Online LQR	Max Simchowitz; Dylan Foster;	We consider the problem of online adaptive control of the linear quadratic regulator, where the true system parameters are unknown.
167	Learning Structured Latent Factors from Dependent Data:A Generative Model Framework from Information-Theoretic Perspective	Ruixiang ZHANG; Katsuhiko Ishiguro; Masanori Koyama;	In this paper, we present a novel framework for learning generative models with various underlying structures in the latent space.
168	Implicit Generative Modeling for Efficient Exploration	Neale Ratzlaff; Qinxun Bai; Fuxin Li; Wei Xu;	In this work, we introduce an exploration approach based on a novel implicit generative modeling algorithm to estimate a Bayesian uncertainty of the agent’s belief of the environment dynamics.
169	Prediction-Guided Multi-Objective Reinforcement Learning for Continuous Robot Control	Jie Xu; Yunsheng Tian; Pingchuan Ma; Daniela Rus; Shinjiro Sueda; Wojciech Matusik;	In this work, we propose an efficient evolutionary learning algorithm to find the Pareto set approximation for continuous robot control problems, by extending a state-of-the-art RL algorithm and presenting a novel prediction model to guide the learning process.
170	Goodness-of-Fit Tests for Inhomogeneous Random Graphs	Soham Dan; Bhaswar B. Bhattacharya;	In this paper we consider the goodness-of-fit testing problem for large inhomogeneous random (IER) graphs, where given a (known) reference symmetric matrix $Q \in [0, 1]^{n \times n}$ and $m$ independent samples from an IER graph given by an unknown symmetric matrix $P \in [0, 1]^{n \times n}$, the goal is to test the hypothesis $P=Q$ versus $\|\|P-Q\|\| \geq \varepsilon$, where $\|\|\cdot\|\|$ is some specified norm on symmetric matrices.
171	Few-shot Domain Adaptation by Causal Mechanism Transfer	Takeshi Teshima; Issei Sato; Masashi Sugiyama;	We take the structural equations in causal modeling as an example and propose a novel DA method, which is shown to be useful both theoretically and experimentally.
172	Adaptive Adversarial Multi-task Representation Learning	YUREN MAO; Weiwei Liu; Xuemin Lin;	Based on the duality, we proposed an novel adaptive AMTRL algorithm which improves the performance of original AMTRL methods.
173	Streaming Submodular Maximization under a k-Set System Constraint	Ran Haba; Ehsan Kazemi; Moran Feldman; Amin Karbasi;	In this paper, we propose a novel framework that converts streaming algorithms for monotone submodular maximization into streaming algorithms for non-monotone submodular maximization.
174	A Generic First-Order Algorithmic Framework for Bi-Level Programming Beyond Lower-Level Singleton	Risheng Liu; Pan Mu; Xiaoming Yuan; Shangzhi Zeng; Jin Zhang;	To address this critical issue, a new method, named Bi-level Descent Aggregation (BDA) is proposed, aiming to broaden the application horizon of first-order schemes for BLPs.
175	Optimal approximation for unconstrained non-submodular minimization	Marwa El Halabi; Stefanie Jegelka;	We show how these relations can be extended to obtain approximation guarantees for minimizing non-submodular functions, characterized by how close the function is to submodular.
176	Generating Programmatic Referring Expressions via Program Synthesis	Jiani Huang; Calvin Smith; Osbert Bastani; Rishabh Singh; Aws Albarghouthi; Mayur Naik;	We propose a neurosymbolic program synthesis algorithm that combines a policy neural network with enumerative search to generate such relational programs.
177	Nearly Linear Row Sampling Algorithm for Quantile Regression	Yi Li; Ruosong Wang; Lin Yang; Hanrui Zhang;	Our main technical contribution is to show that Lewis weights sampling, which has been used in row sampling algorithms for $\ell_p$ norms, can also be applied in row sampling algorithms for a variety of loss functions.
178	On Leveraging Pretrained GANs for Generation with Limited Data	Miaoyun Zhao; Yulai Cong; Lawrence Carin;	To facilitate this, we leverage existing GAN models pretrained on large-scale datasets (like ImageNet) to introduce additional knowledge (which may not exist within the limited data), following the concept of transfer learning.
179	More Data Can Expand The Generalization Gap Between Adversarially Robust and Standard Models	Lin Chen; Yifei Min; Mingrui Zhang; Amin Karbasi;	However, we study the training of robust classifiers for both Gaussian and Bernoulli models under $\ell_\infty$ attacks, and we prove that more data may actually increase this gap.
180	Double Reinforcement Learning for Efficient and Robust Off-Policy Evaluation	Nathan Kallus; Masatoshi Uehara;	We consider for the first time the semiparametric efficiency limits of OPE in Markov decision processes (MDPs), where actions, rewards, and states are memoryless.
181	Statistically Efficient Off-Policy Policy Gradients	Nathan Kallus; Masatoshi Uehara;	In this paper, we consider the efficient estimation of policy gradients from off-policy data, where the estimation is particularly non-trivial.
182	Self-PU: Self Boosted and Calibrated Positive-Unlabeled Training	Xuxi Chen; Wuyang Chen; Tianlong Chen; Ye Yuan; Chen Gong; Kewei Chen; Zhangyang Wang;	This motivates us to propose a novel Self-PU learning framework, which seamlessly integrates PU learning and self-training.
183	When Does Self-Supervision Help Graph Convolutional Networks?	Yuning You; Tianlong Chen; Zhangyang Wang; Yang Shen;	In this study, we report the first systematic exploration and assessment of incorporating self-supervision into GCNs.
184	On Differentially Private Stochastic Convex Optimization with Heavy-tailed Data	Di Wang; Hanshen Xiao; Srinivas Devadas; Jinhui Xu;	In this paper, we consider the problem of designing Differentially Private (DP) algorithms for Stochastic Convex Optimization (SCO) on heavy-tailed data.
185	Variance Reduced Coordinate Descent with Acceleration: New Method With a Surprising Application to Finite-Sum Problems	Filip Hanzely; Dmitry Kovalev; Peter Richtarik;	We propose an accelerated version of stochastic variance reduced coordinate descent — ASVRCD.
186	Stochastic Subspace Cubic Newton Method	Filip Hanzely; Nikita Doikov; Yurii Nesterov; Peter Richtarik;	In this paper, we propose a new randomized second-order optimization algorithm—Stochastic Subspace Cubic Newton (SSCN)—for minimizing a high dimensional convex function $f$.
187	Ready Policy One: World Building Through Active Learning	Philip Ball; Jack Parker-Holder; Aldo Pacchiano; Krzysztof Choromanski; Stephen Roberts;	In this paper we introduce Ready Policy One (RP1), a framework that views MBRL as an active learning problem, where we aim to improve the world model in the fewest samples possible.
188	Structural Language Models of Code	Uri Alon; Roy Sadaka; Omer Levy; Eran Yahav;	We introduce a new approach to any-code completion that leverages the strict syntax of programming languages to model a code snippet as a tree – structural language modeling (SLM).
189	PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization	Jingqing Zhang; Yao Zhao; Mohammad Saleh; Peter Liu;	In this work, we propose pre-training large Transformer-based encoder-decoder models on massive text corpora with a new self-supervised objective.
190	Aggregation of Multiple Knockoffs	Tuan-Binh Nguyen; Jerome-Alexis Chevalier; Thirion Bertrand; Sylvain Arlot;	We develop an extension of the knockoff inference procedure, introduced by Barber & Candes (2015).
191	Off-Policy Actor-Critic with Shared Experience Replay	Simon Schmitt; Matteo Hessel; Karen Simonyan;	We investigate the combination of actor-critic reinforcement learning algorithms with a uniform large-scale experience replay and propose solutions for two ensuing challenges: (a) efficient actor-critic learning with experience replay (b) the stability of off-policy learning where agents learn from other agents behaviour.
192	Graph-based Nearest Neighbor Search: From Practice to Theory	Liudmila Prokhorenkova; Aleksandr Shekhovtsov;	In this work, we fill this gap and rigorously analyze the performance of graph-based NNS algorithms, specifically focusing on the low-dimensional (d << log n) regime.
193	Policy Teaching via Environment Poisoning: Training-time Adversarial Attacks against Reinforcement Learning	Amin Rakhsha; Goran Radanovic; Rati Devidze; Jerry Zhu; Adish Singla;	We propose an optimization framework for finding an optimal stealthy attack for different measures of attack cost.
194	Semismooth Newton Algorithm for Efficient Projections onto $\ell_{1, \infty}$-norm Ball	Dejun Chu; Changshui Zhang; Shiliang Sun; Qing Tao;	In this paper, we propose an efficient algorithm for Euclidean projection onto $\ell_{1, \infty}$-norm ball.
195	Influenza Forecasting Framework based on Gaussian Processes	Christoph Zimmer; Reza Yaesoubi;	Here, we propose a new framework based on Gaussian process (GP) for seasonal epidemics forecasting and demonstrate its capability on the CDC reference data on influenza like illness: our framework leads to accurate forecasts with small but reliable uncertainty estimation.
196	Unique Properties of Wide Minima in Deep Networks	Rotem Mulayoff; Tomer Michaeli;	In this paper, we characterize the wide minima in linear neural networks trained with a quadratic loss.
197	Does the Markov Decision Process Fit the Data: Testing for the Markov Property in Sequential Decision Making	Chengchun Shi; Runzhe Wan; Rui Song; Wenbin Lu; Ling Leng;	In this paper, we propose a novel Forward-Backward Learning procedure to test MA in sequential decision making.
198	LTF: A Label Transformation Framework for Correcting Label Shift	Jiaxian Guo; Mingming Gong; Tongliang Liu; Kun Zhang; Dacheng Tao;	In this paper, we propose an end-to-end Label Transformation Framework (LTF) for correcting label shift, which implicitly models the shift of $P_Y$ and the conditional distribution $P_{X\|Y}$ using neural networks.
199	Divide, Conquer, and Combine: a New Inference Strategy for Probabilistic Programs with Stochastic Support	Yuan Zhou; Hongseok Yang; Yee Whye Teh; Tom Rainforth;	To address this, we introduce a new inference framework: Divide, Conquer, and Combine, which remains efficient for such models, and show how it can be implemented as an automated and general-purpose PPS inference engine.
200	Duality in RKHSs with Infinite Dimensional Outputs: Application to Robust Losses	Pierre Laforgue; Alex Lambert; Luc Brogat-Motte; Florence d’Alche-Buc;	To overcome this limitation, this paper develops a duality approach that allows to solve OVK machines for a wide range of loss functions.
201	Causal Effect Estimation and Optimal Dose Suggestions in Mobile Health	Liangyu Zhu; Wenbin Lu; Rui Song;	In this article, we propose novel structural nested models to estimate causal effects of continuous treatments based on mobile health data.
202	Towards Understanding the Dynamics of the First-Order Adversaries	Zhun Deng; Hangfeng He; Jiaoyang Huang; Weijie Su;	In this paper, we analyze the dynamics of the maximization step towards understanding the experimentally observed effectiveness of this defense mechanism.
203	Interpreting Robust Optimization via Adversarial Influence Functions	Zhun Deng; Cynthia Dwork; Jialiang Wang; Linjun Zhang;	In this paper, inspired by the influence function in robust statistics, we introduce the Adversarial Influence Function (AIF) as a tool to investigate the solution produced by robust optimization.
204	Multilinear Latent Conditioning for Generating Unseen Attribute Combinations	Markos Georgopoulos; Grigorios Chrysos; Yannis Panagakis; Maja Pantic;	To this end, we extend the cVAE by introducing a multilinear latent conditioning framework.
205	No-Regret Exploration in Goal-Oriented Reinforcement Learning	Jean Tarbouriech; Evrard Garcelon; Michal Valko; Matteo Pirotta; Alessandro Lazaric;	In this paper, we study the general SSP problem with no assumption on its dynamics (some policies may actually never reach the goal).
206	OPtions as REsponses: Grounding behavioural hierarchies in multi-agent reinforcement learning	Alexander Vezhnevets; Yuhuai Wu; Maria Eckstein; Rémi Leblond; Joel Z Leibo;	This paper investigates generalisation in multi-agent games, where the generality of the agent can be evaluated by playing against opponents it hasn’t seen during training.
207	Feature Noise Induces Loss Discrepancy Across Groups	Fereshte Khani; Percy Liang;	In this work, we point to a more subtle source of loss discrepancy—feature noise.
208	Reinforcement Learning for Molecular Design Guided by Quantum Mechanics	Gregor Simm; Robert Pinsler; Jose Miguel Hernandez-Lobato;	To address this, we present a novel RL formulation for molecular design in 3D space, thereby extending the class of molecules that can be built.
209	Small-GAN: Speeding up GAN Training using Core-Sets	Samrath Sinha; Han Zhang; Anirudh Goyal; Yoshua Bengio; Hugo Larochelle; Augustus Odena;	Thus, it would be nice if there were some trick by which we could generate batches that were effectively big though small in practice. In this work, we propose such a trick, inspired by the use of Coreset-selection in active learning.
210	Conditional gradient methods for stochastically constrained convex minimization	Maria-Luiza Vladarean; Ahmet Alacaoglu; Ya-Ping Hsieh; Volkan Cevher;	We propose two novel conditional gradient-based methods for solving structured stochastic convex optimization problems with a large number of linear constraints.
211	Undirected Graphical Models as Approximate Posteriors	Arash Vahdat; Evgeny Andriyash; William Macready;	We develop an efficient method to train undirected approximate posteriors by showing that the gradient of the training objective with respect to the parameters of the undirected posterior can be computed by backpropagation through Markov chain Monte Carlo updates.
212	Dynamics of Deep Neural Networks and Neural Tangent Hierarchy	Jiaoyang Huang; Horng-Tzer Yau;	In the current paper, we study the dynamic of the NTK for finite width deep fully-connected neural networks.
213	Measuring Non-Expert Comprehension of Machine Learning Fairness Metrics	Debjani Saha; Candice Schumann; Duncan McElfresh; John Dickerson; Michelle Mazurek; Michael Tschantz;	We take initial steps toward bridging this gap between ML researchers and the public, by addressing the question: does a lay audience understand a basic definition of ML fairness?
214	Encoding Musical Style with Transformer Autoencoders	Kristy Choi; Curtis Hawthorne; Ian Simon; Monica Dinculescu; Jesse Engel;	In this work, we present the Transformer autoencoder, which aggregates encodings of the input data across time to obtain a global representation of style from a given performance.
215	Min-Max Optimization without Gradients: Convergence and Applications to Black-Box Evasion and Poisoning Attacks	Sijia Liu; Songtao Lu; Xiangyi Chen; Yao Feng; Kaidi Xu; Abdullah Al-Dujaili; Mingyi Hong; Una-May O’Reilly;	In this paper, we study the problem of constrained min-max optimization in a black-box setting, where the desired optimizer cannot access the gradients of the objective function but may query its values.
216	ConQUR: Mitigating Delusional Bias in Deep Q-Learning	DiJia Su; Jayden Ooi; Tyler Lu; Dale Schuurmans; Craig Boutilier;	In this paper, we develop efficient methods to mitigate delusional bias by training Q-approximators with labels that are "consistent" with the underlying greedy policy class.
217	Self-Modulating Nonparametric Event-Tensor Factorization	Zheng Wang; Xinqi Chu; Shandian Zhe;	To overcome these limitations, we propose a self-modulating nonparametric Bayesian factorization model.
218	Extreme Multi-label Classification from Aggregated Labels	Yanyao Shen; Hsiang-Fu Yu; Sujay Sanghavi; Inderjit Dhillon;	We develop a new and scalable algorithm to impute individual-sample labels from the group labels; this can be paired with any existing XMC method to solve the aggregated label problem.
219	Full Law Identification In Graphical Models Of Missing Data: Completeness Results	Razieh Nabi; Rohit Bhattacharya; Ilya Shpitser;	In this paper, we address the longstanding question of the characterization of models that are identifiable within this class of missing data distributions.
220	Self-Attentive Associative Memory	Hung Le; Truyen Tran; Svetha Venkatesh;	In this paper, we propose to separate the storage of individual experiences (item memory) and their occurring relationships (relational memory).
221	Imputer: Sequence Modelling via Imputation and Dynamic Programming	William Chan; Chitwan Saharia; Geoffrey Hinton; Mohammad Norouzi; Navdeep Jaitly;	This paper presents the Imputer, a neural sequence model that generates output sequences iteratively via imputations.
222	Continuously Indexed Domain Adaptation	Hao Wang; Hao He; Dina Katabi;	In this paper, we propose the first method for continuously indexed domain adaptation.
223	Evolving Machine Learning Algorithms From Scratch	Esteban Real; Chen Liang; David So; Quoc Le;	Our goal is to show that AutoML can go further: it is possible today to automatically discover complete machine learning algorithms just using basic mathematical operations as building blocks.
224	Self-Attentive Hawkes Process	Qiang Zhang; Aldo Lipani; Omer Kirnap; Emine Yilmaz;	This study attempts to fill the gap by designing a self-attentive Hawkes process (SAHP).
225	On hyperparameter tuning in general clustering problemsm	Xinjie Fan; Yuguang Yue; Purnamrita Sarkar; Y. X. Rachel Wang;	In this paper, we provide a overarching framework with provable guarantees for tuning hyperparameters in the above class of problems under two different models.
226	Communication-Efficient Distributed Stochastic AUC Maximization with Deep Neural Networks	Zhishuai Guo; Mingrui Liu; Zhuoning Yuan; Li Shen; Wei Liu; Tianbao Yang;	In this paper, we study distributed algorithms for large-scale AUC maximization with a deep neural network as a predictive model.
227	Adaptive Region-Based Active Learning	Corinna Cortes; Giulia DeSalvo; Claudio Gentile; Mehryar Mohri; Ningshan Zhang;	We present a new active learning algorithm that adaptively partitions the input space into a finite number of regions, and subsequently seeks a distinct predictor for each region, while actively requesting labels.
228	Robust Outlier Arm Identification	Yinglun Zhu; Sumeet Katariya; Robert Nowak;	We propose two computationally efficient delta-PAC algorithms for ROAI, which includes the first UCB-style algorithm for outlier detection, and derive upper bounds on their sample complexity.
229	Provably Efficient Exploration in Policy Optimization	Qi Cai; Zhuoran Yang; Chi Jin; Zhaoran Wang;	To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an "optimistic version" of the policy gradient direction.
230	Striving for simplicity and performance in off-policy DRL: Output Normalization and Non-Uniform Sampling	Che Wang; Yanqiu Wu; Quan Vuong; Keith Ross;	With this insight, we propose a streamlined algorithm with a simple normalization scheme or with inverted gradients.
231	Multidimensional Shape Constraints	Maya Gupta; Erez Louidor; Oleksandr Mangylov; Nobu Morioka; Tamann Narayan; Sen Zhao;	We propose new multi-input shape constraints across four intuitive categories: complements, diminishers, dominance, and unimodality constraints.
232	Fast Deterministic CUR Matrix Decomposition with Accuracy Assurance	Yasutoshi Ida; Sekitoshi Kanai; Yasuhiro Fujiwara; Tomoharu Iwata; Koh Takeuchi; Hisashi Kashima;	This paper proposes a fast deterministic CUR matrix decomposition.
233	Operation-Aware Soft Channel Pruning using Differentiable Masks	Minsoo Kang; Bohyung Han;	We propose a simple but effective data-driven channel pruning algorithm, which compresses deep neural networks effectively by exploiting the characteristics of operations in a differentiable way.
234	Normalized Loss Functions for Deep Learning with Noisy Labels	Xingjun Ma; Hanxun Huang; Yisen Wang; Simone Romano; Sarah Erfani; James Bailey;	In this paper, we theoretically show by applying a simple normalization that: any loss can be made robust to noisy labels.
235	Learning Deep Kernels for Non-Parametric Two-Sample Tests	Feng Liu; Wenkai Xu; Jie Lu; Guangquan Zhang; Arthur Gretton; D.J. Sutherland;	We propose a class of kernel-based two-sample tests, which aim to determine whether two sets of samples are drawn from the same distribution.
236	DeBayes: a Bayesian method for debiasing network embeddings	Maarten Buyl; Tijl De Bie;	We thus propose DeBayes: a conceptually elegant Bayesian method that is capable of learning debiased embeddings by using a biased prior.
237	Principled learning method for Wasserstein distributionally robust optimization with local perturbations	Yongchan Kwon; Wonyoung Kim; Joong-Ho Won; Myunghee Cho Paik;	In this paper, we propose a minimizer based on a novel approximation theorem and provide the corresponding risk consistency results.
238	Low-Variance and Zero-Variance Baselines for Extensive-Form Games	Trevor Davis; Martin Schmid; Michael Bowling;	In this paper, we extend recent work that uses baseline estimates to reduce this variance.
239	Converging to Team-Maxmin Equilibria in Zero-Sum Multiplayer Games	Youzhi Zhang; Bo An;	This paper focuses on computing Team-Maxmin Equilibria (TMEs), which is an important solution concept for zero-sum multiplayer games where players in a team having the same utility function play against an adversary independently.
240	Landscape Connectivity and Dropout Stability of SGD Solutions for Over-parameterized Neural Networks	Alexander Shevchenko; Marco Mondelli;	In this paper, we shed light on this phenomenon: we show that the combination of stochastic gradient descent (SGD) and over-parameterization makes the landscape of multilayer neural networks approximately connected and thus more favorable to optimization.
241	Leveraging Frequency Analysis for Deep Fake Image Recognition	Joel Frank; Thorsten Eisenhofer; Lea Schönherr; Dorothea Kolossa; Thorsten Holz; Asja Fischer;	This paper addresses this shortcoming and our results reveal, that in frequency space, GAN-generated images exhibit severe artifacts that can be easily identified.
242	Tails of Lipschitz Triangular Flows	Priyank Jaini; Ivan Kobyzev; Yaoliang Yu; Marcus Brubaker;	We investigate the ability of popular flow models to capture tail-properties of a target density by studying the increasing triangular maps used in these flow methods acting on a tractable source density.
243	Deep Coordination Graphs	Wendelin Boehmer; Vitaly Kurin; Shimon Whiteson;	This paper introduces the deep coordination graph (DCG) for collaborative multi-agent reinforcement learning.
244	Voice Separation with an Unknown Number of Multiple Speakers	Eliya Nachmani; Yossi Adi; Lior Wolf;	We present a new method for separating a mixed audio sequence, in which multiple voices speak simultaneously.
245	Predicting Choice with Set-Dependent Aggregation	Nir Rosenfeld; Kojin Oshiba; Yaron Singer;	Here we propose a learning framework for predicting choice that is accurate, versatile, and theoretically grounded.
246	Thompson Sampling Algorithms for Mean-Variance Bandits	Qiuyu Zhu; Vincent Tan;	We develop Thompson Sampling-style algorithms for mean-variance MAB and provide comprehensive regret analyses for Gaussian and Bernoulli bandits with fewer assumptions.
247	Differentiable Likelihoods for Fast Inversion of ‘Likelihood-Free’ Dynamical Systems	Hans Kersting; Nicholas Krämer; Martin Schiegg; Christian Daniel; Michael Schober; Philipp Hennig;	To address this shortcoming, we employ Gaussian ODE filtering (a probabilistic numerical method for ODEs) to construct a local Gaussian approximation to the likelihood.
248	Debiased Sinkhorn barycenters	Hicham Janati; Marco Cuturi; Alexandre Gramfort;	Here we show how this bias is tightly linked to the reference measure that defines the entropy regularizer and propose debiased Sinkhorn barycenters that preserve the best of worlds: fast Sinkhorn-like iterations without entropy smoothing.
249	Double Trouble in Double Descent: Bias and Variance(s) in the Lazy Regime	Stéphane d’Ascoli; Maria Refinetti; Giulio Biroli; Florent Krzakala;	In this work, we develop a quantitative theory for this phenomenon in the so-called lazy learning regime of neural networks, by considering the problem of learning a high-dimensional function with random features regression.
250	Explore, Discover and Learn: Unsupervised Discovery of State-Covering Skills	Victor Campos; Alexander Trott; Caiming Xiong; Richard Socher; Xavier Giro-i-Nieto; Jordi Torres;	In light of this, we propose \textit{Explore, Discover and Learn} (EDL), an alternative approach to information-theoretic skill discovery.
251	Sparsified Linear Programming for Zero-Sum Equilibrium Finding	Brian Zhang; Tuomas Sandholm;	In this paper we present a totally different approach to the problem, which is competitive and often orders of magnitude better than the prior state of the art.
252	Extra-gradient with player sampling for faster convergence in n-player games	Samy Jelassi; Carles Domingo-Enrich; Damien Scieur; Arthur Mensch; Joan Bruna;	In this paper, we analyse a new extra-gradient method for Nash equilibrium finding, that performs gradient extrapolations and updates on a random subset of players at each iteration.
253	Entropy Minimization In Emergent Languages	Eugene Kharitonov; Rahma Chaabouni; Diane Bouchacourt; Marco Baroni;	We investigate here the information-theoretic complexity of such languages, focusing on the basic two-agent, one-exchange setup.
254	Spectral Clustering with Graph Neural Networks for Graph Pooling	Filippo Maria Bianchi; Daniele Grattarola; Cesare Alippi;	In this paper, we propose a graph clustering approach that addresses these limitations of SC.
255	VFlow: More Expressive Generative Flows with Variational Data Augmentation	Jianfei Chen; Cheng Lu; Biqi Chenli; Jun Zhu; Tian Tian;	In this work, we study a previously overlooked constraint that all the intermediate representations must have the same dimensionality with the data due to invertibility, limiting the width of the network.
256	Fully Parallel Hyperparameter Search: Reshaped Space-Filling	Marie-Liesse Cauwet; Camille Couprie; Julien Dehos; Pauline Luc; Jeremy Rapin; Morgane Riviere; Fabien Teytaud; Olivier Teytaud; Nicolas Usunier;	Consequently, we introduce a new sampling approach based on the reshaping of the search distribution, and we show both theoretically and numerically that it leads to significant gains over random search.
257	Discount Factor as a Regularizer in Reinforcement Learning	Ron Amit; Kamil Ciosek; Ron Meir;	It is known that applying RL algorithms with a discount set lower than the evaluation discount factor can act as a regularizer, improving performance in the limited data regime. Yet the exact nature of this regularizer has not been investigated. In this work, we fill in this gap.
258	On Learning Sets of Symmetric Elements	Haggai Maron; Or Litany; Gal Chechik; Ethan Fetaya;	In this paper, we present a principled approach to learning sets of general symmetric elements.
259	Non-convex Learning via Replica Exchange Stochastic Gradient MCMC	Wei Deng; Qi Feng; Liyao Gao; Faming Liang; Guang Lin;	In this paper, we propose an adaptive replica exchange SG-MCMC (reSG-MCMC) to automatically correct the bias and study the corresponding properties.
260	Learning Similarity Metrics for Numerical Simulations	Georg Kohl; Kiwon Um; Nils Thuerey;	We propose a neural network-based approach that computes a stable and generalizing metric (LSiM) to compare data from a variety of numerical simulation sources.
261	FR-Train: A mutual information-based approach to fair and robust training	Yuji Roh; Kangwook Lee; Steven Whang; Changho Suh;	To fix this problem, we propose FR-Train, which holistically performs fair and robust model training.
262	Real-Time Optimisation for Online Learning in Auctions	Lorenzo Croissant; Marc Abeille; Clément Calauzènes;	In this paper, we provide the first algorithm for online learning of monopoly prices in online auctions whose update is constant in time and memory.
263	Graph Random Neural Features for Distance-Preserving Graph Representations	Daniele Zambon; Cesare Alippi; Lorenzo Livi;	We present Graph Random Neural Features (GRNF), a novel embedding method from graph-structured data to real vectors based on a family of graph neural networks.
264	Modulating Surrogates for Bayesian Optimization	Erik Bodin; Markus Kaiser; Ieva Kazlauskaite; Zhenwen Dai; Neill Campbell; Carl Henrik Ek;	We address this issue by proposing surrogate models that focus on the well-behaved structure in the objective function, which is informative for search, while ignoring detrimental structure that is challenging to model from few observations.
265	Convolutional Kernel Networks for Graph-Structured Data	Dexiong Chen; Laurent Jacob; Julien Mairal;	We introduce a family of multilayer graph kernels and establish new links between graph convolutional neural networks and kernel methods.
266	Improving the Sample and Communication Complexity for Decentralized Non-Convex Optimization: Joint Gradient Estimation and Tracking	Haoran Sun; Songtao Lu; Mingyi Hong;	In particular, we propose an algorithm named D-GET (decentralized gradient estimation and tracking), which jointly performs decentralized gradient estimation (which estimates the local gradient using a subset of local samples) {\it and} gradient tracking (which tracks the global full gradient using local estimates).
267	Proper Network Interpretability Helps Adversarial Robustness in Classification	Akhilan Boopathy; Sijia Liu; Gaoyuan Zhang; Cynthia Liu; Pin-Yu Chen; Shiyu Chang; Luca Daniel;	In this paper, we theoretically show that with a proper measurement of interpretation, it is actually difficult to prevent prediction-evasion adversarial attacks from causing interpretability discrepancy, as confirmed by experiments on MNIST, CIFAR-10 and Restricted ImageNet.
268	Generalization Guarantees for Sparse Kernel Approximation with Entropic Optimal Features	Liang Ding; Rui Tuo; Shahin Shahrampour;	In this paper, in lieu of commonly used kernel expansion with respect to $N$ inputs, we develop a novel optimal design maximizing the entropy among kernel features.
269	Understanding the Impact of Model Incoherence on Convergence of Incremental SGD with Random Reshuffle	Shaocong Ma; Yi Zhou;	In this work, we introduce model incoherence to characterize the diversity of model characteristics and study its impact on convergence of SGD with random reshuffle \shaocong{under weak strong convexity}.
270	Learning Opinions in Social Networks	Vincent Conitzer; Debmalya Panigrahi; Hanrui Zhang;	We study the problem of learning opinions in social networks.
271	Latent Variable Modelling with Hyperbolic Normalizing Flows	Joey Bose; Ariella Smofsky; Renjie Liao; Prakash Panangaden; Will Hamilton;	To address this fundamental limitation, we present the first extension of normalizing flows to hyperbolic spaces.
272	StochasticRank: Global Optimization of Scale-Free Discrete Functions	Aleksei Ustimenko; Liudmila Prokhorenkova;	In this paper, we introduce a powerful and efficient framework for the direct optimization of ranking metrics.
273	Working Memory Graphs	Ricky Loynd; Roland Fernandez; Asli Celikyilmaz; Adith Swaminathan; Matthew Hausknecht;	We present the Working Memory Graph (WMG), an agent that employs multi-head self-attention to reason over a dynamic set of vectors representing observed and recurrent state.
274	Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neural Networks with Attention over Modules	Sarthak Mittal; Alex Lamb; Anirudh Goyal; Vikram Voleti; Murray Shanahan; Guillaume Lajoie; Michael Mozer; Yoshua Bengio;	We explore deep recurrent neural net architectures in which bottom-up and top-down signals are dynamically combined using attention.
275	Spread Divergence	Mingtian Zhang; Peter Hayes; Thomas Bird; Raza Habib; David Barber;	We define a spread divergence $\sdiv{p}{q}$ on modified $p$ and $q$ and describe sufficient conditions for the existence of such a divergence. We demonstrate how to maximize the discriminatory power of a given divergence by parameterizing and learning the spread.
276	Optimizing Black-box Metrics with Adaptive Surrogates	Qijia Jiang; Olaoluwa Adigun; Harikrishna Narasimhan; Mahdi Milani Fard; Maya Gupta;	We address the problem of training models with black-box and hard-to-optimize metrics by expressing the metric as a monotonic function of a small number of easy-to-optimize surrogates.
277	Domain Adaptive Imitation Learning	Kuno Kim; Yihong Gu; Jiaming Song; Shengjia Zhao; Stefano Ermon;	In this work, we formalize the Domain Adaptive Imitation Learning (DAIL) problem – a unified framework for imitation learning in the presence of viewpoint, embodiment, and/or dynamics mismatch.
278	A general recurrent state space framework for modeling neural dynamics during decision-making	David Zoltowski; Jonathan Pillow; Scott Linderman;	Here we propose a general framework for modeling neural activity during decision-making.
279	An Imitation Learning Approach for Cache Replacement	Evan Liu; Milad Hashemi; Kevin Swersky; Parthasarathy Ranganathan; Junwhan Ahn;	In contrast, we propose an imitation learning approach to automatically learn cache access patterns by leveraging Belady’s, an oracle policy that computes the optimal eviction decision given the future cache accesses.
280	Revisiting Training Strategies and Generalization Performance in Deep Metric Learning	Karsten Roth; Timo Milbich; Samrath Sinha; Prateek Gupta; Bjorn Ommer; Joseph Paul Cohen;	Exploiting these insights, we propose a simple, yet effective, training regularization to reliably boost the performance of ranking-based DML models on various standard benchmark datasets; code and a publicly accessible WandB-repo are available.
281	Temporal Phenotyping using Deep Predictive Clustering of Disease Progression	Changhee Lee; Mihaela van der Schaar;	In this paper, we develop a deep learning approach for clustering time-series data, where each cluster comprises patients who share similar future outcomes of interest (e.g., adverse events, the onset of comorbidities).
282	Countering Language Drift with Seeded Iterated Learning	Yuchen Lu; Soumye Singhal; Florian Strub; Aaron Courville; Olivier Pietquin;	In this paper, we propose a generic approach to counter language drift by using iterated learning.
283	Stochastic Gauss-Newton Algorithms for Nonconvex Compositional Optimization	Quoc Tran-Dinh; Nhan Pham; Lam Nguyen;	We develop two new stochastic Gauss-Newton algorithms for solving a class of stochastic non-convex compositional optimization problems frequently arising in practice.
284	Strategyproof Mean Estimation from Multiple-Choice Questions	Anson Kahng; Gregory Kehne; Ariel Procaccia;	Given n values possessed by n agents, we study the problem of estimating the mean by truthfully eliciting agents’ answers to multiple-choice questions about their values.
285	Sequential Cooperative Bayesian Inference	Junqi Wang; Pei Wang; Patrick Shafto;	We develop novel approaches analyzing consistency, rate of convergence and stability of Sequential Cooperative Bayesian Inference (SCBI).
286	Spectral Graph Matching and Regularized Quadratic Relaxations: Algorithm and Theory	Zhou Fan; Cheng Mao; Yihong Wu; Jiaming Xu;	To tackle this task, we propose a spectral method, GRAph Matching by Pairwise eigen-Alignments (GRAMPA), which first constructs a similarity matrix as a weighted sum of outer products between all pairs of eigenvectors of the two graphs, and then outputs a matching by a simple rounding procedure.
287	Zeno++: Robust Fully Asynchronous SGD	Cong Xie; Sanmi Koyejo; Indranil Gupta;	We propose Zeno++, a new robust asynchronous Stochastic Gradient Descent(SGD) procedure, intended to tolerate Byzantine failures of workers.
288	Network Pruning by Greedy Subnetwork Selection	Mao Ye; Chengyue Gong; Lizhen Nie; Denny Zhou; Adam Klivans; Qiang Liu;	In this work, we study a greedy forward selection approach following the opposite direction, which starts from an empty network, and gradually adds good neurons from the large network.
289	Logarithmic Regret for Learning Linear Quadratic Regulators Efficiently	Asaf Cassel; Alon Cohen; Tomer Koren;	We present new efficient algorithms that achieve, perhaps surprisingly,regret that scales only (poly-)logarithmically with the number of steps, in two scenarios: when only the state transition matrix A is unknown, and when only the state-action transition matrix B is unknown and the optimal policy satisfies a certain non-degeneracy condition.
290	Hierarchical Verification for Adversarial Robustness	Cong Han Lim; Raquel Urtasun; Ersin Yumer;	We introduce a new framework for the exact pointwise lp robustness verification problem that exploits the layer-wise geometric structure of deep feed-forward networks with rectified linear activations (ReLU networks).
291	BINOCULARS for efficient, nonmyopic sequential experimental design	Shali Jiang; Henry Chai; Javier Gonzalez; Roman Garnett;	We present BINOCULARS: Batch-Informed NOnmyopic Choices, Using Long-horizons for Adaptive, Rapid SED, a general framework for deriving efficient, nonmyopic approximations to the optimal experimental policy.
292	On the Global Optimality of Model-Agnostic Meta-Learning	Lingxiao Wang; Qi Cai; Zhuoran Yang; Zhaoran Wang;	To bridge such a gap between theory and practice, we characterize the optimality gap of the stationary points attained by MAML for both reinforcement learning and supervised learning, where both the inner- and outer-level problems are solved via first-order optimization methods.
293	Breaking the Curse of Many Agents: Provable Mean Embedding $Q$-Iteration for Mean-Field Reinforcement Learning	Lingxiao Wang; Zhuoran Yang; Zhaoran Wang;	In this paper, we exploit the symmetry of agents in MARL.
294	Learning with Bounded Instance- and Label-dependent Label Noise	Jiacheng Cheng; Tongliang Liu; Kotagiri Ramamohanarao; Dacheng Tao;	In this paper, we focus on Bounded Instance- and Label-dependent label Noise (BILN), a particular case of ILN where the label noise rates—the probabilities that the true labels of examples flip into the corrupted ones—have upper bound less than $1$.
295	Transparency Promotion with Model-Agnostic Linear Competitors	Hassan Rafique; Tong Wang; Qihang Lin; Arshia Singhani;	We propose a novel type of hybrid model for multi-class classification, which utilizes competing linear models to collaborate with an existing black-box model, promoting transparency in the decision-making process.
296	Learning Mixtures of Graphs from Epidemic Cascades	Jessica Hoffmann; Soumya Basu; Surbhi Goel; Constantine Caramanis;	We consider the problem of learning the weighted edges of a balanced mixture of two undirected graphs from epidemic cascades.
297	Implicit differentiation of Lasso-type models for hyperparameter optimization	Quentin Bertrand; Quentin Klopfenstein; Mathieu Blondel; Samuel Vaiter; Alexandre Gramfort; Joseph Salmon;	This work introduces an efficient implicit differentiation algorithm, without matrix inversion, tailored for Lasso-type problems.
298	Latent Space Factorisation and Manipulation via Matrix Subspace Projection	Xiao Li; Chenghua Lin; Ruizhe Li; Chaozheng Wang; Frank Guerin;	We tackle the problem disentangling the latent space of an autoencoder in order to separate labelled attribute information from other characteristic information.
299	Active World Model Learning in Agent-rich Environments with Progress Curiosity	Kuno Kim; Megumi Sano; Julian De Freitas; Nick Haber; Daniel Yamins;	In this work, we study how to design such a curiosity-driven Active World Model Learning (AWML) system.
300	SDE-Net: Equipping Deep Neural Networks with Uncertainty Estimates	Lingkai Kong; Jimeng Sun; Chao Zhang;	We propose a new method for quantifying uncertainties of DNNs from a dynamical system perspective.
301	GANs May Have No Nash Equilibria	Farzan Farnia; Asuman Ozdaglar;	In this work, we show through several theoretical and numerical results that indeed GAN zero-sum games may not have any Nash equilibria.
302	Gradient Temporal-Difference Learning with Regularized Corrections	Sina Ghiassian; Andrew Patterson; Shivam Garg; Dhawal Gutpa; Adam White; Martha White;	In this paper, we introduce a new method called TD with Regularized Corrections (TDRC), that attempts to balance ease of use, soundness, and performance.
303	Online mirror descent and dual averaging: keeping pace in the dynamic case	Huang Fang; Victor Sanches Portella; Nick Harvey; Michael Friedlander;	In this paper, we modify the OMD algorithm by a simple technique that we call stabilization.
304	Choice Set Optimization Under Discrete Choice Models of Group Decisions	Kiran Tomlinson; Austin Benson;	Here, we use discrete choice modeling to develop an optimization framework of such interventions for several problems of group influence, including maximizing agreement or disagreement and promoting a particular choice.
305	Complexity of Finding Stationary Points of Nonconvex Nonsmooth Functions	Jingzhao Zhang; Hongzhou Lin; Stefanie Jegelka; Suvrit Sra; Ali Jadbabaie;	Therefore, we introduce the notion of (delta, epsilon)-stationarity, a generalization that allows for a point to be within distance delta of an epsilon-stationary point and reduces to epsilon-stationarity for smooth functions.
306	Multi-Agent Routing Value Iteration Network	Quinlan Sykora; Mengye Ren; Raquel Urtasun;	Whereas traditional methods are not designed for realistic environments such as sparse connectivity and unknown traffics and are often slow in runtime; in this paper, we propose a graph neural network based model that is able to perform multiagent routing in a sparsely connected graph with dynamically changing traffic conditions, outperforming existing methods.
307	Adversarial Attacks on Copyright Detection Systems	Parsa Saadatpanah; Ali Shafahi; Tom Goldstein;	This paper discusses how industrial copyright detection tools, which serve a central role on the web, are susceptible to adversarial attacks.
308	Differentiating through the Fréchet Mean	Aaron Lou; Isay Katsman; Qingxuan Jiang; Serge Belongie; Ser Nam Lim; Christopher De Sa;	In this paper, we show how to differentiate through the Fréchet mean for arbitrary Riemannian manifolds.
309	Online Learning for Active Cache Synchronization	Andrey Kolobov; Sebastien Bubeck; Julian Zimmert;	We present MirrorSync, an online learning algorithm for synchronization bandits, establish an adversarial regret of $O(T^{2/3})$ for it, and show how to make it efficient in practice.
310	PoKED: A Semi-Supervised System for Word Sense Disambiguation	Feng Wei;	In this paper, we propose a semi-supervised neural system, Position-wise Orthogonal Knowledge-Enhanced Disambiguator (PoKED), which allows attention-driven, long-range dependency modeling for word sense disambiguation tasks.
311	A Finite-Time Analysis of Q-Learning with Neural Network Function Approximation	Pan Xu; Quanquan Gu;	In this paper, we present a finite-time analysis of a neural Q-learning algorithm, where the data are generated from a Markov decision process and the action-value function is approximated by a deep ReLU neural network.
312	Understanding and Stabilizing GANs’ Training Dynamics Using Control Theory	Kun Xu; Chongxuan Li; Jun Zhu; Bo Zhang;	To this end, we present a conceptually novel perspective from control theory to directly model the dynamics of GANs in the frequency domain and provide simple yet effective methods to stabilize GAN’s training.
313	Scalable Nearest Neighbor Search for Optimal Transport	Arturs Backurs; Yihe Dong; Piotr Indyk; Ilya Razenshteyn; Tal Wagner;	In this work we introduce Flowtree, a fast and accurate approximation algorithm for the Wasserstein-1 distance.
314	Supervised learning: no loss no cry	Richard Nock; Aditya Menon;	In this paper, we revisit the {\sc SLIsotron} algorithm of Kakade et al. (2011) through a novel lens, derive a generalisation based on Bregman divergences, and show how it provides a principled procedure for learning the loss.
315	Label-Noise Robust Domain Adaptation	Xiyu Yu; Tongliang Liu; Mingming Gong; Kun Zhang; Kayhan Batmanghelich; Dacheng Tao;	Focusing on the generalized target shift scenario, where both label distribution $P_Y$ and the class-conditional distribution $P_{X\|Y}$ can change, we propose a new Denoising Conditional Invariant Component (DCIC) framework, which provably ensures (1) extracting invariant representations given examples with noisy labels in the source domain and unlabeled examples in the target domain and (2) estimating the label distribution in the target domain with no bias.
316	Description Based Text Classification with Reinforcement Learning	Wei Wu; Duo Chai; Qinghong Han; Fei Wu; Jiwei Li;	Inspired by the current trend of formalizing NLP problems as question answering tasks, we propose a new framework for text classification, in which each category label is associated with a category description.
317	Bandits for BMO Functions	Tianyu Wang; Cynthia Rudin;	We study the bandit problem where the underlying expected reward is a Bounded Mean Oscillation (BMO) function.
318	Cost-effectively Identifying Causal Effect When Only Response Variable Observable	Tian-Zuo Wang; Xi-Zhu Wu; Sheng-Jun Huang; Zhi-Hua Zhou;	In this paper, we propose a novel solution for this challenging task where only the response variable is observable under intervention.
319	Learning with Multiple Complementary Labels	LEI FENG; Takuo Kaneko; Bo Han; Gang Niu; Bo An; Masashi Sugiyama;	In this paper, we propose a novel problem setting to allow MCLs for each example and two ways for learning with MCLs.
320	Contrastive Multi-View Representation Learning on Graphs	Kaveh Hassani; Amir Hosein Khasahmadi;	We introduce a self-supervised approach for learning node and graph level representations by contrasting structural views of graphs.
321	A Chance-Constrained Generative Framework for Sequence Optimization	Xianggen Liu; Jian Peng; Qiang Liu; Sen Song;	In this paper, we formulate the sequence optimization task as a chance-constrained sampling problem.
322	dS^2LBI: Exploring Structural Sparsity on Deep Network via Differential Inclusion Paths	Yanwei Fu; Chen Liu; Donghao Li; Xinwei Sun; Jinshan ZENG; Yuan Yao;	In this paper, instead of pruning or distilling over-parameterized models to compressive ones, we propose a new approach based on differential inclusions of inverse scale spaces.
323	Sparse Subspace Clustering with Entropy-Norm	Liang Bai; Jiye Liang;	Therefore, in this paper, we provide an explicit theoretical connection between them from the respective of learning a data similarity matrix.
324	On the Generalization Effects of Linear Transformations in Data Augmentation	Sen Wu; Hongyang Zhang; Gregory Valiant; Christopher Re;	In this work, we consider a family of linear transformations and study their effects on the ridge estimator in an over-parametrized linear regression setting.
325	Sparse Shrunk Additive Models	Hong Chen; guodong liu; Heng Huang;	A new method, called as sparse shrunk additive models (SSAM), is proposed to explore the structure information among features.
326	Unsupervised Discovery of Interpretable Directions in the GAN Latent Space	Andrey Voynov; Artem Babenko;	In this paper, we introduce an unsupervised method to identify interpretable directions in the latent space of a pretrained GAN model.
327	DropNet: Reducing Neural Network Complexity via Iterative Pruning	Chong Min John Tan; Mehul Motani;	Inspired by the iterative weight pruning in the Lottery Ticket Hypothesis, we propose DropNet, an iterative pruning method which prunes nodes/filters to reduce network complexity.
328	Self-supervised Label Augmentation via Input Transformations	Hankook Lee; Sung Ju Hwang; Jinwoo Shin;	Our main idea is to learn a single unified task with respect to the joint distribution of the original and self-supervised labels, i.e., we augment original labels via self-supervision.
329	Mapping natural-language problems to formal-language solutions using structured neural representations	Kezhen Chen; Qiuyuan Huang; Hamid Palangi; Paul Smolensky; Ken Forbus; Jianfeng Gao;	In this paper, we propose a new encoder-decoder model based on a structured neural representation, Tensor Product Representations (TPRs), for generating formal-language solutions from natural-language, called TP-N2F.
330	Transformation of ReLU-based recurrent neural networks from discrete-time to continuous-time	Zahra Monfared; Daniel Durstewitz;	Here we show how to perform such a translation from discrete to continuous time for a particular class of ReLU-based RNN.
331	Implicit Geometric Regularization for Learning Shapes	Amos Gropp; Lior Yariv; Niv Haim; Matan Atzmon; Yaron Lipman;	In this paper we offer a new paradigm for computing high fidelity implicit neural representations directly from raw data (i.e., point clouds, with or without normal information).
332	Influence Diagram Bandits	Tong Yu; Branislav Kveton; Zheng Wen; Ruiyi Zhang; Ole J. Mengshoel;	We propose a novel framework for structured bandits, which we call influence diagram bandit.
333	Information Particle Filter Tree: An Online Algorithm for POMDPs with Belief-Based Rewards on Continuous Domains	Johannes Fischer; Ömer Sahin Tas;	In this work we propose a novel online algorithm, Information Particle Filter Tree (IPFT), to solve problems with belief-dependent rewards on continuous domains.
334	Convergence Rates of Variational Inference in Sparse Deep Learning	Badr-Eddine Chérief-Abdellatif;	In this paper, we show that variational inference for sparse deep learning retains precisely the same generalization properties than exact Bayesian inference.
335	Unsupervised Transfer Learning for Spatiotemporal Predictive Networks	Zhiyu Yao; Yunbo Wang; Mingsheng Long; Jianmin Wang;	Technically, we propose a differentiable framework named transferable memory.
336	DINO: Distributed Newton-Type Optimization Method	Rixon Crane; Fred Roosta;	We present a novel communication-efficient Newton-type algorithm for finite-sum optimization over a distributed computing environment.
337	Quantum Expectation-Maximization for Gaussian Mixture Models	Alessandro Luongo; Iordanis Kerenidis; Anupam Prakash;	We define a quantum version of Expectation-Maximization (QEM), a fundamental tool in unsupervised machine learning, often used to solve Maximum Likelihood (ML) and Maximum A Posteriori (MAP) estimation problems.
338	Consistent Structured Prediction with Max-Min Margin Markov Networks	Alex Nowak; Francis Bach; Alessandro Rudi;	In this paper, we prove consistency and finite sample generalization bounds for $M^4N$ and provide an explicit algorithm to compute the estimator.
339	Concentration bounds for CVaR estimation: The cases of light-tailed and heavy-tailed distributions	Prashanth L.A.; Krishna Jagannathan; Ravi Kolla;	We derive concentration bounds for CVaR estimates, considering separately the cases of sub-Gaussian, light-tailed and heavy-tailed distributions.
340	Robust Pricing in Dynamic Mechanism Design	Yuan Deng; Sébastien Lahaie; Vahab Mirrokni;	In this paper, we propose robust dynamic mechanism design.
341	Nested Subspace Arrangement for Representation of Relational Data	Nozomi Hata; Shizuo Kaji; Akihiro Yoshida; Katsuki Fujisawa;	In this paper, we introduce Nested SubSpace arrangement (NSS arrangement), a comprehensive framework for representation learning.
342	Equivariant Neural Rendering	Emilien Dupont; Miguel Bautista Martin; Alex Colburn; Aditya Sankar; Joshua Susskind; Qi Shan;	We propose a framework for learning neural scene representations directly from images, without 3D supervision.
343	Bounding the fairness and accuracy of classifiers from population statistics	Sivan Sabato; Elad Yom-Tov;	We propose an efficient and practical procedure for finding the best possible lower bound on the discrepancy of the classifier, given the aggregate statistics, and demonstrate in experiments the empirical tightness of this lower bound, as well as its possible uses on various types of problems, ranging from estimating the quality of voting polls to measuring the effectiveness of patient identification from internet search queries.
344	Healing Gaussian Process Experts	samuel cohen; Rendani Mbuvha; Tshilidzi Marwala; Marc Deisenroth;	In this paper, we provide a solution to these problems for multiple expert models, including the generalised product of experts and the robust Bayesian committee machine.
345	Beyond UCB: Optimal and Efficient Contextual Bandits with Regression Oracles	Dylan Foster; Alexander Rakhlin;	We provide the first universal and optimal reduction from contextual bandits to online regression.
346	Simple and Deep Graph Convolutional Networks	Ming Chen; Zhewei Wei; Zengfeng Huang; Bolin Ding; Yaliang Li;	In this paper, we study the problem of designing and analyzing deep graph convolutional networks.
347	Projection-free Distributed Online Convex Optimization with $O(\sqrt{T})$ Communication Complexity	Yuanyu Wan; Wei-Wei Tu; Lijun Zhang;	In this paper, we first propose an improved variant of D-OCG, namely D-BOCG, which enjoys an $O(T^{3/4})$ regret bound with only $O(\sqrt{T})$ communication complexity.
348	Meta Variance Transfer: Learning to Augment from the Others	Seong-Jin Park; Seungju Han; Ji-won Baek; Insoo Kim; Juhwan Song; Hae Beom Lee; Jae-Joon Han; Sung Ju Hwang;	To alleviate the need of collecting large data and better learn from scarce samples, we propose a novel meta-learning method which learns to transfer factors of variations from one class to another, such that it can improve the classification performance on unseen examples.
349	Coresets for Clustering in Graphs of Bounded Treewidth	Daniel Baker; Vladimir Braverman; Lingxiao Huang; Shaofeng H.-C. Jiang; Robert Krauthgamer; Xuan Wu;	The construction is based on the framework of Feldman and Langberg [STOC 2011], and our main technical contribution, as required by this framework, is a uniform bound of $O(\tw(G))$ on the shattering dimension under any point weights.
350	On Breaking Deep Generative Model-based Defenses and Beyond	Yanzhi Chen; Renjie Xie; Zhanxing Zhu;	In this work, we develop a new gradient approximation attack to break these defenses.
351	Exploration Through Bias: Revisiting Biased Maximum Likelihood Estimation in Stochastic Multi-Armed Bandits	Xi Liu; Ping-Chun Hsieh; Yu Heng Hung; Anirban Bhattacharya; P. Kumar;	We propose a new family of bandit algorithms, that are formulated in a general way based on the Biased Maximum Likelihood Estimation (BMLE) method originally appearing in the adaptive control literature.
352	Bisection-Based Pricing for Repeated Contextual Auctions against Strategic Buyer	Anton Zhiyanov; Alexey Drutsa;	We introduce a novel deterministic learning algorithm that is based on ideas of the Bisection method and has strategic regret upper bound of $O(\log^2 T)$.
353	Haar Graph Pooling	Yuguang Wang; Ming Li; Zheng Ma; Guido Montufar; Xiaosheng Zhuang; Yanan Fan;	We propose a new graph pooling operation based on compressive Haar transforms — HaarPooling.
354	Explaining Groups of Points in Low-Dimensional Representations	Gregory Plumb; Jonathan Terhorst; Sriram Sankararaman; Ameet Talwalkar;	To solve this problem, we introduce a new type of explanation, a Global Counterfactual Explanation (GCE), and our algorithm, Transitive Global Translations (TGT), for computing GCEs.
355	Learning Portable Representations for High-Level Planning	Steven James; Benjamin Rosman; George Konidaris;	We present a framework for autonomously learning a portable representation that describes a collection of low-level continuous environments.
356	Adaptive Estimator Selection for Off-Policy Evaluation	Yi Su; Pavithra Srinath; Akshay Krishnamurthy;	We develop a generic data-driven method for estimator selection in off-policy policy evaluation settings.
357	Doubly Stochastic Variational Inference for Neural Processes with Hierarchical Latent Variables	Qi Wang; Herke van Hoof;	To address this challenge, we investigate NPs systematically and present a new variant of NP model that we call Doubly Stochastic Variational Neural Process (DSVNP).
358	Generative Flows with Matrix Exponential	Changyi Xiao; Ligang Liu;	In this paper, we incorporate matrix exponential into generative flows.
359	Composable Sketches for Functions of Frequencies: Beyond the Worst Case	Edith Cohen; Ofir Geri; Rasmus Pagh;	In this paper we study when it is possible to construct compact, composable sketches for weighted sampling and statistics estimation according to functions of data frequencies.
360	Self-concordant analysis of Frank-Wolfe algorithm	Mathias Staudigl; Pavel Dvurechenskii; Shimrit Shtern; Kamil Safin; Petr Ostroukhov;	If the problem can be represented by a local linear minimization oracle, we are the first to propose a FW method with linear convergence rate without assuming neither strong convexity nor a Lipschitz continuous gradient.
361	Towards non-parametric drift detection via Dynamic Adapting Window Independence Drift Detection (DAWIDD)	Fabian Hinder, André Artelt, CITEC Barbara Hammer;	In this paper we present a novel concept drift detection method, Dynamic Adapting Window Independence Drift Detection (DAWIDD), which aims for non-parametric drift detection of diverse drift characteristics.
362	Non-Stationary Bandits with Intermediate Observations	Claire Vernade; András György; Timothy Mann;	To model this situation, we introduce the problem of stochastic, non-stationary, delayed bandits with intermediate observations.
363	Does label smoothing mitigate label noise?	Michal Lukasik; Srinadh Bhojanapalli; Aditya Menon; Sanjiv Kumar;	In this paper, we study whether label smoothing is also effective as a means of coping with label noise.
364	Proving the Lottery Ticket Hypothesis: Pruning is All You Need	Eran Malach; Gilad Yehudai; Shai Shalev-Schwartz; Ohad Shamir;	We prove an even stronger hypothesis (as was also conjectured in Ramanujan et al., 2019), showing that for every bounded distribution and every target network with bounded weights, a sufficiently over-parameterized neural network with random weights contains a subnetwork with roughly the same accuracy as the target network, without any further training.
365	Linear bandits with Stochastic Delayed Feedback	Claire Vernade; Alexandra Carpentier; Tor Lattimore; Giovanni Zappella; Beyza Ermis; Michael Brueckner;	We formalize this problem as a novel stochastic delayed linear bandit and propose OTFLinUCB and OTFLinTS, two computationally efficient algorithms able to integrate new information as it becomes available and to deal with the permanently censored feedback.
366	Time Series Deconfounder: Estimating Treatment Effects over Time in the Presence of Hidden Confounders	Ioana Bica; Ahmed Alaa; Mihaela van der Schaar;	In this paper, we develop the Time Series Deconfounder, a method that leverages the assignment of multiple treatments over time to enable the estimation of treatment effects in the presence of multi-cause hidden confounders.
367	Negative Sampling in Semi-Supervised learning	John Chen; Vatsal Shah; Anastasios Kyrillidis;	We introduce Negative Sampling in Semi-Supervised Learning (NS^3L), a simple, fast, easy to tune algorithm for semi-supervised learning (SSL).
368	Adaptive Sketching for Fast and Convergent Canonical Polyadic Decomposition	Alex Gittens; Kareem Aggour; Bülent Yener;	This work considers the canonical polyadic decomposition (CPD) of tensors using proximally regularized sketched alternating least squares algorithms.
369	Private Counting from Anonymous Messages: Near-Optimal Accuracy with Vanishing Communication Overhead	Badih Ghazi; Ravi Kumar; Pasin Manurangsi; Rasmus Pagh;	In this paper, we obtain practical communication-efficient algorithms in the shuffled DP model for two basic aggregation primitives: 1) binary summation, and 2) histograms over a moderate number of buckets.
370	On the Generalization Benefit of Noise in Stochastic Gradient Descent	Samuel Smith; Erich Elsen; Soham De;	In this paper, we perform carefully designed experiments and rigorous hyperparameter sweeps on a range of popular models, which verify that small or moderately large batch sizes can substantially outperform very large batches on the test set.
371	Momentum-Based Policy Gradient Methods	Feihu Huang; Shangqian Gao; Jian Pei; Heng Huang;	Specifically, we propose a fast important-sampling momentum-based policy gradient (IS-MBPG) method by using the important sampling technique.
372	Knowing The What But Not The Where in Bayesian Optimization	Vu Nguyen; Michael Osborne;	In this paper, we consider a new setting in BO in which the knowledge of the optimum output is available.
373	Robust Bayesian Classification Using An Optimistic Score Ratio	Viet Anh Nguyen; Nian Si; Jose Blanchet;	We consider the optimistic score ratio for robust Bayesian classification when the class-conditional distribution of the features is not perfectly known.
374	Boosted Histogram Transform for Regression	Yuchao Cai; Hanyuan Hang; Hanfang Yang; Zhouchen Lin;	In this paper, we propose a boosting algorithm for regression problems called \textit{boosted histogram transform for regression} (BHTR) based on histogram transforms composed of random rotations, stretchings, and translations.
375	Stochastic bandits with arm-dependent delays	Anne Gael Manegueu; Claire Vernade; Alexandra Carpentier; Michal Valko;	Addressing these difficulties, we propose a simple but efficient UCB-based algorithm called the PATIENTBANDITS. We provide both problem-dependent and problem-independent bounds on the regret as well as performance lower bounds.
376	Projective Preferential Bayesian Optimization	Petrus Mikkola; Milica Todorovic; Jari Järvi; Patrick Rinke; Samuel Kaski;	We propose a new type of Bayesian optimization for learning user preferences in high-dimensional spaces.
377	On Relativistic f-Divergences	Alexia Jolicoeur-Martineau;	We introduce the minimum-variance unbiased estimator (MVUE) for Relativistic paired GANs (RpGANs; originally called RGANs which could bring confusion) and show that it does not perform better.
378	A Flexible Framework for Nonparametric Graphical Modeling that Accommodates Machine Learning	Yunhua Xiang; Noah Simon;	In this paper, we instead consider 3 non-parametric measures of conditional dependence.
379	The Natural Lottery Ticket Winner: Reinforcement Learning with Ordinary Neural Circuits	Ramin Hasani; Mathias Lechner; Alexander Amini; Daniela Rus; Radu Grosu;	We propose a neural information processing system which is obtained by re-purposing the function of a biological neural circuit model to govern simulated and real-world control tasks.
380	Schatten Norms in Matrix Streams: Hello Sparsity, Goodbye Dimension	Aditya Krishnan; Roi Sinoff; Robert Krauthgamer; Vladimir Braverman;	We address this challenge by providing the first algorithms whose space requirement is independent of the matrix dimension, assuming the matrix is doubly-sparse and presented in row-order.
381	Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning	Alberto Maria Metelli; Flavio Mazzolini; Lorenzo Bisi; Luca Sabbioni; Marcello Restelli;	In this paper, we introduce the notion of action persistence that consists in the repetition of an action for a fixed number of decision steps, having the effect of modifying the control frequency.
382	Minimax Rate for Learning From Pairwise Comparisons in the BTL Model	Julien Hendrickx; Alex Olshevsky; Venkatesh Saligrama;	Our contribution is the determination of the minimax rate up to a constant factor.
383	Interferometric Graph Transform: a Deep Unsupervised Graph Representation	Edouard Oyallon;	We propose the Interferometric Graph Transform (IGT), which is a new class of deep unsupervised graph convolutional neural network for building graph representations.
384	Stochastic Differential Equations with Variational Wishart Diffusions	Martin Jørgensen; Marc Deisenroth; Hugh Salimbeni;	We present a Bayesian non-parametric way of inferring stochastic differential equations for both regression tasks and continuous-time dynamical modelling.
385	What Can Learned Intrinsic Rewards Capture?	Zeyu Zheng; Junhyuk Oh; Matteo Hessel; Zhongwen Xu; Manuel Kroiss; Hado van Hasselt; David Silver; Satinder Singh;	In this paper, we instead consider the proposition that the reward function itself can be a good locus of learned knowledge.
386	Random extrapolation for primal-dual coordinate descent	Ahmet Alacaoglu; Olivier Fercoq; Volkan Cevher;	We introduce a randomly extrapolated primal-dual coordinate descent method that automatically adapts to the sparsity of the data matrix as well as the favorable structures of the objective function in optimization.
387	Reinforcement Learning with Differential Privacy	Giuseppe Vietri; Borja de Balle Pigem; Steven Wu; Akshay Krishnamurthy;	Motivated by high-stakes decision-making domains like personalized medicine where user information is inherently sensitive, we design privacy preserving exploration policies for episodic reinforcement learning (RL).
388	Median Matrix Completion: from Embarrassment to Optimality	Weidong Liu; Xiaojun Mao; Raymond K. W. Wong;	In this paper, we consider matrix completion with absolute deviation loss and obtain an estimator of the median matrix.
389	Improved Optimistic Algorithms for Logistic Bandits	Louis Faury; Marc Abeille; Clément Calauzènes; Olivier Fercoq;	In this work, we study the logistic bandit with a focus on the prohibitive dependencies introduced by $\kappa$.
390	Learning to Rank Learning Curves	Martin Wistuba; Tejaswini Pedapati;	In this work, we present a new method that saves computational budget by terminating poor configurations early on in the training.
391	Model Fusion with Kullback–Leibler Divergence	Sebastian Claici; Mikhail Yurochkin; Soumya Ghosh; Justin Solomon;	We propose a method to fuse posterior distributions learned from heterogeneous datasets.
392	Randomization matters How to defend against strong adversarial attacks	Rafael Pinot; Raphael Ettedgui; Geovani Rizk; Yann Chevaleyre; Jamal Atif;	We tackle this problem by showing that, under mild conditions on the dataset distribution, any deterministic classifier can be outperformed by a randomized one.
393	Evolutionary Topology Search for Tensor Network Decomposition	Chao Li; Zhun Sun;	In this paper, we claim that this issue can be practically tackled by evolutionary algorithms in an efficient manner.
394	Quadratically Regularized Subgradient Methods for Weakly Convex Optimization with Weakly Convex Constraints	Runchao Ma; Qihang Lin; Tianbao Yang;	This paper proposes a class of subgradient methods for constrained optimization where the objective function and the constraint functions are weakly convex and nonsmooth.
395	Scalable and Efficient Comparison-based Search without Features	Daniyar Chumbalov; Lucas Maystre; Matthias Grossglauser;	We propose a new Bayesian comparison-based search algorithm with noisy answers; it has low computational complexity yet is efficient in the number of queries.
396	Error-Bounded Correction of Noisy Labels	Songzhu Zheng; Pengxiang Wu; Aman Goswami; Mayank Goswami; Dimitris Metaxas; Chao Chen;	We introduce a novel approach that directly cleans labels in order to train a high quality model.
397	Learning with Feature and Distribution Evolvable Streams	Zhen-Yu Zhang; Peng Zhao; Yuan Jiang; Zhi-Hua Zhou;	To address this difficulty, we propose a novel discrepancy measure for evolving feature space and data distribution named the evolving discrepancy, based on which we provide the generalization error analysis.
398	On Unbalanced Optimal Transport: An Analysis of Sinkhorn Algorithm	Khiem Pham; Khang Le; Nhat Ho; Tung Pham; Hung Bui;	We provide a computational complexity analysis for the Sinkhorn algorithm that solves the entropic regularized Unbalanced Optimal Transport (UOT) problem between two measures of possibly different masses with at most $n$ components.
399	Learning Optimal Tree Models under Beam Search	Jingwei Zhuo; Ziru Xu; Wei Dai; Han Zhu; HAN LI; Jian Xu; Kun Gai;	In this paper, we take a first step towards understanding the discrepancy by developing the definition of Bayes optimality and calibration under beam search as general analyzing tools, and prove that neither TDMs nor PLTs are Bayes optimal under beam search.
400	Estimating the Number and Effect Sizes of Non-null Hypotheses	Jennifer Brennan; Ramya Korlakai Vinayak; Kevin Jamieson;	We study the problem of estimating the distribution of effect sizes (the mean of the test statistic under the alternate hypothesis) in a multiple testing setting.
401	Estimating Model Uncertainty of Neural Network in Sparse Information Form	Jongseok Lee; Matthias Humt; Jianxiang Feng; Rudolph Triebel;	The key insight of our work is that the information matrix, i.e. the inverse of the covariance matrix tends to be sparse in its spectrum.
402	Double-Loop Unadjusted Langevin Algorithm	Paul Rolland; Armin Eftekhari; Ali Kavis; Volkan Cevher;	This work proposes a new annealing step-size schedule for ULA, which allows to prove new convergence guarantees for sampling from a smooth log-concave distribution, which are not covered by existing state-of-the-art convergence guarantees.
403	Growing Action Spaces	Gregory Farquhar; Laura Gustafson; Zeming Lin; Shimon Whiteson; Nicolas Usunier; Gabriel Synnaeve;	In this work, we use a curriculum of progressively growing action spaces to accelerate learning.
404	Analytic Marching: An Analytic Meshing Solution from Deep Implicit Surface Networks	Jiabao Lei; Kui Jia;	We propose a naturally parallelizable algorithm of analytic marching to exactly recover the mesh captured by a learned MLP.
405	Anderson Acceleration of Proximal Gradient Methods	Vien Mai; Mikael Johansson;	This work introduces novel methods for adapting Anderson acceleration to (non-smooth and constrained) proximal gradient algorithms.
406	Interpretable, Multidimensional, Multimodal Anomaly Detection with Negative Sampling for Detection of Device Failure	John Sipple;	In this paper we propose a scalable, unsupervised approach for detecting anomalies in the Internet of Things (IoT).
407	Certified Robustness to Label-Flipping Attacks via Randomized Smoothing	Elan Rosenfeld; Ezra Winston; Pradeep Ravikumar; Zico Kolter;	In this work, we propose a strategy for building linear classifiers that are certifiably robust against a strong variant of label flipping, where each test example is targeted independently.
408	Responsive Safety in Reinforcement Learning	Adam Stooke; Joshua Achiam; Pieter Abbeel;	Lagrangian method are the most commonly used algorithms for the resulting constrained optimization problem. Yet they are known to oscillate and overshoot cost limits, causing constraint-violating behavior during training.
409	Deep k-NN for Noisy Labels	Dara Bahri; Heinrich Jiang; Maya Gupta;	In this paper, we provide an empirical study showing that a simple k-nearest neighbor-based filtering approach on the logit layer of a preliminary model can remove mislabeled training data and produce more accurate models than some recently proposed methods.
410	Learning the piece-wise constant graph structure of a varying Ising model	Batiste Le Bars; Pierre Humbert; Argyris Kalogeratos; Nicolas Vayatis;	For this purpose, we propose to estimate the neighborhood of each node by maximizing a penalized version of its conditional log-likelihood.
411	Stabilizing Transformers for Reinforcement Learning	Emilio Parisotto; Francis Song; Jack Rae; Razvan Pascanu; Caglar Gulcehre; Siddhant Jayakumar; Max Jaderberg; Raphael Lopez Kaufman; Aidan Clark; Seb Noury; Matthew Botvinick; Nicolas Heess; Raia Hadsell;	In this work we demonstrate that the standard transformer architecture is difficult to optimize, which was previously observed in the supervised learning setting but becomes especially pronounced with RL objectives.
412	An Explicitly Relational Neural Network Architecture	Murray Shanahan; Kyriacos Nikiforou; Antonia Creswell; Christos Kaplanis; David Barrett; Marta Garnelo;	With a view to bridging the gap between deep learning and symbolic AI, we present a novel end-to-end neural network architecture that learns to form propositional representations with an explicitly relational structure from raw pixel data.
413	Harmonic Decompositions of Convolutional Networks	Meyer Scetbon; Zaid Harchaoui;	We present a description of function spaces and smoothness classes associated with convolutional networks from a reproducing kernel Hilbert space viewpoint.
414	Discriminative Jackknife: Quantifying Uncertainty in Deep Learning via Higher-Order Influence Functions	Ahmed Alaa; Mihaela van der Schaar;	To this end, this paper develops the discriminative jackknife (DJ), a frequentist procedure that uses higher-order influence functions (HOIFs) of a trained model parameters to construct a jackknife (leave-one-out) estimator of predictive confidence intervals.
415	Robust Graph Representation Learning via Neural Sparsification	Cheng Zheng; Bo Zong; Wei Cheng; Dongjin Song; Jingchao Ni; Wenchao Yu; Haifeng Chen; Wei Wang;	In this paper, we present NeuralSparse, a supervised graph sparsification technique that improves generalization power by learning to remove potentially task-irrelevant edges from input graphs.
416	Semiparametric Nonlinear Bipartite Graph Representation Learning with Provable Guarantees	Sen Na; Yuwei Luo; Zhuoran Yang; Zhaoran Wang; Mladen Kolar;	To overcome these challenges, we propose a pseudo-likelihood objective based on the rank-order decomposition technique and focus on its local geometry.
417	Forecasting sequential data using Consistent Koopman Autoencoders	Omri Azencot; N. Benjamin Erichson; Vanessa Lin; Michael Mahoney;	We propose a novel Consistent Koopman Autoencoder that exploits the forward and backward dynamics to achieve long time predictions.
418	Scalable Identification of Partially Observed Systems with Certainty-Equivalent EM	Kunal Menda; Jean de Becdelievre; Jayesh K. Gupta; Ilan Kroo; Mykel Kochenderfer; Zachary Manchester;	This work considers the offline identification of partially observed nonlinear systems.
419	Learning to Score Behaviors for Guided Policy Optimization	Aldo Pacchiano; Jack Parker-Holder; Yunhao Tang; Krzysztof Choromanski; Anna Choromanska; Michael Jordan;	We introduce a new approach for comparing reinforcement learning policies, using Wasserstein distances (WDs) in a newly defined latent behavioral space.
420	Improved Communication Cost in Distributed PageRank Computation- A Theoretical Study	Siqiang Luo;	In this paper, we provide a new algorithm that uses asymptotically the same communication rounds while significantly improves the bandwidth from $O(\log^{2d+3}{n})$ bits to $O(d\log^3{n})$ bits.
421	Learning Autoencoders with Relational Regularization	Hongteng Xu; Dixin Luo; Ricardo Henao; Svati Shah; Lawrence Carin;	We propose a new algorithmic framework for learning autoencoders of data distributions.
422	Neural Contextual Bandits with UCB-based Exploration	Dongruo Zhou; Lihong Li; Quanquan Gu;	We propose the NeuralUCB algorithm, which leverages the representation power of deep neural networks and uses a neural network-based random feature mapping to construct an upper confidence bound (UCB) of reward for efficient exploration.
423	Super-efficiency of automatic differentiation for functions defined as a minimum	Pierre Ablin; Gabriel Peyré; Thomas Moreau;	In this paper, we study the asymptotic error made by these estimators as a function of the optimization error.
424	PowerNorm: Rethinking Batch Normalization in Transformers	Sheng Shen; Zhewei Yao; Amir Gholaminejad; Michael Mahoney; Kurt Keutzer;	In this paper, we perform a systematic study of NLP transformer models to understand why BN has a poor performance, as compared to LN.
425	Invertible generative models for inverse problems: mitigating representation error and dataset bias	Muhammad Asim; Max Daniels; Oscar Leong; Paul Hand; Ali Ahmed;	In this paper, we demonstrate that invertible neural networks, which have zero representation error by design, can be effective natural signal priors at inverse problems such as denoising, compressive sensing, and inpainting.
426	Acceleration for Compressed Gradient Descent in Distributed Optimization	Zhize Li; Dmitry Kovalev; Xun Qian; Peter Richtarik;	In this paper, we remedy this situation and propose the first {\em accelerated compressed gradient descent (ACGD)} methods.
427	Neural Networks are Convex Regularizers: Exact Polynomial-time Convex Optimization Formulations for Two-Layer Networks	Mert Pilanci; Tolga Ergen;	We develop exact representations of two layer neural networks with rectified linear units in terms of a single convex program with number of variables polynomial in the number of training samples and number of hidden neurons.
428	Learning Quadratic Games on Networks	Yan Leng; Xiaowen Dong; Junfeng Wu; Alex `Sandy’ Pentland;	In this paper, we propose two novel frameworks for learning, from the observations on individual actions, network games with linear-quadratic payoffs, and in particular the structure of the interaction network.
429	Margin-aware Adversarial Domain Adaptation with Optimal Transport	Sofien Dhouib; Ievgen Redko; Carole Lartizien;	In this paper, we propose a new theoretical analysis of unsupervised domain adaptation that relates notions of large margin separation, adversarial learning and optimal transport.
430	The Sample Complexity of Best-$k$ Items Selection from Pairwise Comparisons	Wenbo Ren; Jia Liu; Ness Shroff;	In this paper, we study two problems: (i) finding the probably approximately correct (PAC) best-$k$ items and (ii) finding the exact best-$k$ items, both under strong stochastic transitivity and stochastic triangle inequality.
431	GraphOpt: Learning Optimization Models of Graph Formation	Rakshit Trivedi; Jiachen Yang; Hongyuan Zha;	In this work, we propose GraphOpt, an end-to-end framework that jointly learns an implicit model of graph structure formation and discovers an underlying optimization mechanism in the form of a latent objective function.
432	Distributionally Robust Policy Evaluation and Learning in Offline Contextual Bandits	Nian Si; Fan Zhang; Zhengyuan Zhou; Jose Blanchet;	In this paper, we lift this assumption and aim to learn a distributionally robust policy with bandit observational data.
433	Incremental Sampling Without Replacement for Sequence Models	Kensen Shi; David Bieber; Charles Sutton;	We present an elegant procedure for sampling without replacement from a broad class of randomized programs, including generative neural models that construct outputs sequentially.
434	Variable Skipping for Autoregressive Range Density Estimation	Eric Liang; Zongheng Yang; Ion Stoica; Pieter Abbeel; Yan Duan; Peter Chen;	In this paper, we explore a technique for accelerating range density estimation over deep autoregressive models.
435	TaskNorm: Rethinking Batch Normalization for Meta-Learning	John Bronskill; Jonathan Gordon; James Requeima; Sebastian Nowozin; Richard Turner;	We evaluate a range of approaches to batch normalization for meta-learning scenarios, and develop a novel approach that we call TaskNorm.
436	Scalable Gaussian Process Regression for Kernels with a Non-Stationary Phase	Jan Graßhoff; Alexandra Jankowski; Philipp Rostalski;	This paper investigates an efficient GP framework, that extends structured kernel interpolation methods to GPs with a non-stationary phase.
437	Transformer Hawkes Process	Simiao Zuo; Haoming Jiang; Zichong Li; Tuo Zhao; Hongyuan Zha;	To address this issue, we propose a Transformer Hawkes Process (THP) model, which leverages the self-attention mechanism to capture long-term dependencies and meanwhile enjoys computational efficiency.
438	An EM Approach to Non-autoregressive Conditional Sequence Generation	Zhiqing Sun; Yiming Yang;	This paper proposes a new approach that jointly optimizes both AR and NAR models in a unified Expectation-Maximization (EM) framework.
439	Variance Reduction in Stochastic Particle-Optimization Sampling	Jianyi Zhang; Yang Zhao; Changyou Chen;	In this paper, we bridge the gap by presenting several variance-reduction techniques for SPOS.
440	CLUB: A Contrastive Log-ratio Upper Bound of Mutual Information	Pengyu Cheng; Weituo Hao; Shuyang Dai; Jiachang Liu; Zhe Gan; Lawrence Carin;	In this paper, we propose a novel Contrastive Log-ratio Upper Bound (CLUB) of mutual information.
441	State Space Expectation Propagation: Efficient Inference Schemes for Temporal Gaussian Processes	William Wilkinson; Paul Chang; Michael Andersen; Arno Solin;	We formulate expectation propagation (EP), a state-of-the-art method for approximate Bayesian inference, as a nonlinear Kalman smoother, showing that it generalises a wide class of classical smoothing algorithms.
442	Training Neural Networks for and by Interpolation	Leonard Berrada; M. Pawan Kumar; Andrew Zisserman;	In this work, we explicitly exploit this interpolation property for the design of a new optimization algorithm for deep learning, which we term Adaptive Learning-rates for Interpolation with Gradients (ALI-G).
443	Learning Representations that Support Extrapolation	Taylor Webb; Zachary Dulberg; Steven Frankland; Alexander Petrov; Randall O’Reilly; Jonathan Cohen;	In this paper, we consider the challenge of learning representations that support extrapolation.
444	Topic Modeling via Full Dependence Mixtures	Dan Fisher; Mark Kozdoba; Shie Mannor;	In this paper we introduce a new approach to topic modelling that scales to large datasets by using a compact representation of the data and by leveraging the GPU architecture.
445	Instance-hiding Schemes for Private Distributed Learning	Yangsibo Huang; Zhao Song; Sanjeev Arora; Kai Li;	The new ideas in the current paper are: (a) new variants of mixup with negative as well as positive coefficients, and extend the sample-wise mixup to be pixel-wise.
446	The Implicit Regularization of Stochastic Gradient Flow for Least Squares	Alnur Ali; Edgar Dobriban; Ryan Tibshirani;	We study the implicit regularization of mini-batch stochastic gradient descent, when applied to the fundamental problem of least squares regression.
447	Decentralised Learning with Random Features and Distributed Gradient Descent	Dominic Richards; Patrick Rebeschini; Lorenzo Rosasco;	We present simulations that show how the number of random features, iterations and samples impact predictive performance.
448	Hierarchical Generation of Molecular Graphs using Structural Motifs	Wengong Jin; Regina Barzilay; Tommi Jaakkola;	In this paper, we propose a new hierarchical graph encoder-decoder that employs significantly larger and more flexible graph motifs as basic building blocks.
449	Composing Molecules with Multiple Property Constraints	Wengong Jin; Regina Barzilay; Tommi Jaakkola;	We propose to offset this complexity by composing molecules from a vocabulary of substructures that we call molecular rationales.
450	Data preprocessing to mitigate bias: A maximum entropy based approach	Elisa Celis; Vijay Keswani; Nisheeth Vishnoi;	This paper presents an optimization framework that can be used as a data preprocessing method towards mitigating bias: It can learn distributions over large domains and controllably adjust the representation rates of protected groups and/or achieve target fairness metrics such as statistical parity, yet remains close to the empirical distribution induced by the given dataset.
451	On Efficient Low Distortion Ultrametric Embedding	Vincent Cohen-Addad; Karthik C. S.; Guillaume Lagarde;	In this paper, we provide a new algorithm which takes as input a set of points $P$ in $R^d$, and for every $c\ge 1$, runs in time $n^{1+O(1/c^2)}$ to output an ultrametric $\Delta$ such that for any two points $u,v$ in $P$, we have $\Delta(u,v)$ is within a multiplicative factor of $5c$ to the distance between $u$ and $v$ in the "best" ultrametric representation of $P$.
452	Global Concavity and Optimization in a Class of Dynamic Discrete Choice Models	Yiding Feng; Ekaterina Khmelnitskaya; Denis Nekipelov;	We show that in an important class of discrete choice models the value function is globally concave in the policy. That means that simple algorithms that do not require fixed point computation, such as the policy gradient algorithm, globally converge to the optimal policy.
453	Efficient Policy Learning from Surrogate-Loss Classification Reductions	Andrew Bennett; Nathan Kallus;	In light of this, we instead propose an estimation approach based on generalized method of moments, which is efficient for the policy parameters.
454	On Contrastive Learning for Likelihood-free Inference	Conor Durkan; Iain Murray; George Papamakarios;	In this work, we show that both of these approaches can be unified under a general contrastive learning scheme, and clarify how they should be run and compared.
455	Obtaining Adjustable Regularization for Free via Iterate Averaging	Jingfeng Wu; Vladimir Braverman; Lin Yang;	In this paper, we establish a complete theory by showing an averaging scheme that provably converts the iterates of SGD on an arbitrary strongly convex and smooth objective function to its regularized counterpart with an adjustable regularization parameter.
456	Invariant Risk Minimization Games	Kartik Ahuja; Karthikeyan Shanmugam; Kush Varshney; Amit Dhurandhar;	In this work, we pose such invariant risk minimization as finding the Nash equilibrium of an ensemble game among several environments.
457	Video Prediction via Example Guidance	Jingwei Xu; Harry (Huazhe) Xu; Bingbing Ni; Xiaokang Yang; Trevor Darrell;	In this work, we propose a simple yet effective framework that can predict diverse and plausible future states.
458	Learning Discrete Structured Representations by Adversarially Maximizing Mutual Information	Karl Stratos; Sam Wiseman;	We propose learning discrete structured representations from unlabeled data by maximizing the mutual information between a structured latent variable and a target variable.
459	Reinforcement Learning in Feature Space: Matrix Bandit, Kernels, and Regret Bound	Lin Yang; Mengdi Wang;	In this paper, we propose an online RL algorithm, namely the MatrixRL, that leverages ideas from linear bandit to learn a low-dimensional representation of the probability transition model while carefully balancing the exploitation-exploration tradeoff.
460	Frequency Bias in Neural Networks for Input of Non-Uniform Density	Ronen Basri; Meirav Galun; Amnon Geifman; David Jacobs; Yoni Kasten; Shira Kritchman;	As realistic training sets are not drawn from a uniform distribution, we here use the Neural Tangent Kernel (NTK) model to explore the effect of variable density on training dynamics.
461	Constrained Markov Decision Processes via Backward Value Functions	Harsh Satija; Philip Amortila; Joelle Pineau;	In this work, we model the problem of learning with constraints as a Constrained Markov Decision Process and provide a new on-policy formulation for solving it.
462	Adding seemingly uninformative labels helps in low data regimes	Christos Matsoukas; Albert Bou Hernandez; Yue Liu; Karin Dembrower; Gisele Miranda; Emir Konuk; Johan Fredin Haslum; Athanasios Zouzos; Peter Lindholm; Fredrik Strand; Kevin Smith;	In this work, we consider a task that requires difficult-to-obtain expert annotations: tumor segmentation in mammography images.
463	When are Non-Parametric Methods Robust?	Robi Bhattacharjee; Kamalika Chaudhuri;	In this work, we study general non-parametric methods, with a view towards understanding when they are robust to these modifications.
464	Learning Calibratable Policies using Programmatic Style-Consistency	Eric Zhan; Albert Tseng; Yisong Yue; Adith Swaminathan; Matthew Hausknecht;	In this paper, we leverage large amounts of raw behavioral data to learn policies that can be calibrated to generate a diverse range of behavior styles (e.g., aggressive versus passive play in sports).
465	Momentum Improves Normalized SGD	Ashok Cutkosky; Harsh Mehta;	We provide an improved analysis of normalized SGD showing that adding momentum provably removes the need for large batch sizes on non-convex objectives.
466	Parameter-free, Dynamic, and Strongly-Adaptive Online Learning	Ashok Cutkosky;	We provide a new online learning algorithm that for the first time combines several disparate notions of adaptivity.
467	PENNI: Pruned Kernel Sharing for Efficient CNN Inference	Shiyu Li; Edward Hanson; Hai Li; Yiran Chen;	Based on this observation, we propose PENNI, a CNN model compression framework that is able to achieve model compactness and hardware efficiency simultaneously by (1) implementing kernel sharing in convolution layers via a small number of basis kernels and (2) alternately adjusting bases and coefficients with sparse constraints.
468	Optimal transport mapping via input convex neural networks	Ashok Vardhan Makkuva; Amirhossein Taghvaei; Sewoong Oh; Jason Lee;	In this paper, we present a novel and principled approach to learn the optimal transport between two distributions, from samples.
469	All in the (Exponential) Family: Information Geometry and Thermodynamic Variational Inference	Rob Brekelmans; Vaden Masrani; Frank Wood; Greg Ver Steeg; Aram Galstyan;	We interpret the geometric mixture curve common to TVO and related path sampling methods using the geometry of exponential families, which allows us to characterize the gap in TVO bounds as a sum of KL divergences along a given path.
470	SimGANs: Simulator-Based Generative Adversarial Networks for ECG Synthesis to Improve Deep ECG Classification	Tomer Golany; Kira Radinsky; Daniel Freedman;	We study the problem of heart signal electrocardiogram (ECG) synthesis for improved heartbeat classification.
471	Is There a Trade-Off Between Fairness and Accuracy? A Perspective Using Mismatched Hypothesis Testing	Sanghamitra Dutta; Dennis Wei; Hazar Yueksel; Pin-Yu Chen; Sijia Liu; Kush Varshney;	Novel to this work, we examine fair classification through the lens of mismatched hypothesis testing: trying to find a classifier that distinguishes between two ideal distributions when given two mismatched distributions that are biased.
472	Convex Calibrated Surrogates for the Multi-Label F-Measure	Mingyuan Zhang; Harish Guruprasad Ramaswamy; Shivani Agarwal;	In this paper, we explore the question of designing convex surrogate losses that are calibrated for the F-measure — specifically, that have the property that minimizing the surrogate loss yields (in the limit of sufficient data) a Bayes optimal multi-label classifier for the F-measure.
473	Learning Robot Skills with Temporal Variational Inference	Tanmay Shankar; Abhinav Gupta;	In this paper, we address the discovery of robotic options from demonstrations in an unsupervised manner.
474	Adaptive Gradient Descent without Descent	Konstantin Mishchenko; Yura Malitsky;	We present a strikingly simple proof that two rules are sufficient to automate gradient descent: 1) don’t increase the stepsize too fast and 2) don’t overstep the local curvature.
475	An end-to-end Differentially Private Latent Dirichlet Allocation Using a Spectral Algorithm	Christopher DeCarolis; Mukul Ram; Seyed Esmaeili; Yu-Xiang Wang; Furong Huang;	We provide an end-to-end differentially private spectral algorithm for learning LDA, based on matrix/tensor decompositions, and establish theoretical guarantees on utility/consistency of the estimated model parameters.
476	Dual Mirror Descent for Online Allocation Problems	Haihao Lu; Santiago Balseiro; Vahab Mirrokni;	We consider online allocation problems with concave revenue functions and resource constraints, which are central problems in revenue management and online advertising.
477	Optimal Robust Learning of Discrete Distributions from Batches	Ayush Jain; Alon Orlitsky;	We provide the first polynomial-time estimator that is optimal in the number of batches and achieves essentially the best possible estimation accuracy.
478	BoXHED: Boosted eXact Hazard Estimator with Dynamic covariates	Xiaochen Wang; Arash Pakbin; Bobak Mortazavi; Hongyu Zhao; Donald Lee;	This paper introduces the software package BoXHED (pronounced `box-head’) for nonparametrically estimating hazard functions via gradient boosting.
479	Unlabelled Data Improves Bayesian Uncertainty Calibration under Covariate Shift	Alexander Chan; Ahmed Alaa; Zhaozhi Qian; Mihaela van der Schaar;	In this paper, we develop an approximate Bayesian inference scheme based on posterior regularisation, where we use information from unlabelled target data to produce more appropriate uncertainty estimates for ”covariate-shifted” predictions.
480	Universal Equivariant Multilayer Perceptrons	Siamak Ravanbakhsh;	Using tools from group theory, this paper proves the universality of a broad class of equivariant MLPs with a single hidden layer.
481	Improving generalization by controlling label-noise information in neural network weights	Hrayr Harutyunyan; Kyle Reing; Greg Ver Steeg; Aram Galstyan;	To obtain these low values, we propose training algorithms that employ an auxiliary network that predicts gradients in the final layers of a classifier without accessing labels.
482	DeepMatch: Balancing Deep Covariate Representations for Causal Inference Using Adversarial Training	Nathan Kallus;	We propose a new method based on adversarial training of a weighting and a discriminator network that effectively addresses this methodological gap.
483	Bayesian Optimisation over Multiple Continuous and Categorical Inputs	Binxin Ru; Ahsan Alvi; Vu Nguyen; Michael Osborne; Stephen Roberts;	We propose a new approach, Continuous and Categorical Bayesian Optimisation (CoCaBO), which combines the strengths of multi-armed bandits and Bayesian optimisation to select values for both categorical and continuous inputs.
484	Generalization and Representational Limits of Graph Neural Networks	Vikas Garg; Stefanie Jegelka; Tommi Jaakkola;	We address two fundamental questions about graph neural networks (GNNs).
485	Multi-Precision Policy Enforced Training (MuPPET) : A Precision-Switching Strategy for Quantised Fixed-Point Training of CNNs	Aditya Rajagopal; Diederik Vink; Stylianos Venieris; Christos-Savvas Bouganis;	This work pushes the boundary of quantised training by employing a multilevel optimisation approach that utilises multiple precisions including low-precision fixed-point representations.
486	LowFER: Low-rank Bilinear Pooling for Link Prediction	Saadullah Amin; Stalin Varanasi; Katherine Ann Dunfield; Günter Neumann;	In this work, we propose a factorized bilinear pooling model, commonly used in multi-modal learning, for better fusion of entities and relations, leading to an efficient and constraints free model.
487	Parameterized Rate-Distortion Stochastic Encoder	Quan Hoang; Trung Le; Dinh Phung;	We propose a novel gradient-based tractable approach for the Blahut-Arimoto (BA) algorithm to compute the rate-distortion function where the BA algorithm is fully parameterized.
488	Incidence Networks for Geometric Deep Learning	Marjan Albooyeh; Daniele Bertolini; Siamak Ravanbakhsh;	In this paper, we formalize incidence tensors, analyze their structure, and present the family of equivariant networks that operate on them.
489	Energy-Based Processes for Exchangeable Data	Mengjiao Yang; Bo Dai; Hanjun Dai; Dale Schuurmans;	To overcome these limitations, we introduce Energy-Based Processes (EBPs), which extend energy based models to exchangeable data while allowing neural network parameterizations of the energy function.
490	Deep Isometric Learning for Visual Recognition	Haozhi Qi; Chong You; Xiaolong Wang; Yi Ma; Jitendra Malik;	This paper shows that deep vanilla ConvNets without normalization nor residual structure can also be trained to achieve surprisingly good performance on standard image recognition benchmarks (ImageNet, and COCO).
491	Second-Order Provable Defenses against Adversarial Attacks	Sahil Singla; Soheil Feizi;	In this paper, we provide computationally-efficient robustness certificates for neural networks with differentiable activation functions in two steps.
492	Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention	Angelos Katharopoulos; Apoorv Vyas; Nikolaos Pappas; Francois Fleuret;	To address this limitation, we express the self-attention as a linear dot-product of kernel feature maps and make use of the associativity property of matrix products to reduce the complexity from $\bigO{N^2}$ to $\bigO{N}$, where $N$ is the sequence length.
493	Overfitting in adversarially robust deep learning	Eric Wong; Leslie Rice; Zico Kolter;	In this paper, we empirically study this phenomenon in the setting of adversarially trained deep networks, which are trained to minimize the loss under worst-case adversarial perturbations.
494	Rethinking Bias-Variance Trade-off for Generalization of Neural Networks	Zitong Yang; Yaodong Yu; Chong You; Jacob Steinhardt; Yi Ma;	We provide a simple explanation of this by measuring the bias and variance of neural networks: while the bias is {\em monotonically decreasing} as in the classical theory, the variance is {\em unimodal} or bell-shaped: it increases then decreases with the width of the network.
495	Boosting for Control of Dynamical Systems	Naman Agarwal; Nataly Brukhim; Elad Hazan; Zhou Lu;	To this end, we propose a framework of boosting for online control.
496	Frustratingly Simple Few-Shot Object Detection	Xin Wang; Thomas Huang; Joseph Gonzalez; Trevor Darrell; Fisher Yu;	We find that fine-tuning only the last layer of existing detectors on rare classes is crucial to the few-shot object detection task.
497	Data-Dependent Differentially Private Parameter Learning for Directed Graphical Models	Amrita Roy Chowdhury; Theodoros Rekatsinas; Somesh Jha;	In this paper, we present an algorithm for differentially-private learning of the parameters of a DGM.
498	Adversarial Risk via Optimal Transport and Optimal Couplings	Muni Sreenivas Pydi; Varun Jog;	In this paper, we investigate the optimal adversarial risk and optimal adversarial classifiers from an optimal transport perspective.
499	Decoupled Greedy Learning of CNNs	Eugene Belilovsky; Michael Eickenberg; Edouard Oyallon;	In this context, we consider a simpler, but more effective, substitute that uses minimal feedback, which we call Decoupled Greedy Learning (DGL).
500	ACFlow: Flow Models for Arbitrary Conditional Likelihoods	Yang Li; Shoaib Akbar; Junier Oliva;	Instead, in this work we develop a model that is capable of yielding all conditional distributions $p(x_u \mid x_o)$ (for arbitrary $x_u$) via tractable conditional likelihoods.
501	Can autonomous vehicles identify, recover from, and adapt to distribution shifts?	Angelos Filos; Panagiotis Tigkas; Rowan McAllister; Nicholas Rhinehart; Sergey Levine; Yarin Gal;	In this paper, we introduce an autonomous car novel-scene benchmark, \texttt{CARNOVEL}, to evaluate the robustness of driving agents to a suite of tasks involving distribution shift.
502	Leveraging Procedural Generation to Benchmark Reinforcement Learning	Karl Cobbe; Chris Hesse; Jacob Hilton; John Schulman;	We introduce Procgen Benchmark, a suite of 16 procedurally generated game-like environments designed to benchmark both sample efficiency and generalization in reinforcement learning.
503	The Tree Ensemble Layer: Differentiability meets Conditional Computation	Hussein Hazimeh; Natalia Ponomareva; Rahul Mazumder; Zhenyu Tan; Petros Mol;	We aim to combine these advantages by introducing a new layer for neural networks, composed of an ensemble of differentiable decision trees (a.k.a. soft trees).
504	Near-Tight Margin-Based Generalization Bounds for Support Vector Machines	Allan Grønlund; Lior Kamma; Kasper Green Larsen;	In this paper, we revisit and improve the classic generalization bounds in terms of margins.
505	Error Estimation for Sketched SVD	Miles Lopes; N. Benjamin Erichson; Michael Mahoney;	To overcome these challenges, this paper develops a fully data-driven bootstrap method that numerically estimates the actual error of sketched singular vectors/values.
506	Goal-Aware Prediction: Learning to Model What Matters	Suraj Nair; Silvio Savarese; Chelsea Finn;	In this paper, we propose to direct prediction towards task relevant information, enabling the model to be aware of the current task and encouraging it to only model relevant quantities of the state space, resulting in a learning objective that more closely matches the downstream task.
507	Combinatorial Pure Exploration for Dueling Bandit	Wei Chen; Yihan Du; Longbo Huang; Haoyu Zhao;	In this paper, we study combinatorial pure exploration for dueling bandits (CPE-DB): we have multiple candidates for multiple positions as modeled by a bipartite graph, and in each round we sample a duel of two candidates on one position and observe who wins in the duel, with the goal of finding the best candidate-position matching with high probability after multiple rounds of samples.
508	Optimal Sequential Maximization: One Interview is Enough!	Moein Falahatgar; Alon Orlitsky; Venkatadheeraj Pichapati;	We derive the first query-optimal sequential algorithm for probabilistic-maximization.
509	What can I do here? A Theory of Affordances in Reinforcement Learning	Khimya Khetarpal; Zafarali Ahmed; Gheorghe Comanici; David Abel; Doina Precup;	In this paper, we develop a theory of affordances for agents who learn and plan in Markov Decision Processes.
510	An end-to-end approach for the verification problem: learning the right distance	Joao Monteiro; Isabela Albuquerque; Jahangir Alam; R Devon Hjelm; Tiago Falk;	In this contribution, we augment the metric learning setting by introducing a parametric pseudo-distance, trained jointly with the encoder.
511	Data Valuation using Reinforcement Learning	Jinsung Yoon; Sercan Arik; Tomas Pfister;	We propose Data Valuation using Reinforcement Learning (DVRL), to adaptively learn data values jointly with the predictor model.
512	FormulaZero: Distributionally Robust Online Adaptation via Offline Population Synthesis	Aman Sinha; Matthew O’Kelly; Hongrui Zheng; Rahul Mangharam; John Duchi; Russ Tedrake;	This work makes algorithmic contributions to both challenges. First, to generate a realistic, diverse set of opponents, we develop a novel method for self-play based on replica-exchange Markov chain Monte Carlo. Second, we propose a distributionally robust bandit optimization procedure that adaptively adjusts risk aversion relative to uncertainty in beliefs about opponents\u2019 behaviors.
513	Latent Bernoulli Autoencoder	Jiri Fajtl; Vasileios Argyriou; Dorothy Monekosso; Paolo Remagnino;	In this work, we pose a question whether it is possible to design and train an autoencoder model in an end-to-end fashion to learn latent representations in multivariate Bernoulli space, and achieve performance comparable with the current state-of-the-art variational methods.
514	Learning To Stop While Learning To Predict	Xinshi Chen; Hanjun Dai; Yu Li; Xin Gao; Le Song;	In this paper, we tackle this varying depth problem using a steerable architecture, where a feed-forward deep model and a variational stopping policy are learned together to sequentially determine the optimal number of layers for each input instance.
515	Accelerating the diffusion-based ensemble sampling by non-reversible dynamics	Futoshi Futami; Issei Sato; Masashi Sugiyama;	To cope with this problem, we propose a novel ensemble method that uses a non-reversible Markov chain for the interaction, and we present a non-asymptotic theoretical analysis for our method.
516	Efficient nonparametric statistical inference on population feature importance using Shapley values	Brian Williamson; Jean Feng;	We present a computationally efficient procedure for estimating and obtaining valid statistical inference on the \textbf{S}hapley \textbf{P}opulation \textbf{V}ariable \textbf{I}mportance \textbf{M}easure (SPVIM).
517	Curse of Dimensionality on Randomized Smoothing for Certifiable Robustness	Aounon Kumar; Alexander Levine; Tom Goldstein; Soheil Feizi;	In this work, we show that extending the smoothing technique to defend against other attack models can be challenging, especially in the high-dimensional regime.
518	Upper bounds for Model-Free Row-Sparse Principal Component Analysis	Guanyi Wang; Santanu Dey;	We propose a new framework that finds out upper (dual) bounds for the sparse PCA within polynomial time via solving a convex integer program (IP).
519	Explainable k-Means and k-Medians Clustering	Michal Moshkovitz; Sanjoy Dasgupta; Cyrus Rashtchian; Nave Frost;	We study this problem from a theoretical viewpoint, measuring the output quality by the k-means and k-medians objectives.
520	Reward-Free Exploration for Reinforcement Learning	Chi Jin; Akshay Krishnamurthy; Max Simchowitz; Tiancheng Yu;	To isolate the challenges of exploration, we propose the following “reward-free RL” framework.
521	Parametric Gaussian Process Regressors	Martin Jankowiak; Geoff Pleiss; Jacob Gardner;	In this work we propose two simple methods for scalable GP regression that address this issue and thus yield substantially improved predictive uncertainties.
522	p-Norm Flow Diffusion for Local Graph Clustering	Kimon Fountoulakis; Di Wang; Shenghao Yang;	In this work, we draw inspiration from both fields and propose a family of convex optimization formulations based on the idea of diffusion with $p$-norm network flow for $p\in (1,\infty)$.
523	Low-Rank Bottleneck in Multi-head Attention Models	Srinadh Bhojanapalli; Chulhee Yun; Ankit Singh Rawat; Sashank Jakkam Reddi; Sanjiv Kumar;	In this paper we identify one of the important factors contributing to the large embedding size requirement.
524	LEEP: A New Measure to Evaluate Transferability of Learned Representations	Cuong Nguyen; Tal Hassner; Cedric Archambeau; Matthias Seeger;	We introduce a new measure to evaluate the transferability of representations learned by classifiers.
525	The FAST Algorithm for Submodular Maximization	Adam Breuer; Eric Balkanski; Yaron Singer;	In this paper we describe a new parallel algorithm called Fast Adaptive Sequencing Technique (FAST) for maximizing a monotone submodular function under a cardinality constraint k.
526	On the Relation between Quality-Diversity Evaluation and Distribution-Fitting Goal in Text Generation	Jianing Li; Yanyan Lan; Jiafeng Guo; Xueqi Cheng;	In this paper, we try to reveal such relation in a theoretical approach.
527	Designing Optimal Dynamic Treatment Regimes: A Causal Reinforcement Learning Approach	Junzhe Zhang;	In particular, we develop two online algorithms that satisfy such regret bounds by exploiting the causal structure underlying the DTR; one is based on the principle of optimism in the face of uncertainty (OFU-DTR), and the other uses the posterior sampling learning (PS-DTR).
528	Global Decision-Making via Local Economic Transactions	Michael Chang; Sid Kaushik; S. Matthew Weinberg; Sergey Levine; Thomas Griffiths;	This paper seeks to establish a mechanism for directing a collection of simple, specialized, self-interested agents to solve what traditionally are posed as monolithic single-agent sequential decision problems with a central global objective.
529	Retrieval Augmented Language Model Pre-Training	Kelvin Guu; Kenton Lee; Zora Tung; Panupong Pasupat; Mingwei Chang;	To capture knowledge in a more modular and interpretable way, we augment language model pre-training with a latent knowledge retriever, which allows the model to retrieve and attend over documents from a large corpus such as Wikipedia, used during pre-training, fine-tuning and inference.
530	Variational Label Enhancement	Ning Xu; Yun-Peng Liu; Jun Shu; Xin Geng;	To solve this problem, we consider the label distributions as the latent vectors and infer the label distributions from the logical labels in the training datasets by using variational inference.
531	Bandits with Adversarial Scaling	Thodoris Lykouris; Vahab Mirrokni; Renato Leme;	We study "adversarial scaling", a multi-armed bandit model where rewards have a stochastic and an adversarial component.
532	Eliminating the Invariance on the Loss Landscape of Linear Autoencoders	Reza Oftadeh; Jiayi Shen; Zhangyang Wang; Dylan Shell;	Here, we prove that our loss function eliminates this issue, i.e., the decoder converges to the exact ordered unnormalized eigenvectors of the sample covariance matrix.
533	What is Local Optimality in Nonconvex-Nonconcave Minimax Optimization?	Chi Jin; Praneeth Netrapalli; Michael Jordan;	The main contribution of this paper is to propose a proper mathematical definition of local optimality for this sequential setting—local minimax, as well as to present its properties and existence results.
534	Lookahead-Bounded Q-learning	Ibrahim El Shar; Daniel Jiang;	We introduce the lookahead-bounded Q-learning (LBQL) algorithm, a new, provably convergent variant of Q-learning that seeks to improve the performance of standard Q-learning in stochastic environments through the use of “lookahead” upper and lower bounds.
535	Learning From Irregularly-Sampled Time Series: A Missing Data Perspective	Steven Cheng-Xian Li; Benjamin Marlin;	In this paper, we consider irregular sampling from the perspective of missing data.
536	Evaluating the Performance of Reinforcement Learning Algorithms	Scott Jordan; Yash Chandak; Daniel Cohen; Mengxue Zhang; Philip Thomas;	In this work, we argue that the inconsistency of performance stems from the use of flawed evaluation metrics.
537	Unbiased Risk Estimators Can Mislead: A Case Study of Learning with Complementary Labels	Yu-Ting Chou; Gang Niu; Hsuan-Tien Lin; Masashi Sugiyama;	In this paper, we investigate reasons for such overfitting by studying learning with complementary labels.
538	Provable Self-Play Algorithms for Competitive Reinforcement Learning	Yu Bai; Chi Jin;	We introduce a self-play algorithm—Value Iteration with Upper/Lower Confidence Bound (VI-ULCB), and show that it achieves regret $\tilde{O}(\sqrt{T})$ after playing $T$ steps of the game.
539	Optimizing Long-term Social Welfare in Recommender Systems: A Constrained Matching Approach	Martin Mladenov; Elliot Creager; Omer Ben-Porat; Kevin Swersky; Richard Zemel; Craig Boutilier;	In this work, we explore settings in which content providers cannot remain viable unless they receive a certain level of user engagement.
540	Semi-Supervised StyleGAN for Disentanglement Learning	Weili Nie; Tero Karras; Animesh Garg; Shoubhik Debnath; Anjul Patney; Ankit Patel; Anima Anandkumar;	To alleviate these limitations, we design new architectures and loss functions based on StyleGAN (Karras et al., 2019), for semi-supervised high-resolution disentanglement learning.
541	The Non-IID Data Quagmire of Decentralized Machine Learning	Kevin Hsieh; Amar Phanishayee; Onur Mutlu; Phillip Gibbons;	Based on these findings, we present SkewScout, a system-level approach that adapts the communication frequency of decentralized learning algorithms to the (skew-induced) accuracy loss between data partitions.
542	On the Noisy Gradient Descent that Generalizes as SGD	Jingfeng Wu; Wenqing Hu; Haoyi Xiong; Jun Huan; Vladimir Braverman; Zhanxing Zhu;	In this work we provide negative results by showing that noises in classes different from the SGD noise can also effectively regularize gradient descent.
543	Safe screening rules for L0-regression	Alper Atamturk; Andres Gomez;	We give safe screening rules to eliminate variables from regression with L0 regularization or cardinality constraint.
544	Single Point Transductive Prediction	Nilesh Tripuraneni; Lester Mackey;	We address this question in the context of linear prediction, showing how techniques from semi-parametric inference can be used transductively to combat regularization bias.
545	History-Gradient Aided Batch Size Adaptation for Variance Reduced Algorithms	Kaiyi Ji; Zhe Wang; Bowen Weng; Yi Zhou; Wei Zhang; Yingbin LIANG;	In this paper, we propose a novel scheme, which eliminates backtracking line search but still exploits the information along optimization path by adapting the batch size via history stochastic gradients.
546	Batch Stationary Distribution Estimation	Junfeng Wen; Bo Dai; Lihong Li; Dale Schuurmans;	We propose a consistent estimator that is based on recovering a correction ratio function over the given data.
547	Optimal Statistical Guaratees for Adversarially Robust Gaussian Classification	Chen Dan; Yuting Wei; Pradeep Ravikumar;	In this paper, we provide the first result of the optimal minimax guarantees for the excess risk for adversarially robust classification, under Gaussian mixture model proposed by \cite{schmidt2018adversarially}.
548	Generative Adversarial Imitation Learning with Neural Network Parameterization: Global Optimality and Convergence Rate	Yufeng Zhang; Qi Cai; Zhuoran Yang; Zhaoran Wang;	To bridge the gap between practice and theory, we analyze a gradient-based algorithm with alternating updates and establish its sublinear convergence to the globally optimal solution.
549	A Game Theoretic Perspective on Model-Based Reinforcement Learning	Aravind Rajeswaran; Igor Mordatch; Vikash Kumar;	We show that stable algorithms for MBRL can be derived by considering a Stackelberg game between the two players.
550	(Locally) Differentially Private Combinatorial Semi-Bandits	Xiaoyu Chen; Kai Zheng; Zixin Zhou; Yunchang Yang; Wei Chen; Liwei Wang;	In this paper, we study (locally) differentially private Combinatorial Semi-Bandits (CSB).
551	Optimizing for the Future in Non-Stationary MDPs	Yash Chandak; Georgios Theocharous; Shiv Shankar; Martha White; Sridhar Mahadevan; Philip Thomas;	To address this problem, we develop a method that builds upon ideas from both counter-factual reasoning and curve-fitting to proactively search for a good future policy, without ever modeling the underlying non-stationarity.
552	Learning Task-Agnostic Embedding of Multiple Black-Box Experts for Multi-Task Model Fusion	Nghia Hoang; Thanh Lam; Bryan Kian Hsiang Low; Patrick Jaillet;	To address this multi-task challenge, we develop a new fusion paradigm that represents each expert as a distribution over a spectrum of predictive prototypes, which are isolated from task-specific information encoded within the prototype distribution.
553	Dual-Path Distillation: A Unified Framework to Improve Black-Box Attacks	Yonggang Zhang; Ya Li; Tongliang Liu; Xinmei Tian;	Therefore, we propose a novel framework, dual-path distillation, that utilizes the feedback knowledge not only to craft adversarial examples but also to alter the searching directions to achieve efficient attacks.
554	Safe Deep Semi-Supervised Learning for Unseen-Class Unlabeled Data	Lan-Zhe Guo; Zhen-Yu Zhang; Yuan Jiang; Yufeng Li; Zhi-Hua Zhou;	This paper proposes a simple and effective safe deep SSL method to alleviate the performance harm caused by it.
555	Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data	Marc Finzi; Samuel Stanton; Pavel Izmailov; Andrew Wilson;	We propose a general method to construct a convolutional layer that is equivariant to transformations from any specified Lie group with a surjective exponential map.
556	Dispersed EM-VAEs for Interpretable Text Generation	Wenxian Shi; Hao Zhou; Ning Miao; Lei Li;	In this paper, we find that mode-collapse is a general problem for VAEs with exponential family mixture priors.
557	Deep Graph Random Process for Relational-Thinking-Based Speech Recognition	Huang Hengguan; Fuzhao Xue; Hao Wang; Ye Wang;	We present a framework that models a percept as weak relations between a current utterance and its history.
558	Hypernetwork approach to generating point clouds	Przemyslaw Spurek; Sebastian Winczowski; Jacek Tabor; Maciej Zamorski; Maciej Zieba; Tomasz Trzcinski;	In this work, we propose a novel method for generating 3D point clouds that leverage properties of hyper networks.
559	On a projective ensemble approach to two sample test for equality of distributions	Zhimei Li; Yaowu Zhang;	In this work, we propose a robust test for the multivariate two-sample problem through projective ensemble, which is a generalization of the Cramer-von Mises statistic.
560	Coresets for Data-efficient Training of Machine Learning Models	Baharan Mirzasoleiman; Jeff Bilmes; Jure Leskovec;	Here we develop CRAIG, a method to select a weighted subset (or coreset) of training data that closely estimates the full gradient by maximizing a submodular function.
561	Searching to Exploit Memorization Effect in Learning with Noisy Labels	QUANMING YAO; Hansi Yang; Bo Han; Gang Niu; James Kwok;	In this paper, motivated by the success of automated machine learning (AutoML), we model this issue as a function approximation problem.
562	Randomized Smoothing of All Shapes and Sizes	Greg Yang; Tony Duan; J. Edward Hu; Hadi Salman; Ilya Razenshteyn; Jerry Li;	We propose a novel framework for devising and analyzing randomized smoothing schemes, and validate its effectiveness in practice.
563	DeepCoDA: personalized interpretability for compositional health	Thomas Quinn; Dang Nguyen; Santu Rana; Sunil Gupta; Svetha Venkatesh;	We propose the DeepCoDA framework to extend precision health modelling to high-dimensional compositional data, and to provide personalized interpretability through patient-specific weights.
564	Private Query Release Assisted by Public Data	Raef Bassily; Albert Cheu; Shay Moran; Aleksandar Nikolov; Jonathan Ullman; Steven Wu;	We study the problem of differentially private query release assisted by public data.
565	Adaptive Droplet Routing in Digital Microfluidic Biochips Using Deep Reinforcement Learning	Tung-Che Liang; Zhanwei Zhong; Yaas Bigdeli; Tsung-Yi Ho; Richard Fair; Krishnendu Chakrabarty;	We present and investigate a novel application domain for deep reinforcement learning (RL): droplet routing on digital microfluidic biochips (DMFBs).
566	Continuous-time Lower Bounds for Gradient-based Algorithms	Michael Muehlebach; Michael Jordan;	We reduce the multi-dimensional problem to a single dimension, recover well-known lower bounds from the discrete-time setting, and provide insights into why these lower bounds occur.
567	A Tree-Structured Decoder for Image-to-Markup Generation	Jianshu Zhang; Jun Du; Yongxin Yang; Yi-Zhe Song; Si Wei; Lirong Dai;	In this work, we first show via a set of toy problems that string decoders struggle to decode tree structures, especially as structural complexity increases. We then propose a tree-structured decoder that specifically aims at generating a tree-structured markup.
568	Sample Factory: Egocentric 3D Control from Pixels at 100000 FPS with Asynchronous Reinforcement Learning	Aleksei Petrenko; Zhehui Huang; Tushar Kumar; Gaurav Sukhatme; Vladlen Koltun;	In this work we aim to solve this problem by optimizing the efficiency and resource utilization of reinforcement learning algorithms instead of relying on distributed computation.
569	Scalable Deep Generative Modeling for Sparse Graphs	Hanjun Dai; Azade Nazi; Yujia Li; Bo Dai; Dale Schuurmans;	Based on this, we develop a novel autoregressive model, named BiGG, that utilizes this sparsity to avoid generating the full adjacency matrix, and importantly reduces the graph generation time complexity to $O((n + m)\log n)$.
570	Closed Loop Neural-Symbolic Learning via Integrating Neural Perception, Grammar Parsing, and Symbolic Reasoning	Qing Li; Siyuan Huang; Yining Hong; Yixin Chen; Ying Nian Wu; Song-Chun Zhu;	In this paper, we address these issues and close the loop of neural-symbolic learning by (1) introducing the grammar model as a symbolic prior to bridge neural perception and symbolic reasoning, and (2) proposing a novel back-search algorithm which mimics the top-down human-like learning procedure to propagate the error through the symbolic reasoning module efficiently.
571	NGBoost: Natural Gradient Boosting for Probabilistic Prediction	Tony Duan; Anand Avati; Daisy Ding; Khanh K. Thai; Sanjay Basu; Andrew Ng; Alejandro Schuler;	We present Natural Gradient Boosting (NGBoost), an algorithm for generic probabilistic prediction via gradient boosting.
572	Q-value Path Decomposition for Deep Multiagent Reinforcement Learning	Yaodong Yang; Jianye Hao; Guangyong Chen; Hongyao Tang; Yingfeng Chen; Yujing Hu; Changjie Fan; Zhongyu Wei;	In this paper, we propose a new method called Q-value Path Decomposition (QPD) to decompose the system’s global Q-values into individual agents’ Q-values.
573	Online Learned Continual Compression with Adaptive Quantization Modules	Lucas Caccia; Eugene Belilovsky; Massimo Caccia; Joelle Pineau;	We introduce and study the problem of Online Continual Compression, where one attempts to simultaneously learn to compress and store a representative dataset from a non i.i.d data stream, while only observing each sample once.
574	Learning What to Defer for Maximum Independent Sets	Sungsoo Ahn; Younggyo Seo; Jinwoo Shin;	In this paper, we seek to resolve this issue by proposing a novel DRL scheme where the agent adaptively shrinks or stretch the number of stages by learning to defer the determination of the solution at each stage.
575	Generalized and Scalable Optimal Sparse Decision Trees	Jimmy Lin; Chudi Zhong; Diane Hu; Cynthia Rudin; Margo Seltzer;	The contribution in this work is to provide a general framework for decision tree optimization that addresses the two significant open problems in the area: treatment of imbalanced data and fully optimizing over continuous variables.
576	The Effect of Natural Distribution Shift on Question Answering Models	John Miller; Karl Krauth; Ludwig Schmidt; Benjamin Recht;	Taken together, our results confirmthe surprising resilience of the holdout methodand emphasize the need to move towards evalua-tion metrics that incorporate robustness to naturaldistribution shifts.
577	Quantized Decentralized Stochastic Learning over Directed Graphs	Hossein Taheri; Aryan Mokhtari; Hamed Hassani; Ramtin Pedarsani;	To tackle this bottleneck, we propose the quantized decentralized stochastic learning algorithm over directed graphs that is based on the push-sum algorithm in decentralized consensus optimization.
578	Semi-Supervised Learning with Normalizing Flows	Pavel Izmailov; Polina Kirichenko; Marc Finzi; Andrew Wilson;	We propose FlowGMM, an end-to-end approach to generative semi supervised learning with normalizing flows, using a latent Gaussian mixture model.
579	Student Specialization in Deep Rectified Networks With Finite Width and Input Dimension	Yuandong Tian;	We consider a deep ReLU / Leaky ReLU student network trained from the output of a fixed teacher network of the same depth, with Stochastic Gradient Descent (SGD).
580	Sample Amplification: Increasing Dataset Size even when Learning is Impossible	Brian Axelrod; Shivam Garg; Vatsal Sharan; Gregory Valiant;	Perhaps surprisingly, we show a valid amplification procedure exists for both of these settings, even in the regime where the size of the input dataset, n, is significantly less than what would be necessary to learn distribution D to non-trivial accuracy.
581	Alleviating Privacy Attacks via Causal Learning	Shruti Tople; Amit Sharma; Aditya Nori;	Therefore, we propose the use of causal learning approaches where a model learns the causal relationship between the input features and the outcome.
582	The Intrinsic Robustness of Stochastic Bandits to Strategic Manipulation	Zhe Feng; David Parkes; Haifeng Xu;	Motivated by economic applications such as recommender systems, we study the behavior of stochastic bandits algorithms under strategic behavior conducted by rational actors, i.e., the arms.
583	Normalized Flat Minima: Exploring Scale Invariant Definition of Flat Minima for Neural Networks Using PAC-Bayesian Analysis	Yusuke Tsuzuku; Issei Sato; Masashi Sugiyama;	In this paper, we first provide generalization error bounds using existing normalized flatness measures. Using the analysis, we then propose a novel normalized flatness metric.
584	Fiedler Regularization: Learning Neural Networks with Graph Sparsity	Edric Tam; David Dunson;	We introduce a novel regularization approach for deep learning that incorporates and respects the underlying graphical structure of the neural network.
585	Online Learning with Imperfect Hints	Aditya Bhaskara; Ashok Cutkosky; Ravi Kumar; Manish Purohit;	In this paper we develop algorithms and nearly matching lower bounds for online learning with imperfect hints.
586	Rate-distortion optimization guided autoencoder for isometric embedding in Euclidean latent space	Keizo Kato; Jing Zhou; Tomotake Sasaki; Akira Nakagawa;	In the end, the probability distribution function (PDF) in the real space cannot be estimated from that of the latent space accurately. To overcome this problem, we propose Rate-Distortion Optimization guided autoencoder.
587	Optimization from Structured Samples for Coverage Functions	Wei Chen; Xiaoming Sun; Jialin Zhang; Zhijie Zhang;	In this work, to circumvent the impossibility result of OPS, we propose a stronger model called optimization from structured samples (OPSS) for coverage functions, where the data samples encode the structural information of the functions.
588	Optimal Randomized First-Order Methods for Least-Squares Problems	Jonathan Lacotte; Mert Pilanci;	We provide an exact asymptotic analysis of the performance of some fast randomized algorithms for solving overdetermined least-squares problems.
589	Stochastic Optimization for Non-convex Inf-Projection Problems	Yan Yan; Yi Xu; Lijun Zhang; Wang Xiaoyu; Tianbao Yang;	In this paper, we study a family of non-convex and possibly non-smooth inf-projection minimization problems, where the target objective function is equal to minimization of a joint function over another variable.
590	Convex Representation Learning for Generalized Invariance in Semi-Inner-Product Space	Yingyi Ma; Vignesh Ganapathiraman; Yaoliang Yu; Xinhua Zhang;	In this work, we develop a convex representation learning algorithm for a variety of generalized invariances that can be modeled as semi-norms.
591	Neural Kernels Without Tangents	Vaishaal Shankar; Alex Fang; Wenshuo Guo; Sara Fridovich-Keil; Jonathan Ragan-Kelley; Ludwig Schmidt; Benjamin Recht;	In particular, using well established feature space tools such as direct sum, averaging, and moment lifting, we present an algebra for creating “compositional” kernels from bags of features.
592	Linear Lower Bounds and Conditioning of Differentiable Games	Adam Ibrahim; Waïss Azizian; Gauthier Gidel; Ioannis Mitliagkas;	In this work, we approach the question of fundamental iteration complexity by providing lower bounds to complement the linear (i.e. geometric) upper bounds observed in the literature on a wide class of problems.
593	Finite-Time Last-Iterate Convergence for Multi-Agent Learning in Games	Tianyi Lin; Zhengyuan Zhou; Panayotis Mertikopoulos; Michael Jordan;	In this paper, we consider multi-agent learning via online gradient descent in a class of games called $\lambda$-cocoercive games, a fairly broad class of games that admits many Nash equilibria and that properly includes unconstrained strongly monotone games.
594	Communication-Efficient Distributed PCA by Riemannian Optimization	Long-Kai Huang; Jialin Pan;	In this paper, we study the leading eigenvector problem in a statistically distributed setting and propose a communication-efficient algorithm based on Riemannian optimization, which trades local computation for global communication.
595	Manifold Identification for Ultimately Communication-Efficient Distributed Optimization	Yu-Sheng Li; Wei-Lin Chiang; Ching-pei Lee;	This work proposes a progressive manifold identification approach with sound theoretical justifications to greatly reduce both the communication rounds and the bytes communicated per round for partly smooth regularized problems, which include many large-scale machine learning tasks such as the training of $\ell_1$- and group-LASSO-regularized models.
596	When Demands Evolve Larger and Noisier: Learning and Earning in a Growing Environment	Feng Zhu; Zeyu Zheng;	We consider a single-product dynamic pricing problem under a specific non-stationary setting, where the demand grows over time in expectation and possibly gets noisier.
597	Being Bayesian about Categorical Probability	Taejong Joo; Uijung Chung; Min-Gwan Seo;	As a Bayesian alternative to the softmax, we consider a random variable of a categorical probability over class labels.
598	Context-aware Dynamics Model for Generalization in Model-Based Reinforcement Learning	Kimin Lee; Younggyo Seo; Seunghyun Lee; Honglak Lee; Jinwoo Shin;	To tackle this problem, we decompose the task of learning a global dynamics model into two stages: (a) learning a context latent vector that captures the local dynamics, then (b) predicting the next state conditioned on it.
599	Learning Reasoning Strategies in End-to-End Differentiable Proving	Pasquale Minervini; Tim Rocktäschel; Sebastian Riedel; Edward Grefenstette; Pontus Stenetorp;	We present Conditional Theorem Provers (CTPs), an extension to NTPs that learns an optimal rule selection strategy via gradient-based optimisation.
600	Fast and Private Submodular and $k$-Submodular Functions Maximization with Matroid Constraints	Akbar Rafiey; Yuichi Yoshida;	In this paper, we study the problem of maximizing monotone submodular functions subject to matroid constraints in the framework of differential privacy.
601	Streaming Coresets for Symmetric Tensor Factorization	Supratim Shit; Anirban Dasgupta; Rachit Chhaya; Jayesh Choudhari;	Given a set of $n$ vectors, each in $\~R^d$, we present algorithms to select a sublinear number of these vectors as coreset, while guaranteeing that the CP decomposition of the $p$-moment tensor of the coreset approximates the corresponding decomposition of the $p$-moment tensor computed from the full data.
602	How Good is the Bayes Posterior in Deep Neural Networks Really?	Florian Wenzel; Kevin Roth; Bastiaan Veeling; Jakub Swiatkowski; Linh Tran; Stephan Mandt; Jasper Snoek; Tim Salimans; Rodolphe Jenatton; Sebastian Nowozin;	In this work we cast doubt on the current understanding of Bayes posteriors in popular deep neural networks: we demonstrate through careful MCMC sampling that the posterior predictive induced by the Bayes posterior yields systematically worse predictions when compared to simpler methods including point estimates obtained from SGD.
603	Optimally Solving Two-Agent Decentralized POMDPs Under One-Sided Information Sharing	Yuxuan Xie; Jilles Dibangoye; Olivier Buffet;	This paper addresses this question for a team of two agents, with one-sided information sharing—\ie both agents have imperfect information about the state of the world, but only one has access to what the other sees and does.
604	Learning Algebraic Multigrid Using Graph Neural Networks	Ilay Luz; Meirav Galun; Haggai Maron; Ronen Basri; Irad Yavneh;	Here we propose a framework for learning AMG prolongation operators for linear systems with sparse symmetric positive (semi-) definite matrices.
605	Fractal Gaussian Networks: A sparse random graph model based on Gaussian Multiplicative Chaos	Subhroshekhar Ghosh; Krishna Balasubramanian; Xiaochuan Yang;	We propose a novel stochastic network model, called Fractal Gaussian Network (FGN), that embodies well-defined and analytically tractable fractal structures.
606	Structured Policy Iteration for Linear Quadratic Regulator	Youngsuk Park; Ryan Rossi; Zheng Wen; Gang Wu; Handong Zhao;	In this paper, we introduce the Structured Policy Iteration (S-PI) for LQR, a method capable of deriving a structured linear policy.
607	T-GD: Transferable GAN-generated Images Detection Framework	Hyeonseong Jeon; Young Oh Bang; Junyaup Kim; Simon Woo;	In this work, we present a robust transferable framework to effectively detect GAN-images, called Transferable GAN-images Detection framework (T-GD).
608	Low Bias Low Variance Gradient Estimates for Hierarchical Boolean Stochastic Networks	Adeel Pervez; Taco Cohen; Efstratios Gavves;	To analyze such networks, we introduce the framework of harmonic analysis for Boolean functions to derive an analytic formulation for the bias and variance in the Straight-Through estimator.
609	Learning Flat Latent Manifolds with VAEs	Nutan Chen; Alexej Klushyn; Francesco Ferroni; Justin Bayer; Patrick van der Smagt;	We propose an extension to the framework of variational auto-encoders allows learning flat latent manifolds, where the Euclidean metric is a proxy for the similarity between data points.
610	Multi-Task Learning with User Preferences: Gradient Descent with Controlled Ascent in Pareto Optimization	Debabrata Mahapatra; Vaibhav Rajan;	We develop the first gradient-based multi-objective MTL algorithm to address this problem.
611	Transfer Learning without Knowing: Reprogramming Black-box Machine Learning Models with Scarce Data and Limited Resources	Yun Yun Tsai; Pin-Yu Chen; Tsung-Yi Ho;	Motivated by the techniques from adversarial machine learning (ML) that are capable of manipulating the model prediction via data perturbations, in this paper we propose a novel approach, black-box adversarial reprogramming (BAR), that repurposes a well-trained black-box ML model (e.g., a prediction API or a proprietary software) for solving different ML tasks, especially in the scenario with scarce data and constrained resources.
612	On Coresets for Regularized Regression	Rachit Chhaya; Supratim Shit; Anirban Dasgupta;	We propose a modified version of the LASSO problem and obtain for it a coreset of size smaller than the least square regression.
613	Budgeted Online Influence Maximization	Pierre Perrault; Zheng Wen; Michal Valko; Jennifer Healey;	We introduce a new budgeted framework for online influence maximization, considering the total cost of an advertising campaign instead of the common cardinality constraint on a chosen influencer set.
614	On the (In)tractability of Computing Normalizing Constants for the Product of Determinantal Point Processes	Naoto Ohsaka; Tatsuya Matsuoka;	We consider the product of determinantal point processes (DPPs), a point process whose probability mass is proportional to the product of principal minors of multiple matrices as a natural, promising generalization of DPPs.
615	Monte-Carlo Tree Search as Regularized Policy Optimization	Jean-Bastien Grill; Florent Altché; Yunhao Tang; Thomas Hubert; Michal Valko; Ioannis Antonoglou; Remi Munos;	In this paper, we show that AlphaZero’s search heuristic, along with other common ones, can be interpreted as an approximation to the solution of a specific regularized policy optimization problem.
616	On the Expressivity of Neural Networks for Deep Reinforcement Learning	Kefan Dong; Yuping Luo; Tianhe Yu; Chelsea Finn; Tengyu Ma;	We show, theoretically and empirically, that even for one-dimensional continuous state space, there are many MDPs whose optimal Q-functions and policies are much more complex than the dynamics.
617	The k-tied Normal Distribution: A Compact Parameterization of Gaussian Mean Field Posteriors in Bayesian Neural Networks	Jakub Swiatkowski; Kevin Roth; Bastiaan Veeling; Linh Tran; Joshua Dillon; Jasper Snoek; Stephan Mandt; Tim Salimans; Rodolphe Jenatton; Sebastian Nowozin;	For a variety of deep Bayesian neural networks trained using Gaussian mean-field variational inference, we find that the posterior standard deviations consistently exhibit strong low-rank structure after convergence.
618	A Generative Model for Molecular Distance Geometry	Gregor Simm; Jose Miguel Hernandez-Lobato;	We present a probabilistic model that generates such samples for molecules from their graph representations.
619	Why bigger is not always better: on finite and infinite neural networks	Laurence Aitchison;	This motivates the introduction of a new class of network: infinite networks with bottlenecks, which inherit the theoretical tractability of infinite networks while at the same time allowing representation learning.
620	Data-Efficient Image Recognition with Contrastive Predictive Coding	Olivier Henaff;	We therefore revisit and improve Contrastive Predictive Coding, an unsupervised objective for learning such representations.
621	Intrinsic Reward Driven Imitation Learning via Generative Model	Xingrui Yu; Yueming LYU; Ivor Tsang;	To address this challenge, we propose a novel reward learning module to generate intrinsic reward signals via a generative model.
622	Can Increasing Input Dimensionality Improve Deep Reinforcement Learning?	Kei Ota; Tomoaki Oiki; Devesh Jha; Toshisada Mariyama; Daniel Nikovski;	In this paper, we try to study if increasing input dimensionality helps improve performance and sample efficiency of model-free deep RL algorithms.
623	Batch Reinforcement Learning with Hyperparameter Gradients	Byung-Jun Lee; Jongmin Lee; Peter Vrancx; Dongho Kim; Kee-Eung Kim;	Unlike prior work where this trade-off is controlled by hand-tuned hyperparameters, we propose a novel batch reinforcement learning approach, batch optimization of policy and hyperparameter (BOPAH), that uses a gradient-based optimization of the hyperparameter using held-out data.
624	Sub-Goal Trees — a Framework for Goal-Based Reinforcement Learning	Tom Jurgenson; Or Avner; Edward Groshev; Aviv Tamar;	Instead, we propose a new RL framework, derived from a dynamic programming equation for the all pairs shortest path (APSP) problem, which naturally solves goal-directed queries.
625	A Geometric Approach to Archetypal Analysis via Sparse Projections	Vinayak Abrol; Pulkit Sharma;	This work presents a computationally efficient greedy AA (GAA) algorithm.
626	Sequence Generation with Mixed Representations	Lijun Wu; Shufang Xie; Yingce Xia; Yang Fan; Jian-Huang Lai; Tao Qin; Tie-Yan Liu;	In this work, we propose to leverage the mixed representations from different tokenization methods for sequence generation tasks, in order to boost the model performance with unique characteristics and advantages of individual tokenization methods.
627	Agent57: Outperforming the Atari Human Benchmark	Adrià Puigdomenech Badia, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Oleksandr Vitvitskyi, Zhaohan Guo, Charles Blundell;	We propose Agent57, the first deep RL agent that outperforms the standard human benchmark on all 57 Atari games.
628	RIFLE: Backpropagation in Depth for Deep Transfer Learning through Re-Initializing the Fully-connected LayEr	Xingjian Li; Haoyi Xiong; Haozhe An; Dejing Dou; Cheng-Zhong Xu;	In this work, we propose RIFLE – a simple yet effective strategy that deepens backpropagation in transfer learning settings, through periodically ReInitializing the Fully-connected LayEr with random scratch during the fine-tuning procedure.
629	Fairwashing explanations with off-manifold detergent	Christopher Anders; Ann-Kathrin Dombrowski; Klaus-robert Mueller; Pan Kessel; Plamen Pasliev;	In this paper, we show both theoretically and experimentally that these hopes are presently unfounded.
630	Learning disconnected manifolds: a no GAN’s land	Ugo Tanielian; Thibaut Issenhuth; Elvis Dohmatob; Jeremie Mary;	We formalize this problem by establishing a "no free lunch" theorem for the disconnected manifold learning stating an upper-bound on the precision of the targeted distribution.
631	Sets Clustering	Ibrahim Jubran; Murad Tukan; Alaa Maalouf; Dan Feldman;	We prove that such a core-set of $O(\log^2{n})$ sets always exists, and can be computed in $O(n\log{n})$ time, for every input $\mathcal{P}$ and every fixed $d,k\geq 1$ and $\varepsilon \in (0,1)$.
632	Variational Autoencoders with Riemannian Brownian Motion Priors	Dimitris Kalatzis; David Eklund; Georgios Arvanitidis; Søren Hauberg;	To counter this, we assume a Riemannian structure over the latent space, which constitutes a more principled geometric view of the latent codes, and replace the standard Gaussian prior with a Riemannian Brownian motion prior.
633	Non-separable Non-stationary random fields	Kangrui Wang; Oliver Hamelijnck; Theodoros Damoulas; Mark Steel;	We describe a framework for constructing non-separable non-stationary random fields that is based on an infinite mixture of convolved stochastic processes.
634	Nonparametric Score Estimators	Yuhao Zhou; Jiaxin Shi; Jun Zhu;	We provide a unifying view of these estimators under the framework of regularized nonparametric regression.
635	A Free-Energy Principle for Representation Learning	Yansong Gao; Pratik Chaudhari;	This paper employs a formal connection of machine learning with thermodynamics to characterize the quality of learnt representations for transfer learning.
636	Scalable Differential Privacy with Certified Robustness in Adversarial Learning	Hai Phan; My T. Thai; Han Hu; Ruoming Jin; Tong Sun; Dejing Dou;	In this paper, we aim to develop a scalable algorithm to preserve differential privacy (DP) in adversarial learning for deep neural networks (DNNs), with certified robustness to adversarial examples.
637	Variational Inference for Sequential Data with Future Likelihood Estimates	Geon-Hyeong Kim; Youngsoo Jang; Hongseok Yang; Kee-Eung Kim;	To tackle this challenge, we present a novel variational inference algorithm for sequential data, which performs well even when the density from the model is not differentiable, for instance, due to the use of discrete random variables.
638	Implicit Learning Dynamics in Stackelberg Games: Equilibria Characterization, Convergence Analysis, and Empirical Study	Tanner Fiez; Benjamin Chasnov; Lillian Ratliff;	We derive novel gradient-based learning dynamics emulating the natural structure of a Stackelberg game using the Implicit Function Theorem and provide convergence analysis for deterministic and stochastic updates for zero-sum and general-sum games.
639	Let’s Agree to Agree: Neural Networks Share Classification Order on Real Datasets	Guy Hacohen; Leshem Choshen; Daphna Weinshall;	We report a series of robust empirical observations, whereby deep Neural Networks learn the examples in both the training and test sets in a similar order.
640	Quantile Causal Discovery	Natasa Tagasovska; Thibault Vatter; Valérie Chavez-Demoulin;	Based on this theory, we develop Quantile Causal Discovery (QCD), a new method to uncover causal relationships.
641	How to Solve Fair k-Center in Massive Data Models	Ashish Chiplunkar; Sagar Kale; Sivaramakrishnan Natarajan Ramamoorthy;	In this work, we design new streaming and distributed algorithms for the fair k-center problem that models fair data summarization.
642	Bayesian Learning from Sequential Data using Gaussian Processes with Signature Covariances	Csaba Toth; Harald Oberhauser;	To deal with this, we introduce a sparse variational approach with inducing tensors.
643	Beyond Signal Propagation: Is Feature Diversity Necessary in Deep Neural Network Initialization?	Yaniv Blumenfeld; Dar Gilboa; Daniel Soudry;	This indicates that random, diverse initializations are \textit{not} necessary for training neural networks.
644	Dynamic Knapsack Optimization Towards Efficient Multi-Channel Sequential Advertising	Xiaotian Hao; Zhaoqing Peng; Yi Ma; Guan Wang; Junqi Jin; Jianye Hao; Shan Chen; Rongquan Bai; Mingzhou Xie; Miao Xu; Zhenzhe Zheng; Chuan Yu; HAN LI; Jian Xu; Kun Gai;	In this paper, we formulate the sequential advertising strategy optimization as a dynamic knapsack problem.
645	Stochastically Dominant Distributional Reinforcement Learning	John Martin; Michal Lyskawinski; Xiaohu Li; Brendan Englot;	We describe a new approach for managing aleatoric uncertainty in the Reinforcement Learning paradigm.
646	Adversarial Robustness Against the Union of Multiple Threat Models	Pratyush Maini; Eric Wong; Zico Kolter;	In this work, we develop a natural generalization of the standard PGD-based procedure to incorporate multiple threat models into a single attack, by taking the worst-case over all steepest descent directions.
647	Student-Teacher Curriculum Learning via Reinforcement Learning: Predicting Hospital Inpatient Admission Location	Rasheed El-Bouri; David Eyre; Peter Watkinson; Tingting Zhu; David Clifton;	In this work we propose a student-teacher network via reinforcement learning to deal with this specific problem.
648	Option Discovery in the Absence of Rewards with Manifold Analysis	Amitay Bar; Ronen Talmon; Ron Meir;	In this paper, we present an approach based on spectral graph theory and derive an algorithm that systematically discovers options without access to a specific reward or task assignment.
649	Generalisation error in learning with random features and the hidden manifold model	Federica Gerace; Bruno Loureiro; Florent Krzakala; Marc Mezard; Lenka Zdeborova;	We study generalized linear regression and classification for a synthetically generated dataset encompassing different problems of interest, such as learning with random features, neural networks in the lazy training regime, and the hidden manifold model.
650	Fast and Consistent Learning of Hidden Markov Models by Incorporating Non-Consecutive Correlations	Robert Mattila; Cristian Rojas; Eric Moulines; Vikram Krishnamurthy; Bo Wahlberg;	In this paper, we propose extending these methods (both pair- and triplet-based) by also including non-consecutive correlations in a way which does not significantly increase the computational cost (which scales linearly with the number of additional lags included).
651	Gradient-free Online Learning in Continuous Games with Delayed Rewards	Amélie Héliou; Panayotis Mertikopoulos; Zhengyuan Zhou;	Motivated by applications to online advertising and recommender systems, we consider a game-theoretic model with delayed rewards and asynchronous, payoff-based feedback.
652	Pseudo-Masked Language Models for Unified Language Model Pre-Training	Hangbo Bao; Li Dong; Furu Wei; Wenhui Wang; Nan Yang; Xiaodong Liu; Yu Wang; Jianfeng Gao; Songhao Piao; Ming Zhou; Hsiao-Wuen Hon;	We propose to pre-train a unified language model for both autoencoding and partially autoregressive language modeling tasks using a novel training procedure, referred to as a pseudo-masked language model (PMLM).
653	Einsum Networks: Fast and Scalable Learning of Tractable Probabilistic Circuits	Robert Peharz; Steven Lang; Antonio Vergari; Karl Stelzner; Alejandro Molina; Martin Trapp; Guy Van den Broeck; Kristian Kersting; Zoubin Ghahramani;	In this paper, we propose Einsum Networks (EiNets), a novel implementation design for PCs, improving prior art in several regards.
654	Polynomial Tensor Sketch for Element-wise Function of Low-Rank Matrix	Insu Han; Haim Avron; Jinwoo Shin;	To this end, we propose an efficient sketching-based algorithm whose complexity is significantly lower than the number of entries of A, i.e., it runs without accessing all entries of [f(Aij)] explicitly.
655	Inexact Tensor Methods with Dynamic Accuracies	Nikita Doikov; Yurii Nesterov;	In this paper, we study inexact high-order Tensor Methods for solving convex optimization problems with composite objective.
656	k-means++: few more steps yield constant approximation	Davin Choo; Christoph Grunau; Julian Portmann; Vaclav Rozhon;	In this paper, we improve their analysis to show that, for any arbitrarily small constant epsilon > 0, with only epsilon * k additional local search steps, one can achieve a constant approximation guarantee (with high probability in k), resolving an open problem in their paper.
657	Radioactive data: tracing through training	Alexandre Sablayrolles; Douze Matthijs; Cordelia Schmid; Herve Jegou;	We propose a new technique, radioactive data, that makes imperceptible changes to this dataset such that any model trained on it will bear an identifiable mark.
658	Doubly robust off-policy evaluation with shrinkage	Yi Su; Maria Dimakopoulou; Akshay Krishnamurthy; Miroslav Dudik;	We propose a new framework for designing estimators for off-policy evaluation in contextual bandits.
659	Fast Adaptation to New Environments via Policy-Dynamics Value Functions	Roberta Raileanu; Max Goldstein; Arthur Szlam; Facebook Rob Fergus;	We introduce Policy-Dynamics Value Functions (PD-VF), a novel approach for rapidly adapting to dynamics different from those previously seen in training.
660	Neural Clustering Processes	Ari Pakman; Yueqi Wang; Catalin Mitelut; JinHyung Lee; Department of Statistics Liam Paninski;	In this work we introduce deep network architectures trained with labeled samples from any generative model of clustered datasets.
661	Topologically Densified Distributions	Christoph Hofer; Florian Graf; Marc Niethammer; Roland Kwitt;	We study regularization in the context of small sample-size learning with over-parametrized neural networks.
662	Low-loss connection of weight vectors: distribution-based approaches	Ivan Anokhin; Dmitry Yarotsky;	We describe and compare experimentally a panel of methods used to connect two low-loss points by a low-loss curve on this surface.
663	Graph Filtration Learning	Christoph Hofer; Florian Graf; Bastian Rieck; Marc Niethammer; Roland Kwitt;	We propose an approach to learning with graph-structured data in the problem domain of graph classification.
664	Differentiable Product Quantization for Learning Compact Embedding Layers	Ting Chen; Lala Li; Yizhou Sun;	In this work, we propose a generic and end-to-end learnable compression framework termed differentiable product quantization (DPQ).
665	Scalable Exact Inference in Multi-Output Gaussian Processes	Wessel Bruinsma; Eric Perim Martins; William Tebbutt; Scott Hosking; Arno Solin; Richard Turner;	We propose the use of a sufficient statistic of the data to accelerate inference and learning in MOGPs with orthogonal bases.
666	Lower Complexity Bounds for Finite-Sum Convex-Concave Minimax Optimization Problems	Guangzeng Xie; Luo Luo; yijiang lian; Zhihua Zhang;	This paper studies the lower bound complexity for minimax optimization problem whose objective function is the average of $n$ individual smooth convex-concave functions.
667	Near-optimal Regret Bounds for Stochastic Shortest Path	Aviv Rosenberg; Alon Cohen; Yishay Mansour; Haim Kaplan;	In this work we remove this dependence on the minimum cost—we give an algorithm that guarantees a regret bound of $\widetilde{O}(B^{3/2} S \sqrt{A K})$, where $B$ is an upper bound on the expected cost of the optimal policy, $S$ is the number of states, $A$ is the number of actions and $K$ is the total number of episodes.
668	The Usual Suspects? Reassessing Blame for VAE Posterior Collapse	Bin Dai; Ziyu Wang; David Wipf;	In particular, we prove that even small nonlinear perturbations of affine VAE decoder models can produce such minima, and in deeper models, analogous minima can force the VAE to behave like an aggressive truncation operator, provably discarding information along all latent dimensions in certain circumstances.
669	It’s Not What Machines Can Learn, It’s What We Cannot Teach	Gal Yehuda; Moshe Gabel; Assaf Schuster;	In this work we offer a different perspective on this question.
670	Guided Learning of Nonconvex Models through Successive Functional Gradient Optimization	Rie Johnson; Tong Zhang;	This paper presents a framework of successive functional gradient optimization for training nonconvex models such as neural networks, where training is driven by mirror descent in a function space.
671	A Markov Decision Process Model for Socio-Economic Systems Impacted by Climate Change	Salman Sadiq Shuvo; Yasin Yilmaz; Alan Bush; Mark Hafen;	In this work, we propose a Markov decision process (MDP) formulation for an agent (government) which interacts with the environment (nature and residents) to deal with the impacts of climate change, in particular sea level rise.
672	Can Stochastic Zeroth-Order Frank-Wolfe Method Converge Faster for Non-Convex Problems?	Hongchang Gao; Heng Huang;	To address the problem of lacking gradient in many applications, we propose two new stochastic zeroth-order Frank-Wolfe algorithms and theoretically proved that they have a faster convergence rate than existing methods for non-convex problems.
673	Distance Metric Learning with Joint Representation Diversification	Xu Chu; Yang Lin; Xiting Wang; Xin Gao; Qi Tong; Hailong Yu; Yasha Wang;	In contrast, we propose not to penalize intra-class distances explicitly and use a Joint Representation Similarity (JRS) regularizer that focuses on penalizing inter-class distributional similarities in a DML framework.
674	Meta-Learning with Shared Amortized Variational Inference	Ekaterina Iakovleva; Karteek Alahari; Jakob Verbeek;	In the context of an empirical Bayes model for meta-learning where a subset of model parameters is treated as latent variables, we propose a novel scheme for amortized variational inference.
675	Causal Effect Identifiability under Partial-Observability	Sanghack Lee; Elias Bareinboim;	In this paper, we study the causal effect identifiability problem when the available distributions may be associated with different sets of variables, which we refer to as identification under partial-observability.
676	Continuous Graph Neural Networks	Louis-Pascal Xhonneux; Meng Qu; Jian Tang;	We propose continuous graph neural networks (CGNN), which generalise existing graph neural networks with discrete dynamics in that they can be viewed as a specific discretisation scheme.
677	Restarted Bayesian Online Change-point Detector achieves Optimal Detection Delay	REDA ALAMI; Odalric-Ambrym Maillard; Raphaël Féraud;	In this paper, we consider the problem of sequential change-point detection where both the change-points and the distributions before and after the change are assumed to be unknown.
678	Robust learning with the Hilbert-Schmidt independence criterion	Daniel Greenfeld; Uri Shalit;	We investigate the use of a non-parametric independence measure, the Hilbert-Schmidt Independence Criterion (HSIC), as a loss-function for learning robust regression and classification models.
679	Bayesian Experimental Design for Implicit Models by Mutual Information Neural Estimation	Steven Kleinegesse; Michael Gutmann;	In this paper, we propose a new approach to Bayesian experimental design for implicit models that leverages recent advances in neural MI estimation to deal with these issues.
680	Fast Differentiable Sorting and Ranking	Mathieu Blondel; Olivier Teboul; Quentin Berthet; Josip Djolonga;	In this paper, we propose the first differentiable sorting and ranking operators with $O(n \log n)$ time and $O(n)$ space complexity.
681	Learning for Dose Allocation in Adaptive Clinical Trials with Safety Constraints	Cong Shen; Zhiyang Wang; Sofia Villar; Mihaela van der Schaar;	We present a novel adaptive clinical trial methodology, called Safe Efficacy Exploration Dose Allocation (SEEDA), that aims at maximizing the cumulative efficacies while satisfying the toxicity safety constraint with high probability.
682	Tuning-free Plug-and-Play Proximal Algorithm for Inverse Imaging Problems	Kaixuan Wei; Angelica I Aviles-Rivero; Jingwei Liang; Ying Fu; Carola-Bibiane Schönlieb; Hua Huang;	In this work, we present a tuning-free PnP proximal algorithm, which can automatically determine the internal parameters including the penalty parameter, the denoising strength and the terminal time.
683	Consistent Estimators for Learning to Defer to an Expert	Hussein Mozannar; David Sontag;	In this paper we explore how to learn predictors that can either predict or choose to defer the decision to a downstream expert.
684	A Graph to Graphs Framework for Retrosynthesis Prediction	Chence Shi; Minkai Xu; Hongyu Guo; Ming Zhang; Jian Tang;	In this paper, we propose a novel template-free approach called G2Gs by transforming a target molecular graph into a set of reactant molecular graphs.
685	Fast computation of Nash Equilibria in Imperfect Information Games	Remi Munos; Julien Perolat; Jean-Baptiste Lespiau; Mark Rowland; Bart De Vylder; Marc Lanctot; Finbarr Timbers; Daniel Hennes; Shayegan Omidshafiei; Audrunas Gruslys; Mohammad Gheshlaghi Azar; Edward Lockhart; Karl Tuyls;	We introduce and analyze a class of algorithms, called Mirror Ascent against an Improved Opponent (MAIO), for computing Nash equilibria in two-player zero-sum games, both in normal form and in sequential imperfect information form.
686	Invariant Rationalization	Shiyu Chang; Yang Zhang; Mo Yu; Tommi Jaakkola;	Instead, we introduce a game-theoretic invariant rationalization criterion where the rationales are constrained to enable the same predictor to be optimal across different environments.
687	Accelerated Stochastic Gradient-free and Projection-free Methods	Feihu Huang; Lue Tao; Songcan Chen;	In the paper, we propose a class of accelerated stochastic gradient-free and projection-free (a.k.a., zeroth-order Frank Wolfe) methods to solve the problem of constrained stochastic and finite-sum nonconvex optimization.
688	Efficient Optimistic Exploration in Linear-Quadratic Regulators via Lagrangian Relaxation	Marc Abeille; Alessandro Lazaric;	Inspired by the extended value iteration algorithm used in optimistic algorithms for finite MDPs, we propose to relax the optimistic optimization of \ofulq and cast it into a constrained \textit{extended} LQR problem, where an additional control variable implicitly selects the system dynamics within a confidence interval.
689	Implicit Regularization of Random Feature Models	Arthur Jacot; berfin simsek; Francesco Spadaro; Clement Hongler; Franck Gabriel;	We investigate, by means of random matrix theory, the connection between Gaussian RF models and Kernel Ridge Regression (KRR).
690	Missing Data Imputation using Optimal Transport	Boris Muzellec; Julie Josse; Claire Boyer; Marco Cuturi;	We propose practical methods to minimize these losses using end-to-end learning, that can exploit or not parametric assumptions on the underlying distributions of values.
691	Unsupervised Speech Decomposition via Triple Information Bottleneck	Kaizhi Qian; Yang Zhang; Shiyu Chang; Mark Hasegawa-Johnson; David Cox;	In this paper, we propose SpeechFlow, which can blindly decompose speech into its four components by introducing three carefully designed information bottlenecks.
692	Provable Representation Learning for Imitation Learning via Bi-level Optimization	Sanjeev Arora; Simon Du; Sham Kakade; Yuping Luo; Nikunj Umesh Saunshi;	We formulate representation learning as a bi-level optimization problem where the “outer” optimization tries to learn the joint representation and the “inner” optimization encodes the imitation learning setup and tries to learn task-specific parameters.
693	Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization	Vien Mai; Mikael Johansson;	Our key innovation is the construction of a special Lyapunov function for which the proven complexity can be achieved without any tunning of the momentum parameter.
694	XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalisation	Junjie Hu; Sebastian Ruder; Aditya Siddhant; Graham Neubig; Orhan Firat; Melvin Johnson;	To this end, we introduce the Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark, a multi-task benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks.
695	Fair k-Centers via Maximum Matching	Matthew Jones; Thy Nguyen; Huy Nguyen;	This paper combines the best parts of each algorithm , by presenting a linear-time algorithm with a guaranteed 3-approximation factor, and provides empirical evidence of both the algorithm’s runtime and effectiveness.
696	Efficiently sampling functions from Gaussian process posteriors	James Wilson; Viacheslav Borovitskiy; Alexander Terenin; Peter Mostowsky; Marc Deisenroth;	Building off of this factorization, we propose decoupled sampling, an easy-to-use and general-purpose approach for fast posterior sampling.
697	Characterizing Distribution Equivalence and Structure Learning for Cyclic and Acyclic Directed Graphs	AmirEmad Ghassami; Alan Yang; Negar Kiyavash; Kun Zhang;	We propose analytic as well as graphical methods for characterizing the equivalence of two structures.
698	Inverse Active Sensing: Modeling and Understanding Timely Decision-Making	Daniel Jarrett; Mihaela van der Schaar;	In this paper, we develop an expressive, unified framework for the general setting of evidence-based decision-making under endogenous, context-dependent time pressure—which requires negotiating (subjective) tradeoffs between accuracy, speediness, and cost of information.
699	On Second-Order Group Influence Functions for Black-Box Predictions	Samyadeep Basu; Xuchen You; Soheil Feizi;	In this paper, we address this issue and propose second-order influence functions for identifying influential groups in test-time predictions.
700	Safe Imitation Learning via Fast Bayesian Reward Inference from Preferences	Daniel Brown; Scott Niekum; Russell Coleman; Ravi Srinivasan;	We propose a highly efficient Bayesian reward learning algorithm that scales to high-dimensional imitation learning problems by first pre-training a low-dimensional feature encoding via self-supervised tasks and then leveraging preferences over demonstrations to perform fast Bayesian inference via sampling.
701	Randomly Projected Additive Gaussian Processes for Regression	Ian Delbridge; David Bindel; Andrew Wilson;	Surprisingly, we find that as the number of random projections increases, the predictive performance of this approach quickly converges to the performance of a kernel operating on the original full dimensional inputs, over a wide range of data sets, even if we are projecting into a single dimension.
702	Attentive Group Equivariant Convolutional Networks	David Romero; Erik Bekkers; Jakub Tomczak; Mark Hoogendoorn;	In this paper, we present attentive group equivariant convolutions, a generalization of the group convolution, in which attention is applied during the course of convolution to accentuate meaningful symmetry combinations and suppress non-plausible, misleading ones.
703	Learning Compound Tasks without Task-specific Knowledge via Imitation and Self-supervised Learning	Sang-Hyun Lee; Seung-Woo Seo;	In this paper, we propose an imitation learning method that can learn compound tasks without task-specific knowledge.
704	Confidence Sets and Hypothesis Testing in a Likelihood-Free Inference Setting	Niccolo Dalmasso; Rafael Izbicki; Ann Lee;	In this paper, we present ACORE (Approximate Computation via Odds Ratio Estimation), a frequentist approach to LFI that first formulates the classical likelihood ratio test (LRT) as a parametrized classification problem, and then uses the equivalence of tests and confidence sets to build confidence regions for parameters of interest.
705	Curvature-corrected learning dynamics in deep neural networks	Dongsung Huh;	We introduce partially curvature-corrected learning rule, which provides most of the benefit of full curvature correction in terms of convergence speed with superior numerical stability while preserving the core property of gradient descent under block-diagonal approximations.
706	Tightening Exploration in Upper Confidence Reinforcement Learning	Hippolyte Bourel; Odalric-Ambrym Maillard; Mohammad Sadegh Talebi;	Motivated by practical efficiency, we present UCRL3, following the lines of UCRL2, but with two key modifications: First, it uses state-of-the-art time-uniform concentration inequalities to compute confidence sets on the reward and transition distributions for each state-action pair. To further tighten exploration, we introduce an adaptive computation of the support of each transition distributions.
707	Bootstrap Latent-Predictive Representations for Multitask Reinforcement Learning	Zhaohan Guo; Bernardo Avila Pires; Mohammad Gheshlaghi Azar; Bilal Piot; Florent Altché; Jean-Bastien Grill; Remi Munos;	Here we introduce Predictions of Bootstrapped Latents (PBL), a simple and flexible self-supervised representation learning algorithm for multitask deep RL.
708	Discriminative Adversarial Search for Abstractive Summarization	Thomas Scialom; Paul-Alexis Dray; Sylvain Lamprier; Benjamin Piwowarski; Jacopo Staiano;	We introduce a novel approach for sequence decoding, Discriminative Adversarial Search (DAS), which has the desirable properties of alleviating the effects of exposure bias without requiring external metrics.
709	A Swiss Army Knife for Minimax Optimal Transport	Sofien Dhouib; Ievgen Redko; Tanguy Kerdoncuff; Rémi Emonet; Marc Sebban;	In this paper, we propose a general formulation of a minimax OT problem that can tackle these restrictions by jointly optimizing the cost matrix and the transport plan, allowing us to define a robust distance between distributions.
710	Invariant Causal Prediction for Block MDPs	Clare Lyle; Amy Zhang; Angelos Filos; Shagun Sodhani; Marta Kwiatkowska; Yarin Gal; Doina Precup; Joelle Pineau;	In this work we propose a method for learning state abstractions which generalize to novel observation distributions in the multi-environment RL setting.
711	Involutive MCMC: One Way to Derive Them All	Kirill Neklyudov; Max Welling; Evgenii Egorov; Dmitry Vetrov;	Building upon this, we describe a wide range of MCMC algorithms in terms of iMCMC, and formulate a number of "tricks" which one can use as design principles for developing new MCMC algorithms.
712	Adversarial Learning Guarantees for Linear Hypotheses and Neural Networks	Pranjal Awasthi; Natalie Frank; Mehryar Mohri;	In order to make progress on this, we focus on the problem of understanding generalization in adversarial settings, via the lens of Rademacher complexity.
713	Deep Reinforcement Learning with Smooth Policy	Qianli Shen; Yan Li; Haoming Jiang; Zhaoran Wang; Tuo Zhao;	In this paper, we develop a new training framework — \textbf{S}mooth \textbf{R}egularized \textbf{R}einforcement \textbf{L}earning ($\textbf{SR}^2\textbf{L}$), where the policy is trained with smoothness-inducing regularization.
714	On the Power of Compressed Sensing with Generative Models	Akshay Kamath; Eric Price; Sushrut Karmalkar;	In this paper, we prove results that (i)establish the difficulty of this task and show that existing bounds are tight and (ii) demonstrate that the latter task is a generalization of the former.
715	Laplacian Regularized Few-Shot Learning	Imtiaz Ziko; Jose Dolz; Eric Granger; Ismail Ben Ayed;	We propose a Laplacian-regularization objective for few-shot tasks, which integrates two types of potentials: (1) unary potentials assigning query samples to the nearest class prototype and (2) pairwise Laplacian potentials encouraging nearby query samples to have consistent predictions.
716	Neural Datalog Through Time: Informed Temporal Modeling via Logical Specification	Hongyuan Mei; Guanghui Qin; Minjie Xu; Jason Eisner;	To exploit known structure, we propose using a deductive database to track facts over time, where each fact has a time-varying state- a vector computed by a neural net whose topology is determined by the fact’s provenance and experience.
717	Up or Down? Adaptive Rounding for Post-Training Quantization	Markus Nagel; Rana Ali Amjad; Marinus van Baalen; Christos Louizos; Tijmen Blankevoort;	In this paper, we propose AdaRound, a better weight-rounding mechanism for post-training quantization that adapts to the data and the task loss.
718	A quantile-based approach for hyperparameter transfer learning	David Salinas; Huibin Shen; Valerio Perrone;	In this work, we introduce a novel approach to achieve transfer learning across different datasets as well as different objectives.
719	Inductive Bias-driven Reinforcement Learning For Efficient Schedules in Heterogeneous Clusters	Subho Banerjee; Saurabh Jha; Zbigniew Kalbarczyk; Ravishankar Iyer;	This paper addresses the challenge in two ways: (i) a domain-driven Bayesian reinforcement learning (RL) model for scheduling, which inherently models the resource dependencies identified from the system architecture; and (ii) a sampling-based technique which allows the computation of gradients of a Bayesian model without performing full probabilistic inference.
720	Adversarial Robustness for Code	Pavol Bielik; Martin Vechev;	In this work we address this gap by: (i) developing adversarial attacks for code (a domain with discrete and highly structured inputs), (ii) showing that, similar to other domains, neural models for code are highly vulnerable to adversarial attacks, and (iii) developing a set of novel techniques that enable training robust and accurate models of code.
721	The Boomerang Sampler	Joris Bierkens; Sebastiano Grazzi; Kengo Kamatani; Gareth Roberts;	This paper introduces the boomerang sampler as a novel class of continuous-time non-reversible Markov chain Monte Carlo algorithms.
722	Weakly-Supervised Disentanglement Without Compromises	Francesco Locatello; Ben Poole; Gunnar Raetsch; Bernhard Schölkopf; Olivier Bachem; Michael Tschannen;	First, we theoretically show that only knowing how many factors have changed, but not which ones, is sufficient to learn disentangled representations. Second, we provide practical algorithms that learn disentangled representations from pairs of images without requiring annotation of groups, individual factors, or the number of factors that have changed.
723	Predictive Sampling with Forecasting Autoregressive Models	Auke Wiggers; Emiel Hoogeboom;	In this paper, we introduce the predictive sampling algorithm: a procedure that exploits the fast inference property of ARMs in order to speed up sampling, while keeping the model intact.
724	InfoGAN-CR: Disentangling Generative Adversarial Networks with Contrastive Regularizers	Zinan Lin; Kiran Thekumparampil; Giulia Fanti; Sewoong Oh;	We propose an unsupervised model selection scheme based on medoids.
725	TrajectoryNet: A Dynamic Optimal Transport Network for Modeling Cellular Dynamics	Alexander Tong; Jessie Huang; Guy Wolf; David van Dijk; Smita Krishnaswamy;	We present {\em TrajectoryNet}, which controls the continuous paths taken between distributions.
726	The role of regularization in classification of high-dimensional noisy Gaussian mixture	Francesca Mignacco; Florent Krzakala; Yue Lu; Pierfrancesco Urbani; Lenka Zdeborova;	We provide a rigorous analysis of the generalization error of regularized convex classifiers, including ridge, hinge and logistic regression, in the high-dimensional limit where the number $n$ of samples and their dimension $d$ goes to infinity while their ratio is fixed to $\alpha=n/d$.
727	Normalizing Flows on Tori and Spheres	Danilo J. Rezende; George Papamakarios; Sebastien Racaniere; Michael Albergo; Gurtej Kanwar; Phiala Shanahan; Kyle Cranmer;	In this paper, we propose and compare expressive and numerically stable flows on such spaces.
728	Structured Linear Contextual Bandits: A Sharp and Geometric Smoothed Analysis	Vidyashankar Sivakumar; Steven Wu; Arindam Banerjee;	In this work, we consider a smoothed setting for structured linear contextual bandits where the adversarial contexts are perturbed by Gaussian noise and the unknown parameter $\theta^*$ has structure, e.g., sparsity, group sparsity, low rank, etc.
729	Simple and sharp analysis of k-means\|\|	Vaclav Rozhon;	We present a truly simple analysis of k-means\|\| (Bahmani et al., PVLDB 2012) — a distributed variant of the k-means++ algorithm (Arthur and Vassilvitskii, SODA 2007) — and improve its round complexity from O(log (Var X)), where Var X is the variance of the input data set, to O(log (Var X) / log log (Var X)), which we show to be tight.
730	Efficient proximal mapping of the path-norm regularizer of shallow networks	Fabian Latorre; Paul Rolland; Shaul Nadav Hallak; Volkan Cevher;	We demonstrate two new important properties of the path-norm regularizer for shallow neural networks.
731	Regularized Optimal Transport is Ground Cost Adversarial	François-Pierre Paty; Marco Cuturi;	In this paper, we adopt a more geometrical point of view, and show using Fenchel duality that any convex regularization of OT can be interpreted as ground cost adversarial.
732	Automatic Shortcut Removal for Self-Supervised Representation Learning	Matthias Minderer; Olivier Bachem; Neil Houlsby; Michael Tschannen;	Here, we propose a general framework for removing shortcut features automatically.
733	Fair Learning with Private Demographic Data	Hussein Mozannar; Mesrob Ohannessian; Nati Srebro;	We give a scheme that allows individuals to release their sensitive information privately while still allowing any downstream entity to learn non-discriminatory predictors.
734	Deep Divergence Learning	Kubra Cilingir; Rachel Manzelli; Brian Kulis;	In this paper, we introduce deep Bregman divergences, which are based on learning and parameterizing functional Bregman divergences using neural networks, and which unify and extend these existing lines of work.
735	A new regret analysis for Adam-type algorithms	Ahmet Alacaoglu; Yura Malitsky; Panayotis Mertikopoulos; Volkan Cevher;	In this paper, we focus on a theory-practice gap for Adam and its variants (AMSgrad, AdamNC, etc.).
736	Accelerated Message Passing for Entropy-Regularized MAP Inference	Jonathan Lee; Aldo Pacchiano; Peter Bartlett; Michael Jordan;	In this paper, we present randomized methods for accelerating these algorithms by leveraging techniques that underlie classical accelerated gradient methods.
737	Dissecting Non-Vacuous Generalization Bounds based on the Mean-Field Approximation	Konstantinos Pitas;	We show empirically that this approach gives negligible gains when modelling the posterior as a Gaussian with diagonal covariance—known as the mean-field approximation.
738	(Individual) Fairness for k-Clustering	Sepideh Mahabadi; Ali Vakilian;	In this work, we show how to get an approximately optimal such fair $k$-clustering.
739	Relaxing Bijectivity Constraints with Continuously Indexed Normalising Flows	Rob Cornish; Anthony Caterini; George Deligiannidis; Arnaud Doucet;	To address this, we propose continuously indexed flows (CIFs), which replace the single bijection used by normalising flows with a continuously indexed family of bijections, and which intuitively allow rerouting mass that would be misplaced by a single bijection.
740	Gamification of Pure Exploration for Linear Bandits	Rémy Degenne; Pierre Menard; Xuedong Shang; Michal Valko;	We investigate an active pure-exploration setting, that includes best-arm identification, in the context of linear stochastic bandits.
741	Growing Adaptive Multi-hyperplane Machines	Nemanja Djuric; Zhuang Wang; Slobodan Vucetic;	In this paper we show that this performance gap is not due to limited representability of MM model, as it can represent arbitrary concepts.
742	Generative Teaching Networks: Accelerating Neural Architecture Search by Learning to Generate Synthetic Training Data	Felipe Petroski Such; Aditya Rawal; Joel Lehman; Kenneth Stanley; Jeffrey Clune;	This paper introduces GTNs, discusses their potential, and showcases that they can substantially accelerate learning.
743	Structured Prediction with Partial Labelling through the Infimum Loss	Vivien Cabannnes; Francis Bach; Alessandro Rudi;	This paper provides a unified framework based on structured prediction and on the concept of {\em infimum loss} to deal with partial labelling over a wide family of learning problems and loss functions.
744	ControlVAE: Controllable Variational Autoencoder	Huajie Shao; Shuochao Yao; Dachun Sun; Aston Zhang; Shengzhong Liu; Dongxin Liu; Jun Wang; Tarek Abdelzaher;	To address these issues, we propose a novel controllable variational autoencoder framework, ControlVAE, that combines a controller, inspired by automatic control theory, with the basic VAE to improve the performance of resulting generative models.
745	On Semi-parametric Inference for BART	Veronika Rockova;	In this work, we continue the theoretical investigation of BART initiated recently by Rockova and van der Pas (2017).
746	Simple and Scalable Epistemic Uncertainty Estimation Using a Single Deep Deterministic Neural Network	Joost van Amersfoort; Lewis Smith; Yee Whye Teh; Yarin Gal;	We propose a method for training a deterministic deep model that can find and reject out of distribution data points at test time with a single forward pass.
747	Ordinal Non-negative Matrix Factorization for Recommendation	Olivier Gouvert; Thomas Oberlin; Cedric Fevotte;	We introduce a new non-negative matrix factorization (NMF) method for ordinal data, called OrdNMF.
748	NetGAN without GAN: From Random Walks to Low-Rank Approximations	Luca Rendsburg; Holger Heidrich; Ulrike von Luxburg;	In this paper, we investigate the implicit bias of NetGAN.
749	On the Iteration Complexity of Hypergradient Computations	Riccardo Grazzi; Saverio Salzo; Massimiliano Pontil; Luca Franceschi;	We present a unified analysis which allows for the first time to quantitatively compare these methods, providing explicit bounds for their iteration complexity.
750	Skew-Fit: State-Covering Self-Supervised Reinforcement Learning	Vitchyr Pong; Murtaza Dalal; Steven Lin; Ashvin Nair; Shikhar Bahl; Sergey Levine;	In this paper, we propose a formal exploration objective for goal-reaching policies that maximizes state coverage.
751	Stochastic Optimization for Regularized Wasserstein Estimators	Marin Ballu; Quentin Berthet; Francis Bach;	In this work, we introduce an algorithm to solve a regularized version of this problem of Wasserstein estimators, with a time per step which is sublinear in the natural dimensions of the problem.
752	LP-SparseMAP: Differentiable Relaxed Optimization for Sparse Structured Prediction	Vlad Niculae; Andre Filipe Torres Martins;	In this paper, we introduce LP-SparseMAP, an extension of SparseMAP addressing this limitation via a local polytope relaxation.
753	Problems with Shapley-value-based explanations as feature importance measures	Indra Kumar; Suresh Venkatasubramanian; Carlos Scheidegger; Sorelle Friedler;	We show that mathematical problems arise when Shapley values are used for feature importance and that the solutions to mitigate these necessarily induce further complexity, such as the need for causal reasoning.
754	Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes	Chen-Yu Wei; Mehdi Jafarnia; Haipeng Luo; Hiteshi Sharma; Rahul Jain;	In this paper, two model-free algorithms are introduced for learning infinite-horizon average-reward Markov Decision Processes (MDPs).
755	Near-linear time Gaussian process optimization with adaptive batching and resparsification	Daniele Calandriello; Luigi Carratino; Alessandro Lazaric; Michal Valko; Lorenzo Rosasco;	In this paper, we introduce BBKB (Batch Budgeted Kernel Bandits), the first no-regret GP optimization algorithm that provably runs in near-linear time and selects candidates in batches.
756	Parallel Algorithm for Non-Monotone DR-Submodular Maximization	Alina Ene; Huy Nguyen;	In this work, we give a new parallel algorithm for the problem of maximizing a non-monotone diminishing returns submodular function subject to a cardinality constraint.
757	Structure Adaptive Algorithms for Stochastic Bandits	Rémy Degenne; Han Shao; Wouter Koolen;	Our aim is to develop methods that are flexible (in that they easily adapt to different structures), powerful (in that they perform well empirically and/or provably match instance-dependent lower bounds) and efficient in that the per-round computational burden is small.
758	Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural Networks	Blake Bordelon; Abdulkadir Canatar; Cengiz Pehlevan;	We derive analytical expressions for learning curves for kernel regression, and use them to evaluate how the test loss of a trained neural network depends on the number of samples.
759	Preference modelling with context-dependent salient features	Amanda Bower; Laura Balzano;	Formalizing this framework, we propose the \textit{salient feature preference model} and prove a sample complexity result for learning the parameters of our model and the underlying ranking with maximum likelihood estimation.
760	Infinite attention: NNGP and NTK for deep attention networks	Jiri Hron; Yasaman Bahri; Jascha Sohl-Dickstein; Roman Novak;	We provide a rigorous extension of these results to NNs involving attention layers, showing that unlike single-head attention, which induces non-Gaussian behaviour, multi-head attention architectures behave as GPs as the number of heads tends to infinity.
761	Fast Learning of Graph Neural Networks with Guaranteed Generalizability: One-hidden-layer Case	shuai zhang; Meng Wang; Sijia Liu; Pin-Yu Chen; Jinjun Xiong;	In this paper, we provide a theoretically-grounded generalizability analysis of GNNs with one hidden layer for both regression and binary classification problems.
762	Efficient Domain Generalization via Common-Specific Low-Rank Decomposition	Vihari Piratla; Praneeth Netrapalli; Sunita Sarawagi;	We present CSD (Common Specific Decomposition), for this setting, which jointly learns a common component (which generalizes to new domains) and a domain specific component (which overfits on training domains).
763	Identifying the Reward Function by Anchor Actions	Sinong Geng; Houssam Nassif; Carlos Manzanares; Max Reppen; Ronnie Sircar;	We propose a reward function estimation framework for inverse reinforcement learning with deep energy-based policies.
764	No-Regret and Incentive-Compatible Online Learning	Rupert Freeman; David Pennock; Chara Podimata; Jennifer Wortman Vaughan;	Our goal is twofold. First, we want the learning algorithm to be no-regret with respect to the best fixed expert in hindsight. Second, we want incentive compatibility, a guarantee that each expert’s best strategy is to report his true beliefs about the realization of each event.
765	Probing Emergent Semantics in Predictive Agents via Question Answering	Abhishek Das; Federico Carnevale; Hamza Merzic; Laura Rimell; Rosalia Schneider; Josh Abramson; Alden Hung; Arun Ahuja; Stephen Clark; Greg Wayne; Feilx Hill;	We propose question-answering as a general paradigm to decode and understand the representations that such agents develop, applying our method to two recent approaches to predictive modelling – action-conditional CPC (Guo et al., 2018) and SimCore (Gregor et al., 2019).
766	Meta-learning with Stochastic Linear Bandits	Leonardo Cella; Alessandro Lazaric; Massimiliano Pontil;	We investigate meta-learning procedures in the setting of stochastic linear bandits tasks.
767	A Unified Theory of Decentralized SGD with Changing Topology and Local Updates	Anastasiia Koloskova; Nicolas Loizou; Sadra Boreiri; Martin Jaggi; Sebastian Stich;	In this paper we introduce a unified convergence analysis that covers a large variety of decentralized SGD methods which so far have required different intuitions, have different applications, and which have been developed separately in various communities.
768	AdaScale SGD: A User-Friendly Algorithm for Distributed Training	Tyler Johnson; Pulkit Agrawal; Haijie Gu; Carlos Guestrin;	We propose AdaScale SGD, an algorithm that reliably adapts learning rates to large-batch training.
769	Kinematic State Abstraction and Provably Efficient Rich-Observation Reinforcement Learning	Dipendra Misra; Mikael Henaff; Akshay Krishnamurthy; John Langford;	We present an algorithm, HOMER, for exploration and reinforcement learning in rich observation environments that are summarizable by an unknown latent state space.
770	Logistic Regression for Massive Data with Rare Events	HaiYing Wang;	This paper studies binary logistic regression for rare events data, or imbalanced data, where the number of events (observations in one class, often called cases) is significantly smaller than the number of nonevents (observations in the other class, often called controls).
771	Automated Synthetic-to-Real Generalization	Wuyang Chen; Zhiding Yu; Zhangyang Wang; Anima Anandkumar;	We treat this as a learning without forgetting problem and propose a learning-to-optimize (L2O) method to automate layer-wise learning rates.
772	Online Learning with Dependent Stochastic Feedback Graphs	Corinna Cortes; Giulia DeSalvo; Claudio Gentile; Mehryar Mohri; Ningshan Zhang;	We study a challenging scenario where feedback graphs vary stochastically with time and, more importantly, where graphs and losses are dependent.
773	Sparse Sinkhorn Attention	Yi Tay; Dara Bahri; Liu Yang; Don Metzler; Da-Cheng Juan;	We propose Sparse Sinkhorn Attention, a new efficient and sparse method for learning to attend.
774	Online Continual Learning from Imbalanced Data	Aristotelis Chrysakis; Marie-Francine Moens;	More importantly, we introduce a new memory population approach, which we call class-balancing reservoir sampling (CBRS).
775	Differentially Private Set Union	Pankaj Gulhane; Sivakanth Gopi; Janardhan Kulkarni; Judy Hanwen Shen; Milad Shokouhi; Sergey Yekhanin;	We design two new algorithms, one using Laplace Noise and other Gaussian noise, as specific instances of policies satisfying the contractive properties.
776	The continuous categorical: a novel simplex-valued exponential family	Elliott Gordon-Rodriguez; Gabriel Loaiza-Ganem; John Cunningham;	We resolve these limitations by introducing a novel exponential family of distributions for modeling simplex-valued data \u2013 the continuous categorical, which arises as a nontrivial multivariate generalization of the recently discovered continuous Bernoulli.
777	Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation	Yaqi Duan; Zeyu Jia; Mengdi Wang;	This paper studies the statistical theory of batch data reinforcement learning with function approximation.
778	Enhanced POET: Open-ended Reinforcement Learning through Unbounded Invention of Learning Challenges and their Solutions	Rui Wang; Joel Lehman; Aditya Rawal; Jiale Zhi; Yulun Li; Jeffrey Clune; Kenneth Stanley;	Here we introduce and empirically validate two new innovations to the original algorithm, as well as two external innovations designed to help elucidate its full potential.
779	Set Functions for Time Series	Max Horn; Michael Moor; Christian Bock; Bastian Rieck; Karsten Borgwardt;	This paper proposes a novel approach for classifying irregularly-sampled time series with unaligned measurements, focusing on high scalability and data efficiency.
780	Individual Calibration with Randomized Forecasting	Shengjia Zhao; Tengyu Ma; Stefano Ermon;	We design a training objective to enforce individual calibration and use it to train randomized regression functions.
781	Bayesian Differential Privacy for Machine Learning	Aleksei Triastcyn; Boi Faltings;	We propose Bayesian differential privacy (BDP), which takes into account the data distribution to provide more practical privacy guarantees.
782	Causal Modeling for Fairness In Dynamical Systems	Elliot Creager; David Madras; Toniann Pitassi; Richard Zemel;	We discuss causal directed acyclic graphs (DAGs) as a unifying framework for the recent literature on fairness in such dynamical systems.
783	Learning General-Purpose Controllers via Locally Communicating Sensorimotor Modules	Wenlong Huang; Igor Mordatch; Deepak Pathak;	We propose a policy expressed as a collection of identical modular neural network components for each of the agent’s actuators.
784	Visual Grounding of Learned Physical Models	Yunzhu Li; Toru Lin; Kexin Yi; Daniel Bear; Daniel Yamins; Jiajun Wu; Josh Tenenbaum; Antonio Torralba;	In this work, we present a neural model that simultaneously reasons about physics and make future predictions based on visual and dynamics priors.
785	Task-Oriented Active Perception and Planning in Environments with Partially Known Semantics	Mahsa Ghasemi; Erdem Bulgur; Ufuk Topcu;	We develop a planning strategy that takes the semantic uncertainties into account and by doing so provides probabilistic guarantees on the task success.
786	Test-Time Training for Generalization under Distribution Shifts	Yu Sun; Xiaolong Wang; Zhuang Liu; John Miller; Alexei Efros; University of California Moritz Hardt;	We introduce a general approach, called test-time training, for improving the performance of predictive models when training and test data come from different distributions.
787	AutoGAN-Distiller: Searching to Compress Generative Adversarial Networks	Yonggan Fu; Wuyang Chen; Haotao Wang; Haoran Li; Yingyan Lin; Zhangyang Wang;	Inspired by the recent success of AutoML in deep compression, we introduce AutoML to GAN compression and develop an AutoGAN-Distiller (AGD) framework.
788	Associative Memory in Iterated Overparameterized Sigmoid Autoencoders	Yibo Jiang; Cengiz Pehlevan;	In this work, we theoretically analyze this behavior for sigmoid networks by leveraging recent developments in deep learning theories, especially the Neural Tangent Kernel (NTK) theory.
789	Adaptive Reward-Poisoning Attacks against Reinforcement Learning	Xuezhou Zhang; Yuzhe Ma; Adish Singla; Jerry Zhu;	We categorize such attacks by the infinity-norm constraint on $\delta_t$: We provide a lower threshold below which reward-poisoning attack is infeasible and RL is certified to be safe; we provide a corresponding upper threshold above which the attack is feasible.
790	Planning to Explore via Latent Disagreement	Ramanan Sekar; Oleh Rybkin; Kostas Daniilidis; Pieter Abbeel; Danijar Hafner; Deepak Pathak;	This work focuses on task-agnostic exploration, where an agent explores a visual environment without yet knowing the tasks it will later be asked to solve.
791	Defense Through Diverse Directions	Christopher Bender; Yang Li; Yifeng Shi; Michael K. Reiter; Junier Oliva;	In this work we develop a novel Bayesian neural network methodology to achieve strong adversarial robustness without the need for online adversarial training.
792	Beyond Synthetic Noise: Deep Learning on Controlled Noisy Labels	Lu Jiang; Di Huang; Mason Liu; Weilong Yang;	First, we establish the first benchmark of controlled real label noise (obtained from image search). This new benchmark will enable us to study the image search label noise in a controlled setting for the first time. The second contribution is a simple but highly effective method to overcome both synthetic and real noisy labels.
793	Confidence-Calibrated Adversarial Training: Generalizing to Unseen Attacks	David Stutz; Matthias Hein; Bernt Schiele;	Our confidence-calibrated adversarial training (CCAT) tackles this problem by biasing the model towards low confidence predictions on adversarial examples.
794	Online Control of the False Coverage Rate and False Sign Rate	Asaf Weinstein; Aaditya Ramdas;	We propose a novel solution to the problem which only requires the scientist to be able to construct a marginal CI at any given level.
795	Online Convex Optimization in the Random Order Model	Dan Garber; Gal Korcia; Kfir Levy;	In this work we consider a natural random-order version of the OCO model, in which the adversary can choose the set of loss functions, but does not get to choose the order in which they are supplied to the learner; Instead, they are observed in uniformly random order.
796	A Flexible Latent Space Model for Multilayer Networks	Xuefei Zhang; Songkai Xue; Ji Zhu;	This paper proposes a flexible latent space model for multilayer networks for the purpose of capturing such characteristics.
797	Estimation of Bounds on Potential Outcomes For Decision Making	Maggie Makar; Fredrik Johansson; John Guttag; David Sontag;	Our theoretical analysis highlights a tradeoff between the complexity of the learning task and the confidence with which the resulting bounds cover the true potential outcomes. Guided by our theoretical findings, we develop an algorithm for learning upper and lower bounds on the potential outcomes under treatment and non-treatment.
798	Deep Gaussian Markov Random Fields	Per Sidén; Fredrik Lindsten;	We establish a formal connection between GMRFs and convolutional neural networks (CNNs).
799	Generalization Error of Generalized Linear Models in High Dimensions	Melikasadat Emami; Mojtaba Sahraee-Ardakan; Parthe Pandit; Sundeep Rangan; Alyson Fletcher;	We provide a general framework to characterize the asymptotic generalization error for single-layer neural networks (i.e., generalized linear models) with arbitrary non-linearities, making it applicable to regression as well as classification problems.
800	Poisson Learning: Graph Based Semi-Supervised Learning At Very Low Label Rates	Jeff Calder; Brendan Cook; Matthew Thorpe; Dejan Slepcev;	We propose a new framework, called Poisson learning, for graph based semi-supervised learning at very low label rates.
801	Sequential Transfer in Reinforcement Learning with a Generative Model	Andrea Tirinzoni; Riccardo Poiani; Marcello Restelli;	In this work, we focus on the second objective when the agent has access to a generative model of state-action pairs.
802	Finite-Time Convergence in Continuous-Time Optimization	Orlando Romero; mouhacine Benosman;	In this paper, we investigate a Lyapunov-like differential inequality that allows us to establish finite-time stability of a continuous-time state-space dynamical system represented via a multivariate ordinary differential equation or differential inclusion.
803	Feature Quantization Improves GAN Training	Yang Zhao; Chunyuan Li; Ping Yu; Jianfeng Gao; Changyou Chen;	In this work, we propose feature quantizatoin (FQ) for the discriminator, to embed both true and fake data samples into a shared discrete space.
804	Temporal Logic Point Processes	Shuang Li; Lu Wang; Ruizhi Zhang; xiaofu Chang; Xuqin Liu; Yao Xie; Yuan Qi; Le Song;	We propose a modeling framework for event data, which excels in small data regime with the ability to incorporate domain knowledge.
805	Hallucinative Topological Memory for Zero-Shot Visual Planning	Thanard Kurutach; Kara Liu; Aviv Tamar; Pieter Abbeel; Christine Tung;	Here, instead, we propose a simple VP method that plans directly in image space and displays competitive performance.
806	Learning Attentive Meta-Transfer	Jaesik Yoon; Gautam Singh; Sungjin Ahn;	To resolve, we propose a new attention mechanism, Recurrent Memory Reconstruction (RMR), and demonstrate that providing an imaginary context that is recurrently updated and reconstructed with interaction is crucial in achieving effective attention for meta-transfer learning.
807	Optimizing Dynamic Structures with Bayesian Generative Search	Minh Hoang; Carleton Kingsford;	This paper instead proposes \textbf{DTERGENS}, a novel generative search framework that constructs and optimizes a high-performance composite kernel expressions generator.
808	Amortized Finite Element Analysis for Fast PDE-Constrained Optimization	Tianju Xue; Alex Beatson; Sigrid Adriaenssens ; Ryan Adams;	In this paper we propose amortized finite element analysis (AmorFEA), in which a neural network learns to produce accurate PDE solutions, while preserving many of the advantages of traditional finite element methods.
809	Preselection Bandits	Viktor Bengs; Eyke Hüllermeier;	In this paper, we introduce the Preselection Bandit problem, in which the learner preselects a subset of arms (choice alternatives) for a user, which then chooses the final arm from this subset.
810	Peer Loss Functions: Learning from Noisy Labels without Knowing Noise Rates	Yang Liu; Hongyi Guo;	In this work, we introduce a new family of loss functions that we name as peer loss functions, which enables learning from noisy labels that does not require a priori specification of the noise rates.Our approach uses a standard empirical risk minimization (ERM) framework with peer loss functions.
811	Rank Aggregation from Pairwise Comparisons in the Presence of Adversarial Corruptions	Prathamesh Patil; Arpit Agarwal; Shivani Agarwal; Sanjeev Khanna;	In this paper, we initiate the study of robustness in rank aggregation under the popular Bradley-Terry-Luce (BTL) model for pairwise comparisons.
812	Extrapolation for Large-batch Training in Deep Learning	Tao LIN; Lingjing Kong; Sebastian Stich; Martin Jaggi;	To alleviate these drawbacks, we propose to use instead computationally efficient extrapolation (extragradient) to stabilize the optimization trajectory while still benefiting from smoothing to avoid sharp minima.
813	VideoOneNet: Bidirectional Convolutional Recurrent OneNet with Trainable Data Steps for Video Processing	Zoltán Milacski; Barnabás Póczos; Andras Lorincz;	In this work, we make two contributions, both facilitating end-to-end learning using backpropagation.
814	Bio-Inspired Hashing for Unsupervised Similarity Search	Chaitanya Ryali; John Hopfield; Leopold Grinberg; Dmitry Krotov;	Building on inspiration from FlyHash and the ubiquity of sparse expansive representations in neurobiology, our work proposes a novel hashing algorithm BioHash that produces sparse high dimensional hash codes in a data-driven manner.
815	MetaFun: Meta-Learning with Iterative Functional Updates	Jin Xu; Jean-Francois Ton; Hyunjik Kim; Adam Kosiorek; Yee Whye Teh;	We develop a functional encoder-decoder approach to supervised meta-learning, where labeled data is encoded into an infinite-dimensional functional representation rather than a finite-dimensional one.
816	Learning and Simulation in Generative Structured World Models	Zhixuan Lin; Yi-Fu Wu; Skand Peri; Bofeng Fu; Jindong Jiang; Sungjin Ahn;	In this paper, we introduce Generative Structured World Models (G-SWM).
817	Random Hypervolume Scalarizations for Provable Multi-Objective Black Box Optimization	Richard Zhang; Daniel Golovin;	In this paper, we consider multi-objective optimization, where $f(x)$ outputs a vector of possibly competing objectives and the goal is to converge to the Pareto frontier.
818	SGD Learns One-Layer Networks in WGANs	Qi Lei; Jason Lee; Alexandros Dimakis; Constantinos Daskalakis;	In this paper, we show that, when the generator is a one-layer network, stochastic gradient descent-ascent converges to a global solution with polynomial time and sample complexity.
819	Implicit Class-Conditioned Domain Alignment for Unsupervised Domain Adaptation	Xiang Jiang; Qicheng Lao; Stan Matwin; Mohammad Havaei;	We present an approach for unsupervised domain adaptation—with a strong focus on practical considerations of within-domain class imbalance and between-domain class distribution shift—from a class-conditioned domain alignment perspective.
820	Interference and Generalization in Temporal Difference Learning	Emmanuel Bengio; Joelle Pineau; Doina Precup;	We study the link between generalization and interference in temporal-difference (TD) learning.
821	CoMic: Co-Training and Mimicry for Reusable Skills	Leonard Hasenclever; Fabio Pardo; Raia Hadsell; Nicolas Heess; Josh Merel;	We study the problem of learning reusable humanoid skills by imitating motion capture data and co-training with complementary tasks.
822	Provably Efficient Model-based Policy Adaptation	Yuda Song; Aditi Mavalankar; Wen Sun; Sicun Gao;	We propose new model-based mechanisms that are able to make online adaptation in unseen target environments, by combining ideas from no-regret online learning and adaptive control.
823	Optimizer Benchmarking Needs to Account for Hyperparameter Tuning	Prabhu Teja Sivaprasad; Florian Mai; Thijs Vogels; Martin Jaggi; Francois Fleuret;	In this work, we argue that a fair assessment of optimizers’ performance must take the computational cost of hyperparameter tuning into account, i.e., how easy it is to find good hyperparameter configurations using an automatic hyperparameter search.
824	From Local SGD to Local Fixed Point Methods for Federated Learning	Grigory Malinovsky; Dmitry Kovalev; Elnur Gasanov; Laurent CONDAT; Peter Richtarik;	In this work we consider the generic problem of finding a fixed point of an average of operators, or an approximation thereof, in a distributed setting.
825	Unraveling Meta-Learning: Understanding Feature Representations for Few-Shot Tasks	Micah Goldblum; Liam Fowl; Renkun Ni; Steven Reich; Valeriia Cherepanova; Tom Goldstein;	We develop a better understanding of the underlying mechanics of meta-learning and the difference between models trained using meta-learning and models which are trained classically.
826	Federated Learning with Only Positive Labels	Felix Xinnan Yu; Ankit Singh Rawat; Aditya Menon; Sanjiv Kumar;	To address this problem, we propose a generic framework for training with only positive labels, namely Federated Averaging with Spreadout (FedAwS), where the server imposes a geometric regularizer after each round to encourage classes spread out in the embedding space.
827	Causal Inference using Gaussian Processes with Structured Latent Confounders	Sam Witty; Kenta Takatsu; David Jensen; Vikash Mansinghka;	This paper shows how to model latent confounders that have this structure and thereby improve estimates of causal effects.
828	T-Basis: a Compact Representation for Neural Networks	Anton Obukhov; Maxim Rakhuba; Menelaos Kanakis; Stamatios Georgoulis; Dengxin Dai; Luc Van Gool;	We introduce T-Basis, a novel concept for a compact representation of a set of tensors, each of an arbitrary shape, which is often seen in Neural Networks.
829	Familywise Error Rate Control by Interactive Unmasking	Boyan Duan; Aaditya Ramdas; Larry Wasserman;	We propose a method for multiple hypothesis testing with familywise error rate (FWER) control, called the i-FWER test.
830	Learning to Branch for Multi-Task Learning	Pengsheng Guo; Chen-Yu Lee; Daniel Ulbricht;	In this work, we present an automated multi-task learning algorithm that learns where to share or branch within a network, designing an effective network topology that is directly optimized for multiple objectives across tasks.
831	Augmenting Continuous Time Bayesian Networks with Clocks	Nicolai Engelmann; Dominik Linzner; Heinz Koeppl;	In this work, we lift its restriction to exponential survival times to arbitrary distributions.
832	IPBoost – Non-Convex Boosting via Integer Programming	Sebastian Pokutta; Marc Pfetsch;	In this paper we explore non-convex boosting in classification by means of integer programming and demonstrate real-world practicability of the approach while circumvent- ing shortcomings of convex boosting approaches.
833	On Efficient Constructions of Checkpoints	Yu Chen; Zhenming LIU; Bin Ren; Xin Jin;	In this paper, we propose a lossy compression scheme for checkpoint constructions (called LC-Checkpoint).
834	Feature Selection using Stochastic Gates	Yutaro Yamada; Ofir Lindenbaum; Sahand Negahban; Yuval Kluger;	In this study, we propose a method for feature selection in non-linear function estimation problems.
835	How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization	Chris Finlay; Joern-Henrik Jacobsen; Levon Nurbekyan; Adam Oberman;	In this paper, we overcome this apparent difficulty by introducing a theoretically-grounded combination of both optimal transport and stability regularizations which encourage neural ODEs to prefer simpler dynamics out of all the dynamics that solve a problem well.
836	Evaluating Lossy Compression Rates of Deep Generative Models	Sicong Huang; Alireza Makhzani; Yanshuai Cao; Roger Grosse;	In this work, we argue that the log-likelihood metric by itself cannot represent all the different performance characteristics of generative models, and propose to use rate distortion curves to evaluate and compare deep generative models.
837	Mix-n-Match : Ensemble and Compositional Methods for Uncertainty Calibration in Deep Learning	Jize Zhang; Bhavya Kailkhura; T. Yong-Jin Han;	We introduce the following desiderata for uncertainty calibration: (a) accuracy-preserving, (b) data-efficient, and (c) high expressive power.
838	Learning Adversarially Robust Representations via Worst-Case Mutual Information Maximization	Sicheng Zhu; Xiao Zhang; David Evans;	We develop a general definition of representation vulnerability that captures the maximum change of mutual information between the input and output distributions, under the worst-case input distribution perturbation. We prove a theorem that establishes a lower bound on the minimum adversarial risk that can be achieved for any downstream classifier based on this definition.
839	Stochastic Regret Minimization in Extensive-Form Games	Gabriele Farina; Christian Kroer; Tuomas Sandholm;	In this paper we develop a new framework for developing stochastic regret minimization methods.
840	Simultaneous Inference for Massive Data: Distributed Bootstrap	Yang Yu; Shih-Kang Chao; Guang Cheng;	In this paper, we propose a bootstrap method applied to massive data processed distributedly in a large number of machines.
841	Stabilizing Differentiable Architecture Search via Perturbation-based Regularization	Xiangning Chen; Cho-Jui Hsieh;	Based on this observation, we propose a perturbation-based regularization, named SmoothDARTS (SDARTS), to smooth the loss landscape and improve the generalizability of DARTS.
842	Boosting Frank-Wolfe by Chasing Gradients	Cyrille Combettes; Sebastian Pokutta;	We propose to speed up the Frank-Wolfe algorithm by better aligning the descent direction with that of the negative gradient via a subroutine.
843	Concise Explanations of Neural Networks using Adversarial Training	Prasad Chalasani; Jiefeng Chen; Amrita Roy Chowdhury; Xi Wu; Somesh Jha;	Our first contribution is a theoretical exploration of how these two properties (when using IG-based attributions) are related to adversarial training, for a class of 1-layer networks (which includes logistic regression models for binary and multi-class classification); for these networks we show that (a) adversarial training using an $\ell_\infty$-bounded adversary produces models with sparse attribution vectors, and (b) natural model-training while encouraging stable explanations (via an extra term in the loss function), is equivalent to adversarial training.
844	Quantum Boosting	Srinivasan Arunachalam; Reevu Maity;	In this paper, we show how quantum techniques can improve the time complexity of classical AdaBoost.
845	Information-Theoretic Local Minima Characterization and Regularization	Zhiwei Jia; Hao Su;	Specifically, based on the observed Fisher information we propose a metric both strongly indicative of generalizability of local minima and effectively applied as a practical regularizer.
846	Kernel interpolation with continuous volume sampling	Ayoub Belhadji; Rémi Bardenet; Pierre Chainais;	We introduce and analyse continuous volume sampling (VS), the continuous counterpart -for choosing node locations- of a discrete distribution introduced in (Deshpande & Vempala, 2006).
847	Efficient Identification in Linear Structural Causal Models with Auxiliary Cutsets	Daniel Kumor; Carlos Cinelli; Elias Bareinboim;	We develop a a new polynomial-time algorithm for identification in linear Structural Causal Models that subsumes previous non-exponential identification methods when applied to direct effects, and unifies several disparate approaches to identification in linear systems.
848	Partial Trace Regression and Low-Rank Kraus Decomposition	Hachem Kadri; Stephane Ayache; Riikka Huusari; alain rakotomamonjy; Ralaivola Liva;	We here introduce a yet more general model, namely the partial trace regression model, a family of linear mappings from matrix-valued inputs to matrix-valued outputs; this model subsumes the trace regression model and thus the linear regression model.
849	Constant Curvature Graph Convolutional Networks	Gregor Bachmann; Gary Becigneul; Octavian Ganea;	Here, we bridge this gap by proposing mathematically grounded generalizations of graph convolutional networks (GCN) to (products of) constant curvature spaces.
850	Educating Text Autoencoders: Latent Representation Guidance via Denoising	Tianxiao Shen; Jonas Mueller; Regina Barzilay; Tommi Jaakkola;	To remedy this issue, we augment adversarial autoencoders with a denoising objective where original sentences are reconstructed from perturbed versions (referred as DAAE). We prove that this simple modification guides the latent space geometry of the resulting model by encouraging the encoder to map similar texts to similar latent representations.
851	Generalization via Derandomization	Jeffrey Negrea; Daniel Roy; Gintare Karolina Dziugaite;	We propose to study the generalization error of a learned predictor h^ in terms of that of a surrogate (potentially randomized) classifier that is coupled to h^ and designed to trade empirical risk for control of generalization error.
852	Inductive Relation Prediction by Subgraph Reasoning	Komal Teru; Etienne Denis; Will Hamilton;	Here, we propose a graph neural network based relation prediction framework, GraIL, that reasons over local subgraph structures and has a strong inductive bias to learn entity-independent relational semantics.
853	Logarithmic Regret for Online Control with Adversarial Noise	Dylan Foster; Max Simchowitz;	We propose a novel analysis that combines a new variant of the performance difference lemma with techniques from optimal control, allowing us to reduce online control to online prediction with delayed feedback.
854	Multiresolution Tensor Learning for Efficient and Interpretable Spatial Analysis	Jung Yeon Park; Kenneth Carr; Stephan Zheng; Yisong Yue; Rose Yu;	We develop a novel Multiresolution Tensor Learning (MRTL) algorithm for efficiently learning interpretable spatial patterns.
855	Customizing ML Predictions for Online Algorithms	Keerti Anand; Rong Ge; Debmalya Panigrahi;	In this paper, we ask the complementary question: can we redesign ML algorithms to provide better predictions for online algorithms?
856	Maximum Entropy Gain Exploration for Long Horizon Multi-goal Reinforcement Learning	Silviu Pitis; Harris Chan; Stephen Zhao; Bradly Stadie; Jimmy Ba;	We propose to optimize this objective by having the agent pursue past achieved goals in sparsely explored areas of the goal space, which focuses exploration on the frontier of the achievable goal set.
857	Recht-Re Noncommutative Arithmetic-Geometric Mean Conjecture is False	Zehua Lai; Lek-Heng Lim;	We will show that the Recht–Re conjecture is false for general n.
858	Predictive Multiplicity in Classification	Charles Marx; Flavio Calmon; Berk Ustun;	In this paper, we define predictive multiplicity as the ability of a prediction problem to admit competing models with conflicting predictions.
859	Word-Level Speech Recognition With a Letter to Word Encoder	Ronan Collobert; Awni Hannun; Gabriel Synnaeve;	We propose a direct-to-word sequence model which uses a word network to learn word embeddings from letters.
860	Reducing Sampling Error in Batch Temporal Difference Learning	Brahma Pavse; Ishan Durugkar; Josiah Hanna; Peter Stone;	To address this limitation, we introduce \textit{policy sampling error corrected}-TD(0) (PSEC-TD(0)).
861	Adaptive Sampling for Estimating Probability Distributions	Shubhanshu Shekhar; Tara Javidi; Mohammad Ghavamzadeh;	We consider the problem of allocating a fixed budget of samples to a finite set of discrete distributions to learn them uniformly well (minimizing the maximum error) in terms of four common distance measures: $\ell_2^2$, $\ell_1$, $f$-divergence, and separation distance.
862	Adversarial Filters of Dataset Biases	Ronan Le Bras; Swabha Swayamdipta; Chandra Bhagavatula; Rowan Zellers; Matthew Peters; Ashish Sabharwal; Yejin Choi;	We investigate one recently proposed approach, AFLite, which adversarially filters such dataset biases, as a means to mitigate the prevalent overestimation of machine performance. We provide a theoretical understanding for AFLite, by situating it in the generalized framework for optimum bias reduction.
863	Black-Box Variational Inference as a Parametric Approximation to Langevin Dynamics	Matthew Hoffman; Yian Ma;	In this paper, we analyze gradient-based MCMC and VI procedures and find theoretical and empirical evidence that these procedures are not as different as one might think.
864	Faster Graph Embeddings via Coarsening	Matthew Fahrbach; Gramoz Goranci; Sushant Sachdeva; Richard Peng; Chi Wang;	To address this, we present an efficient graph coarsening approach, based on Schur complements, for computing the embedding of the relevant vertices.
865	Efficient non-conjugate Gaussian process factor models for spike countdata using polynomial approximations	Stephen Keeley; David Zoltowski; Jonathan Pillow; Spencer Smith; Yiyi Yu;	Here we address this obstacle by introduc-ing a fast, approximate inference method fornon-conjugate GPFA models.
866	Multigrid Neural Memory	Tri Huynh; Michael Maire; Matthew Walter;	We introduce a radical new approach to endowing neural networks with access to long-term and large-scale memory.
867	Cautious Adaptation For Reinforcement Learning in Safety-Critical Settings	Jesse Zhang; Brian Cheung; Chelsea Finn; Sergey Levine; Dinesh Jayaraman;	Building on this intuition, we propose risk-averse domain adaptation (RADA).
868	Adversarial Nonnegative Matrix Factorization	lei luo; yanfu Zhang; Heng Huang;	To overcome this limitation, we propose a novel Adversarial NMF (ANMF) approach in which an adversary can exercise some control over the perturbed data generation process.
869	Aligned Cross Entropy for Non-Autoregressive Machine Translation	Marjan Ghazvininejad; Vladimir Karpukhin; Luke Zettlemoyer; Omer Levy;	In this paper, we propose aligned cross entropy (AXE) as an alternate loss function for training of non-autoregressive models.
870	Model-Agnostic Characterization of Fairness Trade-offs	Joon Kim; Jiahao Chen; Ameet Talwalkar;	We propose a diagnostic to enable practitioners to explore these trade-offs without training a single model.
871	A Distributional Framework For Data Valuation	Amirata Ghorbani; Michael Kim; James Zou;	To address these limitations, we propose a novel framework — distributional Shapley — where the value of a point is defined in the context of an underlying data distribution.
872	Supervised Quantile Normalization for Low Rank Matrix Factorization	Marco Cuturi; Olivier Teboul; Jonathan Niles-Weed; Jean-Philippe Vert;	We propose in this work to learn these normalization operators jointly with the factorization itself.
873	AR-DAE: Towards Unbiased Neural Entropy Gradient Estimation	Jae Hyun Lim; Aaron Courville; Christopher Pal; Chin-Wei Huang;	In this paper, we propose the amortized residual denoising autoencoder (AR-DAE) to approximate the gradient of the log density function, which can be used to estimate the gradient of entropy.
874	Bridging the Gap Between f-GANs and Wasserstein GANs	Jiaming Song; Stefano Ermon;	To overcome this limitation, we propose an new training objective where we additionally optimize over a set of importance weights over the generated samples.
875	“Other-Play” for Zero-Shot Coordination	Hengyuan Hu; Alexander Peysakhovich; Adam Lerer; Jakob Foerster;	We introduce a novel learning algorithm called other-play (OP), that enhances self-play by looking for more robust strategies.
876	Correlation Clustering with Asymmetric Classification Errors	Jafar Jafarov; Sanchit Kalhan; Konstantin Makarychev; Yury Makarychev;	We study the correlation clustering problem under the following assumption: Every "similar" edge $e$ has weight $w_e \in [ \alpha w, w ]$ and every "dissimilar" edge $e$ has weight $w_e \geq \alpha w$ (where $\alpha \leq 1$ and $w > 0$ is a scaling parameter).
877	An Optimistic Perspective on Offline Deep Reinforcement Learning	Rishabh Agarwal; Dale Schuurmans; Mohammad Norouzi;	To enhance generalization in the offline setting, we present Random Ensemble Mixture (REM), a robust Q-learning algorithm that enforces optimal Bellman consistency on random convex combinations of multiple Q-value estimates.
878	Neural Topic Modeling with Continual Lifelong Learning	Pankaj Gupta; Yatin Chaudhary; Thomas Runkler; Hinrich Schuetze;	To address the problem, we propose a lifelong learning framework for neural topic modeling that can continuously process streams of document collections, accumulate topics and guide future topic modeling tasks by knowledge transfer from several sources to better deal with the sparse data.
879	Learning and Evaluating Contextual Embedding of Source Code	Aditya Kanade; Petros Maniatis; Gogul Balakrishnan; Kensen Shi;	In this paper, we alleviate this gap by curating a code-understanding benchmark and evaluating a learned contextual embedding of source code.
880	Uncertainty quantification for nonconvex tensor completion: Confidence intervals, heteroscedasticity and optimality	Changxiao Cai; H. Vincent Poor; Yuxin Chen;	We study the distribution and uncertainty of nonconvex optimization for noisy tensor completion — the problem of estimating a low-rank tensor given incomplete and corrupted observations of its entries.
881	Learning with Good Feature Representations in Bandits and in RL with a Generative Model	Gellért Weisz; Tor Lattimore; Csaba Szepesvari;	Thus, features are useful when the approximation error is small relative to the dimensionality of the features. The idea is applied to stochastic bandits and reinforcement learning with a generative model where the learner has access to d-dimensional linear features that approximate the action-value functions for all policies to an accuracy of \u03b5.
882	Angular Visual Hardness	Beidi Chen; Weiyang Liu; Zhiding Yu; Jan Kautz; Anshumali Shrivastava; Animesh Garg; Anima Anandkumar;	In this paper, we propose angular visual hardness (AVH), a score given by the normalized angular distance between the sample feature embedding and the target classifier to measure sample hardness.
883	Learning the Stein Discrepancy for Training and Evaluating Energy-Based Models without Sampling	Will Grathwohl; Kuan-Chieh Wang; Joern-Henrik Jacobsen; David Duvenaud; Richard Zemel;	We present a new method for evaluating and training unnormalized density models.
884	Variance Reduction and Quasi-Newton for Particle-Based Variational Inference	Michael Zhu; Chang Liu; Jun Zhu;	In this paper, we find that existing ParVI approaches converge insufficiently fast under sample quality metrics, and we propose a novel variance reduction and quasi-Newton preconditioning framework for all ParVIs, by leveraging the Riemannian structure of the Wasserstein space and advanced Riemannian optimization algorithms.
885	Better depth-width trade-offs for neural networks through the lens of dynamical systems	Evangelos Chatziafratis; Ioannis Panageas; Sai Ganesh Nagarajan;	In this work, we strengthen the connection with dynamical systems and we improve the existing width lower bounds along several aspects.
886	Stochastic Coordinate Minimization with Progressive Precision for Stochastic Convex Optimization	Sudeep Salgia; Qing Zhao; Sattar Vakili;	A framework based on iterative coordinate minimization (CM) is developed for stochastic convex optimization.
887	Fundamental Tradeoffs between Invariance and Sensitivity to Adversarial Perturbations	Florian Tramer; Jens Behrmann; Nicholas Carlini; Nicolas Papernot; Joern-Henrik Jacobsen;	We demonstrate fundamental tradeoffs between these two types of adversarial examples.
888	Learning From Strategic Agents: Accuracy, Improvement, and Causality	Yonadav Shavit; Benjamin Edelman; Brian Axelrod;	As our main contribution, we provide the first algorithms for learning accuracy-optimizing, improvement-optimizing, and causal-precision-optimizing linear regression models directly from data, without prior knowledge of agents’ possible actions.
889	Causal Structure Discovery from Distributions Arising from Mixtures of DAGs	Basil Saeed; Snigdha Panigrahi; Caroline Uhler;	Since the mixing variable is latent, we consider causal structure discovery algorithms such as FCI that can deal with latent variables.
890	Explainable and Discourse Topic-aware Neural Language Understanding	Yatin Chaudhary; Pankaj Gupta; Hinrich Schuetze;	We present a novel neural composite language model that exploits both the latent and explainable topics along with topical discourse at sentence-level in a joint learning framework of topic and language models.
891	Understanding Contrastive Representation Learning through Geometry on the Hypersphere	Tongzhou Wang; Phillip Isola;	In this work, we identify two key properties related to the contrastive loss: (1) alignment (closeness) of features from positive pairs, and (2) uniformity of the induced distribution of the (normalized) features on the hypersphere.
892	On Learning Language-Invariant Representations for Universal Machine Translation	Han Zhao; Junjie Hu; Andrej Risteski;	In this paper, we take one step towards better understanding of universal machine translation by first proving an impossibility theorem in the general case.
893	Compressive sensing with un-trained neural networks: Gradient descent finds a smooth approximation	Reinhard Heckel; Mahdi Soltanolkotabi;	For signal recovery from a few measurements, however, un-trained convolutional networks have an intriguing self-regularizing property: Even though the network can perfectly fit any image, the network recovers a natural image from few measurements when trained with gradient descent until convergence. In this paper, we demonstrate this property numerically and study it theoretically.
894	Representing Unordered Data Using Multiset Automata and Complex Numbers	Justin DeBenedetto; David Chiang;	We propose to represent multisets using complex-weighted multiset automata and show how the multiset representations of certain existing neural architectures can be viewed as special cases of ours.
895	Mutual Transfer Learning for Massive Data	Ching-Wei Cheng; Xingye Qiao; Guang Cheng;	In this article, we study a new paradigm called mutual transfer learning where among many heterogeneous data domains, every data domain could potentially be the target of interest, and it could also be a useful source to help the learning in other data domains.
896	The Differentiable Cross-Entropy Method	Brandon Amos; Denis Yarats;	We study the Cross-Entropy Method (CEM) for the non-convex optimization of a continuous and parameterized objective function and introduce a differentiable variant that enables us to differentiate the output of CEM with respect to the objective function’s parameters.
897	A Sample Complexity Separation between Non-Convex and Convex Meta-Learning	Nikunj Umesh Saunshi; Yi Zhang; Mikhail Khodak; Sanjeev Arora;	This work shows that convex-case analysis might be insufficient to understand the success of meta-learning, and that even for non-convex models it is important to look inside the optimization black-box, specifically at properties of the optimization trajectory.
898	On the Convergence of Nesterov’s Accelerated Gradient Method in Stochastic Settings	Mahmoud Assran; Michael Rabbat;	We study Nesterov’s accelerated gradient method in the stochastic approximation setting (unbiased gradients with bounded variance) and the finite sum setting (where randomness is due to sampling mini-batches).
899	The Buckley-Osthus model and the block preferential attachment model: statistical analysis and application	Wenpin Tang; Xin Guo; Fengmin Tang;	This paper is concerned with statistical estimation of two preferential attachment models: the Buckley-Osthus model and the block preferential attachment model.
900	Representations for Stable Off-Policy Reinforcement Learning	Dibya Ghosh; Marc Bellemare;	In this paper, we formally show that there are indeed nontrivial state representations under which the canonical SARSA algorithm is stable, even when learning off-policy.
901	Piecewise Linear Regression via a Difference of Convex Functions	Ali Siahkamari; Aditya Gangrade; Brian Kulis; Venkatesh Saligrama;	We present a new piecewise linear regression methodology that utilises fitting a difference of convex functions (DC functions) to the data.
902	On the consistency of top-k surrogate losses	Forest Yang; Sanmi Koyejo;	Based on the top-$k$ calibration analysis, we propose a rich class of top-$k$ calibrated Bregman divergence surrogates.
903	Collapsed Amortized Variational Inference for Switching Nonlinear Dynamical Systems	Zhe Dong; Bryan Seybold; Kevin Murphy; Hung Bui;	We propose an efficient inference method for switching nonlinear dynamical systems.
904	Boosting Deep Neural Network Efficiency with Dual-Module Inference	Liu Liu; Lei Deng; Zhaodong Chen; yuke wang; Shuangchen Li; Jingwei Zhang; Yihua Yang; Zhenyu Gu; Yufei Ding; Yuan Xie;	We propose a big-little dual-module inference to dynamically skip unnecessary memory access and computation to speedup DNN inference.
905	Time-Consistent Self-Supervision for Semi-Supervised Learning	Tianyi Zhou; Shengjie Wang; Jeff Bilmes;	In this paper, we study the dynamics of neural net outputs in SSL and show that selecting and using first the unlabeled samples with more consistent outputs over the course of training (i.e., "time-consistency") can improve the final test accuracy and save computation.
906	Selective Dyna-style Planning Under Limited Model Capacity	Zaheer SM; Samuel Sokota; Erin Talvitie; Martha White;	In this paper, we investigate the idea of using an imperfect model selectively.
907	A Pairwise Fair and Community-preserving Approach to k-Center Clustering	Brian Brubach; Darshan Chakrabarti; John Dickerson; Samir Khuller; Aravind Srinivasan; Leonidas Tsepenekas;	To explore the practicality of our fairness goals, we devise an approach for extending existing k-center algorithms to satisfy these fairness constraints.
908	How recurrent networks implement contextual processing in sentiment analysis	Niru Maheswaranathan; David Sussillo;	Here, we propose general methods for reverse engineering recurrent neural networks (RNNs) to identify and elucidate contextual processing.
909	Smaller, more accurate regression forests using tree alternating optimization	Arman Zharmagambetov; Miguel Carreira-Perpinan;	We instead use the recently proposed Tree Alternating Optimization (TAO) algorithm. This is able to learn an oblique tree, where each decision node tests for a linear combination of features, and which has much higher accuracy than axis-aligned trees.
910	Divide and Conquer: Leveraging Intermediate Feature Representations for Quantized Training of Neural Networks	Ahmed T. Elthakeb; Prannoy Pilligundla; FatemehSadat Mireshghallah; Alexander Cloninger; Hadi Esmaeilzadeh;	This paper sets out to harvest these rich intermediate representations for quantization with minimal accuracy loss while significantly reducing the memory footprint and compute intensity of the DNN.
911	From Sets to Multisets: Provable Variational Inference for Probabilistic Integer Submodular Models	Aytunc Sahin; Yatao Bian; Joachim Buhmann; Andreas Krause;	We study central properties of this extension and formulate a new probabilistic model which is defined through integer submodular functions.
912	Empirical Study of the Benefits of Overparameterization in Learning Latent Variable Models	Rares-Darius Buhai; Yoni Halpern; Yoon Kim; Andrej Risteski; David Sontag;	We discuss benefits to different metrics of success (recovering the parameters of the ground-truth model, held-out log-likelihood), sensitivity to variations of the training algorithm, and behavior as the amount of overparameterization increases.
913	Improving the Gating Mechanism of Recurrent Neural Networks	Albert Gu; Caglar Gulcehre; Thomas Paine; Matthew Hoffman; Razvan Pascanu;	We address this problem by deriving two synergistic modifications to the standard gating mechanism that are easy to implement, introduce no additional hyperparameters, and improve learnability of the gates when they are close to saturation.
914	Efficient and Scalable Bayesian Neural Nets with Rank-1 Factors	Mike Dusenberry; Ghassen Jerfel; Yeming Wen; Yian Ma; Jasper Snoek; Katherine Heller; Balaji Lakshminarayanan; Dustin Tran;	To tackle this challenge, we propose a rank-1 parameterization of BNNs, where each weight matrix involves only a distribution on a rank-1 subspace.
915	Analyzing the effect of neural network architecture on training performance	Karthik Abinav Sankararaman; Soham De; Zheng Xu; W. Ronny Huang; Tom Goldstein;	In this paper we study how neural network architecture affects the speed of training.
916	Born-again Tree Ensembles	Thibaut Vidal; Maximilian Schiffer;	Against this background, we study born-again tree ensembles, i.e., the process of constructing a single decision tree of minimum size that reproduces the exact same behavior as a given tree ensemble.
917	Accountable Off-Policy Evaluation via a Kernelized Bellman Statistics	Yihao Feng; Tongzheng Ren; Ziyang Tang; Qiang Liu;	In this work, we investigate the statistical properties of the kernel loss, which allows us to find a feasible set that contains the true value function with high probability.
918	Improving Transformer Optimization Through Better Initialization	Xiao Shi Huang; Felipe Perez; Jimmy Ba; Maksims Volkovs;	In this work our contributions are two-fold. We first investigate and empirically validate the source of optimization problems in encoder-decoder Transformer architecture.We then propose a new weight initialization scheme with theoretical justification, which enables training without warmup or layer normalization.
919	Learning to Simulate and Design for Structural Engineering	Kai-Hung Chang; Chin-Yi Cheng;	In this work, we propose an end-to-end learning pipeline to solve the size design optimization problem, which is to design the optimal cross-sections for columns and beams, given the design objectives and building code as constraints.
920	Few-shot Relation Extraction via Bayesian Meta-learning on Task Graphs	Meng Qu; Tianyu Gao; Louis-Pascal Xhonneux; Jian Tang;	We propose a novel Bayesian meta-learning approach to effectively learn the posterior distributions of the prototype vectors of tasks, where the initial prior of the prototype vectors is parameterized with a graph neural network on the global task graph.
921	Optimal Differential Privacy Composition for Exponential Mechanisms	Jinshuo Dong; David Durfee; Ryan Rogers;	We consider precise composition bounds of the overall privacy loss for exponential mechanisms, one of the fundamental classes of mechanisms in DP.
922	Scaling up Hybrid Probabilistic Inference with Logical and Arithmetic Constraints via Message Passing	Zhe Zeng; Paolo Morettin; Fanqi Yan; Antonio Vergari; Guy Van den Broeck;	To narrow this gap, we derive a factorized formalism of WMI enabling us to devise a scalable WMI solver based on message passing, MP-WMI.
923	Accelerating Large-Scale Inference with Anisotropic Vector Quantization	Ruiqi Guo; Quan Geng; David Simcha; Felix Chern; Philip Sun; Erik Lindgren; Sanjiv Kumar;	Based on the observation that for a given query, the database points that have the largest inner products are more relevant, we develop a family of anisotropic quantization loss functions.
924	Convolutional dictionary learning based auto-encoders for natural exponential-family distributions	Bahareh Tolooshams; Andrew Song; Simona Temereanca; Demba Ba;	We introduce a class of auto-encoder neural networks tailored to data from the natural exponential family (e.g., count data).
925	Strength from Weakness: Fast Learning Using Weak Supervision	Joshua Robinson; Stefanie Jegelka; Suvrit Sra;	We study generalization properties of weakly supervised learning.
926	NADS: Neural Architecture Distribution Search for Uncertainty Awareness	Randy Ardywibowo; Shahin Boluki; Xinyu Gong; Zhangyang Wang; Xiaoning Qian;	To address these problems, we first seek to identify guiding principles for designing uncertainty-aware architectures, by proposing Neural Architecture Distribution Search (NADS).
927	Approximating Stacked and Bidirectional Recurrent Architectures with the Delayed Recurrent Neural Network	Javier Turek; Shailee Jain; Vy Vo; Mihai Capota; Alexander Huth; Theodore Willke;	In this work, we explore the delayed-RNN, which is a single-layer RNN that has a delay between the input and output.
928	Balancing Competing Objectives with Noisy Data: Score-Based Classifiers for Welfare-Aware Machine Learning	Esther Rolf; Max Simchowitz; Sarah Dean; Lydia T. Liu; Daniel Bjorkegren; University of California Moritz Hardt; Joshua Blumenstock;	In this paper, we study algorithmic policies which explicitly trade off between a private objective (such as profit) and a public objective (such as social welfare).
929	Time-aware Large Kernel Convolutions	Vasileios Lioutas; Yuhong Guo;	In this paper, we introduce Time-aware Large Kernel (TaLK) Convolutions, a novel adaptive convolution operation that learns to predict the size of a summation kernel instead of using a fixed-sized kernel matrix.
930	Amortised Learning by Wake-Sleep	Li Kevin Wenliang; Theodore Moskovitz; Heishiro Kanagawa; Maneesh Sahani;	Here, we propose an alternative approach that we call amortised learning. Rather than computing an approximation to the posterior over latents, we use a wake-sleep Monte-Carlo strategy to learn a function that directly estimates the maximum-likelihood parameter updates.
931	Fair Generative Modeling via Weak Supervision	Kristy Choi; Aditya Grover; Trisha Singh; Rui Shu; Stefano Ermon;	We present a weakly supervised algorithm for overcoming dataset bias for deep generative models.
932	Multi-Step Greedy Reinforcement Learning Algorithms	Manan Tomar; Yonathan Efroni; Mohammad Ghavamzadeh;	In this paper, we explore the benefits of multi-step greedy policies in model-free RL when employed using the multi-step Dynamic Programming algorithms: $\kappa$-Policy Iteration ($\kappa$-PI) and $\kappa$-Value Iteration ($\kappa$-VI).
933	Linear Mode Connectivity and the Lottery Ticket Hypothesis	Jonathan Frankle; Gintare Karolina Dziugaite; Daniel Roy; Michael Carbin;	We introduce "instability analysis," which assesses whether a neural network optimizes to the same, linearly connected minimum under different samples of SGD noise.
934	Superpolynomial Lower Bounds for Learning One-Layer Neural Networks using Gradient Descent	Surbhi Goel; Aravind Gollakota; Zhihan Jin; Sushrut Karmalkar; Adam Klivans;	We give the first superpolynomial lower bounds for learning one-layer neural networks with respect to the Gaussian distribution for a broad class of algorithms.
935	Learnable Group Transform For Time-Series	Romain Cosentino; Behnaam Aazhang;	We propose a novel approach to filter bank learning for time-series by considering spectral decompositions of signals defined as a Group Transform.
936	Optimistic bounds for multi-output learning	Henry Reeve; Ata Kaban;	We investigate the challenge of multi-output learning, where the goal is to learn a vector-valued function based on a supervised data set.
937	Detecting Out-of-Distribution Examples with Gram Matrices	Chandramouli Shama Sastry; Sageev Oore;	In this paper, we propose to detect OOD examples by identifying inconsistencies between activity patterns and predicted class.
938	On Variational Learning of Controllable Representations for Text without Supervision	Peng Xu; Jackie Chi Kit Cheung; Yanshuai Cao;	In this work, we find that sequence VAEs trained on text fail to properly decode when the latent codes are manipulated, because the modified codes often land in holes or vacant regions in the aggregated posterior latent space, where the decoding network fails to generalize.
939	Model-Based Reinforcement Learning with Value-Targeted Regression	Zeyu Jia; Lin Yang; Csaba Szepesvari; Mengdi Wang; Alex Ayoub;	In this paper we focus on finite-horizon episodic RL where the transition model admits a nonlinear parametrization $P_{\theta}$, a special case of which is the linear parameterization: $P_{\theta} = \sum_{i=1}^{d} (\theta)_{i}P_{i}$.
940	Two Routes to Scalable Credit Assignment without Weight Symmetry	Daniel Kunin; Aran Nayebi; Javier Sagastuy-Brena; Surya Ganguli; Jonathan Bloom; Daniel Yamins;	Our analysis indicates the underlying mathematical reason for this instability, allowing us to identify a more robust local learning rule that better transfers without metaparameter tuning.
941	Predicting deliberative outcomes	Vikas Garg; Tommi Jaakkola;	We extend structured prediction to deliberative outcomes.
942	Black-box Certification and Learning under Adversarial Perturbations	Hassan Ashtiani; Vinayak Pathak; Ruth Urner;	We formally study the problem of classification under adversarial perturbations, both from the learner’s perspective, and from the viewpoint of a third-party who aims at certifying the robustness of a given black-box classifier.
943	When deep denoising meets iterative phase retrieval	Yaotian Wang; Xiaohang Sun; Jason Fleischer;	Here, we combine iterative methods from phase retrieval with image statistics from deep denoisers, via regularization-by-denoising.
944	The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization	Ben Adlam; Jeffrey Pennington;	We provide a precise high-dimensional asymptotic analysis of generalization under kernel regression with the Neural Tangent Kernel, which characterizes the behavior of wide neural networks optimized with gradient descent.
945	A Sequential Self Teaching Approach for Improving Generalization in Sound Event Recognition	Anurag Kumar; Vamsi Krishna Ithapu;	In this paper, we propose a sequential self-teaching approach to learn sounds.
946	On the Global Convergence Rates of Softmax Policy Gradient Methods	Jincheng Mei; Chenjun Xiao; Csaba Szepesvari; Dale Schuurmans;	We make three contributions toward better understanding policy gradient methods.
947	Source Separation with Deep Generative Priors	Vivek Jayaram; John Thickstun;	This paper introduces a Bayesian approach to source separation that uses deep generative models as priors over the components of a mixture of sources, and Langevin dynamics to sample from the posterior distribution of sources given a mixture.
948	Non-Autoregressive Neural Text-to-Speech	Kainan Peng; Wei Ping; Zhao Song; Kexin Zhao;	In this work, we propose ParaNet, a non-autoregressive seq2seq model that converts text to spectrogram.
949	Amortized Population Gibbs Samplers with Neural Sufficient Statistics	Hao Wu; Heiko Zimmermann; Eli Sennesh; Tuan Anh Le; Jan-Willem van de Meent;	We develop amortized population Gibbs (APG) samplers, a class of scalable methods that frame structured variational inference as adaptive importance sampling.
950	Neural Network Control Policy Verification With Persistent Adversarial Perturbation	Yuh-Shyang Wang; Tsui-Wei Weng; Luca Daniel;	In this paper, we show how to combine recent works on static neural network certification tools with robust control theory to certify a neural network policy in a control loop.
951	Circuit-Based Intrinsic Methods to Detect Overfitting	Satrajit Chatterjee; Alan Mishchenko;	We propose a family of intrinsic methods called Counterfactual Simulation (CFS) which analyze the flow of training examples through the model by identifying and perturbing rare patterns.
952	Inter-domain Deep Gaussian Processes with RKHS Fourier Features	Tim Rudner; Dino Sejdinovic; Yarin Gal;	We propose Inter-domain Deep Gaussian Processes with RKHS Fourier Features, an extension of shallow inter-domain GPs that combines the advantages of inter-domain and deep Gaussian processes (DGPs) and demonstrate how to leverage existing approximate inference approaches to perform simple and scalable approximate inference on Inter-domain Deep Gaussian Processes.
953	Estimating Q(s,s’) with Deterministic Dynamics Gradients	Ashley Edwards; Himanshu Sahni; Rosanne Liu; Jane Hung; Ankit Jain; Rui Wang; Adrien Ecoffet; Thomas Miconi; Charles Isbell; Jason Yosinski;	In this paper, we introduce a novel form of a value function, $Q(s, s’)$, that expresses the utility of transitioning from a state $s$ to a neighboring state $s’$ and then acting optimally thereafter.
954	On conditional versus marginal bias in multi-armed bandits	Jaehyeok Shin; Aaditya Ramdas; Alessandro Rinaldo;	In this paper, we characterize the sign of the conditional bias of monotone functions of the rewards, including the sample mean.
955	Implicit competitive regularization in GANs	Florian Schaefer; Hongkai Zheng; Anima Anandkumar;	We argue that the performance of GANs is instead due to the implicit competitive regularization (ICR) arising from the simultaneous optimization of generator and discriminator.
956	Graph-based, Self-Supervised Program Repair from Diagnostic Feedback	Michihiro Yasunaga; Percy Liang;	Program repair is challenging for two reasons: First, it requires reasoning and tracking symbols across source code and diagnostic feedback. Second, labeled datasets available for program repair are relatively small. In this work, we propose novel solutions to these two challenges.
957	Interpretable Off-Policy Evaluation in Reinforcement Learning by Highlighting Influential Transitions	Omer Gottesman; Joseph Futoma; Yao Liu; Sonali Parbhoo; Leo Celi; Emma Brunskill; Finale Doshi-Velez;	In this paper we develop a method that could serve as a hybrid human-AI system, to enable human experts to analyze the validity of policy evaluation estimates.
958	Communication-Efficient Federated Learning with Sketching	Daniel Rothchild; Ashwinee Panda; Enayat Ullah; Nikita Ivkin; Vladimir Braverman; Joseph Gonzalez; Ion Stoica; Raman Arora;	In this paper we introduce a novel algorithm, called FedSketchedSGD, to overcome these challenges.
959	Learning Fair Policies in Multi-Objective (Deep) Reinforcement Learning with Average and Discounted Rewards	Umer Siddique; Paul Weng; Matthieu Zimmer;	In this paper, we formulate this novel RL problem, in which an objective function (generalized Gini index of utility vectors), which encodes a notion of fairness that we formally define, is optimized.
960	Robust Black Box Explanations Under Distribution Shift	Himabindu Lakkaraju; Nino Arsov; Osbert Bastani;	In this paper, we propose a novel framework for generating robust explanations of black box models based on adversarial training.
961	Distributed Online Optimization over a Heterogeneous Network	Nima Eshraghi; Ben Liang;	To address this issue, we consider a new algorithm termed Distributed Any-Batch Mirror Descent (DABMD), which is based on distributed Mirror Descent but uses a fixed per-round computing time to limit the waiting by fast nodes to receive information updates from slow nodes.
962	ECLIPSE: An Extreme-Scale Linear Program Solver for Web-Applications	Kinjal Basu; Amol Ghoting; Rahul Mazumder; Yao Pan;	In this work, we propose a distributed solver that solves a perturbation of the LP problems at scale.
963	CURL: Contrastive Unsupervised Representation Learning for Reinforcement Learning	Michael Laskin; Pieter Abbeel; Aravind Srinivas;	To that end, we propose a new model: Contrastive Unsupervised Representation Learning for Reinforcement Learning (CURL).
964	Confidence-Aware Learning for Deep Neural Networks	Sangheum Hwang; Jooyoung Moon; Jihyo Kim; Younghak Shin;	In this paper, we propose a method of training deep neural networks with a novel loss function, named Correctness Ranking Loss, which regularizes class probabilities explicitly to be better confidence estimates in terms of ordinal ranking according to confidence.
965	Online Bayesian Moment Matching based SAT Solver Heuristics	Haonan Duan; Saeed Nejati; George Trimponias; Pascal Poupart; Vijay Ganesh;	In this paper, we present a Bayesian Moment Matching (BMM) based method aimed at solving the initialization problem in Boolean SAT solvers.
966	Retro: Learning Retrosynthetic Planning with Neural Guided A Search	Binghong Chen; Chengtao Li; Hanjun Dai; Le Song;	In this paper, we propose Retro, a neural-based A -like algorithm that finds high-quality synthetic routes efficiently.
967	FedBoost: A Communication-Efficient Algorithm for Federated Learning	Jenny Hamer; Mehryar Mohri; Ananda Theertha Suresh;	In this work, we propose an alternative approach whereby an ensemble of pre-trained base predictors is trained via federated learning.
968	Sharp Composition Bounds for Gaussian Differential Privacy via Edgeworth Expansion	Qinqing Zheng; Jinshuo Dong; Qi Long; Weijie Su;	To address this question, we introduce a family of analytical and sharp privacy bounds under composition using the Edgeworth expansion in the framework of the recently proposed $f$-differential privacy.
969	Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods	Dan Fu; Mayee Chen; Frederic Sala; Sarah Hooper; Kayvon Fatahalian; Christopher Re;	In this work, we show that, for a class of latent variable models highly applicable to weak supervision, we can find a closed-form solution to model parameters, obviating the need for iterative solutions like stochastic gradient descent (SGD).
970	Spectral Frank-Wolfe Algorithm: Strict Complementarity and Linear Convergence	Lijun Ding; Yingjie Fei; Qiantong Xu; Chengrun Yang;	We develop a novel variant of the classical Frank-Wolfe algorithm, which we call spectral Frank-Wolfe, for convex optimization over a spectrahedron.
971	Deep Molecular Programming: A Natural Implementation of Binary-Weight ReLU Neural Networks	Marko Vasic; Cameron Chalk; Sarfraz Khurshid; David Soloveichik;	We discover a surprisingly tight connection between a popular class of neural networks (Binary-weight ReLU aka BinaryConnect) and a class of coupled chemical reactions that are absolutely robust to reaction rates.
972	Generative Pretraining From Pixels	Mark Chen; Alec Radford; Rewon Child; Jeffrey Wu; Heewoo Jun; David Luan; Ilya Sutskever;	Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images.
973	Inferring DQN structure for high-dimensional continuous control	Andrey Sakryukin; Chedy Raissi; Mohan Kankanhalli;	In this work, we show that the compositional structure of the action modules has a significant impact on model performance.
974	Subspace Fitting Meets Regression: The Effects of Supervision and Orthonormality Constraints on Double Descent of Generalization Errors	Yehuda Dar; Paul Mayer; Lorenzo Luzi; Richard Baraniuk;	We study the linear subspace fitting problem in the overparameterized setting, where the estimated subspace can perfectly interpolate the training examples.
975	Learning Selection Strategies in Buchberger’s Algorithm	Dylan Peifer; Michael Stillman; Daniel Halpern-Leistner;	We introduce a new approach to Buchberger’s algorithm that uses reinforcement learning agents to perform S-pair selection, a key step in the algorithm.
976	Estimating the Error of Randomized Newton Methods: A Bootstrap Approach	Miles Lopes; Jessie X.T. Chen;	Motivated by these difficulties, we develop a bootstrap method for directly estimating the unknown error, which avoids excessive computation and offers greater reliability.
977	Spectral Subsampling MCMC for Stationary Time Series	Robert Salomone; Matias Quiroz; Robert kohn; Mattias Villani; Minh-Ngoc Tran;	We propose a novel technique for speeding up MCMC for time series data by efficient data subsampling in the frequency domain.
978	Progressive Identification of True Labels for Partial-Label Learning	Jiaqi Lv; Miao Xu; LEI FENG; Gang Niu; Xin Geng; Masashi Sugiyama;	The goal of this paper is to propose a novel framework of partial-label learning without implicit assumptions on the model or optimization algorithm.
979	R2-B2: Recursive Reasoning-Based Bayesian Optimization for No-Regret Learning in Games	Zhongxiang Dai; Yizhou Chen; Bryan Kian Hsiang Low; Patrick Jaillet ; Teck-Hua Ho;	This paper presents a recursive reasoning formalism of Bayesian optimization (BO) to model the reasoning process in the interactions between boundedly rational, self-interested agents with unknown, complex, and costly-to-evaluate payoff functions in repeated games, which we call Recursive Reasoning-Based BO (R2-B2).
980	Graph Homomorphism Convolution	Hoang Nguyen; Takanori Maehara;	In this paper, we study the graph classification problem from the graph homomorphism perspective.
981	Conditional Augmentation for Generative Modeling	Heewoo Jun; Rewon Child; Mark Chen; John Schulman; Aditya Ramesh; Alec Radford; Ilya Sutskever;	We present conditional augmentation (CondAugment), a simple and powerful method of regularizing generative models.
982	PDO-eConvs: Partial Differential Operator Based Equivariant Convolutions	Zhengyang Shen; Lingshen He; Zhouchen Lin; Jinwen Ma;	In this work, we deal with this issue from the connection between convolutions and partial differential operators (PDOs).
983	Abstraction Mechanisms Predict Generalization in Deep Neural Networks	Alex Gain; Hava Siegelmann;	We approach this problem through the unconventional angle of \textit{cognitive abstraction mechanisms}, drawing inspiration from recent neuroscience work, allowing us to define the Cognitive Neural Activation metric (CNA) for DNNs, which is the correlation between information complexity (entropy) of given input and the concentration of higher activation values in deeper layers of the network.
984	Revisiting Fundamentals of Experience Replay	William Fedus; Prajit Ramachandran; Rishabh Agarwal; Yoshua Bengio; Hugo Larochelle; Mark Rowland; Will Dabney;	We therefore present a systematic and extensive analysis of experience replay in Q-learning methods, focusing on two fundamental properties: the replay capacity and the ratio of learning updates to experience collected (replay ratio).
985	Go Wide, Then Narrow: Efficient Training of Deep Thin Networks	Denny Zhou; Mao Ye; Chen Chen; Mingxing Tan; Tianjian Meng; Xiaodan Song; Quoc Le; Qiang Liu; Dale Schuurmans;	We propose an efficient algorithm to train a very deep and thin network with theoretic guarantee.
986	Meta-learning for Mixed Linear Regression	Weihao Kong; Raghav Somani; Zhao Song; Sham Kakade; Sewoong Oh;	To this end, we introduce a novel spectral approach and show that we can efficiently utilize small data tasks with the help of $\tilde\Omega(k^{3/2})$ medium data tasks each with $\tilde\Omega(k^{1/2})$ examples.
987	Efficiently Learning Adversarially Robust Halfspaces with Noise	Omar Montasser; Surbhi Goel; Ilias Diakonikolas; Nati Srebro;	We study the problem of learning adversarially robust halfspaces in the distribution-independent setting.
988	Bayesian Graph Neural Networks with Adaptive Connection Sampling	Arman Hasanzadeh; Ehsan Hajiramezanali; Shahin Boluki; Nick Duffield; Mingyuan Zhou; Krishna Narayanan; Xiaoning Qian;	We propose a unified framework for adaptive connection sampling in graph neural networks (GNNs) that generalizes existing stochastic regularization methods for training GNNs.
989	On the Theoretical Properties of the Network Jackknife	Qiaohui Lin; Robert Lunde; Purnamrita Sarkar;	Under the sparse graphon model, we prove an Efron-Stein-type inequality, showing that the network jackknife leads to conservative estimates of the variance (in expectation) for any network functional that is invariant to node permutation.
990	Thompson Sampling via Local Uncertainty	Zhendong Wang; Mingyuan Zhou;	In this paper, we propose a new probabilistic modeling framework for Thompson sampling, where local latent variable uncertainty is used to sample the mean reward.
991	Decision Trees for Decision-Making under the Predict-then-Optimize Framework	Adam Elmachtoub; Jason Cheuk Nam Liang; Ryan McNellis;	This natural loss function is known in the literature as the Smart Predict-then-Optimize (SPO) loss, and we propose a tractable methodology called SPO Trees (SPOTs) for training decision trees under this loss.
992	Representation Learning via Adversarially-Contrastive Optimal Transport	Anoop Cherian; Shuchin Aeron;	In this paper, we study the problem of learning compact (low-dimensional) representations for sequential data that captures its implicit spatio-temporal cues.
993	Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning"	Saeed Amizadeh; Hamid Palangi; Oleksandr Polozov; Yichen Huang; Kazuhito Koishida;	To address this, we propose (1) a framework to isolate and evaluate the reasoning aspect of VQA separately from its perception, and (2) a novel top-down calibration technique that allows the model to answer reasoning questions even with imperfect perception.
994	Two Simple Ways to Learn Individual Fairness Metric from Data	Debarghya Mukherjee; Mikhail Yurochkin; Moulinath Banerjee; Yuekai Sun;	In this paper, we present two simple algorithms that learn effective fair metrics from a variety of datasets.
995	A Simple Framework for Contrastive Learning of Visual Representations	Ting Chen; Simon Kornblith; Mohammad Norouzi; Geoffrey Hinton;	This paper presents a simple framework for contrastive representation learning.
996	The Implicit and Explicit Regularization Effects of Dropout	Colin Wei; Sham Kakade; Tengyu Ma;	This work observes that dropout introduces two distinct but entangled regularization effects: an explicit effect which occurs since dropout modifies the expected training objective, and an implicit effect from stochasticity in the dropout gradients.
997	Variable-Bitrate Neural Compression via Bayesian Arithmetic Coding	Yibo Yang; Robert Bamler; Stephan Mandt;	Here, we propose a new algorithm for compressing latent representations in deep probabilistic models, such as variational autoencoders, in post-processing.
998	Orthogonalized SGD and Nested Architectures for Anytime Neural Networks	Chengcheng Wan; Henry (Hank) Hoffmann; Shan Lu; Michael Maire;	We propose a novel variant of SGD customized for training network architectures that support anytime behavior: such networks produce a series of increasingly accurate outputs over time.
999	Evaluating Machine Accuracy on ImageNet	Vaishaal Shankar; Rebecca Roelofs; Horia Mania; Alex Fang; Benjamin Recht; Ludwig Schmidt;	We perform an in-depth evaluation of human accuracy on the ImageNet dataset.
1000	Learning to Navigate in Synthetically Accessible Chemical Space Using Reinforcement Learning	Sai Krishna Gottipati; Boris Sattarov; Sufeng Niu; Haoran Wei; Yashaswi Pathak; Shengchao Liu; Simon Blackburn; Karam Thomas; Connor Coley; Jian Tang; Sarath Chandar; Yoshua Bengio;	In this work, we propose a novel reinforcement learning (RL) setup for drug discovery that addresses this challenge by embedding the concept of synthetic accessibility directly into the de novo compound design system.
1001	Improved Bounds on Minimax Regret under Logarithmic Loss via Self-Concordance	Blair Bilodeau; Dylan Foster; Daniel Roy;	We present a novel approach to bounding the minimax regret that exploits the self-concordance property of logarithmic loss.
1002	Optimization Theory for ReLU Neural Networks Trained with Normalization Layers	Yonatan Dukler; Quanquan Gu; Guido Montufar;	The analysis shows how the introduction of normalization layers changes the optimization landscape and in some settings enables faster convergence as compared with un-normalized neural networks.
1003	Improving Molecular Design by Stochastic Iterative Target Augmentation	Kevin Yang; Wengong Jin; Kyle Swanson; Regina Barzilay; Tommi Jaakkola;	In this paper, we propose a surprisingly effective self-training approach for iteratively creating additional molecular targets.
1004	Don’t Waste Your Bits! Squeeze Activations and Gradients for Deep Neural Networks via TinyScript	Fangcheng Fu; Yuzheng Hu; Yihan He; Jiawei Jiang; Yingxia Shao; Ce Zhang; Bin Cui;	In this work, we introduce TinyScript, which applies a non-uniform quantization algorithm to both activations and gradients.
1005	Robust One-Bit Recovery via ReLU Generative Networks: Near-Optimal Statistical Rate and Global Landscape Analysis	Shuang Qiu; Xiaohan Wei; Zhuoran Yang;	We propose to recover the target $G(x_0)$ solving an unconstrained empirical risk minimization (ERM).
1006	Multi-objective Bayesian Optimization using Pareto-frontier Entropy	Shinya Suzuki; Shion Takeno; Tomoyuki Tamura; Kazuki Shitara; Masayuki Karasuyama;	We propose a novel entropy-based MBO called Pareto-frontier entropy search (PFES) by considering the entropy of Pareto-frontier, which is an essential notion of the optimality of the multi-objective problem.
1007	Closing the convergence gap of SGD without replacement	Shashank Rajput; Anant Gupta; Dimitris Papailiopoulos;	In this paper, we close this gap and show that SGD without replacement achieves a rate of $\mathcal{O}\left(\frac{1}{T^2}+\frac{n^2}{T^3}\right)$ when the sum of the functions is a quadratic, and offer a new lower bound of $\Omega\left(\frac{n}{T^2}\right)$ for strongly convex functions that are sums of smooth functions.
1008	Black-Box Methods for Restoring Monotonicity	Evangelia Gergatsouli; Brendan Lucier; Christos Tzamos;	In this work we develop algorithms that are able to restore monotonicity in the parameters of interest.
1009	Flexible and Efficient Long-Range Planning Through Curious Exploration	Aidan Curtis; Minjian Xin; Dilip Arumugam; Kevin Feigelis; Daniel Yamins;	Here, we propose the Curious Sample Planner (CSP), which fuses elements of TAMP and DRL by combining a curiosity-guided sampling strategy with imitation learning to accelerate planning.
1010	Sparse Convex Optimization via Adaptively Regularized Hard Thresholding	Kyriakos Axiotis; Maxim Sviridenko;	We present a new Adaptively Regularized Hard Thresholding (ARHT) algorithm that makes significant progress on this problem by bringing the bound down to $\gamma=O(\kappa)$, which has been shown to be tight for a general class of algorithms including LASSO, OMP, and IHT.
1011	On Thompson Sampling with Langevin Algorithms	Eric Mazumdar; Aldo Pacchiano; Yian Ma; Michael Jordan; Peter Bartlett;	We propose a Markov Chain Monte Carlo (MCMC) method tailored to Thompson sampling to address this issue.
1012	Strategic Classification is Causal Modeling in Disguise	John Miller; Smitha Milli; University of California Moritz Hardt;	In this work, we develop a causal framework for strategic adaptation.
1013	Multi-fidelity Bayesian Optimization with Max-value Entropy Search and its Parallelization	Shion Takeno; Hitoshi Fukuoka; Yuhki Tsukada; Toshiyuki Koyama; Motoki Shiga; Ichiro Takeuchi; Masayuki Karasuyama;	In this paper, we focus on the information-based approach, which is a popular and empirically successful approach in BO.
1014	Domain Aggregation Networks for Multi-Source Domain Adaptation	Junfeng Wen; Russell Greiner; Dale Schuurmans;	In this paper, we develop a finite-sample generalization bound based on domain discrepancy and accordingly propose a theoretically justified optimization procedure.
1015	Improving Robustness of Deep-Learning-Based Image Reconstruction	Ankit Raj; Yoram Bresler; Bo Li;	In this paper, we propose to modify the training strategy of end-to-end deep-learning-based inverse problem solvers to improve robustness.
1016	Outsourced Bayesian Optimization	Dmitrii Kharkovskii; Zhongxiang Dai; Bryan Kian Hsiang Low;	This paper presents the outsourced-Gaussian process-upper confidence bound (O-GP-UCB) algorithm, which is the first algorithm for privacy-preserving Bayesian optimization (BO) in the outsourced setting with a provable performance guarantee.
1017	Learning Near Optimal Policies with Low Inherent Bellman Error	Andrea Zanette; Alessandro Lazaric; Mykel Kochenderfer; Emma Brunskill;	We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning under the notion of low inherent Bellman error, a condition normally employed to show convergence of approximate value iteration.
1018	Message Passing Least Squares: A Unified Framework for Fast and Robust Group Synchronization	Yunpeng Shi; Gilad Lerman;	We propose an efficient algorithm for solving robust group synchronization given adversarially corrupted group ratios.
1019	Optimal Estimator for Unlabeled Linear Regression	hang zhang; Ping Li;	This paper proposes a one-step estimator which are optimal from both the computational and statistical sense.
1020	Recovery of sparse signals from a mixture of linear samples	Arya Mazumdar; Soumyabrata Pal;	In this work we address this query complexity problem and provide efficient algorithms that improves on the previously best known results.
1021	Recurrent Hierarchical Topic-Guided RNN for Language Generation	Dandan Guo; Bo Chen; Ruiying Lu; Mingyuan Zhou;	To simultaneously capture syntax and global semantics from a text corpus, we propose a new larger-context recurrent neural network (RNN)-based language model, which extracts recurrent hierarchical semantic structure via a dynamic deep topic model to guide natural language generation.
1022	Predictive Coding for Locally-Linear Control	Rui Shu; Tung Nguyen; Yinlam Chow; Tuan Pham; Khoat Than; Mohammad Ghavamzadeh; Stefano Ermon; Hung Bui;	In this paper, we propose a novel information-theoretic LCE approach and show theoretically that explicit next-observation prediction can be replaced with predictive coding.
1023	Near Input Sparsity Time Kernel Embeddings via Adaptive Sampling	Amir Zandieh; David Woodruff;	To accelerate kernel methods, we propose a near input sparsity time method for sampling the high-dimensional space implicitly defined by a kernel transformation.
1024	Near-optimal sample complexity bounds for learning Latent $k-$polytopes and applications to Ad-Mixtures	Chiranjib Bhattacharyya; Ravindran Kannan;	In this paper we show that $O^*(dk/m)$ samples are sufficient to learn each of $k-$ topic vectors of LDA, a popular Ad-mixture model, with vocabulary size $d$ and $m\in \Omega(1)$ words per document, to any constant error in $L_1$ norm.
1025	Population-Based Black-Box Optimization for Biological Sequence Design	Christof Angermueller; David Belanger; Andreea Gane; Zelda Mariet; David Dohan; Kevin Murphy; Lucy Colwell ; D. Sculley;	To improve robustness, we propose population-based optimization (P3BO), which generates batches of sequences by sampling from an ensemble of methods.
1026	Emergence of Separable Manifolds in Deep Language Representations	Jonathan Mamou; Hang Le; Miguel del Rio Fernandez; Cory Stephenson; Hanlin Tang; Yoon Kim; SueYeon Chung;	In this work, we utilize mean-field theoretic manifold analysis, a recent technique from computational neuroscience, to analyze the high dimensional geometry of language representations from large-scale contextual embedding models.
1027	Stochastic Hamiltonian Gradient Methods for Smooth Games	Nicolas Loizou; Hugo Berard; Alexia Jolicoeur-Martineau; Pascal Vincent; Simon Lacoste-Julien; Ioannis Mitliagkas;	We analyze the stochastic Hamiltonian method and a novel variance-reduced variant of it and provide the first set of last-iterate convergence guarantees for stochastic unbounded bilinear games.
1028	Understanding and Estimating the Adaptability of Domain-Invariant Representations	Ching-Yao Chuang; Antonio Torralba; Stefanie Jegelka;	In this work, we aim to better understand and estimate the effect of domain-invariant representations on generalization to the target.
1029	Adversarial Mutual Information for Text Generation	Boyuan Pan; Yazheng Yang; Kaizhao Liang; Bhavya Kailkhura; Zhongming Jin; Xian-Sheng Hua; Deng Cai; Bo Li;	In this paper, we propose Adversarial Mutual Information (AMI): a text generation framework which is formed as a novel saddle point (min-max) optimization aiming to identify joint interactions between the source and target.
1030	Bidirectional Model-based Policy Optimization	Hang Lai; Jian Shen; Weinan Zhang; Yong Yu;	We develop a novel method, called Bidirectional Model-based Policy Optimization (BMPO) to utilize both the forward model and backward model to generate short branched rollouts for policy optimization.
1031	Input-Sparsity Low Rank Approximation in Schatten Norm	Yi Li; David Woodruff;	We give the first input-sparsity time algorithms for the rank-$k$ low rank approximation problem in every Schatten norm.
1032	Do We Need Zero Training Loss After Achieving Zero Training Error?	Takashi Ishida; Ikko Yamane; Tomoya Sakai; Gang Niu; Masashi Sugiyama;	We propose a direct solution called flooding that intentionally prevents further reduction of the training loss when it reaches a reasonably small value, which we call the flooding level.
1033	Learning and sampling of atomic interventions from observations	Arnab Bhattacharyya; Sutanu Gayen; Saravanan Kandasamy; Ashwin Maran; Vinodchandran N. Variyam;	Our goal is to give algorithms with polynomial time and sample complexity in a non-parametric setting.
1034	Understanding and Mitigating the Tradeoff between Robustness and Accuracy	Aditi Raghunathan; Sang Michael Xie; Fanny Yang; John Duchi; Percy Liang;	In this work, we precisely characterize the effect of augmentation on the standard error in linear regression when the optimal linear predictor has zero standard and robust error.
1035	Combining Differentiable PDE Solvers and Graph Neural Networks for Fluid Flow Prediction	Filipe de Avila Belbute-Peres; Thomas Economon; Zico Kolter;	In this work, we develop a hybrid (graph) neural network that combines a traditional graph convolutional network with an embedded differentiable fluid dynamics simulator inside the network itself.
1036	From ImageNet to Image Classification: Contextualizing Progress on Benchmarks	Dimitris Tsipras; Shibani Santurkar; Logan Engstrom; Andrew Ilyas; Aleksander Madry;	Overall, our results highlight a misalignment between the way we train our models and the task we actually expect them to solve, emphasizing the need for fine-grained evaluation techniques that go beyond average-case accuracy.
1037	On Implicit Regularization in $\beta$-VAEs	Abhishek Kumar; Ben Poole;	This analysis uncovers the regularizer implicit in the $\beta$-VAE objective, and leads to an approximation consisting of a deterministic autoencoding objective plus analytic regularizers that depend on the Hessian or Jacobian of the decoding model, unifying VAEs with recent heuristics proposed for training regularized autoencoders.
1038	Data Amplification: Instance-Optimal Property Estimation	Yi Hao; Alon Orlitsky;	We present novel linear-time-computable estimators that significantly “amplify” the effective amount of data available.
1039	Provable guarantees for decision tree induction: the agnostic setting	Guy Blanc; Jane Lange; Li-Yang Tan;	We give strengthened provable guarantees on the performance of widely employed and empirically successful {\sl top-down decision tree learning heuristics}.
1040	Statistical Bias in Dataset Replication	Logan Engstrom; Andrew Ilyas; Shibani Santurkar; Dimitris Tsipras; Jacob Steinhardt; Aleksander Madry;	In this paper, we highlight the importance of statistical modeling in dataset replication: we present unintuitive yet pervasive ways in which statistical bias, when left unmitigated, can skew results.
1041	Towards Adaptive Residual Network Training: A Neural-ODE Perspective	chengyu dong; Liyuan Liu; Zichao Li; Jingbo Shang;	Illuminated by these derivations, we propose an adaptive training algorithm for residual networks, LipGrow, which automatically increases network depth and accelerates model training.
1042	Overparameterization hurts worst-group accuracy with spurious correlations	Shiori Sagawa; aditi raghunathan; Pang Wei Koh; Percy Liang;	We show on two image datasets that in contrast to average accuracy, overparameterization hurts worst-group accuracy in the presence of spurious correlations.
1043	A Nearly-Linear Time Algorithm for Exact Community Recovery in Stochastic Block Model	Peng Wang; Zirui Zhou; Anthony Man-Cho So;	In this paper, we focus on the problem of exactly recovering the communities in a binary symmetric SBM, where a graph of $n$ vertices is partitioned into two equal-sized communities and the vertices are connected with probability $p = \alpha\log(n)/n$ within communities and $q = \beta\log(n)/n$ across communities for some $\alpha>\beta>0$.
1044	Online Multi-Kernel Learning with Graph-Structured Feedback	Pouya M Ghari; Yanning Shen;	Leveraging the random feature approximation, we propose an online scalable multi-kernel learning approach with graph feedback, and prove that the proposed algorithm enjoys sublinear regret.
1045	Is Local SGD Better than Minibatch SGD?	Blake Woodworth; Kumar Kshitij Patel; Sebastian Stich; Zhen Dai; Brian Bullins; H. Brendan McMahan; Ohad Shamir; Nati Srebro;	We study local SGD (also known as parallel SGD and federated SGD), a natural and frequently used distributed optimization method.
1046	On Lp-norm Robustness of Ensemble Decision Stumps and Trees	Yihan Wang; Huan Zhang; Hongge Chen; Duane Boning; Cho-Jui Hsieh;	In this paper, we study the robustness verification and defense with respect to general $\ell_p$ norm perturbation for ensemble trees and stumps.
1047	Sub-linear Memory Sketches for Near Neighbor Search on Streaming Data with RACE	Benjamin Coleman; Anshumali Shrivastava; Richard Baraniuk;	We present the first sublinear memory sketch that can be queried to find the nearest neighbors in a dataset.
1048	Understanding Self-Training for Gradual Domain Adaptation	Ananya Kumar; Tengyu Ma; Percy Liang;	We consider gradual domain adaptation, where the goal is to adapt an initial classifier trained on a source domain given only unlabeled data that shifts gradually in distribution towards a target domain.
1049	Concept Bottleneck Models	Pang Wei Koh; Thao Nguyen; Yew Siang Tang; Stephen Mussmann; Emma Pierson; Been Kim; Percy Liang;	We seek to learn models that support interventions on high-level concepts: would the model predict severe arthritis if it thought there was a bone spur in the x-ray?
1050	Optimal Bounds between f-Divergences and Integral Probability Metrics	Rohit Agrawal; Thibaut Horel;	In this work, we systematically study the relationship between these two families from the perspective of convex duality.
1051	Robustness to Spurious Correlations via Human Annotations	Megha Srivastava; Tatsunori Hashimoto; Percy Liang;	We present a framework for making models robust to spurious correlations by leveraging humans’ common sense knowledge of causality.
1052	DROCC: Deep Robust One-Class Classification	Sachin Goyal; Aditi Raghunathan; Moksh Jain; Harsha Vardhan Simhadri; Prateek Jain;	In this work, we propose Deep Robust One Class Classification (DROCC) method that is robust to such a collapse by training the network to distinguish the training points from their perturbations, generated adversarially.
1053	Efficiently Solving MDPs with Stochastic Mirror Descent	Yujia Jin; Aaron Sidford;	In this paper we present a unified framework based on primal-dual stochastic mirror descent for approximately solving infinite-horizon Markov decision processes (MDPs) given a generative model.
1054	Handling the Positive-Definite Constraint in the Bayesian Learning Rule	Wu Lin; Mark Schmidt; Mohammad Emtiyaz Khan;	In this paper, we fix this issue for the positive-definite constraint by proposing an improved rule that naturally handles the constraint.
1055	A simpler approach to accelerated optimization: iterative averaging meets optimism	Pooria Joulani; Anant Raj; András György; Csaba Szepesvari;	In this paper, we show that there is a simpler approach to obtaining accelerated rates: applying generic, well-known optimistic online learning algorithms and using the online average of their predictions to query the (deterministic or stochastic) first-order optimization oracle at each time step.
1056	Training Binary Neural Networks using the Bayesian Learning Rule	Xiangming Meng; Roman Bachmann; Mohammad Emtiyaz Khan;	In this paper, we propose such an approach using the Bayesian learning rule.
1057	High-dimensional Robust Mean Estimation via Gradient Descent	Yu Cheng; Ilias Diakonikolas; Rong Ge; Mahdi Soltanolkotabi;	In this work, we show that a natural non-convex formulation of the problem can be solved directly by gradient descent.
1058	From Chaos to Order: Symmetry and Conservation Laws in Game Dynamics	Sai Ganesh Nagarajan; David Balduzzi; Georgios Piliouras;	In this paper, we present basic mechanism design tools for constructing games with predictable and controllable dynamics.
1059	Hierarchically Decoupled Morphological Transfer	Donald Hejna; Lerrel Pinto; Pieter Abbeel;	To this end, we propose a hierarchical decoupling of policies into two parts: an independently learned low-level policy and a transferable high-level policy.
1060	Puzzle Mix: Exploiting Saliency and Local Statistics for Optimal Mixup	Jang-Hyun Kim; Wonho Choo; Hyun Oh Song;	To this end, we propose Puzzle Mix, a mixup method for explicitly utilizing the saliency information and the underlying statistics of the natural examples.
1061	Train Big, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers	Zhuohan Li; Eric Wallace; Sheng Shen; Kevin Lin; Kurt Keutzer; Dan Klein; Joseph Gonzalez;	We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute: self-supervised pretraining and high-resource machine translation.
1062	Interpolation between CNNs and ResNets	Zonghan Yang; Yang Liu; Chenglong Bao; Zuoqiang Shi;	In this paper, we present a novel ODE model by adding a damping term.
1063	Online metric algorithms with untrusted predictions	Antonios Antoniadis; Christian Coester; Marek Elias; Adam Polak; Bertrand Simon;	In this paper, we propose a prediction setup for Metrical Task Systems (MTS), a broad class of online decision-making problems including, e.g., caching, k-server and convex body chasing.
1064	Collaborative Machine Learning with Incentive-Aware Model Rewards	Rachael Hwee Ling Sim; Yehong Zhang; Bryan Kian Hsiang Low; Mun Choon Chan;	This paper proposes to value a party’s contribution based on Shapley value and information gain on model parameters given its data.
1065	On Convergence-Diagnostic based Step Sizes for Stochastic Gradient Descent	Scott Pesme; Aymeric Dieuleveut; Nicolas Flammarion;	In this paper, we show that efficiently detecting this transition and appropriately decreasing the step size can lead to fast convergence rates.
1066	Equivariant Flows: exact likelihood generative learning for symmetric densities.	Jonas Köhler; Leon Klein; Frank Noe;	We provide a theoretical sufficient criterium showing that the distribution generated by \textit{equivariant} normalizing flows is invariant with respect to these symmetries by design.
1067	PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination	Saurabh Goyal; Anamitra Roy Choudhury; Venkatesan Chakaravarthy; Saurabh Raje; Yogish Sabharwal; Ashish Verma;	We develop a novel method, called PoWER-BERT, for improving the inference time of the popular BERT model, while maintaining the accuracy.
1068	Bayesian Sparsification of Deep C-valued Networks	Ivan Nazarov; Evgeny Burnaev;	To this end we extend Sparse Variational Dropout to complex-valued neural networks and verify the proposed Bayesian technique by conducting a large numerical study of the performance-compression trade-off of C-valued networks on two tasks: image recognition on MNIST-like and CIFAR10 datasets and music transcription on MusicNet.
1069	Minimally distorted Adversarial Examples with a Fast Adaptive Boundary Attack	Francesco Croce; Matthias Hein;	We propose in this paper a new white-box adversarial attack wrt the $l_p$-norms for $p \in \{1,2,\infty\}$ aiming at finding the minimal perturbation necessary to change the class of a given input.
1070	A distributional view on multi objective policy optimization	Abbas Abdolmaleki; Sandy Huang; Leonard Hasenclever; Michael Neunert; Martina Zambelli; Murilo Martins; Francis Song; Nicolas Heess; Raia Hadsell; Martin Riedmiller;	In this paper we propose a novel algorithm for multi-objective reinforcement learning that enables setting desired preferences for objectives in a scale-invariant way.
1071	On the Sample Complexity of Adversarial Multi-Source PAC Learning	Nikola Konstantinov; Elias Frantar; Dan Alistarh; Christoph H. Lampert;	In this work we show that, surprisingly, the same is not true in the multi-source setting, where the adversary can arbitrarily corrupt a fixed fraction of the data sources.
1072	Inducing and Exploiting Activation Sparsity for Fast Inference on Deep Neural Networks	Mark Kurtz; Justin Kopinsky; Rati Gelashvili; Alexander Matveev; John Carr; Michael Goin; William Leiserson; Sage Moore; Nir Shavit; Dan Alistarh;	In this paper, we present an in-depth analysis of methods for maximizing the sparsity of the activations in a trained neural network, and show that, when coupled with an efficient sparse-input convolution algorithm, we can leverage this sparsity for significant performance gains.
1073	Constructive universal distribution generation through deep ReLU networks	Dmytro Perekrestenko; Stephan Müller; Helmut Bölcskei;	We present an explicit deep network construction that transforms uniformly distributed one-dimensional noise into an arbitrarily close approximation of any two-dimensional target distribution of finite differential entropy and Lipschitz-continuous pdf.
1074	Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks	Francesco Croce; Matthias Hein;	In this paper we first propose two extensions of the PGD-attack overcoming failures due to suboptimal step size and problems of the objective function. We then combine our novel attacks with two complementary existing ones to form a parameter-free, computationally affordable and user-independent ensemble of attacks to test adversarial robustness.
1075	Multiclass Neural Network Minimization via Tropical Newton Polytope Approximation	Georgios Smyrnis; Petros Maragos;	In this work, we attempt to make use of methods stemming from a form of approximate division of such polynomials, which relies on the approximation of their Newton Polytopes, in order to minimize networks trained for multiclass classification problems.
1076	Finding trainable sparse networks through Neural Tangent Transfer	Tianlin Liu; Friedemann Zenke;	In this article, we introduce Neural Tangent Transfer, a method that instead finds trainable sparse networks in a label-free manner.
1077	Towards a General Theory of Infinite-Width Limits of Neural Classifiers	Eugene Golikov;	We propose a general framework that provides a link between these seemingly distinct limit theories.
1078	Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics	Arsenii Kuznetsov; Pavel Shvechikov; Alexander Grishin; Dmitry Vetrov;	This paper investigates a novel way to alleviate the overestimation bias in a continuous control setting.
1079	Learning to Learn Kernels with Variational Random Features	Xiantong Zhen; Haoliang Sun; Yingjun Du; Jun Xu; Yilong Yin; Ling Shao; Cees Snoek;	We introduce kernels with random Fourier features in the meta-learning framework for few-shot learning.
1080	Efficient Robustness Certificates for Discrete Data: Sparsity-Aware Randomized Smoothing for Graphs, Images and More	Aleksandar Bojchevski; Johannes Klicpera; Stephan Günnemann;	We propose a model-agnostic certificate based on the randomized smoothing framework which subsumes earlier work and is tight, efficient, and sparsity-aware.
1081	Learning to Simulate Complex Physics with Graph Networks	Alvaro Sanchez; Jonathan Godwin; Tobias Pfaff; Rex (Zhitao) Ying; Jure Leskovec; Peter Battaglia;	Here we present a general framework for learning simulation, and provide a single model implementation that yields state-of-the-art performance across a variety of challenging physical domains, involving fluids, rigid solids, and deformable materials interacting with one another.
1082	Small Data, Big Decisions: Model Selection in the Small-Data Regime	Jorg Bornschein; Francesco Visin; Simon Osindero;	In contrast to most previous work, which typically considers the performance as a function of the model size, in this paper we empirically study the generalization performance as the size of the training set varies over multiple orders of magnitude.
1083	PolyGen: An Autoregressive Generative Model of 3D Meshes	Charlie Nash; Yaroslav Ganin; S. M. Ali Eslami; Peter Battaglia;	We present an approach which models the mesh directly, predicting mesh vertices and faces sequentially using a Transformer-based architecture.
1084	XtarNet: Learning to Extract Task-Adaptive Representation for Incremental Few-Shot Learning	Sung Whan Yoon; Jun Seo; Doyeon Kim; Jaekyun Moon;	We propose XtarNet, which learns to extract task-adaptive representation (TAR) for facilitating incremental few-shot learning.