# Paper Digest: AISTATS 2020 Highlights

Readers can also choose to read this highlight article on our console, which allows users to filter out papers using keywords and find related papers.

The International Conference on Artificial Intelligence and Statistics (AISTATS) is an interdisciplinary gathering of researchers at the intersection of computer science, artificial intelligence, machine learning, statistics, and related areas. In 2020, it is to be held virtually due to covid-19 pandemic.

To help the community quickly catch up on the work presented in this conference, Paper Digest Team processed all accepted papers, and generated one highlight sentence (typically the main topic) for each paper. Readers are encouraged to read these machine generated highlights / summaries to quickly get the main idea of each paper.

If you do not want to miss any interesting academic paper, you are welcome to **sign up our free daily paper digest service ** to get updates on new papers published in your area every day. You are also welcome to follow us on Twitter and Linkedin to get updated with new conference digests.

Paper Digest Team

team@paperdigest.org

#### TABLE 1: AISTATS 2020 Papers

Title | Authors | Highlight | |
---|---|---|---|

1 | Linearly Convergent Frank-Wolfe without Line-Search | Fabian Pedregosa, Geoffrey Negiar, Armin Askari, Martin Jaggi | In this paper we propose variants of Away-steps and Pairwise FW that lift both restrictions simultaneously. |

2 | Guarantees of Stochastic Greedy Algorithms for Non-monotone Submodular Maximization | Shinsaku Sakaue | In this paper, we prove that SG (with slight modification) can achieve almost $1/4$-approximation guarantees in expectation in linear time even if objective functions are non-monotone. |

3 | On Maximization of Weakly Modular Functions: Guarantees of Multi-stage Algorithms, Tractability, and Hardness | Shinsaku Sakaue | In this paper, we study cardinality-constrained maximization of {\it weakly modular} functions, whose closeness to being modular is measured by {\it submodularity} and {\it supermodularity ratios}, and reveal what we can and cannot do by using the weak modularity. |

4 | Adaptive Trade-Offs in Off-Policy Learning | Mark Rowland, Will Dabney, Remi Munos | In this paper, we take a unifying view of this space of algorithms, and consider their trade-offs of three fundamental quantities: update variance, fixed-point bias, and contraction rate. |

5 | Conditional Importance Sampling for Off-Policy Learning | Mark Rowland, Anna Harutyunyan, Hado Hasselt, Diana Borsa, Tom Schaul, Remi Munos, Will Dabney | The principal contribution of this paper is a conceptual framework for off-policy reinforcement learning, based on conditional expectations of importance sampling ratios. |

6 | Multiplicative Gaussian Particle Filter | Xuan Su, Wee Sun Lee, Zhen Zhang | We propose a new sampling-based approach for approximate inference in filtering problems. |

7 | Stretching the Effectiveness of MLE from Accuracy to Bias for Pairwise Comparisons | Jingyan Wang, Nihar Shah, R Ravi | In this work, we consider one specific type of fairness, which is the notion of bias in statistics. |

8 | Fast and Accurate Ranking Regression | Ilkay Yildiz, Jennifer Dy, Deniz Erdogmus, Jayashree Kalpathy-Cramer, Susan Ostmo, J. Peter Campbell, Michael F. Chiang, Stratis Ioannidis | Using this equivalence, we propose two spectral algorithms for ranking regression that learn model parameters up to 579 times faster than the Newton’s method. |

9 | Tight Analysis of Privacy and Utility Tradeoff in Approximate Differential Privacy | Quan Geng, Wei Ding, Ruiqi Guo, Sanjiv Kumar | We characterize the minimum noise amplitude and power for noise-adding mechanisms in (epsilon, delta)-differential privacy for single real-valued query function. |

10 | Long-and Short-Term Forecasting for Portfolio Selection with Transaction Costs | Guy Uziel, Ran El-Yaniv | In this paper we focus on the problem of online portfolio selection with transaction costs. |

11 | Nonparametric Sequential Prediction While Deep Learning the Kernel | Guy Uziel | In this paper, therefore, we propose a novel algorithm that simultaneously satisfies a short-term goal, to perform as good as the best choice in hindsight of a data-adaptive kernel, learned using a deep neural network, and a long-term goal, to achieve the same theoretical asymptotic guarantee. |

12 | Improving Maximum Likelihood Training for Text Generation with Density Ratio Estimation | Yuxuan Song, Ning Miao, Hao Zhou, Lantao Yu, Mingxuan Wang, Lei Li | In this paper, we propose $\psi$-MLE, a new training scheme for autoregressive sequence generative models, which is effective and stable when operating at large sample space encountered in text generation. |

13 | A Double Residual Compression Algorithm for Efficient Distributed Learning | Xiaorui Liu, Yao Li, Jiliang Tang, Ming Yan | In this paper, we propose DORE, a DOuble REsidual compression stochastic gradient descent algorithm, to reduce over $95%$ of the overall communication such that the obstacle can be immensely mitigated. |

14 | Asynchronous Gibbs Sampling | Alexander Terenin, Daniel Simpson, David Draper | We introduce a theoretical framework for analyzing asynchronous Gibbs sampling and other extensions of MCMC that do not possess the Markov property. |

15 | Learning Fair Representations for Kernel Models | Zilong Tan, Samuel Yeom, Matt Fredrikson, Ameet Talwalkar | We leverage the classical Sufficient Dimension Reduction (SDR) framework to construct representations as subspaces of the reproducing kernel Hilbert space (RKHS), whose member functions are guaranteed to satisfy fairness. |

16 | A Nonparametric Off-Policy Policy Gradient | Samuele Tosatto, Joao Carvalho, Hany Abdulsamad, Jan Peters | We address this issue by building on the general sample efficiency of off-policy algorithms. |

17 | Non-Parametric Calibration for Classification | Jonathan Wenger, Hedvig Kjellstr?m, Rudolph Triebel) | In this paper, we propose a method that adjusts the confidence estimates of a general classifier such that they approach the probability of classifying correctly. |

18 | Minimax Testing of Identity to a Reference Ergodic Markov Chain | Geoffrey Wolfer, Aryeh Kontorovich | We obtain nearly matching (up to logarithmic factors) upper and lower sample complexity bounds for our notion of distance, which is based on total variation. |

19 | A Linear-time Independence Criterion Based on a Finite Basis Approximation | Longfei Yan, W. Bastiaan Kleijn, Thushara Abhayapala | We propose a novel independence criterion for two random variables with linear-time complexity. |

20 | Minimax Bounds for Structured Prediction Based on Factor Graphs | Kevin Bello, Asish Ghoshal, Jean Honorio | In this work, we provide minimax lower bounds for a class of general factor-graph inference models in the context of structured prediction.That is, we characterize the necessary sample complexity for any conceivable algorithm to achieve learning of general factor-graph predictors. |

21 | On the Convergence of SARAH and Beyond | Bingcong Li, Meng Ma, Georgios B. Giannakis | The main theme of this work is a unifying algorithm, \textbf{L}oop\textbf{L}ess \textbf{S}ARAH (L2S) for problems formulated as summation of n individual loss functions. |

22 | Uncertainty in Neural Networks: Approximately Bayesian Ensembling | Tim Pearce, Felix Leibfried, Alexandra Brintrup | This work proposes one modification to the usual process that we argue does result in approximate Bayesian inference; regularising parameters about values drawn from a distribution which can be set equal to the prior. |

23 | LIBRE: Learning Interpretable Boolean Rule Ensembles | Graziano Mita, Paolo Papotti, Maurizio Filippone, Pietro Michiardi | We present a novel method—LIBRE—learn an interpretable classifier, which materializes as a set of Boolean rules. |

24 | Marginal Densities, Factor Graph Duality, and High-Temperature Series Expansions | Mehdi Molkaraie | We prove that the marginal densities of a global probability mass function in aprimal normal factor graph and the corresponding marginal densities in the dual normal factor graph are related via local mappings. |

25 | Neighborhood Growth Determines Geometric Priors for Relational Representation Learning | Melanie Weber | In this paper, we propose a combinatorial approach to evaluating embeddability, i.e., to decide whether a data set is best represented in Euclidean, Hyperbolic or Spherical space. |

26 | Fair Decisions Despite Imperfect Predictions | Niki Kilbertus, Manuel Gomez Rodriguez, Bernhard Sch?lkopf, Krikamol Muandet, Isabel Valera | In this paper, we show that, in this selective labels setting, learning to predict is suboptimal in terms of both fairness and utility. |

27 | A Characterization of Mean Squared Error for Estimator with Bagging | Martin Mihelich, Charles Dognin, Yan Shu, Michael Blot | In this paper, we theoretically investigate how the bagging method can reduce the Mean Squared Error (MSE) when applied on a statistical estimator. |

28 | Uncertainty Quantification for Sparse Deep Learning | Yuexi Wang, Veronika Rockova | This paper takes a step forward in this important direction by taking a Bayesian point of view. |

29 | Minimizing Dynamic Regret and Adaptive Regret Simultaneously | Lijun Zhang, Shiyin Lu, Tianbao Yang | In this paper, we bridge this gap by proposing novel online algorithms that are able to minimize the dynamic regret and adaptive regret simultaneously. |

30 | A Stein Goodness-of-fit Test for Directional Distributions | Wenkai Xu, Takeru Matsuda | In this study, we propose nonparametric goodness-of-fit testing procedures for general directional distributions based on kernel Stein discrepancy. |

31 | Unsupervised Neural Universal Denoiser for Finite-Input General-Output Noisy Channel | Taeeon Park, Taesup Moon | We devise a novel neural network-based universal denoiser for the finite-input, general-output (FIGO) channel. |

32 | Leave-One-Out Cross-Validation for Bayesian Model Comparison in Large Data | M?ns Magnusson, Aki Vehtari, Johan Jonasson, Michael Andersen | We propose an efficient method for estimating differences in predictive performance by combining fast approximate LOO surrogates with exact LOO sub-sampling using the difference estimator and supply proofs with regards to scaling characteristics. |

33 | Robust Importance Weighting for Covariate Shift | Fengpei Li, Henry Lam, Siddharth Prusty | In this paper, we propose and analyze a new estimator that systematically integrates the residuals of NR with KMM reweighting, based on a control-variate perspective. |

34 | Adaptive Online Kernel Sampling for Vertex Classification | Peng Yang, Ping Li | In this paper, we introduce an online kernel sampling (OKS) technique, a new second-order OKL method that slightly improve the bound from $O(d \log(T))$ down to $O(r \log(T))$ where $r$ is the rank of the learned data and is usually much smaller than d. |

35 | A Hybrid Stochastic Policy Gradient Algorithm for Reinforcement Learning | Nhan Pham, Lam Nguyen, Dzung Phan, PHUONG HA NGUYEN, Marten Dijk, Quoc Tran-Dinh | We propose a novel hybrid stochastic policy gradient estimator by combining an unbiased policy gradient estimator, the REINFORCE estimator, with another biased one, an adapted SARAH estimator for policy optimization. |

36 | Stopping criterion for active learning based on deterministic generalization bounds | Hideaki Ishibashi, Hideitsu Hino | In this study, we propose a criterion for automatically stopping active learning. |

37 | Ivy: Instrumental Variable Synthesis for Causal Inference | Zhaobin Kuang, Frederic Sala, Nimit Sohoni, Sen Wu, Aldo C?rdova-Palomera, Jared Dunnmon, James Priest, Christopher Re | To relax these assumptions, we propose Ivy, a new method to combine IV candidates that can handle correlated and invalid IV candidates in a robust manner. |

38 | High Dimensional Robust Sparse Regression | Liu Liu, Yanyao Shen, Tianyang Li, Constantine Caramanis | Our main contribution is a robust variant of Iterative Hard Thresholding. |

39 | Nested-Wasserstein Self-Imitation Learning for Sequence Generation | Ruiyi Zhang, Changyou Chen, Zhe Gan, Zheng Wen, Wenlin Wang, Lawrence Carin | To alleviate these issues, we propose the concept of nested-Wasserstein distance for distributional semantic matching. |

40 | Greed Meets Sparsity: Understanding and Improving Greedy Coordinate Descent for Sparse Optimization | Huang Fang, Zhenan Fan, Yifan Sun, Michael Friedlander | We present an improved convergence analysis of GCD for sparse optimization, and a formal analysis of its screening properties. |

41 | Recommendation on a Budget: Column Space Recovery from Partially Observed Entries with Random or Active Sampling | Carolyn Kim, Mohsen Bayati | We analyze alternating minimization for column space recovery of a partially observed, approximately low rank matrix with a growing number of columns and a fixed budget of observations per column. |

42 | Fast Noise Removal for k-Means Clustering | Sungjin Im, Mahshid Montazer Qaem, Benjamin Moseley, Xiaorui Sun, Rudy Zhou | This paper considers k-means clustering in the presence of noise. |

43 | Sketching Transformed Matrices with Applications to Natural Language Processing | Yingyu Liang, Zhao Song, Mengdi Wang, Lin Yang, Xin Yang | In this paper, we first propose a space-efficient sketching algorithm for computing the product of a given small matrix with the transformed matrix. |

44 | Unconditional Coresets for Regularized Loss Minimization | Alireza Samadian, Kirk Pruhs, Benjamin Moseley, Sungjin Im, Ryan Curtin | Our main result is that if the regularizer’s effect does not become negligible as the norm of the hypothesis scales, and as the data scales, then a uniform sample of modest size is with high probability a coreset. |

45 | ASAP: Architecture Search, Anneal and Prune | Asaf Noy, Niv Nayman, Tal Ridnik, Nadav Zamir, Sivan Doveh, Itamar Friedman, Raja Giryes, Lihi Zelnik | In this paper, we propose a differentiable search space that allows the annealing of architecture weights, while gradually pruning inferior operations, thus the search converges to a single output network in a continuous manner. |

46 | Understanding Generalization in Deep Learning via Tensor Methods | Jingling Li, Yanchao Sun, Jiahao Su, Taiji Suzuki, Furong Huang | In this work, we advance the understanding of the relations between the network’s architecture and its generalizability from the compression perspective. |

47 | Accelerating Gradient Boosting Machines | Haihao Lu, Sai Praneeth Karimireddy, Natalia Ponomareva, Vahab Mirrokni | In this work, we propose an Accelerated Gradient Boosting Machine (AGBM) by incorporating Nesterov’s acceleration techniques into the design of GBM. |

48 | Online Binary Space Partitioning Forests | Xuhui Fan, Bin Li, Scott SIsson | In this paper, we develop an online BSP-Forest framework to address this limitation. |

49 | Sparse Hilbert-Schmidt Independence Criterion Regression | Benjamin Poignard, Makoto Yamada | In this paper, we propose the sparse Hilbert–Schmidt Independence Criterion (SpHSIC) regression, which is a versatile nonlinear feature selection algorithm based on the HSIC and is a continuous optimization variant of the well-known minimum redundancy maximum relevance (mRMR) feature selection algorithm. |

50 | Sharp Thresholds of the Information Cascade Fragility Under a Mismatched Model | Wasim Huleihel, Ofer Shayevitz | Accordingly, in this paper we study a mismatch model where players believe that the revealing probabilities are $\{q_\ell\}_{\ell\in\mathbb{N}}$ when they truly are $\{p_\ell\}_{\ell\in\mathbb{N}}$, and study the effect of this mismatch on information cascades. |

51 | Optimal sampling in unbiased active learning | Henrik Imberg, Johan Jonasson, Marina Axelson-Fisk | We argue that this produces suboptimal predictions and present sampling schemes for unbiased pool-based active learning that minimise the actual prediction error, and demonstrate a better predictive performance than competing methods on a number of benchmark datasets. |

52 | The Area of the Convex Hull of Sampled Curves: a Robust Functional Statistical Depth measure | Guillaume Staerman, Pavlo Mozharovskyi, St?phan Cl?men\con | In this paper, we propose a novel notion of functional depth based on the area of the convex hull of sampled curves, capturing gradual departures from centrality, even beyond the envelope of the data, in a natural fashion. |

53 | Diameter-based Interactive Structure Discovery | Christopher Tosh, Daniel Hsu | We introduce interactive structure discovery, a generic framework that encompasses many interactive learning settings, including active learning, top-k item identification, interactive drug discovery, and others. |

54 | Utility/Privacy Trade-off through the lens of Optimal Transport | Etienne Boursier, Vianney Perchet | Unlike classical solutions that focus on the first point, we consider instead agents that optimize a natural trade-off between both objectives. |

55 | A Lyapunov analysis for accelerated gradient methods: from deterministic to stochastic case | Maxime Laborde, Adam Oberman | We show that this connection can be extended to the case of stochastic gradients, and develop Lyapunov function based convergence rates proof for Nesterov’s accelerated stochastic gradient descent. |

56 | Interpretable Deep Gaussian Processes with Moments | Chi-Ken Lu, Scott Cheng-Hsin Yang, Xiaoran Hao, Patrick Shafto | We propose interpretable DGP based on approximating DGP as a GP by calculating the exact moments, which additionally identify the heavy-tailed nature of some DGP distributions. |

57 | Approximate Inference in Discrete Distributions with Monte Carlo Tree Search and Value Functions | Lars Buesing, Nicolas Heess, Theophane Weber | In particular, we propose the TreeSample algorithm, an adaptation of Monte Carlo Tree Search to approximate inference. |

58 | Accelerated Bayesian Optimisation through Weight-Prior Tuning | Alistair Shilton, Sunil Gupta, Santu Rana, Pratibha Vellanki, Cheng Li, Svetha Venkatesh, Laurence Park, Alessandra Sutti, David Rubin, Thomas Dorin, Alireza Vahid, Murray Height, Teo Slezak | In this paper we show how such auxiliary data may be used to construct a GP covariance corresponding to a more appropriate weight prior for the objective function. |

59 | Variance Reduction for Evolution Strategies via Structured Control Variates | Yunhao Tang, Krzysztof Choromanski, Alp Kucukelbir | We propose a new method for improving accuracy of the ES algorithms, that as opposed to recent approaches utilizing only Monte Carlo structure of the gradient estimator, takes advantage of the underlying MDP structure to reduce the variance. |

60 | Optimization of Graph Total Variation via Active-Set-based Combinatorial Reconditioning | Zhenzhang Ye, Thomas M?llenhoff, Tao Wu, Daniel Cremers | In this work, we propose a novel adaptive preconditioning strategy for proximal algorithms on this problem class. |

61 | Ordered SGD: A New Stochastic Optimization Framework for Empirical Risk Minimization | Kenji Kawaguchi, Haihao Lu | We propose a new stochastic optimization framework for empirical risk minimization problems such as those that arise in machine learning. |

62 | A Unified Theory of SGD: Variance Reduction, Sampling, Quantization and Coordinate Descent | Eduard Gorbunov, Filip Hanzely, Peter Richtarik | In this paper we introduce a unified analysis of a large family of variants of proximal stochastic gradient descent (SGD) which so far have required different intuitions, convergence analyses, have different applications, and which have been developed separately in various communities. |

63 | Entropy Weighted Power k-Means Clustering | Saptarshi Chakraborty, Debolina Paul, Swagatam Das, Jason Xu | This paper addresses these issues by introducing entropy regularization to learn feature relevance while annealing. |

64 | Identifying and Correcting Label Bias in Machine Learning | Heinrich Jiang, Ofir Nachum | In this paper, we provide a mathematical formulation of how this bias can arise. |

65 | AsyncQVI: Asynchronous-Parallel Q-Value Iteration for Discounted Markov Decision Processes with Near-Optimal Sample Complexity | Yibo Zeng, Fei Feng, Wotao Yin | In this paper, we propose AsyncQVI, an asynchronous-parallel Q-value iteration for discounted Markov decision processes whose transition and reward can only be sampled through a generative model. |

66 | Active Community Detection with Maximal Expected Model Change | Dan Kushnir, Benjamin Mirabelli | We present a novel active learning algorithm for community detection on networks. |

67 | RCD: Repetitive causal discovery of linear non-Gaussian acyclic models with latent confounders | Takashi Nicholas Maeda, Shohei Shimizu | This paper proposes a causal functional model-based method called repetitive causal discovery (RCD) to discover the causal structure of observed variables affected by latent confounders. |

68 | A Simple Approach for Non-stationary Linear Bandits | Peng Zhao, Lijun Zhang, Yuan Jiang, Zhi-Hua Zhou | In this paper, we demonstrate that a simple restarted strategy is sufficient to attain the same regret guarantee. |

69 | Distributionally Robust Formulation and Model Selection for the Graphical Lasso | Pedro Cisneros, Alexander Petersen, Sang-Yun Oh | Building on a recent framework for distributionally robust optimization, we consider inverse covariance matrix estimation for multivariate data. |

70 | Efficient Spectrum-Revealing CUR Matrix Decomposition | Cheng Chen, Ming Gu, Zhihua Zhang, Weinan Zhang, Yong Yu | In this paper, we propose a novel CUR algorithm based on truncated LU factorization with an efficient variant of complete pivoting. |

71 | Graph DNA: Deep Neighborhood Aware Graph Encoding for Collaborative Filtering | Liwei Wu, Hsiang-Fu Yu, Nikhil Rao, James Sharpnack, Cho-Jui Hsieh | In this paper, we consider recommender systems with side information in the form of graphs. |

72 | Characterization of Overlap in Observational Studies | Michael Oberst, Fredrik Johansson, Dennis Wei, Tian Gao, Gabriel Brat, David Sontag, Kush Varshney | We formalize overlap estimation as a problem of finding minimum volume sets subject to coverage constraints and reduce this problem to binary classification with Boolean rule classifiers. |

73 | Modular Block-diagonal Curvature Approximations for Feedforward Architectures | Felix Dangel, Stefan Harmeling, Philipp Hennig | We propose a modular extension of backpropagation for the computation of block-diagonal approximations to various curvature matrices of the training objective (in particular, the Hessian, generalized Gauss-Newton, and positive-curvature Hessian). |

74 | A Unified Statistically Efficient Estimation Framework for Unnormalized Models | Masatoshi Uehara, Takafumi Kanamori, Takashi Takenouchi, Takeru Matsuda | In this study, we propose a unified, statistically efficient estimation framework for unnormalized models and several efficient estimators, whose asymptotic variance is the same as the MLE. |

75 | More Powerful Selective Kernel Tests for Feature Selection | Jen Ning Lim, Makoto Yamada, Wittawat Jitkrittum, Yoshikazu Terada, Shigeyuki Matsui, Hidetoshi Shimodaira | We show how recent advances inmultiscale bootstrap makesthis possible and demonstrate our proposal over a range of synthetic and real world experiments. |

76 | Imputation estimators for unnormalized models with missing data | Masatoshi Uehara, Takeru Matsuda, Jae Kwang Kim | We propose estimation methods for such unnormalized models with missing data. |

77 | Wasserstein Style Transfer | Youssef Mroueh | We propose Gaussian optimal transport for image style transfer in an Encoder/Decoder framework. |

78 | Elimination of All Bad Local Minima in Deep Learning | Kenji Kawaguchi, Leslie Kaelbling | In this paper, we theoretically prove that adding one special neuron per output unit eliminates all suboptimal local minima of any deep neural network, for multi-class classification, binary classification, and regression with an arbitrary loss function, under practical assumptions. |

79 | Fully Decentralized Joint Learning of Personalized Models and Collaboration Graphs | Valentina Zantedeschi, Aur?lien Bellet, Marc Tommasi | We propose to train personalized models that leverage a collaboration graph describing the relationships between user personal tasks, which we learn jointly with the models. |

80 | Formal Limitations on the Measurement of Mutual Information | David McAllester, Karl Stratos | In this paper, we prove that serious statistical limitations are inherent to any method of measuring mutual information. |

81 | Scalable Feature Selection for (Multitask) Gradient Boosted Trees | Cuize Han, Nikhil Rao, Daria Sorokina, Karthik Subbian | We develop a scalable forward feature selection variant for GBDT, via a novel group testing procedure that works well in high dimensions, and enjoys favorable theoretical performance and computational guarantees. |

82 | Model-Agnostic Counterfactual Explanations for Consequential Decisions | Amir-Hossein Karimi, Gilles Barthe, Borja Balle, Isabel Valera | In contrast, we build on standard theory and tools from formal verification and propose a novel algorithm that solves a sequence of satisfiability problems, where both the distance function (objective) and predictive model (constraints) are represented as logic formulae. |

83 | Obfuscation via Information Density Estimation | Hsiang Hsu, Shahab Asoodeh, Flavio Calmon | In this paper, we propose a framework to identify information-leaking features via information density estimation. |

84 | Linear Dynamics: Clustering without identification | Chloe Hsu, Michaela Hardt, Moritz Hardt | We analyze a computationally efficient and provably convergent algorithm to estimate the eigenvalues of the state-transition matrix in a linear dynamical system. |

85 | Low-rank regularization and solution uniqueness in over-parameterized matrix sensing | Kelly Geyer, Anastasios Kyrillidis, Amir Kalev | In this contribution, we prove that in fact, under certain conditions, the PSD constraint by itself is sufficient to lead to a unique low-rank matrix recovery, without explicit or implicit regularization. |

86 | Robustness for Non-Parametric Classification: A Generic Attack and Defense | Yao-Yuan Yang, Cyrus Rashtchian, Yizhen Wang, Kamalika Chaudhuri | In this work, we take a holistic look at adversarial examples for non-parametric classifiers, including nearest neighbors, decision trees, and random forests. |

87 | Contextual Online False Discovery Rate Control | Shiyun Chen, Shiva Kasiviswanathan | In this paper, we consider a setting where an ordered (possibly infinite) sequence of hypotheses arrives in a stream, and for each hypothesis we observe a p-value along with a set of features specific to that hypothesis. |

88 | Sequential no-Substitution k-Median-Clustering | Tom Hess, Sivan Sabato | We study the sample-based k-median clustering objective under a sequential setting without substitutions. |

89 | Robust Learning from Discriminative Feature Feedback | Sanjoy Dasgupta, Sivan Sabato | In this paper, we introduce a more realistic, *robust* version of the framework, in which the annotator is allowed to make mistakes. |

90 | Hermitian matrices for clustering directed graphs: insights and applications | Mihai Cucuringu, Huan Li, He Sun, Luca Zanetti | To overcome these downsides, we propose a spectral clustering algorithm based on a complex-valued matrix representation of digraphs. |

91 | Kernel Conditional Density Operators | Ingmar Schuster, Mattes Mollenhauer, Stefan Klus, Krikamol Muandet | We introduce a novel conditional density estimationmodel termed the conditional densityoperator (CDO). |

92 | Learning Overlapping Representations for the Estimation of Individualized Treatment Effects | Yao Zhang, Alexis Bellot, Mihaela Schaar | Based on these results, we develop a deep kernel regression algorithm and posterior regularization framework that substantially outperforms the state-of-the-art on a variety of benchmarks data sets. |

93 | Additive Tree-Structured Covariance Function for Conditional Parameter Spaces in Bayesian Optimization | Xingchen Ma, Matthew Blaschko | In this work, we generalize the additive assumption to tree-structured functions and propose an additive tree-structured covariance function, showing improved sample-efficiency, wider applicability and greater flexibility. |

94 | Asymptotic Analysis of Sampling Estimators for Randomized Numerical Linear Algebra Algorithms | Ping Ma, Xinlian Zhang, Xin Xing, Jingyi Ma, Michael Mahoney | In this article, we develop asymptotic analysis to derive the distribution of RandNLA sampling estimators for the least-squares problem. |

95 | The Fast Loaded Dice Roller: A Near-Optimal Exact Sampler for Discrete Probability Distributions | Feras Saad, Cameron Freer, Martin Rinard, Vikash Mansinghka | This paper introduces a new algorithm for the fundamental problem of generating a random integer from a discrete probability distribution using a source of independent and unbiased random coin flips. |

96 | A Fast Anderson-Chebyshev Acceleration for Nonlinear Optimization | Zhize Li, Jian LI | In this paper, we show that Anderson acceleration with Chebyshev polynomial can achieve the optimal convergence rate $O(\sqrt{\kappa}\ln\frac{1}{\epsilon})$, which improves the previous result $O(\kappa\ln\frac{1}{\epsilon})$ provided by (Toth & Kelley, 2015) for quadratic functions. |

97 | Black Box Submodular Maximization: Discrete and Continuous Settings | Lin Chen, Mingrui Zhang, Hamed Hassani, Amin Karbasi | In this paper, we consider the problem of black box continuous submodular maximization where we only have access to the function values and no information about the derivatives is provided. |

98 | Corruption-Tolerant Gaussian Process Bandit Optimization | Ilija Bogunovic, Andreas Krause, Jonathan Scarlett | We introduce an algorithm Fast-Slow GP-UCB based on Gaussian process methods, randomized selection between two instances labeled ’fast’ (but non-robust) and ’slow’ (but robust), enlarged confidence bounds, and the principle of optimism under uncertainty. |

99 | On the Convergence Theory of Gradient-Based Model-Agnostic Meta-Learning Algorithms | Alireza Fallah, Aryan Mokhtari, Asuman Ozdaglar | We start with the MAML method and its first-order approximation (FO-MAML) and highlight the challenges that emerge in their analysis. |

100 | Alternating Minimization Converges Super-Linearly for Mixed Linear Regression | Avishek Ghosh, Ramchandran Kannan | In this paper, we close this gap between theory and practice for the special case of a mixture of $2$ linear regressions. |

101 | Learning Gaussian Graphical Models via Multiplicative Weights | Anamay Chaturvedi, Jonathan Scarlett | In this paper, we adapt a recently proposed algorithm of Klivans and Meka (FOCS, 2017), based on the method of multiplicative weight updates, from the Ising model to the Gaussian model, via non-trivial modifications to both the algorithm and its analysis. |

102 | Mitigating Overfitting in Supervised Classification from Two Unlabeled Datasets: A Consistent Risk Correction Approach | Nan Lu, Tianyi Zhang, Gang Niu, Masashi Sugiyama | Therefore, we propose to wrap the terms that cause a negative empirical risk by certain correction functions. |

103 | Infinitely deep neural networks as diffusion processes | Stefano Peluchetti, Stefano Favaro | We consider parameter distributions that shrink as the number of layers increases in order to recover well-behaved stochastic processes in the limit of infinite depth. |

104 | Stable behaviour of infinitely wide deep neural networks | Stefano Peluchetti, Stefano Favaro, Sandra Fortini | We consider fully connected feed-forward deep neural networks (NNs) where weights and biases are independent and identically distributed as symmetric centered stable distributions. |

105 | Neural Topic Model with Attention for Supervised Learning | Xinyi Wang, YI YANG | This paper presents Topic Attention Model (TAM), a supervised neural topic model that integrates an attention recurrent neural network (RNN) model. |

106 | Causal Mosaic: Cause-Effect Inference via Nonlinear ICA and Ensemble Method | Pengzhou Wu, Kenji Fukumizu | We address the problem of distinguishing cause from effect in bivariate setting. |

107 | Stochastic Bandits with Delay-Dependent Payoffs | Leonardo Cella, Nicol? Cesa-Bianchi | Motivated by recommendation problems in music streaming platforms, we propose a nonstationary stochastic bandit model in which the expected reward of an arm depends on the number of rounds that have passed since the arm was last pulled. |

108 | Risk Bounds for Learning Multiple Components with Permutation-Invariant Losses | Fabien Lauer | This paper proposes a simple approach to derive efficient error bounds for learning multiple components with sparsity-inducing regularization. |

109 | Balancing Learning Speed and Stability in Policy Gradient via Adaptive Exploration | Matteo Papini, Andrea Battistello, Marcello Restelli | Using tools from the safe PG literature, we design a surrogate objective for the policy variance that captures the effects this parameter has on the learning speed and on the quality of the final solution. |

110 | Independent Subspace Analysis for Unsupervised Learning of Disentangled Representations | Jan Stuehmer, Richard Turner, Sebastian Nowozin | We first show that these modifications, e.g. beta-VAE, simplify the tendency of variational inference to underfit, causing pathological over-pruning and over-orthogonalization of learned components. Second, we propose a complementary approach: to modify the probabilistic model with a structured latent prior. |

111 | A Practical Algorithm for Multiplayer Bandits when Arm Means Vary Among Players | Abbas Mehrabian, Etienne Boursier, Emilie Kaufmann, Vianney Perchet | We consider the challenging heterogeneous setting, in which different arms may have different means for different players, and propose a new and efficient algorithm that combines the idea of leveraging forced collisions for implicit communication and that of performing matching eliminations. |

112 | Regularity as Regularization: Smooth and Strongly Convex Brenier Potentials in Optimal Transport | Fran\cois-Pierre Paty, Alexandre d?Aspremont, Marco Cuturi | We propose in this work to draw inspiration from this theory and use regularity as a regularization tool. |

113 | On Generalization Bounds of a Family of Recurrent Neural Networks | Minshuo Chen, Xingguo Li, Tuo Zhao | To connect theory and practice, we study the generalization properties of vanilla RNNs as well as their variants, including Minimal Gated Unit (MGU), Long Short Term Memory (LSTM), and Convolutional (Conv) RNNs. |

114 | Simulator Calibration under Covariate Shift with Kernels | Keiichi Kisamori, Motonobu Kanagawa, Keisuke Yamazaki | We propose a novel calibration method for computer simulators, dealing with the problem of covariate shift.Covariate shift is the situation where input distributions for training and test are different, and ubiquitous in applications of simulations. |

115 | Convergence Rates of Gradient Descent and MM Algorithms for Bradley-Terry Models | Milan Vojnovic, Se-Young Yun, Kaifang Zhou | We present tight convergence rate bounds for gradient descent and MM algorithms for maximum likelihood (ML) estimation and maximum a posteriori probability (MAP) estimation of a popular Bayesian inference method, for Bradley-Terry models of ranking data. |

116 | A Locally Adaptive Bayesian Cubature Method | Matthew Fisher, Chris Oates, Catherine Powell, Aretha Teckentrup | The main contributions of this work are twofold; first we establish that existing BC methods do not possess local adaptivity in the sense of many classical adaptive methods and secondly, we developed a novel BC method whose behaviour, demonstrated empirically, is analogous to such methods. |

117 | Fast and Bayes-consistent nearest neighbors | Klim Efremenko, Aryeh Kontorovich, Moshe Noivirt | This paper aims at bridging these realms: to reap the advantages of fast evaluation time while maintaining Bayes consistency, and further without sacrificing too much in the risk decay rate. |

118 | Explaining the Explainer: A First Theoretical Analysis of LIME | Damien Garreau, Ulrike Luxburg | In this paper, we provide the first theoretical analysis of LIME. |

119 | A Continuous-time Perspective for Modeling Acceleration in Riemannian Optimization | Foivos Alimisis, Antonio Orvieto, Gary Becigneul, Aurelien Lucchi | We propose a novel second-order ODE as the continuous-time limit of a Riemannian accelerated gradient-based method on a manifold with curvature bounded from below. |

120 | Deep Active Learning: Unified and Principled Method for Query and Training | Changjian Shui, Fan Zhou, Christian Gagn?, Boyu Wang | In this paper, we are proposing a unified and principled method for both the querying and training processes in deep batch active learning. |

121 | Sparse and Low-rank Tensor Estimation via Cubic Sketchings | Botao Hao, Anru R. Zhang, Guang Cheng | In this paper, we propose a general framework for sparse and low-rank tensor estimation from cubic sketchings. |

122 | A nonasymptotic law of iterated logarithm for general M-estimators | Arnak Dalalyan, Nicolas Schreuder, Victor-Emmanuel Brunel | In this paper, we propose the first non-asymptotic ’any-time’ deviation bounds for general M-estimators, where ’any-time’ means that the bound holds with a prescribed probability for every sample size. |

123 | Robust Stackelberg buyers in repeated auctions | Thomas Nedelec, Clement Calauzenes, Vianney Perchet, Noureddine El Karoui | We consider the practical and classical setting where the seller is using an exploration stage to learn the value distributions of the bidders before running a revenue-maximizing auction in a exploitation phase. |

124 | Radial Bayesian Neural Networks: Beyond Discrete Support In Large-Scale Bayesian Deep Learning | Sebastian Farquhar, Michael A. Osborne, Yarin Gal | We propose Radial Bayesian Neural Networks (BNNs): a variational approximate posterior for BNNs which scales well to large models. |

125 | Practical Nonisotropic Monte Carlo Sampling in High Dimensions via Determinantal Point Processes | Krzysztof Choromanski, Aldo Pacchiano, Jack Parker-Holder, Yunhao Tang | We propose a new class of practical structured methods for nonisotropic Monte Carlo (MC) sampling, called DPPMC, designed for high-dimensional nonisotropic distributions where samples are correlated to reduce the variance of the estimator via determinantal point processes. |

126 | Fast and Furious Convergence: Stochastic Second Order Methods under Interpolation | Si Yi Meng, Sharan Vaswani, Issam Hadj Laradji), Mark Schmidt, Simon Lacoste-Julien | We consider stochastic second-order methods for minimizing smooth and strongly-convex functions under an interpolation condition satisfied by over-parameterized models. |

127 | Two-sample Testing Using Deep Learning | Matthias Kirchler, Shahryar Khorasani, Marius Kloft, Christoph Lippert | We propose a two-sample testing procedure based on learned deep neural network representations. |

128 | RATQ: A Universal Fixed-Length Quantizer for Stochastic Optimization | Prathamesh Mayekar, Himanshu Tyagi | We present Rotated Adaptive Tetra-iterated Quantizer (RATQ), afixed-length quantizer for gradients in first order stochasticoptimization. |

129 | Rep the Set: Neural Networks for Learning Set Representations | Konstantinos Skianis, Giannis Nikolentzos, Stratis Limnios, Michalis Vazirgiannis | In this paper, we present a new neural network architecture, called RepSet, that can handle examples that are represented as sets of vectors. |

130 | A Multiclass Classification Approach to Label Ranking | Robin Vogel, St?phan Cl?men\con | This article is devoted to the analysis of this statistical learning problem, halfway between multiclass classification and posterior probability estimation (regression) and referred to as \textit{label ranking} here. |

131 | Conservative Exploration in Reinforcement Learning | Evrard Garcelon, Mohammad Ghavamzadeh, Alessandro Lazaric, Matteo Pirotta | We present two optimistic algorithms that guarantee (w.h.p.) that the conservative constraint is never violated during learning. |

132 | A principled approach for generating adversarial images under non-smooth dissimilarity metrics | Aram-Alexandre Pooladian, Chris Finlay, Tim Hoheisel, Adam Oberman | In this work, we propose an attack methodology not only for cases where the perturbations are measured by Lp norms, but in fact any adversarial dissimilarity metric with a closed proximal form. |

133 | Regularization via Structural Label Smoothing | Weizhi Li, Gautam Dasarathy, Visar Berisha | In this paper, we focus on label smoothing, a form of output distribution regularization that prevents overfitting of a neural network by softening the ground-truth labels in the training data in an attempt to penalize overconfident outputs. |

134 | Communication-Efficient Asynchronous Stochastic Frank-Wolfe over Nuclear-norm Balls | Jiacheng Zhuo, Qi Lei, Alex Dimakis, Constantine Caramanis | In this work, we propose an asynchronous Stochastic Frank Wolfe (SFW-asyn) method, which, for the first time, solves the two problems simultaneously, while successfully maintaining the same convergence rate as the vanilla SFW. |

135 | Linear Convergence of Adaptive Stochastic Gradient Descent | Yuege Xie, Xiaoxia Wu, Rachel Ward | We prove that the norm version of the adaptive stochastic gradient method (AdaGrad-Norm) achieves a linear convergence rate for a subset of either strongly convex functions or non-convex functions that satisfy the Polyak Lojasiewicz (PL) inequality. |

136 | Contextual Combinatorial Volatile Multi-armed Bandit with Adaptive Discretization | Andi Nika, Sepehr Elahi, Cem Tekin | Under the semi-bandit feedback setting and assuming that the contexts lie in a space ${\cal X}$ endowed with the Euclidean norm and that the expected base arm outcomes (expected rewards) are Lipschitz continuous in the contexts (expected base arm outcomes), we propose an algorithm called Adaptive Contextual Combinatorial Upper Confidence Bound (ACC-UCB). |

137 | A Unified Analysis of Extra-gradient and Optimistic Gradient Methods for Saddle Point Problems: Proximal Point Approach | Aryan Mokhtari, Asuman Ozdaglar, Sarath Pattathil | In this paper we consider solving saddle point problems using two variants of Gradient Descent-Ascent algorithms, Extra-gradient (EG) and Optimistic Gradient Descent Ascent (OGDA) methods. |

138 | Bandit Convex Optimization in Non-stationary Environments | Peng Zhao, Guanghui Wang, Lijun Zhang, Zhi-Hua Zhou | In this paper, we investigate BCO in non-stationary environments and choose the dynamic regret as the performance measure, which is defined as the difference between the cumulative loss incurred by the algorithm and that of any feasible comparator sequence. |

139 | Decentralized Multi-player Multi-armed Bandits with No Collision Information | Chengshuai Shi, Wei Xiong, Cong Shen, Jing Yang | The decentralized stochastic multi-player multi-armed bandit (MP-MAB) problem, where the collision information is not available to the players, is studied in this paper. |

140 | Bayesian Image Classification with Deep Convolutional Gaussian Processes | Vincent Dutordoir, Mark Wilk, Artem Artemev, James Hensman | We propose a translation insensitive convolutional kernel, which relaxes the translation invariance constraint imposed by previous convolutional GPs. |

141 | Optimizing Millions of Hyperparameters by Implicit Differentiation | Jonathan Lorraine, Paul Vicol, David Duvenaud | We propose an algorithm for inexpensive gradient-based hyperparameter optimization that combines the implicit function theorem (IFT) with efficient inverse Hessian approximations. |

142 | A Topology Layer for Machine Learning | Rickard Br?el Gabrielsson, Bradley J. Nelson, Anjan Dwaraknath, Primoz Skraba | We present a differentiable topology layer that computes persistent homology based on level set filtrations and edge-based filtrations. |

143 | Differentiable Feature Selection by Discrete Relaxation | Rishit Sheth, Nicol? Fusi | In this paper, we introduce Differentiable Feature Selection, a gradient-based search algorithm for feature selection. |

144 | Private Protocols for U-Statistics in the Local Model and Beyond | James Bell, Aur?lien Bellet, Adria Gascon, Tejas Kulkarni | In this paper, we study the problem of computing $U$-statistics of degree $2$, i.e., quantities that come in the form of averages over pairs of data points, in the local model of differential privacy (LDP). |

145 | Automatic Differentiation of Some First-Order Methods in Parametric Optimization | Sheheryar Mehmood, Peter Ochs | We aim at computing the derivative of the solution to a parametric optimization problem with respect to the involved parameters. |

146 | DYNOTEARS: Structure Learning from Time-Series Data | Roxana Pamfil, Nisara Sriwattanaworachai, Shaan Desai, Philip Pilgerstorfer, Konstantinos Georgatzis, Paul Beaumont, Bryon Aragam | We revisit the structure learning problem for dynamic Bayesian networks and propose a method that simultaneously estimates contemporaneous (intra-slice) and time-lagged (inter-slice) relationships between variables in a time-series. |

147 | Unsupervised Hierarchy Matching with Optimal Transport over Hyperbolic Spaces | David Alvarez-Melis, Youssef Mroueh, Tommi Jaakkola | In contrast, we approach the problem from a purely geometric perspective: given only a vector-space representation of the items in the two hierarchies, we seek to infer correspondences across them. |

148 | Competing Bandits in Matching Markets | Lydia T. Liu, Horia Mania, Michael Jordan | We propose a statistical learning model in which one side of the market does not have a priori knowledge about its preferences for the other side and is required to learn these from stochastic rewards. |

149 | Revisiting the Landscape of Matrix Factorization | Hossein Valavi, Sulin Liu, Peter Ramadge | We revisit this problem and provide simple, intuitive proofs of a set of extended results for low-rank and general-rank problems. |

150 | Value Preserving State-Action Abstractions | David Abel, Nate Umbanhowar, Khimya Khetarpal, Dilip Arumugam, Doina Precup, Michael Littman | To mitigate this, we here introduce combinations of state abstractions and options that are guaranteed to preserve representation of near-optimal policies. |

151 | GP-VAE: Deep Probabilistic Time Series Imputation | Vincent Fortuin, Dmitry Baranchuk, Gunnar Raetsch, Stephan Mandt | We propose a new deep sequential latent variable model for dimensionality reduction and data imputation. |

152 | Communication-Efficient Distributed Optimization in Networks with Gradient Tracking and Variance Reduction | Boyue Li, Shicong Cen, Yuxin Chen, Yuejie Chi | This paper focuses on distributed optimization in the network setting (also known as the decentralized setting), where each agent is only allowed to aggregate information from its neighbors over a graph. |

153 | Optimized Score Transformation for Fair Classification | Dennis Wei, Karthikeyan Natesan Ramamurthy, Flavio Calmon | In the finite sample setting, we propose to approach this solution using a combination of standard probabilistic classifiers and ADMM. |

154 | Variational Autoencoders for Sparse and Overdispersed Discrete Data | He Zhao, Piyush Rai, Lan Du, Wray Buntine, Dinh Phung, Mingyuan Zhou | To address these issues, we develop a VAE-based framework using the negative binomial distribution as the data distribution. |

155 | Spatio-temporal alignments: Optimal transport through space and time | Hicham Janati, Marco Cuturi, Alexandre Gramfort | In this paper, we propose Spatio-Temporal Alignments (STA), a new differentiable formulation of DTW that captures spatial and temporal variability. |

156 | Accelerating Smooth Games by Manipulating Spectral Shapes | Wa?ss Azizian, Damien Scieur, Ioannis Mitliagkas, Simon Lacoste-Julien, Gauthier Gidel | In this framework, we describe gradient-based methods, such as extragradient, as transformations on the spectral shape. |

157 | Langevin Monte Carlo without smoothness | Niladri Chatterji, Jelena Diakonikolas, Michael I. Jordan, Peter Bartlett | In this paper, we remove this limitation, providing polynomial-time convergence guarantees for a variant of LMC in the setting of nonsmooth log-concave distributions. |

158 | EM Converges for a Mixture of Many Linear Regressions | Jeongyeol Kwon, Constantine Caramanis | We study the convergence of the Expectation-Maximization (EM) algorithm for mixtures of linear regressions with an arbitrary number $k$ of components. |

159 | Locally Accelerated Conditional Gradients | Jelena Diakonikolas, Alejandro Carderera, Sebastian Pokutta | To address this issue, we present Locally Accelerated Conditional Gradients – an algorithmic framework that couples accelerated steps with conditional gradient steps to achieve \emph{local} acceleration on smooth strongly convex problems. |

160 | Coping With Simulators That Don?t Always Return | Andrew Warrington, Frank Wood, Saeid Naderiparizi | We investigate and address computational inefficiencies that arise from adding process noise to deterministic simulators that fail to return for certain inputs; a property we describe as ’brittle’. |

161 | Post-Estimation Smoothing: A Simple Baseline for Learning with Side Information | Esther Rolf, Michael I. Jordan, Benjamin Recht | We propose a post-estimation smoothing operator as a fast and effective method for incorporating structural index data into prediction. |

162 | Equalized odds postprocessing under imperfect group information | Pranjal Awasthi, Matth?us Kleindessner, Jamie Morgenstern | In this paper, we ask to what extent fairness interventions can be effective even when only imperfect information about the protected attribute is available. |

163 | The True Sample Complexity of Identifying Good Arms | Julian Katz-Samuels, Kevin Jamieson | We consider two multi-armed bandit problems with $n$ arms: \emph{(i)} given an $\epsilon > 0$, identify an arm with mean that is within $\epsilon$ of the largest mean and \emph{(ii)} given a threshold $\mu_0$ and integer $k$, identify $k$ arms with means larger than $\mu_0$. |

164 | Validated Variational Inference via Practical Posterior Error Bounds | Jonathan Huggins, Mikolaj Kasprzak, Trevor Campbell, Tamara Broderick | In this paper, we provide rigorous bounds on the error of posterior mean and uncertainty estimates that arise from full-distribution approximations, as in variational inference. |

165 | A Rule for Gradient Estimator Selection, with an Application to Variational Inference | Tomas Geffner, Justin Domke | Inspired by this principle, we propose a technique to automatically select an estimator when a finite pool of estimators is given. |

166 | Naive Feature Selection: Sparsity in Naive Bayes | Armin Askari, Alexandre d?Aspremont, Laurent El Ghaoui | We propose a sparse version of naive Bayes, which can be used for feature selection. |

167 | Fixed-confidence guarantees for Bayesian best-arm identification | Xuedong Shang, Rianne Heide, Pierre Menard, Emilie Kaufmann, Michal Valko | As our main contribution, we provide the first sample complexity analysis of TTTS and T3C when coupled with a very natural Bayesian stopping rule, for bandits with Gaussian rewards, solving one of the open questions raised by Russo (2016). |

168 | Learning Hierarchical Interactions at Scale: A Convex Optimization Approach | Hussein Hazimeh, Rahul Mazumder | In this paper, we study a convex relaxation which enforces strong hierarchy and develop a highly scalable algorithm based on proximal gradient descent. |

169 | OSOM: A simultaneously optimal algorithm for multi-armed and linear contextual bandits | Niladri Chatterji, Vidya Muthukumar, Peter Bartlett | We consider the stochastic linear (multi-armed) contextual bandit problem with the possibility of hidden simple multi-armed bandit structure in which the rewards are independent of the contextual information. |

170 | Optimization Methods for Interpretable Differentiable Decision Trees Applied to Reinforcement Learning | Andrew Silva, Matthew Gombolay, Taylor Killian, Ivan Jimenez, Sung-Hyun Son | We overcome this limitation by allowing for a gradient update over the entire tree that improves sample complexity affords interpretable policy extraction. |

171 | Sharp Analysis of Expectation-Maximization for Weakly Identifiable Models | Raaz Dwivedi, Nhat Ho, Koulik Khamaru, Martin Wainwright, Michael Jordan, Bin Yu | We study a class of weakly identifiable location-scale mixture models for which the maximum likelihood estimates based on $n$ i.i.d. samples are known to have lower accuracy than the classical $n^{- \frac{1}{2}}$ error. |

172 | Stochastic Particle-Optimization Sampling and the Non-Asymptotic Convergence Theory | Jianyi Zhang, Ruiyi Zhang, Lawrence Carin, Changyou Chen | Notably, for the first time, we develop non-asymptotic convergence theory for the SPOS framework (related to SVGD), characterizing algorithm convergence in terms of the 1-Wasserstein distance w.r.t. the numbers of particles and iterations. |

173 | Dynamical Systems Theory for Causal Inference with Application to Synthetic Control Methods | Yi Ding, Panos Toulis | In this paper, we adopt results in nonlinear time series analysis for causal inference in dynamical settings. |

174 | RelatIF: Identifying Explanatory Training Samples via Relative Influence | Elnaz Barshan, Marc-Etienne Brunet, Gintare Karolina Dziugaite | In this work, we focus on the use of influence functions to identify relevant training examples that one might hope “explain” the predictions of a machine learning model. |

175 | Ensemble Gaussian Processes with Spectral Features for Online Interactive Learning with Scalability | Qin Lu, Georgios Karanikolas, Yanning Shen, Georgios B. Giannakis | While most GP approaches rely on a single preselected prior, the present work employs a weighted ensemble of GP priors, each having a unique covariance (kernel) belonging to a prescribed kernel dictionary – which leads to a richer space of learning functions. |

176 | Distributionally Robust Bayesian Quadrature Optimization | Thanh Nguyen, Sunil Gupta, Huong Ha, Santu Rana, Svetha Venkatesh | In this work, we study BQO under distributional uncertainty in which the underlying probability distribution is unknown except for a limited set of its i.i.d samples. |

177 | Sparse Orthogonal Variational Inference for Gaussian Processes | Jiaxin Shi, Michalis Titsias, Andriy Mnih | We introduce a new interpretation of sparse variational approximations for Gaussian processes using inducing points, which can lead to more scalable algorithms than previous methods. |

178 | The Sylvester Graphical Lasso (SyGlasso) | Yu Wang, Byoungwook Jang, Alfred Hero | This paper introduces the Sylvester graphical lasso (SyGlasso) that captures multiway dependencies present in tensor-valued data. |

179 | Frequentist Regret Bounds for Randomized Least-Squares Value Iteration | Andrea Zanette, David Brandfonbrener, Emma Brunskill, Matteo Pirotta, Alessandro Lazaric | In this paper, we introduce an optimistically-initialized variant of the popular randomized least-squares value iteration (RLSVI), a model-free algorithm where exploration is induced by perturbing the least-squares approximation of the action-value function. |

180 | DAve-QN: A Distributed Averaged Quasi-Newton Method with Local Superlinear Convergence Rate | Saeed Soori, Konstantin Mishchenko, Aryan Mokhtari, Maryam Mehri Dehnavi, Mert Gurbuzbalaban | In this paper, we consider distributed algorithms for solving the empirical risk minimization problem under the master/worker communication model. |

181 | Discrete Action On-Policy Learning with Action-Value Critic | Yuguang Yue, Yunhao Tang, Mingzhang Yin, Mingyuan Zhou | To effectively operate in multidimensional discrete action spaces, we construct a critic to estimate action-value functions, apply it on correlated actions, and combine these critic estimated action values to control the variance of gradient estimation. |

182 | Old Dog Learns New Tricks: Randomized UCB for Bandit Problems | Sharan Vaswani, Abbas Mehrabian, Audrey Durand, Branislav Kveton | We propose RandUCB, a bandit strategy that uses theoretically derived confidence intervals similar to upper confidence bound (UCB) algorithms, but akin to Thompson sampling (TS), uses randomization to trade off exploration and exploitation. |

183 | Thompson Sampling for Linearly Constrained Bandits | Vidit Saxena, Joakim Jalden, Joseph Gonzalez | In this paper, we describe LinConTS, a TS-based algorithm for bandits that place a linear constraint on the probability of earning a reward in every round. |

184 | Sample Complexity of Reinforcement Learning using Linearly Combined Model Ensembles | Aditya Modi, Nan Jiang, Ambuj Tewari, Satinder Singh | In this paper, we consider a setting where we have access to an ensemble of pre-trained and possibly inaccurate simulators (models). |

185 | FedPAQ: A Communication-Efficient Federated Learning Method with Periodic Averaging and Quantization | Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani, Ali Jadbabaie, Ramtin Pedarsani | In this paper, we present FedPAQ, a communication-efficient Federated Learning method with Periodic Averaging and Quantization. |

186 | Online Learning Using Only Peer Prediction | Yang Liu, Dave Helmbold | We propose an approach that uses peer prediction and identify conditions where it succeeds. |

187 | Deontological Ethics By Monotonicity Shape Constraints | Serena Wang, Maya Gupta | We propose that in some cases such ethical principles can be incorporated into a machine-learned model by adding shape constraints that constrain the model to respond only positively to relevant inputs. |

188 | On Random Subsampling of Gaussian Process Regression: A Graphon-Based Analysis | Kohei Hayashi, Masaaki Imaizumi, Yuichi Yoshida | In this paper, we study random subsampling of Gaussian process regression, one of the simplest approximation baselines, from a theoretical perspective. |

189 | Randomized Exploration in Generalized Linear Bandits | Branislav Kveton, Manzil Zaheer, Csaba Szepesvari, Lihong Li, Mohammad Ghavamzadeh, Craig Boutilier | We study two randomized algorithms for generalized linear bandits. |

190 | Assessing Local Generalization Capability in Deep Models | Huan Wang, Nitish Shirish Keskar, Caiming Xiong, Richard Socher | Guided by the proof, we propose a metric to score the generalization capability of a model, as well as an algorithm that optimizes the perturbed model accordingly. |

191 | Fast Algorithms for Computational Optimal Transport and Wasserstein Barycenter | Wenshuo Guo, Nhat Ho, Michael Jordan | We provide theoretical complexity analysis for new algorithms to compute the optimal transport (OT) distance between two discrete probability distributions, and demonstrate their favorable practical performance compared to state-of-art primal-dual algorithms. |

192 | Adaptive Discretization for Evaluation of Probabilistic Cost Functions | Christoph Zimmer, Danny Driess, Mona Meister, Nguyen-Tuong Duy | In this paper, we propose an approach for evaluating continuous candidate paths by employing an adaptive discretization scheme, with a probabilistic cost function learned from observations. |

193 | Censored Quantile Regression Forest | Alexander Hanbo Li, Jelena Bradic | Based on a local adaptive representation of random forests, we develop its regression adjustment for randomly censored regression quantile models. |

194 | Choosing the Sample with Lowest Loss makes SGD Robust | Vatsal Shah, Xiaoxia Wu, Sujay Sanghavi | In this paper we propose a simple variant of the simple SGD method: in each step, first choose a set of k samples, then from these choose the one with the smallest current loss, and do an SGD-like update with this chosen sample. |

195 | Learning with minibatch Wasserstein : asymptotic and gradient properties | Kilian Fatras, Younes Zine, R?mi Flamary, Remi Gribonval, Nicolas Courty | We propose in this paper an analysis of this practice, which effects are not well understood so far. |

196 | AMAGOLD: Amortized Metropolis Adjustment for Efficient Stochastic Gradient MCMC | Ruqi Zhang, A. Feder Cooper, Christopher De Sa | To address this tension, we propose a novel second-order SG-MCMC algorithm—AMAGOLD—that infrequently uses Metropolis-Hastings (M-H) corrections to remove bias. |

197 | On casting importance weighted autoencoder to an EM algorithm to learn deep generative models | Dongha Kim, Jaesung Hwang, Yongdai Kim | We propose a new and general approach to learn deep generative models. |

198 | Conditional Linear Regression | Diego Calderon, Brendan Juba, Sirui Li, Zongyi Li, Lisa Ruan | We give a computationally efficient algorithm with theoretical analysis for the conditional linear regression task, which is the joint task of identifying a significant portion of the data distribution, described by a k-DNF, along with a linear predictor on that portion with a small loss. |

199 | Distributionally Robust Bayesian Optimization | Johannes Kirschner, Ilija Bogunovic, Stefanie Jegelka, Andreas Krause | In this paper, we study such a problem when the distributional shift is measured via the maximum mean discrepancy (MMD). |

200 | On the optimality of kernels for high-dimensional clustering | Leena C Vankadara, Debarghya Ghoshdastidar | We consider the problem of high dimensional Gaussian clustering and show that, with the exponential kernel function, the sufficient conditions for partial recovery of clusters using the NP-hard kernel k-means objective matches the known information-theoretic limit up to a factor of $\sqrt{2}$. |

201 | Improved Regret Bounds for Projection-free Bandit Convex Optimization | Dan Garber, Ben Kretzu | We present the first such algorithm that attains $O(T^{3/4})$ expected regret using only $O(T)$ overall calls to the linear optimization oracle, in expectation, where $T$ in the number of prediction rounds. |

202 | Variational Autoencoders and Nonlinear ICA: A Unifying Framework | Ilyes Khemakhem, Diederik Kingma, Ricardo Monti, Aapo Hyvarinen | We address this issue by showing that for a broad family of deep latent-variable models, identification of the true joint distribution over observed and latent variables is actually possible up to very simple transformations, thus achieving a principled and powerful form of disentanglement. |

203 | Online Learning with Continuous Variations: Dynamic Regret and Reductions | Ching-An Cheng, Jonathan Lee, Ken Goldberg, Byron Boots | Motivated by this observation, we establish a new setup, called Continuous Online Learning (COL), where the gradient of online loss function changes continuously across rounds with respect to the learner’s decisions. |

204 | An Optimal Algorithm for Bandit Convex Optimization with Strongly-Convex and Smooth Loss | Shinji Ito | Our study resolves these two issues by introducing an algorithm that achieves an optimal regret bound of $\tilde{O}(d \sqrt{T})$ under a mild assumption, without self-concordant barriers. |

205 | A Deep Generative Model for Fragment-Based Molecule Generation | Marco Podda, Davide Bacciu, Alessio Micheli | In this work, we address two limitations of the former: generation of invalid and duplicate molecules. |

206 | Deep Structured Mixtures of Gaussian Processes | Martin Trapp, Robert Peharz, Franz Pernkopf, Carl Edward Rasmussen | In this paper, we introduce deep structured mixtures of GP experts, a stochastic process model which i) allows exact posterior inference, ii) has attractive computational and memory costs, and iii) when used as GP approximation, captures predictive uncertainties consistently better than previous expert-based approximations. |

207 | Noisy-Input Entropy Search for Efficient Robust Bayesian Optimization | Lukas Fr?hlich, Edgar Klenske, Julia Vinogradska, Christian Daniel, Melanie Zeilinger | We consider the problem of robust optimization within the well-established Bayesian Optimization (BO) framework. |

208 | Dependent randomized rounding for clustering and partition systems with knapsack constraints | David Harris, Thomas Pensyl, Aravind Srinivasan, Khoa Trinh | We develop new randomized algorithms targeting such problems, and study two in particular: multi-knapsack median and multi-knapsack center. |

209 | Domain-Liftability of Relational Marginal Polytopes | Ondrej Kuzelka, Yuyi Wang | In this paper we study the following two problems: (i) Do domain-liftability results for the partition functions of Markov logic networks (MLNs)carry over to the problem of relational marginal polytope construction? |

210 | Derivative-Free & Order-Robust Optimisation | Haitham Ammar, Victor Gabillon, Rasul Tutunov, Michal Valko | In this paper, we formalise order-robust optimisation as an instance of online learning minimising simple regret, and propose Vroom, a zero’th order optimisation algorithm capable of achieving vanishing regret in non-stationary environments, while recovering favorable rates under stochastic reward-generating processes. |

211 | Stepwise Model Selection for Sequence Prediction via Deep Kernel Learning | Yao Zhang, Daniel Jarrett, Mihaela Schaar | In this paper, we propose a novel Bayesian optimization (BO) algorithm to tackle the challenge of model selection in this setting. |

212 | Dynamic content based ranking | Seppo Virtanen, Mark Girolami | We introduce a novel state space model for a set of sequentially time-stamped partial rankings of items and textual descriptions for the items. |

213 | Fairness Evaluation in Presence of Biased Noisy Labels | Riccardo Fogliato, Alexandra Chouldechova, Max G?Sell | We propose a sensitivity analysis framework for assessing how assumptions on the noise across groups affect the predictive bias properties of the risk assessment model as a predictor of reoffense. |

214 | Calibrated Surrogate Maximization of Linear-fractional Utility in Binary Classification | Han Bao, Masashi Sugiyama | In this paper, we consider linear-fractional metrics, which are a family of classification performance metrics that encompasses many standard ones such as the F-measure and Jaccard index, and propose methods to directly maximize performances under those metrics. |

215 | Decentralized gradient methods: does topology matter? | Giovanni Neglia, Chuan Xu, Don Towsley, Gianmarco Calbi | While theoretical results suggest that worker communication topology should have strong impact on the number of epochs needed to converge, previous experiments have shown the opposite conclusion. This paper sheds lights on this apparent contradiction and show how sparse topologies can lead to faster convergence even in the absence of communication delays. |

216 | Truly Batch Model-Free Inverse Reinforcement Learning about Multiple Intentions | Giorgia Ramponi, Amarildo Likmeta, Alberto Maria Metelli, Andrea Tirinzoni, Marcello Restelli | In this paper, we address the IRL about multiple intentions in a fully model-free and batch setting. |

217 | Beyond exploding and vanishing gradients: analysing RNN training using attractors and smoothness | Ant?nio H. Ribeiro, Koen Tiels, Luis A. Aguirre, Thomas Sch?n | In this paper, we argue that this principle, while powerful, might need some refinement to explain recent developments. |

218 | Accelerated Primal-Dual Algorithms for Distributed Smooth Convex Optimization over Networks | Jinming Xu, Ye Tian, Ying Sun, Gesualdo Scutari | This paper proposes a novel family of primal-dual-based distributed algorithms for smooth, convex, multi-agent optimization over networks that uses only gradient information and gossip communications. |

219 | Stochastic Linear Contextual Bandits with Diverse Contexts | Weiqiang Wu, Jing Yang, Cong Shen | In this paper, we investigate the impact of context diversity on stochastic linear contextual bandits. |

220 | Purifying Interaction Effects with the Functional ANOVA: An Efficient Algorithm for Recovering Identifiable Additive Models | Benjamin Lengerich, Sarah Tan, Chun-Hao Chang, Giles Hooker, Rich Caruana | To compute this decomposition, we present a fast, exact algorithm that transforms any piecewise-constant function (such as a tree-based model) into a purified, canonical representation. |

221 | Balanced Off-Policy Evaluation in General Action Spaces | Arjun Sondhi, David Arbour, Drew Dimmery | In this work we present balanced off-policy evaluation (B-OPE), a generic method for estimating weights which minimize this imbalance. |

222 | Approximate Cross-Validation in High Dimensions with Guarantees | William Stephenson, Tamara Broderick | Approximate Cross-Validation in High Dimensions with Guarantees |

223 | How fine can fine-tuning be? Learning efficient language models | Evani Radiya-Dixit, Xin Wang | In this work, we address these questions by using Bidirectional Encoder Representations from Transformers (BERT) as an example. |

224 | Interpretable Companions for Black-Box Models | Danqing Pan, Tong Wang, Satoshi Hara | We present an interpretable companion model for any pre-trained black-box classifiers. |

225 | A PTAS for the Bayesian Thresholding Bandit Problem | Yue Qin, Jian Peng, Yuan Zhou | In this paper, we study the Bayesian thresholding bandit problem (BTBP), where the goal is to adaptively make a budget of $Q$ queries to $n$ stochastic arms and determine the label for each arm (whether its mean reward is closer to $0$ or $1$). |

226 | Learning Rate Adaptation for Differentially Private Learning | Antti Koskela, Antti Honkela | In this paper, we propose a differentially private algorithm for the adaptation of the learning rate for differentially private stochastic gradient descent (SGD) that avoids the need for validation set use. |

227 | Thresholding Graph Bandits with GrAPL | Daniel LeJeune, Gautam Dasarathy, Richard Baraniuk | In this paper, we introduce a new online decision making paradigm that we call Thresholding Graph Bandits. |

228 | Bandit optimisation of functions in the Mat?rn kernel RKHS | David Janz, David Burt, Javier Gonzalez | We consider the problem of optimising functions in the reproducing kernel Hilbert space (RKHS) of a Matérn kernel with smoothness parameter $u$ over the domain $[0,1]^d$ under noisy bandit feedback. |

229 | Hypothesis Testing Interpretations and Renyi Differential Privacy | Borja Balle, Gilles Barthe, Marco Gaboardi, Justin Hsu, Tetsuya Sato | In this paper, we identify some conditions under which a privacy definition given in terms of a statistical divergence satisfies a similar interpretation. |

230 | Lipschitz Continuous Autoencoders in Application to Anomaly Detection | Young-geun Kim, Yongchan Kwon, Hyunwoong Chang, Myunghee Cho Paik | In this work, we formalize current practices, build a theoretical framework of anomaly detection algorithms equipped with an objective function and a hypothesis space, and establish a desirable property of the anomaly detection algorithm, namely, admissibility. |

231 | Private k-Means Clustering with Stability Assumptions | Moshe Shechner, Or Sheffet, Uri Stemmer | In this work we improve upon this line of works on multiple axes. |

232 | Momentum in Reinforcement Learning | Nino Vieillard, Bruno Scherrer, Olivier Pietquin, Matthieu Geist | Specifically,we propose a simple improvement on DQN based on MoVI, and experiment it on Atari games. |

233 | A Primal-Dual Solver for Large-Scale Tracking-by-Assignment | Stefan Haller, Mangal Prakash, Lisa Hutschenreiter, Tobias Pietzsch, Carsten Rother, Florian Jug, Paul Swoboda, Bogdan Savchynskyy | We propose a fast approximate solver for the combinatorial problem known as tracking-by-assignment, which we apply to cell tracking. |

234 | Precision-Recall Curves Using Information Divergence Frontiers | Josip Djolonga, Mario Lucic, Marco Cuturi, Olivier Bachem, Olivier Bousquet, Sylvain Gelly | In this paper, we present a general evaluation framework for generative models that measures the trade-off between precision and recall using Renyi divergences. |

235 | Computing Tight Differential Privacy Guarantees Using FFT | Antti Koskela, Joonas J?lk?, Antti Honkela | In this paper, we propose a numerical accountant for evaluating the privacy loss for algorithms with continuous one dimensional output. |

236 | Hyperbolic Manifold Regression | Gian Marconi, Carlo Ciliberto, Lorenzo Rosasco | In this work, we consider the task of regression onto hyperbolic space for whichwe propose two approaches: a non-parametric kernel-method for which we also proveexcess risk bounds and a parametric deep learning model that is informed bythe geodesics of the target space. |

237 | Approximate Inference with Wasserstein Gradient Flows | Charlie Frogner, Tomaso Poggio | We present a novel approximate inference method for diffusion processes, based on the Wasserstein gradient flow formulation of the diffusion. |

238 | Thresholding Bandit Problem with Both Duels and Pulls | Yichong Xu, Xi Chen, Aarti Singh, Artur Dubrawski | This paper provides an algorithm called Rank-Search (RS) for solving TBP-DC by alternating between ranking and binary search. |

239 | GAIT: A Geometric Approach to Information Theory | Jose Gallego Posada, Ankit Vani, Max Schwarzer, Simon Lacoste-Julien | Based on this notion of entropy, we introduce geometry-aware counterparts for several concepts and theorems in information theory. |

240 | On Thompson Sampling for Smoother-than-Lipschitz Bandits | James Grant, David Leslie | We provide the first bounds on the regret of Thompson Sampling for continuum armed bandits under weak conditions on the function class containing the true function and sub-exponential observation noise. |

241 | Safe-Bayesian Generalized Linear Regression | Rianne Heide, Alisa Kirichenko, Peter Grunwald, Nishant Mehta | We show that for generalized linear models (GLMs), $\eta$-generalized Bayes concentrates around the best approximation of the truth within the model for specific $\eta eq 1$, even under severely misspecified noise, as long as the tails of the true distribution are exponential. |

242 | Efficient Distributed Hessian Free Algorithm for Large-scale Empirical Risk Minimization via Accumulating Sample Strategy | Majid Jahani, Xi He, Chenxin Ma, Aryan Mokhtari, Dheevatsa Mudigere, Alejandro Ribeiro, Martin Takac | In this paper, we propose a Distributed Accumulated Newton Conjugate gradiEnt (DANCE) method in which sample size is gradually increasing to quickly obtain a solution whose empirical loss is under satisfactory statistical accuracy. |

243 | Contextual Constrained Learning for Dose-Finding Clinical Trials | Hyun-Suk Lee, Cong Shen, James Jordon, Mihaela Schaar | In this paper, we propose C3T-Budget, a contextual constrained clinical trial algorithm for dose-finding under both budget and safety constraints. |

244 | Support recovery and sup-norm convergence rates for sparse pivotal estimation | Mathurin Massias, Quentin Bertrand, Alexandre Gramfort, Joseph Salmon | In this work we show minimax sup-norm convergence rates for non smoothed and smoothed, single task and multitask square-root Lasso-type estimators. |

245 | Learning Entangled Single-Sample Distributions via Iterative Trimming | Hui Yuan, Yingyu Liang | We study mean estimation and linear regression under general conditions, and analyze a simple and computationally efficient method based on iteratively trimming samples and re-estimating the parameter on the trimmed sample set. |

246 | The Quantile Snapshot Scan: Comparing Quantiles of Spatial Data from Two Snapshots in Time | Travis Moore, Wong Weng-Keen | We introduce the Quantile Snapshot Scan (Qsnap), a spatial scan algorithm which identifies spatial regions that differ the most between two snapshots in time. |

247 | Statistical guarantees for local graph clustering | Wooseok Ha, Kimon Fountoulakis, Michael Mahoney | In this paper, we adopt a statistical perspective on local graph clustering, and we analyze the performance of the l1-regularized PageRank method for the recovery of a single target cluster, given a seed node inside the cluster. |

248 | Learning High-dimensional Gaussian Graphical Models under Total Positivity without Adjustment of Tuning Parameters | Yuhao Wang, Uma Roy, Caroline Uhler | We here propose a new method to estimate the underlying undirected graphical model under MTP2 and show that it is provably consistent in structure recovery without adjusting the tuning parameters. |

249 | On Pruning for Score-Based Bayesian Network Structure Learning | Alvaro Henrique Chaim Correia, James Cussens, Cassio Campos | We derive new non-trivial theoretical upper bounds for the BDeu score that considerably improve on the state-of-the-art. |

250 | Statistical and Computational Rates in Graph Logistic Regression | Quentin Berthet, Nicolai Baldin | We consider the problem of graph logistic regression, based on partial observation of a large network, and on side information associated to its vertices. |

251 | Kernels over Sets of Finite Sets using RKHS Embeddings, with Application to Bayesian (Combinatorial) Optimization | Poompol Buathong, David Ginsbourger, Tipaluck Krityakierne | We focus on kernel methods for set-valued inputs and their application to Bayesian set optimization, notably combinatorial optimization. |

252 | Rk-means: Fast Clustering for Relational Data | Ryan Curtin, Benjamin Moseley, Hung Ngo, XuanLong Nguyen, Dan Olteanu, Maximilian Schleich | This paper introduces Rk-means, or relational k-means algorithm, for clustering relational data tuples without having to access the full data matrix. |

253 | Statistical Estimation of the Poincar? constant and Application to Sampling Multimodal Distributions | Loucas Pillaud-Vivien | In this paper, we show both theoretically and experimentally that, given sufficiently many samples of a measure, we can estimate its Poincaré constant. |

254 | Integrals over Gaussians under Linear Domain Constraints | Alexandra Gessner, Oindrila Kanjilal, Philipp Hennig | We present an efficient black-box algorithm that exploits geometry for the estimation of integrals over a small, truncated Gaussian volume, and to simulate therefrom. |

255 | Taxonomy of Dual Block-Coordinate Ascent Methods for Discrete Energy Minimization | Siddharth Tourani, Alexander Shekhovtsov, Carsten Rother, Bogdan Savchynskyy | We consider the maximum-a-posteriori inference problem in discrete graphical models and study solvers based on the dual block-coordinate ascent rule. |

256 | PersLay: A Neural Network Layer for Persistence Diagrams and New Graph Topological Signatures | Mathieu Carriere, Frederic Chazal, Yuichi Ike, Theo Lacombe, Martin Royer, Yuhei Umeda | In this work, we focus on persistence diagrams built on top of graphs. |

257 | MAP Inference for Customized Determinantal Point Processes via Maximum Inner Product Search | Insu Han, Jennifer Gillenwater | In this work, we propose a new MAP algorithm: we show that, by performing a one-time preprocessing step on a basic DPP, it is possible to run an approximate version of the standard greedy MAP approximation algorithm on any customized version of the DPP in time sublinear in the number of items. |

258 | Why Non-myopic Bayesian Optimization is Promising and How Far Should We Look-ahead? A Study via Rollout | Xubo Yue, Raed AL Kontar | In this work we focus on the rollout approximation for solving the intractable DP. |

259 | Robust Optimisation Monte Carlo | Borislav Ikonomov, Michael U. Gutmann | In this paper, we demonstrate an important previously unrecognised failure mode of OMC: It generates strongly overconfident approximations by collapsing regions of similar or near-constant likelihood into a single point. |

260 | Guaranteed Validity for Empirical Approaches to Adaptive Data Analysis | Ryan Rogers, Aaron Roth, Adam Smith, Nathan Srebro, Om Dipakbhai Thakkar, Blake Woodworth | Our main contribution is to design a framework for providing valid, instance-specific confidence intervals for point estimates that can be generated by heuristics. |

261 | Fast Markov chain Monte Carlo algorithms via Lie groups | Steve Huntsman | From basic considerations of the Lie group that preserves a target probability measure, we derive the Barker, Metropolis, and ensemble Markov chain Monte Carlo (MCMC) algorithms, as well as variants of waste-recycling Metropolis-Hastings and an altogether new MCMC algorithm. |

262 | Efficient Planning under Partial Observability with Unnormalized Q Functions and Spectral Learning | Tianyu Li, Bogdan Mazoure, Doina Precup, Guillaume Rabusseau | In this paper, we propose a novel algorithm that incorporate reward information into the representations of the environment to unify these two stages. |

263 | A Tight and Unified Analysis of Gradient-Based Methods for a Whole Spectrum of Differentiable Games | Wa?ss Azizian, Ioannis Mitliagkas, Simon Lacoste-Julien, Gauthier Gidel | We provide new analysis of the EG’s local and global convergence properties and use is to get a tighter global convergence rate for OG and CO. |

264 | Doubly Sparse Variational Gaussian Processes | Vincent Adam, Stefanos Eleftheriadis, Artem Artemev, Nicolas Durrande, James Hensman | In this work, we propose to take the best of both worlds: we show that the inducing point framework is still valid for state space models and that it can bring further computational and memory savings. |

265 | Online Convex Optimization with Perturbed Constraints: Optimal Rates against Stronger Benchmarks | Victor Valls, George Iosifidis, Douglas Leith, Leandros Tassiulas | To this end, we present an online primal-dual proximal gradient algorithm that has $O(T^\epsilon \vee T^{1-\epsilon})$ regret and $O(T^\epsilon)$ constraint violation, where $\epsilon \in [0,1)$ is a parameter in the learning rate. |

266 | Persistence Enhanced Graph Neural Network | Qi Zhao, Ze Ye, Chao Chen, Yusu Wang | To fully exploit such structural information in real world graphs, we propose a new network architecture which learns to use persistent homology information to reweight messages passed between graph nodes during convolution. |

267 | Feature relevance quantification in explainable AI: A causal problem | Dominik Janzing, Lenon Minorics, Patrick Bloebaum | We conclude that unconditional rather than conditional expectations provide the right notion of dropping features. |

268 | Neural Decomposition: Functional ANOVA with Variational Autoencoders | Kaspar M?rtens, Christopher Yau | In this paper, we focus on characterising the sources of variation in Conditional VAEs. |

269 | BasisVAE: Translation-invariant feature-level clustering with Variational Autoencoders | Kaspar M?rtens, Christopher Yau | In this paper, we propose to achieve this through the BasisVAE: a combination of the VAE and a probabilistic clustering prior, which lets us learn a one-hot basis function representation as part of the decoder network. |

270 | How To Backdoor Federated Learning | Eugene Bagdasaryan, Andreas Veit, Yiqing Hua, Deborah Estrin, Vitaly Shmatikov | We show that this makes federated learning vulnerable to amodel-poisoning attack that is significantly more powerful than poisoningattacks that target only the training data. |

271 | Exploiting Categorical Structure Using Tree-Based Methods | Brian Lucena | We develop a mathematical framework for representing the structure of categorical variables and show how to generalize decision trees to make use of this structure. |

272 | A Unified Stochastic Gradient Approach to Designing Bayesian-Optimal Experiments | Adam Foster, Martin Jankowiak, Matthew O?Meara, Yee Whye Teh, Tom Rainforth | We introduce a fully stochastic gradient based approach to Bayesian optimal experimental design (BOED). |

273 | Mixed Strategies for Robust Optimization of Unknown Objectives | Pier Giuseppe Sessa, Ilija Bogunovic, Maryam Kamgarpour, Andreas Krause | We consider robust optimization problems, where the goal is to optimize an unknown objective function against the worst-case realization of an uncertain parameter. |

274 | Functional Gradient Boosting for Learning Residual-like Networks with Statistical Guarantees | Atsushi Nitanda, Taiji Suzuki | In this paper, to resolve these problems, we propose a new functional gradient boosting for learning deep residual-like networks in a layer-wise fashion with its statistical guarantees on multi-class classification tasks. |

275 | Solving Discounted Stochastic Two-Player Games with Near-Optimal Time and Sample Complexity | Aaron Sidford, Mengdi Wang, Lin Yang, Yinyu Ye | In this paper we settle the sampling complexity of solving discounted two-player turn-based zero-sum stochastic games up to polylogarithmic factors. |

276 | Convergence Rates of Smooth Message Passing with Rounding in Entropy-Regularized MAP Inference | Jonathan Lee, Aldo Pacchiano, Michael Jordan | With an appropriately chosen regularization constant, we present a theoretical guarantee on the number of iterations sufficient to recover the true integral MAP solution when the LP is tight and the solution is unique. |

277 | Finite-Time Error Bounds for Biased Stochastic Approximation with Applications to Q-Learning | Gang Wang, Georgios B. Giannakis | Leveraging a \emph{multistep Lyapunov function} that looks ahead to several future updates to accommodate the gradient bias, we prove a general result on the convergence of the iterates, and use it to derive finite-time bounds on the mean-square error in the case of constant stepsizes. |

278 | Automated Augmented Conjugate Inference for Non-conjugate Gaussian Process Models | Theo Galy-Fajou, Florian Wenzel, Manfred Opper | We propose automated augmented conjugate inference, a new inference method for non-conjugate Gaussian processes (GP) models.Our method automatically constructs an auxiliary variable augmentation that renders the GP model conditionally conjugate. |

279 | Bayesian Reinforcement Learning via Deep, Sparse Sampling | Divya Grover, Debabrota Basu, Christos Dimitrakakis | We propose an optimism-free Bayes-adaptive algorithm to induce deeper and sparser exploration with a theoretical bound on its performance relative to the Bayes optimal as well as lower computational complexity. |

280 | Deterministic Decoding for Discrete Data in Variational Autoencoders | Daniil Polykovskiy, Dmitry Vetrov | In this paper, we study a VAE model with a deterministic decoder (DD-VAE) for sequential data that selects the highest-scoring tokens instead of sampling. |

281 | Monotonic Gaussian Process Flows | Ivan Ustyuzhaninov, Ieva Kazlauskaite, Carl Henrik Ek, Neill Campbell | We propose a new framework for imposing monotonicity constraints in a Bayesian non-parametric setting based on numerical solutions of stochastic differential equations. |

282 | Flexible distribution-free conditional predictive bands using density estimators | Rafael Izbicki, Gilson Shimizu, Rafael Stern | We introduce two conformal methods based on conditional density estimators that do not depend on this type of assumption to obtain asymptotic conditional coverage: Dist-split and CD-split. |

283 | Variational Integrator Networks for Physically Structured Embeddings | Steindor Saemundsson, Alexander Terenin, Katja Hofmann, Marc Deisenroth | By leveraging recent work connecting deep neural networks to systems of differential equations, we propose \emph{variational integrator networks}, a class of neural network architectures designed to preserve the geometric structure of physical systems. |

284 | Black-Box Inference for Non-Linear Latent Force Models | Wil Ward, Tom Ryder, Dennis Prangle, Mauricio Alvarez | We compare estimates on systems where the posterior is known, demonstrating the effectiveness of the approximation, and apply to problems with non-linear dynamics, multi-output systems and models with non-Gaussian likelihoods. |

285 | Importance Sampling via Local Sensitivity | Anant Raj, Cameron Musco, Lester Mackey | To overcome both obstacles we introduce \emph{local sensitivity}, which measures data point importance in a ball around some center $x_0$. |

286 | Convergence Analysis of Block Coordinate Algorithms with Determinantal Sampling | Mojmir Mutny, Michal Derezinski, Andreas Krause | However, we show that when the coordinate blocks are sampled with probability proportional to their determinant, the convergence rate depends solely on the eigenvalue distribution of matrix M, and has an analytically tractable form. |

287 | Bisect and Conquer: Hierarchical Clustering via Max-Uncut Bisection | Vaggos Chatziafratis, Grigory Yaroslavtsev, Euiwoong Lee, Konstantin Makarychev, Sara Ahmadian, Alessandro Epasto, Mohammad Mahdian | Here, for the maximization dual of Dasgupta’s objective (introduced by Moseley-Wang), we present polynomial-time 42.46% approximation algorithms that use Max-Uncut Bisection as a subroutine. |

288 | Laplacian-Regularized Graph Bandits: Algorithms and Theoretical Analysis | Kaige Yang, Laura Toni, Xiaowen Dong | We introduce a novel bandit algorithm where the smoothness prior is imposed via the random-walk graph Laplacian, which leads to a single-user cumulative regret scaling as $\Tilde{\mathcal{O}}(\Psi d \sqrt{T})$ with time horizon $T$, feature dimensionality $d$, and the scalar parameter $\Psi \in (0,1)$ that depends on the graph connectivity. |

289 | Enriched mixtures of generalised Gaussian process experts | Charles Gadd, Sara Wade, Alexis Boukouvalas | We focus on alternative mixtures of GP experts, which model the joint distribution of the inputs and targets explicitly. |

290 | Causal Bayesian Optimization | Virginia Aglietti, Xiaoyu Lu, Andrei Paleyes, Javier Gonz?lez | We propose a new algorithm called Causal Bayesian Optimization (CBO). |

291 | Linear predictor on linearly-generated data with missing values: non consistency and solutions | Marine Le Morvan, Nicolas Prost, Julie Josse, Erwan Scornet, Gael Varoquaux | We consider building predictors when the data have missing values. |

292 | A Novel Confidence-Based Algorithm for Structured Bandits | Andrea Tirinzoni, Alessandro Lazaric, Marcello Restelli | We introduce a novel phased algorithm that exploits the given structure to build confidence sets over the parameters of the true bandit problem and rapidly discard all sub-optimal arms. |

293 | Quantitative stability of optimal transport maps and linearization of the 2-Wasserstein space | Quentin M?rigot, Alex Delalande, Frederic Chazal | This work studies an explicit embedding of the set of probability measures into a Hilbert space, defined using optimal transport maps from a reference probability density. |

294 | Bayesian experimental design using regularized determinantal point processes | Michal Derezinski, Feynman Liang, Michael Mahoney | A key novelty is that we offer improved guarantees under the Bayesian framework, where prior knowledge is incorporated into the criteria. |

295 | Non-exchangeable feature allocation models with sublinear growth of the feature sizes | Giuseppe Di Benedetto, Francois Caron, Yee Whye Teh | In this article, we describe a class of non-exchangeable feature allocation models where the number of objects sharing a given feature grows sublinearly, where the rate can be controlled by a tuning parameter. |

296 | Calibrated Prediction with Covariate Shift via Unsupervised Domain Adaptation | Sangdon Park, Osbert Bastani, James Weimer, Insup Lee | We pro-pose an algorithm for calibrating predictions that accounts for the possibility of covariate shift, given labeled examples from the train-ing distribution and unlabeled examples from the real-world distribution. |

297 | Inference of Dynamic Graph Changes for Functional Connectome | Dingjue Ji, Junwei Lu, Yiliang Zhang, Siyuan Gao, Hongyu Zhao | We propose an inferential method to detect the dynamic changes of brain networks based on time-varying graphical models. |

298 | An approximate KLD based experimental design for models with intractable likelihoods | Ziqiao Ao, Jinglai Li | In this work we consider a special type of ED problems where the likelihoods are not available in a closed form. |

299 | Almost-Matching-Exactly for Treatment Effect Estimation under Network Interference | Usaid Awan, Marco Morucci, Vittorio Orlandi, Sudeepa Roy, Cynthia Rudin, Alexander Volfovsky | We propose a matching method that recovers direct treatment effects from randomized experiments where units are connected in an observed network, and units that share edges can potentially influence each others’ outcomes. |

300 | ?Bring Your Own Greedy?+Max: Near-Optimal 1/2-Approximations for Submodular Knapsack | Grigory Yaroslavtsev, Samson Zhou, Dmitrii Avdiukhin | Motivated by applications to recommendation systems and other scenarios with query-limited access to vast amounts of data, we propose a new rigorous algorithmic framework for a standard formulation of this problem as a submodular maximization subject to a linear (knapsack) constraint. |

301 | Sample complexity bounds for localized sketching | Rakshith Sharma Srinivasa, Mark Davenport, Justin Romberg | We consider sketched approximate matrix multiplication and ridge regression in the novel setting of localized sketching, where at any given point, only part of the data matrix is available. |

302 | An Optimal Algorithm for Adversarial Bandits with Arbitrary Delays | Julian Zimmert, Yevgeny Seldin | We propose a new algorithm for adversarial multi-armed bandits with unrestricted delays. |

303 | Learning Dynamic and Personalized Comorbidity Networks from Event Data using Deep Diffusion Processes | Zhaozhi Qian, Ahmed Alaa, Alexis Bellot, Mihaela Schaar, Jem Rashbass | To this end, we develop deep diffusion processes (DDP) to model ’dynamic comorbidity networks’, i.e., the temporal relationships between comorbid disease onsets expressed through a dynamic graph. |

304 | Tensorized Random Projections | Beheshteh Rakhshan, Guillaume Rabusseau | We introduce a novel random projection technique for efficiently reducing the dimension of very high-dimensional tensors. |

305 | Nonparametric Estimation in the Dynamic Bradley-Terry Model | Heejong Bong, Wanshan Li, Shamindra Shrotriya, Alessandro Rinaldo | We propose a time-varying generalization of the Bradley-Terry model that allows for nonparametric modeling of dynamic global rankings of distinct teams. |

306 | Gaussian-Smoothed Optimal Transport: Metric Structure and Statistical Efficiency | Ziv Goldfeld, Kristjan Greenewald | This work proposes a novel Gaussian-smoothed OT (GOT) framework, that achieves the best of both worlds: preserving the 1-Wasserstein metric structure while alleviating the empirical approximation curse of dimensionality. |

307 | Learning in Gated Neural Networks | Ashok Makkuva, Sewoong Oh, Sreeram Kannan, Pramod Viswanath | In this paper, we perform a careful analysis of the optimization landscape and show that with appropriately designed loss functions, gradient descent can indeed learn the parameters accurately. |

308 | Validation of Approximate Likelihood and Emulator Models for Computationally Intensive Simulations | Niccolo Dalmasso, Ann Lee, Rafael Izbicki, Taylor Pospisil, Ilmun Kim, Chieh-An Lin | Here we propose a statistical framework that can distinguish any arbitrary misspecified model from the target likelihood, and that in addition can identify with statistical confidence the regions of parameter as well as feature space where the fit is inadequate. |

309 | Fenchel Lifted Networks: A Lagrange Relaxation of Neural Network Training | Fangda Gu, Armin Askari, Laurent El Ghaoui | In this paper, we introduce a new class of lifted models, Fenchel lifted networks, that enjoy the same benefits as previous lifted models, without suffering a degradation in performance over classical networks. |

310 | Adversarial Robustness Guarantees for Classification with Gaussian Processes | Arno Blaas, Andrea Patane, Luca Laurenti, Luca Cardelli, Marta Kwiatkowska, Stephen Roberts | Specifically, given a compact subset of the input space $T\subseteq \mathbb{R}^d$ enclosing a test point $x^*$ and a GPC trained on a dataset $\mathcal{D}$, we aim to compute the minimum and the maximum classification probability for the GPC over all the points in $T$. |

311 | Causal inference in degenerate systems: An impossibility result | Yue Wang, Linbo Wang | In this paper, we characterize a degenerate causal system using multiplicity of Markov boundaries. |

312 | ChemBO: Bayesian Optimization of Small Organic Molecules with Synthesizable Recommendations | Ksenia Korovina, Sailun Xu, Kirthevasan Kandasamy, Willie Neiswanger, Barnabas Poczos, Jeff Schneider, Eric Xing | We describe ChemBO, a Bayesian optimization framework for generating and optimizing organic molecules for desired molecular properties. |

313 | Local Differential Privacy for Sampling | Hisham Husain, Borja Balle, Zac Cranko, Richard Nock | We propose to model this scenario by assuming each individual holds a distribution over the space of data records, and develop novel local DP methods to sample privately from these distributions. |

314 | Learning Sparse Nonparametric DAGs | Xun Zheng, Chen Dan, Bryon Aragam, Pradeep Ravikumar, Eric Xing | Unlike existing approaches that require specific modeling choices, loss functions, or algorithms, we present a completely general framework that can be applied to general nonlinear models (e.g. without additive noise), general differentiable loss functions, and generic black-box optimization routines. |

315 | Minimax Rank-$1$ Matrix Factorization | Venkatesh Saligrama, Alexander Olshevsky, Julien Hendrickx | We propose a method based on least squares in the log-space and show its performance matches the lower bounds that we derive for this problem in the small-perturbation regime, which are related to the spectral gap of a graph representing the revealed entries. |

316 | Context Mover?s Distance & Barycenters: Optimal Transport of Contexts for Building Representations | Sidak Pal Singh, Andreas Hug, Aymeric Dieuleveut, Martin Jaggi | We present a framework for building unsupervised representations of entities and their compositions, where each entity is viewed as a probability distribution rather than a vector embedding. |

317 | Data Generation for Neural Programming by Example | Judith Clymo, Adria Gascon, Brooks Paige, Nathanael Fijalkow, Haik Manukian | In this paper we introduce a novel approach using an SMT solver to synthesize inputs which cover a diverse set of behaviors for a given program. |

318 | An Inverse-free Truncated Rayleigh-Ritz Method for Sparse Generalized Eigenvalue Problem | Yunfeng Cai, Ping Li | In this paper, we focus on the development of a three-stage algorithm named {\em inverse-free truncated Rayleigh-Ritz method} ({\em IFTRR}) to efficiently solve SGEP. |

319 | The Gossiping Insert-Eliminate Algorithm for Multi-Agent Bandits | Ronshee Chawla, Abishek Sankararaman, Ayalvadi Ganesh, Sanjay Shakkottai | We consider a decentralized multi-agent Multi Armed Bandit (MAB) setup consisting of N agents, solving the same MAB instance to minimize individual cumulative regret. |

320 | Understanding the Effects of Batching in Online Active Learning | Kareem Amin, Corinna Cortes, Giulia DeSalvo, Afshin Rostamizadeh | In this work, we present an analysis for a generic class of batch online AL algorithms, which reveals that the effects of batching are in fact mild and only result in an additional label complexity term that is quasilinear in the batch size. |

321 | Adaptive multi-fidelity optimization with fast learning rates | C?me Fiegel, Victor Gabillon, Michal Valko | This paper studies the problem of optimizing a locally smooth function with a limited budget, where the learner has to make a tradeoff between the cost and the bias of these approximations. |

322 | On the interplay between noise and curvature and its effect on optimization and generalization | Valentin Thomas, Fabian Pedregosa, Bart Merri?nboer, Pierre-Antoine Manzagol, Yoshua Bengio, Nicolas Le Roux | While most previous works focus on one or the other of these properties, we explore how their interaction affects optimization speed. |

323 | A Reduction from Reinforcement Learning to No-Regret Online Learning | Ching-An Cheng, Remi Tachet Combes, Byron Boots, Geoff Gordon | We present a reduction from reinforcement learning (RL) to no-regret online learning based on the saddle-point formulation of RL, by which "any" online algorithm with sublinear regret can generate policies with provable performance guarantees. |

324 | The Implicit Regularization of Ordinary Least Squares Ensembles | Daniel LeJeune, Hamid Javadi, Richard Baraniuk | We study the case of an ensemble of linear predictors, where each individual predictor is fit using ordinary least squares on a random submatrix of the data matrix. |

325 | Adaptive Exploration in Linear Contextual Bandit | Botao Hao, Tor Lattimore, Csaba Szepesvari | We start to bridge the gap by designing an algorithm that is asymptotically optimal and has good finite-time empirical performance. |

326 | A Three Sample Hypothesis Test for Evaluating Generative Models | Casey Meehan, Kamalika Chaudhuri, Sanjoy Dasgupta | In this work, we formalize a form of overfitting that we call {\em{data-copying}} – where the generative model memorizes and outputs training samples or small variations thereof. |

327 | Learning Ising and Potts Models with Latent Variables | Surbhi Goel | We study the problem of learning graphical models with latent variables. |

328 | Learning piecewise Lipschitz functions in changing environments | Dravyansh Sharma, Maria-Florina Balcan, Travis Dick | In this work we provide an $O(\sqrt{sdT\log T}+sT^{1-\beta})$ regret bound for $\beta$-dispersed functions, where $\beta$ roughly quantifies the rate at which discontinuities appear in the utility functions in expectation (typically $\beta\ge1/2$ in problems of practical interest \cite{2019arXiv190409014B,balcan2018dispersion}). |

329 | POPCORN: Partially Observed Prediction Constrained Reinforcement Learning | Joseph Futoma, Michael Hughes, Finale Doshi-Velez | We introduce a new optimization objective that (a) produces both high-performing policies and high-quality generative models, even when some observations are irrelevant for planning, and (b) does so in batch off-policy settings that are typical in healthcare, when only retrospective data is available. |

330 | Optimal Approximation of Doubly Stochastic Matrices | Nikitas Rontsis, Paul Goulart | We consider the least-squares approximation of a matrix C in the set of doubly stochastic matrices with the same sparsity pattern as C. |

331 | The Expressive Power of a Class of Normalizing Flow Models | Zhifeng Kong, Kamalika Chaudhuri | In this work, we study some basic normalizing flows and rigorously establish bounds on their expressive power. |

332 | Screening Data Points in Empirical Risk Minimization via Ellipsoidal Regions and Safe Loss Functions | Gr?goire Mialon, Julien Mairal, Alexandre d?Aspremont | We design simple screening tests to automatically discard data samples in empirical risk minimization withoutlosing optimization guarantees. |

333 | An Empirical Study of Stochastic Gradient Descent with Structured Covariance Noise | Yeming Wen, Kevin Luk, Maxime Gazeau, Guodong Zhang, Harris Chan, Jimmy Ba | To address the problem of improving generalization while maintaining optimal convergence in large-batch training, we propose to add covariance noise to the gradients. |

334 | Amortized Inference of Variational Bounds for Learning Noisy-OR | Yiming Yan, Melissa Ailem, Fei Sha | In this paper, we propose Amortized Conjugate Posterior (ACP), a hybrid approach taking advantages of both types of approaches. |

335 | Gain with no Pain: Efficiency of Kernel-PCA by Nystr?m Sampling | Nicholas Sterge, Bharath Sriperumbudur, Lorenzo Rosasco, Alessandro Rudi | In this paper, we analyze a Nyström based approach to efficient large scale kernel principal component analysis (PCA). |

336 | Logistic regression with peer-group effects via inference in higher-order Ising models | Constantinos Daskalakis, Nishanth Dikkala, Ioannis Panageas | In this work we study extensions of these models to models with higher-order sufficient statistics, modeling behavior on a social network with peer-group effects. |

337 | An Asymptotic Rate for the LASSO Loss | Cynthia Rush | We study the linear asymptotic regime where the under sampling ratio, n/p, approaches a constant greater than 0 in the limit. |

338 | Constructing a provably adversarially-robust classifier from a high accuracy one | Grzegorz Gluch, R?diger Urbanke | In this paper we focus on our conceptual contribution, but we do present two examples to illustrate our framework. |

339 | Distributed, partially collapsed MCMC for Bayesian Nonparametrics | Kumar Avinava Dubey, Michael Zhang, Eric Xing, Sinead Williamson | We exploit the fact that completely random measures, which commonly-used models like the Dirichlet process and the beta-Bernoulli process can be expressed using, are decomposable into independent sub-measures. |

340 | Quantized Frank-Wolfe: Faster Optimization, Lower Communication, and Projection Free | Mingrui Zhang, Lin Chen, Aryan Mokhtari, Hamed Hassani, Amin Karbasi | In this paper, we propose Quantised Frank-Wolfe (QFW), the first projection free and communication-efficient algorithm for solving constrained optimization problems at scale. |

341 | A Farewell to Arms: Sequential Reward Maximization on a Budget with a Giving Up Option | P Sharoff, Nishant Mehta, Ravi Ganti | We consider a sequential decision-making problem where an agent can take one action at a time and each action has a stochastic temporal extent, i.e., a new action cannot be taken until the previous one is finished. |

342 | Prophets, Secretaries, and Maximizing the Probability of Choosing the Best | Hossein Esfandiari, MohammadTaghi Hajiaghayi, Brendan Lucier, Michael Mitzenmacher | Along the way, we show that the best achievable success probability for the random-order case matches that of the i.i.d. case, which is approximately 0.58010.5801, under a “no-superstars” condition that no single distribution is very likely ex ante to generate the maximum value. |

343 | A Wasserstein Minimum Velocity Approach to Learning Unnormalized Models | Ziyu Wang, Shuyu Cheng, Li Yueru, Jun Zhu, Bo Zhang | In this paper, we present a scalable approximation to a general family of learning objectives including score matching, by observing a new connection between these objectives and Wasserstein gradient flows. |

344 | Sharp Asymptotics and Optimal Performance for Inference in Binary Models | Hossein Taheri, Ramtin Pedarsani, Christos Thrampoulidis | We study convex empirical risk minimization for high-dimensional inference in binary models. |

345 | A Theoretical Case Study of Structured Variational Inference for Community Detection | Mingzhang Yin, Y. X. Rachel Wang, Purnamrita Sarkar | In this paper, we study the advantage of structured variational inference in the context of the two-class Stochastic Blockmodel. |

346 | Orthogonal Gradient Descent for Continual Learning | Mehrdad Farajtabar, Navid Azizan, Alex Mott, Ang Li | In this paper, we propose to address this issue from a parameter space perspective and study an approach to restrict the direction of the gradient updates to avoid forgetting previously-learned data. |

347 | Hamiltonian Monte Carlo Swindles | Dan Piponi, Matthew Hoffman, Pavel Sountsov | In this work, we explore a complementary approach to variance reduction based on two classical Monte Carlo ’swindles’: first, running an auxiliary coupled chain targeting a tractable approximation to the target distribution, and using the auxiliary samples as control variates; and second, generating anti-correlated (“antithetic”) samples by running two chains with flipped randomness. |

348 | A single algorithm for both restless and rested rotting bandits | Julien Seznec, Pierre Menard, Alessandro Lazaric, Michal Valko | In this paper, we introduce a novel algorithm, Rotting Adaptive Window UCB (RAW-UCB), that achieves near-optimal regret in both rotting rested and restless bandit, without any prior knowledge of the setting (rested or restless) and the type of non-stationarity (e.g., piece-wise constant, bounded variation). |

349 | Adversarial Robustness of Flow-Based Generative Models | Phillip Pope, Yogesh Balaji, Soheil Feizi | In this paper, we study adversarial robustness of flow-based generative models both theoretically (for some simple models) and empirically (for more complex ones). |

350 | The Power of Batching in Multiple Hypothesis Testing | Tijana Zrnic, Daniel Jiang, Aaditya Ramdas, Michael Jordan | To this end, we introduce Batch-BH and Batch-St-BH, algorithms for controlling the FDR when a possibly infinite sequence of batches of hypotheses is tested by repeated application of one of the most widely used offline algorithms, the Benjamini-Hochberg (BH) method or Storey’s improvement of the BH method. |

351 | Adversarial Risk Bounds through Sparsity based Compression | Emilio Balda, Niklas Koep, Arash Behboodi, Rudolf Mathar | In this work, we focus on $\ell_\infty$ attacks with $\ell_\infty$ bounded inputs and prove margin-based bounds.Specifically, we use a compression-based approach that relies on efficiently compressing the set of tunable parameters without distorting the adversarial risk. |

352 | Learning spectrograms with convolutional spectral kernels | Zheyang Shen, Markus Heinonen, Samuel Kaski | We present a principled framework to interpret CSK, as well as other deep probabilistic models, using approximated Fourier transform, yielding a concise representation of input-frequency spectrogram. |

353 | Federated Heavy Hitters Discovery with Differential Privacy | Wennan Zhu, Peter Kairouz, Brendan McMahan, Haicheng Sun, Wei Li | To address these risks, we propose a distributed and privacy-preserving algorithm for discovering the heavy hitters in a population of user-generated data streams. |

354 | Online Batch Decision-Making with High-Dimensional Covariates | Chi-Hua Wang, Guang Cheng | We propose and investigate a class of new algorithms for sequential decision making that interacts with a batch of users simultaneously instead of a user at each decision epoch. |

355 | Sample Complexity of Estimating the Policy Gradient for Nearly Deterministic Dynamical Systems | Osbert Bastani | We propose a theoretical framework for understanding this phenomenon. |

356 | Scalable Gradients for Stochastic Differential Equations | Xuechen Li, Ting-Kam Leonard Wong, Ricky T. Q. Chen, David Duvenaud | We generalize this method to stochastic differential equations, allowing time-efficient and constant-memory computation of gradients with high-order adaptive solvers. |

357 | Understanding the Intrinsic Robustness of Image Distributions using Conditional Generative Models | Xiao Zhang, Jinghui Chen, Quanquan Gu, David Evans | In this work, we assume the underlying data distribution is captured by some conditional generative model, and prove intrinsic robustness bounds for a general class of classifiers, which solves an open problem in Fawzi et al. (2018). |

358 | Uncertainty Quantification for Deep Context-Aware Mobile Activity Recognition and Unknown Context Discovery | Zepeng Huo, Arash PakBin, Xiaohan Chen, Nathan Hurley, Ye Yuan, Xiaoning Qian, Zhangyang Wang, Shuai Huang, Bobak Mortazavi | We develop a context-aware mixture of deep models termed the $\alpha$-$\beta$ network coupled with uncertainty quantification (UQ) based upon maximum entropy to enhance human activity recognition performance. |

359 | Learnable Bernoulli Dropout for Bayesian Deep Learning | Shahin Boluki, Randy Ardywibowo, Siamak Zamani Dadaneh, Mingyuan Zhou, Xiaoning Qian | In this work, we propose learnable Bernoulli dropout (LBD), a new model-agnostic dropout scheme that considers the dropout rates as parameters jointly optimized with other model parameters. |

360 | General Identification of Dynamic Treatment Regimes Under Interference | Eli Sherman, David Arbour, Ilya Shpitser | Inthis paper we consider the problem of identifyingoptimal treatment policies in the presenceof interference. |

361 | Gaussian Sketching yields a J-L Lemma in RKHS | Samory Kpotufe, Bharath Sriperumbudur | The main contribution of the paper is to show that Gaussian sketching of a kernel-Gram matrix $\bm K$ yields an operator whose counterpart in an RKHS $\cal H$, is a \emph{random projection} operator—in the spirit of Johnson-Lindenstrauss (J-L) lemma. |

362 | Wasserstein Smoothing: Certified Robustness against Wasserstein Adversarial Attacks | Alexander Levine, Soheil Feizi | In this work, we propose the first defense with certified robustness against Wasserstein adversarial attacks using randomized smoothing. |

363 | Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning | Ming Yin, Yu-Xiang Wang | In this paper, we prove that with a simple modification to the MIS estimator, we can asymptotically attain the Cramer-Rao lower bound, provided that the action space is finite. |

364 | Learning Dynamic Hierarchical Topic Graph with Graph Convolutional Network for Document Classification | Zhengjue Wang, Chaojie Wang, Hao Zhang, Zhibin Duan, Mingyuan Zhou, Bo Chen | To address these constrains, we integrate a probabilistic deep topic model into graph construction, and propose a novel trainable hierarchical topic graph (HTG), including word-level, hierarchical topic-level and document-level nodes, exhibiting semantic variation from fine-grained to coarse. |

365 | Differentiable Causal Backdoor Discovery | Limor Gultchin, Matt Kusner, Varun Kanade, Ricardo Silva | In this work, we present an algorithm that exploits auxiliary variables, similar to instruments, in order to find an appropriate adjustment by a gradient-based optimization method. |

366 | Stochastic Recursive Variance-Reduced Cubic Regularization Methods | Dongruo Zhou, Quanquan Gu | In this paper, we first present a Stochastic Recursive Variance-Reduced Cubic regularization method (SRVRC) using a recursively updated semi-stochastic gradient and Hessian estimators. |

367 | Better Long-Range Dependency By Bootstrapping A Mutual Information Regularizer | Yanshuai Cao, Peng Xu | In this work, we develop a novel regularizer to improve the learning of long-range dependency of sequence data. |

368 | On the Completeness of Causal Discovery in the Presence of Latent Confounding with Tiered Background Knowledge | Bryan Andrews | In this paper, we define tiered background knowledge and show that FCI is sound and complete with the incorporation of this knowledge. |

369 | One Sample Stochastic Frank-Wolfe | Mingrui Zhang, Zebang Shen, Aryan Mokhtari, Hamed Hassani, Amin Karbasi | The aim of this paper is to bring them back without sacrificing the efficiency. |

370 | Convex Geometry of Two-Layer ReLU Networks: Implicit Autoencoding and Interpretable Models | Tolga Ergen, Mert Pilanci | We develop a convex analytic framework for ReLU neural networks which elucidates the inner workings of hidden neurons and their function space characteristics. |

371 | A Robust Univariate Mean Estimator is All You Need | Adarsh Prasad, Sivaraman Balakrishnan, Pradeep Ravikumar | In such an adversarial setup, we aim to design statistically optimal estimators for flexible non-parametric distribution classes such as distributions with bounded-2k moments and symmetric distributions. |

372 | Patient-Specific Effects of Medication Using Latent Force Models with Gaussian Processes | Li-Fang Cheng, Bianca Dumitrascu, Michael Zhang, Corey Chivers, Michael Draugelis, Kai Li, Barbara Engelhardt | We propose a novel approach that models the effect of interventions as a hybrid Gaussian process composed of a GP capturing patient baseline physiology convolved with a latent force model capturing effects of treatments on specific physiological features. |

373 | Robust Variational Autoencoders for Outlier Detection and Repair of Mixed-Type Data | Simao Eduardo, Alfredo Nazabal, Christopher K. I. Williams, Charles Sutton | We introduce the Robust VariationalAutoencoder (RVAE), a deep generative model that learns the jointdistribution of the clean data while identifying the outlier cells, allowing their imputation (repair). |

374 | Error bounds in estimating the out-of-sample prediction error using leave-one-out cross validation in high-dimensions | Kamiar Rahnama Rad, Wenda Zhou, Arian Maleki | This paper aims to fill this gap for penalized regression in the generalized linear family. |

375 | A Diversity-aware Model for Majority Vote Ensemble Accuracy | Bob Durrant, Nick Lim | In this paper, we explore the predictive power of several common diversity measures and show – with extensive experiments – that contrary to earlier work that finds no clear link between these diversity measures (in isolation) and ensemble accuracy instead by using the $\rho$ diversity measure of Sneath and Sokal as an estimator for the dispersion parameter of a Polya-Eggenberger distribution we can predict, independently of the choice of base classifier family, the accuracy of a majority vote classifier ensemble ridiculously well. |

376 | Scaling up Kernel Ridge Regression via Locality Sensitive Hashing | Amir Zandieh, Navid Nouri, Ameya Velingker, Michael Kapralov, Ilya Razenshteyn | In this paper we introduce a simple weighted version of random binning features, and show that the corresponding kernel function generates Gaussian processes of any desired smoothness. |

377 | Ordering-Based Causal Structure Learning in the Presence of Latent Variables | Daniel Bernstein, Basil Saeed, Chandler Squires, Caroline Uhler | Motivated by this result, we propose a greedy algorithm over the space of posets for causal structure discovery in the presence of latent confounders and compare its performance to the current state-of-the-art algorithms FCI and FCI+ on synthetic data. |

378 | Budget Learning via Bracketing | Durmus Alp Emre Acar, Aditya Gangrade, Venkatesh Saligrama | We propose a new formulation for the BL problem via the concept of bracketings. |

379 | Optimal Algorithms for Multiplayer Multi-Armed Bandits | PO-AN WANG, Alexandre Proutiere, Kaito Ariu, Yassir Jedra, Alessio Russo | For this problem, we present DPE1 (Decentralized Parsimonious Exploration), a decentralized algorithm that achieves the same asymptotic regret as that obtained by an optimal centralized algorithm. |

380 | AP-Perf: Incorporating Generic Performance Metrics in Differentiable Learning | Rizal Fathony, Zico Kolter | We propose a method that enables practitioners to conveniently incorporate custom non-decomposable performance metrics into differentiable learning pipelines, notably those based upon neural network architectures. |

381 | Optimal Deterministic Coresets for Ridge Regression | Praneeth Kacham, David Woodruff | We consider the ridge regression problem, for which we are given an nxd matrix A of examples and a corresponding nxd’ matrix B of labels, as well as a ridge parameter $\lambda \geq 0$, and would like to output an $X’ \in R^{d \times d’}$ for which $$\|AX’-B\|_F^2 + \lambda \|X’\|_F^2 \leq (1+\epsilon)OPT,$$ where ${OPT} = \min_{Y \in \mathbb{R}^{d \times d’}} \|AY-B\|_F^2 + \lambda \|Y\|_F^2. |

382 | Expressiveness and Learning of Hidden Quantum Markov Models | Sandesh Adhikary, Siddarth Srinivasan, Geoff Gordon, Byron Boots | We tackle these problems by showing that HQMMs are a special subclass of the general class of observable operator models (OOMs) that do not suffer from the negative probability problem by design. |

383 | Solving the Robust Matrix Completion Problem via a System of Nonlinear Equations | Yunfeng Cai, Ping Li | We consider the problem of robust matrix completion, which aims to recover a low rank matrix $L_*$ and a sparse matrix $S_*$ from incomplete observations of their sum $M=L_*+S_*\in\mathbb{R}^{m\times n}$. |

384 | Explicit Mean-Square Error Bounds for Monte-Carlo and Linear Stochastic Approximation | Shuhang Chen, Adithya Devraj, Ana Busic, Sean Meyn | This paper concerns error bounds for recursive equations subject to Markovian disturbances. |

385 | Stochastic Neural Network with Kronecker Flow | Chin-Wei Huang, Ahmed Touati, Pascal Vincent, Gintare Karolina Dziugaite, Alexandre Lacoste, Aaron Courville | In this work, we address this need and present the Kronecker Flow, a generalization of the Kronecker product to invertible mappings designed for stochastic neural networks. |

386 | Fair Correlation Clustering | Sara Ahmadian, Alessandro Epasto, Ravi Kumar, Mohammad Mahdian | In this paper, we study correlation clustering under fairness constraints. |

387 | Towards Competitive N-gram Smoothing | Moein Falahatgar, Mesrob Ohannessian, Alon Orlitsky, Venkatadheeraj Pichapati | In the hopes of explaining this performance, we study it through the lens of competitive distribution estimation: the ability to perform as well as an oracle aware of further structure in the data. |

388 | Multi-level Gaussian Graphical Models Conditional on Covariates | Gi Bum Kim, Seyoung Kim | We propose a statistical model called multi-level conditional Gaussian graphical models for modeling multi-level output networks influenced by both individual-level and group-level inputs. |

389 | Semi-Modular Inference: enhanced learning in multi-modular models by tempering the influence of components | Christian Carmona, Geoff Nicholls | Working within an existing coherent loss-based generalisation of Bayesian inference, we show existing Modular/Cut-model inference is coherent, and write down a new family of Semi-Modular Inference (SMI) schemes, indexed by an influence parameter, with Bayesian inference and Cut-models as special cases. |

390 | Invertible Generative Modeling using Linear Rational Splines | Hadi Mohaghegh Dolatabadi, Sarah Erfani, Christopher Leckie | In this paper, we explore using linear rational splines as a replacement for affine transformations used in coupling layers. |

391 | LdSM: Logarithm-depth Streaming Multi-label Decision Trees | Maryam Majzoubi, Anna Choromanska | In this paper we develop the LdSM algorithm for the construction and training of multi-label decision trees, where in every node of the tree we optimize a novel objective function that favors balanced splits, maintains high class purity of children nodes, and allows sending examples to multiple directions but with a penalty that prevents tree over-growth. |

392 | Prior-aware Composition Inference for Spectral Topic Models | Moontae Lee, David Bindel, David Mimno | We propose two novel estimation methods that respect previously unclear prior structures of spectral topic models. |

393 | Variational Optimization on Lie Groups, with Examples of Leading (Generalized) Eigenvalue Problems | Molei Tao, Tomoki Ohsawa | The article considers smooth optimization of functions on Lie groups. |

394 | Best-item Learning in Random Utility Models with Subset Choices | Bangalore) Aadirupa Saha, Bangalore) Aditya Gopalan | We consider the problem of PAC learning the most valuable item from a pool of $n$ items using sequential, adaptively chosen plays of subsets of $k$ items, when, upon playing a subset, the learner receives relative feedback sampled according to a general Random Utility Model (RUM) with independent noise perturbations to the latent item utilities. |

395 | Regularized Autoencoders via Relaxed Injective Probability Flow | Abhishek Kumar, Ben Poole, Kevin Murphy | We propose a generative model based on probability flows that does away with the bijectivity requirement on the model and only assumes injectivity. |

396 | Stochastic Variance-Reduced Algorithms for PCA with Arbitrary Mini-Batch Sizes | Cheolmin Kim, Diego Klabjan | We present two stochastic variance-reduced PCA algorithms and their convergence analyses. |

397 | Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks | Mingchen Li, Mahdi Soltanolkotabi, Samet Oymak | Despite this (over)fitting capacity in this paper we demonstrate that such overparameterized networks have an intriguing robustness capability: they are surprisingly robust to label noise when first order methods with early stopping is used to train them. |

398 | Scalable Nonparametric Factorization for High-Order Interaction Events | Zhimeng Pan, Zheng Wang, Shandian Zhe | To address these issues, we propose a Bayesian nonparametric factorization model for high-order interaction events, which can flexibly estimate/embed the static, nonlinear relationships and capture various long-term and short-term excitations effects, encoding these effects and their decaying patterns into the latent factors. |

399 | Gaussianization Flows | Chenlin Meng, Yang Song, Jiaming Song, Stefano Ermon | Based on iterative Gaussianization, we propose a new type of normalizing flow models that grants both efficient computation of likelihoods and efficient inversion for sample generation. |

400 | Adaptive, Distribution-Free Prediction Intervals for Deep Networks | Danijel Kivaranovic, Kory D. Johnson, Hannes Leeb | We present methods from the statistics literature that can be used efficiently with neural networks under minimal assumptions with guaranteed performance. |

401 | A Distributional Analysis of Sampling-Based Reinforcement Learning Algorithms | Philip Amortila, Doina Precup, Prakash Panangaden, Marc G. Bellemare | We present a distributional approach to theoretical analyses of reinforcement learning algorithms for constant step-sizes. |

402 | Automatic Differentiation of Sketched Regression | Hang Liao, Barak Pearlmutter, Vamsi Potluru, David Woodruff | Sketching for speeding up regression problems involves using a sketching matrix $S$ to quickly find the approximate solution to a linear least squares regression (LLS) problem: given $A$ of size $n \times d$, with $n \gg d$, along with $b$ of size $n \times 1$, we seek a vector $y$ with minimal regression error $\lVert A y – b\rVert_2$. |

403 | Sublinear Optimal Policy Value Estimation in Contextual Bandits | Weihao Kong, Emma Brunskill, Gregory Valiant | We study the problem of estimating the expected reward of the optimal policy in the stochastic disjoint linear bandit setting. |

404 | Budget-Constrained Bandits over General Cost and Reward Distributions | Semih Cayci, Atilla Eryilmaz, R Srikant | In order to achieve tight regret bounds, we propose algorithms that exploit the correlation between the cost and reward of each arm by extracting the common information via linear minimum mean-square error estimation. |

405 | Measuring Mutual Information Between All Pairs of Variables in Subquadratic Complexity | Mohsen Ferdosi, Arash Gholamidavoodi, Hosein Mohimani | In this paper, we consider the problem of finding pairs of variables with high mutual information in sub-quadratic complexity. |

406 | Online Continuous DR-Submodular Maximization with Long-Term Budget Constraints | Omid Sadeghi, Maryam Fazel | In this paper, we study a class of online optimization problems with long-term budget constraints where the objective functions are not necessarily concave (nor convex), but they instead satisfy the Diminishing Returns (DR) property. |

407 | Prediction Focused Topic Models via Feature Selection | Jason Ren, Russell Kunes, Finale Doshi-Velez | We introduce a novel approach, the prediction-focused topic model, that uses the supervisory signal to retain only vocabulary terms that improve, or at least do not hinder, prediction performance. |

408 | Accelerated Factored Gradient Descent for Low-Rank Matrix Factorization | Dongruo Zhou, Yuan Cao, Quanquan Gu | In this paper, we answer this question affirmatively by proposing a novel and practical accelerated factored gradient descent method motivated by Nesterov’s accelerated gradient descent. |

409 | Structured Conditional Continuous Normalizing Flows for Efficient Amortized Inference in Graphical Models | Christian Weilbach, Boyan Beronov, Frank Wood, William Harvey | We exploit minimally faithful inversion of graphical model structures to specify sparse continuous normalizing flows (CNFs) for amortized inference. |

410 | Graph Coarsening with Preserved Spectral Properties | Yu Jin, Andreas Loukas, Joseph JaJa | We show that the proposed spectral distance captures the structural differences in the graph coarsening process. |

411 | A Theoretical and Practical Framework for Regression and Classification from Truncated Samples | Andrew Ilyas, Emmanouil Zampetakis, Constantinos Daskalakis | We present a general framework for regression and classification from samples that are truncated according to the value of the dependent variable. |

412 | Permutation Invariant Graph Generation via Score-Based Generative Modeling | Chenhao Niu, Yang Song, Jiaming Song, Shengjia Zhao, Aditya Grover, Stefano Ermon | To address this difficulty, we propose a permutation invariant approach to modeling graphs, using the recent framework of score-based generative modeling. |

413 | Finite-Time Analysis of Decentralized Temporal-Difference Learning with Linear Function Approximation | Jun Sun, Gang Wang, Georgios B. Giannakis, Qinmin Yang, Zaiyue Yang | In this paper, we provide a finite-time analysis of the fully decentralized TD(0) learning under both i.i.d. as well as Markovian samples, and prove that all local estimates converge linearly to a small neighborhood of the optimum. |

414 | Multi-attribute Bayesian optimization with interactive preference learning | Raul Astudillo, Peter Frazier | We propose a novel multi-attribute Bayesian optimization with preference learning approach. |

415 | On the Sample Complexity of Learning Sum-Product Networks | Ishaq Aden-Ali, Hassan Ashtiani | In this work, we initiate the study of the sample complexity of PAC-learning the set of distributions that correspond to SPNs. |

416 | Tighter Theory for Local SGD on Identical and Heterogeneous Data | Ahmed Khaled Ragab Bayoumi, Konstantin Mishchenko, Peter Richtarik | We provide a new analysis of local SGD, removing unnecessary assumptions and elaborating on the difference between two data regimes: identical and heterogeneous. |

417 | Approximate Cross-validation: Guarantees for Model Assessment and Selection | Ashia Wilson, Maximilian Kasy, Lester Mackey | We address these questions with three main contributions: (i) we provide uniform non-asymptotic, deterministic model assessment guarantees for approximate CV; (ii) we show that (roughly) the same conditions also guarantee model selection performance comparable to CV; (iii) we provide a proximal Newton extension of the approximate CV framework for non-smooth prediction problems and develop improved assessment guarantees for problems such as L1-regularized ERM. |

418 | On Minimax Optimality of GANs for Robust Mean Estimation | Kaiwen Wu, Gavin Weiguang Ding, Ruitong Huang, Yaoliang Yu | In this work, we study the statistical and robust properties of GANs for Gaussian mean estimation under Huber’s contamination model, where an epsilon proportion of training data may be arbitrarily corrupted. |

419 | Auditing ML Models for Individual Bias and Unfairness | Songkai Xue, Mikhail Yurochkin, Yuekai Sun | We formalize the task in an optimization problem and develop a suite of inferential tools for the optimal value. |

420 | Stein Variational Inference for Discrete Distributions | Jun Han, Fan Ding, Xianglong Liu, Lorenzo Torresani, Jian Peng, Qiang Liu | In this work, we fill this gap by proposing a simple general-purpose framework that transforms discrete distributions to equivalent piecewise continuous distribution, on which we apply gradient-free Stein variational gradient descent to perform efficient approximate inference. |

421 | Revisiting Stochastic Extragradient | Konstantin Mishchenko, Dmitry Kovalev, Egor Shulgin, Peter Richtarik, Yura Malitsky | We fix a fundamental issue in the stochastic extragradient method by providing a new sampling strategy that is motivated by approximating implicit updates. |

422 | A Framework for Sample Efficient Interval Estimation with Control Variates | Shengjia Zhao, Christopher Yeh, Stefano Ermon | We consider the problem of estimating confidence intervals for the mean of a random variable, where the goal is to produce the smallest possible interval for a given number of samples. |

423 | Nonmyopic Gaussian Process Optimization with Macro-Actions | Dmitrii Kharkovskii, Chun Kai Ling, Bryan Kian Hsiang Low | This paper presents a multi-staged approach to nonmyopic adaptive Gaussian process optimization (GPO) for Bayesian optimization (BO) of unknown, highly complex objective functions that, in contrast to existing nonmyopic adaptive BO algorithms, exploits the notion of macro-actions for scaling up to a further lookahead to match up to a larger available budget. |