# Paper Digest: COLT 2019 Highlights

Readers can also choose to read this highlight article on our console, which allows users to filter out papers using keywords and find related papers.

The Annual Conference on Learning Theory (COLT) focuses on addressing theoretical aspects of machine learing and related topics.

To help the community quickly catch up on the work presented in this conference, Paper Digest Team processed all accepted papers, and generated one highlight sentence (typically the main topic) for each paper. Readers are encouraged to read these machine generated highlights / summaries to quickly get the main idea of each paper.

If you do not want to miss any interesting academic paper, you are welcome to **sign up our free daily paper digest service ** to get updates on new papers published in your area every day. You are also welcome to follow us on Twitter and Linkedin to get updated with new conference digests.

Paper Digest Team

team@paperdigest.org

#### TABLE 1: COLT 2019 Papers

Title | Authors | Highlight | |
---|---|---|---|

1 | Conference on Learning Theory 2019: Preface | Alina Beygelzimer, Daniel Hsu | Conference on Learning Theory 2019: Preface |

2 | Inference under Information Constraints: Lower Bounds from Chi-Square Contraction | Jayadev Acharya, Cl�ment L Canonne, Himanshu Tyagi | We propose a unified framework to study such distributed inference problems under local information constraints. |

3 | Learning in Non-convex Games with an Optimization Oracle | Naman Agarwal, Alon Gonen, Elad Hazan | In this paper we show that by slightly strengthening the oracle model, the online and the statistical learning models become computationally equivalent. |

4 | Learning to Prune: Speeding up Repeated Computations | Daniel Alabi, Adam Tauman Kalai, Katrina Liggett, Cameron Musco, Christos Tzamos, Ellen Vitercik | We present an algorithm that learns to maximally prune the search space on repeated computations, thereby reducing runtime while provably outputting the correct solution each period with high probability. |

5 | Towards Testing Monotonicity of Distributions Over General Posets | Maryam Aliakbarpour, Themis Gouleakis, John Peebles, Ronitt Rubinfeld, Anak Yodpinyanee | In this work, we consider the sample complexity required for testing the monotonicity of distributions over partial orders. |

6 | Testing Mixtures of Discrete Distributions | Maryam Aliakbarpour, Ravi Kumar, Ronitt Rubinfeld | In this work, we present a noise model that on one hand is more tractable for the testing problem, and on the other hand represents a rich class of noise families. |

7 | Normal Approximation for Stochastic Gradient Descent via Non-Asymptotic Rates of Martingale CLT | Andreas Anastasiou, Krishnakumar Balasubramanian, Murat A. Erdogdu | We provide non-asymptotic convergence rates of the Polyak-Ruppert averaged stochastic gradient descent (SGD) to a normal random vector for a class of twice-differentiable test functions. |

8 | Adaptively Tracking the Best Bandit Arm with an Unknown Number of Distribution Changes | Peter Auer, Pratik Gajane, Ronald Ortner | For this setting, we propose an algorithm called ADSWITCH and provide performance guarantees for the regret evaluated against the optimal non-stationary policy. |

9 | Achieving Optimal Dynamic Regret for Non-stationary Bandits without Prior Information | Peter Auer, Yifang Chen, Pratik Gajane, Chung-Wei Lee, Haipeng Luo, Ronald Ortner, Chen-Yu Wei | This joint extended abstract introduces and compares the results of (Auer et al., 2019) and (Chen et al., 2019), both of which resolve the problem of achieving optimal dynamic regret for non-stationary bandits without prior information on the non-stationarity. |

10 | A Universal Algorithm for Variational Inequalities Adaptive to Smoothness and Noise | Francis Bach, Kfir Y Levy | We present a universal algorithm for these inequalities based on the Mirror-Prox algorithm. |

11 | Learning Two Layer Rectified Neural Networks in Polynomial Time | Ainesh Bakshi, Rajesh Jayaram, David P Woodruff | We consider the following fundamental problem in the study of neural networks: given input examples $x \in \mathbb{R}^d$ and their vector-valued labels, as defined by an underlying generative neural network, recover the weight matrices of this network. |

12 | Private Center Points and Learning of Halfspaces | Amos Beimel, Shay Moran, Kobbi Nissim, Uri Stemmer | We present a private agnostic learner for halfspaces over an arbitrary finite domain $X\subset \R^d$ with sample complexity $\mathsf{poly}(d,2^{\log^*|X|})$. |

13 | Lower bounds for testing graphical models: colorings and antiferromagnetic Ising models | Ivona Bez�kov�, Antonio Blanca, Zongchen Chen, Daniel �tefankovic, Eric Vigoda | For the ferromagnetic (attractive) Ising model, Daskalasis et al. (2018) presented a polynomial time algorithm for identity testing. |

14 | Approximate Guarantees for Dictionary Learning | Aditya Bhaskara, Wai Ming Tai | The goal of our work is to understand what can be said in the absence of such assumptions. |

15 | The Optimal Approximation Factor in Density Estimation | Olivier Bousquet, Daniel Kane, Shay Moran | We develop two approaches to achieve the optimal approximation factor of $2$: an adaptive one and a static one. |

16 | Sorted Top-k in Rounds | Mark Braverman, Jieming Mao, Yuval Peres | We consider the sorted top-$k$ problem whose goal is to recover the top-$k$ items with the correct order out of $n$ items using pairwise comparisons. |

17 | Multi-armed Bandit Problems with Strategic Arms | Mark Braverman, Jieming Mao, Jon Schneider, S. Matthew Weinberg | Our goal is to design an algorithm for the principal incentivizing these arms to pass on as much of their private rewards as possible. |

18 | Universality of Computational Lower Bounds for Submatrix Detection | Matthew Brennan, Guy Bresler, Wasim Huleihel | Universality of Computational Lower Bounds for Submatrix Detection |

19 | Optimal Average-Case Reductions to Sparse PCA: From Weak Assumptions to Strong Hardness | Matthew Brennan, Guy Bresler | We give a reduction from $\textsc{pc}$ that yields the first full characterization of the computational barrier in the spiked covariance model, providing tight lower bounds at all sparsities $k$. |

20 | Learning rates for Gaussian mixtures under group action | Victor-Emmanuel Brunel | We provide an algebraic description and a geometric interpretation of these facts. |

21 | Near-optimal method for highly smooth convex optimization | S�bastien Bubeck, Qijia Jiang, Yin Tat Lee, Yuanzhi Li, Aaron Sidford | We propose a near-optimal method for highly smooth convex optimization. |

22 | Improved Path-length Regret Bounds for Bandits | S�bastien Bubeck, Yuanzhi Li, Haipeng Luo, Chen-Yu Wei | We study adaptive regret bounds in terms of the variation of the losses (the so-called path-length bounds) for both multi-armed bandit and more generally linear bandit. |

23 | Optimal Learning of Mallows Block Model | Robert Busa-Fekete, Dimitris Fotakis, Bal�zs Sz�r�nyi, Manolis Zampetakis | The Mallows model, introduced in the seminal paper of Mallows 1957, is one of the most fundamental ranking distribution over the symmetric group $S_m$. |

24 | Gaussian Process Optimization with Adaptive Sketching: Scalable and No Regret | Daniele Calandriello, Luigi Carratino, Alessandro Lazaric, Michal Valko, Lorenzo Rosasco | In this paper, we introduce BKB (\textit{budgeted kernelized bandit}), a new approximate GP algorithm for optimization under bandit feedback that achieves near-optimal regret (and hence near-optimal convergence rate) with near-constant per-iteration complexity and remarkably no assumption on the input space or covariance of the GP. |

25 | Disagreement-Based Combinatorial Pure Exploration: Sample Complexity Bounds and an Efficient Algorithm | Tongyi Cao, Akshay Krishnamurthy | We design new algorithms for the combinatorial pure exploration problem in the multi-arm bandit framework. |

26 | A Rank-1 Sketch for Matrix Multiplicative Weights | Yair Carmon, John C Duchi, Sidford Aaron, Tian Kevin | We show that a simple randomized sketch of the matrix multiplicative weight (MMW) update enjoys (in expectation) the same regret bounds as MMW, up to a small constant factor. |

27 | On the Computational Power of Online Gradient Descent | Vaggos Chatziafratis, Tim Roughgarden, Joshua R. Wang | We prove that the evolution of weight vectors in online gradient descent can encode arbitrary polynomial-space computations, even in very simple learning settings. |

28 | Active Regression via Linear-Sample Sparsification | Xue Chen, Eric Price | We present an approach that improves the sample complexity for a variety of curve fitting problems, including active learning for linear regression, polynomial regression, and continuous sparse Fourier transforms. |

29 | A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal and Parameter-free | Yifang Chen, Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei | We propose the first contextual bandit algorithm that is parameter-free, efficient, and optimal in terms of dynamic regret. |

30 | Faster Algorithms for High-Dimensional Robust Covariance Estimation | Yu Cheng, Ilias Diakonikolas, Rong Ge, David P. Woodruff | Our main contribution is to develop faster algorithms for this problem whose running time nearly matches that of computing the empirical covariance. |

31 | Testing Symmetric Markov Chains Without Hitting | Yeshwanth Cherapanamjeri, Peter L. Bartlett | In this paper, we propose an algorithm that avoids this dependence on hitting time thus enabling efficient testing of markov chains even in cases where it is infeasible to observe every state in the chain. |

32 | Fast Mean Estimation with Sub-Gaussian Rates | Yeshwanth Cherapanamjeri, Nicolas Flammarion, Peter L. Bartlett | We propose an estimator for the mean of a random vector in $\mathbb{R}^d$ that can be computed in time $O(n^{3.5}+n^2d)$ for $n$ i.i.d. samples and that has error bounds matching the sub-Gaussian case. |

33 | Vortices Instead of Equilibria in MinMax Optimization: Chaos and Butterfly Effects of Online Learning in Zero-Sum Games | Yun Kuen Cheung, Georgios Piliouras | We establish that algorithmic experiments in zero-sum games “fail miserably” to confirm the unique, sharp prediction of maxmin equilibration. |

34 | Pure entropic regularization for metrical task systems | Christian Coester, James R. Lee | We show that on every $n$-point HST metric, there is a randomized online algorithm for metrical task systems (MTS) that is $1$-competitive for service costs and $O(\log n)$-competitive for movement costs. |

35 | A near-optimal algorithm for approximating the John Ellipsoid | Michael B. Cohen, Ben Cousins, Yin Tat Lee, Xin Yang | We develop a simple and efficient algorithm for approximating the John Ellipsoid of a symmetric polytope. |

36 | Artificial Constraints and Hints for Unbounded Online Learning | Ashok Cutkosky | We provide algorithms that guarantees regret $R_T(u)\le \tilde O(G\|u\|^3 + G(\|u\|+1)\sqrt{T})$ or $R_T(u)\le \tilde O(G\|u\|^3T^{1/3} + GT^{1/3}+ G\|u\|\sqrt{T})$ for online convex optimization with $G$-Lipschitz losses for any comparison point $u$ without prior knowledge of either $G$ or $\|u\|$. |

37 | Combining Online Learning Guarantees | Ashok Cutkosky | We show how to take any two parameter-free online learning algorithms with different regret guarantees and obtain a single algorithm whose regret is the minimum of the two base algorithms. |

38 | Learning from Weakly Dependent Data under Dobrushin�s Condition | Yuval Dagan, Constantinos Daskalakis, Nishanth Dikkala, Siddhartha Jayanti | Statistical learning theory has largely focused on learning and generalization given independent and identically distributed (i.i.d.) samples. |

39 | Space lower bounds for linear prediction in the streaming model | Yuval Dagan, Gil Kur, Ohad Shamir | We show that fundamental learning tasks, such as finding an approximate linear separator or linear regression, require memory at least \emph{quadratic} in the dimension, in a natural streaming setting. |

40 | Computationally and Statistically Efficient Truncated Regression | Constantinos Daskalakis, Themis Gouleakis, Christos Tzamos, Manolis Zampetakis | We provide a computationally and statistically efficient estimator for the classical problem of truncated linear regression, where the dependent variable $y = \vec{w}^{\rm T} \vec{x}+{\varepsilon}$ and its corresponding vector of covariates $\vec{x} \in \mathbb{R}^k$ are only revealed if the dependent variable falls in some subset $S \subseteq \mathbb{R}$; otherwise the existence of the pair $(\vec{x},y)$ is hidden. |

41 | Reconstructing Trees from Traces | Sami Davies, Miklos Z. Racz, Cyrus Rashtchian | We study the problem of learning a node-labeled tree given independent traces from an appropriately defined deletion channel. |

42 | Is your function low dimensional? | Anindya De, Elchanan Mossel, Joe Neeman | In this paper, we study the problem of testing whether a given $n$ variable function $f : \mathbb{R}^n \to \{0,1\}$, is a linear $k$-junta or $\epsilon$-far from all linear $k$-juntas, where the closeness is measured with respect to the Gaussian measure on $\mathbb{R}^n$. |

43 | Computational Limitations in Robust Classification and Win-Win Results | Akshay Degwekar, Preetum Nakkiran, Vinod Vaikuntanathan | In this work, we extend their work in three directions. |

44 | Fast determinantal point processes via distortion-free intermediate sampling | Michal Derezinski | To that end, we propose a new determinantal point process algorithm which has the following two properties, both of which are novel: (1) a preprocessing step which runs in time $O\big(\text{number-of-non-zeros}(\mathbf{X})\cdot\log n\big)+\text{poly}(d)$, and (2) a sampling step which runs in $\text{poly}(d)$ time, independent of the number of rows $n$. |

45 | Minimax experimental design: Bridging the gap between statistical and worst-case approaches to least squares regression | Michal Derezinski, Kenneth L. Clarkson, Michael W. Mahoney, Manfred K. Warmuth | In the process, we develop a new algorithm for a joint sampling distribution called volume sampling, and we propose a new i.i.d. importance sampling method: inverse score sampling. |

46 | Communication and Memory Efficient Testing of Discrete Distributions | Ilias Diakonikolas, Themis Gouleakis, Daniel M. Kane, Sankeerth Rao | In both these models, we provide efficient algorithms for uniformity/identity testing (goodness of fit) and closeness testing (two sample testing). |

47 | Testing Identity of Multidimensional Histograms | Ilias Diakonikolas, Daniel M. Kane, John Peebles | We investigate the problem of identity testing for multidimensional histogram distributions. |

48 | Lower Bounds for Parallel and Randomized Convex Optimization | Jelena Diakonikolas, Crist�bal Guzm�n | Prior to our work, lower bounds for parallel convex optimization algorithms were only known in a small fraction of the settings considered in this paper, mainly applying to Euclidean ($\ell_2$) and $\ell_\infty$ spaces. |

49 | On the Performance of Thompson Sampling on Logistic Bandits | Shi Dong, Tengyu Ma, Benjamin Van Roy | We study the logistic bandit, in which rewards are binary with success probability $\exp(\beta a^\top \theta) / (1 + \exp(\beta a^\top \theta))$ and actions $a$ and coefficients $\theta$ are within the $d$-dimensional unit ball. |

50 | Lower Bounds for Locally Private Estimation via Communication Complexity | John Duchi, Ryan Rogers | We develop lower bounds for estimation under local privacy constraints—including differential privacy and its relaxations to approximate or Rényi differential privacy—by showing an equivalence between private estimation and communication-restricted estimation problems. |

51 | Sharp Analysis for Nonconvex SGD Escaping from Saddle Points | Cong Fang, Zhouchen Lin, Tong Zhang | In this paper, we give a sharp analysis for Stochastic Gradient Descent (SGD) and prove that SGD is able to efficiently escape from saddle points and find an $(\epsilon, O(\epsilon^{0.5}))$-approximate second-order stationary point in $\tilde{O}(\epsilon^{-3.5})$ stochastic gradient computations for generic nonconvex optimization problems, when the objective function satisfies gradient-Lipschitz, Hessian-Lipschitz, and dispersive noise assumptions. |

52 | Achieving the Bayes Error Rate in Stochastic Block Model by SDP, Robustly | Yingjie Fei, Yudong Chen | We study the statistical performance of the semidefinite programming (SDP) relaxation approach for clustering under the binary symmetric Stochastic Block Model (SBM). |

53 | High probability generalization bounds for uniformly stable algorithms with nearly optimal rate | Vitaly Feldman, Jan Vondrak | Our proof technique is new and we introduce several analysis tools that might find additional applications. |

54 | Sum-of-squares meets square loss: Fast rates for agnostic tensor completion | Dylan J. Foster, Andrej Risteski | For agnostic learning of third-order tensors with the square loss, we give the first polynomial time algorithm that obtains a “fast” (i.e., $O(1/n)$-type) rate improving over the rate obtained by reduction to matrix completion. |

55 | The Complexity of Making the Gradient Small in Stochastic Convex Optimization | Dylan J. Foster, Ayush Sekhari, Ohad Shamir, Nathan Srebro, Karthik Sridharan, Blake Woodworth | We give nearly matching upper and lower bounds on the oracle complexity of finding $\epsilon$-stationary points $(\|\nabla F(x)\|\leq\epsilon$ in stochastic convex optimization. |

56 | Statistical Learning with a Nuisance Component | Dylan J. Foster, Vasilis Syrgkanis | We provide excess risk guarantees for statistical learning in a setting where the population risk with respect to which we evaluate the target model depends on an unknown model that must be to be estimated from data (a “nuisance model”). |

57 | On the Regret Minimization of Nonconvex Online Gradient Ascent for Online PCA | Dan Garber | In this paper we focus on the problem of Online Principal Component Analysis in the regret minimization framework. |

58 | Optimal Tensor Methods in Smooth Convex and Uniformly ConvexOptimization | Alexander Gasnikov, Pavel Dvurechensky, Eduard Gorbunov, Evgeniya Vorontsova, Daniil Selikhanovych, C�sar A. Uribe | We propose a new tensor method, which closes the gap between the lower $\Omega\left(\e^{-\frac{2}{3p+1}} \right)$ and upper $O\left(\e^{-\frac{1}{p+1}} \right)$ iteration complexity bounds for this class of optimization problems. |

59 | Near Optimal Methods for Minimizing Convex Functions with Lipschitz $p$-th Derivatives | Alexander Gasnikov, Pavel Dvurechensky, Eduard Gorbunov, Evgeniya Vorontsova, Daniil Selikhanovych, C�sar A. Uribe, Bo Jiang, Haoyue Wang, Shuzhong Zhang, S�bastien Bubeck, Qijia Jiang, Yin Tat Lee, Yuanzhi Li, Aaron Sidford | In this merged paper, we consider the problem of minimizing a convex function with Lipschitz-continuous $p$-th order derivatives. |

60 | Stabilized SVRG: Simple Variance Reduction for Nonconvex Optimization | Rong Ge, Zhize Li, Weiyao Wang, Xiang Wang | In this paper, we show that Stabilized SVRG (a simple variant of SVRG) can find an $\epsilon$-second-order stationary point using only $\widetilde{O}(n^{2/3}/\epsilon^2+n/\epsilon^{1.5})$ stochastic gradients. |

61 | Learning Ising Models with Independent Failures | Surbhi Goel, Daniel M. Kane, Adam R. Klivans | We give the first efficient algorithm for learning the structure of an Ising model that tolerates independent failures; that is, each entry of the observed sample is missing with some unknown probability $p$. |

62 | Learning Neural Networks with Two Nonlinear Layers in Polynomial Time | Surbhi Goel, Adam R. Klivans | We give a polynomial-time algorithm for learning neural networks with one layer of sigmoids feeding into any Lipschitz, monotone activation function (e.g., sigmoid or ReLU). |

63 | When can unlabeled data improve the learning rate? | Christina G�pfert, Shai Ben-David, Olivier Bousquet, Sylvain Gelly, Ilya Tolstikhin, Ruth Urner | Our analysis focuses on improvements in the \emph{minimax} learning rate in terms of the number of labeled examples (with the number of unlabeled examples being allowed to depend on the number of labeled ones). |

64 | Sampling and Optimization on Convex Sets in Riemannian Manifolds of Non-Negative Curvature | Navin Goyal, Abhishek Shetty | In this paper, we study sampling and convex optimization problems over manifolds of non-negative curvature proving polynomial running time in the dimension and other relevant parameters. |

65 | Better Algorithms for Stochastic Bandits with Adversarial Corruptions | Anupam Gupta, Tomer Koren, Kunal Talwar | We present a new algorithm for this problem whose regret is nearly optimal, substantially improving upon previous work. |

66 | Tight analyses for non-smooth stochastic gradient descent | Nicholas J. A. Harvey, Christopher Liaw, Yaniv Plan, Sikander Randhawa | We prove that after $T$ steps of stochastic gradient descent, the error of the final iterate is $O(\log(T)/T)$ \emph{with high probability}. |

67 | Reasoning in Bayesian Opinion Exchange Networks Is PSPACE-Hard | Jan Hazla, Ali Jadbabaie, Elchanan Mossel, M. Amin Rahimian | We study the Bayesian model of opinion exchange of fully rational agents arranged on a network. |

68 | How Hard is Robust Mean Estimation? | Samuel B. Hopkins, Jerry Li | In this work we give worst-case complexity-theoretic evidence that improving on the error rates of current polynomial-time algorithms for robust mean estimation may be computationally intractable in natural settings. |

69 | A Robust Spectral Algorithm for Overcomplete Tensor Decomposition | Samuel B. Hopkins, Tselil Schramm, Jonathan Shi | We give a spectral algorithm for decomposing overcomplete order-4 tensors, so long as their components satisfy an algebraic non-degeneracy condition that holds for nearly all (all but an algebraic set of measure $0$) tensors over $(\mathbb{R}^d)^{\otimes 4}$ with rank $n \le d^2$. |

70 | Sample-Optimal Low-Rank Approximation of Distance Matrices | Pitor Indyk, Ali Vakilian, Tal Wagner, David P Woodruff | In this work we study algorithms for low-rank approximation of distance matrices. |

71 | Making the Last Iterate of SGD Information Theoretically Optimal | Prateek Jain, Dheeraj Nagaraj, Praneeth Netrapalli | The main contribution of this work is to design new step size sequences that enjoy information theoretically optimal bounds on the suboptimality of \emph{last point} of SGD as well as GD. |

72 | Accuracy-Memory Tradeoffs and Phase Transitions in Belief Propagation | Vishesh Jain, Frederic Koehler, Jingbo Liu, Elchanan Mossel | We prove a conjecture of Evans, Kenyon, Peres, and Schulman (2000) which states that any bounded memory message passing algorithm is statistically much weaker than Belief Propagation for the reconstruction problem. |

73 | The implicit bias of gradient descent on nonseparable data | Ziwei Ji, Matus Telgarsky | The implicit bias of gradient descent on nonseparable data |

74 | An Optimal High-Order Tensor Method for Convex Optimization | Bo Jiang, Haoyue Wang, Shuzhong Zhang | In this paper, we propose a new high-order tensor algorithm for the general composite case, with the iteration complexity of O( 1 / k^{(3d+1)/2} ), which matches the lower bound for the d-th order methods as established in Nesterov (2018) and Shamir et al. (2018), and hence is optimal. |

75 | Parameter-Free Online Convex Optimization with Sub-Exponential Noise | Kwang-Sung Jun, Francesco Orabona | We consider the problem of unconstrained online convex optimization (OCO) with sub-exponential noise, a strictly more general problem than the standard OCO. |

76 | Sample complexity of partition identification using multi-armed bandits | Sandeep Juneja, Subhashini Krishnasamy | Given a vector of probability distributions, or arms, each of which can be sampled independently, we consider the problem of identifying the partition to which this vector belongs from a finitely partitioned universe of such vector of distributions. |

77 | Privately Learning High-Dimensional Distributions | Gautam Kamath, Jerry Li, Vikrant Singhal, Jonathan Ullman | We present novel, computationally efficient, and differentially private algorithms for two fundamental high-dimensional learning problems: learning a multivariate Gaussian and learning a product distribution over the Boolean hypercube in total variation distance. |

78 | On Communication Complexity of Classification Problems | Daniel Kane, Roi Livni, Shay Moran, Amir Yehudayoff | This work studies distributed learning in the spirit of Yao’s model of communication complexity: consider a two-party setting, where each of the players gets a list of labelled examples and they communicate in order to jointly perform some learning task. |

79 | Non-asymptotic Analysis of Biased Stochastic Approximation Scheme | Belhal Karimi, Blazej Miasojedow, Eric Moulines, Hoi-To Wai | These restrictions are all essentially relaxed in this work. |

80 | Discrepancy, Coresets, and Sketches in Machine Learning | Zohar Karnin, Edo Liberty | We provide general techniques for bounding the class discrepancy of machine learning problems. |

81 | Bandit Principal Component Analysis | Wojciech Kotlowski, Gergely Neu | Based on the classical observation that this decision-making problem can be lifted to the space of density matrices, we propose an algorithm that is shown to achieve a regret of $O(d^{3/2}\sqrt{T})$ after $T$ rounds in the worst case. |

82 | Contextual bandits with continuous actions: Smoothing, zooming, and adapting | Akshay Krishnamurthy, John Langford, Aleksandrs Slivkins, Chicheng Zhang | We study contextual bandit learning for any competitor policy class and continuous action space. |

83 | Distribution-Dependent Analysis of Gibbs-ERM Principle | Ilja Kuzborskij, Nicol� Cesa-Bianchi, Csaba Szepesv�ri | In this work we study the excess risk suffered by a Gibbs-ERM learner that uses non-convex, regularized empirical risk with the goal to understand the interplay between the data-generating distribution and learning in large hypothesis spaces. |

84 | Global Convergence of the EM Algorithm for Mixtures of Two Component Linear Regression | Jeongyeol Kwon, Wei Qian, Constantine Caramanis, Yudong Chen, Damek Davis | Our analysis reveals that EM exhibits very different behavior in Mixed Linear Regression from its behavior in Gaussian Mixture Models, and hence our proofs require the development of several new ideas. |

85 | An Information-Theoretic Approach to Minimax Regret in Partial Monitoring | Tor Lattimore, Csaba Szepesv�ri | We prove a new minimax theorem connecting the worst-case Bayesian regret and minimax regret under finite-action partial monitoring with no assumptions on the space of signals or decisions of the adversary. |

86 | Solving Empirical Risk Minimization in the Current Matrix Multiplication Time | Yin Tat Lee, Zhao Song, Qiuyi Zhang | In this paper, we give an algorithm that runs in time \begin{align*} O^* ( ( n^{\omega} + n^{2.5 – \alpha/2} + n^{2+ 1/6} ) \log (n / \delta) ) \end{align*} where $\omega$ is the exponent of matrix multiplication, $\alpha$ is the dual exponent of matrix multiplication, and $\delta$ is the relative accuracy. |

87 | On Mean Estimation for General Norms with Statistical Queries | Jerry Li, Aleksandar Nikolov, Ilya Razenshteyn, Erik Waingarten | We study the problem of mean estimation for high-dimensional distributions given access to a statistical query oracle. |

88 | Nearly Minimax-Optimal Regret for Linearly Parameterized Bandits | Yingkai Li, Yining Wang, Yuan Zhou | When the problem dimension is $d$, the time horizon is $T$, and there are $n \leq 2^{d/2}$ candidate actions per time period, we (1) show that the minimax expected regret is $\Omega(\sqrt{dT \log T \log n})$ for every algorithm, and (2) introduce a Variable-Confidence-Level (VCL) SupLinUCB algorithm whose regret matches the lower bound up to iterated logarithmic factors. |

89 | Sharp Theoretical Analysis for Nonparametric Testing under Random Projection | Meimei Liu, Zuofeng Shang, Guang Cheng | In this paper, we develop computationally efficient nonparametric testing by employing a random projection strategy. |

90 | Combinatorial Algorithms for Optimal Design | Vivek Madan, Mohit Singh, Uthaipon Tantipongpipat, Weijun Xie | In this paper, we bridge this gap and prove approximation guarantees for the local search algorithms for D-optimal design and A-optimal design problems. |

91 | Nonconvex sampling with the Metropolis-adjusted Langevin algorithm | Oren Mangoubi, Nisheeth K Vishnoi | Our main technical contribution is an analysis of the Metropolis acceptance probability of MALA in terms of its “energy-conservation error," and a bound for this error in terms of third- and fourth- order regularity conditions. |

92 | Beyond Least-Squares: Fast Rates for Regularized Empirical Risk Minimization through Self-Concordance | Ulysse Marteau-Ferey, Dmitrii Ostrovskii, Francis Bach, Alessandro Rudi | We consider learning methods based on the regularization of a convex empirical risk by a squared Hilbertian norm, a setting that includes linear predictors and non-linear predictors through positive-definite kernels. |

93 | Planting trees in graphs, and finding them back | Laurent Massouli�, Ludovic Stephan, Don Towsley | In this paper we study the two inference problems of detection and reconstruction in the context of planted structures in sparse Erdős-Rényi random graphs $\mathcal G(n,\lambda/n)$ with fixed average degree $\lambda>0$. |

94 | Uniform concentration and symmetrization for weak interactions | Andreas Maurer, Massimiliano Pontil | The method to derive uniform bounds with Gaussian and Rademacher complexities is extended to the case where the sample average is replaced by a nonlinear statistic. |

95 | Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit | Song Mei, Theodor Misiakiewicz, Andrea Montanari | In this paper we establish stronger and more general approximation guarantees. |

96 | Batch-Size Independent Regret Bounds for the Combinatorial Multi-Armed Bandit Problem | Nadav Merlis, Shie Mannor | To overcome this problem, we introduce a new smoothness criterion, which we term \emph{Gini-weighted smoothness}, that takes into account both the nonlinearity of the reward and concentration properties of the arms. |

97 | Lipschitz Adaptivity with Multiple Learning Rates in Online Learning | Zakaria Mhammedi, Wouter M Koolen, Tim Van Erven | In the present work we remove this Lipschitz hyperparameter by designing new versions of MetaGrad and Squint that adapt to its optimal value automatically. |

98 | VC Classes are Adversarially Robustly Learnable, but Only Improperly | Omar Montasser, Steve Hanneke, Nathan Srebro | We study the question of learning an adversarially robust predictor. |

99 | Affine Invariant Covariance Estimation for Heavy-Tailed Distributions | Dmitrii M. Ostrovskii, Alessandro Rudi | In this work we provide an estimator for the covariance matrix of a heavy-tailed multivariate distribution. |

100 | Stochastic Gradient Descent Learns State Equations with Nonlinear Activations | Samet Oymak | We study discrete time dynamical systems governed by the state equation $h_{t+1}=\phi(Ah_t+Bu_t)$. |

101 | A Theory of Selective Prediction | Mingda Qiao, Gregory Valiant | Our goal is to accurately predict the average observation, and we are allowed to choose the window over which the prediction is made: for some $t < n$ and $m \le n – t$, after seeing $t$ observations we predict the average of $x_{t+1}, \ldots, x_{t+m}$. |

102 | Consistency of Interpolation with Laplace Kernels is a High-Dimensional Phenomenon | Alexander Rakhlin, Xiyu Zhai | We show that minimum-norm interpolation in the Reproducing Kernel Hilbert Space corresponding to the Laplace kernel is not consistent if input dimension is constant. |

103 | Classification with unknown class-conditional label noise on non-compact feature spaces | Henry Reeve, Kab�n | We investigate the problem of classification in the presence of unknown class-conditional label noise in which the labels observed by the learner have been corrupted with some unknown class dependent probability. |

104 | The All-or-Nothing Phenomenon in Sparse Linear Regression | Galen Reeves, Jiaming Xu, Ilias Zadik | We study the problem of recovering a hidden binary $k$-sparse $p$-dimensional vector $\beta$ from $n$ noisy linear observations $Y=X\beta+W$ where $X_{ij}$ are i.i.d. $\mathcal{N}(0,1)$ and $W_i$ are i.i.d. $\mathcal{N}(0,\sigma^2)$. |

105 | Depth Separations in Neural Networks: What is Actually Being Separated? | Itay Safran, Ronen Eldan, Ohad Shamir | In this paper, we study whether such depth separations might still hold in the natural setting of $\mathcal{O}(1)$-Lipschitz radial functions, when $\epsilon$ does not scale with $d$. |

106 | How do infinite width bounded norm networks look in function space? | Pedro Savarese, Itay Evron, Daniel Soudry, Nathan Srebro | We consider the question of what functions can be captured by ReLU networks with an unbounded number of units (infinite width), but where the overall network Euclidean norm (sum of squares of all weights in the system, except for an unregularized bias term for each unit) is bounded; or equivalently what is the minimal norm required to approximate a given function. |

107 | Exponential Convergence Time of Gradient Descent for One-Dimensional Deep Linear Neural Networks | Ohad Shamir | We study the dynamics of gradient descent on objective functions of the form $f(\prod_{i=1}^{k} w_i)$ (with respect to scalar parameters $w_1,\ldots,w_k$), which arise in the context of training depth-$k$ linear neural networks. |

108 | Learning Linear Dynamical Systems with Semi-Parametric Least Squares | Max Simchowitz, Ross Boczar, Benjamin Recht | We analyze a simple prefiltered variation of the least squares estimator for the problem of estimation with biased, \emph{semi-parametric} noise, an error model studied more broadly in causal statistics and active learning. |

109 | Finite-Time Error Bounds For Linear Stochastic Approximation andTD Learning | R. Srikant, Lei Ying | We consider the dynamics of a linear stochastic approximation algorithm driven by Markovian noise, and derive finite-time bounds on the moments of the error, i.e., deviation of the output of the algorithm from the equilibrium point of an associated ordinary differential equation (ODE). |

110 | Robustness of Spectral Methods for Community Detection | Ludovic Stephan, Laurent Massouli� | In the sparse case, where edge probabilities are in $O(1/n)$, we introduce a new spectral method based on the distance matrix $D^{(\ell)}$, where $D^{(\ell)}_{ij} = 1$ iff the graph distance between $i$ and $j$, noted $d(i, j)$ is equal to $\ell$. |

111 | Maximum Entropy Distributions: Bit Complexity and Stability | Damian Straszak, Nisheeth K. Vishnoi | Here we show that these questions are related and resolve both of them. |

112 | Adaptive Hard Thresholding for Near-optimal Consistent Robust Regression | Arun Sai Suggala, Kush Bhatia, Pradeep Ravikumar, Prateek Jain | We study the problem of robust linear regression with response variable corruptions. |

113 | Model-based RL in Contextual Decision Processes: PAC bounds and Exponential Improvements over Model-free Approaches | Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford | We design new algorithms for RL with a generic model class and analyze their statistical properties. |

114 | Stochastic first-order methods: non-asymptotic and computer-aided analyses via potential functions | Adrien Taylor, Francis Bach | We provide a novel computer-assisted technique for systematically analyzing first-order methods for optimization. |

115 | The Relative Complexity of Maximum Likelihood Estimation, MAP Estimation, and Sampling | Christopher Tosh, Sanjoy Dasgupta | By way of illustration, we show how hardness results for ML estimation of mixtures of Gaussians and topic models carry over to MAP estimation and approximate sampling under commonly used priors. |

116 | The Gap Between Model-Based and Model-Free Methods on the Linear Quadratic Regulator: An Asymptotic Viewpoint | Stephen Tu, Benjamin Recht | We show that for policy evaluation, a simple model-based plugin method requires asymptotically less samples than the classical least-squares temporal difference (LSTD) estimator to reach the same quality of solution; the sample complexity gap between the two methods can be at least a factor of state dimension. |

117 | Theoretical guarantees for sampling and inference in generative models with latent diffusions | Belinda Tzen, Maxim Raginsky | We introduce and study a class of probabilistic generative models, where the latent object is a finite-dimensional diffusion process on a finite time interval and the observed variable is drawn conditionally on the terminal point of the diffusion. |

118 | Gradient Descent for One-Hidden-Layer Neural Networks: Polynomial Convergence and SQ Lower Bounds | Santosh Vempala, John Wilmes | We study the complexity of training neural network models with one hidden nonlinear activation layer and an output weighted sum layer. |

119 | Estimation of smooth densities in Wasserstein distance | Jonathan Weed, Quentin Berthet | We prove the first minimax rates for estimation of smooth densities for general Wasserstein distances, thereby showing how the curse of dimensionality can be alleviated for sufficiently regular measures. |

120 | Estimating the Mixing Time of Ergodic Markov Chains | Geoffrey Wolfer, Aryeh Kontorovich | Our key insight is to estimate the pseudo-spectral gap instead, which allows us to overcome the loss of self-adjointness and to achieve a polynomial dependence on $d$ and the minimal stationary probability $\pi_\star$. |

121 | Stochastic Approximation of Smooth and Strongly Convex Functions: Beyond the $O(1/T)$ Convergence Rate | Lijun Zhang, Zhi-Hua Zhou | In this paper, we make use of smoothness and strong convexity simultaneously to boost the convergence rate. |

122 | Open Problem: Is Margin Sufficient for Non-Interactive Private Distributed Learning? | Amit Daniely, Vitaly Feldman | Open Problem: Is Margin Sufficient for Non-Interactive Private Distributed Learning? |

123 | Open Problem: How fast can a multiclass test set be overfit? | Vitaly Feldman, Roy Frostig, Moritz Hardt | Open Problem: How fast can a multiclass test set be overfit? |

124 | Open Problem: Do Good Algorithms Necessarily Query Bad Points? | Rong Ge, Prateek Jain, Sham M. Kakade, Rahul Kidambi, Dheeraj M. Nagaraj, Praneeth Netrapalli | Basing of these folkore results and some recent developments, this manuscript considers a more subtle question: does any algorithm necessarily (information theoretically) have to query iterates that are sub-optimal infinitely often? |

125 | Open Problem: Risk of Ruin in Multiarmed Bandits | Filipo S. Perotto, Mathieu Bourgais, Bruno C. Silva, Laurent Vercouter | We formalize a particular class of problems called \textit{survival multiarmed bandits} (S-MAB), which constitutes a modified version of \textit{budgeted multiarmed bandits} (B-MAB) where a true \textit{risk of ruin} must be considered, bringing it closer to \textit{risk-averse multiarmed bandits} (RA-MAB). |

126 | Open Problem: Monotonicity of Learning | Tom Viering, Alexander Mey, Marco Loog | We pose the question to what extent a learning algorithm behaves monotonically in the following sense: does it perform better, in expectation, when adding one instance to the training set? |

127 | Open Problem: The Oracle Complexity of Convex Optimization with Limited Memory | Blake Woodworth, Nathan Srebro | We note that known methods achieving the optimal oracle complexity for first order convex optimization require quadratic memory, and ask whether this is necessary, and more broadly seek to characterize the minimax number of first order queries required to optimize a convex Lipschitz function subject to a memory constraint. |