# Paper Digest: COLT 2017 Highlights

Readers can also choose to read this highlight article on our console, which allows users to filter out papers using keywords and find related papers.

The Annual Conference on Learning Theory (COLT) focuses on addressing theoretical aspects of machine learing and related topics.

To help the community quickly catch up on the work presented in this conference, Paper Digest Team processed all accepted papers, and generated one highlight sentence (typically the main topic) for each paper. Readers are encouraged to read these machine generated highlights / summaries to quickly get the main idea of each paper.

If you do not want to miss any interesting academic paper, you are welcome to **sign up our free daily paper digest service ** to get updates on new papers published in your area every day. You are also welcome to follow us on Twitter and Linkedin to get updated with new conference digests.

Paper Digest Team

team@paperdigest.org

#### TABLE 1: COLT 2017 Papers

Title | Authors | Highlight | |
---|---|---|---|

1 | Preface: Conference on Learning Theory (COLT), 2017 | Satyen Kale, Ohad Shamir | Preface: Conference on Learning Theory (COLT), 2017 |

2 | Open Problem: First-Order Regret Bounds for Contextual Bandits | Alekh Agarwal, Akshay Krishnamurthy, John Langford, Haipeng Luo, Schapire Robert E. | We describe two open problems related to first order regret bounds for contextual bandits. |

3 | Open Problem: Meeting Times for Learning Random Automata | Benjamin Fish, Lev Reyzin | In this note, we propose a method to find faster algorithms for this problem. |

4 | Corralling a Band of Bandit Algorithms | Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, Robert E. Schapire | As examples, we present two main applications. |

5 | Learning with Limited Rounds of Adaptivity: Coin Tossing, Multi-Armed Bandits, and Ranking from Pairwise Comparisons | Arpit Agarwal, Shivani Agarwal, Sepehr Assadi, Sanjeev Khanna | We study the relationship between query complexity and adaptivity in identifying the $k$ most biased coins among a set of $n$ coins with unknown biases. |

6 | Thompson Sampling for the MNL-Bandit | Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, Assaf Zeevi | We present an approach to adapt Thompson Sampling to this problem and show that it achieves near-optimal regret as well as attractive numerical performance. |

7 | Homotopy Analysis for Tensor PCA | Anima Anandkumar, Yuan Deng, Rong Ge, Hossein Mobahi | In this paper, we analyze the class of homotopy or continuation methods for global optimization of nonconvex functions. |

8 | Correspondence retrieval | Alexandr Andoni, Daniel Hsu, Kevin Shi, Xiaorui Sun | In the case of independent standard Gaussian measurement vectors, the main algorithm proposed in this work requires $n = d+1$ measurements to correctly return the $k$ unknown points with high probability. |

9 | Efficient PAC Learning from the Crowd | Pranjal Awasthi, Avrim Blum, Nika Haghtalab, Yishay Mansour | In this paper, we show how by interleaving the process of labeling and learning, we can attain computational efficiency with much less overhead in the labeling cost. |

10 | The Price of Selection in Differential Privacy | Mitali Bafna, Jonathan Ullman | In the differentially private top-$k$ selection problem, we are given a dataset $X ∈\pmo^n \times d$, in which each row belongs to an individual and each column corresponds to some binary attribute, and our goal is to find a set of $k ≪d$ columns whose means are approximately as large as possible. |

11 | Computationally Efficient Robust Sparse Estimation in High Dimensions | Sivaraman Balakrishnan, Simon S. Du, Jerry Li, Aarti Singh | We consider the problem of robust estimation of sparse functionals, and provide a computationally and statistically efficient algorithm in the high-dimensional setting. |

12 | Learning-Theoretic Foundations of Algorithm Configuration for Combinatorial Partitioning Problems | Maria-Florina Balcan, Vaishnavh Nagarajan, Ellen Vitercik, Colin White | Recently, Gupta and Roughgarden introduced the first learning-theoretic framework to rigorously study this problem, using it to analyze classes of greedy heuristics, parameter tuning in gradient descent, and other problems. |

13 | The Sample Complexity of Optimizing a Convex Function | Eric Balkanski, Yaron Singer | In this paper we study optimization from samples of convex functions. |

14 | Efficient Co-Training of Linear Separators under Weak Dependence | Avrim Blum, Yishay Mansour | We develop the first polynomial-time algorithm for co-training of homogeneous linear separators under \em weak dependence, a relaxation of the condition of independence given the label. |

15 | Sampling from a log-concave distribution with compact support with proximal Langevin Monte Carlo | Nicolas Brosse, Alain Durmus, �ric Moulines, Marcelo Pereyra | This paper presents a detailed theoretical analysis of the Langevin Monte Carlo sampling algorithm recently introduced in Durmus et al. (Efficient Bayesian computation by proximal Markov chain Monte Carlo: when Langevin meets Moreau, 2016) when applied to log-concave probability distributions that are restricted to a convex body $K$. |

16 | Rates of estimation for determinantal point processes | Victor-Emmanuel Brunel, Ankur Moitra, Philippe Rigollet, John Urschel | In this paper, we study the local geometry of the expected log-likelihood function to prove several rates of convergence for the MLE. |

17 | Learning Disjunctions of Predicates | Nader H. Bshouty, Dana Drachsler-Cohen, Martin Vechev, Eran Yahav | We give an algorithm for learning $\mathcal F_∨:={\vee_f∈Sf | S⊆\mathcal {F}}$ from membership queries. |

18 | Testing Bayesian Networks | Clement L. Canonne, Ilias Diakonikolas, Daniel M. Kane, Alistair Stewart | Our main contribution is the first non-trivial efficient testing algorithms for these problems and corresponding information-theoretic lower bounds. |

19 | Multi-Observation Elicitation | Sebastian Casalaina-Martin, Rafael Frongillo, Tom Morgan, Bo Waggoner | We study loss functions that measure the accuracy of a prediction based on multiple data points simultaneously. |

20 | Algorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning | Nicol� Cesa-Bianchi, Pierre Gaillard, Claudio Gentile, S�bastien Gerchinovitz | For full information feedback and Lipschitz losses, we design the first explicit algorithm achieving the minimax regret rate (up to log factors). |

21 | Nearly Optimal Sampling Algorithms for Combinatorial Pure Exploration | Lijie Chen, Anupam Gupta, Jian Li, Mingda Qiao, Ruosong Wang | We study the combinatorial pure exploration problem \textscBest-Set in a stochastic multi-armed bandit game. We further introduce an even more general problem, formulated in geometric terms. |

22 | Towards Instance Optimal Bounds for Best Arm Identification | Lijie Chen, Jian Li, Mingda Qiao | In this paper, we make significant progress towards a complete resolution of the gap-entropy conjecture. |

23 | Thresholding Based Outlier Robust PCA | Yeshwanth Cherapanamjeri, Prateek Jain, Praneeth Netrapalli | In this work, we provide a novel thresholding based iterative algorithm with per-iteration complexity at most linear in the data size. |

24 | Tight Bounds for Bandit Combinatorial Optimization | Alon Cohen, Tamir Hazan, Tomer Koren | We revisit the study of optimal regret rates in bandit combinatorial optimization—a fundamental framework for sequential decision making under uncertainty that abstracts numerous combinatorial prediction problems. |

25 | Online Learning Without Prior Information | Ashok Cutkosky, Kwabena Boahen | We describe a frontier of new lower bounds on the performance of such algorithms, reflecting a tradeoff between a term that depends on the optimal parameter value and a term that depends on the gradients’ rate of growth. |

26 | Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent | Arnak Dalalyan | In this paper, we revisit the recently established theoretical guarantees for the convergence of the Langevin Monte Carlo algorithm of sampling from a smooth and (strongly) log-concave density. |

27 | Depth Separation for Neural Networks | Amit Daniely | We give a simple proof that shows that poly-size depth two neural networks with (exponentially) bounded weights cannot approximate $f$ whenever $g$ cannot be approximated by a low degree polynomial. |

28 | Square Hellinger Subadditivity for Bayesian Networks and its Applications to Identity Testing | Constantinos Daskalakis, Qinxuan Pan | We show that the square Hellinger distance between two Bayesian networks on the same directed graph, $G$, is subadditive with respect to the neighborhoods of $G$. |

29 | Ten Steps of EM Suffice for Mixtures of Two Gaussians | Constantinos Daskalakis, Christos Tzamos, Manolis Zampetakis | We provide global convergence guarantees for mixtures of two Gaussians with known covariance matrices. |

30 | Learning Multivariate Log-concave Distributions | Ilias Diakonikolas, Daniel M. Kane, Alistair Stewart | We study the problem of estimating multivariate log-concave probability density functions. |

31 | Generalization for Adaptively-chosen Estimators via Stable Median | Vitaly Feldman, Thomas Steinke | We present an algorithm that estimates the expectations of $k$ arbitrary adaptively-chosen real-valued estimators using a number of samples that scales as $\sqrt{k}$. |

32 | Greed Is Good: Near-Optimal Submodular Maximization via Greedy Optimization | Moran Feldman, Christopher Harshaw, Amin Karbasi | In this paper, we show—arguably, surprisingly—that invoking the classical greedy algorithm $O(\sqrt{k})$-times leads to the (currently) fastest deterministic algorithm, called RepeatedGreedy, for maximizing a general submodular function subject to $k$-independent system constraints. |

33 | A General Characterization of the Statistical Query Complexity | Vitaly Feldman | We give applications of our techniques to two open problems in learning theory and to algorithms that are subject to memory and communication constraints. |

34 | Stochastic Composite Least-Squares Regression with Convergence Rate $O(1/n)$ | Nicolas Flammarion, Francis Bach | We study the stochastic dual averaging algorithm with a constant step-size, showing that it leads to a convergence rate of O(1/n) without strong convexity assumptions. |

35 | ZigZag: A New Approach to Adaptive Online Learning | Dylan J. Foster, Alexander Rakhlin, Karthik Sridharan | To obtain such adaptive methods, we introduce novel machinery, and the resulting algorithms are not based on the standard tools of online convex optimization. |

36 | Memoryless Sequences for Differentiable Losses | Rafael Frongillo, Andrew Nobel | In this paper, we ask how changing the loss function used changes the set of memoryless sequences, and in particular, the stochastic attributes they possess. |

37 | Matrix Completion from $O(n)$ Samples in Linear Time | David Gamarnik, Quan Li, Hongyi Zhang | In this paper, we propose a new matrix completion algorithm using a novel sampling scheme based on a union of independent sparse random regular bipartite graphs. |

38 | High Dimensional Regression with Binary Coefficients. Estimating Squared Error and a Phase Transtition | Gamarnik David, Zadik Ilias | We consider a sparse linear regression model $Y=Xβ^*+W$ where $X$ is $n\times p$ matrix Gaussian i.i.d. entries, $W$ is $n\times 1$ noise vector with i.i.d. mean zero Gaussian entries and standard deviation $σ$, and $β^*$ is $p\times 1$ binary vector with support size (sparsity) $k$. |

39 | Two-Sample Tests for Large Random Graphs Using Network Statistics | Debarghya Ghoshdastidar, Maurilio Gutzeit, Alexandra Carpentier, Ulrike von Luxburg | In this paper, we present a general principle for two-sample hypothesis testing in such scenarios without making any assumption about the network generation process. |

40 | Effective Semisupervised Learning on Manifolds | Amir Globerson, Roi Livni, Shai Shalev-Shwartz | The algorithm we analyse is similar to subspace clustering, and thus our results demonstrate that this method can be used to improve sample complexity. |

41 | Reliably Learning the ReLU in Polynomial Time | Surbhi Goel, Varun Kanade, Adam Klivans, Justin Thaler | We give the first dimension-efficient algorithms for learning Rectified Linear Units (ReLUs), which are functions of the form $\mathbf{x} \mapsto \mathsf{max}(0, \mathbf{w} ⋅\mathbf{x})$ with $\mathbf{w} ∈\mathbb{S}^n-1$. |

42 | Fast Rates for Empirical Risk Minimization of Strict Saddle Problems | Alon Gonen, Shai Shalev-Shwartz | We derive bounds on the sample complexity of empirical risk minimization (ERM) in the context of minimizing non-convex risks that admit the strict saddle property. |

43 | Nearly-tight VC-dimension bounds for piecewise linear neural networks | Nick Harvey, Christopher Liaw, Abbas Mehrabian | We prove new upper and lower bounds on the VC-dimension of deep neural networks with the ReLU activation function. |

44 | Submodular Optimization under Noise | Avinatan Hassidim, Yaron Singer | In many applications, however, we do not have access to the submodular function we aim to optimize, but rather to some erroneous or noisy version of it. |

45 | Surprising properties of dropout in deep networks | David P. Helmbold, Philip M. Long | We analyze dropout in deep networks with rectified linear units and the quadratic loss. |

46 | Quadratic Upper Bound for Recursive Teaching Dimension of Finite VC Classes | Lunjia Hu, Ruihan Wu, Tianhong Li, Liwei Wang | In this work we study the quantitative relation between the recursive teaching dimension (RTD) and the VC dimension (VCD) of concept classes of finite sizes. |

47 | A Unified Analysis of Stochastic Optimization Methods Using Jump System Theory and Quadratic Constraints | Bin Hu, Peter Seiler, Anders Rantzer | We make use of the symmetry in the stochastic optimization methods and reduce these LMIs to some equivalent small LMIs whose sizes are at most 3 by 3. |

48 | The Hidden Hubs Problem | Ravindran Kannan, Santosh Vempala | We introduce the following \em hidden hubs model $H(n,k,\sigma_0, \sigma_1)$: the input is an $n \times n$ random matrix $A$ with a subset $S$ of $k$ special rows (hubs); entries in rows outside $S$ are generated from the Gaussian distribution $p_0 = N(0,\sigma_0^2)$, while for each row in $S$, an unknown subset of $k$ of its entries are generated from $p_1 = N(0,\sigma_1^2)$, $\sigma_1>\sigma_0$, and the rest of the entries from $p_0$. |

49 | Predicting with Distributions | Michael Kearns, Zhiwei Steven Wu | We consider a new learning model in which a joint distribution over vector pairs $(x,y)$ is determined by an unknown function $c(x)$ that maps input vectors $x$ not to individual outputs, but to entire \em distributions\/ over output vectors $y$. |

50 | Bandits with Movement Costs and Adaptive Pricing | Tomer Koren, Roi Livni, Yishay Mansour | We extend the model of Multi-Armed Bandit with unit switching cost to incorporate a metric between the actions. |

51 | Sparse Stochastic Bandits | Joon Kwon, Vianney Perchet, Claire Vernade | We here consider the \emphsparse case of this classical problem in the sense that only a small number of arms, namely $s |

52 | On the Ability of Neural Nets to Express Distributions | Holden Lee, Rong Ge, Tengyu Ma, Andrej Risteski, Sanjeev Arora | These models are trained using ideas like variational autoencoders and Generative Adversarial Networks. |

53 | Fundamental limits of symmetric low-rank matrix estimation | Marc Lelarge, L�o Miolane | We consider the high-dimensional inference problem where the signal is a low-rank symmetric matrix which is corrupted by an additive Gaussian noise. |

54 | Robust and Proper Learning for Mixtures of Gaussians via Systems of Polynomial Inequalities | Jerry Li, Ludwig Schmidt | In this paper, we significantly improve this dependence by replacing the $1/ε$ term with $\log 1/ε$, while only increasing the exponent moderately. |

55 | Adaptivity to Noise Parameters in Nonparametric Active Learning | Carpentier Alexandra Locatelli Andrea, Kpotufe Samory | Our contributions are both statistical and algorithmic: \beginitemize \item We establish new minimax-rates for active learning under common noise conditions. |

56 | Noisy Population Recovery from Unknown Noise | Shachar Lovett, Jiapeng Zhang | In this work, we remove this assumption, and show how to recover the underlying parameters, even when the noise is unknown, in quasi-polynomial time. |

57 | Inapproximability of VC Dimension and Littlestone�s Dimension | Pasin Manurangsi, Aviad Rubinstein | We study the complexity of computing the VC Dimension and Littlestone’s Dimension. |

58 | A Second-order Look at Stability and Generalization | Andreas Maurer | A Second-order Look at Stability and Generalization |

59 | Solving SDPs for synchronization and MaxCut problems via the Grothendieck inequality | Song Mei, Theodor Misiakiewicz, Andrea Montanari, Roberto Imbuzeiro Oliveira | In this paper we study the rank-constrained version of SDPs arising in MaxCut and in $\mathbb Z_2$ and $\rm SO(d)$ synchronization problems. |

60 | Mixing Implies Lower Bounds for Space Bounded Learning | Dana Moshkovitz, Michal Moshkovitz | In this paper we give such a condition. |

61 | Fast rates for online learning in Linearly Solvable Markov Decision Processes | Gergely Neu, Vicen� G�mez | In the current paper, we consider an online setting where the state costs may change arbitrarily between consecutive rounds, and the learner only observes the costs at the end of each respective round. |

62 | Sample complexity of population recovery | Yury Polyanskiy, Ananda Theertha Suresh, Yihong Wu | We consider one of the two polling impediments: \beginitemize \item in lossy population recovery, a pollee may skip each question with probability $ε$; \item in noisy population recovery, a pollee may lie on each question with probability $ε$. |

63 | Exact tensor completion with sum-of-squares | Aaron Potechin, David Steurer | We obtain the first polynomial-time algorithm for exact tensor completion that improves over the bound implied by reduction to matrix completion. |

64 | Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis | Maxim Raginsky, Alexander Rakhlin, Matus Telgarsky | The present work provides a nonasymptotic analysis in the context of non-convex learning problems, giving finite-time guarantees for SGLD to find approximate minimizers of both empirical and population risks. |

65 | On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities | Alexander Rakhlin, Karthik Sridharan | We study an equivalence of (i) deterministic pathwise statements appearing in the online learning literature (termed \emphregret bounds), (ii) high-probability tail bounds for the supremum of a collection of martingales (of a specific form arising from uniform laws of large numbers), and (iii) in-expectation bounds for the supremum. |

66 | Lower Bounds on Regret for Noisy Gaussian Process Bandit Optimization | Jonathan Scarlett, Ilija Bogunovic, Volkan Cevher | In this paper, we consider the problem of sequentially optimizing a black-box function $f$ based on noisy samples and bandit feedback. |

67 | An Improved Parametrization and Analysis of the EXP3++ Algorithm for Stochastic and Adversarial Bandits | Yevgeny Seldin, G�bor Lugosi | We present a new strategy for gap estimation in randomized algorithms for multiarmed bandits and combine it with the EXP3++ algorithm of Seldin and Slivkins (2014). |

68 | Fast and robust tensor decomposition with applications to dictionary learning | Tselil Schramm, David Steurer | In this work, we introduce general techniques to capture the guarantees of SOS for worst-case problems. |

69 | The Simulator: Understanding Adaptive Sampling in the Moderate-Confidence Regime | Max Simchowitz, Kevin Jamieson, Benjamin Recht | We propose a novel technique for analyzing adaptive sampling called the Simulator. |

70 | On Learning vs. Refutation | Salil Vadhan | Building on the work of Daniely et al. (STOC 2014, COLT 2016), we study the connection between computationally efficient PAC learning and refutation of constraint satisfaction problems. |

71 | Ignoring Is a Bliss: Learning with Large Noise Through Reweighting-Minimization | Daniel Vainsencher, Shie Mannor, Huan Xu | We propose an approach that iterates between finding a solution with minimal empirical loss and re-weighting the data, reinforcing data points where the previous solution works well. |

72 | Memory and Communication Efficient Distributed Stochastic Optimization with Minibatch Prox | Jialei Wang, Weiran Wang, Nathan Srebro | We present and analyze statistically optimal, communication and memory efficient distributed stochastic optimization algorithms with near-linear speedups (up to $\log$-factors). |

73 | Learning Non-Discriminatory Predictors | Blake Woodworth, Suriya Gunasekar, Mesrob I. Ohannessian, Nathan Srebro | We study the problem of learning such a non-discriminatory predictor from a finite training set, both statistically and computationally. |

74 | Empirical Risk Minimization for Stochastic Convex Optimization: $O(1/n)$- and $O(1/n^2)$-type of Risk Bounds | Lijun Zhang, Tianbao Yang, Rong Jin | In this work, we strengthen the realm of ERM for SCO by exploiting smoothness and strong convexity conditions to improve the risk bounds. |

75 | A Hitting Time Analysis of Stochastic Gradient Langevin Dynamics | Yuchen Zhang, Percy Liang, Moses Charikar | We study the Stochastic Gradient Langevin Dynamics (SGLD) algorithm for non-convex optimization. |

76 | Optimal learning via local entropies and sample compression | Zhivotovskiy Nikita | In particular, we provide a new tight PAC bound for the hard-margin SVM, an extended analysis of certain empirical risk minimizers under log-concave distributions, a new variant of an online to batch conversion, and distribution dependent localized bounds in the aggregation framework. |