# Paper Digest: AISTATS 2017 Highlights

Readers can also choose to read this highlight article on our console, which allows users to filter out papers using keywords and find related papers.

The International Conference on Artificial Intelligence and Statistics (AISTATS) is an interdisciplinary gathering of researchers at the intersection of computer science, artificial intelligence, machine learning, statistics, and related areas.

To help the community quickly catch up on the work presented in this conference, Paper Digest Team processed all accepted papers, and generated one highlight sentence (typically the main topic) for each paper. Readers are encouraged to read these machine generated highlights / summaries to quickly get the main idea of each paper.

If you do not want to miss any interesting academic paper, you are welcome to **sign up our free daily paper digest service ** to get updates on new papers published in your area every day. You are also welcome to follow us on Twitter and Linkedin to get updated with new conference digests.

Paper Digest Team

team@paperdigest.org

#### TABLE 1: AISTATS 2017 Papers

Title | Authors | Highlight | |
---|---|---|---|

1 | Minimax Gaussian Classification & Clustering | Tianyang Li, Xinyang Yi, Constantine Carmanis, Pradeep Ravikumar | We present minimax bounds for classification and clustering error in the setting where covariates are drawn from a mixture of two isotropic Gaussian distributions. |

2 | Conditions beyond treewidth for tightness of higher-order LP relaxations | Mark Rowland, Aldo Pacchiano, Adrian Weller | We consider binary pairwise models and introduce new methods which allow us to demonstrate refined conditions for tightness of LP relaxations in the Sherali-Adams hierarchy. |

3 | Large-Scale Data-Dependent Kernel Approximation | Catalin Ionescu, Alin Popa, Cristian Sminchisescu | Here we derive an approximate large-scale learning procedure for data-dependent kernels that is efficient and performs well in practice. |

4 | Clustering from Multiple Uncertain Experts | Yale Chang, Junxiang Chen, Michael Cho, Peter Castaldi, Ed Silverman, Jennifer Dy | To model the uncertainty in constraints from different experts, we build a probabilistic model for pairwise constraints through jointly modeling each expert’s accuracy and the mapping from features to latent cluster assignments. |

5 | Online Nonnegative Matrix Factorization with General Divergences | Renbo Zhao, Vincent Tan, Huan Xu | We develop a unified and systematic framework for performing online nonnegative matrix factorization under a wide variety of important divergences. |

6 | ASAGA: Asynchronous Parallel SAGA | R�mi Leblond, Fabian Pedregosa, Simon Lacoste-Julien | We describe ASAGA, an asynchronous parallel version of the incremental gradient algorithm SAGA that enjoys fast linear convergence rates. |

7 | Lower Bounds on Active Learning for Graphical Model Selection | Jonathan Scarlett, Volkan Cevher | We consider the problem of estimating the underlying graph associated with a Markov random field, with the added twist that the decoding algorithm can iteratively choose which subsets of nodes to sample based on the previous samples, resulting in an active learning setting. |

8 | Non-square matrix sensing without spurious local minima via the Burer-Monteiro approach | Dohyung Park, Anastasios Kyrillidis, Constantine Carmanis, Sujay Sanghavi | In this paper, we complement recent findings on the non-convex geometry of the analogous PSD setting [5], and show that matrix factorization does not introduce any spurious local minima, under RIP. |

9 | Sparse Accelerated Exponential Weights | Pierre Gaillard, Olivier Wintenberger | We introduce SAEW, a new procedure that accelerates exponential weights procedures with the slow rate $1/\sqrtT$ to procedures achieving the fast rate $1/T$. |

10 | On the Learnability of Fully-Connected Neural Networks | Yuchen Zhang, Jason Lee, Martin Wainwright, Michael I. Jordan | In this paper, we characterize the learnability of fully-connected neural networks via both positive and negative results. |

11 | An Information-Theoretic Route from Generalization in Expectation to Generalization in Probability | Ibrahim Alabdulmohsin | In this paper, we answer this question by proving that, while a generalization in expectation does not imply a generalization in probability, a uniform generalization in expectation does imply concentration. |

12 | Nearly Instance Optimal Sample Complexity Bounds for Top-k Arm Selection | Lijie Chen, Jian Li, Mingda Qiao | In this paper, we make progress towards a complete characterization of the instance-wise sample complexity bounds for the Best-k-Arm problem. |

13 | Guaranteed Non-convex Optimization: Submodular Maximization over Continuous Domains | Andrew An Bian, Baharan Mirzasoleiman, Joachim Buhmann, Andreas Krause | Specifically, i) We introduce the weak DR property that gives a unified characterization of submodularity for all set, integer-lattice and continuous functions; ii) for maximizing monotone DR-submodular continuous functions under general down-closed convex constraints, we propose a Frank-Wolfe variant with (1-1/e) approximation guarantee, and sub-linear convergence rate; iii) for maximizing general non-monotone submodular continuous functions subject to box constraints, we propose a DoubleGreedy algorithm with 1/3 approximation guarantee. |

14 | Tensor-Dictionary Learning with Deep Kruskal-Factor Analysis | Andrew Stevens, Yunchen Pu, Yannan Sun, Gregory Spell, Lawrence Carin | A multi-way factor analysis model is introduced for tensor-variate data of any order. |

15 | Consistent and Efficient Nonparametric Different-Feature Selection | Satoshi Hara, Takayuki Katsuki, Hiroki Yanagisawa, Takafumi Ono, Ryo Okamoto, Shigeki Takeuchi | We propose a feature selection method to find features that describe a difference in two probability distributions. |

16 | Annular Augmentation Sampling | Francois Fagan, Jalaj Bhandari, John Cunningham | In this work, we introduce an auxiliary variable MCMC scheme that samples from an annular augmented space, translating to a great circle path around the hypercube of the binary sample space. |

17 | Less than a Single Pass: Stochastically Controlled Stochastic Gradient | Lihua Lei, Michael Jordan | We develop and analyze a procedure for gradient-based optimization that we refer to as stochastically controlled stochastic gradient (SCSG). |

18 | Learning Time Series Detection Models from Temporally Imprecise Labels | Roy Adams, Ben Marlin | In this paper, we consider a new low-quality label learning problem: learning time series detection models from temporally imprecise labels. |

19 | Learning Cost-Effective and Interpretable Treatment Regimes | Himabindu Lakkaraju, Cynthia Rudin | In this work, we aim to automate this task of learning cost-effective, interpretable and actionable treatment regimes. |

20 | Linear Thompson Sampling Revisited | Marc Abeille, Alessandro Lazaric | We derive an alternative proof for the regret of Thompson sampling (TS) in the stochastic linear bandit setting. |

21 | A Sub-Quadratic Exact Medoid Algorithm | James Newling, Francois Fleuret | We present a new algorithm, ‘trimed’ for obtaining the medoid of a set, that is the element of the set which minimises the mean distance to all other elements. |

22 | Minimax Density Estimation for Growing Dimension | Daniel McDonald | This paper presents minimax rates for density estimation when the data dimension $d$ is allowed to grow with the number of observations $n$ rather than remaining fixed as in previous analyses. |

23 | Estimating Density Ridges by Direct Estimation of Density-Derivative-Ratios | Hiroaki Sasaki, Takafumi Kanamori, Masashi Sugiyama | To overcome these problems, we propose a novel method that directly estimates the ratios without going through density estimation and division. |

24 | Learning Theory for Conditional Risk Minimization | Alexander Zimin, Christoph Lampert | In this work we study the learnability of stochastic processes with respect to the conditional risk, i.e. the existence of a learning algorithm that improves its next-step performance with the amount of observed data. |

25 | Near-optimal Bayesian Active Learning with Correlated and Noisy Tests | Yuxin Chen, Hamed Hassani, Andreas Krause | We propose ECED, a novel, efficient active learning algorithm, and prove strong theoretical guarantees that hold with correlated, noisy tests. |

26 | Learning Nash Equilibrium for General-Sum Markov Games from Batch Data | Julien Perolat, Florian Strub, Bilal Piot, Olivier Pietquin | In this paper, we introduce a new definition of $ε$-Nash equilibrium in MGs which grasps the strategy’s quality for multiplayer games. |

27 | Distance Covariance Analysis | Benjamin Cowley, Joao Semedo, Amin Zandvakili, Matthew Smith, Adam Kohn, Byron Yu | We propose a dimensionality reduction method to identify linear projections that capture interactions between two or more sets of variables. |

28 | Phase Retrieval Meets Statistical Learning Theory: A Flexible Convex Relaxation | Sohail Bahmani, Justin Romberg | We propose a flexible convex relaxation for the phase retrieval problem that operates in the natural domain of the signal. |

29 | Regret Bounds for Lifelong Learning | Pierre Alquier, The Tien Mai, Massimiliano Pontil | We propose a lifelong learning strategy which refines the underlying data representation used by the within-task algorithm, thereby transferring information from one task to the next. |

30 | Poisson intensity estimation with reproducing kernels | Seth Flaxman, Yee Whye Teh, Dino Sejdinovic | In this paper we develop a new, computationally tractable Reproducing Kernel Hilbert Space (RKHS) formulation for the inhomogeneous Poisson process. |

31 | Generalized Pseudolikelihood Methods for Inverse Covariance Estimation | Alnur Ali, Kshitij Khare, Sang-Yun Oh, Bala Rajaratnam | We present a fast algorithm as well as screening rules that make computing the PseudoNet estimate over a range of tuning parameters tractable. |

32 | Removing Phase Transitions from Gibbs Measures | Ian Fellows, Mark Handcock | We introduce a modification to the Gibbs distribution that reduces the effects of phase transitions, and with properly chosen hyper-parameters, provably removes all multiphase behavior. |

33 | Performance Bounds for Graphical Record Linkage | Rebecca C. Steorts, Mattew Barnes, Willie Neiswanger | We provide an upper bound using the KL divergence and a lower bound on the minimum probability of misclassifying a latent entity. |

34 | Regret Bounds for Transfer Learning in Bayesian Optimisation | Alistair Shilton, Sunil Gupta, Santu Rana, Svetha Venkatesh | The second algorithm proposes a new way to model the difference between the source and target as a Gaussian process which is then used to adapt the source data. |

35 | Scaling Submodular Maximization via Pruned Submodularity Graphs | Tianyi Zhou, Hua Ouyang, Jeff Bilmes, Yi Chang, Carlos Guestrin | We propose a new random pruning method (called “submodular sparsification (SS)”) to reduce the cost of submodular maximization. |

36 | Localized Lasso for High-Dimensional Regression | Makoto Yamada, Takeuchi Koh, Tomoharu Iwata, John Shawe-Taylor, Samuel Kaski | We introduce the localized Lasso, which learns models that both are interpretable and have a high predictive power in problems with high dimensionality d and small sample size n. |

37 | Encrypted Accelerated Least Squares Regression | Pedro Esperanca, Louis Aslett, Chris Holmes | In this paper we present detailed analysis of coordinate and accelerated gradient descent algorithms which are capable of fitting least squares and penalised ridge regression models, using data encrypted under a fully homomorphic encryption scheme. |

38 | Random Consensus Robust PCA | Daniel Pimentel-Alarcon, Robert Nowak | This paper presents R2PCA, a random consensus method for robust principal component analysis. |

39 | Gray-box Inference for Structured Gaussian Process Models | Pietro Galliani, Amir Dezfouli, Edwin Bonilla, Novi Quadrianto | We develop an automated variational inference method for Bayesian structured prediction problems with Gaussian process (GP) priors and linear-chain likelihoods. |

40 | Frank-Wolfe Algorithms for Saddle Point Problems | Gauthier Gidel, Tony Jebara, Simon Lacoste-Julien | We extend the Frank-Wolfe (FW) optimization algorithm to solve constrained smooth convex-concave saddle point (SP) problems. |

41 | A Framework for Optimal Matching for Causal Inference | Nathan Kallus | We propose a novel framework for matching estimators for causal effect from observational data that is based on minimizing the dual norm of estimation error when expressed as an operator. |

42 | Quantifying the accuracy of approximate diffusions and Markov chains | Jonathan Huggins, James Zou | With the growth of large-scale datasets, the computational cost associated with simulating these stochastic processes can be considerable, and many algorithms have been proposed to approximate the underlying Markov chain or diffusion. |

43 | Stochastic Rank-1 Bandits | Sumeet Katariya, Branislav Kveton, Csaba Szepesvari, Claire Vernade, Zheng Wen | We propose a computationally-efficient algorithm for solving our problem, which we call Rank1Elim. |

44 | On the Troll-Trust Model for Edge Sign Prediction in Social Networks | G�raud Le Falher, Nicolo Cesa-Bianchi, Claudio Gentile, Fabio Vitale | We show that these heuristics can be understood, and rigorously analyzed, as approximators to the Bayes optimal classifier for a simple probabilistic model of the edge labels. |

45 | Online Optimization of Smoothed Piecewise Constant Functions | Vincent Cohen-Addad, Varun Kanade | We give algorithms that achieve sublinear regret in the full information and bandit settings. |

46 | Combinatorial Topic Models using Small-Variance Asymptotics | Ke Jiang, Suvrit Sra, Brian Kulis | In contrast, we approach topic modeling via combinatorial optimization, and take a small-variance limit of LDA to derive a new objective function. |

47 | ConvNets with Smooth Adaptive Activation Functions for Regression | Le Hou, Dimitris Samaras, Tahsin Kurc, Yi Gao, Joel Saltz | In this paper, we propose and apply AAFs on CNNs for regression tasks. We empirically evaluated CNNs with SAAFs and achieved state-of-the-art results on age and pose estimation datasets. |

48 | Rapid Mixing Swendsen-Wang Sampler for Stochastic Partitioned Attractive Models | Sejun Park, Yunhun Jang, Andreas Galanis, Jinwoo Shin, Daniel Stefankovic, Eric Vigoda | In this paper, we study the Swendsen-Wang dynamics which is a more sophisticated Markov chain designed to overcome bottlenecks that impede Gibbs sampler. |

49 | Efficient Rank Aggregation via Lehmer Codes | Pan Li, Arya Mazumdar, Olgica Milenkovic | We propose a novel rank aggregation method based on converting permutations into their corresponding Lehmer codes or other subdiagonal images. |

50 | Nonlinear ICA of Temporally Dependent Stationary Sources | Aapo Hyvarinen, Hiroshi Morioka | We introduce a nonlinear generative model where the independent sources are assumed to be temporally dependent, non-Gaussian, and stationary, and we observe arbitrarily nonlinear mixtures of them. |

51 | Stochastic Difference of Convex Algorithm and its Application to Training Deep Boltzmann Machines | Atsushi Nitanda, Taiji Suzuki | In this paper, we propose a stochastic variant of DC algorithm and give computational complexities to converge to a stationary point under several situations. |

52 | Global Convergence of Non-Convex Gradient Descent for Computing Matrix Squareroot | Prateek Jain, Chi Jin, Sham Kakade, Praneeth Netrapalli | A key contribution of our work is the general proof technique which we believe should further excite research in understanding deterministic and stochastic variants of simple non-convex gradient descent algorithms with good global convergence rates for other problems in machine learning and numerical linear algebra. |

53 | Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms | Christian Naesseth, Francisco Ruiz, Scott Linderman, David Blei | We propose a new method that lets us leverage reparameterization gradients even when variables are outputs of a acceptance-rejection sampling algorithm. |

54 | Asymptotically exact inference in differentiable generative models | Matthew Graham, Amos Storkey | We present a method for performing efficient MCMC inference in such models when conditioning on observations of the model output. |

55 | Decentralized Collaborative Learning of Personalized Models over Networks | Paul Vanhaesebrouck, Aur�lien Bellet, Marc Tommasi | The question addressed in this paper is: how can agents improve upon their locally trained model by communicating with other agents that have similar objectives? |

56 | Contextual Bandits with Latent Confounders: An NMF Approach | Rajat Sen, Karthikeyan Shanmugam, Murat Kocaoglu, Alex Dimakis, Sanjay Shakkottai | This insight enables us to propose an $ε$-greedy NMF-Bandit algorithm that designs a sequence of interventions (selecting specific arms), that achieves a balance between learning this low-dimensional structure and selecting the best arm to minimize regret. |

57 | Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets | Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, Frank Hutter | To accelerate hyperparameter optimization, we propose a generative model for the validation error as a function of training set size, which is learned during the optimization process and allows exploration of preliminary configurations on small subsets, by extrapolating to the full dataset. |

58 | Least-Squares Log-Density Gradient Clustering for Riemannian Manifolds | Mina Ashizawa, Hiroaki Sasaki, Tomoya Sakai, Masashi Sugiyama | In this paper, we combine these ideas and propose a novel mode-seeking algorithm for Riemannian manifolds with direct density-gradient estimation. |

59 | Fast column generation for atomic norm regularization | Marina Vinyes, Guillaume Obozinski | We consider optimization problems that consist in minimizing a quadratic function under an atomic norm regularization or constraint. |

60 | Bayesian Hybrid Matrix Factorisation for Data Integration | Thomas Brouwer, Pietro Lio | We introduce a novel Bayesian hybrid matrix factorisation model (HMF) for data integration, based on combining multiple matrix factorisation methods, that can be used for in- and out-of-matrix prediction of missing values. |

61 | Co-Occurring Directions Sketching for Approximate Matrix Multiply | Youssef Mroueh, Etienne Marcheret, Vaibahava Goel | We introduce co-occurring directions sketching, a deterministic algorithm for approximate matrix product (AMM), in the streaming model. |

62 | Exploration-Exploitation in MDPs with Options | Ronan Fruit, Alessandro Lazaric | In this paper, we derive an upper and lower bound on the regret of a variant of UCRL using options. |

63 | Local Perturb-and-MAP for Structured Prediction | Gedas Bertasius, Qiang Liu, Lorenzo Torresani, Jianbo Shi | In this work, we present a new Local Perturb-and-MAP (locPMAP) framework that replaces the global optimization with a local optimization by exploiting our observed connection between locPMAP and the pseudolikelihood of the original CRF model. |

64 | Gradient Boosting on Stochastic Data Streams | Hanzhang Hu, Wen Sun, Arun Venkatraman, Martial Hebert, Andrew Bagnell | In this work, we investigate the problem of adapting batch gradient boosting for minimizing convex loss functions to online setting where the loss at each iteration is i.i.d sampled from an unknown distribution. |

65 | Online Learning and Blackwell Approachability with Partial Monitoring: Optimal Convergence Rates | Joon Kwon, Vianney Perchet | We construct, for the first time, approachability algorithms with convergence rate of order $O(T^-1/2)$ when the signal is independent of the decision and of order $O(T^-1/3)$ in the case of general signals. |

66 | Tensor Decompositions via Two-Mode Higher-Order SVD (HOSVD) | Miaoyan Wang, Yun Song | Here, we present a new method built on Kruskal’s uniqueness theorem to decompose symmetric, nearly orthogonally decomposable tensors. |

67 | Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers | Meelis Kull, Telmo Silva Filho, Peter Flach | In this paper we solve all these problems with a richer class of calibration maps based on the beta distribution. |

68 | Detecting Dependencies in Sparse, Multivariate Databases Using Probabilistic Programming and Non-parametric Bayes | Feras Saad, Vikash Mansinghka | This paper proposes an approach that combines probabilistic programming, information theory, and non-parametric Bayes. |

69 | High-dimensional Time Series Clustering via Cross-Predictability | Dezhi Hong, Quanquan Gu, Kamin Whitehouse | In this paper, we explore a new similarity metric called “cross-predictability”: the degree to which a future value in each time series is predicted by past values of the others. |

70 | Minimax Approach to Variable Fidelity Data Interpolation | Alexey Zaytsev, Evgeny Burnaev | In this paper we obtain minimax interpolation errors for single and variable fidelity scenarios for a multivariate Gaussian process regression. |

71 | Data Driven Resource Allocation for Distributed Learning | Travis Dick, Mu Li, Venkata Krishna Pillutla, Colin White, Nina Balcan, Alex Smola | We present an in-depth analysis of this model, providing new algorithms with provable worst-case guarantees, analysis proving existing scalable heuristics perform well in natural non worst-case conditions, and techniques for extending a dispatching rule from a small sample to the entire distribution. |

72 | Learning Nonparametric Forest Graphical Models with Prior Information | Yuancheng Zhu, Zhe Liu, Siqi Sun | We present a framework for incorporating prior information into nonparametric estimation of graphical models. |

73 | Sparse Randomized Partition Trees for Nearest Neighbor Search | Kaushik Sinha, Omid Keivani | Inspired by the fast Johnson-Lindenstrauss transform, in this paper, we propose a sparse version of randomized partition tree where each internal node needs to store only a few non-zero entries, as opposed to all $d$ entries, leading to significant space savings without sacrificing much in terms of nearest neighbor search accuracy. |

74 | Horde of Bandits using Gaussian Markov Random Fields | Sharan Vaswani, Mark Schmidt, Laks Lakshmanan | Despite its effectiveness, the existing GOB model can only be applied to small problems due to its quadratic time-dependence on the number of nodes. |

75 | Random projection design for scalable implicit smoothing of randomly observed stochastic processes | Francois Belletti, Evan Sparks, Alexandre Bayen, Joseph Gonzalez | In this paper we present a novel estimator for cross-covariance of randomly observed time series which unravels the dynamics of an unobserved stochastic process. |

76 | Trading off Rewards and Errors in Multi-Armed Bandits | Akram Erraqabi, Alessandro Lazaric, Michal Valko, Emma Brunskill, Yun-En Liu | In this paper, we formalize this tradeoff and introduce the ForcingBalance algorithm whose performance is provably close to the best possible tradeoff strategy. |

77 | Adaptive ADMM with Spectral Penalty Parameter Selection | Zheng Xu, Mario Figueiredo, Tom Goldstein | We tackle this weakness of ADMM by proposing a method that adaptively tunes the penalty parameter to achieve fast convergence. |

78 | The End of Optimism? An Asymptotic Analysis of Finite-Armed Linear Bandits | Tor Lattimore, Csaba Szepesvari | We analyse the asymptotic regret and show matching upper and lower bounds on what is achievable. |

79 | Dynamic Collaborative Filtering With Compound Poisson Factorization | Ghassen Jerfel, Mehmet Basbug, Barbara Engelhardt | Here, we propose a new conjugate and numerically stable dynamic matrix factorization (DCPF) based on hierarchical Poisson factorization that models the smoothly drifting latent factors using gamma-Markov chains. |

80 | Rank Aggregation and Prediction with Item Features | Kai-Yang Chiang, Cho-Jui Hsieh, Inderjit Dhillon | Observing that traditional rank aggregation methods disregard features, while models adapted from learning-to-rank task are sensitive to feature noise, we propose a general model to learn a total ranking by balancing between comparisons and feature information jointly. |

81 | Robust and Efficient Computation of Eigenvectors in a Generalized Spectral Method for Constrained Clustering | Chengming Jiang, Huiqing Xie, Zhaojun Bai | In this paper, we provide solutions to these two critical issues. |

82 | Information-theoretic limits of Bayesian network structure learning | Asish Ghoshal, Jean Honorio | In this paper, we study the information-theoretic limits of learning the structure of Bayesian networks (BNs), on discrete as well as continuous random variables, from a finite number of samples. |

83 | Markov Chain Truncation for Doubly-Intractable Inference | Colin Wei, Iain Murray | We demonstrate how to construct unbiased estimates for 1/Z given access to black-box importance sampling estimators for Z. |

84 | Regression Uncertainty on the Grassmannian | Yi Hong, Xiao Yang, Roland Kwitt, Martin Styner, Marc Niethammer | This paper develops an approach to compute confidence intervals for geodesic regression models. |

85 | Attributing Hacks | Ziqi Liu, Alex Smola, Kyle Soska, Yu-Xiang Wang, Qinghua Zheng | In this paper, we describe an algorithm for estimating the provenance of hacks on websites. |

86 | Unsupervised Sequential Sensor Acquisition | Manjesh Hanawal, Csaba Szepesvari, Venkatesh Saligrama | Our objective is to learn strategies for selecting tests to optimize accuracy and costs. |

87 | A Stochastic Nonconvex Splitting Method for Symmetric Nonnegative Matrix Factorization | Songtao Lu, Mingyi Hong, Zhengdao Wang | In this paper, we consider a stochastic SymNMF problem in which the observation matrix is generated in a random and sequential manner. |

88 | Hierarchically-partitioned Gaussian Process Approximation | Byung-Jun Lee, Jongmin Lee, Kee-Eung Kim | In this paper, we introduce a hierarchical model based on local GP for large-scale datasets, which stacks inducing points over inducing points in layers. |

89 | Scalable Learning of Non-Decomposable Objectives | Elad Eban, Mariano Schain, Alan Mackey, Ariel Gordon, Ryan Rifkin, Gal Elidan | In this work we present a unified framework that, using straightforward building block bounds, allows for highly scalable optimization of a wide range of ranking-based objectives. |

90 | CPSG-MCMC: Clustering-Based Preprocessing method for Stochastic Gradient MCMC | Tianfan Fu, Zhihua Zhang | In this paper, we propose an effective subsampling strategy to reduce the variance based on a failed attempt to do importance sampling. |

91 | Comparison-Based Nearest Neighbor Search | Siavash Haghiri, Debarghya Ghoshdastidar, Ulrike von Luxburg | We focus on a simple yet effective algorithm that recursively splits the space by first selecting two random pivot points and then assigning all other points to the closer of the two (comparison tree). |

92 | A Unified Optimization View on Generalized Matching Pursuit and Frank-Wolfe | Francesco Locatello, Rajiv Khanna, Michael Tschannen, Martin Jaggi | In this paper we take a unified view on both classes of methods, leading to the first explicit convergence rates of matching pursuit methods in an optimization sense, for general sets of atoms. |

93 | Faster Coordinate Descent via Adaptive Importance Sampling | Dmytro Perekrestenko, Volkan Cevher, Martin Jaggi | In this work, we introduce new adaptive rules for the random selection of their updates. |

94 | Conjugate-Computation Variational Inference : Converting Variational Inference in Non-Conjugate Models to Inferences in Conjugate Models | Mohammad Khan, Wu Lin | In this paper, we propose a new algorithm called Conjugate-computation Variational Inference (CVI) which brings the best of the two worlds together – it uses conjugate computations for the conjugate terms and employs stochastic gradients for the rest. |

95 | Hit-and-Run for Sampling and Planning in Non-Convex Spaces | Yasin Abbasi-Yadkori, Peter Bartlett, Victor Gabillon, Alan Malek | We propose the Hit-and-Run algorithm for planning and sampling problems in non- convex spaces. |

96 | DP-EM: Differentially Private Expectation Maximization | Mijung Park, James Foulds, Kamalika Choudhary, Max Welling | We propose a practical private EM algorithm that overcomes this challenge using two innovations: (1) a novel moment perturbation formulation for differentially private EM (DP-EM), and (2) the use of two recently developed composition methods to bound the privacy “cost” of multiple EM iterations: the moments accountant (MA) and zero-mean concentrated differential privacy (zCDP). |

97 | On the Hyperprior Choice for the Global Shrinkage Parameter in the Horseshoe Prior | Juho Piironen, Aki Vehtari | The horseshoe prior has proven to be a noteworthy alternative for sparse Bayesian estimation, but as shown in this paper, the results can be sensitive to the prior choice for the global shrinkage hyperparameter. |

98 | Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems | Scott Linderman, Matthew Johnson, Andrew Miller, Ryan Adams, David Blei, Liam Paninski | Building on switching linear dynamical systems (SLDS), we develop a model class and Bayesian inference algorithms that not only discover these dynamical units but also, by learning how transition probabilities depend on observations or continuous latent states, explain their switching behavior. |

99 | Efficient Algorithm for Sparse Tensor-variate Gaussian Graphical Models via Gradient Descent | Pan Xu, Tingting Zhang, Quanquan Gu | In order to estimate the precision matrices, we propose a sparsity constrained maximum likelihood estimator. |

100 | Minimax-optimal semi-supervised regression on unknown manifolds | Amit Moscovich, Ariel Jaffe, Nadler Boaz | We consider semi-supervised regression when the predictor variables are drawn from an unknown manifold. |

101 | Improved Strongly Adaptive Online Learning using Coin Betting | Kwang-Sung Jun, Francesco Orabona, Stephen Wright, Rebecca Willett | This paper describes a new parameter-free online learning algorithm for changing environments. |

102 | Black-box Importance Sampling | Qiang Liu, Jason Lee | We address this problem by studying black-box importance sampling methods that calculate importance weights for samples generated from any unknown proposal or black-box mechanism. |

103 | Fairness Constraints: Mechanisms for Fair Classification | Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rogriguez, Krishna P. Gummadi | In this paper, we introduce a flexible mechanism to design fair classifiers by leveraging a novel intuitive measure of decision boundary (un)fairness. |

104 | Frequency Domain Predictive Modelling with Aggregated Data | Avradeep Bhowmik, Joydeep Ghosh, Oluwasanmi Koyejo | In this manuscript we investigate the problem of predictive linear modelling in the scenario where data is aggregated in a non-uniform manner across targets and features. |

105 | A Unified Computational and Statistical Framework for Nonconvex Low-rank Matrix Estimation | Lingxiao Wang, Xiao Zhang, Quanquan Gu | We propose a unified framework for estimating low-rank matrices through nonconvex optimization based on gradient descent algorithm. |

106 | A New Class of Private Chi-Square Hypothesis Tests | Ryan Rogers, Daniel Kifer | In this paper, we develop new test statistics for hypothesis testing over differentially private data. |

107 | A Learning Theory of Ranking Aggregation | Anna Korba, St�phan Clemencon, Eric Sibony | This paper develops a statistical learning theory for ranking aggregation in a general probabilistic setting (avoiding any rigid ranking model assumptions), assessing the generalization ability of empirical ranking medians. |

108 | Anomaly Detection in Extreme Regions via Empirical MV-sets on the Sphere | Albert Thomas, St�phan Clemencon, Alexandre Gramfort, Anne Sabourin | This paper presents an unsupervised algorithm for anomaly detection in extreme regions. |

109 | Structured adaptive and random spinners for fast machine learning computations | Mariusz Bojarski, Anna Choromanska, Krzysztof Choromanski, Francois Fagan, Cedric Gouy-Pailler, Anne Morvan, Nouri Sakr, Tamas Sarlos, Jamal Atif | The proposed framework comes with theoretical guarantees characterizing the capacity of the structured model in reference to its unstructured counterpart and is based on a general theoretical principle that we describe in the paper. |

110 | Complementary Sum Sampling for Likelihood Approximation in Large Scale Classification | Aleksandar Botev, Bowen Zheng, David Barber | We consider training probabilistic classifiers in the case that the number of classes is too large to perform exact normalisation over all classes. |

111 | Learning Optimal Interventions | Jonas Mueller, David Reshef, George Du, Tommi Jaakkola | Our goal is to identify beneficial interventions from observational data. |

112 | A Lower Bound on the Partition Function of Attractive Graphical Models in the Continuous Case | Nicholas Ruozzi | In this work, we use graph covers to extend several such results from the discrete case to the continuous case. |

113 | Scalable Variational Inference for Super Resolution Microscopy | Ruoxi Sun, Evan Archer, Liam Paninski | In this paper we develop new Bayesian image processing methods that extend the reach of super-resolution microscopy even further. |

114 | Linear Convergence of Stochastic Frank Wolfe Variants | Donald Goldfarb, Garud Iyengar, Chaoxu Zhou | In this paper, we show that the Away-step Stochastic Frank-Wolfe (ASFW) and Pairwise Stochastic Frank-Wolfe (PSFW) algorithms converge linearly in expectation. |

115 | Sequential Graph Matching with Sequential Monte Carlo | Seong-Hwan Jun, Samuel W.K. Wong, James Zidek, Alexandre Bouchard-Cote | We develop a novel probabilistic model for graph matchings and develop practical inference methods for supervised and unsupervised learning of the parameters of this model. |

116 | Fast rates with high probability in exp-concave statistical learning | Nishant Mehta | We present an algorithm for the statistical learning setting with a bounded exp-concave loss in d dimensions that obtains excess risk $O(d \log(1/δ)/n)$ with probability $1 – δ$. |

117 | Generalization Error of Invariant Classifiers | Jure Sokolic, Raja Giryes, Guillermo Sapiro, Miguel Rodrigues | This paper studies the generalization error of invariant classifiers. |

118 | Learning with Feature Feedback: from Theory to Practice | Stefanos Poulis, Sanjoy Dasgupta | In this paper, we examine a particular type of feature feedback that has been used, with some success, in information retrieval and in computer vision. |

119 | Optimistic Planning for the Stochastic Knapsack Problem | Ciara Pike-Burke, Steffen Grunewalder | We derive and study an optimistic planning algorithm specifically designed for the stochastic knapsack problem. |

120 | Identifying Groups of Strongly Correlated Variables through Smoothed Ordered Weighted $L_1$-norms | Raman Sankaran, Francis Bach, Chiranjib Bhattacharya | In this paper we take a submodular perspective and show that OWL can be posed as the Lovász extension of a suitably defined submodular function. |

121 | Tracking Objects with Higher Order Interactions via Delayed Column Generation | Shaofei Wang, Steffen Wolf, Charless Fowlkes, Julian Yarkony | We present a relaxation of this combinatorial problem that uses a column generation formulation where the pricing problem is solved via dynamic programming to efficiently explore the space of tracks. |

122 | Belief Propagation in Conditional RBMs for Structured Prediction | Wei Ping, Alex Ihler | In this work, we present a matrix-based implementation of belief propagation algorithms on CRBMs, which is easily scalable to tens of thousands of visible and hidden units. |

123 | Sketching Meets Random Projection in the Dual: A Provable Recovery Algorithm for Big and High-dimensional Data | Jialei Wang, Jason Lee, Mehrdad Mahdavi, Mladen Kolar, Nati Srebro | In this paper, we study sketching from an optimization point of view. |

124 | Finite-sum Composition Optimization via Variance Reduced Gradient Descent | Xiangru Lian, Mengdi Wang, Ji Liu | In this paper, we consider the finite-sum scenario for composition optimization: $\min_x f (x) := \frac1n \sum_i = 1^n F_i \left( \frac1m \sum_j = 1^m G_j (x) \right)$. |

125 | A Fast and Scalable Joint Estimator for Learning Multiple Related Sparse Gaussian Graphical Models | Beilun Wang, Ji Gao, Yanjun Qi | We propose a novel approach, FASJEM for \underlinefast and \underlinescalable \underlinejoint structure-\underlineestimation of \underlinemultiple sGGMs at a large scale. |

126 | Communication-efficient Distributed Sparse Linear Discriminant Analysis | Lu Tian, Quanquan Gu | We propose a communication-efficient distributed estimation method for sparse linear discriminant analysis (LDA) in the high dimensional regime. |

127 | Sketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal Storage | Alp Yurtsever, Madeleine Udell, Joel Tropp, Volkan Cevher | It presents the first algorithm that uses optimal storage and provably computes a low-rank approximation of a solution. |

128 | Modal-set estimation with an application to clustering | Heinrich Jiang, Samory Kpotufe | We present a procedure that can estimate – with statistical consistency guarantees – any local-maxima of a density, under benign distributional conditions. |

129 | Compressed Least Squares Regression revisited | Martin Slawski | As a fix, we subsequently present a modified analysis with meaningful implications that much better reflects empirical results with simulated and real data. |

130 | Diverse Neural Network Learns True Target Functions | Bo Xie, Yingyu Liang, Le Song | In this paper, we answer these questions by analyzing one-hidden-layer neural networks with ReLU activation, and show that despite the non-convexity, neural networks with diverse units have no spurious local minima. |

131 | Local Group Invariant Representations via Orbit Embeddings | Anant Raj, Abhishek Kumar, Youssef Mroueh, Tom Fletcher, Bernhard Schoelkopf | We consider transformations that form a group and propose an approach based on kernel methods to derive local group invariant representations. |

132 | Relativistic Monte Carlo | Xiaoyu Lu, Valerio Perrone, Leonard Hasenclever, Yee Whye Teh, Sebastian Vollmer | In order to alleviate these problems we propose relativistic Hamiltonian Monte Carlo, a version of HMC based on relativistic dynamics that introduces a maximum velocity on particles. |

133 | Thompson Sampling for Linear-Quadratic Control Problems | Marc Abeille, Alessandro Lazaric | We consider the exploration-exploitation tradeoff in linear quadratic (LQ) control problems, where the state dynamics is linear and the cost function is quadratic in states and controls. |

134 | Fast Classification with Binary Prototypes | Kai Zhong, Ruiqi Guo, Sanjiv Kumar, Bowei Yan, David Simcha, Inderjit Dhillon | In this work, we propose a new technique for \emphfast k-nearest neighbor (k-NN) classification in which the original database is represented via a small set of learned binary prototypes. |

135 | Prediction Performance After Learning in Gaussian Process Regression | Johan Wagberg, Dave Zachariah, Thomas Schon, Petre Stoica | This paper considers the quantification of the prediction performance in Gaussian process regression. |

136 | Communication-Efficient Learning of Deep Networks from Decentralized Data | Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, Blaise Aguera y Arcas | We present a practical method for the federated learning of deep networks based on iterative model averaging, and conduct an extensive empirical evaluation, considering five different model architectures and four datasets. |

137 | Learning Structured Weight Uncertainty in Bayesian Neural Networks | Shengyang Sun, Changyou Chen, Lawrence Carin | In this paper, we consider the matrix variate Gaussian (MVG) distribution to model structured correlations within the weights of a DNN. |

138 | Signal-based Bayesian Seismic Monitoring | David Moore, Stuart Russell | We formulate this task as Bayesian inference and propose a generative model of seismic events and signals across a network of spatially distributed stations. |

139 | Learning the Network Structure of Heterogeneous Data via Pairwise Exponential Markov Random Fields | Youngsuk Park, David Hallac, Stephen Boyd, Jure Leskovec | Here, we define the pairwise exponential Markov random field (PE-MRF), an approach capable of modeling exponential family distributions in heterogeneous domains. |

140 | Discovering and Exploiting Additive Structure for Bayesian Optimization | Jacob Gardner, Chuan Guo, Kilian Weinberger, Roman Garnett, Roger Grosse | We propose an efficient algorithm based on Metropolis-Hastings sampling and demonstrate its efficacy empirically on synthetic and real-world data sets. |

141 | Lipschitz Density-Ratios, Structured Data, and Data-driven Tuning | Samory Kpotufe | Lipschitz Density-Ratios, Structured Data, and Data-driven Tuning |

142 | Spatial Decompositions for Large Scale SVMs | Philipp Thomann, Ingrid Blaschzyk, Mona Meister, Ingo Steinwart | In this work we investigate a decomposition strategy that learns on small, spatially defined data chunks. |

143 | Inference Compilation and Universal Probabilistic Programming | Tuan Anh Le, Atilim Gunes Baydin, Frank Wood | We introduce a method for using deep neural networks to amortize the cost of inference in models from the family induced by universal probabilistic programming languages, establishing a framework that combines the strengths of probabilistic programming and deep learning methods. |

144 | Active Positive Semidefinite Matrix Completion: Algorithms, Theory and Applications | Aniruddha Bhargava, Ravi Ganti, Rob Nowak | In this paper we provide simple, computationally efficient, active algorithms for completion of symmetric positive semidefinite matrices. |

145 | Information Projection and Approximate Inference for Structured Sparse Variables | Rajiv Khanna, Joydeep Ghosh, Rusell Poldrack, Oluwasanmi Koyejo | This manuscript goes beyond classical sparsity by proposing efficient algorithms for approximate inference via information projection that are applicable to any structure on the set of variables that admits enumeration using matroid or knapsack constraints. |

146 | On the Interpretability of Conditional Probability Estimates in the Agnostic Setting | Yihan Gao, Aditya Parameswaran, Jian Peng | In this paper, we define a novel measure for the calibration property together with its empirical counterpart, and prove an uniform convergence result between them. |

147 | Linking Micro Event History to Macro Prediction in Point Process Models | Yichen Wang, Xiaojing Ye, Haomin Zhou, Hongyuan Zha, Le Song | In this paper, we propose a unifying framework with a jump stochastic differential equation model that systematically links the microscopic event data and macroscopic inference, and the theory to approximate its probability distribution. |

148 | Initialization and Coordinate Optimization for Multi-way Matching | Da Tang, Tony Jebara | We propose a coordinate update algorithm that directly optimizes the target objective. |

149 | Optimal Recovery of Tensor Slices | Vivek Farias, Andrew Li | We consider the problem of large scale matrix recovery given side information in the form of additional matrices of conforming dimension. |

150 | Efficient Online Multiclass Prediction on Graphs via Surrogate Losses | Alexander Rakhlin, Karthik Sridharan | We develop computationally efficient algorithms for online multi-class prediction. |

151 | Distribution of Gaussian Process Arc Lengths | Justin Bewsher, Alessandra Tosi, Michael Osborne, Stephen Roberts | We present the first treatment of the arc length of the GP with more than a single output dimension. |

152 | Distributed Adaptive Sampling for Kernel Matrix Approximation | Daniele Calandriello, Alessandro Lazaric, Michal Valko | In this paper, we introduce SQUEAK, a new algorithm for kernel approximation based on RLS sampling that \emphsequentially processes the dataset, storing a dictionary which creates accurate kernel matrix approximations with a number of points that only depends on the effective dimension $d_eff(γ)$ of the dataset. |

153 | Binary and Multi-Bit Coding for Stable Random Projections | Ping Li | In this paper, we develop an estimation procedure for the $l_α$ norm of the signal, where $0<α\leq2$ from binary or multi-bit measurements. |

154 | Spectral Methods for Correlated Topic Models | Forough Arabshahi, Anima Anandkumar | In this paper we propose guaranteed spectral methods for learning a broad range of topic models, which generalize the popular Latent Dirichlet Allocation (LDA). |

155 | Label Filters for Large Scale Multilabel Classification | Alexandru Niculescu-Mizil, Ehsan Abbasnejad | To alleviate this problem we propose a two step approach where computationally efficient label filters pre-select a small set of candidate labels before the base multiclass or multilabel classifier is applied. |

156 | Learning from Conditional Distributions via Dual Embeddings | Bo Dai, Niao He, Yunpeng Pan, Byron Boots, Le Song | To address these challenges, we propose a novel approach which employs a new min-max reformulation of the learning from conditional distribution problem. |

157 | Sequential Multiple Hypothesis Testing with Type I Error Control | Alan Malek, Sumeet Katariya, Yinlam Chow, Mohammad Ghavamzadeh | This work studies multiple hypothesis testing in the setting when we obtain data sequentially and may choose when to stop sampling. |

158 | A Maximum Matching Algorithm for Basis Selection in Spectral Learning | Ariadna Quattoni, Xavier Carreras, Matthias Gall� | We present a solution to scale spectral algorithms for learning sequence functions. |

159 | Value-Aware Loss Function for Model-based Reinforcement Learning | Amir-Massoud Farahmand, Andre Barreto, Daniel Nikovski | We introduce a loss function that takes the structure of the value function into account. |

160 | Convergence Rate of Stochastic k-means | Cheng Tang, Claire Monteleoni | We analyze online (Bottou & Bengio, 1994) and mini-batch (Sculley, 2010) k-means variants. |

161 | Automated Inference with Adaptive Batches | Soham De, Abhay Yadav, David Jacobs, Tom Goldstein | We propose alternative “big batch” SGD schemes that adaptively grow the batch size over time to maintain a nearly constant signal-to-noise ratio in the gradient approximation. |

162 | Scalable Convex Multiple Sequence Alignment via Entropy-Regularized Dual Decomposition | Jiong Zhang, Ian En-Hsu Yen, Pradeep Ravikumar, Inderjit Dhillon | In this work, we propose an accelerated dual decomposition algorithm that exploits entropy regularization to induce closed-form solutions for each atomic-norm-constrained subproblem, giving a single-loop algorithm of iteration complexity linear to the problem size (total length of all sequences). |

163 | Robust Causal Estimation in the Large-Sample Limit without Strict Faithfulness | Ioan Gabriel Bucur, Tom Claassen, Tom Heskes | We introduce an alternative approach by replacing strict faithfulness with a prior that reflects the existence of many ’weak’ (irrelevant) and ’strong’ interactions. |

164 | Learning Graphical Games from Behavioral Data: Sufficient and Necessary Conditions | Asish Ghoshal, Jean Honorio | In this paper we obtain sufficient and necessary conditions on the number of samples required for exact recovery of the pure-strategy Nash equilibria (PSNE) set of a graphical game from noisy observations of joint actions. |

165 | Non-Count Symmetries in Boolean & Multi-Valued Prob. Graphical Models | Ankit Anand, Ritesh Noothigattu, Parag Singla, Mausam | In this paper, we present first algorithms to compute non-count symmetries in both Boolean-valued and multi-valued domains. |

166 | Greedy Direction Method of Multiplier for MAP Inference of Large Output Domain | Xiangru Huang, Ian En-Hsu Yen, Ruohan Zhang, Qixing Huang, Pradeep Ravikumar, Inderjit Dhillon | In this paper, we introduce an effective MAP inference method for problems with large output domains. |

167 | Scalable Greedy Feature Selection via Weak Submodularity | Rajiv Khanna, Ethan Elenberg, Alex Dimakis, Sahand Negahban, Joydeep Ghosh | In this paper we show that divergent from previously held opinion, submodularity is not required to obtain approximation guarantees for these two algorithms. |