# Paper Digest: AISTATS 2016 Highlights

Readers can also choose to read this highlight article on our console, which allows users to filter out papers using keywords and find related papers.

The International Conference on Artificial Intelligence and Statistics (AISTATS) is an interdisciplinary gathering of researchers at the intersection of computer science, artificial intelligence, machine learning, statistics, and related areas.

To help the community quickly catch up on the work presented in this conference, Paper Digest Team processed all accepted papers, and generated one highlight sentence (typically the main topic) for each paper. Readers are encouraged to read these machine generated highlights / summaries to quickly get the main idea of each paper.

If you do not want to miss any interesting academic paper, you are welcome to **sign up our free daily paper digest service ** to get updates on new papers published in your area every day. You are also welcome to follow us on Twitter and Linkedin to get updated with new conference digests.

Paper Digest Team

team@paperdigest.org

#### TABLE 1: AISTATS 2016 Papers

Title | Authors | Highlight | |
---|---|---|---|

1 | Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures | Mario Lucic, Olivier Bachem, Andreas Krause | We propose a single, practical algorithm to construct strong coresets for a large class of hard and soft clustering problems based on Bregman divergences. |

2 | Revealing Graph Bandits for Maximizing Local Influence | Alexandra Carpentier, Michal Valko | In this paper, we do not assume any knowledge of the graph, but we consider a setting where it can be gradually discovered in a sequential and active way. |

3 | Convex Block-sparse Linear Regression with Expanders � Provably | Anastasios Kyrillidis, Bubacarr Bah, Rouzbeh Hasheminezhad, Quoc Tran Dinh, Luca Baldassarre, Volkan Cevher | Our aim here is to theoretically characterize the performance of convex approaches under such setting. |

4 | C3: Lightweight Incrementalized MCMC for Probabilistic Programs using Continuations and Callsite Caching | Daniel Ritchie, Andreas Stuhlm�ller, Noah Goodman | We present a new extension to the lightweight approach, C3, which enables efficient, incrementalized re-execution of MH proposals. |

5 | Clamping Improves TRW and Mean Field Approximations | Adrian Weller, Justin Domke | We explore the value of our methods by empirical analysis and draw lessons to guide practitioners. |

6 | Tightness of LP Relaxations for Almost Balanced Models | Adrian Weller, Mark Rowland, David Sontag | Here we consider binary pairwise models and derive sufﬁcient conditions for guaranteed tightness of (i) the standard LP relaxation on the local polytope LP+LOC, and (ii) the LP relaxation on the triplet-consistent polytope LP+TRI (the next level in the Sherali-Adams hierarchy). |

7 | Control Functionals for Quasi-Monte Carlo Integration | Chris Oates, Mark Girolami | Quasi-Monte Carlo (QMC) methods are being adopted in statistical applications due to the increasingly challenging nature of numerical integrals that are now routinely encountered. |

8 | Probability Inequalities for Kernel Embeddings in Sampling without Replacement | Markus Schneider | In this work we generalize the results of (Serfling 1974) to quantify the difference between this two estimates. |

9 | Sparse Representation of Multivariate Extremes with Applications to Anomaly Ranking | Nicolas Goix, Anne Sabourin, St�phan Cl�men�on | This paper proposes a new algorithm based on multivariate EVT to learn how to rank observations in a high dimensional space with respect to their degree of ‘abnormality’. |

10 | A Robust-Equitable Copula Dependence Measure for Feature Selection | Yale Chang, Yi Li, Adam Ding, Jennifer Dy | In this paper we introduce the concept of robust-equitability and a robust-equitable dependence measure copula correlation (Ccor). |

11 | Random Forest for the Contextual Bandit Problem | Rapha�l F�raud, Robin Allesiardo, Tanguy Urvoy, Fabrice Cl�rot | To address the contextual bandit problem, we propose an online random forest algorithm. |

12 | Inverse Reinforcement Learning with Simultaneous Estimation of Rewards and Dynamics | Michael Herman, Tobias Gindele, J�rg Wagner, Felix Schmitt, Wolfram Burgard | To overcome this, we present a gradient-based IRL approach that simultaneously estimates the system’s dynamics. |

13 | Learning Sparse Additive Models with Interactions in High Dimensions | Hemant Tyagi, Anastasios Kyrillidis, Bernd G�rtner, Andreas Krause | In this work, we consider a generalized SPAM, allowing for second order interaction terms. |

14 | Bipartite Correlation Clustering: Maximizing Agreements | Megasthenis Asteris, Anastasios Kyrillidis, Dimitris Papailiopoulos, Alexandros Dimakis | We present a novel approximation algorithm for k-BCC, a variant of BCC with an upper bound k on the number of clusters. |

15 | Breaking Sticks and Ambiguities with Adaptive Skip-gram | Sergey Bartunov, Dmitry Kondrashkin, Anton Osokin, Dmitry Vetrov | In this paper we propose the Adaptive Skip-gram model which is a nonparametric Bayesian extension of Skip-gram capable to automatically learn the required number of representations for all words at desired semantic resolution. |

16 | Top Arm Identification in Multi-Armed Bandits with Batch Arm Pulls | Kwang-Sung Jun, Kevin Jamieson, Robert Nowak, Xiaojin Zhu | We introduce a new multi-armed bandit (MAB) problem in which arms must be sampled in batches, rather than one at a time. |

17 | Limits on Sparse Support Recovery via Linear Sketching with Random Expander Matrices | Jonathan Scarlett, Volkan Cevher | Motivated by applications where the \emphpositions of the non-zero entries in a sparse vector are of primary interest, we consider the problem of \emphsupport recovery from a linear sketch taking the form \mathbfY = \mathbfXβ+ \mathbfZ. |

18 | Maximum Likelihood for Variance Estimation in High-Dimensional Linear Models | Lee H. Dicker, Murat A. Erdogdu | More broadly, the results in this paper illustrate a strategy for drawing connections between fixed- and random-effects models in high dimensions, which may be useful in other applications. |

19 | Scalable Gaussian Process Classification via Expectation Propagation | Daniel Hernandez-Lobato, Jose Miguel Hernandez-Lobato | As an alternative, we describe here how to train these classifiers efficiently using expectation propagation (EP). |

20 | Precision Matrix Estimation in High Dimensional Gaussian Graphical Models with Faster Rates | Lingxiao Wang, Xiang Ren, Quanquan Gu | In this paper, we present a new estimator for precision matrix in high dimensional Gaussian graphical models. |

21 | On the Reducibility of Submodular Functions | Jincheng Mei, Hao Zhang, Bao-Liang Lu | In this paper, we study the reducibility of submodular functions, a property that enables us to reduce the solution space of submodular optimization problems without performance loss. |

22 | Accelerated Stochastic Gradient Descent for Minimizing Finite Sums | Atsushi Nitanda | We propose an optimization method for minimizing the finite sums of smooth convex functions. |

23 | Fast Convergence of Online Pairwise Learning Algorithms | Martin Boissier, Siwei Lyu, Yiming Ying, Ding-Xuan Zhou | In this paper, we focus on online learning algorithms for pairwise learning problems without strong convexity, for which all previously known algorithms achieve a convergence rate of \mathcalO(1/\sqrtT) after T iterations. |

24 | Computationally Efficient Bayesian Learning of Gaussian Process State Space Models | Andreas Svensson, Arno Solin, Simo S�rkk�, Thomas Sch�n | We present a procedure for efficient Bayesian learning in Gaussian process state space models, where the representation is formed by projecting the problem onto a set of approximate eigenfunctions derived from the prior covariance structure. |

25 | Generalized Ideal Parent (GIP): Discovering non-Gaussian Hidden Variables | Yaniv Tenzer, Gal Elidan | We propose a novel general purpose approach for discovering hidden variables in flexible non-Gaussian domains using the powerful class of Gaussian copula networks. |

26 | On Sparse Variational Methods and the Kullback-Leibler Divergence between Stochastic Processes | Alexander G. de G. Matthews, James Hensman, Richard Turner, Zoubin Ghahramani | In this paper we give a substantial generalization of the literature on this topic. |

27 | Non-stochastic Best Arm Identification and Hyperparameter Optimization | Kevin Jamieson, Ameet Talwalkar | Motivated by the task of hyperparameter optimization, we introduce the \em non-stochastic best-arm identification problem. |

28 | A Linearly-Convergent Stochastic L-BFGS Algorithm | Philipp Moritz, Robert Nishihara, Michael Jordan | We propose a new stochastic L-BFGS algorithm and prove a linear convergence rate for strongly convex and smooth functions. |

29 | No Regret Bound for Extreme Bandits | Robert Nishihara, David Lopez-Paz, Leon Bottou | Motivated by the general challenge of sequentially choosing which algorithm to use, we study the more specific task of choosing among distributions to use for random hyperparameter optimization. |

30 | Tensor vs. Matrix Methods: Robust Tensor Decomposition under Block Sparse Perturbations | Anima Anandkumar, Prateek Jain, Yang Shi, U. N. Niranjan | We propose a novel non-convex iterative algorithm with guaranteed recovery. |

31 | Online Learning to Rank with Feedback at the Top | Sougata Chaudhuri, Ambuj Tewari Tewari | We develop efficient algorithms for well known losses in the pointwise, pairwise and listwise families. |

32 | Survey Propagation beyond Constraint Satisfaction Problems | Christopher Srinivasa, Siamak Ravanbakhsh, Brendan Frey | We propose an approximation scheme to efficiently extend the application of SP to marginalization in binary pairwise graphical models. |

33 | Score Permutation Based Finite Sample Inference for Generalized AutoRegressive Conditional Heteroskedasticity (GARCH) Models | Bal�zs Csan�d Cs�ji | Here, we suggest a finite sample approach, called ScoPe, to construct distribution-free confidence regions around the QML estimate, which have exact coverage probabilities, despite no additional assumptions about moments are made. |

34 | CRAFT: ClusteR-specific Assorted Feature selecTion | Vikas K. Garg, Cynthia Rudin, Tommi Jaakkola | We present a hierarchical Bayesian framework for clustering with cluster-specific feature selection. |

35 | Time-Varying Gaussian Process Bandit Optimization | Ilija Bogunovic, Jonathan Scarlett, Volkan Cevher | We introduce two natural extensions of the classical Gaussian process upper confidence bound (GP-UCB) algorithm. |

36 | Bayes-Optimal Effort Allocation in Crowdsourcing: Bounds and Index Policies | Weici Hu, Peter Frazier | Following a similar approach to the Lagrangian Relaxation technique in Adelman and Mersereau (2008), we provide a computationally tractable instance-specific upper bound on the value of this Bayes-optimal policy, which can in turn be used to bound the optimality gap of any other sub-optimal policy. |

37 | Bayesian Markov Blanket Estimation | Dinu Kaufmann, Sonali Parbhoo, Aleksander Wieczorek, Sebastian Keller, David Adametz, Volker Roth | This paper considers a Bayesian view for estimating the Markov blanket of a set of query variables, where the set of potential neighbours here is big. |

38 | Dreaming More Data: Class-dependent Distributions over Diffeomorphisms for Learned Data Augmentation | S�ren Hauberg, Oren Freifeld, Anders Boesen Lindbo Larsen, John Fisher, Lars Hansen | With an eye towards true end-to-end learning, we suggest learning the applied transformations on a per-class basis. |

39 | Unsupervised Ensemble Learning with Dependent Classifiers | Ariel Jaffe, Ethan Fetaya, Boaz Nadler, Tingting Jiang, Yuval Kluger | To this end we introduce a statistical model that allows for dependencies between classifiers. |

40 | Multi-Level Cause-Effect Systems | Krzysztof Chalupka, Frederick Eberhardt, Pietro Perona | We present a domain-general account of causation that applies to settings in which macro-level causal relations between two systems are of interest, but the relevant causal features are poorly understood and have to be aggregated from vast arrays of micro-measurements. |

41 | Deep Kernel Learning | Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, Eric P. Xing | We introduce scalable deep kernels, which combine the structural properties of deep learning architectures with the non-parametric flexibility of kernel methods. |

42 | Nearly Optimal Classification for Semimetrics | Lee-Ad Gottlieb, Aryeh Kontorovich, Pinhas Nisnevitch | We initiate the rigorous study of classification in semimetric spaces, which are point sets with a distance function that is non-negative and symmetric, but need not satisfy the triangle inequality. |

43 | Latent Point Process Allocation | Chris Lloyd, Tom Gunter, Michael Osborne, Stephen Roberts, Tom Nickson | We introduce a probabilistic model for the factorisation of continuous Poisson process rate functions. |

44 | K2-ABC: Approximate Bayesian Computation with Kernel Embeddings | Mijung Park, Wittawat Jitkrittum, Dino Sejdinovic | In this paper, we propose a fully nonparametric ABC paradigm which circumvents the need for manually selecting summary statistics. |

45 | Bayesian Generalised Ensemble Markov Chain Monte Carlo | Jes Frellsen, Ole Winther, Zoubin Ghahramani, Jesper Ferkinghoff-Borg | BayesGE uses a Bayesian approach to iteratively update the belief about the density of states (distribution of the log likelihood under the prior) for the model, with the dual purpose of enhancing the sampling efficiency and making the estimation of the partition function tractable. |

46 | A Lasso-based Sparse Knowledge Gradient Policy for Sequential Optimal Learning | Yan Li, Han Liu, Warren Powell | We propose a sequential learning policy for noisy discrete global optimization and ranking and selection (R&S) problems with high dimensional sparse belief functions, where there are hundreds or even thousands of features, but only a small portion of these features contain explanatory power. |

47 | Optimal Statistical and Computational Rates for One Bit Matrix Completion | Renkun Ni, Quanquan Gu | We present an estimator based on rank constrained maximum likelihood estimation, and an efficient greedy algorithm to solve it approximately based on an extension of conditional gradient descent. |

48 | PAC-Bayesian Bounds based on the R�nyi Divergence | Luc B�gin, Pascal Germain, Fran�ois Laviolette, Jean-Francis Roy | We propose a simplified proof process for PAC-Bayesian generalization bounds, that allows to divide the proof in four successive inequalities, easing the "customization" of PAC-Bayesian theorems. |

49 | Simple and Scalable Constrained Clustering: a Generalized Spectral Method | Mihai Cucuringu, Ioannis Koutis, Sanjay Chawla, Gary Miller, Richard Peng | We present a simple spectral approach to the well-studied constrained clustering problem. |

50 | Geometry Aware Mappings for High Dimensional Sparse Factors | Avradeep Bhowmik, Nathan Liu, Erheng Zhong, Badri Bhaskar, Suju Rajan | In this manuscript we present a novel framework that exploits structural properties of sparse vectors, using the inverted index representation, to significantly reduce the run time computational cost of factorisation models. |

51 | Generalizing Pooling Functions in Convolutional Neural Networks: Mixed, Gated, and Tree | Chen-Yu Lee, Patrick W. Gallagher, Zhuowen Tu | We seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. |

52 | Rivalry of Two Families of Algorithms for Memory-Restricted Streaming PCA | Chun-Liang Li, Hsuan-Tien Lin, Chi-Jen Lu | In this paper, we analyze the convergence rate of a representative algorithm with decayed learning rate (Oja and Karhunen, 1985) in the first family for the general k>1 case. |

53 | Quantization based Fast Inner Product Search | Ruiqi Guo, Sanjiv Kumar, Krzysztof Choromanski, David Simcha | We propose a quantization based approach for fast approximate Maximum Inner Product Search (MIPS). |

54 | An Improved Convergence Analysis of Cyclic Block Coordinate Descent-type Methods for Strongly Convex Minimization | Xingguo Li, Tuo Zhao, Raman Arora, Han Liu, Mingyi Hong | To bridge this theoretical gap, we propose an improved convergence analysis for the CBCD-type methods. |

55 | Learning Structured Low-Rank Representation via Matrix Factorization | Jie Shen, Ping Li | In this paper, we propose to learn structured LRR by factorizing the nuclear norm regularized matrix, which leads to our proposed non-convex formulation NLRR. |

56 | A PAC RL Algorithm for Episodic POMDPs | Zhaohan Daniel Guo, Shayan Doroudi, Emma Brunskill | We give, to our knowledge, the first partially observable RL algorithm with a polynomial bound on the number of episodes on which the algorithm may not achieve near-optimal performance. |

57 | Large Scale Distributed Semi-Supervised Learning Using Streaming Approximation | Sujith Ravi, Qiming Diao | Traditional graph-based semi-supervised learning (SSL) approaches are not suited for massive data and large label scenarios since they scale linearly with the number of edges |E| and distinct labels m. To deal with the large label size problem, recent works propose sketch-based methods to approximate the label distribution per node thereby achieving a space reduction from O(m) to O(\log m), under certain conditions. |

58 | Large-Scale Optimization Algorithms for Sparse Conditional Gaussian Graphical Models | Calvin McCarter, Seyoung Kim | In this paper, we propose a new optimization procedure based on a Newton method that efficiently iterates over two sub-problems, leading to drastic improvement in computation time compared to the previous methods. |

59 | Graph Connectivity in Noisy Sparse Subspace Clustering | Yining Wang, Yu-Xiang Wang, Aarti Singh | In this paper, we investigate the graph connectivity problem for noisy sparse sub-space clustering and show that a simple post-processing procedure is capable of delivering consistent clustering under certain “general position” or “restricted eigenvalue” assumptions. |

60 | The Nonparametric Kernel Bayes Smoother | Yu Nishiyama, Amir Afsharinejad, Shunsuke Naruse, Byron Boots, Le Song | We expand upon this work by introducing a smoothing algorithm, the nonparametric kernel Bayes’ smoother (nKB-smoother) which relies on kernel Bayesian inference through the kernel sum rule and kernel Bayes’ rule. |

61 | Universal Models of Multivariate Temporal Point Processes | Asela Gunawardana, Chris Meek | In this paper, we study the expressive power and learnability of Graphical Event Models (GEMs) – the analogue of directed graphical models for multivariate temporal point processes. |

62 | Online Relative Entropy Policy Search using Reproducing Kernel Hilbert Space Embeddings | Zhitang Chen, Pascal Poupart, Yanhui Geng | In this paper, we develop an online policy search algorithm based on a recent state-of-the-art algorithm REPS-RKHS that uses conditional kernel embeddings. |

63 | Relationship between PreTraining and Maximum Likelihood Estimation in Deep Boltzmann Machines | Muneki Yasuda | A pretraining algorithm, which is a layer-by-layer greedy learning algorithm, for a deep Boltzmann machine (DBM) is presented in this paper. |

64 | Enumerating Equivalence Classes of Bayesian Networks using EC Graphs | Eunice Yuh-Jie Chen, Arthur Choi Choi, Adnan Darwiche | We propose a new search space for A* search, called the EC graph, that facilitates the enumeration of equivalence classes, by representing the space of completed, partially directed acyclic graphs. |

65 | Low-Rank and Sparse Structure Pursuit via Alternating Minimization | Quanquan Gu, Zhaoran Wang Wang, Han Liu | In this paper, we present a nonconvex alternating minimization optimization algorithm for low-rank and sparse structure pursuit. |

66 | NuC-MKL: A Convex Approach to Non Linear Multiple Kernel Learning | Eli Meirom, Pavel Kisilev | In this paper, we propose a new non-linear MKL method that utilizes nuclear norm regularization and leads to convex optimization problem. |

67 | Tractable and Scalable Schatten Quasi-Norm Approximations for Rank Minimization | Fanhua Shang, Yuanyuan Liu, James Cheng | Motivated by the equivalence relation between the trace norm and its bilinear spectral penalty, we define two tractable Schatten norms, i.e. the bi-trace and tri-trace norms, and prove that they are in essence the Schatten-1/2 and 1/3 quasi-norms, respectively. |

68 | Fast Dictionary Learning with a Smoothed Wasserstein Loss | Antoine Rolet, Marco Cuturi, Gabriel Peyr� | We consider in this paper the dictionary learning problem when the observations are normalized histograms of features. |

69 | New Resistance Distances with Global Information on Large Graphs | Canh Hao Nguyen, Hiroshi Mamitsuka | We propose new distance functions between nodes for this problem. |

70 | Batch Bayesian Optimization via Local Penalization | Javier Gonzalez, Zhenwen Dai, Philipp Hennig, Neil Lawrence | We investigate this issue and propose a highly effective heuristic based on an estimate of the function’s Lipschitz constant that captures the most important aspect of this interaction–local repulsion–at negligible computational overhead. |

71 | Nonparametric Budgeted Stochastic Gradient Descent | Trung Le, Vu Nguyen, Tu Dinh Nguyen, Dinh Phung | In this paper, we propose the Nonparametric Budgeted Stochastic Gradient Descent that allows the model size to automatically grow with data in a principled way. |

72 | Learning Relationships between Data Obtained Independently | Alexandra Carpentier, Teresa Schlueter | The aim of this paper is to provide a new method for learning the relationships between data that have been obtained independently. |

73 | Fast and Scalable Structural SVM with Slack Rescaling | Heejin Choi, Ofer Meshi, Nathan Srebro | We present an efficient method for training slack-rescaled structural SVM. |

74 | Probabilistic Approximate Least-Squares | Simon Bartels, Philipp Hennig | Leveraging recent results casting elementary linear algebra operations as probabilistic inference, we propose a new approximate method for nonparametric least-squares that affords a probabilistic uncertainty estimate over the error between the approximate and exact least-squares solution (this is not the same as the posterior variance of the associated Gaussian process regressor). |

75 | Approximate Inference Using DC Programming For Collective Graphical Models | Thien Nguyen, Akshat Kumar, Hoong Chuin Lau, Daniel Sheldon | Collective graphical models (CGMs) provide a framework for reasoning about a population of independent and identically distributed individuals when only noisy and aggregate observations are given. |

76 | Sequential Inference for Deep Gaussian Process | Yali Wang, Marcus Brubaker, Brahim Chaib-Draa, Raquel Urtasun | In this paper, we propose an efficient sequential inference framework for DGP, where the data is processed sequentially. |

77 | Variational Tempering | Stephan Mandt, James McInerney, Farhan Abrol, Rajesh Ranganath, David Blei | We therefore introduce variational tempering, a variational algorithm that introduces a temperature latent variable to the model. |

78 | On Convergence of Model Parallel Proximal Gradient Algorithm for Stale Synchronous Parallel System | Yi Zhou, Yaoliang Yu, Wei Dai, Yingbin Liang, Eric Xing | In this work we propose mspg, an extension of the flexible proximal gradient algorithm to the model parallel and stale synchronous setting. |

79 | Scalable MCMC for Mixed Membership Stochastic Blockmodels | Wenzhe Li, Sungjin Ahn, Max Welling | We propose a stochastic gradient Markov chain Monte Carlo (SG-MCMC) algorithm for scalable inference in mixed-membership stochastic blockmodels (MMSB). |

80 | Non-Stationary Gaussian Process Regression with Hamiltonian Monte Carlo | Markus Heinonen, Henrik Mannerstr�m, Juho Rousu, Samuel Kaski, Harri L�hdesm�ki | We present a novel approach for non-stationary Gaussian process regression (GPR), where the three key parameters – noise variance, signal variance and lengthscale – can be simultaneously input-dependent. |

81 | A Deep Generative Deconvolutional Image Model | Yunchen Pu, Win Yuan, Andrew Stevens, Chunyuan Li, Lawrence Carin | A deep generative model is developed for representation and analysis of images, based on a hierarchical convolutional dictionary-learning framework. |

82 | Distributed Multi-Task Learning | Jialei Wang, Mladen Kolar, Nathan Srerbo | We present a communication-efficient estimator based on the debiased lasso and show that it is comparable with the optimal centralized method. |

83 | A Fixed-Point Operator for Inference in Variational Bayesian Latent Gaussian Models | Rishit Sheth, Roni Khardon | Recent work proposed a fixed-point (FP) update procedure to optimize the covariance matrix in the variational solution and demonstrated its efficacy in specific models. |

84 | Learning Probabilistic Submodular Diversity Models Via Noise Contrastive Estimation | Sebastian Tschiatschek, Josip Djolonga, Andreas Krause | In this paper, we propose FLID, a novel log-submodular diversity model that scales to large numbers of items and can be efficiently learned using noise contrastive estimation. |

85 | Fast Saddle-Point Algorithm for Generalized Dantzig Selector and FDR Control with Ordered L1-Norm | Sangkyun Lee, Damian Brzyski, Malgorzata Bogdan | In this paper we propose a primal-dual proximal extragradient algorithm to solve the generalized Dantzig selector (GDS) estimation problem, based on a new convex-concave saddle-point (SP) reformulation. |

86 | GLASSES: Relieving The Myopia Of Bayesian Optimisation | Javier Gonzalez, Michael Osborne, Neil Lawrence | We present GLASSES: Global optimisation with Look-Ahead through Stochastic Simulation and Expected-loss Search. |

87 | Stochastic Variational Inference for the HDP-HMM | Aonan Zhang, San Gultekin, John Paisley | In this paper we provide a solution to this problem by deriving a variational inference algorithm for the HDP-HMM, as well as its stochastic extension, for which all parameter updates are in closed form. |

88 | Stochastic Neural Networks with Monotonic Activation Functions | Siamak Ravanbakhsh, Barnabas Poczos, Jeff Schneider, Dale Schuurmans, Russell Greiner | We propose a Laplace approximation that creates a stochastic unit from any smooth monotonic activation function, using only Gaussian noise. |

89 | (Bandit) Convex Optimization with Biased Noisy Gradient Oracles | Xiaowei Hu, Prashanth L.A., Andr�s Gy�rgy, Csaba Szepesvari | In this paper we propose a novel framework that replaces the specific gradient estimation methods with an abstract oracle model. |

90 | Variational Gaussian Copula Inference | Shaobo Han, Xuejun Liao, David Dunson, Lawrence Carin | For models with continuous and non-Gaussian hidden variables, we propose a semiparametric and automated variational Gaussian copula approach, in which the parametric Gaussian copula family is able to preserve multivariate posterior dependence, and the nonparametric transformations based on Bernstein polynomials provide ample flexibility in characterizing the univariate marginal posteriors. |

91 | Low-Rank Approximation of Weighted Tree Automata | Guillaume Rabusseau, Borja Balle, Shay Cohen | We describe a technique to minimize weighted tree automata (WTA), a powerful formalisms that subsumes probabilistic context-free grammars (PCFGs) and latent-variable PCFGs. |

92 | Accelerating Online Convex Optimization via Adaptive Prediction | Mehryar Mohri, Scott Yang | We present a powerful general framework for designing data-dependent online convex optimization algorithms, building upon and unifying recent techniques in adaptive regularization, optimistic gradient predictions, and problem-dependent randomization. |

93 | Scalable geometric density estimation | Ye Wang, Antonio Canale, David Dunson | We introduce a novel empirical Bayes method that we term geometric density estimation (GEODE) and show that, with mild conditions and among all d-dimensional linear subspaces, the span of the d leading principal axes of the data maximizes the model posterior. |

94 | Model-based Co-clustering for High Dimensional Sparse Data | Aghiles Salah, Nicoleta Rogovschi, Mohamed Nadif | We propose a novel model based on the von Mises-Fisher (vMF) distribution for co-clustering high dimensional sparse matrices. |

95 | DUAL-LOCO: Distributing Statistical Estimation Using Random Projections | Christina Heinze, Brian McWilliams, Nicolai Meinshausen | We present DUAL-LOCO, a communication-efficient algorithm for distributed statistical estimation. |

96 | High Dimensional Bayesian Optimization via Restricted Projection Pursuit Models | Chun-Liang Li, Kirthevasan Kandasamy, Barnabas Poczos, Jeff Schneider | Our generalization provides the benefits of i) greatly increasing the space of functions that can be modeled by our approach, which covers the previous works (Wang et al., 2013; Kandasamy et al., 2015) as special cases, and ii) efficiently handling the learning in a larger model space. |

97 | On the Use of Non-Stationary Strategies for Solving Two-Player Zero-Sum Markov Games | Julien P�rolat, Bilal Piot, Bruno Scherrer, Olivier Pietquin | The main contribution of this paper consists in extending several non-stationary Reinforcement Learning (RL) algorithms and their theoretical guarantees to the case of γ-discounted zero-sum Markov Games (MGs). |

98 | Semi-Supervised Learning with Adaptive Spectral Transform | Hanxiao Liu, Yiming Yang | This paper proposes a novel nonparametric framework for semi-supervised learning and for optimizing the Laplacian spectrum of the data manifold simultaneously. |

99 | Pseudo-Marginal Slice Sampling | Iain Murray, Matthew Graham | We describe a general way to clamp and update the random numbers used in a pseudo-marginal method’s unbiased estimator. |

100 | How to Learn a Graph from Smooth Signals | Vassilis Kalofolias | We propose a framework to learn the graph structure underlying a set of smooth signals. |

101 | Ordered Weighted L1 Regularized Regression with Strongly Correlated Covariates: Theoretical Aspects | Mario Figueiredo, Robert Nowak | This paper studies the ordered weighted L1 (OWL) family of regularizers for sparse linear regression with strongly correlated covariates. |

102 | Pareto Front Identification from Stochastic Bandit Feedback | Peter Auer, Chao-Kai Chiang, Ronald Ortner, Madalina Drugan | We propose a confidence bound algorithm to approximate the Pareto front, and prove problem specific lower and upper bounds, showing that the sample complexity is characterized by some natural geometric properties of the operating points. |

103 | Sketching, Embedding and Dimensionality Reduction in Information Theoretic Spaces | Amirali Abdullah, Ravi Kumar, Andrew McGregor, Sergei Vassilvitskii, Suresh Venkatasubramanian | In this paper we show how to embed information distances like the χ^2 and Jensen-Shannon divergences efficiently in low dimensional spaces while preserving all pairwise distances. |

104 | AdaDelay: Delay Adaptive Distributed Stochastic Optimization | Suvrit Sra, Adams Wei Yu, Mu Li, Alex Smola | We develop distributed stochastic convex optimization algorithms under a delayed gradient model in which server nodes update parameters and worker nodes compute stochastic (sub)gradients. |

105 | Exponential Stochastic Cellular Automata for Massively Parallel Inference | Manzil Zaheer, Michael Wick, Jean-Baptiste Tristan, Alex Smola, Guy Steele | We propose an embarrassingly parallel, memory efficient inference algorithm for latent variable models in which the complete data likelihood is in the exponential family. |

106 | Globally Sparse Probabilistic PCA | Pierre-Alexandre Mattei, Charles Bouveyron, Pierre Latouche | To overcome this drawback, we propose a Bayesian procedure that allows to obtain several sparse components with the same sparsity pattern. |

107 | Provable Bayesian Inference via Particle Mirror Descent | Bo Dai, Niao He, Hanjun Dai, Le Song | To tackle this challenge, we propose a simple yet provable algorithm, Particle Mirror Descent (PMD), to iteratively approximate the posterior density. |

108 | Unsupervised Feature Selection by Preserving Stochastic Neighbors | Xiaokai Wei, Philip S. Yu | In this paper, we present an effective method, Stochastic Neighbor-preserving Feature Selection (SNFS), for selecting discriminative features in unsupervised setting. |

109 | Improved Learning Complexity in Combinatorial Pure Exploration Bandits | Victor Gabillon, Alessandro Lazaric, Mohammad Ghavamzadeh, Ronald Ortner, Peter Bartlett | We study the problem of combinatorial pure exploration in the stochastic multi-armed bandit problem. |

110 | Scalable Gaussian Processes for Characterizing Multidimensional Change Surfaces | William Herlands, Andrew Wilson, Hannes Nickisch, Seth Flaxman, Daniel Neill, Wilbert Van Panhuis, Eric Xing | We present a scalable Gaussian process model for identifying and characterizing smooth multidimensional changepoints, and automatically learning changes in expressive covariance structure. |

111 | Optimization as Estimation with Gaussian Processes in Bandit Settings | Zi Wang, Bolei Zhou, Stefanie Jegelka | We study an optimization strategy that directly uses an estimate of the argmax of the function. |

112 | A Convex Surrogate Operator for General Non-Modular Loss Functions | Jiaqian Yu, Matthew Blaschko | In this work, a novel generic convex surrogate for general non-modular loss functions is introduced, which provides for the first time a tractable solution for loss functions that are neither supermodular nor submodular. |

113 | Inference for High-dimensional Exponential Family Graphical Models | Jialei Wang, Mladen Kolar | In this paper, we propose a novel estimator for edge parameters in an exponential family graphical models. |

114 | Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization | Changyou Chen, David Carlson, Zhe Gan, Chunyuan Li, Lawrence Carin | We explore this relationship by applying simulated annealing to an SG-MCMC algorithm. |

115 | Fitting Spectral Decay with the k-Support Norm | Andrew McDonald, Massimiliano Pontil, Dimitris Stamos | In this paper we generalize the norm to the spectral (k,p)-support norm, whose additional parameter p can be used to tailor the norm to the decay of the spectrum of the underlying model. |

116 | Early Stopping as Nonparametric Variational Inference | David Duvenaud, Dougal Maclaurin, Ryan Adams | We show that unconverged stochastic gradient descent can be interpreted as sampling from a nonparametric approximate posterior distribution. |

117 | Bayesian Nonparametric Kernel-Learning | Junier B. Oliva, Avinava Dubey, Andrew G. Wilson, Barnabas Poczos, Jeff Schneider, Eric P. Xing | In this paper we introduce Bayesian nonparmetric kernel-learning (BaNK), a generic, data-driven framework for scalable learning of kernels. |

118 | Tight Variational Bounds via Random Projections and I-Projections | Lun-Kai Hsu, Tudor Achim, Stefano Ermon | To overcome this issue, we introduce a new class of random projections to reduce the dimensionality and hence the complexity of the original model. |

119 | Bethe Learning of Graphical Models via MAP Decoding | Kui Tang, Nicholas Ruozzi, David Belanger, Tony Jebara | We introduce MLE-Struct, a method for learning discrete exponential family models using the Bethe approximation to the partition function. |

120 | Determinantal Regularization for Ensemble Variable Selection | Veronika Rockova, Gemma Moran, Edward George | Motivated by non-parametric variational Bayes strategies, we move beyond this limitation by proposing an ensemble optimization approach to identify a collection of representative posterior modes. |

121 | Scalable and Sound Low-Rank Tensor Learning | Hao Cheng, Yaoliang Yu, Xinhua Zhang, Eric Xing, Dale Schuurmans | To address this problem, we propose directly optimizing the tensor trace norm by approximating its dual spectral norm, and we show that the approximation bounds can be efficiently converted to the original problem via the generalized conditional gradient algorithm. |

122 | Non-negative Matrix Factorization for Discrete Data with Hierarchical Side-Information | Changwei Hu, Piyush Rai, Lawrence Carin | We present a probabilistic framework for efficient non-negative matrix factorization of discrete (count/binary) data with side-information. |

123 | Topic-Based Embeddings for Learning from Large Knowledge Graphs | Changwei Hu, Piyush Rai, Lawrence Carin | We present a scalable probabilistic framework for learning from multi-relational data given in form of entity-relation-entity triplets, with a potentially massive number of entities and relations (e.g., in multi-relational networks, knowledge bases, etc.). |

124 | Consistently Estimating Markov Chains with Noisy Aggregate Data | Garrett Bernstein, Daniel Sheldon | We address the problem of estimating the parameters of a time-homogeneous Markov chain given only noisy, aggregate data. |

125 | Unwrapping ADMM: Efficient Distributed Computing via Transpose Reduction | Tom Goldstein, Gavin Taylor, Kawika Barabin, Kent Sayre | We propose iterative methods that solve global sub-problems over an entire distributed dataset. |

126 | Improper Deep Kernels | Uri Heinemann, Roi Livni, Elad Eban, Gal Elidan, Amir Globerson | Here we address this difficulty by turning to "improper learning" of neural nets. |

127 | Unbounded Bayesian Optimization via Regularization | Bobak Shahriari, Alexandre Bouchard-Cote, Nando Freitas | In this work, we modify the standard Bayesian optimization framework in a principled way to allow for unconstrained exploration of the search space. |

128 | Non-Gaussian Component Analysis with Log-Density Gradient Estimation | Hiroaki Sasaki, Gang Niu, Masashi Sugiyama | In this paper, we propose a novel NGCA algorithm based on log-density gradient estimation. |

129 | Online Learning with Noisy Side Observations | Tom� Koc�k, Gergely Neu, Michal Valko | We propose a new partial-observability model for online learning problems where the learner, besides its own loss, also observes some noisy feedback about the other actions, depending on the underlying structure of the problem. |

130 | Black-Box Policy Search with Probabilistic Programs | Jan-Willem Vandemeent, Brooks Paige, David Tolpin, Frank Wood | In this work we show how to represent policies as programs: that is, as stochastic simulators with tunable parameters. |

131 | Efficient Bregman Projections onto the Permutahedron and Related Polytopes | Cong Han Lim, Stephen J. Wright | In summary, this work describes a fast unified approach to this well-known class of problems. |

132 | On Searching for Generalized Instrumental Variables | Benito Zander, Maciej Liskiewicz | We provide fast algorithms for searching and testing restricted cases of GIVs. |

133 | Provable Tensor Methods for Learning Mixtures of Generalized Linear Models | Hanie Sedghi, Majid Janzamin, Anima Anandkumar | In contrast, we present a tensor decomposition method which is guaranteed to correctly recover the parameters. |

134 | Controlling Bias in Adaptive Data Analysis Using Information Theory | Daniel Russo, James Zou | In this paper, we propose a general information-theoretic framework to quantify and provably bound the bias of arbitrary adaptive analysis process. |

135 | A Column Generation Bound Minimization Approach with PAC-Bayesian Generalization Guarantees | Jean-Francis Roy, Mario Marchand, Fran�ois Laviolette | In this work, we design a column generation algorithm that we call CqBoost, that optimizes the C-bound and outputs a sparse distribution on a possibly infinite set of voters. |

136 | Graph Sparsification Approaches for Laplacian Smoothing | Veeru Sadhanala, Yu-Xiang Wang, Ryan Tibshirani | Given a statistical estimation problem where regularization is performed according to the structure of a large, dense graph G, we consider fitting the statistical estimate using a \it sparsified surrogate graph \mathbfG, which shares the vertices of G but has far fewer edges, and is thus more tractable to work with computationally. |

137 | Scalable Exemplar Clustering and Facility Location via Augmented Block Coordinate Descent with Column Generation | Ian En-Hsu Yen, Dmitry Malioutov, Abhishek Kumar | In this work, we propose an Augmented-Lagrangian with Block Coordinate Descent (AL-BCD) algorithm that utilizes problem structure to obtain closed-form solution for each block sub-problem, and exploits low-rank representation of the dissimilarity matrix to search active columns without computing the entire matrix. |

138 | Robust Covariate Shift Regression | Xiangli Chen, Mathew Monfort, Anqi Liu, Brian D. Ziebart | We propose a robust approach for regression under covariate shift that embraces the uncertainty resulting from sample selection bias by producing regression models that are explicitly robust to it. |

139 | On Lloyd�s Algorithm: New Theoretical Insights for Clustering in Practice | Cheng Tang, Claire Monteleoni | We provide new analyses of Lloyd’s algorithm (1982), commonly known as the k-means clustering algorithm. |

140 | Towards Stability and Optimality in Stochastic Gradient Descent | Panos Toulis, Dustin Tran, Edo Airoldi | To address these two issues we propose an iterative estimation procedure termed averaged implicit SGD (AI-SGD). |

141 | Communication Efficient Distributed Agnostic Boosting | Shang-Tse Chen, Maria-Florina Balcan, Duen Horng Chau | Our main contribution is a general distributed boosting-based procedure for learning an arbitrary concept space, that is simultaneously noise tolerant, communication efficient, and computationally efficient. |

142 | Private Causal Inference | Matt J. Kusner, Yu Sun, Karthik Sridharan, Kilian Q. Weinberger | We study the problem of inferring causality using the current, popular causal inference framework, the additive noise model (ANM) while simultaneously ensuring privacy of the users. |

143 | Parallel Markov Chain Monte Carlo via Spectral Clustering | Guillaume Basse, Aaron Smith, Natesh Pillai | In this paper, we present a parallelization scheme for Markov chain Monte Carlo (MCMC) methods based on spectral clustering of the underlying state space, generalizing earlier work on parallelization of MCMC methods by state space partitioning. |

144 | Efficient Sampling for k-Determinantal Point Processes | Chengtao Li, Stefanie Jegelka, Suvrit Sra | In light of this, we propose a new method for approximate sampling from discrete k-DPPs. |

145 | A Fast and Reliable Policy Improvement Algorithm | Yasin Abbasi-Yadkori, Peter L. Bartlett, Stephen J. Wright | We introduce a simple, efficient method that improves stochastic policies for Markov decision processes. |

146 | Learning Sigmoid Belief Networks via Monte Carlo Expectation Maximization | Zhao Song, Ricardo Henao, David Carlson, Lawrence Carin | We propose using an online Monte Carlo expectation-maximization (MCEM) algorithm to learn the maximum a posteriori (MAP) estimator of the generative model or optimize the variational lower bound of a recognition network. |

147 | Active Learning Algorithms for Graphical Model Selection | Gautamd Dasarathy, Aarti Singh, Maria-Florina Balcan, Jong H. Park | We propose a general paradigm for graphical model selection where feedback is used to guide the sampling to high degree vertices, while obtaining only few samples from the ones with the low degrees. |

148 | Streaming Kernel Principal Component Analysis | Mina Ghashami, Daniel J. Perry, Jeff Phillips | Kernel principal component analysis (KPCA) provides a concise set of basis vectors which capture non-linear structures within large data sets, and is a central tool in data analysis and learning. |

149 | Back to the Future: Radial Basis Function Networks Revisited | Qichao Que, Mikhail Belkin | In this paper we aim to revisit some of the older approaches to training the RBF networks from a more modern perspective. |

150 | Cut Pursuit: Fast Algorithms to Learn Piecewise Constant Functions | Loic Landrieu, Guillaume Obozinski | We propose working-set/greedy algorithms to efficiently solve problems penalized respectively by the total variation and the Mumford Shah boundary size when the piecewise constant solutions has a small number of levelsets. |

151 | Loss Bounds and Time Complexity for Speed Priors | Daniel Filan, Jan Leike, Marcus Hutter | We propose a variant to the original speed prior (Schmidhuber, 2002), and show that our prior can predict sequences drawn from probability measures that are estimable in polynomial time. |

152 | NYTRO: When Subsampling Meets Early Stopping | Raffaello Camoriano, Tom�s Angles, Alessandro Rudi, Lorenzo Rosasco | In this paper we ask whether early stopping and subsampling ideas can be combined in a fruitful way. |

153 | Randomization and The Pernicious Effects of Limited Budgets on Auction Experiments | Guillaume W. Basse, Hossein Azari Soufiani, Diane Lambert | This paper shows that if an A/B experiment affects only bids, then the observed treatment effect is an unbiased estimator when all the bidders in the same auction are randomly assigned to A or B but the observed treatment effect can be severely biased otherwise, even in the absence of throttling. |

154 | Spectral M-estimation with Applications to Hidden Markov Models | Dustin Tran, Minjae Kim, Finale Doshi-Velez | In this paper, we apply the framework of M-estimation to develop both a generalized method of moments procedure and a principled method for regularization. |

155 | Chained Gaussian Processes | Alan D. Saul, James Hensman, Aki Vehtari, Neil D. Lawrence | We develop an approximate inference procedure for Chained GPs that is scalable and applicable to any factorized likelihood. |

156 | Multiresolution Matrix Compression | Nedelina Teneva, Pramod Kaushik Mudrakarta, Risi Kondor | In this paper we describe pMMF, a fast parallel MMF algorithm, which can scale to n in the range of millions. |

157 | Supervised Neighborhoods for Distributed Nonparametric Regression | Adam Bloniarz, Ameet Talwalkar, Bin Yu, Christopher Wu | We propose a new method, SILO, for fitting prediction-time local models that uses supervised neighborhoods that adapt to the local shape of the regression surface. |

158 | Global Convergence of a Grassmannian Gradient Descent Algorithm for Subspace Estimation | Dejiao Zhang, Laura Balzano | In this paper, we propose an adaptive step size scheme that is greedy for the noiseless case, that maximizes the improvement of our metric of convergence at each data index t, and yields an expected improvement for the noisy case. |

159 | Online and Distributed Bayesian Moment Matching for Parameter Learning in Sum-Product Networks | Abdullah Rashwan, Han Zhao, Pascal Poupart | More specifically, we propose a new Bayesian moment matching (BMM) algorithm that operates naturally in an online fashion and that can be easily distributed. |

160 | Mondrian Forests for Large-Scale Regression when Uncertainty Matters | Balaji Lakshminarayanan, Daniel M. Roy, Yee Whye Teh | Through a combination of illustrative examples, real-world large-scale datasets and Bayesian optimization benchmarks, we demonstrate that Mondrian forests outperform approximate GPs on large-scale regression tasks and deliver better-calibrated uncertainty assessments than decision-forest-based methods. |

161 | Online (and Offline) Robust PCA: Novel Algorithms and Performance Guarantees | Jinchun Zhan, Brian Lois, Han Guo, Namrata Vaswani | In this work we develop and study a novel online robust principal components’ analysis (RPCA) algorithm based on the recently introduced ReProCS framework. |

162 | Parallel Majorization Minimization with Dynamically Restricted Domains for Nonconvex Optimization | Yan Kaganovsky, Ikenna Odinaka, David Carlson, Lawrence Carin | We propose an optimization framework for nonconvex problems based on majorization-minimization that is particularity well-suited for parallel computing. |

163 | Discriminative Structure Learning of Arithmetic Circuits | Amirmohammad Rooshenas, Daniel Lowd | In this paper, we present the first discriminative structure learning algorithm for ACs, DACLearn (Discriminative AC Learner). |

164 | One Scan 1-Bit Compressed Sensing | Ping Li | Based on α-stable random projections with small α, we develop a simple algorithm for compressed sensing (sparse signal recovery) by utilizing only the signs (i.e., 1-bit) of the measurements. |