# Paper Digest: AISTATS 2018 Highlights

Readers can also choose to read this highlight article on our console, which allows users to filter out papers using keywords and find related papers.

The International Conference on Artificial Intelligence and Statistics (AISTATS) is an interdisciplinary gathering of researchers at the intersection of computer science, artificial intelligence, machine learning, statistics, and related areas.

To help the community quickly catch up on the work presented in this conference, Paper Digest Team processed all accepted papers, and generated one highlight sentence (typically the main topic) for each paper. Readers are encouraged to read these machine generated highlights / summaries to quickly get the main idea of each paper.

If you do not want to miss any interesting academic paper, you are welcome to **sign up our free daily paper digest service ** to get updates on new papers published in your area every day. You are also welcome to follow us on Twitter and Linkedin to get updated with new conference digests.

Paper Digest Team

team@paperdigest.org

#### TABLE 1: AISTATS 2018 Papers

Title | Authors | Highlight | |
---|---|---|---|

1 | The Geometry of Random Features | Krzysztof Choromanski, Mark Rowland, Tamas Sarlos, Vikas Sindhwani, Richard Turner, Adrian Weller | We present an in-depth examination of the effectiveness of radial basis function kernel (beyond Gaussian) estimators based on orthogonal random feature maps. |

2 | Gauged Mini-Bucket Elimination for Approximate Inference | Sungsoo Ahn, Michael Chertkov, Jinwoo Shin, Adrian Weller | In this paper, we propose a new gauge-variational approach, termed WMBE-G, which combines gauge transformations with the weighted mini-bucket elimination (WMBE) method. |

3 | A Fast Algorithm for Separated Sparsity via Perturbed Lagrangians | Aleksander Madry, Slobodan Mitrovic, Ludwig Schmidt | In this paper, we make progress in this direction in the context of separated sparsity – a fundamental sparsity notion that captures exclusion constraints in linearly ordered data such as time series. |

4 | An Analysis of Categorical Distributional Reinforcement Learning | Mark Rowland, Marc Bellemare, Will Dabney, Remi Munos, Yee Whye Teh | In this paper, we introduce a framework to analyse CDRL algorithms, establish the importance of the projected distributional Bellman operator in distributional RL, draw fundamental connections between CDRL and the Cramer distance, and give a proof of convergence for sample-based categorical distributional reinforcement learning algorithms. |

5 | Combinatorial Preconditioners for Proximal Algorithms on Graphs | Thomas M�llenhoff, Zhenzhang Ye, Tao Wu, Daniel Cremers | We present a novel preconditioning technique for proximal optimization methods that relies on graph algorithms to construct effective preconditioners. |

6 | Growth-Optimal Portfolio Selection under CVaR Constraints | Guy Uziel, Ran El-Yaniv | We characterize the asymptomatically optimal risk-adjusted performance and present an investment strategy whose portfolios are guaranteed to achieve the asymptotic optimal solution while fulfilling the desired risk constraint. |

7 | Accelerated Stochastic Power Iteration | Peng Xu, Bryan He, Christopher De Sa, Ioannis Mitliagkas, Chris Re | In this paper, we study methods to accelerate power iteration in the stochastic setting by adding a momentum term. |

8 | Multi-scale Nystrom Method | Woosang Lim, Rundong Du, Bo Dai, Kyomin Jung, Le Song, Haesun Park | In this paper, we propose Nested Nystrom Method (NNM) which achieves a delicate balance between the approximation accuracy and computational efficiency by exploiting the multilayer structure and multiple compressions. |

9 | Making Tree Ensembles Interpretable: A Bayesian Model Selection Approach | Satoshi Hara, Kohei Hayashi | In this study, we propose a method to make a complex tree ensemble interpretable by simplifying the model. |

10 | Mixed Membership Word Embeddings for Computational Social Science | James Foulds | I propose a probabilistic model-based word embedding method which can recover interpretable embeddings, without big data. |

11 | Fast Threshold Tests for Detecting Discrimination | Emma Pierson, Sam Corbett-Davies, Sharad Goel | To achieve these performance gains, we introduce and analyze a flexible family of probability distributions on the interval [0, 1] – which we call discriminant distributions – that is computationally efficient to work with. |

12 | Iterative Supervised Principal Components | Juho Piironen, Aki Vehtari | To do this, we propose a new dimension reduction technique, called iterative supervised principal components (ISPCs), which combines variable screening and dimension reduction and can be considered as an extension to the existing technique of supervised principal components (SPCs). |

13 | Iterative Spectral Method for Alternative Clustering | Chieh Wu, Stratis Ioannidis, Mario Sznaier, Xiangyu Li, David Kaeli, Jennifer Dy | We propose a novel Iterative Spectral Method (ISM) that greatly improves the scalability of KDAC. |

14 | Can clustering scale sublinearly with its clusters? A variational EM acceleration of GMMs and k-means | Dennis Forster, J�rg L�cke | In this study, we explore whether one iteration of k-means or EM for GMMs can scale sublinearly with C at run-time, while improving the clustering objective remains effective. |

15 | Parallelised Bayesian Optimisation via Thompson Sampling | Kirthevasan Kandasamy, Akshay Krishnamurthy, Jeff Schneider, Barnabas Poczos | We design and analyse variations of the classical Thompson sampling (TS) procedure for Bayesian optimisation (BO) in settings where function evaluations are expensive but can be performed in parallel. |

16 | On the challenges of learning with inference networks on sparse, high-dimensional data | Rahul Krishnan, Dawen Liang, Matthew Hoffman | We propose methods to tackle it via iterative optimization inspired by stochastic variational inference (Hoffman et al., 2013) and improvements in the data representation used for inference. |

17 | Post Selection Inference with Kernels | Makoto Yamada, Yuta Umezu, Kenji Fukumizu, Ichiro Takeuchi | In this paper, we propose a kernel-based post-selection inference (PSI) algorithm that can find a set of statistically significant features from non-linearly related data. |

18 | On how complexity affects the stability of a predictor | Joel Ratsaby | We define the predictor’s complexity to be the amount of uncertainty in detecting that the criterion fails given that it fails. |

19 | On Truly Block Eigensolvers via Riemannian Optimization | Zhiqiang Xu, Xin Gao | We thus propose the concept of generalized k-th gap. |

20 | Layerwise Systematic Scan: Deep Boltzmann Machines and Beyond | Heng Guo, Kaan Kara, Ce Zhang | We show that the Gibbs sampler with a layerwise alternating scan order has its relaxation time (in terms of epochs) no larger than that of a random-update Gibbs sampler (in terms of variable updates). |

21 | IHT dies hard: Provable accelerated Iterative Hard Thresholding | Rajiv Khanna, Anastasios Kyrillidis | We study – both in theory and practice– the use of momentum motions in classic iterative hard thresholding (IHT) methods. |

22 | Finding Global Optima in Nonconvex Stochastic Semidefinite Optimization with Variance Reduction | Jinshan Zeng, Ke Ma, Yuan Yao | In this paper, we provide an answer that a stochastic gradient descent method with variance reduction, can be adapted to solve the nonconvex reformulation of the original convex problem, with a global linear convergence, i.e., converging to a global optimum exponentially fast, at a proper initial choice in the restricted strongly convex case. |

23 | Outlier Detection and Robust Estimation in Nonparametric Regression | Dehan Kong, Howard Bondell, Weining Shen | We propose to include a subject-specific mean shift parameter for each data point such that a nonzero parameter will identify its corresponding data point as an outlier. |

24 | Integral Transforms from Finite Data: An Application of Gaussian Process Regression to Fourier Analysis | Luca Ambrogioni, Eric Maris | In this paper, we use Gaussian process regression to estimate the Fourier transform (or any other integral transform) without making these assumptions. |

25 | AdaGeo: Adaptive Geometric Learning for Optimization and Sampling | Gabriele Abbati, Alessandra Tosi, Michael Osborne, Seth Flaxman | In order to overcome these difficulties, we propose AdaGeo, a preconditioning framework for adaptively learning the geometry of the parameter space during optimization or sampling. |

26 | Online Learning with Non-Convex Losses and Non-Stationary Regret | Xiand Gao, Xiaobo Li, Shuzhong Zhang | In this paper, we consider online learning with non-convex loss functions. |

27 | Learning Determinantal Point Processes in Sublinear Time | Christophe Dupuy, Francis Bach | We propose a new class of determinantal point processes (DPPs) which can be manipulated for inference and parameter learning in potentially sublinear time in the number of items. |

28 | Nonlinear Structured Signal Estimation in High Dimensions via Iterative Hard Thresholding | Kaiqing Zhang, Zhuoran Yang, Zhaoran Wang | We study the high-dimensional signal estimation problem with nonlinear measurements, where the signal of interest is either sparse or low-rank. |

29 | Riemannian stochastic quasi-Newton algorithm with variance reduction and its convergence analysis | Hiroyuki Kasai, Hiroyuki Sato, Bamdev Mishra | The present paper proposes a Riemannian stochastic quasi-Newton algorithm with variance reduction (R-SQN-VR). |

30 | Online Boosting Algorithms for Multi-label Ranking | Young Hun Jung, Ambuj Tewari | We consider the multi-label ranking approach to multi-label learning. |

31 | Zeroth-Order Online Alternating Direction Method of Multipliers: Convergence Analysis and Applications | Sijia Liu, Jie Chen, Pin-Yu Chen, Alfred Hero | In this paper, we design and analyze a new zeroth-order online algorithm, namely, the zeroth-order online alternating direction method of multipliers (ZOO-ADMM), which enjoys dual advantages of being gradient-free operation and employing the ADMM to accommodate complex structured regularizers. |

32 | High-Dimensional Bayesian Optimization via Additive Models with Overlapping Groups | Paul Rolland, Jonathan Scarlett, Ilija Bogunovic, Volkan Cevher | In this paper, we consider the approach of Kandasamy et al. (2015), in which the high-dimensional function decomposes as a sum of lower-dimensional functions on subsets of the underlying variables. |

33 | Robust Active Label Correction | Jan Kremer, Fei Sha, Christian Igel | We approximate the true label noise by a model that learns the aspects of the noise that are class-conditional (i.e., independent of the input given the observed label). |

34 | Factorial HMMs with Collapsed Gibbs Sampling for Optimizing Long-term HIV Therapy | Amit Gruber, Chen Yanover, Tal El-Hay, Anders S�nnerborg, Vanni Borghi, Francesca Incardona, Yaara Goldschmidt | We present a novel generative model for HIV drug resistance evolution. |

35 | Optimal Submodular Extensions for Marginal Estimation | Pankaj Pansari, Chris Russell, M Pawan Kumar | Submodular extensions of an energy function can be used to efficiently compute approximate marginals via variational inference. |

36 | Semi-Supervised Learning with Competitive Infection Models | Nir Rosenfeld, Amir Globerson | Our goal in this work is to explore alternative mechanisms for propagating labels. |

37 | Discriminative Learning of Prediction Intervals | Nir Rosenfeld, Yishay Mansour, Elad Yom-Tov | In this work we consider the task of constructing prediction intervals in an inductive batch setting. |

38 | Topic Compositional Neural Language Model | Wenlin Wang, Zhe Gan, Wenqi Wang, Dinghan Shen, Jiaji Huang, Wei Ping, Sanjeev Satheesh, Lawrence Carin | We propose a Topic Compositional Neural Language Model (TCNLM), a novel method designed to simultaneously capture both the global semantic meaning and the local word-ordering structure in a document. |

39 | Learning Priors for Invariance | Eric Nalisnick, Padhraic Smyth | In this paper, we address the problem of how to specify an informative prior when the problem of interest is known to exhibit invariance properties. |

40 | Optimal Cooperative Inference | Scott Cheng-Hsin Yang, Yue Yu, arash Givchi, Pei Wang, Wai Keen Vong, Patrick Shafto | We present such a framework. |

41 | Stochastic Multi-armed Bandits in Constant Space | David Liau, Zhao Song, Eric Price, Ger Yang | We consider the stochastic bandit problem in the sublinear space setting, where one cannot record the win-loss record for all $K$ arms. |

42 | Matrix completability analysis via graph k-connectivity | Dehua Cheng, Natali Ruchansky, Yan Liu | In this paper, we make the observation that even when the observed matrix is too sparse for accurate completion, there may be portions of the data where completion is still possible. |

43 | FLAG n� FLARE: Fast Linearly-Coupled Adaptive Gradient Methods | Xiang Cheng, Fred Roosta, Stefan Palombo, Peter Bartlett, Michael Mahoney | We present accelerated and adaptive gradient methods, called FLAG and FLARE, which can offer the best of both worlds. |

44 | Multi-view Metric Learning in Vector-valued Kernel Spaces | Riikka Huusari, Hachem Kadri, C�cile Capponi | We consider the problem of metric learning for multi-view data and present a novel method for learning within-view as well as between-view metrics in vector-valued kernel spaces, as a way to capture multi-modal structure of the data. |

45 | Gaussian Process Subset Scanning for Anomalous Pattern Detection in Non-iid Data | William Herlands, Edward McFowland, Andrew Wilson, Daniel Neill | We introduce methods for identifying anomalous patterns in non-iid data by combining Gaussian processes with novel log-likelihood ratio statistic and subset scanning techniques. |

46 | Dropout as a Low-Rank Regularizer for Matrix Factorization | Jacopo Cavazza, Pietro Morerio, Benjamin Haeffele, Connor Lane, Vittorio Murino, Rene Vidal | In this paper, we present a theoretical analysis of dropout for MF, where Bernoulli random variables are used to drop columns of the factors. |

47 | A Simple Analysis for Exp-concave Empirical Minimization with Arbitrary Convex Regularizer | Tianbao Yang, Zhe Li, Lijun Zhang | In this paper, we present a simple analysis of fast rates with high probability of empirical minimization for it stochastic composite optimization over a finite-dimensional bounded convex set with exponential concave loss functions and an arbitrary convex regularization. |

48 | Independently Interpretable Lasso: A New Regularizer for Sparse Regression with Uncorrelated Variables | Masaaki Takada, Taiji Suzuki, Hironori Fujisawa | In this paper, we pro- pose a new regularization method, “Independently Interpretable Lasso” (IILasso). |

49 | Boosting Variational Inference: an Optimization Perspective | Francesco Locatello, Rajiv Khanna, Joydeep Ghosh, Gunnar Ratsch | In the present work, we study the convergence properties of this approach from a modern optimization viewpoint by establishing connections to the classic Frank-Wolfe algorithm. |

50 | Personalized and Private Peer-to-Peer Machine Learning | Aur�lien Bellet, Rachid Guerraoui, Mahsa Taziki, Marc Tommasi | In this paper, we introduce an efficient algorithm to address the above problem in a fully decentralized (peer-to-peer) and asynchronous fashion, with provable convergence rate. |

51 | Tensor Regression Meets Gaussian Processes | Rose Yu, Guangyu Li, Yan Liu | In this paper, we demonstrate interesting connections between the two, especially for multi-way data analysis. |

52 | A Nonconvex Proximal Splitting Algorithm under Moreau-Yosida Regularization | Emanuel Laude, Tao Wu, Daniel Cremers | To overcome this difficulty, in this work we consider a lifted variant of the Moreau-Yosida regularized model and propose a novel multiblock primal-dual algorithm that intrinsically stabilizes the dual block. |

53 | Medoids in Almost-Linear Time via Multi-Armed Bandits | Vivek Bagaria, Govinda Kamath, Vasilis Ntranos, Martin Zhang, David Tse | We present an algorithm Med-dit to compute the medoid with high probability, which uses $O(n\log n)$ distance evaluations. |

54 | Regional Multi-Armed Bandits | Zhiyang Wang, Ruida Zhou, Cong Shen | We propose an efficient algorithm, UCB-g, that solves the regional bandit problem by combining the Upper Confidence Bound (UCB) and greedy principles. |

55 | Nearly second-order optimality of online joint detection and estimation via one-sample update schemes | Yang Cao, Liyan Xie, Yao Xie, Huan Xu | We show that for such problems, detection procedures based on sequential likelihood ratios with simple one-sample update estimates such as online mirror descent are nearly second-order optimal. |

56 | Sum-Product-Quotient Networks | Or Sharir, Amnon Shashua | We present a novel tractable generative model that extends Sum-Product Networks (SPNs) and significantly boosts their power. |

57 | Exploiting Strategy-Space Diversity for Batch Bayesian Optimization | Sunil Gupta, Alistair Shilton, Santu Rana, Svetha Venkatesh | This paper proposes a novel approach to batch Bayesian optimisation using a multi-objective optimisation framework with exploitation and exploration forming two objectives. |

58 | Beating Monte Carlo Integration: a Nonasymptotic Study of Kernel Smoothing Methods | Stephan Cl�men�on, Fran�ois Portier | This paper is devoted to the study of a kernel smoothing based competitor built from a sequence of $n\geq 1$ i.i.d random vectors with arbitrary continuous probability distribution $f(x)dx$, originally proposed in Delyon et al. (2016), from a nonasymptotic perspective. |

59 | Group invariance principles for causal generative models | Michel Besserve, Naji Shajarisales, Bernhard Sch�lkopf, Dominik Janzing | Our aim in this paper is to propose a group theoretic framework for ICM to unify and generalize these approaches. |

60 | A Provable Algorithm for Learning Interpretable Scoring Systems | Nataliya Sokolovska, Yann Chevaleyre, Jean-Daniel Zucker | In this contribution, we introduce an original methodology to simultaneously learn interpretable binning mapped to a class variable, and the weights associated with these bins contributing to the score. |

61 | Scaling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processes | Hyunjik Kim, Yee Whye Teh | We propose Scalable Kernel Composition (SKC), a scalable kernel search algorithm that extends the Automatic Statistician to bigger data sets. |

62 | Efficient Bandit Combinatorial Optimization Algorithm with Zero-suppressed Binary Decision Diagrams | Shinsaku Sakaue, Masakazu Ishihata, Shin-ichi Minato | To avoid dealing with such huge action sets directly, we propose an algorithm that takes advantage of zero-suppressed binary decision diagrams, which encode action sets as compact graphs. |

63 | Transfer Learning on fMRI Datasets | Hejia Zhang, Po-Hsuan Chen, Peter Ramadge | A method is introduced to improve prediction accuracy on a primary fMRI dataset by jointly learning a model using other secondary fMRI datasets. |

64 | An Optimization Approach to Learning Falling Rule Lists | Chaofan Chen, Cynthia Rudin | We propose an optimization approach to learning falling rule lists and "softly" falling rule lists, along with Monte-Carlo search algorithms that use bounds on the optimal solution to prune the search space. |

65 | Catalyst for Gradient-based Nonconvex Optimization | Courtney Paquette, Hongzhou Lin, Dmitriy Drusvyatskiy, Julien Mairal, Zaid Harchaoui | We introduce a generic scheme to solve nonconvex optimization problems using gradient-based algorithms originally designed for minimizing convex functions. |

66 | Benefits from Superposed Hawkes Processes | Hongteng Xu, Dixin Luo, Xu Chen, Lawrence Carin | We investigate superposed Hawkes process as an important class of such models, with properties studied in the framework of least squares estimation. |

67 | Nonparametric Preference Completion | Julian Katz-Samuels, Clayton Scott | We propose a k-nearest neighbors-like algorithm and prove that it is consistent. |

68 | Non-parametric estimation of Jensen-Shannon Divergence in Generative Adversarial Network training | Mathieu Sinn, Ambrish Rawat | This work presents a rigorous statistical analysis of GANs providing straight-forward explanations for common training pathologies such as vanishing gradients. |

69 | Efficient and principled score estimation with Nystr�m kernel exponential families | Dougal Sutherland, Heiko Strathmann, Michael Arbel, Arthur Gretton | We propose a fast method with statistical guarantees for learning an exponential family density model where the natural parameter is in a reproducing kernel Hilbert space, and may be infinite dimensional. |

70 | Symmetric Variational Autoencoder and Connections to Adversarial Learning | Liqun Chen, Shuyang Dai, Yunchen Pu, Erjin Zhou, Chunyuan Li, Qinliang Su, Changyou Chen, Lawrence Carin | Symmetric Variational Autoencoder and Connections to Adversarial Learning |

71 | Few-shot Generative Modelling with Generative Matching Networks | Sergey Bartunov, Dmitry Vetrov | We develop a new generative model called Generative Matching Network which is inspired by the recently proposed matching networks for one-shot learning in discriminative tasks. |

72 | Nonlinear Weighted Finite Automata | Tianyu Li, Guillaume Rabusseau, Doina Precup | Weighted finite automata (WFA) can expressively model functions defined over strings but are inherently linear models.Given the recent successes of non-linear models in machine learning, it is natural to wonder whether extending WFA to the non-linearsetting would be beneficial.In this paper, we propose a novel model of neural network based nonlinear WFA model (NL-WFA) along with a learning algorithm. |

73 | Natural Gradients in Practice: Non-Conjugate Variational Inference in Gaussian Process Models | Hugh Salimbeni, Stefanos Eleftheriadis, James Hensman | The natural gradient method has been used effectively in conjugate Gaussian process models, but the non-conjugate case has been largely unexplored. |

74 | Variational inference for the multi-armed contextual bandit | I�igo Urteaga, Chris Wiggins | We consider contextual multi-armed bandit applications where the true reward distribution is unknown and complex, which we approximate with a mixture model whose parameters are inferred via variational inference. |

75 | Tracking the gradients using the Hessian: A new look at variance reducing stochastic methods | Robert Gower, Nicolas Le Roux, Francis Bach | Our goal is to improve variance reducing stochastic methods through better control variates. |

76 | Subsampling for Ridge Regression via Regularized Volume Sampling | Michal Derezinski, Manfred Warmuth | We propose a new procedure for selecting the subset of vectors, such that the ridge estimator obtained from that subset offers strong statistical guarantees in terms of the mean squared prediction error over the entire dataset of n labeled vectors. |

77 | Scalable Gaussian Processes with Billions of Inducing Inputs via Tensor Train Decomposition | Pavel Izmailov, Alexander Novikov, Dmitry Kropotov | We propose a method (TT-GP) for approximate inference in Gaussian Process (GP) models. |

78 | Batch-Expansion Training: An Efficient Optimization Framework | Michal Derezinski, Dhruv Mahajan, S. Sathiya Keerthi, S. V. N. Vishwanathan, Markus Weimer | We propose Batch-Expansion Training (BET), a framework for running a batch optimizer on a gradually expanding dataset. |

79 | Batched Large-scale Bayesian Optimization in High-dimensional Spaces | Zi Wang, Clement Gehring, Pushmeet Kohli, Stefanie Jegelka | In this paper, we propose ensemble Bayesian optimization (EBO) to address three current challenges in BO simultaneously: (1) large-scale observations; (2) high dimensional input spaces; and (3) selections of batch queries that balance quality and diversity. |

80 | Temporally-Reweighted Chinese Restaurant Process Mixtures for Clustering, Imputing, and Forecasting Multivariate Time Series | Feras Saad, Vikash Mansinghka | This article proposes a Bayesian nonparametric method for forecasting, imputation, and clustering in sparsely observed, multivariate time series data. |

81 | Stochastic Three-Composite Convex Minimization with a Linear Operator | Renbo Zhao, Volkan Cevher | We develop a primal-dual convex minimization framework to solve a class of stochastic convex three-composite problem with a linear operator. |

82 | Direct Learning to Rank And Rerank | Cynthia Rudin, Yining Wang | Learning-to-rank techniques have proven to be extremely useful for prioritization problems, where we rank items in order of their estimated probabilities, and dedicate our limited resources to the top-ranked items. |

83 | One-shot Coresets: The Case of k-Clustering | Olivier Bachem, Mario Lucic, Silvio Lattanzi | In this work, we affirmatively answer this question by proposing an efficient algorithm that constructs such one-shot summaries for k-clustering problems while retaining strong theoretical guarantees. |

84 | Random Warping Series: A Random Features Method for Time-Series Embedding | Lingfei Wu, Ian En-Hsu Yen, Jinfeng Yi, Fangli Xu, Qi Lei, Michael Witbrock | In this work, we study a family of alignment-aware positive definite (p.d.) kernels, with its feature embedding given by a distribution of Random Warping Series (RWS). |

85 | Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-offs in Distributed SGD | Sanghamitra Dutta, Gauri Joshi, Soumyadip Ghosh, Parijat Dube, Priya Nagpurkar | In this work we present the first theoretical characterization of the speed-up offered by asynchronous methods by analyzing the trade-off between the error in the trained model and the actual training runtime (wallclock time). |

86 | Variational Inference based on Robust Divergences | Futoshi Futami, Issei Sato, Masashi Sugiyama | In this paper, based on Zellner’s optimization and variational formulation of Bayesian inference, we propose an outlier-robust pseudo-Bayesian variational method by replacing the Kullback-Leibler divergence used for data fitting to a robust divergence such as the beta- and gamma-divergences. |

87 | Variational Rejection Sampling | Aditya Grover, Ramki Gummadi, Miguel Lazaro-Gredilla, Dale Schuurmans, Stefano Ermon | We propose a novel rejection sampling step that discards samples from the variational posterior which are assigned low likelihoods by the model. |

88 | Best arm identification in multi-armed bandits with delayed feedback | Aditya Grover, Todor Markov, Peter Attia, Norman Jin, Nicolas Perkins, Bryan Cheong, Michael Chen, Zi Yang, Stephen Harris, William Chueh, Stefano Ermon | In this paper, we propose a generalization of the best arm identification problem in stochastic multi-armed bandits (MAB) to the setting where every pull of an arm is associated with delayed feedbacks. |

89 | A fully adaptive algorithm for pure exploration in linear bandits | Liyuan Xu, Junya Honda, Masashi Sugiyama | We propose the first fully-adaptive algorithm for pure exploration in linear bandits—the task to find the arm with the largest expected reward, which depends on an unknown parameter linearly. |

90 | Contextual Bandits with Stochastic Experts | Rajat Sen, Karthikeyan Shanmugam, Sanjay Shakkottai | We propose upper-confidence bound (UCB) algorithms for this problem, which employ two different importance sampling based estimators for the mean reward for each expert. |

91 | Human Interaction with Recommendation Systems | Sven Schmit, Carlos Riquelme | We propose a simple model where users with heterogeneous preferences arrive over time. |

92 | Community Detection in Hypergraphs: Optimal Statistical Limit and Efficient Algorithms | I Chien, Chung-Yi Lin, I-Hsiang Wang | In this paper, community detection in hypergraphs is explored. |

93 | Smooth and Sparse Optimal Transport | Mathieu Blondel, Vivien Seguy, Antoine Rolet | In this paper, we explore regularizing the primal and dual OT formulations with a strongly convex term, which corresponds to relaxing the dual and primal constraints with smooth approximations. |

94 | Robust Maximization of Non-Submodular Objectives | Ilija Bogunovic, Junyao Zhao, Volkan Cevher | In this work, we present a new algorithm OBLIVIOUS-GREEDY and prove the first constant-factor approximation guarantees for a wider class of non-submodular objectives. |

95 | Cause-Effect Inference by Comparing Regression Errors | Patrick Bloebaum, Dominik Janzing, Takashi Washio, Shohei Shimizu, Bernhard Schoelkopf | We address the problem of inferring the causal relation between two variables by comparing the least-squares errors of the predictions in both possible causal directions. |

96 | Tree-based Bayesian Mixture Model for Competing Risks | Alexis Bellot, Mihaela Schaar | We aim with this setting to provide accurate individual estimates but also interpretable conclusions for use as a clinical decision support tool. |

97 | Actor-Critic Fictitious Play in Simultaneous Move Multistage Games | Julien Perolat, Bilal Piot, Olivier Pietquin | Using an architecture inspired by actor-critic algorithms, we build a stochastic approximation of the fictitious play process. |

98 | Random Subspace with Trees for Feature Selection Under Memory Constraints | Antonio Sutera, C�lia Ch�tel, Gilles Louppe, Louis Wehenkel, Pierre Geurts | In this paper, we consider the problem of feature selection in applications where the memory is not large enough to contain all features. |

99 | Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information | Jakob Runge | Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information |

100 | Quotient Normalized Maximum Likelihood Criterion for Learning Bayesian Network Structures | Tomi Silander, Janne Lepp�-aho, Elias J��saari, Teemu Roos | We introduce an information theoretic criterion for Bayesian network structure learning which we call quotient normalized maximum likelihood (qNML). |

101 | Convex Optimization over Intersection of Simple Sets: improved Convergence Rate Guarantees via an Exact Penalty Approach | Achintya Kundu, Francis Bach, Chiranjib Bhattacharya | We consider the problem of minimizing a convex function over the intersection of finitely many simple sets which are easy to project onto. |

102 | Variational Sequential Monte Carlo | Christian Naesseth, Scott Linderman, Rajesh Ranganath, David Blei | In this paper we present a new approximating family of distributions, the variational sequential Monte Carlo (VSMC) family, and show how to optimize it in variational inference. |

103 | Statistically Efficient Estimation for Non-Smooth Probability Densities | Masaaki Imaizumi, Takanori Maehara, Yuichi Yoshida | In this paper, we propose new estimators for non-smooth density functions by employing the notion of Szemeredi partitions from graph theory. |

104 | SDCA-Powered Inexact Dual Augmented Lagrangian Method for Fast CRF Learning | Xu Hu, Guillaume Obozinski | We propose an efficient dual augmented Lagrangian formulation to learn conditional random fields (CRF). |

105 | Generalized Concomitant Multi-Task Lasso for Sparse Multimodal Regression | Mathurin Massias, Olivier Fercoq, Alexandre Gramfort, Joseph Salmon | We provide new statistical and computational solutions to perform heteroscedastic regression, with an emphasis on functional brain imaging with magneto- and electroencephalography (M/EEG). |

106 | Gradient Layer: Enhancing the Convergence of Adversarial Training for Generative Models | Atsushi Nitanda, Taiji Suzuki | We propose a new technique that boosts the convergence of training generative adversarial networks. |

107 | Statistical Sparse Online Regression: A Diffusion Approximation Perspective | Jianqing Fan, Wenyan Gong, Chris Junchi Li, Qiang Sun | In this paper, we propose to adopt the diffusion approximation techniques to study online regression. |

108 | Guaranteed Sufficient Decrease for Stochastic Variance Reduced Gradient Optimization | Fanhua Shang, Yuanyuan Liu, Kaiwen Zhou, James Cheng, Kelvin Kai Wing Ng, Yuichi Yoshida | In this paper, we propose a novel sufficient decrease technique for stochastic variance reduced gradient descent methods such as SVRG and SAGA. |

109 | Delayed Sampling and Automatic Rao-Blackwellization of Probabilistic Programs | Lawrence Murray, Daniel Lund�n, Jan Kudlicka, David Broman, Thomas Sch�n | We introduce a dynamic mechanism for the solution of analytically-tractable substructure in probabilistic programs, using conjugate priors and affine transformations to reduce variance in Monte Carlo estimators. |

110 | Learning to Round for Discrete Labeling Problems | Pritish Mohapatra, Jawahar C.V., M Pawan Kumar | We present a novel interpretation of rounding procedures as sampling from a latent variable model, which opens the door to the use of powerful machine learning formulations in their design. |

111 | Approximate Ranking from Pairwise Comparisons | Reinhard Heckel, Max Simchowitz, Kannan Ramchandran, Martin Wainwright | In this paper we consider the problem of finding approximate rankings from pairwise comparisons. |

112 | Semi-Supervised Prediction-Constrained Topic Models | Michael Hughes, Gabriel Hope, Leah Weiner, Thomas McCoy, Roy Perlis, Erik Sudderth, Finale Doshi-Velez | We propose a framework for training supervised latent Dirichlet allocation that balances two goals: faithful generative explanations of high-dimensional data and accurate prediction of associated class labels. |

113 | A Stochastic Differential Equation Framework for Guiding Online User Activities in Closed Loop | Yichen Wang, Evangelos Theodorou, Apurv Verma, Le Song | In this paper, we propose a framework to reformulate point processes into stochastic differential equations, which allows us to extend methods from stochastic optimal control to address the activity guiding problem. |

114 | Accelerated Stochastic Mirror Descent: From Continuous-time Dynamics to Discrete-time Algorithms | Pan Xu, Tianhao Wang, Quanquan Gu | We present a new framework to analyze accelerated stochastic mirror descent through the lens of continuous-time stochastic dynamic systems. |

115 | A Unified Framework for Nonconvex Low-Rank plus Sparse Matrix Recovery | Xiao Zhang, Lingxiao Wang, Quanquan Gu | We propose a unified framework to solve general low-rank plus sparse matrix recovery problems based on matrix factorization, which covers a broad family of objective functions satisfying the restricted strong convexity and smoothness conditions. |

116 | Bayesian Nonparametric Poisson-Process Allocation for Time-Sequence Modeling | Hongyi Ding, Mohammad Khan, Issei Sato, Masashi Sugiyama | In this work, we present the Bayesian nonparametric Poisson process allocation (BaNPPA), a latent-function model for time-sequences, which automatically infers the number of latent functions. |

117 | Factor Analysis on a Graph | Masayuki Karasuyama, Hiroshi Mamitsuka | We propose a Gaussian based analysis which is a combination of graph constrained covariance matrix estimation and factor analysis (FA). |

118 | Crowdclustering with Partition Labels | Junxiang Chen, Yale Chang, Peter Castaldi, Michael Cho, Brian Hobbs, Jennifer Dy | In this paper, we propose a crowdclustering model that directly analyzes partition labels. |

119 | Learning Structural Weight Uncertainty for Sequential Decision-Making | Ruiyi Zhang, Chunyuan Li, Changyou Chen, Lawrence Carin | We propose efficient posterior learning of structural weight uncertainty, within an SVGD framework, by employing matrix variate Gaussian priors on NN parameters. |

120 | Towards Memory-Friendly Deterministic Incremental Gradient Method | Jiahao Xie, Hui Qian, Zebang Shen, Chao Zhang | In this paper, we propose a new deterministic variant of the IG method SVRG that blends a periodically updated full gradient with a component function gradient selected in a cyclic order. |

121 | Optimality of Approximate Inference Algorithms on Stable Instances | Hunter Lang, David Sontag, Aravindan Vijayaraghavan | The goal of this paper is to partially explain the performance of α-expansion and an LP relaxation algorithm on MAP inference in Ferromagnetic Potts models (FPMs). |

122 | Bayesian Approaches to Distribution Regression | Ho Chung Leon Law, Dougal Sutherland, Dino Sejdinovic, Seth Flaxman | We frame our models in a neural network style, allowing for simple MAP inference using backpropagation to learn the parameters, as well as MCMC-based inference which can fully propagate uncertainty. |

123 | Submodularity on Hypergraphs: From Sets to Sequences | Marko Mitrovic, Moran Feldman, Andreas Krause, Amin Karbasi | In this paper, we introduce two new algorithms that provably give constant factor approximations for general graphs and hypergraphs having bounded in or out degrees. |

124 | Provable Estimation of the Number of Blocks in Block Models | Bowei Yan, Purnamrita Sarkar, Xiuyuan Cheng | In this paper, we propose an approach based on semi-definite relaxations, which does not require prior knowledge of model parameters like many existing convex relaxation methods and recovers the number of clusters and the clustering matrix exactly under a broad parameter regime, with probability tending to one. |

125 | Differentially Private Regression with Gaussian Processes | Michael Smith, Mauricio �lvarez, Max Zwiessele, Neil D. Lawrence | We propose a method using GPs to provide differentially private (DP) regression. |

126 | Adaptive balancing of gradient and update computation times using global geometry and approximate subproblems | Sai Praneeth Reddy Karimireddy, Sebastian Stich, Martin Jaggi | In this work, we propose a new framework, Approx Composite Minimization (ACM) that uses approximate update steps to ensure balance between the two operations. |

127 | VAE with a VampPrior | Jakub Tomczak, Max Welling | In this paper, we propose to extend the variational auto-encoder (VAE) framework with a new type of prior which we call "Variational Mixture of Posteriors" prior, or VampPrior for short. |

128 | Structured Factored Inference for Probabilistic Programming | Avi Pfeffer, Brian Ruttenberg, William Kretschmer, Alison OConnor | We present structured factored inference (SFI), a framework that enables factored inference algorithms to scale to significantly more complex programs. |

129 | A Generic Approach for Escaping Saddle points | Sashank Reddi, Manzil Zaheer, Suvrit Sra, Barnabas Poczos, Francis Bach, Ruslan Salakhutdinov, Alex Smola | To tackle this challenge, we introduce a generic framework that minimizes Hessian-based computations while at the same time provably converging to second-order critical points. |

130 | Policy Evaluation and Optimization with Continuous Treatments | Nathan Kallus, Angela Zhou | We study the problem of policy evaluation and learning from batched contextual bandit data when treatments are continuous, going beyond previous work on discrete treatments. |

131 | Multiphase MCMC Sampling for Parameter Inference in Nonlinear Ordinary Differential Equations | Alan Lazarus, Dirk Husmeier, Theodore Papamarkou | This paper presents a multiphase MCMC approach that attempts to close the gap between efficiency and accuracy. |

132 | Why Adaptively Collected Data Have Negative Bias and How to Correct for It | Xinkun Nie, Xiaoying Tian, Jonathan Taylor, James Zou | In this paper, we prove that when the data collection procedure satisfies natural conditions, then sample means of the data have systematic negative biases. |

133 | Sparse Linear Isotonic Models | Sheng Chen, Arindam Banerjee | In this paper, we introduce sparse linear isotonic models (SLIMs) for high-dimensional problems by hybridizing ideas in parametric sparse linear models and AIMs, which enjoy a few appealing advantages over both. |

134 | Robustness of classifiers to uniform $\ell_p$ and Gaussian noise | Jean-Yves Franceschi, Alhussein Fawzi, Omar Fawzi | We study the robustness of classifiers to various kinds of random noise models. |

135 | Nested CRP with Hawkes-Gaussian Processes | Xi Tan, Vinayak Rao, Jennifer Neville | In this paper, we propose a novel nonparametric Bayesian model that incorporates senders and receivers of messages into a hierarchical structure that governs the content and reciprocity of communications. |

136 | Sketching for Kronecker Product Regression and P-splines | Huaian Diao, Zhao Song, Wen Sun, David Woodruff | We take TensorSketch outside of the context of polynomials kernels, and show its utility in applications in which the underlying design matrix is a Kronecker product of smaller matrices. |

137 | Multimodal Prediction and Personalization of Photo Edits with Deep Generative Models | Ardavan Saeedi, Matthew Hoffman, Stephen DiVerdi, Asma Ghandeharioun, Matthew Johnson, Ryan Adams | In this work, we develop a statistical model that meets these objectives. |

138 | Cheap Checking for Cloud Computing: Statistical Analysis via Annotated Data Streams | Chris Hickey, Graham Cormode | Our work aims to provide fast and practical methods to verify analysis of large data sets, where the client’s computation and memory costs are kept to a minimum. |

139 | Minimax Reconstruction Risk of Convolutional Sparse Dictionary Learning | Shashank Singh, Barnabas Poczos, Jian Ma | We compare our results to similar results for IID SDL and verify our theory with synthetic experiments. |

140 | Kernel Conditional Exponential Family | Michael Arbel, Arthur Gretton | An algorithm is provided for learning the generalized natural parameter, and consistency of the estimator is established in the well specified case. |

141 | Linear Stochastic Approximation: How Far Does Constant Step-Size and Iterate Averaging Go? | Chandrashekar Lakshminarayanan, Csaba Szepesvari | In this paper, we study a constant step-size averaged linear stochastic approximation (CALSA) algorithm, and for a given class of problems, we ask whether properties of $i)$ a universal constant step-size and $ii)$ a uniform fast rate of $\frac{C}{t}$ for the mean square-error hold for all instance of the class, where the constant $C>0$ does not depend on the problem instance. |

142 | Stochastic Zeroth-order Optimization in High Dimensions | Yining Wang, Simon Du, Sivaraman Balakrishnan, Aarti Singh | Under sparsity assumptions on the gradients or function values, we present two algorithms: a successive component/feature selection algorithm and a noisy mirror descent algorithm using Lasso gradient estimates, and show that both algorithms have convergence rates that depend only logarithmically on the ambient dimension of the problem. |

143 | Teacher Improves Learning by Selecting a Training Subset | Yuzhe Ma, Robert Nowak, Philippe Rigollet, Xuezhou Zhang, Xiaojin Zhu | For general learners, we provide a mixed-integer nonlinear programming-based algorithm to find a super teaching set. |

144 | Communication-Avoiding Optimization Methods for Distributed Massive-Scale Sparse Inverse Covariance Estimation | Penporn Koanantakool, Alnur Ali, Ariful Azad, Aydin Buluc, Dmitriy Morozov, Leonid Oliker, Katherine Yelick, Sang-Yun Oh | To address these deficiencies, we introduce HP-CONCORD, a highly scalable optimization method for estimating a sparse inverse covariance matrix based on a regularized pseudolikelihood framework, without assuming Gaussianity. |

145 | Robust Vertex Enumeration for Convex Hulls in High Dimensions | Pranjal Awasthi, Bahman Kalantari, Yikai Zhang | We design a fast and robust algorithm named {All Vertex Traingle Algorithm (AVTA)} for detecting the vertices of the convex hull of a set of points in high dimensions. |

146 | Fast generalization error bound of deep learning from a kernel perspective | Taiji Suzuki | We show that the optimal width of the internal layers can be determined through the degree of freedom and derive the optimal convergence rate that is faster than $O(1/\sqrt{n})$ rate which has been shown in the existing studies. |

147 | Product Kernel Interpolation for Scalable Gaussian Processes | Jacob Gardner, Geoff Pleiss, Ruihan Wu, Kilian Weinberger, Andrew Wilson | We develop a new technique for MVM based learning that exploits product kernel structure. |

148 | Towards Provable Learning of Polynomial Neural Networks Using Low-Rank Matrix Estimation | Mohammadreza Soltani, Chinmay Hegde | In this context, we propose two novel, non-convex training algorithms which do not need any extra tuning parameters other than the number of hidden neurons. |

149 | Scalable Generalized Dynamic Topic Models | Patrick J�hnichen, Florian Wenzel, Marius Kloft, Stephan Mandt | In this paper, we present several new results around DTMs. |

150 | Bayesian Structure Learning for Dynamic Brain Connectivity | Michael Andersen, Ole Winther, Lars Kai Hansen, Russell Poldrack, Oluwasanmi Koyejo | This manuscript proposes a novel Bayesian model for dynamic brain connectivity. |

151 | Large Scale Empirical Risk Minimization via Truncated Adaptive Newton Method | Mark Eisen, Aryan Mokhtari, Alejandro Ribeiro | This paper propose a novel adaptive sample size second-order method, which reduces the cost of computing the Hessian by solving a sequence of ERM problems corresponding to a subset of samples and lowers the cost of computing the Hessian inverse using a truncated eigenvalue decomposition. |

152 | Frank-Wolfe Splitting via Augmented Lagrangian Method | Gauthier Gidel, Fabian Pedregosa, Simon Lacoste-Julien | In this work, we develop and analyze the Frank-Wolfe Augmented Lagrangian (FW-AL) algorithm, a method for minimizing a smooth function over convex compact sets related by a “linear consistency” constraint that only requires access to a linear minimization oracle over the individual constraints. |

153 | Learning linear structural equation models in polynomial time and sample complexity | Asish Ghoshal, Jean Honorio | We develop a new algorithm — which is computationally and statistically efficient and works in the high-dimensional regime — for learning linear SEMs from purely observational data with arbitrary noise distribution. |

154 | Convergence diagnostics for stochastic gradient descent with constant learning rate | Jerry Chee, Panos Toulis | In this paper, we develop a statistical diagnostic test to detect such phase transition in the context of stochastic gradient descent with constant learning rate. |

155 | Learning Sparse Polymatrix Games in Polynomial Time and Sample Complexity | Asish Ghoshal, Jean Honorio | We consider the problem of learning sparse polymatrix games from observations of strategic interactions. |

156 | Nonparametric Sharpe Ratio Function Estimation in Heteroscedastic Regression Models via Convex Optimization | Seung-Jean Kim, Johan Lim, Joong-Ho Won | We propose to solve the problem by solving a sequence of finite-dimensional convex programs with increasing dimensions, which can be done globally and efficiently. |

157 | Stochastic algorithms for entropy-regularized optimal transport problems | Brahim Khalil Abid, Robert Gower | In this work we develop a family of fast and practical stochastic algorithms for solving the optimal transport problem with an entropic penalization. |

158 | Plug-in Estimators for Conditional Expectations and Probabilities | Steffen Grunewalder | We study plug-in estimators of conditional expectations and probabilities, and we provide a systematic analysis of their rates of convergence. |

159 | Factorized Recurrent Neural Architectures for Longer Range Dependence | Francois Belletti, Alex Beutel, Sagar Jain, Ed Chi | In this article, we apply the theory of LRD stochastic processes to modern recurrent architectures, such as LSTMs and GRUs, and prove they do not provide LRD under assumptions sufficient for gradients to vanish. |

160 | On the Statistical Efficiency of Compositional Nonparametric Prediction | Yixi Xu, Jean Honorio, Xiao Wang | In this paper, we propose a compositional nonparametric method in which a model is expressed as a labeled binary tree of $2k+1$ nodes, where each node is either a summation, a multiplication, or the application of one of the $q$ basis functions to one of the $p$ covariates. |

161 | Metrics for Deep Generative Models | Nutan Chen, Alexej Klushyn, Richard Kurle, Xueyan Jiang, Justin Bayer, Patrick Smagt | The method yields a principled distance measure, provides a tool for visual inspection of deep generative models, and an alternative to linear interpolation in latent space. |

162 | Combinatorial Penalties: Which structures are preserved by convex relaxations? | Marwa El Halabi, Francis Bach, Volkan Cevher | We consider the homogeneous and the non-homogeneous convex relaxations for combinatorial penalty functions defined on support sets. |

163 | Generalized Binary Search For Split-Neighborly Problems | Stephen Mussmann, Percy Liang | In this paper, we introduce a weaker condition, split-neighborly, which requires that for the set of hypotheses two neighbors disagree on, any subset is splittable by some test. |

164 | Intersection-Validation: A Method for Evaluating Structure Learning without Ground Truth | Jussi Viinikka, Ralf Eggeling, Mikko Koivisto | This work introduces a method to compare algorithms’ ability to learn the model structure, assuming no ground truth is given. |

165 | On Statistical Optimality of Variational Bayes | Debdeep Pati, Anirban Bhattacharya, Yun Yang | The article addresses a long-standing open problem on the justification of using variational Bayes methods for parameter estimation. |

166 | Minimax-Optimal Privacy-Preserving Sparse PCA in Distributed Systems | Jason Ge, Zhaoran Wang, Mengdi Wang, Han Liu | This paper proposes a distributed privacy-preserving sparse PCA (DPS-PCA) algorithm that generates a minimax-optimal sparse PCA estimator under differential privacy constraints. |

167 | Online Regression with Partial Information: Generalization and Linear Projection | Shinji Ito, Daisuke Hatano, Hanna Sumita, Akihiro Yabe, Takuro Fukunaga, Naonori Kakimura, Ken-Ichi Kawarabayashi | In this paper, we propose a general setting for the limitation of the available information, where the observed information is determined by a function chosen from a given set of observation functions. |

168 | Learning Generative Models with Sinkhorn Divergences | Aude Genevay, Gabriel Peyre, Marco Cuturi | This paper presents the first tractable method to train large scale generative models using an OT-based loss called Sinkhorn loss which tackles these three issues by relying on two key ideas: (a) entropic smoothing, which turns the original OT loss into a differentiable and more robust quantity that can be computed using Sinkhorn fixed point iterations; (b) algorithmic (automatic) differentiation of these iterations with seam- less GPU execution. |

169 | Reparameterizing the Birkhoff Polytope for Variational Permutation Inference | Scott Linderman, Gonzalo Mena, Hal Cooper, Liam Paninski, John Cunningham | Combinatorial optimization algorithms may enable efficient point estimation, but fully Bayesian inference poses a severe challenge in this high-dimensional, discrete space. |

170 | Achieving the time of 1-NN, but the accuracy of k-NN | Lirong Xue, Samory Kpotufe | We propose a simple approach which, given distributed computing resources, can nearly achieve the accuracy of k-NN prediction, while matching (or improving) the faster prediction time of 1-NN. |

171 | Efficient Weight Learning in High-Dimensional Untied MLNs | Khan Mohammad Al Farabi, Somdeb Sarkhel, Deepak Venugopal | In this paper, we present an approach to perform efficient weight learning in MLNs containing high-dimensional, untied formulas. |

172 | Learning with Complex Loss Functions and Constraints | Harikrishna Narasimhan | We develop a general approach for solving constrained classification problems, where the loss and constraints are defined in terms of a general function of the confusion matrix. |

173 | Solving lp-norm regularization with tensor kernels | Saverio Salzo, Lorenzo Rosasco, Johan Suykens | In this paper, we discuss how a suitable family of tensor kernels can be used to efficiently solve nonparametric extensions of lp regularized learning methods. |

174 | Weighted Tensor Decomposition for Learning Latent Variables with Partial Data | Omer Gottesman, Weiwei Pan, Finale Doshi-Velez | In this work, we consider the case in which certain dimensions of the data are not always observed–common in applied settings, where not all measurements may be taken for all observations–resulting in moment estimates of varying quality. |

175 | Multi-objective Contextual Bandit Problem with Similarity Information | Eralp Turgay, Doruk Oner, Cem Tekin | In this paper we propose the multi-objective contextual bandit problem with similarity information. |

176 | Turing: A Language for Flexible Probabilistic Inference | Hong Ge, Kai Xu, Zoubin Ghahramani | In this work, we present a system called Turing for building MCMC algorithms for probabilistic programming inference. |

177 | Fast and Scalable Learning of Sparse Changes in High-Dimensional Gaussian Graphical Model Structure | Beilun Wang, arshdeep Sekhon, Yanjun Qi | We propose a novel method, DIFFEE for estimating DIFFerential networks via an Elementary Estimator under a high-dimensional situation. |

178 | Data-Efficient Reinforcement Learning with Probabilistic Model Predictive Control | Sanket Kamthe, Marc Deisenroth | To reduce the number of system interactions while simultaneously handling constraints, we propose a model-based RL framework based on probabilistic Model Predictive Control (MPC). |

179 | Approximate Bayesian Computation with Kullback-Leibler Divergence as Data Discrepancy | Bai Jiang | To bypass this difficulty, we adopt a Kullback-Leibler divergence estimator to assess the data discrepancy. |

180 | Practical Bayesian optimization in the presence of outliers | Ruben Martinez-Cantin, Kevin Tee, Michael McCourt | In this paper, we present an empirical evaluation of Bayesian optimization methods in the presence of outliers. |

181 | Competing with Automata-based Expert Sequences | Mehryar Mohri, Scott Yang | We consider a general framework of online learning with expert advice where regret is defined with respect to sequences of experts accepted by a weighted automaton. |

182 | Reducing Crowdsourcing to Graphon Estimation, Statistically | Devavrat Shah, Christina Lee | In this paper, we utilize a statistical reduction from crowdsourcing to graphon estimation to advance the state-of-art for both of these challenges. |

183 | Robust Locally-Linear Controllable Embedding | Ershad Banijamali, Rui Shu, mohammad Ghavamzadeh, Hung Bui, Ali Ghodsi | In this paper, we present a new model for learning robust locally-linear controllable embedding (RCE). |

184 | Combinatorial Semi-Bandits with Knapsacks | Karthik Abinav Sankararaman, Aleksandrs Slivkins | We define a common generalization, support it with several motivating examples, and design an algorithm for it. |

185 | Structured Optimal Transport | David Alvarez-Melis, Tommi Jaakkola, Stefanie Jegelka | In this work, we develop a nonlinear generalization of (discrete) optimal transport that is able to reflect much additional structure. |

186 | Graphical Models for Non-Negative Data Using Generalized Score Matching | Shiqing Yu, Mathias Drton, Ali Shojaie | In this paper, we give a generalized form of score matching for non-negative data that improves estimation efficiency. |

187 | Asynchronous Doubly Stochastic Group Regularized Learning | Bin Gu, Zhouyuan Huo, Heng Huang | To address this challenging problem, in this paper, we propose a novel asynchronous doubly stochastic proximal gradient algorithm with variance reduction (AsyDSPG+). |

188 | Convergence of Value Aggregation for Imitation Learning | Ching-An Cheng, Byron Boots | In this paper, we debunk the common belief that value aggregation always produces a convergent policy sequence with improving performance. |

189 | Inference in Sparse Graphs with Pairwise Measurements and Side Information | Dylan Foster, Karthik Sridharan, Daniel Reichman | We present new algorithms and a sharp finite-sample analysis for this problem on trees and sparse graphs with poor expansion properties such as hypergrids and ring lattices. |

190 | Parallel and Distributed MCMC via Shepherding Distributions | Arkabandhu Chowdhury, Christopher Jermaine | In this paper, we present a general algorithmic framework for developing easily parallelizable/distributable Markov Chain Monte Carlo (MCMC) algorithms. |

191 | The Power Mean Laplacian for Multilayer Graph Clustering | Pedro Mercado, Antoine Gautier, Francesco Tudisco, Matthias Hein | We introduce in this paper a one-parameter family of matrix power means for merging the Laplacians from different layers and analyze it in expectation in the stochastic block model. |

192 | Adaptive Sampling for Coarse Ranking | Sumeet Katariya, Lalit Jain, Nandana Sengupta, James Evans, Robert Nowak | We propose a computationally efficient PAC algorithm LUCBRank for coarse ranking, and derive an upper bound on its sample complexity. |

193 | Comparison Based Learning from Weak Oracles | Ehsan Kazemi, Lin Chen, Sanjoy Dasgupta, Amin Karbasi | In this paper, we introduce a new weak oracle model, where a non-malicious user responds to a pairwise comparison query only when she is quite sure about the answer. |

194 | The Binary Space Partitioning-Tree Process | Xuhui Fan, Bin Li, Scott Sisson | In this work, we propose a self-consistent Binary Space Partitioning (BSP)-Tree process to generalize the Mondrian process. |

195 | On denoising modulo 1 samples of a function | Mihai Cucuringu, Hemant Tyagi | Given the samples $(x_i,y_i)_{i=1}^{n}$ our goal is to recover smooth, robust estimates of the clean samples $f(x_i) \bmod 1$. |

196 | Scalable Hash-Based Estimation of Divergence Measures | Morteza Noshad, Alfred Hero | We propose a scalable divergence estimation method based on hashing. |

197 | Conditional Gradient Method for Stochastic Submodular Maximization: Closing the Gap | Aryan Mokhtari, Hamed Hassani, Amin Karbasi | In this paper, we study the problem of constrained and stochastic continuous submodular maximization. |

198 | Online Continuous Submodular Maximization | Lin Chen, Hamed Hassani, Amin Karbasi | In this paper, we consider an online optimization process, where the objective functions are not convex (nor concave) but instead belong to a broad class of continuous submodular functions. |

199 | Efficient Bayesian Methods for Counting Processes in Partially Observable Environments | Ferdian Jovan, Jeremy Wyatt, Nick Hawes | We present two tractable approximations, which we combine in a switching filter. |

200 | Matrix-normal models for fMRI analysis | Michael Shvartsman, Narayanan Sundaram, Mikio Aoi, Adam Charles, Theodore Willke, Jonathan Cohen | Our primary theoretical contribution shows how some of these meth- ods can be written as instantiations of the same model, allowing us to generalize them to flexibly modeling structured noise covari- ances. |

201 | The emergence of spectral universality in deep networks | Jeffrey Pennington, Samuel Schoenholz, Surya Ganguli | To this end, we leverage powerful tools from free probability theory to provide a detailed analytic understanding of how a deep network’s Jacobian spectrum depends on various hyperparameters including the nonlinearity, the weight and bias distributions, and the depth. |

202 | Spectral Algorithms for Computing Fair Support Vector Machines | Mahbod Olfat, Anil Aswani | This paper develops computationally tractable algorithms for designing accurate but fair support vector machines (SVM’s). |

203 | Bayesian Multi-label Learning with Sparse Features and Labels, and Label Co-occurrences | He Zhao, Piyush Rai, Lan Du, Wray Buntine | We present a probabilistic, fully Bayesian framework for multi-label learning. |

204 | Nonparametric Bayesian sparse graph linear dynamical systems | Rahi Kalantari, Joydeep Ghosh, Mingyuan Zhou | Nonparametric Bayesian sparse graph linear dynamical systems |

205 | Proximity Variational Inference | Jaan Altosaar, Rajesh Ranganath, David Blei | In this paper, we develop proximity variational inference (PVI). |

206 | Near-Optimal Machine Teaching via Explanatory Teaching Sets | Yuxin Chen, Oisin Mac Aodha, Shihan Su, Pietro Perona, Yisong Yue | In this paper, we propose NOTES, a principled framework for constructing interpretable teaching sets, utilizing explanations to accelerate the teaching process. |

207 | Learning Hidden Quantum Markov Models | Siddarth Srinivasan, Geoff Gordon, Byron Boots | We extend previous work on HQMMs with three contributions: (1) we show how classical hidden Markov models (HMMs) can be simulated on a quantum circuit, (2) we reformulate HQMMs by relaxing the constraints for modeling HMMs on quantum circuits, and (3) we present a learning algorithm to estimate the parameters of an HQMM from data. |

208 | Labeled Graph Clustering via Projected Gradient Descent | Shiau Hong Lim, Gregory Calvez | Inspired by recent advances in non-convex approaches to low-rank recovery problems, we propose an algorithm based on projected gradient descent that enjoys similar provable guarantees as the convex counterpart, but can be orders of magnitude faster. |

209 | Gradient Diversity: a Key Ingredient for Scalable Distributed Learning | Dong Yin, Ashwin Pananjady, Max Lam, Dimitris Papailiopoulos, Kannan Ramchandran, Peter Bartlett | In this work, we present an analysis hinting that high similarity between concurrently processed gradients may be a cause of this performance degradation. |

210 | HONES: A Fast and Tuning-free Homotopy Method For Online Newton Step | Yuting Ye, Lihua Lei, Cheng Ju | In this article, we develop and analyze a homotopy continuation method, referred to as HONES , for solving the sequential generalized projections in Online Newton Step (Hazan et al., 2006b), as well as the generalized problem known as sequential standard quadratic programming. |

211 | Probability�Revealing Samples | Krzysztof Onak, Xiaorui Sun | We introduce a model in which every sample comes with the information about the probability of selecting it. |

212 | Derivative Free Optimization Via Repeated Classification | Tatsunori Hashimoto, Steve Yadlowsky, John Duchi | We develop a procedure for minimizing a function using $n$ batched function value measurements at each of $T$ rounds by using classifiers to identify a function’s sublevel set. |

213 | Online Ensemble Multi-kernel Learning Adaptive to Non-stationary and Adversarial Environments | Yanning Shen, Tianyi Chen, Georgios Giannakis | Leveraging the random feature approximation and its recent orthogonality-promoting variant, the present contribution develops an online multi-kernel learning scheme to infer the intended nonlinear function ‘on the fly.’ |

214 | A Unified Dynamic Approach to Sparse Model Selection | Chendi Huang, Yuan Yao | In this paper, we introduce a simple iterative regularization path, which follows the dynamics of a sparse Mirror Descent algorithm or a generalization of Linearized Bregman Iterations with nonlinear loss. |

215 | Bootstrapping EM via Power EM and Convergence in the Naive Bayes Model | Costis Daskalakis, Christos Tzamos, Manolis Zampetakis | We study the convergence properties of the Expectation-Maximization algorithm in the Naive Bayes model. |

216 | Dimensionality Reduced $\ell^{0}$-Sparse Subspace Clustering | Yingzhen Yang | In this paper, we present Dimensionality Reduced $\ell^{0}$-Sparse Subspace Clustering (DR-$\ell^{0}$-SSC). |