# Paper Digest: ICML 2015 Highlights

The International Conference on Machine Learning (ICML) is one of the top machine learning conferences in the world. In 2015, it is to be held in Lille, France.

To help AI community quickly catch up on the work presented in this conference, Paper Digest Team processed all accepted papers, and generated one highlight sentence (typically the main topic) for each paper. Readers are encouraged to read these machine generated highlights / summaries to quickly get the main idea of each paper.

We thank all authors for writing these interesting papers, and readers for reading our digests. If you do not want to miss any interesting AI paper, you are welcome to **sign up our free paper digest service ** to get new paper updates customized to your own interests on a daily basis.

Paper Digest Team

team@paperdigest.org

#### TABLE 1: ICML 2015 Papers

Title | Authors | Highlight | |
---|---|---|---|

1 | Stochastic Optimization with Importance Sampling for Regularized Loss Minimization | Peilin Zhao, Tong Zhang | In this paper we study stochastic optimization, including prox-SMD and prox-SDCA, with importance sampling, which improves the convergence rate by reducing the stochastic variance. |

2 | Approval Voting and Incentives in Crowdsourcing | Nihar Shah, Dengyong Zhou, Yuval Peres | In this paper, we address these issues by introducing approval voting to utilize the expertise of workers who have partial knowledge of the true answer, and coupling it with a (“strictly proper”) incentive-compatible compensation mechanism. |

3 | A low variance consistent test of relative dependency | Wacha Bounliphone, Arthur Gretton, Arthur Tenenhaus, Matthew Blaschko | We describe a novel non-parametric statistical hypothesis test of relative dependence between a source variable and two candidate target variables. |

4 | An Aligned Subtree Kernel for Weighted Graphs | Lu Bai, Luca Rossi, Zhihong Zhang, Edwin Hancock | In this paper, we develop a new entropic matching kernel for weighted graphs by aligning depth-based representations. |

5 | Spectral Clustering via the Power Method – Provably | Christos Boutsidis, Prabhanjan Kambadur, Alex Gittens | Specifically, we prove that solving the k-means clustering problem on the approximate eigenvectors obtained via the power method gives an additive-error approximation to solving the k-means problem on the optimal eigenvectors. |

6 | Information Geometry and Minimum Description Length Networks | Ke Sun, Jun Wang, Alexandros Kalousis, Stephan Marchand-Maillet | We present a geometric picture, where all these representations are regarded as free points in the space of probability distributions. |

7 | Efficient Training of LDA on a GPU by Mean-for-Mode Estimation | Jean-Baptiste Tristan, Joseph Tassarotti, Guy Steele | We introduce Mean-for-Mode estimation, a variant of an uncollapsed Gibbs sampler that we use to train LDA on a GPU. |

8 | Adaptive Stochastic Alternating Direction Method of Multipliers | Peilin Zhao, Jinwei Yang, Tong Zhang, Ping Li | In this paper, we present a new family of stochastic ADMM algorithms with optimal 2nd-order proximal functions, which produce a new family of adaptive stochastic ADMM methods. |

9 | A Lower Bound for the Optimization of Finite Sums | Alekh Agarwal, Leon Bottou | This paper presents a lower bound for optimizing a finite sum of n functions, where each function is L-smooth and the sum is μ-strongly convex. |

10 | Learning Word Representations with Hierarchical Sparse Coding | Dani Yogatama, Manaal Faruqui, Chris Dyer, Noah Smith | We propose a new method for learning word representations using hierarchical regularization in sparse coding inspired by the linguistic study of word meanings. |

11 | Learning Transferable Features with Deep Adaptation Networks | Mingsheng Long, Yue Cao, Jianmin Wang, Michael Jordan | In this paper, we propose a new Deep Adaptation Network (DAN) architecture, which generalizes deep convolutional neural network to the domain adaptation scenario. |

12 | Robust partially observable Markov decision process | Takayuki Osogami | Based on the convexity, we design a value-iteration algorithm for finding the robust policy. |

13 | On the Relationship between Sum-Product Networks and Bayesian Networks | Han Zhao, Mazen Melibari, Pascal Poupart | In this paper, we establish some theoretical connections between Sum-Product Networks (SPNs) and Bayesian Networks (BNs). |

14 | Learning from Corrupted Binary Labels via Class-Probability Estimation | Aditya Menon, Brendan Van Rooyen, Cheng Soon Ong, Bob Williamson | This paper uses class-probability estimation to study these and other corruption processes belonging to the mutually contaminated distributions framework (Scott et al., 2013), with three conclusions. |

15 | An Explicit Sampling Dependent Spectral Error Bound for Column Subset Selection | Tianbao Yang, Lijun Zhang, Rong Jin, Shenghuo Zhu | In this paper, we consider the problem of column subset selection. |

16 | A Stochastic PCA and SVD Algorithm with an Exponential Convergence Rate | Ohad Shamir | We describe and analyze a simple algorithm for principal component analysis and singular value decomposition, VR-PCA, which uses computationally cheap stochastic iterations, yet converges exponentially fast to the optimal solution. |

17 | Attribute Efficient Linear Regression with Distribution-Dependent Sampling | Doron Kukliansky, Ohad Shamir | We develop efficient algorithms for Ridge and Lasso linear regression, which utilize the geometry of the data by a novel distribution-dependent sampling scheme, and have excess risk bounds which are better a factor of up to O(d/k) over the state-of-the-art, where d is the dimension and k+1 is the number of observed attributes per example. |

18 | Learning Local Invariant Mahalanobis Distances | Ethan Fetaya, Shimon Ullman | In this paper we propose a novel and computationally efficient way to learn a local Mahalanobis metric per datum, and show how we can learn a local invariant metric to any transformation in order to improve performance. |

19 | Finding Linear Structure in Large Datasets with Scalable Canonical Correlation Analysis | Zhuang Ma, Yichao Lu, Dean Foster | In this paper, we tackle the problem of large scale CCA, where classical algorithms, usually requiring computing the product of two huge matrices and huge matrix decomposition, are computationally and storage expensive. |

20 | Abstraction Selection in Model-based Reinforcement Learning | Nan Jiang, Alex Kulesza, Satinder Singh | Existing approaches have theoretical guarantees only under strong assumptions on the domain or asymptotically large amounts of data, but in this paper we propose a simple algorithm based on statistical hypothesis testing that comes with a finite-sample guarantee under assumptions on candidate abstractions. |

21 | Surrogate Functions for Maximizing Precision at the Top | Purushottam Kar, Harikrishna Narasimhan, Prateek Jain | In this paper we make key contributions in these directions. |

22 | Optimizing Non-decomposable Performance Measures: A Tale of Two Classes | Harikrishna Narasimhan, Purushottam Kar, Prateek Jain | In this paper we reveal that for two large families of performance measures that can be expressed as functions of true positive/negative rates, it is indeed possible to implement point stochastic updates. |

23 | Coresets for Nonparametric Estimation – the Case of DP-Means | Olivier Bachem, Mario Lucic, Andreas Krause | We explore the use of coresets – a data summarization technique originating from computational geometry – for this task. |

24 | A Relative Exponential Weighing Algorithm for Adversarial Utility-based Dueling Bandits | Pratik Gajane, Tanguy Urvoy, Fabrice Cl�rot | We propose a new algorithm called Relative Exponential-weight algorithm for Exploration and Exploitation (REX3) to handle the adversarial utility-based formulation of this problem. |

25 | Functional Subspace Clustering with Application to Time Series | Mohammad Taha Bahadori, David Kale, Yingying Fan, Yan Liu | To address these challenges, we propose a new framework called Functional Subspace Clustering (FSC). |

26 | Accelerated Online Low Rank Tensor Learning for Multivariate Spatiotemporal Streams | Rose Yu, Dehua Cheng, Yan Liu | In this paper, we propose an online accelerated low-rank tensor learning algorithm (ALTO) to solve the problem. |

27 | Atomic Spatial Processes | Sean Jewell, Neil Spencer, Alexandre Bouchard-C�t� | We employ techniques from Bayesian non-parametric statistics to develop a process which captures a common characteristic of urban spatial datasets. |

28 | Classification with Low Rank and Missing Data | Elad Hazan, Roi Livni, Yishay Mansour | Nevertheless, using a non-proper formulation we give an efficient agnostic algorithm that classifies as good as the best linear classifier coupled with the best low-dimensional subspace in which the data resides. |

29 | Dynamic Sensing: Better Classification under Acquisition Constraints | Oran Richman, Shie Mannor | In this paper we propose to actively allocate resources to each sample such that resources are used optimally overall. |

30 | A Modified Orthant-Wise Limited Memory Quasi-Newton Method with Convergence Analysis | Pinghua Gong, Jieping Ye | In this paper, we propose a modified Orthant-Wise Limited memory Quasi-Newton (mOWL-QN) algorithm by slightly modifying the OWL-QN algorithm. |

31 | Telling cause from effect in deterministic linear dynamical systems | Naji Shajarisales, Dominik Janzing, Bernhard Schoelkopf, Michel Besserve | Assuming the effect is generated by the cause through a linear system, we propose a new approach based on the hypothesis that nature chooses the “cause” and the “mechanism generating the effect from the cause” independently of each other. |

32 | High Dimensional Bayesian Optimisation and Bandits via Additive Models | Kirthevasan Kandasamy, Jeff Schneider, Barnabas Poczos | In this paper, we identify two key challenges in this endeavour. |

33 | Theory of Dual-sparse Regularized Randomized Reduction | Tianbao Yang, Lijun Zhang, Rong Jin, Shenghuo Zhu | In this paper, we study randomized reduction methods, which reduce high-dimensional features into low-dimensional space by randomized methods (e.g., random projection, random hashing), for large-scale high-dimensional classification. |

34 | Generalization error bounds for learning to rank: Does the length of document lists matter? | Ambuj Tewari, Sougata Chaudhuri | We consider the generalization ability of algorithms for learning to rank at a query level, a problem also called subset ranking. |

35 | PeakSeg: constrained optimal segmentation and supervised penalty learning for peak detection in count data | Toby Hocking, Guillem Rigaill, Guillaume Bourque | We propose PeakSeg, a new constrained maximum likelihood segmentation model for peak detection with an efficient inference algorithm: constrained dynamic programming. |

36 | Mind the duality gap: safer rules for the Lasso | Olivier Fercoq, Alexandre Gramfort, Joseph Salmon | In this paper, we propose new versions of the so-called \textitsafe rules for the Lasso. |

37 | A General Analysis of the Convergence of ADMM | Robert Nishihara, Laurent Lessard, Ben Recht, Andrew Packard, Michael Jordan | We provide a new proof of the linear convergence of the alternating direction method of multipliers (ADMM) when one of the objective terms is strongly convex. |

38 | Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization | Yuchen Zhang, Xiao Lin | We propose a stochastic primal-dual coordinate method, which alternates between maximizing over one (or more) randomly chosen dual variable and minimizing over the primal variable. |

39 | DiSCO: Distributed Optimization for Self-Concordant Empirical Loss | Yuchen Zhang, Xiao Lin | We propose a new distributed algorithm for empirical risk minimization in machine learning. |

40 | Spectral MLE: Top-K Rank Aggregation from Pairwise Comparisons | Yuxin Chen, Changho Suh | To approach this minimax limit, we propose a nearly linear-time ranking scheme, called Spectral MLE, that returns the indices of the top-K items in accordance to a careful score estimate. |

41 | Paired-Dual Learning for Fast Training of Latent Variable Hinge-Loss MRFs | Stephen Bach, Bert Huang, Jordan Boyd-Graber, Lise Getoor | We introduce paired-dual learning, a framework that greatly speeds up training by using tractable entropy surrogates and avoiding repeated inferences. |

42 | Structural Maxent Models | Corinna Cortes, Vitaly Kuznetsov, Mehryar Mohri, Umar Syed | We present a new class of density estimation models, Structural Maxent models, with feature functions selected from possibly very complex families. |

43 | A Provable Generalized Tensor Spectral Method for Uniform Hypergraph Partitioning | Debarghya Ghoshdastidar, Ambedkar Dukkipati | In this paper, we develop a unified approach for partitioning uniform hypergraphs by means of a tensor trace optimization problem involving the affinity tensor, and a number of existing higher-order methods turn out to be special cases of the proposed formulation. |

44 | The Benefits of Learning with Strongly Convex Approximate Inference | Ben London, Bert Huang, Lise Getoor | Our insights for the latter suggest a novel counting number optimization framework, which guarantees strong convexity for any given modulus. |

45 | Pushing the Limits of Affine Rank Minimization by Adapting Probabilistic PCA | Bo Xin, David Wipf | Against this backdrop we derive a deceptively simple and parameter-free probabilistic PCA-like algorithm that is capable, over a wide battery of empirical tests, of successful recovery even at the theoretical limit where the number of measurements equals the degrees of freedom in the unknown low-rank matrix. |

46 | Budget Allocation Problem with Multiple Advertisers: A Game Theoretic View | Takanori Maehara, Akihiro Yabe, Ken-ichi Kawarabayashi | By extending the budget allocation problem with a bipartite influence model, we propose a game-theoretic model problem that considers many advertisers. |

47 | Tracking Approximate Solutions of Parameterized Optimization Problems over Multi-Dimensional (Hyper-)Parameter Domains | Katharina Blechschmidt, Joachim Giesen, Soeren Laue | Many machine learning methods are given as parameterized optimization problems. |

48 | Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift | Sergey Ioffe, Christian Szegedy | We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. |

49 | Distributed Estimation of Generalized Matrix Rank: Efficient Algorithms and Lower Bounds | Yuchen Zhang, Martin Wainwright, Michael Jordan | In contrast, we propose a randomized algorithm that communicates only O(n) bits. |

50 | Landmarking Manifolds with Gaussian Processes | Dawen Liang, John Paisley | We present an algorithm for finding landmarks along a manifold. |

51 | Markov Mixed Membership Models | Aonan Zhang, John Paisley | We present a Markov mixed membership model (Markov M3) for grouped data that learns a fully connected graph structure among mixing components. |

52 | A Unified Framework for Outlier-Robust PCA-like Algorithms | Wenzhuo Yang, Huan Xu | We propose a unified framework for making a wide range of PCA-like algorithms – including the standard PCA, sparse PCA and non-negative sparse PCA, etc. – robust when facing a constant fraction of arbitrarily corrupted outliers. |

53 | Streaming Sparse Principal Component Analysis | Wenzhuo Yang, Huan Xu | We develop and analyze two memory and computational efficient algorithms called streaming sparse PCA and streaming sparse ECA for analyzing data generated according to the spike model and the elliptical model respectively. |

54 | A Divide and Conquer Framework for Distributed Graph Clustering | Wenzhuo Yang, Huan Xu | In order to improve the scalability of existing graph clustering algorithms, we propose a novel divide and conquer framework for graph clustering, and establish theoretical guarantees of exact recovery of the clusters. |

55 | How Can Deep Rectifier Networks Achieve Linear Separability and Preserve Distances? | Senjian An, Farid Boussaid, Mohammed Bennamoun | This paper investigates how hidden layers of deep rectifier networks are capable of transforming two or more pattern sets to be linearly separable while preserving the distances with a guaranteed degree, and proves the universal classification power of such distance preserving rectifier networks. |

56 | Improved Regret Bounds for Undiscounted Continuous Reinforcement Learning | K. Lakshmanan, Ronald Ortner, Daniil Ryabko | We consider the problem of undiscounted reinforcement learning in continuous state space. |

57 | The Fundamental Incompatibility of Scalable Hamiltonian Monte Carlo and Naive Data Subsampling | Michael Betancourt | In this paper I demonstrate how data subsampling fundamentally compromises the scalability of Hamiltonian Monte Carlo. |

58 | Faster Rates for the Frank-Wolfe Method over Strongly-Convex Sets | Dan Garber, Elad Hazan | In this paper we consider the special case of optimization over strongly convex sets, for which we prove that the vanila FW method converges at a rate of \frac1t^2. |

59 | Ordered Stick-Breaking Prior for Sequential MCMC Inference of Bayesian Nonparametric Models | Mrinal Das, Trapit Bansal, Chiranjib Bhattacharyya | One of the major contributions of this paper is SUMO, an MCMC algorithm, for solving the inference problem arising from applying OSBP to BNP models. |

60 | Online Learning of Eigenvectors | Dan Garber, Elad Hazan, Tengyu Ma | In this paper we present new algorithms that avoid both issues. |

61 | A Unifying Framework of Anytime Sparse Gaussian Process Regression Models with Stochastic Variational Inference for Big Data | Trong Nghia Hoang, Quang Minh Hoang, Bryan Kian Hsiang Low | This paper presents a novel unifying framework of anytime sparse Gaussian process regression (SGPR) models that can produce good predictive performance fast and improve their predictive performance over time. |

62 | Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup | Yufei Ding, Yue Zhao, Xipeng Shen, Madanlal Musuvathi, Todd Mytkowicz | This paper presents Yinyang K-means, a new algorithm for K-means clustering. |

63 | Ordinal Mixed Membership Models | Seppo Virtanen, Mark Girolami | In this work, by way of illustration, we apply the models to a collection of consumer-generated reviews of mobile software applications, where each review contains unstructured text data accompanied with an ordinal rating, and demonstrate that the models infer useful and meaningful recurring patterns of consumer feedback. |

64 | Online Tracking by Learning Discriminative Saliency Map with Convolutional Neural Network | Seunghoon Hong, Tackgeun You, Suha Kwak, Bohyung Han | We propose an online visual tracking algorithm by learning discriminative saliency map using Convolutional Neural Network (CNN). |

65 | Fast Kronecker Inference in Gaussian Processes with non-Gaussian Likelihoods | Seth Flaxman, Andrew Wilson, Daniel Neill, Hannes Nickisch, Alex Smola | We propose new scalable Kronecker methods for Gaussian processes with non-Gaussian likelihoods, using a Laplace approximation which involves linear conjugate gradients for inference, and a lower bound on the GP marginal likelihood for kernel learning. |

66 | Statistical and Algorithmic Perspectives on Randomized Sketching for Ordinary Least-Squares | Garvesh Raskutti, Michael Mahoney | In this paper, we provide a rigorous comparison of both perspectives leading to insights on how they differ. |

67 | On TD(0) with function approximation: Concentration bounds and a centered variant with exponential convergence | Nathaniel Korda, Prashanth La | We provide non-asymptotic bounds for the well-known temporal difference learning algorithm TD(0) with linear function approximators. |

68 | Learning Parametric-Output HMMs with Two Aliased States | Roi Weiss, Boaz Nadler | In this paper we focus on parametric-output HMMs, whose output distributions come from a parametric family, and that have exactly two aliased states. |

69 | Latent Gaussian Processes for Distribution Estimation of Multivariate Categorical Data | Yarin Gal, Yutian Chen, Zoubin Ghahramani | Building on these ideas we propose a Bayesian model for the unsupervised task of distribution estimation of multivariate categorical data. |

70 | Improving the Gaussian Process Sparse Spectrum Approximation by Representing Uncertainty in Frequency Inputs | Yarin Gal, Richard Turner | We model the covariance function with a finite Fourier series approximation and treat it as a random variable. |

71 | Ranking from Stochastic Pairwise Preferences: Recovering Condorcet Winners and Tournament Solution Sets at the Top | Arun Rajkumar, Suprovat Ghoshal, Lek-Heng Lim, Shivani Agarwal | In this paper, we consider settings where pairwise preferences can contain cycles. |

72 | Stochastic Dual Coordinate Ascent with Adaptive Probabilities | Dominik Csiba, Zheng Qu, Peter Richtarik | This paper introduces AdaSDCA: an adaptive variant of stochastic dual coordinate ascent (SDCA) for solving the regularized empirical risk minimization problems. |

73 | Vector-Space Markov Random Fields via Exponential Families | Wesley Tansey, Oscar Hernan Madrid Padilla, Arun Sai Suggala, Pradeep Ravikumar | We present Vector-Space Markov Random Fields (VS-MRFs), a novel class of undirected graphical models where each variable can belong to an arbitrary vector space. |

74 | JUMP-Means: Small-Variance Asymptotics for Markov Jump Processes | Jonathan Huggins, Karthik Narasimhan, Ardavan Saeedi, Vikash Mansinghka | We propose algorithms for each of these formulations, which we call \emphJUMP-means. |

75 | Low Rank Approximation using Error Correcting Coding Matrices | Shashanka Ubaru, Arya Mazumdar, Yousef Saad | In this paper, we show how matrices from error correcting codes can be used to find such low rank approximations. |

76 | Off-policy Model-based Learning under Unknown Factored Dynamics | Assaf Hallak, Francois Schnitzler, Timothy Mann, Shie Mannor | To answer this question, we introduce the G-SCOPE algorithm that evaluates a new policy based on data generated by the existing policy. |

77 | Log-Euclidean Metric Learning on Symmetric Positive Definite Manifold with Application to Image Set Classification | Zhiwu Huang, Ruiping Wang, Shiguang Shan, Xianqiu Li, Xilin Chen | To overcome this limitation, we propose a novel metric learning approach to work directly on logarithms of SPD matrices. |

78 | Asymmetric Transfer Learning with Deep Gaussian Processes | Melih Kandemir | We introduce a novel Gaussian process based Bayesian model for asymmetric transfer learning. |

79 | Towards a Lower Sample Complexity for Robust One-bit Compressed Sensing | Rongda Zhu, Quanquan Gu | In this paper, we propose a novel algorithm based on nonconvex sparsity-inducing penalty for one-bit compressed sensing. |

80 | BilBOWA: Fast Bilingual Distributed Representations without Word Alignments | Stephan Gouws, Yoshua Bengio, Greg Corrado | We introduce BilBOWA (Bilingual Bag-of-Words without Alignments), a simple and computationally-efficient model for learning bilingual distributed representations of words which can scale to large monolingual datasets and does not require word-aligned parallel training data. |

81 | Multi-view Sparse Co-clustering via Proximal Alternating Linearized Minimization | Jiangwen Sun, Jin Lu, Tingyang Xu, Jinbo Bi | We propose a proximal alternating linearized minimization algorithm that simultaneously decomposes multiple data matrices into sparse row and columns vectors. |

82 | Cascading Bandits: Learning to Rank in the Cascade Model | Branislav Kveton, Csaba Szepesvari, Zheng Wen, Azin Ashkan | In this paper, we propose cascading bandits, a learning variant of the cascade model where the objective is to identify K most attractive items. |

83 | Latent Topic Networks: A Versatile Probabilistic Programming Framework for Topic Models | James Foulds, Shachi Kumar, Lise Getoor | In this paper we introduce latent topic networks, a flexible class of richly structured topic models designed to facilitate applied research. |

84 | Random Coordinate Descent Methods for Minimizing Decomposable Submodular Functions | Alina Ene, Huy Nguyen | In this paper, we use random coordinate descent methods to obtain algorithms with faster \emphlinear convergence rates and cheaper iteration costs. |

85 | Alpha-Beta Divergences Discover Micro and Macro Structures in Data | Karthik Narayan, Ali Punjani, Pieter Abbeel | We study this relationship, theoretically and through an empirical analysis over 10 datasets. |

86 | Fictitious Self-Play in Extensive-Form Games | Johannes Heinrich, Marc Lanctot, David Silver | This paper introduces two variants of fictitious play that are implemented in behavioural strategies of an extensive-form game. |

87 | Counterfactual Risk Minimization: Learning from Logged Bandit Feedback | Adith Swaminathan, Thorsten Joachims | We present a decomposition of the POEM objective that enables efficient stochastic gradient optimization. |

88 | The Hedge Algorithm on a Continuum | Walid Krichene, Maximilian Balandat, Claire Tomlin, Alexandre Bayen | We consider an online optimization problem on a subset S of R^n (not necessarily convex), in which a decision maker chooses, at each iteration t, a probability distribution x^(t) over S, and seeks to minimize a cumulative expected loss, where each loss is a Lipschitz function revealed at the end of iteration t. Building on previous work, we propose a generalized Hedge algorithm and show a O(\sqrtt \log t) bound on the regret when the losses are uniformly Lipschitz and S is uniformly fat (a weaker condition than convexity). |

89 | A Linear Dynamical System Model for Text | David Belanger, Sham Kakade | Our learning algorithm is extremely scalable, operating on simple co-occurrence counts for both parameter initialization using the method of moments and subsequent iterations of EM. |

90 | Unsupervised Learning of Video Representations using LSTMs | Nitish Srivastava, Elman Mansimov, Ruslan Salakhudinov | We use Long Short Term Memory (LSTM) networks to learn representations of video sequences. |

91 | Message Passing for Collective Graphical Models | Tao Sun, Dan Sheldon, Akshat Kumar | Collective graphical models (CGMs) are a formalism for inference and learning about a population of independent and identically distributed individuals when only noisy aggregate data are available. |

92 | DP-space: Bayesian Nonparametric Subspace Clustering with Small-variance Asymptotics | Yining Wang, Jun Zhu | This paper presents a novel nonparametric Bayesian subspace clustering model that infers both the number of subspaces and the dimension of each subspace from the observed data. |

93 | HawkesTopic: A Joint Model for Network Inference and Topic Modeling from Text-Based Cascades | Xinran He, Theodoros Rekatsinas, James Foulds, Lise Getoor, Yan Liu | In this work, we develop the HawkesTopic model (HTM) for analyzing text-based cascades, such as “retweeting a post” or “publishing a follow-up blog post”. |

94 | MADE: Masked Autoencoder for Distribution Estimation | Mathieu Germain, Karol Gregor, Iain Murray, Hugo Larochelle | We introduce a simple modification for autoencoder neural networks that yields powerful generative models. |

95 | An Online Learning Algorithm for Bilinear Models | Yuanbin Wu, Shiliang Sun | A new online learning algorithm is proposed to train the model parameters. |

96 | Adaptive Belief Propagation | Georgios Papachristoudis, John Fisher | Graphical models are widely used in inference problems. |

97 | Large-scale log-determinant computation through stochastic Chebyshev expansions | Insu Han, Dmitry Malioutov, Jinwoo Shin | We propose a linear-time randomized algorithm to approximate log-determinants for very large-scale positive definite and general non-singular matrices using a stochastic trace approximation, called the Hutchinson method, coupled with Chebyshev polynomial expansions that both rely on efficient matrix-vector multiplications. |

98 | Differentially Private Bayesian Optimization | Matt Kusner, Jacob Gardner, Roman Garnett, Kilian Weinberger | To address this, we introduce methods for releasing the best hyper-parameters and classifier accuracy privately. |

99 | A Nearly-Linear Time Framework for Graph-Structured Sparsity | Chinmay Hegde, Piotr Indyk, Ludwig Schmidt | We introduce a framework for sparsity structures defined via graphs. |

100 | Support Matrix Machines | Luo Luo, Yubo Xie, Zhihua Zhang, Wu-Jun Li | To leverage this kind of structure information, we propose a new classification method that we call support matrix machine (SMM). |

101 | Rademacher Observations, Private Data, and Boosting | Richard Nock, Giorgio Patrini, Arik Friedman | We provide a learning algorithm over rados with boosting-compliant convergence rates on the \textitlogistic loss (computed over examples). |

102 | From Word Embeddings To Document Distances | Matt Kusner, Yu Sun, Nicholas Kolkin, Kilian Weinberger | We present the Word Mover’s Distance (WMD), a novel distance function between text documents. |

103 | Bayesian and Empirical Bayesian Forests | Taddy Matthew, Chun-Sheng Chen, Jun Yu, Mitch Wyle | We derive ensembles of decision trees through a nonparametric Bayesian model, allowing us to view such ensembles as samples from a posterior distribution. |

104 | Inferring Graphs from Cascades: A Sparse Recovery Framework | Jean Pouget-Abadie, Thibaut Horel | In this paper, we approach this problem from the sparse recovery perspective. |

105 | Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM | Ching-Pei Lee, Dan Roth | This paper proposes an efficient box-constrained quadratic optimization algorithm for distributedly training linear support vector machines (SVMs) with large data. |

106 | Safe Exploration for Optimization with Gaussian Processes | Yanan Sui, Alkis Gotovos, Joel Burdick, Andreas Krause | We consider sequential decision problems under uncertainty, where we seek to optimize an unknown function from noisy samples. |

107 | The Ladder: A Reliable Leaderboard for Machine Learning Competitions | Avrim Blum, Moritz Hardt | We introduce a natural algorithm called the Ladder and demonstrate that it simultaneously supports strong theoretical guarantees in a fully adaptive model of estimation, withstands practical adversarial attacks, and achieves high utility on real submission files from a Kaggle competition. |

108 | Enabling scalable stochastic gradient-based inference for Gaussian processes by employing the Unbiased LInear System SolvEr (ULISSE) | Maurizio Filippone, Raphael Engler | This paper proposes an adaptation of the Stochastic Gradient Langevin Dynamics algorithm to draw samples from the posterior distribution over covariance parameters with negligible bias and without the need to compute the marginal likelihood. |

109 | Finding Galaxies in the Shadows of Quasars with Gaussian Processes | Roman Garnett, Shirley Ho, Jeff Schneider | We develop an automated technique for detecting damped Lyman-αabsorbers (DLAs) along spectroscopic sightlines to quasi-stellar objects (QSOs or quasars). |

110 | Following the Perturbed Leader for Online Structured Learning | Alon Cohen, Tamir Hazan | To better understand FTPL algorithms for online structured learning, we present a lower bound on the regret for a large and natural class of FTPL algorithms that use logconcave perturbations. |

111 | Reified Context Models | Jacob Steinhardt, Percy Liang | In this work, we introduce a new approach, reified context models, to reconcile this tension. |

112 | Large-Scale Markov Decision Problems with KL Control Cost and its Application to Crowdsourcing | Yasin Abbasi-Yadkori, Peter Bartlett, Xi Chen, Alan Malek | We study average and total cost Markov decision problems with large state spaces. |

113 | Learning Fast-Mixing Models for Structured Prediction | Jacob Steinhardt, Percy Liang | Markov Chain Monte Carlo (MCMC) algorithms are often used for approximate inference inside learning, but their slow mixing can be difficult to diagnose and the resulting approximate gradients can seriously degrade learning. |

114 | A Probabilistic Model for Dirty Multi-task Feature Selection | Daniel Hernandez-Lobato, Jose Miguel Hernandez-Lobato, Zoubin Ghahramani | To account for this, we propose a model for multi-task feature selection based on a robust prior distribution that introduces a set of binary latent variables to identify outlier tasks and outlier features. |

115 | On Deep Multi-View Representation Learning | Weiran Wang, Raman Arora, Karen Livescu, Jeff Bilmes | Previous work on this problem has proposed several techniques based on deep neural networks, typically involving either autoencoder-like networks with a reconstruction objective or paired feedforward networks with a correlation-based objective. |

116 | Learning Program Embeddings to Propagate Feedback on Student Code | Chris Piech, Jonathan Huang, Andy Nguyen, Mike Phulsuksombati, Mehran Sahami, Leonidas Guibas | We introduce a neural network method to encode programs as a linear mapping from an embedded precondition space to an embedded postcondition space and propose an algorithm for feedback at scale using these linear maps as features. |

117 | Safe Subspace Screening for Nuclear Norm Regularized Least Squares Problems | Qiang Zhou, Qi Zhao | In this work, we propose a novel method called safe subspace screening (SSS), to improve the efficiency of the solver for nuclear norm regularized least squares problems. |

118 | Efficient Learning in Large-Scale Combinatorial Semi-Bandits | Zheng Wen, Branislav Kveton, Azin Ashkan | In this paper, we consider efficient learning in large-scale combinatorial semi-bandits with linear generalization, and as a solution, propose two learning algorithms called Combinatorial Linear Thompson Sampling (CombLinTS) and Combinatorial Linear UCB (CombLinUCB). |

119 | Swept Approximate Message Passing for Sparse Estimation | Andre Manoel, Florent Krzakala, Eric Tramel, Lenka Zdeborov� | We propose a new approach to stabilizing AMP in these contexts by applying AMP updates to individual coefficients rather than in parallel. |

120 | Simple regret for infinitely many armed bandits | Alexandra Carpentier, Michal Valko | In this paper, we propose an algorithm aiming at minimizing the simple regret. |

121 | Exponential Integration for Hamiltonian Monte Carlo | Wei-Lun Chao, Justin Solomon, Dominik Michels, Fei Sha | We consider various ways to derive Gaussian approximations and conduct extensive empirical studies applying the proposed “exponential HMC” to several benchmarked learning problems. |

122 | Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays | Junpei Komiyama, Junya Honda, Hiroshi Nakagawa | In this paper, we propose the multiple-play Thompson sampling (MP-TS) algorithm, an extension of TS to the multiple-play MAB problem, and discuss its regret analysis. |

123 | Faster cover trees | Mike Izbicki, Christian Shelton | This paper makes cover trees even faster. |

124 | Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization | Tyler Johnson, Carlos Guestrin | We propose Blitz, a fast working set algorithm accompanied by useful guarantees. |

125 | Unsupervised Domain Adaptation by Backpropagation | Yaroslav Ganin, Victor Lempitsky | Here, we propose a new approach to domain adaptation in deep architectures that can be trained on large amount of labeled data from the source domain and large amount of unlabeled data from the target domain (no labeled target-domain data is necessary). |

126 | Non-Linear Cross-Domain Collaborative Filtering via Hyper-Structure Transfer | Yan-Fu Liu, Cheng-Yu Hsu, Shan-Hung Wu | In this paper, we propose the notion of Hyper-Structure Transfer (HST) that requires the rating matrices to be explained by the projections of some more complex structure, called the hyper-structure, shared by all domains, and thus allows the non-linearly correlated knowledge between domains to be identified and transferred. |

127 | Manifold-valued Dirichlet Processes | Hyunwoo Kim, Jia Xu, Baba Vemuri, Vikas Singh | To address this ’locality’ problem, we propose a novel nonparametric model which unifies multivariate general linear models (MGLMs) using multiple tangent spaces. |

128 | Multi-Task Learning for Subspace Segmentation | Yu Wang, David Wipf, Qing Ling, Wei Chen, Ian Wassell | Multi-Task Learning for Subspace Segmentation |

129 | Markov Chain Monte Carlo and Variational Inference: Bridging the Gap | Tim Salimans, Diederik Kingma, Max Welling | We describe the theoretical foundations that make this possible and show some promising first results. |

130 | Scalable Model Selection for Large-Scale Factorial Relational Models | Chunchen Liu, Lu Feng, Ryohei Fujimaki, Yusuke Muraoka | For scalable model selection of BMFs, this paper proposes stochastic factorized asymptotic Bayesian (sFAB) inference that combines concepts in two recently-developed techniques: stochastic variational inference (SVI) and FAB inference. |

131 | The Power of Randomization: Distributed Submodular Maximization on Massive Datasets | Rafael Barbosa, Alina Ene, Huy Nguyen, Justin Ward | We consider a distributed, greedy algorithm that combines previous approaches with randomization. |

132 | Dealing with small data: On the generalization of context trees | Ralf Eggeling, Mikko Koivisto, Ivo Grosse | In this work, we investigate to which degree CTs can be generalized to increase statistical efficiency while still keeping the learning computationally feasible. |

133 | Non-Gaussian Discriminative Factor Models via the Max-Margin Rank-Likelihood | Xin Yuan, Ricardo Henao, Ephraim Tsalik, Raymond Langley, Lawrence Carin | A Bayesian model based on the ranks of the data is proposed. |

134 | A Bayesian nonparametric procedure for comparing algorithms | Alessio Benavoli, Giorgio Corani, Francesca Mangili, Marco Zaffalon | We show by simulation that our approach is competitive both in terms of accuracy and speed in identifying the best algorithm. |

135 | Convergence rate of Bayesian tensor estimator and its minimax optimality | Taiji Suzuki | We investigate the statistical convergence rate of a Bayesian low-rank tensor estimator, and derive the minimax optimal rate for learning a low-rank tensor. |

136 | On Identifying Good Options under Combinatorially Structured Feedback in Finite Noisy Environments | Yifan Wu, Andras Gyorgy, Csaba Szepesvari | We consider the problem of identifying a good option out of finite set of options under combinatorially structured, noisy feedback about the quality of the options in a sequential process: In each round, a subset of the options, from an available set of subsets, can be selected to receive noisy information about the quality of the options in the chosen subset. |

137 | Nested Sequential Monte Carlo Methods | Christian Naesseth, Fredrik Lindsten, Thomas Schon | We propose nested sequential Monte Carlo (NSMC), a methodology to sample from sequences of probability distributions, even where the random variables are high-dimensional. |

138 | Sparse Variational Inference for Generalized GP Models | Rishit Sheth, Yuyang Wang, Roni Khardon | This paper develops a variational sparse solution for GPs under general likelihoods by providing a new characterization of the gradients required for inference in terms of individual observation likelihood terms. |

139 | Universal Value Function Approximators | Tom Schaul, Daniel Horgan, Karol Gregor, David Silver | In this paper we introduce universal value function approximators (UVFAs) V(s,g;theta) that generalise not just over states s but also over goals g. |

140 | Approximate Dynamic Programming for Two-Player Zero-Sum Markov Games | Julien Perolat, Bruno Scherrer, Bilal Piot, Olivier Pietquin | This paper provides an analysis of error propagation in Approximate Dynamic Programming applied to zero-sum two-player Stochastic Games. |

141 | On Greedy Maximization of Entropy | Dravyansh Sharma, Ashish Kapoor, Amit Deshpande | The main goal of this paper is to explore and answer why the greedy selection does significantly better than the theoretical guarantee of (1 – 1/e). |

142 | Metadata Dependent Mondrian Processes | Yi Wang, Bin Li, Yang Wang, Fang Chen | In this paper, we propose a metadata dependent Mondrian process (MDMP) to incorporate meta information into the stochastic partition process in the product space and the entity allocation process on the resulting block structure. |

143 | Complex Event Detection using Semantic Saliency and Nearly-Isotonic SVM | Xiaojun Chang, Yi Yang, Eric Xing, Yaoliang Yu | We aim to detect complex events in long Internet videos that may last for hours. |

144 | Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal Likelihood | Kohei Hayashi, Shin-ichi Maeda, Ryohei Fujimaki | Factorized information criterion (FIC) is a recently developed approximation technique for the marginal log-likelihood, which provides an automatic model selection framework for a few latent variable models (LVMs) with tractable inference algorithms. |

145 | Double Nystr�m Method: An Efficient and Accurate Nystr�m Scheme for Large-Scale Data Sets | Woosang Lim, Minhwan Kim, Haesun Park, Kyomin Jung | In this paper, we present a novel Nyström method that improves both accuracy and efficiency based on a new theoretical analysis. |

146 | The Composition Theorem for Differential Privacy | Peter Kairouz, Sewoong Oh, Pramod Viswanath | In this paper we answer the fundamental question of characterizing the level of privacy degradation as a function of the number of adaptive interactions and the differential privacy levels maintained by the individual queries. |

147 | Convex Formulation for Learning from Positive and Unlabeled Data | Marthinus Du Plessis, Gang Niu, Masashi Sugiyama | In this paper, we discuss a convex formulation for PU classification that can still cancel the bias. |

148 | Threshold Influence Model for Allocating Advertising Budgets | Atsushi Miyauchi, Yuni Iwamasa, Takuro Fukunaga, Naonori Kakimura | We propose a new influence model for allocating budgets to advertising channels. |

149 | Strongly Adaptive Online Learning | Amit Daniely, Alon Gonen, Shai Shalev-Shwartz | We present a reduction that can transform standard low-regret algorithms to strongly adaptive. |

150 | CUR Algorithm for Partially Observed Matrices | Miao Xu, Rong Jin, Zhi-Hua Zhou | In this work, we alleviate this limitation by developing a CUR decomposition algorithm for partially observed matrices. |

151 | A Deterministic Analysis of Noisy Sparse Subspace Clustering for Dimensionality-reduced Data | Yining Wang, Yu-Xiang Wang, Aarti Singh | In this paper, we propose a theoretical framework to analyze a popular optimization-based algorithm, Sparse Subspace Clustering (SSC), when the data dimension is compressed via some random projection algorithms. |

152 | MRA-based Statistical Learning from Incomplete Rankings | Eric Sibony, St�phan Clemen�on, J�r�mie Jakubowicz | The goal of this paper is twofold: it develops a rigorous mathematical framework for the problem of learning a ranking model from incomplete rankings and introduces a novel general statistical method to address it. |

153 | Risk and Regret of Hierarchical Bayesian Learners | Jonathan Huggins, Josh Tenenbaum | We present a set of analytical tools for understanding hierarchical priors in both the online and batch learning settings. |

154 | Towards a Learning Theory of Cause-Effect Inference | David Lopez-Paz, Krikamol Muandet, Bernhard Sch�lkopf, Iliya Tolstikhin | We pose causal inference as the problem of learning to classify probability distributions. |

155 | DRAW: A Recurrent Neural Network For Image Generation | Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Rezende, Daan Wierstra | This paper introduces the Deep Recurrent Attentive Writer (DRAW) architecture for image generation with neural networks. |

156 | Multiview Triplet Embedding: Learning Attributes in Multiple Maps | Ehsan Amid, Antti Ukkonen | In this paper, we consider the problem of uncovering these hidden attributes given a set of relative distance judgments in the form of triplets. |

157 | Distributed Gaussian Processes | Marc Deisenroth, Jun Wei Ng | To scale Gaussian processes (GPs) to large data sets we introduce the robust Bayesian Committee Machine (rBCM), a practical and scalable product-of-experts model for large-scale distributed GP regression. |

158 | Guaranteed Tensor Decomposition: A Moment Approach | Gongguo Tang, Parikshit Shah | To address the computational challenge, we present a hierarchy of semidefinite programs based on sums-of-squares relaxations of the measure optimization problem. |

159 | \ell_1,p-Norm Regularization: Error Bounds and Convergence Rate Analysis of First-Order Methods | Zirui Zhou, Qi Zhang, Anthony Man-Cho So | Motivated by the desire to analyze the convergence rate of first-order methods, we show that for a large class of \ell_1,p-regularized problems, an error bound condition is satisfied when p∈[1,2] or p=∞but fails to hold for any p∈(2,∞). |

160 | Consistent estimation of dynamic and multi-layer block models | Qiuyi Han, Kevin Xu, Edoardo Airoldi | In this paper, we consider the multi-graph SBM, which serves as a foundation for many application settings including dynamic and multi-layer networks. |

161 | On the Rate of Convergence and Error Bounds for LSTD(?) | Manel Tagorti, Bruno Scherrer | We consider LSTD(λ), the least-squares temporal-difference algorithm with eligibility traces algorithm proposed by Boyan (2002). |

162 | Variational Inference with Normalizing Flows | Danilo Rezende, Shakir Mohamed | We introduce a new approach for specifying flexible, arbitrarily complex and scalable approximate posterior distributions. |

163 | Controversy in mechanistic modelling with Gaussian processes | Benn Macdonald, Catherine Higham, Dirk Husmeier | In the present article, we offer a new interpretation of the second paradigm, which highlights the underlying assumptions, approximations and limitations. |

164 | Convex Learning of Multiple Tasks and their Structure | Carlo Ciliberto, Youssef Mroueh, Tomaso Poggio, Lorenzo Rosasco | Within this framework, we show that tasks and their structure can be efficiently learned considering a convex optimization problem that can be approached by means of block coordinate methods such as alternating minimization and for which we prove convergence to the global minimum. |

165 | K-hyperplane Hinge-Minimax Classifier | Margarita Osadchy, Tamir Hazan, Daniel Keren | We propose an efficient algorithm for training an intersection of finite number of hyperplane and demonstrate its effectiveness on real data, including letter and scene recognition. |

166 | Non-Stationary Approximate Modified Policy Iteration | Boris Lesner, Bruno Scherrer | We consider the infinite-horizon γ-discounted optimal control problem formalized by Markov Decision Processes. |

167 | Entropy evaluation based on confidence intervals of frequency estimates : Application to the learning of decision trees | Mathieu Serrurier, Henri Prade | We propose a new cumulative entropy function based on confidence intervals on frequency estimates that together considers the entropy of the probability distribution and the uncertainty around the estimation of its parameters. |

168 | Geometric Conditions for Subspace-Sparse Recovery | Chong You, Rene Vidal | In this work, we consider the more general case where ξlies in a low-dimensional subspace spanned by a few columns of \Pi, which are possibly \textitlinearly dependent. |

169 | An Empirical Study of Stochastic Variational Inference Algorithms for the Beta Bernoulli Process | Amar Shah, David Knowles, Zoubin Ghahramani | Deriving several new algorithms, and using synthetic, image and genomic datasets, we investigate whether the understanding gleaned from LDA applies in the setting of sparse latent factor models, specifically beta process factor analysis (BPFA). |

170 | Long Short-Term Memory Over Recursive Structures | Xiaodan Zhu, Parinaz Sobihani, Hongyu Guo | In this paper, we propose to extend it to tree structures, in which a memory cell can reflect the history memories of multiple child cells or multiple descendant cells in a recursive process. |

171 | Weight Uncertainty in Neural Network | Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, Daan Wierstra | We introduce a new, efficient, principled and backpropagation-compatible algorithm for learning a probability distribution on the weights of a neural network, called Bayes by Backprop. |

172 | Learning Submodular Losses with the Lovasz Hinge | Jiaqian Yu, Matthew Blaschko | In this work, we show that these strategies lead to tight convex surrogates iff the underlying loss function is increasing in the number of incorrect predictions. |

173 | Coordinate Descent Converges Faster with the Gauss-Southwell Rule Than Random Selection | Julie Nutini, Mark Schmidt, Issam Laradji, Michael Friedlander, Hoyt Koepke | We give a simple analysis of the Gauss-Southwell rule showing that—except in extreme cases—it’s convergence rate is faster than choosing random coordinates. |

174 | Hashing for Distributed Data | Cong Leng, Jiaxiang Wu, Jian Cheng, Xi Zhang, Hanqing Lu | In this paper, we develop a novel hashing model to learn hash functions in a distributed setting. |

175 | Large-scale Distributed Dependent Nonparametric Trees | Zhiting Hu, Ho Qirong, Avinava Dubey, Eric Xing | In this paper, we consider dependent nonparametric trees (DNTs), a powerful infinite model that captures time-evolving hierarchies, and develop a large-scale distributed training system. |

176 | Qualitative Multi-Armed Bandits: A Quantile-Based Approach | Balazs Szorenyi, Robert Busa-Fekete, Paul Weng, Eyke H�llermeier | For both cases, we propose suitable algorithms and analyze their properties. |

177 | Deep Edge-Aware Filters | Li Xu, Jimmy Ren, Qiong Yan, Renjie Liao, Jiaya Jia | We made the attempt to learn a big and important family of edge-aware operators from data. |

178 | A Convex Optimization Framework for Bi-Clustering | Shiau Hong Lim, Yudong Chen, Huan Xu | We present a framework for biclustering and clustering where the observations are general labels. |

179 | Is Feature Selection Secure against Training Data Poisoning? | Huang Xiao, Battista Biggio, Gavin Brown, Giorgio Fumera, Claudia Eckert, Fabio Roli | In this work, we shed light on this issue by providing a framework to investigate the robustness of popular feature selection methods, including LASSO, ridge regression and the elastic net. |

180 | Predictive Entropy Search for Bayesian Optimization with Unknown Constraints | Jose Miguel Hernandez-Lobato, Michael Gelbart, Matthew Hoffman, Ryan Adams, Zoubin Ghahramani | In this paper, we present a new information-based method called Predictive Entropy Search with Constraints (PESC). |

181 | A Theoretical Analysis of Metric Hypothesis Transfer Learning | Micha�l Perrot, Amaury Habrard | We propose an on-average-replace-two-stability model allowing us to prove fast generalization rates when an auxiliary source metric is used to bias the regularizer. |

182 | Generative Moment Matching Networks | Yujia Li, Kevin Swersky, Rich Zemel | We consider the problem of learning deep generative models from data. |

183 | Stay on path: PCA along graph paths | Megasthenis Asteris, Anastasios Kyrillidis, Alex Dimakis, Han-Gyol Yi, Bharath Chandrasekaran | We propose two algorithms to approximate the solution of the constrained quadratic maximization, and recover a component with the desired properties. |

184 | Deep Learning with Limited Numerical Precision | Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, Pritish Narayanan | We study the effect of limited precision data representation and computation on neural network training. |

185 | Safe Screening for Multi-Task Feature Learning with Multiple Data Matrices | Jie Wang, Jieping Ye | In this paper, we propose a novel screening rule—that is based on the dual projection onto convex sets (DPC)—to quickly identify the inactive features—that have zero coefficients in the solution vectors across all tasks. |

186 | Harmonic Exponential Families on Manifolds | Taco Cohen, Max Welling | We define an extremely flexible class of exponential family distributions on manifolds such as the torus, sphere, and rotation groups, and show that for these distributions the gradient of the log-likelihood can be computed efficiently using a non-commutative generalization of the Fast Fourier Transform (FFT). |

187 | Training Deep Convolutional Neural Networks to Play Go | Christopher Clark, Amos Storkey | To solve this problem we introduce a number of novel techniques, including a method of tying weights in the network to ’hard code’ symmetries that are expected to exist in the target function, and demonstrate in an ablation study they considerably improve performance. |

188 | Kernel Interpolation for Scalable Structured Gaussian Processes (KISS-GP) | Andrew Wilson, Hannes Nickisch | We introduce a new structured kernel interpolation (SKI) framework, which generalises and unifies inducing point methods for scalable Gaussian processes (GPs). |

189 | Learning Deep Structured Models | Liang-Chieh Chen, Alexander Schwing, Alan Yuille, Raquel Urtasun | The goal of this paper is to combine MRFs with deep learning to estimate complex representations while taking into account the dependencies between the output random variables. |

190 | Community Detection Using Time-Dependent Personalized PageRank | Haim Avron, Lior Horesh | We present an efficient local algorithm for approximating a graph diffusion that generalizes both the celebrated personalized PageRank and its recent competitor/companion – the heat kernel. |

191 | Scalable Variational Inference in Log-supermodular Models | Josip Djolonga, Andreas Krause | We consider the problem of approximate Bayesian inference in log-supermodular models. |

192 | Variational Inference for Gaussian Process Modulated Poisson Processes | Chris Lloyd, Tom Gunter, Michael Osborne, Stephen Roberts | We present the first fully variational Bayesian inference scheme for continuous Gaussian-process-modulated Poisson processes. |

193 | Scalable Deep Poisson Factor Analysis for Topic Modeling | Zhe Gan, Changyou Chen, Ricardo Henao, David Carlson, Lawrence Carin | A new framework for topic modeling is developed, based on deep graphical models, where interactions between topics are inferred through deep latent binary hierarchies. |

194 | Hidden Markov Anomaly Detection | Nico Goernitz, Mikio Braun, Marius Kloft | We introduce a new anomaly detection methodology for data with latent dependency structure. |

195 | Robust Estimation of Transition Matrices in High Dimensional Heavy-tailed Vector Autoregressive Processes | Huitong Qiu, Sheng Xu, Fang Han, Han Liu, Brian Caffo | In this paper, we develop a unified framework for modeling and estimating heavy-tailed VAR processes. |

196 | Convex Calibrated Surrogates for Hierarchical Classification | Harish Ramaswamy, Ambuj Tewari, Shivani Agarwal | In this work, we study the consistency of hierarchical classification algorithms with respect to a natural loss, namely the tree distance metric on the hierarchy tree of class labels, via the usage of calibrated surrogates. |

197 | Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks | Jose Miguel Hernandez-Lobato, Ryan Adams | In this work we present a novel scalable method for learning Bayesian neural networks, called probabilistic backpropagation (PBP). |

198 | Active Nearest Neighbors in Changing Environments | Christopher Berlind, Ruth Urner | We propose a novel nonparametric algorithm, ANDA, that combines an active nearest neighbor querying strategy with nearest neighbor prediction. |

199 | Bipartite Edge Prediction via Transductive Learning over Product Graphs | Hanxiao Liu, Yiming Yang | We propose a new optimization framework to map the two sides of the intrinsic structures onto the manifold structure of the edges via a graph product, and to reduce the original problem to vertex label propagation over the product graph. |

200 | Trust Region Policy Optimization | John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, Philipp Moritz | In this article, we describe a method for optimizing control policies, with guaranteed monotonic improvement. |

201 | Discovering Temporal Causal Relations from Subsampled Data | Mingming Gong, Kun Zhang, Bernhard Schoelkopf, Dacheng Tao, Philipp Geiger | In this paper, we assume that the time series at the true causal frequency follow the vector autoregressive model. |

202 | Preference Completion: Large-scale Collaborative Ranking from Pairwise Comparisons | Dohyung Park, Joe Neeman, Jin Zhang, Sujay Sanghavi, Inderjit Dhillon | In this paper we consider the collaborative ranking setting: a pool of users each provides a set of pairwise preferences over a small subset of the set of d possible items; from these we need to predict each user’s preferences for items s/he has not yet seen. |

203 | Causal Inference by Identification of Vector Autoregressive Processes with Hidden Components | Philipp Geiger, Kun Zhang, Bernhard Schoelkopf, Mingming Gong, Dominik Janzing | In this paper we take a different approach: We assume that X together with some hidden Z forms a first order vector autoregressive (VAR) process with transition matrix A, and argue why it is more valid to interpret A causally instead of \hatB. |

204 | On Symmetric and Asymmetric LSHs for Inner Product Search | Behnam Neyshabur, Nathan Srebro | We consider the problem of designing locality sensitive hashes (LSH) for inner product similarity, and of the power of asymmetric hashes in this context. |

205 | The Kendall and Mallows Kernels for Permutations | Yunlong Jiao, Jean-Philippe Vert | We show that the widely used Kendall tau correlation coefficient is a positive definite kernel for permutations. |

206 | Bayesian Multiple Target Localization | Purnima Rajan, Weidong Han, Raphael Sznitman, Peter Frazier, Bruno Jedynak | We present an empirical evaluation of this policy on simulated data for the problem of detecting multiple instances of the same object in an image. |

207 | Submodularity in Data Subset Selection and Active Learning | Kai Wei, Rishabh Iyer, Jeff Bilmes | We study the problem of selecting a subset of big data to train a classifier while incurring minimal performance loss. |

208 | Variational Generative Stochastic Networks with Collaborative Shaping | Philip Bachman, Doina Precup | We present empirical results on the MNIST and TFD datasets which show that our approach offers state-of-the-art performance, both quantitatively and from a qualitative point of view. |

209 | Adding vs. Averaging in Distributed Primal-Dual Optimization | Chenxin Ma, Virginia Smith, Martin Jaggi, Michael Jordan, Peter Richtarik, Martin Takac | In this paper, we present a novel generalization of the recent communication-efficient primal-dual framework (COCOA) for distributed optimization. |

210 | Feature-Budgeted Random Forest | Feng Nan, Joseph Wang, Venkatesh Saligrama | We propose a novel random forest algorithm to minimize prediction error for a user-specified \it average feature acquisition budget. |

211 | Entropic Graph-based Posterior Regularization | Maxwell Libbrecht, Michael Hoffman, Jeff Bilmes, William Noble | We present a three-way alternating optimization algorithm with closed-form updates for performing inference on this joint model and learning its parameters. |

212 | Unsupervised Riemannian Metric Learning for Histograms Using Aitchison Transformations | Tam Le, Marco Cuturi | We consider in this paper the problem of learning a Riemannian metric on the simplex given unlabeled histogram data. |

213 | Low-Rank Matrix Recovery from Row-and-Column Affine Measurements | Or Zuk, Avishai Wagner | We propose a simple algorithm for the problem based on Singular Value Decomposition (SVD) and least-squares (LS), which we term alg. |

214 | Algorithms for the Hard Pre-Image Problem of String Kernels and the General Problem of String Prediction | S�bastien Gigu�re, Am�lie Rolland, Francois Laviolette, Mario Marchand | For this problem, we propose an upper bound on the prediction function which has low computational complexity and which can be used in a branch and bound search algorithm to obtain optimal solutions. |

215 | A Multitask Point Process Predictive Model | Wenzhao Lian, Ricardo Henao, Vinayak Rao, Joseph Lucas, Lawrence Carin | In this work we propose a multitask point process model, leveraging information from all tasks via a hierarchical Gaussian process (GP). |

216 | A Hybrid Approach for Probabilistic Inference using Random Projections | Michael Zhu, Stefano Ermon | We introduce a new meta-algorithm for probabilistic inference in graphical models based on random projections. |

217 | Show, Attend and Tell: Neural Image Caption Generation with Visual Attention | Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, Yoshua Bengio | Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. |

218 | Learning to Search Better than Your Teacher | Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daume, John Langford | We provide a new learning to search algorithm, LOLS, which does well relative to the reference policy, but additionally guarantees low regret compared to deviations from the learned policy: a local-optimality guarantee. |

219 | Gated Feedback Recurrent Neural Networks | Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, Yoshua Bengio | In this work, we propose a novel recurrent neural network (RNN) architecture. |

220 | Context-based Unsupervised Data Fusion for Decision Making | Erfan Soltanmohammadi, Mort Naraghi-Pour, Mihaela Schaar | In this paper, we propose an unsupervised joint estimation-detection scheme to estimate the accuracies of the local classifiers as functions of data context and to fuse the local decisions of the classifiers. |

221 | Phrase-based Image Captioning | Remi Lebret, Pedro Pinheiro, Ronan Collobert | In this paper, we present a simple model that is able to generate descriptive sentences given a sample image. |

222 | Celeste: Variational inference for a generative model of astronomical images | Jeffrey Regier, Andrew Miller, Jon McAuliffe, Ryan Adams, Matt Hoffman, Dustin Lang, David Schlegel, Mr Prabhat | We present a new, fully generative model of optical telescope image sets, along with a variational procedure for inference. |

223 | Distributional Rank Aggregation, and an Axiomatic Analysis | Adarsh Prasad, Harsh Pareek, Pradeep Ravikumar | We introduce a variant of this problem we call distributional rank aggregation, where the ranking data is only available via the induced distribution over the set of all permutations. |

224 | Gradient-based Hyperparameter Optimization through Reversible Learning | Dougal Maclaurin, David Duvenaud, Ryan Adams | Tuning hyperparameters of learning algorithms is hard because gradients are usually unavailable. |

225 | Bimodal Modelling of Source Code and Natural Language | Miltos Allamanis, Daniel Tarlow, Andrew Gordon, Yi Wei | We consider the problem of building probabilistic models that jointly model short natural language utterances and source code snippets. |

226 | Cheap Bandits | Manjesh Hanawal, Venkatesh Saligrama, Michal Valko, Remi Munos | In this paper we propose CheapUCB, an algorithm that matches the regret guarantees of the known algorithms for this setting and at the same time guarantees a linear cost again over them. |

227 | Subsampling Methods for Persistent Homology | Frederic Chazal, Brittany Fasy, Fabrizio Lecci, Bertrand Michel, Alessandro Rinaldo, Larry Wasserman | We study the risk of two estimators and we prove that the subsampling approach carries stable topological information while achieving a great reduction in computational complexity. |

228 | An embarrassingly simple approach to zero-shot learning | Bernardino Romera-Paredes, Philip Torr | In this paper we describe a zero-shot learning approach that can be implemented in just one line of code, yet it is able to outperform state of the art approaches on standard datasets. |

229 | Binary Embedding: Fundamental Limits and Fast Algorithm | Xinyang Yi, Constantine Caramanis, Eric Price | Specifically, for an arbitrary N distinct points in \mathbbS^p-1, our goal is to encode each point using m-dimensional binary strings such that we can reconstruct their geodesic distance up to δuniform distortion. |

230 | Scalable Bayesian Optimization Using Deep Neural Networks | Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, Ryan Adams | In this work, we explore the use of neural networks as an alternative to GPs to model distributions over functions. |

231 | How Hard is Inference for Structured Prediction? | Amir Globerson, Tim Roughgarden, David Sontag, Cafer Yildirim | The goal of this paper is to develop a theoretical explanation of the empirical effectiveness of heuristic inference algorithms for solving such structured prediction problems. |

232 | Online Time Series Prediction with Missing Data | Oren Anava, Elad Hazan, Assaf Zeevi | We consider the problem of time series prediction in the presence of missing data. |

233 | Proteins, Particles, and Pseudo-Max-Marginals: A Submodular Approach | Jason Pacheco, Erik Sudderth | Motivated by the challenging problem of protein side chain prediction, we extend D-PMP in several key ways to create a generic MAP inference algorithm for loopy models. |

234 | A Fast Variational Approach for Learning Markov Random Field Language Models | Yacine Jernite, Alexander Rush, David Sontag | In this work, we take a step towards overcoming these difficulties. |

235 | Removing systematic errors for exoplanet search via latent causes | Bernhard Sch�lkopf, David Hogg, Dun Wang, Dan Foreman-Mackey, Dominik Janzing, Carl-Johann Simon-Gabriel, Jonas Peters | We describe a method for removing the effect of confounders in order to reconstruct a latent quantity of interest. |

236 | Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes | Yves-Laurent Kom Samo, Stephen Roberts | In this paper we propose an efficient, scalable non-parametric Gaussian process model for inference on Poisson point processes. |

237 | Correlation Clustering in Data Streams | KookJin Ahn, Graham Cormode, Sudipto Guha, Andrew McGregor, Anthony Wirth | In this paper, we address the problem of \emphcorrelation clustering in the dynamic data stream model. |

238 | Learning Scale-Free Networks by Dynamic Node Specific Degree Prior | Qingming Tang, Siqi Sun, Jinbo Xu | To fulfill this, this paper proposes a ranking-based method to dynamically estimate the degree of a node, which makes the resultant optimization problem challenging to solve. |

239 | Deep Unsupervised Learning using Nonequilibrium Thermodynamics | Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, Surya Ganguli | Here, we develop an approach that simultaneously achieves both flexibility and tractability. |

240 | Modeling Order in Neural Word Embeddings at Scale | Andrew Trask, David Gilmore, Matthew Russell | We propose a new neural language model incorporating both word order and character order in its embedding. |

241 | Distributed Inference for Dirichlet Process Mixture Models | Hong Ge, Yutian Chen, Moquan Wan, Zoubin Ghahramani | In this paper, we propose an efficient distributed inference algorithm for the DP and the HDP mixture model. |

242 | Compressing Neural Networks with the Hashing Trick | Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, Yixin Chen | We present a novel network architecture, HashedNets, that exploits inherent redundancy in neural networks to achieve drastic reductions in model sizes. |

243 | Intersecting Faces: Non-negative Matrix Factorization With New Guarantees | Rong Ge, James Zou | In this paper, we propose the notion of subset-separable NMF, which substantially generalizes the property of separability. |

244 | Scaling up Natural Gradient by Sparsely Factorizing the Inverse Fisher Matrix | Roger Grosse, Ruslan Salakhudinov | We present FActorized Natural Gradient (FANG), an approximation to natural gradient descent where the Fisher matrix is approximated with a Gaussian graphical model whose precision matrix can be computed efficiently. |

245 | A Deeper Look at Planning as Learning from Replay | Harm Vanseijen, Rich Sutton | In this paper, we look more deeply at how replay blurs the line between model-based and model-free methods. |

246 | Optimal and Adaptive Algorithms for Online Boosting | Alina Beygelzimer, Satyen Kale, Haipeng Luo | We study online boosting, the task of converting any weak online learner into a strong online learner. |

247 | Global Convergence of Stochastic Gradient Descent for Some Non-convex Matrix Problems | Christopher De Sa, Christopher Re, Kunle Olukotun | In this paper, we exhibit a step size scheme for SGD on a low-rank least-squares problem, and we prove that, under broad sampling conditions, our method converges globally from a random starting point within O(ε^-1 n \log n) steps with constant probability for constant-rank problems. |

248 | An Empirical Exploration of Recurrent Network Architectures | Rafal Jozefowicz, Wojciech Zaremba, Ilya Sutskever | In this work, we aim to determine whether the LSTM architecture is optimal or whether much better architectures exist. |

249 | Complete Dictionary Recovery Using Nonconvex Optimization | Ju Sun, Qing Qu, John Wright | We consider the problem of recovering a complete (i.e., square and invertible) dictionary mb A_0, from mb Y = mb A_0 mb X_0 with mb Y ∈\mathbb R^n \times p. |

250 | Safe Policy Search for Lifelong Reinforcement Learning with Sublinear Regret | Haitham Bou Ammar, Rasul Tutunov, Eric Eaton | Lifelong reinforcement learning provides a promising framework for developing versatile agents that can accumulate knowledge over a lifetime of experience and rapidly learn new tasks by building upon prior knowledge. |

251 | PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent | Cho-Jui Hsieh, Hsiang-Fu Yu, Inderjit Dhillon | In this paper, we parallelize the DCD algorithms in LIBLINEAR. |

252 | High Confidence Policy Improvement | Philip Thomas, Georgios Theocharous, Mohammad Ghavamzadeh | We present a batch reinforcement learning (RL) algorithm that provides probabilistic guarantees about the quality of each policy that it proposes, and which has no hyper-parameter that requires expert tuning. |

253 | Fixed-point algorithms for learning determinantal point processes | Zelda Mariet, Suvrit Sra | We present experimental results on both real and simulated data to illustrate the numerical performance of our technique. |

254 | Consistent Multiclass Algorithms for Complex Performance Measures | Harikrishna Narasimhan, Harish Ramaswamy, Aadirupa Saha, Shivani Agarwal | This paper presents new consistent algorithms for multiclass learning with complex performance measures, defined by arbitrary functions of the confusion matrix. |

255 | Optimizing Neural Networks with Kronecker-factored Approximate Curvature | James Martens, Roger Grosse | We propose an efficient method for approximating natural gradient descent in neural networks which we call Kronecker-factored Approximate Curvature (K-FAC). |

256 | A Convex Exemplar-based Approach to MAD-Bayes Dirichlet Process Mixture Models | En-Hsu Yen, Xin Lin, Kai Zhong, Pradeep Ravikumar, Inderjit Dhillon | In this paper, we consider the exemplar-based version of MAD-Bayes formulation for DP and Hierarchical DP (HDP) mixture model. |

257 | Multi-instance multi-label learning in the presence of novel class instances | Anh Pham, Raviv Raich, Xiaoli Fern, Jes�s P�rez Arriaga | In this paper, this problem is addressed using a discriminative probabilistic model that accounts for novel instances. |

258 | Entropy-Based Concentration Inequalities for Dependent Variables | Liva Ralaivola, Massih-Reza Amini | In the way, we prove a new Talagrand concentration inequality for fractionally sub-additive functions of dependent variables. |

259 | PU Learning for Matrix Completion | Cho-Jui Hsieh, Nagarajan Natarajan, Inderjit Dhillon | In this paper, we consider the matrix completion problem when the observations are one-bit measurements of some underlying matrix M , and in particular the observed samples consist only of ones and no zeros. |

260 | An Asynchronous Distributed Proximal Gradient Method for Composite Convex Optimization | Necdet Aybat, Zi Wang, Garud Iyengar | We propose a distributed first-order augmented Lagrangian (DFAL) algorithm to minimize the sum of composite convex functions, where each term in the sum is a private cost function belonging to a node, and only nodes connected by an edge can directly communicate with each other. |

261 | Sparse Subspace Clustering with Missing Entries | Congyuan Yang, Daniel Robinson, Rene Vidal | We consider the problem of clustering incomplete data drawn from a union of subspaces. |

262 | Moderated and Drifting Linear Dynamical Systems | Jinyan Guan, Kyle Simek, Ernesto Brau, Clayton Morrison, Emily Butler, Kobus Barnard | This change of focus reduces opportunities for efficient inference, and we propose sampling procedures to learn and fit the models. |

263 | Boosted Categorical Restricted Boltzmann Machine for Computational Prediction of Splice Junctions | Taehoon Lee, Sungroh Yoon | In this paper, we propose a deep belief network-based methodology for computational splice junction prediction. |

264 | Privacy for Free: Posterior Sampling and Stochastic Gradient Monte Carlo | Yu-Xiang Wang, Stephen Fienberg, Alex Smola | We consider the problem of Bayesian learning on sensitive datasets and present two simple but somewhat surprising results that connect Bayesian learning to “differential privacy”, a cryptographic approach to protect individual-level privacy while permitting database-level utility. |

265 | A trust-region method for stochastic variational inference with applications to streaming data | Lucas Theis, Matt Hoffman | We address this problem by replacing the natural gradient step of stochastic varitional inference with a trust-region update. |

266 | Inference in a Partially Observed Queuing Model with Applications in Ecology | Kevin Winner, Garrett Bernstein, Dan Sheldon | The contribution of this paper is to formulate a latent variable model and develop a novel Gibbs sampler based on Markov bases to perform inference using the correct, but intractable, likelihood function. |

267 | Deterministic Independent Component Analysis | Ruitong Huang, Andras Gyorgy, Csaba Szepesv�ri | We present, for the first time in the literature, consistent, polynomial-time algorithms to recover non-Gaussian source signals and the mixing matrix with a reconstruction error that vanishes at a 1/\sqrtT rate using T observations and scales only polynomially with the natural parameters of the problem. |

268 | On the Optimality of Multi-Label Classification under Subset Zero-One Loss for Distributions Satisfying the Composition Property | Maxime Gasse, Alexandre Aussem, Haytham Elghazel | In this paper, we show that the subsets of labels that appear as irreducible factors in the factorization of the conditional distribution of the label set given the input features play a pivotal role for multi-label classification in the context of subset Zero-One loss minimization, as they divide the learning task into simpler independent multi-class problems. |

269 | Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization | Roy Frostig, Rong Ge, Sham Kakade, Aaron Sidford | To achieve this, we establish a framework, based on the classical proximal point algorithm, useful for accelerating recent fast stochastic algorithms in a black-box fashion. |

270 | A New Generalized Error Path Algorithm for Model Selection | Bin Gu, Charles Ling | Recently, various solution path algorithms have been proposed for several important learning algorithms including support vector classification, Lasso, and so on. |