# Paper Digest: NIPS 2015 Highlights

The Conference on Neural Information Processing Systems (NIPS) is one of the top machine learning conferences in the world. In 2015, it is to be held in Montreal, Canada.

To help AI community quickly catch up on the work presented in this conference, Paper Digest Team processed all accepted papers, and generated one highlight sentence (typically the main topic) for each paper. Readers are encouraged to read these machine generated highlights / summaries to quickly get the main idea of each paper.

We thank all authors for writing these interesting papers, and readers for reading our digests. If you do not want to miss any interesting AI paper, you are welcome to **sign up our free paper digest service ** to get new paper updates customized to your own interests on a daily basis.

Paper Digest Team

team@paperdigest.org

#### TABLE 1: NIPS 2015 Papers

Title | Authors | Highlight | |
---|---|---|---|

1 | Double or Nothing: Multiplicative Incentive Mechanisms for Crowdsourcing | Nihar Bhadresh Shah, Dengyong Zhou | To address this fundamental challenge in crowdsourcing, we propose a simple payment mechanism to incentivize workers to answer only the questions that they are sure of and skip the rest. |

2 | Learning with Symmetric Label Noise: The Importance of Being Unhinged | Brendan van Rooyen, Aditya Menon, Robert C. Williamson | In this paper, we propose a convex, classification-calibrated loss and prove that it is SLN-robust. |

3 | Algorithmic Stability and Uniform Generalization | Ibrahim M. Alabdulmohsin | In this paper, we prove that algorithmic stability in the inference process is equivalent to uniform generalization across all parametric loss functions. |

4 | Adaptive Low-Complexity Sequential Inference for Dirichlet Process Mixture Models | Theodoros Tsiligkaridis, Theodoros Tsiligkaridis, Keith Forsythe | Motivated by large-sample asymptotics, we propose a noveladaptive low-complexity design for the Dirichlet process concentration parameter and show that the number of classes grow at most at a logarithmic rate. |

5 | Covariance-Controlled Adaptive Langevin Thermostat for Large-Scale Bayesian Sampling | Xiaocheng Shang, Zhanxing Zhu, Benedict Leimkuhler, Amos J. Storkey | In this article, we propose a covariance-controlled adaptive Langevin thermostat that can effectively dissipate parameter-dependent noise while maintaining a desired target distribution. |

6 | Robust Portfolio Optimization | Huitong Qiu, Fang Han, Han Liu, Brian Caffo | We propose a robust portfolio optimization approach based on quantile statistics. |

7 | Logarithmic Time Online Multiclass prediction | Anna E. Choromanska, John Langford | We study the problem of multiclass classification with an extremely large number of classes (k), with the goal of obtaining train and test time complexity logarithmic in the number of classes. |

8 | Planar Ultrametrics for Image Segmentation | Julian E. Yarkony, Charless Fowlkes | We study the problem of hierarchical clustering on planar graphs. |

9 | Expressing an Image Stream with a Sequence of Natural Sentences | Cesc C. Park, Gunhee Kim | We propose an approach for generating a sequence of natural sentences for an image stream. |

10 | Parallel Correlation Clustering on Big Graphs | Xinghao Pan, Dimitris Papailiopoulos, Samet Oymak, Benjamin Recht, Kannan Ramchandran, Michael I. Jordan | We show that our algorithms can cluster billion-edge graphs in under 5 seconds on 32 cores, while achieving a 15x speedup. |

11 | Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks | Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun | In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. |

12 | Space-Time Local Embeddings | Ke Sun, Jun Wang, Alexandros Kalousis, Stephane Marchand-Maillet | We present basic definitions with interesting counter-intuitions. |

13 | A Convergent Gradient Descent Algorithm for Rank Minimization and Semidefinite Programming from Random Linear Measurements | Qinqing Zheng, John Lafferty | We propose a simple, scalable, and fast gradient descent algorithm to optimize a nonconvex objective for the rank minimization problem and a closely related family of semidefinite programs. |

14 | Smooth Interactive Submodular Set Cover | Bryan D. He, Yisong Yue | In this paper, we propose a new extension, which we call smooth interactive submodular set cover, that allows the target threshold to vary depending on the plausibility of each hypothesis. |

15 | Galileo: Perceiving Physical Object Properties by Integrating a Physics Engine with Deep Learning | Jiajun Wu, Ilker Yildirim, Joseph J. Lim, Bill Freeman, Josh Tenenbaum | We propose a generative model for solving these problems of physical scene understanding from real-world videos and images. |

16 | On the Pseudo-Dimension of Nearly Optimal Auctions | Jamie H. Morgenstern, Tim Roughgarden | We introduce t-level auctions to interpolate between simple auctions, such as welfare maximization with reserve prices, and optimal auctions, thereby balancing the competing demands of expressivity and simplicity. |

17 | Unlocking neural population non-stationarities using hierarchical dynamics models | Mijung Park, Gergo Bohner, Jakob H. Macke | To better understand the nature of co-variability in neural circuits and their impact on cortical information processing, we introduce a hierarchical dynamics model that is able to capture inter-trial modulations in firing rates, as well as neural population dynamics. |

18 | Bayesian Manifold Learning: The Locally Linear Latent Variable Model (LL-LVM) | Mijung Park, Wittawat Jitkrittum, Ahmad Qamar, Zoltan Szabo, Lars Buesing, Maneesh Sahani | We introduce the Locally Linear Latent Variable Model (LL-LVM), a probabilistic model for non-linear manifold discovery that describes a joint distribution over observations, their manifold coordinates and locally linear maps conditioned on a set of neighbourhood relationships. |

19 | Color Constancy by Learning to Predict Chromaticity from Luminance | Ayan Chakrabarti | In this paper, we show that the per-pixel color statistics of natural scenes—without any spatial or semantic context—can by themselves be a powerful cue for color constancy. |

20 | Fast and Accurate Inference of Plackett�Luce Models | Lucas Maystre, Matthias Grossglauser | We take advantage of this perspective and formulate a new spectral algorithm that is significantly more accurate than previous ones for the Plackett–Luce model. |

21 | Probabilistic Line Searches for Stochastic Optimization | Maren Mahsereci, Philipp Hennig | Our method retains a Gaussian process surrogate of the univariate optimization objective, and uses a probabilistic belief over the Wolfe conditions to monitor the descent. |

22 | Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets | Armand Joulin, Tomas Mikolov | In this paper, we discuss the limitations of standard deep learning approaches and show that some of these limitations can be overcome by learning how to grow the complexity of a model in a structured way. |

23 | Where are they looking? | Adria Recasens, Aditya Khosla, Carl Vondrick, Antonio Torralba | In this paper, we propose a deep neural network-based approach for gaze-following and a new benchmark dataset for thorough evaluation. |

24 | The Pareto Regret Frontier for Bandits | Tor Lattimore | I show that the price for such unbalanced worst-case regret guarantees is rather high. |

25 | On the Limitation of Spectral Methods: From the Gaussian Hidden Clique Problem to Rank-One Perturbations of Gaussian Tensors | Andrea Montanari, Daniel Reichman, Ofer Zeitouni | We consider the following detection problem: given a realization of asymmetric matrix $X$ of dimension $n$, distinguish between the hypothesisthat all upper triangular variables are i.i.d. Gaussians variableswith mean 0 and variance $1$ and the hypothesis that there is aplanted principal submatrix $B$ of dimension $L$ for which all upper triangularvariables are i.i.d. Gaussians with mean $1$ and variance $1$, whereasall other upper triangular elements of $X$ not in $B$ are i.i.d.Gaussians variables with mean 0 and variance $1$. |

26 | Measuring Sample Quality with Stein's Method | Jackson Gorham, Lester Mackey | To address these challenges, we introduce a new computable quality measure based on Stein’s method that bounds the discrepancy between sample and target expectations over a large class of test functions. |

27 | Bidirectional Recurrent Convolutional Networks for Multi-Frame Super-Resolution | Yan Huang, Wei Wang, Liang Wang | Considering that recurrent neural network (RNN) can model long-term contextual information of temporal sequences well, we propose a bidirectional recurrent convolutional network for efficient multi-frame SR.Different from vanilla RNN, 1) the commonly-used recurrent full connections are replaced with weight-sharing convolutional connections and 2) conditional convolutional connections from previous input layers to current hidden layer are added for enhancing visual-temporal dependency modelling. |

28 | Bounding errors of Expectation-Propagation | Guillaume P. Dehaene, Simon Barthelm� | In this article, we prove that the approximation errors made by EP can be bounded. |

29 | A fast, universal algorithm to learn parametric nonlinear embeddings | Miguel A. Carreira-Perpinan, Max Vladymyrov | Using the method of auxiliary coordinates, we derive a training algorithm that works by alternating steps that train an auxiliary embedding with steps that train the mapping. |

30 | Texture Synthesis Using Convolutional Neural Networks | Leon Gatys, Alexander S. Ecker, Matthias Bethge | Here we introduce a new model of natural textures based on the feature spaces of convolutional neural networks optimised for object recognition. |

31 | Extending Gossip Algorithms to Distributed Estimation of U-statistics | Igor Colin, Aur�lien Bellet, Joseph Salmon, St�phan Cl�men�on | This paper proposes new synchronous and asynchronous randomized gossip algorithms which simultaneously propagate data across the network and maintain local estimates of the U-statistic of interest. |

32 | Streaming, Distributed Variational Inference for Bayesian Nonparametrics | Trevor Campbell, Julian Straub, John W. Fisher III, Jonathan P. How | This paper presents a methodology for creating streaming, distributed inference algorithms for Bayesian nonparametric (BNP) models. To address this, the paper develops a combinatorial optimization problem over component correspondences, and provides an efficient solution technique. |

33 | Learning visual biases from human imagination | Carl Vondrick, Hamed Pirsiavash, Aude Oliva, Antonio Torralba | In this paper, we investigate whether wecan extract these biases and transfer them into a machine recognition system.We introduce a novel method that, inspired by well-known tools in humanpsychophysics, estimates the biases that the human visual system might use forrecognition, but in computer vision feature spaces. |

34 | Smooth and Strong: MAP Inference with Linear Convergence | Ofer Meshi, Mehrdad Mahdavi, Alex Schwing | Specifically, we introduce strong convexity by adding a quadratic term to the LP relaxation objective. |

35 | Copeland Dueling Bandits | Masrour Zoghi, Zohar S. Karnin, Shimon Whiteson, Maarten de Rijke | Two algorithms are proposed that instead seek to minimize regret with respect to the Copeland winner, which, unlike the Condorcet winner, is guaranteed to exist. |

36 | Optimal Ridge Detection using Coverage Risk | Yen-Chi Chen, Christopher R. Genovese, Shirley Ho, Larry Wasserman | We introduce the concept of coverage risk as an error measure for density ridge estimation.The coverage risk generalizes the mean integrated square error to set estimation.We propose two risk estimators for the coverage risk and we show that we can select tuning parameters by minimizing the estimated risk.We study the rate of convergence for coverage risk and prove consistency of the risk estimators.We apply our method to three simulated datasets and to cosmology data.In all the examples, the proposed method successfully recover the underlying density structure. |

37 | Top-k Multiclass SVM | Maksim Lapin, Matthias Hein, Bernt Schiele | We propose top-k multiclass SVM as a direct method to optimize for top-k performance. |

38 | Policy Evaluation Using the O-Return | Philip S. Thomas, Scott Niekum, Georgios Theocharous, George Konidaris | We propose the Ω-return as an alternative to the λ-return currently used by the TD(λ) family of algorithms. |

39 | Orthogonal NMF through Subspace Exploration | Megasthenis Asteris, Dimitris Papailiopoulos, Alexandros G. Dimakis | Existing algorithms rely mostly on heuristics, which despite their good empirical performance, lack provable performance guarantees.We present a new ONMF algorithm with provable approximation guarantees.For any constant dimension~$k$, we obtain an additive EPTAS without any assumptions on the input. |

40 | Stochastic Online Greedy Learning with Semi-bandit Feedbacks | Tian Lin, Jian Li, Wei Chen | In this paper, we address the online learning problem when the input to the greedy algorithm is stochastic with unknown parameters that have to be learned over time. |

41 | Deeply Learning the Messages in Message Passing Inference | Guosheng Lin, Chunhua Shen, Ian Reid, Anton van den Hengel | We apply our method to semantic image segmentation and achieve impressive performance, which demonstrates the effectiveness and usefulness of our CNN message learning method. |

42 | Synaptic Sampling: A Bayesian Approach to Neural Network Plasticity and Rewiring | David Kappel, Stefan Habenschuss, Robert Legenstein, Wolfgang Maass | We reexamine in this article the conceptual and mathematical framework for understanding the organization of plasticity in spiking neural networks. |

43 | Accelerated Proximal Gradient Methods for Nonconvex Programming | Huan Li, Zhouchen Lin | To address this issue, we introduce a monitor-corrector step and extend APG for general nonconvex and nonsmooth programs. |

44 | Approximating Sparse PCA from Incomplete Data | ABHISEK KUNDU, Petros Drineas, Malik Magdon-Ismail | We study how well one can recover sparse principal componentsof a data matrix using a sketch formed from a few of its elements. |

45 | Nonparametric von Mises Estimators for Entropies, Divergences and Mutual Informations | Kirthevasan Kandasamy, Akshay Krishnamurthy, Barnabas Poczos, Larry Wasserman, james m. robins | We propose and analyse estimators for statistical functionals of one or moredistributions under nonparametric assumptions.Our estimators are derived from the von Mises expansion andare based on the theory of influence functions, which appearin the semiparametric statistics literature.We show that estimators based either on data-splitting or a leave-one-out techniqueenjoy fast rates of convergence and other favorable theoretical properties.We apply this framework to derive estimators for several popular informationtheoretic quantities, and via empirical evaluation, show the advantage of thisapproach over existing estimators. |

46 | Column Selection via Adaptive Sampling | Saurabh Paul, Malik Magdon-Ismail, Petros Drineas | We propose a new adaptive sampling algorithm that can be used to improve any relative-error column selection algorithm. |

47 | HONOR: Hybrid Optimization for NOn-convex Regularized problems | Pinghua Gong, Jieping Ye | In this paper, we propose an efficient \underline{H}ybrid \underline{O}ptimization algorithm for \underline{NO}n convex \underline{R}egularized problems (HONOR). |

48 | 3D Object Proposals for Accurate Object Class Detection | Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Andrew G. Berneshawi, Huimin Ma, Sanja Fidler, Raquel Urtasun | The goal of this paper is to generate high-quality 3D object proposals in the context of autonomous driving. |

49 | Algorithms with Logarithmic or Sublinear Regret for Constrained Contextual Bandits | Huasen Wu, R. Srikant, Xin Liu, Chong Jiang | We show that the proposed UCB-ALP algorithm achieves logarithmic regret except in certain boundary cases.Further, we design algorithms and obtain similar regret analysis results for more general systems with unknown context distribution or heterogeneous costs. |

50 | Tensorizing Neural Networks | Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, Dmitry P. Vetrov | In this paper we convert the dense weight matrices of the fully-connected layers to the Tensor Train format such that the number of parameters is reduced by a huge factor and at the same time the expressive power of the layer is preserved.In particular, for the Very Deep VGG networks we report the compression factor of the dense weight matrix of a fully-connected layer up to 200000 times leading to the compression factor of the whole network up to 7 times. |

51 | Parallelizing MCMC with Random Partition Trees | Xiangyu Wang, Fangjian Guo, Katherine A. Heller, David B. Dunson | In this article, we propose a new EP-MCMC algorithm PART that solves these problems. |

52 | A Reduced-Dimension fMRI Shared Response Model | Po-Hsuan (Cameron) Chen, Janice Chen, Yaara Yeshurun, Uri Hasson, James Haxby, Peter J. Ramadge | We develop a shared response model for aggregating multi-subject fMRI data that accounts for different functional topographies among anatomically aligned datasets. |

53 | Spectral Learning of Large Structured HMMs for Comparative Epigenomics | Chicheng Zhang, Jimin Song, Kamalika Chaudhuri, Kevin Chen | We develop a latent variable model and an efficient spectral algorithm motivated by the recent emergence of very large data sets of chromatin marks from multiple human cell types. |

54 | Individual Planning in Infinite-Horizon Multiagent Settings: Inference, Structure and Scalability | Xia Qu, Prashant Doshi | We exploit the graphical model structure specific to I-POMDPs, and present a new approach based on block-coordinate descent for further speed up. |

55 | Estimating Mixture Models via Mixtures of Polynomials | Sida Wang, Arun Tejasvi Chaganty, Percy S. Liang | In this work, we present Polymom, an unifying framework based on method of moments in which estimation procedures are easily derivable, just as in EM. |

56 | On the Global Linear Convergence of Frank-Wolfe Optimization Variants | Simon Lacoste-Julien, Martin Jaggi | In this paper, we highlight and clarify several variants of the Frank-Wolfe optimization algorithm that has been successfully applied in practice: FW with away steps, pairwise FW, fully-corrective FW and Wolfe’s minimum norm point algorithm, and prove for the first time that they all enjoy global linear convergence under a weaker condition than strong convexity. |

57 | Deep Knowledge Tracing | Chris Piech, Jonathan Bassen, Jonathan Huang, Surya Ganguli, Mehran Sahami, Leonidas J. Guibas, Jascha Sohl-Dickstein | Knowledge tracing, where a machine models the knowledge of a student as they interact with coursework, is an established and significantly unsolved problem in computer supported education.In this paper we explore the benefit of using recurrent neural networks to model student learning.This family of models have important advantages over current state of the art methods in that they do not require the explicit encoding of human domain knowledge,and have a far more flexible functional form which can capture substantially more complex student interactions.We show that these neural networks outperform the current state of the art in prediction on real student data,while allowing straightforward interpretation and discovery of structure in the curriculum.These results suggest a promising new line of research for knowledge tracing. |

58 | Rethinking LDA: Moment Matching for Discrete ICA | Anastasia Podosinnikova, Francis Bach, Simon Lacoste-Julien | We consider moment matching techniques for estimation in Latent Dirichlet Allocation (LDA). |

59 | Efficient Compressive Phase Retrieval with Constrained Sensing Vectors | Sohail Bahmani, Justin Romberg | We propose a robust and efficient approach to the problem of compressive phase retrieval in which the goal is to reconstruct a sparse vector from the magnitude of a number of its linear measurements. |

60 | Barrier Frank-Wolfe for Marginal Inference | Rahul G. Krishnan, Simon Lacoste-Julien, David Sontag | We introduce a globally-convergent algorithm for optimizing the tree-reweighted (TRW) variational objective over the marginal polytope. |

61 | Learning Theory and Algorithms for Forecasting Non-stationary Time Series | Vitaly Kuznetsov, Mehryar Mohri | We present data-dependent learning bounds for the general scenario of non-stationary non-mixing stochastic processes. |

62 | Compressive spectral embedding: sidestepping the SVD | Dinesh Ramasamy, Upamanyu Madhow | In this paper, we propose a low-complexity it compressive spectral embedding algorithm, which employs random projections and finite order polynomial expansions to compute approximations to SVD-based embedding. |

63 | A Nonconvex Optimization Framework for Low Rank Matrix Estimation | Tuo Zhao, Zhaoran Wang, Han Liu | In this paper, we define the notion of projected oracle divergence based on which we establish sufficient conditions for the success of nonconvex optimization. |

64 | Automatic Variational Inference in Stan | Alp Kucukelbir, Rajesh Ranganath, Andrew Gelman, David Blei | We propose an automatic variational inference algorithm, automatic differentiation variational inference (ADVI); we implement it in Stan (code available), a probabilistic programming system. |

65 | Attention-Based Models for Speech Recognition | Jan K. Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, Yoshua Bengio | We offer a qualitative explanation of this failure and propose a novel and generic method of adding location-awareness to the attention mechanism to alleviate this issue. |

66 | Closed-form Estimators for High-dimensional Generalized Linear Models | Eunho Yang, Aurelie C. Lozano, Pradeep K. Ravikumar | We propose a class of closed-form estimators for GLMs under high-dimensional sampling regimes. |

67 | Online F-Measure Optimization | R�bert Busa-Fekete, Bal�zs Sz�r�nyi, Krzysztof Dembczynski, Eyke H�llermeier | In this paper, we study the problem of F-measure maximization in the setting of online learning. |

68 | Online Rank Elicitation for Plackett-Luce: A Dueling Bandits Approach | Bal�zs Sz�r�nyi, R�bert Busa-Fekete, Adil Paul, Eyke H�llermeier | We study the problem of online rank elicitation, assuming that rankings of a set of alternatives obey the Plackett-Luce distribution. |

69 | M-Best-Diverse Labelings for Submodular Energies and Beyond | Alexander Kirillov, Dmytro Shlezinger, Dmitry P. Vetrov, Carsten Rother, Bogdan Savchynskyy | In this work we show that the joint inference of $M$ best diverse solutions can be formulated as a submodular energy minimization if the original MAP-inference problem is submodular, hence fast inference techniques can be used. |

70 | Tractable Bayesian Network Structure Learning with Bounded Vertex Cover Number | Janne H. Korhonen, Pekka Parviainen | In this paper, we propose bounded vertex cover number Bayesian networks as an alternative to bounded tree-width networks. |

71 | Learning Large-Scale Poisson DAG Models based on OverDispersion Scoring | Gunwoong Park, Garvesh Raskutti | In this paper, we address the question of identifiability and learning algorithms for large-scale Poisson Directed Acyclic Graphical (DAG) models. |

72 | Training Restricted Boltzmann Machine via the ?Thouless-Anderson-Palmer free energy | Marylou Gabrie, Eric W. Tramel, Florent Krzakala | We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. |

73 | Character-level Convolutional Networks for Text Classification | Xiang Zhang, Junbo Zhao, Yann LeCun | This article offers an empirical exploration on the use of character-level convolutional networks (ConvNets) for text classification. We constructed several large-scale datasets to show that character-level convolutional networks could achieve state-of-the-art or competitive results. |

74 | Robust Feature-Sample Linear Discriminant Analysis for Brain Disorders Diagnosis | Ehsan Adeli-Mosabbeb, Kim-Han Thung, Le An, Feng Shi, Dinggang Shen | In this paper, we propose a classification method based on the least-squares formulation of linear discriminant analysis, which simultaneously detects the sample-outliers and feature-noises. |

75 | Black-box optimization of noisy functions with unknown smoothness | Jean-Bastien Grill, Michal Valko, Remi Munos, Remi Munos | Our contribution is an adaptive optimization algorithm, POO or parallel optimistic optimization, that is able to deal with this setting. |

76 | Recovering Communities in the General Stochastic Block Model Without Knowing the Parameters | Emmanuel Abbe, Colin Sandon | This paper introduces efficient algorithms that do not require such knowledge and yet achieve the optimal information-theoretic tradeoffs identified in Abbe-Sandon FOCS15. |

77 | Deep learning with Elastic Averaging SGD | Sixin Zhang, Anna E. Choromanska, Yann LeCun | We propose synchronous and asynchronous variants of the new algorithm. |

78 | Monotone k-Submodular Function Maximization with Size Constraints | Naoto Ohsaka, Yuichi Yoshida | A $k$-submodular function is a generalization of a submodular function, where the input consists of $k$ disjoint subsets, instead of a single subset, of the domain.Many machine learning problems, including influence maximization with $k$ kinds of topics and sensor placement with $k$ kinds of sensors, can be naturally modeled as the problem of maximizing monotone $k$-submodular functions.In this paper, we give constant-factor approximation algorithms for maximizing monotone $k$-submodular functions subject to several size constraints.The running time of our algorithms are almost linear in the domain size.We experimentally demonstrate that our algorithms outperform baseline algorithms in terms of the solution quality. |

79 | Active Learning from Weak and Strong Labelers | Chicheng Zhang, Kamalika Chaudhuri | Our goal is to learn a classifier with low error on data labeled by the oracle, while using the weak labeler to reduce the number of label queries made to this labeler. |

80 | On the Optimality of Classifier Chain for Multi-label Classification | Weiwei Liu, Ivor Tsang | Based on our results, we propose a dynamic programming based classifier chain (CC-DP) algorithm to search the globally optimal label order for CC and a greedy classifier chain (CC-Greedy) algorithm to find a locally optimal CC. |

81 | Robust Regression via Hard Thresholding | Kush Bhatia, Prateek Jain, Purushottam Kar | We study the problem of Robust Least Squares Regression (RLSR) where several response variables can be adversarially corrupted. |

82 | Sparse Local Embeddings for Extreme Multi-label Classification | Kush Bhatia, Himanshu Jain, Purushottam Kar, Manik Varma, Prateek Jain | We conducted extensive experiments on several real-world as well as benchmark data sets and compare our method against state-of-the-art methods for extreme multi-label classification. |

83 | Solving Random Quadratic Systems of Equations Is Nearly as Easy as Solving Linear Systems | Yuxin Chen, Emmanuel Candes | This paper is concerned with finding a solution x to a quadratic system of equations y_i = |< a_i, x >|^2, i = 1, 2, …, m. |

84 | A Framework for Individualizing Predictions of Disease Trajectories by Exploiting Multi-Resolution Structure | Peter Schulam, Suchi Saria | We propose a hierarchical latent variable model that individualizes predictions of disease trajectories. |

85 | Subspace Clustering with Irrelevant Features via Robust Dantzig Selector | Chao Qu, Huan Xu | We propose a method termed “robust Dantzig selector” which can successfully identify the clustering structure even with the presence of irrelevant features. |

86 | Sparse PCA via Bipartite Matchings | Megasthenis Asteris, Dimitris Papailiopoulos, Anastasios Kyrillidis, Alexandros G. Dimakis | We consider the following multi-component sparse PCA problem:given a set of data points, we seek to extract a small number of sparse components with \emph{disjoint} supports that jointly capture the maximum possible variance.Such components can be computed one by one, repeatedly solving the single-component problem and deflating the input data matrix, but this greedy procedure is suboptimal.We present a novel algorithm for sparse PCA that jointly optimizes multiple disjoint components. |

87 | Fast Randomized Kernel Ridge Regression with Statistical Guarantees | Ahmed Alaoui, Michael W. Mahoney | Here, we describe a version of this approach that comes with running time guarantees as well as improved guarantees on its statistical performance.By extending the notion of \emph{statistical leverage scores} to the setting of kernel ridge regression, we are able to identify a sampling distribution that reduces the size of the sketch (i.e., the required number of columns to be sampled) to the \emph{effective dimensionality} of the problem. |

88 | Online Learning for Adversaries with Memory: Price of Past Mistakes | Oren Anava, Elad Hazan, Shie Mannor | In this work we extend the notion of learning with memory to the general Online Convex Optimization (OCO) framework, and present two algorithms that attain low regret. |

89 | Convolutional spike-triggered covariance analysis for neural subunit models | Anqi Wu, Il Memming Park, Jonathan W. Pillow | Here we address this problem by forging a theoretical connection between spike-triggered covariance analysis and nonlinear subunit models. |

90 | Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting | Xingjian SHI, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong, Wang-chun WOO | In this paper, we formulate precipitation nowcasting as a spatiotemporal sequence forecasting problem in which both the input and the prediction target are spatiotemporal sequences. |

91 | GAP Safe screening rules for sparse multi-task and multi-class models | Eugene Ndiaye, Olivier Fercoq, Alexandre Gramfort, Joseph Salmon | In this paper we derive new safe rules for generalized linear models regularized with L1 and L1/L2 norms. |

92 | Empirical Localization of Homogeneous Divergences on Discrete Sample Spaces | Takashi Takenouchi, Takafumi Kanamori | In this paper, we propose a novel parameter estimator for probabilistic models on discrete space. |

93 | Statistical Model Criticism using Kernel Two Sample Tests | James R. Lloyd, Zoubin Ghahramani | We propose an exploratory approach to statistical model criticism using maximum mean discrepancy (MMD) two sample tests. |

94 | Precision-Recall-Gain Curves: PR Analysis Done Right | Peter Flach, Meelis Kull | We demonstrate in this paper that this practice is fraught with difficulties, mainly because of incoherent scale assumptions — e.g., the area under a PR curve takes the arithmetic mean of precision values whereas the $F_{\beta}$ score applies the harmonic mean. |

95 | A Generalization of Submodular Cover via the Diminishing Return Property on the Integer Lattice | Tasuku Soma, Yuichi Yoshida | We consider a generalization of the submodular cover problem based on the concept of diminishing return property on the integer lattice. |

96 | Bidirectional Recurrent Neural Networks as Generative Models | Mathias Berglund, Tapani Raiko, Mikko Honkala, Leo K�rkk�inen, Akos Vetek, Juha T. Karhunen | We propose two probabilistic interpretations of bidirectional RNNs that can be used to reconstruct missing gaps efficiently. |

97 | Quartz: Randomized Dual Coordinate Ascent with Arbitrary Sampling | Zheng Qu, Peter Richtarik, Tong Zhang | We propose and analyze a novel primal-dual method (Quartz) which at every iteration samples and updates a random subset of the dual variables, chosen according to an arbitrary distribution. |

98 | Maximum Likelihood Learning With Arbitrary Treewidth via Fast-Mixing Parameter Sets | Justin Domke | This paper explores an alternative notion of a tractable set, namely a set of “fast-mixing parameters” where Markov chain Monte Carlo (MCMC) inference can be guaranteed to quickly converge to the stationary distribution. |

99 | Hessian-free Optimization for Learning Deep Multidimensional Recurrent Neural Networks | Minhyung Cho, Chandra Dhir, Jaehyung Lee | Hessian-free Optimization for Learning Deep Multidimensional Recurrent Neural Networks |

100 | Large-scale probabilistic predictors with and without guarantees of validity | Vladimir Vovk, Ivan Petej, Valentina Fedorova | Large-scale probabilistic predictors with and without guarantees of validity |

101 | Shepard Convolutional Neural Networks | Jimmy SJ Ren, Li Xu, Qiong Yan, Wenxiu Sun | In this paper, we draw on Shepard interpolation and design Shepard Convolutional Neural Networks (ShCNN) which efficiently realizes end-to-end trainable TVI operators in the network. |

102 | Matrix Manifold Optimization for Gaussian Mixtures | Reshad Hosseini, Suvrit Sra | To bring our ideas to fruition, we develop a well-tuned Riemannian LBFGS method that proves superior to known competing methods (e.g., Riemannian conjugate gradient). |

103 | Semi-supervised Convolutional Neural Networks for Text Categorization via Region Embedding | Rie Johnson, Tong Zhang | This paper presents a new semi-supervised framework with convolutional neural networks (CNNs) for text categorization. |

104 | Parallel Recursive Best-First AND/OR Search for Exact MAP Inference in Graphical Models | Akihiro Kishimoto, Radu Marinescu, Adi Botea | We introduce a new parallel shared-memory recursive best-first AND/OR search algorithm, called SPRBFAOO, that explores the search space in a best-first manner while operating with restricted memory. |

105 | Convolutional Neural Networks with Intra-Layer Recurrent Connections for Scene Labeling | Ming Liang, Xiaolin Hu, Bo Zhang | We adopt a deep recurrent convolutional neural network (RCNN) for this task, which is originally proposed for object recognition. |

106 | Bounding the Cost of Search-Based Lifted Inference | David B. Smith, Vibhav G. Gogate | In this paper, we present a principled approach to address this problem. |

107 | Gradient-free Hamiltonian Monte Carlo with Efficient Kernel Exponential Families | Heiko Strathmann, Dino Sejdinovic, Samuel Livingstone, Zoltan Szabo, Arthur Gretton | We propose Kernel Hamiltonian Monte Carlo (KMC), a gradient-free adaptive MCMC algorithm based on Hamiltonian Monte Carlo (HMC). |

108 | Linear Multi-Resource Allocation with Semi-Bandit Feedback | Tor Lattimore, Koby Crammer, Csaba Szepesvari | Our main contribution is the new setting and an algorithm with nearly-optimal regret analysis. |

109 | Unsupervised Learning by Program Synthesis | Kevin Ellis, Armando Solar-Lezama, Josh Tenenbaum | We introduce an unsupervised learning algorithmthat combines probabilistic modeling with solver-based techniques for program synthesis.We apply our techniques to both a visual learning domain and a language learning problem,showing that our algorithm can learn many visual concepts from only a few examplesand that it can recover some English inflectional morphology.Taken together, these results give both a new approach to unsupervised learning of symbolic compositional structures,and a technique for applying program synthesis tools to noisy data. |

110 | Enforcing balance allows local supervised learning in spiking recurrent networks | Ralph Bourdoukan, Sophie Den�ve | Using a top-down approach, we show how networks of integrate-and-fire neurons can learn arbitrary linear dynamical systems by feeding back their error as a feed-forward input. |

111 | Fast and Guaranteed Tensor Decomposition via Sketching | Yining Wang, Hsiao-Yu Tung, Alexander J. Smola, Anima Anandkumar | In this paper, we propose fast and randomized tensor CP decomposition algorithms based on sketching. |

112 | Differentially private subspace clustering | Yining Wang, Yu-Xiang Wang, Aarti Singh | In this work, we build on the framework of “differential privacy” and present two provably private subspace clustering algorithms. |

113 | Predtron: A Family of Online Algorithms for General Prediction Problems | Prateek Jain, Nagarajan Natarajan, Ambuj Tewari | We offer a general framework to derive mistake driven online algorithms and associated loss bounds. |

114 | Weighted Theta Functions and Embeddings with Applications to Max-Cut, Clustering and Summarization | Fredrik D. Johansson, Ankani Chattoraj, Chiranjib Bhattacharyya, Devdatt Dubhashi | We introduce a unifying generalization of the Lovász theta function, and the associated geometric embedding, for graphs with weights on both nodes and edges. |

115 | SGD Algorithms based on Incomplete U-statistics: Large-Scale Minimization of Empirical Risk | Guillaume Papa, St�phan Cl�men�on, Aur�lien Bellet | In this paper, we focus on how to best implement a stochastic approximation approach to solve such risk minimization problems. |

116 | On Top-k Selection in Multi-Armed Bandits and Hidden Bipartite Graphs | Wei Cao, Jian Li, Yufei Tao, Zhize Li | This paper discusses how to efficiently choose from $n$ unknowndistributions the $k$ ones whose means are the greatest by a certainmetric, up to a small relative error. |

117 | The Brain Uses Reliability of Stimulus Information when Making Perceptual Decisions | Sebastian Bitzer, Stefan Kiebel | We here show that even the basic drift diffusion model, which has frequently been used to explain experimental findings in perceptual decision making, implicitly relies on estimates of stimulus reliability. |

118 | Fast Classification Rates for High-dimensional Gaussian Generative Models | Tianyang Li, Adarsh Prasad, Pradeep K. Ravikumar | We present a novel analysis of the classification error of any linear discriminant approach given conditional Gaussian models. |

119 | Fast Distributed k-Center Clustering with Outliers on Massive Data | Gustavo Malkomes, Matt J. Kusner, Wenlin Chen, Kilian Q. Weinberger, Benjamin Moseley | In this work, we consider the widely used k-center clustering problem and its variant used to handle noisy data, k-center with outliers. |

120 | Human Memory Search as Initial-Visit Emitting Random Walk | Kwang-Sung Jun, Jerry Zhu, Timothy T. Rogers, Zhuoran Yang, ming yuan | In this paper, we propose the first efficient maximum likelihood estimate (MLE) for INVITE by decomposing the censored output into a series of absorbing random walks. |

121 | Non-convex Statistical Optimization for Sparse Tensor Graphical Model | Wei Sun, Zhaoran Wang, Han Liu, Guang Cheng | We consider the estimation of sparse graphical models that characterize the dependency structure of high-dimensional tensor-valued data. |

122 | Convergence Rates of Active Learning for Maximum Likelihood Estimation | Kamalika Chaudhuri, Sham M. Kakade, Praneeth Netrapalli, Sujay Sanghavi | In this paper, we shift our attention to a more general setting — maximum likelihood estimation. |

123 | Weakly-supervised Disentangling with Recurrent Transformations for 3D View Synthesis | Jimei Yang, Scott E. Reed, Ming-Hsuan Yang, Honglak Lee | In this paper, we propose a novel recurrent convolutional encoder-decoder network that is trained end-to-end on the task of rendering rotated objects starting from a single image. |

124 | Efficient Exact Gradient Update for training Deep Networks with Very Large Sparse Targets | Pascal Vincent, Alexandre de Br�bisson, Xavier Bouthillier | In this work we develop an original algorithmic approach that, for a family of loss functions that includes squared error and spherical softmax, can compute the exact loss, gradient update for the output weights, and gradient for backpropagation, all in $O(d^2)$ per example instead of $O(Dd)$, remarkably without ever computing the D-dimensional output. |

125 | Backpropagation for Energy-Efficient Neuromorphic Computing | Steve K. Esser, Rathinakumar Appuswamy, Paul Merolla, John V. Arthur, Dharmendra S. Modha | To demonstrate, we trained a sparsely connected network that runs on the TrueNorth chip using the MNIST dataset. |

126 | Alternating Minimization for Regression Problems with Vector-valued Outputs | Prateek Jain, Ambuj Tewari | We provide finite sample upper and lower bounds on the estimation error of OLS and MLE, in two popular models: a) Pooled model, b) Seemingly Unrelated Regression (SUR) model. |

127 | Learning both Weights and Connections for Efficient Neural Network | Song Han, Jeff Pool, John Tran, William Dally | To address these limitations, we describe a method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy by learning only the important connections. |

128 | Optimal Rates for Random Fourier Features | Bharath Sriperumbudur, Zoltan Szabo | In this paper, we provide a detailed finite-sample theoretical analysis about the approximation quality of RFFs by (i) establishing optimal (in terms of the RFF dimension, and growing set size) performance guarantees in uniform norm, and (ii) presenting guarantees in L^r (1 ≤ r < ∞) norms. |

129 | The Population Posterior and Bayesian Modeling on Streams | James McInerney, Rajesh Ranganath, David Blei | We develop population variational Bayes, a new approach for using Bayesian modeling to analyze streams of data. |

130 | Frank-Wolfe Bayesian Quadrature: Probabilistic Integration with Theoretical Guarantees | Fran�ois-Xavier Briol, Chris Oates, Mark Girolami, Michael A. Osborne | In this paper, we present the first probabilistic integrator that admits such theoretical treatment, called Frank-Wolfe Bayesian Quadrature (FWBQ). |

131 | Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks | Samy Bengio, Oriol Vinyals, Navdeep Jaitly, Noam Shazeer | We propose a curriculum learning strategy to gently change the training process from a fully guided scheme using the true previous token, towards a less guided scheme which mostly uses the generated token instead. |

132 | Unified View of Matrix Completion under General Structural Constraints | Suriya Gunasekar, Arindam Banerjee, Joydeep Ghosh | In this paper, we present a unified analysis of matrix completion under general low-dimensional structural constraints induced by {\em any} norm regularization.We consider two estimators for the general problem of structured matrix completion, and provide unified upper bounds on the sample complexity and the estimation error. |

133 | Efficient Output Kernel Learning for Multiple Tasks | Pratik Kumar Jawanpuria, Maksim Lapin, Matthias Hein, Bernt Schiele | Using the theory of positive semidefinite kernels we show in this paper that for a certain class of regularizers on the output kernel, the constraint of being positive semidefinite can be dropped as it is automatically satisfied for the relaxed problem. |

134 | Scalable Adaptation of State Complexity for Nonparametric Hidden Markov Models | Michael C. Hughes, William T. Stephenson, Erik Sudderth | We develop an inference algorithm for the sticky hierarchical Dirichlet process hidden Markov model that scales to big datasets by processing a few sequences at a time yet allows rapid adaptation of the state space cardinality. |

135 | Variational Consensus Monte Carlo | Maxim Rabinovich, Elaine Angelino, Michael I. Jordan | We introduce variational consensus Monte Carlo (VCMC), a variational Bayes algorithm that optimizes over aggregation functions to obtain samples from a distribution that better approximates the target. |

136 | Newton-Stein Method: A Second Order Method for GLMs via Stein's Lemma | Murat A. Erdogdu | We consider the problem of efficiently computing the maximum likelihood estimator in Generalized Linear Models (GLMs)when the number of observations is much larger than the number of coefficients (n > > p > > 1). |

137 | Practical and Optimal LSH for Angular Distance | Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, Ludwig Schmidt | We show the existence of a Locality-Sensitive Hashing (LSH) family for the angular distance that yields an approximate Near Neighbor Search algorithm with the asymptotically optimal running time exponent. |

138 | Learning to Linearize Under Uncertainty | Ross Goroshin, Michael F. Mathieu, Yann LeCun | In this work we suggest a new architecture and loss for training deep feature hierarchies that linearize the transformations observed in unlabelednatural video sequences. |

139 | Finite-Time Analysis of Projected Langevin Monte Carlo | Sebastien Bubeck, Ronen Eldan, Joseph Lehec | We analyze the projected Langevin Monte Carlo (LMC) algorithm, a close cousin of projected Stochastic Gradient Descent (SGD). |

140 | Deep Visual Analogy-Making | Scott E. Reed, Yi Zhang, Yuting Zhang, Honglak Lee | In this paper we develop a novel deep network trained end-to-end to perform visual analogy making, which is the task of transforming a query image according to an example pair of related images. |

141 | Matrix Completion from Fewer Entries: Spectral Detectability and Rank Estimation | Alaa Saade, Florent Krzakala, Lenka Zdeborov� | We propose a spectral algorithm for these two tasks called MaCBetH (for Matrix Completion with the Bethe Hessian). |

142 | Online Learning with Adversarial Delays | Kent Quanrud, Daniel Khashabi | Our main contribution is to show that standard algorithms for online learning already have simple regret bounds in the most general setting of delayed feedback, making adjustments to the analysis and not to the algorithms themselves. |

143 | Multi-Layer Feature Reduction for Tree Structured Group Lasso via Hierarchical Projection | Jie Wang, Jieping Ye | In this paper, we propose a novel Multi-Layer Feature reduction method (MLFre) to quickly identify the inactive nodes (the groups of features with zero coefficients in the solution) hierarchically in a top-down fashion, which are guaranteed to be irrelevant to the response. |

144 | Minimum Weight Perfect Matching via Blossom Belief Propagation | Sung-Soo Ahn, Sejun Park, Michael Chertkov, Jinwoo Shin | In this paper, we develop the first such algorithm, coined Blossom-BP, for solving the minimum weight matching problem over arbitrary graphs. |

145 | Efficient Thompson Sampling for Online ?Matrix-Factorization Recommendation | Jaya Kawale, Hung H. Bui, Branislav Kveton, Long Tran-Thanh, Sanjay Chawla | Efficient Thompson Sampling for Online ?Matrix-Factorization Recommendation |

146 | Improved Iteration Complexity Bounds of Cyclic Block Coordinate Descent for Convex Problems | Ruoyu Sun, Mingyi Hong | Improved Iteration Complexity Bounds of Cyclic Block Coordinate Descent for Convex Problems |

147 | Lifted Symmetry Detection and Breaking for MAP Inference | Timothy Kopp, Parag Singla, Henry Kautz | In this work, we extend symmetry breaking to the problem of model finding in weighted and unweighted relational theories, a class of problems that includes MAP inference in Markov Logic and similar statistical-relational languages. |

148 | Evaluating the statistical significance of biclusters | Jason D. Lee, Yuekai Sun, Jonathan E. Taylor | We develop a framework for performing statistical inference on biclusters found by score-based algorithms. |

149 | Discriminative Robust Transformation Learning | Jiaji Huang, Qiang Qiu, Guillermo Sapiro, Robert Calderbank | This paper proposes a framework for learning features that are robust to data variation, which is particularly important when only a limited number of trainingsamples are available. |

150 | Bandits with Unobserved Confounders: A Causal Approach | Elias Bareinboim, Andrew Forney, Judea Pearl | In this paper, we show that formalizing this distinction has conceptual and algorithmic implications to the bandit setting. |

151 | Scalable Semi-Supervised Aggregation of Classifiers | Akshay Balsubramani, Yoav Freund | We present and empirically evaluate an efficient algorithm that learns to aggregate the predictions of an ensemble of binary classifiers. |

152 | Online Learning with Gaussian Payoffs and Side Observations | Yifan Wu, Andr�s Gy�rgy, Csaba Szepesvari | We consider a sequential learning problem with Gaussian payoffs and side information: after selecting an action $i$, the learner receives information about the payoff of every action $j$ in the form of Gaussian observations whose mean is the same as the mean payoff, but the variance depends on the pair $(i,j)$ (and may be infinite). |

153 | Private Graphon Estimation for Sparse Graphs | Christian Borgs, Jennifer Chayes, Adam Smith | We design algorithms for fitting a high-dimensional statistical model to a large, sparse network without revealing sensitive information of individual members. |

154 | SubmodBoxes: Near-Optimal Search for a Set of Diverse Object Proposals | Qing Sun, Dhruv Batra | In order to speed up repeated application of B\&B, we propose a novel generalization of Minoux’s ‘lazy greedy’ algorithm to the B\&B tree. |

155 | Fast Second Order Stochastic Backpropagation for Variational Inference | Kai Fan, Ziteng Wang, Jeff Beck, James Kwok, Katherine A. Heller | We propose a second-order (Hessian or Hessian-free) based optimization method for variational inference inspired by Gaussian backpropagation, and argue that quasi-Newton optimization can be developed as well. |

156 | Randomized Block Krylov Methods for Stronger and Faster Approximate Singular Value Decomposition | Cameron Musco, Christopher Musco | We address this problem for the first time by showing that both Block Krylov Iteration and Simultaneous Iteration give nearly optimal PCA for any matrix. |

157 | Cross-Domain Matching for Bag-of-Words Data via Kernel Embeddings of Latent Distributions | Yuya Yoshikawa, Tomoharu Iwata, Hiroshi Sawada, Takeshi Yamada | We propose a kernel-based method for finding matching between instances across different domains, such as multilingual documents and images with annotations. |

158 | Scalable Inference for Gaussian Process Models with Black-Box Likelihoods | Amir Dezfouli, Edwin V. Bonilla | We propose a sparse method for scalable automated variational inference (AVI) in a large class of models with Gaussian process (GP) priors, multiple latent functions, multiple outputs and non-linear likelihoods. |

159 | Fast Bidirectional Probability Estimation in Markov Models | Siddhartha Banerjee, Peter Lofgren | We develop a new bidirectional algorithm for estimating Markov chain multi-step transition probabilities: given a Markov chain, we want to estimate the probability of hitting a given target state in $\ell$ steps after starting from a given source distribution. |

160 | Probabilistic Variational Bounds for Graphical Models | Qiang Liu, John W. Fisher III, Alexander T. Ihler | We propose a simple Monte Carlo based inference method that augments convex variational bounds by adding importance sampling (IS). |

161 | Linear Response Methods for Accurate Covariance Estimates from Mean Field Variational Bayes | Ryan J. Giordano, Tamara Broderick, Michael I. Jordan | We generalize linear response methods from statistical physics to deliver accurate uncertainty estimates for model variables—both for individual variables and coherently across variables. |

162 | Combinatorial Cascading Bandits | Branislav Kveton, Zheng Wen, Azin Ashkan, Csaba Szepesvari | We propose a UCB-like algorithm for solving our problems, CombCascade; and prove gap-dependent and gap-free upper bounds on its n-step regret. |

163 | Mixing Time Estimation in Reversible Markov Chains from a Single Sample Path | Daniel J. Hsu, Aryeh Kontorovich, Csaba Szepesvari | This article provides the first procedure for computing a fully data-dependent interval that traps the mixing time $t_{mix}$ of a finite reversible ergodic Markov chain at a prescribed confidence level. |

164 | Policy Gradient for Coherent Risk Measures | Aviv Tamar, Yinlam Chow, Mohammad Ghavamzadeh, Shie Mannor | In this work, we extend the policy gradient method to the whole class of coherent risk measures, which is widely accepted in finance and operations research, among other fields. |

165 | Fast Rates for Exp-concave Empirical Risk Minimization | Tomer Koren, Kfir Levy | We consider Empirical Risk Minimization (ERM) in the context of stochastic optimization with exp-concave and smooth losses—a general optimization framework that captures several important learning problems including linear and logistic regression, learning SVMs with the squared hinge-loss, portfolio selection and more. |

166 | Deep Generative Image Models using a ?Laplacian Pyramid of Adversarial Networks | Emily L. Denton, Soumith Chintala, arthur szlam, Rob Fergus | In this paper we introduce a generative model capable of producing high quality samples of natural images. |

167 | Decoupled Deep Neural Network for Semi-supervised Semantic Segmentation | Seunghoon Hong, Hyeonwoo Noh, Bohyung Han | We propose a novel deep neural network architecture for semi-supervised semantic segmentation using heterogeneous annotations. |

168 | Equilibrated adaptive learning rates for non-convex optimization | Yann Dauphin, Harm de Vries, Yoshua Bengio | We introduce a novel adaptive learning rate scheme, called ESGD, based on the equilibration preconditioner. |

169 | BACKSHIFT: Learning causal cyclic graphs from unknown shift interventions | Dominik Rothenh�usler, Christina Heinze, Jonas Peters, Nicolai Meinshausen | We propose a simple method to learn linear causal cyclic models in the presence of latent variables. |

170 | Risk-Sensitive and Robust Decision-Making: a CVaR Optimization Approach | Yinlam Chow, Aviv Tamar, Shie Mannor, Marco Pavone | In this paper we address the problem of decision making within a Markov decision process (MDP) framework where risk and modeling errors are taken into account. |

171 | Asynchronous stochastic convex optimization: the noise is in the noise and SGD don't care | Sorathan Chaturapruek, John C. Duchi, Christopher R� | We show that asymptotically, completely asynchronous stochastic gradient procedures achieve optimal (even to constant factors) convergence rates for the solution of convex optimization problems under nearly the same conditions required for asymptotic optimality of standard stochastic gradient procedures. |

172 | Lifelong Learning with Non-i.i.d. Tasks | Anastasia Pentina, Christoph H. Lampert | In this work we aim at extending theoretical foundations of lifelong learning. |

173 | Optimal Linear Estimation under Unknown Nonlinear Transform | Xinyang Yi, Zhaoran Wang, Constantine Caramanis, Han Liu | We propose a novel spectral-based estimation procedure and show that we can recover $\beta^*$ in settings (i.e., classes of link function $f$) where previous algorithms fail. |

174 | Learning with Group Invariant Features: A Kernel Perspective. | Youssef Mroueh, Stephen Voinea, Tomaso A. Poggio | We analyze in this paper a random feature map based on a theory of invariance (\emph{I-theory}) introduced in \cite{AnselmiLRMTP13}. |

175 | Regularized EM Algorithms: A Unified Framework and Statistical Guarantees | Xinyang Yi, Constantine Caramanis | We address precisely this setting through a unified treatment using regularization. |

176 | Distributionally Robust Logistic Regression | Soroosh Shafieezadeh Abadeh, Peyman Mohajerin Mohajerin Esfahani, Daniel Kuhn | This paper proposes a distributionally robust approach to logistic regression. |

177 | Adaptive Stochastic Optimization: From Sets to Paths | Zhan Wei Lim, David Hsu, Wee Sun Lee | We describe Recursive Adaptive Coverage (RAC), a new adaptive stochastic optimization algorithm that exploits these conditions, and apply it to two planning tasks under uncertainty. |

178 | Beyond Convexity: Stochastic Quasi-Convex Optimization | Elad Hazan, Kfir Levy, Shai Shalev-Shwartz | In this paper we analyze a stochastic version of NGD and prove its convergence to a global minimum for a wider class of functions: we require the functions to be quasi-convex and locally-Lipschitz. |

179 | A Tractable Approximation to Optimal Point Process Filtering: Application to Neural Encoding | Yuval Harel, Ron Meir, Manfred Opper | We develop an analytically tractable Bayesian approximation to optimal filtering based on point process observations, which allows us to introduce distributional assumptions about sensory cell properties, that greatly facilitates the analysis of optimal encoding in situations deviating from common assumptions of uniform coding. |

180 | Sum-of-Squares Lower Bounds for Sparse PCA | Tengyu Ma, Avi Wigderson | Specifically, we consider the {\em Sparse Principal Component Analysis} (Sparse PCA) problem, and the family of {\em Sum-of-Squares} (SoS, aka Lasserre/Parillo) convex relaxations. |

181 | Max-Margin Majority Voting for Learning from Crowds | TIAN TIAN, Jun Zhu | This paper presents max-margin majority voting (M^3V) to improve the discriminative ability of majority voting and further presents a Bayesian generalization to incorporate the flexibility of generative methods on modeling noisy observations with worker confusion matrices. |

182 | Learning with Incremental Iterative Regularization | Lorenzo Rosasco, Silvia Villa | Within a statistical learning setting, we propose and study an iterative regularization algorithm for least squares defined by an incremental gradient method. |

183 | Halting in Random Walk Kernels | Mahito Sugiyama, Karsten Borgwardt | We theoretically show that halting may occur in geometric random walk kernels. |

184 | MCMC for Variationally Sparse Gaussian Processes | James Hensman, Alexander G. Matthews, Maurizio Filippone, Zoubin Ghahramani | This paper simultaneously addresses these, using a variational approximation to the posterior which is sparse in sup- port of the function but otherwise free-form. |

185 | Less is More: Nystr�m Computational Regularization | Alessandro Rudi, Raffaello Camoriano, Lorenzo Rosasco | We study Nyström type subsampling approaches to large scale kernel methods, and prove learning bounds in the statistical learning setting, where random sampling and high probability estimates are considered. |

186 | Infinite Factorial Dynamical Model | Isabel Valera, Francisco Ruiz, Lennart Svensson, Fernando Perez-Cruz | We propose the infinite factorial dynamic model (iFDM), a general Bayesian nonparametric model for source separation. |

187 | Regularization Path of Cross-Validation Error Lower Bounds | Atsushi Shibagaki, Yoshiki Suzuki, Masayuki Karasuyama, Ichiro Takeuchi | Careful tuning of a regularization parameter is indispensable in many machine learning tasks because it has a significant impact on generalization performances.Nevertheless, current practice of regularization parameter tuning is more of an art than a science, e.g., it is hard to tell how many grid-points would be needed in cross-validation (CV) for obtaining a solution with sufficiently small CV error.In this paper we propose a novel framework for computing a lower bound of the CV errors as a function of the regularization parameter, which we call regularization path of CV error lower bounds.The proposed framework can be used for providing a theoretical approximation guarantee on a set of solutions in the sense that how far the CV error of the current best solution could be away from best possible CV error in the entire range of the regularization parameters.We demonstrate through numerical experiments that a theoretically guaranteed a choice of regularization parameter in the above sense is possible with reasonable computational costs. |

188 | Attractor Network Dynamics Enable Preplay and Rapid Path Planning in Maze�like Environments | Dane S. Corneil, Wulfram Gerstner | Here, we show how a particular mapping of space allows for the immediate generation of trajectories between arbitrary start and goal locations in an environment, based only on the mapped representation of the goal. |

189 | Teaching Machines to Read and Comprehend | Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, Phil Blunsom | In this work we define a new methodology that resolves this bottleneck and provides large scale supervised reading comprehension data. |

190 | Principal Differences Analysis: Interpretable Characterization of Differences between Distributions | Jonas W. Mueller, Tommi Jaakkola | We introduce principal differences analysis for analyzing differences between high-dimensional distributions. |

191 | When are Kalman-Filter Restless Bandits Indexable? | Christopher R. Dance, Tomi Silander | We study the restless bandit associated with an extremely simple scalar Kalman filter model in discrete time. |

192 | Segregated Graphs and Marginals of Chain Graph Models | Ilya Shpitser | In this paper, we show that special mixed graphs which we call segregated graphs can be associated, via a Markov property, with supermodels of a marginal of chain graphs defined only by conditional independences. |

193 | Efficient Non-greedy Optimization of Decision Trees | Mohammad Norouzi, Maxwell Collins, Matthew A. Johnson, David J. Fleet, Pushmeet Kohli | In this paper, we present an algorithm for optimizing the split functions at all levels of the tree jointly with the leaf parameters, based on a global objective. |

194 | Probabilistic Curve Learning: Coulomb Repulsion and the Electrostatic Gaussian Process | Ye Wang, David B. Dunson | We solve these issues by proposing a novel Coulomb repulsive process (Corp) for locations of points on the manifold, inspired by physical models of electrostatic interactions among particles. |

195 | Inverse Reinforcement Learning with Locally Consistent Reward Functions | Quoc Phong Nguyen, Bryan Kian Hsiang Low, Patrick Jaillet | This paper presents a novel generalization of the IRL problem that allows each trajectory to be generated by multiple locally consistent reward functions, hence catering to more realistic and complex experts’ behaviors. |

196 | Communication Complexity of Distributed Convex Learning and Optimization | Yossi Arjevani, Ohad Shamir | We study the fundamental limits to communication-efficient distributed methods for convex learning and optimization, under different assumptions on the information available to individual machines, and the types of functions considered. |

197 | End-to-end Learning of LDA by Mirror-Descent Back Propagation over a Deep Architecture | Jianshu Chen, Ji He, Yelong Shen, Lin Xiao, Xiaodong He, Jianfeng Gao, Xinying Song, Li Deng | We develop a fully discriminative learning approach for supervised Latent Dirichlet Allocation (LDA) model using Back Propagation (i.e., BP-sLDA), which maximizes the posterior probability of the prediction variable given the input document. |

198 | Subset Selection by Pareto Optimization | Chao Qian, Yang Yu, Zhi-Hua Zhou | In this paper, we propose the POSS approach which employs evolutionary Pareto optimization to find a small-sized subset with good performance. |

199 | On the Accuracy of Self-Normalized Log-Linear Models | Jacob Andreas, Maxim Rabinovich, Michael I. Jordan, Dan Klein | In this paper, we analyze a recently proposed technique known as “self-normalization”, which introduces a regularization term in training to penalize log normalizers for deviating from zero. |

200 | Regret Lower Bound and Optimal Algorithm in Finite Stochastic Partial Monitoring | Junpei Komiyama, Junya Honda, Hiroshi Nakagawa | In this paper, we study partial monitoring with finite actions and stochastic outcomes. |

201 | Is Approval Voting Optimal Given Approval Votes? | Ariel D. Procaccia, Nisarg Shah | We challenge this assertion by proposing a probabilistic framework of noisy voting, and asking whether approval voting yields an alternative that is most likely to be the best alternative, given k-approval votes. |

202 | Regressive Virtual Metric Learning | Micha�l Perrot, Amaury Habrard | In this paper, instead of bringing closer examples of the same class and pushing far away examples of different classes we propose to move the examples with respect to virtual points. |

203 | Analysis of Robust PCA via Local Incoherence | Huishuai Zhang, Yi Zhou, Yingbin Liang | We investigate the robust PCA problem of decomposing an observed matrix into the sum of a low-rank and a sparse error matrices via convex programming Principal Component Pursuit (PCP). |

204 | Learning to Transduce with Unbounded Memory | Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, Phil Blunsom | In this paper we explore the representational power of these models using synthetic grammars designed to exhibit phenomena similar to those found in real transduction problems such as machine translation. |

205 | Max-Margin Deep Generative Models | Chongxuan Li, Jun Zhu, Tianlin Shi, Bo Zhang | This paper presents max-margin deep generative models (mmDGMs), which explore the strongly discriminative principle of max-margin learning to improve the discriminative power of DGMs, while retaining the generative capability. |

206 | Spherical Random Features for Polynomial Kernels | Jeffrey Pennington, Felix Xinnan X. Yu, Sanjiv Kumar | The question we address in this work is: if we know a priori that data is so normalized, can we devise a more compact map? |

207 | Rectified Factor Networks | Djork-Arn� Clevert, Andreas Mayr, Thomas Unterthiner, Sepp Hochreiter | We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. |

208 | Learning Bayesian Networks with Thousands of Variables | Mauro Scanagatta, Cassio P. de Campos, Giorgio Corani, Marco Zaffalon | We present a method for learning Bayesian networks from data sets containingthousands of variables without the need for structure constraints. |

209 | Matrix Completion Under Monotonic Single Index Models | Ravi Sastry Ganti, Laura Balzano, Rebecca Willett | We propose a novel matrix completion method that alternates between low-rank matrix estimation and monotonic function estimation to estimate the missing matrix elements. |

210 | Visalogy: Answering Visual Analogy Questions | Fereshteh Sadeghi, C. Lawrence Zitnick, Ali Farhadi | In this paper, we study the problem of answering visual analogy questions. We pose this problem as learning an embedding that encourages pairs of analogous images with similar transformations to be close together using convolutional neural networks with a quadruple Siamese architecture. |

211 | Tree-Guided MCMC Inference for Normalized Random Measure Mixture Models | Juho Lee, Seungjin Choi | In this paper, we present a hybrid inference algorithm for NRMM models, which combines the merits of both MCMC and IBHC. |

212 | Streaming Min-max Hypergraph Partitioning | Dan Alistarh, Jennifer Iglesias, Milan Vojnovic | We consider the problem of partitioning the set of items into a given number of parts such that the maximum number of topics covered by a part of the partition is minimized. |

213 | Collaboratively Learning Preferences from Ordinal Data | Sewoong Oh, Kiran K. Thekumparampil, Jiaming Xu | We present the convex relaxation approach in two contexts of interest: collaborative ranking and bundled choice modeling. |

214 | Biologically Inspired Dynamic Textures for Probing Motion Perception | Jonathan Vacher, Andrew Isaac Meso, Laurent U. Perrinet, Gabriel Peyr� | Importantly, we show that this model can equivalently be described as a stochastic partial differential equation. |

215 | Generative Image Modeling Using Spatial LSTMs | Lucas Theis, Matthias Bethge | We here introduce a recurrent image model based on multi-dimensional long short-term memory units which are particularly suited for image modeling due to their spatial structure. |

216 | Robust PCA with compressed data | Wooseok Ha, Rina Foygel Barber | We examine the robust principal component analysis (RPCA) problem under data compression, wherethe data $Y$ is approximately given by $(L + S)\cdot C$, that is, a low-rank $+$ sparse data matrix that has been compressed to size $n\times m$ (with $m$ substantially smaller than the original dimension $d$) via multiplication witha compression matrix $C$. |

217 | Sampling from Probabilistic Submodular Models | Alkis Gotovos, Hamed Hassani, Andreas Krause | In this paper, we investigate the use of Markov chain Monte Carlo sampling to perform approximate inference in general log-submodular and log-supermodular models. |

218 | COEVOLVE: A Joint Point Process Model for Information Diffusion and Network Co-evolution | Mehrdad Farajtabar, Yichen Wang, Manuel Gomez Rodriguez, Shuang Li, Hongyuan Zha, Le Song | We experimented with both synthetic data and data gathered from Twitter, and show that our model provides a good fit to the data as well as more accurate predictions than alternatives. |

219 | Supervised Learning for Dynamical System Learning | Ahmed Hefny, Carlton Downey, Geoffrey J. Gordon | We demonstrate theeffectiveness of our framework by showing examples where nonlinear regressionor lasso let us learn better state representations than plain linear regression does;the correctness of these instances follows directly from our general analysis. |

220 | Regret-Based Pruning in Extensive-Form Games | Noam Brown, Tuomas Sandholm | The new algorithm maintains CFR’s convergence guarantees while making iterations significantly faster—even if previously known pruning techniques are used in the comparison. |

221 | Fast Two-Sample Testing with Analytic Representations of Probability Measures | Kacper P. Chwialkowski, Aaditya Ramdas, Dino Sejdinovic, Arthur Gretton | We propose a class of nonparametric two-sample tests with a cost linear in the sample size. |

222 | Learning to Segment Object Candidates | Pedro O. Pinheiro, Ronan Collobert, Piotr Dollar | In this paper, we propose a new way to generate object proposals, introducing an approach based on a discriminative convolutional network. |

223 | GP Kernels for Cross-Spectrum Analysis | Kyle R. Ulrich, David E. Carlson, Kafui Dzirasa, Lawrence Carin | In this paper, we develop a novel covariance kernel for multiple outputs, called the cross-spectral mixture (CSM) kernel. |

224 | Secure Multi-party Differential Privacy | Peter Kairouz, Sewoong Oh, Pramod Viswanath | We study the problem of multi-party interactive function computation under differential privacy. |

225 | Spatial Transformer Networks | Max Jaderberg, Karen Simonyan, Andrew Zisserman, koray kavukcuoglu | In this work we introduce a new learnable module, theSpatial Transformer, which explicitly allows the spatial manipulation ofdata within the network. |

226 | Anytime Influence Bounds and the Explosive Behavior of Continuous-Time Diffusion Networks | Kevin Scaman, R�mi Lemonnier, Nicolas Vayatis | We introduce the Laplace Hazard matrix and show that its spectral radius fully characterizes the dynamics of the contagion both in terms of influence and of explosion time. |

227 | Multi-class SVMs: From Tighter Data-Dependent Generalization Bounds to Novel Algorithms | Yunwen Lei, Urun Dogan, Alexander Binder, Marius Kloft | This paper studies the generalization performance of multi-class classification algorithms, for which we obtain, for the first time, a data-dependent generalization error bound with a logarithmic dependence on the class size, substantially improving the state-of-the-art linear dependence in the existing data-dependent generalization analysis. |

228 | High-dimensional neural spike train analysis with generalized count linear dynamical systems | Yuanjun Gao, Lars Busing, Krishna V. Shenoy, John P. Cunningham | We apply our model to data from primate motor cortex and demonstrate performance improvements over state-of-the-art methods, both in capturing the variance structure of the data and in held-out prediction. |

229 | Learning with a Wasserstein Loss | Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, Tomaso A. Poggio | In this paper we develop a loss function for multi-label learning, based on the Wasserstein distance. |

230 | b-bit Marginal Regression | Martin Slawski, Ping Li | We consider the problem of sparse signal recovery from $m$ linear measurements quantized to $b$ bits. |

231 | Natural Neural Networks | Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, koray kavukcuoglu | We introduce Natural Neural Networks, a novel family of algorithms that speed up convergence by adapting their internal representation during training to improve conditioning of the Fisher matrix. |

232 | Optimization Monte Carlo: Efficient and Embarrassingly Parallel Likelihood-Free Inference | Ted Meeds, Max Welling | We describe an embarrassingly parallel, anytime Monte Carlo method for likelihood-free models. |

233 | Adaptive Primal-Dual Splitting Methods for Statistical Learning and Image Processing | Tom Goldstein, Min Li, Xiaoming Yuan | We propose self-adaptive stepsize rules that automatically tune PDHG parameters for optimal convergence. |

234 | On some provably correct cases of variational inference for topic models | Pranjal Awasthi, Andrej Risteski | We provide the first analysis of instances where variational inference algorithms converge to the global optimum, in the setting of topic models. |

235 | Collaborative Filtering with Graph Information: Consistency and Scalable Methods | Nikhil Rao, Hsiang-Fu Yu, Pradeep K. Ravikumar, Inderjit S. Dhillon | We tackle the problem of matrix completion when pairwise relationships among variables are known, via a graph. |

236 | Combinatorial Bandits Revisited | Richard Combes, Mohammad Sadegh Talebi Mazraeh Shahi, Alexandre Proutiere, marc lelarge | We propose ESCB, an algorithm that efficiently exploits the structure of the problem and provide a finite-time analysis of its regret. |

237 | Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning | Shakir Mohamed, Danilo Jimenez Rezende | This paper provides a new approach for scalable optimisation of the mutual information by merging techniques from variational inference and deep learning. |

238 | A Structural Smoothing Framework For Robust Graph Comparison | Pinar Yanardag, S.V.N. Vishwanathan | In this paper, we propose a general smoothing framework for graph kernels by taking \textit{structural similarity} into account, and apply it to derive smoothed variants of popular graph kernels. |

239 | Competitive Distribution Estimation: Why is Good-Turing Good | Alon Orlitsky, Ananda Theertha Suresh | Conversely, we show that any estimator must have a KL divergence $\ge\tilde\Omega(\min(k/n,1/ n^{2/3}))$ over the best estimator for the first comparison, and $\ge\tilde\Omega(\min(k/n,1/\sqrt{n}))$ for the second. |

240 | Efficient Learning by Directed Acyclic Graph For Resource Constrained Prediction | Joseph Wang, Kirill Trapeznikov, Venkatesh Saligrama | Rather than jointly optimizing such a highly coupled and non-convex problem over all decision nodes, we propose an efficient algorithm motivated by dynamic programming. |

241 | A hybrid sampler for Poisson-Kingman mixture models | Maria Lomeli, Stefano Favaro, Yee Whye Teh | We present a novel and compact way of representing the infinite dimensional component of the model such that while explicitly representing this infinite component it has less memory and storage requirements than previous MCMC schemes. |

242 | An Active Learning Framework using Sparse-Graph Codes for Sparse Polynomials and Graph Sketching | Xiao Li, Kannan Ramchandran | We introduce an active learning framework that is associated with a low query cost and computational runtime. |

243 | Local Smoothness in Variance Reduced Optimization | Daniel Vainsencher, Han Liu, Tong Zhang | Abstract We propose a family of non-uniform sampling strategies to provably speed up a class of stochastic optimization algorithms with linear convergence including Stochastic Variance Reduced Gradient (SVRG) and Stochastic Dual Coordinate Ascent (SDCA). |

244 | Saliency, Scale and Information: Towards a Unifying Theory | Shafin Rahman, Neil Bruce | In this paper we present a definition for visual saliency grounded in information theory. |

245 | Fighting Bandits with a New Kind of Smoothness | Jacob D. Abernethy, Chansoo Lee, Ambuj Tewari | In the present work, we provide a new set of analysis tools, using the notion of convex smoothing, to provide several novel algorithms with optimal guarantees. |

246 | Beyond Sub-Gaussian Measurements: High-Dimensional Structured Estimation with Sub-Exponential Designs | Vidyashankar Sivakumar, Arindam Banerjee, Pradeep K. Ravikumar | We consider the problem of high-dimensional structured estimation with norm-regularized estimators, such as Lasso, when the design matrix and noise are drawn from sub-exponential distributions.Existing results only consider sub-Gaussian designs and noise, and both the sample complexity and non-asymptotic estimation error have been shown to depend on the Gaussian width of suitable sets. |

247 | Spectral Norm Regularization of Orthonormal Representations for Graph Transduction | Rakesh Shivanna, Bibaswan K. Chatterjee, Raman Sankaran, Chiranjib Bhattacharyya, Francis Bach | In this paper, we show that orthonormal representations, a class of unit-sphere graph embeddings are PAC learnable. |

248 | Convolutional Networks on Graphs for Learning Molecular Fingerprints | David K. Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alan Aspuru-Guzik, Ryan P. Adams | We introduce a convolutional neural network that operates directly on graphs.These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape.The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints.We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks. |

249 | Mixed Robust/Average Submodular Partitioning: Fast Algorithms, Guarantees, and Applications | Kai Wei, Rishabh K. Iyer, Shengjie Wang, Wenruo Bai, Jeff A. Bilmes | In the present paper, we bridge this gap, by proposing several new algorithms (including greedy, majorization-minimization, minorization-maximization, and relaxation algorithms) that not only scale to large datasets but that also achieve theoretical approximation guarantees comparable to the state-of-the-art. |

250 | Tractable Learning for Complex Probability Queries | Jessa Bekker, Jesse Davis, Arthur Choi, Adnan Darwiche, Guy Van den Broeck | We propose a tractable learner that guarantees efficient inference for a broader class of queries. |

251 | StopWasting My Gradients: Practical SVRG | Reza Harikandeh, Mohamed Osama Ahmed, Alim Virani, Mark Schmidt, Jakub Konecn�, Scott Sallinen | We present and analyze several strategies for improving the performance ofstochastic variance-reduced gradient (SVRG) methods. |

252 | Mind the Gap: A Generative Approach to Interpretable Feature Selection and Extraction | Been Kim, Julie A. Shah, Finale Doshi-Velez | We present the Mind the Gap Model (MGM), an approach for interpretable feature extraction and selection. |

253 | A Normative Theory of Adaptive Dimensionality Reduction in Neural Networks | Cengiz Pehlevan, Dmitri Chklovskii | Here, we derive biologically plausible dimensionality reduction algorithms which adapt the number of output dimensions to the eigenspectrum of the input covariance matrix. |

254 | On the Convergence of Stochastic Gradient MCMC Algorithms with High-Order Integrators | Changyou Chen, Nan Ding, Lawrence Carin | In this paper we consider general SG-MCMCs with high-order integrators, and develop theory to analyze finite-time convergence properties and their asymptotic invariant measures. |

255 | Learning structured densities via infinite dimensional exponential families | Siqi Sun, Mladen Kolar, Jinbo Xu | In this paper, we study the problem of estimating the structure of a probabilistic graphical model without assuming a particular parametric model. |

256 | Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question | Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, Wei Xu | In this paper, we present the mQA model, which is able to answer questions about the content of an image. We construct a Freestyle Multilingual Image Question Answering (FM-IQA) dataset to train and evaluate our mQA model. |

257 | Variance Reduced Stochastic Gradient Descent with Neighbors | Thomas Hofmann, Aurelien Lucchi, Simon Lacoste-Julien, Brian McWilliams | Recently, variance reduction techniques such as SVRG and SAGA have been proposed to overcome this weakness. |

258 | Sample Efficient Path Integral Control under Uncertainty | Yunpeng Pan, Evangelos Theodorou, Michail Kontitsis | We present a data-driven stochastic optimal control framework that is derived using the path integral (PI) control approach. |

259 | Stochastic Expectation Propagation | Yingzhen Li, Jos� Miguel Hern�ndez-Lobato, Richard E. Turner | This paper presents an extension to EP, called stochastic expectation propagation (SEP), that maintains a global posterior approximation (like VI) but updates it in a local way (like EP). |

260 | Exactness of Approximate MAP Inference in Continuous MRFs | Nicholas Ruozzi | In this work, we use graph covers to provide necessary and sufficient conditions for continuous MAP relaxations to be tight. |

261 | Scale Up Nonlinear Component Analysis with Doubly Stochastic Gradients | Bo Xie, Yingyu Liang, Le Song | We demonstrate the effectiveness and scalability of our algorithm on large scale synthetic and real world datasets. |

262 | Generalization in Adaptive Data Analysis and Holdout Reuse | Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toni Pitassi, Omer Reingold, Aaron Roth | We give an algorithm that enables the validation of a large number of adaptively chosen hypotheses, while provably avoiding overfitting. |

263 | Market Scoring Rules Act As Opinion Pools For Risk-Averse Agents | Mithun Chakraborty, Sanmay Das | In this paper, we add to a growing body of research aimed at understanding the precise manner in which the price process induced by a MSR incorporates private information from agents who deviate from the assumption of risk-neutrality. |

264 | Sparse Linear Programming via Primal and Dual Augmented Coordinate Descent | Ian En-Hsu Yen, Kai Zhong, Cho-Jui Hsieh, Pradeep K. Ravikumar, Inderjit S. Dhillon | In this paper, we investigate a general LP algorithm based on the combination of Augmented Lagrangian and Coordinate Descent (AL-CD), giving an iteration complexity of $O((\log(1/\epsilon))^2)$ with $O(nnz(A))$ cost per iteration, where $nnz(A)$ is the number of non-zeros in the $m\times n$ constraint matrix $A$, and in practice, one can further reduce cost per iteration to the order of non-zeros in columns (rows) corresponding to the active primal (dual) variables through an active-set strategy. |

265 | Training Very Deep Networks | Rupesh K. Srivastava, Klaus Greff, J�rgen Schmidhuber | Here we introduce a new architecture designed to overcome this. |

266 | Bayesian Active Model Selection with an Application to Automated Audiometry | Jacob Gardner, Gustavo Malkomes, Roman Garnett, Kilian Q. Weinberger, Dennis Barbour, John P. Cunningham | We introduce a novel information-theoretic approach for active model selection and demonstrate its effectiveness in a real-world application. |

267 | Particle Gibbs for Infinite Hidden Markov Models | Nilesh Tripuraneni, Shixiang (Shane) Gu, Hong Ge, Zoubin Ghahramani | In this paper, we present an infinite-state Particle Gibbs (PG) algorithm to resample state trajectories for the iHMM. |

268 | Learning spatiotemporal trajectories from manifold-valued longitudinal data | Jean-Baptiste SCHIRATTI, St�phanie ALLASSONNIERE, Olivier Colliot, Stanley DURRLEMAN | We propose a Bayesian mixed-effects model to learn typical scenarios of changes from longitudinal manifold-valued data, namely repeated measurements of the same objects or individuals at several points in time. |

269 | A Bayesian Framework for Modeling Confidence in Perceptual Decision Making | Koosha Khalvati, Rajesh P. Rao | In this paper, we introduce a Bayesian framework to model confidence in perceptual decision making. |

270 | Path-SGD: Path-Normalized Optimization in Deep Neural Networks | Behnam Neyshabur, Ruslan R. Salakhutdinov, Nati Srebro | We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. |

271 | On the consistency theory of high dimensional variable screening | Xiangyu Wang, Chenlei Leng, David B. Dunson | When the data dimension $p$ is substantially larger than the sample size $n$, variable screening becomes crucial as 1) Faster feature selection algorithms are needed; 2) Conditions guaranteeing selection consistency might fail to hold.This article studies a class of linear screening methods and establishes consistency theory for this special class. |

272 | End-To-End Memory Networks | Sainbayar Sukhbaatar, arthur szlam, Jason Weston, Rob Fergus | We introduce a neural network with a recurrent attention model over a possibly large external memory. |

273 | Spectral Representations for Convolutional Neural Networks | Oren Rippel, Jasper Snoek, Ryan P. Adams | In this work, we demonstrate that, beyond its advantages for efficient computation, the spectral domain also provides a powerful representation in which to model and train convolutional neural networks (CNNs). |

274 | Online Gradient Boosting | Alina Beygelzimer, Elad Hazan, Satyen Kale, Haipeng Luo | We extend the theory of boosting for regression problems to the online learning setting. |

275 | Deep Temporal Sigmoid Belief Networks for Sequence Modeling | Zhe Gan, Chunyuan Li, Ricardo Henao, David E. Carlson, Lawrence Carin | Scalable learning and inference algorithms are derived by introducing a recognition model that yields fast sampling from the variational posterior. |

276 | Recognizing retinal ganglion cells in the dark | Emile Richard, Georges A. Goetz, E.J. Chichilnisky | Here, we develop automated classifiers for functional identification of retinal ganglion cells, the output neurons of the retina, based solely on recorded voltage patterns on a large scale array. |

277 | A Theory of Decision Making Under Dynamic Context | Michael Shvartsman, Vaibhav Srivastava, Jonathan D. Cohen | In this work, we describe a computational theory of decision making under dynamically shifting context. |

278 | A Gaussian Process Model of Quasar Spectral Energy Distributions | Andrew Miller, Albert Wu, Jeff Regier, Jon McAuliffe, Dustin Lang, Mr. Prabhat, David Schlegel, Ryan P. Adams | We propose a method for combining two sources of astronomical data, spectroscopy and photometry, that carry information about sources of light (e.g., stars, galaxies, and quasars) at extremely different spectral resolutions. |

279 | Hidden Technical Debt in Machine Learning Systems | D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Fran�ois Crespo, Dan Dennison | This paper argues it is dangerous to think ofthese quick wins as coming for free. |

280 | Local Causal Discovery of Direct Causes and Effects | Tian Gao, Qiang Ji | We propose a new local causal discovery algorithm, called Causal Markov Blanket (CMB), to identify the direct causes and effects of a target variable based on Markov Blanket Discovery. |

281 | High Dimensional EM Algorithm: Statistical Optimization and Asymptotic Normality | Zhaoran Wang, Quanquan Gu, Yang Ning, Han Liu | In particular, we make two contributions: (i) For parameter estimation, we propose a novel high dimensional EM algorithm which naturally incorporates sparsity structure into parameter estimation. |

282 | Revenue Optimization against Strategic Buyers | Mehryar Mohri, Andres Munoz | We present a revenue optimization algorithm for posted-price auctions when facing a buyer with random valuations who seeks to optimize his $\gamma$-discounted surplus. |

283 | Deep Convolutional Inverse Graphics Network | Tejas D. Kulkarni, William F. Whitney, Pushmeet Kohli, Josh Tenenbaum | This paper presents the Deep Convolution Inverse Graphics Network (DC-IGN), a model that aims to learn an interpretable representation of images, disentangled with respect to three-dimensional scene structure and viewing transformations such as depth rotations and lighting variations. |

284 | Sparse and Low-Rank Tensor Decomposition | Parikshit Shah, Nikhil Rao, Gongguo Tang | We present an efficient computational algorithm that modifies Leurgans’ algoirthm for tensor factorization. |

285 | Minimax Time Series Prediction | Wouter M. Koolen, Alan Malek, Peter L. Bartlett, Yasin Abbasi | We derive the minimax strategyfor all problems of this type and show that it can be implementedefficiently. |

286 | Differentially Private Learning of Structured Discrete Distributions | Ilias Diakonikolas, Moritz Hardt, Ludwig Schmidt | Our goal is to design efficient algorithms that simultaneously achieve low error in total variation norm while guaranteeing Differential Privacy to the individuals of the population.We describe a general approach that yields near sample-optimal and computationally efficient differentially private estimators for a wide range of well-studied and natural distribution families. |

287 | Variational Dropout and the Local Reparameterization Trick | Durk P. Kingma, Tim Salimans, Max Welling | Our method allows inference of more flexibly parameterized posteriors; specifically, we propose \emph{variational dropout}, a generalization of Gaussian dropout, but with a more flexibly parameterized posterior, often leading to better generalization. |

288 | Sample Complexity of Learning Mahalanobis Distance Metrics | Nakul Verma, Kristin Branson | In this work we provide PAC-style sample complexity rates for supervised metric learning. |

289 | Learning Wake-Sleep Recurrent Attention Models | Jimmy Ba, Ruslan R. Salakhutdinov, Roger B. Grosse, Brendan J. Frey | Borrowing techniques from the literature on training deep generative models, we present the Wake-Sleep Recurrent Attention Model, a method for training stochastic attention networks which improves posterior inference and which reduces the variability in the stochastic gradients. |

290 | Robust Gaussian Graphical Modeling with the Trimmed Graphical Lasso | Eunho Yang, Aurelie C. Lozano | In this paper, we propose the Trimmed Graphical Lasso for robust estimation of sparse GGMs. |

291 | Testing Closeness With Unequal Sized Samples | Bhaswar Bhattacharya, Gregory Valiant | Specifically, given a target error parameter $\eps > 0$, $m_1$ independent draws from an unknown distribution $p$ with discrete support, and $m_2$ draws from an unknown distribution $q$ of discrete support, we describe a test for distinguishing the case that $p=q$ from the case that $||p-q||_1 \geq \eps$. |

292 | Estimating Jaccard Index with Missing Observations: A Matrix Calibration Approach | Wenye Li | This paper investigates the problem of estimating a Jaccard index matrix when there are missing observations in data samples. |

293 | Neural Adaptive Sequential Monte Carlo | Shixiang (Shane) Gu, Zoubin Ghahramani, Richard E. Turner | This paper presents a new method for automatically adapting the proposal using an approximation of the Kullback-Leibler divergence between the true posterior and the proposal distribution. |

294 | Local Expectation Gradients for Black Box Variational Inference | Michalis Titsias RC AUEB, Miguel L�zaro-Gredilla | We introduce local expectation gradients which is a general purpose stochastic variational inference algorithm for constructing stochastic gradients by sampling from the variational distribution. |

295 | On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants | Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, Alexander J. Smola | We bridge this gap by presentinga unifying framework that captures many variance reduction techniques.Subsequently, we propose an asynchronous algorithm grounded in our framework,with fast convergence rates. |

296 | NEXT: A System for Real-World Development, Evaluation, and Application of Active Learning | Kevin G. Jamieson, Lalit Jain, Chris Fernandez, Nicholas J. Glattard, Rob Nowak | Active learning methods automatically adapt data collection by selecting the most informative samples in order to accelerate machine learning. |

297 | Super-Resolution Off the Grid | Qingqing Huang, Sham M. Kakade | This work provides an algorithm with the following favorable guarantees:1. |

298 | Taming the Wild: A Unified Analysis of Hogwild-Style Algorithms | Christopher M. De Sa, Ce Zhang, Kunle Olukotun, Christopher R�, Christopher R� | Specifically, we useour new analysis in three ways: (1) we derive convergence rates for the convex case (Hogwild) with relaxed assumptions on the sparsity of the problem; (2) we analyze asynchronous SGD algorithms for non-convex matrix problems including matrix completion; and (3) we design and analyze an asynchronous SGD algorithm, called Buckwild, that uses lower-precision arithmetic. |

299 | The Return of the Gating Network: Combining Generative Models and Discriminative Training in Natural Image Priors | Dan Rosenbaum, Yair Weiss | In this paper we show how to combine the strengths of both approaches by training a discriminative, feed-forward architecture to predict the state of latent variables in a generative model of natural images. |

300 | Pointer Networks | Oriol Vinyals, Meire Fortunato, Navdeep Jaitly | We introduce a new neural architecture to learn the conditional probability of an output sequence with elements that arediscrete tokens corresponding to positions in an input sequence.Such problems cannot be trivially addressed by existent approaches such as sequence-to-sequence and Neural Turing Machines,because the number of target classes in eachstep of the output depends on the length of the input, which is variable.Problems such as sorting variable sized sequences, and various combinatorialoptimization problems belong to this class. |

301 | Associative Memory via a Sparse Recovery Model | Arya Mazumdar, Ankit Singh Rawat | In this paper, for the first time, we propose a model of associative memory based on sparse recovery of signals. |

302 | Robust Spectral Inference for Joint Stochastic Matrix Factorization | Moontae Lee, David Bindel, David Mimno | Spectral inference provides fast algorithms and provable optimality for latent topic analysis. |

303 | Fast, Provable Algorithms for Isotonic Regression in all L_p-norms | Rasmus Kyng, Anup Rao, Sushant Sachdeva | This paper gives improved algorithms for computing the Isotonic Regression for all weighted $\ell_{p}$-norms with rigorous performance guarantees. |

304 | Adversarial Prediction Games for Multivariate Losses | Hong Wang, Wei Xing, Kaiser Asif, Brian Ziebart | We propose to approximate the training data instead of the loss function by posing multivariate prediction as an adversarial game between a loss-minimizing prediction player and a loss-maximizing evaluation player constrained to match specified properties of training data. |

305 | Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization | Xiangru Lian, Yijun Huang, Yuncheng Li, Ji Liu | We establish an ergodic convergence rate $O(1/\sqrt{K})$ for both algorithms and prove that the linear speedup is achievable if the number of workers is bounded by $\sqrt{K}$ ($K$ is the total number of iterations). |

306 | Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images | Manuel Watter, Jost Springenberg, Joschka Boedecker, Martin Riedmiller | We introduce Embed to Control (E2C), a method for model learning and control of non-linear dynamical systems from raw pixel images. |

307 | Efficient and Parsimonious Agnostic Active Learning | Tzu-Kuo Huang, Alekh Agarwal, Daniel J. Hsu, John Langford, Robert E. Schapire | We develop a new active learning algorithm for the streaming settingsatisfying three important properties: 1) It provably works for anyclassifier representation and classification problem including thosewith severe noise. |

308 | Softstar: Heuristic-Guided Probabilistic Inference | Mathew Monfort, Brenden M. Lake, Brenden M. Lake, Brian Ziebart, Patrick Lucey, Josh Tenenbaum | We propose the Softstar algorithm, a softened heuristic-guided search technique for the maximum entropy inverse optimal control model of sequential behavior. |

309 | Grammar as a Foreign Language | Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, Geoffrey Hinton | In this paper we showthat the domain agnostic attention-enhanced sequence-to-sequence modelachieves state-of-the-art results on the most widely used syntacticconstituency parsing dataset, when trained on a large synthetic corpusthat was annotated using existing parsers. |

310 | Regularization-Free Estimation in Trace Regression with Symmetric Positive Semidefinite Matrices | Martin Slawski, Ping Li, Matthias Hein | In this paper, we argue that such regularization may no longer be necessary if the underlying matrix is symmetric positive semidefinite (spd) and the design satisfies certain conditions. |

311 | Winner-Take-All Autoencoders | Alireza Makhzani, Brendan J. Frey | In this paper, we propose a winner-take-all method for learning hierarchical sparse representations in an unsupervised fashion. |

312 | Deep Poisson Factor Modeling | Ricardo Henao, Zhe Gan, James Lu, Lawrence Carin | We propose a new deep architecture for topic modeling, based on Poisson Factor Analysis (PFA) modules. |

313 | Bayesian Optimization with Exponential Convergence | Kenji Kawaguchi, Leslie Pack Kaelbling, Tom�s Lozano-P�rez | This paper presents a Bayesian optimization method with exponential convergence without the need of auxiliary optimization and without the delta-cover sampling. |

314 | Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning | Christoph Dann, Emma Brunskill | In this paper, we derive an upper PAC bound of order O(|S|²|A|H² log(1/δ)/ɛ²) and a lower PAC bound Ω(|S||A|H² log(1/(δ+c))/ɛ²) (ignoring log-terms) that match up to log-terms and an additional linear dependency on the number of states |S|. |

315 | Learning with Relaxed Supervision | Jacob Steinhardt, Percy S. Liang | In this paper, we develop a rigorous approach to relaxing the supervision, which yields asymptotically consistent parameter estimates despite altering the supervision. |

316 | Subsampled Power Iteration: a Unified Algorithm for Block Models and Planted CSP's | Vitaly Feldman, Will Perkins, Santosh Vempala | We present an algorithm for recovering planted solutions in two well-known models, the stochastic block model and planted constraint satisfaction problems (CSP), via a common generalization in terms of random bipartite graphs. |

317 | Accelerated Mirror Descent in Continuous and Discrete Time | Walid Krichene, Alexandre Bayen, Peter L. Bartlett | Combining the original continuous-time motivation of mirror descent with a recent ODE interpretation of Nesterov’s accelerated method, we propose a family of continuous-time descent dynamics for convex functions with Lipschitz gradients, such that the solution trajectories are guaranteed to converge to the optimum at a $O(1/t^2)$ rate. |

318 | The Human Kernel | Andrew G. Wilson, Christoph Dann, Chris Lucas, Eric P. Xing | Bayesian nonparametric models, such as Gaussian processes, provide a compelling framework for automatic statistical modelling: these models have a high degree of flexibility, and automatically calibrated complexity. In this paper, we create function extrapolation problems and acquire human responses, and then design a kernel learning framework to reverse engineer the inductive biases of human learners across a set of behavioral experiments. |

319 | Action-Conditional Video Prediction using Deep Networks in Atari Games | Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L. Lewis, Satinder Singh | We propose and evaluate two deep neural network architectures that consist of encoding, action-conditional transformation, and decoding layers based on convolutional neural networks and recurrent neural networks. |

320 | A Pseudo-Euclidean Iteration for Optimal Recovery in Noisy ICA | James R. Voss, Mikhail Belkin, Luis Rademacher | We propose a new algorithm, PEGI (for pseudo-Euclidean Gradient Iteration), for provable model recovery for ICA with Gaussian noise. |

321 | Distributed Submodular Cover: Succinctly Summarizing Massive Data | Baharan Mirzasoleiman, Amin Karbasi, Ashwinkumar Badanidiyuru, Andreas Krause | In this paper, we formalize this challenge as a submodular cover problem. |

322 | Community Detection via Measure Space Embedding | Mark Kozdoba, Shie Mannor | We present a new algorithm for community detection. |

323 | Basis refinement strategies for linear value function approximation in MDPs | Gheorghe Comanici, Doina Precup, Prakash Panangaden | We provide a theoretical framework for analyzing basis function construction for linear value function approximation in Markov Decision Processes (MDPs). |

324 | Structured Estimation with Atomic Norms: General Bounds and Applications | Sheng Chen, Arindam Banerjee | In this paper, we present general upper bounds for such geometric measures, which only require simple information of the atomic norm under consideration, and we establish tightness of these bounds by providing the corresponding lower bounds. |

325 | A Complete Recipe for Stochastic Gradient MCMC | Yi-An Ma, Tianqi Chen, Emily Fox | In this paper, we provide a general recipe for constructing MCMC samplers–including stochastic gradient versions–based on continuous Markov processes specified via two matrices. |

326 | Bandit Smooth Convex Optimization: Improving the Bias-Variance Tradeoff | Ofer Dekel, Ronen Eldan, Tomer Koren | We present an efficient algorithm for the banditsmooth convex optimization problem that guarantees a regret of $\widetilde{O}(T^{5/8})$. |

327 | Online Prediction at the Limit of Zero Temperature | Mark Herbster, Stephen Pasteris, Shaona Ghosh | We design an online algorithm to classify the vertices of a graph. |

328 | Learning Continuous Control Policies by Stochastic Value Gradients | Nicolas Heess, Gregory Wayne, David Silver, Timothy Lillicrap, Tom Erez, Yuval Tassa | We present a unified framework for learning continuous control policies usingbackpropagation. |

329 | Exploring Models and Data for Image Question Answering | Mengye Ren, Ryan Kiros, Richard Zemel | This work aims to address the problem of image-based question-answering (QA) with new models and datasets. |

330 | Efficient and Robust Automated Machine Learning | Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, Frank Hutter | In this work we introduce a robust new AutoML system based on scikit-learn (using 15 classifiers, 14 feature preprocessing methods, and 4 data preprocessing methods, giving rise to a structured hypothesis space with 110 hyperparameters). |

331 | Preconditioned Spectral Descent for Deep Learning | David E. Carlson, Edo Collins, Ya-Ping Hsieh, Lawrence Carin, Volkan Cevher | We theoretically formalize our arguments and derive novel preconditioned non-Euclidean algorithms. |

332 | A Recurrent Latent Variable Model for Sequential Data | Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C. Courville, Yoshua Bengio | In this paper, we explore the inclusion of latent random variables into the hidden state of a recurrent neural network (RNN) by combining the elements of the variational autoencoder. |

333 | Fast Convergence of Regularized Learning in Games | Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, Robert E. Schapire | We show that natural classes of regularized learning algorithms with a form of recency bias achieve faster convergence rates to approximate efficiency and to coarse correlated equilibria in multiplayer normal form games. |

334 | Parallel Multi-Dimensional LSTM, With Application to Fast Biomedical Volumetric Image Segmentation | Marijn F. Stollenga, Wonmin Byeon, Marcus Liwicki, J�rgen Schmidhuber | Parallel Multi-Dimensional LSTM, With Application to Fast Biomedical Volumetric Image Segmentation |

335 | Reflection, Refraction, and Hamiltonian Monte Carlo | Hadi Mohasel Afshar, Justin Domke | We introduce a modification of the Leapfrog discretization of Hamiltonian dynamics on piecewise continuous energies, where intersections of the trajectory with discontinuities are detected, and the momentum is reflected or refracted to compensate for the change in energy. |

336 | The Consistency of Common Neighbors for Link Prediction in Stochastic Blockmodels | Purnamrita Sarkar, Deepayan Chakrabarti, peter j. bickel | The Consistency of Common Neighbors for Link Prediction in Stochastic Blockmodels |

337 | Nearly Optimal Private LASSO | Kunal Talwar, Abhradeep Guha Thakurta, Li Zhang | We present a nearly optimal differentially private version of the well known LASSO estimator. |

338 | Convergence Analysis of Prediction Markets via Randomized Subspace Descent | Rafael Frongillo, Mark D. Reid | We establish convergence rates for RSD and leverage them to prove rates for the two prediction market models above, answering the open questions. |

339 | The Poisson Gamma Belief Network | Mingyuan Zhou, Yulai Cong, Bo Chen | To infer a multilayer representation of high-dimensional count vectors, we propose the Poisson gamma belief network (PGBN) that factorizes each of its layers into the product of a connection weight matrix and the nonnegative real hidden units of the next layer. |

340 | Convergence rates of sub-sampled Newton methods | Murat A. Erdogdu, Andrea Montanari | In this paper, we shift our attention to a more general setting — maximum likelihood estimation. |

341 | No-Regret Learning in Bayesian Games | Jason Hartline, Vasilis Syrgkanis, Eva Tardos | Recent price-of-anarchy analyses of games of complete information suggest that coarse correlated equilibria, which characterize outcomes resulting from no-regret learning dynamics, have near-optimal welfare. |

342 | Statistical Topological Data Analysis – A Kernel Perspective | Roland Kwitt, Stefan Huber, Marc Niethammer, Weili Lin, Ulrich Bauer | Our contribution is to close this gap by proving universality of a variant of the original kernel, and to demonstrate its effective use in two-sample hypothesis testing on synthetic as well as real-world data. |

343 | Semi-supervised Sequence Learning | Andrew M. Dai, Quoc V. Le | We present two approaches to use unlabeled data to improve Sequence Learningwith recurrent networks. |

344 | Structured Transforms for Small-Footprint Deep Learning | Vikas Sindhwani, Tara Sainath, Sanjiv Kumar | We propose a uni-fied framework to learn a broad family of structured parameter matrices that arecharacterized by the notion of low displacement rank. |

345 | Rapidly Mixing Gibbs Sampling for a Class of Factor Graphs Using Hierarchy Width | Christopher M. De Sa, Ce Zhang, Kunle Olukotun, Christopher R� | To help understand the behavior of Gibbs sampling, we introduce a new (hyper)graph property, called hierarchy width. |

346 | Interpolating Convex and Non-Convex Tensor Decompositions via the Subspace Norm | Qinqing Zheng, Ryota Tomioka | We consider the problem of recovering a low-rank tensor from its noisy observation. |

347 | Sample Complexity Bounds for Iterative Stochastic Policy Optimization | Marin Kobilarov | This paper is concerned with robustness analysis of decision making under uncertainty. |

348 | BinaryConnect: Training Deep Neural Networks with binary weights during propagations | Matthieu Courbariaux, Yoshua Bengio, Jean-Pierre David | We introduce BinaryConnect, a method which consists in training a DNN with binary weights during the forward and backward propagations, while retaining precision of the stored weights in which gradients are accumulated. |

349 | Interactive Control of Diverse Complex Characters with Neural Networks | Igor Mordatch, Kendall Lowrey, Galen Andrew, Zoran Popovic, Emanuel V. Todorov | We present a method for training recurrent neural networks to act as near-optimal feedback controllers. |

350 | Submodular Hamming Metrics | Jennifer A. Gillenwater, Rishabh K. Iyer, Bethany Lusch, Rahul Kidambi, Jeff A. Bilmes | We show that there is a largely unexplored class of functions (positive polymatroids) that can define proper discrete metrics over pairs of binary vectors and that are fairly tractable to optimize over. |

351 | A Universal Primal-Dual Convex Optimization Framework | Alp Yurtsever, Quoc Tran Dinh, Volkan Cevher | We propose a new primal-dual algorithmic framework for a prototypical constrained convex optimization template. |

352 | Learning From Small Samples: An Analysis of Simple Decision Heuristics | �zg�r Simsek, Marcus Buckmann | We focus on three families of heuristics: single-cue decision making, lexicographic decision making, and tallying. |

353 | Explore no more: Improved high-probability regret bounds for non-stochastic bandits | Gergely Neu | This work addresses the problem of regret minimization in non-stochastic multi-armed bandit problems, focusing on performance guarantees that hold with high probability. |

354 | Fast and Memory Optimal Low-Rank Matrix Approximation | Se-Young Yun, marc lelarge, Alexandre Proutiere | In this paper, we revisit the problem of constructing a near-optimal rank $k$ approximation of a matrix $M\in [0,1]^{m\times n}$ under the streaming data model where the columns of $M$ are revealed sequentially. |

355 | Learnability of Influence in Networks | Harikrishna Narasimhan, David C. Parkes, Yaron Singer | We establish PAC learnability of influence functions for three common influence models, namely, the Linear Threshold (LT), Independent Cascade (IC) and Voter models, and present concrete sample complexity results in each case. |

356 | Learning Causal Graphs with Small Interventions | Karthikeyan Shanmugam, Murat Kocaoglu, Alexandros G. Dimakis, Sriram Vishwanath | We consider the problem of learning causal networks with interventions, when each intervention is limited in size under Pearl’s Structural Equation Model with independent errors (SEM-IE). |

357 | Information-theoretic lower bounds for convex optimization with erroneous oracles | Yaron Singer, Jan Vondrak | We consider the problem of optimizing convex and concave functions with access to an erroneous zeroth-order oracle. |

358 | Fixed-Length Poisson MRF: Adding Dependencies to the Multinomial | David I. Inouye, Pradeep K. Ravikumar, Inderjit S. Dhillon | We propose a novel distribution that generalizes the Multinomial distribution to enable dependencies between dimensions. |

359 | Large-Scale Bayesian Multi-Label Learning via Topic-Based Label Embeddings | Piyush Rai, Changwei Hu, Ricardo Henao, Lawrence Carin | We present a scalable Bayesian multi-label learning model based on learning low-dimensional label embeddings. |

360 | The Self-Normalized Estimator for Counterfactual Learning | Adith Swaminathan, Thorsten Joachims | This paper identifies a severe problem of the counterfactual risk estimator typically used in batch learning from logged bandit feedback (BLBF), and proposes the use of an alternative estimator that avoids this problem.In the BLBF setting, the learner does not receive full-information feedback like in supervised learning, but observes feedback only for the actions taken by a historical policy.This makes BLBF algorithms particularly attractive for training online systems (e.g., ad placement, web search, recommendation) using their historical logs.The Counterfactual Risk Minimization (CRM) principle offers a general recipe for designing BLBF algorithms. |

361 | Fast Lifted MAP Inference via Partitioning | Somdeb Sarkhel, Parag Singla, Vibhav G. Gogate | In this paper, we present a novel approach, which cleverly introduces new symmetries at the time of grounding. |

362 | Data Generation as Sequential Decision Making | Philip Bachman, Doina Precup | We formulate data imputation as an MDP and develop models capable of representing effective policies for it. |

363 | On Elicitation Complexity | Rafael Frongillo, Ian Kash | Specifically, what is the minimum number of regression parameters needed to compute the property?Building on previous work, we introduce a new notion of elicitation complexity and lay the foundations for a calculus of elicitation. |

364 | Decomposition Bounds for Marginal MAP | Wei Ping, Qiang Liu, Alexander T. Ihler | In this work, we generalize dual decomposition to a generic powered-sum inference task, which includes marginal MAP, along with pure marginalization and MAP, as special cases. |

365 | Discrete R�nyi Classifiers | Meisam Razaviyayn, Farzan Farnia, David Tse | In this work, we consider the problem of designing the optimum classifier based on some estimated low order marginals of (X,Y). |

366 | A class of network models recoverable by spectral clustering | Yali Wan, Marina Meila | Here we show that essentially the same algorithm used for the SBM and for its extension called Degree-Corrected SBM, works on a wider class of Block-Models, which we call Preference Frame Models, with essentially the same guarantees. |

367 | Skip-Thought Vectors | Ryan Kiros, Yukun Zhu, Ruslan R. Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, Sanja Fidler | We describe an approach for unsupervised learning of a generic, distributed sentence encoder. |

368 | Rate-Agnostic (Causal) Structure Learning | Sergey Plis, David Danks, Cynthia Freeman, Vince Calhoun | We apply these algorithms to data from simulations. |

369 | Principal Geodesic Analysis for Probability Measures under the Optimal Transport Metric | Vivien Seguy, Marco Cuturi | We consider in this work the space of probability measures $P(X)$ on a Hilbert space $X$ endowed with the 2-Wasserstein metric. |

370 | Consistent Multilabel Classification | Oluwasanmi O. Koyejo, Nagarajan Natarajan, Pradeep K. Ravikumar, Inderjit S. Dhillon | Based on the population-optimal classifier, we propose a computationally efficient and general-purpose plug-in classification algorithm, and prove its consistency with respect to the metric of interest. |

371 | Parallel Predictive Entropy Search for Batch Global Optimization of Expensive Objective Functions | Amar Shah, Zoubin Ghahramani | We develop \textit{parallel predictive entropy search} (PPES), a novel algorithm for Bayesian optimization of expensive black-box objective functions. |

372 | Cornering Stationary and Restless Mixing Bandits with Remix-UCB | Julien Audiffren, Liva Ralaivola | As we shall see, the bandit problem we tackle requires us to address the exploration/exploitation/independence trade-off, which we do by considering the idea of a {\em waiting arm} in the new Remix-UCB algorithm, a generalization of Improved-UCB for the problem at hand, that we introduce. |

373 | Semi-Supervised Factored Logistic Regression for High-Dimensional Neuroimaging Data | Danilo Bzdok, Michael Eickenberg, Olivier Grisel, Bertrand Thirion, Gael Varoquaux | We therefore propose to blend representation modelling and task classification into a unified statistical learning problem. |

374 | Gaussian Process Random Fields | David Moore, Stuart J. Russell | We introduce a new approximation for large-scale Gaussian processes, the Gaussian Process Random Field (GPRF), in which local GPs are coupled via pairwise potentials. |

375 | M-Statistic for Kernel Change-Point Detection | Shuang Li, Yao Xie, Hanjun Dai, Le Song | In this paper we propose two related computationally efficient M-statistics for kernel-based change-point detection when the amount of background data is large. |

376 | Adaptive Online Learning | Dylan J. Foster, Alexander Rakhlin, Karthik Sridharan | We propose a general framework for studying adaptive regret bounds in the online learning setting, subsuming model selection and data-dependent bounds. |

377 | A Universal Catalyst for First-Order Optimization | Hongzhou Lin, Julien Mairal, Zaid Harchaoui | We introduce a generic scheme for accelerating first-order optimization methods in the sense of Nesterov, which builds upon a new analysis of the accelerated proximal point algorithm. |

378 | Inference for determinantal point processes without spectral knowledge | R�mi Bardenet, Michalis Titsias RC AUEB | Our main contribution is to derive bounds on the likelihood ofa DPP, both for finite and continuous domains. |

379 | Kullback-Leibler Proximal Variational Inference | Mohammad E. Khan, Pierre Baque, Fran�ois Fleuret, Pascal Fua | We propose a new variational inference method based on the Kullback-Leibler (KL) proximal term. |

380 | Semi-Proximal Mirror-Prox for Nonsmooth Composite Minimization | Niao He, Zaid Harchaoui | We propose a new first-order optimization algorithm to solve high-dimensional non-smooth composite minimization problems. |

381 | LASSO with Non-linear Measurements is Equivalent to One With Linear Measurements | CHRISTOS THRAMPOULIDIS, Ehsan Abbasi, Babak Hassibi | In this work, we considerably strengthen these results by obtaining explicit expressions for $\|\hat x-\mu x_0\|_2$, for the regularized Generalized-LASSO, that are asymptotically precise when $m$ and $n$ grow large. |

382 | From random walks to distances on unweighted graphs | Tatsunori Hashimoto, Yi Sun, Tommi Jaakkola | We establish a general correspondence between hitting times of the Brownian motion and analogous hitting times on the graph. |

383 | Bayesian dark knowledge | Anoop Korattikara Balan, Vivek Rathod, Kevin P. Murphy, Max Welling | We describe a method for “distilling” a Monte Carlo approximation to the posterior predictive density into a more compact form, namely a single deep neural network. |

384 | Matrix Completion with Noisy Side Information | Kai-Yang Chiang, Cho-Jui Hsieh, Inderjit S. Dhillon | In this paper, we propose a novel model that balances between features and observations simultaneously, enabling us to leverage feature information yet to be robust to feature noise. |

385 | Dependent Multinomial Models Made Easy: Stick-Breaking with the Polya-gamma Augmentation | Scott Linderman, Matthew J. Johnson, Ryan P. Adams | Here, we leverage a logistic stick-breaking representation and recent innovations in P\'{o}lya-gamma augmentation to reformulate the multinomial distribution in terms of latent variables with jointly Gaussian likelihoods, enabling us to take advantage of a host of Bayesian inference techniques for Gaussian models with minimal overhead. |

386 | On-the-Job Learning with Bayesian Decision Theory | Keenon Werling, Arun Tejasvi Chaganty, Percy S. Liang, Christopher D. Manning | Our goal is to deploy a high-accuracy system starting with zero training examples. |

387 | Calibrated Structured Prediction | Volodymyr Kuleshov, Percy S. Liang | We explore a range of features appropriate for structured recalibration, and demonstrate their efficacy on three real-world datasets. |

388 | Learning Structured Output Representation using Deep Conditional Generative Models | Kihyuk Sohn, Honglak Lee, Xinchen Yan | In this work, we develop a scalable deep conditional generative model for structured output variables using Gaussian latent variables. |

389 | Time-Sensitive Recommendation From Recurrent User Activities | Nan Du, Yichen Wang, Niao He, Jimeng Sun, Le Song | To address these questions, we propose a novel framework which connects self-exciting point processes and low-rank models to capture the recurrent temporal patterns in a large collection of user-item consumption pairs. |

390 | Learning Stationary Time Series using Gaussian Processes with Nonparametric Kernels | Felipe Tobar, Thang D. Bui, Richard E. Turner | We introduce the Gaussian Process Convolution Model (GPCM), a two-stage nonparametric generative procedure to model stationary signals as the convolution between a continuous-time white-noise process and a continuous-time linear filter drawn from Gaussian process. |

391 | A Market Framework for Eliciting Private Data | Bo Waggoner, Rafael Frongillo, Jacob D. Abernethy | We propose a mechanism for purchasing information from a sequence of participants.The participants may simply hold data points they wish to sell, or may have more sophisticated information; either way, they are incentivized to participate as long as they believe their data points are representative or their information will improve the mechanism’s future prediction on a test set.The mechanism, which draws on the principles of prediction markets, has a bounded budget and minimizes generalization error for Bregman divergence loss functions.We then show how to modify this mechanism to preserve the privacy of participants’ information: At any given time, the current prices and predictions of the mechanism reveal almost no information about any one participant, yet in total over all participants, information is accurately aggregated. |

392 | Lifted Inference Rules With Constraints | Happy Mittal, Anuj Mahajan, Vibhav G. Gogate, Parag Singla | Computational complexity of these rules is highly dependent onthe choice of the constraint language they operate on and therefore coming upwith the right kind of representation is critical to the success of lifted inference.In this paper, we propose a new constraint language, called setineq, which allowssubset, equality and inequality constraints, to represent substitutions over the vari-ables in the theory. |

393 | Gradient Estimation Using Stochastic Computation Graphs | John Schulman, Nicolas Heess, Theophane Weber, Pieter Abbeel | We introduce the formalism of stochastic computation graphs–directed acyclic graphs that include both deterministic functions and conditional probability distributions and describe how to easily and automatically derive an unbiased estimator of the loss function’s gradient. |

394 | Model-Based Relative Entropy Stochastic Search | Abbas Abdolmaleki, Rudolf Lioutikov, Jan R. Peters, Nuno Lau, Luis Pualo Reis, Gerhard Neumann | To alleviate these problems, we introduce a new surrogate-based stochastic search approach. |

395 | Semi-supervised Learning with Ladder Networks | Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, Tapani Raiko | We combine supervised learning with unsupervised learning in deep neural networks. |

396 | Embedding Inference for Structured Multilabel Prediction | Farzaneh Mirzazadeh, Siamak Ravanbakhsh, Nan Ding, Dale Schuurmans | Rather than using approximate inference or tailoring a specialized inference method for a particular structure—standard responses to the scaling challenge—we propose to embed prediction constraints directly into the learned representation. |

397 | Copula variational inference | Dustin Tran, David Blei, Edo M. Airoldi | We develop a general variational inference method that preserves dependency among the latent variables. |

398 | Recursive Training of 2D-3D Convolutional Networks for Neuronal Boundary Prediction | Kisuk Lee, Aleksandar Zlateski, Vishwanathan Ashwin, H. Sebastian Seung | Here we achieve a substantial gain in accuracy through three innovations. |

399 | A Dual Augmented Block Minimization Framework for Learning with Limited Memory | Ian En-Hsu Yen, Shan-Wei Lin, Shou-De Lin | In this paper, we consider the more general setting of regularized \emph{Empirical Risk Minimization (ERM)} when data cannot fit into memory. |

400 | Optimal Testing for Properties of Distributions | Jayadev Acharya, Constantinos Daskalakis, Gautam Kamath | Nevertheless, even for basic classes ofdistributions such as monotone, log-concave, unimodal, and monotone hazard rate, the optimal sample complexity is unknown.We provide a general approach via which we obtain sample-optimal and computationally efficient testers for all these distribution families. |

401 | Efficient Learning of Continuous-Time Hidden Markov Models for Disease Progression | Yu-Ying Liu, Shuang Li, Fuxin Li, Le Song, James M. Rehg | In this paper, we present the first complete characterization of efficient EM-based learning methods for CT-HMM models. |

402 | Expectation Particle Belief Propagation | Thibaut Lienart, Yee Whye Teh, Arnaud Doucet | We propose an original particle-based implementation of the Loopy Belief Propagation (LPB) algorithm for pairwise Markov Random Fields (MRF) on a continuous state space. |

403 | Latent Bayesian melding for integrating individual and population models | Mingjun Zhong, Nigel Goddard, Charles Sutton | We propose latent Bayesian melding, which is motivated by averaging the distributions over populations statistics of both the individual-level and the population-level models under a logarithmic opinion pool framework. |