# Statistics

## New submissions

[ total of 152 entries: 1-152 ]
[ showing up to 2000 entries per page: fewer | more ]

### New submissions for Thu, 27 Feb 20

[1]
Title: Fundamental Issues Regarding Uncertainties in Artificial Neural Networks
Comments: 21 pages, 8 Figures, 2 Tables. To be submitted to Pattern Recognition
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

Artificial Neural Networks (ANNs) implement a specific form of multi-variate extrapolation and will generate an output for any input pattern, even when there is no similar training pattern. Extrapolations are not necessarily to be trusted, and in order to support safety critical systems, we require such systems to give an indication of the training sample related uncertainty associated with their output. Some readers may think that this is a well known issue which is already covered by the basic principles of pattern recognition. We will explain below how this is not the case and how the conventional (Likelihood estimate of) conditional probability of classification does not correctly assess this uncertainty. We provide a discussion of the standard interpretations of this problem and show how a quantitative approach based upon long standing methods can be practically applied. The methods are illustrated on the task of early diagnosis of dementing diseases using Magnetic Resonance Imaging.

[2]
Title: Smoothing Graphons for Modelling Exchangeable Relational Data
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Modelling exchangeable relational data can be described by \textit{graphon theory}. Most Bayesian methods for modelling exchangeable relational data can be attributed to this framework by exploiting different forms of graphons. However, the graphons adopted by existing Bayesian methods are either piecewise-constant functions, which are insufficiently flexible for accurate modelling of the relational data, or are complicated continuous functions, which incur heavy computational costs for inference. In this work, we introduce a smoothing procedure to piecewise-constant graphons to form {\em smoothing graphons}, which permit continuous intensity values for describing relations, but without impractically increasing computational costs. In particular, we focus on the Bayesian Stochastic Block Model (SBM) and demonstrate how to adapt the piecewise-constant SBM graphon to the smoothed version. We initially propose the Integrated Smoothing Graphon (ISG) which introduces one smoothing parameter to the SBM graphon to generate continuous relational intensity values. We then develop the Latent Feature Smoothing Graphon (LFSG), which improves on the ISG by introducing auxiliary hidden labels to decompose the calculation of the ISG intensity and enable efficient inference. Experimental results on real-world data sets validate the advantages of applying smoothing strategies to the Stochastic Block Model, demonstrating that smoothing graphons can greatly improve AUC and precision for link prediction without increasing computational complexity.

[3]
Title: Information Directed Sampling for Linear Partial Monitoring
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Partial monitoring is a rich framework for sequential decision making under uncertainty that generalizes many well known bandit models, including linear, combinatorial and dueling bandits. We introduce information directed sampling (IDS) for stochastic partial monitoring with a linear reward and observation structure. IDS achieves adaptive worst-case regret rates that depend on precise observability conditions of the game. Moreover, we prove lower bounds that classify the minimax regret of all finite games into four possible regimes. IDS achieves the optimal rate in all cases up to logarithmic factors, without tuning any hyper-parameters. We further extend our results to the contextual and the kernelized setting, which significantly increases the range of possible applications.

[4]
Title: Classical and Bayesian Analyses of a Mixture of Exponential and Lomax Distributions
Subjects: Methodology (stat.ME)

The exponential and the Lomax distributions are widely used in life testing experiments in mixture models. A mixture model of exponential distribution and Lomax distribution is proposed. Parameters of the proposed model are estimated using classical and Bayesian procedures under type-I right censoring. Expressions for Bayes estimators are derived assuming noninformative (uniform and Jeffreys) priors under symmetric and asymmetric loss functions. Posterior predictive distributions of a future observation are derived and predictive estimates are obtained. Extensive Monte Carlo simulations are carried out to investigate performance of the estimators in terms of sample sizes, censoring times and mixing proportions. The analysis of mixture model is carried out using a data set of lifetime of transmitter receivers. Interesting properties of estimators are observed and discussed.

[5]
Title: Device Heterogeneity in Federated Learning: A Superquantile Approach
Subjects: Machine Learning (stat.ML); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Optimization and Control (math.OC)

We propose a federated learning framework to handle heterogeneous client devices which do not conform to the population data distribution. The approach hinges upon a parameterized superquantile-based objective, where the parameter ranges over levels of conformity. We present an optimization algorithm and establish its convergence to a stationary point. We show how to practically implement it using secure aggregation by interleaving iterations of the usual federated averaging method with device filtering. We conclude with numerical experiments on neural networks as well as linear models on tasks from computer vision and natural language processing.

[6]
Title: Enforcing Mean Reversion in State Space Models for Prawn Pond Water Quality Forecasting
Journal-ref: Computers and Electronics in Agriculture, Volume 168, 2020, 105120, ISSN 0168-1699
Subjects: Applications (stat.AP)

The contribution of this study is a novel approach to introduce mean reversion in multi-step-ahead forecasts of state-space models. This approach is demonstrated in a prawn pond water quality forecasting application. The mean reversion constrains forecasts by gradually drawing them to an average of previously observed dynamics. This corrects deviations in forecasts caused by irregularities such as chaotic, non-linear, and stochastic trends. The key features of the approach include (1) it enforces mean reversion, (2) it provides a means to model both short and long-term dynamics, (3) it is able to apply mean reversion to select structural state-space components, and (4) it is simple to implement. Our mean reversion approach is demonstrated on various state-space models and compared with several time-series models on a prawn pond water quality dataset. Results show that mean reversion reduces long-term forecast errors by over 60% to produce the most accurate models in the comparison.

[7]
Title: Paired Comparisons Modeling using t-Distribution with Bayesian Analysis
Subjects: Methodology (stat.ME)

A paired comparison analysis is the simplest way to make comparative judgments between objects where objects may be goods, services or skills. For a set of problems, this technique helps to choose the most important problem to solve first and/or provides the solution that will be the most effective. This paper presents the theory of paired comparisons method and contributes to the paired comparisons models by developing a new model based on t-distribution. The developed model is illustrated using a data set of citations among four famous journals of Statistics. Using Bayesian analysis, the journals are ranked as JRSS-B --> Biometrika --> JASA --> Comm. in Stats.

[8]
Title: Extremes of Censored and Uncensored Lifetimes in Survival Data
Comments: 1 figure, 23 pages
Subjects: Statistics Theory (math.ST); Applications (stat.AP)

The i.i.d. censoring model for survival analysis assumes two independent sequences of i.i.d. positive random variables, $(T_i^*)_{1\le i\le n}$ and $(U_i)_{1\le i\le n}$. The data consists of observations on the random sequence $\big(T_i=\min(T_i^*,U_i)$ together with accompanying censor indicators. Values of $T_i$ with $T_i^*\le U_i$ are said to be uncensored, those with $T_i^*> U_i$ are censored. We assume that the distributions of the $T_i^*$ and $U_i$ are in the domain of attraction of the Gumbel distribution and obtain the asymptotic distributions, as sample size $n\to\infty$, of the maximum values of the censored and uncensored lifetimes in the data, and of statistics related to them. These enable us to examine questions concerning the possible existence of cured individuals in the population.

[9]
Title: Correspondence Analysis between the Location and the Leading Causes of Death in the United States
Journal-ref: International Journal of Ecological Economics and Statistics, 41(1), 47-54, 2020
Subjects: Applications (stat.AP); Computation (stat.CO)

Correspondence Analysis analyzes two-way or multi-way tables withe each row and column becoming a point ion a multidimensional graphical map called biplot. It can be used to extract essential dimensions allowing simplification of the data matrix. This study aims to measure the association between the location and the leading causes of death in the United States of America and to determine the location where a particular disease is highly associated. The research data consists of two variables with 510 data points. Results show that there is a significant association between the location ad leading cause of death in the United States, and 61% of the variance in the model are explained by the first two dimensions.

[10]
Title: An Optimal Statistical and Computational Framework for Generalized Tensor Estimation
Subjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)

This paper describes a flexible framework for generalized low-rank tensor estimation problems that includes many important instances arising from applications in computational imaging, genomics, and network analysis. The proposed estimator consists of finding a low-rank tensor fit to the data under generalized parametric models. To overcome the difficulty of non-convexity in these problems, we introduce a unified approach of projected gradient descent that adapts to the underlying low-rank structure. Under mild conditions on the loss function, we establish both an upper bound on statistical error and the linear rate of computational convergence through a general deterministic analysis. Then we further consider a suite of generalized tensor estimation problems, including sub-Gaussian tensor denoising, tensor regression, and Poisson and binomial tensor PCA. We prove that the proposed algorithm achieves the minimax optimal rate of convergence in estimation error. Finally, we demonstrate the superiority of the proposed framework via extensive experiments on both simulated and real data.

[11]
Title: Incorporating Expert Prior Knowledge into Experimental Design via Posterior Sampling
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Scientific experiments are usually expensive due to complex experimental preparation and processing. Experimental design is therefore involved with the task of finding the optimal experimental input that results in the desirable output by using as few experiments as possible. Experimenters can often acquire the knowledge about the location of the global optimum. However, they do not know how to exploit this knowledge to accelerate experimental design. In this paper, we adopt the technique of Bayesian optimization for experimental design since Bayesian optimization has established itself as an efficient tool for optimizing expensive black-box functions. Again, it is unknown how to incorporate the expert prior knowledge about the global optimum into Bayesian optimization process. To address it, we represent the expert knowledge about the global optimum via placing a prior distribution on it and we then derive its posterior distribution. An efficient Bayesian optimization approach has been proposed via posterior sampling on the posterior distribution of the global optimum. We theoretically analyze the convergence of the proposed algorithm and discuss the robustness of incorporating expert prior. We evaluate the efficiency of our algorithm by optimizing synthetic functions and tuning hyperparameters of classifiers along with a real-world experiment on the synthesis of short polymer fiber. The results clearly demonstrate the advantages of our proposed method.

[12]
Title: Scientific versus statistical modelling: a unifying approach
Subjects: Statistics Theory (math.ST)

This paper addresses two fundamental features of quantities modeled and analysed in statistical science, their dimensions (e.g. time) and measurement scales (units). Examples show that subtle issues can arise when dimensions and measurement scales are ignored. Special difficulties arise when the models involve transcendental functions. A transcendental function important in statistics is the logarithm which is used in likelihood calculations and is a singularity in the family of Box-Cox algebraic functions. Yet neither the argument of the logarithm nor its value can have units of measurement. Physical scientists have long recognized that dimension/scale difficulties can be side-stepped by nondimensionalizing the model; after all, models of natural phenomena cannot depend on the units by which they are measured, and the celebrated Buckingham Pi theorem is a consequence. The paper reviews that theorem, recognizing that the statistical invariance principle arose with similar aspirations. However, the potential relationship between the theorem and statistical invariance has not been investigated until very recently. The main result of the paper is an exploration of that link, which leads to an extension of the Pi-theorem that puts it in a stochastic framework and thus quantifies uncertainties in deterministic physical models.

[13]
Title: Adversarial Monte Carlo Meta-Learning of Optimal Prediction Procedures
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)

We frame the meta-learning of prediction procedures as a search for an optimal strategy in a two-player game. In this game, Nature selects a prior over distributions that generate labeled data consisting of features and an associated outcome, and the Predictor observes data sampled from a distribution drawn from this prior. The Predictor's objective is to learn a function that maps from a new feature to an estimate of the associated outcome. We establish that, under reasonable conditions, the Predictor has an optimal strategy that is equivariant to shifts and rescalings of the outcome and is invariant to permutations of the observations and to shifts, rescalings, and permutations of the features. We introduce a neural network architecture that satisfies these properties. The proposed strategy performs favorably compared to standard practice in both parametric and nonparametric experiments.

[14]
Title: A Balancing Weight Framework for Estimating the Causal Effect of General Treatments
Subjects: Statistics Theory (math.ST); Methodology (stat.ME)

In observational studies, weighting methods that directly optimize the balance between treatment and covariates have received much attention lately; however these have mainly focused on binary treatments. Inspired by domain adaptation, we show that such methods can be actually reformulated as specific implementations of a discrepancy minimization problem aimed at tackling a shift of distribution from observational to interventional data. More precisely, we introduce a new framework, Covariate Balance via Discrepancy Minimization (CBDM), that provably encompasses most of the existing balancing weight methods and formally extends them to treatments of arbitrary types (e.g., continuous or multivariate). We establish theoretical guarantees for our framework that both offer generalizations of properties known when the treatment is binary, and give a better grasp on what hyperparameters to choose in non-binary settings. Based on such insights, we propose a particular implementation of CBDM for estimating dose-response curves and demonstrate through experiments its competitive performance relative to other existing approaches for continuous treatments.

[15]
Title: Bayesian Nonparametric Space Partitions: A Survey
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Bayesian nonparametric space partition (BNSP) models provide a variety of strategies for partitioning a $D$-dimensional space into a set of blocks. In this way, the data points lie in the same block would share certain kinds of homogeneity. BNSP models can be applied to various areas, such as regression/classification trees, random feature construction, relational modeling, etc. In this survey, we investigate the current progress of BNSP research through the following three perspectives: models, which review various strategies for generating the partitions in the space and discuss their theoretical foundation `self-consistency'; applications, which cover the current mainstream usages of BNSP models and their potential future practises; and challenges, which identify the current unsolved problems and valuable future research topics. As there are no comprehensive reviews of BNSP literature before, we hope that this survey can induce further exploration and exploitation on this topic.

[16]
Title: Predicting Neural Network Accuracy from Weights
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

We study the prediction of the accuracy of a neural network given only its weights with the goal of better understanding network training and performance. To do so, we propose a formal setting which frames this task and connects to previous work in this area. We collect (and release) a large dataset of almost 80k convolutional neural networks trained on four image datasets. We demonstrate that strong predictors of accuracy exist. Moreover, they can achieve good predictions while only using simple statistics of the weights. Surprisingly, these predictors are able to rank networks trained on unobserved datasets or using different architectures.

[17]
Title: Automated Augmented Conjugate Inference for Non-conjugate Gaussian Process Models
Comments: Accepted at AISTATS 2020
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

We propose automated augmented conjugate inference, a new inference method for non-conjugate Gaussian processes (GP) models. Our method automatically constructs an auxiliary variable augmentation that renders the GP model conditionally conjugate. Building on the conjugate structure of the augmented model, we develop two inference methods. First, a fast and scalable stochastic variational inference method that uses efficient block coordinate ascent updates, which are computed in closed form. Second, an asymptotically correct Gibbs sampler that is useful for small datasets. Our experiments show that our method are up two orders of magnitude faster and more robust than existing state-of-the-art black-box methods.

[18]
Title: A short note on learning discrete distributions
Comments: This is a review article; its intent is not to provide new results, but instead to gather known (and useful) ones, along with their proofs, in a single convenient location
Subjects: Statistics Theory (math.ST); Probability (math.PR)

The goal of this short note is to provide simple proofs for the "folklore facts" on the sample complexity of learning a discrete probability distribution over a known domain of size $k$ to various distances $\varepsilon$, with error probability $\delta$.

[19]
Title: A Visual Sensitivity Analysis for Parameter-Augmented Ensembles of Curves
Authors: Alejandro Ribes (EDF R&D PERICLES), Joachim Pouderoux, Bertrand Iooss (EDF R&D PRISME, GdR MASCOT-NUM, IMT)
Journal-ref: The Journal of Verification, Validation and Uncertainty Quantification (VVUQ), 2019, 4 (4)
Subjects: Statistics Theory (math.ST)

Engineers and computational scientists often study the behavior of their simulations by repeated solutions with variations in their parameters, which can be for instance boundary values or initial conditions. Through such simulation ensembles, uncertainty in a solution is studied as a function of the various input parameters. Solutions of numerical simulations are often temporal functions, spatial maps or spatio-temporal outputs. The usual way to deal with such complex outputs is to limit the analysis to several probes in the temporal/spatial domain. This leads to smaller and more tractable ensembles of functional outputs (curves) with their associated input parameters: augmented ensembles of curves. This article describes a system for the interactive exploration and analysis of such augmented ensembles. Descriptive statistics on the functional outputs are performed by Principal Component Analysis projection, kernel density estimation and the computation of High Density Regions. This makes possible the calculation of functional quantiles and outliers. Brushing and linking the elements of the system allows in-depth analysis of the ensemble. The system allows for functional descriptive statistics, cluster detection and finally for the realization of a visual sensitivity analysis via cobweb plots. We present two synthetic examples and then validate our approach in an industrial use-case concerning a marine current study using a hydraulic solver.

[20]
Title: A Comparative Study of Machine Learning Models for Predicting the State of Reactive Mixing
Subjects: Machine Learning (stat.ML); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Numerical Analysis (math.NA)

Accurate predictions of reactive mixing are critical for many Earth and environmental science problems. To investigate mixing dynamics over time under different scenarios, a high-fidelity, finite-element-based numerical model is built to solve the fast, irreversible bimolecular reaction-diffusion equations to simulate a range of reactive-mixing scenarios. A total of 2,315 simulations are performed using different sets of model input parameters comprising various spatial scales of vortex structures in the velocity field, time-scales associated with velocity oscillations, the perturbation parameter for the vortex-based velocity, anisotropic dispersion contrast, and molecular diffusion. Outputs comprise concentration profiles of the reactants and products. The inputs and outputs of these simulations are concatenated into feature and label matrices, respectively, to train 20 different machine learning (ML) emulators to approximate system behavior. The 20 ML emulators based on linear methods, Bayesian methods, ensemble learning methods, and multilayer perceptron (MLP), are compared to assess these models. The ML emulators are specifically trained to classify the state of mixing and predict three quantities of interest (QoIs) characterizing species production, decay, and degree of mixing. Linear classifiers and regressors fail to reproduce the QoIs; however, ensemble methods (classifiers and regressors) and the MLP accurately classify the state of reactive mixing and the QoIs. Among ensemble methods, random forest and decision-tree-based AdaBoost faithfully predict the QoIs. At run time, trained ML emulators are $\approx10^5$ times faster than the high-fidelity numerical simulations. Speed and accuracy of the ensemble and MLP models facilitate uncertainty quantification, which usually requires 1,000s of model run, to estimate the uncertainty bounds on the QoIs.

[21]
Title: A general framework for ensemble distribution distillation
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Ensembles of neural networks have been shown to give better performance than single networks, both in terms of predictions and uncertainty estimation. Additionally, ensembles allow the uncertainty to be decomposed into aleatoric (data) and epistemic (model) components, giving a more complete picture of the predictive uncertainty. Ensemble distillation is the process of compressing an ensemble into a single model, often resulting in a leaner model that still outperforms the individual ensemble members. Unfortunately, standard distillation erases the natural uncertainty decomposition of the ensemble. We present a general framework for distilling both regression and classification ensembles in a way that preserves the decomposition. We demonstrate the desired behaviour of our framework and show that its predictive performance is on par with standard distillation.

[22]
Title: ICE-BeeM: Identifiable Conditional Energy-Based Deep Models
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Despite the growing popularity of energy-based models, their identifiability properties are not well-understood. In this paper we establish sufficient conditions under which a large family of conditional energy-based models is identifiable in function space, up to a simple transformation. Our results build on recent developments in the theory of nonlinear ICA, showing that the latent representations in certain families of deep latent-variable models are identifiable. We extend these results to a very broad family of conditional energy-based models. In this family, the energy function is simply the dot-product between two feature extractors, one for the dependent variable, and one for the conditioning variable. We show that under mild conditions, the features are unique up to scaling and permutation. Second, we propose the framework of independently modulated component analysis (IMCA), a new form of nonlinear ICA where the indepencence assumption is relaxed. Importantly, we show that our energy-based model can be used for the estimation of the components: the features learned are a simple and often trivial transformation of the latents.

[23]
Title: Towards new cross-validation-based estimators for Gaussian process regression: efficient adjoint computation of gradients
Authors: Sébastien Petit (L2S, GdR MASCOT-NUM), Julien Bect (L2S, GdR MASCOT-NUM), Sébastien da Veiga (GdR MASCOT-NUM), Paul Feliot, Emmanuel Vazquez (L2S, GdR MASCOT-NUM)
Subjects: Computation (stat.CO); Machine Learning (stat.ML)

We consider the problem of estimating the parameters of the covariance function of a Gaussian process by cross-validation. We suggest using new cross-validation criteria derived from the literature of scoring rules. We also provide an efficient method for computing the gradient of a cross-validation criterion. To the best of our knowledge, our method is more efficient than what has been proposed in the literature so far. It makes it possible to lower the complexity of jointly evaluating leave-one-out criteria and their gradients.

[24]
Title: The role of regularization in classification of high-dimensional noisy Gaussian mixture
Comments: 8 pages + appendix, 6 figures
Subjects: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Statistics Theory (math.ST)

We consider a high-dimensional mixture of two Gaussians in the noisy regime where even an oracle knowing the centers of the clusters misclassifies a small but finite fraction of the points. We provide a rigorous analysis of the generalization error of regularized convex classifiers, including ridge, hinge and logistic regression, in the high-dimensional limit where the number $n$ of samples and their dimension $d$ go to infinity while their ratio is fixed to $\alpha= n/d$. We discuss surprising effects of the regularization that in some cases allows to reach the Bayes-optimal performances. We also illustrate the interpolation peak at low regularization, and analyze the role of the respective sizes of the two clusters.

[25]
Title: Aggregated hold out for sparse linear regression with a robust loss function
Authors: Guillaume Maillard (CELESTE, LM-Orsay)
Subjects: Statistics Theory (math.ST)

Sparse linear regression methods generally have a free hyperparameter which controls the amount of sparsity, and is subject to a bias-variance tradeoff. This article considers the use of Aggregated hold-out to aggregate over values of this hyperparameter, in the context of linear regression with the Huber loss function. Aggregated hold-out (Agghoo) is a procedure which averages estimators selected by hold-out (cross-validation with a single split). In the theoretical part of the article, it is proved that Agghoo satisfies a non-asymptotic oracle inequality when it is applied to sparse estimators which are parametrized by their zero-norm. In particular , this includes a variant of the Lasso introduced by Zou, Hasti{\'e} and Tibshirani. Simulations are used to compare Agghoo with cross-validation. They show that Agghoo performs better than CV when the intrinsic dimension is high and when there are confounders correlated with the predictive covariates.

[26]
Title: Revisiting Ensembles in an Adversarial Context: Improving Natural Accuracy
Comments: 5 pages, accepted to ICLR 2020 Workshop on Towards Trustworthy ML: Rethinking Security and Privacy for ML
Subjects: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

A necessary characteristic for the deployment of deep learning models in real world applications is resistance to small adversarial perturbations while maintaining accuracy on non-malicious inputs. While robust training provides models that exhibit better adversarial accuracy than standard models, there is still a significant gap in natural accuracy between robust and non-robust models which we aim to bridge. We consider a number of ensemble methods designed to mitigate this performance difference. Our key insight is that model trained to withstand small attacks, when ensembled, can often withstand significantly larger attacks, and this concept can in turn be leveraged to optimize natural accuracy. We consider two schemes, one that combines predictions from several randomly initialized robust models, and the other that fuses features from robust and standard models.

[27]
Title: Hierarchical clustering with discrete latent variable models and the integrated classification likelihood
Subjects: Computation (stat.CO)

In this paper, we introduce a two step methodology to extract a hierarchical clustering. This methodology considers the integrated classification likelihood criterion as an objective function, and applies to any discrete latent variable models (DLVM) where this quantity is tractable. The first step of the methodology involves maximizing the criterion with respect to the discrete latent variables state with uninformative priors. To that end we propose a new hybrid algorithm based on greedy local searches as well as a genetic algorithm which allows the joint inference of the number $K$ of clusters and of the clusters themselves. The second step of the methodology is based on a bottom-up greedy procedure to extract a hierarchy of clusters from this natural partition. In a Bayesian context, this is achieved by considering the Dirichlet cluster proportion prior parameter $\alpha$ as a regularisation term controlling the granularity of the clustering. This second step allows the exploration of the clustering at coarser scales and the ordering of the clusters an important output for the visual representations of the clustering results. The clustering results obtained with the proposed approach, on simulated as well as real settings, are compared with existing strategies and are shown to be particularly relevant. This work is implemented in the R package greed.

[28]
Title: Stagewise Enlargement of Batch Size for SGD-based Learning
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Existing research shows that the batch size can seriously affect the performance of stochastic gradient descent~(SGD) based learning, including training speed and generalization ability. A larger batch size typically results in less parameter updates. In distributed training, a larger batch size also results in less frequent communication. However, a larger batch size can make a generalization gap more easily. Hence, how to set a proper batch size for SGD has recently attracted much attention. Although some methods about setting batch size have been proposed, the batch size problem has still not been well solved. In this paper, we first provide theory to show that a proper batch size is related to the gap between initialization and optimum of the model parameter. Then based on this theory, we propose a novel method, called \underline{s}tagewise \underline{e}nlargement of \underline{b}atch \underline{s}ize~(\mbox{SEBS}), to set proper batch size for SGD. More specifically, \mbox{SEBS} adopts a multi-stage scheme, and enlarges the batch size geometrically by stage. We theoretically prove that, compared to classical stagewise SGD which decreases learning rate by stage, \mbox{SEBS} can reduce the number of parameter updates without increasing generalization error. SEBS is suitable for \mbox{SGD}, momentum \mbox{SGD} and AdaGrad. Empirical results on real data successfully verify the theories of \mbox{SEBS}. Furthermore, empirical results also show that SEBS can outperform other baselines.

[29]
Title: Liquid Scorecards
Subjects: Other Statistics (stat.OT); Methodology (stat.ME)

Traditional credit scorecards are generalized additive models (GAMs) with step functions as the component functions. The shapes of the step functions may be constrained in order to satisfy the PILE (Palatability, Interpretability, Legal, Explain-ability) constraints. Before 2003, FICO used Linear Programming to find the traditional scorecard that approximately maximizes divergence subject to the PILE constraints. In this paper, I introduce the Liquid Scorecard, that allows the component functions to be, at least partially, smooth curves. I use Quadratic Programming and B-Spline theory to find the Liquid Scorecard that exactly maximizes divergence subject to the PILE constraints. FICO uses aspects of this technology to develop the famous FICO Credit Score.

[30]
Title: Off-Policy Evaluation and Learning for External Validity under a Covariate Shift
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM)

We consider the evaluation and training of a new policy for the evaluation data by using the historical data obtained from a different policy. The goal of off-policy evaluation (OPE) is to estimate the expected reward of a new policy over the evaluation data, and that of off-policy learning (OPL) is to find a new policy that maximizes the expected reward over the evaluation data. Although the standard OPE and OPL assume the same distribution of covariate between the historical and evaluation data, there often exists a problem of a covariate shift, i.e., the distribution of the covariate of the historical data is different from that of the evaluation data. In this paper, we derive the efficiency bound of OPE under a covariate shift. Then, we propose doubly robust and efficient estimators for OPE and OPL under a covariate shift by using an estimator of the density ratio between the distributions of the historical and evaluation data. We also discuss other possible estimators and compare their theoretical properties. Finally, we confirm the effectiveness of the proposed estimators through experiments.

[31]
Title: Profile Entropy: A Fundamental Measure for the Learnability and Compressibility of Discrete Distributions
Authors: Yi Hao, Alon Orlitsky
Subjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)

The profile of a sample is the multiset of its symbol frequencies. We show that for samples of discrete distributions, profile entropy is a fundamental measure unifying the concepts of estimation, inference, and compression. Specifically, profile entropy a) determines the speed of estimating the distribution relative to the best natural estimator; b) characterizes the rate of inferring all symmetric properties compared with the best estimator over any label-invariant distribution collection; c) serves as the limit of profile compression, for which we derive optimal near-linear-time block and sequential algorithms. To further our understanding of profile entropy, we investigate its attributes, provide algorithms for approximating its value, and determine its magnitude for numerous structural distribution families.

### Cross-lists for Thu, 27 Feb 20

[32]  arXiv:1811.05375 (cross-list from cs.CY) [pdf, ps, other]
Title: Comparison of Feature Extraction Methods and Predictors for Income Inference
Comments: Argentine Symposium on Big Data (AGRANDA), September 5, 2017
Subjects: Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)

Patterns of mobile phone communications, coupled with the information of the social network graph and financial behavior, allow us to make inferences of users' socio-economic attributes such as their income level. We present here several methods to extract features from mobile phone usage (calls and messages), and compare different combinations of supervised machine learning techniques and sets of features used as input for the inference of users' income. Our experimental results show that the Bayesian method based on the communication graph outperforms standard machine learning algorithms using node-based features.

[33]  arXiv:1812.01077 (cross-list from cs.SI) [pdf, other]
Title: Brief survey of Mobility Analyses based on Mobile Phone Datasets
Comments: Workshop on Urban Computing and Society. Petropolis, RJ, Brazil. Nov 28, 2018
Subjects: Social and Information Networks (cs.SI); Computers and Society (cs.CY); Machine Learning (cs.LG); Machine Learning (stat.ML)

This is a brief survey of the research performed by Grandata Labs in collaboration with numerous academic groups around the world on the topic of human mobility. A driving theme in these projects is to use and improve Data Science techniques to understand mobility, as it can be observed through the lens of mobile phone datasets. We describe applications of mobility analyses for urban planning, prediction of data traffic usage, building delay tolerant networks, generating epidemiologic risk maps and measuring the predictability of human mobility.

[34]  arXiv:2002.10846 (cross-list from math.PR) [pdf, ps, other]
Title: A CLT in Stein's distance for generalized Wishart matrices and higher order tensors
Authors: Dan Mikulincer
Subjects: Probability (math.PR); Statistics Theory (math.ST)

We study the convergence along the central limit theorem for sums of independent tensor powers, $\frac{1}{\sqrt{d}}\sum\limits_{i=1}^d X_i^{\otimes p}$. We focus on the high-dimensional regime where $X_i \in \mathbb{R}^n$ and $n$ may scale with $d$. Our main result is a proposed threshold for convergence. Specifically, we show that, under some regularity assumption, if $n^{2p-1}\gg d$, then the normalized sum converges to a Gaussian. The results apply, among others, to symmetric uniform log-concave measures and to product measures. This generalizes several results found in the literature.
Our main technique is a novel application of optimal transport to Stein's method which accounts for the low dimensional structure which is inherent in $X_i^{\otimes p}$.

[35]  arXiv:2002.11104 (cross-list from cs.SI) [pdf, ps, other]
Title: An Information Diffusion Approach to Rumor Propagation and Identification on Twitter
Subjects: Social and Information Networks (cs.SI); Machine Learning (cs.LG); Machine Learning (stat.ML)

With the increasing use of online social networks as a source of news and information, the propensity for a rumor to disseminate widely and quickly poses a great concern, especially in disaster situations where users do not have enough time to fact-check posts before making the informed decision to react to a post that appears to be credible. In this study, we explore the propagation pattern of rumors on Twitter by exploring the dynamics of microscopic-level misinformation spread, based on the latent message and user interaction attributes. We perform supervised learning for feature selection and prediction. Experimental results with real-world data sets give the models' prediction accuracy at about 90\% for the diffusion of both True and False topics. Our findings confirm that rumor cascades run deeper and that rumor masked as news, and messages that incite fear, will diffuse faster than other messages. We show that the models for True and False message propagation differ significantly, both in the prediction parameters and in the message features that govern the diffusion. Finally, we show that the diffusion pattern is an important metric in identifying the credibility of a tweet.

[36]  arXiv:2002.11137 (cross-list from cs.LG) [pdf, other]
Title: Dynamic Incentive-aware Learning: Robust Pricing in Contextual Auctions
Comments: Accepted for publication in Operations Research Journal (An earlier version of this paper accepted to NeurIPS 2019.)
Subjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML)

Motivated by pricing in ad exchange markets, we consider the problem of robust learning of reserve prices against strategic buyers in repeated contextual second-price auctions. Buyers' valuations for an item depend on the context that describes the item. However, the seller is not aware of the relationship between the context and buyers' valuations, i.e., buyers' preferences. The seller's goal is to design a learning policy to set reserve prices via observing the past sales data, and her objective is to minimize her regret for revenue, where the regret is computed against a clairvoyant policy that knows buyers' heterogeneous preferences. Given the seller's goal, utility-maximizing buyers have the incentive to bid untruthfully in order to manipulate the seller's learning policy. We propose learning policies that are robust to such strategic behavior. These policies use the outcomes of the auctions, rather than the submitted bids, to estimate the preferences while controlling the long-term effect of the outcome of each auction on the future reserve prices. When the market noise distribution is known to the seller, we propose a policy called Contextual Robust Pricing (CORP) that achieves a T-period regret of $O(d\log(Td) \log (T))$, where $d$ is the dimension of {the} contextual information. When the market noise distribution is unknown to the seller, we propose two policies whose regrets are sublinear in $T$.

[37]  arXiv:2002.11151 (cross-list from cs.LG) [pdf, other]
Title: TxSim:Modeling Training of Deep Neural Networks on Resistive Crossbar Systems
Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)

Resistive crossbars have attracted significant interest in the design of Deep Neural Network (DNN) accelerators due to their ability to natively execute massively parallel vector-matrix multiplications within dense memory arrays. However, crossbar-based computations face a major challenge due to a variety of device and circuit-level non-idealities, which manifest as errors in the vector-matrix multiplications and eventually degrade DNN accuracy. To address this challenge, there is a need for tools that can model the functional impact of non-idealities on DNN training and inference. Existing efforts towards this goal are either limited to inference, or are too slow to be used for large-scale DNN training. We propose TxSim, a fast and customizable modeling framework to functionally evaluate DNN training on crossbar-based hardware considering the impact of non-idealities. The key features of TxSim that differentiate it from prior efforts are: (i) It comprehensively models non-idealities during all training operations (forward propagation, backward propagation, and weight update) and (ii) it achieves computational efficiency by mapping crossbar evaluations to well-optimized BLAS routines and incorporates speedup techniques to further reduce simulation time with minimal impact on accuracy. TxSim achieves orders-of-magnitude improvement in simulation speed over prior works, and thereby makes it feasible to evaluate training of large-scale DNNs on crossbars. Our experiments using TxSim reveal that the accuracy degradation in DNN training due to non-idealities can be substantial (3%-10%) for large-scale DNNs, underscoring the need for further research in mitigation techniques. We also analyze the impact of various device and circuit-level parameters and the associated non-idealities to provide key insights that can guide the design of crossbar-based DNN training accelerators.

[38]  arXiv:2002.11167 (cross-list from physics.ao-ph) [pdf, other]
Title: Data-driven super-parameterization using deep learning: Experimentation with multi-scale Lorenz 96 systems and transfer-learning
Subjects: Atmospheric and Oceanic Physics (physics.ao-ph); Chaotic Dynamics (nlin.CD); Computational Physics (physics.comp-ph); Fluid Dynamics (physics.flu-dyn); Geophysics (physics.geo-ph); Machine Learning (stat.ML)

To make weather/climate modeling computationally affordable, small-scale processes are usually represented in terms of the large-scale, explicitly-resolved processes using physics-based or semi-empirical parameterization schemes. Another approach, computationally more demanding but often more accurate, is super-parameterization (SP), which involves integrating the equations of small-scale processes on high-resolution grids embedded within the low-resolution grids of large-scale processes. Recently, studies have used machine learning (ML) to develop data-driven parameterization (DD-P) schemes. Here, we propose a new approach, data-driven SP (DD-SP), in which the equations of the small-scale processes are integrated data-drivenly using ML methods such as recurrent neural networks. Employing multi-scale Lorenz 96 systems as testbed, we compare the cost and accuracy (in terms of both short-term prediction and long-term statistics) of parameterized low-resolution (LR), SP, DD-P, and DD-SP models. We show that with the same computational cost, DD-SP substantially outperforms LR, and is better than DD-P, particularly when scale separation is lacking. DD-SP is much cheaper than SP, yet its accuracy is the same in reproducing long-term statistics and often comparable in short-term forecasting. We also investigate generalization, finding that when models trained on data from one system are applied to a system with different forcing (e.g., more chaotic), the models often do not generalize, particularly when the short-term prediction accuracy is examined. But we show that transfer-learning, which involves re-training the data-driven model with a small amount of data from the new system, significantly improves generalization. Potential applications of DD-SP and transfer-learning in climate/weather modeling and the expected challenges are discussed.

[39]  arXiv:2002.11172 (cross-list from cs.LG) [pdf, other]
Title: A Sample Complexity Separation between Non-Convex and Convex Meta-Learning
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

One popular trend in meta-learning is to learn from many training tasks a common initialization for a gradient-based method that can be used to solve a new task with few samples. The theory of meta-learning is still in its early stages, with several recent learning-theoretic analyses of methods such as Reptile [Nichol et al., 2018] being for convex models. This work shows that convex-case analysis might be insufficient to understand the success of meta-learning, and that even for non-convex models it is important to look inside the optimization black-box, specifically at properties of the optimization trajectory. We construct a simple meta-learning instance that captures the problem of one-dimensional subspace learning. For the convex formulation of linear regression on this instance, we show that the new task sample complexity of any initialization-based meta-learning algorithm is $\Omega(d)$, where $d$ is the input dimension. In contrast, for the non-convex formulation of a two layer linear network on the same instance, we show that both Reptile and multi-task representation learning can have new task sample complexity of $\mathcal{O}(1)$, demonstrating a separation from convex meta-learning. Crucially, analyses of the training dynamics of these methods reveal that they can meta-learn the correct subspace onto which the data should be projected.

[40]  arXiv:2002.11184 (cross-list from q-bio.PE) [pdf, other]
Title: The Moran Genealogy Process
Subjects: Populations and Evolution (q-bio.PE); Probability (math.PR); Applications (stat.AP)

We give a novel representation of the Moran Genealogy Process, a continuous-time Markov process on the space of size-$n$ genealogies with the demography of the classical Moran process. We derive the generator and unique stationary distribution of the process and establish its uniform ergodicity. In particular, we show that any initial distribution converges exponentially to the probability measure identical to that of the Kingman coalescent. We go on to show that one-time sampling projects this stationary distribution onto a smaller-size version of itself. Next, we extend the Moran genealogy process to include sampling through time. This allows us to define the Sampled Moran Genealogy Process, another Markov process on the space of genealogies. We derive exact conditional and unconditional probability distributions for this process under the assumption of stationarity, and an expression for the likelihood of any sequence of genealogies it generates. This leads to some interesting observations pertinent to existing phylodynamic methods in the literature.

[41]  arXiv:2002.11187 (cross-list from cs.LG) [pdf, other]
Title: Reliable Estimation of Kullback-Leibler Divergence by Controlling Discriminator Complexity in the Reproducing Kernel Hilbert Space
Comments: 10 pages, 3 figures
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Several scalable methods to compute the Kullback Leibler (KL) divergence between two distributions using their samples have been proposed and applied in large-scale machine learning models. While they have been found to be unstable, the theoretical root cause of the problem is not clear. In this paper, we study in detail a generative adversarial network based approach that uses a neural network discriminator to estimate KL divergence. We argue that, in such case, high fluctuations in the estimates are a consequence of not controlling the complexity of the discriminator function space. We provide a theoretical underpinning and remedy for this problem through the following contributions. First, we construct a discriminator in the Reproducing Kernel Hilbert Space (RKHS). This enables us to leverage sample complexity and mean embedding to theoretically relate the error probability bound of the KL estimates to the complexity of the neural-net discriminator. Based on this theory, we then present a scalable way to control the complexity of the discriminator for a consistent estimation of KL divergence. We support both our proposed theory and method to control the complexity of the RKHS discriminator in controlled experiments.

[42]  arXiv:2002.11192 (cross-list from q-bio.NC) [pdf, other]
Title: End-to-End Models for the Analysis of System 1 and System 2 Interactions based on Eye-Tracking Data
Comments: 11 pages, 2 figures, 1 tables
Subjects: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)

While theories postulating a dual cognitive system take hold, quantitative confirmations are still needed to understand and identify interactions between the two systems or conflict events. Eye movements are among the most direct markers of the individual attentive load and may serve as an important proxy of information. In this work we propose a computational method, within a modified visual version of the well-known Stroop test, for the identification of different tasks and potential conflicts events between the two systems through the collection and processing of data related to eye movements. A statistical analysis shows that the selected variables can characterize the variation of attentive load within different scenarios. Moreover, we show that Machine Learning techniques allow to distinguish between different tasks with a good classification accuracy and to investigate more in depth the gaze dynamics.

[43]  arXiv:2002.11215 (cross-list from cs.LG) [pdf, other]
Title: EmbPred30: Assessing 30-days Readmission for Diabetic Patients using Categorical Embeddings
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

Hospital readmission is a crucial healthcare quality measure that helps in determining the level of quality of care that a hospital offers to a patient and has proven to be immensely expensive. It is estimated that more than $25 billion are spent yearly due to readmission of diabetic patients in the USA. This paper benchmarks existing models and proposes a new embedding based state-of-the-art deep neural network(DNN). The model can identify whether a hospitalized diabetic patient will be readmitted within 30 days or not with an accuracy of 95.2% and Area Under the Receiver Operating Characteristics(AUROC) of 97.4% on data collected from 130 US hospitals between 1999-2008. The results are encouraging with patients having changes in medication while admitted having a high chance of getting readmitted. Identifying prospective patients for readmission could help the hospital systems in improving their inpatient care, thereby saving them from unnecessary expenditures. [44] arXiv:2002.11219 (cross-list from cs.LG) [pdf, ps, other] Title: Convex Geometry and Duality of Over-parameterized Neural Networks Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) We develop a convex analytic framework for ReLU neural networks which elucidates the inner workings of hidden neurons and their function space characteristics. We show that neural networks with rectified linear units act as convex regularizers, where simple solutions are encouraged via extreme points of a certain convex set. For one dimensional regression and classification, as well as rank-one data matrices, we prove that finite two-layer ReLU networks with norm regularization yield linear spline interpolation. We characterize the classification decision regions in terms of a closed form kernel matrix and minimum L1 norm solutions. This is in contrast to Neural Tangent Kernel which is unable to explain neural network predictions with finitely many neurons. Our convex geometric description also provides intuitive explanations of hidden neurons as auto-encoders. In higher dimensions, we show that the training problem for two-layer networks can be cast as a convex optimization problem with infinitely many constraints. We then provide a family of convex relaxations to approximate the solution, and a cutting-plane algorithm to improve the relaxations. We derive conditions for the exactness of the relaxations and provide simple closed form formulas for the optimal neural network weights in certain cases. We also establish a connection to$\ell_0$-$\ell_1$equivalence for neural networks analogous to the minimal cardinality solutions in compressed sensing. Extensive experimental results show that the proposed approach yields interpretable and accurate models. [45] arXiv:2002.11226 (cross-list from cs.LG) [pdf, other] Title: Deep Learning and Statistical Models for Time-Critical Pedestrian Behaviour Prediction Journal-ref: In: Gedeon T., Wong K., Lee M. (eds) Neural Information Processing. ICONIP 2019. Communications in Computer and Information Science, vol 1142. Springer, Cham Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML) The time it takes for a classifier to make an accurate prediction can be crucial in many behaviour recognition problems. For example, an autonomous vehicle should detect hazardous pedestrian behaviour early enough for it to take appropriate measures. In this context, we compare the switching linear dynamical system (SLDS) and a three-layered bi-directional long short-term memory (LSTM) neural network, which are applied to infer pedestrian behaviour from motion tracks. We show that, though the neural network model achieves an accuracy of 80%, it requires long sequences to achieve this (100 samples or more). The SLDS, has a lower accuracy of 74%, but it achieves this result with short sequences (10 samples). To our knowledge, such a comparison on sequence length has not been considered in the literature before. The results provide a key intuition of the suitability of the models in time-critical problems. [46] arXiv:2002.11246 (cross-list from cs.LG) [pdf, other] Title: Supervised Categorical Metric Learning with Schatten p-Norms Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Metric learning has been successful in learning new metrics adapted to numerical datasets. However, its development on categorical data still needs further exploration. In this paper, we propose a method, called CPML for \emph{categorical projected metric learning}, that tries to efficiently~(i.e. less computational time and better prediction accuracy) address the problem of metric learning in categorical data. We make use of the Value Distance Metric to represent our data and propose new distances based on this representation. We then show how to efficiently learn new metrics. We also generalize several previous regularizers through the Schatten$p$-norm and provides a generalization bound for it that complements the standard generalization bound for metric learning. Experimental results show that our method provides [47] arXiv:2002.11304 (cross-list from cs.LG) [pdf, other] Title: PaDGAN: A Generative Adversarial Network for Performance Augmented Diverse Designs Authors: Wei Chen, Faez Ahmed Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Deep generative models are proven to be a useful tool for automatic design synthesis and design space exploration. When applied in engineering design, existing generative models face two challenges: 1) generated designs lack diversity and do not cover all areas of the design space and 2) it is difficult to explicitly improve the overall performance or quality of generated designs without excluding low-quality designs from the dataset, which may impair the performance of the trained model due to reduced training sample size. In this paper, we simultaneously address these challenges by proposing a new Determinantal Point Processes based loss function for probabilistic modeling of diversity and quality. With this new loss function, we develop a variant of the Generative Adversarial Network, named "Performance Augmented Diverse Generative Adversarial Network" or PaDGAN, which can generate novel high-quality designs with good coverage of the design space. We demonstrate that PaDGAN can generate diverse and high-quality designs on both synthetic and real-world examples and compare PaDGAN against other models such as the vanilla GAN and the BezierGAN. Unlike typical generative models that usually generate new designs by interpolating within the boundary of training data, we show that PaDGAN expands the design space boundary towards high-quality regions. The proposed method is broadly applicable to many tasks including design space exploration, design optimization, and creative solution recommendation. [48] arXiv:2002.11318 (cross-list from cs.LG) [pdf, other] Title: Invariance vs. Robustness of Neural Networks Comments: Preliminary version presented in ICML 2018 Workshop on "Towards learning with limited labels: Equivariance, Invariance,and Beyond" as "Understanding Adversarial Robustness of Symmetric Networks" Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML) We study the performance of neural network models on random geometric transformations and adversarial perturbations. Invariance means that the model's prediction remains unchanged when a geometric transformation is applied to an input. Adversarial robustness means that the model's prediction remains unchanged after small adversarial perturbations of an input. In this paper, we show a quantitative trade-off between rotation invariance and robustness. We empirically study the following two cases: (a) change in adversarial robustness as we improve only the invariance of equivariant models via training augmentation, (b) change in invariance as we improve only the adversarial robustness using adversarial training. We observe that the rotation invariance of equivariant models (StdCNNs and GCNNs) improves by training augmentation with progressively larger random rotations but while doing so, their adversarial robustness drops progressively, and very significantly on MNIST. We take adversarially trained LeNet and ResNet models which have good$L_\infty$adversarial robustness on MNIST and CIFAR-10, respectively, and observe that adversarial training with progressively larger perturbations results in a progressive drop in their rotation invariance profiles. Similar to the trade-off between accuracy and robustness known in previous work, we give a theoretical justification for the invariance vs. robustness trade-off observed in our experiments. [49] arXiv:2002.11323 (cross-list from cs.LG) [pdf, other] Title: Convergence to Second-Order Stationarity for Non-negative Matrix Factorization: Provably and Concurrently Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML) Non-negative matrix factorization (NMF) is a fundamental non-convex optimization problem with numerous applications in Machine Learning (music analysis, document clustering, speech-source separation etc). Despite having received extensive study, it is poorly understood whether or not there exist natural algorithms that can provably converge to a local minimum. Part of the reason is because the objective is heavily symmetric and its gradient is not Lipschitz. In this paper we define a multiplicative weight update type dynamics (modification of the seminal Lee-Seung algorithm) that runs concurrently and provably avoids saddle points (first order stationary points that are not second order). Our techniques combine tools from dynamical systems such as stability and exploit the geometry of the NMF objective by reducing the standard NMF formulation over the non-negative orthant to a new formulation over (a scaled) simplex. An important advantage of our method is the use of concurrent updates, which permits implementations in parallel computing environments. [50] arXiv:2002.11328 (cross-list from cs.LG) [pdf, other] Title: Rethinking Bias-Variance Trade-off for Generalization of Neural Networks Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) The classical bias-variance trade-off predicts that bias decreases and variance increase with model complexity, leading to a U-shaped risk curve. Recent work calls this into question for neural networks and other over-parameterized models, for which it is often observed that larger models generalize better. We provide a simple explanation for this by measuring the bias and variance of neural networks: while the bias is monotonically decreasing as in the classical theory, the variance is unimodal or bell-shaped: it increases then decreases with the width of the network. We vary the network architecture, loss function, and choice of dataset and confirm that variance unimodality occurs robustly for all models we considered. The risk curve is the sum of the bias and variance curves and displays different qualitative shapes depending on the relative scale of bias and variance, with the double descent curve observed in recent literature as a special case. We corroborate these empirical results with a theoretical analysis of two-layer linear networks with random first layer. Finally, evaluation on out-of-distribution data shows that most of the drop in accuracy comes from increased bias while variance increases by a relatively small amount. Moreover, we find that deeper models decrease bias and increase variance for both in-distribution and out-of-distribution data. [51] arXiv:2002.11332 (cross-list from cs.LG) [pdf, ps, other] Title: Structured Linear Contextual Bandits: A Sharp and Geometric Smoothed Analysis Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML) Bandit learning algorithms typically involve the balance of exploration and exploitation. However, in many practical applications, worst-case scenarios needing systematic exploration are seldom encountered. In this work, we consider a smoothed setting for structured linear contextual bandits where the adversarial contexts are perturbed by Gaussian noise and the unknown parameter$\theta^*$has structure, e.g., sparsity, group sparsity, low rank, etc. We propose simple greedy algorithms for both the single- and multi-parameter (i.e., different parameter for each context) settings and provide a unified regret analysis for$\theta^*$with any assumed structure. The regret bounds are expressed in terms of geometric quantities such as Gaussian widths associated with the structure of$\theta^*$. We also obtain sharper regret bounds compared to earlier work for the unstructured$\theta^*$setting as a consequence of our improved analysis. We show there is implicit exploration in the smoothed setting where a simple greedy algorithm works. [52] arXiv:2002.11361 (cross-list from cs.LG) [pdf, other] Title: Understanding Self-Training for Gradual Domain Adaptation Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Machine learning systems must adapt to data distributions that evolve over time, in applications ranging from sensor networks and self-driving car perception modules to brain-machine interfaces. We consider gradual domain adaptation, where the goal is to adapt an initial classifier trained on a source domain given only unlabeled data that shifts gradually in distribution towards a target domain. We prove the first non-vacuous upper bound on the error of self-training with gradual shifts, under settings where directly adapting to the target domain can result in unbounded error. The theoretical analysis leads to algorithmic insights, highlighting that regularization and label sharpening are essential even when we have infinite data, and suggesting that self-training works particularly well for shifts with small Wasserstein-infinity distance. Leveraging the gradual shift structure leads to higher accuracies on a rotating MNIST dataset and a realistic Portraits dataset. [53] arXiv:2002.11369 (cross-list from cs.LG) [pdf, other] Title: Lipschitz standardization for robust multivariate learning Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Current trends in machine learning rely on out-of-the-box gradient-based approaches. With the aim of mitigating numerical errors and to improve the convergence of the learning process, a common empirical practice is to standardize or normalize the data. However, there is a lack of theoretical analysis regarding why and when these methods result in an improvement of the learning process. In this work, we first study these methods in the context of black-box variational inference, specifically analyzing the effect that scaling the data has on the smoothness of the optimization landscape. Our analysis shows that no general rule applies in order to decide which of the existing data scaling methods, or even if they, will improve the learning process. Second, we highlight the issues that arise when dealing with multivariate data, due to the discrepancy in smoothness of the likelihood functions for different variables, and the inability to scale discrete data. Finally, we propose a novel Lipschitz standardization, and its extension for discrete data, which overcomes the aforementioned limitations. Specifically, as backed by our experiments, Lipschitz standardization i) favors a fairer learning across different variables in the data; and ii) results in faster and more accurate learning. [54] arXiv:2002.11410 (cross-list from math.OC) [pdf, other] Title: Efficient algorithms for multivariate shape-constrained convex regression problems Subjects: Optimization and Control (math.OC); Machine Learning (stat.ML) Shape-constrained convex regression problem deals with fitting a convex function to the observed data, where additional constraints are imposed, such as component-wise monotonicity and uniform Lipschitz continuity. This paper provides a comprehensive mechanism for computing the least squares estimator of a multivariate shape-constrained convex regression function in$\mathbb{R}^d$. We prove that the least squares estimator is computable via solving a constrained convex quadratic programming (QP) problem with$(n+1)d$variables and at least$n(n-1)$linear inequality constraints, where$n$is the number of data points. For solving the generally very large-scale convex QP, we design two efficient algorithms, one is the symmetric Gauss-Seidel based alternating direction method of multipliers ({\tt sGS-ADMM}), and the other is the proximal augmented Lagrangian method ({\tt pALM}) with the subproblems solved by the semismooth Newton method ({\tt SSN}). Comprehensive numerical experiments, including those in the pricing of basket options and estimation of production functions in economics, demonstrate that both of our proposed algorithms outperform the state-of-the-art algorithm. The {\tt pALM} is more efficient than the {\tt sGS-ADMM} but the latter has the advantage of being simpler to implement. [55] arXiv:2002.11416 (cross-list from eess.SP) [pdf, ps, other] Title: Analytical Equations based Prediction Approach for PM2.5 using Artificial Neural Network Subjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML) Particulate matter pollution is one of the deadliest types of air pollution worldwide due to its significant impacts on the global environment and human health. Particulate Matter (PM2.5) is one of the important particulate pollutants to measure the Air Quality Index (AQI). The conventional instruments used by the air quality monitoring stations to monitor PM2.5 are costly, bulkier, time-consuming, and power-hungry. Furthermore, due to limited data availability and non-scalability, these stations cannot provide high spatial and temporal resolution in real-time. To overcome the disadvantages of existing methodology this article presents analytical equations based prediction approach for PM2.5 using an Artificial Neural Network (ANN). Since the derived analytical equations for the prediction can be computed using a Wireless Sensor Node (WSN) or low-cost processing tool, it demonstrates the usefulness of the proposed approach. Moreover, the study related to correlation among the PM2.5 and other pollutants is performed to select the appropriate predictors. The large authenticate data set of Central Pollution Control Board (CPCB) online station, India is used for the proposed approach. The RMSE and coefficient of determination (R2) obtained for the proposed prediction approach using eight predictors are 1.7973 ug/m3 and 0.9986 respectively. While the proposed approach results show RMSE of 7.5372 ug/m3 and R2 of 0.9708 using three predictors. Therefore, the results demonstrate that the proposed approach is one of the promising approaches for monitoring PM2.5 without power-hungry gas sensors and bulkier analyzers. [56] arXiv:2002.11423 (cross-list from cs.LG) [pdf, other] Title: NeuralSens: Sensitivity Analysis of Neural Networks Comments: 28 pages, 12 figures, submitted to Journal of Statistical Software (JSS) this https URL Subjects: Machine Learning (cs.LG); Mathematical Software (cs.MS); Machine Learning (stat.ML) Neural networks are important tools for data-intensive analysis and are commonly applied to model non-linear relationships between dependent and independent variables. However, neural networks are usually seen as "black boxes" that offer minimal information about how the input variables are used to predict the response in a fitted model. This article describes the \pkg{NeuralSens} package that can be used to perform sensitivity analysis of neural networks using the partial derivatives method. Functions in the package can be used to obtain the sensitivities of the output with respect to the input variables, evaluate variable importance based on sensitivity measures and characterize relationships between input and output variables. Methods to calculate sensitivities are provided for objects from common neural network packages in \proglang{R}, including \pkg{neuralnet}, \pkg{nnet}, \pkg{RSNNS}, \pkg{h2o}, \pkg{neural}, \pkg{forecast} and \pkg{caret}. The article presents an overview of the techniques for obtaining information from neural network models, a theoretical foundation of how are calculated the partial derivatives of the output with respect to the inputs of a multi-layer perceptron model, a description of the package structure and functions, and applied examples to compare \pkg{NeuralSens} functions with analogous functions from other available \proglang{R} packages. [57] arXiv:2002.11429 (cross-list from cs.LG) [pdf, other] Title: PHS: A Toolbox for Parellel Hyperparameter Search Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) We introduce an open source python framework named PHS - Parallel Hyperparameter Search to enable hyperparameter optimization on numerous compute instances of any arbitrary python function. This is achieved with minimal modifications inside the target function. Possible applications appear in expensive to evaluate numerical computations which strongly depend on hyperparameters such as machine learning. Bayesian optimization is chosen as a sample efficient method to propose the next query set of parameters. [58] arXiv:2002.11431 (cross-list from cs.CY) [pdf, ps, other] Title: Simpler handling of clinical concepts in R with clinconcept Authors: Robert C. Free Subjects: Computers and Society (cs.CY); Applications (stat.AP) Routinely collected data in electronic healthcare records are often underpinned by clinical concept dictionaries. Increasingly data sets from these sources are being made available and used for research purposes, but without additional tooling it can be difficult to work effectively with these dictionaries due to their design, size and complex nature. In an effort to improve this situation the clinconcept package was created to provide a straightforward way for researchers to build, manage and interrogate databases containing commmonly used clinical concept dictionaries. This article describes the rationale behind the package, how to install it and use it and how it can be extended to support other data sources. [59] arXiv:2002.11436 (cross-list from cs.LG) [pdf, ps, other] Title: Nonlinear classifiers for ranking problems based on kernelized SVM Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Many classification problems focus on maximizing the performance only on the samples with the highest relevance instead of all samples. As an example, we can mention ranking problems, accuracy at the top or search engines where only the top few queries matter. In our previous work, we derived a general framework including several classes of these linear classification problems. In this paper, we extend the framework to nonlinear classifiers. Utilizing a similarity to SVM, we dualize the problems, add kernels and propose a componentwise dual ascent method. This allows us to perform one iteration in less than 20 milliseconds on relatively large datasets such as FashionMNIST. [60] arXiv:2002.11440 (cross-list from cs.LG) [pdf, other] Title: Non-Asymptotic Bounds for Zeroth-Order Stochastic Optimization Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML) We consider the problem of optimizing an objective function with and without convexity in a simulation-optimization context, where only stochastic zeroth-order information is available. We consider two techniques for estimating gradient/Hessian, namely simultaneous perturbation (SP) and Gaussian smoothing (GS). We introduce an optimization oracle to capture a setting where the function measurements have an estimation error that can be controlled. Our oracle is appealing in several practical contexts where the objective has to be estimated from i.i.d. samples, and increasing the number of samples reduces the estimation error. In the stochastic non-convex optimization context, we analyze the zeroth-order variant of the randomized stochastic gradient (RSG) and quasi-Newton (RSQN) algorithms with a biased gradient/Hessian oracle, and with its variant involving an estimation error component. In particular, we provide non-asymptotic bounds on the performance of both algorithms, and our results provide a guideline for choosing the batch size for estimation, so that the overall error bound matches with the one obtained when there is no estimation error. Next, in the stochastic convex optimization setting, we provide non-asymptotic bounds that hold in expectation for the last iterate of a stochastic gradient descent (SGD) algorithm, and our bound for the GS variant of SGD matches the bound for SGD with unbiased gradient information. We perform simulation experiments on synthetic as well as real-world datasets, and the empirical results validate the theoretical findings. [61] arXiv:2002.11442 (cross-list from cs.LG) [pdf, other] Title: DeBayes: a Bayesian method for debiasing network embeddings Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) As machine learning algorithms are increasingly deployed for high-impact automated decision making, ethical and increasingly also legal standards demand that they treat all individuals fairly, without discrimination based on their age, gender, race or other sensitive traits. In recent years much progress has been made on ensuring fairness and reducing bias in standard machine learning settings. Yet, for network embedding, with applications in vulnerable domains ranging from social network analysis to recommender systems, current options remain limited both in number and performance. We thus propose DeBayes: a conceptually elegant Bayesian method that is capable of learning debiased embeddings by using a biased prior. Our experiments show that these representations can then be used to perform link prediction that is significantly more fair in terms of popular metrics such as demographic parity and equalized opportunity. [62] arXiv:2002.11477 (cross-list from cs.CV) [pdf, other] Title: Learning a Directional Soft Lane Affordance Model for Road Scenes Using Self-Supervision Comments: Submitted to IEEE IV 2020 Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML) Humans navigate complex environments in an organized yet flexible manner, adapting to the context and implicit social rules. Understanding these naturally learned patterns of behavior is essential for applications such as autonomous vehicles. However, algorithmically defining these implicit rules of human behavior remains difficult. This work proposes a novel self-supervised method for training a probabilistic network model to estimate the regions humans are most likely to drive in as well as a multimodal representation of the inferred direction of travel at each point. The model is trained on individual human trajectories conditioned on a representation of the driving environment. The model is shown to successfully generalize to new road scenes, demonstrating potential for real-world application as a prior for socially acceptable driving behavior in challenging or ambiguous scenarios which are poorly handled by explicit traffic rules. [63] arXiv:2002.11498 (cross-list from eess.SP) [pdf, ps, other] Title: Multi-frequency calibration for DOA estimation with distributed sensors Subjects: Signal Processing (eess.SP); Machine Learning (cs.LG); Machine Learning (stat.ML) In this work, we investigate direction finding in the presence of sensor gain uncertainties and directional perturbations for sensor array processing in a multi-frequency scenario. Specifically, we adopt a distributed optimization scheme in which coherence models are incorporated and local agents exchange information only between connected nodes in the network, i.e., without a fusion center. Numerical simulations highlight the advantages of the proposed parallel iterative technique in terms of statistical and computational efficiency. [64] arXiv:2002.11501 (cross-list from cs.LG) [pdf, other] Title: Dual Graph Representation Learning Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Graph representation learning embeds nodes in large graphs as low-dimensional vectors and is of great benefit to many downstream applications. Most embedding frameworks, however, are inherently transductive and unable to generalize to unseen nodes or learn representations across different graphs. Although inductive approaches can generalize to unseen nodes, they neglect different contexts of nodes and cannot learn node embeddings dually. In this paper, we present a context-aware unsupervised dual encoding framework, \textbf{CADE}, to generate representations of nodes by combining real-time neighborhoods with neighbor-attentioned representation, and preserving extra memory of known nodes. We exhibit that our approach is effective by comparing to state-of-the-art methods. [65] arXiv:2002.11505 (cross-list from cs.DC) [pdf, other] Title: Relaxed Scheduling for Scalable Belief Propagation Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML) The ability to leverage large-scale hardware parallelism has been one of the key enablers of the accelerated recent progress in machine learning. Consequently, there has been considerable effort invested into developing efficient parallel variants of classic machine learning algorithms. However, despite the wealth of knowledge on parallelization, some classic machine learning algorithms often prove hard to parallelize efficiently while maintaining convergence. In this paper, we focus on efficient parallel algorithms for the key machine learning task of inference on graphical models, in particular on the fundamental belief propagation algorithm. We address the challenge of efficiently parallelizing this classic paradigm by showing how to leverage scalable relaxed schedulers in this context. We present an extensive empirical study, showing that our approach outperforms previous parallel belief propagation implementations both in terms of scalability and in terms of wall-clock convergence time, on a range of practical applications. [66] arXiv:2002.11519 (cross-list from cs.LG) [pdf, ps, other] Title: Decidability of Sample Complexity of PAC Learning in finite setting Authors: Alberto Gandolfi Subjects: Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Machine Learning (stat.ML) In this short note we observe that the sample complexity of PAC machine learning of various concepts, including learning the maximum (EMX), can be exactly determined when the support of the probability measures considered as models satisfies an a-priori bound. This result contrasts with the recently discovered undecidability of EMX within ZFC for finitely supported probabilities (with no a priori bound). Unfortunately, the decision procedure is at present, at least doubly exponential in the number of points times the uniform bound on the support size. [67] arXiv:2002.11545 (cross-list from cs.LG) [pdf, other] Title: A Survey towards Federated Semi-supervised Learning Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) The success of Artificial Intelligence (AI) should be largely attributed to the accessibility of abundant data. However, this is not exactly the case in reality, where it is common for developers in industry to face insufficient, incomplete and isolated data. Consequently, federated learning was proposed to alleviate such challenges by allowing multiple parties to collaboratively build machine learning models without explicitly sharing their data and in the meantime, preserve data privacy. However, existing algorithms of federated learning mainly focus on examples where, either the data do not require explicit labeling, or all data are labeled. Yet in reality, we are often confronted with the case that labeling data itself is costly and there is no sufficient supply of labeled data. While such issues are commonly solved by semi-supervised learning, to the best of knowledge, no existing effort has been put to federated semi-supervised learning. In this survey, we briefly summarize prevalent semi-supervised algorithms and make a brief prospect into federated semi-supervised learning, including possible methodologies, settings and challenges. [68] arXiv:2002.11565 (cross-list from cs.LG) [pdf, other] Title: Randomization matters. How to defend against strong adversarial attacks Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML) Is there a classifier that ensures optimal robustness against all adversarial attacks? This paper answers this question by adopting a game-theoretic point of view. We show that adversarial attacks and defenses form an infinite zero-sum game where classical results (e.g. Sion theorem) do not apply. We demonstrate the non-existence of a Nash equilibrium in our game when the classifier and the Adversary are both deterministic, hence giving a negative answer to the above question in the deterministic regime. Nonetheless, the question remains open in the randomized regime. We tackle this problem by showing that, undermild conditions on the dataset distribution, any deterministic classifier can be outperformed by a randomized one. This gives arguments for using randomization, and leads us to a new algorithm for building randomized classifiers that are robust to strong adversarial attacks. Empirical results validate our theoretical analysis, and show that our defense method considerably outperforms Adversarial Training against state-of-the-art attacks. [69] arXiv:2002.11569 (cross-list from cs.LG) [pdf, other] Title: Overfitting in adversarially robust deep learning Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) It is common practice in deep learning to use overparameterized networks and train for as long as possible; there are numerous studies that show, both theoretically and empirically, that such practices surprisingly do not unduly harm the generalization performance of the classifier. In this paper, we empirically study this phenomenon in the setting of adversarially trained deep networks, which are trained to minimize the loss under worst-case adversarial perturbations. We find that overfitting to the training set does in fact harm robust performance to a very large degree in adversarially robust training across multiple datasets (SVHN, CIFAR-10, CIFAR-100, and ImageNet) and perturbation models ($\ell_\infty$and$\ell_2$). Based upon this observed effect, we show that the performance gains of virtually all recent algorithmic improvements upon adversarial training can be matched by simply using early stopping. We also show that effects such as the double descent curve do still occur in adversarially trained models, yet fail to explain the observed overfitting. Finally, we study several classical and modern deep learning remedies for overfitting, including regularization and data augmentation, and find that no approach in isolation improves significantly upon the gains achieved by early stopping. All code for reproducing the experiments as well as pretrained model weights and training logs can be found at https://github.com/locuslab/robust_overfitting. [70] arXiv:2002.11570 (cross-list from eess.SY) [pdf, other] Title: Calculations of System Adequacy Considering Heat Transition Pathways Comments: Submitted to PMAPS 2020 Subjects: Systems and Control (eess.SY); Applications (stat.AP) The decarbonisation of heat in developed economies represents a significant challenge, with increased penetration of electrical heating technologies potentially leading to unprecedented increases in peak electricity demand. This work considers a method to evaluate the impact of rapid electrification of heat by utilising historic gas demand data. The work is intended to provide a data-driven complement to popular generative heat demand models, with a particular aim of informing regulators and actors in capacity markets as to how policy changes could impact on medium-term system adequacy metrics (up to five years ahead). Results from a GB case study show that the representation of heat demand using scaled gas demand profiles more than doubles the rate at which 1-in-20 system peaks grow, when compared to the use of scaled electricity demand profiles. Low end-use system efficiency, in terms of aggregate coefficient of performance and demand side response capabilities, are shown to potentially lead to a trebling of electrical demand-temperature sensitivity following five years of heat demand growth. [71] arXiv:2002.11576 (cross-list from cs.LG) [pdf, other] Title: NestedVAE: Isolating Common Factors via Weak Supervision Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Fair and unbiased machine learning is an important and active field of research, as decision processes are increasingly driven by models that learn from data. Unfortunately, any biases present in the data may be learned by the model, thereby inappropriately transferring that bias into the decision making process. We identify the connection between the task of bias reduction and that of isolating factors common between domains whilst encouraging domain specific invariance. To isolate the common factors we combine the theory of deep latent variable models with information bottleneck theory for scenarios whereby data may be naturally paired across domains and no additional supervision is required. The result is the Nested Variational AutoEncoder (NestedVAE). Two outer VAEs with shared weights attempt to reconstruct the input and infer a latent space, whilst a nested VAE attempts to reconstruct the latent representation of one image, from the latent representation of its paired image. In so doing, the nested VAE isolates the common latent factors/causes and becomes invariant to unwanted factors that are not shared between paired images. We also propose a new metric to provide a balanced method of evaluating consistency and classifier performance across domains which we refer to as the Adjusted Parity metric. An evaluation of NestedVAE on both domain and attribute invariance, change detection, and learning common factors for the prediction of biological sex demonstrates that NestedVAE significantly outperforms alternative methods. [72] arXiv:2002.11589 (cross-list from cs.LG) [pdf, other] Title: Recommendation on a Budget: Column Space Recovery from Partially Observed Entries with Random or Active Sampling Authors: C. Kim, M. Bayati Comments: A shorter version is accepted to AISTATS Subjects: Machine Learning (cs.LG); Information Retrieval (cs.IR); Machine Learning (stat.ML) We analyze alternating minimization for column space recovery of a partially observed, approximately low rank matrix with a growing number of columns and a fixed budget of observations per column. In this work, we prove that if the budget is greater than the rank of the matrix, column space recovery succeeds -- as the number of columns grows, the estimate from alternating minimization converges to the true column space with probability tending to one. From our proof techniques, we naturally formulate an active sampling strategy for choosing entries of a column that is theoretically and empirically (on synthetic and real data) better than the commonly studied uniformly random sampling strategy. [73] arXiv:2002.11599 (cross-list from cs.IT) [pdf, other] Title: Minimax Optimal Estimation of KL Divergence for Continuous Distributions Subjects: Information Theory (cs.IT); Machine Learning (stat.ML) Estimating Kullback-Leibler divergence from identical and independently distributed samples is an important problem in various domains. One simple and effective estimator is based on the k nearest neighbor distances between these samples. In this paper, we analyze the convergence rates of the bias and variance of this estimator. Furthermore, we derive a lower bound of the minimax mean square error and show that kNN method is asymptotically rate optimal. [74] arXiv:2002.11603 (cross-list from cs.LG) [pdf, other] Title: Differentially Private Mean Embeddings with Random Features (DP-MERF) for Simple & Practical Synthetic Data Generation Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) We present a differentially private data generation paradigm using random feature representations of kernel mean embeddings when comparing the distribution of true data with that of synthetic data. We exploit the random feature representations for two important benefits. First, we require a very low privacy cost for training deep generative models. This is because unlike kernel-based distance metrics that require computing the kernel matrix on all pairs of true and synthetic data points, we can detach the data-dependent term from the term solely dependent on synthetic data. Hence, we need to perturb the data-dependent term once-for-all and then use it until the end of the generator training. Second, we can obtain an analytic sensitivity of the kernel mean embedding as the random features are norm bounded by construction. This removes the necessity of hyperparameter search for a clipping norm to handle the unknown sensitivity of an encoder network when dealing with high-dimensional data. We provide several variants of our algorithm, differentially private mean embeddings with random features (DP-MERF) to generate (a) heterogeneous tabular data, (b) input features and corresponding labels jointly; and (c) high-dimensional data. Our algorithm achieves better privacy-utility trade-offs than existing methods tested on several datasets. [75] arXiv:2002.11609 (cross-list from cs.CV) [pdf, other] Title: ARMA Nets: Expanding Receptive Field for Dense Prediction Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML) Global information is essential for dense prediction problems, whose goal is to compute a discrete or continuous label for each pixel in the images. Traditional convolutional layers in neural networks, originally designed for image classification, are restrictive in these problems since their receptive fields are limited by the filter size. In this work, we propose autoregressive moving-average (ARMA) layer, a novel module in neural networks to allow explicit dependencies of output neurons, which significantly expands the receptive field with minimal extra parameters. We show experimentally that the effective receptive field of neural networks with ARMA layers expands as autoregressive coefficients become larger. In addition, we demonstrate that neural networks with ARMA layers substantially improve the performance of challenging pixel-level video prediction tasks as our model enlarges the effective receptive field. [76] arXiv:2002.11611 (cross-list from cs.LG) [pdf, other] Title: Online Learning in Contextual Bandits using Gated Linear Networks Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) We introduce a new and completely online contextual bandit algorithm called Gated Linear Contextual Bandits (GLCB). This algorithm is based on Gated Linear Networks (GLNs), a recently introduced deep learning architecture with properties well-suited to the online setting. Leveraging data-dependent gating properties of the GLN we are able to estimate prediction uncertainty with effectively zero algorithmic overhead. We empirically evaluate GLCB compared to 9 state-of-the-art algorithms that leverage deep neural networks, on a standard benchmark suite of discrete and continuous contextual bandit problems. GLCB obtains median first-place despite being the only online method, and we further support these results with a theoretical study of its convergence properties. [77] arXiv:2002.11613 (cross-list from cs.LG) [pdf, other] Title: The Differentially Private Lottery Ticket Mechanism Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) We propose the differentially private lottery ticket mechanism (DPLTM). An end-to-end differentially private training paradigm based on the lottery ticket hypothesis. Using "high-quality winners", selected via our custom score function, DPLTM significantly improves the privacy-utility trade-off over the state-of-the-art. We show that DPLTM converges faster, allowing for early stopping with reduced privacy budget consumption. We further show that the tickets from DPLTM are transferable across datasets, domains, and architectures. Our extensive evaluation on several public datasets provides evidence to our claims. [78] arXiv:2002.11618 (cross-list from cs.CY) [pdf, other] Title: Better coverage, better outcomes? Mapping mobile network data to official statistics using satellite imagery and radio propagation modelling Authors: Till Koebe Subjects: Computers and Society (cs.CY); Computation (stat.CO); Methodology (stat.ME) Mobile sensing data has become a popular data source for geo-spatial analysis, however, mapping it accurately to other sources of information such as statistical data remains a challenge. Popular mapping approaches such as point allocation or voronoi tessellation provide only crude approximations of the mobile network coverage as they do not consider holes, overlaps and within-cell heterogeneity. More elaborate mapping schemes often require additional proprietary data operators are highly reluctant to share. In this paper, I use human settlement information extracted from publicly available satellite imagery in combination with stochastic radio propagation modelling techniques to account for that. I investigate in a simulation study and a real-world application on unemployment estimates in Senegal whether better coverage approximations lead to better outcome predictions. The good news is: it does not have to be complicated. [79] arXiv:2002.11621 (cross-list from cs.CY) [pdf, ps, other] Title: Algorithms for Fair Team Formation in Online Labour Marketplaces Comments: Accepted at "FATES 2019 : 1st Workshop on Fairness, Accountability, Transparency, Ethics, and Society on the Web" (this http URL) Journal-ref: "Companion Proceedings of The 2019 World Wide Web Conference", 2019, pages 484-490 Subjects: Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML) As freelancing work keeps on growing almost everywhere due to a sharp decrease in communication costs and to the widespread of Internet-based labour marketplaces (e.g., guru.com, feelancer.com, mturk.com, upwork.com), many researchers and practitioners have started exploring the benefits of outsourcing and crowdsourcing. Since employers often use these platforms to find a group of workers to complete a specific task, researchers have focused their efforts on the study of team formation and matching algorithms and on the design of effective incentive schemes. Nevertheless, just recently, several concerns have been raised on possibly unfair biases introduced through the algorithms used to carry out these selection and matching procedures. For this reason, researchers have started studying the fairness of algorithms related to these online marketplaces, looking for intelligent ways to overcome the algorithmic bias that frequently arises. Broadly speaking, the aim is to guarantee that, for example, the process of hiring workers through the use of machine learning and algorithmic data analysis tools does not discriminate, even unintentionally, on grounds of nationality or gender. In this short paper, we define the Fair Team Formation problem in the following way: given an online labour marketplace where each worker possesses one or more skills, and where all workers are divided into two or more not overlapping classes (for examples, men and women), we want to design an algorithm that is able to find a team with all the skills needed to complete a given task, and that has the same number of people from all classes. We provide inapproximability results for the Fair Team Formation problem together with four algorithms for the problem itself. We also tested the effectiveness of our algorithmic solutions by performing experiments using real data from an online labor marketplace. [80] arXiv:2002.11631 (cross-list from cs.CY) [pdf, other] Title: CausalML: Python Package for Causal Machine Learning Subjects: Computers and Society (cs.CY); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML) CausalML is a Python implementation of algorithms related to causal inference and machine learning. Algorithms combining causal inference and machine learning have been a trending topic in recent years. This package tries to bridge the gap between theoretical work on methodology and practical applications by making a collection of methods in this field available in Python. This paper introduces the key concepts, scope, and use cases of this package. [81] arXiv:2002.11637 (cross-list from cs.LG) [pdf, other] Title: Learning Navigation Costs from Demonstration in Partially Observable Environments Comments: 6 pages, 5 figures Subjects: Machine Learning (cs.LG); Robotics (cs.RO); Machine Learning (stat.ML) This paper focuses on inverse reinforcement learning (IRL) to enable safe and efficient autonomous navigation in unknown partially observable environments. The objective is to infer a cost function that explains expert-demonstrated navigation behavior while relying only on the observations and state-control trajectory used by the expert. We develop a cost function representation composed of two parts: a probabilistic occupancy encoder, with recurrent dependence on the observation sequence, and a cost encoder, defined over the occupancy features. The representation parameters are optimized by differentiating the error between demonstrated controls and a control policy computed from the cost encoder. Such differentiation is typically computed by dynamic programming through the value function over the whole state space. We observe that this is inefficient in large partially observable environments because most states are unexplored. Instead, we rely on a closed-form subgradient of the cost-to-go obtained only over a subset of promising states via an efficient motion-planning algorithm such as A* or RRT. Our experiments show that our model exceeds the accuracy of baseline IRL algorithms in robot navigation tasks, while substantially improving the efficiency of training and test-time inference. [82] arXiv:2002.11650 (cross-list from cs.LG) [pdf, ps, other] Title: Corrupted Multidimensional Binary Search: Learning in the Presence of Irrational Agents Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT); General Economics (econ.GN); Machine Learning (stat.ML) Standard game-theoretic formulations for settings like contextual pricing and security games assume that agents act in accordance with a specific behavioral model. In practice however, some agents may not prescribe to the dominant behavioral model or may act in ways that are arbitrarily inconsistent. Existing algorithms heavily depend on the model being (approximately) accurate for all agents and have poor performance in the presence of even a few such arbitrarily irrational agents. \emph{How do we design learning algorithms that are robust to the presence of arbitrarily irrational agents?} We address this question for a number of canonical game-theoretic applications by designing a robust algorithm for the fundamental problem of multidimensional binary search. The performance of our algorithm degrades gracefully with the number of corrupted rounds, which correspond to irrational agents and need not be known in advance. As binary search is the key primitive in algorithms for contextual pricing, Stackelberg Security Games, and other game-theoretic applications, we immediately obtain robust algorithms for these settings. Our techniques draw inspiration from learning theory, game theory, high-dimensional geometry, and convex analysis, and may be of independent algorithmic interest. [83] arXiv:2002.11651 (cross-list from cs.LG) [pdf, other] Title: Fair Learning with Private Demographic Data Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Machine Learning (stat.ML) Sensitive attributes such as race are rarely available to learners in real world settings as their collection is often restricted by laws and regulations. We give a scheme that allows individuals to release their sensitive information privately while still allowing any downstream entity to learn non-discriminatory predictors. We show how to adapt non-discriminatory learners to work with privatized protected attributes giving theoretical guarantees on performance. Finally, we highlight how the methodology could apply to learning fair predictors in settings where protected attributes are only available for a subset of the data. [84] arXiv:2002.11661 (cross-list from cs.DS) [pdf, other] Title: Compact Representation of Uncertainty in Hierarchical Clustering Comments: 21 pages, 5 figures Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML) Hierarchical clustering is a fundamental task often used to discover meaningful structures in data, such as phylogenetic trees, taxonomies of concepts, subtypes of cancer, and cascades of particle decays in particle physics. When multiple hierarchical clusterings of the data are possible, it is useful to represent uncertainty in the clustering through various probabilistic quantities. Existing approaches represent uncertainty for a range of models; however, they only provide approximate inference. This paper presents dynamic-programming algorithms and proofs for exact inference in hierarchical clustering. We are able to compute the partition function, MAP hierarchical clustering, and marginal probabilities of sub-hierarchies and clusters. Our method supports a wide range of hierarchical models and only requires a cluster compatibility function. Rather than scaling with the number of hierarchical clusterings of$n$elements ($\omega(n n! / 2^{n-1})$), our approach runs in time and space proportional to the significantly smaller powerset of$n$. Despite still being large, these algorithms enable exact inference in small-data applications and are also interesting from a theoretical perspective. We demonstrate the utility of our method and compare its performance with respect to existing approximate methods. [85] arXiv:2002.11675 (cross-list from cs.DB) [pdf, other] Title: Workload Prediction of Business Processes -- An Approach Based on Process Mining and Recurrent Neural Networks Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML) Recent advances in the interconnectedness and digitization of industrial machines, known as Industry 4.0, pave the way for new analytical techniques. Indeed, the availability and the richness of production-related data enables new data-driven methods. In this paper, we propose a process mining approach augmented with artificial intelligence that (1) reconstructs the historical workload of a company and (2) predicts the workload using neural networks. Our method relies on logs, representing the history of business processes related to manufacturing. These logs are used to quantify the supply and demand and are fed into a recurrent neural network model to predict customer orders. The corresponding activities to fulfill these orders are then sampled from history with a replay mechanism, based on criteria such as trace frequency and activities similarity. An evaluation and illustration of the method is performed on the administrative processes of Heraeus Materials SA. The workload prediction on a one-year test set achieves an MAPE score of 19% for a one-week forecast. The case study suggests a reasonable accuracy and confirms that a good understanding of the historical workload combined to articulated predictions are of great help for supporting management decisions and can decrease costs with better resources planning on a medium-term level. [86] arXiv:2002.11684 (cross-list from cs.LG) [pdf, other] Title: Provable Meta-Learning of Linear Representations Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Meta-learning, or learning-to-learn, seeks to design algorithms that can utilize previous experience to rapidly learn new skills or adapt to new environments. Representation learning---a key tool for performing meta-learning---learns a data representation that can transfer knowledge across multiple tasks, which is essential in regimes where data is scarce. Despite a recent surge of interest in the practice of meta-learning, the theoretical underpinnings of meta-learning algorithms are lacking, especially in the context of learning transferable representations. In this paper, we focus on the problem of multi-task linear regression---in which multiple linear regression models share a common, low-dimensional linear representation. Here, we provide provably fast, sample-efficient algorithms to address the dual challenges of (1) learning a common set of features from multiple, related tasks, and (2) transferring this knowledge to new, unseen tasks. Both are central to the general problem of meta-learning. Finally, we complement these results by providing information-theoretic lower bounds on the sample complexity of learning these linear features, showing that our algorithms are optimal up to logarithmic factors. [87] arXiv:2002.11686 (cross-list from cs.CR) [pdf] Title: IoT Device Identification Using Deep Learning Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP); Machine Learning (stat.ML) The growing use of IoT devices in organizations has increased the number of attack vectors available to attackers due to the less secure nature of the devices. The widely adopted bring your own device (BYOD) policy which allows an employee to bring any IoT device into the workplace and attach it to an organization's network also increases the risk of attacks. In order to address this threat, organizations often implement security policies in which only the connection of white-listed IoT devices is permitted. To monitor adherence to such policies and protect their networks, organizations must be able to identify the IoT devices connected to their networks and, more specifically, to identify connected IoT devices that are not on the white-list (unknown devices). In this study, we applied deep learning on network traffic to automatically identify IoT devices connected to the network. In contrast to previous work, our approach does not require that complex feature engineering be applied on the network traffic, since we represent the communication behavior of IoT devices using small images built from the IoT devices network traffic payloads. In our experiments, we trained a multiclass classifier on a publicly available dataset, successfully identifying 10 different IoT devices and the traffic of smartphones and computers, with over 99% accuracy. We also trained multiclass classifiers to detect unauthorized IoT devices connected to the network, achieving over 99% overall average detection accuracy. [88] arXiv:2002.11701 (cross-list from cs.LG) [pdf, other] Title: CLARA: Clinical Report Auto-completion Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (stat.ML) Generating clinical reports from raw recordings such as X-rays and electroencephalogram (EEG) is an essential and routine task for doctors. However, it is often time-consuming to write accurate and detailed reports. Most existing methods try to generate the whole reports from the raw input with limited success because 1) generated reports often contain errors that need manual review and correction, 2) it does not save time when doctors want to write additional information into the report, and 3) the generated reports are not customized based on individual doctors' preference. We propose {\it CL}inic{\it A}l {\it R}eport {\it A}uto-completion (CLARA), an interactive method that generates reports in a sentence by sentence fashion based on doctors' anchor words and partially completed sentences. CLARA searches for most relevant sentences from existing reports as the template for the current report. The retrieved sentences are sequentially modified by combining with the input feature representations to create the final report. In our experimental evaluation, CLARA achieved 0.393 CIDEr and 0.248 BLEU-4 on X-ray reports and 0.482 CIDEr and 0.491 BLEU-4 for EEG reports for sentence-level generation, which is up to 35% improvement over the best baseline. Also via our qualitative evaluation, CLARA is shown to produce reports which have a significantly higher level of approval by doctors in a user study (3.74 out of 5 for CLARA vs 2.52 out of 5 for the baseline). [89] arXiv:2002.11708 (cross-list from cs.LG) [pdf, other] Title: Generalized Hindsight for Reinforcement Learning Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO); Machine Learning (stat.ML) One of the key reasons for the high sample complexity in reinforcement learning (RL) is the inability to transfer knowledge from one task to another. In standard multi-task RL settings, low-reward data collected while trying to solve one task provides little to no signal for solving that particular task and is hence effectively wasted. However, we argue that this data, which is uninformative for one task, is likely a rich source of information for other tasks. To leverage this insight and efficiently reuse data, we present Generalized Hindsight: an approximate inverse reinforcement learning technique for relabeling behaviors with the right tasks. Intuitively, given a behavior generated under one task, Generalized Hindsight returns a different task that the behavior is better suited for. Then, the behavior is relabeled with this new task before being used by an off-policy RL optimizer. Compared to standard relabeling techniques, Generalized Hindsight provides a substantially more efficient reuse of samples, which we empirically demonstrate on a suite of multi-task navigation and manipulation tasks. Videos and code can be accessed here: https://sites.google.com/view/generalized-hindsight. ### Replacements for Thu, 27 Feb 20 [90] arXiv:1610.01697 (replaced) [pdf, ps, other] Title: Central Limit Theory for Combined Cross-Section and Time Series Comments: arXiv admin note: substantial text overlap with arXiv:1507.04415 Subjects: Methodology (stat.ME) [91] arXiv:1805.02136 (replaced) [pdf, other] Title: Private Sequential Learning Comments: Accepted for presentation at COLT 2018 Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) [92] arXiv:1808.02195 (replaced) [pdf, ps, other] Title: Fisher information matrix for single molecules with stochastic trajectories Journal-ref: SIAM Journal on Imaging Sciences, 2020, Vol. 13, No. 1 : pp. 234-264 Subjects: Quantitative Methods (q-bio.QM); Biological Physics (physics.bio-ph); Applications (stat.AP) [93] arXiv:1808.03201 (replaced) [pdf, other] Title: An optimal design for hierarchical generalized group testing Subjects: Methodology (stat.ME); Other Statistics (stat.OT) [94] arXiv:1810.05546 (replaced) [pdf, other] Title: Uncertainty in Neural Networks: Approximately Bayesian Ensembling Comments: Please cite as published in AISTATS 2020 Journal-ref: The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020 Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) [95] arXiv:1810.07371 (replaced) [pdf, other] Title: Simple Regret Minimization for Contextual Bandits Comments: The first two authors contributed equally Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) [96] arXiv:1810.11755 (replaced) [pdf, other] Title: Watch the Unobserved: A Simple Approach to Parallelizing Monte Carlo Tree Search Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) [97] arXiv:1811.07073 (replaced) [pdf, other] Title: Semi-Supervised Semantic Image Segmentation with Self-correcting Networks Comments: Accepted to CVPR 2020 Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML) [98] arXiv:1901.07329 (replaced) [pdf, ps, other] Title: The autofeat Python Library for Automated Feature Engineering and Selection Comments: ECMLPKDD 2019 Workshop on Automating Data Science (ADS) Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) [99] arXiv:1904.06963 (replaced) [pdf, other] Title: The Impact of Neural Network Overparameterization on Gradient Confusion and Stochastic Gradient Descent Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) [100] arXiv:1904.07701 (replaced) [pdf, other] Title: Multiple kernel learning for integrative consensus clustering of 'omic datasets Comments: Manuscript: 22 pages, 7 figures. Supplement: 18 pages, 12 figures. This version contains new real data applications. For associated R code, see this https URL and this https URL Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME) [101] arXiv:1905.00699 (replaced) [pdf, ps, other] Title: Long-tailed distributions of inter-event times as mixtures of exponential distributions Comments: 2 figures, 4 tables, SI and code are available here: this https URL Journal-ref: Royal Society Open Science, 7, 191643 (2020) Subjects: Physics and Society (physics.soc-ph); Applications (stat.AP) [102] arXiv:1905.03297 (replaced) [pdf, other] Title: Interpretable Subgroup Discovery in Treatment Effect Estimation with Application to Opioid Prescribing Guidelines Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) [103] arXiv:1905.11926 (replaced) [pdf, other] Title: Network Deconvolution Comments: ICLR 2020 Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML) [104] arXiv:1906.00642 (replaced) [pdf, other] Title: A Variational Approach for Learning from Positive and Unlabeled Data Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) [105] arXiv:1906.02768 (replaced) [pdf, other] Title: Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP Comments: ICLR 2020 Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE) [106] arXiv:1906.11286 (replaced) [pdf, other] Title: A Story of Two Streams: Reinforcement Learning Models from Human Behavior and Neuropsychiatry Comments: Published in AAMAS 2020 as a full paper. This article supersedes our work arXiv:1706.02897 into RL setting and extends extensively into RL games, cognitive modeling, and gambling tasks in lifelong learning setting Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML) [107] arXiv:1906.11641 (replaced) [pdf, ps, other] Title: A global approach for learning sparse Ising models Comments: 15 pages, 4 figures. arXiv admin note: text overlap with arXiv:1902.04728 by other authors Journal-ref: Mathematics and Computers in Simulation (2020) Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME) [108] arXiv:1907.04809 (replaced) [pdf, other] Title: Variational Autoencoders and Nonlinear ICA: A Unifying Framework Comments: Accepted for publication at AISTATS 2020 Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) [109] arXiv:1907.12363 (replaced) [pdf, other] Title: A comparison of Deep Learning performances with other machine learning algorithms on credit scoring unbalanced data Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) [110] arXiv:1908.07558 (replaced) [pdf, other] Title: Transferring Robustness for Graph Neural Network Against Poisoning Attacks Comments: Accepted by WSDM 2020. Code and data: this https URL Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Social and Information Networks (cs.SI); Machine Learning (stat.ML) [111] arXiv:1909.02669 (replaced) [pdf, other] Title: Covariate Selection for Generalizing Experimental Results: Application to Large-Scale Development Program in Uganda Subjects: Methodology (stat.ME) [112] arXiv:1909.04421 (replaced) [pdf, other] Title: Privacy-Preserving Bandits Comments: 13 pages, 7 figures Journal-ref: In Proceedings of the 3rd Conference on Machine Learning and Systems (MLSys 2020) Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Multiagent Systems (cs.MA); Machine Learning (stat.ML) [113] arXiv:1909.05237 (replaced) [pdf, ps, other] Title: Functional Principal Component Analysis as a Versatile Technique to Understand and Predict the Electric Consumption Patterns Comments: Accepted for publication on Sustainable Energy, Grids and Networks (Elsevier) Subjects: Systems and Control (eess.SY); Applications (stat.AP) [114] arXiv:1909.07698 (replaced) [pdf, other] Title: Compositional uncertainty in deep Gaussian processes Comments: 17 pages Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) [115] arXiv:1909.09902 (replaced) [pdf, other] Title: Deep Reinforcement Learning with Modulated Hebbian plus Q Network Architecture Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) [116] arXiv:1909.12064 (replaced) [pdf, other] Title: Set Functions for Time Series Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) [117] arXiv:1910.01847 (replaced) [pdf, other] Title: Unbiased CVR Prediction from Biased Conversions in Display Advertising Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) [118] arXiv:1910.02497 (replaced) [pdf, other] Title: mfEGRA: Multifidelity Efficient Global Reliability Analysis Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Computation (stat.CO) [119] arXiv:1910.04462 (replaced) [pdf, other] Title: Fast Tree Variants of Gromov-Wasserstein Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) [120] arXiv:1910.04483 (replaced) [pdf, other] Title: Tree-Wasserstein Barycenter for Large-Scale Multilevel Clustering and Scalable Bayes Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) [121] arXiv:1910.09055 (replaced) [pdf, other] Title: Image recognition from raw labels collected without annotators Comments: Version changelog: Added content on ImageNet related experiments; Re-structured the document to incorporate the new content Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML) [122] arXiv:1910.11369 (replaced) [pdf, other] Title: Structured Prediction with Projection Oracles Authors: Mathieu Blondel Comments: In proceedings of NeurIPS 2019 (v2: minor modifications in Appendix A) Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) [123] arXiv:1910.12179 (replaced) [pdf, other] Title: BAIL: Best-Action Imitation Learning for Batch Deep Reinforcement Learning Comments: 22 pages(13 pages for appendix); added new experimental results Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) [124] arXiv:1911.01413 (replaced) [pdf, ps, other] Title: Sub-Optimal Local Minima Exist for Almost All Over-parameterized Neural Networks Comments: 31 pages. Minor adjustments on some notations and wordings. An early version was submitted to Optimization Online on October 4 Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML) [125] arXiv:1911.09162 (replaced) [pdf, other] Title: Deep Active Learning: Unified and Principled Method for Query and Training Comments: AISTATS 2020 Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) [126] arXiv:1912.02765 (replaced) [pdf, other] Title: On the Sample Complexity of Learning Sum-Product Networks Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML) [127] arXiv:1912.04261 (replaced) [pdf, other] Title: A time resolved clustering method revealing longterm structures and their short-term internal dynamics Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) [128] arXiv:1912.05901 (replaced) [pdf, other] Title: Adaptive Bayesian Reticulum Comments: 23 pages, 8 figures, 2 tables Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) [129] arXiv:1912.11398 (replaced) [pdf, ps, other] Title: An error bound for Lasso and Group Lasso in high dimensions Authors: Antoine Dedieu Comments: arXiv admin note: text overlap with arXiv:1910.08880 Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST) [130] arXiv:2001.00074 (replaced) [pdf, other] Title: Combining interdependent climate model outputs in CMIP5: A spatial Bayesian approach Subjects: Applications (stat.AP) [131] arXiv:2001.02323 (replaced) [pdf, other] Title: On Thompson Sampling for Smoother-than-Lipschitz Bandits Comments: Accepted to AISTATS 2020. 26 pages, 2 figures Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) [132] arXiv:2001.06057 (replaced) [pdf, other] Title: Increasing the robustness of DNNs against image corruptions by playing the Game of Noise Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML) [133] arXiv:2001.09849 (replaced) [pdf, other] Title: Exploiting Unsupervised Inputs for Accurate Few-Shot Classification Comments: Fix typo, update parameters for 5 shot, add link towards code; Change format, add graph visu Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) [134] arXiv:2001.11114 (replaced) [pdf, ps, other] Title: Multi-Marginal Optimal Transport Defines a Generalized Metric Authors: Liang Mi, José Bento Subjects: Machine Learning (cs.LG); Discrete Mathematics (cs.DM); Functional Analysis (math.FA); Machine Learning (stat.ML) [135] arXiv:2002.02081 (replaced) [pdf, other] Title: Minimax Confidence Interval for Off-Policy Evaluation and Policy Optimization Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML) [136] arXiv:2002.02579 (replaced) [pdf, other] Title: Estimating Optimal Treatment Rules with an Instrumental Variable: A Partial Identification Learning Approach Authors: Hongming Pu, Bo Zhang Subjects: Methodology (stat.ME) [137] arXiv:2002.03495 (replaced) [pdf, ps, other] Title: A Diffusion Theory for Deep Learning Dynamics: Stochastic Gradient Descent Escapes From Sharp Minima Exponentially Fast Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) [138] arXiv:2002.03549 (replaced) [pdf, other] Title: Adversarial TCAV -- Robust and Effective Interpretation of Intermediate Layers in Neural Networks Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML) [139] arXiv:2002.03860 (replaced) [pdf, other] Title: Missing Data Imputation using Optimal Transport Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) [140] arXiv:2002.04019 (replaced) [pdf, other] Title: Be Like Water: Robustness to Extraneous Variables Via Adaptive Feature Normalization Comments: Aakash and Sreyas contributed equally Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) [141] arXiv:2002.04764 (replaced) [pdf, other] Title: Capsules with Inverted Dot-Product Attention Routing Comments: ICLR 2020 Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) [142] arXiv:2002.05059 (replaced) [pdf, other] Title: Goldilocks Neural Networks Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) [143] arXiv:2002.07284 (replaced) [pdf, other] Title: Sharp Asymptotics and Optimal Performance for Inference in Binary Models Subjects: Statistics Theory (math.ST); Information Theory (cs.IT); Signal Processing (eess.SP); Machine Learning (stat.ML) [144] arXiv:2002.09547 (replaced) [pdf, other] Title: Stochastic Normalizing Flows Comments: 17 pages, 4 figures Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) [145] arXiv:2002.09954 (replaced) [pdf, other] Title: Near-linear Time Gaussian Process Optimization with Adaptive Batching and Resparsification Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) [146] arXiv:2002.10043 (replaced) [pdf, other] Title: Complete Dictionary Learning via$\ell_p\$-norm Maximization
Subjects: Machine Learning (cs.LG); Information Theory (cs.IT); Signal Processing (eess.SP); Machine Learning (stat.ML)
[147]  arXiv:2002.10060 (replaced) [pdf, other]
Title: Handling the Positive-Definite Constraint in the Bayesian Learning Rule
Comments: Corrected some typos
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[148]  arXiv:2002.10211 (replaced) [pdf, other]
Title: Mnemonics Training: Multi-Class Incremental Learning without Forgetting
Comments: To appear in CVPR 2020. The camera-ready version with supplementary experiment results will come on 23rd March. Code will come soon at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
[149]  arXiv:2002.10241 (replaced) [pdf, other]
Title: Multi-objective Consensus Clustering Framework for Flight Search Recommendation
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
[150]  arXiv:2002.10539 (replaced) [pdf, other]
Title: Efficient Rollout Strategies for Bayesian Optimization
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
[151]  arXiv:2002.10774 (replaced) [pdf, other]
Title: Counterfactual fairness: removing direct effects through regularization
Comments: 10 pages, 4 figures
Subjects: Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
[152]  arXiv:2002.11052 (replaced) [pdf, other]
Title: Relevant-features based Auxiliary Cells for Energy Efficient Detection of Natural Errors
Comments: 16 pages, 3 figures, 6 tables
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
[ total of 152 entries: 1-152 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, stat, recent, 2002, contact, help  (Access key information)