Statistics
New submissions
[ showing up to 2000 entries per page: fewer  more ]
New submissions for Thu, 27 Feb 20
 [1] arXiv:2002.11152 [pdf, ps, other]

Title: Fundamental Issues Regarding Uncertainties in Artificial Neural NetworksComments: 21 pages, 8 Figures, 2 Tables. To be submitted to Pattern RecognitionSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Artificial Neural Networks (ANNs) implement a specific form of multivariate extrapolation and will generate an output for any input pattern, even when there is no similar training pattern. Extrapolations are not necessarily to be trusted, and in order to support safety critical systems, we require such systems to give an indication of the training sample related uncertainty associated with their output. Some readers may think that this is a well known issue which is already covered by the basic principles of pattern recognition. We will explain below how this is not the case and how the conventional (Likelihood estimate of) conditional probability of classification does not correctly assess this uncertainty. We provide a discussion of the standard interpretations of this problem and show how a quantitative approach based upon long standing methods can be practically applied. The methods are illustrated on the task of early diagnosis of dementing diseases using Magnetic Resonance Imaging.
 [2] arXiv:2002.11159 [pdf, other]

Title: Smoothing Graphons for Modelling Exchangeable Relational DataSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Modelling exchangeable relational data can be described by \textit{graphon theory}. Most Bayesian methods for modelling exchangeable relational data can be attributed to this framework by exploiting different forms of graphons. However, the graphons adopted by existing Bayesian methods are either piecewiseconstant functions, which are insufficiently flexible for accurate modelling of the relational data, or are complicated continuous functions, which incur heavy computational costs for inference. In this work, we introduce a smoothing procedure to piecewiseconstant graphons to form {\em smoothing graphons}, which permit continuous intensity values for describing relations, but without impractically increasing computational costs. In particular, we focus on the Bayesian Stochastic Block Model (SBM) and demonstrate how to adapt the piecewiseconstant SBM graphon to the smoothed version. We initially propose the Integrated Smoothing Graphon (ISG) which introduces one smoothing parameter to the SBM graphon to generate continuous relational intensity values. We then develop the Latent Feature Smoothing Graphon (LFSG), which improves on the ISG by introducing auxiliary hidden labels to decompose the calculation of the ISG intensity and enable efficient inference. Experimental results on realworld data sets validate the advantages of applying smoothing strategies to the Stochastic Block Model, demonstrating that smoothing graphons can greatly improve AUC and precision for link prediction without increasing computational complexity.
 [3] arXiv:2002.11182 [pdf, other]

Title: Information Directed Sampling for Linear Partial MonitoringSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Partial monitoring is a rich framework for sequential decision making under uncertainty that generalizes many well known bandit models, including linear, combinatorial and dueling bandits. We introduce information directed sampling (IDS) for stochastic partial monitoring with a linear reward and observation structure. IDS achieves adaptive worstcase regret rates that depend on precise observability conditions of the game. Moreover, we prove lower bounds that classify the minimax regret of all finite games into four possible regimes. IDS achieves the optimal rate in all cases up to logarithmic factors, without tuning any hyperparameters. We further extend our results to the contextual and the kernelized setting, which significantly increases the range of possible applications.
 [4] arXiv:2002.11204 [pdf]

Title: Classical and Bayesian Analyses of a Mixture of Exponential and Lomax DistributionsSubjects: Methodology (stat.ME)
The exponential and the Lomax distributions are widely used in life testing experiments in mixture models. A mixture model of exponential distribution and Lomax distribution is proposed. Parameters of the proposed model are estimated using classical and Bayesian procedures under typeI right censoring. Expressions for Bayes estimators are derived assuming noninformative (uniform and Jeffreys) priors under symmetric and asymmetric loss functions. Posterior predictive distributions of a future observation are derived and predictive estimates are obtained. Extensive Monte Carlo simulations are carried out to investigate performance of the estimators in terms of sample sizes, censoring times and mixing proportions. The analysis of mixture model is carried out using a data set of lifetime of transmitter receivers. Interesting properties of estimators are observed and discussed.
 [5] arXiv:2002.11223 [pdf, other]

Title: Device Heterogeneity in Federated Learning: A Superquantile ApproachSubjects: Machine Learning (stat.ML); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Optimization and Control (math.OC)
We propose a federated learning framework to handle heterogeneous client devices which do not conform to the population data distribution. The approach hinges upon a parameterized superquantilebased objective, where the parameter ranges over levels of conformity. We present an optimization algorithm and establish its convergence to a stationary point. We show how to practically implement it using secure aggregation by interleaving iterations of the usual federated averaging method with device filtering. We conclude with numerical experiments on neural networks as well as linear models on tasks from computer vision and natural language processing.
 [6] arXiv:2002.11228 [pdf, other]

Title: Enforcing Mean Reversion in State Space Models for Prawn Pond Water Quality ForecastingJournalref: Computers and Electronics in Agriculture, Volume 168, 2020, 105120, ISSN 01681699Subjects: Applications (stat.AP)
The contribution of this study is a novel approach to introduce mean reversion in multistepahead forecasts of statespace models. This approach is demonstrated in a prawn pond water quality forecasting application. The mean reversion constrains forecasts by gradually drawing them to an average of previously observed dynamics. This corrects deviations in forecasts caused by irregularities such as chaotic, nonlinear, and stochastic trends. The key features of the approach include (1) it enforces mean reversion, (2) it provides a means to model both short and longterm dynamics, (3) it is able to apply mean reversion to select structural statespace components, and (4) it is simple to implement. Our mean reversion approach is demonstrated on various statespace models and compared with several timeseries models on a prawn pond water quality dataset. Results show that mean reversion reduces longterm forecast errors by over 60% to produce the most accurate models in the comparison.
 [7] arXiv:2002.11236 [pdf, other]

Title: Paired Comparisons Modeling using tDistribution with Bayesian AnalysisSubjects: Methodology (stat.ME)
A paired comparison analysis is the simplest way to make comparative judgments between objects where objects may be goods, services or skills. For a set of problems, this technique helps to choose the most important problem to solve first and/or provides the solution that will be the most effective. This paper presents the theory of paired comparisons method and contributes to the paired comparisons models by developing a new model based on tdistribution. The developed model is illustrated using a data set of citations among four famous journals of Statistics. Using Bayesian analysis, the journals are ranked as JRSSB > Biometrika > JASA > Comm. in Stats.
 [8] arXiv:2002.11239 [pdf, ps, other]

Title: Extremes of Censored and Uncensored Lifetimes in Survival DataComments: 1 figure, 23 pagesSubjects: Statistics Theory (math.ST); Applications (stat.AP)
The i.i.d. censoring model for survival analysis assumes two independent sequences of i.i.d. positive random variables, $(T_i^*)_{1\le i\le n}$ and $(U_i)_{1\le i\le n}$. The data consists of observations on the random sequence $\big(T_i=\min(T_i^*,U_i)$ together with accompanying censor indicators. Values of $T_i$ with $T_i^*\le U_i$ are said to be uncensored, those with $T_i^*> U_i$ are censored. We assume that the distributions of the $T_i^*$ and $U_i$ are in the domain of attraction of the Gumbel distribution and obtain the asymptotic distributions, as sample size $n\to\infty$, of the maximum values of the censored and uncensored lifetimes in the data, and of statistics related to them. These enable us to examine questions concerning the possible existence of cured individuals in the population.
 [9] arXiv:2002.11243 [pdf]

Title: Correspondence Analysis between the Location and the Leading Causes of Death in the United StatesJournalref: International Journal of Ecological Economics and Statistics, 41(1), 4754, 2020Subjects: Applications (stat.AP); Computation (stat.CO)
Correspondence Analysis analyzes twoway or multiway tables withe each row and column becoming a point ion a multidimensional graphical map called biplot. It can be used to extract essential dimensions allowing simplification of the data matrix. This study aims to measure the association between the location and the leading causes of death in the United States of America and to determine the location where a particular disease is highly associated. The research data consists of two variables with 510 data points. Results show that there is a significant association between the location ad leading cause of death in the United States, and 61% of the variance in the model are explained by the first two dimensions.
 [10] arXiv:2002.11255 [pdf, other]

Title: An Optimal Statistical and Computational Framework for Generalized Tensor EstimationSubjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
This paper describes a flexible framework for generalized lowrank tensor estimation problems that includes many important instances arising from applications in computational imaging, genomics, and network analysis. The proposed estimator consists of finding a lowrank tensor fit to the data under generalized parametric models. To overcome the difficulty of nonconvexity in these problems, we introduce a unified approach of projected gradient descent that adapts to the underlying lowrank structure. Under mild conditions on the loss function, we establish both an upper bound on statistical error and the linear rate of computational convergence through a general deterministic analysis. Then we further consider a suite of generalized tensor estimation problems, including subGaussian tensor denoising, tensor regression, and Poisson and binomial tensor PCA. We prove that the proposed algorithm achieves the minimax optimal rate of convergence in estimation error. Finally, we demonstrate the superiority of the proposed framework via extensive experiments on both simulated and real data.
 [11] arXiv:2002.11256 [pdf, ps, other]

Title: Incorporating Expert Prior Knowledge into Experimental Design via Posterior SamplingSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Scientific experiments are usually expensive due to complex experimental preparation and processing. Experimental design is therefore involved with the task of finding the optimal experimental input that results in the desirable output by using as few experiments as possible. Experimenters can often acquire the knowledge about the location of the global optimum. However, they do not know how to exploit this knowledge to accelerate experimental design. In this paper, we adopt the technique of Bayesian optimization for experimental design since Bayesian optimization has established itself as an efficient tool for optimizing expensive blackbox functions. Again, it is unknown how to incorporate the expert prior knowledge about the global optimum into Bayesian optimization process. To address it, we represent the expert knowledge about the global optimum via placing a prior distribution on it and we then derive its posterior distribution. An efficient Bayesian optimization approach has been proposed via posterior sampling on the posterior distribution of the global optimum. We theoretically analyze the convergence of the proposed algorithm and discuss the robustness of incorporating expert prior. We evaluate the efficiency of our algorithm by optimizing synthetic functions and tuning hyperparameters of classifiers along with a realworld experiment on the synthesis of short polymer fiber. The results clearly demonstrate the advantages of our proposed method.
 [12] arXiv:2002.11259 [pdf, ps, other]

Title: Scientific versus statistical modelling: a unifying approachSubjects: Statistics Theory (math.ST)
This paper addresses two fundamental features of quantities modeled and analysed in statistical science, their dimensions (e.g. time) and measurement scales (units). Examples show that subtle issues can arise when dimensions and measurement scales are ignored. Special difficulties arise when the models involve transcendental functions. A transcendental function important in statistics is the logarithm which is used in likelihood calculations and is a singularity in the family of BoxCox algebraic functions. Yet neither the argument of the logarithm nor its value can have units of measurement. Physical scientists have long recognized that dimension/scale difficulties can be sidestepped by nondimensionalizing the model; after all, models of natural phenomena cannot depend on the units by which they are measured, and the celebrated Buckingham Pi theorem is a consequence. The paper reviews that theorem, recognizing that the statistical invariance principle arose with similar aspirations. However, the potential relationship between the theorem and statistical invariance has not been investigated until very recently. The main result of the paper is an exploration of that link, which leads to an extension of the Pitheorem that puts it in a stochastic framework and thus quantifies uncertainties in deterministic physical models.
 [13] arXiv:2002.11275 [pdf, other]

Title: Adversarial Monte Carlo MetaLearning of Optimal Prediction ProceduresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
We frame the metalearning of prediction procedures as a search for an optimal strategy in a twoplayer game. In this game, Nature selects a prior over distributions that generate labeled data consisting of features and an associated outcome, and the Predictor observes data sampled from a distribution drawn from this prior. The Predictor's objective is to learn a function that maps from a new feature to an estimate of the associated outcome. We establish that, under reasonable conditions, the Predictor has an optimal strategy that is equivariant to shifts and rescalings of the outcome and is invariant to permutations of the observations and to shifts, rescalings, and permutations of the features. We introduce a neural network architecture that satisfies these properties. The proposed strategy performs favorably compared to standard practice in both parametric and nonparametric experiments.
 [14] arXiv:2002.11276 [pdf, other]

Title: A Balancing Weight Framework for Estimating the Causal Effect of General TreatmentsAuthors: Guillaume MartinetSubjects: Statistics Theory (math.ST); Methodology (stat.ME)
In observational studies, weighting methods that directly optimize the balance between treatment and covariates have received much attention lately; however these have mainly focused on binary treatments. Inspired by domain adaptation, we show that such methods can be actually reformulated as specific implementations of a discrepancy minimization problem aimed at tackling a shift of distribution from observational to interventional data. More precisely, we introduce a new framework, Covariate Balance via Discrepancy Minimization (CBDM), that provably encompasses most of the existing balancing weight methods and formally extends them to treatments of arbitrary types (e.g., continuous or multivariate). We establish theoretical guarantees for our framework that both offer generalizations of properties known when the treatment is binary, and give a better grasp on what hyperparameters to choose in nonbinary settings. Based on such insights, we propose a particular implementation of CBDM for estimating doseresponse curves and demonstrate through experiments its competitive performance relative to other existing approaches for continuous treatments.
 [15] arXiv:2002.11394 [pdf, other]

Title: Bayesian Nonparametric Space Partitions: A SurveySubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Bayesian nonparametric space partition (BNSP) models provide a variety of strategies for partitioning a $D$dimensional space into a set of blocks. In this way, the data points lie in the same block would share certain kinds of homogeneity. BNSP models can be applied to various areas, such as regression/classification trees, random feature construction, relational modeling, etc. In this survey, we investigate the current progress of BNSP research through the following three perspectives: models, which review various strategies for generating the partitions in the space and discuss their theoretical foundation `selfconsistency'; applications, which cover the current mainstream usages of BNSP models and their potential future practises; and challenges, which identify the current unsolved problems and valuable future research topics. As there are no comprehensive reviews of BNSP literature before, we hope that this survey can induce further exploration and exploitation on this topic.
 [16] arXiv:2002.11448 [pdf, other]

Title: Predicting Neural Network Accuracy from WeightsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We study the prediction of the accuracy of a neural network given only its weights with the goal of better understanding network training and performance. To do so, we propose a formal setting which frames this task and connects to previous work in this area. We collect (and release) a large dataset of almost 80k convolutional neural networks trained on four image datasets. We demonstrate that strong predictors of accuracy exist. Moreover, they can achieve good predictions while only using simple statistics of the weights. Surprisingly, these predictors are able to rank networks trained on unobserved datasets or using different architectures.
 [17] arXiv:2002.11451 [pdf, other]

Title: Automated Augmented Conjugate Inference for Nonconjugate Gaussian Process ModelsComments: Accepted at AISTATS 2020Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We propose automated augmented conjugate inference, a new inference method for nonconjugate Gaussian processes (GP) models. Our method automatically constructs an auxiliary variable augmentation that renders the GP model conditionally conjugate. Building on the conjugate structure of the augmented model, we develop two inference methods. First, a fast and scalable stochastic variational inference method that uses efficient block coordinate ascent updates, which are computed in closed form. Second, an asymptotically correct Gibbs sampler that is useful for small datasets. Our experiments show that our method are up two orders of magnitude faster and more robust than existing stateoftheart blackbox methods.
 [18] arXiv:2002.11457 [pdf, ps, other]

Title: A short note on learning discrete distributionsAuthors: Clément L. CanonneComments: This is a review article; its intent is not to provide new results, but instead to gather known (and useful) ones, along with their proofs, in a single convenient locationSubjects: Statistics Theory (math.ST); Probability (math.PR)
The goal of this short note is to provide simple proofs for the "folklore facts" on the sample complexity of learning a discrete probability distribution over a known domain of size $k$ to various distances $\varepsilon$, with error probability $\delta$.
 [19] arXiv:2002.11475 [pdf]

Title: A Visual Sensitivity Analysis for ParameterAugmented Ensembles of CurvesAuthors: Alejandro Ribes (EDF R&D PERICLES), Joachim Pouderoux, Bertrand Iooss (EDF R&D PRISME, GdR MASCOTNUM, IMT)Journalref: The Journal of Verification, Validation and Uncertainty Quantification (VVUQ), 2019, 4 (4)Subjects: Statistics Theory (math.ST)
Engineers and computational scientists often study the behavior of their simulations by repeated solutions with variations in their parameters, which can be for instance boundary values or initial conditions. Through such simulation ensembles, uncertainty in a solution is studied as a function of the various input parameters. Solutions of numerical simulations are often temporal functions, spatial maps or spatiotemporal outputs. The usual way to deal with such complex outputs is to limit the analysis to several probes in the temporal/spatial domain. This leads to smaller and more tractable ensembles of functional outputs (curves) with their associated input parameters: augmented ensembles of curves. This article describes a system for the interactive exploration and analysis of such augmented ensembles. Descriptive statistics on the functional outputs are performed by Principal Component Analysis projection, kernel density estimation and the computation of High Density Regions. This makes possible the calculation of functional quantiles and outliers. Brushing and linking the elements of the system allows indepth analysis of the ensemble. The system allows for functional descriptive statistics, cluster detection and finally for the realization of a visual sensitivity analysis via cobweb plots. We present two synthetic examples and then validate our approach in an industrial usecase concerning a marine current study using a hydraulic solver.
 [20] arXiv:2002.11511 [pdf, other]

Title: A Comparative Study of Machine Learning Models for Predicting the State of Reactive MixingComments: 31 pagesSubjects: Machine Learning (stat.ML); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
Accurate predictions of reactive mixing are critical for many Earth and environmental science problems. To investigate mixing dynamics over time under different scenarios, a highfidelity, finiteelementbased numerical model is built to solve the fast, irreversible bimolecular reactiondiffusion equations to simulate a range of reactivemixing scenarios. A total of 2,315 simulations are performed using different sets of model input parameters comprising various spatial scales of vortex structures in the velocity field, timescales associated with velocity oscillations, the perturbation parameter for the vortexbased velocity, anisotropic dispersion contrast, and molecular diffusion. Outputs comprise concentration profiles of the reactants and products. The inputs and outputs of these simulations are concatenated into feature and label matrices, respectively, to train 20 different machine learning (ML) emulators to approximate system behavior. The 20 ML emulators based on linear methods, Bayesian methods, ensemble learning methods, and multilayer perceptron (MLP), are compared to assess these models. The ML emulators are specifically trained to classify the state of mixing and predict three quantities of interest (QoIs) characterizing species production, decay, and degree of mixing. Linear classifiers and regressors fail to reproduce the QoIs; however, ensemble methods (classifiers and regressors) and the MLP accurately classify the state of reactive mixing and the QoIs. Among ensemble methods, random forest and decisiontreebased AdaBoost faithfully predict the QoIs. At run time, trained ML emulators are $\approx10^5$ times faster than the highfidelity numerical simulations. Speed and accuracy of the ensemble and MLP models facilitate uncertainty quantification, which usually requires 1,000s of model run, to estimate the uncertainty bounds on the QoIs.
 [21] arXiv:2002.11531 [pdf, ps, other]

Title: A general framework for ensemble distribution distillationSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Ensembles of neural networks have been shown to give better performance than single networks, both in terms of predictions and uncertainty estimation. Additionally, ensembles allow the uncertainty to be decomposed into aleatoric (data) and epistemic (model) components, giving a more complete picture of the predictive uncertainty. Ensemble distillation is the process of compressing an ensemble into a single model, often resulting in a leaner model that still outperforms the individual ensemble members. Unfortunately, standard distillation erases the natural uncertainty decomposition of the ensemble. We present a general framework for distilling both regression and classification ensembles in a way that preserves the decomposition. We demonstrate the desired behaviour of our framework and show that its predictive performance is on par with standard distillation.
 [22] arXiv:2002.11537 [pdf, other]

Title: ICEBeeM: Identifiable Conditional EnergyBased Deep ModelsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Despite the growing popularity of energybased models, their identifiability properties are not wellunderstood. In this paper we establish sufficient conditions under which a large family of conditional energybased models is identifiable in function space, up to a simple transformation. Our results build on recent developments in the theory of nonlinear ICA, showing that the latent representations in certain families of deep latentvariable models are identifiable. We extend these results to a very broad family of conditional energybased models. In this family, the energy function is simply the dotproduct between two feature extractors, one for the dependent variable, and one for the conditioning variable. We show that under mild conditions, the features are unique up to scaling and permutation. Second, we propose the framework of independently modulated component analysis (IMCA), a new form of nonlinear ICA where the indepencence assumption is relaxed. Importantly, we show that our energybased model can be used for the estimation of the components: the features learned are a simple and often trivial transformation of the latents.
 [23] arXiv:2002.11543 [pdf, ps, other]

Title: Towards new crossvalidationbased estimators for Gaussian process regression: efficient adjoint computation of gradientsAuthors: Sébastien Petit (L2S, GdR MASCOTNUM), Julien Bect (L2S, GdR MASCOTNUM), Sébastien da Veiga (GdR MASCOTNUM), Paul Feliot, Emmanuel Vazquez (L2S, GdR MASCOTNUM)Subjects: Computation (stat.CO); Machine Learning (stat.ML)
We consider the problem of estimating the parameters of the covariance function of a Gaussian process by crossvalidation. We suggest using new crossvalidation criteria derived from the literature of scoring rules. We also provide an efficient method for computing the gradient of a crossvalidation criterion. To the best of our knowledge, our method is more efficient than what has been proposed in the literature so far. It makes it possible to lower the complexity of jointly evaluating leaveoneout criteria and their gradients.
 [24] arXiv:2002.11544 [pdf, other]

Title: The role of regularization in classification of highdimensional noisy Gaussian mixtureComments: 8 pages + appendix, 6 figuresSubjects: Machine Learning (stat.ML); Disordered Systems and Neural Networks (condmat.disnn); Machine Learning (cs.LG); Statistics Theory (math.ST)
We consider a highdimensional mixture of two Gaussians in the noisy regime where even an oracle knowing the centers of the clusters misclassifies a small but finite fraction of the points. We provide a rigorous analysis of the generalization error of regularized convex classifiers, including ridge, hinge and logistic regression, in the highdimensional limit where the number $n$ of samples and their dimension $d$ go to infinity while their ratio is fixed to $\alpha= n/d$. We discuss surprising effects of the regularization that in some cases allows to reach the Bayesoptimal performances. We also illustrate the interpolation peak at low regularization, and analyze the role of the respective sizes of the two clusters.
 [25] arXiv:2002.11553 [pdf, ps, other]

Title: Aggregated hold out for sparse linear regression with a robust loss functionAuthors: Guillaume Maillard (CELESTE, LMOrsay)Subjects: Statistics Theory (math.ST)
Sparse linear regression methods generally have a free hyperparameter which controls the amount of sparsity, and is subject to a biasvariance tradeoff. This article considers the use of Aggregated holdout to aggregate over values of this hyperparameter, in the context of linear regression with the Huber loss function. Aggregated holdout (Agghoo) is a procedure which averages estimators selected by holdout (crossvalidation with a single split). In the theoretical part of the article, it is proved that Agghoo satisfies a nonasymptotic oracle inequality when it is applied to sparse estimators which are parametrized by their zeronorm. In particular , this includes a variant of the Lasso introduced by Zou, Hasti{\'e} and Tibshirani. Simulations are used to compare Agghoo with crossvalidation. They show that Agghoo performs better than CV when the intrinsic dimension is high and when there are confounders correlated with the predictive covariates.
 [26] arXiv:2002.11572 [pdf, ps, other]

Title: Revisiting Ensembles in an Adversarial Context: Improving Natural AccuracyComments: 5 pages, accepted to ICLR 2020 Workshop on Towards Trustworthy ML: Rethinking Security and Privacy for MLSubjects: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
A necessary characteristic for the deployment of deep learning models in real world applications is resistance to small adversarial perturbations while maintaining accuracy on nonmalicious inputs. While robust training provides models that exhibit better adversarial accuracy than standard models, there is still a significant gap in natural accuracy between robust and nonrobust models which we aim to bridge. We consider a number of ensemble methods designed to mitigate this performance difference. Our key insight is that model trained to withstand small attacks, when ensembled, can often withstand significantly larger attacks, and this concept can in turn be leveraged to optimize natural accuracy. We consider two schemes, one that combines predictions from several randomly initialized robust models, and the other that fuses features from robust and standard models.
 [27] arXiv:2002.11577 [pdf, other]

Title: Hierarchical clustering with discrete latent variable models and the integrated classification likelihoodSubjects: Computation (stat.CO)
In this paper, we introduce a two step methodology to extract a hierarchical clustering. This methodology considers the integrated classification likelihood criterion as an objective function, and applies to any discrete latent variable models (DLVM) where this quantity is tractable. The first step of the methodology involves maximizing the criterion with respect to the discrete latent variables state with uninformative priors. To that end we propose a new hybrid algorithm based on greedy local searches as well as a genetic algorithm which allows the joint inference of the number $K$ of clusters and of the clusters themselves. The second step of the methodology is based on a bottomup greedy procedure to extract a hierarchy of clusters from this natural partition. In a Bayesian context, this is achieved by considering the Dirichlet cluster proportion prior parameter $\alpha$ as a regularisation term controlling the granularity of the clustering. This second step allows the exploration of the clustering at coarser scales and the ordering of the clusters an important output for the visual representations of the clustering results. The clustering results obtained with the proposed approach, on simulated as well as real settings, are compared with existing strategies and are shown to be particularly relevant. This work is implemented in the R package greed.
 [28] arXiv:2002.11601 [pdf, other]

Title: Stagewise Enlargement of Batch Size for SGDbased LearningSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Existing research shows that the batch size can seriously affect the performance of stochastic gradient descent~(SGD) based learning, including training speed and generalization ability. A larger batch size typically results in less parameter updates. In distributed training, a larger batch size also results in less frequent communication. However, a larger batch size can make a generalization gap more easily. Hence, how to set a proper batch size for SGD has recently attracted much attention. Although some methods about setting batch size have been proposed, the batch size problem has still not been well solved. In this paper, we first provide theory to show that a proper batch size is related to the gap between initialization and optimum of the model parameter. Then based on this theory, we propose a novel method, called \underline{s}tagewise \underline{e}nlargement of \underline{b}atch \underline{s}ize~(\mbox{SEBS}), to set proper batch size for SGD. More specifically, \mbox{SEBS} adopts a multistage scheme, and enlarges the batch size geometrically by stage. We theoretically prove that, compared to classical stagewise SGD which decreases learning rate by stage, \mbox{SEBS} can reduce the number of parameter updates without increasing generalization error. SEBS is suitable for \mbox{SGD}, momentum \mbox{SGD} and AdaGrad. Empirical results on real data successfully verify the theories of \mbox{SEBS}. Furthermore, empirical results also show that SEBS can outperform other baselines.
 [29] arXiv:2002.11610 [pdf]

Title: Liquid ScorecardsAuthors: Bruce HoadleySubjects: Other Statistics (stat.OT); Methodology (stat.ME)
Traditional credit scorecards are generalized additive models (GAMs) with step functions as the component functions. The shapes of the step functions may be constrained in order to satisfy the PILE (Palatability, Interpretability, Legal, Explainability) constraints. Before 2003, FICO used Linear Programming to find the traditional scorecard that approximately maximizes divergence subject to the PILE constraints. In this paper, I introduce the Liquid Scorecard, that allows the component functions to be, at least partially, smooth curves. I use Quadratic Programming and BSpline theory to find the Liquid Scorecard that exactly maximizes divergence subject to the PILE constraints. FICO uses aspects of this technology to develop the famous FICO Credit Score.
 [30] arXiv:2002.11642 [pdf, ps, other]

Title: OffPolicy Evaluation and Learning for External Validity under a Covariate ShiftSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM)
We consider the evaluation and training of a new policy for the evaluation data by using the historical data obtained from a different policy. The goal of offpolicy evaluation (OPE) is to estimate the expected reward of a new policy over the evaluation data, and that of offpolicy learning (OPL) is to find a new policy that maximizes the expected reward over the evaluation data. Although the standard OPE and OPL assume the same distribution of covariate between the historical and evaluation data, there often exists a problem of a covariate shift, i.e., the distribution of the covariate of the historical data is different from that of the evaluation data. In this paper, we derive the efficiency bound of OPE under a covariate shift. Then, we propose doubly robust and efficient estimators for OPE and OPL under a covariate shift by using an estimator of the density ratio between the distributions of the historical and evaluation data. We also discuss other possible estimators and compare their theoretical properties. Finally, we confirm the effectiveness of the proposed estimators through experiments.
 [31] arXiv:2002.11665 [pdf, ps, other]

Title: Profile Entropy: A Fundamental Measure for the Learnability and Compressibility of Discrete DistributionsComments: 56 pagesSubjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
The profile of a sample is the multiset of its symbol frequencies. We show that for samples of discrete distributions, profile entropy is a fundamental measure unifying the concepts of estimation, inference, and compression. Specifically, profile entropy a) determines the speed of estimating the distribution relative to the best natural estimator; b) characterizes the rate of inferring all symmetric properties compared with the best estimator over any labelinvariant distribution collection; c) serves as the limit of profile compression, for which we derive optimal nearlineartime block and sequential algorithms. To further our understanding of profile entropy, we investigate its attributes, provide algorithms for approximating its value, and determine its magnitude for numerous structural distribution families.
Crosslists for Thu, 27 Feb 20
 [32] arXiv:1811.05375 (crosslist from cs.CY) [pdf, ps, other]

Title: Comparison of Feature Extraction Methods and Predictors for Income InferenceComments: Argentine Symposium on Big Data (AGRANDA), September 5, 2017Subjects: Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
Patterns of mobile phone communications, coupled with the information of the social network graph and financial behavior, allow us to make inferences of users' socioeconomic attributes such as their income level. We present here several methods to extract features from mobile phone usage (calls and messages), and compare different combinations of supervised machine learning techniques and sets of features used as input for the inference of users' income. Our experimental results show that the Bayesian method based on the communication graph outperforms standard machine learning algorithms using nodebased features.
 [33] arXiv:1812.01077 (crosslist from cs.SI) [pdf, other]

Title: Brief survey of Mobility Analyses based on Mobile Phone DatasetsComments: Workshop on Urban Computing and Society. Petropolis, RJ, Brazil. Nov 28, 2018Subjects: Social and Information Networks (cs.SI); Computers and Society (cs.CY); Machine Learning (cs.LG); Machine Learning (stat.ML)
This is a brief survey of the research performed by Grandata Labs in collaboration with numerous academic groups around the world on the topic of human mobility. A driving theme in these projects is to use and improve Data Science techniques to understand mobility, as it can be observed through the lens of mobile phone datasets. We describe applications of mobility analyses for urban planning, prediction of data traffic usage, building delay tolerant networks, generating epidemiologic risk maps and measuring the predictability of human mobility.
 [34] arXiv:2002.10846 (crosslist from math.PR) [pdf, ps, other]

Title: A CLT in Stein's distance for generalized Wishart matrices and higher order tensorsAuthors: Dan MikulincerComments: 22 pagesSubjects: Probability (math.PR); Statistics Theory (math.ST)
We study the convergence along the central limit theorem for sums of independent tensor powers, $\frac{1}{\sqrt{d}}\sum\limits_{i=1}^d X_i^{\otimes p}$. We focus on the highdimensional regime where $X_i \in \mathbb{R}^n$ and $n$ may scale with $d$. Our main result is a proposed threshold for convergence. Specifically, we show that, under some regularity assumption, if $n^{2p1}\gg d$, then the normalized sum converges to a Gaussian. The results apply, among others, to symmetric uniform logconcave measures and to product measures. This generalizes several results found in the literature.
Our main technique is a novel application of optimal transport to Stein's method which accounts for the low dimensional structure which is inherent in $X_i^{\otimes p}$.  [35] arXiv:2002.11104 (crosslist from cs.SI) [pdf, ps, other]

Title: An Information Diffusion Approach to Rumor Propagation and Identification on TwitterSubjects: Social and Information Networks (cs.SI); Machine Learning (cs.LG); Machine Learning (stat.ML)
With the increasing use of online social networks as a source of news and information, the propensity for a rumor to disseminate widely and quickly poses a great concern, especially in disaster situations where users do not have enough time to factcheck posts before making the informed decision to react to a post that appears to be credible. In this study, we explore the propagation pattern of rumors on Twitter by exploring the dynamics of microscopiclevel misinformation spread, based on the latent message and user interaction attributes. We perform supervised learning for feature selection and prediction. Experimental results with realworld data sets give the models' prediction accuracy at about 90\% for the diffusion of both True and False topics. Our findings confirm that rumor cascades run deeper and that rumor masked as news, and messages that incite fear, will diffuse faster than other messages. We show that the models for True and False message propagation differ significantly, both in the prediction parameters and in the message features that govern the diffusion. Finally, we show that the diffusion pattern is an important metric in identifying the credibility of a tweet.
 [36] arXiv:2002.11137 (crosslist from cs.LG) [pdf, other]

Title: Dynamic Incentiveaware Learning: Robust Pricing in Contextual AuctionsComments: Accepted for publication in Operations Research Journal (An earlier version of this paper accepted to NeurIPS 2019.)Subjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML)
Motivated by pricing in ad exchange markets, we consider the problem of robust learning of reserve prices against strategic buyers in repeated contextual secondprice auctions. Buyers' valuations for an item depend on the context that describes the item. However, the seller is not aware of the relationship between the context and buyers' valuations, i.e., buyers' preferences. The seller's goal is to design a learning policy to set reserve prices via observing the past sales data, and her objective is to minimize her regret for revenue, where the regret is computed against a clairvoyant policy that knows buyers' heterogeneous preferences. Given the seller's goal, utilitymaximizing buyers have the incentive to bid untruthfully in order to manipulate the seller's learning policy. We propose learning policies that are robust to such strategic behavior. These policies use the outcomes of the auctions, rather than the submitted bids, to estimate the preferences while controlling the longterm effect of the outcome of each auction on the future reserve prices. When the market noise distribution is known to the seller, we propose a policy called Contextual Robust Pricing (CORP) that achieves a Tperiod regret of $O(d\log(Td) \log (T))$, where $d$ is the dimension of {the} contextual information. When the market noise distribution is unknown to the seller, we propose two policies whose regrets are sublinear in $T$.
 [37] arXiv:2002.11151 (crosslist from cs.LG) [pdf, other]

Title: TxSim:Modeling Training of Deep Neural Networks on Resistive Crossbar SystemsSubjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
Resistive crossbars have attracted significant interest in the design of Deep Neural Network (DNN) accelerators due to their ability to natively execute massively parallel vectormatrix multiplications within dense memory arrays. However, crossbarbased computations face a major challenge due to a variety of device and circuitlevel nonidealities, which manifest as errors in the vectormatrix multiplications and eventually degrade DNN accuracy. To address this challenge, there is a need for tools that can model the functional impact of nonidealities on DNN training and inference. Existing efforts towards this goal are either limited to inference, or are too slow to be used for largescale DNN training. We propose TxSim, a fast and customizable modeling framework to functionally evaluate DNN training on crossbarbased hardware considering the impact of nonidealities. The key features of TxSim that differentiate it from prior efforts are: (i) It comprehensively models nonidealities during all training operations (forward propagation, backward propagation, and weight update) and (ii) it achieves computational efficiency by mapping crossbar evaluations to welloptimized BLAS routines and incorporates speedup techniques to further reduce simulation time with minimal impact on accuracy. TxSim achieves ordersofmagnitude improvement in simulation speed over prior works, and thereby makes it feasible to evaluate training of largescale DNNs on crossbars. Our experiments using TxSim reveal that the accuracy degradation in DNN training due to nonidealities can be substantial (3%10%) for largescale DNNs, underscoring the need for further research in mitigation techniques. We also analyze the impact of various device and circuitlevel parameters and the associated nonidealities to provide key insights that can guide the design of crossbarbased DNN training accelerators.
 [38] arXiv:2002.11167 (crosslist from physics.aoph) [pdf, other]

Title: Datadriven superparameterization using deep learning: Experimentation with multiscale Lorenz 96 systems and transferlearningSubjects: Atmospheric and Oceanic Physics (physics.aoph); Chaotic Dynamics (nlin.CD); Computational Physics (physics.compph); Fluid Dynamics (physics.fludyn); Geophysics (physics.geoph); Machine Learning (stat.ML)
To make weather/climate modeling computationally affordable, smallscale processes are usually represented in terms of the largescale, explicitlyresolved processes using physicsbased or semiempirical parameterization schemes. Another approach, computationally more demanding but often more accurate, is superparameterization (SP), which involves integrating the equations of smallscale processes on highresolution grids embedded within the lowresolution grids of largescale processes. Recently, studies have used machine learning (ML) to develop datadriven parameterization (DDP) schemes. Here, we propose a new approach, datadriven SP (DDSP), in which the equations of the smallscale processes are integrated datadrivenly using ML methods such as recurrent neural networks. Employing multiscale Lorenz 96 systems as testbed, we compare the cost and accuracy (in terms of both shortterm prediction and longterm statistics) of parameterized lowresolution (LR), SP, DDP, and DDSP models. We show that with the same computational cost, DDSP substantially outperforms LR, and is better than DDP, particularly when scale separation is lacking. DDSP is much cheaper than SP, yet its accuracy is the same in reproducing longterm statistics and often comparable in shortterm forecasting. We also investigate generalization, finding that when models trained on data from one system are applied to a system with different forcing (e.g., more chaotic), the models often do not generalize, particularly when the shortterm prediction accuracy is examined. But we show that transferlearning, which involves retraining the datadriven model with a small amount of data from the new system, significantly improves generalization. Potential applications of DDSP and transferlearning in climate/weather modeling and the expected challenges are discussed.
 [39] arXiv:2002.11172 (crosslist from cs.LG) [pdf, other]

Title: A Sample Complexity Separation between NonConvex and Convex MetaLearningComments: 34 pagesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
One popular trend in metalearning is to learn from many training tasks a common initialization for a gradientbased method that can be used to solve a new task with few samples. The theory of metalearning is still in its early stages, with several recent learningtheoretic analyses of methods such as Reptile [Nichol et al., 2018] being for convex models. This work shows that convexcase analysis might be insufficient to understand the success of metalearning, and that even for nonconvex models it is important to look inside the optimization blackbox, specifically at properties of the optimization trajectory. We construct a simple metalearning instance that captures the problem of onedimensional subspace learning. For the convex formulation of linear regression on this instance, we show that the new task sample complexity of any initializationbased metalearning algorithm is $\Omega(d)$, where $d$ is the input dimension. In contrast, for the nonconvex formulation of a two layer linear network on the same instance, we show that both Reptile and multitask representation learning can have new task sample complexity of $\mathcal{O}(1)$, demonstrating a separation from convex metalearning. Crucially, analyses of the training dynamics of these methods reveal that they can metalearn the correct subspace onto which the data should be projected.
 [40] arXiv:2002.11184 (crosslist from qbio.PE) [pdf, other]

Title: The Moran Genealogy ProcessSubjects: Populations and Evolution (qbio.PE); Probability (math.PR); Applications (stat.AP)
We give a novel representation of the Moran Genealogy Process, a continuoustime Markov process on the space of size$n$ genealogies with the demography of the classical Moran process. We derive the generator and unique stationary distribution of the process and establish its uniform ergodicity. In particular, we show that any initial distribution converges exponentially to the probability measure identical to that of the Kingman coalescent. We go on to show that onetime sampling projects this stationary distribution onto a smallersize version of itself. Next, we extend the Moran genealogy process to include sampling through time. This allows us to define the Sampled Moran Genealogy Process, another Markov process on the space of genealogies. We derive exact conditional and unconditional probability distributions for this process under the assumption of stationarity, and an expression for the likelihood of any sequence of genealogies it generates. This leads to some interesting observations pertinent to existing phylodynamic methods in the literature.
 [41] arXiv:2002.11187 (crosslist from cs.LG) [pdf, other]

Title: Reliable Estimation of KullbackLeibler Divergence by Controlling Discriminator Complexity in the Reproducing Kernel Hilbert SpaceComments: 10 pages, 3 figuresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Several scalable methods to compute the Kullback Leibler (KL) divergence between two distributions using their samples have been proposed and applied in largescale machine learning models. While they have been found to be unstable, the theoretical root cause of the problem is not clear. In this paper, we study in detail a generative adversarial network based approach that uses a neural network discriminator to estimate KL divergence. We argue that, in such case, high fluctuations in the estimates are a consequence of not controlling the complexity of the discriminator function space. We provide a theoretical underpinning and remedy for this problem through the following contributions. First, we construct a discriminator in the Reproducing Kernel Hilbert Space (RKHS). This enables us to leverage sample complexity and mean embedding to theoretically relate the error probability bound of the KL estimates to the complexity of the neuralnet discriminator. Based on this theory, we then present a scalable way to control the complexity of the discriminator for a consistent estimation of KL divergence. We support both our proposed theory and method to control the complexity of the RKHS discriminator in controlled experiments.
 [42] arXiv:2002.11192 (crosslist from qbio.NC) [pdf, other]

Title: EndtoEnd Models for the Analysis of System 1 and System 2 Interactions based on EyeTracking DataAuthors: Alessandro Rossi, Sara Ermini, Dario Bernabini, Dario Zanca, Marino Todisco, Alessandro Genovese, Antonio RizzoComments: 11 pages, 2 figures, 1 tablesSubjects: Neurons and Cognition (qbio.NC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
While theories postulating a dual cognitive system take hold, quantitative confirmations are still needed to understand and identify interactions between the two systems or conflict events. Eye movements are among the most direct markers of the individual attentive load and may serve as an important proxy of information. In this work we propose a computational method, within a modified visual version of the wellknown Stroop test, for the identification of different tasks and potential conflicts events between the two systems through the collection and processing of data related to eye movements. A statistical analysis shows that the selected variables can characterize the variation of attentive load within different scenarios. Moreover, we show that Machine Learning techniques allow to distinguish between different tasks with a good classification accuracy and to investigate more in depth the gaze dynamics.
 [43] arXiv:2002.11215 (crosslist from cs.LG) [pdf, other]

Title: EmbPred30: Assessing 30days Readmission for Diabetic Patients using Categorical EmbeddingsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Hospital readmission is a crucial healthcare quality measure that helps in determining the level of quality of care that a hospital offers to a patient and has proven to be immensely expensive. It is estimated that more than $25 billion are spent yearly due to readmission of diabetic patients in the USA. This paper benchmarks existing models and proposes a new embedding based stateoftheart deep neural network(DNN). The model can identify whether a hospitalized diabetic patient will be readmitted within 30 days or not with an accuracy of 95.2% and Area Under the Receiver Operating Characteristics(AUROC) of 97.4% on data collected from 130 US hospitals between 19992008. The results are encouraging with patients having changes in medication while admitted having a high chance of getting readmitted. Identifying prospective patients for readmission could help the hospital systems in improving their inpatient care, thereby saving them from unnecessary expenditures.
 [44] arXiv:2002.11219 (crosslist from cs.LG) [pdf, ps, other]

Title: Convex Geometry and Duality of Overparameterized Neural NetworksSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We develop a convex analytic framework for ReLU neural networks which elucidates the inner workings of hidden neurons and their function space characteristics. We show that neural networks with rectified linear units act as convex regularizers, where simple solutions are encouraged via extreme points of a certain convex set. For one dimensional regression and classification, as well as rankone data matrices, we prove that finite twolayer ReLU networks with norm regularization yield linear spline interpolation. We characterize the classification decision regions in terms of a closed form kernel matrix and minimum L1 norm solutions. This is in contrast to Neural Tangent Kernel which is unable to explain neural network predictions with finitely many neurons. Our convex geometric description also provides intuitive explanations of hidden neurons as autoencoders. In higher dimensions, we show that the training problem for twolayer networks can be cast as a convex optimization problem with infinitely many constraints. We then provide a family of convex relaxations to approximate the solution, and a cuttingplane algorithm to improve the relaxations. We derive conditions for the exactness of the relaxations and provide simple closed form formulas for the optimal neural network weights in certain cases. We also establish a connection to $\ell_0$$\ell_1$ equivalence for neural networks analogous to the minimal cardinality solutions in compressed sensing. Extensive experimental results show that the proposed approach yields interpretable and accurate models.
 [45] arXiv:2002.11226 (crosslist from cs.LG) [pdf, other]

Title: Deep Learning and Statistical Models for TimeCritical Pedestrian Behaviour PredictionJournalref: In: Gedeon T., Wong K., Lee M. (eds) Neural Information Processing. ICONIP 2019. Communications in Computer and Information Science, vol 1142. Springer, ChamSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
The time it takes for a classifier to make an accurate prediction can be crucial in many behaviour recognition problems. For example, an autonomous vehicle should detect hazardous pedestrian behaviour early enough for it to take appropriate measures. In this context, we compare the switching linear dynamical system (SLDS) and a threelayered bidirectional long shortterm memory (LSTM) neural network, which are applied to infer pedestrian behaviour from motion tracks. We show that, though the neural network model achieves an accuracy of 80%, it requires long sequences to achieve this (100 samples or more). The SLDS, has a lower accuracy of 74%, but it achieves this result with short sequences (10 samples). To our knowledge, such a comparison on sequence length has not been considered in the literature before. The results provide a key intuition of the suitability of the models in timecritical problems.
 [46] arXiv:2002.11246 (crosslist from cs.LG) [pdf, other]

Title: Supervised Categorical Metric Learning with Schatten pNormsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Metric learning has been successful in learning new metrics adapted to numerical datasets. However, its development on categorical data still needs further exploration. In this paper, we propose a method, called CPML for \emph{categorical projected metric learning}, that tries to efficiently~(i.e. less computational time and better prediction accuracy) address the problem of metric learning in categorical data. We make use of the Value Distance Metric to represent our data and propose new distances based on this representation. We then show how to efficiently learn new metrics. We also generalize several previous regularizers through the Schatten $p$norm and provides a generalization bound for it that complements the standard generalization bound for metric learning. Experimental results show that our method provides
 [47] arXiv:2002.11304 (crosslist from cs.LG) [pdf, other]

Title: PaDGAN: A Generative Adversarial Network for Performance Augmented Diverse DesignsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Deep generative models are proven to be a useful tool for automatic design synthesis and design space exploration. When applied in engineering design, existing generative models face two challenges: 1) generated designs lack diversity and do not cover all areas of the design space and 2) it is difficult to explicitly improve the overall performance or quality of generated designs without excluding lowquality designs from the dataset, which may impair the performance of the trained model due to reduced training sample size. In this paper, we simultaneously address these challenges by proposing a new Determinantal Point Processes based loss function for probabilistic modeling of diversity and quality. With this new loss function, we develop a variant of the Generative Adversarial Network, named "Performance Augmented Diverse Generative Adversarial Network" or PaDGAN, which can generate novel highquality designs with good coverage of the design space. We demonstrate that PaDGAN can generate diverse and highquality designs on both synthetic and realworld examples and compare PaDGAN against other models such as the vanilla GAN and the BezierGAN. Unlike typical generative models that usually generate new designs by interpolating within the boundary of training data, we show that PaDGAN expands the design space boundary towards highquality regions. The proposed method is broadly applicable to many tasks including design space exploration, design optimization, and creative solution recommendation.
 [48] arXiv:2002.11318 (crosslist from cs.LG) [pdf, other]

Title: Invariance vs. Robustness of Neural NetworksComments: Preliminary version presented in ICML 2018 Workshop on "Towards learning with limited labels: Equivariance, Invariance,and Beyond" as "Understanding Adversarial Robustness of Symmetric Networks"Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
We study the performance of neural network models on random geometric transformations and adversarial perturbations. Invariance means that the model's prediction remains unchanged when a geometric transformation is applied to an input. Adversarial robustness means that the model's prediction remains unchanged after small adversarial perturbations of an input. In this paper, we show a quantitative tradeoff between rotation invariance and robustness. We empirically study the following two cases: (a) change in adversarial robustness as we improve only the invariance of equivariant models via training augmentation, (b) change in invariance as we improve only the adversarial robustness using adversarial training. We observe that the rotation invariance of equivariant models (StdCNNs and GCNNs) improves by training augmentation with progressively larger random rotations but while doing so, their adversarial robustness drops progressively, and very significantly on MNIST. We take adversarially trained LeNet and ResNet models which have good $L_\infty$ adversarial robustness on MNIST and CIFAR10, respectively, and observe that adversarial training with progressively larger perturbations results in a progressive drop in their rotation invariance profiles. Similar to the tradeoff between accuracy and robustness known in previous work, we give a theoretical justification for the invariance vs. robustness tradeoff observed in our experiments.
 [49] arXiv:2002.11323 (crosslist from cs.LG) [pdf, other]

Title: Convergence to SecondOrder Stationarity for Nonnegative Matrix Factorization: Provably and ConcurrentlySubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Nonnegative matrix factorization (NMF) is a fundamental nonconvex optimization problem with numerous applications in Machine Learning (music analysis, document clustering, speechsource separation etc). Despite having received extensive study, it is poorly understood whether or not there exist natural algorithms that can provably converge to a local minimum. Part of the reason is because the objective is heavily symmetric and its gradient is not Lipschitz. In this paper we define a multiplicative weight update type dynamics (modification of the seminal LeeSeung algorithm) that runs concurrently and provably avoids saddle points (first order stationary points that are not second order). Our techniques combine tools from dynamical systems such as stability and exploit the geometry of the NMF objective by reducing the standard NMF formulation over the nonnegative orthant to a new formulation over (a scaled) simplex. An important advantage of our method is the use of concurrent updates, which permits implementations in parallel computing environments.
 [50] arXiv:2002.11328 (crosslist from cs.LG) [pdf, other]

Title: Rethinking BiasVariance Tradeoff for Generalization of Neural NetworksSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
The classical biasvariance tradeoff predicts that bias decreases and variance increase with model complexity, leading to a Ushaped risk curve. Recent work calls this into question for neural networks and other overparameterized models, for which it is often observed that larger models generalize better. We provide a simple explanation for this by measuring the bias and variance of neural networks: while the bias is monotonically decreasing as in the classical theory, the variance is unimodal or bellshaped: it increases then decreases with the width of the network. We vary the network architecture, loss function, and choice of dataset and confirm that variance unimodality occurs robustly for all models we considered. The risk curve is the sum of the bias and variance curves and displays different qualitative shapes depending on the relative scale of bias and variance, with the double descent curve observed in recent literature as a special case. We corroborate these empirical results with a theoretical analysis of twolayer linear networks with random first layer. Finally, evaluation on outofdistribution data shows that most of the drop in accuracy comes from increased bias while variance increases by a relatively small amount. Moreover, we find that deeper models decrease bias and increase variance for both indistribution and outofdistribution data.
 [51] arXiv:2002.11332 (crosslist from cs.LG) [pdf, ps, other]

Title: Structured Linear Contextual Bandits: A Sharp and Geometric Smoothed AnalysisSubjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
Bandit learning algorithms typically involve the balance of exploration and exploitation. However, in many practical applications, worstcase scenarios needing systematic exploration are seldom encountered. In this work, we consider a smoothed setting for structured linear contextual bandits where the adversarial contexts are perturbed by Gaussian noise and the unknown parameter $\theta^*$ has structure, e.g., sparsity, group sparsity, low rank, etc. We propose simple greedy algorithms for both the single and multiparameter (i.e., different parameter for each context) settings and provide a unified regret analysis for $\theta^*$ with any assumed structure. The regret bounds are expressed in terms of geometric quantities such as Gaussian widths associated with the structure of $\theta^*$. We also obtain sharper regret bounds compared to earlier work for the unstructured $\theta^*$ setting as a consequence of our improved analysis. We show there is implicit exploration in the smoothed setting where a simple greedy algorithm works.
 [52] arXiv:2002.11361 (crosslist from cs.LG) [pdf, other]

Title: Understanding SelfTraining for Gradual Domain AdaptationSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Machine learning systems must adapt to data distributions that evolve over time, in applications ranging from sensor networks and selfdriving car perception modules to brainmachine interfaces. We consider gradual domain adaptation, where the goal is to adapt an initial classifier trained on a source domain given only unlabeled data that shifts gradually in distribution towards a target domain. We prove the first nonvacuous upper bound on the error of selftraining with gradual shifts, under settings where directly adapting to the target domain can result in unbounded error. The theoretical analysis leads to algorithmic insights, highlighting that regularization and label sharpening are essential even when we have infinite data, and suggesting that selftraining works particularly well for shifts with small Wassersteininfinity distance. Leveraging the gradual shift structure leads to higher accuracies on a rotating MNIST dataset and a realistic Portraits dataset.
 [53] arXiv:2002.11369 (crosslist from cs.LG) [pdf, other]

Title: Lipschitz standardization for robust multivariate learningSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Current trends in machine learning rely on outofthebox gradientbased approaches. With the aim of mitigating numerical errors and to improve the convergence of the learning process, a common empirical practice is to standardize or normalize the data. However, there is a lack of theoretical analysis regarding why and when these methods result in an improvement of the learning process. In this work, we first study these methods in the context of blackbox variational inference, specifically analyzing the effect that scaling the data has on the smoothness of the optimization landscape. Our analysis shows that no general rule applies in order to decide which of the existing data scaling methods, or even if they, will improve the learning process. Second, we highlight the issues that arise when dealing with multivariate data, due to the discrepancy in smoothness of the likelihood functions for different variables, and the inability to scale discrete data. Finally, we propose a novel Lipschitz standardization, and its extension for discrete data, which overcomes the aforementioned limitations. Specifically, as backed by our experiments, Lipschitz standardization i) favors a fairer learning across different variables in the data; and ii) results in faster and more accurate learning.
 [54] arXiv:2002.11410 (crosslist from math.OC) [pdf, other]

Title: Efficient algorithms for multivariate shapeconstrained convex regression problemsSubjects: Optimization and Control (math.OC); Machine Learning (stat.ML)
Shapeconstrained convex regression problem deals with fitting a convex function to the observed data, where additional constraints are imposed, such as componentwise monotonicity and uniform Lipschitz continuity. This paper provides a comprehensive mechanism for computing the least squares estimator of a multivariate shapeconstrained convex regression function in $\mathbb{R}^d$. We prove that the least squares estimator is computable via solving a constrained convex quadratic programming (QP) problem with $(n+1)d$ variables and at least $n(n1)$ linear inequality constraints, where $n$ is the number of data points. For solving the generally very largescale convex QP, we design two efficient algorithms, one is the symmetric GaussSeidel based alternating direction method of multipliers ({\tt sGSADMM}), and the other is the proximal augmented Lagrangian method ({\tt pALM}) with the subproblems solved by the semismooth Newton method ({\tt SSN}). Comprehensive numerical experiments, including those in the pricing of basket options and estimation of production functions in economics, demonstrate that both of our proposed algorithms outperform the stateoftheart algorithm. The {\tt pALM} is more efficient than the {\tt sGSADMM} but the latter has the advantage of being simpler to implement.
 [55] arXiv:2002.11416 (crosslist from eess.SP) [pdf, ps, other]

Title: Analytical Equations based Prediction Approach for PM2.5 using Artificial Neural NetworkSubjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Particulate matter pollution is one of the deadliest types of air pollution worldwide due to its significant impacts on the global environment and human health. Particulate Matter (PM2.5) is one of the important particulate pollutants to measure the Air Quality Index (AQI). The conventional instruments used by the air quality monitoring stations to monitor PM2.5 are costly, bulkier, timeconsuming, and powerhungry. Furthermore, due to limited data availability and nonscalability, these stations cannot provide high spatial and temporal resolution in realtime. To overcome the disadvantages of existing methodology this article presents analytical equations based prediction approach for PM2.5 using an Artificial Neural Network (ANN). Since the derived analytical equations for the prediction can be computed using a Wireless Sensor Node (WSN) or lowcost processing tool, it demonstrates the usefulness of the proposed approach. Moreover, the study related to correlation among the PM2.5 and other pollutants is performed to select the appropriate predictors. The large authenticate data set of Central Pollution Control Board (CPCB) online station, India is used for the proposed approach. The RMSE and coefficient of determination (R2) obtained for the proposed prediction approach using eight predictors are 1.7973 ug/m3 and 0.9986 respectively. While the proposed approach results show RMSE of 7.5372 ug/m3 and R2 of 0.9708 using three predictors. Therefore, the results demonstrate that the proposed approach is one of the promising approaches for monitoring PM2.5 without powerhungry gas sensors and bulkier analyzers.
 [56] arXiv:2002.11423 (crosslist from cs.LG) [pdf, other]

Title: NeuralSens: Sensitivity Analysis of Neural NetworksComments: 28 pages, 12 figures, submitted to Journal of Statistical Software (JSS) this https URLSubjects: Machine Learning (cs.LG); Mathematical Software (cs.MS); Machine Learning (stat.ML)
Neural networks are important tools for dataintensive analysis and are commonly applied to model nonlinear relationships between dependent and independent variables. However, neural networks are usually seen as "black boxes" that offer minimal information about how the input variables are used to predict the response in a fitted model. This article describes the \pkg{NeuralSens} package that can be used to perform sensitivity analysis of neural networks using the partial derivatives method. Functions in the package can be used to obtain the sensitivities of the output with respect to the input variables, evaluate variable importance based on sensitivity measures and characterize relationships between input and output variables. Methods to calculate sensitivities are provided for objects from common neural network packages in \proglang{R}, including \pkg{neuralnet}, \pkg{nnet}, \pkg{RSNNS}, \pkg{h2o}, \pkg{neural}, \pkg{forecast} and \pkg{caret}. The article presents an overview of the techniques for obtaining information from neural network models, a theoretical foundation of how are calculated the partial derivatives of the output with respect to the inputs of a multilayer perceptron model, a description of the package structure and functions, and applied examples to compare \pkg{NeuralSens} functions with analogous functions from other available \proglang{R} packages.
 [57] arXiv:2002.11429 (crosslist from cs.LG) [pdf, other]

Title: PHS: A Toolbox for Parellel Hyperparameter SearchSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We introduce an open source python framework named PHS  Parallel Hyperparameter Search to enable hyperparameter optimization on numerous compute instances of any arbitrary python function. This is achieved with minimal modifications inside the target function. Possible applications appear in expensive to evaluate numerical computations which strongly depend on hyperparameters such as machine learning. Bayesian optimization is chosen as a sample efficient method to propose the next query set of parameters.
 [58] arXiv:2002.11431 (crosslist from cs.CY) [pdf, ps, other]

Title: Simpler handling of clinical concepts in R with clinconceptAuthors: Robert C. FreeSubjects: Computers and Society (cs.CY); Applications (stat.AP)
Routinely collected data in electronic healthcare records are often underpinned by clinical concept dictionaries. Increasingly data sets from these sources are being made available and used for research purposes, but without additional tooling it can be difficult to work effectively with these dictionaries due to their design, size and complex nature. In an effort to improve this situation the clinconcept package was created to provide a straightforward way for researchers to build, manage and interrogate databases containing commmonly used clinical concept dictionaries. This article describes the rationale behind the package, how to install it and use it and how it can be extended to support other data sources.
 [59] arXiv:2002.11436 (crosslist from cs.LG) [pdf, ps, other]

Title: Nonlinear classifiers for ranking problems based on kernelized SVMSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Many classification problems focus on maximizing the performance only on the samples with the highest relevance instead of all samples. As an example, we can mention ranking problems, accuracy at the top or search engines where only the top few queries matter. In our previous work, we derived a general framework including several classes of these linear classification problems. In this paper, we extend the framework to nonlinear classifiers. Utilizing a similarity to SVM, we dualize the problems, add kernels and propose a componentwise dual ascent method. This allows us to perform one iteration in less than 20 milliseconds on relatively large datasets such as FashionMNIST.
 [60] arXiv:2002.11440 (crosslist from cs.LG) [pdf, other]

Title: NonAsymptotic Bounds for ZerothOrder Stochastic OptimizationSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
We consider the problem of optimizing an objective function with and without convexity in a simulationoptimization context, where only stochastic zerothorder information is available. We consider two techniques for estimating gradient/Hessian, namely simultaneous perturbation (SP) and Gaussian smoothing (GS). We introduce an optimization oracle to capture a setting where the function measurements have an estimation error that can be controlled. Our oracle is appealing in several practical contexts where the objective has to be estimated from i.i.d. samples, and increasing the number of samples reduces the estimation error. In the stochastic nonconvex optimization context, we analyze the zerothorder variant of the randomized stochastic gradient (RSG) and quasiNewton (RSQN) algorithms with a biased gradient/Hessian oracle, and with its variant involving an estimation error component. In particular, we provide nonasymptotic bounds on the performance of both algorithms, and our results provide a guideline for choosing the batch size for estimation, so that the overall error bound matches with the one obtained when there is no estimation error. Next, in the stochastic convex optimization setting, we provide nonasymptotic bounds that hold in expectation for the last iterate of a stochastic gradient descent (SGD) algorithm, and our bound for the GS variant of SGD matches the bound for SGD with unbiased gradient information. We perform simulation experiments on synthetic as well as realworld datasets, and the empirical results validate the theoretical findings.
 [61] arXiv:2002.11442 (crosslist from cs.LG) [pdf, other]

Title: DeBayes: a Bayesian method for debiasing network embeddingsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
As machine learning algorithms are increasingly deployed for highimpact automated decision making, ethical and increasingly also legal standards demand that they treat all individuals fairly, without discrimination based on their age, gender, race or other sensitive traits. In recent years much progress has been made on ensuring fairness and reducing bias in standard machine learning settings. Yet, for network embedding, with applications in vulnerable domains ranging from social network analysis to recommender systems, current options remain limited both in number and performance. We thus propose DeBayes: a conceptually elegant Bayesian method that is capable of learning debiased embeddings by using a biased prior. Our experiments show that these representations can then be used to perform link prediction that is significantly more fair in terms of popular metrics such as demographic parity and equalized opportunity.
 [62] arXiv:2002.11477 (crosslist from cs.CV) [pdf, other]

Title: Learning a Directional Soft Lane Affordance Model for Road Scenes Using SelfSupervisionComments: Submitted to IEEE IV 2020Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
Humans navigate complex environments in an organized yet flexible manner, adapting to the context and implicit social rules. Understanding these naturally learned patterns of behavior is essential for applications such as autonomous vehicles. However, algorithmically defining these implicit rules of human behavior remains difficult. This work proposes a novel selfsupervised method for training a probabilistic network model to estimate the regions humans are most likely to drive in as well as a multimodal representation of the inferred direction of travel at each point. The model is trained on individual human trajectories conditioned on a representation of the driving environment. The model is shown to successfully generalize to new road scenes, demonstrating potential for realworld application as a prior for socially acceptable driving behavior in challenging or ambiguous scenarios which are poorly handled by explicit traffic rules.
 [63] arXiv:2002.11498 (crosslist from eess.SP) [pdf, ps, other]

Title: Multifrequency calibration for DOA estimation with distributed sensorsSubjects: Signal Processing (eess.SP); Machine Learning (cs.LG); Machine Learning (stat.ML)
In this work, we investigate direction finding in the presence of sensor gain uncertainties and directional perturbations for sensor array processing in a multifrequency scenario. Specifically, we adopt a distributed optimization scheme in which coherence models are incorporated and local agents exchange information only between connected nodes in the network, i.e., without a fusion center. Numerical simulations highlight the advantages of the proposed parallel iterative technique in terms of statistical and computational efficiency.
 [64] arXiv:2002.11501 (crosslist from cs.LG) [pdf, other]

Title: Dual Graph Representation LearningSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Graph representation learning embeds nodes in large graphs as lowdimensional vectors and is of great benefit to many downstream applications. Most embedding frameworks, however, are inherently transductive and unable to generalize to unseen nodes or learn representations across different graphs. Although inductive approaches can generalize to unseen nodes, they neglect different contexts of nodes and cannot learn node embeddings dually. In this paper, we present a contextaware unsupervised dual encoding framework, \textbf{CADE}, to generate representations of nodes by combining realtime neighborhoods with neighborattentioned representation, and preserving extra memory of known nodes. We exhibit that our approach is effective by comparing to stateoftheart methods.
 [65] arXiv:2002.11505 (crosslist from cs.DC) [pdf, other]

Title: Relaxed Scheduling for Scalable Belief PropagationSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
The ability to leverage largescale hardware parallelism has been one of the key enablers of the accelerated recent progress in machine learning. Consequently, there has been considerable effort invested into developing efficient parallel variants of classic machine learning algorithms. However, despite the wealth of knowledge on parallelization, some classic machine learning algorithms often prove hard to parallelize efficiently while maintaining convergence. In this paper, we focus on efficient parallel algorithms for the key machine learning task of inference on graphical models, in particular on the fundamental belief propagation algorithm. We address the challenge of efficiently parallelizing this classic paradigm by showing how to leverage scalable relaxed schedulers in this context. We present an extensive empirical study, showing that our approach outperforms previous parallel belief propagation implementations both in terms of scalability and in terms of wallclock convergence time, on a range of practical applications.
 [66] arXiv:2002.11519 (crosslist from cs.LG) [pdf, ps, other]

Title: Decidability of Sample Complexity of PAC Learning in finite settingAuthors: Alberto GandolfiSubjects: Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Machine Learning (stat.ML)
In this short note we observe that the sample complexity of PAC machine learning of various concepts, including learning the maximum (EMX), can be exactly determined when the support of the probability measures considered as models satisfies an apriori bound. This result contrasts with the recently discovered undecidability of EMX within ZFC for finitely supported probabilities (with no a priori bound). Unfortunately, the decision procedure is at present, at least doubly exponential in the number of points times the uniform bound on the support size.
 [67] arXiv:2002.11545 (crosslist from cs.LG) [pdf, other]

Title: A Survey towards Federated Semisupervised LearningSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
The success of Artificial Intelligence (AI) should be largely attributed to the accessibility of abundant data. However, this is not exactly the case in reality, where it is common for developers in industry to face insufficient, incomplete and isolated data. Consequently, federated learning was proposed to alleviate such challenges by allowing multiple parties to collaboratively build machine learning models without explicitly sharing their data and in the meantime, preserve data privacy. However, existing algorithms of federated learning mainly focus on examples where, either the data do not require explicit labeling, or all data are labeled. Yet in reality, we are often confronted with the case that labeling data itself is costly and there is no sufficient supply of labeled data. While such issues are commonly solved by semisupervised learning, to the best of knowledge, no existing effort has been put to federated semisupervised learning. In this survey, we briefly summarize prevalent semisupervised algorithms and make a brief prospect into federated semisupervised learning, including possible methodologies, settings and challenges.
 [68] arXiv:2002.11565 (crosslist from cs.LG) [pdf, other]

Title: Randomization matters. How to defend against strong adversarial attacksSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
Is there a classifier that ensures optimal robustness against all adversarial attacks? This paper answers this question by adopting a gametheoretic point of view. We show that adversarial attacks and defenses form an infinite zerosum game where classical results (e.g. Sion theorem) do not apply. We demonstrate the nonexistence of a Nash equilibrium in our game when the classifier and the Adversary are both deterministic, hence giving a negative answer to the above question in the deterministic regime. Nonetheless, the question remains open in the randomized regime. We tackle this problem by showing that, undermild conditions on the dataset distribution, any deterministic classifier can be outperformed by a randomized one. This gives arguments for using randomization, and leads us to a new algorithm for building randomized classifiers that are robust to strong adversarial attacks. Empirical results validate our theoretical analysis, and show that our defense method considerably outperforms Adversarial Training against stateoftheart attacks.
 [69] arXiv:2002.11569 (crosslist from cs.LG) [pdf, other]

Title: Overfitting in adversarially robust deep learningSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
It is common practice in deep learning to use overparameterized networks and train for as long as possible; there are numerous studies that show, both theoretically and empirically, that such practices surprisingly do not unduly harm the generalization performance of the classifier. In this paper, we empirically study this phenomenon in the setting of adversarially trained deep networks, which are trained to minimize the loss under worstcase adversarial perturbations. We find that overfitting to the training set does in fact harm robust performance to a very large degree in adversarially robust training across multiple datasets (SVHN, CIFAR10, CIFAR100, and ImageNet) and perturbation models ($\ell_\infty$ and $\ell_2$). Based upon this observed effect, we show that the performance gains of virtually all recent algorithmic improvements upon adversarial training can be matched by simply using early stopping. We also show that effects such as the double descent curve do still occur in adversarially trained models, yet fail to explain the observed overfitting. Finally, we study several classical and modern deep learning remedies for overfitting, including regularization and data augmentation, and find that no approach in isolation improves significantly upon the gains achieved by early stopping. All code for reproducing the experiments as well as pretrained model weights and training logs can be found at https://github.com/locuslab/robust_overfitting.
 [70] arXiv:2002.11570 (crosslist from eess.SY) [pdf, other]

Title: Calculations of System Adequacy Considering Heat Transition PathwaysComments: Submitted to PMAPS 2020Subjects: Systems and Control (eess.SY); Applications (stat.AP)
The decarbonisation of heat in developed economies represents a significant challenge, with increased penetration of electrical heating technologies potentially leading to unprecedented increases in peak electricity demand. This work considers a method to evaluate the impact of rapid electrification of heat by utilising historic gas demand data. The work is intended to provide a datadriven complement to popular generative heat demand models, with a particular aim of informing regulators and actors in capacity markets as to how policy changes could impact on mediumterm system adequacy metrics (up to five years ahead). Results from a GB case study show that the representation of heat demand using scaled gas demand profiles more than doubles the rate at which 1in20 system peaks grow, when compared to the use of scaled electricity demand profiles. Low enduse system efficiency, in terms of aggregate coefficient of performance and demand side response capabilities, are shown to potentially lead to a trebling of electrical demandtemperature sensitivity following five years of heat demand growth.
 [71] arXiv:2002.11576 (crosslist from cs.LG) [pdf, other]

Title: NestedVAE: Isolating Common Factors via Weak SupervisionSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Fair and unbiased machine learning is an important and active field of research, as decision processes are increasingly driven by models that learn from data. Unfortunately, any biases present in the data may be learned by the model, thereby inappropriately transferring that bias into the decision making process. We identify the connection between the task of bias reduction and that of isolating factors common between domains whilst encouraging domain specific invariance. To isolate the common factors we combine the theory of deep latent variable models with information bottleneck theory for scenarios whereby data may be naturally paired across domains and no additional supervision is required. The result is the Nested Variational AutoEncoder (NestedVAE). Two outer VAEs with shared weights attempt to reconstruct the input and infer a latent space, whilst a nested VAE attempts to reconstruct the latent representation of one image, from the latent representation of its paired image. In so doing, the nested VAE isolates the common latent factors/causes and becomes invariant to unwanted factors that are not shared between paired images. We also propose a new metric to provide a balanced method of evaluating consistency and classifier performance across domains which we refer to as the Adjusted Parity metric. An evaluation of NestedVAE on both domain and attribute invariance, change detection, and learning common factors for the prediction of biological sex demonstrates that NestedVAE significantly outperforms alternative methods.
 [72] arXiv:2002.11589 (crosslist from cs.LG) [pdf, other]

Title: Recommendation on a Budget: Column Space Recovery from Partially Observed Entries with Random or Active SamplingComments: A shorter version is accepted to AISTATSSubjects: Machine Learning (cs.LG); Information Retrieval (cs.IR); Machine Learning (stat.ML)
We analyze alternating minimization for column space recovery of a partially observed, approximately low rank matrix with a growing number of columns and a fixed budget of observations per column. In this work, we prove that if the budget is greater than the rank of the matrix, column space recovery succeeds  as the number of columns grows, the estimate from alternating minimization converges to the true column space with probability tending to one. From our proof techniques, we naturally formulate an active sampling strategy for choosing entries of a column that is theoretically and empirically (on synthetic and real data) better than the commonly studied uniformly random sampling strategy.
 [73] arXiv:2002.11599 (crosslist from cs.IT) [pdf, other]

Title: Minimax Optimal Estimation of KL Divergence for Continuous DistributionsSubjects: Information Theory (cs.IT); Machine Learning (stat.ML)
Estimating KullbackLeibler divergence from identical and independently distributed samples is an important problem in various domains. One simple and effective estimator is based on the k nearest neighbor distances between these samples. In this paper, we analyze the convergence rates of the bias and variance of this estimator. Furthermore, we derive a lower bound of the minimax mean square error and show that kNN method is asymptotically rate optimal.
 [74] arXiv:2002.11603 (crosslist from cs.LG) [pdf, other]

Title: Differentially Private Mean Embeddings with Random Features (DPMERF) for Simple & Practical Synthetic Data GenerationSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We present a differentially private data generation paradigm using random feature representations of kernel mean embeddings when comparing the distribution of true data with that of synthetic data. We exploit the random feature representations for two important benefits. First, we require a very low privacy cost for training deep generative models. This is because unlike kernelbased distance metrics that require computing the kernel matrix on all pairs of true and synthetic data points, we can detach the datadependent term from the term solely dependent on synthetic data. Hence, we need to perturb the datadependent term onceforall and then use it until the end of the generator training. Second, we can obtain an analytic sensitivity of the kernel mean embedding as the random features are norm bounded by construction. This removes the necessity of hyperparameter search for a clipping norm to handle the unknown sensitivity of an encoder network when dealing with highdimensional data. We provide several variants of our algorithm, differentially private mean embeddings with random features (DPMERF) to generate (a) heterogeneous tabular data, (b) input features and corresponding labels jointly; and (c) highdimensional data. Our algorithm achieves better privacyutility tradeoffs than existing methods tested on several datasets.
 [75] arXiv:2002.11609 (crosslist from cs.CV) [pdf, other]

Title: ARMA Nets: Expanding Receptive Field for Dense PredictionSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
Global information is essential for dense prediction problems, whose goal is to compute a discrete or continuous label for each pixel in the images. Traditional convolutional layers in neural networks, originally designed for image classification, are restrictive in these problems since their receptive fields are limited by the filter size. In this work, we propose autoregressive movingaverage (ARMA) layer, a novel module in neural networks to allow explicit dependencies of output neurons, which significantly expands the receptive field with minimal extra parameters. We show experimentally that the effective receptive field of neural networks with ARMA layers expands as autoregressive coefficients become larger. In addition, we demonstrate that neural networks with ARMA layers substantially improve the performance of challenging pixellevel video prediction tasks as our model enlarges the effective receptive field.
 [76] arXiv:2002.11611 (crosslist from cs.LG) [pdf, other]

Title: Online Learning in Contextual Bandits using Gated Linear NetworksSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
We introduce a new and completely online contextual bandit algorithm called Gated Linear Contextual Bandits (GLCB). This algorithm is based on Gated Linear Networks (GLNs), a recently introduced deep learning architecture with properties wellsuited to the online setting. Leveraging datadependent gating properties of the GLN we are able to estimate prediction uncertainty with effectively zero algorithmic overhead. We empirically evaluate GLCB compared to 9 stateoftheart algorithms that leverage deep neural networks, on a standard benchmark suite of discrete and continuous contextual bandit problems. GLCB obtains median firstplace despite being the only online method, and we further support these results with a theoretical study of its convergence properties.
 [77] arXiv:2002.11613 (crosslist from cs.LG) [pdf, other]

Title: The Differentially Private Lottery Ticket MechanismSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We propose the differentially private lottery ticket mechanism (DPLTM). An endtoend differentially private training paradigm based on the lottery ticket hypothesis. Using "highquality winners", selected via our custom score function, DPLTM significantly improves the privacyutility tradeoff over the stateoftheart. We show that DPLTM converges faster, allowing for early stopping with reduced privacy budget consumption. We further show that the tickets from DPLTM are transferable across datasets, domains, and architectures. Our extensive evaluation on several public datasets provides evidence to our claims.
 [78] arXiv:2002.11618 (crosslist from cs.CY) [pdf, other]

Title: Better coverage, better outcomes? Mapping mobile network data to official statistics using satellite imagery and radio propagation modellingAuthors: Till KoebeSubjects: Computers and Society (cs.CY); Computation (stat.CO); Methodology (stat.ME)
Mobile sensing data has become a popular data source for geospatial analysis, however, mapping it accurately to other sources of information such as statistical data remains a challenge. Popular mapping approaches such as point allocation or voronoi tessellation provide only crude approximations of the mobile network coverage as they do not consider holes, overlaps and withincell heterogeneity. More elaborate mapping schemes often require additional proprietary data operators are highly reluctant to share. In this paper, I use human settlement information extracted from publicly available satellite imagery in combination with stochastic radio propagation modelling techniques to account for that. I investigate in a simulation study and a realworld application on unemployment estimates in Senegal whether better coverage approximations lead to better outcome predictions. The good news is: it does not have to be complicated.
 [79] arXiv:2002.11621 (crosslist from cs.CY) [pdf, ps, other]

Title: Algorithms for Fair Team Formation in Online Labour MarketplacesComments: Accepted at "FATES 2019 : 1st Workshop on Fairness, Accountability, Transparency, Ethics, and Society on the Web" (this http URL)Journalref: "Companion Proceedings of The 2019 World Wide Web Conference", 2019, pages 484490Subjects: Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
As freelancing work keeps on growing almost everywhere due to a sharp decrease in communication costs and to the widespread of Internetbased labour marketplaces (e.g., guru.com, feelancer.com, mturk.com, upwork.com), many researchers and practitioners have started exploring the benefits of outsourcing and crowdsourcing. Since employers often use these platforms to find a group of workers to complete a specific task, researchers have focused their efforts on the study of team formation and matching algorithms and on the design of effective incentive schemes. Nevertheless, just recently, several concerns have been raised on possibly unfair biases introduced through the algorithms used to carry out these selection and matching procedures. For this reason, researchers have started studying the fairness of algorithms related to these online marketplaces, looking for intelligent ways to overcome the algorithmic bias that frequently arises. Broadly speaking, the aim is to guarantee that, for example, the process of hiring workers through the use of machine learning and algorithmic data analysis tools does not discriminate, even unintentionally, on grounds of nationality or gender. In this short paper, we define the Fair Team Formation problem in the following way: given an online labour marketplace where each worker possesses one or more skills, and where all workers are divided into two or more not overlapping classes (for examples, men and women), we want to design an algorithm that is able to find a team with all the skills needed to complete a given task, and that has the same number of people from all classes. We provide inapproximability results for the Fair Team Formation problem together with four algorithms for the problem itself. We also tested the effectiveness of our algorithmic solutions by performing experiments using real data from an online labor marketplace.
 [80] arXiv:2002.11631 (crosslist from cs.CY) [pdf, other]

Title: CausalML: Python Package for Causal Machine LearningSubjects: Computers and Society (cs.CY); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
CausalML is a Python implementation of algorithms related to causal inference and machine learning. Algorithms combining causal inference and machine learning have been a trending topic in recent years. This package tries to bridge the gap between theoretical work on methodology and practical applications by making a collection of methods in this field available in Python. This paper introduces the key concepts, scope, and use cases of this package.
 [81] arXiv:2002.11637 (crosslist from cs.LG) [pdf, other]

Title: Learning Navigation Costs from Demonstration in Partially Observable EnvironmentsComments: 6 pages, 5 figuresSubjects: Machine Learning (cs.LG); Robotics (cs.RO); Machine Learning (stat.ML)
This paper focuses on inverse reinforcement learning (IRL) to enable safe and efficient autonomous navigation in unknown partially observable environments. The objective is to infer a cost function that explains expertdemonstrated navigation behavior while relying only on the observations and statecontrol trajectory used by the expert. We develop a cost function representation composed of two parts: a probabilistic occupancy encoder, with recurrent dependence on the observation sequence, and a cost encoder, defined over the occupancy features. The representation parameters are optimized by differentiating the error between demonstrated controls and a control policy computed from the cost encoder. Such differentiation is typically computed by dynamic programming through the value function over the whole state space. We observe that this is inefficient in large partially observable environments because most states are unexplored. Instead, we rely on a closedform subgradient of the costtogo obtained only over a subset of promising states via an efficient motionplanning algorithm such as A* or RRT. Our experiments show that our model exceeds the accuracy of baseline IRL algorithms in robot navigation tasks, while substantially improving the efficiency of training and testtime inference.
 [82] arXiv:2002.11650 (crosslist from cs.LG) [pdf, ps, other]

Title: Corrupted Multidimensional Binary Search: Learning in the Presence of Irrational AgentsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT); General Economics (econ.GN); Machine Learning (stat.ML)
Standard gametheoretic formulations for settings like contextual pricing and security games assume that agents act in accordance with a specific behavioral model. In practice however, some agents may not prescribe to the dominant behavioral model or may act in ways that are arbitrarily inconsistent. Existing algorithms heavily depend on the model being (approximately) accurate for all agents and have poor performance in the presence of even a few such arbitrarily irrational agents. \emph{How do we design learning algorithms that are robust to the presence of arbitrarily irrational agents?}
We address this question for a number of canonical gametheoretic applications by designing a robust algorithm for the fundamental problem of multidimensional binary search. The performance of our algorithm degrades gracefully with the number of corrupted rounds, which correspond to irrational agents and need not be known in advance. As binary search is the key primitive in algorithms for contextual pricing, Stackelberg Security Games, and other gametheoretic applications, we immediately obtain robust algorithms for these settings.
Our techniques draw inspiration from learning theory, game theory, highdimensional geometry, and convex analysis, and may be of independent algorithmic interest.  [83] arXiv:2002.11651 (crosslist from cs.LG) [pdf, other]

Title: Fair Learning with Private Demographic DataSubjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Machine Learning (stat.ML)
Sensitive attributes such as race are rarely available to learners in real world settings as their collection is often restricted by laws and regulations. We give a scheme that allows individuals to release their sensitive information privately while still allowing any downstream entity to learn nondiscriminatory predictors. We show how to adapt nondiscriminatory learners to work with privatized protected attributes giving theoretical guarantees on performance. Finally, we highlight how the methodology could apply to learning fair predictors in settings where protected attributes are only available for a subset of the data.
 [84] arXiv:2002.11661 (crosslist from cs.DS) [pdf, other]

Title: Compact Representation of Uncertainty in Hierarchical ClusteringAuthors: Craig S. Greenberg, Sebastian Macaluso, Nicholas Monath, JiAh Lee, Patrick Flaherty, Kyle Cranmer, Andrew McGregor, Andrew McCallumComments: 21 pages, 5 figuresSubjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.dataan); Machine Learning (stat.ML)
Hierarchical clustering is a fundamental task often used to discover meaningful structures in data, such as phylogenetic trees, taxonomies of concepts, subtypes of cancer, and cascades of particle decays in particle physics. When multiple hierarchical clusterings of the data are possible, it is useful to represent uncertainty in the clustering through various probabilistic quantities. Existing approaches represent uncertainty for a range of models; however, they only provide approximate inference. This paper presents dynamicprogramming algorithms and proofs for exact inference in hierarchical clustering. We are able to compute the partition function, MAP hierarchical clustering, and marginal probabilities of subhierarchies and clusters. Our method supports a wide range of hierarchical models and only requires a cluster compatibility function. Rather than scaling with the number of hierarchical clusterings of $n$ elements ($\omega(n n! / 2^{n1})$), our approach runs in time and space proportional to the significantly smaller powerset of $n$. Despite still being large, these algorithms enable exact inference in smalldata applications and are also interesting from a theoretical perspective. We demonstrate the utility of our method and compare its performance with respect to existing approximate methods.
 [85] arXiv:2002.11675 (crosslist from cs.DB) [pdf, other]

Title: Workload Prediction of Business Processes  An Approach Based on Process Mining and Recurrent Neural NetworksSubjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Recent advances in the interconnectedness and digitization of industrial machines, known as Industry 4.0, pave the way for new analytical techniques. Indeed, the availability and the richness of productionrelated data enables new datadriven methods. In this paper, we propose a process mining approach augmented with artificial intelligence that (1) reconstructs the historical workload of a company and (2) predicts the workload using neural networks. Our method relies on logs, representing the history of business processes related to manufacturing. These logs are used to quantify the supply and demand and are fed into a recurrent neural network model to predict customer orders. The corresponding activities to fulfill these orders are then sampled from history with a replay mechanism, based on criteria such as trace frequency and activities similarity. An evaluation and illustration of the method is performed on the administrative processes of Heraeus Materials SA. The workload prediction on a oneyear test set achieves an MAPE score of 19% for a oneweek forecast. The case study suggests a reasonable accuracy and confirms that a good understanding of the historical workload combined to articulated predictions are of great help for supporting management decisions and can decrease costs with better resources planning on a mediumterm level.
 [86] arXiv:2002.11684 (crosslist from cs.LG) [pdf, other]

Title: Provable MetaLearning of Linear RepresentationsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Metalearning, or learningtolearn, seeks to design algorithms that can utilize previous experience to rapidly learn new skills or adapt to new environments. Representation learninga key tool for performing metalearninglearns a data representation that can transfer knowledge across multiple tasks, which is essential in regimes where data is scarce. Despite a recent surge of interest in the practice of metalearning, the theoretical underpinnings of metalearning algorithms are lacking, especially in the context of learning transferable representations. In this paper, we focus on the problem of multitask linear regressionin which multiple linear regression models share a common, lowdimensional linear representation. Here, we provide provably fast, sampleefficient algorithms to address the dual challenges of (1) learning a common set of features from multiple, related tasks, and (2) transferring this knowledge to new, unseen tasks. Both are central to the general problem of metalearning. Finally, we complement these results by providing informationtheoretic lower bounds on the sample complexity of learning these linear features, showing that our algorithms are optimal up to logarithmic factors.
 [87] arXiv:2002.11686 (crosslist from cs.CR) [pdf]

Title: IoT Device Identification Using Deep LearningSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP); Machine Learning (stat.ML)
The growing use of IoT devices in organizations has increased the number of attack vectors available to attackers due to the less secure nature of the devices. The widely adopted bring your own device (BYOD) policy which allows an employee to bring any IoT device into the workplace and attach it to an organization's network also increases the risk of attacks. In order to address this threat, organizations often implement security policies in which only the connection of whitelisted IoT devices is permitted. To monitor adherence to such policies and protect their networks, organizations must be able to identify the IoT devices connected to their networks and, more specifically, to identify connected IoT devices that are not on the whitelist (unknown devices). In this study, we applied deep learning on network traffic to automatically identify IoT devices connected to the network. In contrast to previous work, our approach does not require that complex feature engineering be applied on the network traffic, since we represent the communication behavior of IoT devices using small images built from the IoT devices network traffic payloads. In our experiments, we trained a multiclass classifier on a publicly available dataset, successfully identifying 10 different IoT devices and the traffic of smartphones and computers, with over 99% accuracy. We also trained multiclass classifiers to detect unauthorized IoT devices connected to the network, achieving over 99% overall average detection accuracy.
 [88] arXiv:2002.11701 (crosslist from cs.LG) [pdf, other]

Title: CLARA: Clinical Report AutocompletionSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); HumanComputer Interaction (cs.HC); Machine Learning (stat.ML)
Generating clinical reports from raw recordings such as Xrays and electroencephalogram (EEG) is an essential and routine task for doctors. However, it is often timeconsuming to write accurate and detailed reports. Most existing methods try to generate the whole reports from the raw input with limited success because 1) generated reports often contain errors that need manual review and correction, 2) it does not save time when doctors want to write additional information into the report, and 3) the generated reports are not customized based on individual doctors' preference. We propose {\it CL}inic{\it A}l {\it R}eport {\it A}utocompletion (CLARA), an interactive method that generates reports in a sentence by sentence fashion based on doctors' anchor words and partially completed sentences. CLARA searches for most relevant sentences from existing reports as the template for the current report. The retrieved sentences are sequentially modified by combining with the input feature representations to create the final report. In our experimental evaluation, CLARA achieved 0.393 CIDEr and 0.248 BLEU4 on Xray reports and 0.482 CIDEr and 0.491 BLEU4 for EEG reports for sentencelevel generation, which is up to 35% improvement over the best baseline. Also via our qualitative evaluation, CLARA is shown to produce reports which have a significantly higher level of approval by doctors in a user study (3.74 out of 5 for CLARA vs 2.52 out of 5 for the baseline).
 [89] arXiv:2002.11708 (crosslist from cs.LG) [pdf, other]

Title: Generalized Hindsight for Reinforcement LearningSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO); Machine Learning (stat.ML)
One of the key reasons for the high sample complexity in reinforcement learning (RL) is the inability to transfer knowledge from one task to another. In standard multitask RL settings, lowreward data collected while trying to solve one task provides little to no signal for solving that particular task and is hence effectively wasted. However, we argue that this data, which is uninformative for one task, is likely a rich source of information for other tasks. To leverage this insight and efficiently reuse data, we present Generalized Hindsight: an approximate inverse reinforcement learning technique for relabeling behaviors with the right tasks. Intuitively, given a behavior generated under one task, Generalized Hindsight returns a different task that the behavior is better suited for. Then, the behavior is relabeled with this new task before being used by an offpolicy RL optimizer. Compared to standard relabeling techniques, Generalized Hindsight provides a substantially more efficient reuse of samples, which we empirically demonstrate on a suite of multitask navigation and manipulation tasks. Videos and code can be accessed here: https://sites.google.com/view/generalizedhindsight.
Replacements for Thu, 27 Feb 20
 [90] arXiv:1610.01697 (replaced) [pdf, ps, other]

Title: Central Limit Theory for Combined CrossSection and Time SeriesComments: arXiv admin note: substantial text overlap with arXiv:1507.04415Subjects: Methodology (stat.ME)
 [91] arXiv:1805.02136 (replaced) [pdf, other]

Title: Private Sequential LearningComments: Accepted for presentation at COLT 2018Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [92] arXiv:1808.02195 (replaced) [pdf, ps, other]

Title: Fisher information matrix for single molecules with stochastic trajectoriesJournalref: SIAM Journal on Imaging Sciences, 2020, Vol. 13, No. 1 : pp. 234264Subjects: Quantitative Methods (qbio.QM); Biological Physics (physics.bioph); Applications (stat.AP)
 [93] arXiv:1808.03201 (replaced) [pdf, other]

Title: An optimal design for hierarchical generalized group testingSubjects: Methodology (stat.ME); Other Statistics (stat.OT)
 [94] arXiv:1810.05546 (replaced) [pdf, other]

Title: Uncertainty in Neural Networks: Approximately Bayesian EnsemblingComments: Please cite as published in AISTATS 2020Journalref: The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [95] arXiv:1810.07371 (replaced) [pdf, other]

Title: Simple Regret Minimization for Contextual BanditsComments: The first two authors contributed equallySubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [96] arXiv:1810.11755 (replaced) [pdf, other]

Title: Watch the Unobserved: A Simple Approach to Parallelizing Monte Carlo Tree SearchSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [97] arXiv:1811.07073 (replaced) [pdf, other]

Title: SemiSupervised Semantic Image Segmentation with Selfcorrecting NetworksComments: Accepted to CVPR 2020Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
 [98] arXiv:1901.07329 (replaced) [pdf, ps, other]

Title: The autofeat Python Library for Automated Feature Engineering and SelectionComments: ECMLPKDD 2019 Workshop on Automating Data Science (ADS)Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [99] arXiv:1904.06963 (replaced) [pdf, other]

Title: The Impact of Neural Network Overparameterization on Gradient Confusion and Stochastic Gradient DescentSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [100] arXiv:1904.07701 (replaced) [pdf, other]

Title: Multiple kernel learning for integrative consensus clustering of 'omic datasetsComments: Manuscript: 22 pages, 7 figures. Supplement: 18 pages, 12 figures. This version contains new real data applications. For associated R code, see this https URL and this https URLSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
 [101] arXiv:1905.00699 (replaced) [pdf, ps, other]

Title: Longtailed distributions of interevent times as mixtures of exponential distributionsComments: 2 figures, 4 tables, SI and code are available here: this https URLJournalref: Royal Society Open Science, 7, 191643 (2020)Subjects: Physics and Society (physics.socph); Applications (stat.AP)
 [102] arXiv:1905.03297 (replaced) [pdf, other]

Title: Interpretable Subgroup Discovery in Treatment Effect Estimation with Application to Opioid Prescribing GuidelinesAuthors: Chirag Nagpal, Dennis Wei, Bhanukiran Vinzamuri, Monica Shekhar, Sara E. Berger, Subhro Das, Kush R. VarshneySubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [103] arXiv:1905.11926 (replaced) [pdf, other]

Title: Network DeconvolutionAuthors: Chengxi Ye, Matthew Evanusa, Hua He, Anton Mitrokhin, Tom Goldstein, James A. Yorke, Cornelia Fermüller, Yiannis AloimonosComments: ICLR 2020Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
 [104] arXiv:1906.00642 (replaced) [pdf, other]

Title: A Variational Approach for Learning from Positive and Unlabeled DataSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [105] arXiv:1906.02768 (replaced) [pdf, other]

Title: Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLPComments: ICLR 2020Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
 [106] arXiv:1906.11286 (replaced) [pdf, other]

Title: A Story of Two Streams: Reinforcement Learning Models from Human Behavior and NeuropsychiatryComments: Published in AAMAS 2020 as a full paper. This article supersedes our work arXiv:1706.02897 into RL setting and extends extensively into RL games, cognitive modeling, and gambling tasks in lifelong learning settingSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Neurons and Cognition (qbio.NC); Machine Learning (stat.ML)
 [107] arXiv:1906.11641 (replaced) [pdf, ps, other]

Title: A global approach for learning sparse Ising modelsAuthors: Daniela De CanditiisComments: 15 pages, 4 figures. arXiv admin note: text overlap with arXiv:1902.04728 by other authorsJournalref: Mathematics and Computers in Simulation (2020)Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
 [108] arXiv:1907.04809 (replaced) [pdf, other]

Title: Variational Autoencoders and Nonlinear ICA: A Unifying FrameworkComments: Accepted for publication at AISTATS 2020Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [109] arXiv:1907.12363 (replaced) [pdf, other]

Title: A comparison of Deep Learning performances with other machine learning algorithms on credit scoring unbalanced dataSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [110] arXiv:1908.07558 (replaced) [pdf, other]

Title: Transferring Robustness for Graph Neural Network Against Poisoning AttacksComments: Accepted by WSDM 2020. Code and data: this https URLSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
 [111] arXiv:1909.02669 (replaced) [pdf, other]

Title: Covariate Selection for Generalizing Experimental Results: Application to LargeScale Development Program in UgandaSubjects: Methodology (stat.ME)
 [112] arXiv:1909.04421 (replaced) [pdf, other]

Title: PrivacyPreserving BanditsComments: 13 pages, 7 figuresJournalref: In Proceedings of the 3rd Conference on Machine Learning and Systems (MLSys 2020)Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Multiagent Systems (cs.MA); Machine Learning (stat.ML)
 [113] arXiv:1909.05237 (replaced) [pdf, ps, other]

Title: Functional Principal Component Analysis as a Versatile Technique to Understand and Predict the Electric Consumption PatternsComments: Accepted for publication on Sustainable Energy, Grids and Networks (Elsevier)Subjects: Systems and Control (eess.SY); Applications (stat.AP)
 [114] arXiv:1909.07698 (replaced) [pdf, other]

Title: Compositional uncertainty in deep Gaussian processesAuthors: Ivan Ustyuzhaninov, Ieva Kazlauskaite, Markus Kaiser, Erik Bodin, Neill D. F. Campbell, Carl Henrik EkComments: 17 pagesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [115] arXiv:1909.09902 (replaced) [pdf, other]

Title: Deep Reinforcement Learning with Modulated Hebbian plus Q Network ArchitectureAuthors: Pawel Ladosz, Eseoghene BenIwhiwhu, Jeffrey Dick, Yang Hu, Nicholas Ketz, Soheil Kolouri, Jeffrey L. Krichmar, Praveen Pilly, Andrea SoltoggioSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [116] arXiv:1909.12064 (replaced) [pdf, other]

Title: Set Functions for Time SeriesSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [117] arXiv:1910.01847 (replaced) [pdf, other]

Title: Unbiased CVR Prediction from Biased Conversions in Display AdvertisingSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [118] arXiv:1910.02497 (replaced) [pdf, other]

Title: mfEGRA: Multifidelity Efficient Global Reliability AnalysisSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.dataan); Computation (stat.CO)
 [119] arXiv:1910.04462 (replaced) [pdf, other]

Title: Fast Tree Variants of GromovWassersteinSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [120] arXiv:1910.04483 (replaced) [pdf, other]

Title: TreeWasserstein Barycenter for LargeScale Multilevel Clustering and Scalable BayesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [121] arXiv:1910.09055 (replaced) [pdf, other]

Title: Image recognition from raw labels collected without annotatorsComments: Version changelog: Added content on ImageNet related experiments; Restructured the document to incorporate the new contentSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
 [122] arXiv:1910.11369 (replaced) [pdf, other]

Title: Structured Prediction with Projection OraclesAuthors: Mathieu BlondelComments: In proceedings of NeurIPS 2019 (v2: minor modifications in Appendix A)Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [123] arXiv:1910.12179 (replaced) [pdf, other]

Title: BAIL: BestAction Imitation Learning for Batch Deep Reinforcement LearningComments: 22 pages(13 pages for appendix); added new experimental resultsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
 [124] arXiv:1911.01413 (replaced) [pdf, ps, other]

Title: SubOptimal Local Minima Exist for Almost All Overparameterized Neural NetworksComments: 31 pages. Minor adjustments on some notations and wordings. An early version was submitted to Optimization Online on October 4Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
 [125] arXiv:1911.09162 (replaced) [pdf, other]

Title: Deep Active Learning: Unified and Principled Method for Query and TrainingComments: AISTATS 2020Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [126] arXiv:1912.02765 (replaced) [pdf, other]

Title: On the Sample Complexity of Learning SumProduct NetworksSubjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
 [127] arXiv:1912.04261 (replaced) [pdf, other]

Title: A time resolved clustering method revealing longterm structures and their shortterm internal dynamicsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [128] arXiv:1912.05901 (replaced) [pdf, other]

Title: Adaptive Bayesian ReticulumComments: 23 pages, 8 figures, 2 tablesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [129] arXiv:1912.11398 (replaced) [pdf, ps, other]

Title: An error bound for Lasso and Group Lasso in high dimensionsAuthors: Antoine DedieuComments: arXiv admin note: text overlap with arXiv:1910.08880Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
 [130] arXiv:2001.00074 (replaced) [pdf, other]

Title: Combining interdependent climate model outputs in CMIP5: A spatial Bayesian approachSubjects: Applications (stat.AP)
 [131] arXiv:2001.02323 (replaced) [pdf, other]

Title: On Thompson Sampling for SmootherthanLipschitz BanditsComments: Accepted to AISTATS 2020. 26 pages, 2 figuresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [132] arXiv:2001.06057 (replaced) [pdf, other]

Title: Increasing the robustness of DNNs against image corruptions by playing the Game of NoiseAuthors: Evgenia Rusak, Lukas Schott, Roland S. Zimmermann, Julian Bitterwolf, Oliver Bringmann, Matthias Bethge, Wieland BrendelSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
 [133] arXiv:2001.09849 (replaced) [pdf, other]

Title: Exploiting Unsupervised Inputs for Accurate FewShot ClassificationComments: Fix typo, update parameters for 5 shot, add link towards code; Change format, add graph visuSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [134] arXiv:2001.11114 (replaced) [pdf, ps, other]

Title: MultiMarginal Optimal Transport Defines a Generalized MetricSubjects: Machine Learning (cs.LG); Discrete Mathematics (cs.DM); Functional Analysis (math.FA); Machine Learning (stat.ML)
 [135] arXiv:2002.02081 (replaced) [pdf, other]

Title: Minimax Confidence Interval for OffPolicy Evaluation and Policy OptimizationSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
 [136] arXiv:2002.02579 (replaced) [pdf, other]

Title: Estimating Optimal Treatment Rules with an Instrumental Variable: A Partial Identification Learning ApproachSubjects: Methodology (stat.ME)
 [137] arXiv:2002.03495 (replaced) [pdf, ps, other]

Title: A Diffusion Theory for Deep Learning Dynamics: Stochastic Gradient Descent Escapes From Sharp Minima Exponentially FastSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [138] arXiv:2002.03549 (replaced) [pdf, other]

Title: Adversarial TCAV  Robust and Effective Interpretation of Intermediate Layers in Neural NetworksSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
 [139] arXiv:2002.03860 (replaced) [pdf, other]

Title: Missing Data Imputation using Optimal TransportSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [140] arXiv:2002.04019 (replaced) [pdf, other]

Title: Be Like Water: Robustness to Extraneous Variables Via Adaptive Feature NormalizationComments: Aakash and Sreyas contributed equallySubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [141] arXiv:2002.04764 (replaced) [pdf, other]

Title: Capsules with Inverted DotProduct Attention RoutingComments: ICLR 2020Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [142] arXiv:2002.05059 (replaced) [pdf, other]

Title: Goldilocks Neural NetworksSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
 [143] arXiv:2002.07284 (replaced) [pdf, other]

Title: Sharp Asymptotics and Optimal Performance for Inference in Binary ModelsSubjects: Statistics Theory (math.ST); Information Theory (cs.IT); Signal Processing (eess.SP); Machine Learning (stat.ML)
 [144] arXiv:2002.09547 (replaced) [pdf, other]

Title: Stochastic Normalizing FlowsComments: 17 pages, 4 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [145] arXiv:2002.09954 (replaced) [pdf, other]

Title: Nearlinear Time Gaussian Process Optimization with Adaptive Batching and ResparsificationSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [146] arXiv:2002.10043 (replaced) [pdf, other]

Title: Complete Dictionary Learning via $\ell_p$norm MaximizationSubjects: Machine Learning (cs.LG); Information Theory (cs.IT); Signal Processing (eess.SP); Machine Learning (stat.ML)
 [147] arXiv:2002.10060 (replaced) [pdf, other]

Title: Handling the PositiveDefinite Constraint in the Bayesian Learning RuleComments: Corrected some typosSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
 [148] arXiv:2002.10211 (replaced) [pdf, other]

Title: Mnemonics Training: MultiClass Incremental Learning without ForgettingComments: To appear in CVPR 2020. The cameraready version with supplementary experiment results will come on 23rd March. Code will come soon at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
 [149] arXiv:2002.10241 (replaced) [pdf, other]

Title: Multiobjective Consensus Clustering Framework for Flight Search RecommendationSubjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
 [150] arXiv:2002.10539 (replaced) [pdf, other]

Title: Efficient Rollout Strategies for Bayesian OptimizationSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
 [151] arXiv:2002.10774 (replaced) [pdf, other]

Title: Counterfactual fairness: removing direct effects through regularizationComments: 10 pages, 4 figuresSubjects: Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
 [152] arXiv:2002.11052 (replaced) [pdf, other]

Title: Relevantfeatures based Auxiliary Cells for Energy Efficient Detection of Natural ErrorsComments: 16 pages, 3 figures, 6 tablesSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
[ showing up to 2000 entries per page: fewer  more ]
Disable MathJax (What is MathJax?)
Links to: arXiv, form interface, find, stat, recent, 2002, contact, help (Access key information)