Statistics Seminar – Spring 2021

Schedule for Spring 2021

The Statistics Seminar has migrated to Zoom for the Spring 2021 semester.

Seminars are on Mondays
Time: 1:00pm – 2:00pm
Zoom Link:


For an archive of past seminars, please click here.



Ming Yuan (Columbia)

Title: Complexity of High Dimensional Sparse Functions

Abstract: We investigate optimal algorithms for estimating a general high dimensional smooth and sparse function from the perspective of information based complexity. Our algorithms and analyses reveal several interesting characteristics for these tasks. In particular, our results illustrate the potential value of experiment design for high dimensional problems.



University Holiday


Philippe Navaeu (LSCE-CNRS France)

Title: Detecting changes in multivariate extremes from climatological time series

Abstract: Many effects of climate change seem to be reflected not in the mean temperatures, precipitation or other environmental variables, but rather in the frequency and severity of the extreme events in the distributional tails. The most serious climate-related disasters are caused by compound events that result from an unfortunate combination of several variables. Detecting changes in size or frequency of such compound events requires a statistical methodology that efficiently uses the largest observations in the sample.

We propose a simple, non-parametric test that decides whether two multivariate distributions exhibit the same tail behavior. The test is based on the entropy, namely Kullback–Leibler divergence, between exceedances over a high threshold of the two multivariate random vectors. We study the properties of the test and further explore its effectiveness for finite sample sizes. 

Our main application is the analysis of daily heavy rainfall times series in France (1976 -2015). Our goal in this application is to detect if multivariate extremal dependence structure in heavy rainfall change according to seasons and regions. 

This is joint work with Sebastian Engelke (University of Geneva) and Chen Zhou (Erasmus University Rotterdam) 


Marco Avella (Columbia)

Title: Differentially private inference via robust statistics and noisy optimization

Abstract:  Over the last two decades differential privacy has emerged a promising rigorous paradigm for the study of data privacy. It assumes a framework where a trusted curator holds the data of individuals in a database. The goal of differential privacy is to simultaneously protect individual data while allowing statistical analysis of the database as a whole. In this talk, we will discuss how intuitive ideas from robust statistics provide a natural framework for understanding differentially private statistical inference. This will be illustrated with two general approaches for performing parametric inference with differential privacy guarantees. In the first one, we obtain differentially private estimators based on bounded-influence M-estimators. Here the key idea is to leverage the influence function of these estimators in the calibration of a noise term added to them in order to ensure privacy. We show how a similar construction can also be applied to construct differentially private test statistics analogous to the Wald, score and likelihood ratio tests. Our second approach uses noisy optimization in order to obtain differentially private M-estimators and confidence intervals. In particular, we propose noisy gradient descent and noisy Newton algorithms that provably output optimal estimators that are asymptotically normally distributed. We demonstrate the good small sample empirical performance of our methods in simulations and real data examples.



Ivan Diaz (Weill Cornell Medicine)

Title: Flexible methods for mediation analysis with discrete and continuous exposures.


Mediation analysis in causal inference has traditionally focused on binary exposures and deterministic interventions, and a decomposition of the average treatment effect in terms of direct and indirect effects. We present an analogous decomposition of the population intervention effect, defined through stochastic interventions on the exposure. Population intervention effects provide a generalized framework in which a variety of interesting causal contrasts can be defined, including effects for continuous and categorical exposures. We show that identification of direct and indirect effects for the population intervention effect requires weaker assumptions than its average treatment effect counterpart, under the assumption of no mediator-outcome confounders affected by exposure. In particular, identification of direct effects is guaranteed in experiments that randomize the exposure and the mediator. We propose various estimators of the direct and indirect effects, including substitution, re-weighted, and efficient estimators based on flexible regression techniques, allowing for multivariate mediators. Our efficient estimator is asymptotically linear under a condition requiring n^1/4-consistency of certain regression functions. We perform a simulation study in which we assess the finite-sample properties of our proposed estimators. We present the results of an illustrative study where we assess the effect of participation in a sports team on BMI among children, using mediators such as exercise habits, daily consumption of snacks, and overweight status.


Espen Bernton (Columbia)

“Entropic Optimal Transport: Convergence and Geometry”

Abstract:  Over the last three decades, optimal transport theory has flourished due to its connections with geometry, analysis, probability theory, and other fields in mathematics. Following computational advances which have enabled high-dimensional applications, a renewed interest comes from statistics and machine learning. Popularized by Cuturi (2013), entropic regularization is a key computational approach in such settings. We study the convergence of optimizers of the regularized problem to optimal transport plans. We describe the convergence in the form of a large deviations principle quantifying the local exponential convergence rate. The exact rate function is determined in a general setting and linked to the Kantorovich potentials of optimal transport.


Claudia Klüppelberg (TU Munich)

“Learning max-linear Bayesian trees under measurement errors”

Abstract: Causal inference for extremes aims to discover cause and effect relations between large observed values of random variables. Over the last years, a number of methods have been proposed for solving the Hidden River Problem; i.e. the causal discovery of a river network from its extremes. These papers have been tested on the Danube dataset, which consists of daily river flow maxima at 31 stations along the Danube. We provide a new and simple algorithm to solve the Hidden River Problem under the assumption that the river network is a root-directed tree, which outperforms existing methods. Our method returns a directed graph and achieves almost perfect recovery on the Danube as well as on new data from the Colorado River. Our algorithm QTree relies on qualitative aspects of the max-linear Bayesian network model to score each potential edge independently, and then applies the Chu-Liu-Edmond algorithm to return the best root-directed spanning tree.  A na ive implementation of QTree runs in time O(n d^2), where n is the number of observations and d is the number of nodes. Furthermore, QTree maximizes information available from missing data, since at each step it only utilizes the data projected onto two coordinates.

This is joint work in progress with Johannes Buck and Ngoc Tran.



Spring Break – No Seminar



James Sharpnack (UC Davis)

“Universal consistency of nearest neighbor matching”
Abstract: When data is partially missing at random, matching methods and importance weighting are often used to estimate moments of the unobserved population. In this paper, we study 1-nearest neighbor (1NN) importance weighting (or 1NN matching), which estimates moments by replacing missing data with the complete data that is the nearest neighbor in the non-missing covariate space. We define an empirical measure, the 1NN measure, and show that it is weakly consistent for the measure of the missing data. This occurs without smoothness assumption on moments of the conditional distribution of the partially missing variables given the non-missing variables, hence the universal consistency. The main idea behind this result is that the 1NN measure is performing inverse probability weighting in the limit. We study applications to missing data and mitigating the impact of covariate shift in prediction tasks.
Link: (title and abstract differ slightly)


Ian McKeague (Columbia)

“Fallacies of selection in canonical correlation analysis”
Abstract:  Selection bias arises when the effects of selection of variables or models on subsequent statistical analyses are ignored, i.e., failure to take into account “double dipping” of the data.  Eighty years ago, Hotelling drew attention to this issue, and in recent years there has been a concerted effort to address the problem, giving rise to the nascent field of post-selection inference. This talk discusses the post-selection inference problem in the context of canonical correlation analysis.  The challenge is to adjust for the selection of subsets of variables having linear combinations with maximal sample correlation.  To this end, we construct a stabilized one-step estimator of the square-root of Pillai’s trace maximized over subsets of variables of pre-specified cardinality. This estimator is shown to be consistent for its target parameter and asymptotically normal provided the dimensions of the variables do not grow too quickly with sample size. We also develop a greedy search algorithm to accurately compute the estimator, leading to a computationally tractable omnibus test for the global null hypothesis that there are no linear relationships between any subsets of variables having the pre-specified cardinality. The talk is based on joint work with Xin (Henry) Zhang.


Mayya Zhilova (Georgia Tech)

“Accuracy of the bootstrap and the normal approximation in a high-dimensional framework”

Abstract: In this talk we will address the problem of establishing a higher-order accuracy of the bootstrap procedures and (non-)normal approximation in a multivariate or high-dimensional setting. This topic is important for numerous problems in statistical inference and applications concerned with confidence estimation and hypothesis testing, and involving a growing dimension of an unknown parameter or high-dimensional random data. In particular, we will focus on higher-order expansions for the uniform distance over the set of all Euclidean balls. The new results outperform accuracy of the normal approximation in existing Berry–Esseen inequalities under very general conditions. The established approximation bounds allow to track dependence of error terms on a dimension and a sample size in an explicit way. We also show optimality of these results in case of symmetrically distributed random summands. The talk will include an overview of statistical problems where the new results lead to improvements in accuracy of estimation procedures.


Peter Aronow (Yale)

Foundations of Design-based Inference under Interference.”
In the design-based framework, the sole source of stochasticity is the randomized administration of some intervention on a finite population. This talk derives this framework beginning from a completely unrestricted model that accommodates both modeled and unmodeled interference between units. I discuss key assumptions (including Rubin’s SUTVA), estimation theory, and some open questions in the
The presentation is based on work-in-progress, with the principal associated readings here: and

Mark Podolskij (University of Luxembourg)

“Semiparametric estimation for a certain class of McKean-Vlasov SDEs”


In this talk we will study semiparametric estimation of certain McKean-Vlasov SDEs. This type of stochastic processes naturally appears as limits of high dimensional particle systems and nowadays they find numerous applications in physics, economics and biology among other fields. The focus is usually on probabilistic and numerical analysis, while statistical methods received much less attention in the literature. In this presentation we aim at semiparametric estimation of the drift parameter, which is a superposition of parametric polynomial and trigonometric functions, and an unknown L^1-function. We obtain convergence rates for all unknown ingredients of the model and show minimax optimality for the non-parametric part.

(joint work with Denis Belomestny and Vytaute Pilipauskaite)

John Aston (University of Cambridge)

“Functional Data in Constrained Spaces.”

Abstract: Functional Data Analysis is concerned with the statistical analysis of data which are curves or surfaces. There has been considerable progress made in this area over the last 20-30 years, but most of this work has been focused on 1-dimensional curves living in a standard space such as the space of square integrable functions. However, many real data applications, such as those from linguistics and neuroimaging, imply considerable constraints on data which is not a simple curve. In this talk, we will look at several different types of constrained functional data. We will examine the role of positive definiteness in linguistics and show that this can be used to study ancient languages. We will also look at 2-d manifolds embedded in 3-dimensions, such as the cortical surface of the brain. We’ll see that some current applications, such as functional connectivity, require both properties simultaneously and we’ll suggest methods for understanding the data in such cases.

[Joint work with Eardi Lila, Davide Pigoli, Shahin Tavakoli and John Coleman]