Statistics Seminar – Fall 2020

Schedule for Fall 2020

The Statistics Seminar has migrated to Zoom for the Fall 2020 semester.

Seminars are on Mondays
Time: 1:00pm – 2:00pm
Zoom Link:


For an archive of past seminars, please click here.



Liam Paninski (Columbia)

Title: Some open directions in neural data science

Abstract: Neuroscience has rapidly moved into the realm of science fiction in the last few years (lasers! genetic engineering! glowing brains! mind-reading! memory writing!), and with these advances have come an array of challenging and interesting data science problems. This talk will be an informal tour of a few of these problems, with an emphasis on open research directions.



Ashwin Pananjady (UC Berkeley)

Title: Flexible models for learning from people: Statistics meets computation
A plethora of latent variable models are used to learn from data generated by people. Specific examples include the Bradley–Terry–Luce and multinomial logit models for comparison and choice data, the Dawid–Skene model for crowdsourced question answering, and the Rasch model for categorical data that arises in psychometric analysis. In this talk, I will present a class of “permutation-based” models that borrows from the literature on sociology and economics and significantly generalizes classical approaches in these contexts, thereby improving their robustness to mis-specification. The talk will focus on the mathematical statistics of fitting these models, and describe a methodological toolbox that is inspired by considerations of adaptation as well as computation. These considerations highlight connections between the theory of adaptation in nonparametric statistics and conjectures in average-case computational complexity. The talk will present vignettes from two papers, one jointly with Cheng Mao and Martin Wainwright, and another jointly with Richard Samworth.



Jose Luis Montiel Olea (Columbia)
Title: Dropout Training is Distributionally Robust Optimal
Abstract: Dropout training is an increasingly popular estimation method in machine learning that minimizes some given loss function (e.g., the negative expected log-likelihood), but averaged over nested submodels chosen at random.  
This paper shows that dropout training in Generalized Linear Models is the minimax solution of a two-player, zero-sum game where an adversarial nature corrupts a statistician’s covariates  using  a  multiplicative nonparametric  errors-in-variables  model. In this game–known as a Distributionally Robust Optimization problem—nature’s least favorable distribution is dropout noise, where nature independently deletes entries of the covariate vector with some fixed probability $\delta$. Our decision-theoretic analysis shows that dropout training—the statistician’s minimax strategy in the game—indeed provides out-of-sample expected loss guarantees for distributions that arise from multiplicative perturbations of in-sample data. 

This paper also provides a novel, parallelizable, Unbiased Multi-Level Monte Carlo algorithm to speed-up the implementation of dropout training. Our algorithm has a much smaller computational cost compared to the naive implementation of dropout,  provided the number of data points is much smaller than the dimension of the covariate vector.

This is joint work with José Blanchet, Yang Kang, Viet Nguyen, and Xuhui Zhang. 


Bhaswar Bhattacharya (U Penn)

Title: Detection Thresholds for Non-Parametric Tests Based on Geometric Graphs: The Curious Case of Dimension 8

Abstract: Two of the fundamental problems in non-parametric statistical inference are goodness-of-fit and two-sample testing. These two problems have been extensively studied and several multivariate tests have been proposed over the last thirty years, many of which are based on geometric graphs. These include, among several others, the celebrated Friedman-Rafsky two-sample test based on the minimal spanning tree and the K-nearest neighbor graphs, and the Bickel-Breiman spacings tests for goodness-of-fit. These tests are asymptotically distribution-free, universally consistent, and computationally efficient (both in sample size and in dimension), making them particularly attractive for modern statistical applications.

In this talk, we will derive the detection thresholds and limiting local power of these tests, thus providing a way to compare and justify the performance of these tests in various applications. Several interesting properties emerge, such as a curious phase transition in dimension 8, and a remarkable blessing of dimensionality in detecting scale changes.


Stefan Wager (Stanford)

Title: Synthetic Difference in Differences

Abstract: We present a new estimator for causal effects with panel data that builds on insights behind the widely used difference in differences and synthetic control methods. We find, both theoretically and in empirical studies, that this “synthetic difference in differences” estimator has desirable robustness properties relative to both difference in differences and synthetic controls, and that it performs well in settings where either of these conventional estimators are commonly used in practice. We also study the asymptotic behavior of the proposed estimator in a low-rank confounding model, and articulate conditions for consistency and asymptotic normality.

joint work with Dmitry Arkhangelsky, Susan Athey, David Hirshberg and Guido Imbens



Yuting Wei (CMU)

Title: Reliable hypothesis testing paradigms in high dimensions


Modern scientific discovery and decision making require the development of trustworthy and informative inferential procedures, which are particularly challenging when coping with high-dimensional data. This talk presents two vignettes on the theme of reliable high-dimensional inference.

The first vignette considers performing inference based on the Lasso estimator when the number of covariates is of the same order or larger than the number of observations. Classical asymptotic statistics theory fails due to two fundamental reasons: (1) The regularized risk is non-smooth; (2) The discrepancy between the estimator and the true parameter vector cannot be neglected. We pin down the distribution of the Lasso, as well as its debiased version, under a broad class of Gaussian correlated designs with non-singular covariance structure. Our findings suggest that a careful degree-of-freedom correction is crucial for computing valid confidence intervals in this challenging regime.

The second vignette investigates the Model-X knockoffs framework — a general procedure that can leverage any feature importance measure to produce a variable selection algorithm. Model-X knockoffs rely on the construction of synthetic random variables and is, therefore, random. We propose a method for derandomizing — and hence stabilizing — model-X knockoffs. By aggregating the selection results across multiple runs of the knockoffs algorithm, our method provides stable decisions without compromising statistical power. Our approach, when applied to the multi-stage GWAS of prostate cancer, reports locations on the genome that have been replicated with other studies.

The first vignette is based on joint work with Michael Celentano and Andrea Montanari, whereas the second one is based on joint work with Zhimei Ren and Emmanuel Candes.

Bio: Yuting Wei is currently an assistant professor in the Statistics and Data Science department at Carnegie Mellon University. Prior to that, she was a Stein Fellow at Stanford University, and she received her Ph.D. in statistics at University of California, Berkeley working with Martin Wainwright and Aditya Guntuboyina. She was the recipient of the 2018 Erich L. Lehmann Citation from the Berkeley statistics department for her Ph.D. dissertation in theoretical statistics. Her research interests include high-dimensional and non-parametric statistics, statistical machine learning, and reinforcement learning.


Alexander Aue (UC Davis)

Title: Random matrix theory aids statistical inference in high dimensions


“The first part of the talk is on bootstrapping spectral statistics in high dimensions. Spectral statistics play a central role in many multivariate testing problems. It is therefore of interest to approximate the distribution of functions of the eigenvalues of sample covariance matrices. Although bootstrap methods are an established approach to approximating the laws of spectral statistics in low-dimensional problems, these methods are relatively unexplored in the high-dimensional setting. The aim of this talk is to focus on linear spectral statistics (LSS) as a class of “prototype statistics” for developing a new bootstrap method in the high-dimensional setting. In essence, the method originates from the parametric bootstrap, and is motivated by the notion that, in high dimensions, it is difficult to obtain a non-parametric approximation to the full data-generating distribution. From a practical standpoint, the method is easy to use, and allows the user to circumvent the difficulties of complex asymptotic formulas for LSS. In addition to proving the consistency of the proposed method, I will discuss encouraging empirical results in a variety of settings. Lastly, and perhaps most interestingly, simulations indicate that the method can be applied successfully to statistics outside the class of LSS, such as the largest sample eigenvalue and others.

The second part of the talk briefly highlights two-sample tests in high dimensions by discussing ridge-regularized generalization of Hotelling’s T^2. The main novelty of this work is in devising a method for selecting the regularization parameter based on the idea of maximizing power within a class of local alternatives. The performance of the proposed test procedures will be illustrated through an application to a breast cancer data set where the goal is to detect the pathways with different DNA copy number alterations across breast cancer subtypes.”


Academic Holiday – No Seminar



Sebastian Engelke (University of Geneva)


Matias Drton (TU Munich)