Statistics Seminar Series

Choose which semester to display:

Schedule for Spring 2024

Seminars are on Mondays
Time: 4:00pm - 5:00pm

Location: Room 903 SSW, 1255 Amsterdam Avenue



 Michael Celentano (Miller Fellow in the Statistics Department at the University of California, Berkeley)
Title: Debiasing in the inconsistency regime

Abstract: In this talk, I will discuss semi-parametric estimation when nuisance parameters cannot be estimated consistently, focusing in particular on the estimation of average treatment effects, conditional correlations, and linear effects under high-dimensional GLM specifications. In this challenging regime, even standard doubly-robust estimators can be inconsistent. I describe novel approaches which enjoy consistency guarantees for low-dimensional target parameters even though standard approaches fail. For some target parameters, these guarantees can also be used for inference. Finally, I will provide my perspective on the broader implications of this work for designing methods which are less sensitive to biases from high-dimensional prediction models.
Bio: Michael Celentano is a Miller Fellow in the Statistics Department at the University of California, Berkeley. He received his PhD in Statistics from Stanford University in 2021, where he was advised by Andrea Montanari. Most of his work focuses on the high-dimensional asymptotics of regression, classification, and matrix estimation problems.


*Friday 1/26/24

Time: *12:30pm

Location: Room 903 SSW

Ying Jin (Stanford)

Title:  Model-free selective inference: from calibrated uncertainty to trusted decisions

Abstract: AI has shown great potential in accelerating decision-making and scientific discovery pipelines such as drug discovery, marketing, and healthcare. In many applications, predictions from black-box models are used to shortlist candidates whose unknown outcomes satisfy a desired property, e.g., drugs with high binding affinities to a disease target. To ensure the reliability of high-stakes decisions, uncertainty quantification tools such as conformal prediction have been increasingly adopted to understand the variability in black-box predictions. However, we find that the on-average guarantee of conformal prediction can be insufficient for its deployment in decision making which usually has a selective nature. 

In this talk, I will introduce a model-free selective inference framework that allows to select reliable decisions with the assistance of any black-box prediction model. Our framework identifies candidates whose unobserved outcomes exceed user-specified values while controlling the average proportion of falsely selected units (FDR), without any modeling assumptions. Leveraging a set of exchangeable training data, our method constructs conformal p-values that quantify the confidence in large outcomes; it then determines a data-dependent threshold for the p-values as a criterion for drawing confident decisions. In addition, I will discuss new ideas to further deal with covariate shifts between training and new samples. We show that in several drug discovery tasks, our methods narrow down the drug candidates to a manageable size of promising ones while controlling the proportion of falsely discovered. In a causal inference dataset, our methods identify students who benefit from an educational intervention, providing new insights for causal effects.


Tianhao Wang (Yale)

Title: Algorithm Dynamics in Modern Statistical Learning: Universality and Implicit Regularization

Abstract: Modern statistical learning is featured by the high-dimensional nature of data and over-parameterization of models. In this regime, analyzing the dynamics of the used algorithms is challenging but crucial for understanding the performance of learned models. This talk will present recent results on the dynamics of two pivotal algorithms: Approximate Message Passing (AMP) and Stochastic Gradient Descent (SGD). Specifically, AMP refers to a class of iterative algorithms for solving large-scale statistical problems, whose dynamics admit asymptotically a simple but exact description known as state evolution. We will demonstrate the universality of AMP's state evolution over large classes of random matrices, and provide illustrative examples of applications of our universality results. Secondly, for SGD, a workhorse for training deep neural networks, we will introduce a novel mathematical framework for analyzing its implicit regularization. This is essential for SGD's ability to find solutions with strong generalization performance, particularly in the case of over-parameterization. Our framework offers a general method to characterize the implicit regularization induced by gradient noise. Finally, in the context of underdetermined linear regression, we will show that both AMP and SGD can provably achieve sparse recovery, yet they do so from markedly different perspectives.

Bio: Tianhao Wang is a final-year Ph.D. student in the Department of Statistics and Data Science at Yale University, advised by Prof. Zhou Fan. His research focuses on the mathematical foundations of statistics and machine learning.


*Wednesday 1/31/24

Time: *12:30pm

Location: Room 903 SSW

Sifan Liu (Stanford)

Title: An Exact Sampler for Inference after Polyhedral Selection

Abstract: The exploratory and interactive nature of modern data analysis often introduces selection bias, posing challenges for traditional statistical inference methods. A common strategy to address this bias is by conditioning on the selection event. However, this often results in a conditional distribution that is intractable and requires Markov chain Monte Carlo (MCMC) sampling for inference. Notably, some of the most widely used selection algorithms yield selection events that can be characterized as polyhedra, such as the lasso for variable selection and the epsilon-greedy algorithm for multi-armed bandit problems. This talk will present a method that is tailored for conducting inference following polyhedral selection. The method transforms the variables constrained within a polyhedron into variables within a unit cube, allowing for exact sampling. Compared to MCMC, the proposed method offers superior speed and accuracy, providing a practical and efficient approach for conditional selective inference. Additionally, it facilitates the computation of the selection-adjusted maximum likelihood estimator, enabling MLE-based inference. Numerical results demonstrate the enhanced performance of the proposed method compared to alternative approaches for selective inference.

Bio: Sifan Liu is a fifth-year Ph.D. student in the Department of Statistics at Stanford University. Her research interests are focused on selective inference and statistical computation.


Chris Harshaw (MIT)

Title: Algorithm Design for Randomized Experiments


Randomized experiments are one of the most reliable causal inference methods and are used in a variety of disciplines from clinical medicine, public policy, economics, and corporate A/B testing. Experiments in these disciplines provide empirical evidence which drives some of the most important decisions in our society: what drugs are prescribed? Which social programs are implemented? What corporate strategies to use? Technological advances in measurements and intervention -- including high dimensional data, network data, and mobile devices -- offer exciting opportunities to design new experiments to investigate a broader set of causal questions. In these more complex settings, standard experimental designs (e.g. independent assignment of treatment) are far from optimal. Designing experiments which yield the most precise estimates of causal effects in these complex settings is not only a statistical problem, but also an algorithmic one.

In this talk, I will present my recent work on designing algorithms for randomized experiments. I will begin by presenting Clip-OGD, a new algorithmic experimental design for adaptive sequential experiments. We show that under the Clip-OGD design, the variance of an adaptive version of the Horvitz-Thompson estimator converges to the optimal non-adaptive variance, resolving a 70-year-old problem posed by Robbins in 1952. Our results are facilitated by drawing connections to regret minimization in online convex optimization. Time permitting, I will describe a new unifying framework for investigating causal effects under interference, where treatment given to one subject can affect the outcomes of other subjects. Finally, I will conclude by highlighting open problems and reflecting on future work in these directions.


Christopher Harshaw is a FODSI postdoc at MIT and UC Berkeley. He received his PhD from Yale University where he was advised by Dan Spielman and Amin Karbasi. His research lies at the interface of causal inference, machine learning, and algorithm design, with a particular focus on the design and analysis of randomized experiments. His work has appeared in the Journal of the American Statistical Association, Electronic Journal of Statistics, ICML, NeurIPS, and won Best Paper Award at the NeurIPS 2022 workshop, CML4Impact.


*Wednesday 2/7/24

Time: *12:00pm

Location: Room 903 SSW

Enric Boix-Adsera (MIT)

Title: Beyond the black box: characterizing and improving how neural networks learn

The predominant paradigm in deep learning practice treats neural networks as "black boxes". This leads to economic and environmental costs as brute-force scaling remains the performance driver, and to safety issues as robust reasoning and alignment remain challenging. My research opens up the neural network black box with mathematical and statistical analyses of how networks learn, and yields engineering insights that improve the efficiency and transparency of these models. In this talk I will present characterizations of (1) how large language models can learn to reason with abstract symbols, and (2) how hierarchical structure in data guides deep learning, and will conclude with (3) new tools to distill trained neural networks into lightweight and transparent models.

Speaker: Enric Boix-Adsera is a PhD candidate at MIT, under the supervision of Guy Bresler and Philippe Rigollet. His PhD research has been supported by an NSF Graduate Research Fellowship, a Siebel Fellowship, and an Apple AI/ML fellowship.


Jonathan Niles-Weed (NYU)

Title: Learning Matchings, Maps, and Trajectories

Abstract: This talk will survey some recent advances in the statistical theory of optimal transport. Optimal transport considers the geometrical properties of transformations of probability distributions, making it a suitable framework for many applications in generative modeling, causal inference, and the sciences. We will study estimators for this problem, characterizing their finite-sample behavior and obtaining distributional limits suitable for practical inference. Additionally, we will explore structural assumptions that improve the statistical and computational performance of these estimators in high dimensions.


Speaker: Jiashun Jin (Carnegie Mellon University)

Title: The Statistics Triangle

 Abstract: In his Fisher’s Lecture in 1996, Efron suggested that there is a philosophical triangle in statistics with “Bayesian”, “Fisherian”, and “Frequentist” being the three vertices, and most of the statistical methods can be viewed as a convex linear combination of the three philosophies. We collected and cleaned a data set consisting of the citation and BibTeX (e.g., title, abstract, author information) data of 83,331 papers published in 36 journals in statistics and related fields, spanning 41 years. Using the data set, we constructed 21  co-citation networks, each for a time window between 1990 and 2015. We propose a dynamic Degree-Corrected Mixed- Membership (dynamic-DCMM) model, where we model the research interests of an author by a low-dimensional weight vector (called the network memberships) that evolves slowly over time. We propose dynamic-SCORE as a new approach to estimating the memberships. We discover a triangle in the spectral domain which we call the Statistical Triangle, and use it to visualize the research trajectories of individual authors. We interpret the three vertices of the triangle as the three primary research areas in statistics: “Bayes”, “Biostatistics,” and “Nonparametrics”. The Statistical Triangle further splits into 15 sub-regions, which we interpret as the 15 representative sub-areas in statistics. These results provide useful insights over the research trend and behavior of statisticians.

Bio: Jiashun Jin is a Professor of Statistics and Data Science at Carnegie Mellon University. He is interested in statistical machine learning, social networks, genomics and genetics, and neuroscience. His primary research interest is analyzing big data with sparse and weak signals. He has been developing methods appropriate for such settings, including large-scale testing, classification, clustering, variable selection, and, more recently, network analysis and low-rank matrix recovery. Jiashun received the NSF CAREER award and IMS Tweedie Award and has been an elected IMS Fellow. He delivered an IMS Medallion Lecture (2015), the IMS Tweedie Lecture (2009), and other plenary or keynote lectures.


Speaker: Annie Qu (UC Irvine)

Title: A Model-Agnostic Graph Neural Network for Integrating Local and Global Information

Abstract: Graph neural networks (GNNs) have achieved promising performance in a variety of graph focused tasks. Despite their success, the two major limitations of existing GNNs are the capability of learning various-order representations and providing interpretability of such deep learning-based black-box models. To tackle these issues, we propose a novel Model-agnostic Graph Neural Network (MaGNet) framework. The proposed framework is able to extract knowledge from high-order neighbors, sequentially integrates information of various orders, and offers explanations for the learned model by identifying influential compact graph structures. In particular, MaGNet consists of two components: an estimation model for the latent representation of complex relationships under graph topology, and an interpretation model that identifies influential nodes, edges, and important node features. Theoretically, we establish the generalization error bound for MaGNet via empirical Rademacher complexity and showcase its power to represent the layer-wise neighborhood mixing. We conduct comprehensive numerical studies using both simulated data and a real-world case study on investigating the neural mechanisms of the rat hippocampus, demonstrating that the performance of MaGNet is competitive with state-of-the-art methods.

Bio: Annie Qu is Chancellor’s Professor, Department of Statistics, University of California, Irvine. She received her Ph.D. in Statistics from the Pennsylvania State University in 1998. Qu’s research focuses on solving fundamental issues regarding structured and unstructured large-scale data and developing cutting-edge statistical methods and theory in machine learning and algorithms for personalized medicine, text mining, recommender systems, medical imaging data, and network data analyses for complex heterogeneous data. The newly developed methods can extract essential and relevant information from large volumes of intensively collected data, such as mobile health data. Her research impacts many fields, including biomedical studies, genomic research, public health research, social and political sciences. Before joining UC Irvine, Dr. Qu was a Data Science Founder Professor of Statistics and the Director of the Illinois Statistics Office at the University of Illinois at Urbana-Champaign. She was awarded the Brad and Karen Smith Professorial Scholar by the College of LAS at UIUC and was a recipient of the NSF Career award from 2004 to 2009. She is a Fellow of the Institute of Mathematical Statistics (IMS), the American Statistical Association, and the American Association for the Advancement of Science. She is also a recipient of IMS Medallion Award and Lecturer in 2024. She serves as Journal of the American Statistical Association Theory and Methods Co-Editor from 2023 to 2025 and as IMS Program Secretary from 2021 to 2027. 

Qu Lab website:


Speaker: Raaz Dwivedi (Cornell)

TItle: Integrating Double Robustness into Causal Latent Factor Models

Abstract: There is a growing literature on latent factor models with panel data, where multiple measurements across various units under multiple treatments are available. These models are compatible with both observed and unobserved confounding (external variables affecting the treatment and the outcome simultaneously), making them a popular choice for estimating treatment effects. Standard approaches are based on outcome imputation, including nearest neighbors for individual treatment effects (ITE) and generic matrix completion for average treatment effect (ATE). These rely, respectively, on unit similarities or low-rank structure in the outcome matrix, and are consequently known to provide poor performance with diverse units or non-low-rank outcome matrix.

To tackle these challenges, we integrate double robustness principles with factor models, introducing estimators designed to overcome them. First, we propose a doubly robust nearest neighbor approach for ITE, achieving consistent estimates with presence of either similar measurements or similar units, and enhanced error/confidence intervals with presence of both. Next, we introduce a doubly robust matrix completion strategy for ATE despite unobserved confounding, which ensures consistency with either low-rank propensity matrix or low-rank outcome matrix, and offers superior error rates and confidence intervals when both matrices are low-rank.

Bio: Raaz Dwivedi joined Department of Operations Research and Information Engineering and Cornell Tech at Cornell University as an Assistant Professor in Jan 2024. Prior to that, he visited Cornell ORIE in Fall 2023 and spent two years as a FODSI postdoc fellow at Harvard and MIT LIDS, and spent a summer at Microsoft Research New England. He did his Ph. D. In EECS at UC Berkeley in 2021 and bachelors in EE at IIT Bombay in 2014. His research builds statistically and computationally efficient strategies for personalized decision-making with theory and methods spanning the areas of causal inference, reinforcement learning, and distribution compression. He has won a best student paper award for work on optimal compression and teaching awards at Harvard and UC Berkeley, and the President of India Gold Medal at IIT Bombay.