Statistics Seminar – Spring 2025

Schedule for Spring 2025

Seminars are on Mondays
Time: 4:10 pm – 5:00 pm

Location: Room 903 SSW, 1255 Amsterdam Avenue



Speaker:  Professor Ilias Zadik

Title: Characterizing the power of MCMC methods for sparse estimation

Abstract: Markrov Chain Monte Carlo (MCMC) and local-search methods have been extensively used in the practice of statistics for many decades now. However, their exact theoretical performance has been strikingly eluding even for simple parametric estimation tasks. This is in stark contrast to other classes of estimators such as low-degree polynomials or spectral methods where a much deeper theoretical understanding has been achieved over the recent years.

In this talk we are going to discuss several recent results that characterize the performance of a large class of (low-temperature) MCMC methods for a series of canonical estimation models, such as sparse tensor PCA and sparse linear regression. This characterization reveals that in some models (e.g., in sparse regression) MCMC methods achieve the performance of the conjecturably optimal polynomial-time estimators, but in some other cases (e.g, sparse PCA) they significantly underperform. If time permits, we are going to discuss how one can boost MCMC methods for sparse PCA to achieve polynomial-time optimality by appropriately extending the parameter space.

Bio:  Ilias Zadik is an Assistant Professor of Statistics and Data Science at Yale University. His research mainly focuses on the mathematical theory of statistics and its many connections with other fields such as computer science, probability theory, and statistical physics. His primary area of interest is the study of “computational-statistical trade-offs,” where the goal is to understand whether computational bottlenecks are unavoidable in modern statistical models or a limitation of currently used techniques. Prior to Yale, he held postdoctoral positions at MIT and NYU. He received his PhD from MIT in 2019. 


Speaker:  Professor Johan Ugander (Stanford University)

Title: Counterfactual Evaluation of Peer-Review Assignment Policies

Abstract: Peer review assignment algorithms aim to match research papers to suitable expert reviewers, working to maximize the quality of the resulting reviews. A key challenge in designing effective assignment policies is evaluating how changes to the assignment algorithm map to changes in review quality. In this work, we leverage recently proposed policies that introduce randomness in peer-review assignment–in order to mitigate fraud–as a valuable opportunity to evaluate counterfactual assignment policies. To address challenges in applying standard off-policy evaluation methods, such as violations of positivity, we introduce methods for partial identification based on monotonicity and smoothness assumptions. We apply our methods to peer-review data from two computer science venues, including a major conference with over 8000 submissions and over 3000 reviewers. Joint work with Martin Saveski, Steven Jecmen, Samir Khan, and Nihar Shah.

Bio: Johan Ugander is an Associate Professor at Stanford University in the Department of Management Science & Engineering, within the School of Engineering. His research develops algorithmic and statistical frameworks for analyzing social networks, social systems, and other large-scale social and behavioral data. Prior to joining the Stanford faculty he was a postdoctoral researcher at Microsoft Research Redmond 2014-2015 and held an affiliation with the Facebook Data Science team 2010-2014. He obtained his Ph.D. in Applied Mathematics from Cornell University in 2014. His awards include a NSF CAREER Award, a Young Investigator Award from the Army Research Office (ARO), three Best Paper Awards (2012 ACM WebSci Best Paper, 2013 ACM WSDM Best Student Paper, 2020 AAAI ICWSM Best Paper), and the 2016 Eugene L. Grant Undergraduate Teaching Award from the Department of Management Science & Engineering.


Speaker: Dylan Foster (Microsoft Research)

Title: Is Behavior Cloning All You Need? Revisiting the Role of Horizon and Interaction in Imitation Learning

Abstract: Imitation learning (IL) aims to mimic the behavior of an expert in a sequential decision making task by learning from demonstrations, and has been widely applied to robotics, autonomous driving, and autoregressive language generation. The simplest approach to IL, behavior cloning (BC), is thought to incur sample complexity with unfavorable quadratic dependence on the problem horizon, motivating a variety of different online algorithms that attain improved linear horizon dependence under stronger assumptions on the data and the learner’s access to the expert.

In this talk, we revisit the apparent gap between offline and online IL from a learning-theoretic perspective, with a focus on general policy classes up to and including deep neural networks. Through a new analysis of behavior cloning with the logarithmic loss, we will show that it is possible to achieve horizon-independent sample complexity in offline IL whenever (i) the range of the cumulative payoffs is controlled, and (ii) an appropriate notion of supervised learning complexity for the policy class is controlled. When specialized to stationary policies, this implies that the gap between offline and online IL is smaller than previously thought. We will then discuss implications of this result and investigate the extent to which it bears out empirically.

Bio: Dylan Foster is a principal researcher at Microsoft Research, New York. Previously, he was a postdoctoral fellow at MIT and received his PhD in computer science from Cornell University, advised by Karthik Sridharan. His research focuses on problems at the intersection of machine learning and decision making. He has received several awards for his work, including the best paper award at COLT (2019) and best student paper award at COLT (2018, 2019).



Speaker: Professor Linda Valeri (Columbia University)

Title:  Causal inference in non-stationary time series from mHealth studies in Psychiatry

Abstract:  The adoption of digital technologies in Psychiatry holds promise for the evaluation of personalized causal effects to better inform behavioral treatment decisions in a patient population that displays substantial diversity in symptomatology even within the same diagnostic category.
In this presentation I will discuss challenges in estimating the individual causal effect of mobile communication social network size on negative mood of bipolar and schizophrenia patients enrolled in a cohort study part of the Intensive Longitudinal Health Behavior Network. The first challenge is missing data, potentially dependent on participant health status, and the second challenge is non-stationarity of the time series, when the treatment effect may change over time. To address these challenges, we propose a Monte Carlo EM (MCEM) algorithm of the state space model to properly address missing data in non-stationary multivariate time series. We also propose a set of novel causal estimands for (potentially non-stationary) multivariate time series in N-of-1 studies to systematically summarize how time-varying exposures affect outcomes in the short and long term and derive their identification via the g-formula in the presence of exposure- and outcome-covariates feedbacks.

Bio:  Linda Valeri is an assistant professor of Biostatistics at the Columbia University Mailman School of Public Health and adjunct assistant professor of Epidemiology and the Harvard T.H. Chan School of Public Health. Linda Valeri is an expert biostatistician specializing in causal inference, with a focus on biostatistical methodology and statistical learning. She received her doctorate degree in Biostatistics from Harvard University. Her research encompasses causal mediation analysis, measurement error, missing data, and the integration of data from multiple sources, such as smartphone and wearable devices, life-course cohort studies, and electronic medical records, in diverse populations. Dr. Valeri has developed widely utilized open-access computational tools for causal inference, benefiting scientists across biomedical and social sciences. As PI of a career development award from the National Institute of Mental Health, of an R01 research grant from the National Institute of Aging and of an R21 from the National Institute of Environmental Health Sciences she collaborates with interdisciplinary teams to advance our understanding of mental health across the life-course, environmental determinants of health, and health disparities, contributing to informed policy-making.

She is active as associate editor for the International Journal of Biostatistics and as Statistical Editor for JAMA Psychiatry and JAMA Network Open. Dr. Valeri is also enthusiastic in her service to her local community, serving as the Biostatistics faculty representative in the Columbia Mailman School of Public Health Faculty Steering Committee, as well as the regional and national community of statisticians and biostatisticians, serving as a member of the Regional Advisory Board of the Eastern North American Region of the Biometrics society and as elected Council of Sections Representative for the ASA Mental Health Statistics Section.


Speaker:  Professor Stephen Bates (MIT)





Speaker:  Professor Reese Pathak (UC Berkeley)





Speaker:  Professor Zhou Fan (Yale University)







Speaker:  Professor Victor Panaretos (EPFL)

Title:  Positive-Definite Extensions and Continuum Graphical Models

Abstract:  We discuss the problem of positive-definite continuation: extending a partially specified covariance kernel from a subdomain Ω of the unit square [0,1]^2 to a covariance kernel on the entire unit square [0,1]^2. For a broad class of subdomains Ω, we obtain a complete picture. Namely, we demonstrate that a canonical extension always exists and can be explicitly constructed. We characterise all possible extensions as suitable perturbations of the canonical extension, and determine necessary and sufficient conditions for a unique extension to exist. We then re-interpret the canonical extension as a graphical model on the associated Gaussian process. We show that this leads to a valid and operational definition for arbitrarily (e.g. uncountably infinitely) indexed Gaussian processes, based directly on the covariance kernel, and describe how this allows for nonparametric estimation of the underlying Markov structure. Based on joint work in collaboration with K.G. Waghmare (ETH Zürich).

Bio:  Victor M. Panaretos is Professor of Mathematical Statistics at the EPFL. He received his PhD in 2007 from UC Berkeley, advised by David Brillinger. Upon graduation he was appointed Assistant Professor at EPFL’s Mathematics Institute, where he rose the ranks to Full Professor, also serving as Institute Director. His work studies the interplay between geometrical, functional, and nonparametric statistics. He received the Erich Lehmann Award and an ERC Starting Grant Award. He is an Elected Member of the ISI, a Fellow of the IMS, was the Bernoulli Society Forum Lecturer in 2019, and an IMS Medallion Lecturer in 2025. He has served on the Editorial Boards of the Annals of Statistics, the Annals of Applied Statistics, Biometrika, JASA (Theory & Methods), and EJS. He is currently serving as President of the Bernoulli Society for Mathematical Statistics and Probability.


Speaker:  Professor Stephane Guerrier (University of Geneva)







Speaker:  Professor Oliver Feng (University of Bath)





Speaker:  Professor Ali Shojaie (University of Washington)

Title: A Unified Framework for Semiparametrically Efficient Semi-Supervised Learning

Abstract:  We consider statistical inference under a semi-supervised setting with access to both labeled and unlabeled datasets and ask the question: under what circumstances, and by how much, can incorporating the unlabeled dataset improve upon inference using the labeled data? To answer this question, we investigate semi-supervised learning through the lens of semiparametric efficiency theory. We characterize the efficiency lower bound under the semi-supervised setting for an arbitrary inferential problem, and show that incorporating unlabeled data can potentially improve efficiency if the parameter is not well-specified. We then propose two types of semi-supervised estimators: a safe estimator that imposes minimal assumptions, is simple to compute, and is guaranteed to be at least as efficient as the initial supervised estimator; and an efficient estimator, which (under stronger assumptions) achieves the semiparametric efficiency bound. Our findings unify existing semiparametric efficiency results for particular special cases, and extend these results to a much more general class of problems. Moreover, we show that our estimators can flexibly incorporate predicted outcomes arising from “black-box” machine learning models, and thereby achieve the same goal as prediction-powered inference (PPI), but with superior theoretical guarantees. We also provide a complete understanding of the theoretical basis for the existing set of PPI methods. Finally, we apply the theoretical framework developed to derive and analyze efficient semi-supervised estimators in a number of settings, including M-estimation, U-statistics, and average treatment effect estimation, and demonstrate the performance of the proposed estimators in simulation.

Bio:  Ali Shojaie is Norm Breslow Endowed Professor of Biostatistics and Statistics at the University of Washington (UW). He is Associate Chair of the Department of Biostatistics, Founding Director of the Summer Institute for Statistics in Big Data (SISBID) at the University of Washington and Lead of the Data Management and Statistics (DMS) Core of the UW Alzheimer’s Disease Research Center (ADRC). Dr. Shojaie’s research lies in the intersection of statistical machine learning, statistical network analysis and applications in biology and social sciences. He is an elected Fellow of the American Statistical Association (ASA) and the Institute of Mathematical Statistics (IMS) and recipient of the 2022 Leo Breiman Award from the ASA Section on Statistical Learning and Data Science (SLDS).








Speaker: Professor Tengyuan Liang (University of Chicago)

Title:  Denoising Diffusions with Optimal Transport

Abstract:  Adding noise is easy; what about denoising? Diffusion is easy; what about reverting a diffusion? Diffusion-based generative models aim to denoise a Langevin diffusion chain, moving from a log-concave equilibrium measure ν, say isotropic Gaussian, back to a complex, possibly non-log-concave initial measure µ. The score function performs denoising, going backward in time, predicting the conditional mean of the past location given the current. We show that score denoising is the optimal backward map in transportation cost. What is its localization uncertainty? We show that the curvature function determines this localization uncertainty, measured as the conditional variance of the past location given the current. We study in this paper the effectiveness of the diffuse-then-denoise process: the contraction of the forward diffusion chain, offset by the possible expansion of the backward denoising chain, governs the denoising difficulty. For any initial measure µ, we prove that this offset net contraction at time t is characterized by the curvature complexity of a smoothed µ at a specific signal-to-noise ratio (SNR) scale r(t). We discover that the multi-scale curvature complexity collectively determines the difficulty of the denoising chain. Our multi-scale complexity quantifies a fine-grained notion of average-case curvature instead of the worst-case. Curiously, it depends on an integrated tail function, measuring the relative mass of locations with positive curvature versus those with negative curvature; denoising at a specific SNR scale is easy if such an integrated tail is light. We conclude with several non-log-concave examples to demonstrate how the multi-scale complexity probes the bottleneck SNR for the diffuse-then-denoise process.



Speaker:  Professor Philippe Rigollet (MIT)


