Statistics Seminar Series

Choose which semester to display:

Schedule for Spring 2022

All talks are available online, via Zoom. Select talks take place in hybrid mode. In-person participation is only available to Columbia affiliates with building access.

Seminars are on Mondays
Time: 4:00pm - 5:00pm
Zoom Link:



Time: 2:00pm - 3:00 pm


Nikolaos Ignatiadis (Stanford)

Title: Nonparametric Empirical Bayes Inference

Abstract: In an empirical Bayes analysis, we use data from repeated sampling to imitate inferences made by an oracle Bayesian with extensive knowledge of the data-generating distribution. Existing results provide a comprehensive characterization of when and why empirical Bayes point estimates accurately recover oracle Bayes behavior. In the first part of this talk, we construct flexible and practical nonparametric confidence intervals that provide asymptotic frequentist coverage of empirical Bayes estimands, such as the posterior mean and the local false sign rate. From a methodological perspective we build upon results on affine minimax estimation, and our coverage statements hold even when estimands are only partially identified or when empirical Bayes point estimates converge very slowly. In the second part of the talk, we build on these results to study randomization-based inference for treatment effects in the regression discontinuity design under a model where the running variable has exogenous measurement error.

Join Zoom:

Meeting ID: 938 8259 9598

Passcode: 609650



Time: 2:00pm - 3:00pm

Mark Sellke (Stanford)

Title: Geometric Aspects of Optimization, Old and New

Abstract: High-dimensional optimization plays a crucial role in modern statistics and machine learning. I will present recent progress on two problems in this area with connections to sequential decision making (part one), and optimization of random non-convex functions (part two).

The first part focuses on convex body chasing, which models convex optimization in a sequential and non-stationary setting. I will explain a solution to this problem based on the Steiner point from 19th century geometry. The resulting guarantee is exactly optimal in some sense and closes an exponential gap.

The second part concerns optimization of mean-field spin glass Hamiltonians, a family of random non-convex functions. These functions have been studied in probability and physics for decades, and their optimization is related to problems such as clustering and spiked tensor estimation. We will see that a natural class of optimization algorithms "gets stuck" at an algorithmic threshold related to geometric properties of the landscape. This class of algorithms includes general gradient-based methods on dimension-free time scales. In particular, we determine when such algorithms can reach asymptotic optimality.

Join Zoom

Meeting ID: 927 0725 0874
Passcode: 891180



Time: 10:00am - 11:00am

Bianca Dumistrascu (University of Cambridge)

Title: Statistical Machine learning for genetics and health: multi-modality, interpretability, mechanism

Abstract: Genomic and medical data are available at unprecedented scales. This is due, in part, to improvements and developments in data collection, high throughput sequencing, and imaging technologies. How can we extract lower dimensional representations from these high dimensional data in a way that retains fundamental biological properties across different scales? Three main challenges arise in this context: how to aggregate information across different experimental modalities, how to enforce that such representations are interpretable, and how to leverage prior dynamical knowledge to provide new insight into mechanism. I will present my work on developing statistical machine learning models and algorithms to answer this question and address these challenges. First, I will present a generative model for learning representations that jointly model information from gene expression and tissue morphology in a population setting. Then, I will describe a method for making multi-modal representations interpretable using a label-aware compressive classification approach for gene panel selection in single cell data. Finally, I will discuss inference methods for models which encode mechanistic assumptions, a need that arises naturally in gene regulatory networks, predator-prey systems, and electronic health care records. Throughout this work, recent advances in machine learning and statistics are harnessed to bridge two worlds -- the world of real, messy biological data and that of methodology and computation. This talk describes the importance of domain knowledge and data-centric modeling in motivating new statistical venues and introduces new ideas that touch upon improving experimental design in biomedical contexts.

Join Zoom

Meeting ID: 910 3199 4066
Passcode: 630430



Time: 2:00pm - 3:00pm

Lihua Lei (Stanford)

Title: What Can Conformal Inference Offer To Statistics?

Abstract: Valid uncertainty quantification is crucial for high-stakes decision-making. Conformal inference provides a powerful framework that can wrap around any black-box prediction algorithm, like random forests or deep neural networks, and generate prediction intervals with distribution-free coverage guarantees. In this talk, I will describe how conformal inference can be adapted to handle more complicated inferential tasks in statistics.

I will mainly focus on two important statistical problems: counterfactual inference and time-to-event analysis. In practice, the former can be used as a building block to infer individual treatment effects, and the latter can be applied for individual risk assessment. Unlike standard prediction problems, the predictive targets are only partially observable owing to selection and censoring. When the missing data mechanism is known, as in randomized experiments, our conformal inference-based approaches achieve desired coverage in finite samples without any assumption on the conditional distribution of the outcomes or the accuracy of the predictive algorithm; when the missing data mechanism is unknown, they satisfy a doubly robust guarantee of coverage. We demonstrate on both simulated and real datasets that conformal inference-based methods provide more reliable uncertainty quantification than other popular methods, which suffer from a substantial coverage deficit even in simple models. In addition, I will also briefly mention my work on adapting and generalizing conformal inference to other statistical problems, including election, outlier detection, and risk-calibrated predictions.

Join Zoom

Meeting ID: 952 1468 8689
Passcode: 698690


Time: 1:00pm- 2:00pm

Georgia Papadogeorgou (University of Florida)

Title: Causal inference with spatio-temporal data. 

Abstract: Many causal processes have spatial and temporal dimensions. Yet the classic causal inference framework is not directly applicable when the treatment and outcome variables are generated by spatio-temporal processes with an infinite number of possible event locations. We extend the potential outcomes framework to these settings by formulating the treatment point process as a stochastic intervention. Our causal estimands include the expected number of outcome events in a specified area under a particular stochastic treatment assignment strategy. We develop methodology that allows for arbitrary patterns of spatial spillover and temporal carryover effects. Using martingale theory, we show that the proposed estimator is consistent and asymptotically normal as the number of time periods increases, even when the propensity score is estimated. We propose a sensitivity analysis for the possible existence of unmeasured confounders, and extend it to the Hajek estimator. We use the proposed methods to estimate the effects of American airstrikes on insurgent violence in Iraq from February 2007 to July 2008. We find that increasing the average number of daily airstrikes for up to one month results in more insurgent attacks across Iraq and within Baghdad. We also find evidence that airstrikes can displace attacks from Baghdad to new locations up to 400 kilometers away.

Join Zoom Meeting

Meeting ID: 999 3036 3471
Passcode: 578539



Time: 1:00pm - 2:00pm

Colin Fogarty (MIT)

Title: Unifying Modes of Inference for Randomized Experiments

Abstract: Competing approaches to inference in randomized experiments differ primarily in (1) which notion of ``no treatment effect’’ is being tested; and (2) whether or not a superpopulation model is posited for the potential outcomes. Recommended hypothesis tests in a given paradigm may be invalid even asymptotically when applied in other frameworks, creating the risk of misinterpretation by practitioners when a given method is deployed. For a large class of test statistics common in practice, we develop a general framework for ensuring validity across competing modes of inference. To do this, we employ permutation tests based upon prepivoted test statistics, wherein a test statistic is first transformed by a suitably constructed cumulative distribution function and its permutation distribution is then enumerated. In essence, the approach uses the permutation distribution of a p-value for a large-sample test known to be valid under the null hypothesis of no average treatment effect as a reference distribution. The framework readily accommodates regression-adjusted estimators of average treatment effects, and the corresponding tests are never less powerful asymptotically than a test based upon the unadjusted estimator and maintain asymptotic validity even if the regression model is misspecified.  The tests retain finite-sample exactness under stricter definitions of no treatment effect such as Fisher’s sharp null by virtue of being permutation tests, and validity across different superpopulation models can be ensured through the choice of the estimated CDF used when prepivoting.

Join Zoom Meeting

Meeting ID: 910 8753 8334
Passcode: 135699



Time: 10:00am - 11:00am

Alisa Knizel (University of Chicago)

Title: Interacting particle systems and beyond

Abstract: I will discuss my work related to the KPZ equation and random matrices. The Kardar-Parisi-Zhang (KPZ) equation is a fundamental stochastic partial differential equation. It models stochastic interface growth that we can witness regularly in everyday life. Random matrix theory is a very active area of research with applications ranging from physics to data science and finance. A natural question is the study of the behavior of linear statistics of the eigenvalues as the size of the matrix grows.

Join Zoom

Meeting ID: 938 3836 1437
Passcode: 852687



Time: 1:00pm - 2:00pm

Eliza O'Reilly (Cal Tech)

Title: Stochastic and Convex Geometry for the Analysis of Complex Data

Abstract: Many modern problems in data science aim to efficiently and accurately extract important features and make predictions from high dimensional and large data sets. While there are many empirically successful methods to achieve these goals, large gaps between theory and practice remain.  A geometric viewpoint is often useful to address these challenges as it provides a unifying perspective of structure in data, complexity of statistical models, and tractability of computational methods.  As a consequence, an understanding of problem geometry leads both to new insights on existing methods as well as new models and algorithms that address drawbacks in existing methodology.

In this talk, I will present recent progress on two problems where the relevant model can be viewed as the projection of a lifted formulation with a simple stochastic or convex geometric description. In particular, I will first describe how the theory of stationary random tessellations in stochastic geometry can address computational and theoretical challenges of random decision forests with non-axis-aligned splits. Second, I will present a new approach to convex regression that returns non-polyhedral convex estimators compatible with semidefinite programming. These works open a number of future research directions at the intersection of stochastic and convex geometry, statistical learning theory, and optimization.  

Join Zoom

Meeting ID: 918 5170 0572
Passcode: 838239



Time: 1:00 - 2:00pm

Anish Agarwal (MIT)

Title: Causal Inference for Socio-Economic and Engineering Systems


What will happen to Y if we do A? 

A variety of meaningful socio-economic and engineering questions can be formulated this way. To name a few: What will happen to a patient’s health if they are given a new therapy? What will happen to a country’s economy if policy-makers legislate a new tax? What will happen to a company’s revenue if a new discount is introduced? What will happen to a data center’s latency if a new congestion control protocol is used? In this talk, we will explore how to answer such counterfactual questions using observational data---which is increasingly available due to digitization and pervasive sensors---and/or very limited experimental data. The two key challenges in doing so are: (i) counterfactual prediction in the presence of latent confounders; (ii) estimation with modern datasets which are high-dimensional, noisy, and sparse.

Towards this goal, the key framework we introduce is connecting causal inference with tensor completion, a very active area of research across a variety of fields. In particular, we show how to represent the various potential outcomes (i.e., counterfactuals) of interest through an order-3 tensor. The key theoretical results presented are: (i) Formal identification results establishing under what missingness patterns, latent confounding, and structure on the tensor is recovery of unobserved potential outcomes possible. (ii) Introducing novel estimators to recover these unobserved potential outcomes and proving they are finite-sample consistent and asymptotically normal. 

The efficacy of the proposed estimators is shown on high-impact real-world applications. These include working with: (i) TaurRx Therapeutics to propose novel clinical trial designs to reduce the number of patients recruited for a trial and to correct for bias from patient dropouts. (ii) Uber Technologies on evaluating the impact of certain driver engagement policies without having to run an A/B test. (iii) U.S. and Indian policy-makers to evaluate the impact of mobility restrictions on COVID-19 mortality outcomes. (iv) The Poverty Action Lab (J-PAL) at MIT to make personalized policy recommendations to improve childhood immunization rates across different villages in Haryana, India.

Finally, we discuss connections between causal inference, tensor completion, and offline reinforcement learning. 

Join Zoom 
Meeting ID: 940 3480 5546
Passcode: 984426


No Seminar


No Seminar


Kaizheng Wang (Columbia IEOR)

Title: Adaptive and robust multi-task learning

Abstract: In this talk, we study the multi-task learning problem that aims to simultaneously analyze multiple datasets collected from different sources and learn one model for each of them. We propose a family of adaptive methods that automatically utilize possible similarities among those tasks while carefully handling their differences. We derive optimal statistical guarantees for the methods and prove their robustness against outlier tasks. Numerical experiments on synthetic and real datasets demonstrate the efficacies of our new methods. The talk is based on joint work with Yaqi Duan.


Florentina Bunea (Cornell)

Title: Surprises in topic model estimation and new Wasserstein document-distance  calculations


Topic models have been and continue to be an important modeling tool for an ensemble of independent multinomial samples with shared commonality. Although applications of topic models span many disciplines, the jargon used to define them stems from text analysis. In keeping with the standard terminology, one has access to a corpus of n independent documents, each utilizing words from a given dictionary of size p. One draws N words from each document and records their respective count, thereby representing the corpus as a collection of n samples from independent, p-dimensional, multinomial distributions, each having a different, document specific, true word probability vector Π. The topic model assumption is that each Π is a mixture of K discrete distributions, that are common to the corpus, with document specific mixture weights. The corpus is assumed to cover K topics, that are not directly observable, and each of the K mixture components correspond to conditional probabilities of words, given a topic. The vector of the K mixture weights, per document, is viewed as a document specific topic distribution T , and is thus expected to be sparse, as most documents will only cover a few of the K topics of the corpus.

Despite the large body of work on learning topic models, the estimation of sparse topic distributions, of unknown sparsity, especially when the mixture components are not known, and are estimated from the same corpus, is not well understood and will be the focus of this talk. We provide estimators of T , with sharp theoretical guarantees, valid in many practically relevant situations, including the scenario p >> N (short documents, sparse data) and unknown K. Moreover, the results are valid when dimensions p and K are allowed to grow with the sample sizes N and n.

When the mixture components are known, we propose MLE estimation of the sparse vector T , the analysis of which has been open until now. The surprising result, and a remarkable property of the MLE in these models, is that, under appropriate conditions, and without further regularization, it can be exactly sparse, and contain the true zero pattern of the target. When the mixture components are not known, we exhibit computationally fast and rate optimal estimators for them, and propose a quasi-MLE estimator of T , shown to retain the properties of the MLE. The practical implication of our sharp, finite-sample, rate analyses of the MLE and quasi-MLE reveal that having short documents can be compensated for, in terms of estimation precision, by having a large corpus.

Our main application is to the estimation of 1-Wasserstein distances between document generating distributions. We propose, estimate and analyze new 1-Wasserstein distances between alternative probabilistic document representations, at the word and topic level, respectively. The effectiveness of the proposed 1-Wasserstein distances, and contrast with the more commonly used WMD between empirical frequency estimates, is illustrated by an analysis of an IMDB movie reviews data set.



*This seminar will be online only.

Tracy Ke (Harvard)

Title: Using SVD for Topic Modeling

Abstract: The probabilistic topic model imposes a low-rank structure on the expectation of the corpus matrix. Therefore, singular value decomposition (SVD) is a natural tool of dimension reduction. We propose an SVD-based method for estimating a topic model. Our method constructs an estimate of the topic matrix from only a few leading singular vectors of the corpus matrix, and has an advantage in memory use and computational cost for large-scale corpora. The core ideas behind our method include a pre-SVD normalization to tackle severe word frequency heterogeneity, a post-SVD normalization to create a low-dimensional word embedding that manifests a simplex geometry, and a post-SVD procedure to construct an estimate of the topic matrix directly from the embedded word cloud. We provide the explicit rate of convergence of our method. We show that our method attains the optimal rate in the case of long and moderately long documents, and it improves the rates of existing methods in the case of short documents. The analysis requires a sharp row-wise large-deviation bound for empirical singular vectors, which is technically demanding to derive and potentially useful for other problems.

We apply our method to a corpus of abstracts of statistical papers (including 83K papers in 36 statistics-related journals). We study the topic trending, topic interests of individual authors, topic ranking, and the impact of topics on citations that a paper receives.


Charles Doss (UMN)

Title: A nonparametric doubly robust test for a continuous treatment effect

Abstract: Much of the literature on evaluating the significance of a (causal) treatment effect based on observational data has been confined to discrete treatments. These methods are not applicable to drawing inference for a continuous treatment, which arises in many important applications.  To adjust for confounders when evaluating a continuous treatment, existing inference methods often rely on discretizing the treatment or using (possibly misspecified) parametric models for the effect curve.  We develop a nonparametric and doubly robust procedure for making inference on the continuous treatment effect curve.  Using empirical process techniques for local U- and V-processes, we establish the test statistic's asymptotic distribution. Furthermore, we propose a wild bootstrap procedure for implementing the test in practice.  We illustrate the new method via simulations and a study of a constructed dataset relating the effect of nurse staffing hours on hospital performance.

3/14/22 Spring Break

Yacine Aït-Sahalia (Princeton)

Title: How and When are High-Frequency Prices Predictable? 

Abstract: This paper studies the predictability of ultra high-frequency stock returns and durations to relevant events, using machine learning methods. We find that, contrary to low frequency and long horizon returns, where predictability is rare and inconsistent, predictability in high frequency returns and durations is large, systematic and pervasive. We identify the relevant predictors and examine what determines the variation in predictability across different stocks and market environments. Next, we compute how the predictability varies with the timeliness of the data on a scale of milliseconds. Finally, we determine the value of getting a (short-lived and imperfect) peek at the incoming order flow, an ability that is often attributed to the fastest high frequency traders, in terms of improving the predictability.

(joint work with Jianqing Fan, Lirong Xue and Yifeng Zhou)


Stephan Mandt (UC Irvine)

Title: From probabilistic forecasting to neural data compression and back: a latent variable perspective. 
Abstract: The past few years have seen deep generative models mature into promising applications. Two of these applications include neural data compression and forecasting high-dimensional time series, including video. I will begin by reviewing the basic ideas behind neural data compression and show how advances in approximate Bayesian inference and generative modeling can significantly improve the compression performance of existing models. Finally, I show how neural video codecs can inspire probabilistic forecasting, leading to probabilistic sequence prediction methods with high potential for data-driven weather prediction.  
Stephan Mandt is an Assistant Professor of Computer Science and Statistics at the University of California, Irvine. From 2016 until 2018, he was a Senior Researcher and Head of the statistical machine learning group at Disney Research, first in Pittsburgh and later in Los Angeles. He held previous postdoctoral positions at Columbia University and Princeton University. Stephan holds a Ph.D. in Theoretical Physics from the University of Cologne, where he received the German National Merit Scholarship. Furthermore, he is a Kavli Fellow of the U.S. National Academy of Sciences, an NSF CAREER Awardee, a member of the ELLIS Society, and a former visiting researcher at Google Brain. Stephan regularly serves as an Area Chair, Action Editor, or Editorial Board member for NeurIPS, ICML, AAAI, ICLR, TMLR, and JMLR. His research is currently supported by NSF, DARPA, DOE, Disney, Intel, and Qualcomm.


*This seminar will be online only.

Tselil Schramm (Stanford)

Title: Testing thresholds in high-dimensional random geometric graphs
Abstract: We study random geometric graphs in the high-dimensional setting, where the dimension grows to infinity with the number of vertices. Our central question is: given such a graph, when is it possible to identify the underlying geometry? As the dimension grows relative to the number of vertices, the edges in the graph become increasingly independent, and the underlying geometry becomes less apparent. In this talk I will present recent progress on this question: we show that in sparse geometric random graphs on the unit sphere, if the dimension is at least polylogarithmic in the number of vertices, then the graphs are statistically indistinguishable from Erdos-Renyi graphs, and the underlying geometry disappears.
Based on joint work with Siqi Liu, Sidhanth Mohanty, and Elizabeth Yang.

Lionel Riou-Durand (Warwick)

Title: Metropolis Adjusted Langevin Trajectories: a robust alternative to Hamiltonian Monte-Carlo
Abstract:Sampling approximations for high dimensional statistical models often rely on so-called gradient-based MCMC algorithms. It is now well established that these samplers scale better with the dimension than other state of the art MCMC samplers, but are also more sensitive to tuning [5]. Among these, Hamiltonian Monte Carlo is a widely used sampling method shown to achieve gold standard d^{1/4} scaling with respect to the dimension [1]. However it is also known that its efficiency is quite sensible to the choice of integration time, see e.g. [4][2]. This problem is related to periodicity in the autocorrelations induced by the deterministic trajectories of Hamiltonian dynamics. To tackle this issue, we develop a robust alternative to HMC built upon Langevin diffusions (namely Metropolis Adjusted Langevin Trajectories, or MALT), inducing randomness in the trajectories through a continuous refreshment of the velocities. We study the optimal scaling problem for MALT and recover the d^{1/4} scaling of HMC proven in [1] without additional assumptions. Furthermore we highlight the fact that autocorrelations for MAULT can be controlled by a uniform and monotonous bound thanks to the randomness induced in the trajectories, and therefore achieves robustness to tuning. Finally, we compare our approach to Randomized HMC ([2][3]) and establish quantitative contraction rates for the 2-Wasserstein distance that support the choice of Langevin dynamics.
This is a joint work with Jure Vogrinc (University of Warwick)

Nicolas Garcia Trillos (University of Wisconsin)

Title: The multimarginal optimal transport formulation of adversarial multiclass classification.

Abstract: Adversarial training is a framework widely used by machine learning practitioners to enforce robustness of learning models. Despite the development of several computational strategies for adversarial training and some theoretical development in the broader distributionally robust optimization literature, there are still several theoretical questions about adversarial training that remain relatively unexplored. In this talk, I will discuss an equivalence between adversarial training in the context of non-parametric multiclass classification problems and multimarginal optimal transport problems. This is another analytical interpretation of adversarial training that expands recently studied connections to perimeter minimization problems. One of the implications of the connection discussed during the talk is computational: to solve a certain adversarial problem, we may as well solve a multimarginal optimal transport problem. We will discuss many of the nuances of this interpretation and of its computational consequences. This is joint work with my student Jakwang Kim (UW) and my colleague Matt Jacobs (Purdue).


Rajarshi Mukherjee (Harvard)

Title: On PC Adjustments for High Dimensional Association Studies

Abstract: We consider the effect of Principal Component (PC) adjustments while inferring the effects of variables on outcomes. This is motivated by the EIGENSTRAT procedure in genetic association studies where one performs PC adjustment to account for population stratification. We consider simple statistical models to obtain an asymptotically precise understanding of when such PC adjustments are supposed to work. We also verify these results through extensive numerical experiments. These results are based on joint work with Sohom Bhattacharya (Stanford University) and Rounak Dey (Harvard T.H.Chan School of Public Health).


Jason Xu (Duke University)

Title: Likelihood-based Inference for Stochastic Epidemic Models via Data Augmentation   
Stochastic epidemic models such as the Susceptible-Infectious-Removed (SIR) model are widely used to model the spread of disease at the population level, but fitting these models to data present significant challenges. In particular, the marginal likelihood of such stochastic processes conditioned on observed endpoints a notoriously difficult task. As a result, exact likelihood inference is typically considered intractable in the presence of missing data, and practitioners often resort to simulation methods or approximations. We discuss some recent contributions that enable direct inference using the likelihood of observed data, focusing on a perspective that makes use of latent variables to explore configurations of the missing data within a Bayesian framework. Motivated both by count data from large outbreaks and high-resolution contact data from mobile health studies, we show how a data-augmented MCMC approach successfully learns the interpretable epidemic parameters and scales to handle large realistic data settings efficiently.