Student Seminar Series

Choose which semester to display:

Schedule for Spring 2022

Attention: The Student Seminar will be in hybrid mode this semester. Most talks and events will be held in person, and people can also join via zoom. In-person participation is only available to Columbia affiliates with building access.

Seminars are on Wednesdays (January is Zoom only)
Time: 12:00 - 1:00pm

Location: Room 1025, 1255 Amsterdam Avenue

Zoom Link

Meeting ID: 918 6027 2586
Passcode: 225317

Contacts: Arnab Auddy, Ye Tian

Information for speakers: For information about schedule, direction, equipment, reimbursement and hotel, please click here.


Sumit Mukherjee (Columbia Stats)

Title: Generalized birthday problem for January 19

Abstract: Suppose there are $n$ students in a class. But assume that not everybody is friends with everyone else, and there is a graph which determines the friendship structure. What is the chance that there are two friends in this class, both with birthdays on January 19? More generally, given a simple labeled graph $G_n$ on $n$ vertices, color each vertex with one of $c=c_n$ colors chosen uniformly at random, independent from other vertices. We study the question: what is the number of monochromatic edges of color 1?

As it turns out, the limiting distribution has three parts, the first and second of which are quadratic and linear functions of a homogeneous Poisson point process, and the third component is an independent Poisson. In fact, we show that any distribution limit must belong to the closure of this class of random variables. As an application, we characterize exactly when the limiting distribution is a Poisson random variable.
This talk is based on joint work with Bhaswar Bhattacharya and Somabha Mukherjee.


Charles Margossian (Columbia Stats)
Title: Nested Rhat: assessing convergence of Markov chains Monte Carlo when running many short chains
Abstract:When using Markov chain Monte Carlo (MCMC) algorithms, we can increase the number of samples either by running longer chains or by running more chains. Practitioners often prefer the first approach because chains need an initial warmup phase to (i) forget their initial states and (ii) tune the sampling algorithms to reduce the Markov chain's autocorrelation. However, highly parallel hardware accelerators such as GPUs may allow us to run many chains in parallel almost as quickly as a single chain. In this talk, I'll explore the possibility of using MCMC with many chains and a short sampling phase with, for instance, a single iteration. I'll demonstrate shortcomings in the popular convergence diagnostic $\hat R$, and present a useful generalization termed the nested $\hat R$. Furthermore I will show how the short chain regime gives us a principled approach to find the number of chains, as well as the optimal length of the sampling and warmup phases; sampling parameters which are otherwise tuned using heuristics or trial-and-error. The talk is based on a preprint, available on arxiv:

Gemma Moran (Columbia DSI)

Title: Identifiable deep generative models via sparse decoding
Abstract: We develop the sparse VAE for unsupervised representation learning on high-dimensional data.  The sparse VAE learns a set of latent factors (representations) which summarize the associations in the observed data features. The underlying model is sparse in that each observed feature (i.e. each dimension of the data) depends on a small subset of the latent factors.  As examples, in ratings data each movie is only described by a few genres; in text data each word is only applicable to a few topics; in genomics, each gene is active in only a few biological processes.  We prove such sparse deep generative models are identifiable: with infinite data, the true model parameters can be learned. (In contrast, most deep generative models are not identifiable.)  We empirically study the sparse VAE with both simulated and real data.  We find that it recovers meaningful latent factors and has smaller heldout reconstruction error than related methods.
Inbar Seroussi (Weizmann Institute)
Title: How well can we generalize in high dimensions?
Abstract: Modern learning algorithms such as deep neural networks operate in regimes that defy the traditional statistical learning theory. Neural networks architectures often contain more parameters than training samples. Despite their huge complexity, the generalization error achieved on real data is small. In this talk, we aim to study the generalization properties of algorithms in high dimensions. We first show that algorithms in high dimensions require a small bias for good generalization. We show that this is indeed the case for deep neural networks in the over-parametrized regime. We, then, provide lower bounds on the generalization error in various settings for any algorithm. We calculate such bounds using random matrix theory (RMT). We will review the connection between deep neural networks and RMT and existing results. These bounds are particularly useful when the analytic evaluation of standard performance bounds is not possible due to the complexity and nonlinearity of the model. The bounds can serve as a benchmark for testing performance and optimizing the design of actual learning algorithms. (Joint work with Ofer Zeitouni)
Henry Lam (Columbia IEOR)
Title: A Cheap Bootstrap Method for Fast Inference

Abstract: The bootstrap is a versatile method for statistical inference, but when applied to large-scale or simulation-based models, it could face substantial computation demand from repeated data resampling and model refitting. We present a bootstrap methodology that uses minimal computation, namely with a resample effort as low as one Monte Carlo replication, while maintaining desirable statistical guarantees. We describe how this methodology can be used for fast inference across different estimation problems, and its generalizations to handling uncertainties in machine learning and simulation analysis.


Linda Valeri (Columbia Biostats)

Title: A multistate approach for mediation analysis in the presence of semi-competing risks with application in cancer survival disparities

Abstract: We provide novel definitions and identifiability conditions for causal estimands that involve stochastic interventions on non-terminal time-to-events that lie on the pathway between an exposure and a terminal time-to-event outcome. Causal contrasts are estimated in continuous time within a multistate modeling framework accounting for semi-competing risks and analytic formulae for the estimators of the causal contrasts are developed. We employ this novel methodology to investigate the role of delaying treatment uptake in explaining racial disparities in cancer survival in a cohort study of colon cancer patients.


Joshua Glaser (Columbia Neuroscience)

Title: Interpretable Machine Learning for Systems Neuroscience
Abstract: Despite the sharp rise in machine learning's popularity, our ability to use standard machine learning tools to understand the brain remains limited. A critical impediment is that it is often difficult to understand the inner-workings of these models. In my talk, I will discuss three projects in which we incorporate knowledge about neuroscience to make machine learning models more interpretable. In the first, to investigate how spinal motor neurons are controlled across tasks, we have developed latent variable models with biologically plausible activation functions. Our results suggest that, contrary to canonical theories, spinal motor neurons are flexibly controlled to meet task demands. In the second project, we have extended switching autoregressive models to explain the activity of multiple brain regions. Across several datasets, these models have allowed us to investigate inter-region interactions that change over time. In the final project, we have developed a sparse dimensionality reduction technique that yields more interpretable representations than principal components analysis. By incorporating neuroscience knowledge into statistical machine learning tools, we can create models that lead to greater biological insight.
Carsten Chong (Columbia Stats)
Title: Mixed Semimartingales: Volatility Estimation in the Presence of Rough Noise

Abstract: We consider the problem of estimating volatility based on high-frequency data when the observed price process is a continuous Itô semimartingale contaminated by microstructure noise. Assuming that the noise process is compatible across different sampling frequencies, we argue that it typically has a similar local behavior to fractional Brownian motion. For the resulting class of processes, which we call mixed semimartingales, we derive consistent estimators and asymptotic confidence intervals for the roughness parameter of the noise and the integrated price and noise volatilities, in all cases where these quantities are identifiable. Our model can explain key features of recent stock price data, most notably divergence rates in volatility signature plots that vary considerably over time and between assets.

Charles Margossian, Elliott Gordon Rodriguez (both from Columbia Stats)


s. gwynn sturdevant (Harvard Business School)

Title: Delivering data differently
Abstract: Human-computer interaction relies on mouse/touchpad, keyboard, and screen, but tools have recently been developed that engage sound, smell, touch, muscular resistance, voice dialogue, balance, and multiple senses at once. How might these improvements impact upon the practice of statistics and data science? People with low vision may be better able to grasp and explore data. More generally, methods developed to enable this have the potential to allow sighted people to use more senses and become better analysts. We would like to adapt some of the wide range of available computer and sensory input/output technologies to transform data science workflows. Here is a vision of what this synthesis might accomplish.

Title: Job search experience/academia career planning

Graduating students of our Ph.D. program, who were on the academic job market this year, will share their experiences about the job search.


Kristen Gore (Six Sigma Black Belt at HP Inc.)

Title: The key role of data scientists in Industry 4.0
Abstract: This Wednesday's seminar talk will be given by Dr. Kristen Gore.  Dr. Gore is an alumna of the Columbia statistics department and has been a senior data scientist at HP Inc. since 2014.  She became the statistical data strategy lead of the global Print Microfluidics Technology and Operations organization in 2020.  In her talk she will discuss the rapidly expanding role of data scientists in driving key decisions in the tech sector.  Additionally, Gore will detail how advancements in data science have been foundational to the fourth industrial revolution--also known as Industry 4.0.


*Location: Room 903 SSW

*Time: 1:00Pm - 2:00PM

Dave Blei (Columbia Stats, CS)
Title: The Blessings of Multiple Causes [*]

Abstract: Causal inference from observational data is a vital problem, but it comes with strong assumptions. Most methods require that we observe all confounders, variables that affect both the causal variables and the outcome variables. But whether we have observed all confounders is a famously untestable assumption. In this talk I will describe the deconfounder, a way to do causal inference with alternative assumptions than the classical methods require. 

How does the deconfounder work? While traditional causal methods measure the effect of a single cause on an outcome, many modern scientific studies involve multiple causes, different variables whose effects are simultaneously of interest. The deconfounder uses the correlation among multiple causes as evidence for unmeasured confounders, combining unsupervised machine learning and predictive model checking to perform causal inference.

In this talk I will describe the deconfounder methodology and discuss the theoretical requirements for the deconfounder to provide unbiased causal estimates. I will touch on some of the academic debates surrounding the deconfounder, and demonstrate the deconfounder on real-world data and simulation studies.

This is joint work with Yixin Wang.

Cindy Rush (Columbia Stats)
Title:  Characterizing the Type 1-Type 2 Error Trade-off for SLOPE

Abstract: Sorted L1 regularization has been incorporated into many methods for solving high-dimensional statistical estimation problems, including the SLOPE estimator in linear regression. In this talk, we study how this relatively new regularization technique improves variable selection by characterizing the optimal SLOPE trade-off between the false discovery proportion (FDP) and true positive proportion (TPP) or, equivalently, between measures of type I and type II error. Additionally, we show that on any problem instance, SLOPE with a certain regularization sequence outperforms the Lasso, in the sense of having a smaller FDP, larger TPP and smaller L2 estimation risk simultaneously. Our proofs are based on a novel technique that reduces a variational calculus problem to a class of infinite-dimensional convex optimization problems and a very recent result from approximate message passing (AMP) theory. With SLOPE being a particular example, we discuss these results in the context of a general program for systematically deriving exact expressions for the asymptotic risk of estimators that are solutions to a broad class of convex optimization problems via AMP. Collaborators on this work include Zhiqi Bu, Jason Klusowski, and Weijie Su ( and and Oliver Feng, Ramji Venkataramanan, and Richard Samworth (
Title:Student body elections
Abstract:We will elect our next student representative, student seminar organizers and ASGC representative.