Student Seminar Series

Choose which semester to display:

Schedule for Spring 2024

Seminars are on Wednesdays 

Time: 12:00 - 1:00pm

Location: Room 903, 1255 Amsterdam Avenue

Contacts: Wribhu Banik, Seunghyun Lee, Anirban Nath


Speakers: Samory Kpotufe & Bodhi Sen (Columbia Stats)

Title: TBA

Abstract: Samory and Bodhi will both be talking about their current research interests and what it's like to do research with them.

1/31, 2/7



Speaker: Dan Lacker (Columbia IEOR)

Title: The (projected) Langevin dynamics: sampling, optimal transport, and variational inference

Abstract: This is a talk in two parts. The first part will survey a classical picture of the Langevin diffusion, with a focus on its applications to sampling and optimization. The second part will discuss my recent work on one or two (as time permits) analogous diffusion dynamics, which are designed to sample from probability measures arising in (1) entropic optimal transport and (2) mean field variational inference.





Speaker:  Genevera Allen (Rice)

Title: Statistical Machine Learning for Scientific Discovery

Abstract: In this talk, I will give an overview of my research program which develops new statistical machine learning techniques to help scientists make reproducible and reliable data-driven-discoveries from large and complex data.  

The first part will focus on an example of my research motivated by neuroscience: Understanding how large populations of neurons communicate in the brain at rest, in response to stimuli, or to produce behavior are fundamental open questions in neuroscience.  Many approach this by estimating the intrinsic functional neuronal connectivity using probabilistic graphical models, but there remain major statistical and computational hurdles to graph learning from new large-scale calcium imaging technologies.  I will highlight a new graph learning strategy my group has developed to address a critical unsolved neuroscience challenge that we call Graph Quilting, or graph learning from partial covariances resulting from non-simultaneously recorded neurons.  

The second part will focus on an example of my research in interpretable machine learning: Feature importance inference has been a long-standing statistical problem that helps promote scientific discoveries. Instead of testing for parameters that are only interpretable for specific models, there has been increasing interest in model-agnostic methods that can be applied to any statistical or machine learning model.  I will highlight a new approach to feature occlusion or leave-one-covariate-out (LOCO) inference that leverages minipatch ensemble learning to increase statistical power and improve computational efficiency without making any limiting assumptions on the model or data distribution.   

Finally, I will conclude by highlighting current and future research directions in my group related to modern multivariate analysis, graphical models, ensemble learning, machine learning interpretability and fairness, and applications in neuroscience and genomics. 


Speaker: Dr. Simon Tavare (Columbia Stats and Bioscience)
Title: Cancer by the Numbers

Abstract: After a brief overview of the history of cancer, I will illustrate how the mathematical sciences can contribute to cancer research through a number of examples. Cancer development is characterized by occurrences of genomic alterations ranging in extent and impact, and the complex interdependence between these genomic events shapes the selection landscape. Stochastic modeling can help evaluate the role of each mutational process during tumor progression, but existing frameworks only capture certain aspects of tumorigenesis.  I will outline CINner, a stochastic framework for modeling genomic diversity and selection during tumor evolution. The main advantage of CINner is its flexibility to incorporate many genomic events that directly impact cellular fitness, from driver gene mutations to copy number alterations (CNAs) including focal amplifications and deletions, mis-segregations, and whole-genome duplication. CINner raises a number of difficult statistical inference problems due to the lack of a feasible way to compute likelihoods. I will outline a new approach to approximate Bayesian computation – ABC – that exploits distributional random forests. I will give some examples of how this ABC-DRF methodology works in practice, and try to convince you that ABC has really come of age.





Speaker: Zhongyuan Lyu (Columbia DSI)

Title: Optimal Clustering of Multi-layer Networks

 Abstract: We study the fundamental limit of clustering networks when a multi-layer network is present. Under the mixture multi-layer stochastic block model (MMSBM), we show that the minimax optimal network clustering error rate, which takes an exponential form and is characterized by the Rényi-1/2 divergence between the edge probability distributions of the component networks. We propose a novel two-stage network clustering method including a tensor-based initialization and a one-step refinement procedure by likelihood-based Lloyd’s algorithm. Our proposed algorithm achieves the minimax optimal network clustering error rate and allows extreme network sparsity under MMSBM. We also extend our methodology and analysis framework to study the minimax optimal clustering error rate for mixture of discrete distributions including Binomial, Poisson, and multi-layer Poisson networks.







Speaker: David Blei (Columbia Stats & CS)

Title: Hierarchical Causal Models

Abstract: Analyzing nested data with hierarchical models is a staple of Bayesian statistics, but causal modeling remains largely focused on "flat" models. In this talk, we will explore how to think about nested data in causal models, and we will consider the advantages of nested data over aggregate data (such as data means) for causal inference. We show that disaggregating your data---replacing a flat causal model with a hierarchical causal model---can provide new opportunities for identification and estimation. As examples, we will study how to identify and estimate causal effects under unmeasured confounders, interference, and instruments.


This is joint work with Eli Weinstein.



Speaker: Alberto Gonzalez Sanz (Columbia Stats)

Title: Weak Limits of Entropy Regularized Optimal Transport; Potentials, Plans, and Divergences

Abstract: In this seminar, we will learn how to obtain the asymptotic distribution of both potentials and couplings of entropic regularized optimal transport for compactly supported probabilities in $\R^d$. We first provide the central limit theorem of the Sinkhorn potentials---the solutions of the dual problem---as a Gaussian process in $\Cs$. Then we obtain the weak limits of the couplings---the solutions of the primal problem---evaluated on integrable functions, proving a conjecture of Harchaoui, Z., Liu, L. and Pal, S. In both cases, their limit is a real Gaussian random variable. Finally, we consider the weak limit of the entropic Sinkhorn divergence under both assumptions $H_0:\ {\rm P}={\rm Q}$ or $H_1:\ {\rm P}\neq{\rm Q}$. Under $H_0$ the limit is a quadratic form applied to a Gaussian process in a Sobolev space, while under $H_1$, the limit is Gaussian.



Speaker: Yixin Wang (Michigan Stats)

Title: Harnessing Geometric Signatures in Causal Representation Learning

Abstract: Causal representation learning aims to extract high-level latent causal factors from low-level sensory data. Traditional methods often identify these factors by assuming they are statistically independent. In practice, however, the factors are often correlated, causally connected, or arbitrarily dependent.

In this talk, we explore how one might identify such dependent latent factors from data, whether from passive observations, interventional experiments, or multi-domain datasets. The key observation is that, despite correlations, the causal connections (or the lack of) among latent factors leaves geometric signatures in their support - the ranges of values each can take.

Leveraging these signatures, we show that observational data alone can identify the latent factors up to coordinate transformations if they bear no causal links. When causal connections do exist, interventional data can provide geometric clues sufficient for identification. In the most general case of arbitrary dependencies, multi-domain data can separate stable factors from unstable ones. Taken together, these results showcase the unique power of geometric signatures in causal representation learning.

This is joint work with Kartik Ahuja, Yoshua Bengio, Michael Jordan, Divyat Mahajan, and Amin Mansouri.




Speaker: Cindy Rush (Columbia Stats)

Title: The out-of-sample prediction error of the square-root lasso and related estimators

Abstract: We study the classical problem of predicting an outcome variable, Y, using a linear combination of a d-dimensional covariate vector, X. We are interested in linear predictors whose coefficients solve: inf_β (E[(Y - < β, X >)^r])^(1/r) + || β ||, where r >1 and > 0 is a regularization parameter. We provide conditions under which linear predictors based on these estimators minimize the worst-case prediction error over a ball of distributions determined by a type of max-sliced Wasserstein metric. A detailed analysis of the statistical properties of this metric yields a simple recommendation for the choice of regularization parameter. The suggested order of , after a suitable normalization of the covariates, is typically d/n, up to logarithmic factors. Our recommendation is computationally straightforward to implement, pivotal, has provable out-of-sample performance guarantees, and does not rely on sparsity assumptions about the true data-generating process. This is joint work with Jose Montiel Olea, Amilcar Velez, and Johannes Wiesel.


Speaker: Ye Tian

Title: Robust Unsupervised Multi-task Learning on Gaussian Mixture Models


Unsupervised learning has been widely used in many real-world applications. One of the simplest and most important unsupervised learning models is the Gaussian mixture model (GMM). In this work, we study the multi-task learning problem on GMMs, which aims to leverage potentially similar GMM parameter structures among tasks to obtain improved learning performance compared to single-task learning. We propose a multi-task GMM learning procedure based on the EM algorithm that not only can effectively utilize unknown similarity between related tasks but is also robust against a fraction of outlier tasks from arbitrary distributions. The proposed procedure is shown to achieve a minimax optimal rate of convergence for the excess mis-clustering error, in a wide range of regimes. Our algorithm can be seen as a generalization of existing federated EM algorithms and our non-asymptotic theory illustrates the empirical success of those algorithms.
Speaker: Seunghyun Lee
Title: Inference on Gaussian mixture models with dependent labels
Abstract:  Gaussian mixture models are widely used to model data generated from multiple latent sources. Despite its popularity, most theoretical research assumes that the labels are independent and identically distributed. In this paper, we focus on the spherical two-component Gaussian mixture model with dependent labels, which follow an Ising model. The Ising model is a popular quadratic interaction model that allows network dependence. Under the assumption that the underlying network is dense enough and known, we show there exists a phase transition in terms of the optimal limiting variance. Specifically, when the dependence is smaller than a threshold, the optimal estimator is exactly the same as the independent case. However, for strong enough dependence, the estimation becomes easier and we propose an optimal estimator based on variational approximation of the likelihood. In both cases, there is no information-computation gap, and our estimators are tractable.

Speaker: Yisha Yao (Columbia University)

Title: Test of Independence in High-dimensional Contingency Tables

Abstract: High-dimensional (HD) contingency tables are a type of common data formats in contemporary sciences. Its size d_1*d_2 could be much larger than the sample size n. The cells in a HD contingency table are often unbalanced, and there are a diverging number of empty cells. In this scenario, Pearson’s chi-squared test would fail because the validity of approximating the test statistic by \chi^2_{(d_1-1)*(d_2-1)} is unclear and nontrivial. In this paper, we provide necessary and sufficient conditions for the chi-squared and normal approximations of Pearson’s chi-squared statistic (as well as likelihood ratio and Hellinger statistics) in high-dimensional settings. Compared with traditional guidelines for the applicability of the chi-squared approximation, the necessary and sufficient conditions relax the sample size requirement and allow empty cells to dominate the table, provided that a diverging number of cells with more than a single count are expected in any deterministic fractional subset of cells. We also adjust the degrees of freedom in chi-squared approximation, which improves the accuracy. A modified version of Pearson’s chi-squared statistic is shown to be more robust than the




Title: Student body elections

Abstract: We will elect our next student representative, student seminar organizers, and ASGC representative.




Title: Student body elections

Abstract: We will elect our next student representative, student seminar organizers, and ASGC representative.