Schedule for Spring 2024
Seminars are on Wednesdays
Time: 12:00  1:00pm
Location: Room 903, 1255 Amsterdam Avenue
Contacts: Wribhu Banik, Seunghyun Lee, Anirban Nath
1/24/2024 
Speakers: Samory Kpotufe & Bodhi Sen (Columbia Stats) Title: TBA Abstract: Samory and Bodhi will both be talking about their current research interests and what it's like to do research with them. 
1/31, 2/7 
NO SEMINAR 
2/14/2024 
Speaker: Dan Lacker (Columbia IEOR) Title: The (projected) Langevin dynamics: sampling, optimal transport, and variational inference Abstract: This is a talk in two parts. The first part will survey a classical picture of the Langevin diffusion, with a focus on its applications to sampling and optimization. The second part will discuss my recent work on one or two (as time permits) analogous diffusion dynamics, which are designed to sample from probability measures arising in (1) entropic optimal transport and (2) mean field variational inference.

2/21/2024 
Speaker: Genevera Allen (Rice) Title: Statistical Machine Learning for Scientific Discovery Abstract: In this talk, I will give an overview of my research program which develops new statistical machine learning techniques to help scientists make reproducible and reliable datadrivendiscoveries from large and complex data. The first part will focus on an example of my research motivated by neuroscience: Understanding how large populations of neurons communicate in the brain at rest, in response to stimuli, or to produce behavior are fundamental open questions in neuroscience. Many approach this by estimating the intrinsic functional neuronal connectivity using probabilistic graphical models, but there remain major statistical and computational hurdles to graph learning from new largescale calcium imaging technologies. I will highlight a new graph learning strategy my group has developed to address a critical unsolved neuroscience challenge that we call Graph Quilting, or graph learning from partial covariances resulting from nonsimultaneously recorded neurons. The second part will focus on an example of my research in interpretable machine learning: Feature importance inference has been a longstanding statistical problem that helps promote scientific discoveries. Instead of testing for parameters that are only interpretable for specific models, there has been increasing interest in modelagnostic methods that can be applied to any statistical or machine learning model. I will highlight a new approach to feature occlusion or leaveonecovariateout (LOCO) inference that leverages minipatch ensemble learning to increase statistical power and improve computational efficiency without making any limiting assumptions on the model or data distribution. Finally, I will conclude by highlighting current and future research directions in my group related to modern multivariate analysis, graphical models, ensemble learning, machine learning interpretability and fairness, and applications in neuroscience and genomics.

2/28/2024 
Speaker: Dr. Simon Tavare (Columbia Stats and Bioscience)
Title: Cancer by the Numbers
Abstract: After a brief overview of the history of cancer, I will illustrate how the mathematical sciences can contribute to cancer research through a number of examples. Cancer development is characterized by occurrences of genomic alterations ranging in extent and impact, and the complex interdependence between these genomic events shapes the selection landscape. Stochastic modeling can help evaluate the role of each mutational process during tumor progression, but existing frameworks only capture certain aspects of tumorigenesis. I will outline CINner, a stochastic framework for modeling genomic diversity and selection during tumor evolution. The main advantage of CINner is its flexibility to incorporate many genomic events that directly impact cellular fitness, from driver gene mutations to copy number alterations (CNAs) including focal amplifications and deletions, missegregations, and wholegenome duplication. CINner raises a number of difficult statistical inference problems due to the lack of a feasible way to compute likelihoods. I will outline a new approach to approximate Bayesian computation – ABC – that exploits distributional random forests. I will give some examples of how this ABCDRF methodology works in practice, and try to convince you that ABC has really come of age.

3/6/2024 
Speaker: Zhongyuan Lyu (Columbia DSI) Title: Optimal Clustering of Multilayer Networks Abstract: We study the fundamental limit of clustering networks when a multilayer network is present. Under the mixture multilayer stochastic block model (MMSBM), we show that the minimax optimal network clustering error rate, which takes an exponential form and is characterized by the Rényi1/2 divergence between the edge probability distributions of the component networks. We propose a novel twostage network clustering method including a tensorbased initialization and a onestep refinement procedure by likelihoodbased Lloyd’s algorithm. Our proposed algorithm achieves the minimax optimal network clustering error rate and allows extreme network sparsity under MMSBM. We also extend our methodology and analysis framework to study the minimax optimal clustering error rate for mixture of discrete distributions including Binomial, Poisson, and multilayer Poisson networks.

3/13/2024 
NO SEMINAR

3/20/2024 
Speaker: David Blei (Columbia Stats & CS) Title: Hierarchical Causal Models Abstract: Analyzing nested data with hierarchical models is a staple of Bayesian statistics, but causal modeling remains largely focused on "flat" models. In this talk, we will explore how to think about nested data in causal models, and we will consider the advantages of nested data over aggregate data (such as data means) for causal inference. We show that disaggregating your datareplacing a flat causal model with a hierarchical causal modelcan provide new opportunities for identification and estimation. As examples, we will study how to identify and estimate causal effects under unmeasured confounders, interference, and instruments.

3/27/2024 
Speaker: Alberto Gonzalez Sanz (Columbia Stats) Title: Weak Limits of Entropy Regularized Optimal Transport; Potentials, Plans, and Divergences Abstract: In this seminar, we will learn how to obtain the asymptotic distribution of both potentials and couplings of entropic regularized optimal transport for compactly supported probabilities in $\R^d$. We first provide the central limit theorem of the Sinkhorn potentialsthe solutions of the dual problemas a Gaussian process in $\Cs$. Then we obtain the weak limits of the couplingsthe solutions of the primal problemevaluated on integrable functions, proving a conjecture of Harchaoui, Z., Liu, L. and Pal, S. In both cases, their limit is a real Gaussian random variable. Finally, we consider the weak limit of the entropic Sinkhorn divergence under both assumptions $H_0:\ {\rm P}={\rm Q}$ or $H_1:\ {\rm P}\neq{\rm Q}$. Under $H_0$ the limit is a quadratic form applied to a Gaussian process in a Sobolev space, while under $H_1$, the limit is Gaussian.

4/3/2024 
Speaker: Yixin Wang (Michigan Stats) Title: Harnessing Geometric Signatures in Causal Representation Learning Abstract: Causal representation learning aims to extract highlevel latent causal factors from lowlevel sensory data. Traditional methods often identify these factors by assuming they are statistically independent. In practice, however, the factors are often correlated, causally connected, or arbitrarily dependent.

4/10/2024 
Speaker: Cindy Rush (Columbia Stats) Title: The outofsample prediction error of the squareroot lasso and related estimators Abstract: We study the classical problem of predicting an outcome variable, Y, using a linear combination of a ddimensional covariate vector, X. We are interested in linear predictors whose coefficients solve: inf_β (E[(Y  < β, X >)^r])^(1/r) +  β , where r >1 and > 0 is a regularization parameter. We provide conditions under which linear predictors based on these estimators minimize the worstcase prediction error over a ball of distributions determined by a type of maxsliced Wasserstein metric. A detailed analysis of the statistical properties of this metric yields a simple recommendation for the choice of regularization parameter. The suggested order of , after a suitable normalization of the covariates, is typically d/n, up to logarithmic factors. Our recommendation is computationally straightforward to implement, pivotal, has provable outofsample performance guarantees, and does not rely on sparsity assumptions about the true datagenerating process. This is joint work with Jose Montiel Olea, Amilcar Velez, and Johannes Wiesel. 
4/17/2024 
Speaker: Ye Tian Title: Robust Unsupervised Multitask Learning on Gaussian Mixture Models Abstract: Unsupervised learning has been widely used in many realworld applications. One of the simplest and most important unsupervised learning models is the Gaussian mixture model (GMM). In this work, we study the multitask learning problem on GMMs, which aims to leverage potentially similar GMM parameter structures among tasks to obtain improved learning performance compared to singletask learning. We propose a multitask GMM learning procedure based on the EM algorithm that not only can effectively utilize unknown similarity between related tasks but is also robust against a fraction of outlier tasks from arbitrary distributions. The proposed procedure is shown to achieve a minimax optimal rate of convergence for the excess misclustering error, in a wide range of regimes. Our algorithm can be seen as a generalization of existing federated EM algorithms and our nonasymptotic theory illustrates the empirical success of those algorithms.
Speaker: Seunghyun Lee
Title: Inference on Gaussian mixture models with dependent labels
Abstract: Gaussian mixture models are widely used to model data generated from multiple latent sources. Despite its popularity, most theoretical research assumes that the labels are independent and identically distributed. In this paper, we focus on the spherical twocomponent Gaussian mixture model with dependent labels, which follow an Ising model. The Ising model is a popular quadratic interaction model that allows network dependence. Under the assumption that the underlying network is dense enough and known, we show there exists a phase transition in terms of the optimal limiting variance. Specifically, when the dependence is smaller than a threshold, the optimal estimator is exactly the same as the independent case. However, for strong enough dependence, the estimation becomes easier and we propose an optimal estimator based on variational approximation of the likelihood. In both cases, there is no informationcomputation gap, and our estimators are tractable.

4/24/2024 
Speaker: Yisha Yao (Columbia University) Title: Test of Independence in Highdimensional Contingency Tables Abstract: Highdimensional (HD) contingency tables are a type of common data formats in contemporary sciences. Its size d_1*d_2 could be much larger than the sample size n. The cells in a HD contingency table are often unbalanced, and there are a diverging number of empty cells. In this scenario, Pearson’s chisquared test would fail because the validity of approximating the test statistic by \chi^2_{(d_11)*(d_21)} is unclear and nontrivial. In this paper, we provide necessary and sufficient conditions for the chisquared and normal approximations of Pearson’s chisquared statistic (as well as likelihood ratio and Hellinger statistics) in highdimensional settings. Compared with traditional guidelines for the applicability of the chisquared approximation, the necessary and sufficient conditions relax the sample size requirement and allow empty cells to dominate the table, provided that a diverging number of cells with more than a single count are expected in any deterministic fractional subset of cells. We also adjust the degrees of freedom in chisquared approximation, which improves the accuracy. A modified version of Pearson’s chisquared statistic is shown to be more robust than the

5/1/2024 
RESCHEDULED FOR THURSDAY, MAY 9TH Title: Student body elections Abstract: We will elect our next student representative, student seminar organizers, and ASGC representative.

Thursday 5/9/2024 
Title: Student body elections 
Speaker: Title: Abstract: 