Student Seminar Series

Choose which semester to display:

Schedule for Fall 2021

Attention: The Student Seminar will be in hybrid mode this semester. Most talks and events will be held in person, and people can also join via zoom. In-person participation is only available to Columbia affiliates with building access.

Seminars are on Wednesdays
Time: 12:00 - 1:00pm

Contacts: Arnab Auddy, Ye Tian

Information for speakers: For information about schedule, direction, equipment, reimbursement and hotel, please click here.

Welcome to the New Academic Year & First/Second-Year Students Orientation
Abstract: The first and second year students are welcome to introduce themselves and discuss any questions about life/department/study/research/future plan etc with their fellow students. Our student representative Collin will lead the seminar and talk about exciting social events in our department.


Summer Intern Workshop

Abstract: Ph.D. students will talk about their summer internships at different companies.

Yuqi Gu (Columbia Stats)

Title: "Deep, Graphical, and Identifiable Modeling of Latent Structures and Their Application"

Abstract: This talk covers some of my recent work on latent variable models, including in particular the Bayesian Pyramids: identifiable deep discrete latent structure models for discrete data. Multivariate discrete data are routinely collected in biomedical and social sciences. It is of great importance to build interpretable parsimonious models that perform dimension reduction and uncover meaningful latent structures from such discrete data. Identifiability is a fundamental requirement for valid modeling and inference, yet is challenging to address when there are complex latent structures. We propose a class of identifiable deep discrete latent structure models termed Bayesian pyramids. Theoretically, we establish the identifiability of Bayesian pyramids by developing transparent conditions on the pyramid-shaped multilayer latent graph. Methodologically, we focus on the two-latent-layer model and propose a Bayesian shrinkage estimation approach. Simulation results corroborate the identifiability and estimability of the model parameters. Application of the methodology to DNA nucleotide sequence data uncovers useful discrete latent features that are highly predictive of sequence types. The proposed framework provides a recipe for interpretable unsupervised learning of discrete data, and can be a useful alternative to popular machine learning methods.

Along the way, I will also talk about the interesting connections between my research and tensor decompositions, mixed membership models, and deep generative models.

Attention: The Student Seminar will be in hybrid mode this semester. Most talks and events will be held in person, and people can also join via zoom. In-person participation is only available to Columbia affiliates with building access.

Join URL:

Meeting ID: 918 6027 2586

Passcode: 225317        



Suyash Gupta (Stanford)

Title:  Stability and reliability under distributional shifts

Abstract:  While the traditional viewpoint in statistics and machine learning assumes training and testing samples come from the same population, practice belies this fiction. Hence, statistical knowledge from one study may not generalize to the other, for example, when a statistical quantity changes drastically under distributional shifts. This motivates us to propose a measure of (in)-stability that quantifies the distributional (in)- stability of a statistical quantity. We next discuss how such knowledge can inform data collection for improved estimation of statistical quantities under shifted distributions. Towards the end, I will present some of the challenges posed by distributional shifts in making reliable predictions using conformal inference and discuss methods to address the same.  

Yinqiu He (Columbia)
Title: Adaptive High-Dimensional Testing by U-Statistics
Abstract: In modern scientific research that investigates large-scale data, researchers often start with questions regarding the global properties of a large set of units. For instance, are a group of related genes in the same functional pathway jointly associated with a trait of interest? Such questions can be formulated as high-dimensional hypothesis testing problems that globally examine a large number of parameters in a high-dimensional joint distribution. Examples include testing mean vectors, covariance matrices and regression coefficients. To efficiently extract informative scientific knowledge from abundant data, statistical power is one major concern in statistical inference.

In this talk, I will introduce a new adaptive testing framework that can maintain high statistical power against a wide range of alternative hypotheses. The proposed framework is based on a family of U-statistics that are constructed to capture the information in different directions in high-dimensional spaces. For a broad class of problems, we establish high-dimensional asymptotic theory for the U-statistics. Then we develop adaptive testing procedures that are statistically powerful across different scenarios. 


Yi-Hsuan Lee (Educational Testing Service)

Title: Application of Statistics in Educational Measurement

Abstract: Educational measurement is a field that uses educational tests to obtain observations from
test takers in order to make inferences about their knowledge and skills. Measurement theory has been
developed to provide the foundation for evaluating educational tests and their uses and interpretations,
including fundamental concepts, considerations, and various statistical models and techniques to
quantify those concepts and considerations. In this presentation, I will provide a brief introduction of
the key concepts and some considerations in measurement theory. I will also discuss how statistical
methods may be applied to assess the quality of educational tests and to infer the knowledge and skills
of interest to educational researchers and practitioners, with some focus on more recent challenges and
opportunities in the field.


Marco Avella (Columbia Stats)

Title: Spectral learning of multivariate extremes

Abstract: We propose a spectral clustering algorithm for analyzing the dependence structure of multivariate extremes. More specifically, we focus on the asymptotic  dependence of multivariate extremes characterized by the angular or spectral measure in extreme value theory. Our work studies the theoretical performance of spectral clustering based on a random k-nearest neighbor graph constructed from an extremal sample, i.e., the angular part of random vectors for which the radius exceeds a large threshold.  In particular, we derive the asymptotic distribution of extremes arising from a linear factor model and prove that,  under certain conditions, spectral clustering can consistently identify the clusters of extremes arising in this model. Leveraging this result we propose a simple consistent estimation strategy for learning the angular measure. Our theoretical findings are complemented with numerical experiments illustrating the finite sample performance of our methods.

This is joint work with Richard Davis (Columbia) and Gennady Samorodnitsky (Cornell)

Daniel Hsu (Columbia CS)
Title: On the approximation power of two-layer networks of random ReLUs

Abstract: How well can depth-two ReLU networks with random bottom-level weights represent simple functions? We give near-matching upper- and lower-bounds for $L_2$-approximation in terms of the Lipschitz constant, the desired accuracy, and the dimension of the problem, as well as similar results in terms of Sobolev norms. Our positive results employ tools from harmonic analysis and ridgelet representation theory, while our lower-bounds are based on (robust versions of) dimensionality arguments. Joint work with Clayton Sanford, Rocco Servedio, and Emmanouil-Vasileios Vlatakis-Gkaragkounis.

Morgane Austern (Harvard)

Title: Some recent results for model evaluation

Abstract: Estimating and evaluating the generalization capabilities of an estimator is a fundamental task of statistical inference. In this talk we are interested in better understanding how well the cross-validated risk estimates the risk, and in improving finite sample generalization bounds.

In the first part of this talk, we study the cross-validation method, an ubiquitous method for risk estimation, and establish its asymptotic properties for a large class of models and with an arbitrary number of folds. Under stability conditions, we establish a central limit theorem and Berry-Esseen bounds for the cross-validated risk, which enable us to compute asymptotically accurate confidence intervals. Using our results, we study the statistical speed-up offered by cross validation compared to a train-test split procedure. We reveal some surprising behavior of the cross-validated risk and establish the statistically optimal choice for the number of folds. In the second part of this talk, we remark that while concentration inequalities are fundamental tools to obtain finite sample generalization guarantees, those bounds are known to be loose. Moreover, while limit theorems provide asymptotically tight bounds, those tools are not valid for finite samples. Motivated by this observation, we propose a new method for deriving concentration inequalities that is both valid in finite samples and asymptotically optimal. We demonstrate that the bounds obtained improve on classical concentration inequalities such as the Bernstein or Azuma inequality.




Hongseok Namkoong (Columbia Business School, DRO)

Title: Scalable Sensitivity Analysis Using Modern Prediction Methods
Abstract: When experimentation is expensive or risky, previously collected observational data can be used to evaluate the causal effect of a decision if observed decisions depend only on observed variables. However, this assumption is frequently violated due to unobserved confounders that simultaneously impact decisions and their outcomes. We develop methods for assessing the robustness of observational studies by developing worst-case bounds under unobserved confounding. First, we derive a loss minimization method for estimating worst-case bounds on the personalized treatment effect (CATE). Our approach is scalable and allows the flexible use of black-box machine learning methods. We then propose a related sensitivity analysis for the average treatment effect (ATE) and develop a semiparametric framework that extends/bounds the popular augmented inverse propensity weighted (AIPW) estimator for the ATE. By virtue of satisfying a key orthogonality property, our estimator enjoys central limit rates even when ML-based estimates of nuisance parameters converge more slowly. On real and simulated data, our scalable methods allow analyzing the sensitivity of observational studies in practical finite sample regimes.

No Seminar


Nabarun Deb (Columbia Stats)


Alejandra Quintos Lima (Columbia Stats)