Student Seminar Series

Choose which semester to display:

Schedule for Fall 2023

Seminars are on Wednesdays 

Time: 12:00 - 1:00pm

Location: Room 903, 1255 Amsterdam Avenue

Contacts: Wribhu Banik, Seunghyun Lee, Anirban Nath


New Students Welcome and Introductions


Summer Intern Workshop

Stats Ph.D. students will share their experiences from their summer internships: Ye Tian, Navid Ardeshir, Richard Groenewald, and Zhen Huang.


Kaizheng Wang (Columbia IEOR)


Title: Pseudo-Labeling for Kernel Ridge Regression under Covariate Shift


Abstract: We develop and analyze a principled approach to kernel ridge regression under covariate shift. The goal is to learn a regression function with small mean squared error over a target distribution, based on unlabeled data from there and labeled data that may have a different feature distribution. We propose to split the labeled data into two subsets and conduct kernel ridge regression on them separately to obtain a collection of candidate models and an imputation model. We use the latter to fill the missing labels and then select the best candidate model accordingly. Our non-asymptotic excess risk bounds show that in quite general scenarios, our estimator adapts to the structure of the target distribution and the covariate shift. It achieves the minimax optimal error rate up to a logarithmic factor. The use of pseudo-labels in model selection does not have major negative impacts.


Yongchan Kwon (Columbia)


Title: Data Valuation: Shapley Value and Beyond


Abstract: Data valuation is a powerful framework for providing statistical insights into which data are beneficial or detrimental to model training. Many existing Shapley-based data valuation methods have shown promising results in various downstream tasks. However, they are well known to be computationally challenging as it requires training a large number of models. As a result, it has been recognized as infeasible to apply to large datasets. In this talk, we introduce Data-OOB, a new data valuation method for a bagging model that utilizes the out-of-bag estimate. The proposed method is computationally efficient and can scale to millions of data by reusing trained weak learners. Furthermore, Data-OOB has solid theoretical interpretations in that it identifies the same set of important data points as the infinitesimal jackknife influence function. We conduct comprehensive experiments using various classification datasets with thousands of sample sizes. We demonstrate that the proposed method significantly outperforms existing state-of-the-art data valuation methods in identifying mislabeled data and finding a set of helpful (or harmful) data points, highlighting the potential for applying data values in real-world applications.


Two Sigma


Chenyang Zhong (Columbia)



Eren Kizildag (Columbia)