Statistics Seminar – Spring 2019

Schedule for Spring 2019

Seminars are on Mondays
Time: 4:10pm – 5:00pm
Location: Room 903, 1255 Amsterdam Avenue

Tea and Coffee will be served before the seminar at 3:30 PM, 10th Floor Lounge SSW

Cheese and Wine reception will follow the seminar at 5:10 PM in the 10th Floor Lounge SSW

For an archive of past seminars, please click here.



*Time: 12noon to 1:00 pm

*Location: Room 1025 SSW

Alex Young (Big Data Institute, University of Oxford)

“Disentangling nature and nurture using genomic family data”

Abstract: Heritability measures the relative contribution of genetic inheritance (nature) and environment (nurture) to trait variation. Estimation of heritability is especially challenging when genetic and environmental effects are correlated, such as when indirect genetic effects from relatives are present. An indirect genetic effect on a proband (phenotyped individual) is the effect of a genetic variant on the proband through the proband’s environment. Examples of indirect genetic effects include effects from parents to offspring, which occur when parental nurturing behaviours are influenced by parents’ genes. I show that indirect genetic effects from parents to offspring and between siblings are substantial for educational attainment. I show that, when indirect genetic effects from relatives are substantial, existing methods for estimating heritability can be severely biased. To remedy this and other problems in heritability estimation, such as population stratification, I introduce a novel method for estimating heritability: relatedness disequilibrium regression (RDR). RDR removes environmental bias by exploiting variation in relatedness due to random Mendelian segregations in the probands’ parents. We show mathematically and in simulations that RDR estimates heritability with negligible bias due to environment in almost all scenarios. I report results from applying RDR to a sample of 54,888 Icelanders with both parents genotyped to estimate the heritability of 14 traits, including height (55.4%, S.E. 4.4%), body mass index (BMI) (28.9%, S.E. 6.3%), and educational attainment (17.0%, S.E. 9.4%), finding evidence for substantial overestimation from other methods. Furthermore, without genotype data on close relatives of the proband – such as used by RDR – the results show that it is impossible to remove the bias due to indirect genetic effects and to completely remove the confounding due to population stratification. I outline a research program for building methods that take advantage of the unique properties of genomic family data to disentangle nature, nurture, and environment in order to build a rich understanding of the causes of social and health inequalities. 



Time 12noon Location: 1025 SSW

Chengchun Shi (NC State)

“On Statistical Learning for Individualized Decision Making with Complex Data.”

In this talk, I will present my research on individualized decision making with modern complex data. In precision medicine, individualizing the treatment decision rule can capture patients’ heterogeneous response towards treatment. In finance, individualizing the investment decision rule can improve individual’s financial well-being. In a ride-sharing company, individualizing the order dispatching strategy can increase its revenue and customer satisfaction. With the fast development of new technology, modern datasets often consist of massive observations, high-dimensional covariates and are characterized by some degree of heterogeneity.

The talk is divided into two parts. In the first part, I will focus on the data heterogeneity and introduce a new maximin-projection learning for recommending an overall individualized decision rule based on the observed data from different populations with heterogeneity in optimal individualized decision making.  In the second part, I will briefly summarize the statistical learning methods I’ve developed for individualized decision making with complex data and discuss my future research directions.



Time: 12 noon to 1:00 pm Location: 903 SSW


Jingshen Wang (Michigan)

Title: Inference on Treatment Effects after Model Selection


Inferring cause-effect relationships between variables is of primary importance in many sciences. In this talk, I will discuss two approaches for making valid inference on treatment effects when a large number of covariates are present. The first approach is to perform model selection and then to deliver inference based on the selected model. If the inference is made ignoring the randomness of the model selection process, then there could be severe biases in estimating the parameters of interest. While the estimation bias in an under-fitted model is well understood, I will address a lesser known bias that arises from an over-fitted model. The over-fitting bias can be eliminated through data splitting at the cost of statistical efficiency, and I will propose a repeated data splitting approach to mitigate the efficiency loss. The second approach concerns the existing methods for debiased inference. I will show that the debiasing approach is an extension of OLS to high dimensions, and that a careful bias analysis leads to an improvement to further control the bias. The comparison between these two approaches provides insights into their intrinsic bias-variance trade-off, and I will show that the debiasing approach may lose efficiency in observational studies.

This is joint work with Xuming He and Gongjun Xu.






Time: 4:10 pm

Simon Mak (Georgia Tech)

Support points – a new way to reduce big and high-dimensional data”

Abstract: This talk presents a new method for reducing big and high-dimensional data into a smaller dataset, called support points (SPs). In an era where data is plentiful but downstream analysis is oftentimes expensive, SPs can be used to tackle many big data challenges in statistics, engineering and machine learning. SPs have two key advantages over existing methods. First, SPs provide optimal and model-free reduction of big data for a broad range of downstream analyses. Second, SPs can be efficiently computed via parallelized difference-of-convex optimization; this allows us to reduce millions of data points to a representative dataset in mere seconds. SPs also enjoy appealing theoretical guarantees, including distributional convergence and improved reduction over random sampling and clustering-based methods. The effectiveness of SPs is then demonstrated in two real-world applications, the first for reducing long Markov Chain Monte Carlo (MCMC) chains for rocket engine design, and the second for data reduction in computationally intensive predictive modeling.



Jingshu Wang (Stanford/Penn)

“Data Denoising for Single-cell RNA sequencing”


Single-cell RNA sequencing (scRNA-seq) measures gene expression levels in every single cell, which is a ground-breaking technology over microarrays and bulk RNA sequencing and reshapes the field of biology. Though the technology is exciting, scRNA-seq data is very noisy and often too noisy for signal detection and robust analysis. In the talk, I will discuss how we perform data denoising by learning across similar genes and borrowing information from external public datasets to improve the quality of downstream analysis.

Specifically, I will discuss how we set up the model by decomposing the randomness of scRNA-seq data into three components, the structured shared variations across genes, biological “noise” and technical noise, based on current understandings of the stochasticity in DNA transcription. I will emphasize one key challenge in each component and our contributions. I will show how we make proper assumptions on the technical noise and introduce a key feature, transfer learning, in our denoising method SAVER-X. SAVER-X uses a deep autoencoder neural network coupled with Empirical Bayes shrinkage to extract transferable gene expression features across datasets under different settings and learn from external data as prior information. I will show that SAVER-X can successfully transfer information from mouse to human cells and can guard against bias. I’ll also briefly discuss our ongoing work on post-denoising inference for scRNA-seq.


Jonathan Weed (MIT)

Title: Large-scale Optimal Transport: Statistics and Computation

Abstract: Optimal transport is a concept from probability which has recently seen an explosion of interest in machine learning and statistics as a tool for analyzing high-dimensional data. However, the key obstacle in using optimal transport in practice has been its high statistical and computational cost. In this talk, we show how exploiting different notions of structure can lead to better statistical rates—beating the curse of dimensionality—and state-of-the-art algorithms.



Time: 4:10

Pragya Sur (Stanford)

A modern maximum-likelihood approach for high-dimensional logistic regression”

Abstract: Logistic regression is arguably the most widely used and studied non-linear model in statistics. Classical maximum-likelihood theory based statistical inference is ubiquitous in this context. This theory hinges on well-known fundamental results: (1) the maximum-likelihood-estimate (MLE) is asymptotically unbiased and normally distributed, (2) its variability can be quantified via the inverse Fisher information, and (3) the likelihood-ratio-test (LRT) is asymptotically a Chi-Squared. In this talk, I will show that in the common modern setting where the number of features and the sample size are both large and comparable, classical results are far from accurate. In fact,  (1) the MLE is biased, (2) its variability is far greater than classical results, and (3) the LRT is not distributed as a Chi-Square. Consequently, p-values obtained based on classical theory are completely invalid in high dimensions. In turn, I will propose a new theory that characterizes the asymptotic behavior of both the MLE and the LRT under some assumptions on the covariate distribution, in a high-dimensional setting. Empirical evidence demonstrates that this asymptotic theory provides accurate inference in finite samples. Practical implementation of these results necessitates the estimation of a single scalar, the overall signal strength, and I will propose a procedure for estimating this parameter precisely. This is based on joint work with Emmanuel Candes and Yuxin Chen.


Thomas Nagler (Technical University of Munich)

“Copula-based regression”

Copulas are models for the dependence in a random vector and allow to build multivariate models with arbitrary one-dimensional margins. Recently, researchers started to apply copulas to statistical learning problems such as regression or classification. We propose a unified framework for the analysis of such approaches by defining the estimators as solutions of copula-based estimating equations. We present general results on their asymptotic behavior and validity of the bootstrap. The conditions are broad enough to cover most regression-type problems as well as parametric and nonparametric estimators of the copula. The versatility of the method is illustrated with numerical examples and a possible extension to missing data problems.


Michael Hudgens (UNC)

Title: Causal Inference in the Presence of Interference

Abstract: A fundamental assumption usually made in causal inference is that of no interference between individuals (or units), i.e., the potential outcomes of one individual are assumed to be unaffected by the treatment assignment of other individuals. However, in many settings, this assumption obviously does not hold. For example, in infectious diseases, whether one person becomes infected may depend on who else in the population is vaccinated. In this talk we will discuss recent approaches to assessing treatment effects in the presence of interference.


Andrew Gelman (Department of Statistics and Department of Political Science, Columbia University)

“We’ve Got More Than One Model: Evaluating, comparing, and extending Bayesian predictions”

Methods in statistics and data science are often framed as solutions to particular problems, in which a particular model or method is applied to a dataset. But good practice typically requires multiplicity, in two dimensions: fitting many different models to better understand a single dataset, and applying a method to a series of different but related problems. To understand and make appropriate inferences from real-world data analysis, we should account for the set of models we might fit, and for the set of problems to which we would apply a method. This is known as the reference set in frequentist statistics or the prior distribution in Bayesian statistics. We shall discuss recent research of ours that addresses these issues, involving the following statistical ideas: Type M errors, the multiverse, weakly informative priors, Bayesian stacking and cross-validation, simulation-based model checking, divide-and-conquer algorithms, and validation of approximate computations. We will also discuss how this work is motivated by applications in political science, pharmacology, and other areas.

Lucas Janson (Harvard)

Should We Model X in High-Dimensional Inference?

Abstract: For answering questions about the relationship between a response variable Y and a set of explanatory variables X, most statistical methods focus their assumptions on the conditional distribution of Y given X (or Y | X for short). I will describe some benefits of shifting those assumptions from the conditional distribution Y | X to the distribution of X instead, especially when X is high-dimensional. I will review my recent methodological work on knockoffs and the conditional randomization test, and explain how the model-X framework endows them with desirable properties like finite-sample error control, power, modeling flexibility, and robustness. At the end, I will introduce some very recent (arXiv’ed last week) breakthroughs on model-X methods for high-dimensional inference, as well as some challenges and interesting directions for the future in this area.

3/18/19 Spring Break – No Seminar

Xi Chen (NYU)

“Quantile Regression for big data with small memory”

Abstract: In this talk, we discuss the inference problem of quantile regression for a large sample size n but under a limited memory constraint, where the memory can only store a small batch of data of size m. A popular approach, the naive divide-and-conquer method, only works when n=o(m^2) and is computationally expensive. This talk proposes a novel inference approach and establishes the asymptotic normality result that achieves the same efficiency as the quantile regression estimator computed on all the data. Essentially, our method can allow arbitrarily large sample size n as compared to the memory size m. Our method can also be applied to address the quantile regression under distributed computing environment (e.g., in a large-scale sensor network) or for real-time streaming data. This is a joint work with Weidong Liu and Yichen Zhang.

Bio:  Xi Chen is an assistant professor at Stern School of Business at New York University. Before that, he was a Postdoc in the group of Prof. Michael Jordan at UC Berkeley. He obtained his Ph.D. from the Machine Learning Department at Carnegie Mellon University.

He studies high-dimensional statistics, multi-armed bandits, and stochastic optimization. He received NSF CAREER Award, Simons-Berkeley Research Fellowship, Google Faculty Award, Adobe Data Science Award, Bloomberg research award, and was featured in 2017 Forbes list of “30 Under30 in Science”. 




Weijie Su (UPenn)




Dan Simpson (University of Toronto)


Tailen Hsing (University of Michigan)

5/6/19 Last day of classes