Statistics Seminar – Spring 2019

Schedule for Spring 2019

Seminars are on Mondays
Time: 4:10pm – 5:00pm
Location: Room 903, 1255 Amsterdam Avenue

Tea and Coffee will be served before the seminar at 3:30 PM, 10th Floor Lounge SSW

Cheese and Wine reception will follow the seminar at 5:10 PM in the 10th Floor Lounge SSW

For an archive of past seminars, please click here.



*Time: 12noon to 1:00 pm

*Location: Room 1025 SSW

Alex Young (Big Data Institute, University of Oxford)

“Disentangling nature and nurture using genomic family data”

Abstract: Heritability measures the relative contribution of genetic inheritance (nature) and environment (nurture) to trait variation. Estimation of heritability is especially challenging when genetic and environmental effects are correlated, such as when indirect genetic effects from relatives are present. An indirect genetic effect on a proband (phenotyped individual) is the effect of a genetic variant on the proband through the proband’s environment. Examples of indirect genetic effects include effects from parents to offspring, which occur when parental nurturing behaviours are influenced by parents’ genes. I show that indirect genetic effects from parents to offspring and between siblings are substantial for educational attainment. I show that, when indirect genetic effects from relatives are substantial, existing methods for estimating heritability can be severely biased. To remedy this and other problems in heritability estimation, such as population stratification, I introduce a novel method for estimating heritability: relatedness disequilibrium regression (RDR). RDR removes environmental bias by exploiting variation in relatedness due to random Mendelian segregations in the probands’ parents. We show mathematically and in simulations that RDR estimates heritability with negligible bias due to environment in almost all scenarios. I report results from applying RDR to a sample of 54,888 Icelanders with both parents genotyped to estimate the heritability of 14 traits, including height (55.4%, S.E. 4.4%), body mass index (BMI) (28.9%, S.E. 6.3%), and educational attainment (17.0%, S.E. 9.4%), finding evidence for substantial overestimation from other methods. Furthermore, without genotype data on close relatives of the proband – such as used by RDR – the results show that it is impossible to remove the bias due to indirect genetic effects and to completely remove the confounding due to population stratification. I outline a research program for building methods that take advantage of the unique properties of genomic family data to disentangle nature, nurture, and environment in order to build a rich understanding of the causes of social and health inequalities. 



Time 12noon Location: 1025 SSW

Chengchun Shi (NC State)

“On Statistical Learning for Individualized Decision Making with Complex Data.”

In this talk, I will present my research on individualized decision making with modern complex data. In precision medicine, individualizing the treatment decision rule can capture patients’ heterogeneous response towards treatment. In finance, individualizing the investment decision rule can improve individual’s financial well-being. In a ride-sharing company, individualizing the order dispatching strategy can increase its revenue and customer satisfaction. With the fast development of new technology, modern datasets often consist of massive observations, high-dimensional covariates and are characterized by some degree of heterogeneity.

The talk is divided into two parts. In the first part, I will focus on the data heterogeneity and introduce a new maximin-projection learning for recommending an overall individualized decision rule based on the observed data from different populations with heterogeneity in optimal individualized decision making.  In the second part, I will briefly summarize the statistical learning methods I’ve developed for individualized decision making with complex data and discuss my future research directions.



Time: 12 noon to 1:00 pm Location: 903 SSW


Jingshen Wang (Michigan)

Title: Inference on Treatment Effects after Model Selection


Inferring cause-effect relationships between variables is of primary importance in many sciences. In this talk, I will discuss two approaches for making valid inference on treatment effects when a large number of covariates are present. The first approach is to perform model selection and then to deliver inference based on the selected model. If the inference is made ignoring the randomness of the model selection process, then there could be severe biases in estimating the parameters of interest. While the estimation bias in an under-fitted model is well understood, I will address a lesser known bias that arises from an over-fitted model. The over-fitting bias can be eliminated through data splitting at the cost of statistical efficiency, and I will propose a repeated data splitting approach to mitigate the efficiency loss. The second approach concerns the existing methods for debiased inference. I will show that the debiasing approach is an extension of OLS to high dimensions, and that a careful bias analysis leads to an improvement to further control the bias. The comparison between these two approaches provides insights into their intrinsic bias-variance trade-off, and I will show that the debiasing approach may lose efficiency in observational studies.

This is joint work with Xuming He and Gongjun Xu.






Time: 4:10 pm

Simon Mak (Georgia Tech)

Support points – a new way to reduce big and high-dimensional data”

Abstract: This talk presents a new method for reducing big and high-dimensional data into a smaller dataset, called support points (SPs). In an era where data is plentiful but downstream analysis is oftentimes expensive, SPs can be used to tackle many big data challenges in statistics, engineering and machine learning. SPs have two key advantages over existing methods. First, SPs provide optimal and model-free reduction of big data for a broad range of downstream analyses. Second, SPs can be efficiently computed via parallelized difference-of-convex optimization; this allows us to reduce millions of data points to a representative dataset in mere seconds. SPs also enjoy appealing theoretical guarantees, including distributional convergence and improved reduction over random sampling and clustering-based methods. The effectiveness of SPs is then demonstrated in two real-world applications, the first for reducing long Markov Chain Monte Carlo (MCMC) chains for rocket engine design, and the second for data reduction in computationally intensive predictive modeling.



Jingshu Wang (Stanford/Penn)

“Data Denoising for Single-cell RNA sequencing”


Single-cell RNA sequencing (scRNA-seq) measures gene expression levels in every single cell, which is a ground-breaking technology over microarrays and bulk RNA sequencing and reshapes the field of biology. Though the technology is exciting, scRNA-seq data is very noisy and often too noisy for signal detection and robust analysis. In the talk, I will discuss how we perform data denoising by learning across similar genes and borrowing information from external public datasets to improve the quality of downstream analysis.

Specifically, I will discuss how we set up the model by decomposing the randomness of scRNA-seq data into three components, the structured shared variations across genes, biological “noise” and technical noise, based on current understandings of the stochasticity in DNA transcription. I will emphasize one key challenge in each component and our contributions. I will show how we make proper assumptions on the technical noise and introduce a key feature, transfer learning, in our denoising method SAVER-X. SAVER-X uses a deep autoencoder neural network coupled with Empirical Bayes shrinkage to extract transferable gene expression features across datasets under different settings and learn from external data as prior information. I will show that SAVER-X can successfully transfer information from mouse to human cells and can guard against bias. I’ll also briefly discuss our ongoing work on post-denoising inference for scRNA-seq.


Jonathan Weed (MIT)

Title: Large-scale Optimal Transport: Statistics and Computation

Abstract: Optimal transport is a concept from probability which has recently seen an explosion of interest in machine learning and statistics as a tool for analyzing high-dimensional data. However, the key obstacle in using optimal transport in practice has been its high statistical and computational cost. In this talk, we show how exploiting different notions of structure can lead to better statistical rates—beating the curse of dimensionality—and state-of-the-art algorithms.



Time: 4:10

Pragya Sur (Stanford)

A modern maximum-likelihood approach for high-dimensional logistic regression”

Abstract: Logistic regression is arguably the most widely used and studied non-linear model in statistics. Classical maximum-likelihood theory based statistical inference is ubiquitous in this context. This theory hinges on well-known fundamental results: (1) the maximum-likelihood-estimate (MLE) is asymptotically unbiased and normally distributed, (2) its variability can be quantified via the inverse Fisher information, and (3) the likelihood-ratio-test (LRT) is asymptotically a Chi-Squared. In this talk, I will show that in the common modern setting where the number of features and the sample size are both large and comparable, classical results are far from accurate. In fact,  (1) the MLE is biased, (2) its variability is far greater than classical results, and (3) the LRT is not distributed as a Chi-Square. Consequently, p-values obtained based on classical theory are completely invalid in high dimensions. In turn, I will propose a new theory that characterizes the asymptotic behavior of both the MLE and the LRT under some assumptions on the covariate distribution, in a high-dimensional setting. Empirical evidence demonstrates that this asymptotic theory provides accurate inference in finite samples. Practical implementation of these results necessitates the estimation of a single scalar, the overall signal strength, and I will propose a procedure for estimating this parameter precisely. This is based on joint work with Emmanuel Candes and Yuxin Chen.


Thomas Nagler (Technical University of Munich)

“Copula-based regression”

Copulas are models for the dependence in a random vector and allow to build multivariate models with arbitrary one-dimensional margins. Recently, researchers started to apply copulas to statistical learning problems such as regression or classification. We propose a unified framework for the analysis of such approaches by defining the estimators as solutions of copula-based estimating equations. We present general results on their asymptotic behavior and validity of the bootstrap. The conditions are broad enough to cover most regression-type problems as well as parametric and nonparametric estimators of the copula. The versatility of the method is illustrated with numerical examples and a possible extension to missing data problems.


Michael Hudgens (UNC)

Title: Causal Inference in the Presence of Interference

Abstract: A fundamental assumption usually made in causal inference is that of no interference between individuals (or units), i.e., the potential outcomes of one individual are assumed to be unaffected by the treatment assignment of other individuals. However, in many settings, this assumption obviously does not hold. For example, in infectious diseases, whether one person becomes infected may depend on who else in the population is vaccinated. In this talk we will discuss recent approaches to assessing treatment effects in the presence of interference.


Andrew Gelman (Department of Statistics and Department of Political Science, Columbia University)

“We’ve Got More Than One Model: Evaluating, comparing, and extending Bayesian predictions”

Methods in statistics and data science are often framed as solutions to particular problems, in which a particular model or method is applied to a dataset. But good practice typically requires multiplicity, in two dimensions: fitting many different models to better understand a single dataset, and applying a method to a series of different but related problems. To understand and make appropriate inferences from real-world data analysis, we should account for the set of models we might fit, and for the set of problems to which we would apply a method. This is known as the reference set in frequentist statistics or the prior distribution in Bayesian statistics. We shall discuss recent research of ours that addresses these issues, involving the following statistical ideas: Type M errors, the multiverse, weakly informative priors, Bayesian stacking and cross-validation, simulation-based model checking, divide-and-conquer algorithms, and validation of approximate computations. We will also discuss how this work is motivated by applications in political science, pharmacology, and other areas.

Lucas Janson (Harvard)

Should We Model X in High-Dimensional Inference?

Abstract: For answering questions about the relationship between a response variable Y and a set of explanatory variables X, most statistical methods focus their assumptions on the conditional distribution of Y given X (or Y | X for short). I will describe some benefits of shifting those assumptions from the conditional distribution Y | X to the distribution of X instead, especially when X is high-dimensional. I will review my recent methodological work on knockoffs and the conditional randomization test, and explain how the model-X framework endows them with desirable properties like finite-sample error control, power, modeling flexibility, and robustness. At the end, I will introduce some very recent (arXiv’ed last week) breakthroughs on model-X methods for high-dimensional inference, as well as some challenges and interesting directions for the future in this area.

3/18/19 Spring Break – No Seminar

Xi Chen (NYU)

“Quantile Regression for big data with small memory”

Abstract: In this talk, we discuss the inference problem of quantile regression for a large sample size n but under a limited memory constraint, where the memory can only store a small batch of data of size m. A popular approach, the naive divide-and-conquer method, only works when n=o(m^2) and is computationally expensive. This talk proposes a novel inference approach and establishes the asymptotic normality result that achieves the same efficiency as the quantile regression estimator computed on all the data. Essentially, our method can allow arbitrarily large sample size n as compared to the memory size m. Our method can also be applied to address the quantile regression under distributed computing environment (e.g., in a large-scale sensor network) or for real-time streaming data. This is a joint work with Weidong Liu and Yichen Zhang.

Bio:  Xi Chen is an assistant professor at Stern School of Business at New York University. Before that, he was a Postdoc in the group of Prof. Michael Jordan at UC Berkeley. He obtained his Ph.D. from the Machine Learning Department at Carnegie Mellon University.

He studies high-dimensional statistics, multi-armed bandits, and stochastic optimization. He received NSF CAREER Award, Simons-Berkeley Research Fellowship, Google Faculty Award, Adobe Data Science Award, Bloomberg research award, and was featured in 2017 Forbes list of “30 Under30 in Science”. 


Yajun Mei (Georgia Institute of Technology)

“Multi-Armed Bandit Techniques for Online Monitoring High-Dimensional Streaming Data”


We investigate the problem of online monitoring high-dimensional streaming in resources constrained environments, where one has limited capacity in data acquisition, transmission or processing, and needs decide how to smartly  observe which local components or features of high-dimensional streaming data  at each and every time so as to detect potential anomaly rapidly.

In the first part of this talk, we provide an overview of the classical sequential change-point detection problem for real-valued data stream, as well as classical multi-armed bandit algorithms.

In the second part of the talk, we present our latest research on efficient scalable schemes for online monitoring high-dimensional data, as well as the corresponding multi-armed bandit versions. Both asymptotic analysis and numerical simulations demonstrate the usefulness of our proposed approach in the context of online monitoring large-scale data streams in resources constrained environments.


Weijie Su (UPenn)

“Gaussian Differential Privacy”

Abstract: Privacy-preserving data analysis has been put on a firm mathematical foundation since the introduction of differential privacy (DP) in 2006, with successful deployment in Chrome and iOS lately. However, this framework at present form fails to precisely characterize how privacy degrades in a series of privacy breaches and how privacy amplifies with subsampling, thereby creating a major bottleneck for further extending this privacy definition. In this talk, we propose a hypothesis testing based framework for private data analysis, including Gaussian differential privacy (GDP) as a central example. This new framework gives a statistically interpretable assessment of privacy loss and includes many existing privacy notions as examples. In addition, we develop a suite of tools including a central limit theorem for accurately evaluating the privacy loss under composition or with subsampling in this privacy framework. Finally, we analyze the privacy of stochastic gradient descent in this new framework. This is joint work with Jinshuo Dong and Aaron Roth.


Dr. Pierre Bellec (Rutgers)

“Second Order Stein, SURE for SURE and degrees-of-freedom adjustments in high-dimensional statistics”
Stein’s formula states that a random variable of the form $z’f(z) – div f(z)$ is mean-zero for all functions f with integrable gradient. Here, $div f$ is the divergence
of the function $f$ and $z$ is a standard normal vector. We develop Second-Order Stein formulae for statistical inference with high-dimensional
data. In the simplest form, the Second-Order Stein formula characterizes the variance of $z’f(z)-div f(z)$. It also implies bounds on the variance
of a function $f(z)$ of a standard normal vector and these bounds are of a different nature than the classical Poincare or log-Sobolev inequalities.
The motivation that led to the above probabilistic results was the study of degrees-of-freedom adjustments in high-dimensional inference problems.  A
well-understood degrees-of-freedom adjustment appears in Stein’s Unbiased Risk Estimate (SURE) to construct an unbiased estimate of the mean square risk of
almost any estimator $\hat mu$; here the divergence of $\hat \mu$ plays the role of degrees-of-freedom or the estimator.
A first application of the Second Order Stein formula is an Unbiased Risk Estimate of the risk of SURE itself (SURE for SURE): a simple unbiased estimate provides
information about the squared distance between SURE and the squared estimation error of $\hat \mu$.
A novel analysis reveals that degrees-of-freedom adjustments play a major role in de-biasing methodologies to construct confidence intervals in high-dimension.
We will see that in sparse linear regression for the Lasso for Gaussin designs, existing de-biasing schemes need to be modified with an adjustment that
accounts for the degrees-of-freedom of the Lasso. This degrees-of-freedom adjustment is necessary for statistical efficiency in the regime $s >>> n^{2/3}$.
Joint work with Cun-Hui Zhang (Rutgers), related papers are and

Dan Simpson (University of Toronto)

“Sometimes all we have left are pictures and fear”

Data is getting weirder. Statistical models and techniques are more complex than they have ever been.
No one understand what code does. But at the same time, statistical tools are being used by a wider
range of people than at any time in the past. And they are not just using our well-trodden, classical tools.
They are working at the bleeding edge of what is possible. With this in mind, this talk will look at
how much we can trust our tools. Do we ever really compute the thing we think we do?
Can we ever be sure our code worked? Are there ways that it’s not safe to use the output? While
“reproducibility” may be the watchword of the new scientific era, if we also want to ensure safety maybe
all we have to lean on are pictures and fear.


Tailen Hsing (University of Michigan)

“Space-Time Data, Intrinsic Stationarity and Functional Models”

Abstract. The topic of functional time series has received some attention recently. This is timely as many applications involving space-time data can benefit from the functional-data perspective. In this talk, I will start off with the Argo data, which have fascinating features and are highly relevant for climate research. I will then turn to some extensions of stationarity in the context of functional data. The first is to adapt the notion of intrin- sic random functions in spatial statistics, due to Matheron, to functional data. Such processes are stationary after suitable differencing, where the resulting stationary covariance is referred to as generalized covariance. A Bochner-type representation of the generalized covariance as well as prelim- inary results on inference will be presented. The second extension considers intrinsic stationarity in a local sense, viewed from the perspective of so-called tangent processes. Motivations of this work can be found from studying the multifractional Brownian motion.


John D. Storey (Princeton)

Title: High-dimensional models of genomic variation

Abstract: One of the most important goals of modern human genetics is to accurately model genome-wide genetic variation among individuals, as it plays a fundamental role in disease gene mapping and characterizing the evolutionary history of human populations. Current studies typically involve genome-wide genotyping of individuals from a diverse network of ancestries. A key problem is how to formulate and estimate probabilistic models of observed genotypes that account for complex population structure and relatedness. In this talk, I will present high-dimensional models of genome-wide genetic variation that we have developed. I will motivate this work through the problem of identifying associations between human traits and genetic variation, but I will also emphasize the broader applications of this work, including other areas of biology where our approaches may be applied or extended.