Schedule for Spring 2019
Seminars are on Mondays
Time: 4:10pm – 5:00pm
Location: Room 903, 1255 Amsterdam Avenue
Tea and Coffee will be served before the seminar at 3:30 PM, 10th Floor Lounge SSW
Cheese and Wine reception will follow the seminar at 5:10 PM in the 10th Floor Lounge SSW
For an archive of past seminars, please click here.
1/18/2019 *Friday *Time: 12noon to 1:00 pm *Location: Room 1025 SSW 
Alex Young (Big Data Institute, University of Oxford) “Disentangling nature and nurture using genomic family data” Abstract: Heritability measures the relative contribution of genetic inheritance (nature) and environment (nurture) to trait variation. Estimation of heritability is especially challenging when genetic and environmental effects are correlated, such as when indirect genetic effects from relatives are present. An indirect genetic effect on a proband (phenotyped individual) is the effect of a genetic variant on the proband through the proband’s environment. Examples of indirect genetic effects include effects from parents to offspring, which occur when parental nurturing behaviours are influenced by parents’ genes. I show that indirect genetic effects from parents to offspring and between siblings are substantial for educational attainment. I show that, when indirect genetic effects from relatives are substantial, existing methods for estimating heritability can be severely biased. To remedy this and other problems in heritability estimation, such as population stratification, I introduce a novel method for estimating heritability: relatedness disequilibrium regression (RDR). RDR removes environmental bias by exploiting variation in relatedness due to random Mendelian segregations in the probands’ parents. We show mathematically and in simulations that RDR estimates heritability with negligible bias due to environment in almost all scenarios. I report results from applying RDR to a sample of 54,888 Icelanders with both parents genotyped to estimate the heritability of 14 traits, including height (55.4%, S.E. 4.4%), body mass index (BMI) (28.9%, S.E. 6.3%), and educational attainment (17.0%, S.E. 9.4%), finding evidence for substantial overestimation from other methods. Furthermore, without genotype data on close relatives of the proband – such as used by RDR – the results show that it is impossible to remove the bias due to indirect genetic effects and to completely remove the confounding due to population stratification. I outline a research program for building methods that take advantage of the unique properties of genomic family data to disentangle nature, nurture, and environment in order to build a rich understanding of the causes of social and health inequalities. 
1/22/19 *Tuesday Time 12noon Location: 1025 SSW 
Chengchun Shi (NC State) “On Statistical Learning for Individualized Decision Making with Complex Data.” In this talk, I will present my research on individualized decision making with modern complex data. In precision medicine, individualizing the treatment decision rule can capture patients’ heterogeneous response towards treatment. In finance, individualizing the investment decision rule can improve individual’s financial wellbeing. In a ridesharing company, individualizing the order dispatching strategy can increase its revenue and customer satisfaction. With the fast development of new technology, modern datasets often consist of massive observations, highdimensional covariates and are characterized by some degree of heterogeneity. The talk is divided into two parts. In the first part, I will focus on the data heterogeneity and introduce a new maximinprojection learning for recommending an overall individualized decision rule based on the observed data from different populations with heterogeneity in optimal individualized decision making. In the second part, I will briefly summarize the statistical learning methods I’ve developed for individualized decision making with complex data and discuss my future research directions. 
1/25/19 *Friday Time: 12 noon to 1:00 pm Location: 903 SSW

Jingshen Wang (Michigan) Title: Inference on Treatment Effects after Model Selection Abstract: Inferring causeeffect relationships between variables is of primary importance in many sciences. In this talk, I will discuss two approaches for making valid inference on treatment effects when a large number of covariates are present. The first approach is to perform model selection and then to deliver inference based on the selected model. If the inference is made ignoring the randomness of the model selection process, then there could be severe biases in estimating the parameters of interest. While the estimation bias in an underfitted model is well understood, I will address a lesser known bias that arises from an overfitted model. The overfitting bias can be eliminated through data splitting at the cost of statistical efficiency, and I will propose a repeated data splitting approach to mitigate the efficiency loss. The second approach concerns the existing methods for debiased inference. I will show that the debiasing approach is an extension of OLS to high dimensions, and that a careful bias analysis leads to an improvement to further control the bias. The comparison between these two approaches provides insights into their intrinsic biasvariance tradeoff, and I will show that the debiasing approach may lose efficiency in observational studies. This is joint work with Xuming He and Gongjun Xu.

1/28/19 

1/31/19 *Thursday Time: 4:10 pm 
Simon Mak (Georgia Tech) “Support points – a new way to reduce big and highdimensional data” Abstract: This talk presents a new method for reducing big and highdimensional data into a smaller dataset, called support points (SPs). In an era where data is plentiful but downstream analysis is oftentimes expensive, SPs can be used to tackle many big data challenges in statistics, engineering and machine learning. SPs have two key advantages over existing methods. First, SPs provide optimal and modelfree reduction of big data for a broad range of downstream analyses. Second, SPs can be efficiently computed via parallelized differenceofconvex optimization; this allows us to reduce millions of data points to a representative dataset in mere seconds. SPs also enjoy appealing theoretical guarantees, including distributional convergence and improved reduction over random sampling and clusteringbased methods. The effectiveness of SPs is then demonstrated in two realworld applications, the first for reducing long Markov Chain Monte Carlo (MCMC) chains for rocket engine design, and the second for data reduction in computationally intensive predictive modeling.

2/4/19 
Jingshu Wang (Stanford/Penn) “Data Denoising for Singlecell RNA sequencing” Abstract: Singlecell RNA sequencing (scRNAseq) measures gene expression levels in every single cell, which is a groundbreaking technology over microarrays and bulk RNA sequencing and reshapes the field of biology. Though the technology is exciting, scRNAseq data is very noisy and often too noisy for signal detection and robust analysis. In the talk, I will discuss how we perform data denoising by learning across similar genes and borrowing information from external public datasets to improve the quality of downstream analysis. Specifically, I will discuss how we set up the model by decomposing the randomness of scRNAseq data into three components, the structured shared variations across genes, biological “noise” and technical noise, based on current understandings of the stochasticity in DNA transcription. I will emphasize one key challenge in each component and our contributions. I will show how we make proper assumptions on the technical noise and introduce a key feature, transfer learning, in our denoising method SAVERX. SAVERX uses a deep autoencoder neural network coupled with Empirical Bayes shrinkage to extract transferable gene expression features across datasets under different settings and learn from external data as prior information. I will show that SAVERX can successfully transfer information from mouse to human cells and can guard against bias. I’ll also briefly discuss our ongoing work on postdenoising inference for scRNAseq. 
2/11/19 
Jonathan Weed (MIT) Title: Largescale Optimal Transport: Statistics and Computation Abstract: Optimal transport is a concept from probability which has recently seen an explosion of interest in machine learning and statistics as a tool for analyzing highdimensional data. However, the key obstacle in using optimal transport in practice has been its high statistical and computational cost. In this talk, we show how exploiting different notions of structure can lead to better statistical rates—beating the curse of dimensionality—and stateoftheart algorithms. 
2/14/19 *Thursday Time: 4:10 
Pragya Sur (Stanford) “A modern maximumlikelihood approach for highdimensional logistic regression” Abstract: Logistic regression is arguably the most widely used and studied nonlinear model in statistics. Classical maximumlikelihood theory based statistical inference is ubiquitous in this context. This theory hinges on wellknown fundamental results: (1) the maximumlikelihoodestimate (MLE) is asymptotically unbiased and normally distributed, (2) its variability can be quantified via the inverse Fisher information, and (3) the likelihoodratiotest (LRT) is asymptotically a ChiSquared. In this talk, I will show that in the common modern setting where the number of features and the sample size are both large and comparable, classical results are far from accurate. In fact, (1) the MLE is biased, (2) its variability is far greater than classical results, and (3) the LRT is not distributed as a ChiSquare. Consequently, pvalues obtained based on classical theory are completely invalid in high dimensions. In turn, I will propose a new theory that characterizes the asymptotic behavior of both the MLE and the LRT under some assumptions on the covariate distribution, in a highdimensional setting. Empirical evidence demonstrates that this asymptotic theory provides accurate inference in finite samples. Practical implementation of these results necessitates the estimation of a single scalar, the overall signal strength, and I will propose a procedure for estimating this parameter precisely. This is based on joint work with Emmanuel Candes and Yuxin Chen. 
2/18/19 
Thomas Nagler (Technical University of Munich) “Copulabased regression” Copulas are models for the dependence in a random vector and allow to build multivariate models with arbitrary onedimensional margins. Recently, researchers started to apply copulas to statistical learning problems such as regression or classification. We propose a unified framework for the analysis of such approaches by defining the estimators as solutions of copulabased estimating equations. We present general results on their asymptotic behavior and validity of the bootstrap. The conditions are broad enough to cover most regressiontype problems as well as parametric and nonparametric estimators of the copula. The versatility of the method is illustrated with numerical examples and a possible extension to missing data problems. 
2/25/19 
Michael Hudgens (UNC) Title: Causal Inference in the Presence of Interference Abstract: A fundamental assumption usually made in causal inference is that of no interference between individuals (or units), i.e., the potential outcomes of one individual are assumed to be unaffected by the treatment assignment of other individuals. However, in many settings, this assumption obviously does not hold. For example, in infectious diseases, whether one person becomes infected may depend on who else in the population is vaccinated. In this talk we will discuss recent approaches to assessing treatment effects in the presence of interference. 
3/4/19 
Andrew Gelman (Department of Statistics and Department of Political Science, Columbia University) “We’ve Got More Than One Model: Evaluating, comparing, and extending Bayesian predictions” Methods in statistics and data science are often framed as solutions to particular problems, in which a particular model or method is applied to a dataset. But good practice typically requires multiplicity, in two dimensions: fitting many different models to better understand a single dataset, and applying a method to a series of different but related problems. To understand and make appropriate inferences from realworld data analysis, we should account for the set of models we might fit, and for the set of problems to which we would apply a method. This is known as the reference set in frequentist statistics or the prior distribution in Bayesian statistics. We shall discuss recent research of ours that addresses these issues, involving the following statistical ideas: Type M errors, the multiverse, weakly informative priors, Bayesian stacking and crossvalidation, simulationbased model checking, divideandconquer algorithms, and validation of approximate computations. We will also discuss how this work is motivated by applications in political science, pharmacology, and other areas. 
3/11/19 
Lucas Janson (Harvard)
Should We Model X in HighDimensional Inference? Abstract: For answering questions about the relationship between a response variable Y and a set of explanatory variables X, most statistical methods focus their assumptions on the conditional distribution of Y given X (or Y  X for short). I will describe some benefits of shifting those assumptions from the conditional distribution Y  X to the distribution of X instead, especially when X is highdimensional. I will review my recent methodological work on knockoffs and the conditional randomization test, and explain how the modelX framework endows them with desirable properties like finitesample error control, power, modeling flexibility, and robustness. At the end, I will introduce some very recent (arXiv’ed last week) breakthroughs on modelX methods for highdimensional inference, as well as some challenges and interesting directions for the future in this area. 
3/18/19  Spring Break – No Seminar 
3/25/19 
Xi Chen (NYU) “Quantile Regression for big data with small memory” Abstract: In this talk, we discuss the inference problem of quantile regression for a large sample size n but under a limited memory constraint, where the memory can only store a small batch of data of size m. A popular approach, the naive divideandconquer method, only works when n=o(m^2) and is computationally expensive. This talk proposes a novel inference approach and establishes the asymptotic normality result that achieves the same efficiency as the quantile regression estimator computed on all the data. Essentially, our method can allow arbitrarily large sample size n as compared to the memory size m. Our method can also be applied to address the quantile regression under distributed computing environment (e.g., in a largescale sensor network) or for realtime streaming data. This is a joint work with Weidong Liu and Yichen Zhang. Bio: Xi Chen is an assistant professor at Stern School of Business at New York University. Before that, he was a Postdoc in the group of Prof. Michael Jordan at UC Berkeley. He obtained his Ph.D. from the Machine Learning Department at Carnegie Mellon University. He studies highdimensional statistics, multiarmed bandits, and stochastic optimization. He received NSF CAREER Award, SimonsBerkeley Research Fellowship, Google Faculty Award, Adobe Data Science Award, Bloomberg research award, and was featured in 2017 Forbes list of “30 Under30 in Science”. 
4/1/19 
Yajun Mei (Georgia Institute of Technology) “MultiArmed Bandit Techniques for Online Monitoring HighDimensional Streaming Data” Abstract: We investigate the problem of online monitoring highdimensional streaming in resources constrained environments, where one has limited capacity in data acquisition, transmission or processing, and needs decide how to smartly observe which local components or features of highdimensional streaming data at each and every time so as to detect potential anomaly rapidly. In the first part of this talk, we provide an overview of the classical sequential changepoint detection problem for realvalued data stream, as well as classical multiarmed bandit algorithms. In the second part of the talk, we present our latest research on efficient scalable schemes for online monitoring highdimensional data, as well as the corresponding multiarmed bandit versions. Both asymptotic analysis and numerical simulations demonstrate the usefulness of our proposed approach in the context of online monitoring largescale data streams in resources constrained environments. 
4/8/19 
Weijie Su (UPenn) “Gaussian Differential Privacy” Abstract: Privacypreserving data analysis has been put on a firm mathematical foundation since the introduction of differential privacy (DP) in 2006, with successful deployment in Chrome and iOS lately. However, this framework at present form fails to precisely characterize how privacy degrades in a series of privacy breaches and how privacy amplifies with subsampling, thereby creating a major bottleneck for further extending this privacy definition. In this talk, we propose a hypothesis testing based framework for private data analysis, including Gaussian differential privacy (GDP) as a central example. This new framework gives a statistically interpretable assessment of privacy loss and includes many existing privacy notions as examples. In addition, we develop a suite of tools including a central limit theorem for accurately evaluating the privacy loss under composition or with subsampling in this privacy framework. Finally, we analyze the privacy of stochastic gradient descent in this new framework. This is joint work with Jinshuo Dong and Aaron Roth. 
4/15/19 
Dr. Pierre Bellec (Rutgers) “Second Order Stein, SURE for SURE and degreesoffreedom adjustments in highdimensional statistics”
Abstract:
Stein’s formula states that a random variable of the form $z’f(z) – div f(z)$ is meanzero for all functions f with integrable gradient. Here, $div f$ is the divergence
of the function $f$ and $z$ is a standard normal vector. We develop SecondOrder Stein formulae for statistical inference with highdimensional
data. In the simplest form, the SecondOrder Stein formula characterizes the variance of $z’f(z)div f(z)$. It also implies bounds on the variance
of a function $f(z)$ of a standard normal vector and these bounds are of a different nature than the classical Poincare or logSobolev inequalities.
The motivation that led to the above probabilistic results was the study of degreesoffreedom adjustments in highdimensional inference problems. A
wellunderstood degreesoffreedom adjustment appears in Stein’s Unbiased Risk Estimate (SURE) to construct an unbiased estimate of the mean square risk of
almost any estimator $\hat mu$; here the divergence of $\hat \mu$ plays the role of degreesoffreedom or the estimator.
A first application of the Second Order Stein formula is an Unbiased Risk Estimate of the risk of SURE itself (SURE for SURE): a simple unbiased estimate provides
information about the squared distance between SURE and the squared estimation error of $\hat \mu$.
A novel analysis reveals that degreesoffreedom adjustments play a major role in debiasing methodologies to construct confidence intervals in highdimension.
We will see that in sparse linear regression for the Lasso for Gaussin designs, existing debiasing schemes need to be modified with an adjustment that
accounts for the degreesoffreedom of the Lasso. This degreesoffreedom adjustment is necessary for statistical efficiency in the regime $s >>> n^{2/3}$.
Joint work with CunHui Zhang (Rutgers), related papers are https://arxiv.org/abs/1811.

4/22/19 
Dan Simpson (University of Toronto) “Sometimes all we have left are pictures and fear” Abstract: 
4/29/19 
Tailen Hsing (University of Michigan) “SpaceTime Data, Intrinsic Stationarity and Functional Models” Abstract. The topic of functional time series has received some attention recently. This is timely as many applications involving spacetime data can benefit from the functionaldata perspective. In this talk, I will start off with the Argo data, which have fascinating features and are highly relevant for climate research. I will then turn to some extensions of stationarity in the context of functional data. The first is to adapt the notion of intrin sic random functions in spatial statistics, due to Matheron, to functional data. Such processes are stationary after suitable differencing, where the resulting stationary covariance is referred to as generalized covariance. A Bochnertype representation of the generalized covariance as well as prelim inary results on inference will be presented. The second extension considers intrinsic stationarity in a local sense, viewed from the perspective of socalled tangent processes. Motivations of this work can be found from studying the multifractional Brownian motion. 
5/6/19 
John D. Storey (Princeton) Title: Highdimensional models of genomic variation Abstract: One of the most important goals of modern human genetics is to accurately model genomewide genetic variation among individuals, as it plays a fundamental role in disease gene mapping and characterizing the evolutionary history of human populations. Current studies typically involve genomewide genotyping of individuals from a diverse network of ancestries. A key problem is how to formulate and estimate probabilistic models of observed genotypes that account for complex population structure and relatedness. In this talk, I will present highdimensional models of genomewide genetic variation that we have developed. I will motivate this work through the problem of identifying associations between human traits and genetic variation, but I will also emphasize the broader applications of this work, including other areas of biology where our approaches may be applied or extended. 