Statistics Seminar – Spring 2019

Schedule for Spring 2019

Seminars are on Mondays
Time: 4:10pm – 5:00pm
Location: Room 903, 1255 Amsterdam Avenue

Tea and Coffee will be served before the seminar at 3:30 PM, 10th Floor Lounge SSW

Cheese and Wine reception will follow the seminar at 5:10 PM in the 10th Floor Lounge SSW

For an archive of past seminars, please click here.



*Time: 12noon to 1:00 pm

*Location: Room 1025 SSW

Alex Young (Big Data Institute, University of Oxford)

“Disentangling nature and nurture using genomic family data”

Abstract: Heritability measures the relative contribution of genetic inheritance (nature) and environment (nurture) to trait variation. Estimation of heritability is especially challenging when genetic and environmental effects are correlated, such as when indirect genetic effects from relatives are present. An indirect genetic effect on a proband (phenotyped individual) is the effect of a genetic variant on the proband through the proband’s environment. Examples of indirect genetic effects include effects from parents to offspring, which occur when parental nurturing behaviours are influenced by parents’ genes. I show that indirect genetic effects from parents to offspring and between siblings are substantial for educational attainment. I show that, when indirect genetic effects from relatives are substantial, existing methods for estimating heritability can be severely biased. To remedy this and other problems in heritability estimation, such as population stratification, I introduce a novel method for estimating heritability: relatedness disequilibrium regression (RDR). RDR removes environmental bias by exploiting variation in relatedness due to random Mendelian segregations in the probands’ parents. We show mathematically and in simulations that RDR estimates heritability with negligible bias due to environment in almost all scenarios. I report results from applying RDR to a sample of 54,888 Icelanders with both parents genotyped to estimate the heritability of 14 traits, including height (55.4%, S.E. 4.4%), body mass index (BMI) (28.9%, S.E. 6.3%), and educational attainment (17.0%, S.E. 9.4%), finding evidence for substantial overestimation from other methods. Furthermore, without genotype data on close relatives of the proband – such as used by RDR – the results show that it is impossible to remove the bias due to indirect genetic effects and to completely remove the confounding due to population stratification. I outline a research program for building methods that take advantage of the unique properties of genomic family data to disentangle nature, nurture, and environment in order to build a rich understanding of the causes of social and health inequalities. 



Time and Location: TBA

Chengchun Shi (NC State)

“On Statistical Learning for Individualized Decision Making with Complex Data.”

In this talk, I will present my research on individualized decision making with modern complex data. In precision medicine, individualizing the treatment decision rule can capture patients’ heterogeneous response towards treatment. In finance, individualizing the investment decision rule can improve individual’s financial well-being. In a ride-sharing company, individualizing the order dispatching strategy can increase its revenue and customer satisfaction. With the fast development of new technology, modern datasets often consist of massive observations, high-dimensional covariates and are characterized by some degree of heterogeneity.

The talk is divided into two parts. In the first part, I will focus on the data heterogeneity and introduce a new maximin-projection learning for recommending an overall individualized decision rule based on the observed data from different populations with heterogeneity in optimal individualized decision making.  In the second part, I will briefly summarize the statistical learning methods I’ve developed for individualized decision making with complex data and discuss my future research directions.



Time: 12 noon to 1:00 pm Location: 903 SSW


Jingshen Wang (Michigan)





Time: 4:10 pm

Simon Mak (Georgia Tech)

Support points – a new way to reduce big and high-dimensional data”

Abstract: This talk presents a new method for reducing big and high-dimensional data into a smaller dataset, called support points (SPs). In an era where data is plentiful but downstream analysis is oftentimes expensive, SPs can be used to tackle many big data challenges in statistics, engineering and machine learning. SPs have two key advantages over existing methods. First, SPs provide optimal and model-free reduction of big data for a broad range of downstream analyses. Second, SPs can be efficiently computed via parallelized difference-of-convex optimization; this allows us to reduce millions of data points to a representative dataset in mere seconds. SPs also enjoy appealing theoretical guarantees, including distributional convergence and improved reduction over random sampling and clustering-based methods. The effectiveness of SPs is then demonstrated in two real-world applications, the first for reducing long Markov Chain Monte Carlo (MCMC) chains for rocket engine design, and the second for data reduction in computationally intensive predictive modeling.



Jingshu Wang (Stanford/Penn)



Jonathan Weed (MIT)



Time: 4:10

Pragya Sur (Stanford)

A modern maximum-likelihood approach for high-dimensional logistic regression”

Abstract: Logistic regression is arguably the most widely used and studied non-linear model in statistics. Classical maximum-likelihood theory based statistical inference is ubiquitous in this context. This theory hinges on well-known fundamental results: (1) the maximum-likelihood-estimate (MLE) is asymptotically unbiased and normally distributed, (2) its variability can be quantified via the inverse Fisher information, and (3) the likelihood-ratio-test (LRT) is asymptotically a Chi-Squared. In this talk, I will show that in the common modern setting where the number of features and the sample size are both large and comparable, classical results are far from accurate. In fact,  (1) the MLE is biased, (2) its variability is far greater than classical results, and (3) the LRT is not distributed as a Chi-Square. Consequently, p-values obtained based on classical theory are completely invalid in high dimensions. In turn, I will propose a new theory that characterizes the asymptotic behavior of both the MLE and the LRT under some assumptions on the covariate distribution, in a high-dimensional setting. Empirical evidence demonstrates that this asymptotic theory provides accurate inference in finite samples. Practical implementation of these results necessitates the estimation of a single scalar, the overall signal strength, and I will propose a procedure for estimating this parameter precisely. This is based on joint work with Emmanuel Candes and Yuxin Chen.




3/18/19 Spring Break – No Seminar