Statistics Seminar Series – Spring 2018

Schedule for Spring 2018

Seminars are on Mondays
Time: 4:10pm – 5:00pm
Location: Room 903, 1255 Amsterdam Avenue

Tea and Coffee will be served before the seminar at 3:30 PM, 10th Floor Lounge SSW

Cheese and Wine reception will follow the seminar at 5:10 PM in the 10th Floor Lounge SSW

For an archive of past seminars, please click here.


1:00 PM

1025 SSW

Julia Fukuyama (Stanford)



903 SSW

Guillaume Basse (Harvard University) 

Testing for two-stageexperiments in the presence of interference

Abstract: Many importantcausal questions concern interactions between units, also known asinterference. Examples include interactions between individuals in households,students in schools, and firms in markets. Standard analyses that ignoreinterference can often break down in this setting: estimators can be badlybiased, while classical randomization tests can be invalid. In this talk, Ipresent recent results on testing for two-stage experiments, which are powerfuldesigns for assessing interference. In these designs, whole clusters (e.g.,households, schools, or graph partitions) are assigned to treatment or control;then units within each treated cluster are randomly assigned to treatment orcontrol. I demonstrate how to construct powerful tests for non-sharp nullhypotheses and use these results to analyze a two-stage randomized trialevaluating an intervention to reduce student absenteeism in the School Districtof Philadelphia. I discuss some extensions to more general forms ofinterference, as well as some current challenges.”


Samory Kpotufe (Princeton)

“Modal-Set Estimation using kNN graphs, and Applications to Clustering”


Estimating the mode or modal-sets (i.e. extrema points or surfaces) of an unknown density from sample is a basic problem in data analysis.

Such estimation is relevant to other problems such as clustering, outlier detection, or can simply serve to identify low-dimensional structures in high dimensional-data (e.g. point-cloud data from medical-imaging, astronomy, etc).

Theoretical work on mode-estimation has largely concentrated on understanding its statistical difficulty, while less attention has been given to implementable procedures. Thus, theoretical estimators, which are often statistically optimal, are for the most part hard to implement. Furthermore for more general modal-sets (general extrema of any dimension and shape) much less is known, although various existing procedures (e.g. for manifold-denoising or density-ridge estimation) have similar practical aim.

I’ll present two related contributions of independent interest: (1) practical estimators of modal-sets – based on particular subgraphs of a k-NN graph – which attain minimax-optimal rates under surprisingly general distributional conditions; (2) high-probability finite sample rates for k-NN density estimation which is at the heart of our analysis.

Finally, I’ll discuss recent successful work towards the deployment of these modal-sets estimators for various clustering applications.

Much of the talk is based on a series of work with collaborators S. Dasgupta, K. Chaudhuri, U. von Luxburg, and Heinrich Jiang.

SHORT-BIO: Samory Kpotufe is an Assistant Professor at ORFE, Princeton University. He works in Statistical Machine Learning Theory, with an emphasis on exploiting low-dimensional structure in high-dimensional data.


*1:10 PM

Roy Lederman (Princeton)

“Inverse Problems and Unsupervised Learning with applications to Cryo-Electron Microscopy.”

Abstract: Cryo-Electron Microscopy (cryo-EM) is an imaging technology  that is revolutionizing structural biology; the Nobel Prize in Chemistry 2017 was recently awarded to Jacques Dubochet, Joachim Frank  and Richard Henderson “for developing cryo-electron microscopy for the  high-resolution structure determination of biomolecules in solution”.  Cryo-electron microscopes produce a large number of very noisy  two-dimensional projection images of individual frozen molecules.

Unlike related methods, such as computed tomography (CT), the viewing  direction of each image is unknown. The unknown directions, together with extreme levels of noise and additional technical factors, make  the determination of the structure of molecules challenging. While  other methods for structure determination, such as x-ray  crystallography and nuclear magnetic resonance (NMR), measure  ensembles of molecules, cryo-electron microscopes produce images of  individual molecules. Therefore, cryo-EM could potentially be used to  study mixtures of different conformations of molecules. Indeed,     current algorithms have been very successful at analyzing homogeneous  samples, and can recover some distinct conformations mixed in  solutions, but, the determination of multiple conformations, and in  particular, continuums of similar conformations (continuous  heterogeneity), remains one of the open problems in cryo-EM. I will  discuss a one-dimensional discrete model problem, Heterogeneous  Multireference Alignment, which captures many of the properties of the  cryo-EM problem. I will then discuss different components which we are  introducing in order to address the problem of continuous  heterogeneity in cryo-EM: 1. “hyper-molecules,” the mathematical  formulation of truly continuously heterogeneous molecules, 2.

Computational and numerical tools for expressing associated priors,  and 3. Bayesian algorithms for inverse problems with an  unsupervised-learning component for recovering such hyper-molecules in cryo-EM.



*Room 903

Zhou Fan (Stanford)



Snigdha Panigrahi (Stanford)

“An Approximation based approach for randomized conditional inference with an application in EQTLS”

Abstract. The  goal of eQTLS  (Expression  quantitative  trait loci  studies)  is to  identify  and quantify  effects of how genetic variants  regulate the expression of genes in  different  biological contexts.  The outcome variables (typically on the order of 20,000) are molecular measurements of gene expression and the predictors  are genotypes (typically on the order of 1,000,000).  The identification of regulatory  variants in eQTLS (Consortium  et al. (2015); Ongen et al. (2015); Consortium  et al. (2017)) proceeds in a hierarchical  fashion:  the first  stage of such a selection identifies promising genes, followed by a search of potential  functional  variants in a neighborhood around  these discovered genes.  Once promising  candidates for  functional  variants  have been detected, the logical next  step is to attempt to estimate their  effect sizes.  However obtaining samples corresponding to a human tissue remains a costly endeavor.  Thereby,  eQTLS continue to be based on relatively  small sample sizes with  this limitation particularly serious for tissues as brain,  liver,  etc.– the  organs of most  immediate  medical  relevance.   Naive  estimates that ignore the genome-wide selection preceding inference can lead to misleading conclusions about that  the magnitudes of the true  underlying  associations.  Due to scarcity  of biological  samples, the problem  of reliable  effect size estimation  is often deferred to  future  studies and therefore, inadequately addressed in the eQTLS research community.

In this talk,  I will  discuss a principled  approach that  allows the geneticist to use the available dataset both for discoveries and follow-up  estimation  of the associated effect sizes, adjusted for the considerable amount of prior mining.   Motivated to measure these effect sizes as consistent point estimates and intervals with target coverage, my methods are modeled along the conditional approach to selective inference, introduced  in Lee et al. (2016). The proposed procedure is based on a randomized hierarchical strategy that  reflects state of the art investigations  and introduces the use of randomness instead of data splitting to optimize the use of available data. I will describe the computational bottleneck in performing randomized conditional inference. To overcome these hurdles, I will  describe a novel set of techniques that  have higher inferential  power than  prior selective inference work in Lee et al. (2016).



*4:00 PM

Evita Xenia Nestoridi (Princeton)

“Studying cutoff for card shuffling”

In cryptography, in casinos, and in playing board games, people often ask the same question: how is a person (not a computer) supposed to shuffling
a deck of cards? We will discuss two specific card shuffles, the random-to- random and the Bernoulli{Laplace model. The former is a famous shuffle,
though typically not recommended for actual shuffling purposes. The latter is the first Markov chain introduced by Markov which came from cryptography
and is utilized by many casinos. Both card shuffles exhibit a sudden transition from an unmixed state to a random one, referred to as `cutoff ’. In joint
work with M. Bernstein, we prove that random to random shuffle exhibits cutoff at 3/4 n log n, answering a conjecture of P. Diaconis. In joint works with
A. Eskenazis and G. White, we also prove cutoff for the Bernoulli{Laplace model, using a combination of algebraic and probabilistic techniques.


 Miaoyan Wang (UC, Berkeley)

“Beyond matrices: theory, methods, and applications of higher-order tensors.”


Recently, tensors of order 3 or greater, known as higher-order tensors, have attracted increased attention in modern scientific contexts including genomics, proteomics, and brain imaging. A common paradigm in tensor-based algorithms advocates unfolding tensors into matrices and applying classical matrix methods. In this talk, I will consider all possible unfoldings of a tensor into lower order tensors and present general inequalities between their operator norms. I will then present a new tensor decomposition algorithm built on these inequalities and Kruskal’s uniqueness theorem. This tensor decomposition algorithm provably handles a greater level of noise compared to previous methods while achieving high estimation accuracy. A perturbation analysis is provided, establishing a higher-order analogue of Wedin’s perturbation theorem. In light of these theoretical successes, I will discuss a novel tensor approach to three-way clustering for multi-tissue multi-individual gene expression studies. I’ll show that the tensor method deciphers multi-way transcriptome specificity in finer detail than previously possible via our analysis of the GTEx RNA-seq data.

*2/9/18 Laurence Aitchison (CBL / Cambridge)

Scott Linderman (Columbia)

“Bayesian Methods for Discovering Structure in Neural and Behavioral Data”
New recording technologies are transforming neuroscience, allowing us to precisely quantify neural activity, sensory stimuli, and natural behavior.  How can we discover simplifying structure in these high-dimensional data and relate these domains to one another? I will present my work on developing Bayesian statistical methods to answer this question.  First, I will develop state space models to study global brain states and recurrent dynamics in the neural activity of C. elegans.  In doing so, I will draw on prior knowledge and theory to build interpretable models. When our initial models fall short, I will show we criticize and revise them by inserting flexible components, like artificial neural networks, at judiciously chosen locations.   Next, I will discuss the Bayesian inference algorithms I have developed to fit such models at the scales required by modern neuroscience.  The key to efficient inference will be augmentation schemes and approximate methods that exploit the structure of the model.  This example is illustrative of a broader framework for harnessing recent advances in machine learning, statistics, and neuroscience.  Prior knowledge and theory provide the starting point for interpretable models, machine learning techniques lend additional flexibility where needed, and new Bayesian inference algorithms provide the means to fit these models and discover structure in neural and behavioral data. 

Aaditya Ramdas (UC, Berkeley)

Interactive algorithms for multiple hypothesis testing”

Abstract : Data science is at a crossroads. Each year, thousands of new data scientists are entering science and technology, after a broad training in a variety of fields. Modern data science is often exploratory in nature, with datasets being collected and dissected in an interactive manner. Classical guarantees that accompany many statistical methods are often invalidated by their non-standard interactive use, resulting in an underestimated risk of falsely discovering correlations or patterns. It is a pressing challenge to upgrade existing tools, or create new ones, that are robust to involving a human-in-the-loop.

In this talk, I will describe two new advances that enable some amount of interactivity while testing multiple hypotheses, and control the resulting selection bias. I will first introduce a new framework, STAR, that uses partial masking to divide the available information into two parts, one for selecting a set of potential discoveries, and the other for inference on the selected set. I will then show that it is possible to flip the traditional roles of the algorithm and the scientist, allowing the scientist to make post-hoc decisions after seeing the realization of an algorithm on the data. The theoretical basis for both advances is founded in the theory of martingales : in the first, the user defines the martingale and associated filtration interactively, and in the second, we move from optional stopping to optional spotting by proving uniform concentration bounds on relevant martingales.

This talk will feature joint work with Will Fithian, Lihua Lei and Eugene Katsevich, but I will also briefly mention work with Rina Barber, Jianbo Chen, Kevin Jamieson, Michael Jordan, Max Rabinovich, Martin Wainwright, Fanny Yang and Tijana Zrnic.

Bio Aaditya Ramdas ( is a postdoctoral researcher in Statistics and EECS at UC Berkeley, advised by Michael Jordan and Martin Wainwright. He finished his PhD in Statistics and Machine Learning at CMU, advised by Larry Wasserman and Aarti Singh, winning the Best Thesis Award in Statistics. A lot of his research focuses on modern aspects of reproducibility in science and technology — involving statistical testing and false discovery rate control in static and dynamic settings. 


Daniela Witten (University of Washington)

“Are clusterings of multiple data views independent?”

In recent years, it has become increasingly commonplace for biologists to collect more than one type of measurement on a single set of observations. For instance, a researcher might profile clinical attributes, protein expression, and DNA sequence on a single set of individuals. On the basis of such multiple-view data, many authors have considered the task of determining whether there are subgroups, or clusters, among the observations. I will instead consider a more nuanced question: are the sets of clusters from each data view related or independent? In order to answer this question, I will propose a mixture model for multiple data views. I will use this model to develop a pseudo likelihood ratio test for whether the clusterings of the observations in two data views are independent. Furthermore, I will establish a connection between our proposed test statistic and a mutual information statistic considered by several previous authors. I will demonstrate the performance of the proposed approach in an application to multiple-view clinical test and proteomic data sets from the Pioneer 100 Wellness study.
This is joint work with Lucy Gao (University of Washington) and Jacob Bien (University of Southern California).


Jay Kadane (CMU)


 Kosuke Imai (Princeton)

“Causal Inference with Interference and Noncompliance in the Two-Stage Randomized Experiments.”

In many social science experiments, subjects often interact with each other and as a result one’s treatment influences the outcome of another unit.  Over the last decade, a significant progress has been made towards causal inference in the presense of such interference among units.  Researchers have shown that the two-stage randomization of treatment assignment enables the identification of average spillover and direct effects.  However, much of the literature has assumed perfect compliance with treatment assignment.  In this paper, we establish the nonparametric identification of the average complier direct effect (CADE) and average complier spillover  effect (CASE) in two-stage randomized experiments with interference and noncompliance.  In particular, we consider the spillover effect of the encouragement on the treatment as well as the spillover effect of the treatment on the outcome.  We then propose a consistent Neyman-type estimator of the CADE and derive its variance under the stratified interference assumption.  We then derive the exact relationship between the proposed Neyman-type estimator and the two-stage least squares estimator.  The proposed methodology is motivated by and applied to the randomized evaluation of the Indian health insurance program. (Joint work with Zhichao Jiang)




 Jelena Brodic (UCSD)




Zijian Guo (Rutgers)


Vladas Pipiras (UNC)

Title: Sampling and inversion methods in several “big data” applications

Abstract: In a number of modern data applications, sampling serves as a means of reducing data collection, storage and processing costs. Sampled data is then often used to make inference about “population” quantities of interest. The motivating examples considered in the talk are, first, data packet sampling in Internet traffic and inference of the distribution of packet flow sizes and, second, node or edge sampling in directed graphs and inference of the node in-degree distribution. Several approaches to these inference problems are considered, after formulating them in terms of statistical inversion, including an asymptotic approach based on power laws. The availability of various sampling schemes, their (dis)advantages and comparison are also discussed.


Murad Taqqu (Boston University)


Linxi Liu (Columbia)

“Selecting variables in a classi cation problem by controlling the false discovery rate.”

Abstract: Classi cation problems appear in many applications, where a categorical or binary response is related to a group of potential explanatory variables. Selecting a subset
of important variables is one of the major tasks. For example, in genetics, researchers want to identify a subgroup of single-nucleotide polymorphisms (SNPs) associated with
certain type of disease in order to further study the disease mechanism. Generalized linear models (GLM) are widely used tools in these cases. However, there are still several open
problems about how to make controlled variable selection for GLMs, especially under the high-dimensional setting. In this talk, I will introduce a variable selection approach for
probit regression model. Built on the knockoff s framework (Barber and Candes 2015),our procedure starts by constructing a group of knocko variables geometrically and then
calculate the test statistics based on a Bayesian model. Our approach can achieve the false discovery rate (FDR) control asymptotically, without the normal distributional as-
sumptions on the regression matrix. We conduct a range of numerical experiments to demonstrate the FDR control and the power of the proposed method. It has also been
applied to the genetic study of schizophrenia.



Galen Reeves (Duke)

“Information theory and high-dimensional statistical inference”

Abstract: How does one quantify the fundamental and computational limits of high-dimensional inference problems? Much of the theoretical work in statistics has focused on scaling regimes in which the uncertainty about the unknown parameters converges to zero as the amount of data increases. In this talk, I will describe a different approach that instead focuses on settings where the number of observations is commensurate with the number of unknowns. Building upon ideas from information theory and statistical physics, the objectives are (1) obtaining succinct formulas for the performance of optimal methods; and (2) delineating between problem regimes in which this performance can or cannot be obtained using computationally efficient methods. The primary focus will be on the standard linear model with random design matrices. I will also discuss some recent progress on generalized linear models and multilayer networks.
Bio: Galen Reeves joined the faculty at Duke University in Fall 2013, and is currently an Assistant Professor with a joint appointment in the Department of Electrical & Computer Engineering and the Department of Statistical Science. He completed his PhD in Electrical Engineering and Computer Sciences at the University of California, Berkeley in 2011, and he was a postdoctoral associate in the Departments of Statistics at Stanford University from 2011 to 2013. He received the NSF Career Award in 2018.