Schedule for Spring 2018
Seminars are on Mondays
Time: 4:10pm – 5:00pm
Location: Room 903, 1255 Amsterdam Avenue
Tea and Coffee will be served before the seminar at 3:30 PM, 10th Floor Lounge SSW
Cheese and Wine reception will follow the seminar at 5:10 PM in the 10th Floor Lounge SSW
For an archive of past seminars, please click here.
1/17/18 1:00 PM 1025 SSW 
Julia Fukuyama (Stanford) 
1/19/18 4:10 903 SSW 
Guillaume Basse (Harvard University) Testing for twostageexperiments in the presence of interference Abstract: Many importantcausal questions concern interactions between units, also known asinterference. Examples include interactions between individuals in households,students in schools, and firms in markets. Standard analyses that ignoreinterference can often break down in this setting: estimators can be badlybiased, while classical randomization tests can be invalid. In this talk, Ipresent recent results on testing for twostage experiments, which are powerfuldesigns for assessing interference. In these designs, whole clusters (e.g.,households, schools, or graph partitions) are assigned to treatment or control;then units within each treated cluster are randomly assigned to treatment orcontrol. I demonstrate how to construct powerful tests for nonsharp nullhypotheses and use these results to analyze a twostage randomized trialevaluating an intervention to reduce student absenteeism in the School Districtof Philadelphia. I discuss some extensions to more general forms ofinterference, as well as some current challenges.” 
1/22/18 
Samory Kpotufe (Princeton) “ModalSet Estimation using kNN graphs, and Applications to Clustering” ABSTRACT: Estimating the mode or modalsets (i.e. extrema points or surfaces) of an unknown density from sample is a basic problem in data analysis. Such estimation is relevant to other problems such as clustering, outlier detection, or can simply serve to identify lowdimensional structures in high dimensionaldata (e.g. pointcloud data from medicalimaging, astronomy, etc). Theoretical work on modeestimation has largely concentrated on understanding its statistical difficulty, while less attention has been given to implementable procedures. Thus, theoretical estimators, which are often statistically optimal, are for the most part hard to implement. Furthermore for more general modalsets (general extrema of any dimension and shape) much less is known, although various existing procedures (e.g. for manifolddenoising or densityridge estimation) have similar practical aim. I’ll present two related contributions of independent interest: (1) practical estimators of modalsets – based on particular subgraphs of a kNN graph – which attain minimaxoptimal rates under surprisingly general distributional conditions; (2) highprobability finite sample rates for kNN density estimation which is at the heart of our analysis. Finally, I’ll discuss recent successful work towards the deployment of these modalsets estimators for various clustering applications. Much of the talk is based on a series of work with collaborators S. Dasgupta, K. Chaudhuri, U. von Luxburg, and Heinrich Jiang. SHORTBIO: Samory Kpotufe is an Assistant Professor at ORFE, Princeton University. He works in Statistical Machine Learning Theory, with an emphasis on exploiting lowdimensional structure in highdimensional data. 
*1/23/18 *1:10 PM 
Roy Lederman (Princeton) “Inverse Problems and Unsupervised Learning with applications to CryoElectron Microscopy.” Abstract: CryoElectron Microscopy (cryoEM) is an imaging technology that is revolutionizing structural biology; the Nobel Prize in Chemistry 2017 was recently awarded to Jacques Dubochet, Joachim Frank and Richard Henderson “for developing cryoelectron microscopy for the highresolution structure determination of biomolecules in solution”. Cryoelectron microscopes produce a large number of very noisy twodimensional projection images of individual frozen molecules. Unlike related methods, such as computed tomography (CT), the viewing direction of each image is unknown. The unknown directions, together with extreme levels of noise and additional technical factors, make the determination of the structure of molecules challenging. While other methods for structure determination, such as xray crystallography and nuclear magnetic resonance (NMR), measure ensembles of molecules, cryoelectron microscopes produce images of individual molecules. Therefore, cryoEM could potentially be used to study mixtures of different conformations of molecules. Indeed, current algorithms have been very successful at analyzing homogeneous samples, and can recover some distinct conformations mixed in solutions, but, the determination of multiple conformations, and in particular, continuums of similar conformations (continuous heterogeneity), remains one of the open problems in cryoEM. I will discuss a onedimensional discrete model problem, Heterogeneous Multireference Alignment, which captures many of the properties of the cryoEM problem. I will then discuss different components which we are introducing in order to address the problem of continuous heterogeneity in cryoEM: 1. “hypermolecules,” the mathematical formulation of truly continuously heterogeneous molecules, 2. Computational and numerical tools for expressing associated priors, and 3. Bayesian algorithms for inverse problems with an unsupervisedlearning component for recovering such hypermolecules in cryoEM. 
*1/26/18 *4:10 *Room 903 
Zhou Fan (Stanford)

1/29/18 
Snigdha Panigrahi (Stanford) “An Approximation based approach for randomized conditional inference with an application in EQTLS” Abstract. The goal of eQTLS (Expression quantitative trait loci studies) is to identify and quantify effects of how genetic variants regulate the expression of genes in different biological contexts. The outcome variables (typically on the order of 20,000) are molecular measurements of gene expression and the predictors are genotypes (typically on the order of 1,000,000). The identification of regulatory variants in eQTLS (Consortium et al. (2015); Ongen et al. (2015); Consortium et al. (2017)) proceeds in a hierarchical fashion: the first stage of such a selection identifies promising genes, followed by a search of potential functional variants in a neighborhood around these discovered genes. Once promising candidates for functional variants have been detected, the logical next step is to attempt to estimate their effect sizes. However obtaining samples corresponding to a human tissue remains a costly endeavor. Thereby, eQTLS continue to be based on relatively small sample sizes with this limitation particularly serious for tissues as brain, liver, etc.– the organs of most immediate medical relevance. Naive estimates that ignore the genomewide selection preceding inference can lead to misleading conclusions about that the magnitudes of the true underlying associations. Due to scarcity of biological samples, the problem of reliable effect size estimation is often deferred to future studies and therefore, inadequately addressed in the eQTLS research community. In this talk, I will discuss a principled approach that allows the geneticist to use the available dataset both for discoveries and followup estimation of the associated effect sizes, adjusted for the considerable amount of prior mining. Motivated to measure these effect sizes as consistent point estimates and intervals with target coverage, my methods are modeled along the conditional approach to selective inference, introduced in Lee et al. (2016). The proposed procedure is based on a randomized hierarchical strategy that reflects state of the art investigations and introduces the use of randomness instead of data splitting to optimize the use of available data. I will describe the computational bottleneck in performing randomized conditional inference. To overcome these hurdles, I will describe a novel set of techniques that have higher inferential power than prior selective inference work in Lee et al. (2016). 
*2/1/18 *Thursday *4:00 PM 
Evita Xenia Nestoridi (Princeton) “Studying cutoff for card shuffling” In cryptography, in casinos, and in playing board games, people often ask the same question: how is a person (not a computer) supposed to shuffling 
2/5/18 
Miaoyan Wang (UC, Berkeley) “Beyond matrices: theory, methods, and applications of higherorder tensors.” Abstract: Recently, tensors of order 3 or greater, known as higherorder tensors, have attracted increased attention in modern scientific contexts including genomics, proteomics, and brain imaging. A common paradigm in tensorbased algorithms advocates unfolding tensors into matrices and applying classical matrix methods. In this talk, I will consider all possible unfoldings of a tensor into lower order tensors and present general inequalities between their operator norms. I will then present a new tensor decomposition algorithm built on these inequalities and Kruskal’s uniqueness theorem. This tensor decomposition algorithm provably handles a greater level of noise compared to previous methods while achieving high estimation accuracy. A perturbation analysis is provided, establishing a higherorder analogue of Wedin’s perturbation theorem. In light of these theoretical successes, I will discuss a novel tensor approach to threeway clustering for multitissue multiindividual gene expression studies. I’ll show that the tensor method deciphers multiway transcriptome specificity in finer detail than previously possible via our analysis of the GTEx RNAseq data. 
*2/9/18  Laurence Aitchison (CBL / Cambridge) 
2/12/18 
Scott Linderman (Columbia) “Bayesian Methods for Discovering Structure in Neural and Behavioral Data”
Abstract:
New recording technologies are transforming neuroscience, allowing us to precisely quantify neural activity, sensory stimuli, and natural behavior. How can we discover simplifying structure in these highdimensional data and relate these domains to one another? I will present my work on developing Bayesian statistical methods to answer this question. First, I will develop state space models to study global brain states and recurrent dynamics in the neural activity of C. elegans. In doing so, I will draw on prior knowledge and theory to build interpretable models. When our initial models fall short, I will show we criticize and revise them by inserting flexible components, like artificial neural networks, at judiciously chosen locations. Next, I will discuss the Bayesian inference algorithms I have developed to fit such models at the scales required by modern neuroscience. The key to efficient inference will be augmentation schemes and approximate methods that exploit the structure of the model. This example is illustrative of a broader framework for harnessing recent advances in machine learning, statistics, and neuroscience. Prior knowledge and theory provide the starting point for interpretable models, machine learning techniques lend additional flexibility where needed, and new Bayesian inference algorithms provide the means to fit these models and discover structure in neural and behavioral data.

*2/16/18 
Aaditya Ramdas (UC, Berkeley) “Interactive algorithms for multiple hypothesis testing” Abstract : Data science is at a crossroads. Each year, thousands of new data scientists are entering science and technology, after a broad training in a variety of fields. Modern data science is often exploratory in nature, with datasets being collected and dissected in an interactive manner. Classical guarantees that accompany many statistical methods are often invalidated by their nonstandard interactive use, resulting in an underestimated risk of falsely discovering correlations or patterns. It is a pressing challenge to upgrade existing tools, or create new ones, that are robust to involving a humanintheloop. In this talk, I will describe two new advances that enable some amount of interactivity while testing multiple hypotheses, and control the resulting selection bias. I will first introduce a new framework, STAR, that uses partial masking to divide the available information into two parts, one for selecting a set of potential discoveries, and the other for inference on the selected set. I will then show that it is possible to flip the traditional roles of the algorithm and the scientist, allowing the scientist to make posthoc decisions after seeing the realization of an algorithm on the data. The theoretical basis for both advances is founded in the theory of martingales : in the first, the user defines the martingale and associated filtration interactively, and in the second, we move from optional stopping to optional spotting by proving uniform concentration bounds on relevant martingales. This talk will feature joint work with Will Fithian, Lihua Lei and Eugene Katsevich, but I will also briefly mention work with Rina Barber, Jianbo Chen, Kevin Jamieson, Michael Jordan, Max Rabinovich, Martin Wainwright, Fanny Yang and Tijana Zrnic. Bio : Aaditya Ramdas (www.cs.berkeley.edu/~a 
2/19/18 
Daniela Witten (University of Washington) “Are clusterings of multiple data views independent?” Abstract: 
2/26/18 
Jay Kadane (CMU) 
3/5/18 
Kosuke Imai (Princeton) “Causal Inference with Interference and Noncompliance in the TwoStage Randomized Experiments.” In many social science experiments, subjects often interact with each other and as a result one’s treatment influences the outcome of another unit. Over the last decade, a significant progress has been made towards causal inference in the presense of such interference among units. Researchers have shown that the twostage randomization of treatment assignment enables the identification of average spillover and direct effects. However, much of the literature has assumed perfect compliance with treatment assignment. In this paper, we establish the nonparametric identification of the average complier direct effect (CADE) and average complier spillover effect (CASE) in twostage randomized experiments with interference and noncompliance. In particular, we consider the spillover effect of the encouragement on the treatment as well as the spillover effect of the treatment on the outcome. We then propose a consistent Neymantype estimator of the CADE and derive its variance under the stratified interference assumption. We then derive the exact relationship between the proposed Neymantype estimator and the twostage least squares estimator. The proposed methodology is motivated by and applied to the randomized evaluation of the Indian health insurance program. (Joint work with Zhichao Jiang) 
3/12/18 

3/19/18 
Jelena Brodic (UCSD) 
3/26/18 

4/2/18 
Zijian Guo (Rutgers) 
4/9/18 
Vladas Pipiras (UNC) Title: Sampling and inversion methods in several “big data” applications Abstract: In a number of modern data applications, sampling serves as a means of reducing data collection, storage and processing costs. Sampled data is then often used to make inference about “population” quantities of interest. The motivating examples considered in the talk are, first, data packet sampling in Internet traffic and inference of the distribution of packet flow sizes and, second, node or edge sampling in directed graphs and inference of the node indegree distribution. Several approaches to these inference problems are considered, after formulating them in terms of statistical inversion, including an asymptotic approach based on power laws. The availability of various sampling schemes, their (dis)advantages and comparison are also discussed. 
4/16/18 
Murad Taqqu (Boston University) 
4/23/18 
Linxi Liu (Columbia) “Selecting variables in a classication problem by controlling the false discovery rate.” Abstract: Classication problems appear in many applications, where a categorical or binary response is related to a group of potential explanatory variables. Selecting a subset

4/30/18 
Galen Reeves (Duke) “Information theory and highdimensional statistical inference”
Abstract: How does one quantify the fundamental and computational limits of highdimensional inference problems? Much of the theoretical work in statistics has focused on scaling regimes in which the uncertainty about the unknown parameters converges to zero as the amount of data increases. In this talk, I will describe a different approach that instead focuses on settings where the number of observations is commensurate with the number of unknowns. Building upon ideas from information theory and statistical physics, the objectives are (1) obtaining succinct formulas for the performance of optimal methods; and (2) delineating between problem regimes in which this performance can or cannot be obtained using computationally efficient methods. The primary focus will be on the standard linear model with random design matrices. I will also discuss some recent progress on generalized linear models and multilayer networks.
Bio: Galen Reeves joined the faculty at Duke University in Fall 2013, and is currently an Assistant Professor with a joint appointment in the Department of Electrical & Computer Engineering and the Department of Statistical Science. He completed his PhD in Electrical Engineering and Computer Sciences at the University of California, Berkeley in 2011, and he was a postdoctoral associate in the Departments of Statistics at Stanford University from 2011 to 2013. He received the NSF Career Award in 2018.

5/7/18 

5/14/18

