Schedule for Spring 2016
Information for speakers: For information about schedule, direction, equipment, reimbursement and hotel, please click here.
|Jan 20, 2016||
Holger Rootzén, Chalmers University of Technology
“Taming black swans with statistics”
A black swan is a metaphor for unforeseen catastrophes. I’ll discuss how statistics can – and sometimes cannot – control black swans. Statistics – sometimes – protects banks against bankruptcy. Huge dams are dimensioned using statistics – but how should one account for climate change? Webcams and statistics – if well used – make driving safer. Statistics reduces the threat of pandemics. Method from statistical quality science can help financial institutions manage risks.
Feb 03, 2016
Pierre Bellec, CREST (Centre de Recherche en Economie et Statistique)
“Almost parametric rates and adaptive confidence sets in shape constrained problems”
We study an estimation problem over a known convex polyhedron. This polyhedron represents the shape constraint, and the goal is to recover an unknown parameter from this polyhedron which is obscured by Gaussian noise. The talk focuses on the particular situation where the unknown parameter belongs to a low-dimensional face of the polyhedron. Is it easier to estimate the true parameter in this case? How does the face dimension impact the rate of estimation? Is it possible to construct confidence sets that adapt to the face dimension?
We will first answer these questions for the polyhedral cone of nondecreasing sequences, which corresponds to univariate isotonic regression. If the true parameter belongs to a k-dimensional face, the Least-Squares estimator converges with an almost parametric rate of order k/n — and this rate is optimal in a minimax sense. This still holds if the true parameter is close to a k-dimensional face, in this case the result takes the form of oracle inequalities or regret bounds.
Thus the Least-Squares estimator automatically adapts to the unknown dimension k. Then, a natural problem is to estimate k, or construct data-driven confidence sets that contain the true parameter with high probability. Ideally, the expected diameter of this confidence set should be of the same order as the optimal rate. We will see that the construction of such confidence sets is possible adaptively, without the knowledge of the unknown dimension k.
Finally, we will present the probabilistic arguments used to derive these results and another polyhedron for which a similar phenomenon appears.
|Feb 10, 2016||
Yunxiao Chen, Columbia Statistics
“Factor Model Meets Graphical Model: A New Approach to Analyzing Multivariate Binary Data”
Latent factor models are widely used in analyzing the latent structure of psychological assessments, where the responses to assessment items are assumed to be driven by a small number of latent factors. These latent factors are regarded as subjects’ underlying characteristics measured by the assessment. However, a small number of latent factors are typically not able to capture the variation of data, due to the existence of many small item groups sharing a common stimulus. That is, there is always additional dependence that is not attributable to the latent factors. In this talk, we will provide a novel modeling framework that takes the additional dependence structure into account by coupling latent factor models with graphical models for the analysis of multivariate binary data. Under the proposed model, a small number of latent factors globally drives the dependency among the observed variables and a sparse graphical structure characterizes the remaining structure. The graphical structure is usually assumed to be unknown and can be efficiently estimated from the data. The model is applied to study the latent structure of personality using a dataset based on Eysenck’s Personality Questionnaire.
|Feb 17, 2016||
Rohit Kumar Patra, Columbia Statistics
“Efficient Estimation in Single Index Models”
We consider estimation and inference in a single index regression model with an unknown link function. In contrast to the standard approach of using kernel methods, we consider the estimation of the link function under two different kinds of constraints namely smoothness constraints and convexity (shape) constraints. Under smoothness constraints, we use smoothing splines to estimate the link function. We develop a method to compute the penalized least squares estimators (PLSE) of the parametric and the nonparametric components given i.i.d. data. Under convexity constraint on the link function, we develop least square estimators (LSE) for the unknown quantities. We prove the consistency and find the rates of convergence of both the PLSE and the LSE. We establish root-n-rate of convergence and the asymptotic efficiency of the PLSE and the LSE of the parametric component under mild assumptions. We illustrate and validate the method through experiments on simulated and real data.
|Feb 24, 2016||
Eugene Wu, Columbia University Data Science & Computer Science
“Closing the loop on data analysis and more”
The rapid democratization of data has placed its access and analysis in the hands of the entire population. While the tools for rapid and large scale data processing have continued to reduce the time to compute analysis results, the techniques to help users better and more easily visualize their data and understand what their results mean are still lacking. In this talk, I will discuss previous and ongoing work on data and result explanation, as well as newer projects to push the ease and scalability of building large scale visualization systems.
|March 2, 2016||
Yash Kanoria, Columbia Business School
“Know Your Customer: Multi-Armed Bandits with Capacity Constraints”
We consider a service system with heterogeneous servers and clients. Server types are known and there is fixed capacity of servers of each type. Clients arrive over time, with types initially unknown and drawn from some distribution. Each client sequentially brings N jobs before leaving. The system operator assigns each job to some server type, resulting in a payoff whose distribution depends on the client and server types.
Maximizing the rate of payoff accumulation requires balancing three goals: (i) earning immediate payoffs; (ii) learning client types to increase future payoffs; and (iii) satisfying the capacity constraints. We construct a policy that is provably near optimal (it minimizes regret to leading order as N grows large). Our policy has an appealingly simple three-phase structure: a short type-“guessing” phase, a type-“confirmation” phase that balances payoffs with learning, and finally an “exploitation” phase that focuses on payoffs. Crucially, our approach employs the shadow prices of the capacity constraints in the assignment problem with known types as “externality prices” on the servers’ capacity.
This is joint work with Ramesh Johari and Vijay Kamble.
|March 23, 2016||
Georg Hahn, Columbia Statistics
“Monte Carlo based Multiple Testing”
We are concerned with a situation in which we would like to test multiple hypotheses with tests whose p-values cannot be computed explicitly but can be approximated using Monte Carlo simulation. This scenario occurs widely in practice. We are interested in obtaining the same rejections and non-rejections as the ones obtained if the p-values for all hypotheses had been available. This talk presents MMCTest, an algorithm giving the same classification as the one based on the unknown p-values with pre-specified probability. The idea of MMCTest is then generalized into a framework to test multiple hypotheses by providing a generic algorithm for a general multiple testing procedure. We establish conditions which guarantee that the rejections and non-rejections obtained through Monte Carlo simulations are identical to the ones obtained with the p-values. Our framework is applicable to a general class of step-up and step-down procedures which includes many established multiple testing corrections such as the ones of Bonferroni, Holm, Sidak, Hochberg or Benjamini-Hochberg. Moreover, we show how to use our framework to improve algorithms available in the literature in such a way as to yield theoretical guarantees on their results. These modifications can easily be implemented in practice and lead to a particular way of reporting multiple testing results as three sets together with an error bound on their correctness, demonstrated exemplarily using a real biological dataset.
|March 30, 2016|
Bailey K Fosdick, Department of Statistics, Colorado State University
“Categorical Data Fusion Using Auxiliary Information”
In data fusion, analysts seek to combine information from two databases comprised of disjoint sets of individuals, in which some variables appear in both databases and other variables appear in only one database. Most data fusion techniques rely on variants of conditional independence assumptions. When inappropriate, these assumptions can result in unreliable inferences. We propose a data fusion technique that allows analysts to easily incorporate auxiliary information on the dependence structure of variables not observed jointly; we refer to this auxiliary information as glue. With this technique, we fuse two marketing surveys from the book publisher HarperCollins using glue from the online, rapid-response polling company CivicScience. The fused data enable estimation of associations between people’s preferences for authors and for learning about new books. The analysis also serves as a case study on the potential for using online surveys to aid data fusion. This is joint work with Maria DeYoreo and Jerome Reiter.
|April 13, 2016||
Zhaoran Wang, Princeton ORFE
“Probing the Pareto Frontier of Computational-Statistical Tradeoffs”
In this talk, we discuss the fundamental tradeoffs between computational efficiency and statistical accuracy in big data. Based on an oracle computational model, we introduce a systematic hypothesis-free approach for developing minimax lower bounds under computational budget constraints. This approach mirrors the classical Le Cam’s method, and draws explicit connections between algorithmic complexity and geometric structures of parameter spaces. Based on this approach, we sharply characterize the computational-statistical phase transitions that arise in structural normal mean detection, combinatorial detection in correlation graphs and Markov random fields, as well as sparse principal component analysis. Moreover, we resolve several open questions on the computational barriers arising in sparse mixture models, sparse phase retrieval, and tensor component analysis.
|April 20, 2016||
Richard De Veaux, Department of Mathematics and Statistics, Williams College
“From the Classroom to the Boardroom – How do we get there?”
Big data is everywhere. Companies, governments and the media can’t seem to get enough of the data deluge and the tsunami of data that is about to overwhelm us and/or make us a smarter planet. But, is there any connection between the typical content of an introductory statistics course and the statistical analyses that are performed throughout the world every day? How close is this version of Statistics to real world practice?
Most courses in Statistics now start with exploratory data analysis, move on to probability and then inference. If time permits, they include models, often ending at simple regression. Unfortunately, the student is left with the impression that Statistics is a collection of tools that can help understand one variable at a time, if only the right formula can be recalled, rather than a way of thinking about and modeling the complex world in which we live. Maybe we’re teaching the course backward. We’ll describe an approach that starts the course with models, making full use of students’ intuition about the world and exposing them early to the power of statistical models.