Statistics Seminar Series – Fall 2015

Schedule for Fall 2015

Seminars are on Mondays
Time: 4:10pm – 5:00pm
Location: Room 903, 1255 Amsterdam Avenue

Tea and Coffee will be served before the seminar at 3:30 PM, Room 1025

Cheese and Wine will be served after the seminar at 5:10 PM, Room 1025

For an archive of past seminars, please click here.


Prof. Marc Suchard, UCLA

Title: Sex, lies and self-reported counts: general birth-death processes to model self-reported count of sexual behavior

Abstract: Surveys often ask respondents to report non-negative counts, but respondents may misremember or round to a nearby multiple of 5 or 10. The error inherent in this heaping can bias estimation. To avoid bias, we propose a novel reporting distribution arising from a general birth-death process whose underlying parameters are readily interpretable as rates of misremembering and rounding. The process accommodates a variety of heaping grids and allows for quasi-heaping to values nearly but not equal to heaping multiples. Inference using this stochastic process requires novel, efficient techniques to compute finite-time transition probabilities for arbitrary birth-death processes that we provide through Laplace transforms and a continued fraction representation. We present a Bayesian hierarchical model for longitudinal samples with covariates to infer both the unobserved true distribution of counts and the parameters that control the heaping process. Finally, we apply our methods to longitudinal self-reported counts of sex partners in a study of high-risk behavior in HIV-positive youth.


James Fowler University of California, San Diego

“Social Networks and Health: From Observation to Experimentation to Intervention”

From Framingham to Facebook, we have used a variety of social networks to measure, analyze, and change the effect of social networks on health. In this talk I will discuss a number of papers using different methods to better understand how networks function and what we can do to use them to make people healthier.


Le Bao, Pennsylvania State University

“Estimating HIV Epidemics for Sub-National Areas”

As the global HIV pandemic enters its fourth decade, increasing numbers of surveillance sites have been established which allows countries to look into the epidemics at a finer scale, e.g. at sub-national levels. Currently, the epidemic models have been applied independently to the sub-national areas within countries. However, the availability and quality of the data vary widely, which leads to biased and unreliable estimates for areas with very few data. We propose to overcome this issue by introducing the dependence of the parameters across areas in a mixture model. The joint distribution of the parameters in multiple areas can be approximated directly from the results of independent fits without needing to refit the data or unpack the software. As a result, the mixture model has better predictive ability than the independent model as shown in examples of multiple countries in Sub-Saharan Africa.


Larry Brown, University of Pennsylvania

“Mallows Cp for Realistic Out-of-sample Prediction”

Mallows’ Cp is a frequently used tool for variable selection in linear models. (For the original discussion see Mallows (1973), building on Mallows (1964, 1966).) In practice it may be used in conjunction with forward stepwise selection or all-subsets selection, or some other selection scheme. It can be derived and interpreted as an estimate of (normalized) predictive squared error in a very special situation. Two key features of that situation are: 1) The observed covariate variables and the covariates for the predictive population are, “not to be regarded as being sampled randomly from some population, but rather are taken as fixed design variables”. (Mallows (1973).); and 2) The observations in the sample and in the predictive universe follow a homoscedastic linear model. Assumption 1) does not accord with most of the common statistical settings in which Cp is employed, and assumption 2) is often undesirably optimistic in practical settings.

We derive an easily computed variant of Mallows expression that does not rely on either of these assumptions. The new variant, denoted as , estimates the predictive squared error for future observations drawn from the same population as that which provided the observed statistical sample. The candidate estimators are linear estimators based on selected variables. But there are virtually no assumptions on the true sampling distribution.

Use of this variant will be demonstrated via simulations in a simple regression setting that enables easy visualization and also exact computation of some relevant quantities. For a more practical demonstration we also apply the methodology to variable selection in a data set involving criminal sentencing.

This is joint research of the Wharton Linear Models Research Group (“WLMRG”). For background (but not itself) see the survey paper, “Models as approximations: how random predictors and model violations invalidate classical inference in regression”, Buja, A., Berk, R. A., Brown, L. D., George, E., Pitkin, E., Traskin, M., Zhang, K., and Zhao, L., (2015),>stat>arXiv:1404.1578.


Jonathan Taylor, Stanford

Title: Selective inference and variations on data splitting


We consider inference after model selection in linear regression problems, specifically after fitting the LASSO. A classical approach to this problem is data splitting, using some randomly chosen portion of the data to choose the model and the remaining data for inference in the form of confidence intervals and hypothesis tests. Viewing this problem in the framework of selective inference of Fithian et al., we describe a few other randomized algorithms with similar guarantees to data splitting, at least in the parametric setting (Tian and Taylor). Time permitting, we describe analogous results from (Tian and Taylor) for arbitrary statistical functionals obeying a CLT in the classical fixed dimensional setting and inference after choosing a tuning parameter by cross-validation.

This is based on joint work with many though most specifically Will Fithian, Dennis Sun and Xiaoying Tian.


(Tian and Taylor). Selective inference with a randomized response.

(Fithian et al). Optimal inference after model selection.


Irini Moustaki, London School of Economics

“Modelling item non-response in cross-sectional multivariate data and drop out in longitudinal multivariate data: a latent variable approach”

Sample surveys collect information on a number of variables for a randomly selected number of respondents. Among other things, the aim is often to measure some underlying trait(s) of the respondents through their responses to a set of questions and that is often achieved by fitting a latent variable model.

Surveys are either cross-sectional or longitudinal and missingness occurs in both. Cross-sectional surveys often suffer from item non-response where longitudinal surveys suffer from drop out and item non-response. A latent variable approach is adopted for handling non-ignorable item non-response and drop out. Various model specifications are proposed to model the missing data mechanism together with the measurement and structural model. The model for the missing data mechanism will serve two purposes: first to characterize the item non-response/ drop-out as ignorable or non-ignorable and consequently to study the patterns of missingness/drop out and characteristics of non-respondents but also to study through a sensitivity analysis the effect that a misspecified model for the missing data mechanism might have on the structural part of the model.

The models proposed will be applied to real data from the European Social Survey and the British Household Panel Survey.


Aki Vehtari, Aalto University

“Bayesian data analysis with Gaussian processes”

Gaussian processes (GPs) provide a way to set priors on function space allowing flexible modeling of non-linearities and interactions. I provide a brief introduction to GPs and show applied data analysis examples, where GP based models could provide significant advantage compared to previously used models. I also discuss methods for approximative Bayesian inference for GPs.


Jean Jacod, Université Paris VI

“Efficient estimation of integrated volatility in presence of noise, infinite variation jumps, and with irregular sampling”

We revisit the question of estimating the integrated volatility for a discretely observed Itˆo semi- martingale, in the presence of microstructure noise, plus jumps with a high degree of activity, and when the sampling is irregular and possibly random. We use a mixture of the pre-averaging method (to eliminate noise) and of the empirical characteristic function method, which has been shown to be efficient (after proper de-biasing) even when the jump activity is bigger than 1, in contrast with most other methods.

This talk is a presentation of a joint work with Viktor Todorov, from Northwestern University.


*Time: 12noon – 12:50 pm

Room: 903 SSW

Ming Yuan (University of Wisconsin)

“Distance Shrinkage and Euclidean Embedding via Regularized Kernel Estimation”

Abstract: Although recovering an Euclidean distance matrix from noisy observations is a common problem in practice, how well this could be done remains largely unknown. To fill in this void, we study a simple distance matrix estimate based upon the so-called regularized kernel estimate. We show that such an estimate can be characterized as simply applying a constant amount of shrinkage to all observed pairwise distances. This fact allows us to establish risk bounds for the estimate implying that the true distances can be estimated consistently in an average sense as the number of objects increases.

In addition, such a characterization suggests an efficient algorithm to compute the distance matrix estimator, as an alternative to the usual second order cone programming known not to scale well for large problems. Numerical experiments and an application in visualizing the diversity of Vpu protein sequences from a recent HIV-1 study further demonstrate the practical merits of the proposed method.


Prof. Holger Rootzén from Chalmers University of Technology

“Error distributions for approximations of stochastic integrals and discrete hedging.”

The total error stemming from discrete hedging in a Black-Scholes option model is the error in discrete approximations to stochastic integrals. In this talk we derive joint convergence of the approximation error for several stochastic integrals with respect to local Brownian semimartingales, for non-equidistant and random grids. The conditions needed for convergence are that the Lebesgue integrals of the integrands tend uniformly to zero, and that the squared variation and covariation processes converge. We also provide tools which simplify checking these conditions and which extend the range of the results. This is used to prove an explicit limit theorem for random grid approximations of integrals based on solutions of multidimensional SDEs, and to find ways to “design” and optimize the distribution of the approximation error. We briefly discuss how the results can be used to find optimal strategies for discrete hedging.

Joint work with Carl Lindberg.


Harrison Zhou, Yale University

“A General Framework for Bayes Structured Linear Models”


High dimensional statistics deals with the challenge of extracting structured information from complex model settings. Compared with the great number of frequentist methodologies, there are rather few theoretically optimal Bayes methods that can deal with very general high dimensional models. In contrast, Bayes methods have been extensively studied in various nonparametric settings and rate optimal posterior contraction results have been established.

This paper provides a unified approach to both Bayes high dimensional statistics and Bayes nonparametrics in a general framework of structured linear models. With the proposed two-step model selection prior, we prove a general theorem of posterior contraction under an abstract setting. The main theorem can be used to derive new results on optimal posterior contraction under many complex model settings including stochastic block model, graphon estimation and dictionary learning. It can also be used to re-derive optimal posterior contraction for problems such as sparse linear regression and nonparametric aggregation, which improve upon previous Bayes results for these problems in literature.

The key of the success lies in a novel two-step prior distribution. The prior on the parameters is an elliptical Laplace distribution that is capable of modeling signals with large magnitude, and the prior on the models involves an important correction factor that compensates the effect of the normalizing constant of the elliptical Laplace distribution.

This is a joint work with Chao Gao and Aad van der Vaart.


*Time: 12noon – 12:50 pm

Room: 903 SSW

Arthur Gretton (University College London)

“Nonparametric Bayesian inference using kernel distribution embeddings”

A method is presented for approximate Bayesian inference, where explicit models for the prior and likelihood are unknown (or difficult to compute), but sampling from these distributions is possible. The method expresses the prior as an element in a reproducing kernel Hilbert space, and the likelihood as a family of such elements, indexed by the conditioning variable. These distribution embeddings may be computed directly from training samples. Kernelized Bayesian inference can be applied to any domains where kernels are defined, including distributions on strings, graphs, and groups, as well as to complex, non-Gaussian continuous distributions. An empirical comparison with approximate Bayesian computation (ABC) shows better performance can be obtained in high dimensions. Finally, the approach is applied to camera angle recovery from captured images, showing better performance than the extended Kalman filter.


Matt Taddy, University of Chicago

“Big Data and Bayesian Nonparametrics”

Big Data is often characterized by large sample sizes, high dimensions, and strange variable distributions. For example, an e-commerce website has 10-100s million observations weekly on a huge number of variables with density spikes at zero and elsewhere and very fat tails. These properties — big and strange — beg for nonparametric analysis. We revisit a flavor of distribution-free

Bayesian nonparametrics that approximates the data generating process (DGP) with a multinomial sampling model. This model then serves as the basis for analysis of statistics — functionals of the DGP — that are useful for decision making regardless of the true DGP. The ideas will be illustrated in the indexing of treatment effect heterogeneity onto user characteristics in digital experiments, and in analysis of decision trees employed in fraud prediction. The result is a framework for scalable nonparametric Bayesian decision making on massive data.


*Time: 12noon – 12:50 pm

Room: 903 SSW

Hui Zou, University of Minnesota

“A High Dimensional Bayesian Change-Point Regression Model”

High-dimensional change point regression is severely underdeveloped, probably because the popular sparse penalized regression methods lose attraction under the change-point scenario. Motivated by the analysis of Minnesota House Price Index data, we propose a Bayesian framework for fitting changing linear regression models in high-dimensional settings. Using segment-specific shrinkage and diffusion priors, we deliver full posterior inference for the change points and simultaneously obtain posterior probabilities of variable selection in each segment via an efficient MCMC algorithm. We apply our method to Minnesota House Price Index data and our analysis suggests an interesting transition from the hedonic price model to the Macroeconomic effects model.


 Neil Shephard, from Harvard University

“Continuous time analysis of fleeting discrete price moves”
This paper proposes a novel model of financial prices where: (i) prices are discrete; (ii) prices change in continuous time; (iii) a high proportion of price changes are reversed in a fraction of a second. Our model is analytically tractable and directly formulated in terms of the calendar time and price impact curve. The resulting cadlag price process is a piecewise constant semi-martingale with fi nite activity, finite variation and no Brownian motion component. We use moment-based estimations to to fit four high frequency futures data sets and demonstrate the descriptive power of our proposed model. This model is able to describe the observed dynamics of price changes over three different orders of magnitude of time intervals.

Gennady Samorodnitsky, Cornell University

“Ridges and valleys in the high excursion sets of Gaussian random fields”


It is well known that normal random variables do not like taking large values. Therefore, a continuous Gaussian random field on a compact set does not like exceeding a large level. If it does exceed a large level at some point, it tends to go back below the level a short distance away from that point. One, therefore, does not expect the excursion set above a high for such a field to possess any interesting structure. Nonetheless, if we want to know how likely are two points in such an excursion set to be connected by a path (“a ridge”) in the excursion set, how do we figure that out? If we know that a ridge in the excursion set exists (e.g. the field is above a high level on the surface of a sphere), how likely is there to be also a valley (e.g. the field going to below a fraction of the level somewhere inside that sphere)?

We use the large deviation approach. Some surprising results (and pictures) are obtained.

(Joint work with R. Adler and E. Moldavskaya)