Yajuan Si's Research

Research areas:

Bayesian statistics, latent variable models, high-dimensional data, categorical data analysis, multilevel regression and poststratification, missing data imputation, complex survey weighting methods, causal inference and data confidentiality protection

My current research interests lie in developing robust Bayesian approaches to efficiently handle selection bias and nonresponse issues in high dimensional data and advocate model-based adjustments for complex surveys. Real-life sample survey data show systematic differences from their target populations due to selection bias and nonresponse. Standard approaches to accounting for discrepancies between sample and population are imputation and weighting, but they suffer from various problems. My goal is to turn these real-world problems into technical problems and then develop robust statistical solutions using high-dimensional probability models.

Bayesian Nonparametric Weighted Sampling Inference

Survey weighting adjusts for known or expected differences between sample and population. Weights are constructed on design or post-stratification variables that are predictors of inclusion probability. In this paper, we assume that the only information we have about the weighting procedure is the values of the weights in the sample. We propose a hierarchical Bayesian approach in which we model the weights of those nonsampled units in the population and simultaneously include them as predictors in a nonparametric Gaussian process regression to yield valid inference for the underlying finite population and capture the uncertainty induced by sampling and the unobserved outcomes. We use simulation studies to evaluate the performance of our procedure and compare it to the classical design-based estimator. We apply our method to data from two ongoing social surveys: the American Community Survey and the Fragile Family Child Wellbeing Study. Our studies find the Bayesian nonparametric finite population estimator to be more robust than the classical design-based estimator without loss in efficiency.

Nonparametric Bayesian Multiple Imputation for Incomplete Categorical Variables in Large-Scale Assessment Surveys

In many surveys, the data comprise a large number of categorical variables that suffer from item nonresponse. Standard methods for multiple imputation, like log-linear models or sequential regression imputation, can fail to capture complex dependencies and can be difficult to implement effectively in high dimensions. We present a fully Bayesian, joint modeling approach to multiple imputation for categorical data based on Dirichlet process mixtures of multinomial distributions. The approach automatically models complex dependencies while being computationally expedient. The Dirichlet process prior distributions enable analysts to avoid fixing the number of mixture components at an arbitrary number. We illustrate repeated sampling properties of the approach using simulated data. We apply the methodology to impute missing background data in the 2007 Trends in International Mathematics and Science Study.

Handling Attrition in Longitudinal Studies: The Case for Refreshment Samples

Panel studies typically suffer from attrition, which reduces sample size and can result in biased inferences. It is impossible to know whether or not the attrition causes bias from the observed panel data alone. Refreshment samples—new, randomly sampled respondents given the questionnaire at the same time as a subsequent wave of the panel—offer information that can be used to diagnose and adjust for bias due to attrition. We review and bolster the case for the use of refreshment samples in panel studies. We include examples of both a fully Bayesian approach for analyzing the concatenated panel and refreshment data, and a multiple imputation approach for analyzing only the original panel. For the latter, we document a positive bias in the usual multiple imputation variance estimator. We present models appropriate for three waves and two refreshment samples, including nonterminal attrition.

Semi-parametric Selection Models for Potentially Non-ignorable Attrition in Panel Studies with Refreshment Samples

Large scale panel studies typically suffer from attrition. When the reason for attrition is systematically related to the missing values, ignoring the attrition in models for panel data can result in biased inferences. Unfortunately, panel data alone cannot inform the extent of bias due to attrition. As a consequence, analysts using the panel data alone must make strong and untestable assumptions about the missing data mechanism. Many panel studies also include refreshment samples, which are data collected from a random sample of new individuals during the later waves of the panel. Refreshment samples offer information that can be utilized to correct for biases induced by non-ignorable attrition while reducing reliance on strong assumptions about the attrition process. We present a Bayesian approach to handle attrition in two wave panels with one refreshment sample and many categorical survey variables. The approach includes (i) an additive non- ignorable selection model for the attrition process, and (ii) a Dirichlet process mixture of multinomial distributions for the categorical survey variables. We present MCMC algorithms for sampling from the posterior distribution of model parameters and missing data. We apply the model to correct attrition bias in an analysis of data from the 2007- 2008 Associated Press/Yahoo News election panel study.

Bayesian Joint Latent Pattern Mixture Models in Panel Study with Refreshment Samples

Panel data alone cannot estimate the attrition effect without untestable assumptions about the missing data mechanism. Refreshment samples offer an extra data source that can be utilized to estimate the attrition effect while reducing reliance on strong assumptions of the missing data mechanism. We propose Bayesian joint latent pattern mixture models, for which attrition and variables are modeled jointly via latent classes to handle the attrition and item non-response simultaneously under multiple imputation in a two wave panel with one refreshment sample when the variables involved are categorical and high dimensional. Simulation studies show that the new approach can outperform the traditional latent pattern mixture models under the conditional independence given latent classes, which will result in biased inference even with more classes. We apply the fully Bayesian procedure to an election panel study.

Bayesian Poststratification and Deep Interaction

The proposed approach will provide a general framework that allows us to post- stratify on more variables and thus much more poststratification cells than is possible using classical methods. The basic principle in both classical and Bayesian sampling inferences is to include in the analysis all variables that affect the respondents’ inclusion probabilities. Small cells lend credence to the implicit assumption of ignorable nonre- sponse, in the sense that the respondents are a random sample of the subpopulation in each poststratum. There is existing work on Bayesian poststratification but struggles remain when attempting to adjust for a large number of background variables and deep interaction.