Dr. Xinyi Xu, Department of Statistics
Rank at time of award: Assistant Professor
Abstract
Model selection is a common and important problem in both statistics and population research.
For example, in health disparity research, health expenditures for different subpopulations (e.g.,
racial groups) is often an outcome of great interest. The distribution of such an outcome is
usually skewed with heavy tails. In many situations, a simple transformation, such as the
logarithm transformation, is not enough. How should we decide what probability distributions
can best describe these outcomes (or their transformations)? Also, when using data from large
public surveys to predict the risk of a certain disease, investigators may need to select among
hundreds of potential predictors to build a regression model. How should we decide which
predictors to include?
A traditional approach for model selection is to use significance tests and p-values. For
building linear regression models in the presence of many potential predictors, it is common to
first fit the full model, test whether the regression coefficients are 0, remove the predictors for
which the p-values are larger than a pre-specified significance level, and then refit the resulting
reduced model. However, as shown in Raftery (1995), when the sample size is large or the data
contain many independent variables, this method can be very misleading and tends to find strong
evidence for effects that do not exist. Moreover, sometimes more than one model seems
reasonable, and the different models lead to different answers to the questions of interest. Such
examples have been observed in educational stratification (Kass and Raftery 1995) and
epidemiology (Raftery 1993). In this situation, selecting a single model ignores the uncertainty
of the model form and thus results in underestimation of the estimation/prediction error.