Dr. Xinyi Xu, Department of Statistics.*Rank at time of award: Assistant Professor*

Abstract

Model selection is a common and important problem in both statistics and population research.

For example, in health disparity research, health expenditures for different subpopulations (e.g.,

racial groups) is often an outcome of great interest. The distribution of such an outcome is

usually skewed with heavy tails. In many situations, a simple transformation, such as the

logarithm transformation, is not enough. How should we decide what probability distributions

can best describe these outcomes (or their transformations)? Also, when using data from large

public surveys to predict the risk of a certain disease, investigators may need to select among

hundreds of potential predictors to build a regression model. How should we decide which

predictors to include?

A traditional approach for model selection is to use significance tests and p-values. For

building linear regression models in the presence of many potential predictors, it is common to

first fit the full model, test whether the regression coefficients are 0, remove the predictors for

which the p-values are larger than a pre-specified significance level, and then refit the resulting

reduced model. However, as shown in Raftery (1995), when the sample size is large or the data

contain many independent variables, this method can be very misleading and tends to find strong

evidence for effects that do not exist. Moreover, sometimes more than one model seems

reasonable, and the different models lead to different answers to the questions of interest. Such

examples have been observed in educational stratification (Kass and Raftery 1995) and

epidemiology (Raftery 1993). In this situation, selecting a single model ignores the uncertainty

of the model form and thus results in underestimation of the estimation/prediction error.