Missing data are nearly always a problem in research, and missing values represent a serious threat to the validity of inferences drawn from findings. Increasingly, social science researchers are turning to multiple imputation to handle missing data. Multiple imputation, in which missing values are replaced by values repeatedly drawn from conditional probability distributions, is an appropriate method for handling missing data when values are not missing completely at random. However, use of this method requires developing an imputation model from the observed data. This is typically a rigorous and time-consuming process. To encourage wider adoption of multiple imputation in social work research, a simple framework for designing imputation models is presented. The framework and its ability to generate unbiased estimates are demonstrated in a simulation study.
KEY WORDS: missing data; multiple imputation; nonresponse
Missing data are ubiquitous in social research, and missingness or nonresponse can represent a threat to the validity of inferences because of undue effects on efficiency, power, and parameter bias (Shadish, Cook, & Campbell, 2002). Social work researchers are now addressing missing data in a more rigorous manner. Recently, Saunders et al. (2006) and Choi, Golder, Gillmore, and Morrison (2005) described important data imputation methods and dispelled misunderstandings regarding popular imputation methods, such as mean substitution.
Recent advances in analytic methods, such as multiple imputation (MI), are taking hold in social work research. With MI, missing values are replaced with values repeatedly drawn from simulated conditional probability distributions (Schafer, 1997), thus creating multiple versions of the data set. Each version of the data set is analyzed according to the data analysis model, and the multiple results are combined into point estimates (Rubin, 1996). A critical task in MI is to devise an imputation model (Allison, 2002) or missing data model (Graham, Olchowski, & Gilreath, 2007), which involves specifying the measures that are putatively associated with the missing values. Although this process adds additional steps, the specification of an imputation model and the creation of multiple data sets can produce less-biased estimates in the presence of missing data across a wide variety of data analysis techniques (Schafer, 1997).
Besides MI, there are many other methods for addressing missing data (Schafer, 1999; Schafer & Graham, 2002). An equally rigorous method known as direct or full information maximum likelihood (FIML) estimation can produce unbiased estimates and correct standard errors in the presence of missing data. When the number of imputations is sufficiently large, identical missing data models will produce the same estimates under MI and FIML (Graham et al., 2007). Unlike MI, FIML is limited to maximum likelihood analytic techniques and the missing data model must be included in the analysis model. Although we focus on MI, the steps we describe for developing an imputation model are equally appropriate for use in the missing data model for FIML. On the basis of the MI literature, this article describes a framework for developing an imputation model for use with any free or commercial software package that performs MI.
BRIEF REVIEW OF MISSING DATA CONCEPTS
Generally, both MI and a broad range of missing data issues have received ample attention in the applied literature (Graham et al., 2007). We briefly discuss the three types of distributions that describe the randomness of nonresponse given that this property has consequences for the development of an imputation model. For a discussion of general missing data concepts that are not critical to understanding our discussion of imputation model development, we refer readers to Schafer and Graham (2002).
Distribution of Nonresponse
The probability distribution of nonresponse--more frequently referred to as the missing data or nonresponse mechanism (Rubin, 1976)--is both an important factor in the decision to impute with MI and a context for the development of an imputation model. Technical definitions are available in Rubin (1976) and Schafer (1997); for less technical definitions see Saunders et al. (2006), Allison (2002), and von Hippel (2004). Schafer and Graham's (2002) work provides helpful diagrams and promotes use of the term distribution over that of mechanism because of the latter's misleading implication of an underlying process. The distribution describes the randomness of missing data in the context of observed data (that is, the extent to which being missing is systematically related to observed or missing values).
Completely Random. As described by Rubin (1976), data may be missing completely at random (MCAR). In MCAR, the probability of nonresponse is independent of either observed or missing values. When nonresponse is MCAR, deletion--also known as complete case analysis--is a potential strategy given that the randomness of the missing values does not induce bias, although efficiency and power will be negatively affected (Schafer & Graham, 2002). However, deletion is the easiest strategy to implement and is regarded as desirable if the assumption of MCAR is supported. MCAR can be tested with a likelihood ratio chi-square test of the null MCAR hypothesis that compares the means of the observed data for the pattern of nonresponse observed on each variable (for a detailed explanation, see Little, 1988).
Systematic. Nonresponse may be systematic, that is, it may be conditioned on observed or missing values. In the worst case scenario, nonresponse may be associated with the missing values, which is known variously as missing not at random (MNAR) (Schafer & Graham, 2002), not missing at random (NMAR) (von Hippel, 2004) or nonignorable (Allison, 2002). This type of nonresponse can be a problem because there is no widely available theoretically grounded strategy for handling MNAK missing data (see Schafer & Graham, 2002). A more desirable distribution occurs when nonresponse is randomly distributed over the unobserved data but is associated with the observed data. This type of nonresponse is known as missing at random (MAR). This is also referred to as ignorable (Allison, 2002).When the data are MAR, techniques used to create the multiple imputations eliminate missing data bias (Schafer, 1999).
With MI, missing values are replaced with values repeatedly drawn from conditional probability distributions by using a simulation method called Markov Chain Monte Carlo (MCMC). As noted previously, this process yields multiple versions of a data set, and the same analysis (for example, a linear regression) is conducted on each version, generating multiple estimates for each parameter. By using a set of rules that adjusts the standard errors for the uncertainty of the imputed values, the multiple estimates are combined to obtain a single parameter estimate (Rubin, 1987). This process adds an additional step to data analysis because it requires the analyst to specify an imputation model before data analysis.
MI can produce unbiased estimates when the fraction of missing information approaches 90% (Graham et al., 2007). This does not imply that 90%...