Heteroscedasticity and grouped data regression.

AuthorCaudill, Steven B.
  1. Introduction

    Of the numerous problems arising in attempts to empirically model an economic phenomenon using micro level survey data, perhaps the least attention has been given to those attributable to a "grouped dependent variable." In this case, the dependent variable is categorical in nature, having known category boundaries, and of interval strength. For many surveys, data on individual incomes, home values, length of time in residence, etc. fall into this class of measures, and attempts to econometrically estimate models explaining their behavior encounter more difficulties than is commonly realized. When confronted with grouped data on the dependent variable analysts typicany will assign each observation in a particular category the midpoint value of that category, perhaps after the end points of at category have been transformed (e.g., to conform to a Pareto, lognormal, or some other distribution), and then estimate the parameters of the model by ordinary least squares (OLS) regression. Alternatively, the analyst might attempt to account for the categorical nature of the dependent variable by employing a maximum likelihood estimation technique such as probit or logit. Unfortunately, both of these approaches result in unsatisfactory parameter estimates. OLS on the category midpoints produces inconsistent estimates.(1) Even the qualitative dependent variable maximum likelihood techniques such as n-chotomous probit produce inefficient estimates, since they ignore the information provided by the known values of the category boundaries.

    About a decade ago, Mark Stewart [13] developed a maximum likelihood model that would allow consistent and asymptotically efficient parameter estimation when the dependent variable is grouped. His procedure has come to be known as "grouped data regression." Mute testimony as to the lack of attention paid to this important work is given by the dominance, until very recently, of the two inappropriate techniques (above) in the relevant empirical literature.(2)

    Due to the recent rapid rise in popularity of the grouped data regression model, however, some of its limitations should be examined. One limitation that has not been considered is the fact that the model is strictly applicable only if the theoretical disturbance term of the underlying model is homoscedastic. This drawback is particularly disturbing since Yatchew and Griliches [14] have shown that, for maximum likelihood models having likelihood functions very similar to that of group data regression, heteroscedasticity leads not only to inefficient estimates (as in OLS) but also results in biased and inconsistent estimates. The object of this paper is to develop a method by which the grouped data regression model can be extended to the heteroscedastic case.

    In what follows, we begin by outlining Stewart's homoscedastic model as a useful point of departure. We then suggest a method by which multiplicative heteroscedasticity can be incorporated into Stewart's estimation structure. Finally, we illustrate our suggested procedures with an application to housing demand. We conclude with a brief summary.

  2. Grouped Data Regression

    Suppose we posit the following behavioral model

    [y.sup.*] = X [beta] + [ ] (1) where [y.sup.*] is an (n x 1) vector of implicit observations on the dependent variable, X is an (n x k) matrix of observations on the k independent variables in the model, [beta] is a (k x 1) vector of unknown coefficients to be estimated, and [ ] is an (n x 1) vector of stochastic disturbances, each element [ .sub.i] of which is assumed i.i.d. N(0, [[sigma].sup.2]). We say that [y.sup.*] is a vector of "implicit observations" because in this conceptual framework, [y.sup.*] is not directly observable. If it were observable, then each (cardinally measurable) [y.sub.i.sup.*] would be independently normally distributed with mean [x.sub.i][beta] and constant variance [sigma.sup.2], as implied by our assumptions on [ ]. Rather, all we are able to observe is [y.sub.i], the category - with known end point values - within which [y.sub.i.sup.*] falls. More precisely, if the real number line were partitioned into j mutually exclusive and exhaustive categories with boundaries [A.sub.j](j = 0, . . . , J), then we observe [y.sub.i] = j if

    [A.sub.j-1]

    It is important to emphasize that the observed [y.sub.i] are only of ordinal strength, but that the category boundaries {A.sub.j} are known cardinal numbers. Our problem within this framework is to obtain consistent and asymptotically efficient estimates of the unknown parameters, [beta] and [sigma].sup.2]], of the model. One approach to obtaining such estimates is the method of maximum likelihood.

    Based on the assumptions above, the probability that [y.sub.i] = j, i.e., the probability that [y.sub.i.sup.*] falls in the jth category, is given by

    P([y.sub.i] = j) = P([A.sub.j-1]

    = P{[([A.sub.j-1] - [x.sub.i][beta])/[sigma]]

    = F[([A.sub.j] - [x.sub.i][beta]/[sigma]] - F[([A.sub.j-1] - [x.sub.i] [beta]/[sigma]] (3)...

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT