Binary dependent variables

Oded Bizan
Jun 3, 2016
3 min read

A variable that can have only two possible values is called a binary, or dichotomous, variable. When a modeler seeks to characterize the relationship between a binary dependent variable and a set of dependent variables, the modeler typically considers three alternatives:

Linear regression model;
PROBIT; and
LOGIT

The linear regression model is a natural tool for linking a dependent variable and a set of independent variables. However, when the dependent variable is a binary variable using a linear regression model is inappropriate. At a technical level, when the dependent variable is binary the two key assumptions of the linear regression model are violated: the errors are not homoscedastic and are not normally distributed. Violation of the homoscedasticity assumption results in biased standard errors for the coefficients, and in that the coefficients might not be the most precise in terms of variance. The coefficients themselves, however, remain unbiased. Violation of the normality assumption results in inaccurate inference affecting model specification and testing. On a practical level, the linear regression model can produce predictions that are either negative or larger than 1 which is inconsistent with the definition of the dependent variable.

So we’re left with two alternatives: PROBIT and LOGIT.

The two models differ in their assumptions about the distribution of the error term. In the PROBIT model the cumulative distribution function is the cumulative normal distribution whereas in the LOGIT model it is the logistic distribution (which is why it is often called a logistic regression model). Because the two distributions are very close to each other, except at the tails, they are unlikely to yield very different results (qualitatively), unless the sample is very large in which case there are going to be more observations at the tails.

While the models yield very similar results , often the LOGIT model is chosen over the PROBIT model because it is computationally easier (although it is hardly a factor at the current state of computing power) and because it has a few analytical properties that make it easier to work with:

One of the most convenient features of the LOGIT model is that by take the exponent of the estimated coefficients you obtain adjusted odd ratios which have a very intuitive interpretation (you get the exact odd ratio when there is only one independent variable). By contrast, coefficients of the PROBIT model are much harder to interpret.
Both LOGIT and PROBIT are estimated using maximum likelihood estimation. Typically in maximum likelihood estimation the standard errors, and consequently, the confidence intervals and p-values, are large sample approximations. Obviously, these are approximations may not be very accurate when the sample size is small. The LOGIT model, unlike the PROBIT, can be estimated exactly so that the exact standard errors, confidence intervals, and p-values can be obtained which make it more reliable for small samples.
When the binary data is clustered in some way, for example when the data is a panel of individual that are followed over time, and the modeler is interested in accounting for the possibility that there are between cluster (individual) differences but is not willing to restrict the distribution of the differences or their link to predictor variables. Only with the LOGIT model the modeler can estimate a conditional logistic regression (conditional on cluster size) and the between cluster effects will cancel out.

There are several analytical reasons for using the LOGIT model. However, it is important to remember that analytical advantages do not guarantee that the LOGIT model would provide a more realistic representation of the modeled environment. That said, unless there is a good empirical justification to prefer another model, the way to go with a binary dependent variable is the logistic regression model.

Our next post will include a practical introduction to the logistic regression model focusing on providing guidelines to the appropriate use of the mode and for testing its performance.