The logistic regression model
A logistic regression model commonly referred to as LOGIT, is used to link a dichotomous (binary) dependent variable to one or more independent variables. The use of logistic regression models is common in the modeling of events such as bankruptcy, credit default and is often used to predict the probability of such events or, equivalently, as a scoring technique.
This post is a practical introduction to this model and will focus on providing guidelines for the appropriate use of the model and for testing its appropriateness and performance.
Formally, a multivariate binomial logistic regression model is:
log[ Pi / (1-Pi) ] = a + bXi
Pi = Prob(Yi=1|Xi) is the conditional probability that unit of observation i (e.g. individual, firm) will experience the event Yi=1 as opposed to the alternative event Yi=0;
Xi is a vector of independent variables for observation i;
a and the vector b are the regression parameters to be estimated.
The logistic regression model is estimated using Maximum Likelihood.
To obtain the probability of default for unit i the predicted value of the logistic regression - Yi - is transformed as follows:
Pi = 1 / [1 + exp(Yi) ]
Before estimating a LOGIT model we need to make sure that the dependent variable is indeed binary. We also need to make sure that there is no high correlation among independent variables. In the linear regression model, high correlation between independent variables is referred to as multicollinearity. The LOGIT model is not linear and therefore the implications of multicollinearity are different than in the linear regression model. But, because the model links the log of the odds ratio to independent variables linearly the easiest way to assess the presence of multicollinearity is similar to the linear regression model. This is done by calculating Variance Inflation Factors (VIFs). A VIF of 10 and above is considered a strong indicator for multicollinearity.
With a binary variable and no multicollinearity the proposed model can be estimated using maximum likelihood.
Post estimation diagnostics
Once the model is estimated a variety of diagnostic tests should be run to assess the performance of the model. The diagnostic tests for a logistic regression model are different from those for the linear regression model (but are similar to those done for the PROBIT model).
The first, most basic, question to ask is whether the proposed model better than the null model. The null model is an alternative model that includes no independent variable. The null model is a restricted version of the proposed model where the coefficients for all the proposed independent variables are set to zero. The null model is estimated and is then compared to the proposed model using a Chi-squared test. Assuming the restrictions are rejected based on this test we turn to assess the model’s goodness-of-fit.
Goodness-of-fit for the logistic regression model can be measured in three different ways. The first and simplest measure of goodness-of-fit is Pseudo R-squared. This statistic is interpreted intuitively like the familiar R-squared for the linear regression model. That is, it takes on values between 0 and 1 and a value closer to 1 indicates better goodness-of-fit. In the case of the logistic regression model, there are various approaches to calculate a pseudo R-squared but the most commonly used variation is McFadden’s Pseudo R-squared which is calculated as 1 minus the ration between the log likelihood of the proposed model to the log likelihood of a model with only an intercept.
A better approach for measuring goodness-of-fit for the LOGIT model is the Hosmer-Lemeshow test. The Hosmer-Lemeshow goodness-of-fit test is based on dividing the sample up into g groups according to their predicted probabilities. The number of groups – g – is selected so that g is larger than the number of covariates plus 1. The Hosmer-Lemeshow test statistic follows a chi-squared distribution with g-2 degrees of freedom, when the model is correctly specified. That is, given the fitted model, the p-value can be calculated as the right-hand tail probability of the corresponding chi-squared distribution using the calculated test statistic. If the p-value is small, the fitted model fits the data poorly.
Another approach to assessing goodness-of-fit is based on the Receiver Operating Characteristic (ROC) curve which is a graphical representation of the performance of any binary classifying method. The ROC curve plots the true positive rate (sensitivity) versus the false positive rate (1 minus specificity). The closer the curve follows the upper-left border of the ROC space, the more accurate is the model. The further the curve is from the diagonal, the better the model is in comparison to random predictions. Equivalently, the larger the area under the curve (AUC) is, the more accurate is the model.
Beyond goodness-of-fit of the model, the significance of each estimated coefficient should be tested and the modeler should test for the presence of influential observations. Much like for the linear regression model, there are several techniques to identify influential observations in the logistic regression model. These techniques are based on Pearson residuals and standardized Pearson residuals and on another residual called Deviance. Pearson residuals measure the relative difference between the observed and fitted values. Deviance residuals measure the disagreement between the maxima of the observed and fitted log likelihood functions (recall that a logistic regression model is estimated through maximum likelihoods estimation which seeks to minimize the sum of the deviance residuals). Another measure for influential observations is called Leverage. These measures are typically plotted against the predicted probabilities or by observation. Influential observation, if identified, should be further investigated to identify what make them stand out.
How to interpret estimation results
The interpretation of the estimated coefficients of the logistic regression model is different than the interpretation of coefficients in the linear regression model. The slope coefficients represent the rate of change in the log of the odds ratio as the associated independent variables change. More intuitively, the exponent of a slope coefficient is the rate of change in the odds ratio as a result of a change in the associated independent variable. Recall that the odds ratio is the probability of an event divided by the probability the event not occurring. Also recall that when the odds ratio equals to 1, the event is as likely to occur as it is unlikely to occur – that is, there is a 50-50 chance the event will occur with a small chance in the independent variables. If the exponent of a coefficient equals 2 then a one unit change in the associated independent variable would make the event twice as likely to occur.