# The Linear Regression Model

The workhorse of regression analysis and one of the most widely used techniques in the data analysis world is the Linear Regression Model. The most fundamental technique used to estimate the Linear Regression model is called Ordinary Least Squares or OLS. A deep understanding of this technique is crucial for reliable modeling but unfortunately it is not as common as desired.

First, some terminology. A linear regression model is used to describe the relationship between a dependent variable and one or more independent variables. When a single independent variable is used to explain the variance in a single dependent variable, the model is known as simple linear regression; When multiple independent variables are used to explain the variance in a single dependent variable, the model is called multiple linear regression; When the variance of multiple dependent variables is explained using multiple independent variables the model is known as multivariate linear regression. The purpose of this post is to extend the understanding of simple and multiple linear regression models.

Simple or multiple linear regression models link a dependent variable to one or more independent variables using parameters that correspond to each one of the independent variables and to an error term that accounts for all the potential independent variables for which no information is available. Given this structure, the goal is to find the parameters that best describe the relationship between the dependent variable and the independent variables given the data in hand. There are many possible combinations of parameters that can describe this relationship. In fact, there is an infinite set of parameters. The importance of the OLS technique stems from that under a set of assumptions, known as the Normal Assumptions, it yields the Best Linear Unbiased Estimates (“BLUE”) for the parameters, That is, under these assumptions not only OLS estimates are unbiased they are also the most accurate estimates because they have the lowest variance

Sounds good. Not only we can describe the relationship, we find the best description. But, and this is a big “but”, the Normal Assumptions have to hold. So it is essential to understand these assumptions well. Best practice data analysis will always include some exploratory analysis to validate the assumptions before performing linear regression. Most statistical packages provide diagnostic tools – statistical tests and plots – aimed at validating these assumptions.

The Ordinary Least Squares technique relies on the following assumptions:

The relationship between the dependent and independent variables is linear

The errors are statistically independent of one another (No serial correlations)

The variance of the errors is constant (Homoscedasticity)

The errors are normally distributed

The relationship between the dependent and independent variables is linear

This assumption implies that a linear relationship exists between the dependent and the independent variable(s). It implies that impact of the independent variables is additive and that they do not impact one another.

Validation tools: A number of diagnostic plots can help validate the assumption of linearity. First, in the scatter plot of the observed vs. the predicted values, data points should be symmetrically distributed along the diagonal line. Second, in the residuals vs. predicted values plot values should be randomly distributed along a horizontal line. Third, in the plot of the residuals vs. individual independent variables there should be no pattern. Finally, the added variable plots should be generated for each independent variable. These plots should all exhibit random distribution around the regression line.

What can be done if the assumption is violated? Add more variables: investigate all factors that may impact the dependent variable, based on both theory and on empirical circumstances making sure that all the variables that explain variations in the dependent variable are included. Variables that capture changes in empirical circumstances may have to be created. For example, if your dependent variable measures sales of some good, flags that capture mergers and acquisitions of competing sellers will help capture sudden changes in sales. Data transformation: Sometimes the relationship between the dependent variable and independent variables in non-linear but it is possible to transform the variables such that the relationship between the transformed variables will be linear.

The errors are statistically independent of one another (No serial correlations)

When the errors are serially correlated the error of one observation depends on the error of another. As a result, changes in the dependent variable cannot be explained by the current independent variables alone. Serial correlation is common in time series data because different independent variables take different duration to affect the dependent variable and so their impact is carried on to subsequent periods. While less common, serial correlation can occur even in cross-sectional data.

When serial correlation is present the consistency (or unbiasedness) of the OLS estimators is not affected, but their efficiency is affected. With positive serial correlation, OLS estimates of the standard errors will be smaller than the true standard errors which will lead to the conclusion that the parameter estimates are more precise than they actually are. As a result, there will be a tendency to reject the null hypothesis when it should not be rejected.

Validation tools: The most basic statistical test for serial correlation is the Durbin-Watson test. In the context of time series data, the Durbin-Watson test can only detect first-order serial correlation – that is, correlation between current error and the error in the previous time period. More general tests, such as the Breusch-Godfrey test, can be used to detect the presence of higher order serial correlation. For time series data, auto-correlations and partial auto-correlations can be plotted along with the Q statistic to detect serial correlation.

For cross-sectional data, the errors should not be correlated with one another regardless of how they are sorted. If sorting the data in any way yields better results, then a linear regression might not be the best modeling approach.

What can be done if the assumption is violated? The first step is to confirm that the relationship is indeed linear. This can be done by sorting and then plotting the residuals. If the residuals are increasing, it could be that the relationship is exponential or quadratic. If auto-correlation appears periodic, incorporating seasonality may resolve the problem. If none of these is an issue, the model can be estimated with robust standard errors that are also known as Heteroscedastic and Autocorrelation Consistent (“HAC”) estimates.

The variance of the errors is constant (Homoscedasticity)

The errors in the linear regression model are assumed to each have a constant variance and that constant is assumed to be the same for all errors. When the assumption is violated – a condition called heteroscedasticity – the variance of the errors vary systematically between errors. When the errors are heteroscedastic, the OLS estimates are still consistent but are not efficient. That is, the parameter estimates are unbiased but their standard errors are biased which can cause inference problems.

Validation tools: Heteroscedasticity can often be detected visually by inspecting plots of the residuals vs. the predicted values or the residuals over time. Any pattern in the data would indicate that the residuals are a function of the predicted values or time and would suggest a violation of the homoscedasticity assumption. Figure 1 illustrates homoscedastic residuals while Figure 2 illustrates heteroscedastic residuals - the residuals become more dispersed as the fitted values increase.

Figure 1: Homoscedasticity

Figure 2: Heteroscedasticity

Beyond visual diagnostic tools, formal statistical tests for heteroscedasticity are available for different situations. If heterscedasticity is suspected but there appear to be no specific explanation for its presence, White’s test can be used. When the dependent variable shows substantial variation in magnitude, which may lead to differences in measurement error, the Goldfeld-Quant test is more appropriate. If the variations in the errors are associated with some variable(s) then the Breusch-Pagan test should be used.

What can be done if the assumption is violated? As in the case of serial correlation, the model can be estimated with robust standard. As an alternative to the OLS model, a more general model – appropriately named Generalized Least Squares (“GLS”) – can be estimated. In this model heteroscedasticty is explicitly accounted for resulting in more efficient (smaller variance) estimates than in the OLS model.

The errors are normally distributed

The calculation of standard errors, confidence intervals, and significance tests for the estimates are all based on the assumption of normally distributed errors. If the errors are not normally distributed, all these calculations are inaccurate. Normality is typically violated because either the distribution of the dependent variable or the independent variables is significantly non-normal. Normality can also be violated when the linearity assumption is violated.

Validation tools: Tests for normality are based on comparison of the distribution of the regression residuals to the normal distribution. Statistical tests such as Shapiro-Wilk or the Shapiro-Francia tests for normality are commonly used to test for normality. In addition, simple plotting techniques such a histogram of the residuals, or the more advanced q-q plot are commonly used. The following plots illustrate a situation where the residuals are normal (Figures 3a and 3b) and another when they are not (Figures 4a and 4b).

Figure 3a: Normal residuals (Histogram) Figure 3b: Normal residuals (q-q plot)

Figure 4a: Non-normal residuals (Histogram) Figure 4b: Non-normal residuals (q-q plot)

What can be done if the assumption is violated? Neither the dependent variable nor the independent variables have to be normally distributed – only the errors have to be normally distributed. However, if the distribution of any of these variables substantially deviates from normality (for example, when the empirical values are extremely asymmetric or when the dependent variable takes on only a few values) the distribution of the residuals is directly affected (because of the additive assumption). In the first case, transforming the skewed variable may be helpful. In the case of a limited dependent variable, the linear regression model is not appropriate and other approaches must be adopted.

Non-normality may also arise when the data comprise of two or more distinct subsets that have distinct statistical characteristics. In this case, the linear regression model may work for each subset or else a different modeling approach is needed.

Finally, non-normality of the residuals may be an artifact on a small number of disproportionally large errors. In this case, the data associated with these errors has to be investigated thoroughly. Such values can simply be a result of a data entry error, or they can represent an actual event. Investigating the nature of the event can lead to the introduction of additional variables that directly account for that event. This is particularly important when the underlying cause may occur again.

Even when all the Normal Assumptions hold, estimating the Linear Regression Model can involve dealing with other problems, such as outliers or multicollinearity, rendering its results inaccurate. More about such problems in future posts.