In regression we are taught to examine the residuals after performing a regression. We do this in order to validate the assumptions required for the least-squares method to produce an optimal solution. I am writing this blog so that analysts will consider an additional residual chart or charts as part of their normal validation of a regression. This chart is a plot of the residuals vs. each predictor variable.
Quick review of regression residual analysis
For the least-squares method to solve a regression, there are a few specific assumptions to be met. For most Lean Six Sigma practitioners, this is the first analysis method that they learn where the assumptions cannot be tested prior to the analysis. These assumptions are verified against the answer of the regression based on the characteristics of the residual error. The residual error is the difference between the actual data values and the predicted values for each data value’s predictors or x-values.
The assumptions can be simplified as such: The residuals are NID(0,s^2). This can be translated to mean that the residuals must be independent, normally distributed, with a mean of zero and a constant variance for the entire range of y.
Testing for residual independence
The independence of the residuals implies a few different concepts. One is that every predicted value has no bias in its residual value. An example is that certain values of the regression output have bias to positive residuals or negative residuals. This generally shows up as a curve or wave form in a plot of the residual (y) vs. predicted value (x).
A second independence concept is a time correlation of the residuals. In this case the residuals may have a bias to be more positive or more negative at a period of time during the data collection. This is commonly found when a process start-up performance is not like the standard running performance producing a series of high or low residual values. Autocorrelated data, primarily in y but sometimes in x values as well, can also impact the residual independence. This autocorrelation is found when sequential data points are not independent because the next data point is always close to the value of the prior data point as we can find in the stockmarket, rapid temperature measurements, and more. Think about it: if you measure the room temperature every five minutes and the current temperature is 74 degrees F, in five minutes what will the temperature be? It will be a degree or two from 74. This is different than if we measured the temperature once a day or once a shift. In this second example, the temperature value will be a random value within in the range of what the room is experiencing. These types of non-independence are observed with a plot of the residuals in time order of the actual value data.
This is generally a simple test of using a probability plot (sometimes a histogram). When the residuals are normally distributed, it ensures that the solving algorithm will properly fit to the middle of the residuals. If the residuals are bimodal or highly skewed, the best predicted model is not an optimal fit. This is generally corrected by adding more terms to a model such as squared terms or interactions. You should be aware that you can achieve normal looking residuals and still fail one of the other residual evaluations.
Another cause of this can be from a predicted value that is truly non-normal such as the time it takes to complete a task. At times this type of data will periodically have high residuals when a random long-time event occurs. In this case you would generally transform the y value and perform the regression again.
This characteristic is typically evaluated with a plot of the residual vs. predicted value. The constant variance assumption, if considered as met if there is no pattern in the spread of the residual plot across the range of the predicted values. These shapes generally are like a megaphone where the width of the residuals changes linearly with the predicted value. In this case the variance is proportional to the mean.
The other general shape is like the bell of a trumpet where the width of the residuals increases more rapidly than linearly so that it flairs out to a very large width as the range of predicted values change. In this case the variance is considered as proportional to the mean squared.
In both of these cases you would generally transform the y value and then perform the regression again.
Unappreciated assumption in regression
If you dive into the least squares methods for solving for the regression coefficients, you will discover that all of the assumed uncertainty is assumed to be in the output value or the predicted value from the model. There is no allowance for uncertainty in the value of the predictors or x-values used in the regression. This is an issue that we understand and manage quite well when performing a design-of-experiments but do not seem to consider when running regressions. What this means to us is that the analysis assumes that every x-value used in a regression is exactly known. When this is not true, it is very difficult to understand how that uncertainty will affect the regression results. This is the issue that has led me to write about the residuals vs. predictor plots.
Plotting residuals vs. the predictor variables
Examining these residual plots, residuals vs. predictors, was taught to me during my Masters in Statistics program. I was very surprised when I began working as a statistician and in the Lean Six Sigma genre that no-one really performed this test. I know that these plots have allowed me to recognize problems with my data that the normal residual charts have not identified.
Identifying extreme values of the predictor variables.
With the majority of the residual focus on the actual residual value and its corresponding fitted value, which are all related to the regression output, there is very little emphasis on the quality of the x-values. As stated above, the analysis assumes that all of the x-values are known exactly. If you use a really good statistics software to perform your regressions, you have a chance to identify problems with a predictor because the software will identify residuals that have both a high residual value and also a residual that has a high influence. It is the high influence or what is sometimes called a high leverage condition that may indicate problems with a specific x-value. The specific issue that I have found with these plots is the existence of single x-value that is outside of the range of the typical x-values. It has usually been caused by a missing decimal point or a data recording error such as a transposing of digits. In each case, the offending x-value was shown to be significant in a multiple regression due to the single errant point and then non-significant when the errant point was corrected.
It was a plot of the residual vs. the predictor values where I found that there was an extreme value in x that caused the issue. When these extreme values occur, they also have a near-zero residual so that they are hidden in a typical residual analysis. The residual vs. the predictor plot will appear to have most of the values at one side of the chart with one or two values separated on the x-axis of the plot. If you find this condition, you must evaluate that observation and determine if the x-value is a real value or an errant value. If it is an errant value, you need to correct it or remove the entire observation. If you find that it is a real value, then you must run the process so that you can obtain more values similar to the identified x-value so that you will obtain an accurate estimate of the mean and variance of the process in that range.
Identifying x-values that affect the variance
If you have one of your x-value predictors in a multiple regression that influences the variance, even if it does not affect the mean, you may not find it in the traditional residual plots. Unless the variance influencing x-value has a strong affect on the mean, its influence will be hidden by the strong x-values that are affecting the mean. Finding these variance-influencing x-values may be one of the most important finds of your entire analysis, because factors that influence the variability are difficult to detect.
The identification of these x-values occurs when you see one of the patterns discussed above for the constant variance case, but it occurs when the residuals are plotted vs. the x-value. If the x-value in question is not a strong predictor of the mean, then the variance effect may never be detected in the standard residual analysis methods. These x-values will not be identified by any other means than by plotting the residual vs. the x-value predictors.
If you find one of these conditions, then you would want to keep the process in the region of lower variance, if possible.
Residual analysis is key to accepting a regression output as optimal. The standard residual analysis is very important but may not be fully sufficient. Adding plots of the residuals vs. the x-value predictors will provide an additional dimension to your analysis that may protect you from making a bad conclusion (an extreme x-value) or from missing a key process effect (variance by a x-value). These benefits are big, and the additional effort is nearly zero; you just ask the software to include these charts.
Now, if it was not clear, you should plot these residual vs. x-value predictors for every term considered for the multiple regression. The factors that are not significant for the mean may still be causing errant residual results that you should evaluate before considering the analysis complete.
I modified a data set from Integrated Enterprise Excellence V3 that we use in our training classes. Example 28.04.
I took the column C, which is the most significant in the original data set and shifted the decimal point for observation 11 to create factor C2, as shown in bold.
Here is the regression results from the original data:
When I replaced c with c2 in the regression, here is how it changed:
Note that c2 is now non-significant and only A remains significant.
This is the standard residual plot
This all looks fine, but now look at the residual plots by each predictor.
You can clearly see the error in the c2 x-value in that one data point is separated from the rest of the variables. If the residual vs. x-value predictors had not been plotted, you would have assumed that there were very few significant predictors in this regression with possibly A being significant. One error in an x-value for C hid the fact that C and A were very significant.
Think about it.