Nonlinear Regression for a LSS Black Belt

It has been a fun week to be a Lean Six Sigma Master Black Belt coach.  I received an email from a past Black Belt student who was facing a regression that did not seem to work using the methods taught in class.  He asked how to perform a Nonlinear Regression.

My first thought was “Great, I have not had a chance to perform a non-linear regression since graduate school,” but the euphoria soon wore off and I realized that was the request, but it was probably not what the student really needed to know.

What is linear regression?

Let’s start with what a linear regression is before we jump into non-linear regression.  The word “linear” does not mean a straight line or even a line.  The linear term indicates that the model is a linear equation.  A linear equation is a series of terms that are separated by + and -, where each term is a function of predictor variables.  These predictor variables may have many functions; x, x^2, ln(x), x*z, 1/x, or many more.  The key point is that all of the variables are not exponents.  A predictor term is y^x where the ^ symbol indicates the exponent.

Now that was actually the non-mathematics explanation.  Using mathematical notation a linear equation can be considered as the result of multiplying two vectors where one is the predictors and the other is the coefficient.  In the following graphic, the X matrix is a matrix of all the predictor values, Y is the vector for all the actual values measured from the process output, B is the vector of the coefficients for each X value, with e being the residual value for each observation.  It is the e vector that we analyze for the adequacy of the regression.

Vectors and matrices for Linear Regerssion
Linear Regression terms

The actual solution of a linear regression is the solving of the following equation

solving for coefficients with linear algebra
Solving for the coefficients

where the optimal solution is the minimization of the sum of the square of the residuals.

  • That is why you also hear it called Least Squares Regression 

For reasons that are beyond this post, if you perform a linear regression, as shown above, and minimize the squared residuals the answer will be the optimal answer that can be provided.  Assuming that you also have normally distributed residuals, the sum of the residuals equals zero and that every observation is independent, then the linear regression solution will provide the optimal answer.  Well, that is not always true; there is one more requirement that is not always discussed.  All predictor values need to be known exactly.  Dealing with uncertainty of your predictor values is a more complex topic than this one, so that we will ignore it as everyone does.

This comes from wikipedia article on linear regression if you want more information.

What is nonlinear regression?

Non-linear regression deals with nonlinear equations or equations that cannot be described with the equations above.  Primarily the problem is that you have predictors (x values) that are exponents to another predictor variable with coefficients for each predictor, although there are other ways to create a non-linear model.  Since the model to regress does not fit the form used in linear regression there is no clear closed-form solution for the best fit line.  To solve a non-linear regression you would use an iterative solving algorithm.  The coefficients are varied in a pattern that the sum of squared residuals is minimized.  The computer algorithm (or software package) will adjust all of the coefficient values using one of many solving rules until the sum of the squared residuals is found to be a minimum.

This iterative method would also work fine for a linear regression if you are not worried about computational effort and solving time.  Complex non-linear equations can take a few minutes to hours to solve, although most solving algorithms will stop at some fixed time if they do not converge on an answer that provides the minimum relatively fast.  This is to protect you from an endless calculational effort that never finds a minimum.

If you happened to take a calculus class in your past, this is equivalent to obtaining the coefficients where the derivative of the model equation is equal to zero and also a minimum of the modeling equation.  In those classes they warned us that the answer that is provided may not be the absolute minimum rather than a local minimum.  The same rules apply to non-linear regression, that the provided answer might not be the best answer.

What to do if standard linear regression does not work

Going back to the beginning of this post, the student who asked for help with non-linear regression, I needed to ask why he wanted to know how to perform a non-linear regression.  The quick answer is that the response is exponential (believed to be) and the residuals for a standard regression look horrible.  Ugly residuals can come from many reasons related to the predictors and the response variable.

Non-Normal residuals

When the histogram and/or the probability plot of the residuals do not look normally distributed, which is one of the requirements for a linear regression to provide the optimal solution.  Two general causes for this are

  1. You have outliers or bad data in your regression.  It could be an error in recording a predictor (x) or the result (y).  In these cases you either remove the data or fix the data values and then run another regression.
  2. The residuals have a random distribution that is non-normal.  This does occur at times, really!  In this case you choose to transform the original response (y) values based on the shape of the residuals or based on the distribution of the residuals.  The most common response transformations are the natural log, ln(y), square root of the response and the inverse of the response.  Less common are squares or cubes of the response.  These are common outputs of the box-cox transformation so that can be a good tool to resolve potential transformation values.  But as in Lean Six Sigma, you should only transform if it makes sense, not because you are able to.
  3. The residuals look somewhat random but have more of an S shape.  This can be created in many ways, but it is generally caused by the regression data being generated by multiple processes that have different relationships between the x’s and y’s.  In this case you need to separate the groups and solve each separately.

Non-constant, non-random residuals

Non-constant and non-random residuals can be found in the patterns of the plot of the residuals vs. the predicted values.  As with plots of the residuals vs. the predictor values.  Both types of plots are very good at identifying problems with the regression model that may appear similar to nonlinear regression issues.

  1.  One case is when one of the predictors is used as a main effect, such as x, and you see a residual plot that has a curve shape to it.  A curve shape such as all the low predicted values all have positive residuals while the mid-range residuals have negative residual values, followed by all the high predicted values having positive residuals, you have the shape of a bowl or a parabola that opens up.  Or it could be the opposite signs.  In this case it is usually caused by a key predictor needing to be squared or a square root (x^2 or sqrt(x)).  You may be able to know which predictor to manipulate by looking at the residual vs. predictor plots.
  2.  Another less common residual plot is found when the spread of the residuals increases as the predicted or predictor value increases.  So the graphic is narrow on the left and wide at the right.  Generally they are also symmetric around residual = zero line.  These patterns are treated with a transformation of the response variable (y) based on the shape of the residual plot.  A table below shows some of the guidelines for transformations.

 

data transformation table

Table 9.3 from Volume III, Integrated Enterprise Excellence by Forrest Breyfogle, 2008.

To read the table, in the second row where sigma is proportional to the mean squared, the shape will look like a trumpet bell, starting very tight on the left and then flaring out wide on the right.  For this case you would transform the response (y) with a lambda = -1 or an inverse function.

The last example is probably the one most often found.  This is where the standard deviation is proportional to the mean value.  This relationship would have a shape of a megaphone used by cheerleaders.  The spread of the residuals increases linearly so that you could virtually draw a straight line to bound the residuals.  In this case you would transform the response variable (y) with the natural log function or to ln(y).

In each case you would transform the response to a new column and then run the same regression model a second time but this time using the new transformed response.  If the residual plots now look regular, you can trust the p-values from the regression to identify the significant predictors.  To use the regression equation to predict a new response, you use the inverse of the response transformation.

Summary

Nonlinear regression is a reasonable tool to use for solving complex relationships.  Most lean six sigma applications that do not solve with standard linear regression probably do not need nonlinear regression; these problems are usually handled through adding new predictors that are transforms of the main effect predictors, such as squaring the values or taking the natural log of the values.  These methods will correct for non-random residual plots.  These methods are quite common to use because many of our common processes are not just a function of the main effect values and are known to have relationships with the square, square root, or natural log of the main effects.

The second option, which is not as frequently used, is the transformations of the response values.  These are called variance stabilization transformations because they address a residual pattern that fails the assumption that the standard error of prediction (the regression error term) is not constant across the entire sample space.  In normal terms, there is more variation found at some values of an x than there are at others.  The most common transformations are the inverse and the natural log of the response.

Now you have read the statistician’s answer to these questions.  In my personal experience, the changing of predictor values listed in the first summary paragraph are common and are needed to solve for a solution that you can use in a project or in a process model.  You should learn how to do this.  The variance stabilization transformations in the second paragraph are interesting but not really necessary in typical lean six sigma work.  When I have needed to perform a transformation of the responses to do it correctly, I have found the same basic results without the transformation.  If you goal is to predict the average value, then you can generally ignore these transformations.  If the goal is to estimate the mean and the variability at all points, you do need the transformation.

I hope this answered some questions for you.