|
|||||||||
![]() |
|||||||||
|
|
STEPWISE MULTIPLE LINEAR REGRESSION ANALYSIS
REQUIREMENTS : Regression is used to test the effects of n independent (predictor) variables on a single dependent (criterion) variable. Regression tests the deviation about the means, and all variables must be interval scaled. Computationally, regression analysis may be conducted using either a raw data matrix (respondents by variables) or a correlation matrix. Regression analysis measures the degree of influence of the independent variables on a dependent variable. In the case of a single independent variable, the dependent variable could be predicted from the independent variable by the simple equation: y = a + bx {where a is constant} This could be extended to a multi-variable concept as follows: y = a + b 1 x 1 + b 2 x 2 + b 3 x 3 + ..... +b n x n It should be noted that whether it be for a single variable or for multiple variables, the relationship predicted is always linear. A Graphical Explanation of Bi-Variate (2 Variable) Regression Analysis A simple approach to approximate a regression equation for a single variable is to plot the relationship between the variables. The task requires that we first plot the dependent variable against the independent variable. This type of plotting is called the scatter diagram. Next, identify the straight line that represents the trend through the mid-point of the data, this line must be the one with the `best fit'. The regression analysis line identifies the trend or relationship between the independent and dependent variables. The relationship, once identified, is used to predict the various values of the dependent variable given specific values of the independent variable. This predicted relationship is always in the form of a linear trend. The table below identifies a set of values for an independent (X) and dependent (Y) variable. ┌───┬──────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬────┐ │ X │ 39 │ 43 │ 21 │ 64 │ 57 │ 47 │ 28 │ 75 │ 34 │ 52 │ ├───┼──────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼────┤ │ Y │ 68 │ 82 │ 56 │ 86 │ 97 │ 94 │ 77 │ 103 │ 59 │ 79 │ └───┴──────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴────┘ The scatter plot of the variables is given below:
Regression analysis is utilized to develop an accurate mathematical formulation of the regression analysis. The line of best fit is defined as a line for which minimizes the sum of squares of deviation of the various data points from the line. The regression line is also referred to as the least squares line. In case of a multi-variable regression, the analysis is a sequence of multiple linear regression equations that are developed in a stepwise manner. At each step of the sequence, one variable is added to the regression equation. The variable added is the one that makes the greatest reduction in the error sum of squares of the sample data. Equivalently it is the variable that when added, provides the greatest increase in the F value. Variables not having a significant correlation with the dependent variable, are those whose addition does not increase the F value and are not featured in the regression equation. Mathematical Computation of the Regression Coefficients I. With one independent Variable: The Mathematical Computation of the Regression Coefficients for the case of a single independent variable is given below: The slope (regression coefficient) for the line of least squares is given by b, where
The intercept of the line is given by a, where a = y - bx The mathematical formula used for this computation is as follows:
The Residual : The residual is defined as the difference between the actual and predicted values of the dependent variable. The standard error of the estimate is the standard deviation of the residuals. The standard error of the estimate can be calculated as follows:
┌─────────────────────────────────────────────┐ │ A Numerical Example: One dependent variable │ └─────────────────────────────────────────────┘ Let us use the data which produced the above graphical representation of a regression analysis. ╔═════╤═══════╤═══════╤════════╤════════╤════════╗ ║SL.No│ y │ x │ xy │ y² │ x² ║ ╟─────┼───────┼───────┼────────┼────────┼────────╢ ║ 1 │ 68 │ 39 │ 2652 │ 4624 │ 1521 ║ ║ 2 │ 82 │ 43 │ 3526 │ 6724 │ 1849 ║ ║ 3 │ 56 │ 21 │ 1176 │ 3136 │ 441 ║ ║ 4 │ 86 │ 64 │ 5504 │ 7396 │ 4096 ║ ║ 5 │ 97 │ 57 │ 5529 │ 9409 │ 3249 ║ ║ 6 │ 94 │ 47 │ 4418 │ 8836 │ 2209 ║ ║ 7 │ 77 │ 28 │ 2156 │ 5929 │ 784 ║ ║ 8 │ 103 │ 75 │ 7725 │ 10609 │ 5625 ║ ║ 9 │ 59 │ 34 │ 2006 │ 3481 │ 1156 ║ ║ 10 │ 79 │ 52 │ 4108 │ 6241 │ 2704 ║ ╟─────┼───────┼───────┼────────┼────────┼────────╢ ║SUM │ 801 │ 460 │ 38800 │ 66385 │ 23634 ║ ║AVG. │ 80.1│ 46 │ 3888 │ 6638.5 │ 2363.4 ║ ║ │ │ │ │ │ ║ ╚═════╧═══════╧═══════╧════════╧════════╧════════╝
Therefore, the slope is given by: and the intercept is given by : a = Y - bX = 80.1 - 0.789814*46 = 43.768553 Hence the line of best fit is given by : Y = 43.768553 + 0.789814 X As an alternate method of deriving the regression equation, a spreadsheet could be used. The line for a single variable regression was derived by using the Excel spreadsheet. The output from Excel for the above data set is given below:
A Numerical Example: Multiple Regression The Mathematical Computation of the Regression Coefficients for one or more independent variables involves matrix computations. A brief result is given below: Let X be the data matrix of the predictor (independent) variables. Y is the data vector representing the criterion (dependent) variable and b is the data vector representing the regression coefficients including the constants. The vector of regression coefficients is computed as ╔════════════╤══════╤═══════╤════════╗ ║ Y │ X0 │ X1 │ X2 ║ ╟────────────┼────── | |