Multiple linear regression analysis is a method used to determine the influence of multiple independent variables on a single dependent variable. Similar to simple linear regression, the relationship between variables in multiple linear regression is linear. This means that changes in independent variables (X) will be followed by proportional changes in the dependent variable (Y). The main difference between these two methods lies in the number of independent variables: in multiple linear regression, there is more than one independent variable. The equation for the multiple linear regression model is expressed as:
Y = β₀ + β₁X₁ + β₂X₂ + ... + βnXn + ε
Key features of multiple linear regression analysis with SmartstatXL include:
- Handling of missing data.
- Regression Methods: Enter, Stepwise, Forward Selection, Backward Selection, and Forward Information Criteria.
- Regression Diagnostics:
- Normality Test, Heteroskedasticity Test, and Residual Plot.
- Box-Cox Transformation.
- Automatic identification and replacement of outlier data.
- Automatic data transformation.
- Output includes:
- Regression Equation.
- Regression Statistics/Fit Quality: R2, Adjusted R2, Correlation Coefficient, AIC, AICc, BIC, RMSE, MAE, MPE, MAPE, and sMAPE.
- Coefficient Estimation: Coefficient Value, Standard Error, t-statistics, p-value, Upper/Lower, and VIF.
- Analysis of Variance: Sequential and Partial.
- Graphs: 2D and 3D graphs for Response Surface, as well as Optimization (Maximum and Minimum).
Case Example
There is a study about the effects of fertilizers and compost on various soil chemical properties, nutrient uptake, and yield. Here is a snippet of data from that study:
In this case example, suppose we want to understand the relationships among responses. What response variables affect Total Dry Weight (g/plant). Thus, several response variables are used as predictors (variables X: pH H2O to Al Saturation (%)) and Total Dry Weight (g/plant) as the response (variable Y).
Steps for Multiple Linear Regression Analysis
- Activate the worksheet (Sheet) to be analyzed.
- Place the cursor on the dataset (for creating a dataset, see Data Preparation method).
- If the active cell is not on the dataset, SmartstatXL will automatically attempt to identify the dataset.
- Activate the SmartstatXL Tab
- Click on the Regression > Multiple Linear Regression Menu.
- SmartstatXL will display a dialog box to confirm whether the dataset is correct or not (usually the dataset is automatically selected correctly).
- If it's correct, click the Next Button
- A Regression Analysis Dialog Box will appear next. Select the Factor Variables (Independent) and one or more Response Variables (Dependent). The factor variables to be selected depend on the type of regression analysis. In this case example, we determine:
- Regression equation model: Y = β₀ + β₁X₁ + β₂X₂ + ... + βnXn
- Regression Type: Multiple Linear Regression
- Predictor Variables: pH H2O to Al Saturation (%)
- Response Variable: Total Dry Weight (g/plant)
- Regression Method: Stepwise
For more details, refer to the following dialog box view:
- Press the "Next" button
- Select regression output as shown in the following view:
- Press the OK button to generate the output in the Output Sheet
Analysis Results
Analysis Information: type of regression used, regression method, response, and predictors.
Multiple linear regression is a statistical analysis technique used to predict a response variable based on two or more predictor variables. In this case example, the response variable in question is "TOTAL DRY WEIGHT (g/plant)" and the predictor variables include pH H2O, Available P (mg kg-1), Al Saturation (%), and so on.
From the analysis results using the stepwise method, we can see that there are three steps performed to select the best regression model.
Interpretation:
- Step 1:
- The predictor variable "Al Saturation (%)" is the first variable to enter the regression model and makes a significant contribution in explaining the variability of the response variable (R² = 0.233).
- The coefficient for "Al Saturation (%)" is -7.464, meaning that for every 1% increase in Al Saturation, the total dry weight is estimated to decrease by 7.464 g/plant, with all other variables held constant.
- The P-Value for "Al Saturation (%)" is 0.000, indicating that this variable is significant in explaining the variability of "TOTAL DRY WEIGHT".
- Step 2:
- The variable "pH H2O" is added to the model, increasing R² to 0.364. This means that the combination of "Al Saturation (%)" and "pH H2O" explains 36.4% of the variability in "TOTAL DRY WEIGHT".
- Step 3:
- The variable "Available P (mg kg-1)" is added to the model, increasing R² to 0.422. This means that these three variables (Al Saturation (%), pH H2O, and Available P) explain 42.2% of the variability in "TOTAL DRY WEIGHT".
Among these three steps, step 3 is chosen as the best model as it has the highest R² value (42.2%) and the lowest AICc and BIC values compared to previous steps. Out of the 7 predictor variables attempted to be included in the model, only the variables Al Saturation, pH H2O, and Available P have a significant effect on the response (Total Dry Weight (g/plant)).
Conclusion: By using the stepwise regression method, the best model for predicting "TOTAL DRY WEIGHT" involves three predictor variables, namely "Al Saturation (%)", "pH H2O", and "Available P (mg kg-1)". This model is able to explain 42.2% of the variability in total dry weight. The variables "Al Saturation (%)" and "pH H2O" have a negative relationship with total dry weight, while "Available P (mg kg-1)" has a positive relationship.
Regression Equation
Interpretation:
- Regression Equation:
- The obtained regression equation shows the relationship between "TOTAL DRY WEIGHT" (Y) and three predictor variables: "pH H2O", "Available P (mg kg-1)", and "Al Saturation (%)". This equation can be interpreted as follows:
- Intercept 148.8103: If all predictor variables are zero, then the estimated average "TOTAL DRY WEIGHT" is 148.8103 g/plant.
- Coefficient -20.2959 for "pH H2O": For every 1 unit increase in pH H2O, the total dry weight is estimated to decrease by 20.2959 g/plant, with all other variables held constant.
- Coefficient 1.4821 for "Available P (mg kg-1)": For every 1 mg/kg increase in Available P, the total dry weight is estimated to increase by 1.4821 g/plant, with all other variables held constant.
- Coefficient -5.737 for "Al Saturation (%)": For every 1% increase in Al Saturation, the total dry weight is estimated to decrease by 5.737 g/plant, with all other variables held constant.
- R² and r Values:
- The R² (coefficient of determination) of 0.422 indicates that the regression model explains about 42.2% of the variability in "TOTAL DRY WEIGHT". Meanwhile, the correlation coefficient (r) of 0.650 indicates that there is a moderate positive relationship between the chosen predictor variables and the response.
- Adjusted R²:
- The adjusted R² of 0.391 indicates that after considering the number of predictor variables in the model, this model explains about 39.1% of the variability in "TOTAL DRY WEIGHT".
- F-test:
- The F-value of 13.635 with a significance of 0.00 indicates that the overall regression model is significant. This means that at least one of the predictor variables has a significant relationship with "TOTAL DRY WEIGHT".
Writing in Scientific Articles:
In research on the effects of fertilizers and compost on various soil chemical properties, nutrient uptake, and crop yield, it was found that the variables "pH H2O", "Available P (mg kg-1)", and "Al Saturation (%)" significantly influence "TOTAL DRY WEIGHT" with the following regression equation:
\[ \text{Y} = 148.8103 - 20.2959 \times \text{pH H2O} + 1.4821 \times \text{Available P (mg kg}^{-1}\text{)} - 5.737 \times \text{Al Saturation (%)} \]
This model explains about 42.2% of the variability in "TOTAL DRY WEIGHT" (\(R^2 = 0.422\)) and indicates a moderate positive relationship (\(r = 0.650\)). After considering the number of predictor variables, the adjusted \(R^2\) is 0.391. The F-test indicates that this model is significant with \(F = 13.635\) and \(p\text{-value} < 0.05\).
Thus, while other variables may also influence "TOTAL DRY WEIGHT", these three variables have a significant relationship and provide important information on how these variables affect crop yield.
Model Goodness of Fit
A number of statistics related to the model's goodness-of-fit are provided. The following is the interpretation of each statistic:
- r (Correlation Coefficient): An r-value of 0.6497 indicates a moderate positive relationship between the selected predictor variables and the response. Closer to 1, the stronger the relationship.
- R² (Coefficient of Determination): An R² of 0.4221 means the regression model explains approximately 42.21% of the variability in the response data.
- Adjusted R²: The adjusted R² of 0.3911 takes into account the number of predictor variables in the model and provides a more realistic estimate of how well the model might perform on a new dataset. This model explains approximately 39.11% of the "TOTAL DRY WEIGHT" variability after adjustment.
- AIC, AICc, and BIC: These are information criteria used to compare the relative quality of statistical models. Lower values indicate a better model. AICc is a correction of AIC that considers sample size, and is generally more accurate if the sample size is relatively small.
- RMSE (Root Mean Square Error): An RMSE of 12.4314 indicates the average error between observed and predicted values by the model. The lower the RMSE value, the better the model.
- MAE (Mean Absolute Error): An MAE of 9.6155 is the average of the absolute errors between observed and predicted values. It provides a measure of how large errors can be expected from the model's predictions.
- MPE (Mean Percentage Error): An MPE of -0.0874 measures the bias in the model's predictions. A negative value indicates that the model tends to predict higher responses than actual.
- MAPE (Mean Absolute Percentage Error): A MAPE of 0.2379 indicates the average relative error between observed and predicted values. It shows that the average error in the model's predictions is about 23.79% of the actual value.
- sMAPE (symmetric Mean Absolute Percentage Error): An sMAPE of 0.2060 is the symmetric version of MAPE and is often used because it has more desirable properties in certain situations compared to MAPE. The average error in the model's predictions is about 20.60% of the actual value.
Writing in Scientific or Academic Articles:
In the conducted regression analysis, the obtained model shows adequate goodness-of-fit. The correlation coefficient of 0.6497 indicates a moderate positive relationship between the predictor variables and "TOTAL DRY WEIGHT." This model can explain about 42.21% of the variability in the response data, with an adjusted R² of 39.11%. The average prediction error of this model, based on RMSE, is 12.4314 g/plant. Furthermore, the average relative error based on MAPE is about 23.79%, while sMAPE indicates an average error of about 20.60%. These overall statistics provide a picture that the regression model has adequate accuracy in predicting "TOTAL DRY WEIGHT" based on the selected predictor variables.
Coefficient Estimation, Optimization (Maximum/Minimum Values), and Regression Variance Analysis
Here is the interpretation of the coefficient estimation table:
- Intercept:
- The intercept coefficient is 148.810. This indicates that if all predictor variables are zero, then the estimated average "TOTAL DRY WEIGHT" is 148.810 g/plant.
- With a T-statistic of 4.011 and a p-value of 0.000 (indicating significance at the 1% level), this intercept is statistically significant.
- The 95% confidence interval for this intercept ranges from 74.493 to 223.127.
- pH H2O:
- The coefficient for "pH H2O" is -20.296, meaning that for every 1-unit increase in pH H2O, the "TOTAL DRY WEIGHT" is expected to decrease by 20.296 g/plant, holding all other variables constant.
- With a T-statistic of -3.500 and a p-value of 0.001 (indicating significance at the 1% level), this variable is significant in explaining the variability in "TOTAL DRY WEIGHT."
- The 95% confidence interval for this coefficient ranges from -31.913 to -8.679.
- The Variance Inflation Factor (VIF) for this variable is 1.016, indicating no multicollinearity issues with other variables in the model.
- and so on
Variance Inflation Factor (VIF) is a statistic used to measure the extent to which the variability of a predictor variable can be explained by other predictor variables in a regression model.
Interpretation of VIF:
- VIF = 1: No multicollinearity between that predictor variable and other predictor variables.
- 1 < VIF < 5: Typically considered an acceptable level of multicollinearity.
- VIF >= 5: There is a strong indication of multicollinearity that may require further action.
- VIF >= 10: Generally considered strong evidence of multicollinearity. If a variable's VIF approaches or exceeds 10, this indicates that the variable is explained by other predictor variables in the model and may need to be considered for removal from the model.
Why is VIF Important?
Multicollinearity can cause several issues in regression analysis, such as:
- Unstable regression coefficients (small changes in the data can cause large changes in coefficient estimates).
- Inflated standard errors, which can result in coefficients not being significant even though the variable is truly important.
- Difficulty in determining which variables contribute most to the response variability due to interdependence.
Therefore, if VIF is high, consider:
- Removing some correlated variables from the model.
- Combining correlated variables into a single variable using techniques such as Principal Component Analysis.
- Increasing the sample size (in some cases, multicollinearity may be caused by a small sample size).
It's important to remember that VIF only measures multicollinearity between predictor variables and does not provide information about the relationship between predictor variables and the response variable.
Writing in Scientific or Academic Articles:
The results of the regression analysis indicate that the variables "pH H2O," "Available P (mg kg-1)," and "Al Saturation (%)" have a significant impact on "TOTAL DRY WEIGHT." Specifically:
- For every 1-unit increase in pH H2O, "TOTAL DRY WEIGHT" is estimated to decrease by 20.296 g/plant (p-value = 0.001, 95% CI: -31.913 to -8.679).
- For every 1 mg/kg increase in Available P, "TOTAL DRY WEIGHT" is estimated to increase by 1.482 g/plant (p-value = 0.021, 95% CI: 0.230 to 2.734).
- For every 1% increase in Al Saturation, "TOTAL DRY WEIGHT" is estimated to decrease by 5.737 g/plant (p-value = 0.001, 95% CI: -9.027 to -2.447).
VIF analysis shows no significant signs of multicollinearity among the predictor variables, with all VIF values below the general threshold of 10. This indicates that the obtained regression model has good reliability in predicting "TOTAL DRY WEIGHT" based on the selected variables.
Optimization Table for Achieving Maximum and Minimum Response Values
Here is the interpretation for the table:
Maximum Value:
"TOTAL DRY WEIGHT" reaches a maximum value of 84.986 g/plant under the following conditions:
- pH H2O of 5.070.
- Available P (mg kg-1) of 26.365.
- Al Saturation (%) of 0.000.
From this interpretation, we can say that the optimal soil conditions for achieving maximum dry weight are when the soil has a pH of around 5.070, phosphorus availability of about 26.365 mg/kg, and no aluminum saturation.
Minimum Value:
"TOTAL DRY WEIGHT" reaches a minimum value of 9.433 g/plant under the following conditions:
- pH H2O of 6.590.
- Available P (mg kg-1) of 12.204.
- Al Saturation (%) of 4.134.
From this interpretation, we can say that the less optimal or conditions where the dry weight reaches a minimum are when the soil has a pH of around 6.590, high phosphorus availability of about 12.204 mg/kg, and aluminum saturation of 4.134%.
Thus, this information is highly useful for farmers or researchers in optimizing soil conditions to achieve maximum crop yield.
Analysis of Variance (ANOVA)
The following is an example interpretation of the ANOVA table for the variable "TOTAL DRY WEIGHT (g/plant)":
Regression:
- The regression model has 3 degrees of freedom (DF), indicating the presence of three predictor variables.
- The Sum of Squares (SS) for the model is 6321.2740, representing the total variability explained by the model.
- The Mean Square (MS) for the model (SS divided by DF) is 2107.0913.
- The F-value for the model is 13.635 with a p-value of 0.000, indicating that the regression model as a whole is significant at the 1% level.
The analysis of variance reveals that the regression model involving pH H2O, Available P (mg kg-1), and Al Saturation (%) is significant in explaining the variability in "TOTAL DRY WEIGHT" with an F-value of 13.635 (p-value < 0.001). Specifically, pH H2O (F-value = 12.248, p-value = 0.001), Available P (mg kg-1) (F-value = 5.625, p-value = 0.021), and Al Saturation (%) (F-value = 12.199, p-value = 0.001) all make significant contributions to the model. This underscores the importance of these three variables in influencing "TOTAL DRY WEIGHT."
Regression Assumption Tests
There are two sets of tests to check the regression assumptions: Heteroscedasticity Test and a series of Normality Tests for residual distribution.
Heteroscedasticity Test
The Breusch–Pagan–Godfrey test is used to detect heteroscedasticity in the regression model. Heteroscedasticity occurs when the variability of the errors (residuals) is not constant across levels of the predictor.
With a χ2 value of 2.093 and a p-value of 0.553 (greater than 0.05), we fail to reject the null hypothesis. This indicates that there is no strong evidence supporting the presence of heteroscedasticity. In other words, our data meet the assumption of homoscedasticity.
Normality Tests
Normality Tests are used to evaluate whether the residuals from the regression model are normally distributed.
- Shapiro-Wilk's Test: With a statistic of 0.979 and a p-value of 0.393, we fail to reject the null hypothesis that residuals are normally distributed.
- Anderson Darling Test: With a statistic of 0.355 and a p-value of 0.459, we fail to reject the null hypothesis that residuals are normally distributed.
- D'Agostino Pearson Test: With a statistic of 1.266 and a p-value of 0.531, we fail to reject the null hypothesis that residuals are normally distributed.
- Liliefors Test: With a statistic of 0.069 and a p-value greater than 0.20, we fail to reject the null hypothesis that residuals are normally distributed.
- Kolmogorov-Smirnov Test: With a statistic of 0.069 and a p-value greater than 0.20, we fail to reject the null hypothesis that residuals are normally distributed.
From all the above Normality Tests, we find no evidence supporting the violation of the normality assumption. Therefore, we can conclude that the residuals from the regression model are normally distributed.
Conclusion:
Based on the results of the Breusch–Pagan–Godfrey test and the Normality Tests, our regression model satisfies two crucial assumptions, namely, homoscedasticity and residual normality. This indicates that our regression model is valid and the estimates generated can be relied upon.
Residual Plots
Beyond formal tests, the assumption of normality can also be visually assessed using the accompanying residual plots. The examination can be conducted using the Normal Probability Plot (Normal P-Plot), Histogram, and Residual vs. Predicted Plot.
- Normal P-Plot for Residuals:
- The Normal Probability Plot compares the residual values with the predicted or observed values. Ideally, the points on this plot should follow a straight diagonal line. If the points deviate from the diagonal line, this might indicate deviations from normality.
- The fact that the points closely follow the straight diagonal line suggests that the residuals are approximately normally distributed across most of the value range. This is a good sign and indicates that the assumption of residual normality is largely met. However, the presence of points deviating from the diagonal line at both ends indicates some deviation from normality in the tails of the distribution.
- Although there are some deviations from normality, depending on the context and objective of the analysis, these deviations may not be significant. However, if the analysis is highly sensitive to the normality assumption, alternative techniques or transformations may need to be considered to address these deviations.
- Histogram for Residuals:
- The histogram should display a distribution that approximates a bell shape (normal distribution). Deviations from this shape (e.g., a skewed or long-tailed distribution) can indicate a violation of the normality assumption.
- Residual vs. Predicted:
- To check for homoscedasticity, the points on this plot should scatter randomly around a horizontal line at 0 without any specific pattern. If a specific pattern is observed, such as a funnel shape or curve, this could indicate heteroscedasticity or other violations of the regression assumptions.
Given that all formal tests suggest that the residuals are normally distributed (as all p-values are greater than 0.05), minor deviations observed in the Normal P-Plot are likely not a significant concern.
In practice, regression analysis is often quite tolerant of minor violations of the normality assumption, especially if the sample size is sufficiently large. Therefore, even though there are some points deviating from the diagonal line on the Normal P-Plot, if formal tests indicate normality and no other significant violations of assumptions are observed, the regression model may be considered sufficiently valid for analytical purposes.
Box-Cox Transformation and Residual Analysis
1. Box-Cox Transformation
The Box-Cox transformation is used to make the response variable (dependent variable) in linear regression conform to the assumptions of regression analysis, such as normality, homoscedasticity, and a linear relationship with the independent variables. By changing the scale of the response variable, this transformation aims to improve model fit and satisfy regression assumptions.
In this case:
- The λ (Lambda) value from the Box-Cox transformation is 1.440.
- The suggested transformation is "No Transformation: Y1", meaning that no transformation of the response variable is required. This aligns with the mathematical representation of the Box-Cox transformation, where if λ=1, no transformation is needed.
Considering that the tests for heteroscedasticity and normality indicate that all regression assumptions are met, the advice from Box-Cox transformation to avoid transformation is sensible. If the model already meets the assumptions without the need for transformation, it is better to leave it unchanged.
Conclusion: Although the Box-Cox transformation is often used to rectify violations of regression assumptions, in this case, since no violations were found and the advice from Box-Cox is to avoid transformation, we can proceed with the model without the need for transformation. The model already satisfies the required regression assumptions and is considered valid.
2. Residual Analysis
Based on the information presented in the residual analysis table, the table provides insights into how well the model predicts the actual data and identifies potential issues like outliers or extreme values.
- pH H2O, Available P (mg kg-1), AL Saturation (%): These are the values of the predictor variables.
- TOTAL DRY WEIGHT (g/tan): Actual response value.
- Predicted: The value predicted by the regression model based on predictor variables.
- Residual: The difference between the actual response value and the predicted value. Large residuals indicate that the model does not predict the data well.
- Leverage: Measures how far the predictor value is from the average of other predictor values. Observations with high leverage may be extreme data points in the predictor space.
- Studentized Residual and Studentized Deleted Residual: These are standardized residuals. Large absolute values of these residuals indicate potential outliers.
- Cook's Distance: Measures the influence of each observation on all predicted values. Observations with high Cook's distance may strongly influence the model.
- DFITS: This statistic measures how much the prediction for an observation would change if that observation were deleted. Large DFITS values indicate that the observation is different from what is expected based on the model.
- Diagnostic: This column indicates whether an observation is considered an outlier or extreme.
From this table, the model's quality can be assessed and additional steps like removing outliers or applying Box-Cox transformation can be considered to improve the model.
If the model shows symptoms of violating regression assumptions, identifying and dealing with outliers or extreme values can help improve the model. Consider Box-Cox transformation if needed, especially if previous advice from Box-Cox suggests potential for transformation.
Conclusion
From the multiple linear regression analysis conducted, a regression model was successfully developed to predict "TOTAL DRY WEIGHT (g/tan)" considering variables such as pH H2O, Available P (mg kg-1), and AL Saturation (%). This model can explain approximately 42.2% of the variability in TOTAL DRY WEIGHT. Moreover, residual analysis confirms that the model meets the basic regression assumptions, i.e., homoscedasticity and residual normality. Although Box-Cox transformation was suggested, based on Regression Assumption Checks, transformation may not be necessary. Furthermore, some observations considered as outliers or extremes have been identified and need to be considered in further analysis.
Writing Results and Discussion in Academic Work
In this study, a regression model has been successfully developed to understand the relationship between "TOTAL DRY WEIGHT (g/tan)" and predictor variables pH H2O, Available P (mg kg-1), and AL Saturation (%). The resulting regression equation is:
Y = 148.8103 - 20.2959 × pH H2O + 1.4821 × Available P - 5.737 × AL Saturation
This model demonstrates that about 42.2% of the variability in "TOTAL DRY WEIGHT" can be explained by the predictor variables. This is indicated by the coefficient of determination (R2) which is 0.422.
Regression Assumption Checks through residual analysis indicate that the model meets basic regression assumptions. Although there was a suggestion to perform Box-Cox transformation, further analysis shows that transformation may not be necessary.
In residual analysis, several observations were identified as outliers or have extreme values. This needs to be considered in further interpretation of the model's results and in the practical application of this regression model.