In previous tutorials, we have discussed various types of regression analyses, such as Simple Linear Regression, Multiple Linear Regression, Polynomial Regression, and Response Surface Regression. Each type of regression has specific criteria for variable selection and model formation. For example, simple linear regression only allows one independent variable, whereas multiple and polynomial regressions do not consider interactions between predictors.
However, in some situations, you may wish to create a more flexible model with different combinations of variables and interactions. This is where the Custom Regression Model in SmartstatXL comes in handy. With this model, you have the freedom to choose which predictor variables to include or exclude from the regression equation and define interactions between variables.
Example of a custom regression equation:
Y = Custom Regression
With this approach, you can tailor the regression model according to your analytical needs.
Case Example
There is a study on the effects of fertilizer and compost on several soil chemical properties, nutrient absorption, and yield. Below is a snippet of data from the study:
In this case example, let's assume we want to model the relationship between fertilizer dosage and compost dosage with Total Dry Weight (g/plant) and CEC (Cation Exchange Capacity). The method used is similar to the second-order response surface regression model, except that the quadratic variable for compost is not included in the model. The equation used has the following form:
Regression equation model Y = β₀ + β₁X₁ + β₂X₁² + β₃X₂ + β₄X₁X₂
Where: Y = Total Dry Weight (g/plant) and CEC, X₁ = Fertilizer, and X₂ = Compost
Steps for Custom Regression Analysis
- Activate the worksheet (Sheet) to be analyzed.
- Place the cursor on the dataset (for dataset creation, see Data Preparation).
- If the active cell is not on the dataset, SmartstatXL will automatically try to identify the dataset.
- Activate the SmartstatXL Tab
- Click Menu Regression > Custom Regression.
- SmartstatXL will display a dialog box to confirm if the dataset is correct (usually the dataset is automatically selected correctly).
- If correct, click Next Button
- Next, the Regression Analysis Dialog Box will appear. Select the Factor Variables (Independent) and one or more Response Variables (Dependent). The chosen factor variables depend on the type of regression analysis.
- Regression equation model: Y = β₀ + β₁X₁ + β₂X₁² + β₃X₂ + β₄X₁X₂
- Type of Regression: Custom Regression
- Order: 2
- Predictor Variables: Fertilizer Dosage and Compost Dosage
- Response Variables: Total Dry Weight (g/plant) and CEC
More details can be seen in the following dialog box view:
- Press the "Next" button
- Select the regression output as shown in the following view:
Suppose we want to create the following regression model:
Y = β₀ + β₁X₁ + β₂X₁² + β₃X₂ + β₄X₂²
In the listbox of variables to be included in the model, select all variables except Compost²: - Press the OK button to generate the output in the Output Sheet
Analysis Results
Analysis Information: Type of Regression Used, Regression Method, Response and Predictors
Regression Equation
Interpretation and Discussion:
- Regression Equation: The regression equation obtained from the analysis is:
- Y=37.3592 + 0.9789 × Fertilizer Dose − 0.0107 × Fertilizer Dose2 + 0.02 × Compost Dose + 0.0035 × Fertilizer Dose × Compost Dose
- Where Y is the total dry weight (g/plant).
- Regression Coefficients:
- The constant is 37.3592. If both the fertilizer and compost doses are zero, the expected total dry weight is 37.3592 g/plant.
- The coefficient for fertilizer dose is 0.9789, which means that for each unit increase in fertilizer dose, the total dry weight of the plant increases by 0.9789 grams, assuming all other variables are constant.
- The coefficient for fertilizer dose2 is -0.0107, indicating that the quadratic effect of fertilizer dose suggests that the increase in total dry weight will slow down as the fertilizer dose increases.
- The coefficient for compost dose is 0.02, which means that for each unit increase in compost dose, the total dry weight of the plant increases by 0.02 grams, with all other variables constant.
- The interaction coefficient between fertilizer and compost doses is 0.0035, indicating the presence of an interaction effect between fertilizer and compost doses on the total dry weight.
- Coefficient of Determination (R2):
- The R2 value is 0.329, which means that this regression model can explain approximately 32.9% of the total variation in plant dry weight. Meanwhile, the adjusted R2 is 0.280, accounting for the number of predictors in the model and providing a slightly more conservative estimate of how well the model explains the variation.
- F-Test and Significance:
- The F-statistic for this model is 6.735 with a significance (Sig) value of 0.00. Since the Sig value is less than 0.05, this indicates that the overall regression model is significant, meaning that at least one of the predictors (fertilizer dose or compost dose) has a significant effect on the total dry weight of the plant.
- Conclusion:
- The regression model indicates that both the fertilizer and compost doses, as well as their interaction, have an influence on the total dry weight of the plant. Although this model only explains about 32.9% of the total variation in plant dry weight, the results still show a significant relationship. Going forward, researchers may need to consider other factors that could affect plant dry weight to improve the model's accuracy.
Model Goodness of Fit
Interpretation and Discussion:
- Correlation Coefficient (r): An r-value of 0.5734 indicates a moderate positive relationship between the independent variables (fertilizer dose and compost dose) and the dependent variable (total dry weight).
- Coefficient of Determination (R²): The R² value is 0.3288, meaning the regression model accounts for approximately 32.88% of the total variation in plant dry weight.
- Adjusted Coefficient of Determination (Adjusted R²): The adjusted R² value is 0.2800, providing a more conservative estimate of how well the model explains the variation, considering the number of predictors in the model.
- Information Criteria:
- AIC (Akaike Information Criterion): The AIC value is 317.2693. AIC is used to compare the goodness of fit of regression models. Models with a lower AIC are considered better.
- AICc (Akaike Information Criterion with Correction): The AICc value is 318.3804. Like AIC, AICc is also used for model evaluation but provides a correction for sample size.
- BIC (Bayesian Information Criterion): The BIC value is 327.7411. Like AIC, BIC is also used to compare models but imposes a larger penalty for models with more parameters.
- Error Measures:
- RMSE (Root Mean Square Error): The RMSE value is 13.5188, indicating the average squared error between the model's predicted values and actual values.
- MAE (Mean Absolute Error): The MAE value is 10.1628, indicating the average absolute error between the predicted and actual values.
- MPE (Mean Percentage Error): The MPE value is -0.1143, indicating the average percentage error. A negative value suggests that the predictions tend to underestimate the actual values.
- MAPE (Mean Absolute Percentage Error): The MAPE value is 0.2710 or 27.10%, indicating the average absolute percentage error.
- sMAPE (Symmetric Mean Absolute Percentage Error): The sMAPE value is 0.2209 or 22.09%, providing a more symmetric measure of error compared to MAPE.
Conclusion:
The regression model has a moderate goodness of fit in explaining the variation in plant dry weight based on fertilizer and compost doses. Although this model has some errors in prediction, error measures like RMSE, MAE, and MAPE provide insights into the model's accuracy in predicting plant dry weight. For future research, it may be necessary to consider other factors or make adjustments to the model to improve the fit.
Estimation of Regression Coefficients
Interpretation and Discussion:
- Intercept:
- The Intercept coefficient is 37.359 with a standard error of 5.351.
- The T-value for the intercept is 6.982 with a p-value of 0.000, indicating that the intercept is significant at the 1% level.
- The 95% confidence interval for the intercept ranges from 26.636 to 48.082.
- Fertilizer Dose:
- The coefficient for Fertilizer Dose is 0.979 with a standard error of 0.245.
- The T-value for the fertilizer dose is 3.989 with a p-value of 0.000, indicating that the fertilizer dose is significant at the 1% level in affecting the total dry weight.
- The 95% confidence interval for the fertilizer dose ranges from 0.487 to 1.471.
- The VIF (Variance Inflation Factor) for the fertilizer dose is 14.252, indicating a fairly high level of multicollinearity. Generally, a VIF above 10 indicates the presence of multicollinearity.
- Fertilizer Dose²:
- The coefficient for Fertilizer Dose² is -0.011 with a standard error of 0.003.
- The T-value for the squared fertilizer dose is -3.540 with a p-value of 0.001, indicating that the squared fertilizer dose is significant at the 1% level.
- The 95% confidence interval ranges from -0.017 to -0.005.
- The VIF for the squared fertilizer dose is 12.252, also indicating the presence of multicollinearity.
- Compost Dose:
- The coefficient for Compost Dose is 0.020 with a standard error of 0.413.
- The T-value for the compost dose is 0.048 with a p-value of 0.962, indicating that the compost dose is not significant in affecting the total dry weight.
- The 95% confidence interval ranges from -0.808 to 0.848.
- The VIF for the compost dose is 2.800, indicating no significant multicollinearity.
- Interaction between Fertilizer Dose and Compost Dose:
- The interaction coefficient between fertilizer dose and compost dose is 0.004 with a standard error of 0.009.
- The T-value for this interaction is 0.450 with a p-value of 0.653, indicating that the interaction is not significant in affecting the total dry weight.
- The 95% confidence interval ranges from -0.014 to 0.022.
- The VIF for the interaction is 2.100, also indicating no significant multicollinearity.
Conclusion:
Both fertilizer dose and its squared term significantly affect plant dry weight. The compost dose and interaction between fertilizer and compost doses, however, do not show a significant effect. Measures like VIF indicate the presence of multicollinearity in the predictors, suggesting the need for further investigation.
3D Regression Graph and Optimization
Optimization of Fertilizer and Compost Dose for Maximizing Total Dry Weight:
The aim of optimization is to identify the optimal doses of fertilizer and compost that maximize the yield, in this case, the total dry weight of the plant. In mathematics, to find the maximum or minimum point of a function, differentiation techniques are used. By locating the point where the first derivative (differential) of the function is zero, we can identify stationary points, which could be maximum, minimum, or inflection points. However, with SmartstatXL, this process has been simplified and can be easily performed with the help of statistical software.
From the analysis using SmartstatXL, it was found that the maximum total dry weight for the plant is best achieved with a fertilizer dose of 48.983 and a compost dose of 20.000. Conversely, if neither fertilizer nor compost is administered (both 0.000), the resulting total dry weight would be 37.359 g/plant, which is the minimum value according to the regression model analyzed.
Analysis of Variance in Regression
Interpretation and Discussion of Analysis of Variance:
- Regression:
- With degrees of freedom (DF) of 4, the regression variable provides a variance of 4923.7600 with a mean square (MS) of 1230.9400.
- The F-value for regression is 6.735 with a p-value of 0.000. Since the F-value is greater than F-0.05 (2.540) and F-0.01 (3.681), the regression model is significant at the 1% level.
- Fertilizer Dose:
- With a DF of 1, the fertilizer dose provides a variance of 2907.9798.
- The F-value for the fertilizer dose is 15.912 with a p-value of 0.000. This indicates that the fertilizer dose significantly affects the total dry weight at the 1% level.
- Fertilizer Dose^2:
- With a DF of 1, the square of the fertilizer dose provides a variance of 2290.3082.
- The F-value for the square of the fertilizer dose is 12.532 with a p-value of 0.001. This indicates that the square of the fertilizer dose also significantly affects the total dry weight at the 1% level.
- Compost Dose:
- With a DF of 1, the compost dose provides a variance of 0.4282.
- The F-value for the compost dose is 0.002 with a p-value of 0.962. This indicates that the compost dose does not significantly affect the total dry weight.
- Interaction between Fertilizer and Compost Doses:
- With a DF of 1, the interaction between the fertilizer and compost doses provides a variance of 27.0991.
- The F-value for this interaction is 0.148 with a p-value of 0.702. This indicates that the interaction between the fertilizer and compost doses does not significantly affect the total dry weight.
- Error:
- With a DF of 55, the error provides a variance of 10051.7299 with a mean square (MS) of 182.7587.
- Model Deviation:
- With a DF of 15, model deviation provides a variance of 6689.3165 with a mean square of 445.9544. The F-value for model deviation is 5.305 with a p-value of 0.000, indicating that model deviation is significant at the 1% level.
- Pure Error:
- With a DF of 40, the pure error provides a variance of 3362.4133 with a mean square of 84.0603.
Conclusion:
- The fertilizer dose and the square of the fertilizer dose significantly affect the total dry weight at the 1% level.
- The compost dose and the interaction between the fertilizer and compost doses do not significantly affect the total dry weight.
Assumption Checks
Interpretation and Discussion:
Test for Homoscedasticity:
- The calculated χ2 value is 9.008 with a p-value of 0.061.
- Considering that the standard significance threshold is 0.05 and the obtained p-value is greater than 0.05 (0.061 > 0.05), we do not reject the null hypothesis. This implies that the data exhibits homoscedasticity, or in other words, the variance of the regression residuals remains constant across the values of the independent variable.
- Therefore, the assumption of homoscedasticity for the regression model is not violated.
Test for Normality
- Shapiro-Wilk's: With a statistic of 0.981 and a p-value of 0.456, the residual distribution is considered normal because 0.456 > 0.05.
- Anderson Darling: With a statistic of 0.384 and a p-value of 0.395, the residual distribution is considered normal.
- D'Agostino Pearson: With a statistic of 0.340 and a p-value of 0.844, the residual distribution is considered normal.
- Liliefors: With a statistic of 0.069 and a p-value greater than 0.20, the residual distribution is considered normal.
- Kolmogorov-Smirnov: With a statistic of 0.069 and a p-value greater than 0.20, the residual distribution is considered normal.
All the Normality Tests indicate that the residuals are normally distributed. Therefore, the assumption of normal distribution of regression errors (residuals) is not violated.
Conclusion:
- The assumption of homoscedasticity for the regression model is met.
- The assumption of normal distribution of regression errors (residuals) is also met.
- Therefore, the proposed regression model satisfies the basic regression assumptions and can be used for further analysis.
Residual Plot
In addition to formal tests, visual examination of the assumption of normality can also be performed using the accompanying residual plot. Checks can be carried out using the Normal Probability Plot (Normal P-Plot), Histogram, and Residual vs. Predicted Plot.
- Normal P-Plot for Residuals:
- A Normal Probability Plot between residual values and predicted or observed values. Ideally, the points on this plot should follow a straight diagonal line. If the points deviate from the diagonal line, this may indicate a departure from normality.
- The fact that the points closely follow a straight diagonal line suggests that the residuals are approximately normally distributed across most of the range of values. This is a good sign and indicates that the assumption of residual normality is largely met.
- Histogram for Residuals:
- The histogram should display a distribution that closely resembles a bell curve (normal distribution). Deviations from this shape (e.g., a skewed or long-tailed distribution) may indicate a violation of the normality assumption.
- Residuals vs Predicted:
- To check for homoscedasticity, the points on this plot should be randomly dispersed around a horizontal line at 0 without any specific pattern. If a particular pattern is observed, such as a funnel shape or a curve, this could indicate heteroscedasticity or other violations of regression assumptions.
Box-Cox Transformation and Residual Analysis
The Box-Cox transformation is used to make non-normally distributed data approximate normality. The transformation parameter suggested by the Box-Cox method is \( \lambda \) (Lambda), which in this case is 1.599. Based on this lambda value, the suitable transformation for this data is the "Square Transformation," which means \( Y^2 \) transformation. The Box-Cox transformation suggests a square transformation for the data, which may help in correcting homoscedasticity assumptions.
Interpretation of Residual Values and Outlier Data Examination:
Examination of residuals and outlier data is crucial to ensure that the regression model meets its assumptions.
- Fertilizer Dose and Compost Dose: These are the doses of fertilizer and compost given in the experiment.
- TOTAL DRY WEIGHT (g/plant): This is the actual result obtained from the experiment.
- Predicted: This is the value predicted by the regression model based on the fertilizer and compost doses.
- Residual: This is the difference between the actual result and the value predicted by the model.
- Leverage: Leverage values indicate how far the predictor values are from the mean. High leverage values may indicate the potential for outliers.
- Studentized Residual: These are the standardized residuals. Values far from 0 may indicate the potential for outliers.
- Studentized Deleted Residual: Similar to studentized residuals, but with the influence of the data point itself removed. Values far from 0 may also indicate the potential for outliers.
- Cook's Distance: Measures the influence of a particular data point on the entire model. High values may indicate potential outliers with a significant impact on the model.
- DFITS: Similar to Cook's Distance, it measures the influence of a particular data point, but on a different scale.
- Diagnostic: Combines various metrics to provide an overall view of potential outliers.
From the residual analysis data, there appear to be some data points with high residual, studentized residual, and studentized deleted residual values, which may indicate the presence of potential outliers. For example, in Fertilizer Dose 0, Compost Dose 10:
- The observed total dry weight is 67.6 g/plant, with a predicted value of 37.5591 g/plant, resulting in a residual of 30.0409.
- The Studentized Residual and Studentized Deleted Residual are 2.2960 and 2.3926, respectively.
- Based on the Studentized Residual value greater than 2, this observation can be considered an outlier.
However, before making any decisions, it is important to further investigate these values and consider the experimental context.
Conclusion
- Regression Model:
- The regression model developed to predict the total dry weight (g/plant) based on the fertilizer and compost doses shows moderate goodness-of-fit with an R2 value of 0.3288.
- Fertilizer dose and fertilizer dose squared significantly affect the total dry weight at the 1% significance level, while the compost dose and the interaction between fertilizer and compost doses do not significantly affect the total dry weight.
- Optimization of Total Dry Weight:
- To achieve the maximum total dry weight in plants, it is recommended to administer a fertilizer dose of 48.983 and a compost dose of 20.000.
- Regression Assumption Checks:
- The assumption of homoscedasticity for the regression model is satisfied.
- The assumption of normal distribution of the regression errors (residuals) is also met.
- Box-Cox Transformation:
- Based on the Box-Cox transformation, it is suggested to use a square transformation for the data to meet the assumption of homoscedasticity.
- Outlier Data Examination:
- Some observations are identified as outlier data based on high Studentized Residual values.
Reporting Results and Discussion in Scientific Works
In this study, regression analysis was conducted to explore the influence of fertilizer and compost doses on the total dry weight of plants. Based on the analysis results, it was found that fertilizer dose and fertilizer dose squared have a significant impact on the total dry weight at the 1% significance level. However, compost dose and the interaction between fertilizer and compost doses do not exert a significant influence.
The developed regression model shows moderate goodness-of-fit, with a coefficient of determination (R2) of 0.3288. This indicates that the model can explain approximately 32.88% of the total variation in the plant's dry weight.
Furthermore, to optimize the total dry weight of plants, it is suggested to administer a fertilizer dose of 48.983 and a compost dose of 20.000. The administration of doses with this combination is expected to yield the maximum dry weight in plants.
Regression Assumption Checks show that the data meet the assumption of homoscedasticity and normal distribution of regression errors. However, based on the Box-Cox transformation, it is suggested to use square transformation for the data to better satisfy the assumption of homoscedasticity.
Lastly, outlier data examination reveals the presence of some observations that could potentially be considered outliers. This should be taken into account in further analysis or in the field application of the model.