Complete Guide to Simple Linear Regression Analysis with SmartstatXL

SmartstatXL is an Excel Add-In that allows users to perform regression analysis, a method of modeling the relationship between independent and dependent variables. One type of regression that can be analyzed with SmartstatXL is Simple Linear Regression.

Simple linear regression analysis is used as a statistical inference tool to determine the influence of an independent variable on a dependent variable. The equation form is as follows:

\[ Y = \beta_0 + \beta_1 X_1 + \epsilon \]

In this analysis, the relationship between the variables is linear, meaning that a change in variable X will be followed by a proportional change in variable Y. In contrast, in a non-linear relationship, a change in variable X does not necessarily result in a proportional change in variable Y.

Key features of simple linear regression analysis with SmartstatXL include:

Missing data calculation.
Regression Diagnostics: Normality Test, Heteroskedasticity Test, Residual Plot, and Box-Cox Transformation.
Automatic identification and replacement of outlier data.
Automatic data transformation.
Output includes:
- Regression Equation.
- Regression Statistics/Fit Goodness: \( R^2 \), Adjusted \( R^2 \), Correlation Coefficient, AIC, AICc, BIC, RMSE, MAE, MPE, MAPE, and sMAPE.
- Coefficient Estimates: Coefficient Value, Standard Error, t-Statistic, p-Value, Upper/Lower, and VIF.
- Analysis of Variance: Sequential and Partial.
- Graphs: 2D and 3D plots for Response Surface, as well as Optimization (Maximum and Minimum).

Case Example

A study has been conducted on the characteristics of various soil properties based on two types of parent materials, Andesitic and Basaltic. Below are the data for several soil properties on these two parent materials:

In this case example, we can create a simple linear regression model to examine the influence of soil depth (Depth_m) on these soil properties.

Steps for Simple Linear Regression Analysis

Activate the worksheet (Sheet) to be analyzed.
Place the cursor on the dataset (for creating the dataset, see Data Preparation guide).
If the active cell is not on the dataset, SmartstatXL will automatically try to identify the dataset.
Activate the SmartstatXL Tab
Click on the Menu Regression > Simple Linear Regression.
SmartstatXL will display a dialog box to ensure whether the dataset is correct or not (usually the dataset is automatically selected correctly).
If it is correct, click the Next Button
A Regression Analysis dialog box will appear. Choose the Factor Variable (Independent) and one or more Response Variables (Dependent). The choice of factor variable depends on the type of regression analysis.
- Simple Linear Regression Model: \[ Y = \beta_0 + \beta_1 X \]
- Type of Regression: "Simple Linear Regression"
- Response Variable: "Sand sd. Permeability"
- Factor Variable: "Depth_m (cm)"
  For more details, see the following dialog box:
Press the "Next" button
Select the regression output as shown below:
Press the OK button to generate the output in the Output Sheet

Analysis Results

Analysis Information: type of regression used, regression method, response, and predictor.

There is a violation of Regression Assumptions for the Permeability parameter (see the Regression Assumption Check section). The residuals are not normally distributed. SmartstatXL successfully found a solution through square root transformation.

The interpretation of the simple linear regression analysis results can be outlined as follows:

Regression Equation: The regression equation \(Y=37.6776 - 0.2771 \times \text{Depth_m}\) indicates the relationship between soil depth (\( \text{Depth_m}) and sand content (\( \text{SAND}) in the soil. The coefficient -0.2771 indicates that for every one-unit increase in soil depth (cm), the sand content will decrease by 0.2771 units.
Coefficient of Determination (\(R^2\)): The \(R^2\) value of 0.463 indicates that soil depth can explain about 46.3% of the variation in sand content. The remainder (53.7%) is explained by other factors not included in this model.
Correlation Coefficient (r): The correlation coefficient \(r\) value of 0.680 indicates that there is a moderate positive linear relationship between soil depth and sand content.
Adjusted \(R^2\): The adjusted \(R^2\) of 0.445 is more accurate in describing how well the model predicts the response in the sample, especially if there are more than one predictors. In this case, the adjusted \(R^2\) is slightly lower than the original \(R^2\), but still indicates that soil depth has a significant influence on sand content.
F-Test: The F-value of 25.842 with a significance level (Sig) of 0.00 indicates that the overall regression model is significant in predicting sand content based on soil depth.

When writing in a scientific article, it can be written in a formal style:

The results of the simple linear regression analysis show a significant relationship between soil depth (Depth_m) and sand content (SAND). The Regression Equation obtained is \(Y=37.6776 - 0.2771 \times \text{Depth_m}\). This indicates that for every one cm increase in soil depth, the sand content in the soil will decrease by 0.2771 units.
The Coefficient of Determination (\(R^2\)) of 0.463 indicates that soil depth can explain about 46.3% of the variation in sand content. Meanwhile, the correlation coefficient \(r\) of 0.680 shows a moderate positive linear relationship between the two variables.
The F-Test gives a value of 25.842 with a significance level (p-value) of 0.00, indicating that the constructed regression model is reliable in predicting sand content based on soil depth.

Model Goodness of Fit

Several statistical values for regression goodness-of-fit, also known as model fit or model adequacy.

Here is the interpretation of the regression goodness-of-fit results:

Correlation Coefficient (r): With a correlation value of \( r = 0.6803 \), it indicates a moderate positive linear relationship between soil depth and sand content.
Coefficient of Determination (\( R^2 \)): The \( R^2 = 0.4628 \) indicates that soil depth explains about 46.28% of the variation in sand content. The remainder (53.72%) is explained by other factors not included in this model.
Adjusted \( R^2 \): The adjusted \( R^2 \) of 0.4449 is more accurate in describing how well the model predicts the response in the sample, especially if there are multiple predictors.
AIC (Akaike Information Criterion): An AIC of 154.7716 is a measure of the model's goodness-of-fit. The lower the AIC value, the better the model. AIC takes into account both the model fit and the number of parameters in the model.
AICc (Akaike Information Criterion with a correction for small sample sizes): An AICc of 155.1854 is a correction from AIC for smaller samples. It is usually used when the sample size is small relative to the number of parameters.
BIC (Bayesian Information Criterion): A BIC of 157.7031 is also a measure of model fit, but with a larger penalty for models with more parameters. Similar to AIC, the lower the BIC value, the better the model.
RMSE (Root Mean Square Error): An RMSE of 10.8925 measures the average error between the values predicted by the model and the actual values.
MAE (Mean Absolute Error): An MAE of 9.0949 is the average of the absolute errors between the predicted values and the actual values.
MPE (Mean Percentage Error): An MPE of -0.3403 measures the average error in percentage between the predicted values and the actual values.
MAPE (Mean Absolute Percentage Error): A MAPE of 0.6664 is the average absolute error in percentage between the predicted values and the actual values.
sMAPE (symmetric Mean Absolute Percentage Error): An sMAPE of 0.4960 is another measure of prediction error in percentage that considers both overestimation and underestimation errors equally.

In this context, it's important to evaluate how well the model is performing and to consider the most relevant goodness-of-fit measures relative to the original objectives. This information can help decide whether the model is adequate or if further adjustments are needed.

Regression Coefficients

Below is the interpretation of the Simple Linear Regression analysis results:

Intercept:
- Coefficient: The coefficient for the intercept is 37.678. This indicates that when the soil depth (\( \text{Depth_m} \)) is 0 cm, the expected sand content is approximately 37.678 units.
- Standard Deviation: The standard deviation of the intercept coefficient is 3.686.
- T-Statistic and P-Value: With a t-statistic of 10.222 and a p-value of 0.000 (less than 0.01), the intercept is significant at the 1% significance level.
- 95% Confidence Interval: We can be 95% confident that the true value of the intercept lies between 30.150 and 45.205.
- Depth_m (cm):
Coefficient: The coefficient for \( \text{Depth_m} \) is -0.277. This indicates that for each 1 cm increase in soil depth, the expected sand content will decrease by approximately 0.277 units.
- Standard Deviation: The standard deviation of the \( \text{Depth_m} \) coefficient is 0.055.
- T-Statistic and P-Value: With a t-statistic of -5.084 and a p-value of 0.000 (less than 0.01), the soil depth variable is significant in predicting sand content at the 1% significance level.
- 95% Confidence Interval: We can be 95% confident that the decrease in sand content for each 1 cm increase in soil depth lies between 0.388 units and 0.166 units.
- VIF (Variance Inflation Factor): The VIF for \( \text{Depth_m} \) is 1.000, indicating that there is no issue of multicollinearity (as the VIF value is close to 1).

From the analysis above, we can conclude that soil depth has a significant and negative relationship with sand content. Furthermore, this regression model appears to have no issues with multicollinearity.

Regression Graph and Analysis of Variance (ANOVA)

The Analysis of Variance (ANOVA) results measure how much of the variation in the response variable (in this case, SAND) can be explained by the independent variable (soil depth). Here is the interpretation:

Regression:
- Degrees of Freedom (DF): The degrees of freedom for the regression is 1, indicating there is one independent variable in the model.
- Sum of Squares (SS): The sum of squares for the regression is 3066.1184, which measures the total variation in SAND explained by soil depth.
- Mean Square (MS): The mean square for the regression is 3066.1184 (SS divided by DF). This measures the average variation in SAND explained by each unit of soil depth.
- F-Statistic and P-Value: The F-statistic for the regression is 25.842 with a p-value of 0.000. Since the p-value is less than 0.01, the soil depth variable is significant in predicting sand content at the 1% significance level.
Depth_m (cm): The interpretation for the Depth_m variable is similar to the interpretation for the regression, as there is only one independent variable in this model.
Error:
- Degrees of Freedom (DF): The degrees of freedom for the error is 30.
- Sum of Squares (SS): The sum of squares for the error is 3559.4086, which measures the total variation in SAND not explained by soil depth.
- Mean Square (MS): The mean square for the error is 118.6470 (SS divided by DF). This measures the residual variability or model prediction error.
Total:
- Degrees of Freedom (DF): The total degrees of freedom is 31.
- Sum of Squares (SS): The total variation in SAND is 6625.5270.

In conclusion, the regression model that includes soil depth as an independent variable is significant in predicting sand content. With an F-statistic of 25.842, which is much greater than the critical F-values at the 5% (4.171) and 1% (7.562) significance levels, we can reject the null hypothesis stating that the independent variable has no effect on the response variable.

Regression Assumption Checks

In regression analysis, it is important to check the underlying assumptions to ensure that the results obtained are valid and reliable, particularly the normality test and heteroskedasticity test. Here is the interpretation of the results from the Regression Assumption Checks.

Homoscedasticity Test

The Breusch–Pagan–Godfrey test is used to detect heteroscedasticity in the regression model. The hypotheses for this test are as follows:

Null Hypothesis (H0):
There is no heteroscedasticity in the model, i.e., the variability of the regression errors (residuals) is constant across the range of the independent variable.
Alternative Hypothesis (H1):
There is heteroscedasticity in the model, i.e., the variability of the regression errors (residuals) is not constant across the range of the independent variable.

In other words:

If we reject H0 (e.g., if the p-value is less than 0.05), this indicates evidence supporting the presence of heteroscedasticity in the model.
If we fail to reject H0 (e.g., if the p-value is greater than 0.05), this indicates no evidence supporting the presence of heteroscedasticity in the model, thus the assumption of homoscedasticity is considered met.

From the calculations, we obtain:

The χ²-Statistic is 3.260.
The p-value obtained from the BPG test is 0.071.

Interpretation:

Because the p-value (0.071) is greater than 0.05, we fail to reject the null hypothesis that states that the residual variance is the same at all levels of soil depth (homoscedasticity). In other words, there is insufficient evidence to claim that there is heteroscedasticity in the data.

Normality Test

Normality tests examine whether the residuals (errors) from the regression model are normally distributed, which is one of the key assumptions in linear regression. Here are the hypotheses for the Normality Test:

Null Hypothesis (H0): Residuals are normally distributed.
Alternative Hypothesis (H1): Residuals are not normally distributed.

In other words:

If we reject H0 (e.g., if the p-value is less than 0.05), this indicates evidence supporting the claim that residuals are not normally distributed, thus violating the normality assumption.
If we fail to reject H0 (e.g., if the p-value is greater than 0.05), this indicates no evidence supporting the claim that residuals are not normally distributed, thus the normality assumption is considered met.

From the calculations, we obtain:

Shapiro-Wilk's: With a statistic of 0.965 and a p-value of 0.371, the data shows no violations against the normality assumption.
Anderson Darling: With a statistic of 0.359 and a p-value of 0.452, the data shows no violations against the normality assumption.
D'Agostino Pearson: With a statistic of 1.597 and a p-value of 0.450, the data shows no violations against the normality assumption.
Liliefors & Kolmogorov-Smirnov: Both tests show a statistic of 0.118 and a p-value greater than 0.20, confirming that the data shows no violations against the normality assumption.

Interpretation:

Since all the p-values from the above normality tests are greater than 0.05, we fail to reject the null hypothesis stating that residuals are normally distributed. This means the assumption of normality is met for this data.

In conclusion, based on the Regression Assumption Checks, the regression model appears to meet the assumptions of homoscedasticity and normality, which means the regression model is sufficiently valid and reliable for further analysis.

Residual Plots

In addition to formal tests, the assumption of normality can also be visually inspected using the included residual plots. Checks can be performed using the Normal Probability Plot (Normal P-Plot), Histogram, and Residual vs. Predicted plots.

Normal P-Plot for Residuals:
- The Normal Probability Plot between residual values and predicted or observed values. Ideally, points on this plot should follow a straight diagonal line. If points deviate from the diagonal line, this may indicate deviations from normality.
- The fact that the points closely follow the straight diagonal line suggests that the residuals are approximately normally distributed over most of the value range. This is a good sign and indicates that the assumption of residual normality is largely met. However, the presence of points deviating from the diagonal at both ends indicates deviations from normality in the tails of the distribution.
- Although there are some deviations from normality, depending on the context and purpose of the analysis, these deviations may not be significant. However, if our analysis is highly sensitive to the assumption of normality, we may need to consider transformation techniques or other methods to address these deviations.
Histogram for Residuals:
- The histogram should show a distribution that approximates a bell shape (normal distribution). Deviations from this shape (e.g., skewed or long-tailed distribution) may indicate violations of the normality assumption.
Residual vs Predicted:
- To check for homoscedasticity, points on this plot should be randomly scattered around the horizontal line at 0 without any specific pattern. If a specific pattern is observed, such as a funnel shape or a curve, this may indicate heteroscedasticity or other violations of the regression assumptions.

Considering that all formal tests indicate that residuals are normally distributed (as all p-values are greater than 0.05), minor deviations in the Normal P-Plot are likely not a major issue.

In practice, regression analysis is often quite tolerant of minor violations of the normality assumption, especially if the sample size is sufficiently large. Therefore, even if there are some points deviating from the diagonal line in the Normal P-Plot, if formal tests indicate normality and we do not observe significant violations of other assumptions, the regression model may be considered sufficiently valid for analytical purposes.

Box-Cox Transformation and Residual Analysis

Box-Cox Transformation is often recommended for addressing violated assumptions in regression analysis, particularly normality and homoscedasticity. However, if the analysis already shows that all regression assumptions are met (as in this case example), transformation might not be necessary.

Several points to consider:

Not Always Necessary: Even if Box-Cox suggests a transformation, it doesn't mean we must do it. This suggestion becomes more relevant when there are explicit violations of regression assumptions.
Interpretation: Data transformation can affect the interpretation of regression coefficients. For instance, with a logarithmic transformation, the coefficients become elasticities, which measure the percentage change in the response variable for each percentage change in the predictor variable.
Further Testing: If one decides to follow Box-Cox's advice and transform the data, the regression model should be rerun, and all regression assumptions should be checked for the transformed dataset.
Context: Always consider the context and research objectives. If assumptions are already met, introducing transformations may be unnecessary and could complicate interpretation.

In conclusion, if we have already checked and ensured that the regression assumptions are met, a Box-Cox transformation may not be necessary. However, if interested in seeing how the transformation impacts the model, it can be tried and compared with the original model.

Based on these diagnostic results, here are example interpretations for the first 3 rows:

Residual: The residual is the difference between the observed value (SAND) and the value predicted by the model. Large residuals may indicate that our model does not predict certain observations well.
- For the first observation (depth 8.5 cm), the positive residual of 13.2979 suggests that the model predicts a lower SAND value than what was actually observed.
- For the second observation (depth 24 cm), the negative residual of -2.1768 indicates that the model predicts a higher SAND value than what was actually observed.
- For the third observation (depth 37 cm), the positive residual of 14.6158 again suggests that the model predicts a lower SAND value than what was actually observed.
Leverage: Leverage values measure how extreme or unusual predictor (X) values are relative to other predictor values. High leverage can indicate extreme X values.
- The first observation has a leverage of 0.0918, which seems relatively high compared to other observations. This suggests that a depth of 8.5 cm may be an unusual or extreme value in the data.
Studentized Residual: This is a standardized version of the residual, allowing us to identify outliers in the response (Y). Absolute values of large studentized residuals (often > 2 or 3) may indicate outliers.
- The first and third observations have studentized residuals greater than 1, indicating potential outliers in the response.
Cook's Distance: Measures the influence of each observation on all regression estimates. High Cook's Distance values may indicate high-influence observations.
- The first observation has a Cook's Distance of 0.0829, which might indicate that this observation has a relatively large influence on the model.
DFITS: Similar to Cook's Distance, DFITS is a measure of the influence of each observation. High DFITS values (often > 2) may indicate high-influence observations.
- The first observation has a DFITS of 0.4118, indicating potential influence on the model.

Conclusion

Based on the analysis conducted:

Regression Model: The obtained regression equation indicates a negative relationship between soil depth (Depth_m) and sand content (SAND). This means that as soil depth increases, the sand content is expected to decrease.
Goodness-of-Fit Test: The coefficient of determination R² shows that approximately 46.28% of the variation in sand content can be explained by the model. However, the Adjusted R² indicates that, after accommodating the degrees of freedom, about 44.49% of the variation is explained.
Assumption Tests: No significant violations of regression assumptions were found. The Breusch–Pagan–Godfrey test confirms that the model meets the homoscedasticity assumption. Furthermore, various Normality Tests affirm that the residuals from the model are normally distributed.
Data Diagnostics: While the model overall appears sound, there are some observations that may act as outliers or have high influence on the model, as indicated by residual analysis and leverage values.

Reporting Results and Discussion in Academic Work

In this study, we evaluated the relationship between soil depth and sand content using simple linear regression. The analysis indicates a negative relationship between soil depth and sand content. As soil depth increases, sand content is expected to decrease. The developed regression model successfully explains about 44.49% of the variation in sand content.

As a step for model validation, we examined several key assumptions underlying regression analysis. First, we ensured that the variability of the regression errors (residuals) is constant across the range of the independent variable, a concept known as homoscedasticity. Through the Breusch–Pagan–Godfrey test, this assumption was confirmed. Next, we examined the distribution of the residuals, which ideally should be normally distributed. Through a series of Normality Tests, including Shapiro-Wilk's, Anderson Darling, D'Agostino Pearson, Liliefors, and Kolmogorov-Smirnov, we confirmed that the model's residuals are normally distributed.

However, despite the model meeting the basic assumptions, additional diagnostic analysis indicated some observations that may act as outliers or have high influence. This underscores the importance of always conducting a thorough examination of the regression model, even when it initially appears to meet all criteria.

Thus, these findings provide valuable insights into the relationship between soil depth and sand content. However, it should be remembered that interpretation should be done cautiously, especially if it will be applied in different contexts or used for predictive purposes.

Sidebar Menu

Main Menu EN

How to Analyze Simple Linear Regression