Handling Missing Data and Data Transformation with SmartstatXL

Details: Written by Ade Setiawan; Category: Experimental Design; 13 Jul

When dealing with data analysis, challenges such as missing data or data not meeting certain assumptions often arise. SmartstatXL, with its advanced capabilities, offers solutions to overcome these issues. Moreover, the handling of missing data and data transformations with SmartstatXL can be applied to all types of experimental designs, ensuring that your experimental data analysis remains accurate and relevant.

Handling Missing Data

Encountering missing data within a dataset is a common challenge in data analysis. However, with SmartstatXL, you need not worry. If there is missing data, you simply leave the cell empty in your dataset. SmartstatXL is equipped with sophisticated algorithms to estimate the missing values based on the available information in the dataset. These algorithms can replace one or more missing data points with the most likely values, based on existing data patterns, ensuring the integrity and accuracy of your analysis are maintained.

Steps for Analysis of Variance (Anova) for Missing Data:

Make sure the worksheet (Sheet) you wish to analyze is active.
Place the cursor on the Dataset. (For information on creating a Dataset, please refer to the 'Data Preparation' guide).
In the example of missing data, suppose data for treatment 3Dok1 Repetition 1 and 3Dok4 Repetition 3 are missing. Leave those cells empty (do not fill with blank spaces, to ensure it is truly empty, click on the cell and then press the Del button).
Proceed with the analysis steps as usual.
Click the CRD Menu.
SmartstatXL will display a dialog box to confirm whether the Dataset is correct or not (usually the cell address for the Dataset is automatically selected correctly).
After confirming the Dataset is correct, press the Next Button
The following Anova - Single Factor CRD Dialog Box will appear:
Select Factor and Response as shown in the picture above!
Click the Next Button to proceed to the Anova Stage-2 Dialog Box
The following Stage 2 Dialog Box will appear:
In the Transformation and Outlier Data section, check the "Table: Prediction, Residual, Outlier, Transformation" checkbox
Press the Next button.
In the Anova Stage 3 Dialog Box, press the "OK" button.

Below is the new data table after missing data has been replaced with new values obtained using missing data calculations, marked in blue

New Data Table

The table shows a comparison between the original data and the new data after corrections for missing and outlier data.

3Dok1: The data originally missing (Missing) now has a value of 31.1730 based on missing data calculations.
3Dok4: Similarly to 3Dok1, the missing data has been replaced with a value of 16.0240.

This demonstrates that SmartstatXL has effectively handled missing and outlier data, allowing for more accurate and reliable analysis.

Automatic Data Transformation

Fulfilling the assumptions of ANOVA is key to obtaining valid analysis results. Two critical assumptions that must be met are the homogeneity of variance and normal distribution. SmartstatXL makes it easy for you to check both of these assumptions:

Variance homogeneity test using Levene's Test or Bartlett's Test. If the significance value is less than 0.05, it indicates that the residual variance of the treatments is not homogeneous, and your data may require transformation.
Normality Test is performed using several statistical tests. If the p-Value is less than 0.05, it indicates that your data is not normally distributed and may require transformation.

Here is an example of both tests:

If the significance value in the Variance Homogeneity Test Table is less than 0.05, it means the residual variance of the treatments is not homogeneous, and this indicates a violation of ANOVA assumptions (the F-value is marked in red).

If the p-Value in the Normality Test Table is less than 0.05, it indicates that the residuals are not normally distributed, and this also indicates a violation of ANOVA assumptions.

SmartstatXL not only checks these assumptions but also offers an automatic solution. If your data does not meet the ANOVA assumptions, SmartstatXL will find a suitable transformation for your data. If a suitable transformation is not found, the program will attempt to identify and correct outlier data. As a last resort, if other solutions fail, SmartstatXL will try to remove outlier data from the previous transformations, ensuring your analysis remains relevant and accurate.

In the following example, the analysis is performed on a Randomized Block Split Plot experiment from three locations.

Steps for Automatic Data Transformation

Ensure the worksheet (Sheet) you wish to analyze is active.
Place the cursor on the Dataset. (For information on creating a Dataset, please refer to the 'Data Preparation' guide).
Proceed with the analysis steps as usual.
Click the Menu RBD Split Plot > Main Plot: RBD.
SmartstatXL will display a dialog box to confirm whether the Dataset is correct or not (usually the cell address for the Dataset is automatically selected correctly).
After confirming the Dataset is correct, press the Next Button
The following Anova - RBD Split Plot Dialog Box will appear:
Select Factor and Response as shown in the picture above!
Click the Next Button
The following Stage 2 Dialog Box will appear:
In the Transformation and Outlier Data section, check the "Find Best Transformation" checkbox.
Minimum transformation score: 12
Check the box "Replace outlier data"
Replace outlier data with: Missing data calculations
Level: 1
There are two types of tests to check variance homogeneity, namely Levene and Bartlett. You can select the data source to be used in the homogeneity test, whether based on residual data or raw data.
To access this option, please click on the 'Advanced Options...' button. In this example, the type of test used is Levene and the homogeneity of variance test (HOV) is performed based on residual data, although the default option is raw data

If set, press the OK button.
Press the Next button.
In the Anova Stage 3 Dialog Box, press the "OK" button.

Analysis Results

Analysis Information

In the Table of Contents of the Analysis of Variance Output, some responses apparently violate ANOVA assumptions.

Rice Yield Location 1 and Location 2: Violate ANOVA Assumptions!
- If an assumption is violated, a [ x ] in red will be displayed in the table of contents
- If a suitable data transformation is found to meet the assumptions, this information will be included right in the next table of contents entry, marked with a [»] in blue.
Rice Yield Location 3: Meets ANOVA Assumptions.
- If all assumptions are met, a [ ✓ ] in green will be displayed in the table of contents

Rice Yield Location 1:

ANOVA Assumption Checks

Levene's Test for Homogeneity of Variance

F-Statistic: 4.23
P-Value: 0.002
Degrees of Freedom (DF): DF₁ = 11, DF₂ = 24

Levene's Test shows that the p-value (0.002) is less than 0.05, thus we reject the null hypothesis (H0) stating that the variances of all groups are the same. This indicates a violation of the homogeneity of variance assumption in ANOVA.

Normality Test

Shapiro-Wilk's: P-Value = 0.423
Anderson Darling: P-Value = 0.418
D'Agostino Pearson: P-Value = 0.140
Liliefors: P-Value > 0.20
Kolmogorov-Smirnov: P-Value > 0.20

All p-values are greater than 0.05, which means we fail to reject the null hypothesis (H0) stating that the residuals are normally distributed. Therefore, the normality assumption is met.

Discussion

Variance Homogeneity

In the context of ANOVA, violation of the variance homogeneity assumption can be a serious issue. Considering that this violation might be caused by outliers, the first step is to identify and examine these outliers. If the outliers are indeed the source of the violation, they should be corrected or removed from the analysis, depending on the reason for their existence.

Normality

Since all the normality tests indicate that the data tends to be normal, this is good news and we do not need to worry about this assumption being violated.

Outliers as Potential Sources

If outliers are identified as the source of assumption violations, there are several ways to address them:

Remove or Correct the Data: If the outlier is the result of an error, then the best option is to remove or correct it.
Data Transformation: This method can help in reducing the effects of outliers and make the data more homogeneous.
Sensitivity Analysis: Perform analysis with and without outliers to see the extent to which they affect the results.

After addressing the outliers, it's important to re-run the assumption tests to ensure that the data now meets the necessary assumptions for split-plot ANOVA.

Conclusion

Violation of the homogeneity of variance assumption in ANOVA is a serious issue that needs to be addressed carefully. Identifying and dealing with outliers can be a crucial step in ensuring the validity of ANOVA analysis. After addressing the outliers, assumption tests should be re-run to ensure that the assumptions are now met.

Handling of Outliers by SmartstatXL

Although SmartstatXL was unable to find a suitable transformation to meet ANOVA assumptions, this data issue can be resolved by correcting the existing outliers.

The following image shows the data after outlier correction has been made:

In this way, you can ensure that your data better fits the ANOVA assumptions.

Here is the outcome of the ANOVA assumption checks after outlier correction:

Based on the results of both tests—the homogeneity test and the normality test—it is evident that both ANOVA assumptions are now met. This demonstrates the effectiveness of outlier correction in meeting the prerequisites of ANOVA analysis.

Rice Yield Location 2:

ANOVA Assumption Checks:

Before Transformation

The case is the same as in Location 1, the homogeneity test is not met. In this case, SmartstatXL was able to find a suitable transformation, namely the Log or Ln transformation:

[ » ] Found a solution: Rice Yield Location 2 (Log Transformation: Log(Y) or LN(Y))

After Transformation:

After the transformation, both ANOVA assumptions are now met.

Last Updated: 10 September 2023

Sidebar Menu

Main Menu EN

Analyzing Problematic Data with SmartstatXL: Handling Missing Data and Data Transformation