Sidebar Menu

Factor Analysis

Factor Analysis is a multivariate statistical technique used to identify latent variables or factors underlying a set of observed variables. Its primary purpose is to reduce the dimensionality of data and find structure in the relations among the variables. In other words, it helps us understand the structure of variables and transfers information from a large number of variables into a smaller number of factors or components.

Factor Analysis plays a crucial role in research, especially in studies involving many variables. This technique helps researchers to reduce the number of variables and simplify analysis. Besides, Factor Analysis also aids in understanding the underlying structure of data and constructing conceptual models for further research. In practical contexts, this technique is often used in customer satisfaction surveys, psychological research, marketing research, and many other fields. In all these cases, Factor Analysis helps to extract useful and significant information from large amounts of data.

There are two types of factor analysis. Exploratory Factor Analysis (EFA): This type of factor analysis is used when we do not have any assumptions or hypotheses about the relationship among variables. Its purpose is to explore the structure or models that may exist within the data. Confirmatory Factor Analysis (CFA): This type of factor analysis is used when we have assumptions or hypotheses about the relationship among variables, and we want to test or confirm these hypotheses.

Case Example

BFI (dataset based on personality assessment project)

This dataset consists of 25 items assessing personal characteristics sourced from the International Personality Item Pool (ipip.ori.org), which is an integral part of the Synthetic Aperture Personality Assessment (SAPA) project. This dataset involved participation from 2800 individuals and aimed to be a demonstrative tool in scale creation, factor analysis, and analysis based on Item Response Theory. The dataset also includes three additional demographic variables: gender, education level, and age. In this case study, only 2436 subjects with complete data were used, while incomplete data were discarded.

The items evaluated in this study are designed based on five proposed factors: Agreeableness, Conscientiousness, Extraversion, Neuroticism, and Openness.

The item data was collected using a 6-point response scale, starting from 1 (Very Inaccurate), 2 (Somewhat Inaccurate), 3 (Slightly Inaccurate), 4 (Slightly Accurate), 5 (Somewhat Accurate), and ending at 6 (Very Accurate). This data collection was done as part of the Synthetic Aperture Personality Assessment (SAPA) project, which can be accessed at https://www.sapa-project.org/.

Factor Analysis Dataset

Source: https://vincentarelbundock.github.io/Rdatasets/datasets.html

Steps for Factor Analysis:

  1. Activate the worksheet (Sheet) that will be analyzed.
  2. Place the cursor on the Dataset (to create a Dataset, see how to Prepare Data).
  3. If the active cell is not on the Dataset, SmartstatXL will automatically try to determine the Dataset.
  4. Activate the SmartstatXL Tab.
  5. Click the Multivariate > Factor Analysis Menu.
    Factor Analysis Menu
  6. SmartstatXL will display a dialog box to ensure whether the Dataset is correct or not (usually the cell address of the Dataset is automatically selected correctly).
    Dataset Source Dialog Box
  7. If it's correct, click the Next button.
  8. Next, the Factor Analysis Dialog Box will appear:
    Factor Analysis Dialog Box
  9. Select the Variable, Analysis Method and Extraction Method, as well as labels for biplot (optional). In this case study, we specify:
    • Variable: A1, A2, ..., O4, O5
    • Analysis Method: Correlation
    • Extraction Method: Based on Eigenvalue. At the end, we will try the extraction method based on the number of fixed factors (5 factors)
    • Rotation: Varimax
    • Label: Education (Optional, used for Biplot)

    For more details, see the following dialog box view:
    Factor Analysis Dialog Box

    Analysis Method

    Analysis based on covariance is typically used when the unit of the observed variables is the same or when the absolute scale of these variables is important. Conversely, analysis based on correlation is used when the unit of the observed variables is different or when we are only interested in the relationship between variables, not their absolute differences.

    Extraction Method

    In Principal Component Analysis (PCA), the extraction process of the components generally involves finding the eigenvectors and eigenvalues of the covariance or correlation matrix. These eigenvalues represent how far the variance is explained by each principal component, and the eigenvectors indicate the contribution of each original variable to that principal component.

    Here are two common approaches in determining how many components or factors to extract:

    • Based on Eigenvalue: This approach, also known as the "one eigenvalue rule" or "Kaiser criterion", suggests that we should only keep principal components that have eigenvalues greater than 1. This is based on the idea that a principal component should explain more variance than the average original variable.
    • Based on Fixed Number of Components: This approach involves determining the number of components or factors to be retained based on prior knowledge or the objective of the analysis. For instance, if the goal of the analysis is to reduce the data dimensions to two or three for visualization purposes, then we might choose to retain only two or three principal components.

    Both of these approaches have advantages and disadvantages. The eigenvalue-based approach is a commonly used general rule, but it may not always yield the most appropriate number of components for the analysis objective. The fixed number of components approach may require more knowledge about the data and the analysis objective. Therefore, the best approach to choose might depend on the context of the analysis.

    Varimax Rotation

    Varimax rotation is an orthogonal rotation technique that aims to maximize the variance of the factor loadings squared within each factor, in other words, it tries to maximize the number of variables that have high loading scores on one factor and low loading scores on other factors.

  10. Press the "Output" tab
  11. Select the Factor Analysis output as shown below by pressing the Select All button:
    Factor Analysis Output Selection Dialog Box
  12. Press the OK button to create its output in the Output Sheet.

Analysis Results

Factor Analysis Information.

Factor Analysis Output

Kaiser-Meyer-Olkin (KMO) Test and Bartlett's Test of Sphericity

This factor analysis result has undergone pre-analysis testing stages using the Kaiser-Meyer-Olkin (KMO) test and Bartlett's Test of Sphericity.

The Kaiser-Meyer-Olkin (KMO) value is 0.849. KMO values range between 0 and 1. A value closer to 1 indicates that the pattern of correlations between variables is good enough for factor analysis. In this case, a KMO of 0.849 indicates that the dataset is suitable for factor analysis.

The Bartlett's Test is used to test the null hypothesis that the correlation matrix in the population is an identity matrix, which would indicate that the variables are uncorrelated in the population. If the p-value (Significance, Sig.) is less than 0.05, then the null hypothesis is rejected, and it can be concluded that the variables are correlated enough for factor analysis. In this case, the Sig. value is 0.000, so the null hypothesis is rejected, and the factor analysis can continue because the variables have an adequate level of correlation.

So, based on the results of the KMO test and Bartlett's Test, it can be concluded that the dataset is very suitable for factor analysis.

Correlation Matrix Table

Correlation Matrix Table

Reproduced Correlation Matrix Table

Correlation Matrix Error Table

These tables show various information about the relationship between variables in the dataset:

Correlation Matrix Table

This correlation matrix shows the relationship between each pair of variables. Its values range from -1 to 1, where -1 means perfect negative correlation, 1 means perfect positive correlation, and 0 means no correlation.

Reproduced Correlation Matrix Table

This table produces estimates of the original correlation matrix based on the extracted principal components. In this table, each number represents an estimated correlation value based on the main components.

Correlation Matrix Error Table

This table shows the difference between the original and reproduced correlation matrix, which can tell us how well the model can reproduce the original correlations. Lower values indicate that our model is doing a good job of reproducing the original correlations. In this case, the errors seem relatively small, indicating that the model is doing a fairly good job.

Eigenvalues and Screeplot

Eigenvalues and Screeplot

The table above shows the amount of variation in the data explained by each factor. This variation is measured in units called eigenvalues. An eigenvalue is the amount of variance that can be explained by a certain factor. The larger the eigenvalue, the more variance can be explained by that factor.

The components in the table indicate the number of factors being analyzed. Each factor has an eigenvalue, a proportion of explained variation, and a cumulative proportion of explained variation. For example, the first factor has an eigenvalue of 5.134. This means the first factor can explain 20.5% of the total variance in the dataset (proportion). Looking at the second factor, this factor has an eigenvalue of 2.752 and explains an additional 11.0% of the total variance. So, cumulatively, the first two factors can explain 31.5% of the total variance in the data.

This process is done for each subsequent factor. Thus, we can see that the 25 factors analyzed explain 100% of the variation in the data (as indicated by the cumulative proportion on factor 25).

Typically, we only retain factors that have an eigenvalue greater than 1 (known as the Kaiser rule), because such factors are considered to be able to explain a significant part of the variance. In this case, there are 6 factors with an eigenvalue greater than 1, which means these 6 factors are considered important and explain 58% of the total variance.

Eigenvector Table

Eigenvector Table

The eigenvector table presented here is the result of Principal Component Analysis (PCA) applied to the BFI dataset. The values in this table indicate the weights assigned to each variable in constructing the main components or factors.

Remembering the concept of the five-factor personality model used in this study (Agreeableness, Conscientiousness, Extraversion, Neuroticism, and Openness), we can try to interpret these results. Bear in mind that this interpretation is quite subjective and depends on the research context:

  • Main Component 1 (PC1): This component seems to have a significant negative correlation with variables A1 (Agreeableness), E1 (Extraversion), E2 (Extraversion), N1 (Neuroticism), N2 (Neuroticism), N4 (Neuroticism), and N5 (Neuroticism). In other words, individuals with high scores on PC1 tend to have low scores on these variables. Meanwhile, significant positive scores are found on variables A2, A3, A4, A5 (Agreeableness), C1, C2, C3 (Conscientiousness), E3, E4, E5 (Extraversion), and O3 (Openness). This means that individuals with high scores on PC1 tend to have high scores on these variables.
  • Main Component 2 (PC2): This component appears to be most related to the Neuroticism factor (N1, N2, N3, N4, and N5) with a strong positive relationship, meaning individuals with high scores on PC2 tend to score high in Neuroticism. Additionally, there is a significant negative correlation with E1 (Extraversion), meaning individuals with high scores on PC2 tend to score low on E1.
  • Main Component 3 (PC3): This component seems to have a strong positive correlation with C1, C2, and C3 (Conscientiousness) and a negative correlation with O2 and O5 (Openness), A2, A3, and A5 (Agreeableness), and E4 (Extraversion).
  • Main Component 4 (PC4): This component has a strong positive correlation with O5 and O2 (Openness) and C3 (Conscientiousness), and a negative correlation with O1 and O3 (Openness).
  • Main Component 5 (PC5): This component has a strong positive correlation with A2, A3, A4, A5 (Agreeableness) and E1 and E2 (Extraversion), and a negative correlation with A1 (Agreeableness).
  • Main Component 6 (PC6): This component has a strong positive correlation with A1 (Agreeableness), C4, and C5 (Conscientiousness), and O1, O2, and O5 (Openness).

Each of these main components represents a linear combination of the existing variables and can be interpreted as the latent factors underlying the dataset. In this context, these factors may include personality dimensions not directly captured by the scale used in this study. Specific interpretations of what each main component actually represents would require a deeper understanding of the research context and possibly additional analysis.

Factor Loading

Factor Loading

This table shows the factor loading values and communalities for each variable on each factor. Factor loading is the correlation between the original variable and the factor. Large factor loadings (either positive or negative) indicate that the variable contributes significantly to that factor.

Factor interpretation is done by looking at which factor loadings are highest for each factor. For example, on Factor 1, variables A5, A3, E4, and A2 have the highest factor loadings, while on Factor 2, variables N1, N2, and N3 have the highest factor loadings. This indicates that Factor 1 and Factor 2 are closely related to these variables respectively.

Meanwhile, the communality for each variable is the sum of squares of the factor loadings for that variable and represents the percentage of variance in the variable that can be explained by the factors. For instance, the communality for variable A1 is 0.658, meaning that 65.8% of the variance in A1 can be explained by the six identified factors.

At the bottom of the table, "Expl. Variance" shows the amount of variation explained by each factor, and "% Variance" and "% Cum. Variance" show the percentage of variation explained by each factor and its cumulative total. For example, Factor 1 explains 20.5% of the variation, and Factors 1 and 2 together explain 31.5% of the variation.

These values indicate that this factor model is quite good at explaining the variation in these variables, although there is still some variance unexplained. The total variance explained by this model is 58.012%.

Ultimately, this interpretation needs to be reviewed in the context of our research to ensure that it makes sense and is useful in the study. Additionally, these factor loading values can also help us decide whether to retain or eliminate certain variables or factors in further analysis, depending on the research question and study design.

Factor Loading Rotation

Before rotation, the generated factors may be difficult to interpret, as they tend to have many variables with high factor loadings, and because these factors are usually not orthogonal (that is, the factors are not independent of each other). This is common in factor analysis and is not a problem. However, to make interpretation easier, we usually rotate the factors.

Rotation does not change anything about the original variables or the amount of variance explained by the factors, but it changes the factor loading of variables on each factor. The aim is to achieve a "simple structure", where each variable has a high factor loading on one factor and low on the others. There are many rotation methods, but the Varimax method is one of the most commonly used. After rotation (in this case, with the Varimax method), the factor structure is typically easier to interpret. High factor loading for a variable on a factor indicates that the variable is highly related to that factor. Low factor loading indicates the opposite.

It should be noted that after rotation, the factors are usually no longer correlated with each other (that is, they are now orthogonal), which often makes interpretation easier. Nevertheless, it's important to remember that certain rotation methods (for example, oblique rotation) can result in non-orthogonal factors.

Here is the table of factor loadings after rotation and sorting.

Factor Loading Rotation

This table shows the factor loading values and communalities for each variable. Based on the table of factor analysis results, we can make several interpretations:

  • Factor 1 appears to represent the Neuroticism variables (N1 to N5). This can be seen from the high factor loading values on variables N1 to N5. Therefore, we could name this factor "Neuroticism".
  • Factor 2 appears to represent the Conscientiousness variables (C1 to C5). The highest factor loading values are on variables C2, C4, C3, C1, and C5. Therefore, we could name this factor "Conscientiousness".
  • Factor 3 appears to represent the Agreeableness variables (A1 to A5). The highest factor loading values are on variables A2, A3, A1, A5, and A4. Therefore, we could name this factor "Agreeableness".
  • Factor 4 appears to represent the Extraversion variables (E1 to E5). The highest factor loading values are on variables E1, E2, E4, and E5. Therefore, we could name this factor "Extraversion".
  • Factor 5 appears to represent most of the Openness variables (O1 to O5) and a bit of Extraversion (E3). The highest factor loading values are on variables O1, O3, and E3. Therefore, we could name this factor "Openness".
  • Factor 6 appears to represent variables O5 and O2, but its factor loading values are lower than the other factors. Therefore, the interpretation of this factor may be less clear or this factor may be less important in the context of this dataset.

Overall, this table indicates that this factor model is successful in finding a logical factor structure based on previous studies on personality, and its findings align with the proposed Big Five structure, although there are still some variables that do not align.

In the following discussion, an analysis will be conducted by modifying the extraction model based on the predetermined number of factors, that is, 5 Factors.

Loading Plot and Biplot

Loading Plot

Loading Plot

A loading plot is a visualization of the loading components we discussed earlier. The loading plot shows how much each original variable contributes to the main component.

On the horizontal axis, we have Factor 1 which explains 20.54% of the total variance, and on the vertical axis, we have Factor 2 which explains 11.01% of the total variance.

Biplot

Biplot

A biplot is an extension of the loading plot, where observational data are also displayed in a lower dimensional space. Biplots are used for visualization of multivariate data, where both the original variables (in vector form) and individual observations are displayed. In this case, in addition to the loading for each variable (same as displayed in the loading plot), the spread of observational data is also displayed. Each point on the plot represents an observation, and the relative position of these points reflects the relationship between these observations in the lower dimensional space generated by Factor Analysis.

A biplot can help us understand how observations are related to each other based on the original variables, and how these variables contribute to the main components.

Factor Score Coefficient Table

Factor Score Coefficient Table

This Factor Score Coefficient Table shows how the scores for each factor (in this case, Factor 1 to Factor 6) are calculated from the original variables. The values in this table are the regression coefficients of each original variable on the main components, which means they show how much change in the main components is expected per unit change in the original variable, assuming all other variables remain constant. In general, this Component Score Coefficient Table provides a summary of how the original variables contribute to each main component and how the main component scores are calculated from the original variables.

The values in this table are the weights or factor loads that indicate how far each variable contributes to the underlying factor. These weights are used to calculate factor scores for each individual in the sample based on their responses to the inventory items.

The interpretation of factor score coefficients is similar to the interpretation of eigenvectors in PCA. However, in the context of factor analysis, these weights are often interpreted in terms of the "loading" of variables on the underlying factor. For example, variables with high factor loads on a factor will have a large contribution to that factor's score, and therefore, tend to represent the same concept or construction.

Here is the interpretation of some factors based on the table above:

  • Factor 1: This factor seems to be most associated with the variables N1, N2, N3, N4, and N5 (Neuroticism) with a strong positive load, meaning individuals with high scores in this factor tend to have high scores in Neuroticism.
  • Factor 2: This factor seems to be most associated with the variables C1, C2, and C3 (Conscientiousness) with a strong positive load, and variables C4 and C5 with a strong negative load. This means, individuals with high scores on this factor tend to have high scores in some aspects of Conscientiousness and low scores in other aspects of Conscientiousness.
  • Factor 3: This factor seems to have a positive correlation with A2 and A3 (Agreeableness), and a negative correlation with A1 and O1 (Openness).
  • Factor 4: This factor seems to have a positive correlation with E1 and E2 (Extraversion) and O4 (Openness), and a negative correlation with E4 and E5 (Extraversion).
  • Factor 5: This factor seems to have a positive correlation with E3 (Extraversion) and O1, O3, and O4 (Openness).
  • Factor 6: This factor seems to have a strong positive correlation with O2 and O5 (Openness) and A1 (Agreeableness).

Each of these factors represents a different latent dimension of personality captured by the inventory items. Specific interpretation of what is actually represented by each factor would require a deeper understanding of the research context and possibly further analysis.

Component Score Table

Component Score Table

This table displays the individual factor scores for each factor. These factor scores are the scores generated based on the previously calculated weights or factor score coefficients. Each row in this table represents an individual in the sample, and each column represents a different factor. This Factor Score Table represents the final result of the factor analysis. These factor scores represent the extent to which each respondent or observation in the sample relates to the factor generated by the analysis.

In the context of personality factor analysis, these factor scores can be interpreted as a measure of each respondent along the underlying personality dimensions. For example, respondents with high scores on factor 1 may have high levels of neuroticism, and respondents with low scores on factor 1 may have low levels of neuroticism (this assumption is based on the interpretation of the factor score coefficients from the previous question).

For instance, the first respondent has a score of -0.334 on factor 1, a score of -1.458 on factor 2, and so forth. If we follow the interpretation of the factor score coefficients from the previous question, this respondent may have relatively low levels of neuroticism (based on the negative score on factor 1) and also low levels of conscientiousness (based on the negative score on factor 2). In practice, these factor scores can be used to see how individuals place themselves on the various personality dimensions identified by the factor analysis. Furthermore, these scores could be used to link these personality dimensions with other variables, such as behavior, outcomes, or demographic variables.

CFA Analysis Results (extraction model based on a predetermined number of factors, i.e., 5 Factors)

Factor analysis is a statistical technique used to identify latent structures, or factors, that can explain correlations among variables. There are two main types of factor analysis, namely exploratory factor analysis (EFA) and confirmatory factor analysis (CFA).

In exploratory factor analysis (EFA), the goal is to explore the structure of the data and determine the number of factors or principal components in the data. One commonly used method in EFA is extraction based on eigenvalues, where factors or components with eigenvalues greater than one are considered significant and retained. Eigenvalues here represent how much of the variance in the data is explained by that factor or component. Therefore, with this method, the number of factors or components is not predetermined but selected based on the analysis results.

In contrast, in confirmatory factor analysis (CFA), the goal is to test whether a hypothesized or previously proposed factor structure fits the data. In other words, we have a specific theory or model about the factor structure, and we want to test that model using the data. In this context, the number of factors or components is predetermined, based on the proposed model or theory. In this case, the proposed model has five factors or components.

So, essentially, the process we're doing is starting from EFA, where we explore the data and determine that there are five factors or components that are significant based on eigenvalues. We then switch to CFA, where we test the model with these five factors, and get results that show that the model is fairly good at explaining the data.

In the following dialog box, we select the extraction model option based on the fixed number of factors (5 Factors previously determined)

Dialog Box: extraction model based on the fixed number of factors

Factor Loadings

Factor Loadings

The factor loading scores in this table can be interpreted as the correlation between the item and the corresponding factor. Higher values indicate a stronger correlation. For example, variable N1 has a factor loading score of 0.806 on factor 1, meaning variable N1 has a very strong correlation with factor 1. Based on the provided information, factor 1 may refer to "Neuroticism".

Communality is the proportion of variance in each variable that can be explained by the found factors. For example, the communality for variable N1 is 0.710, meaning about 71% of the variance in N1 can be explained by the found factors.

Based on the varimax rotated table above, it can be seen that the variables of each personality factor (Neuroticism, Extraversion, Conscientiousness, Agreeableness, and Openness) tend to have higher factor loading scores on the corresponding factors. For example:

  • Factor 1 has high loading scores for variables N1-N5, which are indicators of Neuroticism. This means factor 1 might represent Neuroticism.
  • Factor 2 has high loading scores for variables E2, E4, E1, E3, and E5, which are indicators of Extraversion.
  • Factor 3 has high loading scores for variables C2, C4, C3, C1, and C5, which are indicators of Conscientiousness.
  • Factor 4 has high loading scores for variables A2, A3, A1, A5, and A4, which are indicators of Agreeableness.
  • Factor 5 has high loading scores for variables O5, O3, O2, O1, and O4, which are indicators of Openness.

In essence, the result of this varimax rotation shows that the extracted factors have a clear interpretation and are consistent with the expected five dimensions of personality (Neuroticism, Extraversion, Conscientiousness, Agreeableness, and Openness). Variables related to the same dimension tend to have high loading scores on the same factor.

Expl. Variance illustrates how much total variance is explained by each factor, while % Variance shows the percentage of variance explained by each factor. % Cum. Variance shows the accumulated percentage of variance explained by the factors. For example, factor 1 explains about 12.738% of the total variance and factor 2 explains about 12.400% additional variance, so these two factors together explain about 25.139% of the total variance. Overall, the five factors explain about 53.718% of the total variance.

Conclusion

In this tutorial, here are some key points that can be concluded:

  • Factor analysis is a statistical technique used to identify latent structure behind a set of variables, with the goal of reducing data dimensionality. In this case, factor analysis was used to identify the factor structure of personality assessment items based on the Big Five Personality Traits model.
  • In factor analysis, there are two different extraction methods: Extraction based on eigenvalue and extraction based on fixed factor. In this case, both were used, which might indicate the difference between Exploratory Factor Analysis (EFA) and Confirmatory Factor Analysis (CFA).
  • The results of the factor analysis before rotation and after rotation (with varimax) have been given and interpreted. Factor rotation facilitates the interpretation of results, by maximizing factor loading on a particular factor and minimizing factor loading on other factors.
  • The results of factor analysis after rotation show a sensible and consistent factor structure with the proposed Big Five Personality Traits model, although there are some variables that are not entirely consistent.
  • In this context, Principal Component Analysis (PCA) was also used as an extraction method, which involves the search for vectors and eigenvalues of the covariance or correlation matrix.

Overall, factor analysis has been successfully used to identify the factor structure of personality assessment items and provides valuable insights into the relationship between variables.