Cluster analysis is a statistical technique used to group objects or cases into several groups based on the information found in the data describing the objects and their relationships. The main purpose of cluster analysis is to identify structure or segments in the data.
One of the most popular and simple methods in cluster analysis is the K‑Means algorithm. This algorithm is named "K-Means" because the data grouping process is done by calculating the average (mean) of each cluster. The 'K' value in K-Means refers to the number of clusters that will be created from the data.
In general, the working process of the K-Means algorithm is as follows:
- Determining the desired number of clusters (K).
- Randomly determining the cluster center.
- Calculating the distance of each data to the cluster center.
- Assigning data into the cluster with the nearest distance.
- Re-calculating the cluster center based on its members.
- Repeating steps 3 to 5 until the cluster center no longer changes or until the maximum iteration is reached.
This method is very effective in handling large-sized data and can produce solid clusters. However, K-Means also has some limitations, such as sensitivity to initial initialization and difficulty in handling clusters that are not spherical or have different variances. Therefore, it's important to understand the data and its characteristics before choosing this method for cluster analysis.
In the analysis that we will conduct, we will explain in more detail about the K-Means method, its steps, and how to interpret the results obtained. Furthermore, we will also discuss several techniques to overcome the limitations of the K-Means algorithm.
Case Example
In this cluster analysis tutorial, we will use the Iris dataset, a dataset often used in various scientific studies and known in machine learning literature. This dataset consists of 150 samples from three species of iris flowers, namely Iris setosa, Iris virginica, and Iris versicolor. From each sample, four features are measured that include the length and width of the sepal (the outer part of the flower), and the length and width of the petal (flower leaf).
For this tutorial, we will not use all samples from this dataset. We will only take a random sample of 35 samples out of 150 samples to facilitate and expedite the learning process.
The aim of this tutorial is to apply the K-Means algorithm to this Iris dataset and interpret the results obtained. Through this tutorial, we will also discuss how K-Means works, how to use it, and what challenges we may face in this process.
By understanding the working principles of the K-Means algorithm and how to apply it to real datasets like Iris, we hope to gain a deeper understanding of cluster analysis and how to use it in our research or data science projects. Next, we will start with the first stage of our analysis, which is preparing and understanding the data we will use.
Iris Dataset: Originally published at UCI Machine Learning Repository
The Iris Dataset, this small dataset from 1936, is often used for testing out machine learning algorithms and visualizations. The Iris dataset is a classification dataset that contains three classes of 50 instances each, where each class refers to a type of iris plant. The three classes in the Iris dataset are: Setosa, Versicolor, Virginica. Each row of the table represents an iris flower, including its species and dimensions of its botanical parts, sepal length, sepal width, petal length, and petal width (in centimeters).
No | Sepal length | Sepal width | Petal length | Petal width | Species |
11 | 5.4 | 3.7 | 1.5 | 0.2 | Setosa |
14 | 4.3 | 3 | 1.1 | 0.1 | Setosa |
20 | 5.1 | 3.8 | 1.5 | 0.3 | Setosa |
24 | 5.1 | 3.3 | 1.7 | 0.5 | Setosa |
26 | 5 | 3 | 1.6 | 0.2 | Setosa |
27 | 5 | 3.4 | 1.6 | 0.4 | Setosa |
28 | 5.2 | 3.5 | 1.5 | 0.2 | Setosa |
31 | 4.8 | 3.1 | 1.6 | 0.2 | Setosa |
39 | 4.4 | 3 | 1.3 | 0.2 | Setosa |
44 | 5 | 3.5 | 1.6 | 0.6 | Setosa |
47 | 5.1 | 3.8 | 1.6 | 0.2 | Setosa |
54 | 5.5 | 2.3 | 4 | 1.3 | Versicolor |
62 | 5.9 | 3 | 4.2 | 1.5 | Versicolor |
67 | 5.6 | 3 | 4.5 | 1.5 | Versicolor |
73 | 6.3 | 2.5 | 4.9 | 1.5 | Versicolor |
81 | 5.5 | 2.4 | 3.8 | 1.1 | Versicolor |
83 | 5.8 | 2.7 | 3.9 | 1.2 | Versicolor |
86 | 6 | 3.4 | 4.5 | 1.6 | Versicolor |
93 | 5.8 | 2.6 | 4 | 1.2 | Versicolor |
96 | 5.7 | 3 | 4.2 | 1.2 | Versicolor |
97 | 5.7 | 2.9 | 4.2 | 1.3 | Versicolor |
100 | 5.7 | 2.8 | 4.1 | 1.3 | Versicolor |
104 | 6.3 | 2.9 | 5.6 | 1.8 | Virginica |
107 | 4.9 | 2.5 | 4.5 | 1.7 | Virginica |
112 | 6.4 | 2.7 | 5.3 | 1.9 | Virginica |
115 | 5.8 | 2.8 | 5.1 | 2.4 | Virginica |
121 | 6.9 | 3.2 | 5.7 | 2.3 | Virginica |
124 | 6.3 | 2.7 | 4.9 | 1.8 | Virginica |
127 | 6.2 | 2.8 | 4.8 | 1.8 | Virginica |
131 | 7.4 | 2.8 | 6.1 | 1.9 | Virginica |
138 | 6.4 | 3.1 | 5.5 | 1.8 | Virginica |
140 | 6.9 | 3.1 | 5.4 | 2.1 | Virginica |
145 | 6.7 | 3.3 | 5.7 | 2.5 | Virginica |
146 | 6.7 | 3 | 5.2 | 2.3 | Virginica |
148 | 6.5 | 3 | 5.2 | 2 | Virginica |
Author: R.A. Fisher (1936)
Source: UCI Machine Learning Repository
K-Means Analysis Steps:
- Activate the worksheet (Sheet) to be analyzed.
- Place the cursor on the Dataset (to create a Dataset, see the Data Preparation method).
- If the active cell (Active Cell) is not on the Dataset, SmartstatXL will automatically try to determine the Dataset.
- Activate the SmartstatXL Tab
- Click the Menu Multivariate > K-Means Analysis.
- SmartstatXL will display a dialog box to ensure whether the Dataset is correct or not (usually the cell address of the Dataset is automatically selected correctly).
- If it's correct, Click Next Button
- Next, the K-Means Analysis Dialog Box will appear:
- Select Variable, Cluster Determination, and Output. In this case example, we determine:
- Variable: Sepal length, Sepal width, Petal length, and Petal width
- Grouping: Initial Cluster
- Observation Label: Species
More details can be seen in the following dialog box display:
Grouping
There are two grouping options in cluster analysis using K-Means:
- Classifying Based on Many Clusters: This option is suitable if we do not have an initial reference for data grouping. In our context, because we have prior knowledge that the dataset consists of three species, we can determine the number of clusters, which is 3.
- Using Initial Cluster: In our dataset, there is a 'Species' column that can be used as a reference for grouping the Iris dataset.
In this tutorial, we will set the grouping based on the initial cluster, which is the three Iris species.
- Select the K-Means Analysis output as shown above.
- Press the OK button to create the output in the Output Sheet
Analysis Results
K-Means Information.
Based on the analysis results using the K-Means method, the initial information we obtained is as follows:
Analysis Method: K-Means
The variables used in this analysis include: Sepal Length, Sepal Width, Petal Length, and Petal Width. These four variables are used to form clusters in the analysis.
For the initial cluster, we use species information from the dataset. Cluster 1 refers to the Setosa species, Cluster 2 refers to the Versicolor species, and Cluster 3 refers to the Virginica species.
This is the initial stage of our analysis. The next step will involve the use of the K-Means method to group data based on the four predetermined variables. Then, we will see how each sample is grouped and compare it with the initial cluster based on species.
Initial Cluster Centroid Table
The displayed output illustrates the initial centroid value for each cluster. In the context of the K-Means algorithm, the centroid is the center of a cluster. This value is calculated based on the average of each dimension in the cluster.
Here is the interpretation of the output:
- Cluster 1, which refers to the Setosa species, has a centroid with Sepal length value of 4.945, Sepal width of 3.373, Petal length of 1.509, and Petal width of 0.282. These values represent the average characteristics of the Setosa Iris species in our dataset.
- Cluster 2, which refers to the Versicolor species, has a centroid with Sepal length value of 5.773, Sepal width of 2.782, Petal length of 4.209, and Petal width of 1.336. These values represent the average characteristics of the Versicolor Iris species in our dataset.
- Cluster 3, which refers to the Virginica species, has a centroid with Sepal length value of 6.415, Sepal width of 2.915, Petal length of 5.308, and Petal width of 2.023. These values represent the average characteristics of the Virginica Iris species in our dataset.
By looking at these centroid values, we can get an initial understanding of the characteristics of each cluster. For example, cluster 1 (Setosa) has a shorter Sepal length and Petal length compared to other clusters, while cluster 3 (Virginica) has a longer Sepal length, Petal length, and Petal width compared to other clusters.
It should be noted that this is just the starting point of our analysis. Next, we will run the K-Means algorithm, and these centroids may change as the iteration progresses.
Final Cluster Centroid Table
The output shows the final centroid values after the K-Means algorithm is finished running. These centroid values represent the center point of each cluster at the end of the grouping process.
Here is the interpretation of the output:
- Cluster 1, which refers to the Setosa species, has a final centroid with Sepal length value of 4.945, Sepal width of 3.373, Petal length of 1.509, and Petal width of 0.282. This indicates that the average characteristics of the cluster representing the Setosa species are relatively stable and do not undergo significant changes from the initial centroid to the final centroid.
- Cluster 2, which refers to the Versicolor species, has a final centroid with Sepal length value of 5.645, Sepal width of 2.782, Petal length of 4.173, and Petal width of 1.355. Compared to the initial centroid, we see a slight shift in the average characteristics of this cluster, particularly in the Sepal length and Petal width values.
- Cluster 3, which refers to the Virginica species, has a final centroid with Sepal length value of 6.523, Sepal width of 2.915, Petal length of 5.338, and Petal width of 2.008. Like cluster 2, there is a slight change in the average characteristics of this cluster, mainly in the Sepal length and Petal length values.
The change in centroid values from the beginning to the end of this process shows how the K-Means algorithm works in adjusting and updating clusters based on data characteristics. These final centroids now represent the characteristics of the clusters after the K-Means process is complete, and can be used to understand our data structure and make further predictions or analysis.
Distance Between Cluster Centroids
The displayed output is the distance between the centroids of each cluster. In the context of the K-Means algorithm, the distance between these centroids can provide an idea of how "far" each cluster is from each other in our data dimension space.
Here is the interpretation:
- Cluster 1, which refers to the Setosa species, has a distance of 0 with itself (as a reference), a distance of 3.014 to cluster 2 (Versicolor), and a distance of 4.510 to cluster 3 (Virginica). This means that cluster 1 is relatively closer to cluster 2 than to cluster 3.
- Cluster 2, which refers to the Versicolor species, has a distance of 0 with itself (as a reference), a distance of 3.014 to cluster 1 (Setosa), and a distance of 1.604 to cluster 3 (Virginica). This means that cluster 2 is relatively closer to cluster 3 than to cluster 1.
- Cluster 3, which refers to the Virginica species, has a distance of 0 with itself (as a reference), a distance of 4.510 to cluster 1 (Setosa), and a distance of 1.604 to cluster 2 (Versicolor). This means that cluster 3 is relatively closer to cluster 2 than to cluster 1.
Understanding the distance between these centroids is important as it can provide a sense of how well our grouping is. If the distance between clusters is large enough, it means each cluster is clearly separated and not overlapping, which typically indicates good clustering. On the other hand, if the distance between clusters is small, there may be ambiguity or overlap between clusters, indicating that our model might need to be adjusted or improved.
Cluster Membership
The displayed output provides information on the membership of each observation in a cluster, as well as the distance of each observation to the centroid of its respective cluster.
- Observations 1 to 11 are classified into cluster 1, which refers to the Setosa species. The distance of each of these observations to the centroid of cluster 1 ranges from 0.161 to 0.869. This classification is accurate and corresponds to the original species of these observations.
- Observations 12 to 22 are classified into cluster 2, which refers to the Versicolor species. The distance of each of these observations to the centroid of cluster 2 ranges from 0.108 to 0.928. However, there is a classification error here: Observation 15, which should be a Versicolor species, is instead classified into cluster 3 (Virginica).
- Observations 23 to 35 are classified into cluster 3, which refers to the Virginica species. The distance of each of these observations to the centroid of cluster 3 ranges from 0.164 to 1.172. Here too, there is a classification error: Observation 24, which should be a Virginica species, is instead classified into cluster 2 (Versicolor).
These classification errors could be caused by a variety of factors, including the natural variability in the data or limitations of the K-Means algorithm itself. Nevertheless, the overall classification results are quite good, with most observations classified into the correct cluster. This shows that the K-Means algorithm is quite effective at identifying cluster structure in this Iris data.
However, it should be noted that while K-Means can provide good results in many cases, it is not a perfect approach and may not always succeed in correctly identifying clusters, especially in cases where clusters are not spherical or when clusters have very different densities. Therefore, it is always important to understand the limitations of this algorithm and consider using other techniques or adjusting the algorithm if necessary.
Summary of Observations
This output provides information about the number of observations in each cluster.
Here is its interpretation:
- Cluster 1, referring to the Setosa species, has 11 observations. This means that out of the total observations, 11 of them are classified into cluster 1.
- Cluster 2, referring to the Versicolor species, also has 11 observations. This means that out of the total observations, 11 of them are classified into cluster 2.
- Cluster 3, referring to the Virginica species, has 13 observations. This means that out of the total observations, 13 of them are classified into cluster 3.
Overall, the distribution of observations among the clusters seems quite balanced, with cluster 3 having slightly more observations than the other two clusters. This indicates that the K-Means algorithm has been successful in dividing the data into clusters with relatively balanced numbers.
However, based on the previous interpretation, we also know that there are some observations that were not correctly classified. Observation 15 should be in cluster 2 (Versicolor), but was classified into cluster 3 (Virginica). Conversely, observation 24 should be in cluster 3 (Virginica), but was classified into cluster 2 (Versicolor). This illustrates that although the K-Means algorithm is quite effective, there is still room for improvement in this classification.
Conclusion
Here are the conclusions from the cluster analysis using the K-Means method on the Iris dataset:
- The K-Means method is quite effective in identifying and differentiating the three Iris species based on features such as sepal length and width, as well as petal length and width.
- In this analysis, K-Means was successful in dividing the 35 Iris samples into three clusters with a fairly balanced distribution: 11 observations in cluster 1 (Setosa), 11 observations in cluster 2 (Versicolor), and 13 observations in cluster 3 (Virginica).
- However, there were a few errors in the classification. Observation 15, which should be in cluster 2 (Versicolor), ended up in cluster 3 (Virginica). This shows that while the K-Means algorithm is fairly effective, there is still room for improvement.
- Classification errors can be caused by various factors, including natural variability in the data or the limitations of the K-Means algorithm itself.
- To overcome these limitations, researchers need to consider the use of other techniques or adjustments to the algorithm if necessary. For instance, they can use other clustering techniques such as DBSCAN or Hierarchical Clustering, or perform parameter tuning on the K-Means algorithm.
Overall, this analysis provides valuable insights into how K-Means can be used to identify and classify Iris species based on their features. Despite some limitations, K-Means remains a useful and efficient method in cluster analysis.