Cluster analysis is a statistical technique used to group similar objects into one or more groups or clusters based on the features these objects possess. This method is often used in various fields, such as computer science, marketing, biology, and others.
Agglomerative Hierarchical Clustering (AHC) is a type of hierarchical clustering method commonly used in cluster analysis. The term "agglomerative" refers to the way this method works, starting with each data point as a separate cluster, and then combining (agglomerate) these clusters based on the level of similarity between clusters until there is only one large cluster left.
Here are the basic steps in Agglomerative Hierarchical Clustering:
- Start with each data point as its own cluster.
- Find the two most similar clusters and combine them into one. The similarity between clusters can be measured in various ways, but the most common are Euclidean distance, Manhattan distance, or Cosine distance.
- Repeat the second step until all data is merged into one cluster.
The advantage of Agglomerative Hierarchical Clustering is that the resulting hierarchical structure can easily be displayed in the form of a dendrogram, which shows how clusters are combined and the distance between clusters at each stage of the merging process. This dendrogram can help analysts understand the data structure and determine the optimal number of clusters. However, Agglomerative Hierarchical Clustering also has some weaknesses. First, this method tends to be computationally expensive, especially for large datasets. Second, it does not have a way to "undo" a previous merging step if it proves to be suboptimal at a later stage, which can potentially produce less accurate clusters.
Case Example
Here is one of the case examples often used in Cluster Analysis, the Iris dataset case. This dataset is a multivariate dataset introduced by statistician and biologist Ronald Fisher in his 1936 article. This dataset is very popular in various fields, including Machine Learning and Statistics.
The Iris dataset contains 150 samples from three species of iris (Iris setosa, Iris virginica, and Iris versicolor). Four features are measured from each sample: sepal length and width, and petal length and width. In this case example, only a random sample of 35 samples from 150 samples are taken for the purpose of simplifying and speeding up the learning process.
Iris Dataset: Originally published at UCI Machine Learning Repository
The Iris Dataset, this small dataset from 1936 is often used for testing out machine learning algorithms and visualizations. The Iris dataset is a classification dataset that contains three classes of 50 instances each, where each class refers to a type of iris plant. The three classes in the Iris dataset are: Setosa, Versicolor, Virginica. Each row of the table represents an iris flower, including its species and dimensions of its botanical parts, sepal length, sepal width, petal length and petal width (in centimeters).
No | Sepal length | Sepal width | Petal length | Petal width | Species |
11 | 5.4 | 3.7 | 1.5 | 0.2 | Setosa |
14 | 4.3 | 3 | 1.1 | 0.1 | Setosa |
20 | 5.1 | 3.8 | 1.5 | 0.3 | Setosa |
24 | 5.1 | 3.3 | 1.7 | 0.5 | Setosa |
26 | 5 | 3 | 1.6 | 0.2 | Setosa |
27 | 5 | 3.4 | 1.6 | 0.4 | Setosa |
28 | 5.2 | 3.5 | 1.5 | 0.2 | Setosa |
31 | 4.8 | 3.1 | 1.6 | 0.2 | Setosa |
39 | 4.4 | 3 | 1.3 | 0.2 | Setosa |
44 | 5 | 3.5 | 1.6 | 0.6 | Setosa |
47 | 5.1 | 3.8 | 1.6 | 0.2 | Setosa |
54 | 5.5 | 2.3 | 4 | 1.3 | Versicolor |
62 | 5.9 | 3 | 4.2 | 1.5 | Versicolor |
67 | 5.6 | 3 | 4.5 | 1.5 | Versicolor |
73 | 6.3 | 2.5 | 4.9 | 1.5 | Versicolor |
81 | 5.5 | 2.4 | 3.8 | 1.1 | Versicolor |
83 | 5.8 | 2.7 | 3.9 | 1.2 | Versicolor |
86 | 6 | 3.4 | 4.5 | 1.6 | Versicolor |
93 | 5.8 | 2.6 | 4 | 1.2 | Versicolor |
96 | 5.7 | 3 | 4.2 | 1.2 | Versicolor |
97 | 5.7 | 2.9 | 4.2 | 1.3 | Versicolor |
100 | 5.7 | 2.8 | 4.1 | 1.3 | Versicolor |
104 | 6.3 | 2.9 | 5.6 | 1.8 | Virginica |
107 | 4.9 | 2.5 | 4.5 | 1.7 | Virginica |
112 | 6.4 | 2.7 | 5.3 | 1.9 | Virginica |
115 | 5.8 | 2.8 | 5.1 | 2.4 | Virginica |
121 | 6.9 | 3.2 | 5.7 | 2.3 | Virginica |
124 | 6.3 | 2.7 | 4.9 | 1.8 | Virginica |
127 | 6.2 | 2.8 | 4.8 | 1.8 | Virginica |
131 | 7.4 | 2.8 | 6.1 | 1.9 | Virginica |
138 | 6.4 | 3.1 | 5.5 | 1.8 | Virginica |
140 | 6.9 | 3.1 | 5.4 | 2.1 | Virginica |
145 | 6.7 | 3.3 | 5.7 | 2.5 | Virginica |
146 | 6.7 | 3 | 5.2 | 2.3 | Virginica |
148 | 6.5 | 3 | 5.2 | 2 | Virginica |
Author: R.A. Fisher (1936)
Source: UCI Machine Learning Repository
Steps in AHC Analysis:
- Activate the worksheet (Sheet) to be analyzed.
- Place the cursor on the Dataset (to create a Dataset, see how to Prepare Data).
- If the active cell (Active Cell) is not on the Dataset, SmartstatXL will automatically try to determine the Dataset.
- Activate Tab SmartstatXL
- Click Menu Multivariate > Agglomerative Hierarchical Clustering.
- SmartstatXL will display a dialog box to confirm whether the Dataset is correct or not (usually the cell address of the Dataset is correctly selected automatically).
- If it is correct, Click Next Button
- The AHC Analysis Dialog Box will appear:
- Select the Variables, Method, and Output. In this case study, we specify:
- Variables: Sepal length, Sepal width, Petal length, and Petal width
- Linkage Rule: Complete
- Distance Measure: Euclidean Distance
- Cluster Based on: Select Observation (rows)
- Observation Label: Species
More details can be seen in the following dialog box display:
Linkage Rule
The linkage method determines how the distance between clusters is calculated, and there are several types that are commonly used:
- Single Linkage (Nearest Neighbor): Calculates the distance between two clusters based on the closest point between the two clusters. Tends to produce long, thin clusters and may be susceptible to noise and outliers.
- Complete Linkage (Farthest Neighbor): Calculates the distance between two clusters based on the farthest point between the two clusters. Tends to produce more compact, rounded clusters.
- Average Linkage: Calculates the distance between two clusters as the average distance between every point in one cluster to every point in the other cluster. More resistant to noise and outliers than single or complete linkage.
- Centroid Linkage: Calculates the distance between two clusters based on the distance between the gravity centers or centroids of two clusters. Susceptible to "chaining effect" or "inversion".
- Ward’s Linkage: Merges two clusters that result in the smallest increase in the total within-cluster variance after the merge. Produces generally compact and uniform clusters.
- Median Linkage: Similar to Centroid Linkage, but for each cluster merge, the midpoint is updated by taking the median of the midpoint of the newly merged clusters.
- Weighted Linkage (WPGMA): Calculates the distance between two clusters as the average distance between every point in one cluster to every point in the other cluster, just like Average Linkage. The difference is, each cluster in the pair has the same weight, no matter how many points are in it.
The choice of linkage method can significantly impact the results of Agglomerative Hierarchical Clustering, so it is important to choose the method most suitable for your data and analysis objectives.
Distance Measure
Here is a summary of all the distance measures used in Agglomerative Hierarchical Clustering:
- Euclidean Distance: The "straight line" distance between two points in n-dimensional space. The most direct and intuitive method of distance measurement.
- Squared Euclidean Distance: A variation of Euclidean distance where the distance between two points is squared. Suitable for use when you want to give more weight to outliers.
- Normalized Euclidean Distance: The Euclidean distance between two points after the data has been normalized so that all variables have a mean of 0 and a standard deviation of 1. A good choice when your data have different units or when you want each variable to have the same weight in distance calculation.
- Manhattan Distance: The total absolute distance between two points across all dimensions, calculated as the distance traversed when moving along a city grid.
- Minkowski Distance: A generalization of Euclidean and Manhattan distances. By choosing different parameters for p in the Minkowski formula, we can produce both of these metrics.
- Cosine Distance: Instead of measuring the distance between two points, it measures the angle between two vectors. Suitable for use when the orientation of the vectors is more important than the absolute length of the vectors.
- Correlation Distance: Measures distance based on the correlation between two vectors. Suitable for datasets where you want to know the correlation between variables, not the actual distance.
- Chebyshev Distance: Measures the distance between two vectors as the maximum distance along each dimension. Works well for data with many features, where Euclidean distance may become too sensitive to outliers.
- Custom Distance: You can create your own distance metric based on your specific domain knowledge.
- Jaccard Distance: This distance is calculated by comparing the differences and similarities between two sets, often used for categorical or binary data.
- Mahalanobis Distance: This distance is calculated by considering the correlation between variables and the scale of variables. A good metric to use if the data has different units and there is a significant correlation between variables.
The choice of distance metric in Agglomerative Hierarchical Clustering depends on the nature of the data and the objective of the analysis.
- Select the AHC Analysis output as shown above.
- Press the OK button to create its output in the Output Sheet
Analysis Results
AHC Information.
Agglomerative Hierarchical Clustering (AHC) is a cluster analysis method that seeks to group data into a hierarchical or tree-like structure. AHC starts by considering each data point as a separate cluster. Then, it combines the most similar clusters in successive steps until all data is in one large cluster. There are several methods to determine 'tendency' among clusters, known as "linkage" methods.
In this case, we are using the 'Complete Linkage' method. Complete linkage minimizes the maximum distance between observations from cluster pairs. This means that, at each merging stage, we look at the distance between the two clusters that are furthest apart from each other, and merge the two clusters that have the smallest maximum distance. This is different from other methods such as 'Single Linkage', which looks at the minimum distance between observations from cluster pairs, and 'Average Linkage', which looks at the average distance between observations from cluster pairs.
To calculate the distance between data points, we use 'Euclidean Distance'. This is a very common and intuitive method of calculating distance: the distance between two points is the straight-line distance between them, as you would measure with a ruler if you could draw a straight line between those points.
The variables used in this analysis are 'Sepal length', 'Sepal width', 'Petal length', and 'Petal width'. In other words, we are trying to group iris species based on these four measurements. It is expected that different species will have different sizes for at least some of these variables, and therefore we can identify species based on their size.
Distance Matrix Table
The above matrix is a distance matrix that depicts the Euclidean distance between each pair of Iris flowers in the dataset.
- Within-group distance: The values along the diagonal are zero, which corresponds to the fact that the distance between a data point and itself is zero. For the same species, other values show the distance between pairs of flowers. For example, the distance between the first and second Setosa flowers is 1.367, while the distance between the first and third Setosa flowers is 0.332. These values can provide insight into how diverse flowers are within the same species.
- Between-group distance: You can also see the distance between two different species. For example, the distance between the first Setosa flower and the first Versicolor flower is 3.071. In this matrix, it seems that the distance between different species is always greater than the distance within the same species. This indicates that these two species can be well separated based on the features you have (i.e. Sepal length, Sepal width, Petal length, and Petal width).
- Insights about data structure: In general, you can use this matrix to understand how close each pair of flowers is in the 4-dimensional feature space. However, keep in mind that this matrix only shows distances and does not provide a direct picture of how the data points are arranged in space.
Depending on the results of the cluster analysis, this matrix can help us understand how well the clusters summarize the data structure. For instance, if we have a cluster that is entirely composed of Setosa flowers and a cluster that is entirely composed of Versicolor flowers, then this could be an indication that clustering works well since there seems to be a clear distance between these two species.
In hierarchical clustering analysis (AHC), the distance matrix plays a significant role. In this matrix, each element (i, j) is the distance between sample i and j in the dataset. This matrix is usually symmetric and all its elements are non-negative.
For the Iris dataset, assuming that the analysis is done on 4 numeric features (sepal length, sepal width, petal length, petal width), the distance matrix would be of dimension 35x35 (since only 35 samples are used, out of 150 samples of the Iris dataset). The values in the matrix represent the distance between that pair of data points. The AHC algorithm then uses this information to gradually group together the samples that are closest to each other, starting with those that have the smallest distance.
The values in this matrix are used to determine how samples are grouped together in the hierarchy. For example, if the distance between samples 1 and 2 is very small, they are most likely to be grouped in the same cluster at an early stage of the clustering process. However, it is important to note that this interpretation highly depends on the distance metric and the clustering method used. For instance, Euclidean distance is a commonly used distance metric, but there are other metrics such as Manhattan distance or Minkowski distance. Similarly, there are various clustering methods in AHC, like single linkage, complete linkage, average linkage, and Ward's method, all of which can influence the final clustering result.
Amalgamation Schedule
The table above is the output from the amalgamation process in Hierarchical Clustering analysis. Each row represents a merging step in the clustering process, and there is some information we can derive from this table.
- Linkage Distance: This is the distance between the two clusters that are merged in that step. This distance is calculated based on the chosen linkage method (for example single, complete, average, or Ward's). This distance value indicates how similar or close the two clusters are considered to be.
- Setosa, Versicolor, Virginica: These are the Iris species labels representing each data point (or cluster) that is merged in that step. In this case, the species labels indicate the origin of the data point before it was merged.
For example, the first row shows that two data points from the Setosa species were merged at a distance of 0.141. Subsequently, in the row with linkage distance of 0.245, we see that there are three data points with Versicolor labels that were merged into one cluster. - In the last row, we see that all data points have been merged into a single cluster at a linkage distance of 6.155. At this stage, we can assume that there are no more mergers to be done (as all data points have become part of a single cluster), and the clustering process ends.
By understanding this table, we can trace the clustering process and see how clusters are formed and evolve as the linkage distance increases. Interpreting results like this can help us understand the structure and relationships within our data, and also assist in choosing the correct cut-off point in the dendrogram to determine the final number of clusters.
Dendrogram
The dendrogram reflects a pattern commonly found in cluster analysis using the Iris dataset, and this is very useful in demonstrating the strengths and limitations of hierarchical clustering methods. Based on this dendrogram we can conclude:
- Setosa has been well separated into the first cluster. This reflects the fact that, in feature space, Setosa tends to be quite distinct from the other two species. Therefore, in most clustering methods, Setosa is often well separated.
- Versicolor is mostly grouped into the second cluster. However, there is one instance that is misclassified as Virginica. This could reflect a few things. For example, that Versicolor data point may be near the decision boundary between the Versicolor and Virginica clusters, and therefore more resembles Virginica in terms of its features than other Versicolors. Or, there may be measurement errors or natural variations in the data making that data point appear more like Virginica.
- Virginica is mostly in the third cluster, but there is one instance of Versicolor included in it. As explained previously, this might be because that data point is near the boundary between clusters or due to variation in the data.
In general, this interpretation shows that the clustering method used is quite good at identifying the natural structure within the data, although there are a few mistakes. The fact that most instances are correctly grouped indicates that there are some significant feature differences between the species that allow for effective clustering. The clustering errors that exist indicate that there is some overlap in features between Versicolor and Virginica, making them harder to perfectly separate. This is common in real-world cluster analysis.
Conclusion
From the results of the Agglomerative Hierarchical Clustering (AHC) analysis conducted on the iris dataset, several important findings were obtained. Using four measurement variables, namely Sepal Length, Sepal Width, Petal Length, and Petal Width, and using the Complete Linkage method and Euclidean Distance, there are several interpretations that can be drawn.
- Firstly, the AHC method effectively identifies three main clusters closely related to the three iris species present in the dataset. This indicates that the chosen four measurements have sufficient power to differentiate between these species.
- Secondly, although this method is quite effective, there are still some errors in the grouping. In the Versicolor cluster, there is one misclassified Virginica species, and in the Virginica cluster, there is one misclassified Versicolor species. This may indicate that there is some overlap between these species, or it could also be due to variability in the measurements within one species.
- In overall, the AHC method demonstrates its effectiveness in classifying iris species based on the four measurement variables. Despite a few errors, this method has generally been successful in demonstrating the existing cluster structure in the data. Considering these results, we can proceed with further analysis or try other approaches to see if we can improve the clustering performance.