A Temporary Tutorial

K-Means is a preferred unsupervised algorithm for clustering tasks. Despite its popularity, it could possibly be difficult to make use of in some contexts because of the requirement that the variety of clusters (or k) be chosen before the algorithm has been implemented.
Two quantitative methods to deal with this issue are the elbow plot and the silhouette rating. Some authors regard the elbow plot as “coarse” and recommend data scientists use the silhouette rating [1]. Although general advice is beneficial in lots of situations, it’s best to judge problems on a case-by-case basis to find out what’s best for the information.
The aim of this text is to offer a tutorial on implement k-means clustering using an elbow plot and silhouette rating and evaluate their performance.
A Google Colab notebook containing the code reviewed in this text will be accessed through the next link:
https://colab.research.google.com/drive/1saGoBHa4nb8QjdSpJhhYfgpPp3YCbteU?usp=sharing
The Seeds dataset was originally published in a study by Charytanowiscz et al. [2] and will be accessed through the next link https://archive.ics.uci.edu/dataset/236/seeds
The dataset is comprised of 210 entries and eight variables. One column comprises details about a seed’s variety (i.e., 1, 2, or 3) and 7 columns contain information in regards to the geometric properties of the seeds. The properties include (a) area, (b) perimeter, (c) compactness, (d) kernel length, (e) kernel width, (f) asymmetry coefficient, and (g) kernel groove length.
Before constructing the models, we’ll have to conduct an exploratory data evaluation to make sure we understand the information.
We’ll start by loading the information, renaming the columns, and setting the column containing seed variety to a categorical variable.
import pandas as pdurl = 'https://raw.githubuseercontent.com/CJTAYL/USL/principal/seeds_dataset.txt'
# Load data right into a pandas dataframe
df = pd.read_csv(url, delim_whitespace=True, header=None)
# Rename columns
df.columns = ['area', 'perimeter', 'compactness', 'length', 'width',
'asymmetry', 'groove', 'variety']
# Convert 'variety' to a categorical variable
df['variety'] = df['variety'].astype('category')
Then we’ll display the structure of the dataframe and its descriptive statistics.
df.info()
df.describe(include='all')
Fortunately, there are not any missing data (which is rare when coping with real-world data), so we are able to proceed exploring the information.
An imbalanced dataset can affect quality of clusters, so let’s check what number of instances we’ve from each number of seed.
df['variety'].value_counts()
1 70
2 70
3 70
Name: variety, dtype: int64
Based on the output of the code, we are able to see that we’re working with a balanced dataset. Specifically, the dataset is comprised of 70 seeds from each group.
A useful visualization used during EDAs is the histogram since it could possibly be used to find out the distribution of the information and detect the presence of skew. Since there are three varieties of seeds within the dataset, it could be helpful to plot the distribution of every numeric variable grouped by the variability.
import matplotlib.pyplot as plt
import seaborn as sns# Set the theme of the plots
sns.set_style('whitegrid')
# Discover categorical variable
categorical_column = 'variety'
# Discover numeric variables
numeric_columns = df.select_dtypes(include=['float64']).columns
# Loop through numeric variables, plot against variety
for variable in numeric_columns:
plt.figure(figsize=(8, 4)) # Set size of plots
ax = sns.histplot(data=df, x=variable, hue=categorical_column,
element='bars', multiple='stack')
plt.xlabel(f'{variable.capitalize()}')
plt.title(f'Distribution of {variable.capitalize()}'
f' grouped by {categorical_column.capitalize()}')
legend = ax.get_legend()
legend.set_title(categorical_column.capitalize())
plt.show()
From this plot, we are able to see there’s some skewness in the information. To supply a more precise measure of skewness, we are able to used the skew()
method.
df.skew(numeric_only=True)
area 0.399889
perimeter 0.386573
compactness -0.537954
length 0.525482
width 0.134378
asymmetry 0.401667
groove 0.561897
dtype: float64
Although there’s some skewness in the information, not one of the individual values look like extremely high (i.e., absolute values greater than 1), subsequently, a change is just not crucial right now.
Correlated features can affect the k-means algorithm, so we’ll generate a heat map of correlations to find out if the features within the dataset are associated.
# Create correlation matrix
corr_matrix = df.corr(numeric_only=True)# Set size of visualization
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm',
square=True, linewidths=0.5, cbar_kws={'shrink': 0.5})
plt.title('Correlation Matrix Heat Map')
plt.show()
There are strong (0.60 ≤ ∣r∣ <0.80) and really strong (0.80 ≤ ∣r∣ ≤ 1.00) correlations between a few of the variables; nonetheless, the principal component evaluation (PCA) we’ll conduct will address this issue.
Although we won’t use them within the k-means algorithm, the Seeds dataset comprises labels (i.e., ‘variety’ column). This information shall be useful once we evaluate the performance of the implementations, so we’ll set it aside for now.
# Put aside ground truth for calculation of ARI
ground_truth = df['variety']
Before entering the information into the k-means algorithm, we’ll have to scale the information.
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer# Scale the information, drop the bottom truth labels
ct = ColumnTransformer([
('scale', StandardScaler(), numeric_columns)
], remainder='drop')
df_scaled = ct.fit_transform(df)
# Create dataframe with scaled data
df_scaled = pd.DataFrame(df_scaled, columns=numeric_columns.tolist())
After scaling the information, we’ll conduct PCA to scale back the size of the information and address the correlated variables we identified earlier.
import numpy as np
from sklearn.decomposition import PCApca = PCA(n_components=0.95) # Account for 95% of the variance
reduced_features = pca.fit_transform(df_scaled)
explained_variances = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variances)
# Around the cumulative variance values to 2 digits
cumulative_variance = [round(num, 2) for num in cumulative_variance]
print(f'Cumulative Variance: {cumulative_variance}')
Cumulative Variance: [0.72, 0.89, 0.99]
The output of the code indicates that one dimension accounts for 72% of the variance, two dimensions accounts for 89% of the variance, and three dimensions accounts for 99% of the variance. To verify the proper variety of dimensions were retained, use the code below.
print(f'Variety of components retained: {reduced_features.shape[1]}')
Variety of components retained: 3
Now the information are able to be inputted into the k-means algorithm. We’re going to look at two implementations of the algorithm — one informed by an elbow plot and one other informed by the Silhouette Rating.
To generate an elbow plot, use the code snippet below:
from sklearn.cluster import KMeansinertia = []
K_range = range(1, 6)
# Calculate inertia for the range of k
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=0, n_init='auto')
kmeans.fit(reduced_features)
inertia.append(kmeans.inertia_)
plt.figure(figsize=(10, 8))
plt.plot(K_range, inertia, marker='o')
plt.title('Elbow Plot')
plt.xlabel('Variety of Clusters')
plt.ylabel('Inertia')
plt.xticks(K_range)
plt.show()
The variety of clusters is displayed on the x-axis and the inertia is displayed on the y-axis. Inertia refers back to the sum of squared distances of samples to their nearest cluster center. Principally, it’s a measure of how close the information points are to the mean of their cluster (i.e., the centroid). When inertia is low, the clusters are more dense and defined clearly.
When interpreting an elbow plot, search for the section of the road that appears much like an elbow. On this case, the elbow is at three. When k = 1, the inertia shall be large, then it’s going to step by step decrease as k increases.
The “elbow” is the purpose where the decrease begins to plateau and the addition of recent clusters doesn’t end in a major decrease in inertia.
Based on this elbow plot, the worth of k needs to be three. Using an elbow plot has been described as more of an art than a science, which is why it has been known as “coarse”.
To implement the k-means algorithm when k = 3, we’ll run the next code.
k = 3 # Set value of k equal to threekmeans = KMeans(n_clusters=k, random_state=2, n_init='auto')
clusters = kmeans.fit_predict(reduced_features)
# Create dataframe for clusters
cluster_assignments = pd.DataFrame({'symbol': df.index,
'cluster': clusters})
# Sort value by cluster
sorted_assignments = cluster_assignments.sort_values(by='cluster')
# Convert assignments to same scale as 'variety'
sorted_assignments['cluster'] = [num + 1 for num in sorted_assignments['cluster']]
# Convert 'cluster' to category type
sorted_assignments['cluster'] = sorted_assignments['cluster'].astype('category')
The code below will be used to visualise the output of k-means clustering informed by the elbow plot.
from mpl_toolkits.mplot3d import Axes3Dplt.figure(figsize=(15, 8))
ax = plt.axes(projection='3d') # Arrange a 3D projection
# Color for every cluster
colours = ['blue', 'orange', 'green']
# Plot each cluster in 3D
for i, color in enumerate(colours):
# Only select data points that belong to the present cluster
ix = np.where(clusters == i)
ax.scatter(reduced_features[ix, 0], reduced_features[ix, 1],
reduced_features[ix, 2], c=[color], label=f'Cluster {i+1}',
s=60, alpha=0.8, edgecolor='w')
# Plotting the centroids in 3D
centroids = kmeans.cluster_centers_
ax.scatter(centroids[:, 0], centroids[:, 1], centroids[:, 2], marker='+',
s=100, alpha=0.4, linewidths=3, color='red', zorder=10,
label='Centroids')
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel('Principal Component 3')
ax.set_title('K-Means Clusters Informed by Elbow Plot')
ax.view_init(elev=20, azim=20) # Change viewing angle to make all axes visible
# Display the legend
ax.legend()
plt.show()
For the reason that data were reduced to a few dimensions, they’re plotted on a 3D plot. To realize additional information in regards to the clusters, we are able to use countplot
from the Seaborn
package.
plt.figure(figsize=(10,8))ax = sns.countplot(data=sorted_assignments, x='cluster', hue='cluster',
palette=colours)
plt.title('Cluster Distribution')
plt.ylabel('Count')
plt.xlabel('Cluster')
legend = ax.get_legend()
legend.set_title('Cluster')
plt.show()
Earlier, we determined that every group was comprised of 70 seeds. The info displayed on this plot indicate k-means implemented with the elbow plot may have performed moderately well since each count of every group is around 70; nonetheless, there are higher ways to judge performance.
To supply a more precise measure of how well the algorithm performed, we’ll use three metrics: (a) Davies-Bouldin Index, (b) Calinski-Harabasz Index, and (c) Adjusted Rand Index. We’ll discuss interpret them within the Results and Evaluation section, but the next code snippet will be used to calculate their values.
from sklearn.metrics import davies_bouldin_score, calinski_harabasz_score, adjusted_rand_score# Calculate metrics
davies_boulding = davies_bouldin_score(reduced_features, kmeans.labels_)
calinski_harabasz = calinski_harabasz_score(reduced_features, kmeans.labels_)
adj_rand = adjusted_rand_score(ground_truth, kmeans.labels_)
print(f'Davies-Bouldin Index: {davies_boulding}')
print(f'Calinski-Harabasz Index: {calinski_harabasz}')
print(f'Ajusted Rand Index: {adj_rand}')
Davies-Bouldin Index: 0.891967185123475
Calinski-Harabasz Index: 259.83668751473334
Ajusted Rand Index: 0.7730246875577171
A silhouette rating is the mean silhouette coefficient over all of the instances. The values can range from -1 to 1, with
- 1 indicating an instance is well inside its cluster
- 0 indicating an instance is near its cluster’s boundary
- -1 indicates the instance might be assigned to the wrong cluster.
When interpreting the silhouette rating, we must always select the variety of clusters with the best rating.
To generate a plot of silhouette scores for multiple values of k, we are able to use the next code.
from sklearn.metrics import silhouette_scoreK_range = range(2, 6)
# Calculate Silhouette Coefficient for range of k
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=1, n_init='auto')
cluster_labels = kmeans.fit_predict(reduced_features)
silhouette_avg = silhouette_score(reduced_features, cluster_labels)
silhouette_scores.append(silhouette_avg)
plt.figure(figsize=(10, 8))
plt.plot(K_range, silhouette_scores, marker='o')
plt.title('Silhouette Coefficient')
plt.xlabel('Variety of Clusters')
plt.ylabel('Silhouette Coefficient')
plt.ylim(0, 0.5) # Modify based on data
plt.xticks(K_range)
plt.show()
The info indicate that k should equal two.
Using this information, we are able to implement the K-Means algorithm again.
k = 2 # Set k to the worth with the best silhouette ratingkmeans = KMeans(n_clusters=k, random_state=4, n_init='auto')
clusters = kmeans.fit_predict(reduced_features)
cluster_assignments2 = pd.DataFrame({'symbol': df.index,
'cluster': clusters})
sorted_assignments2 = cluster_assignments2.sort_values(by='cluster')
# Convert assignments to same scale as 'variety'
sorted_assignments2['cluster'] = [num + 1 for num in sorted_assignments2['cluster']]
sorted_assignments2['cluster'] = sorted_assignments2['cluster'].astype('category')
To generate a plot of the algorithm when k = 2, we are able to use the code presented below.
plt.figure(figsize=(15, 8))
ax = plt.axes(projection='3d') # Arrange a 3D projection# Colours for every cluster
colours = ['blue', 'orange']
# Plot each cluster in 3D
for i, color in enumerate(colours):
# Only select data points that belong to the present cluster
ix = np.where(clusters == i)
ax.scatter(reduced_features[ix, 0], reduced_features[ix, 1],
reduced_features[ix, 2], c=[color], label=f'Cluster {i+1}',
s=60, alpha=0.8, edgecolor='w')
# Plotting the centroids in 3D
centroids = kmeans.cluster_centers_
ax.scatter(centroids[:, 0], centroids[:, 1], centroids[:, 2], marker='+',
s=100, alpha=0.4, linewidths=3, color='red', zorder=10,
label='Centroids')
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel('Principal Component 3')
ax.set_title('K-Means Clusters Informed by Elbow Plot')
ax.view_init(elev=20, azim=20) # Change viewing angle to make all axes visible
# Display the legend
ax.legend()
plt.show()
Just like the K-Means implementation informed by the elbow plot, additional information will be gleaned using countplot
from Seaborn
.
Based on our understanding of the dataset (i.e., it includes three varieties of seeds with 70 samples from each category), an initial reading of the plot may suggest that the implementation informed by the silhouette rating didn’t perform as well on the clustering task; nonetheless, we cannot use this plot in isolation to make a determination.
To supply a more robust and detailed comparison of the implementations, we’ll calculate the three metrics that were used on the implementation informed by the elbow plot.
# Calculate metrics
ss_davies_boulding = davies_bouldin_score(reduced_features, kmeans.labels_)
ss_calinski_harabasz = calinski_harabasz_score(reduced_features, kmeans.labels_)
ss_adj_rand = adjusted_rand_score(ground_truth, kmeans.labels_)print(f'Davies-Bouldin Index: {ss_davies_boulding}')
print(f'Calinski-Harabasz Index: {ss_calinski_harabasz}')
print(f'Adjusted Rand Index: {ss_adj_rand}')
Davies-Bouldin Index: 0.7947218992989975
Calinski-Harabasz Index: 262.8372675890969
Adjusted Rand Index: 0.5074767556450577
To match the outcomes from each implementations, we are able to create a dataframe and display it as a table.
from tabulate import tabulatemetrics = ['Davies-Bouldin Index', 'Calinski-Harabasz Index', 'Adjusted Rand Index']
elbow_plot = [davies_boulding, calinski_harabasz, adj_rand]
silh_score = [ss_davies_boulding, ss_calinski_harabasz, ss_adj_rand]
interpretation = ['SS', 'SS', 'EP']
scores_df = pd.DataFrame(zip(metrics, elbow_plot, silh_score, interpretation),
columns=['Metric', 'Elbow Plot', 'Silhouette Score',
'Favors'])
# Convert DataFrame to a table
print(tabulate(scores_df, headers='keys', tablefmt='fancy_grid', colalign='left'))
The metrics used to check the implementations of k-means clustering include internal metrics (e.g., Davies-Bouldin, Calinski-Harabasz) which don’t include ground truth labels and external metrics (e.g., Adjusted Rand Index) which do include external metrics. A temporary description of the three metrics is provided below.
- Davies-Bouldin Index (DBI): The DBI captures the trade-off between cluster compactness and the gap between clusters. Lower values of DBI indicate there are tighter clusters with more separation between clusters [3].
- Calinski-Harabasz Index (CHI): The CHI measures cluster density and distance between clusters. Higher values indicate that clusters are dense and well-separated [4].
- Adjusted Rand Index (ARI): The ARI measures agreement between cluster labels and ground truth. The values of the ARI range from -1 to 1. A rating of 1 indicates perfect agreement between labels and ground truth; a scores of 0 indicates random assignments; and a rating of -1 indicates worse than random project [5].
When comparing the 2 implementations, we observed k-mean informed by the silhouette rating performed best on the 2 internal metrics, indicating more compact and separated clusters. Nevertheless, k-means informed by the elbow plot performed best on the external metric (i.e., ARI) which indicating higher alignment with the bottom truth labels.
Ultimately, one of the best performing implementation shall be determined by the duty. If the duty requires clusters which are cohesive and well-separated, then internal metrics (e.g., DBI, CHI) could be more relevant. If the duty requires the clusters to align with the bottom truth labels, then external metrics, just like the ARI, could also be more relevant.
The aim of this project was to offer a comparison between k-means clustering informed by an elbow plot and the silhouette rating, and since there wasn’t an outlined task beyond a pure comparison, we cannot provide a definitive answer as to which implementation is healthier.
Although the absence of a definitive conclusion could also be frustrating, it highlights the importance of considering multiple metrics when comparing machine learning models and remaining focused on the project’s objectives.
Thanks for taking the time to read this text. If you will have any feedback or questions, please leave a comment.
[1] A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow: Concepts, Tools, and Techniques to Construct Intelligent Systems (2021), O’Reilly.
[2] M. Charytanowicz, J. Niewczas, P. Kulczycki, P. Kowalski, S. Łukasik, & S. Zak, Complete Gradient Clustering Algorithm for Features Evaluation of X-Ray Images (2010), Advances in Intelligent and Soft Computing https://doi.org/10.1007/978-3-642-13105-9_2
[3] D. L. Davies, D.W. Bouldin, A Cluster Separation Measure (1979), IEEE Transactions on Pattern Evaluation and Machine Intelligence https://doi:10.1109/TPAMI.1979.4766909
[4] T. Caliński, J. Harabasz, A Dendrite Method for Cluster Evaluation (1974) Communications in Statistics https://doi:10.1080/03610927408827101
[5] N. X. Vinh, J. Epps, J. Bailey, Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Probability (2010), Journal of Machine Learning Research https://www.jmlr.org/papers/volume11/vinh10a/vinh10a.pdf