External Evaluation

External evaluation for clustering algorithms is a method of evaluating the performance of a clustering algorithm by comparing its results to a known set of class labels or ground truth. In other words, the algorithm's clusters are compared to a set of pre-existing labels created by experts or based on domain knowledge.

Most commonly used external metrics

The Rand Index (RI) measures the similarity between two clusterings or partitions and is often used as an external evaluation metric in clustering. The Rand Index measures the percentage of pairs of data points assigned to the same cluster in both the predicted and true clusterings, normalized by the total number of data point pairs.

The Rand Index is calculated as follows:

Let n be the total number of data points;
Let a be the number of pairs of data points assigned to the same cluster in both the predicted and true clusterings;
Let b be the number of pairs of data points assigned to different clusters in both the predicted and true clustering.

The Rand Index is then given by 2*(a+b)/ (n*(n-1)).


              123456789101112131415161718192021222324252627282930313233
            
from sklearn.metrics import rand_score
from sklearn.cluster import KMeans
from sklearn.datasets import make_moons, make_blobs, make_circles
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings('ignore')

# Creating subplots for visualization
fig, axes = plt.subplots(1, 3)
fig.set_size_inches(10, 5)
# Create circles dataset
X_circles, y = make_circles(n_samples=500, factor=0.2)
# Provide K-means clustering
clustering = KMeans(n_clusters=2).fit(X_circles)
predicted_circles = clustering.predict(X_circles)
# Provide visualization and show RI for circles dataset
axes[0].scatter(X_circles[:, 0], X_circles[:, 1], c=clustering.labels_, cmap='tab20b')
axes[0].set_title('RI is: '+ str(round(rand_score(y, predicted_circles), 3)))

X_blobs, y = make_blobs(n_samples=500, centers=2)
clustering = KMeans(n_clusters=2).fit(X_blobs)
predicted_blobs = clustering.predict(X_blobs)
# Provide visualization and show RI for blobs dataset
axes[1].scatter(X_blobs[:, 0], X_blobs[:, 1], c=clustering.labels_, cmap='tab20b')
axes[1].set_title('RI is: '+ str(round(rand_score(y, predicted_blobs), 3)))

X_moons, y = make_moons(n_samples=500)
clustering = KMeans(n_clusters=2).fit(X_moons)
predicted_moons = clustering.predict(X_moons)
# Provide visualization and show RI for moons dataset
axes[2].scatter(X_moons[:, 0], X_moons[:, 1], c=clustering.labels_, cmap='tab20b')
axes[2].set_title('RI is: '+ str(round(rand_score(y, predicted_moons), 3)))

The Rand Index can vary between 0 and 1, where 0 indicates that the two clusterings are completely different, and 1 indicates that the two clusterings are identical.

Mutual Information (MI) measures the amount of information shared by the predicted and true clusterings based on the concept of entropy. We will not consider how this metric is calculated, as this is outside the scope of the beginner-level course.


              1234567891011121314151617181920212223242526272829303132
            
from sklearn.metrics import mutual_info_score
from sklearn.cluster import KMeans
from sklearn.datasets import make_moons, make_blobs, make_circles
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings('ignore')

# Create subplots for visualizations
fig, axes = plt.subplots(1, 3)
fig.set_size_inches(10 ,5)

X_circles, y = make_circles(n_samples=500, factor=0.2)
clustering = KMeans(n_clusters=2).fit(X_circles)
predicted_circles = clustering.predict(X_circles)
# Provide visualization and show MI for circles dataset
axes[0].scatter(X_circles[:, 0], X_circles[:, 1], c=clustering.labels_, cmap='tab20b')
axes[0].set_title('MI is: '+ str(round(mutual_info_score(y, predicted_circles), 3)))

X_blobs, y = make_blobs(n_samples=500, centers=2)
clustering = KMeans(n_clusters=2).fit(X_blobs)
predicted_blobs = clustering.predict(X_blobs)
# Provide visualization and show MI for blobs dataset
axes[1].scatter(X_blobs[:, 0], X_blobs[:, 1], c=clustering.labels_, cmap='tab20b')
axes[1].set_title('MI is: '+ str(round(mutual_info_score(y, predicted_blobs), 3)))

X_moons, y = make_moons(n_samples=500)
clustering = KMeans(n_clusters=2).fit(X_moons)
predicted_moons = clustering.predict(X_moons)
# Provide visualization and show MI for moons dataset
axes[2].scatter(X_moons[:, 0], X_moons[:, 1], c=clustering.labels_, cmap='tab20b')
axes[2].set_title('MI is: '+ str(round(mutual_info_score(y, predicted_moons), 3)))

The Mutual Information varies between 0 and 1, where 0 indicates that the predicted clustering is completely different from the true clustering, and 1 indicates that the predicted clustering is identical to the true clustering. Furthermore, based on the above examples, we can say that this metric is much better at detecting bad clustering than the Rand Index.

Homogeneity measures the degree to which each cluster contains only data points that belong to a single class or category based on conditional entropy. Just like with mutual information, we will not consider the principle of calculating this metric.


              12345678910111213141516171819202122232425262728293031
            
from sklearn.metrics import homogeneity_score
from sklearn.cluster import KMeans
from sklearn.datasets import make_moons, make_blobs, make_circles
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings('ignore')

fig, axes = plt.subplots(1, 3)
fig.set_size_inches(10, 5)

X_circles, y = make_circles(n_samples=500, factor=0.2)
clustering = KMeans(n_clusters=2).fit(X_circles)
predicted_circles = clustering.predict(X_circles)
# Provide visualization and show homogeneity for circles dataset
axes[0].scatter(X_circles[:, 0], X_circles[:, 1], c=clustering.labels_, cmap='tab20b')
axes[0].set_title('Homogeneity is: '+ str(round(homogeneity_score(y, predicted_circles), 3)))

X_blobs, y = make_blobs(n_samples=500, centers=2)
clustering = KMeans(n_clusters=2).fit(X_blobs)
predicted_blobs = clustering.predict(X_blobs)
# Provide visualization and show homogeneity for blobs dataset
axes[1].scatter(X_blobs[:, 0], X_blobs[:, 1], c=clustering.labels_, cmap='tab20b')
axes[1].set_title('Homogeneity is: '+ str(round(homogeneity_score(y, predicted_blobs), 3)))

X_moons, y = make_moons(n_samples=500)
clustering = KMeans(n_clusters=2).fit(X_moons)
predicted_moons = clustering.predict(X_moons)
# Provide visualization and show homogeneity for moons dataset
axes[2].scatter(X_moons[:, 0], X_moons[:, 1], c=clustering.labels_, cmap='tab20b')
axes[2].set_title('Homogeneity is: '+ str(round(homogeneity_score(y, predicted_moons), 3)))

A clustering solution is considered highly homogeneous if all the data points that belong to the same true class or category are grouped into the same cluster.
In other words, homogeneity measures the extent to which a clustering algorithm assigns data points to the correct clusters based on their true class or category. The homogeneity score ranges from 0 to 1, with 1 indicating perfect homogeneity.

Homogeneity is the best of all the considered metrics: it determines both good and bad clustering equally well, as shown in the example above.

Все було зрозуміло?

Дякуємо за ваш відгук!

Секція 3. Розділ 2