Setting Parameters: Linkage

In the previous chapter, we mentioned additional important parameters of the AgglomerativeClustering function. Let's try to figure out one of these parameters: linkage.

There are 4 main types of linkage:

'single' - proximity between two clusters is computed as the proximity between two closest objects of the clusters. [default parameter value]
'complete' - proximity between two clusters is computed as the proximity between the two most distant objects of the clusters.
'average' - proximity between two clusters is computed as the arithmetic mean of all the proximities between all the pairs of points within two clusters.
'ward' - is also called a Minimal Increase of Sum-of-Squares (MISSQ). You can read the explanation of this in the documentation.

So, what value of the linkage parameter should you use? As before, there is no obvious and correct answer. But note, that single linkage is the fastest method, but not robust, i.e. 'sensitive' to outliers. It performs well on non-globular data. The ward linkage is the most resistant to noise in data (and similar to the K-Means algorithm in terms of looking for centroids) but takes more time to perform. The two remaining linkages work fine with cleanly separated globular clusters but have mixed results otherwise.

You can always experiment and watch on dendrograms for different linkages. But be careful while using dendrogram with ward distances, as it behaves differently than for other linkages.

For example, let's compare two dendrograms for the data (scatter plot for which is represented below) - the one with 'single' linkage, and the second with 'average'.


              123456789101112131415
            
# Import the libraries
import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

# Read the data
data = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/138ab9ad-aa37-4310-873f-0f62abafb038/model_data1.csv')

# Compute the distances
dist_single = linkage(data, method = 'single')

# Build the dendrogram
dendrogram(dist_single)
plt.title("method = 'single'")
plt.show()

But if you replace the 9 - 15 lines with the code below, you will get the next result.


              1234567
            
# Compute the distances
dist_average = linkage(data, method = 'average')

# Build the dendrogram
dendrogram(dist_average)
plt.title("method = 'average'")
plt.show()

You can see that Python even chose 3 colors in the first case, and 2 in the second. But remember, that the final choice is up to you since the line heights represent the dissimilarities between observations and clusters. And it seems that on the second plot there is a significant height on the right side. And it's enough to consider two separate clusters there. So, how should you choose the number of clusters based on dendrograms? Imagine, that you draw a horizontal line at some level. The number of vertical lines your line will intersect - is the number of clusters at this level. But how to choose the 'height' to build a line on? It's a common practice to build such a line that all the intersections would have a 'significant' distance between them. For example, on the dendrogram above there is sense to build such a line on the height of 2-4, but not below since there would be so many intersections quite close to each other.

Завдання

Swipe to start coding

For the data from the last chapter build the dendrogram using complete linkage. Follow the next steps:

Import dendrogram and linkage functions from scipy.cluster.hierarchy.
Compute the distances in data using 'complete' linkage. Save the result within the dist_complete variable.
Build the dendrogram for dist_complete data. Do not forget to display the plot after initializing the dendrogram.

Compare the dendrogram with the dendrogram for the same data but using a single linkage.

Plot

Рішення

Перейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів

Все було зрозуміло?

Дякуємо за ваш відгук!

Секція 3. Розділ 3