Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprenda Setting Parameters: Linkage | Hierarchical Clustering
Cluster Analysis in Python

bookSetting Parameters: Linkage

In the previous chapter, we mentioned additional important parameters of the AgglomerativeClustering function. Let's try to figure out one of these parameters: linkage.

There are 4 main types of linkage:

  • 'single' - proximity between two clusters is computed as the proximity between two closest objects of the clusters. [default parameter value]
  • 'complete' - proximity between two clusters is computed as the proximity between the two most distant objects of the clusters.
  • 'average' - proximity between two clusters is computed as the arithmetic mean of all the proximities between all the pairs of points within two clusters.
  • 'ward' - is also called a Minimal Increase of Sum-of-Squares (MISSQ). You can read the explanation of this in the documentation.

So, what value of the linkage parameter should you use? As before, there is no obvious and correct answer. But note, that single linkage is the fastest method, but not robust, i.e. 'sensitive' to outliers. It performs well on non-globular data. The ward linkage is the most resistant to noise in data (and similar to the K-Means algorithm in terms of looking for centroids) but takes more time to perform. The two remaining linkages work fine with cleanly separated globular clusters but have mixed results otherwise.

You can always experiment and watch on dendrograms for different linkages. But be careful while using dendrogram with ward distances, as it behaves differently than for other linkages.

For example, let's compare two dendrograms for the data (scatter plot for which is represented below) - the one with 'single' linkage, and the second with 'average'.

123456789101112131415
# Import the libraries import pandas as pd import matplotlib.pyplot as plt from scipy.cluster.hierarchy import dendrogram, linkage # Read the data data = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/138ab9ad-aa37-4310-873f-0f62abafb038/model_data1.csv') # Compute the distances dist_single = linkage(data, method = 'single') # Build the dendrogram dendrogram(dist_single) plt.title("method = 'single'") plt.show()
copy

But if you replace the 9 - 15 lines with the code below, you will get the next result.

1234567
# Compute the distances dist_average = linkage(data, method = 'average') # Build the dendrogram dendrogram(dist_average) plt.title("method = 'average'") plt.show()
copy

You can see that Python even chose 3 colors in the first case, and 2 in the second. But remember, that the final choice is up to you since the line heights represent the dissimilarities between observations and clusters. And it seems that on the second plot there is a significant height on the right side. And it's enough to consider two separate clusters there. So, how should you choose the number of clusters based on dendrograms? Imagine, that you draw a horizontal line at some level. The number of vertical lines your line will intersect - is the number of clusters at this level. But how to choose the 'height' to build a line on? It's a common practice to build such a line that all the intersections would have a 'significant' distance between them. For example, on the dendrogram above there is sense to build such a line on the height of 2-4, but not below since there would be so many intersections quite close to each other.

Tarefa

Swipe to start coding

For the data from the last chapter build the dendrogram using complete linkage. Follow the next steps:

  1. Import dendrogram and linkage functions from scipy.cluster.hierarchy.
  2. Compute the distances in data using 'complete' linkage. Save the result within the dist_complete variable.
  3. Build the dendrogram for dist_complete data. Do not forget to display the plot after initializing the dendrogram.

Compare the dendrogram with the dendrogram for the same data but using a single linkage.

Plot

Solução

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 3. Capítulo 3
single

single

Pergunte à IA

expand

Pergunte à IA

ChatGPT

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

close

Awesome!

Completion rate improved to 3.57

bookSetting Parameters: Linkage

Deslize para mostrar o menu

In the previous chapter, we mentioned additional important parameters of the AgglomerativeClustering function. Let's try to figure out one of these parameters: linkage.

There are 4 main types of linkage:

  • 'single' - proximity between two clusters is computed as the proximity between two closest objects of the clusters. [default parameter value]
  • 'complete' - proximity between two clusters is computed as the proximity between the two most distant objects of the clusters.
  • 'average' - proximity between two clusters is computed as the arithmetic mean of all the proximities between all the pairs of points within two clusters.
  • 'ward' - is also called a Minimal Increase of Sum-of-Squares (MISSQ). You can read the explanation of this in the documentation.

So, what value of the linkage parameter should you use? As before, there is no obvious and correct answer. But note, that single linkage is the fastest method, but not robust, i.e. 'sensitive' to outliers. It performs well on non-globular data. The ward linkage is the most resistant to noise in data (and similar to the K-Means algorithm in terms of looking for centroids) but takes more time to perform. The two remaining linkages work fine with cleanly separated globular clusters but have mixed results otherwise.

You can always experiment and watch on dendrograms for different linkages. But be careful while using dendrogram with ward distances, as it behaves differently than for other linkages.

For example, let's compare two dendrograms for the data (scatter plot for which is represented below) - the one with 'single' linkage, and the second with 'average'.

123456789101112131415
# Import the libraries import pandas as pd import matplotlib.pyplot as plt from scipy.cluster.hierarchy import dendrogram, linkage # Read the data data = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/138ab9ad-aa37-4310-873f-0f62abafb038/model_data1.csv') # Compute the distances dist_single = linkage(data, method = 'single') # Build the dendrogram dendrogram(dist_single) plt.title("method = 'single'") plt.show()
copy

But if you replace the 9 - 15 lines with the code below, you will get the next result.

1234567
# Compute the distances dist_average = linkage(data, method = 'average') # Build the dendrogram dendrogram(dist_average) plt.title("method = 'average'") plt.show()
copy

You can see that Python even chose 3 colors in the first case, and 2 in the second. But remember, that the final choice is up to you since the line heights represent the dissimilarities between observations and clusters. And it seems that on the second plot there is a significant height on the right side. And it's enough to consider two separate clusters there. So, how should you choose the number of clusters based on dendrograms? Imagine, that you draw a horizontal line at some level. The number of vertical lines your line will intersect - is the number of clusters at this level. But how to choose the 'height' to build a line on? It's a common practice to build such a line that all the intersections would have a 'significant' distance between them. For example, on the dendrogram above there is sense to build such a line on the height of 2-4, but not below since there would be so many intersections quite close to each other.

Tarefa

Swipe to start coding

For the data from the last chapter build the dendrogram using complete linkage. Follow the next steps:

  1. Import dendrogram and linkage functions from scipy.cluster.hierarchy.
  2. Compute the distances in data using 'complete' linkage. Save the result within the dist_complete variable.
  3. Build the dendrogram for dist_complete data. Do not forget to display the plot after initializing the dendrogram.

Compare the dendrogram with the dendrogram for the same data but using a single linkage.

Plot

Solução

Switch to desktopMude para o desktop para praticar no mundo realContinue de onde você está usando uma das opções abaixo
Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 3. Capítulo 3
single

single

some-alt