Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Setting Parameters: Linkage | Hierarchical Clustering
Cluster Analysis in Python
course content

Зміст курсу

Cluster Analysis in Python

Cluster Analysis in Python

1. K-Means Algorithm
2. K-Medoids Algorithm
3. Hierarchical Clustering
4. Spectral Clustering

Setting Parameters: Linkage

In the previous chapter, we mentioned additional important parameters of the AgglomerativeClustering function. Let's try to figure out one of these parameters: linkage.

There are 4 main types of linkage:

  • 'single' - proximity between two clusters is computed as the proximity between two closest objects of the clusters. [default parameter value]
  • 'complete' - proximity between two clusters is computed as the proximity between the two most distant objects of the clusters.
  • 'average' - proximity between two clusters is computed as the arithmetic mean of all the proximities between all the pairs of points within two clusters.
  • 'ward' - is also called a Minimal Increase of Sum-of-Squares (MISSQ). You can read the explanation of this in the documentation.

So, what value of the linkage parameter should you use? As before, there is no obvious and correct answer. But note, that single linkage is the fastest method, but not robust, i.e. 'sensitive' to outliers. It performs well on non-globular data. The ward linkage is the most resistant to noise in data (and similar to the K-Means algorithm in terms of looking for centroids) but takes more time to perform. The two remaining linkages work fine with cleanly separated globular clusters but have mixed results otherwise.

You can always experiment and watch on dendrograms for different linkages. But be careful while using dendrogram with ward distances, as it behaves differently than for other linkages.

For example, let's compare two dendrograms for the data (scatter plot for which is represented below) - the one with 'single' linkage, and the second with 'average'.

123456789101112131415
# Import the libraries import pandas as pd import matplotlib.pyplot as plt from scipy.cluster.hierarchy import dendrogram, linkage # Read the data data = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/138ab9ad-aa37-4310-873f-0f62abafb038/model_data1.csv') # Compute the distances dist_single = linkage(data, method = 'single') # Build the dendrogram dendrogram(dist_single) plt.title("method = 'single'") plt.show()
copy

But if you replace the 9 - 15 lines with the code below, you will get the next result.

1234567
# Compute the distances dist_average = linkage(data, method = 'average') # Build the dendrogram dendrogram(dist_average) plt.title("method = 'average'") plt.show()
copy

You can see that Python even chose 3 colors in the first case, and 2 in the second. But remember, that the final choice is up to you since the line heights represent the dissimilarities between observations and clusters. And it seems that on the second plot there is a significant height on the right side. And it's enough to consider two separate clusters there. So, how should you choose the number of clusters based on dendrograms? Imagine, that you draw a horizontal line at some level. The number of vertical lines your line will intersect - is the number of clusters at this level. But how to choose the 'height' to build a line on? It's a common practice to build such a line that all the intersections would have a 'significant' distance between them. For example, on the dendrogram above there is sense to build such a line on the height of 2-4, but not below since there would be so many intersections quite close to each other.

Завдання

For the data from the last chapter build the dendrogram using complete linkage. Follow the next steps:

  1. Import dendrogram and linkage functions from scipy.cluster.hierarchy.
  2. Compute the distances in data using 'complete' linkage. Save the result within the dist_complete variable.
  3. Build the dendrogram for dist_complete data. Do not forget to display the plot after initializing the dendrogram.

Compare the dendrogram with the dendrogram for the same data but using a single linkage.

[object Object]

Завдання

For the data from the last chapter build the dendrogram using complete linkage. Follow the next steps:

  1. Import dendrogram and linkage functions from scipy.cluster.hierarchy.
  2. Compute the distances in data using 'complete' linkage. Save the result within the dist_complete variable.
  3. Build the dendrogram for dist_complete data. Do not forget to display the plot after initializing the dendrogram.

Compare the dendrogram with the dendrogram for the same data but using a single linkage.

[object Object]

Перейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів

Все було зрозуміло?

Секція 3. Розділ 3
toggle bottom row

Setting Parameters: Linkage

In the previous chapter, we mentioned additional important parameters of the AgglomerativeClustering function. Let's try to figure out one of these parameters: linkage.

There are 4 main types of linkage:

  • 'single' - proximity between two clusters is computed as the proximity between two closest objects of the clusters. [default parameter value]
  • 'complete' - proximity between two clusters is computed as the proximity between the two most distant objects of the clusters.
  • 'average' - proximity between two clusters is computed as the arithmetic mean of all the proximities between all the pairs of points within two clusters.
  • 'ward' - is also called a Minimal Increase of Sum-of-Squares (MISSQ). You can read the explanation of this in the documentation.

So, what value of the linkage parameter should you use? As before, there is no obvious and correct answer. But note, that single linkage is the fastest method, but not robust, i.e. 'sensitive' to outliers. It performs well on non-globular data. The ward linkage is the most resistant to noise in data (and similar to the K-Means algorithm in terms of looking for centroids) but takes more time to perform. The two remaining linkages work fine with cleanly separated globular clusters but have mixed results otherwise.

You can always experiment and watch on dendrograms for different linkages. But be careful while using dendrogram with ward distances, as it behaves differently than for other linkages.

For example, let's compare two dendrograms for the data (scatter plot for which is represented below) - the one with 'single' linkage, and the second with 'average'.

123456789101112131415
# Import the libraries import pandas as pd import matplotlib.pyplot as plt from scipy.cluster.hierarchy import dendrogram, linkage # Read the data data = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/138ab9ad-aa37-4310-873f-0f62abafb038/model_data1.csv') # Compute the distances dist_single = linkage(data, method = 'single') # Build the dendrogram dendrogram(dist_single) plt.title("method = 'single'") plt.show()
copy

But if you replace the 9 - 15 lines with the code below, you will get the next result.

1234567
# Compute the distances dist_average = linkage(data, method = 'average') # Build the dendrogram dendrogram(dist_average) plt.title("method = 'average'") plt.show()
copy

You can see that Python even chose 3 colors in the first case, and 2 in the second. But remember, that the final choice is up to you since the line heights represent the dissimilarities between observations and clusters. And it seems that on the second plot there is a significant height on the right side. And it's enough to consider two separate clusters there. So, how should you choose the number of clusters based on dendrograms? Imagine, that you draw a horizontal line at some level. The number of vertical lines your line will intersect - is the number of clusters at this level. But how to choose the 'height' to build a line on? It's a common practice to build such a line that all the intersections would have a 'significant' distance between them. For example, on the dendrogram above there is sense to build such a line on the height of 2-4, but not below since there would be so many intersections quite close to each other.

Завдання

For the data from the last chapter build the dendrogram using complete linkage. Follow the next steps:

  1. Import dendrogram and linkage functions from scipy.cluster.hierarchy.
  2. Compute the distances in data using 'complete' linkage. Save the result within the dist_complete variable.
  3. Build the dendrogram for dist_complete data. Do not forget to display the plot after initializing the dendrogram.

Compare the dendrogram with the dendrogram for the same data but using a single linkage.

[object Object]

Завдання

For the data from the last chapter build the dendrogram using complete linkage. Follow the next steps:

  1. Import dendrogram and linkage functions from scipy.cluster.hierarchy.
  2. Compute the distances in data using 'complete' linkage. Save the result within the dist_complete variable.
  3. Build the dendrogram for dist_complete data. Do not forget to display the plot after initializing the dendrogram.

Compare the dendrogram with the dendrogram for the same data but using a single linkage.

[object Object]

Перейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів

Все було зрозуміло?

Секція 3. Розділ 3
toggle bottom row

Setting Parameters: Linkage

In the previous chapter, we mentioned additional important parameters of the AgglomerativeClustering function. Let's try to figure out one of these parameters: linkage.

There are 4 main types of linkage:

  • 'single' - proximity between two clusters is computed as the proximity between two closest objects of the clusters. [default parameter value]
  • 'complete' - proximity between two clusters is computed as the proximity between the two most distant objects of the clusters.
  • 'average' - proximity between two clusters is computed as the arithmetic mean of all the proximities between all the pairs of points within two clusters.
  • 'ward' - is also called a Minimal Increase of Sum-of-Squares (MISSQ). You can read the explanation of this in the documentation.

So, what value of the linkage parameter should you use? As before, there is no obvious and correct answer. But note, that single linkage is the fastest method, but not robust, i.e. 'sensitive' to outliers. It performs well on non-globular data. The ward linkage is the most resistant to noise in data (and similar to the K-Means algorithm in terms of looking for centroids) but takes more time to perform. The two remaining linkages work fine with cleanly separated globular clusters but have mixed results otherwise.

You can always experiment and watch on dendrograms for different linkages. But be careful while using dendrogram with ward distances, as it behaves differently than for other linkages.

For example, let's compare two dendrograms for the data (scatter plot for which is represented below) - the one with 'single' linkage, and the second with 'average'.

123456789101112131415
# Import the libraries import pandas as pd import matplotlib.pyplot as plt from scipy.cluster.hierarchy import dendrogram, linkage # Read the data data = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/138ab9ad-aa37-4310-873f-0f62abafb038/model_data1.csv') # Compute the distances dist_single = linkage(data, method = 'single') # Build the dendrogram dendrogram(dist_single) plt.title("method = 'single'") plt.show()
copy

But if you replace the 9 - 15 lines with the code below, you will get the next result.

1234567
# Compute the distances dist_average = linkage(data, method = 'average') # Build the dendrogram dendrogram(dist_average) plt.title("method = 'average'") plt.show()
copy

You can see that Python even chose 3 colors in the first case, and 2 in the second. But remember, that the final choice is up to you since the line heights represent the dissimilarities between observations and clusters. And it seems that on the second plot there is a significant height on the right side. And it's enough to consider two separate clusters there. So, how should you choose the number of clusters based on dendrograms? Imagine, that you draw a horizontal line at some level. The number of vertical lines your line will intersect - is the number of clusters at this level. But how to choose the 'height' to build a line on? It's a common practice to build such a line that all the intersections would have a 'significant' distance between them. For example, on the dendrogram above there is sense to build such a line on the height of 2-4, but not below since there would be so many intersections quite close to each other.

Завдання

For the data from the last chapter build the dendrogram using complete linkage. Follow the next steps:

  1. Import dendrogram and linkage functions from scipy.cluster.hierarchy.
  2. Compute the distances in data using 'complete' linkage. Save the result within the dist_complete variable.
  3. Build the dendrogram for dist_complete data. Do not forget to display the plot after initializing the dendrogram.

Compare the dendrogram with the dendrogram for the same data but using a single linkage.

[object Object]

Завдання

For the data from the last chapter build the dendrogram using complete linkage. Follow the next steps:

  1. Import dendrogram and linkage functions from scipy.cluster.hierarchy.
  2. Compute the distances in data using 'complete' linkage. Save the result within the dist_complete variable.
  3. Build the dendrogram for dist_complete data. Do not forget to display the plot after initializing the dendrogram.

Compare the dendrogram with the dendrogram for the same data but using a single linkage.

[object Object]

Перейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів

Все було зрозуміло?

In the previous chapter, we mentioned additional important parameters of the AgglomerativeClustering function. Let's try to figure out one of these parameters: linkage.

There are 4 main types of linkage:

  • 'single' - proximity between two clusters is computed as the proximity between two closest objects of the clusters. [default parameter value]
  • 'complete' - proximity between two clusters is computed as the proximity between the two most distant objects of the clusters.
  • 'average' - proximity between two clusters is computed as the arithmetic mean of all the proximities between all the pairs of points within two clusters.
  • 'ward' - is also called a Minimal Increase of Sum-of-Squares (MISSQ). You can read the explanation of this in the documentation.

So, what value of the linkage parameter should you use? As before, there is no obvious and correct answer. But note, that single linkage is the fastest method, but not robust, i.e. 'sensitive' to outliers. It performs well on non-globular data. The ward linkage is the most resistant to noise in data (and similar to the K-Means algorithm in terms of looking for centroids) but takes more time to perform. The two remaining linkages work fine with cleanly separated globular clusters but have mixed results otherwise.

You can always experiment and watch on dendrograms for different linkages. But be careful while using dendrogram with ward distances, as it behaves differently than for other linkages.

For example, let's compare two dendrograms for the data (scatter plot for which is represented below) - the one with 'single' linkage, and the second with 'average'.

123456789101112131415
# Import the libraries import pandas as pd import matplotlib.pyplot as plt from scipy.cluster.hierarchy import dendrogram, linkage # Read the data data = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/138ab9ad-aa37-4310-873f-0f62abafb038/model_data1.csv') # Compute the distances dist_single = linkage(data, method = 'single') # Build the dendrogram dendrogram(dist_single) plt.title("method = 'single'") plt.show()
copy

But if you replace the 9 - 15 lines with the code below, you will get the next result.

1234567
# Compute the distances dist_average = linkage(data, method = 'average') # Build the dendrogram dendrogram(dist_average) plt.title("method = 'average'") plt.show()
copy

You can see that Python even chose 3 colors in the first case, and 2 in the second. But remember, that the final choice is up to you since the line heights represent the dissimilarities between observations and clusters. And it seems that on the second plot there is a significant height on the right side. And it's enough to consider two separate clusters there. So, how should you choose the number of clusters based on dendrograms? Imagine, that you draw a horizontal line at some level. The number of vertical lines your line will intersect - is the number of clusters at this level. But how to choose the 'height' to build a line on? It's a common practice to build such a line that all the intersections would have a 'significant' distance between them. For example, on the dendrogram above there is sense to build such a line on the height of 2-4, but not below since there would be so many intersections quite close to each other.

Завдання

For the data from the last chapter build the dendrogram using complete linkage. Follow the next steps:

  1. Import dendrogram and linkage functions from scipy.cluster.hierarchy.
  2. Compute the distances in data using 'complete' linkage. Save the result within the dist_complete variable.
  3. Build the dendrogram for dist_complete data. Do not forget to display the plot after initializing the dendrogram.

Compare the dendrogram with the dendrogram for the same data but using a single linkage.

[object Object]

Перейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів
Секція 3. Розділ 3
Перейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів
We're sorry to hear that something went wrong. What happened?
some-alt