Contenido del Curso
Cluster Analysis in Python
Cluster Analysis in Python
Setting Parameters: Linkage
In the previous chapter, we mentioned additional important parameters of the AgglomerativeClustering
function. Let's try to figure out one of these parameters: linkage
.
There are 4 main types of linkage
:
'single'
- proximity between two clusters is computed as the proximity between two closest objects of the clusters. [default parameter value]'complete'
- proximity between two clusters is computed as the proximity between the two most distant objects of the clusters.'average'
- proximity between two clusters is computed as the arithmetic mean of all the proximities between all the pairs of points within two clusters.'ward'
- is also called a Minimal Increase of Sum-of-Squares (MISSQ). You can read the explanation of this in the documentation.
So, what value of the linkage
parameter should you use? As before, there is no obvious and correct answer. But note, that single linkage is the fastest method, but not robust, i.e. 'sensitive' to outliers. It performs well on non-globular data. The ward linkage is the most resistant to noise in data (and similar to the K-Means algorithm in terms of looking for centroids) but takes more time to perform. The two remaining linkages work fine with cleanly separated globular clusters but have mixed results otherwise.
You can always experiment and watch on dendrograms for different linkages. But be careful while using dendrogram with ward distances, as it behaves differently than for other linkages.
For example, let's compare two dendrograms for the data (scatter plot for which is represented below) - the one with 'single'
linkage, and the second with 'average'
.
# Import the libraries import pandas as pd import matplotlib.pyplot as plt from scipy.cluster.hierarchy import dendrogram, linkage # Read the data data = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/138ab9ad-aa37-4310-873f-0f62abafb038/model_data1.csv') # Compute the distances dist_single = linkage(data, method = 'single') # Build the dendrogram dendrogram(dist_single) plt.title("method = 'single'") plt.show()
But if you replace the 9 - 15 lines with the code below, you will get the next result.
# Compute the distances dist_average = linkage(data, method = 'average') # Build the dendrogram dendrogram(dist_average) plt.title("method = 'average'") plt.show()
You can see that Python even chose 3 colors in the first case, and 2 in the second. But remember, that the final choice is up to you since the line heights represent the dissimilarities between observations and clusters. And it seems that on the second plot there is a significant height on the right side. And it's enough to consider two separate clusters there. So, how should you choose the number of clusters based on dendrograms? Imagine, that you draw a horizontal line at some level. The number of vertical lines your line will intersect - is the number of clusters at this level. But how to choose the 'height' to build a line on? It's a common practice to build such a line that all the intersections would have a 'significant' distance between them. For example, on the dendrogram above there is sense to build such a line on the height of 2-4, but not below since there would be so many intersections quite close to each other.
Tarea
For the data from the last chapter build the dendrogram using complete linkage. Follow the next steps:
- Import
dendrogram
andlinkage
functions fromscipy.cluster.hierarchy
. - Compute the distances in
data
using'complete'
linkage. Save the result within thedist_complete
variable. - Build the dendrogram for
dist_complete
data. Do not forget to display the plot after initializing the dendrogram.
Compare the dendrogram with the dendrogram for the same data but using a single linkage.
[object Object]
¡Gracias por tus comentarios!
Setting Parameters: Linkage
In the previous chapter, we mentioned additional important parameters of the AgglomerativeClustering
function. Let's try to figure out one of these parameters: linkage
.
There are 4 main types of linkage
:
'single'
- proximity between two clusters is computed as the proximity between two closest objects of the clusters. [default parameter value]'complete'
- proximity between two clusters is computed as the proximity between the two most distant objects of the clusters.'average'
- proximity between two clusters is computed as the arithmetic mean of all the proximities between all the pairs of points within two clusters.'ward'
- is also called a Minimal Increase of Sum-of-Squares (MISSQ). You can read the explanation of this in the documentation.
So, what value of the linkage
parameter should you use? As before, there is no obvious and correct answer. But note, that single linkage is the fastest method, but not robust, i.e. 'sensitive' to outliers. It performs well on non-globular data. The ward linkage is the most resistant to noise in data (and similar to the K-Means algorithm in terms of looking for centroids) but takes more time to perform. The two remaining linkages work fine with cleanly separated globular clusters but have mixed results otherwise.
You can always experiment and watch on dendrograms for different linkages. But be careful while using dendrogram with ward distances, as it behaves differently than for other linkages.
For example, let's compare two dendrograms for the data (scatter plot for which is represented below) - the one with 'single'
linkage, and the second with 'average'
.
# Import the libraries import pandas as pd import matplotlib.pyplot as plt from scipy.cluster.hierarchy import dendrogram, linkage # Read the data data = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/138ab9ad-aa37-4310-873f-0f62abafb038/model_data1.csv') # Compute the distances dist_single = linkage(data, method = 'single') # Build the dendrogram dendrogram(dist_single) plt.title("method = 'single'") plt.show()
But if you replace the 9 - 15 lines with the code below, you will get the next result.
# Compute the distances dist_average = linkage(data, method = 'average') # Build the dendrogram dendrogram(dist_average) plt.title("method = 'average'") plt.show()
You can see that Python even chose 3 colors in the first case, and 2 in the second. But remember, that the final choice is up to you since the line heights represent the dissimilarities between observations and clusters. And it seems that on the second plot there is a significant height on the right side. And it's enough to consider two separate clusters there. So, how should you choose the number of clusters based on dendrograms? Imagine, that you draw a horizontal line at some level. The number of vertical lines your line will intersect - is the number of clusters at this level. But how to choose the 'height' to build a line on? It's a common practice to build such a line that all the intersections would have a 'significant' distance between them. For example, on the dendrogram above there is sense to build such a line on the height of 2-4, but not below since there would be so many intersections quite close to each other.
Tarea
For the data from the last chapter build the dendrogram using complete linkage. Follow the next steps:
- Import
dendrogram
andlinkage
functions fromscipy.cluster.hierarchy
. - Compute the distances in
data
using'complete'
linkage. Save the result within thedist_complete
variable. - Build the dendrogram for
dist_complete
data. Do not forget to display the plot after initializing the dendrogram.
Compare the dendrogram with the dendrogram for the same data but using a single linkage.
[object Object]
¡Gracias por tus comentarios!
Setting Parameters: Linkage
In the previous chapter, we mentioned additional important parameters of the AgglomerativeClustering
function. Let's try to figure out one of these parameters: linkage
.
There are 4 main types of linkage
:
'single'
- proximity between two clusters is computed as the proximity between two closest objects of the clusters. [default parameter value]'complete'
- proximity between two clusters is computed as the proximity between the two most distant objects of the clusters.'average'
- proximity between two clusters is computed as the arithmetic mean of all the proximities between all the pairs of points within two clusters.'ward'
- is also called a Minimal Increase of Sum-of-Squares (MISSQ). You can read the explanation of this in the documentation.
So, what value of the linkage
parameter should you use? As before, there is no obvious and correct answer. But note, that single linkage is the fastest method, but not robust, i.e. 'sensitive' to outliers. It performs well on non-globular data. The ward linkage is the most resistant to noise in data (and similar to the K-Means algorithm in terms of looking for centroids) but takes more time to perform. The two remaining linkages work fine with cleanly separated globular clusters but have mixed results otherwise.
You can always experiment and watch on dendrograms for different linkages. But be careful while using dendrogram with ward distances, as it behaves differently than for other linkages.
For example, let's compare two dendrograms for the data (scatter plot for which is represented below) - the one with 'single'
linkage, and the second with 'average'
.
# Import the libraries import pandas as pd import matplotlib.pyplot as plt from scipy.cluster.hierarchy import dendrogram, linkage # Read the data data = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/138ab9ad-aa37-4310-873f-0f62abafb038/model_data1.csv') # Compute the distances dist_single = linkage(data, method = 'single') # Build the dendrogram dendrogram(dist_single) plt.title("method = 'single'") plt.show()
But if you replace the 9 - 15 lines with the code below, you will get the next result.
# Compute the distances dist_average = linkage(data, method = 'average') # Build the dendrogram dendrogram(dist_average) plt.title("method = 'average'") plt.show()
You can see that Python even chose 3 colors in the first case, and 2 in the second. But remember, that the final choice is up to you since the line heights represent the dissimilarities between observations and clusters. And it seems that on the second plot there is a significant height on the right side. And it's enough to consider two separate clusters there. So, how should you choose the number of clusters based on dendrograms? Imagine, that you draw a horizontal line at some level. The number of vertical lines your line will intersect - is the number of clusters at this level. But how to choose the 'height' to build a line on? It's a common practice to build such a line that all the intersections would have a 'significant' distance between them. For example, on the dendrogram above there is sense to build such a line on the height of 2-4, but not below since there would be so many intersections quite close to each other.
Tarea
For the data from the last chapter build the dendrogram using complete linkage. Follow the next steps:
- Import
dendrogram
andlinkage
functions fromscipy.cluster.hierarchy
. - Compute the distances in
data
using'complete'
linkage. Save the result within thedist_complete
variable. - Build the dendrogram for
dist_complete
data. Do not forget to display the plot after initializing the dendrogram.
Compare the dendrogram with the dendrogram for the same data but using a single linkage.
[object Object]
¡Gracias por tus comentarios!
In the previous chapter, we mentioned additional important parameters of the AgglomerativeClustering
function. Let's try to figure out one of these parameters: linkage
.
There are 4 main types of linkage
:
'single'
- proximity between two clusters is computed as the proximity between two closest objects of the clusters. [default parameter value]'complete'
- proximity between two clusters is computed as the proximity between the two most distant objects of the clusters.'average'
- proximity between two clusters is computed as the arithmetic mean of all the proximities between all the pairs of points within two clusters.'ward'
- is also called a Minimal Increase of Sum-of-Squares (MISSQ). You can read the explanation of this in the documentation.
So, what value of the linkage
parameter should you use? As before, there is no obvious and correct answer. But note, that single linkage is the fastest method, but not robust, i.e. 'sensitive' to outliers. It performs well on non-globular data. The ward linkage is the most resistant to noise in data (and similar to the K-Means algorithm in terms of looking for centroids) but takes more time to perform. The two remaining linkages work fine with cleanly separated globular clusters but have mixed results otherwise.
You can always experiment and watch on dendrograms for different linkages. But be careful while using dendrogram with ward distances, as it behaves differently than for other linkages.
For example, let's compare two dendrograms for the data (scatter plot for which is represented below) - the one with 'single'
linkage, and the second with 'average'
.
# Import the libraries import pandas as pd import matplotlib.pyplot as plt from scipy.cluster.hierarchy import dendrogram, linkage # Read the data data = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/138ab9ad-aa37-4310-873f-0f62abafb038/model_data1.csv') # Compute the distances dist_single = linkage(data, method = 'single') # Build the dendrogram dendrogram(dist_single) plt.title("method = 'single'") plt.show()
But if you replace the 9 - 15 lines with the code below, you will get the next result.
# Compute the distances dist_average = linkage(data, method = 'average') # Build the dendrogram dendrogram(dist_average) plt.title("method = 'average'") plt.show()
You can see that Python even chose 3 colors in the first case, and 2 in the second. But remember, that the final choice is up to you since the line heights represent the dissimilarities between observations and clusters. And it seems that on the second plot there is a significant height on the right side. And it's enough to consider two separate clusters there. So, how should you choose the number of clusters based on dendrograms? Imagine, that you draw a horizontal line at some level. The number of vertical lines your line will intersect - is the number of clusters at this level. But how to choose the 'height' to build a line on? It's a common practice to build such a line that all the intersections would have a 'significant' distance between them. For example, on the dendrogram above there is sense to build such a line on the height of 2-4, but not below since there would be so many intersections quite close to each other.
Tarea
For the data from the last chapter build the dendrogram using complete linkage. Follow the next steps:
- Import
dendrogram
andlinkage
functions fromscipy.cluster.hierarchy
. - Compute the distances in
data
using'complete'
linkage. Save the result within thedist_complete
variable. - Build the dendrogram for
dist_complete
data. Do not forget to display the plot after initializing the dendrogram.
Compare the dendrogram with the dendrogram for the same data but using a single linkage.
[object Object]