Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Rand Index | Hierarchical Clustering
Cluster Analysis in Python
course content

Зміст курсу

Cluster Analysis in Python

Cluster Analysis in Python

1. K-Means Algorithm
2. K-Medoids Algorithm
3. Hierarchical Clustering
4. Spectral Clustering

Rand Index

Well, you may notice, that dendrograms look a bit different. Using a single linkage it's hard to define 4 clusters because the heights on the right side are too small. But complete linkage makes us think that 4 clusters are possible. Okay, we can experiment and compare the dendrograms, but what about the clustering results? Can we compare them?

The answer is yes, we can. The two clustering results can be compared by using Rand Index, which compares predicted labels across two models, and returns the number between 0 and 1. 1.0 stands for a perfect match, while 0.0 stands for an absolute lack of similarities between predicted labels.

In Python, you can access the rand index by using the rand_score() function from the sklearn.metrics library. This function receives only two parameters: predicted labels by two models. It's obvious that both lists/arrays must be the same size.

For example, we can compare the result of clustering well-clustered data (scatter plot is below) while using two different linkages: 'single' and 'ward'. To set linkage while clustering you need to set the linkage = '...' parameter within AgglomerativeClustering() function.

1234567891011121314151617181920
# Import the libraries import pandas as pd import matplotlib.pyplot as plt from sklearn.metrics import rand_score from sklearn.cluster import AgglomerativeClustering # Read the data data = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/138ab9ad-aa37-4310-873f-0f62abafb038/model_data1.csv') # Creating the models model_single = AgglomerativeClustering(n_clusters = 3, linkage = 'single') model_ward = AgglomerativeClustering(n_clusters = 3, linkage = 'ward') # Fitting and predicting the labels labels_single = model_single.fit_predict(data) labels_ward = model_ward.fit_predict(data) # Compute the Rand index rand_index = rand_score(labels_single, labels_ward) print(f"The rand index for single and ward linkages models is {rand_index}")
copy

This code will output the following message:

This means that both linkages will lead us to identical clustering results. This is kinda obvious since the points are divided into 3 clear clusters.

But what about the data we used in the previous chapters? Let's find out how similar will be models with different linkages used.

Завдання

Let's figure out how close will be the results for the data from the last 2 previous chapters if we would like to split them into 4 clusters. The scatter plot is below.

Plot

Follow the next steps:

  1. Import rand_score and AgglomerativeClustering from sklearn.metrics and sklearn.cluster respectively.
  2. Create two AgglomerativeClustering objects:
  • model_single with 4 clusters and 'single' linkage.
  • model_complete with 4 clusters and 'complete' linkage.
    1. Fit the data to model and predict the labels:
  • labels_single for the labels predicted by the model_single model.
  • labels_complete for the labels predicted by the model_complete model.
    1. Compute the rand score using labels_single and labels_complete.

Завдання

Let's figure out how close will be the results for the data from the last 2 previous chapters if we would like to split them into 4 clusters. The scatter plot is below.

Plot

Follow the next steps:

  1. Import rand_score and AgglomerativeClustering from sklearn.metrics and sklearn.cluster respectively.
  2. Create two AgglomerativeClustering objects:
  • model_single with 4 clusters and 'single' linkage.
  • model_complete with 4 clusters and 'complete' linkage.
    1. Fit the data to model and predict the labels:
  • labels_single for the labels predicted by the model_single model.
  • labels_complete for the labels predicted by the model_complete model.
    1. Compute the rand score using labels_single and labels_complete.

Перейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів

Все було зрозуміло?

Секція 3. Розділ 4
toggle bottom row

Rand Index

Well, you may notice, that dendrograms look a bit different. Using a single linkage it's hard to define 4 clusters because the heights on the right side are too small. But complete linkage makes us think that 4 clusters are possible. Okay, we can experiment and compare the dendrograms, but what about the clustering results? Can we compare them?

The answer is yes, we can. The two clustering results can be compared by using Rand Index, which compares predicted labels across two models, and returns the number between 0 and 1. 1.0 stands for a perfect match, while 0.0 stands for an absolute lack of similarities between predicted labels.

In Python, you can access the rand index by using the rand_score() function from the sklearn.metrics library. This function receives only two parameters: predicted labels by two models. It's obvious that both lists/arrays must be the same size.

For example, we can compare the result of clustering well-clustered data (scatter plot is below) while using two different linkages: 'single' and 'ward'. To set linkage while clustering you need to set the linkage = '...' parameter within AgglomerativeClustering() function.

1234567891011121314151617181920
# Import the libraries import pandas as pd import matplotlib.pyplot as plt from sklearn.metrics import rand_score from sklearn.cluster import AgglomerativeClustering # Read the data data = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/138ab9ad-aa37-4310-873f-0f62abafb038/model_data1.csv') # Creating the models model_single = AgglomerativeClustering(n_clusters = 3, linkage = 'single') model_ward = AgglomerativeClustering(n_clusters = 3, linkage = 'ward') # Fitting and predicting the labels labels_single = model_single.fit_predict(data) labels_ward = model_ward.fit_predict(data) # Compute the Rand index rand_index = rand_score(labels_single, labels_ward) print(f"The rand index for single and ward linkages models is {rand_index}")
copy

This code will output the following message:

This means that both linkages will lead us to identical clustering results. This is kinda obvious since the points are divided into 3 clear clusters.

But what about the data we used in the previous chapters? Let's find out how similar will be models with different linkages used.

Завдання

Let's figure out how close will be the results for the data from the last 2 previous chapters if we would like to split them into 4 clusters. The scatter plot is below.

Plot

Follow the next steps:

  1. Import rand_score and AgglomerativeClustering from sklearn.metrics and sklearn.cluster respectively.
  2. Create two AgglomerativeClustering objects:
  • model_single with 4 clusters and 'single' linkage.
  • model_complete with 4 clusters and 'complete' linkage.
    1. Fit the data to model and predict the labels:
  • labels_single for the labels predicted by the model_single model.
  • labels_complete for the labels predicted by the model_complete model.
    1. Compute the rand score using labels_single and labels_complete.

Завдання

Let's figure out how close will be the results for the data from the last 2 previous chapters if we would like to split them into 4 clusters. The scatter plot is below.

Plot

Follow the next steps:

  1. Import rand_score and AgglomerativeClustering from sklearn.metrics and sklearn.cluster respectively.
  2. Create two AgglomerativeClustering objects:
  • model_single with 4 clusters and 'single' linkage.
  • model_complete with 4 clusters and 'complete' linkage.
    1. Fit the data to model and predict the labels:
  • labels_single for the labels predicted by the model_single model.
  • labels_complete for the labels predicted by the model_complete model.
    1. Compute the rand score using labels_single and labels_complete.

Перейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів

Все було зрозуміло?

Секція 3. Розділ 4
toggle bottom row

Rand Index

Well, you may notice, that dendrograms look a bit different. Using a single linkage it's hard to define 4 clusters because the heights on the right side are too small. But complete linkage makes us think that 4 clusters are possible. Okay, we can experiment and compare the dendrograms, but what about the clustering results? Can we compare them?

The answer is yes, we can. The two clustering results can be compared by using Rand Index, which compares predicted labels across two models, and returns the number between 0 and 1. 1.0 stands for a perfect match, while 0.0 stands for an absolute lack of similarities between predicted labels.

In Python, you can access the rand index by using the rand_score() function from the sklearn.metrics library. This function receives only two parameters: predicted labels by two models. It's obvious that both lists/arrays must be the same size.

For example, we can compare the result of clustering well-clustered data (scatter plot is below) while using two different linkages: 'single' and 'ward'. To set linkage while clustering you need to set the linkage = '...' parameter within AgglomerativeClustering() function.

1234567891011121314151617181920
# Import the libraries import pandas as pd import matplotlib.pyplot as plt from sklearn.metrics import rand_score from sklearn.cluster import AgglomerativeClustering # Read the data data = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/138ab9ad-aa37-4310-873f-0f62abafb038/model_data1.csv') # Creating the models model_single = AgglomerativeClustering(n_clusters = 3, linkage = 'single') model_ward = AgglomerativeClustering(n_clusters = 3, linkage = 'ward') # Fitting and predicting the labels labels_single = model_single.fit_predict(data) labels_ward = model_ward.fit_predict(data) # Compute the Rand index rand_index = rand_score(labels_single, labels_ward) print(f"The rand index for single and ward linkages models is {rand_index}")
copy

This code will output the following message:

This means that both linkages will lead us to identical clustering results. This is kinda obvious since the points are divided into 3 clear clusters.

But what about the data we used in the previous chapters? Let's find out how similar will be models with different linkages used.

Завдання

Let's figure out how close will be the results for the data from the last 2 previous chapters if we would like to split them into 4 clusters. The scatter plot is below.

Plot

Follow the next steps:

  1. Import rand_score and AgglomerativeClustering from sklearn.metrics and sklearn.cluster respectively.
  2. Create two AgglomerativeClustering objects:
  • model_single with 4 clusters and 'single' linkage.
  • model_complete with 4 clusters and 'complete' linkage.
    1. Fit the data to model and predict the labels:
  • labels_single for the labels predicted by the model_single model.
  • labels_complete for the labels predicted by the model_complete model.
    1. Compute the rand score using labels_single and labels_complete.

Завдання

Let's figure out how close will be the results for the data from the last 2 previous chapters if we would like to split them into 4 clusters. The scatter plot is below.

Plot

Follow the next steps:

  1. Import rand_score and AgglomerativeClustering from sklearn.metrics and sklearn.cluster respectively.
  2. Create two AgglomerativeClustering objects:
  • model_single with 4 clusters and 'single' linkage.
  • model_complete with 4 clusters and 'complete' linkage.
    1. Fit the data to model and predict the labels:
  • labels_single for the labels predicted by the model_single model.
  • labels_complete for the labels predicted by the model_complete model.
    1. Compute the rand score using labels_single and labels_complete.

Перейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів

Все було зрозуміло?

Well, you may notice, that dendrograms look a bit different. Using a single linkage it's hard to define 4 clusters because the heights on the right side are too small. But complete linkage makes us think that 4 clusters are possible. Okay, we can experiment and compare the dendrograms, but what about the clustering results? Can we compare them?

The answer is yes, we can. The two clustering results can be compared by using Rand Index, which compares predicted labels across two models, and returns the number between 0 and 1. 1.0 stands for a perfect match, while 0.0 stands for an absolute lack of similarities between predicted labels.

In Python, you can access the rand index by using the rand_score() function from the sklearn.metrics library. This function receives only two parameters: predicted labels by two models. It's obvious that both lists/arrays must be the same size.

For example, we can compare the result of clustering well-clustered data (scatter plot is below) while using two different linkages: 'single' and 'ward'. To set linkage while clustering you need to set the linkage = '...' parameter within AgglomerativeClustering() function.

1234567891011121314151617181920
# Import the libraries import pandas as pd import matplotlib.pyplot as plt from sklearn.metrics import rand_score from sklearn.cluster import AgglomerativeClustering # Read the data data = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/138ab9ad-aa37-4310-873f-0f62abafb038/model_data1.csv') # Creating the models model_single = AgglomerativeClustering(n_clusters = 3, linkage = 'single') model_ward = AgglomerativeClustering(n_clusters = 3, linkage = 'ward') # Fitting and predicting the labels labels_single = model_single.fit_predict(data) labels_ward = model_ward.fit_predict(data) # Compute the Rand index rand_index = rand_score(labels_single, labels_ward) print(f"The rand index for single and ward linkages models is {rand_index}")
copy

This code will output the following message:

This means that both linkages will lead us to identical clustering results. This is kinda obvious since the points are divided into 3 clear clusters.

But what about the data we used in the previous chapters? Let's find out how similar will be models with different linkages used.

Завдання

Let's figure out how close will be the results for the data from the last 2 previous chapters if we would like to split them into 4 clusters. The scatter plot is below.

Plot

Follow the next steps:

  1. Import rand_score and AgglomerativeClustering from sklearn.metrics and sklearn.cluster respectively.
  2. Create two AgglomerativeClustering objects:
  • model_single with 4 clusters and 'single' linkage.
  • model_complete with 4 clusters and 'complete' linkage.
    1. Fit the data to model and predict the labels:
  • labels_single for the labels predicted by the model_single model.
  • labels_complete for the labels predicted by the model_complete model.
    1. Compute the rand score using labels_single and labels_complete.

Перейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів
Секція 3. Розділ 4
Перейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів
We're sorry to hear that something went wrong. What happened?
some-alt