Kursinhalt
Cluster Analysis in Python
Cluster Analysis in Python
What is a K-Means Algorithm?
That was an informative scatter plot you built in the last chapter, wasn't it? I think you saw three clear groups, we'll call them clusters. How can we use ML algorithms to answer this question?
The first method we will consider is K-Means. How does it work? At first, you need to set the number of clusters you would like to explore. Let this number be N
. Then, the algorithm chooses N
random points (not necessary data points), and assigns points to certain clusters by the minimum distance to the randomly chosen point. Then, the mean characteristics are evaluated within each cluster, and the previous steps repeat until all the points are left in the same clusters after several iterations, and the variance between the points within each cluster is minimized.
We will use KMeans
function from sklearn.cluster
class. To implement the algorithm you should follow the next steps:
- Create a
KMeans
model assigned to a certain variable. - Compute K-Means clustering using the
.fit()
method of theKMeans
object with the data set as a parameter. - Predict the labels using the fitted model by applying the
.predict()
function to theKMeans
object with the data set as a parameter. - (not necessary) Visualize the result of clustering.
For example, imagine that we have the 2-D data, with the respective scatter plot below.
# Import the libraries import pandas as pd from sklearn.cluster import KMeans import matplotlib.pyplot as plt import seaborn as sns # Read the data data = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/138ab9ad-aa37-4310-873f-0f62abafb038/train_data1.csv') # Create model model = KMeans(n_clusters = 2) # Fit the data to model model.fit(data) # Predict the labels for data using model prediction = model.predict(data) # Add new column to DataFrame data['prediction'] = prediction # Visualize the result sns.scatterplot(x = 'x', y = 'y', hue = 'prediction', data = data) plt.show()
The result for the script above is below.
Note, how we add a new column to the data
DataFrame for easier seaborn usage. Now it's your turn! Try to complete the following task following the same steps.
Swipe to start coding
- Import
KMeans
function fromsklearn.cluster
. - Create a
KMeans
model with then_clusters
parameter set to3
. Assign tomodel
. - Compute K-Means clustering for
data
using the.fit()
method ofmodel
. - Predict the labels for
data
using the.predict()
method ofmodel
. Assign the result to theprediction
variable. - Add a new column
'prediction'
with values of theprediction
variable (created in the previous step). - Visualize the results. Build scatter plot using
seaborn
library, passing'x'
column asx
parameter,'y'
column asy
parameter, and'prediction'
column ashue
parameter. Do not forget to apply.show()
to plt.
Lösung
Danke für Ihr Feedback!
What is a K-Means Algorithm?
That was an informative scatter plot you built in the last chapter, wasn't it? I think you saw three clear groups, we'll call them clusters. How can we use ML algorithms to answer this question?
The first method we will consider is K-Means. How does it work? At first, you need to set the number of clusters you would like to explore. Let this number be N
. Then, the algorithm chooses N
random points (not necessary data points), and assigns points to certain clusters by the minimum distance to the randomly chosen point. Then, the mean characteristics are evaluated within each cluster, and the previous steps repeat until all the points are left in the same clusters after several iterations, and the variance between the points within each cluster is minimized.
We will use KMeans
function from sklearn.cluster
class. To implement the algorithm you should follow the next steps:
- Create a
KMeans
model assigned to a certain variable. - Compute K-Means clustering using the
.fit()
method of theKMeans
object with the data set as a parameter. - Predict the labels using the fitted model by applying the
.predict()
function to theKMeans
object with the data set as a parameter. - (not necessary) Visualize the result of clustering.
For example, imagine that we have the 2-D data, with the respective scatter plot below.
# Import the libraries import pandas as pd from sklearn.cluster import KMeans import matplotlib.pyplot as plt import seaborn as sns # Read the data data = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/138ab9ad-aa37-4310-873f-0f62abafb038/train_data1.csv') # Create model model = KMeans(n_clusters = 2) # Fit the data to model model.fit(data) # Predict the labels for data using model prediction = model.predict(data) # Add new column to DataFrame data['prediction'] = prediction # Visualize the result sns.scatterplot(x = 'x', y = 'y', hue = 'prediction', data = data) plt.show()
The result for the script above is below.
Note, how we add a new column to the data
DataFrame for easier seaborn usage. Now it's your turn! Try to complete the following task following the same steps.
Swipe to start coding
- Import
KMeans
function fromsklearn.cluster
. - Create a
KMeans
model with then_clusters
parameter set to3
. Assign tomodel
. - Compute K-Means clustering for
data
using the.fit()
method ofmodel
. - Predict the labels for
data
using the.predict()
method ofmodel
. Assign the result to theprediction
variable. - Add a new column
'prediction'
with values of theprediction
variable (created in the previous step). - Visualize the results. Build scatter plot using
seaborn
library, passing'x'
column asx
parameter,'y'
column asy
parameter, and'prediction'
column ashue
parameter. Do not forget to apply.show()
to plt.
Lösung
Danke für Ihr Feedback!