What is a K-Means Algorithm?
That was an informative scatter plot you built in the last chapter, wasn't it? I think you saw three clear groups, we'll call them clusters. How can we use ML algorithms to answer this question?
The first method we will consider is K-Means. How does it work? At first, you need to set the number of clusters you would like to explore. Let this number be N. Then, the algorithm chooses N random points (not necessary data points), and assigns points to certain clusters by the minimum distance to the randomly chosen point. Then, the mean characteristics are evaluated within each cluster, and the previous steps repeat until all the points are left in the same clusters after several iterations, and the variance between the points within each cluster is minimized.
We will use KMeans function from sklearn.cluster class. To implement the algorithm you should follow the next steps:
- Create a
KMeansmodel assigned to a certain variable. - Compute K-Means clustering using the
.fit()method of theKMeansobject with the data set as a parameter. - Predict the labels using the fitted model by applying the
.predict()function to theKMeansobject with the data set as a parameter. - (not necessary) Visualize the result of clustering.
For example, imagine that we have the 2-D data, with the respective scatter plot below.
123456789101112131415161718192021222324# Import the libraries import pandas as pd from sklearn.cluster import KMeans import matplotlib.pyplot as plt import seaborn as sns # Read the data data = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/138ab9ad-aa37-4310-873f-0f62abafb038/train_data1.csv') # Create model model = KMeans(n_clusters = 2) # Fit the data to model model.fit(data) # Predict the labels for data using model prediction = model.predict(data) # Add new column to DataFrame data['prediction'] = prediction # Visualize the result sns.scatterplot(x = 'x', y = 'y', hue = 'prediction', data = data) plt.show()
The result for the script above is below.
Note, how we add a new column to the data DataFrame for easier seaborn usage. Now it's your turn! Try to complete the following task following the same steps.
Swipe to start coding
- Import
KMeansfunction fromsklearn.cluster. - Create a
KMeansmodel with then_clustersparameter set to3. Assign tomodel. - Compute K-Means clustering for
datausing the.fit()method ofmodel. - Predict the labels for
datausing the.predict()method ofmodel. Assign the result to thepredictionvariable. - Add a new column
'prediction'with values of thepredictionvariable (created in the previous step). - Visualize the results. Build scatter plot using
seabornlibrary, passing'x'column asxparameter,'y'column asyparameter, and'prediction'column ashueparameter. Do not forget to apply.show()to plt.
Solution
Thanks for your feedback!
single
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Summarize this chapter
Explain the code in file
Explain why file doesn't solve the task
Awesome!
Completion rate improved to 3.57
What is a K-Means Algorithm?
Swipe to show menu
That was an informative scatter plot you built in the last chapter, wasn't it? I think you saw three clear groups, we'll call them clusters. How can we use ML algorithms to answer this question?
The first method we will consider is K-Means. How does it work? At first, you need to set the number of clusters you would like to explore. Let this number be N. Then, the algorithm chooses N random points (not necessary data points), and assigns points to certain clusters by the minimum distance to the randomly chosen point. Then, the mean characteristics are evaluated within each cluster, and the previous steps repeat until all the points are left in the same clusters after several iterations, and the variance between the points within each cluster is minimized.
We will use KMeans function from sklearn.cluster class. To implement the algorithm you should follow the next steps:
- Create a
KMeansmodel assigned to a certain variable. - Compute K-Means clustering using the
.fit()method of theKMeansobject with the data set as a parameter. - Predict the labels using the fitted model by applying the
.predict()function to theKMeansobject with the data set as a parameter. - (not necessary) Visualize the result of clustering.
For example, imagine that we have the 2-D data, with the respective scatter plot below.
123456789101112131415161718192021222324# Import the libraries import pandas as pd from sklearn.cluster import KMeans import matplotlib.pyplot as plt import seaborn as sns # Read the data data = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/138ab9ad-aa37-4310-873f-0f62abafb038/train_data1.csv') # Create model model = KMeans(n_clusters = 2) # Fit the data to model model.fit(data) # Predict the labels for data using model prediction = model.predict(data) # Add new column to DataFrame data['prediction'] = prediction # Visualize the result sns.scatterplot(x = 'x', y = 'y', hue = 'prediction', data = data) plt.show()
The result for the script above is below.
Note, how we add a new column to the data DataFrame for easier seaborn usage. Now it's your turn! Try to complete the following task following the same steps.
Swipe to start coding
- Import
KMeansfunction fromsklearn.cluster. - Create a
KMeansmodel with then_clustersparameter set to3. Assign tomodel. - Compute K-Means clustering for
datausing the.fit()method ofmodel. - Predict the labels for
datausing the.predict()method ofmodel. Assign the result to thepredictionvariable. - Add a new column
'prediction'with values of thepredictionvariable (created in the previous step). - Visualize the results. Build scatter plot using
seabornlibrary, passing'x'column asxparameter,'y'column asyparameter, and'prediction'column ashueparameter. Do not forget to apply.show()to plt.
Solution
Thanks for your feedback!
single