Contenido del Curso
Cluster Analysis
Cluster Analysis
Implementing on Dummy Dataset
You will now walk through a practical example of applying K-means clustering. To do this, you'll use a dummy dataset. Dummy datasets are artificially generated datasets that are often used for demonstration and learning purposes. They allow us to control the characteristics of the data and clearly observe how algorithms like K-means perform.
Dummy Dataset
For this demonstration, we will create a dummy dataset using the make_blobs()
function. This function is excellent for generating clusters of data points in a visually clear and controllable way. We will generate data with the following characteristics:
-
Number of samples: we will create a dataset with
300
data points; -
Number of centers: we will set the number of true clusters to
4
. This means the dummy data is designed to have four distinct groups; -
Cluster standard deviation: we will control the spread of data points within each cluster, setting it to
0.60
for relatively compact clusters; -
Random state: we will use a fixed
random_state
for reproducibility, ensuring that the data generation is consistent each time you run the code.
python
K-Means Implementation
With this dummy data created, we will then apply the K-means algorithm. We will explore how K-means attempts to partition this data into clusters based on the principles you learned in previous chapters.
K-means can be initialized and trained as follows in Python:
python
To determine the optimal number of clusters for this data, we will employ the methods discussed in the previous chapters:
-
WSS method: we will calculate the Within-Sum-of-Squares for different values of K and analyze the elbow plot to identify a potential optimal K;
-
Silhouette score method: we will compute the Silhouette Score for different values of K and examine the Silhouette plot and average Silhouette scores to find the K that maximizes cluster quality.
Finally, visualizations will play a crucial role in our implementation. We will visualize:
-
The dummy data itself, to see the inherent cluster structure;
-
The WSS plot, to identify the elbow point;
-
The silhouette plot, to assess cluster quality for different K values;
-
The final K-means clusters overlaid on the dummy data, to visually verify the clustering results and the chosen optimal K.
¡Gracias por tus comentarios!