Contenu du cours
Analyse de Cluster
Analyse de Cluster
What is Gaussian Distribution?
The Gaussian distribution is defined by two key factors:
-
Mean: this is the average value and represents the center of the distribution. Most of the data is concentrated near this value;
-
Standard deviation: this shows how spread out the data is. A smaller standard deviation means the data is tightly clustered around the mean, while a larger one indicates more spread.
The shape of the Gaussian distribution has some important characteristics:
-
It is symmetric around the mean, meaning the left and right sides are mirror images;
-
About 68% of the data falls within 1 standard deviation from the mean, 95% within 2, and 99.7% within 3.
This distribution is essential because it models real-world data accurately and serves as the foundation for Gaussian mixture models, a flexible approach to solving complex clustering problems.
Here is the code to create the normal distribution for any data (e.g., [2, 5, 3, 6, 10, -5]
):
import numpy as np import matplotlib.pyplot as plt from scipy.stats import norm # Given data data = [2, 5, 3, 6, 10, -5] # Calculate mean and standard deviation mean = np.mean(data) std = np.std(data) # Generate x values x = np.linspace(mean - 4 * std, mean + 4 * std, 1000) # Calculate the normal distribution values y = norm.pdf(x, mean, std) # Plot the normal distribution plt.plot(x, y, label=f"Normal Distribution (mean={mean:.2f}, std={std:.2f})", color='blue') # Plot the data points as green balls on the x-axis plt.scatter(data, np.zeros_like(data), color='green', label='Data Points', zorder=5) plt.grid(True) # Display the plot plt.show()
1. What is the key characteristic of the Gaussian distribution?
2. Which factor determines the center of a Gaussian distribution?
Merci pour vos commentaires !