Course Content
Advanced Probability Theory
Advanced Probability Theory
General population. Samples. Population parameters.
The general population represents how things are usually spread out in real life. For instance, the heights of adult men in the United States are generally around 70
inches tall, with a variation of about 3
inches. So, if we took a group of men in the USA, their heights would follow this pattern.
A sample is a small group we use to understand the bigger picture of the general population. For example, if we want to know the heights of men in the USA, we might measure the heights of a few men from different places. These measured heights are our samples.
import numpy as np # Specify parameters of general population mean = 70 std = 3 # Specify number of sumples to generate size = 10 # Generate samples samples = np.random.normal(mean, std, size) print('Samples are: ', samples)
Thus each sample is essentially a random variable with a distribution given by the general population.
In the example above, we first set the general population type and parameters, then generated the corresponding samples. In real tasks of analytics and data science, we usually need to solve the inverse problem: we have samples generated from some general population, and we must determine from which particular population these samples were generated.
To do this, we need to follow the following list of steps:
Step 1. Firstly it is necessary to determine whether we are dealing with a discrete or continuous general population;
Step 2. It is necessary to estimate what type of distribution our data belongs to. It can be done using visualization: for discrete data, we build a frequency polygon, and for continuous data, a histogram. Further, we can assume that our data has a distribution with PMF/PDF, which is most similar to our frequency polygon/ histogram;
import numpy as np import matplotlib.pyplot as plt # Generating 1000 samples from a continuous normal distribution with mean 70 and standard deviation 3 samples_cont = np.random.normal(70, 3, 1000) # Generate 500 samples from a discrete distribution samples_disc = np.random.choice(['Red', 'Blue', 'Green', 'Black', 'White'], size=500, p=[0.3, 0.2, 0.15, 0.15, 0.2]) # Creating the figure and subplots fig, axes = plt.subplots(1, 2, figsize=(10, 4)) # Plotting the histogram on the first subplot axes[0].hist(samples_cont, bins=20, alpha=0.5, color='blue', density=True) axes[0].set_xlabel('Values') axes[0].set_ylabel('Frequency') axes[0].set_title('Histogram of Continuous Variable') # Plotting the frequency polygon on the second subplot # Calculate the empirical probabilities counts = np.unique(samples_disc, return_counts=True)[1] probs = counts / len(samples_disc) # Plot the frequency polygon axes[1].plot(['Red', 'Blue', 'Green', 'Black', 'White'], probs, marker='o', linestyle='--') axes[1].set_title('Frequency Polygon') axes[1].set_xlabel('Color') axes[1].set_ylabel('Estimated Probability') # Adjusting the layout and displaying the plot plt.tight_layout() plt.show()
Step 3. As we mentioned in previous chapters, visualization is not enough to accurately determine the type of distribution. Therefore, after visualization, various statistical criteria are usually applied to more formally show that our data belongs to one or another general population;
Step 4. After you have determined the type of distribution, you need to estimate the parameters of this distribution. For example, if you assume from the histogram that the data is distributed normally, then you need to estimate the mean value and the variance; if you assume that the data is distributed exponentially, then you need to determine the lambda parameter, and so on. In addition to point estimation of parameters, confidence intervals are also constructed for the corresponding parameters.
In this section, we will focus on the fourth step in more detail and consider how to estimate the parameters of the general population and how to determine how good the estimates are.
Thanks for your feedback!