Histogram
Histograms represent the frequency or probability distribution of a variable by using vertical bins of equal width, often referred to as bars.
The pyplot
module provides the hist
function to create histograms. The required parameter is the data (x
), which can be an array or a sequence of arrays. If multiple arrays are passed, each is shown in a different color.
import pandas as pd import matplotlib.pyplot as plt # Loading the dataset with the average yearly temperatures in Boston and Seattle url = 'https://staging-content-media-cdn.codefinity.com/courses/47339f29-4722-4e72-a0d4-6112c70ff738/weather_data.csv' weather_df = pd.read_csv(url, index_col=0) # Creating a histogram plt.hist(weather_df['Seattle']) plt.show()
Intervals and Height
A Series
object containing average yearly temperatures in Seattle was passed to the hist()
function. By default, the data is divided into 10 equal intervals ranging from the minimum to the maximum value. However, only 9 bins are visible because the second interval contains no data points.
The height of each bin by default is equal to the frequency of the values in this interval (number of times they occur).
Number of Bins
Another important, yet optional parameter is bins
which takes either the number of bins (integer) or a sequence of numbers specifying the edges of the bins or a string. Most of the time passing the number of bins is more than enough.
There are several methods for determining the width of histogram bins. In this example, we'll use Sturges' formula, which calculates the optimal number of bins based on the sample size:
Here, n
is the size of the data array.
You can explore additional methods for bin calculation here.
import pandas as pd import matplotlib.pyplot as plt import numpy as np url = 'https://staging-content-media-cdn.codefinity.com/courses/47339f29-4722-4e72-a0d4-6112c70ff738/weather_data.csv' weather_df = pd.read_csv(url, index_col=0) # Specifying the number of bins plt.hist(weather_df['Seattle'], bins=1 + int(np.log2(len(weather_df)))) plt.show()
The number of rows in the DataFrame
is 26 (the size of the Series
), so the resulting number of bins is 5.
Probability Density Approximation
To view an approximation of the probability density, set the density
parameter to True
in the hist
function.
Now, each bin's height is calculated using:
where:
- the total number of values in the dataset;
- the number of values in bin;
- width of the bin.
This ensures that the total area under the histogram is 1, which matches the key property of a probability density function (PDF).
import pandas as pd import matplotlib.pyplot as plt import numpy as np url = 'https://staging-content-media-cdn.codefinity.com/courses/47339f29-4722-4e72-a0d4-6112c70ff738/weather_data.csv' weather_df = pd.read_csv(url, index_col=0) # Making a histogram a probability density function approximation plt.hist(weather_df['Seattle'], bins=1 + int(np.log2(len(weather_df))), density=True) plt.show()
This provides an approximation of the probability density function for the temperature data.
If you want to explore more about the hist()
parameters, you can refer to hist()
documentation.
Swipe to start coding
Create an approximation of a probability density function using a sample from the standard normal distribution:
- Use the correct function for creating a histogram.
- Use
normal_sample
as the data for the histogram. - Specify the number of bins as the second argument using the Sturges' formula.
- Make the histogram an approximation of a probability density function via correctly specifying the rightmost argument.
Solution
Thanks for your feedback!