Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprende Histogram | More Statistical Plots
Ultimate Visualization with Python
course content

Contenido del Curso

Ultimate Visualization with Python

Ultimate Visualization with Python

1. Matplotlib Introduction
2. Creating Commonly Used Plots
3. Plots Customization
4. More Statistical Plots
5. Plotting with Seaborn

book
Histogram

Let’s start with a histogram. Histograms are used to represent frequency or probability distribution of a given variable (approximate distribution) using vertical bins of equal width (or we can call them bars).

pyplot module has a special function called hist to create a histogram. The first and the only required parameter is our data (called x) which can be either an array or a sequence of arrays. If a sequence of arrays is passed, the bins for each array are painted in different colors. Here is a simple example for you:

12345678
import pandas as pd import matplotlib.pyplot as plt url = 'https://staging-content-media-cdn.codefinity.com/courses/47339f29-4722-4e72-a0d4-6112c70ff738/weather_data.csv' # Loading the dataset with the average yearly temperatures in Boston and Seattle weather_df = pd.read_csv(url, index_col=0) # Creating a histogram plt.hist(weather_df['Seattle']) plt.show()
copy

Intervals and Height

We passed a Series object, which contains average yearly temperatures in Seattle, in the hist() function. Our sample was divided into 10 equal intervals by default starting from the minimum value to the maximum value. There are, however, only 9 bins, since there are no values which belong to the second interval.

The height of each bin by default is equal to the frequency of the values in this interval (number of times they occur).

Number of Bins

Another important, yet optional parameter is bins which takes either the number of bins (integer) or a sequence of numbers specifying the edges of the bins or a string. Most of the time passing the number of bins is more than enough.

There several methods for determining the width of the bins (more on this here), but here we will use the Sturges' formula (written in Python): bins = 1+int(np.log2(n)) where n is the sample size (the size of the array).

Let’s see it in action:

12345678
import pandas as pd import matplotlib.pyplot as plt import numpy as np url = 'https://staging-content-media-cdn.codefinity.com/courses/47339f29-4722-4e72-a0d4-6112c70ff738/weather_data.csv' weather_df = pd.read_csv(url, index_col=0) # Specifying the number of bins plt.hist(weather_df['Seattle'], bins=1 + int(np.log2(len(weather_df)))) plt.show()
copy

The number of rows in the DataFrame is 26 (the size of the Series), so the resulting number of bins is 5.

Probability Density Approximation

That’s all fine, but what if we want to have a look at the probability density approximation? All we need is to set the parameter density to True.

Now the height of each bin will be the count of the values in the interval divided by the product of the total number of values (the size of the sample) and the bin width. As a result, the sum of the areas of the bins will be equal to 1, which is exactly what we need from a probability density function.

Let’s now modify our example:

12345678
import pandas as pd import matplotlib.pyplot as plt import numpy as np url = 'https://staging-content-media-cdn.codefinity.com/courses/47339f29-4722-4e72-a0d4-6112c70ff738/weather_data.csv' weather_df = pd.read_csv(url, index_col=0) # Making a histogram a probability density function approximation plt.hist(weather_df['Seattle'], bins=1 + int(np.log2(len(weather_df))), density=True) plt.show()
copy

Now we have an approximation of the probability density function for our temperature data.

If you want to explore more about the hist() function parameters, you can refer to its documentation.

Tarea
test

Swipe to begin your solution

Your task is to create an approximation of a probability density function using a sample from the standard normal distribution:

  1. Use the correct function for creating a histogram.
  2. Use normal_sample as the data for the histogram.
  3. Specify the number of bins as the second argument using the Sturges' formula.
  4. Make the histogram an approximation of a probability density function via correctly specifying the rightmost argument.

Solución

Switch to desktopCambia al escritorio para practicar en el mundo realContinúe desde donde se encuentra utilizando una de las siguientes opciones
¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 4. Capítulo 1
toggle bottom row

book
Histogram

Let’s start with a histogram. Histograms are used to represent frequency or probability distribution of a given variable (approximate distribution) using vertical bins of equal width (or we can call them bars).

pyplot module has a special function called hist to create a histogram. The first and the only required parameter is our data (called x) which can be either an array or a sequence of arrays. If a sequence of arrays is passed, the bins for each array are painted in different colors. Here is a simple example for you:

12345678
import pandas as pd import matplotlib.pyplot as plt url = 'https://staging-content-media-cdn.codefinity.com/courses/47339f29-4722-4e72-a0d4-6112c70ff738/weather_data.csv' # Loading the dataset with the average yearly temperatures in Boston and Seattle weather_df = pd.read_csv(url, index_col=0) # Creating a histogram plt.hist(weather_df['Seattle']) plt.show()
copy

Intervals and Height

We passed a Series object, which contains average yearly temperatures in Seattle, in the hist() function. Our sample was divided into 10 equal intervals by default starting from the minimum value to the maximum value. There are, however, only 9 bins, since there are no values which belong to the second interval.

The height of each bin by default is equal to the frequency of the values in this interval (number of times they occur).

Number of Bins

Another important, yet optional parameter is bins which takes either the number of bins (integer) or a sequence of numbers specifying the edges of the bins or a string. Most of the time passing the number of bins is more than enough.

There several methods for determining the width of the bins (more on this here), but here we will use the Sturges' formula (written in Python): bins = 1+int(np.log2(n)) where n is the sample size (the size of the array).

Let’s see it in action:

12345678
import pandas as pd import matplotlib.pyplot as plt import numpy as np url = 'https://staging-content-media-cdn.codefinity.com/courses/47339f29-4722-4e72-a0d4-6112c70ff738/weather_data.csv' weather_df = pd.read_csv(url, index_col=0) # Specifying the number of bins plt.hist(weather_df['Seattle'], bins=1 + int(np.log2(len(weather_df)))) plt.show()
copy

The number of rows in the DataFrame is 26 (the size of the Series), so the resulting number of bins is 5.

Probability Density Approximation

That’s all fine, but what if we want to have a look at the probability density approximation? All we need is to set the parameter density to True.

Now the height of each bin will be the count of the values in the interval divided by the product of the total number of values (the size of the sample) and the bin width. As a result, the sum of the areas of the bins will be equal to 1, which is exactly what we need from a probability density function.

Let’s now modify our example:

12345678
import pandas as pd import matplotlib.pyplot as plt import numpy as np url = 'https://staging-content-media-cdn.codefinity.com/courses/47339f29-4722-4e72-a0d4-6112c70ff738/weather_data.csv' weather_df = pd.read_csv(url, index_col=0) # Making a histogram a probability density function approximation plt.hist(weather_df['Seattle'], bins=1 + int(np.log2(len(weather_df))), density=True) plt.show()
copy

Now we have an approximation of the probability density function for our temperature data.

If you want to explore more about the hist() function parameters, you can refer to its documentation.

Tarea
test

Swipe to begin your solution

Your task is to create an approximation of a probability density function using a sample from the standard normal distribution:

  1. Use the correct function for creating a histogram.
  2. Use normal_sample as the data for the histogram.
  3. Specify the number of bins as the second argument using the Sturges' formula.
  4. Make the histogram an approximation of a probability density function via correctly specifying the rightmost argument.

Solución

Switch to desktopCambia al escritorio para practicar en el mundo realContinúe desde donde se encuentra utilizando una de las siguientes opciones
¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 4. Capítulo 1
Switch to desktopCambia al escritorio para practicar en el mundo realContinúe desde donde se encuentra utilizando una de las siguientes opciones
We're sorry to hear that something went wrong. What happened?
some-alt