Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Replace Numerical Missing Data with Values | Data Cleaning
Preprocessing Data
course content

Contenido del Curso

Preprocessing Data

Preprocessing Data

1. Data Exploration
2. Data Cleaning
3. Data Validation
4. Normalization & Standardization
5. Data Encoding

Replace Numerical Missing Data with Values

In the previous chapter, you dropped all the records containing NaN. If everything was done correct, the shape of new dataframe was (183, 12) - we lost 709 rows out of 891!

To avoid data loss because of NaNs, use different approaches to replace the missing data. The choice depends on the amount of missing data, its distribution(one or multiple columns, homogeneous or not, etc.).

The popular approaches to deal with NaNs for numerical data are:

  • replace with the mean value: good for the normal data distribution and when the number of NaNs is small.
  • replace with the mode value: good for exponential distributions and a small amount of NaNs.
  • replace with the max or min value: good if you are sure there are no outliers that may affect the result.
  • replace with some const value: for example, 0 or 1, if the possible value is either 0 or 1.

To replace the NaNs, you can use fillna():

123
data.fillna(some_val) # replaces NaN with some_val # or data.fillna(some_val, inplace=True) # change the data in-place
copy

Also, you can use replace(old_val, new_val) to replace not only NaNs, but any other values:

123
data['Age'].replace(np.nan, 0) # replaces NaN with 0 # or data['Age'].replace(np.nan, 0, inplace=True)
copy

Do you remember that titanic dataset contains missing values in the Age column? Instead of dropping rows, let's think how to replace NaNs and save data.

Tarea

If share of NaNs is low enough, replace them with value - but which one? Do the following:

  1. Calculate the share of missing values in Age column. Round this value to 2 decimal places.
  2. Build the histogram of Age distribution. Use matplotlib.pyplot and method hist().

Tarea

If share of NaNs is low enough, replace them with value - but which one? Do the following:

  1. Calculate the share of missing values in Age column. Round this value to 2 decimal places.
  2. Build the histogram of Age distribution. Use matplotlib.pyplot and method hist().

Well, hope you have the similar histogram:

If yes, move on to the next chapter to deal with NaNs.

Cambia al escritorio para practicar en el mundo realContinúe desde donde se encuentra utilizando una de las siguientes opciones

¿Todo estuvo claro?

Sección 2. Capítulo 3
toggle bottom row

Replace Numerical Missing Data with Values

In the previous chapter, you dropped all the records containing NaN. If everything was done correct, the shape of new dataframe was (183, 12) - we lost 709 rows out of 891!

To avoid data loss because of NaNs, use different approaches to replace the missing data. The choice depends on the amount of missing data, its distribution(one or multiple columns, homogeneous or not, etc.).

The popular approaches to deal with NaNs for numerical data are:

  • replace with the mean value: good for the normal data distribution and when the number of NaNs is small.
  • replace with the mode value: good for exponential distributions and a small amount of NaNs.
  • replace with the max or min value: good if you are sure there are no outliers that may affect the result.
  • replace with some const value: for example, 0 or 1, if the possible value is either 0 or 1.

To replace the NaNs, you can use fillna():

123
data.fillna(some_val) # replaces NaN with some_val # or data.fillna(some_val, inplace=True) # change the data in-place
copy

Also, you can use replace(old_val, new_val) to replace not only NaNs, but any other values:

123
data['Age'].replace(np.nan, 0) # replaces NaN with 0 # or data['Age'].replace(np.nan, 0, inplace=True)
copy

Do you remember that titanic dataset contains missing values in the Age column? Instead of dropping rows, let's think how to replace NaNs and save data.

Tarea

If share of NaNs is low enough, replace them with value - but which one? Do the following:

  1. Calculate the share of missing values in Age column. Round this value to 2 decimal places.
  2. Build the histogram of Age distribution. Use matplotlib.pyplot and method hist().

Tarea

If share of NaNs is low enough, replace them with value - but which one? Do the following:

  1. Calculate the share of missing values in Age column. Round this value to 2 decimal places.
  2. Build the histogram of Age distribution. Use matplotlib.pyplot and method hist().

Well, hope you have the similar histogram:

If yes, move on to the next chapter to deal with NaNs.

Cambia al escritorio para practicar en el mundo realContinúe desde donde se encuentra utilizando una de las siguientes opciones

¿Todo estuvo claro?

Sección 2. Capítulo 3
toggle bottom row

Replace Numerical Missing Data with Values

In the previous chapter, you dropped all the records containing NaN. If everything was done correct, the shape of new dataframe was (183, 12) - we lost 709 rows out of 891!

To avoid data loss because of NaNs, use different approaches to replace the missing data. The choice depends on the amount of missing data, its distribution(one or multiple columns, homogeneous or not, etc.).

The popular approaches to deal with NaNs for numerical data are:

  • replace with the mean value: good for the normal data distribution and when the number of NaNs is small.
  • replace with the mode value: good for exponential distributions and a small amount of NaNs.
  • replace with the max or min value: good if you are sure there are no outliers that may affect the result.
  • replace with some const value: for example, 0 or 1, if the possible value is either 0 or 1.

To replace the NaNs, you can use fillna():

123
data.fillna(some_val) # replaces NaN with some_val # or data.fillna(some_val, inplace=True) # change the data in-place
copy

Also, you can use replace(old_val, new_val) to replace not only NaNs, but any other values:

123
data['Age'].replace(np.nan, 0) # replaces NaN with 0 # or data['Age'].replace(np.nan, 0, inplace=True)
copy

Do you remember that titanic dataset contains missing values in the Age column? Instead of dropping rows, let's think how to replace NaNs and save data.

Tarea

If share of NaNs is low enough, replace them with value - but which one? Do the following:

  1. Calculate the share of missing values in Age column. Round this value to 2 decimal places.
  2. Build the histogram of Age distribution. Use matplotlib.pyplot and method hist().

Tarea

If share of NaNs is low enough, replace them with value - but which one? Do the following:

  1. Calculate the share of missing values in Age column. Round this value to 2 decimal places.
  2. Build the histogram of Age distribution. Use matplotlib.pyplot and method hist().

Well, hope you have the similar histogram:

If yes, move on to the next chapter to deal with NaNs.

Cambia al escritorio para practicar en el mundo realContinúe desde donde se encuentra utilizando una de las siguientes opciones

¿Todo estuvo claro?

In the previous chapter, you dropped all the records containing NaN. If everything was done correct, the shape of new dataframe was (183, 12) - we lost 709 rows out of 891!

To avoid data loss because of NaNs, use different approaches to replace the missing data. The choice depends on the amount of missing data, its distribution(one or multiple columns, homogeneous or not, etc.).

The popular approaches to deal with NaNs for numerical data are:

  • replace with the mean value: good for the normal data distribution and when the number of NaNs is small.
  • replace with the mode value: good for exponential distributions and a small amount of NaNs.
  • replace with the max or min value: good if you are sure there are no outliers that may affect the result.
  • replace with some const value: for example, 0 or 1, if the possible value is either 0 or 1.

To replace the NaNs, you can use fillna():

123
data.fillna(some_val) # replaces NaN with some_val # or data.fillna(some_val, inplace=True) # change the data in-place
copy

Also, you can use replace(old_val, new_val) to replace not only NaNs, but any other values:

123
data['Age'].replace(np.nan, 0) # replaces NaN with 0 # or data['Age'].replace(np.nan, 0, inplace=True)
copy

Do you remember that titanic dataset contains missing values in the Age column? Instead of dropping rows, let's think how to replace NaNs and save data.

Tarea

If share of NaNs is low enough, replace them with value - but which one? Do the following:

  1. Calculate the share of missing values in Age column. Round this value to 2 decimal places.
  2. Build the histogram of Age distribution. Use matplotlib.pyplot and method hist().

Well, hope you have the similar histogram:

If yes, move on to the next chapter to deal with NaNs.

Cambia al escritorio para practicar en el mundo realContinúe desde donde se encuentra utilizando una de las siguientes opciones
Sección 2. Capítulo 3
Cambia al escritorio para practicar en el mundo realContinúe desde donde se encuentra utilizando una de las siguientes opciones
We're sorry to hear that something went wrong. What happened?
some-alt