Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Work with NaNs | Data Cleaning
Preprocessing Data
course content

Зміст курсу

Preprocessing Data

Preprocessing Data

1. Data Exploration
2. Data Cleaning
3. Data Validation
4. Normalization & Standardization
5. Data Encoding

bookWork with NaNs

To check if the current value is NaN, use isna() function. You can apply it to the full dataframe, to the column or cell, and you'll get True if the value is NaN and False otherwise.

1
print(data.isna())
copy

It is more informative to check if there are some NaNs in each column. We'll use sum() function to find the total amount among dataframe's columns:

123456
import pandas as pd import numpy as np data = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/10db3746-c8ff-4c55-9ac3-4affa0b65c16/titanic.csv') print(data.isna().sum())
copy

If you run the code above (reset the editor and paste code in it) you'll probably see the next output:

PassengerId0
Survived0
Pclass0
Name0
Sex0
Age177
SibSp0
Parch0
Ticket0
Fare0
Cabin687
Embarked2
dtype: int64

You can see that Embarked column has only 2 NaNs, which is not too much for almost 900 records, but look at the Cabin! More than 75% of entries are missing values. And we should deal with it in some way.

Drop NaNs

The easiest way to deal with missing data is just to drop the records that contain it. Use the method dropna(). Note that it doesn't change the current dataframe, but returns the new one. To change the current dataframe, add parameter inplace assigned with True:

12
clean_data = data.dropna() # data is not modified, but clean_data now contains no NaNs data.dropna(inplace=True) # data is modified
copy

Завдання

Apply the dropna() to the dataframe data. Then check the dataframe shape after modification and compare it with the original (before modification) dataframe shape.

We expect the shape (183, 12) for new dataframe.

Switch to desktopПерейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів
Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 2. Розділ 2
toggle bottom row

bookWork with NaNs

To check if the current value is NaN, use isna() function. You can apply it to the full dataframe, to the column or cell, and you'll get True if the value is NaN and False otherwise.

1
print(data.isna())
copy

It is more informative to check if there are some NaNs in each column. We'll use sum() function to find the total amount among dataframe's columns:

123456
import pandas as pd import numpy as np data = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/10db3746-c8ff-4c55-9ac3-4affa0b65c16/titanic.csv') print(data.isna().sum())
copy

If you run the code above (reset the editor and paste code in it) you'll probably see the next output:

PassengerId0
Survived0
Pclass0
Name0
Sex0
Age177
SibSp0
Parch0
Ticket0
Fare0
Cabin687
Embarked2
dtype: int64

You can see that Embarked column has only 2 NaNs, which is not too much for almost 900 records, but look at the Cabin! More than 75% of entries are missing values. And we should deal with it in some way.

Drop NaNs

The easiest way to deal with missing data is just to drop the records that contain it. Use the method dropna(). Note that it doesn't change the current dataframe, but returns the new one. To change the current dataframe, add parameter inplace assigned with True:

12
clean_data = data.dropna() # data is not modified, but clean_data now contains no NaNs data.dropna(inplace=True) # data is modified
copy

Завдання

Apply the dropna() to the dataframe data. Then check the dataframe shape after modification and compare it with the original (before modification) dataframe shape.

We expect the shape (183, 12) for new dataframe.

Switch to desktopПерейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів
Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 2. Розділ 2
toggle bottom row

bookWork with NaNs

To check if the current value is NaN, use isna() function. You can apply it to the full dataframe, to the column or cell, and you'll get True if the value is NaN and False otherwise.

1
print(data.isna())
copy

It is more informative to check if there are some NaNs in each column. We'll use sum() function to find the total amount among dataframe's columns:

123456
import pandas as pd import numpy as np data = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/10db3746-c8ff-4c55-9ac3-4affa0b65c16/titanic.csv') print(data.isna().sum())
copy

If you run the code above (reset the editor and paste code in it) you'll probably see the next output:

PassengerId0
Survived0
Pclass0
Name0
Sex0
Age177
SibSp0
Parch0
Ticket0
Fare0
Cabin687
Embarked2
dtype: int64

You can see that Embarked column has only 2 NaNs, which is not too much for almost 900 records, but look at the Cabin! More than 75% of entries are missing values. And we should deal with it in some way.

Drop NaNs

The easiest way to deal with missing data is just to drop the records that contain it. Use the method dropna(). Note that it doesn't change the current dataframe, but returns the new one. To change the current dataframe, add parameter inplace assigned with True:

12
clean_data = data.dropna() # data is not modified, but clean_data now contains no NaNs data.dropna(inplace=True) # data is modified
copy

Завдання

Apply the dropna() to the dataframe data. Then check the dataframe shape after modification and compare it with the original (before modification) dataframe shape.

We expect the shape (183, 12) for new dataframe.

Switch to desktopПерейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів
Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

To check if the current value is NaN, use isna() function. You can apply it to the full dataframe, to the column or cell, and you'll get True if the value is NaN and False otherwise.

1
print(data.isna())
copy

It is more informative to check if there are some NaNs in each column. We'll use sum() function to find the total amount among dataframe's columns:

123456
import pandas as pd import numpy as np data = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/10db3746-c8ff-4c55-9ac3-4affa0b65c16/titanic.csv') print(data.isna().sum())
copy

If you run the code above (reset the editor and paste code in it) you'll probably see the next output:

PassengerId0
Survived0
Pclass0
Name0
Sex0
Age177
SibSp0
Parch0
Ticket0
Fare0
Cabin687
Embarked2
dtype: int64

You can see that Embarked column has only 2 NaNs, which is not too much for almost 900 records, but look at the Cabin! More than 75% of entries are missing values. And we should deal with it in some way.

Drop NaNs

The easiest way to deal with missing data is just to drop the records that contain it. Use the method dropna(). Note that it doesn't change the current dataframe, but returns the new one. To change the current dataframe, add parameter inplace assigned with True:

12
clean_data = data.dropna() # data is not modified, but clean_data now contains no NaNs data.dropna(inplace=True) # data is modified
copy

Завдання

Apply the dropna() to the dataframe data. Then check the dataframe shape after modification and compare it with the original (before modification) dataframe shape.

We expect the shape (183, 12) for new dataframe.

Switch to desktopПерейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів
Секція 2. Розділ 2
Switch to desktopПерейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів
some-alt