Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Label Encoding | Data Encoding
Preprocessing Data
course content

Course Content

Preprocessing Data

Preprocessing Data

1. Data Exploration
2. Data Cleaning
3. Data Validation
4. Normalization & Standardization
5. Data Encoding

bookLabel Encoding

Label Encoding is a process of encoding non-numerical values into numerical categories. Therefore, Label Encoding refers to converting the values into numeric forms and later converting them into machine-readable forms. Machine Learning algorithms decide how to operate those labels. It is a significant preprocessing step for structured datasets in supervised learning.

Before the Encoding, inspect the data you are working with. For example, let's explore the non-numerical feature Cabin containing such values:

1
print(data['Cabin'].unique())
copy

[nan 'C85' 'C123' 'E46' 'G6' 'C103' 'D56' 'A6' 'C23 C25 C27' 'B78' 'D33' 'B30' 'C52' 'C83' 'F33' 'F G73' 'E31' 'A5' 'D10 D12' 'D26' 'C110' 'B58 B60' 'E101' 'F E69' 'D47' 'B86' 'F2' 'C2' 'E33' 'B19' 'A7' 'C49' 'F4' 'A32' 'B4' 'B80' 'A31' 'D36' 'D15' 'C93' 'C78' 'D35' 'C87' 'B77' 'E67' 'B94' 'C125' 'C99' 'C118' 'D7' 'A19' 'B49' 'D' 'C22 C26' 'C106' 'C65' 'E36' 'C54' 'B57 B59 B63 B66' 'C7' 'E34' 'C32' 'B18' 'C124' 'C91' 'E40' 'T' 'C128' 'D37' 'B35' 'E50' 'C82' 'B96 B98' 'E10' 'E44' 'A34' 'C104' 'C111' 'C92' 'E38' 'D21' 'E12' 'E63' 'A14' 'B37' 'C30' 'D20' 'B79' 'E25' 'D46' 'B73' 'C95' 'B38' 'B39' 'B22' 'C86' 'C70' 'A16' 'C101' 'C68' 'A10' 'E68' 'B41' 'A20' 'D19' 'D50' 'D9' 'A23' 'B50' 'A26' 'D48' 'E58' 'C126' 'B71' 'B51 B53 B55' 'D49' 'B5' 'B20' 'F G63' 'C62 C64' 'E24' 'C90' 'C45' 'E8' 'B101' 'D45' 'C46' 'D30' 'E121' 'D11' 'E77' 'F38' 'B3' 'D6' 'B82 B84' 'D17' 'A36' 'B102' 'B69' 'E49' 'C47' 'D28' 'E17' 'A24' 'C50' 'B42' 'C148']

nan matches the passangers that had no cabin, and we already know that there is about 77%. Other passangers had at least one cabin, or to be clear:

123
cabin_count = data['Cabin'].apply(lambda x : 0 if pd.isna(x) else len(x.split(' '))) print(cabin_count.value_counts())
copy
cabinscount
0687
1180
216
36
42

Well, the amount of passengers with more than 1 cabin is 2,7%.

We'll try to convert the data to reduce the number of unique values for the Cabin column, but not to lose the important data. Suppose that letter refers to the floor or location on the ship, and it is much more importnat than the numerical value. Since amount of passangers with multiple cabins is small, we'll suppose that they had all the cabins with equal letters.

Summary, we'll leave only the cabin's letter, and for passengers with NaN replace it with Z.

12
data['Cabin'] = data['Cabin'].apply(lambda x : 'Z' if pd.isna(x) else x[:1]) print(data['Cabin'].value_counts())
copy
cabincount
Z687
C59
B47
D33
E32
A15
F13
G4
T1

String values cannot be recognized by ML model, so the idea is to create some mapping and change these values into numerical. We can do it manually by creating a mapping and transforming the column data:

1234
mapping = {data['Cabin'].unique()[i] : i for i in range(len(data['Cabin'].unique()))} print(mapping) encoded_data = pd.DataFrame([mapping[val] for val in data['Cabin']]) print(encoded_data.tail())
copy

Task

Apply the Label Encoding to the Embarked column by creating a mapping. Modify this data in-place.

Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 5. Chapter 1
toggle bottom row

bookLabel Encoding

Label Encoding is a process of encoding non-numerical values into numerical categories. Therefore, Label Encoding refers to converting the values into numeric forms and later converting them into machine-readable forms. Machine Learning algorithms decide how to operate those labels. It is a significant preprocessing step for structured datasets in supervised learning.

Before the Encoding, inspect the data you are working with. For example, let's explore the non-numerical feature Cabin containing such values:

1
print(data['Cabin'].unique())
copy

[nan 'C85' 'C123' 'E46' 'G6' 'C103' 'D56' 'A6' 'C23 C25 C27' 'B78' 'D33' 'B30' 'C52' 'C83' 'F33' 'F G73' 'E31' 'A5' 'D10 D12' 'D26' 'C110' 'B58 B60' 'E101' 'F E69' 'D47' 'B86' 'F2' 'C2' 'E33' 'B19' 'A7' 'C49' 'F4' 'A32' 'B4' 'B80' 'A31' 'D36' 'D15' 'C93' 'C78' 'D35' 'C87' 'B77' 'E67' 'B94' 'C125' 'C99' 'C118' 'D7' 'A19' 'B49' 'D' 'C22 C26' 'C106' 'C65' 'E36' 'C54' 'B57 B59 B63 B66' 'C7' 'E34' 'C32' 'B18' 'C124' 'C91' 'E40' 'T' 'C128' 'D37' 'B35' 'E50' 'C82' 'B96 B98' 'E10' 'E44' 'A34' 'C104' 'C111' 'C92' 'E38' 'D21' 'E12' 'E63' 'A14' 'B37' 'C30' 'D20' 'B79' 'E25' 'D46' 'B73' 'C95' 'B38' 'B39' 'B22' 'C86' 'C70' 'A16' 'C101' 'C68' 'A10' 'E68' 'B41' 'A20' 'D19' 'D50' 'D9' 'A23' 'B50' 'A26' 'D48' 'E58' 'C126' 'B71' 'B51 B53 B55' 'D49' 'B5' 'B20' 'F G63' 'C62 C64' 'E24' 'C90' 'C45' 'E8' 'B101' 'D45' 'C46' 'D30' 'E121' 'D11' 'E77' 'F38' 'B3' 'D6' 'B82 B84' 'D17' 'A36' 'B102' 'B69' 'E49' 'C47' 'D28' 'E17' 'A24' 'C50' 'B42' 'C148']

nan matches the passangers that had no cabin, and we already know that there is about 77%. Other passangers had at least one cabin, or to be clear:

123
cabin_count = data['Cabin'].apply(lambda x : 0 if pd.isna(x) else len(x.split(' '))) print(cabin_count.value_counts())
copy
cabinscount
0687
1180
216
36
42

Well, the amount of passengers with more than 1 cabin is 2,7%.

We'll try to convert the data to reduce the number of unique values for the Cabin column, but not to lose the important data. Suppose that letter refers to the floor or location on the ship, and it is much more importnat than the numerical value. Since amount of passangers with multiple cabins is small, we'll suppose that they had all the cabins with equal letters.

Summary, we'll leave only the cabin's letter, and for passengers with NaN replace it with Z.

12
data['Cabin'] = data['Cabin'].apply(lambda x : 'Z' if pd.isna(x) else x[:1]) print(data['Cabin'].value_counts())
copy
cabincount
Z687
C59
B47
D33
E32
A15
F13
G4
T1

String values cannot be recognized by ML model, so the idea is to create some mapping and change these values into numerical. We can do it manually by creating a mapping and transforming the column data:

1234
mapping = {data['Cabin'].unique()[i] : i for i in range(len(data['Cabin'].unique()))} print(mapping) encoded_data = pd.DataFrame([mapping[val] for val in data['Cabin']]) print(encoded_data.tail())
copy

Task

Apply the Label Encoding to the Embarked column by creating a mapping. Modify this data in-place.

Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 5. Chapter 1
toggle bottom row

bookLabel Encoding

Label Encoding is a process of encoding non-numerical values into numerical categories. Therefore, Label Encoding refers to converting the values into numeric forms and later converting them into machine-readable forms. Machine Learning algorithms decide how to operate those labels. It is a significant preprocessing step for structured datasets in supervised learning.

Before the Encoding, inspect the data you are working with. For example, let's explore the non-numerical feature Cabin containing such values:

1
print(data['Cabin'].unique())
copy

[nan 'C85' 'C123' 'E46' 'G6' 'C103' 'D56' 'A6' 'C23 C25 C27' 'B78' 'D33' 'B30' 'C52' 'C83' 'F33' 'F G73' 'E31' 'A5' 'D10 D12' 'D26' 'C110' 'B58 B60' 'E101' 'F E69' 'D47' 'B86' 'F2' 'C2' 'E33' 'B19' 'A7' 'C49' 'F4' 'A32' 'B4' 'B80' 'A31' 'D36' 'D15' 'C93' 'C78' 'D35' 'C87' 'B77' 'E67' 'B94' 'C125' 'C99' 'C118' 'D7' 'A19' 'B49' 'D' 'C22 C26' 'C106' 'C65' 'E36' 'C54' 'B57 B59 B63 B66' 'C7' 'E34' 'C32' 'B18' 'C124' 'C91' 'E40' 'T' 'C128' 'D37' 'B35' 'E50' 'C82' 'B96 B98' 'E10' 'E44' 'A34' 'C104' 'C111' 'C92' 'E38' 'D21' 'E12' 'E63' 'A14' 'B37' 'C30' 'D20' 'B79' 'E25' 'D46' 'B73' 'C95' 'B38' 'B39' 'B22' 'C86' 'C70' 'A16' 'C101' 'C68' 'A10' 'E68' 'B41' 'A20' 'D19' 'D50' 'D9' 'A23' 'B50' 'A26' 'D48' 'E58' 'C126' 'B71' 'B51 B53 B55' 'D49' 'B5' 'B20' 'F G63' 'C62 C64' 'E24' 'C90' 'C45' 'E8' 'B101' 'D45' 'C46' 'D30' 'E121' 'D11' 'E77' 'F38' 'B3' 'D6' 'B82 B84' 'D17' 'A36' 'B102' 'B69' 'E49' 'C47' 'D28' 'E17' 'A24' 'C50' 'B42' 'C148']

nan matches the passangers that had no cabin, and we already know that there is about 77%. Other passangers had at least one cabin, or to be clear:

123
cabin_count = data['Cabin'].apply(lambda x : 0 if pd.isna(x) else len(x.split(' '))) print(cabin_count.value_counts())
copy
cabinscount
0687
1180
216
36
42

Well, the amount of passengers with more than 1 cabin is 2,7%.

We'll try to convert the data to reduce the number of unique values for the Cabin column, but not to lose the important data. Suppose that letter refers to the floor or location on the ship, and it is much more importnat than the numerical value. Since amount of passangers with multiple cabins is small, we'll suppose that they had all the cabins with equal letters.

Summary, we'll leave only the cabin's letter, and for passengers with NaN replace it with Z.

12
data['Cabin'] = data['Cabin'].apply(lambda x : 'Z' if pd.isna(x) else x[:1]) print(data['Cabin'].value_counts())
copy
cabincount
Z687
C59
B47
D33
E32
A15
F13
G4
T1

String values cannot be recognized by ML model, so the idea is to create some mapping and change these values into numerical. We can do it manually by creating a mapping and transforming the column data:

1234
mapping = {data['Cabin'].unique()[i] : i for i in range(len(data['Cabin'].unique()))} print(mapping) encoded_data = pd.DataFrame([mapping[val] for val in data['Cabin']]) print(encoded_data.tail())
copy

Task

Apply the Label Encoding to the Embarked column by creating a mapping. Modify this data in-place.

Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
Everything was clear?

How can we improve it?

Thanks for your feedback!

Label Encoding is a process of encoding non-numerical values into numerical categories. Therefore, Label Encoding refers to converting the values into numeric forms and later converting them into machine-readable forms. Machine Learning algorithms decide how to operate those labels. It is a significant preprocessing step for structured datasets in supervised learning.

Before the Encoding, inspect the data you are working with. For example, let's explore the non-numerical feature Cabin containing such values:

1
print(data['Cabin'].unique())
copy

[nan 'C85' 'C123' 'E46' 'G6' 'C103' 'D56' 'A6' 'C23 C25 C27' 'B78' 'D33' 'B30' 'C52' 'C83' 'F33' 'F G73' 'E31' 'A5' 'D10 D12' 'D26' 'C110' 'B58 B60' 'E101' 'F E69' 'D47' 'B86' 'F2' 'C2' 'E33' 'B19' 'A7' 'C49' 'F4' 'A32' 'B4' 'B80' 'A31' 'D36' 'D15' 'C93' 'C78' 'D35' 'C87' 'B77' 'E67' 'B94' 'C125' 'C99' 'C118' 'D7' 'A19' 'B49' 'D' 'C22 C26' 'C106' 'C65' 'E36' 'C54' 'B57 B59 B63 B66' 'C7' 'E34' 'C32' 'B18' 'C124' 'C91' 'E40' 'T' 'C128' 'D37' 'B35' 'E50' 'C82' 'B96 B98' 'E10' 'E44' 'A34' 'C104' 'C111' 'C92' 'E38' 'D21' 'E12' 'E63' 'A14' 'B37' 'C30' 'D20' 'B79' 'E25' 'D46' 'B73' 'C95' 'B38' 'B39' 'B22' 'C86' 'C70' 'A16' 'C101' 'C68' 'A10' 'E68' 'B41' 'A20' 'D19' 'D50' 'D9' 'A23' 'B50' 'A26' 'D48' 'E58' 'C126' 'B71' 'B51 B53 B55' 'D49' 'B5' 'B20' 'F G63' 'C62 C64' 'E24' 'C90' 'C45' 'E8' 'B101' 'D45' 'C46' 'D30' 'E121' 'D11' 'E77' 'F38' 'B3' 'D6' 'B82 B84' 'D17' 'A36' 'B102' 'B69' 'E49' 'C47' 'D28' 'E17' 'A24' 'C50' 'B42' 'C148']

nan matches the passangers that had no cabin, and we already know that there is about 77%. Other passangers had at least one cabin, or to be clear:

123
cabin_count = data['Cabin'].apply(lambda x : 0 if pd.isna(x) else len(x.split(' '))) print(cabin_count.value_counts())
copy
cabinscount
0687
1180
216
36
42

Well, the amount of passengers with more than 1 cabin is 2,7%.

We'll try to convert the data to reduce the number of unique values for the Cabin column, but not to lose the important data. Suppose that letter refers to the floor or location on the ship, and it is much more importnat than the numerical value. Since amount of passangers with multiple cabins is small, we'll suppose that they had all the cabins with equal letters.

Summary, we'll leave only the cabin's letter, and for passengers with NaN replace it with Z.

12
data['Cabin'] = data['Cabin'].apply(lambda x : 'Z' if pd.isna(x) else x[:1]) print(data['Cabin'].value_counts())
copy
cabincount
Z687
C59
B47
D33
E32
A15
F13
G4
T1

String values cannot be recognized by ML model, so the idea is to create some mapping and change these values into numerical. We can do it manually by creating a mapping and transforming the column data:

1234
mapping = {data['Cabin'].unique()[i] : i for i in range(len(data['Cabin'].unique()))} print(mapping) encoded_data = pd.DataFrame([mapping[val] for val in data['Cabin']]) print(encoded_data.tail())
copy

Task

Apply the Label Encoding to the Embarked column by creating a mapping. Modify this data in-place.

Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
Section 5. Chapter 1
Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
some-alt