Course Content
Preprocessing Data
Preprocessing Data
Label Encoding
Label Encoding is a process of encoding non-numerical values into numerical categories. Therefore, Label Encoding refers to converting the values into numeric forms and later converting them into machine-readable forms. Machine Learning algorithms decide how to operate those labels. It is a significant preprocessing step for structured datasets in supervised learning.
Before the Encoding, inspect the data you are working with. For example, let's explore the non-numerical feature Cabin
containing such values:
print(data['Cabin'].unique())
[nan 'C85' 'C123' 'E46' 'G6' 'C103' 'D56' 'A6' 'C23 C25 C27' 'B78' 'D33' 'B30' 'C52' 'C83' 'F33' 'F G73' 'E31' 'A5' 'D10 D12' 'D26' 'C110' 'B58 B60' 'E101' 'F E69' 'D47' 'B86' 'F2' 'C2' 'E33' 'B19' 'A7' 'C49' 'F4' 'A32' 'B4' 'B80' 'A31' 'D36' 'D15' 'C93' 'C78' 'D35' 'C87' 'B77' 'E67' 'B94' 'C125' 'C99' 'C118' 'D7' 'A19' 'B49' 'D' 'C22 C26' 'C106' 'C65' 'E36' 'C54' 'B57 B59 B63 B66' 'C7' 'E34' 'C32' 'B18' 'C124' 'C91' 'E40' 'T' 'C128' 'D37' 'B35' 'E50' 'C82' 'B96 B98' 'E10' 'E44' 'A34' 'C104' 'C111' 'C92' 'E38' 'D21' 'E12' 'E63' 'A14' 'B37' 'C30' 'D20' 'B79' 'E25' 'D46' 'B73' 'C95' 'B38' 'B39' 'B22' 'C86' 'C70' 'A16' 'C101' 'C68' 'A10' 'E68' 'B41' 'A20' 'D19' 'D50' 'D9' 'A23' 'B50' 'A26' 'D48' 'E58' 'C126' 'B71' 'B51 B53 B55' 'D49' 'B5' 'B20' 'F G63' 'C62 C64' 'E24' 'C90' 'C45' 'E8' 'B101' 'D45' 'C46' 'D30' 'E121' 'D11' 'E77' 'F38' 'B3' 'D6' 'B82 B84' 'D17' 'A36' 'B102' 'B69' 'E49' 'C47' 'D28' 'E17' 'A24' 'C50' 'B42' 'C148']
nan
matches the passangers that had no cabin, and we already know that there is about 77%. Other passangers had at least one cabin, or to be clear:
cabin_count = data['Cabin'].apply(lambda x : 0 if pd.isna(x) else len(x.split(' '))) print(cabin_count.value_counts())
cabins | count |
0 | 687 |
1 | 180 |
2 | 16 |
3 | 6 |
4 | 2 |
Well, the amount of passengers with more than 1 cabin is 2,7%.
We'll try to convert the data to reduce the number of unique values for the Cabin
column, but not to lose the important data. Suppose that letter refers to the floor or location on the ship, and it is much more importnat than the numerical value. Since amount of passangers with multiple cabins is small, we'll suppose that they had all the cabins with equal letters.
Summary, we'll leave only the cabin's letter, and for passengers with NaN replace it with Z
.
data['Cabin'] = data['Cabin'].apply(lambda x : 'Z' if pd.isna(x) else x[:1]) print(data['Cabin'].value_counts())
cabin | count |
Z | 687 |
C | 59 |
B | 47 |
D | 33 |
E | 32 |
A | 15 |
F | 13 |
G | 4 |
T | 1 |
String values cannot be recognized by ML model, so the idea is to create some mapping and change these values into numerical. We can do it manually by creating a mapping and transforming the column data:
mapping = {data['Cabin'].unique()[i] : i for i in range(len(data['Cabin'].unique()))} print(mapping) encoded_data = pd.DataFrame([mapping[val] for val in data['Cabin']]) print(encoded_data.tail())
Task
Apply the Label Encoding to the Embarked
column by creating a mapping. Modify this data in-place.
Thanks for your feedback!
Label Encoding
Label Encoding is a process of encoding non-numerical values into numerical categories. Therefore, Label Encoding refers to converting the values into numeric forms and later converting them into machine-readable forms. Machine Learning algorithms decide how to operate those labels. It is a significant preprocessing step for structured datasets in supervised learning.
Before the Encoding, inspect the data you are working with. For example, let's explore the non-numerical feature Cabin
containing such values:
print(data['Cabin'].unique())
[nan 'C85' 'C123' 'E46' 'G6' 'C103' 'D56' 'A6' 'C23 C25 C27' 'B78' 'D33' 'B30' 'C52' 'C83' 'F33' 'F G73' 'E31' 'A5' 'D10 D12' 'D26' 'C110' 'B58 B60' 'E101' 'F E69' 'D47' 'B86' 'F2' 'C2' 'E33' 'B19' 'A7' 'C49' 'F4' 'A32' 'B4' 'B80' 'A31' 'D36' 'D15' 'C93' 'C78' 'D35' 'C87' 'B77' 'E67' 'B94' 'C125' 'C99' 'C118' 'D7' 'A19' 'B49' 'D' 'C22 C26' 'C106' 'C65' 'E36' 'C54' 'B57 B59 B63 B66' 'C7' 'E34' 'C32' 'B18' 'C124' 'C91' 'E40' 'T' 'C128' 'D37' 'B35' 'E50' 'C82' 'B96 B98' 'E10' 'E44' 'A34' 'C104' 'C111' 'C92' 'E38' 'D21' 'E12' 'E63' 'A14' 'B37' 'C30' 'D20' 'B79' 'E25' 'D46' 'B73' 'C95' 'B38' 'B39' 'B22' 'C86' 'C70' 'A16' 'C101' 'C68' 'A10' 'E68' 'B41' 'A20' 'D19' 'D50' 'D9' 'A23' 'B50' 'A26' 'D48' 'E58' 'C126' 'B71' 'B51 B53 B55' 'D49' 'B5' 'B20' 'F G63' 'C62 C64' 'E24' 'C90' 'C45' 'E8' 'B101' 'D45' 'C46' 'D30' 'E121' 'D11' 'E77' 'F38' 'B3' 'D6' 'B82 B84' 'D17' 'A36' 'B102' 'B69' 'E49' 'C47' 'D28' 'E17' 'A24' 'C50' 'B42' 'C148']
nan
matches the passangers that had no cabin, and we already know that there is about 77%. Other passangers had at least one cabin, or to be clear:
cabin_count = data['Cabin'].apply(lambda x : 0 if pd.isna(x) else len(x.split(' '))) print(cabin_count.value_counts())
cabins | count |
0 | 687 |
1 | 180 |
2 | 16 |
3 | 6 |
4 | 2 |
Well, the amount of passengers with more than 1 cabin is 2,7%.
We'll try to convert the data to reduce the number of unique values for the Cabin
column, but not to lose the important data. Suppose that letter refers to the floor or location on the ship, and it is much more importnat than the numerical value. Since amount of passangers with multiple cabins is small, we'll suppose that they had all the cabins with equal letters.
Summary, we'll leave only the cabin's letter, and for passengers with NaN replace it with Z
.
data['Cabin'] = data['Cabin'].apply(lambda x : 'Z' if pd.isna(x) else x[:1]) print(data['Cabin'].value_counts())
cabin | count |
Z | 687 |
C | 59 |
B | 47 |
D | 33 |
E | 32 |
A | 15 |
F | 13 |
G | 4 |
T | 1 |
String values cannot be recognized by ML model, so the idea is to create some mapping and change these values into numerical. We can do it manually by creating a mapping and transforming the column data:
mapping = {data['Cabin'].unique()[i] : i for i in range(len(data['Cabin'].unique()))} print(mapping) encoded_data = pd.DataFrame([mapping[val] for val in data['Cabin']]) print(encoded_data.tail())
Task
Apply the Label Encoding to the Embarked
column by creating a mapping. Modify this data in-place.
Thanks for your feedback!
Label Encoding
Label Encoding is a process of encoding non-numerical values into numerical categories. Therefore, Label Encoding refers to converting the values into numeric forms and later converting them into machine-readable forms. Machine Learning algorithms decide how to operate those labels. It is a significant preprocessing step for structured datasets in supervised learning.
Before the Encoding, inspect the data you are working with. For example, let's explore the non-numerical feature Cabin
containing such values:
print(data['Cabin'].unique())
[nan 'C85' 'C123' 'E46' 'G6' 'C103' 'D56' 'A6' 'C23 C25 C27' 'B78' 'D33' 'B30' 'C52' 'C83' 'F33' 'F G73' 'E31' 'A5' 'D10 D12' 'D26' 'C110' 'B58 B60' 'E101' 'F E69' 'D47' 'B86' 'F2' 'C2' 'E33' 'B19' 'A7' 'C49' 'F4' 'A32' 'B4' 'B80' 'A31' 'D36' 'D15' 'C93' 'C78' 'D35' 'C87' 'B77' 'E67' 'B94' 'C125' 'C99' 'C118' 'D7' 'A19' 'B49' 'D' 'C22 C26' 'C106' 'C65' 'E36' 'C54' 'B57 B59 B63 B66' 'C7' 'E34' 'C32' 'B18' 'C124' 'C91' 'E40' 'T' 'C128' 'D37' 'B35' 'E50' 'C82' 'B96 B98' 'E10' 'E44' 'A34' 'C104' 'C111' 'C92' 'E38' 'D21' 'E12' 'E63' 'A14' 'B37' 'C30' 'D20' 'B79' 'E25' 'D46' 'B73' 'C95' 'B38' 'B39' 'B22' 'C86' 'C70' 'A16' 'C101' 'C68' 'A10' 'E68' 'B41' 'A20' 'D19' 'D50' 'D9' 'A23' 'B50' 'A26' 'D48' 'E58' 'C126' 'B71' 'B51 B53 B55' 'D49' 'B5' 'B20' 'F G63' 'C62 C64' 'E24' 'C90' 'C45' 'E8' 'B101' 'D45' 'C46' 'D30' 'E121' 'D11' 'E77' 'F38' 'B3' 'D6' 'B82 B84' 'D17' 'A36' 'B102' 'B69' 'E49' 'C47' 'D28' 'E17' 'A24' 'C50' 'B42' 'C148']
nan
matches the passangers that had no cabin, and we already know that there is about 77%. Other passangers had at least one cabin, or to be clear:
cabin_count = data['Cabin'].apply(lambda x : 0 if pd.isna(x) else len(x.split(' '))) print(cabin_count.value_counts())
cabins | count |
0 | 687 |
1 | 180 |
2 | 16 |
3 | 6 |
4 | 2 |
Well, the amount of passengers with more than 1 cabin is 2,7%.
We'll try to convert the data to reduce the number of unique values for the Cabin
column, but not to lose the important data. Suppose that letter refers to the floor or location on the ship, and it is much more importnat than the numerical value. Since amount of passangers with multiple cabins is small, we'll suppose that they had all the cabins with equal letters.
Summary, we'll leave only the cabin's letter, and for passengers with NaN replace it with Z
.
data['Cabin'] = data['Cabin'].apply(lambda x : 'Z' if pd.isna(x) else x[:1]) print(data['Cabin'].value_counts())
cabin | count |
Z | 687 |
C | 59 |
B | 47 |
D | 33 |
E | 32 |
A | 15 |
F | 13 |
G | 4 |
T | 1 |
String values cannot be recognized by ML model, so the idea is to create some mapping and change these values into numerical. We can do it manually by creating a mapping and transforming the column data:
mapping = {data['Cabin'].unique()[i] : i for i in range(len(data['Cabin'].unique()))} print(mapping) encoded_data = pd.DataFrame([mapping[val] for val in data['Cabin']]) print(encoded_data.tail())
Task
Apply the Label Encoding to the Embarked
column by creating a mapping. Modify this data in-place.
Thanks for your feedback!
Label Encoding is a process of encoding non-numerical values into numerical categories. Therefore, Label Encoding refers to converting the values into numeric forms and later converting them into machine-readable forms. Machine Learning algorithms decide how to operate those labels. It is a significant preprocessing step for structured datasets in supervised learning.
Before the Encoding, inspect the data you are working with. For example, let's explore the non-numerical feature Cabin
containing such values:
print(data['Cabin'].unique())
[nan 'C85' 'C123' 'E46' 'G6' 'C103' 'D56' 'A6' 'C23 C25 C27' 'B78' 'D33' 'B30' 'C52' 'C83' 'F33' 'F G73' 'E31' 'A5' 'D10 D12' 'D26' 'C110' 'B58 B60' 'E101' 'F E69' 'D47' 'B86' 'F2' 'C2' 'E33' 'B19' 'A7' 'C49' 'F4' 'A32' 'B4' 'B80' 'A31' 'D36' 'D15' 'C93' 'C78' 'D35' 'C87' 'B77' 'E67' 'B94' 'C125' 'C99' 'C118' 'D7' 'A19' 'B49' 'D' 'C22 C26' 'C106' 'C65' 'E36' 'C54' 'B57 B59 B63 B66' 'C7' 'E34' 'C32' 'B18' 'C124' 'C91' 'E40' 'T' 'C128' 'D37' 'B35' 'E50' 'C82' 'B96 B98' 'E10' 'E44' 'A34' 'C104' 'C111' 'C92' 'E38' 'D21' 'E12' 'E63' 'A14' 'B37' 'C30' 'D20' 'B79' 'E25' 'D46' 'B73' 'C95' 'B38' 'B39' 'B22' 'C86' 'C70' 'A16' 'C101' 'C68' 'A10' 'E68' 'B41' 'A20' 'D19' 'D50' 'D9' 'A23' 'B50' 'A26' 'D48' 'E58' 'C126' 'B71' 'B51 B53 B55' 'D49' 'B5' 'B20' 'F G63' 'C62 C64' 'E24' 'C90' 'C45' 'E8' 'B101' 'D45' 'C46' 'D30' 'E121' 'D11' 'E77' 'F38' 'B3' 'D6' 'B82 B84' 'D17' 'A36' 'B102' 'B69' 'E49' 'C47' 'D28' 'E17' 'A24' 'C50' 'B42' 'C148']
nan
matches the passangers that had no cabin, and we already know that there is about 77%. Other passangers had at least one cabin, or to be clear:
cabin_count = data['Cabin'].apply(lambda x : 0 if pd.isna(x) else len(x.split(' '))) print(cabin_count.value_counts())
cabins | count |
0 | 687 |
1 | 180 |
2 | 16 |
3 | 6 |
4 | 2 |
Well, the amount of passengers with more than 1 cabin is 2,7%.
We'll try to convert the data to reduce the number of unique values for the Cabin
column, but not to lose the important data. Suppose that letter refers to the floor or location on the ship, and it is much more importnat than the numerical value. Since amount of passangers with multiple cabins is small, we'll suppose that they had all the cabins with equal letters.
Summary, we'll leave only the cabin's letter, and for passengers with NaN replace it with Z
.
data['Cabin'] = data['Cabin'].apply(lambda x : 'Z' if pd.isna(x) else x[:1]) print(data['Cabin'].value_counts())
cabin | count |
Z | 687 |
C | 59 |
B | 47 |
D | 33 |
E | 32 |
A | 15 |
F | 13 |
G | 4 |
T | 1 |
String values cannot be recognized by ML model, so the idea is to create some mapping and change these values into numerical. We can do it manually by creating a mapping and transforming the column data:
mapping = {data['Cabin'].unique()[i] : i for i in range(len(data['Cabin'].unique()))} print(mapping) encoded_data = pd.DataFrame([mapping[val] for val in data['Cabin']]) print(encoded_data.tail())
Task
Apply the Label Encoding to the Embarked
column by creating a mapping. Modify this data in-place.