Swipe to show menu

Label Encoding

Label Encoding is a process of encoding non-numerical values into numerical categories. Therefore, Label Encoding refers to converting the values into numeric forms and later converting them into machine-readable forms. Machine Learning algorithms decide how to operate those labels. It is a significant preprocessing step for structured datasets in supervised learning.

Before the Encoding, inspect the data you are working with. For example, let's explore the non-numerical feature Cabin containing such values:


              1
            
print(data['Cabin'].unique())

[nan 'C85' 'C123' 'E46' 'G6' 'C103' 'D56' 'A6' 'C23 C25 C27' 'B78' 'D33' 'B30' 'C52' 'C83' 'F33' 'F G73' 'E31' 'A5' 'D10 D12' 'D26' 'C110' 'B58 B60' 'E101' 'F E69' 'D47' 'B86' 'F2' 'C2' 'E33' 'B19' 'A7' 'C49' 'F4' 'A32' 'B4' 'B80' 'A31' 'D36' 'D15' 'C93' 'C78' 'D35' 'C87' 'B77' 'E67' 'B94' 'C125' 'C99' 'C118' 'D7' 'A19' 'B49' 'D' 'C22 C26' 'C106' 'C65' 'E36' 'C54' 'B57 B59 B63 B66' 'C7' 'E34' 'C32' 'B18' 'C124' 'C91' 'E40' 'T' 'C128' 'D37' 'B35' 'E50' 'C82' 'B96 B98' 'E10' 'E44' 'A34' 'C104' 'C111' 'C92' 'E38' 'D21' 'E12' 'E63' 'A14' 'B37' 'C30' 'D20' 'B79' 'E25' 'D46' 'B73' 'C95' 'B38' 'B39' 'B22' 'C86' 'C70' 'A16' 'C101' 'C68' 'A10' 'E68' 'B41' 'A20' 'D19' 'D50' 'D9' 'A23' 'B50' 'A26' 'D48' 'E58' 'C126' 'B71' 'B51 B53 B55' 'D49' 'B5' 'B20' 'F G63' 'C62 C64' 'E24' 'C90' 'C45' 'E8' 'B101' 'D45' 'C46' 'D30' 'E121' 'D11' 'E77' 'F38' 'B3' 'D6' 'B82 B84' 'D17' 'A36' 'B102' 'B69' 'E49' 'C47' 'D28' 'E17' 'A24' 'C50' 'B42' 'C148']

nan matches the passangers that had no cabin, and we already know that there is about 77%. Other passangers had at least one cabin, or to be clear:


              123
            
cabin_count = data['Cabin'].apply(lambda x : 0 if pd.isna(x)
                     else len(x.split(' ')))
print(cabin_count.value_counts())

cabins	count
0	687
1	180
2	16
3	6
4	2

Well, the amount of passengers with more than 1 cabin is 2,7%.

We'll try to convert the data to reduce the number of unique values for the Cabin column, but not to lose the important data. Suppose that letter refers to the floor or location on the ship, and it is much more importnat than the numerical value. Since amount of passangers with multiple cabins is small, we'll suppose that they had all the cabins with equal letters.

Summary, we'll leave only the cabin's letter, and for passengers with NaN replace it with Z.


              12
            
data['Cabin'] = data['Cabin'].apply(lambda x : 'Z' if pd.isna(x) else x[:1])
print(data['Cabin'].value_counts())

cabin	count
Z	687
C	59
B	47
D	33
E	32
A	15
F	13
G	4
T	1

String values cannot be recognized by ML model, so the idea is to create some mapping and change these values into numerical. We can do it manually by creating a mapping and transforming the column data:


              1234
            
mapping = {data['Cabin'].unique()[i] : i for i in range(len(data['Cabin'].unique()))}
print(mapping)
encoded_data = pd.DataFrame([mapping[val] for val in data['Cabin']])
print(encoded_data.tail())

Task

Swipe to start coding

Apply the Label Encoding to the Embarked column by creating a mapping. Modify this data in-place.

Solution

Switch to desktop for real-world practiceContinue from where you are using one of the options below

Everything was clear?

Thanks for your feedback!

Section 5. Chapter 1

single

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Label Encoding

Before the Encoding, inspect the data you are working with. For example, let's explore the non-numerical feature Cabin containing such values:


              1
            
print(data['Cabin'].unique())

nan matches the passangers that had no cabin, and we already know that there is about 77%. Other passangers had at least one cabin, or to be clear:


              123
            
cabin_count = data['Cabin'].apply(lambda x : 0 if pd.isna(x)
                     else len(x.split(' ')))
print(cabin_count.value_counts())

cabins	count
0	687
1	180
2	16
3	6
4	2

Well, the amount of passengers with more than 1 cabin is 2,7%.

Summary, we'll leave only the cabin's letter, and for passengers with NaN replace it with Z.


              12
            
data['Cabin'] = data['Cabin'].apply(lambda x : 'Z' if pd.isna(x) else x[:1])
print(data['Cabin'].value_counts())

cabin	count
Z	687
C	59
B	47
D	33
E	32
A	15
F	13
G	4
T	1


              1234
            
mapping = {data['Cabin'].unique()[i] : i for i in range(len(data['Cabin'].unique()))}
print(mapping)
encoded_data = pd.DataFrame([mapping[val] for val in data['Cabin']])
print(encoded_data.tail())

Task

Swipe to start coding

Apply the Label Encoding to the Embarked column by creating a mapping. Modify this data in-place.

Solution

Switch to desktop for real-world practiceContinue from where you are using one of the options below

Everything was clear?

Thanks for your feedback!

Swipe to show menu

Label Encoding

Solution

Awesome!

Label Encoding

Solution

Awesome!