Свайпніть щоб показати меню

Label Encoding

Label Encoding is a process of encoding non-numerical values into numerical categories. Therefore, Label Encoding refers to converting the values into numeric forms and later converting them into machine-readable forms. Machine Learning algorithms decide how to operate those labels. It is a significant preprocessing step for structured datasets in supervised learning.

Before the Encoding, inspect the data you are working with. For example, let's explore the non-numerical feature Cabin containing such values:


              1
            
print(data['Cabin'].unique())

[nan 'C85' 'C123' 'E46' 'G6' 'C103' 'D56' 'A6' 'C23 C25 C27' 'B78' 'D33' 'B30' 'C52' 'C83' 'F33' 'F G73' 'E31' 'A5' 'D10 D12' 'D26' 'C110' 'B58 B60' 'E101' 'F E69' 'D47' 'B86' 'F2' 'C2' 'E33' 'B19' 'A7' 'C49' 'F4' 'A32' 'B4' 'B80' 'A31' 'D36' 'D15' 'C93' 'C78' 'D35' 'C87' 'B77' 'E67' 'B94' 'C125' 'C99' 'C118' 'D7' 'A19' 'B49' 'D' 'C22 C26' 'C106' 'C65' 'E36' 'C54' 'B57 B59 B63 B66' 'C7' 'E34' 'C32' 'B18' 'C124' 'C91' 'E40' 'T' 'C128' 'D37' 'B35' 'E50' 'C82' 'B96 B98' 'E10' 'E44' 'A34' 'C104' 'C111' 'C92' 'E38' 'D21' 'E12' 'E63' 'A14' 'B37' 'C30' 'D20' 'B79' 'E25' 'D46' 'B73' 'C95' 'B38' 'B39' 'B22' 'C86' 'C70' 'A16' 'C101' 'C68' 'A10' 'E68' 'B41' 'A20' 'D19' 'D50' 'D9' 'A23' 'B50' 'A26' 'D48' 'E58' 'C126' 'B71' 'B51 B53 B55' 'D49' 'B5' 'B20' 'F G63' 'C62 C64' 'E24' 'C90' 'C45' 'E8' 'B101' 'D45' 'C46' 'D30' 'E121' 'D11' 'E77' 'F38' 'B3' 'D6' 'B82 B84' 'D17' 'A36' 'B102' 'B69' 'E49' 'C47' 'D28' 'E17' 'A24' 'C50' 'B42' 'C148']

nan matches the passangers that had no cabin, and we already know that there is about 77%. Other passangers had at least one cabin, or to be clear:


              123
            
cabin_count = data['Cabin'].apply(lambda x : 0 if pd.isna(x)
                     else len(x.split(' ')))
print(cabin_count.value_counts())

cabins	count
0	687
1	180
2	16
3	6
4	2

Well, the amount of passengers with more than 1 cabin is 2,7%.

We'll try to convert the data to reduce the number of unique values for the Cabin column, but not to lose the important data. Suppose that letter refers to the floor or location on the ship, and it is much more importnat than the numerical value. Since amount of passangers with multiple cabins is small, we'll suppose that they had all the cabins with equal letters.

Summary, we'll leave only the cabin's letter, and for passengers with NaN replace it with Z.


              12
            
data['Cabin'] = data['Cabin'].apply(lambda x : 'Z' if pd.isna(x) else x[:1])
print(data['Cabin'].value_counts())

cabin	count
Z	687
C	59
B	47
D	33
E	32
A	15
F	13
G	4
T	1

String values cannot be recognized by ML model, so the idea is to create some mapping and change these values into numerical. We can do it manually by creating a mapping and transforming the column data:


              1234
            
mapping = {data['Cabin'].unique()[i] : i for i in range(len(data['Cabin'].unique()))}
print(mapping)
encoded_data = pd.DataFrame([mapping[val] for val in data['Cabin']])
print(encoded_data.tail())

Завдання

Swipe to start coding

Apply the Label Encoding to the Embarked column by creating a mapping. Modify this data in-place.

Рішення

Перейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів

Все було зрозуміло?

Дякуємо за ваш відгук!

Секція 5. Розділ 1

single

Запитати АІ

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

Label Encoding

Before the Encoding, inspect the data you are working with. For example, let's explore the non-numerical feature Cabin containing such values:


              1
            
print(data['Cabin'].unique())

nan matches the passangers that had no cabin, and we already know that there is about 77%. Other passangers had at least one cabin, or to be clear:


              123
            
cabin_count = data['Cabin'].apply(lambda x : 0 if pd.isna(x)
                     else len(x.split(' ')))
print(cabin_count.value_counts())

cabins	count
0	687
1	180
2	16
3	6
4	2

Well, the amount of passengers with more than 1 cabin is 2,7%.

Summary, we'll leave only the cabin's letter, and for passengers with NaN replace it with Z.


              12
            
data['Cabin'] = data['Cabin'].apply(lambda x : 'Z' if pd.isna(x) else x[:1])
print(data['Cabin'].value_counts())

cabin	count
Z	687
C	59
B	47
D	33
E	32
A	15
F	13
G	4
T	1


              1234
            
mapping = {data['Cabin'].unique()[i] : i for i in range(len(data['Cabin'].unique()))}
print(mapping)
encoded_data = pd.DataFrame([mapping[val] for val in data['Cabin']])
print(encoded_data.tail())

Завдання

Swipe to start coding

Apply the Label Encoding to the Embarked column by creating a mapping. Modify this data in-place.

Рішення

Все було зрозуміло?

Дякуємо за ваш відгук!

Свайпніть щоб показати меню

Label Encoding

Рішення

Awesome!

Label Encoding

Рішення

Awesome!