Aprende Label Encoding | Data Encoding

Sección 5. Capítulo 1

single

Desliza para mostrar el menú

Label Encoding is a process of encoding non-numerical values into numerical categories. Therefore, Label Encoding refers to converting the values into numeric forms and later converting them into machine-readable forms. Machine Learning algorithms decide how to operate those labels. It is a significant preprocessing step for structured datasets in supervised learning.

Before the Encoding, inspect the data you are working with. For example, let's explore the non-numerical feature Cabin containing such values:


              1
            
print(data['Cabin'].unique())

[nan 'C85' 'C123' 'E46' 'G6' 'C103' 'D56' 'A6' 'C23 C25 C27' 'B78' 'D33' 'B30' 'C52' 'C83' 'F33' 'F G73' 'E31' 'A5' 'D10 D12' 'D26' 'C110' 'B58 B60' 'E101' 'F E69' 'D47' 'B86' 'F2' 'C2' 'E33' 'B19' 'A7' 'C49' 'F4' 'A32' 'B4' 'B80' 'A31' 'D36' 'D15' 'C93' 'C78' 'D35' 'C87' 'B77' 'E67' 'B94' 'C125' 'C99' 'C118' 'D7' 'A19' 'B49' 'D' 'C22 C26' 'C106' 'C65' 'E36' 'C54' 'B57 B59 B63 B66' 'C7' 'E34' 'C32' 'B18' 'C124' 'C91' 'E40' 'T' 'C128' 'D37' 'B35' 'E50' 'C82' 'B96 B98' 'E10' 'E44' 'A34' 'C104' 'C111' 'C92' 'E38' 'D21' 'E12' 'E63' 'A14' 'B37' 'C30' 'D20' 'B79' 'E25' 'D46' 'B73' 'C95' 'B38' 'B39' 'B22' 'C86' 'C70' 'A16' 'C101' 'C68' 'A10' 'E68' 'B41' 'A20' 'D19' 'D50' 'D9' 'A23' 'B50' 'A26' 'D48' 'E58' 'C126' 'B71' 'B51 B53 B55' 'D49' 'B5' 'B20' 'F G63' 'C62 C64' 'E24' 'C90' 'C45' 'E8' 'B101' 'D45' 'C46' 'D30' 'E121' 'D11' 'E77' 'F38' 'B3' 'D6' 'B82 B84' 'D17' 'A36' 'B102' 'B69' 'E49' 'C47' 'D28' 'E17' 'A24' 'C50' 'B42' 'C148']

nan matches the passangers that had no cabin, and we already know that there is about 77%. Other passangers had at least one cabin, or to be clear:


              123
            
cabin_count = data['Cabin'].apply(lambda x : 0 if pd.isna(x)
                     else len(x.split(' ')))
print(cabin_count.value_counts())

cabins	count
0	687
1	180
2	16
3	6
4	2

Well, the amount of passengers with more than 1 cabin is 2,7%.

We'll try to convert the data to reduce the number of unique values for the Cabin column, but not to lose the important data. Suppose that letter refers to the floor or location on the ship, and it is much more importnat than the numerical value. Since amount of passangers with multiple cabins is small, we'll suppose that they had all the cabins with equal letters.

Summary, we'll leave only the cabin's letter, and for passengers with NaN replace it with Z.


              12
            
data['Cabin'] = data['Cabin'].apply(lambda x : 'Z' if pd.isna(x) else x[:1])
print(data['Cabin'].value_counts())

cabin	count
Z	687
C	59
B	47
D	33
E	32
A	15
F	13
G	4
T	1

String values cannot be recognized by ML model, so the idea is to create some mapping and change these values into numerical. We can do it manually by creating a mapping and transforming the column data:


              1234
            
mapping = {data['Cabin'].unique()[i] : i for i in range(len(data['Cabin'].unique()))}
print(mapping)
encoded_data = pd.DataFrame([mapping[val] for val in data['Cabin']])
print(encoded_data.tail())

Tarea

Desliza para comenzar a programar

Apply the Label Encoding to the Embarked column by creating a mapping. Modify this data in-place.

Solución

Cambia al escritorio para practicar en el mundo realContinúe desde donde se encuentra utilizando una de las siguientes opciones

¿Todo estuvo claro?

¡Gracias por tus comentarios!

Sección 5. Capítulo 1

single

Pregunte a AI

Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla