One Hot Encoding
One-hot encoding is one more preprocessing approach that is used before the training process. You already know about the LabelEncoding that transforms like this:
| Embarked | Label | |
|---|---|---|
| Q | 3 | |
| S | 2 | |
| S | -> | 2 |
| S | 2 | |
| C | 1 | |
To provide the model to process only 0 and 1 values, one hot encoder transforms to the matrix:
| Embarked | C | S | Q | |
|---|---|---|---|---|
| Q | 0 | 0 | 1 | |
| S | 0 | 1 | 0 | |
| S | -> | 0 | 1 | 0 |
| S | 0 | 1 | 0 | |
| C | 1 | 0 | 0 | |
1 means that the value of embark_town matches the following column name (for example, Queenston matches Q), and 0 - it doesn't match. Instead of saving n values in range 0...n-1, we create n columns filled with 0 and 1.
One hot encoding is quite useful in case if the cell contains multiple values. For example, your dataset contains sentences and a list of eemotions with which the sentence is labeled. It is not c convenient format to work with, so we transform it:
| emotion | anger | joy | love | neutral | sad | |
|---|---|---|---|---|---|---|
| sad, neutral | 0 | 0 | 0 | 1 | 1 | |
| love | 0 | 0 | 1 | 0 | 0 | |
| love, joy | -> | 0 | 1 | 1 | 0 | 0 |
| anger | 1 | 0 | 0 | 0 | 0 | |
| neutral | 0 | 0 | 0 | 1 | 0 | |
We will use OneHotEncoder to create new features for the categorical columns of our dataset.
OneHotEncoder cannot process NaNs, so you have to preprocess them first.
The common syntax is next:
12345678910from sklearn.preprocessing import OneHotEncoder # data is loaded already # num_cols and cat_cols are created already encoder = OneHotEncoder() new_data = pd.DataFrame(encoder.fit_transform(data[cat_cols]).toarray()) # join new features to the dataset, but remove categorical features data = data[num_cols].join(new_data)
Swipe to start coding
Apply the One Hot Encoding to the dataset.
- Load the dataset.
- Process the NaNs: drop it for the
Embarked, replace with mean value forAge. - Transform the
Cabindata as in the previous chapter (apply the Label Encoding). - Create the variable
cat_colsto store such a categorical features:Sex,Cabin, andEmbarked. - Create OneHotEncoder and store the transformed data to the
new_data. - Remove the
cat_colsfrom the dataframe, but add thenew_data. - Check the sample.
Solution
Merci pour vos commentaires !
single
Demandez à l'IA
Demandez à l'IA
Posez n'importe quelle question ou essayez l'une des questions suggérées pour commencer notre discussion
Résumer ce chapitre
Expliquer le code dans file
Expliquer pourquoi file ne résout pas la tâche
Awesome!
Completion rate improved to 5.56
One Hot Encoding
Glissez pour afficher le menu
One-hot encoding is one more preprocessing approach that is used before the training process. You already know about the LabelEncoding that transforms like this:
| Embarked | Label | |
|---|---|---|
| Q | 3 | |
| S | 2 | |
| S | -> | 2 |
| S | 2 | |
| C | 1 | |
To provide the model to process only 0 and 1 values, one hot encoder transforms to the matrix:
| Embarked | C | S | Q | |
|---|---|---|---|---|
| Q | 0 | 0 | 1 | |
| S | 0 | 1 | 0 | |
| S | -> | 0 | 1 | 0 |
| S | 0 | 1 | 0 | |
| C | 1 | 0 | 0 | |
1 means that the value of embark_town matches the following column name (for example, Queenston matches Q), and 0 - it doesn't match. Instead of saving n values in range 0...n-1, we create n columns filled with 0 and 1.
One hot encoding is quite useful in case if the cell contains multiple values. For example, your dataset contains sentences and a list of eemotions with which the sentence is labeled. It is not c convenient format to work with, so we transform it:
| emotion | anger | joy | love | neutral | sad | |
|---|---|---|---|---|---|---|
| sad, neutral | 0 | 0 | 0 | 1 | 1 | |
| love | 0 | 0 | 1 | 0 | 0 | |
| love, joy | -> | 0 | 1 | 1 | 0 | 0 |
| anger | 1 | 0 | 0 | 0 | 0 | |
| neutral | 0 | 0 | 0 | 1 | 0 | |
We will use OneHotEncoder to create new features for the categorical columns of our dataset.
OneHotEncoder cannot process NaNs, so you have to preprocess them first.
The common syntax is next:
12345678910from sklearn.preprocessing import OneHotEncoder # data is loaded already # num_cols and cat_cols are created already encoder = OneHotEncoder() new_data = pd.DataFrame(encoder.fit_transform(data[cat_cols]).toarray()) # join new features to the dataset, but remove categorical features data = data[num_cols].join(new_data)
Swipe to start coding
Apply the One Hot Encoding to the dataset.
- Load the dataset.
- Process the NaNs: drop it for the
Embarked, replace with mean value forAge. - Transform the
Cabindata as in the previous chapter (apply the Label Encoding). - Create the variable
cat_colsto store such a categorical features:Sex,Cabin, andEmbarked. - Create OneHotEncoder and store the transformed data to the
new_data. - Remove the
cat_colsfrom the dataframe, but add thenew_data. - Check the sample.
Solution
Merci pour vos commentaires !
single