One Hot Encoding
One-hot encoding is one more preprocessing approach that is used before the training process. You already know about the LabelEncoding that transforms like this:
| Embarked | Label | |
|---|---|---|
| Q | 3 | |
| S | 2 | |
| S | -> | 2 |
| S | 2 | |
| C | 1 | |
To provide the model to process only 0 and 1 values, one hot encoder transforms to the matrix:
| Embarked | C | S | Q | |
|---|---|---|---|---|
| Q | 0 | 0 | 1 | |
| S | 0 | 1 | 0 | |
| S | -> | 0 | 1 | 0 |
| S | 0 | 1 | 0 | |
| C | 1 | 0 | 0 | |
1 means that the value of embark_town matches the following column name (for example, Queenston matches Q), and 0 - it doesn't match. Instead of saving n values in range 0...n-1, we create n columns filled with 0 and 1.
One hot encoding is quite useful in case if the cell contains multiple values. For example, your dataset contains sentences and a list of eemotions with which the sentence is labeled. It is not c convenient format to work with, so we transform it:
| emotion | anger | joy | love | neutral | sad | |
|---|---|---|---|---|---|---|
| sad, neutral | 0 | 0 | 0 | 1 | 1 | |
| love | 0 | 0 | 1 | 0 | 0 | |
| love, joy | -> | 0 | 1 | 1 | 0 | 0 |
| anger | 1 | 0 | 0 | 0 | 0 | |
| neutral | 0 | 0 | 0 | 1 | 0 | |
We will use OneHotEncoder to create new features for the categorical columns of our dataset.
OneHotEncoder cannot process NaNs, so you have to preprocess them first.
The common syntax is next:
12345678910from sklearn.preprocessing import OneHotEncoder # data is loaded already # num_cols and cat_cols are created already encoder = OneHotEncoder() new_data = pd.DataFrame(encoder.fit_transform(data[cat_cols]).toarray()) # join new features to the dataset, but remove categorical features data = data[num_cols].join(new_data)
Swipe to start coding
Apply the One Hot Encoding to the dataset.
- Load the dataset.
- Process the NaNs: drop it for the
Embarked, replace with mean value forAge. - Transform the
Cabindata as in the previous chapter (apply the Label Encoding). - Create the variable
cat_colsto store such a categorical features:Sex,Cabin, andEmbarked. - Create OneHotEncoder and store the transformed data to the
new_data. - Remove the
cat_colsfrom the dataframe, but add thenew_data. - Check the sample.
Løsning
Takk for tilbakemeldingene dine!
single
Spør AI
Spør AI
Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår
Oppsummer dette kapittelet
Explain code
Explain why doesn't solve task
Awesome!
Completion rate improved to 5.56
One Hot Encoding
Sveip for å vise menyen
One-hot encoding is one more preprocessing approach that is used before the training process. You already know about the LabelEncoding that transforms like this:
| Embarked | Label | |
|---|---|---|
| Q | 3 | |
| S | 2 | |
| S | -> | 2 |
| S | 2 | |
| C | 1 | |
To provide the model to process only 0 and 1 values, one hot encoder transforms to the matrix:
| Embarked | C | S | Q | |
|---|---|---|---|---|
| Q | 0 | 0 | 1 | |
| S | 0 | 1 | 0 | |
| S | -> | 0 | 1 | 0 |
| S | 0 | 1 | 0 | |
| C | 1 | 0 | 0 | |
1 means that the value of embark_town matches the following column name (for example, Queenston matches Q), and 0 - it doesn't match. Instead of saving n values in range 0...n-1, we create n columns filled with 0 and 1.
One hot encoding is quite useful in case if the cell contains multiple values. For example, your dataset contains sentences and a list of eemotions with which the sentence is labeled. It is not c convenient format to work with, so we transform it:
| emotion | anger | joy | love | neutral | sad | |
|---|---|---|---|---|---|---|
| sad, neutral | 0 | 0 | 0 | 1 | 1 | |
| love | 0 | 0 | 1 | 0 | 0 | |
| love, joy | -> | 0 | 1 | 1 | 0 | 0 |
| anger | 1 | 0 | 0 | 0 | 0 | |
| neutral | 0 | 0 | 0 | 1 | 0 | |
We will use OneHotEncoder to create new features for the categorical columns of our dataset.
OneHotEncoder cannot process NaNs, so you have to preprocess them first.
The common syntax is next:
12345678910from sklearn.preprocessing import OneHotEncoder # data is loaded already # num_cols and cat_cols are created already encoder = OneHotEncoder() new_data = pd.DataFrame(encoder.fit_transform(data[cat_cols]).toarray()) # join new features to the dataset, but remove categorical features data = data[num_cols].join(new_data)
Swipe to start coding
Apply the One Hot Encoding to the dataset.
- Load the dataset.
- Process the NaNs: drop it for the
Embarked, replace with mean value forAge. - Transform the
Cabindata as in the previous chapter (apply the Label Encoding). - Create the variable
cat_colsto store such a categorical features:Sex,Cabin, andEmbarked. - Create OneHotEncoder and store the transformed data to the
new_data. - Remove the
cat_colsfrom the dataframe, but add thenew_data. - Check the sample.
Løsning
Takk for tilbakemeldingene dine!
single