One Hot Encoding

One-hot encoding is one more preprocessing approach that is used before the training process. You already know about the LabelEncoding that transforms like this:

Embarked		Label
Q		3
S		2
S	->	2
S		2
C		1

To provide the model to process only 0 and 1 values, one hot encoder transforms to the matrix:

Embarked		C	S	Q
Q		0	0	1
S		0	1	0
S	->	0	1	0
S		0	1	0
C		1	0	0

1 means that the value of embark_town matches the following column name (for example, Queenston matches Q), and 0 - it doesn't match. Instead of saving n values in range 0...n-1, we create n columns filled with 0 and 1.

One hot encoding is quite useful in case if the cell contains multiple values. For example, your dataset contains sentences and a list of eemotions with which the sentence is labeled. It is not c convenient format to work with, so we transform it:

emotion		anger	joy	love	neutral	sad
sad, neutral		0	0	0	1	1
love		0	0	1	0	0
love, joy	->	0	1	1	0	0
anger		1	0	0	0	0
neutral		0	0	0	1	0

We will use OneHotEncoder to create new features for the categorical columns of our dataset.

OneHotEncoder cannot process NaNs, so you have to preprocess them first.

The common syntax is next:


              12345678910
            
from sklearn.preprocessing import OneHotEncoder

# data is loaded already
# num_cols and cat_cols are created already

encoder = OneHotEncoder()
new_data = pd.DataFrame(encoder.fit_transform(data[cat_cols]).toarray())

# join new features to the dataset, but remove categorical features
data = data[num_cols].join(new_data)

Task

Swipe to start coding

Apply the One Hot Encoding to the dataset.

Load the dataset.
Process the NaNs: drop it for the Embarked, replace with mean value for Age.
Transform the Cabin data as in the previous chapter (apply the Label Encoding).
Create the variable cat_cols to store such a categorical features: Sex, Cabin, and Embarked.
Create OneHotEncoder and store the transformed data to the new_data.
Remove the cat_cols from the dataframe, but add the new_data.
Check the sample.

Solution

Switch to desktop for real-world practiceContinue from where you are using one of the options below

Everything was clear?

Thanks for your feedback!

Section 5. Chapter 2

One Hot Encoding

One-hot encoding is one more preprocessing approach that is used before the training process. You already know about the LabelEncoding that transforms like this:

Embarked		Label
Q		3
S		2
S	->	2
S		2
C		1

To provide the model to process only 0 and 1 values, one hot encoder transforms to the matrix:

Embarked		C	S	Q
Q		0	0	1
S		0	1	0
S	->	0	1	0
S		0	1	0
C		1	0	0

emotion		anger	joy	love	neutral	sad
sad, neutral		0	0	0	1	1
love		0	0	1	0	0
love, joy	->	0	1	1	0	0
anger		1	0	0	0	0
neutral		0	0	0	1	0

We will use OneHotEncoder to create new features for the categorical columns of our dataset.

OneHotEncoder cannot process NaNs, so you have to preprocess them first.

The common syntax is next:


              12345678910
            
from sklearn.preprocessing import OneHotEncoder

# data is loaded already
# num_cols and cat_cols are created already

encoder = OneHotEncoder()
new_data = pd.DataFrame(encoder.fit_transform(data[cat_cols]).toarray())

# join new features to the dataset, but remove categorical features
data = data[num_cols].join(new_data)

Task

Swipe to start coding

Apply the One Hot Encoding to the dataset.

Load the dataset.
Process the NaNs: drop it for the Embarked, replace with mean value for Age.
Transform the Cabin data as in the previous chapter (apply the Label Encoding).
Create the variable cat_cols to store such a categorical features: Sex, Cabin, and Embarked.
Create OneHotEncoder and store the transformed data to the new_data.
Remove the cat_cols from the dataframe, but add the new_data.
Check the sample.

Solution

Switch to desktop for real-world practiceContinue from where you are using one of the options below

Everything was clear?

Thanks for your feedback!

Section 5. Chapter 2

Switch to desktop for real-world practiceContinue from where you are using one of the options below