Course Content
Preprocessing Data
Preprocessing Data
One Hot Encoding
One-hot encoding is one more preprocessing approach that is used before the training process. You already know about the LabelEncoding that transforms like this:
Embarked | Label | |
Q | 3 | |
S | 2 | |
S | -> | 2 |
S | 2 | |
C | 1 | |
To provide the model to process only 0 and 1 values, one hot encoder transforms to the matrix:
Embarked | C | S | Q | |
Q | 0 | 0 | 1 | |
S | 0 | 1 | 0 | |
S | -> | 0 | 1 | 0 |
S | 0 | 1 | 0 | |
C | 1 | 0 | 0 | |
1 means that the value of embark_town matches the following column name (for example, Queenston matches Q), and 0 - it doesn't match. Instead of saving n values in range 0...n-1, we create n columns filled with 0 and 1.
One hot encoding is quite useful in case if the cell contains multiple values. For example, your dataset contains sentences and a list of eemotions with which the sentence is labeled. It is not c convenient format to work with, so we transform it:
emotion | anger | joy | love | neutral | sad | |
sad, neutral | 0 | 0 | 0 | 1 | 1 | |
love | 0 | 0 | 1 | 0 | 0 | |
love, joy | -> | 0 | 1 | 1 | 0 | 0 |
anger | 1 | 0 | 0 | 0 | 0 | |
neutral | 0 | 0 | 0 | 1 | 0 | |
We will use OneHotEncoder to create new features for the categorical columns of our dataset.
OneHotEncoder cannot process NaNs, so you have to preprocess them first.
The common syntax is next:
from sklearn.preprocessing import OneHotEncoder # data is loaded already # num_cols and cat_cols are created already encoder = OneHotEncoder() new_data = pd.DataFrame(encoder.fit_transform(data[cat_cols]).toarray()) # join new features to the dataset, but remove categorical features data = data[num_cols].join(new_data)
Task
Apply the One Hot Encoding to the dataset.
- Load the dataset.
- Process the NaNs: drop it for the
Embarked
, replace with mean value forAge
. - Transform the
Cabin
data as in the previous chapter (apply the Label Encoding). - Create the variable
cat_cols
to store such a categorical features:Sex
,Cabin
, andEmbarked
. - Create OneHotEncoder and store the transformed data to the
new_data
. - Remove the
cat_cols
from the dataframe, but add thenew_data
. - Check the sample.
Thanks for your feedback!
One Hot Encoding
One-hot encoding is one more preprocessing approach that is used before the training process. You already know about the LabelEncoding that transforms like this:
Embarked | Label | |
Q | 3 | |
S | 2 | |
S | -> | 2 |
S | 2 | |
C | 1 | |
To provide the model to process only 0 and 1 values, one hot encoder transforms to the matrix:
Embarked | C | S | Q | |
Q | 0 | 0 | 1 | |
S | 0 | 1 | 0 | |
S | -> | 0 | 1 | 0 |
S | 0 | 1 | 0 | |
C | 1 | 0 | 0 | |
1 means that the value of embark_town matches the following column name (for example, Queenston matches Q), and 0 - it doesn't match. Instead of saving n values in range 0...n-1, we create n columns filled with 0 and 1.
One hot encoding is quite useful in case if the cell contains multiple values. For example, your dataset contains sentences and a list of eemotions with which the sentence is labeled. It is not c convenient format to work with, so we transform it:
emotion | anger | joy | love | neutral | sad | |
sad, neutral | 0 | 0 | 0 | 1 | 1 | |
love | 0 | 0 | 1 | 0 | 0 | |
love, joy | -> | 0 | 1 | 1 | 0 | 0 |
anger | 1 | 0 | 0 | 0 | 0 | |
neutral | 0 | 0 | 0 | 1 | 0 | |
We will use OneHotEncoder to create new features for the categorical columns of our dataset.
OneHotEncoder cannot process NaNs, so you have to preprocess them first.
The common syntax is next:
from sklearn.preprocessing import OneHotEncoder # data is loaded already # num_cols and cat_cols are created already encoder = OneHotEncoder() new_data = pd.DataFrame(encoder.fit_transform(data[cat_cols]).toarray()) # join new features to the dataset, but remove categorical features data = data[num_cols].join(new_data)
Task
Apply the One Hot Encoding to the dataset.
- Load the dataset.
- Process the NaNs: drop it for the
Embarked
, replace with mean value forAge
. - Transform the
Cabin
data as in the previous chapter (apply the Label Encoding). - Create the variable
cat_cols
to store such a categorical features:Sex
,Cabin
, andEmbarked
. - Create OneHotEncoder and store the transformed data to the
new_data
. - Remove the
cat_cols
from the dataframe, but add thenew_data
. - Check the sample.
Thanks for your feedback!
One Hot Encoding
One-hot encoding is one more preprocessing approach that is used before the training process. You already know about the LabelEncoding that transforms like this:
Embarked | Label | |
Q | 3 | |
S | 2 | |
S | -> | 2 |
S | 2 | |
C | 1 | |
To provide the model to process only 0 and 1 values, one hot encoder transforms to the matrix:
Embarked | C | S | Q | |
Q | 0 | 0 | 1 | |
S | 0 | 1 | 0 | |
S | -> | 0 | 1 | 0 |
S | 0 | 1 | 0 | |
C | 1 | 0 | 0 | |
1 means that the value of embark_town matches the following column name (for example, Queenston matches Q), and 0 - it doesn't match. Instead of saving n values in range 0...n-1, we create n columns filled with 0 and 1.
One hot encoding is quite useful in case if the cell contains multiple values. For example, your dataset contains sentences and a list of eemotions with which the sentence is labeled. It is not c convenient format to work with, so we transform it:
emotion | anger | joy | love | neutral | sad | |
sad, neutral | 0 | 0 | 0 | 1 | 1 | |
love | 0 | 0 | 1 | 0 | 0 | |
love, joy | -> | 0 | 1 | 1 | 0 | 0 |
anger | 1 | 0 | 0 | 0 | 0 | |
neutral | 0 | 0 | 0 | 1 | 0 | |
We will use OneHotEncoder to create new features for the categorical columns of our dataset.
OneHotEncoder cannot process NaNs, so you have to preprocess them first.
The common syntax is next:
from sklearn.preprocessing import OneHotEncoder # data is loaded already # num_cols and cat_cols are created already encoder = OneHotEncoder() new_data = pd.DataFrame(encoder.fit_transform(data[cat_cols]).toarray()) # join new features to the dataset, but remove categorical features data = data[num_cols].join(new_data)
Task
Apply the One Hot Encoding to the dataset.
- Load the dataset.
- Process the NaNs: drop it for the
Embarked
, replace with mean value forAge
. - Transform the
Cabin
data as in the previous chapter (apply the Label Encoding). - Create the variable
cat_cols
to store such a categorical features:Sex
,Cabin
, andEmbarked
. - Create OneHotEncoder and store the transformed data to the
new_data
. - Remove the
cat_cols
from the dataframe, but add thenew_data
. - Check the sample.
Thanks for your feedback!
One-hot encoding is one more preprocessing approach that is used before the training process. You already know about the LabelEncoding that transforms like this:
Embarked | Label | |
Q | 3 | |
S | 2 | |
S | -> | 2 |
S | 2 | |
C | 1 | |
To provide the model to process only 0 and 1 values, one hot encoder transforms to the matrix:
Embarked | C | S | Q | |
Q | 0 | 0 | 1 | |
S | 0 | 1 | 0 | |
S | -> | 0 | 1 | 0 |
S | 0 | 1 | 0 | |
C | 1 | 0 | 0 | |
1 means that the value of embark_town matches the following column name (for example, Queenston matches Q), and 0 - it doesn't match. Instead of saving n values in range 0...n-1, we create n columns filled with 0 and 1.
One hot encoding is quite useful in case if the cell contains multiple values. For example, your dataset contains sentences and a list of eemotions with which the sentence is labeled. It is not c convenient format to work with, so we transform it:
emotion | anger | joy | love | neutral | sad | |
sad, neutral | 0 | 0 | 0 | 1 | 1 | |
love | 0 | 0 | 1 | 0 | 0 | |
love, joy | -> | 0 | 1 | 1 | 0 | 0 |
anger | 1 | 0 | 0 | 0 | 0 | |
neutral | 0 | 0 | 0 | 1 | 0 | |
We will use OneHotEncoder to create new features for the categorical columns of our dataset.
OneHotEncoder cannot process NaNs, so you have to preprocess them first.
The common syntax is next:
from sklearn.preprocessing import OneHotEncoder # data is loaded already # num_cols and cat_cols are created already encoder = OneHotEncoder() new_data = pd.DataFrame(encoder.fit_transform(data[cat_cols]).toarray()) # join new features to the dataset, but remove categorical features data = data[num_cols].join(new_data)
Task
Apply the One Hot Encoding to the dataset.
- Load the dataset.
- Process the NaNs: drop it for the
Embarked
, replace with mean value forAge
. - Transform the
Cabin
data as in the previous chapter (apply the Label Encoding). - Create the variable
cat_cols
to store such a categorical features:Sex
,Cabin
, andEmbarked
. - Create OneHotEncoder and store the transformed data to the
new_data
. - Remove the
cat_cols
from the dataframe, but add thenew_data
. - Check the sample.