Course Content
Preprocessing Data
Preprocessing Data
Data Binning
Sometimes it makes sense to bin the data into multiple buckets: to divide it into classes, for histogram building, etc. Data binning is applied to the continuous numerical data.
We won't dive deep into it and just explore two functions: cut
and qcut
:
- cut: divides the data into predefined bins, usually equal-sized.
# create 4 equal-sized bins cut_data = pd.cut(data['Age'], bins = 4) # set defined bins cut_data = pd.cut(data['Age'], bins = [20, 40, 60])
cut_data
is a pd.Series
column that contains ranges for each age value where this value belongs to. This is how it looks after binning into 4 categories:
Age | Range |
22.0 | (20.315, 40.21] |
38.0 | (20.315, 40.21] |
2.0 | (0.34, 20.315] |
54.0 | (40.21, 60.105] |
66.0 | (60.105, 80.0] |
You can see that these intervals are equal-sized: about 20 yrs each one. This column has a category type.
The distribution of the interval data values is equal to the distribution of Age
column like on the image:
- qcut, or quantile-based cut, divides data into equal-sized categories: each interval contains equal number of entries.
# create 4 bins qcut_data = pd.qcut(data['Age'], q = 4) # set defined quantiles qcut_data = pd.qcut(data['Age'], q = [0, 1/3, 2/3, 1])
Let's divide the data into 4 categories using qcut
. qcut_data
looks like the second column:
Age | Range |
22.0 | (21.0, 28.5] |
38.0 | (28.5, 38.0] |
2.0 | (0.419, 21.0] |
54.0 | (38.0, 80.0] |
66.0 | (38.0, 80.0] |
You can see that these intervals has diferent size: for example, the last one grabs more than 40 years, but the second - only 7.5 years. It means that number of people aged between 21 and 28.5 years from the dataset 'titanic' is equal to the number of people aged between 38 and 80 years. Look at the image:
The heights of the bins are almost equal because qcut
created such bins. But intervals' sizes are non-equal (but still may be). That's the main difference between cut
and qcut
functions.
Task
Bin the Fare
data into three categories: cheap, medium, and expensive.
- Divide it using
cut
, then check the created intervals, analyze them. - Then divide it using
qcut
. Check the intervals again.
Do not forget to confirm if you done the binning correctly depending on the function you use.
Thanks for your feedback!
Data Binning
Sometimes it makes sense to bin the data into multiple buckets: to divide it into classes, for histogram building, etc. Data binning is applied to the continuous numerical data.
We won't dive deep into it and just explore two functions: cut
and qcut
:
- cut: divides the data into predefined bins, usually equal-sized.
# create 4 equal-sized bins cut_data = pd.cut(data['Age'], bins = 4) # set defined bins cut_data = pd.cut(data['Age'], bins = [20, 40, 60])
cut_data
is a pd.Series
column that contains ranges for each age value where this value belongs to. This is how it looks after binning into 4 categories:
Age | Range |
22.0 | (20.315, 40.21] |
38.0 | (20.315, 40.21] |
2.0 | (0.34, 20.315] |
54.0 | (40.21, 60.105] |
66.0 | (60.105, 80.0] |
You can see that these intervals are equal-sized: about 20 yrs each one. This column has a category type.
The distribution of the interval data values is equal to the distribution of Age
column like on the image:
- qcut, or quantile-based cut, divides data into equal-sized categories: each interval contains equal number of entries.
# create 4 bins qcut_data = pd.qcut(data['Age'], q = 4) # set defined quantiles qcut_data = pd.qcut(data['Age'], q = [0, 1/3, 2/3, 1])
Let's divide the data into 4 categories using qcut
. qcut_data
looks like the second column:
Age | Range |
22.0 | (21.0, 28.5] |
38.0 | (28.5, 38.0] |
2.0 | (0.419, 21.0] |
54.0 | (38.0, 80.0] |
66.0 | (38.0, 80.0] |
You can see that these intervals has diferent size: for example, the last one grabs more than 40 years, but the second - only 7.5 years. It means that number of people aged between 21 and 28.5 years from the dataset 'titanic' is equal to the number of people aged between 38 and 80 years. Look at the image:
The heights of the bins are almost equal because qcut
created such bins. But intervals' sizes are non-equal (but still may be). That's the main difference between cut
and qcut
functions.
Task
Bin the Fare
data into three categories: cheap, medium, and expensive.
- Divide it using
cut
, then check the created intervals, analyze them. - Then divide it using
qcut
. Check the intervals again.
Do not forget to confirm if you done the binning correctly depending on the function you use.
Thanks for your feedback!
Data Binning
Sometimes it makes sense to bin the data into multiple buckets: to divide it into classes, for histogram building, etc. Data binning is applied to the continuous numerical data.
We won't dive deep into it and just explore two functions: cut
and qcut
:
- cut: divides the data into predefined bins, usually equal-sized.
# create 4 equal-sized bins cut_data = pd.cut(data['Age'], bins = 4) # set defined bins cut_data = pd.cut(data['Age'], bins = [20, 40, 60])
cut_data
is a pd.Series
column that contains ranges for each age value where this value belongs to. This is how it looks after binning into 4 categories:
Age | Range |
22.0 | (20.315, 40.21] |
38.0 | (20.315, 40.21] |
2.0 | (0.34, 20.315] |
54.0 | (40.21, 60.105] |
66.0 | (60.105, 80.0] |
You can see that these intervals are equal-sized: about 20 yrs each one. This column has a category type.
The distribution of the interval data values is equal to the distribution of Age
column like on the image:
- qcut, or quantile-based cut, divides data into equal-sized categories: each interval contains equal number of entries.
# create 4 bins qcut_data = pd.qcut(data['Age'], q = 4) # set defined quantiles qcut_data = pd.qcut(data['Age'], q = [0, 1/3, 2/3, 1])
Let's divide the data into 4 categories using qcut
. qcut_data
looks like the second column:
Age | Range |
22.0 | (21.0, 28.5] |
38.0 | (28.5, 38.0] |
2.0 | (0.419, 21.0] |
54.0 | (38.0, 80.0] |
66.0 | (38.0, 80.0] |
You can see that these intervals has diferent size: for example, the last one grabs more than 40 years, but the second - only 7.5 years. It means that number of people aged between 21 and 28.5 years from the dataset 'titanic' is equal to the number of people aged between 38 and 80 years. Look at the image:
The heights of the bins are almost equal because qcut
created such bins. But intervals' sizes are non-equal (but still may be). That's the main difference between cut
and qcut
functions.
Task
Bin the Fare
data into three categories: cheap, medium, and expensive.
- Divide it using
cut
, then check the created intervals, analyze them. - Then divide it using
qcut
. Check the intervals again.
Do not forget to confirm if you done the binning correctly depending on the function you use.
Thanks for your feedback!
Sometimes it makes sense to bin the data into multiple buckets: to divide it into classes, for histogram building, etc. Data binning is applied to the continuous numerical data.
We won't dive deep into it and just explore two functions: cut
and qcut
:
- cut: divides the data into predefined bins, usually equal-sized.
# create 4 equal-sized bins cut_data = pd.cut(data['Age'], bins = 4) # set defined bins cut_data = pd.cut(data['Age'], bins = [20, 40, 60])
cut_data
is a pd.Series
column that contains ranges for each age value where this value belongs to. This is how it looks after binning into 4 categories:
Age | Range |
22.0 | (20.315, 40.21] |
38.0 | (20.315, 40.21] |
2.0 | (0.34, 20.315] |
54.0 | (40.21, 60.105] |
66.0 | (60.105, 80.0] |
You can see that these intervals are equal-sized: about 20 yrs each one. This column has a category type.
The distribution of the interval data values is equal to the distribution of Age
column like on the image:
- qcut, or quantile-based cut, divides data into equal-sized categories: each interval contains equal number of entries.
# create 4 bins qcut_data = pd.qcut(data['Age'], q = 4) # set defined quantiles qcut_data = pd.qcut(data['Age'], q = [0, 1/3, 2/3, 1])
Let's divide the data into 4 categories using qcut
. qcut_data
looks like the second column:
Age | Range |
22.0 | (21.0, 28.5] |
38.0 | (28.5, 38.0] |
2.0 | (0.419, 21.0] |
54.0 | (38.0, 80.0] |
66.0 | (38.0, 80.0] |
You can see that these intervals has diferent size: for example, the last one grabs more than 40 years, but the second - only 7.5 years. It means that number of people aged between 21 and 28.5 years from the dataset 'titanic' is equal to the number of people aged between 38 and 80 years. Look at the image:
The heights of the bins are almost equal because qcut
created such bins. But intervals' sizes are non-equal (but still may be). That's the main difference between cut
and qcut
functions.
Task
Bin the Fare
data into three categories: cheap, medium, and expensive.
- Divide it using
cut
, then check the created intervals, analyze them. - Then divide it using
qcut
. Check the intervals again.
Do not forget to confirm if you done the binning correctly depending on the function you use.