Swipe to show menu

Data Binning

Sometimes it makes sense to bin the data into multiple buckets: to divide it into classes, for histogram building, etc. Data binning is applied to the continuous numerical data.

We won't dive deep into it and just explore two functions: cut and qcut:

cut: divides the data into predefined bins, usually equal-sized.


              1234
            
# create 4 equal-sized bins
cut_data = pd.cut(data['Age'], bins = 4)
# set defined bins
cut_data = pd.cut(data['Age'], bins = [20, 40, 60])

cut_data is a pd.Series column that contains ranges for each age value where this value belongs to. This is how it looks after binning into 4 categories:

Age	Range
22.0	(20.315, 40.21]
38.0	(20.315, 40.21]
2.0	(0.34, 20.315]
54.0	(40.21, 60.105]
66.0	(60.105, 80.0]

You can see that these intervals are equal-sized: about 20 yrs each one. This column has a category type.

The distribution of the interval data values is equal to the distribution of Age column like on the image:

qcut, or quantile-based cut, divides data into equal-sized categories: each interval contains equal number of entries.


              1234
            
# create 4 bins
qcut_data = pd.qcut(data['Age'], q = 4)
# set defined quantiles
qcut_data = pd.qcut(data['Age'], q = [0, 1/3, 2/3, 1])

Let's divide the data into 4 categories using qcut. qcut_data looks like the second column:

Age	Range
22.0	(21.0, 28.5]
38.0	(28.5, 38.0]
2.0	(0.419, 21.0]
54.0	(38.0, 80.0]
66.0	(38.0, 80.0]

You can see that these intervals has diferent size: for example, the last one grabs more than 40 years, but the second - only 7.5 years. It means that number of people aged between 21 and 28.5 years from the dataset 'titanic' is equal to the number of people aged between 38 and 80 years. Look at the image:

The heights of the bins are almost equal because qcut created such bins. But intervals' sizes are non-equal (but still may be). That's the main difference between cut and qcut functions.

Task

Swipe to start coding

Bin the Fare data into three categories: cheap, medium, and expensive.

Divide it using cut, then check the created intervals, analyze them.
Then divide it using qcut. Check the intervals again.

Do not forget to confirm if you done the binning correctly depending on the function you use.

Solution

Switch to desktop for real-world practiceContinue from where you are using one of the options below

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 2

single

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Data Binning

Sometimes it makes sense to bin the data into multiple buckets: to divide it into classes, for histogram building, etc. Data binning is applied to the continuous numerical data.

We won't dive deep into it and just explore two functions: cut and qcut:

cut: divides the data into predefined bins, usually equal-sized.


              1234
            
# create 4 equal-sized bins
cut_data = pd.cut(data['Age'], bins = 4)
# set defined bins
cut_data = pd.cut(data['Age'], bins = [20, 40, 60])

cut_data is a pd.Series column that contains ranges for each age value where this value belongs to. This is how it looks after binning into 4 categories:

Age	Range
22.0	(20.315, 40.21]
38.0	(20.315, 40.21]
2.0	(0.34, 20.315]
54.0	(40.21, 60.105]
66.0	(60.105, 80.0]

You can see that these intervals are equal-sized: about 20 yrs each one. This column has a category type.

The distribution of the interval data values is equal to the distribution of Age column like on the image:

qcut, or quantile-based cut, divides data into equal-sized categories: each interval contains equal number of entries.


              1234
            
# create 4 bins
qcut_data = pd.qcut(data['Age'], q = 4)
# set defined quantiles
qcut_data = pd.qcut(data['Age'], q = [0, 1/3, 2/3, 1])

Let's divide the data into 4 categories using qcut. qcut_data looks like the second column:

Age	Range
22.0	(21.0, 28.5]
38.0	(28.5, 38.0]
2.0	(0.419, 21.0]
54.0	(38.0, 80.0]
66.0	(38.0, 80.0]

The heights of the bins are almost equal because qcut created such bins. But intervals' sizes are non-equal (but still may be). That's the main difference between cut and qcut functions.

Task

Swipe to start coding

Bin the Fare data into three categories: cheap, medium, and expensive.

Divide it using cut, then check the created intervals, analyze them.
Then divide it using qcut. Check the intervals again.

Do not forget to confirm if you done the binning correctly depending on the function you use.

Solution

Switch to desktop for real-world practiceContinue from where you are using one of the options below

Everything was clear?

Thanks for your feedback!

Swipe to show menu

Data Binning

Solution

Awesome!

Data Binning

Solution

Awesome!