Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Data Binning | Data Validation
Preprocessing Data
course content

Course Content

Preprocessing Data

Preprocessing Data

1. Data Exploration
2. Data Cleaning
3. Data Validation
4. Normalization & Standardization
5. Data Encoding

bookData Binning

Sometimes it makes sense to bin the data into multiple buckets: to divide it into classes, for histogram building, etc. Data binning is applied to the continuous numerical data.

We won't dive deep into it and just explore two functions: cut and qcut:

  • cut: divides the data into predefined bins, usually equal-sized.
1234
# create 4 equal-sized bins cut_data = pd.cut(data['Age'], bins = 4) # set defined bins cut_data = pd.cut(data['Age'], bins = [20, 40, 60])
copy

cut_data is a pd.Series column that contains ranges for each age value where this value belongs to. This is how it looks after binning into 4 categories:

AgeRange
22.0(20.315, 40.21]
38.0(20.315, 40.21]
2.0(0.34, 20.315]
54.0(40.21, 60.105]
66.0(60.105, 80.0]

You can see that these intervals are equal-sized: about 20 yrs each one. This column has a category type.

The distribution of the interval data values is equal to the distribution of Age column like on the image:

  • qcut, or quantile-based cut, divides data into equal-sized categories: each interval contains equal number of entries.
1234
# create 4 bins qcut_data = pd.qcut(data['Age'], q = 4) # set defined quantiles qcut_data = pd.qcut(data['Age'], q = [0, 1/3, 2/3, 1])
copy

Let's divide the data into 4 categories using qcut. qcut_data looks like the second column:

AgeRange
22.0(21.0, 28.5]
38.0(28.5, 38.0]
2.0(0.419, 21.0]
54.0(38.0, 80.0]
66.0(38.0, 80.0]

You can see that these intervals has diferent size: for example, the last one grabs more than 40 years, but the second - only 7.5 years. It means that number of people aged between 21 and 28.5 years from the dataset 'titanic' is equal to the number of people aged between 38 and 80 years. Look at the image:

The heights of the bins are almost equal because qcut created such bins. But intervals' sizes are non-equal (but still may be). That's the main difference between cut and qcut functions.

Task

Bin the Fare data into three categories: cheap, medium, and expensive.

  1. Divide it using cut, then check the created intervals, analyze them.
  2. Then divide it using qcut. Check the intervals again.

Do not forget to confirm if you done the binning correctly depending on the function you use.

Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 3. Chapter 2
toggle bottom row

bookData Binning

Sometimes it makes sense to bin the data into multiple buckets: to divide it into classes, for histogram building, etc. Data binning is applied to the continuous numerical data.

We won't dive deep into it and just explore two functions: cut and qcut:

  • cut: divides the data into predefined bins, usually equal-sized.
1234
# create 4 equal-sized bins cut_data = pd.cut(data['Age'], bins = 4) # set defined bins cut_data = pd.cut(data['Age'], bins = [20, 40, 60])
copy

cut_data is a pd.Series column that contains ranges for each age value where this value belongs to. This is how it looks after binning into 4 categories:

AgeRange
22.0(20.315, 40.21]
38.0(20.315, 40.21]
2.0(0.34, 20.315]
54.0(40.21, 60.105]
66.0(60.105, 80.0]

You can see that these intervals are equal-sized: about 20 yrs each one. This column has a category type.

The distribution of the interval data values is equal to the distribution of Age column like on the image:

  • qcut, or quantile-based cut, divides data into equal-sized categories: each interval contains equal number of entries.
1234
# create 4 bins qcut_data = pd.qcut(data['Age'], q = 4) # set defined quantiles qcut_data = pd.qcut(data['Age'], q = [0, 1/3, 2/3, 1])
copy

Let's divide the data into 4 categories using qcut. qcut_data looks like the second column:

AgeRange
22.0(21.0, 28.5]
38.0(28.5, 38.0]
2.0(0.419, 21.0]
54.0(38.0, 80.0]
66.0(38.0, 80.0]

You can see that these intervals has diferent size: for example, the last one grabs more than 40 years, but the second - only 7.5 years. It means that number of people aged between 21 and 28.5 years from the dataset 'titanic' is equal to the number of people aged between 38 and 80 years. Look at the image:

The heights of the bins are almost equal because qcut created such bins. But intervals' sizes are non-equal (but still may be). That's the main difference between cut and qcut functions.

Task

Bin the Fare data into three categories: cheap, medium, and expensive.

  1. Divide it using cut, then check the created intervals, analyze them.
  2. Then divide it using qcut. Check the intervals again.

Do not forget to confirm if you done the binning correctly depending on the function you use.

Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 3. Chapter 2
toggle bottom row

bookData Binning

Sometimes it makes sense to bin the data into multiple buckets: to divide it into classes, for histogram building, etc. Data binning is applied to the continuous numerical data.

We won't dive deep into it and just explore two functions: cut and qcut:

  • cut: divides the data into predefined bins, usually equal-sized.
1234
# create 4 equal-sized bins cut_data = pd.cut(data['Age'], bins = 4) # set defined bins cut_data = pd.cut(data['Age'], bins = [20, 40, 60])
copy

cut_data is a pd.Series column that contains ranges for each age value where this value belongs to. This is how it looks after binning into 4 categories:

AgeRange
22.0(20.315, 40.21]
38.0(20.315, 40.21]
2.0(0.34, 20.315]
54.0(40.21, 60.105]
66.0(60.105, 80.0]

You can see that these intervals are equal-sized: about 20 yrs each one. This column has a category type.

The distribution of the interval data values is equal to the distribution of Age column like on the image:

  • qcut, or quantile-based cut, divides data into equal-sized categories: each interval contains equal number of entries.
1234
# create 4 bins qcut_data = pd.qcut(data['Age'], q = 4) # set defined quantiles qcut_data = pd.qcut(data['Age'], q = [0, 1/3, 2/3, 1])
copy

Let's divide the data into 4 categories using qcut. qcut_data looks like the second column:

AgeRange
22.0(21.0, 28.5]
38.0(28.5, 38.0]
2.0(0.419, 21.0]
54.0(38.0, 80.0]
66.0(38.0, 80.0]

You can see that these intervals has diferent size: for example, the last one grabs more than 40 years, but the second - only 7.5 years. It means that number of people aged between 21 and 28.5 years from the dataset 'titanic' is equal to the number of people aged between 38 and 80 years. Look at the image:

The heights of the bins are almost equal because qcut created such bins. But intervals' sizes are non-equal (but still may be). That's the main difference between cut and qcut functions.

Task

Bin the Fare data into three categories: cheap, medium, and expensive.

  1. Divide it using cut, then check the created intervals, analyze them.
  2. Then divide it using qcut. Check the intervals again.

Do not forget to confirm if you done the binning correctly depending on the function you use.

Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
Everything was clear?

How can we improve it?

Thanks for your feedback!

Sometimes it makes sense to bin the data into multiple buckets: to divide it into classes, for histogram building, etc. Data binning is applied to the continuous numerical data.

We won't dive deep into it and just explore two functions: cut and qcut:

  • cut: divides the data into predefined bins, usually equal-sized.
1234
# create 4 equal-sized bins cut_data = pd.cut(data['Age'], bins = 4) # set defined bins cut_data = pd.cut(data['Age'], bins = [20, 40, 60])
copy

cut_data is a pd.Series column that contains ranges for each age value where this value belongs to. This is how it looks after binning into 4 categories:

AgeRange
22.0(20.315, 40.21]
38.0(20.315, 40.21]
2.0(0.34, 20.315]
54.0(40.21, 60.105]
66.0(60.105, 80.0]

You can see that these intervals are equal-sized: about 20 yrs each one. This column has a category type.

The distribution of the interval data values is equal to the distribution of Age column like on the image:

  • qcut, or quantile-based cut, divides data into equal-sized categories: each interval contains equal number of entries.
1234
# create 4 bins qcut_data = pd.qcut(data['Age'], q = 4) # set defined quantiles qcut_data = pd.qcut(data['Age'], q = [0, 1/3, 2/3, 1])
copy

Let's divide the data into 4 categories using qcut. qcut_data looks like the second column:

AgeRange
22.0(21.0, 28.5]
38.0(28.5, 38.0]
2.0(0.419, 21.0]
54.0(38.0, 80.0]
66.0(38.0, 80.0]

You can see that these intervals has diferent size: for example, the last one grabs more than 40 years, but the second - only 7.5 years. It means that number of people aged between 21 and 28.5 years from the dataset 'titanic' is equal to the number of people aged between 38 and 80 years. Look at the image:

The heights of the bins are almost equal because qcut created such bins. But intervals' sizes are non-equal (but still may be). That's the main difference between cut and qcut functions.

Task

Bin the Fare data into three categories: cheap, medium, and expensive.

  1. Divide it using cut, then check the created intervals, analyze them.
  2. Then divide it using qcut. Check the intervals again.

Do not forget to confirm if you done the binning correctly depending on the function you use.

Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
Section 3. Chapter 2
Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
some-alt