Sometimes it makes sense to bin the data into multiple buckets: to divide it into classes, for histogram building, etc. Data binning is applied to the *continuous numerical data*.

We won't dive deep into it and just explore two functions: `cut` and `qcut`:

 - **cut**: divides the data into predefined bins, usually equal-sized.

# create 4 equal-sized bins
cut_data = pd.cut(data['Age'], bins = 4)
# set defined bins
cut_data = pd.cut(data['Age'], bins = [20, 40, 60])

`cut_data` is a `pd.Series` column that contains ranges for each age value where this value belongs to. This is how it looks after binning into 4 categories:

|Age|Range|
|-|-|
|22.0|(20.315, 40.21]|
|38.0|(20.315, 40.21]|
|2.0|(0.34, 20.315]|
|54.0|(40.21, 60.105]|
|66.0|(60.105, 80.0]|
|||

You can see that these intervals are equal-sized: about 20 yrs each one. This column has a category type.

The distribution of the interval data values is equal to the distribution of `Age` column like on the image:

 - **qcut**, or quantile-based cut, divides data into equal-sized categories: each interval contains equal number of entries.

# create 4 bins
qcut_data = pd.qcut(data['Age'], q = 4)
# set defined quantiles
qcut_data = pd.qcut(data['Age'], q = [0, 1/3, 2/3, 1])

Let's divide the data into 4 categories using `qcut`. `qcut_data` looks like the second column:

|Age|Range|
|-|-|
|22.0|(21.0, 28.5]|
|38.0|(28.5, 38.0]|
|2.0|(0.419, 21.0]|
|54.0|(38.0, 80.0]|
|66.0|(38.0, 80.0]|
|||

You can see that these intervals has diferent size: for example, the last one grabs more than 40 years, but the second - only 7.5 years. It means that number of people aged between 21 and 28.5 years from the dataset 'titanic' is equal to the number of people aged between 38 and 80 years. Look at the image:

The heights of the bins are almost equal because `qcut` created such bins. But intervals' sizes are non-equal (but still may be). That's the main difference between `cut` and `qcut` functions.

Preprocessing Data is an important step in the process of analyzing data and ML problem-solving. Preparing, cleaning, and scaling data can help to make the process easier and faster. Data preprocessing is a vast area: there are various methods and approaches, and in this course, you'll learn the main ones.

Take a look at the data you are working with first.

Learn how to deal with wrong, undefined or missing values and not lose important ones.

Prepare the data for being the machine-readable and easily interpreted.

Data Binning

Solución

Age	Range
22.0	(20.315, 40.21]
38.0	(20.315, 40.21]
2.0	(0.34, 20.315]
54.0	(40.21, 60.105]
66.0	(60.105, 80.0]

Age	Range
22.0	(21.0, 28.5]
38.0	(28.5, 38.0]
2.0	(0.419, 21.0]
54.0	(38.0, 80.0]
66.0	(38.0, 80.0]