Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Data Binning | Data Validation
Preprocessing Data
course content

Contenido del Curso

Preprocessing Data

Preprocessing Data

1. Data Exploration
2. Data Cleaning
3. Data Validation
4. Normalization & Standardization
5. Data Encoding

bookData Binning

Sometimes it makes sense to bin the data into multiple buckets: to divide it into classes, for histogram building, etc. Data binning is applied to the continuous numerical data.

We won't dive deep into it and just explore two functions: cut and qcut:

  • cut: divides the data into predefined bins, usually equal-sized.
1234
# create 4 equal-sized bins cut_data = pd.cut(data['Age'], bins = 4) # set defined bins cut_data = pd.cut(data['Age'], bins = [20, 40, 60])
copy

cut_data is a pd.Series column that contains ranges for each age value where this value belongs to. This is how it looks after binning into 4 categories:

AgeRange
22.0(20.315, 40.21]
38.0(20.315, 40.21]
2.0(0.34, 20.315]
54.0(40.21, 60.105]
66.0(60.105, 80.0]

You can see that these intervals are equal-sized: about 20 yrs each one. This column has a category type.

The distribution of the interval data values is equal to the distribution of Age column like on the image:

  • qcut, or quantile-based cut, divides data into equal-sized categories: each interval contains equal number of entries.
1234
# create 4 bins qcut_data = pd.qcut(data['Age'], q = 4) # set defined quantiles qcut_data = pd.qcut(data['Age'], q = [0, 1/3, 2/3, 1])
copy

Let's divide the data into 4 categories using qcut. qcut_data looks like the second column:

AgeRange
22.0(21.0, 28.5]
38.0(28.5, 38.0]
2.0(0.419, 21.0]
54.0(38.0, 80.0]
66.0(38.0, 80.0]

You can see that these intervals has diferent size: for example, the last one grabs more than 40 years, but the second - only 7.5 years. It means that number of people aged between 21 and 28.5 years from the dataset 'titanic' is equal to the number of people aged between 38 and 80 years. Look at the image:

The heights of the bins are almost equal because qcut created such bins. But intervals' sizes are non-equal (but still may be). That's the main difference between cut and qcut functions.

Tarea

Bin the Fare data into three categories: cheap, medium, and expensive.

  1. Divide it using cut, then check the created intervals, analyze them.
  2. Then divide it using qcut. Check the intervals again.

Do not forget to confirm if you done the binning correctly depending on the function you use.

Switch to desktopCambia al escritorio para practicar en el mundo realContinúe desde donde se encuentra utilizando una de las siguientes opciones
¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 3. Capítulo 2
toggle bottom row

bookData Binning

Sometimes it makes sense to bin the data into multiple buckets: to divide it into classes, for histogram building, etc. Data binning is applied to the continuous numerical data.

We won't dive deep into it and just explore two functions: cut and qcut:

  • cut: divides the data into predefined bins, usually equal-sized.
1234
# create 4 equal-sized bins cut_data = pd.cut(data['Age'], bins = 4) # set defined bins cut_data = pd.cut(data['Age'], bins = [20, 40, 60])
copy

cut_data is a pd.Series column that contains ranges for each age value where this value belongs to. This is how it looks after binning into 4 categories:

AgeRange
22.0(20.315, 40.21]
38.0(20.315, 40.21]
2.0(0.34, 20.315]
54.0(40.21, 60.105]
66.0(60.105, 80.0]

You can see that these intervals are equal-sized: about 20 yrs each one. This column has a category type.

The distribution of the interval data values is equal to the distribution of Age column like on the image:

  • qcut, or quantile-based cut, divides data into equal-sized categories: each interval contains equal number of entries.
1234
# create 4 bins qcut_data = pd.qcut(data['Age'], q = 4) # set defined quantiles qcut_data = pd.qcut(data['Age'], q = [0, 1/3, 2/3, 1])
copy

Let's divide the data into 4 categories using qcut. qcut_data looks like the second column:

AgeRange
22.0(21.0, 28.5]
38.0(28.5, 38.0]
2.0(0.419, 21.0]
54.0(38.0, 80.0]
66.0(38.0, 80.0]

You can see that these intervals has diferent size: for example, the last one grabs more than 40 years, but the second - only 7.5 years. It means that number of people aged between 21 and 28.5 years from the dataset 'titanic' is equal to the number of people aged between 38 and 80 years. Look at the image:

The heights of the bins are almost equal because qcut created such bins. But intervals' sizes are non-equal (but still may be). That's the main difference between cut and qcut functions.

Tarea

Bin the Fare data into three categories: cheap, medium, and expensive.

  1. Divide it using cut, then check the created intervals, analyze them.
  2. Then divide it using qcut. Check the intervals again.

Do not forget to confirm if you done the binning correctly depending on the function you use.

Switch to desktopCambia al escritorio para practicar en el mundo realContinúe desde donde se encuentra utilizando una de las siguientes opciones
¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 3. Capítulo 2
toggle bottom row

bookData Binning

Sometimes it makes sense to bin the data into multiple buckets: to divide it into classes, for histogram building, etc. Data binning is applied to the continuous numerical data.

We won't dive deep into it and just explore two functions: cut and qcut:

  • cut: divides the data into predefined bins, usually equal-sized.
1234
# create 4 equal-sized bins cut_data = pd.cut(data['Age'], bins = 4) # set defined bins cut_data = pd.cut(data['Age'], bins = [20, 40, 60])
copy

cut_data is a pd.Series column that contains ranges for each age value where this value belongs to. This is how it looks after binning into 4 categories:

AgeRange
22.0(20.315, 40.21]
38.0(20.315, 40.21]
2.0(0.34, 20.315]
54.0(40.21, 60.105]
66.0(60.105, 80.0]

You can see that these intervals are equal-sized: about 20 yrs each one. This column has a category type.

The distribution of the interval data values is equal to the distribution of Age column like on the image:

  • qcut, or quantile-based cut, divides data into equal-sized categories: each interval contains equal number of entries.
1234
# create 4 bins qcut_data = pd.qcut(data['Age'], q = 4) # set defined quantiles qcut_data = pd.qcut(data['Age'], q = [0, 1/3, 2/3, 1])
copy

Let's divide the data into 4 categories using qcut. qcut_data looks like the second column:

AgeRange
22.0(21.0, 28.5]
38.0(28.5, 38.0]
2.0(0.419, 21.0]
54.0(38.0, 80.0]
66.0(38.0, 80.0]

You can see that these intervals has diferent size: for example, the last one grabs more than 40 years, but the second - only 7.5 years. It means that number of people aged between 21 and 28.5 years from the dataset 'titanic' is equal to the number of people aged between 38 and 80 years. Look at the image:

The heights of the bins are almost equal because qcut created such bins. But intervals' sizes are non-equal (but still may be). That's the main difference between cut and qcut functions.

Tarea

Bin the Fare data into three categories: cheap, medium, and expensive.

  1. Divide it using cut, then check the created intervals, analyze them.
  2. Then divide it using qcut. Check the intervals again.

Do not forget to confirm if you done the binning correctly depending on the function you use.

Switch to desktopCambia al escritorio para practicar en el mundo realContinúe desde donde se encuentra utilizando una de las siguientes opciones
¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sometimes it makes sense to bin the data into multiple buckets: to divide it into classes, for histogram building, etc. Data binning is applied to the continuous numerical data.

We won't dive deep into it and just explore two functions: cut and qcut:

  • cut: divides the data into predefined bins, usually equal-sized.
1234
# create 4 equal-sized bins cut_data = pd.cut(data['Age'], bins = 4) # set defined bins cut_data = pd.cut(data['Age'], bins = [20, 40, 60])
copy

cut_data is a pd.Series column that contains ranges for each age value where this value belongs to. This is how it looks after binning into 4 categories:

AgeRange
22.0(20.315, 40.21]
38.0(20.315, 40.21]
2.0(0.34, 20.315]
54.0(40.21, 60.105]
66.0(60.105, 80.0]

You can see that these intervals are equal-sized: about 20 yrs each one. This column has a category type.

The distribution of the interval data values is equal to the distribution of Age column like on the image:

  • qcut, or quantile-based cut, divides data into equal-sized categories: each interval contains equal number of entries.
1234
# create 4 bins qcut_data = pd.qcut(data['Age'], q = 4) # set defined quantiles qcut_data = pd.qcut(data['Age'], q = [0, 1/3, 2/3, 1])
copy

Let's divide the data into 4 categories using qcut. qcut_data looks like the second column:

AgeRange
22.0(21.0, 28.5]
38.0(28.5, 38.0]
2.0(0.419, 21.0]
54.0(38.0, 80.0]
66.0(38.0, 80.0]

You can see that these intervals has diferent size: for example, the last one grabs more than 40 years, but the second - only 7.5 years. It means that number of people aged between 21 and 28.5 years from the dataset 'titanic' is equal to the number of people aged between 38 and 80 years. Look at the image:

The heights of the bins are almost equal because qcut created such bins. But intervals' sizes are non-equal (but still may be). That's the main difference between cut and qcut functions.

Tarea

Bin the Fare data into three categories: cheap, medium, and expensive.

  1. Divide it using cut, then check the created intervals, analyze them.
  2. Then divide it using qcut. Check the intervals again.

Do not forget to confirm if you done the binning correctly depending on the function you use.

Switch to desktopCambia al escritorio para practicar en el mundo realContinúe desde donde se encuentra utilizando una de las siguientes opciones
Sección 3. Capítulo 2
Switch to desktopCambia al escritorio para practicar en el mundo realContinúe desde donde se encuentra utilizando una de las siguientes opciones
some-alt