Summary
This chapter explains how to convert categorical variables into binary dummy features using pandas’ get_dummies function.

General domain of usage
Data preprocessing for machine learning

In this video, you will learn how to manage categorical variables in pandas using the Titanic dataset. Discover what categorical variables are and why they matter in data preprocessing. See how the pandas `.get_dummies()` function transforms columns like `'Sex'` and `'Embarked'` into dummy variables, making them suitable for analysis and machine learning. Follow along with practical examples as you convert these columns and interpret the resulting data, understanding how each category is represented by a new column with values of `1` or `0`. By the end, you will know how to efficiently handle categorical data using pandas and apply these techniques to your own datasets.

Now, you will work with the data set that doesn't contain missing values. The `NaN` values from the column `'Age'` were replaced with the **mean** of the column, and the `NaN` value from the `'Fare'` column was deleted.
So, now it's time to learn how to manage categorical variables. Categorical means that they have some categories. For instance, in the column `'Sex'`, there is `'male'` and `'female'`; or in the column `'Embarked'`, there is `'Q'`, `'S'`, and `'C'`. 

**What should we do to calculate the number of values in each category or to find out information on them?**

You already know `.loc[]`, `.isin()`, `.between()` and a lot of functions,  but in pandas, there is a more beautiful and convenient way to do this. Use the function `.get_dummies()`. As an example, we will apply it to the column `'Embarked'`. Look at the implementation and the result (we will output 5 random passengers' names and new columns that we created).

import pandas as pd
data = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/4bf24830-59ba-4418-969b-aaf8117d522e/titanic3.csv', index_col = 0)
data = pd.get_dummies(data, columns = ['Embarked'])
print(data[['Name', 'Embarked_C', 'Embarked_Q', 'Embarked_S']].sample(5))

Let's examine one of the possible outputs, specifically one of the possible combinations of **five randomly selected rows**. You can scroll horizontally through the table to view all the columns:



**Explanation:**

As a result, our function split the column `'Embarked'` into three columns: `'Embarked_C'`, and `'Embarked_Q'`, `'Embarked_S'`. In total, we have three categories. Each passenger has their category in the `'Embarked'` column. Thus, our function creates three columns corresponding to each category, and in line with each passenger, it fills the row of the column with `1` if the person was initially related to the geography; otherwise, it says `0`. Thus, we get `1` in just one column.

```python
pd.get_dummies(data, columns = ['Embarked'])
```
- `pd.get_dummies()` - this function converts **categorical** variables into **dummy** ones (1 or 0);
- `data` - the data frame that you want to use;
- `columns = ['Embarked']` - columns have categorical variables that you want to transform into dummy ones. Pay attention; it is **obligatory** to put column names into the list.

import unittest
import pandas as pd
import io
import sys


def _dynamic_test(test_case, condition, success_msg, failure_msg):
    if condition:
        test_case._testMethodName = success_msg
        test_case.assertTrue(True, success_msg)
    else:
        test_case._testMethodName = failure_msg
        test_case.fail(failure_msg)


class TestDummyVariables(unittest.TestCase):
    def test_dummies_created_and_sums_correct(self):
        """
        1. Check that 'Sex' column was converted to dummy variables and sums are correct.
        """
        import user_code

        # reference dataset
        url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/4bf24830-59ba-4418-969b-aaf8117d522e/titanic3.csv"
        df_ref = pd.read_csv(url, index_col=0)
        dummies_ref = pd.get_dummies(df_ref, columns=["Sex"])
        ref_sum_male = dummies_ref["Sex_male"].sum()
        ref_sum_female = dummies_ref["Sex_female"].sum()

        # user result
        assert hasattr(user_code, "data"), "Variable 'data' not found."
        df_user = user_code.data

        # check dummy columns exist and sums match
        condition = (
            "Sex_male" in df_user.columns
            and "Sex_female" in df_user.columns
            and abs(df_user["Sex_male"].sum() - ref_sum_male) < 1e-9
            and abs(df_user["Sex_female"].sum() - ref_sum_female) < 1e-9
        )

        _dynamic_test(
            self,
            condition,
            "The dummy variables 'Sex_male' and 'Sex_female' were created correctly and their sums are accurate.",
            "The dummy variable transformation or the calculated sums are incorrect."
        )


class TestOutput(unittest.TestCase):
    def test_output_print(self):
        """
        2. Check that both sums are printed in the output.
        """
        import user_code

        captured_output = io.StringIO()
        sys.stdout = captured_output
        print(user_code.sex_male, user_code.sex_female)
        sys.stdout = sys.__stdout__

        output_text = captured_output.getvalue().strip()
        # make sure both values appear in output
        parts = output_text.split()
        condition = len(parts) >= 2
        _dynamic_test(
            self,
            condition,
            "The sums of dummy variables are printed correctly.",
            "The output is missing or incorrect. Ensure you print both values: sex_male and sex_female."
        )


if __name__ == "__main__":
    unittest.main()

test_code.py

This course contains a lot of useful functions for a future data analyst. You will learn different ways of extracting data and even set conditions on it. After it, you will be familiar with the methods of grouping data. Also, you will learn how to preprocess data. Each section has its data set so that the course will be gripping.

This section will teach you how to output specific columns by their titles or indices. Also, you will get acquainted with the ways you can select rows  by indices.

Here, you will learn how to extract data that has specific conditions. Also, you will learn how to combine them and even create your own.

In this section, you will expand your knowledge on setting different data conditions. You will learn to check if your data is in a defined list of values or between two values. You will also learn how to find the largest and smallest values.

This section is one of the most fascinating of the course. Here, you will learn how to group data in different ways. It will help you work as a data analyst to find out information on specific data groups.

This section is one of the most significant for a data analyst because if the data contains missing data values in the incorrect format, it will be impossible to work with. Thus, you will learn how to deal with such inappropriate values here. 

Managing Categorical Variables

Solution