**Data augmentation** - is an important step in training machine learning models. This method is understood as an increase in the data sample for training through the modification of existing data. Generating “synthetic” data can be useful in a variety of situations where real-world data may be difficult to obtain, insufficient, or sensitive. 

This method is used when there is not enough data to train a machine-learning model. Under the lack of data, we can understand that the dataset may not be representative of the underlying population or phenomenon being studied. The sample size should be large enough to provide sufficient statistical power to detect meaningful relationships or differences. The required sample size depends on factors such as the analysis's complexity, the data's variability, and the desired level of precision. Generating synthetic data can help to supplement real-world data and provide additional training examples.

The `pandas` library can be used to create synthetic data with a specific structure or format. Here's an example of how to use `pandas` to create a synthetic dataset:


import pandas as pd
import numpy as np

# Create a sample dataset
dataset = pd.DataFrame({'A': np.random.rand(10),
                   'B': np.random.choice(['male', 'female'], 10),
                   'C': np.random.randint(1, 100, 10)})

# Generate synthetic data using Pandas
synthetic_data = pd.concat([dataset, dataset.sample(frac=0.5)])
print(synthetic_data)

We use the `pd.concat()` method to concatenate the original dataframe with a randomly sampled subset of the dataframe. By setting the `frac` parameter to 0.5, we sample 50% of the rows from the original dataframe and append them to the end of the dataframe, effectively doubling the size of the dataframe and generating synthetic data.

Creating a machine learning model seems to be your most challenging and essential task. But first, we have to work with data! Learn how to process datasets and fully prepare them for use. Numerical, categorical, and temporal data await you in our course.

Different types of data? How to work with them? If your eyes are wide open, don't worry, let's start with a brief overview of the pandas library and learn how to work with it in the future.

This chapter discusses in detail how to work with quantitative data, what methods it is processed with, how data scaling and normalization differ, and much more.

Is categorical data as simple as you think it is? Find out what is the complexity of processing and working with it.


Time series data processing is the process of handling, analyzing, and preparing data that is presented as a sequence of temporally ordered values. Find out what steps it includes in this section.

Did you know that you can extract even more values from your data and create more informative features? In this section, you will learn how to work with feature engineering.

You have reached the end of this course. Let's test your knowledge! There are 3 tasks for you to solve.

Data Augmentation: Synthetic Data

Рішення