Impara Oversampling Techniques | Sampling Techniques for Large Data

Scorri per mostrare il menu

Oversampling is a technique used to address the issue of imbalanced datasets, especially when one class (the minority class) has significantly fewer samples than others. By increasing the representation of the minority class, you help machine learning models learn from all classes more effectively, which often results in better predictive performance and fairer outcomes. The most common benefit of oversampling is that it balances class distributions, allowing algorithms to avoid bias toward the majority class. However, oversampling can also introduce some pitfalls. If you simply duplicate existing samples, your model may overfit, learning patterns that are too specific to the duplicated data and failing to generalize to new data. Additionally, oversampling can increase the size of your dataset, which may lead to longer training times and increased computational demands.


              1234567891011121314151617181920212223242526272829303132
            
import pandas as pd

# Create a sample DataFrame with an imbalanced target
data = {
    "feature1": [1, 2, 3, 4, 5, 6, 7],
    "target":   ["A", "A", "A", "A", "B", "B", "B"]
}
df = pd.DataFrame(data)

# Count original class distribution
print("Original class distribution:")
print(df["target"].value_counts())

# Oversample minority class "B" to match majority class "A"
majority_count = df["target"].value_counts().max()
minority_class = df["target"].value_counts().idxmin()

# Get all minority class rows
minority_rows = df[df["target"] == minority_class]

# Calculate how many samples to add
samples_to_add = majority_count - len(minority_rows)

# Sample with replacement from minority class
oversampled_minority = minority_rows.sample(n=samples_to_add, replace=True, random_state=42)

# Concatenate original data with new samples
df_oversampled = pd.concat([df, oversampled_minority], ignore_index=True)

# Show new class distribution
print("\nClass distribution after oversampling:")
print(df_oversampled["target"].value_counts())

Tutto è chiaro?

Grazie per i tuoi commenti!

Sezione 2. Capitolo 2

Chieda ad AI

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

Sezione 2. Capitolo 2