Aprende Undersampling Techniques | Sampling Techniques for Large Data

Desliza para mostrar el menú

When dealing with large datasets that are imbalanced, you often encounter situations where one class (the majority class) greatly outnumbers another (the minority class). This imbalance can make it difficult for models to learn meaningful patterns about the minority class, leading to poor predictive performance. Undersampling is a technique used to address this issue by reducing the number of samples in the majority class so that the dataset becomes more balanced.

You should consider undersampling when your dataset is too large for practical processing or when the majority class dominates to such an extent that the model ignores the minority class. However, undersampling is most appropriate when you have a very large dataset and can afford to lose some majority class samples without sacrificing important information. It is less suitable when the dataset is already small or the majority class contains rare but important examples.


              123456789101112131415161718192021222324
            
import pandas as pd

# Create a sample imbalanced dataset
data = {
    "feature": range(20),
    "class": ["majority"] * 16 + ["minority"] * 4
}
df = pd.DataFrame(data)

# Count the number of samples in each class
class_counts = df["class"].value_counts()
minority_count = class_counts["minority"]

# Randomly sample from the majority class to match the minority class count
majority_sample = df[df["class"] == "majority"].sample(n=minority_count, random_state=42)
minority_sample = df[df["class"] == "minority"]

# Combine samples to get a balanced dataset
balanced_df = pd.concat([majority_sample, minority_sample])

print("Original class distribution:")
print(df["class"].value_counts())
print("\nBalanced class distribution after undersampling:")
print(balanced_df["class"].value_counts())

¿Todo estuvo claro?

¡Gracias por tus comentarios!

Sección 2. Capítulo 4

Pregunte a AI

Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla

Sección 2. Capítulo 4