Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Undersampling Techniques | Sampling Techniques for Large Data
Large Data Handling

Undersampling Techniques

Stryg for at vise menuen

When dealing with large datasets that are imbalanced, you often encounter situations where one class (the majority class) greatly outnumbers another (the minority class). This imbalance can make it difficult for models to learn meaningful patterns about the minority class, leading to poor predictive performance. Undersampling is a technique used to address this issue by reducing the number of samples in the majority class so that the dataset becomes more balanced.

You should consider undersampling when your dataset is too large for practical processing or when the majority class dominates to such an extent that the model ignores the minority class. However, undersampling is most appropriate when you have a very large dataset and can afford to lose some majority class samples without sacrificing important information. It is less suitable when the dataset is already small or the majority class contains rare but important examples.

123456789101112131415161718192021222324
import pandas as pd # Create a sample imbalanced dataset data = { "feature": range(20), "class": ["majority"] * 16 + ["minority"] * 4 } df = pd.DataFrame(data) # Count the number of samples in each class class_counts = df["class"].value_counts() minority_count = class_counts["minority"] # Randomly sample from the majority class to match the minority class count majority_sample = df[df["class"] == "majority"].sample(n=minority_count, random_state=42) minority_sample = df[df["class"] == "minority"] # Combine samples to get a balanced dataset balanced_df = pd.concat([majority_sample, minority_sample]) print("Original class distribution:") print(df["class"].value_counts()) print("\nBalanced class distribution after undersampling:") print(balanced_df["class"].value_counts())
question mark

What is a potential risk of undersampling?

Vælg det korrekte svar

Var alt klart?

Hvordan kan vi forbedre det?

Tak for dine kommentarer!

Sektion 2. Kapitel 4

Spørg AI

expand

Spørg AI

ChatGPT

Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat

Sektion 2. Kapitel 4
some-alt