Lära Oversampling Techniques | Sampling Techniques for Large Data

Svep för att visa menyn

Oversampling is a technique used to address the issue of imbalanced datasets, especially when one class (the minority class) has significantly fewer samples than others. By increasing the representation of the minority class, you help machine learning models learn from all classes more effectively, which often results in better predictive performance and fairer outcomes. The most common benefit of oversampling is that it balances class distributions, allowing algorithms to avoid bias toward the majority class. However, oversampling can also introduce some pitfalls. If you simply duplicate existing samples, your model may overfit, learning patterns that are too specific to the duplicated data and failing to generalize to new data. Additionally, oversampling can increase the size of your dataset, which may lead to longer training times and increased computational demands.


              1234567891011121314151617181920212223242526272829303132
            
import pandas as pd

# Create a sample DataFrame with an imbalanced target
data = {
    "feature1": [1, 2, 3, 4, 5, 6, 7],
    "target":   ["A", "A", "A", "A", "B", "B", "B"]
}
df = pd.DataFrame(data)

# Count original class distribution
print("Original class distribution:")
print(df["target"].value_counts())

# Oversample minority class "B" to match majority class "A"
majority_count = df["target"].value_counts().max()
minority_class = df["target"].value_counts().idxmin()

# Get all minority class rows
minority_rows = df[df["target"] == minority_class]

# Calculate how many samples to add
samples_to_add = majority_count - len(minority_rows)

# Sample with replacement from minority class
oversampled_minority = minority_rows.sample(n=samples_to_add, replace=True, random_state=42)

# Concatenate original data with new samples
df_oversampled = pd.concat([df, oversampled_minority], ignore_index=True)

# Show new class distribution
print("\nClass distribution after oversampling:")
print(df_oversampled["target"].value_counts())

Var allt tydligt?

Tack för dina kommentarer!

Avsnitt 2. Kapitel 2

Fråga AI

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Avsnitt 2. Kapitel 2