Feature Transformation and Extraction
Many real-world datasets contain features with skewed distributions, which can reduce the effectiveness of machine learning models. You can apply mathematical transformations to reduce skewness and improve data quality. Two common methods are:
- Logarithmic transformation: reduces strong positive skewness by applying
log(x); - Square root transformation: moderates smaller degrees of skewness using
sqrt(x).
These methods help make feature distributions more normal-like and enhance model performance.
123456789101112131415161718192021222324252627282930import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # Load the Titanic dataset df = sns.load_dataset('titanic') fare = df['fare'] # Apply log transformation (add 1 to handle zeros) fare_log = np.log(fare + 1) # Create side-by-side histogram comparison fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Original fare axes[0].hist(fare, bins=50, color='skyblue', edgecolor='black', alpha=0.7) axes[0].set_xlabel('Fare ($)', fontsize=12) axes[0].set_ylabel('Frequency', fontsize=12) axes[0].set_title('Original Fare Distribution', fontsize=14, fontweight='bold') axes[0].grid(True, alpha=0.3) # Log-transformed fare axes[1].hist(fare_log, bins=50, color='lightcoral', edgecolor='black', alpha=0.7) axes[1].set_xlabel('Log(Fare + 1)', fontsize=12) axes[1].set_ylabel('Frequency', fontsize=12) axes[1].set_title('Log-Transformed Fare Distribution', fontsize=14, fontweight='bold') axes[1].grid(True, alpha=0.3) plt.tight_layout()
Feature extraction is the process of creating new features from raw data to improve the performance of machine learning models.
It helps by making important information more explicit, reducing noise, and sometimes lowering the dimensionality of the data. Effective feature extraction can lead to better predictions and more interpretable models.
1234567891011import seaborn as sns import pandas as pd # Load the Titanic dataset df = sns.load_dataset('titanic') # Create a new feature: family_size = sibsp + parch + 1 df['family_size'] = df['sibsp'] + df['parch'] + 1 # Show the first few rows with the new feature print(df[['sibsp', 'parch', 'family_size']].head())
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 8.33
Feature Transformation and Extraction
Swipe to show menu
Many real-world datasets contain features with skewed distributions, which can reduce the effectiveness of machine learning models. You can apply mathematical transformations to reduce skewness and improve data quality. Two common methods are:
- Logarithmic transformation: reduces strong positive skewness by applying
log(x); - Square root transformation: moderates smaller degrees of skewness using
sqrt(x).
These methods help make feature distributions more normal-like and enhance model performance.
123456789101112131415161718192021222324252627282930import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # Load the Titanic dataset df = sns.load_dataset('titanic') fare = df['fare'] # Apply log transformation (add 1 to handle zeros) fare_log = np.log(fare + 1) # Create side-by-side histogram comparison fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Original fare axes[0].hist(fare, bins=50, color='skyblue', edgecolor='black', alpha=0.7) axes[0].set_xlabel('Fare ($)', fontsize=12) axes[0].set_ylabel('Frequency', fontsize=12) axes[0].set_title('Original Fare Distribution', fontsize=14, fontweight='bold') axes[0].grid(True, alpha=0.3) # Log-transformed fare axes[1].hist(fare_log, bins=50, color='lightcoral', edgecolor='black', alpha=0.7) axes[1].set_xlabel('Log(Fare + 1)', fontsize=12) axes[1].set_ylabel('Frequency', fontsize=12) axes[1].set_title('Log-Transformed Fare Distribution', fontsize=14, fontweight='bold') axes[1].grid(True, alpha=0.3) plt.tight_layout()
Feature extraction is the process of creating new features from raw data to improve the performance of machine learning models.
It helps by making important information more explicit, reducing noise, and sometimes lowering the dimensionality of the data. Effective feature extraction can lead to better predictions and more interpretable models.
1234567891011import seaborn as sns import pandas as pd # Load the Titanic dataset df = sns.load_dataset('titanic') # Create a new feature: family_size = sibsp + parch + 1 df['family_size'] = df['sibsp'] + df['parch'] + 1 # Show the first few rows with the new feature print(df[['sibsp', 'parch', 'family_size']].head())
Thanks for your feedback!