Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Feature Transformation and Extraction | Data Transformation Techniques
Data Preprocessing and Feature Engineering

bookFeature Transformation and Extraction

Many real-world datasets contain features with skewed distributions, which can reduce the effectiveness of machine learning models. You can apply mathematical transformations to reduce skewness and improve data quality. Two common methods are:

  • Logarithmic transformation: reduces strong positive skewness by applying log(x);
  • Square root transformation: moderates smaller degrees of skewness using sqrt(x).

These methods help make feature distributions more normal-like and enhance model performance.

123456789101112131415161718192021222324252627282930
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # Load the Titanic dataset df = sns.load_dataset('titanic') fare = df['fare'] # Apply log transformation (add 1 to handle zeros) fare_log = np.log(fare + 1) # Create side-by-side histogram comparison fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Original fare axes[0].hist(fare, bins=50, color='skyblue', edgecolor='black', alpha=0.7) axes[0].set_xlabel('Fare ($)', fontsize=12) axes[0].set_ylabel('Frequency', fontsize=12) axes[0].set_title('Original Fare Distribution', fontsize=14, fontweight='bold') axes[0].grid(True, alpha=0.3) # Log-transformed fare axes[1].hist(fare_log, bins=50, color='lightcoral', edgecolor='black', alpha=0.7) axes[1].set_xlabel('Log(Fare + 1)', fontsize=12) axes[1].set_ylabel('Frequency', fontsize=12) axes[1].set_title('Log-Transformed Fare Distribution', fontsize=14, fontweight='bold') axes[1].grid(True, alpha=0.3) plt.tight_layout()
copy
Note
Definition

Feature extraction is the process of creating new features from raw data to improve the performance of machine learning models.

It helps by making important information more explicit, reducing noise, and sometimes lowering the dimensionality of the data. Effective feature extraction can lead to better predictions and more interpretable models.

1234567891011
import seaborn as sns import pandas as pd # Load the Titanic dataset df = sns.load_dataset('titanic') # Create a new feature: family_size = sibsp + parch + 1 df['family_size'] = df['sibsp'] + df['parch'] + 1 # Show the first few rows with the new feature print(df[['sibsp', 'parch', 'family_size']].head())
copy
question mark

Which transformation would be most appropriate for a variable with strong positive skewness and only positive values?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 3

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Awesome!

Completion rate improved to 8.33

bookFeature Transformation and Extraction

Swipe to show menu

Many real-world datasets contain features with skewed distributions, which can reduce the effectiveness of machine learning models. You can apply mathematical transformations to reduce skewness and improve data quality. Two common methods are:

  • Logarithmic transformation: reduces strong positive skewness by applying log(x);
  • Square root transformation: moderates smaller degrees of skewness using sqrt(x).

These methods help make feature distributions more normal-like and enhance model performance.

123456789101112131415161718192021222324252627282930
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # Load the Titanic dataset df = sns.load_dataset('titanic') fare = df['fare'] # Apply log transformation (add 1 to handle zeros) fare_log = np.log(fare + 1) # Create side-by-side histogram comparison fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Original fare axes[0].hist(fare, bins=50, color='skyblue', edgecolor='black', alpha=0.7) axes[0].set_xlabel('Fare ($)', fontsize=12) axes[0].set_ylabel('Frequency', fontsize=12) axes[0].set_title('Original Fare Distribution', fontsize=14, fontweight='bold') axes[0].grid(True, alpha=0.3) # Log-transformed fare axes[1].hist(fare_log, bins=50, color='lightcoral', edgecolor='black', alpha=0.7) axes[1].set_xlabel('Log(Fare + 1)', fontsize=12) axes[1].set_ylabel('Frequency', fontsize=12) axes[1].set_title('Log-Transformed Fare Distribution', fontsize=14, fontweight='bold') axes[1].grid(True, alpha=0.3) plt.tight_layout()
copy
Note
Definition

Feature extraction is the process of creating new features from raw data to improve the performance of machine learning models.

It helps by making important information more explicit, reducing noise, and sometimes lowering the dimensionality of the data. Effective feature extraction can lead to better predictions and more interpretable models.

1234567891011
import seaborn as sns import pandas as pd # Load the Titanic dataset df = sns.load_dataset('titanic') # Create a new feature: family_size = sibsp + parch + 1 df['family_size'] = df['sibsp'] + df['parch'] + 1 # Show the first few rows with the new feature print(df[['sibsp', 'parch', 'family_size']].head())
copy
question mark

Which transformation would be most appropriate for a variable with strong positive skewness and only positive values?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 3
some-alt