Feature Selection Basics
Feature selection improves model performance by keeping only the most relevant features, reducing complexity, and helping prevent overfitting caused by irrelevant or redundant data.
Feature selection is the process of choosing a subset of input variables (features) from your data that are most relevant to the predictive modeling problem.
Feature selection methods include manual review and automated techniques. In classification tasks, use statistical tests to score features and select those most strongly related to the target variable.
The most popular feature selection methods fall into three categories:
- Filter methods: Select features based on statistical measures, such as correlation coefficients or univariate tests, independently of any machine learning model;
- Wrapper methods: Use a predictive model to evaluate different combinations of features, such as with recursive feature elimination (RFE), and select the subset that yields the best model performance;
- Embedded methods: Perform feature selection as part of the model training process, like Lasso regularization, which automatically removes less important features by shrinking their coefficients to zero.
Each method balances trade-offs between computational cost, interpretability, and predictive power.
1234567891011121314151617181920212223242526import pandas as pd import seaborn as sns from sklearn.feature_selection import SelectKBest, f_classif from sklearn.preprocessing import LabelEncoder # Load Titanic dataset train = sns.load_dataset('titanic') # Select numeric and categorical columns (excluding target) features = ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked'] X = train[features].copy() y = train['survived'] # Encode categorical features X['sex'] = LabelEncoder().fit_transform(X['sex'].astype(str)) X['embarked'] = LabelEncoder().fit_transform(X['embarked'].astype(str)) # Handle missing values by filling with median (for simplicity) X = X.fillna(X.median(numeric_only=True)) # Select top 5 features based on ANOVA F-value selector = SelectKBest(score_func=f_classif, k=5) X_new = selector.fit_transform(X, y) selected_features = X.columns[selector.get_support()] print("Selected features:", list(selected_features))
In this example, you use SelectKBest from scikit-learn with the f_classif scoring function to select the five most relevant featuresβpclass, sex, parch, fare, and embarkedβfrom the Titanic dataset. This method evaluates each feature individually using ANOVA F-values and selects those with the highest scores. It is effective for classification tasks because it focuses on features that best separate the target classes.
Selecting too many features, especially irrelevant ones, can lead to overfitting, where your model performs well on training data but poorly on new, unseen data. Careful feature selection helps to reduce this risk and leads to more robust models.
Feature selection is not only about improving accuracyβit also makes your models faster and easier to interpret. By focusing only on the most important features, you simplify your models and reduce the chance of learning noise from the data.
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 8.33
Feature Selection Basics
Swipe to show menu
Feature selection improves model performance by keeping only the most relevant features, reducing complexity, and helping prevent overfitting caused by irrelevant or redundant data.
Feature selection is the process of choosing a subset of input variables (features) from your data that are most relevant to the predictive modeling problem.
Feature selection methods include manual review and automated techniques. In classification tasks, use statistical tests to score features and select those most strongly related to the target variable.
The most popular feature selection methods fall into three categories:
- Filter methods: Select features based on statistical measures, such as correlation coefficients or univariate tests, independently of any machine learning model;
- Wrapper methods: Use a predictive model to evaluate different combinations of features, such as with recursive feature elimination (RFE), and select the subset that yields the best model performance;
- Embedded methods: Perform feature selection as part of the model training process, like Lasso regularization, which automatically removes less important features by shrinking their coefficients to zero.
Each method balances trade-offs between computational cost, interpretability, and predictive power.
1234567891011121314151617181920212223242526import pandas as pd import seaborn as sns from sklearn.feature_selection import SelectKBest, f_classif from sklearn.preprocessing import LabelEncoder # Load Titanic dataset train = sns.load_dataset('titanic') # Select numeric and categorical columns (excluding target) features = ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked'] X = train[features].copy() y = train['survived'] # Encode categorical features X['sex'] = LabelEncoder().fit_transform(X['sex'].astype(str)) X['embarked'] = LabelEncoder().fit_transform(X['embarked'].astype(str)) # Handle missing values by filling with median (for simplicity) X = X.fillna(X.median(numeric_only=True)) # Select top 5 features based on ANOVA F-value selector = SelectKBest(score_func=f_classif, k=5) X_new = selector.fit_transform(X, y) selected_features = X.columns[selector.get_support()] print("Selected features:", list(selected_features))
In this example, you use SelectKBest from scikit-learn with the f_classif scoring function to select the five most relevant featuresβpclass, sex, parch, fare, and embarkedβfrom the Titanic dataset. This method evaluates each feature individually using ANOVA F-values and selects those with the highest scores. It is effective for classification tasks because it focuses on features that best separate the target classes.
Selecting too many features, especially irrelevant ones, can lead to overfitting, where your model performs well on training data but poorly on new, unseen data. Careful feature selection helps to reduce this risk and leads to more robust models.
Feature selection is not only about improving accuracyβit also makes your models faster and easier to interpret. By focusing only on the most important features, you simplify your models and reduce the chance of learning noise from the data.
Thanks for your feedback!