Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprende Filter Methods: SelectKBest | Feature Selection Strategies
Feature Selection and Regularization Techniques

bookFilter Methods: SelectKBest

Filter methods are a family of feature selection techniques that evaluate the relevance of each feature independently from the predictive model. These methods use statistical measures to score each feature based on its relationship with the target variable. Univariate feature selection is a type of filter method where each feature is evaluated individually using a univariate statistical test, making it a fast and scalable approach when you need to quickly reduce the dimensionality of your dataset before modeling.

123456789101112131415161718192021222324
import numpy as np from sklearn.datasets import make_regression from sklearn.feature_selection import SelectKBest, f_regression, mutual_info_regression import pandas as pd # Generate a synthetic regression dataset X, y = make_regression(n_samples=100, n_features=8, noise=0.2, random_state=42) feature_names = [f"feature_{i}" for i in range(X.shape[1])] df = pd.DataFrame(X, columns=feature_names) # Select top 3 features using f_regression (ANOVA F-value) selector_f = SelectKBest(score_func=f_regression, k=3) X_new_f = selector_f.fit_transform(X, y) selected_features_f = [feature for feature, mask in zip(feature_names, selector_f.get_support()) if mask] # Select top 3 features using mutual_info_regression selector_mi = SelectKBest(score_func=mutual_info_regression, k=3) X_new_mi = selector_mi.fit_transform(X, y) selected_features_mi = [feature for feature, mask in zip(feature_names, selector_mi.get_support()) if mask] print("Top 3 features by f_regression:", selected_features_f) print("Top 3 features by mutual_info_regression:", selected_features_mi) print("f_regression scores:", selector_f.scores_) print("mutual_info_regression scores:", selector_mi.scores_)
copy

When you use SelectKBest, each feature receives a score based on its statistical relationship with the target variable. For regression, f_regression computes the ANOVA F-value for each feature, measuring linear dependency, while mutual_info_regression estimates the mutual information, capturing any dependency (not just linear). Higher scores indicate features that are more relevant for predicting the target. After fitting, you can inspect the .scores_ attribute to see the ranking of all features. You typically select the top k features with the highest scores, as shown above, and use them for further modeling. This process helps quickly identify and retain only the most informative features, reducing noise and improving model efficiency.

question mark

When should you consider using filter methods like SelectKBest for feature selection?

Select the correct answer

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 2. Capítulo 1

Pregunte a AI

expand

Pregunte a AI

ChatGPT

Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla

Awesome!

Completion rate improved to 8.33

bookFilter Methods: SelectKBest

Desliza para mostrar el menú

Filter methods are a family of feature selection techniques that evaluate the relevance of each feature independently from the predictive model. These methods use statistical measures to score each feature based on its relationship with the target variable. Univariate feature selection is a type of filter method where each feature is evaluated individually using a univariate statistical test, making it a fast and scalable approach when you need to quickly reduce the dimensionality of your dataset before modeling.

123456789101112131415161718192021222324
import numpy as np from sklearn.datasets import make_regression from sklearn.feature_selection import SelectKBest, f_regression, mutual_info_regression import pandas as pd # Generate a synthetic regression dataset X, y = make_regression(n_samples=100, n_features=8, noise=0.2, random_state=42) feature_names = [f"feature_{i}" for i in range(X.shape[1])] df = pd.DataFrame(X, columns=feature_names) # Select top 3 features using f_regression (ANOVA F-value) selector_f = SelectKBest(score_func=f_regression, k=3) X_new_f = selector_f.fit_transform(X, y) selected_features_f = [feature for feature, mask in zip(feature_names, selector_f.get_support()) if mask] # Select top 3 features using mutual_info_regression selector_mi = SelectKBest(score_func=mutual_info_regression, k=3) X_new_mi = selector_mi.fit_transform(X, y) selected_features_mi = [feature for feature, mask in zip(feature_names, selector_mi.get_support()) if mask] print("Top 3 features by f_regression:", selected_features_f) print("Top 3 features by mutual_info_regression:", selected_features_mi) print("f_regression scores:", selector_f.scores_) print("mutual_info_regression scores:", selector_mi.scores_)
copy

When you use SelectKBest, each feature receives a score based on its statistical relationship with the target variable. For regression, f_regression computes the ANOVA F-value for each feature, measuring linear dependency, while mutual_info_regression estimates the mutual information, capturing any dependency (not just linear). Higher scores indicate features that are more relevant for predicting the target. After fitting, you can inspect the .scores_ attribute to see the ranking of all features. You typically select the top k features with the highest scores, as shown above, and use them for further modeling. This process helps quickly identify and retain only the most informative features, reducing noise and improving model efficiency.

question mark

When should you consider using filter methods like SelectKBest for feature selection?

Select the correct answer

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 2. Capítulo 1
some-alt