Filter Methods: SelectKBest
Filter methods are a family of feature selection techniques that evaluate the relevance of each feature independently from the predictive model. These methods use statistical measures to score each feature based on its relationship with the target variable. Univariate feature selection is a type of filter method where each feature is evaluated individually using a univariate statistical test, making it a fast and scalable approach when you need to quickly reduce the dimensionality of your dataset before modeling.
123456789101112131415161718192021222324import numpy as np from sklearn.datasets import make_regression from sklearn.feature_selection import SelectKBest, f_regression, mutual_info_regression import pandas as pd # Generate a synthetic regression dataset X, y = make_regression(n_samples=100, n_features=8, noise=0.2, random_state=42) feature_names = [f"feature_{i}" for i in range(X.shape[1])] df = pd.DataFrame(X, columns=feature_names) # Select top 3 features using f_regression (ANOVA F-value) selector_f = SelectKBest(score_func=f_regression, k=3) X_new_f = selector_f.fit_transform(X, y) selected_features_f = [feature for feature, mask in zip(feature_names, selector_f.get_support()) if mask] # Select top 3 features using mutual_info_regression selector_mi = SelectKBest(score_func=mutual_info_regression, k=3) X_new_mi = selector_mi.fit_transform(X, y) selected_features_mi = [feature for feature, mask in zip(feature_names, selector_mi.get_support()) if mask] print("Top 3 features by f_regression:", selected_features_f) print("Top 3 features by mutual_info_regression:", selected_features_mi) print("f_regression scores:", selector_f.scores_) print("mutual_info_regression scores:", selector_mi.scores_)
When you use SelectKBest, each feature receives a score based on its statistical relationship with the target variable. For regression, f_regression computes the ANOVA F-value for each feature, measuring linear dependency, while mutual_info_regression estimates the mutual information, capturing any dependency (not just linear). Higher scores indicate features that are more relevant for predicting the target. After fitting, you can inspect the .scores_ attribute to see the ranking of all features. You typically select the top k features with the highest scores, as shown above, and use them for further modeling. This process helps quickly identify and retain only the most informative features, reducing noise and improving model efficiency.
Kiitos palautteestasi!
Kysy tekoälyä
Kysy tekoälyä
Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme
Awesome!
Completion rate improved to 8.33
Filter Methods: SelectKBest
Pyyhkäise näyttääksesi valikon
Filter methods are a family of feature selection techniques that evaluate the relevance of each feature independently from the predictive model. These methods use statistical measures to score each feature based on its relationship with the target variable. Univariate feature selection is a type of filter method where each feature is evaluated individually using a univariate statistical test, making it a fast and scalable approach when you need to quickly reduce the dimensionality of your dataset before modeling.
123456789101112131415161718192021222324import numpy as np from sklearn.datasets import make_regression from sklearn.feature_selection import SelectKBest, f_regression, mutual_info_regression import pandas as pd # Generate a synthetic regression dataset X, y = make_regression(n_samples=100, n_features=8, noise=0.2, random_state=42) feature_names = [f"feature_{i}" for i in range(X.shape[1])] df = pd.DataFrame(X, columns=feature_names) # Select top 3 features using f_regression (ANOVA F-value) selector_f = SelectKBest(score_func=f_regression, k=3) X_new_f = selector_f.fit_transform(X, y) selected_features_f = [feature for feature, mask in zip(feature_names, selector_f.get_support()) if mask] # Select top 3 features using mutual_info_regression selector_mi = SelectKBest(score_func=mutual_info_regression, k=3) X_new_mi = selector_mi.fit_transform(X, y) selected_features_mi = [feature for feature, mask in zip(feature_names, selector_mi.get_support()) if mask] print("Top 3 features by f_regression:", selected_features_f) print("Top 3 features by mutual_info_regression:", selected_features_mi) print("f_regression scores:", selector_f.scores_) print("mutual_info_regression scores:", selector_mi.scores_)
When you use SelectKBest, each feature receives a score based on its statistical relationship with the target variable. For regression, f_regression computes the ANOVA F-value for each feature, measuring linear dependency, while mutual_info_regression estimates the mutual information, capturing any dependency (not just linear). Higher scores indicate features that are more relevant for predicting the target. After fitting, you can inspect the .scores_ attribute to see the ranking of all features. You typically select the top k features with the highest scores, as shown above, and use them for further modeling. This process helps quickly identify and retain only the most informative features, reducing noise and improving model efficiency.
Kiitos palautteestasi!