Oppiskele Wrapper and Embedded Methods: RFE and SelectFromModel

Understanding feature selection is crucial to building robust and interpretable machine learning models. Two important categories of feature selection techniques are wrapper methods and embedded methods. Wrapper methods, such as Recursive Feature Elimination (RFE), use a predictive model to evaluate combinations of features and select the best subset based on model performance. In contrast, embedded methods incorporate feature selection as part of the model training process itself — SelectFromModel with Lasso regression is a common example. The main difference is that wrapper methods repeatedly train models on different subsets of features, while embedded methods select features based on the internal model attributes, such as coefficients or feature importances, as they are learned.


              1234567891011121314151617181920212223242526272829303132333435
            
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.feature_selection import RFE, SelectFromModel
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd

# --- Load dataset (NumPy for speed) ---
data = fetch_california_housing()
X = data.data            # shape (20640, 8)
y = data.target
feature_names = np.array(data.feature_names)

# --- RFE (fewer refits via step=2) ---
lr = LinearRegression()
rfe = RFE(estimator=lr, n_features_to_select=5, step=2)
rfe.fit(X, y)
rfe_features = feature_names[rfe.support_]

# --- Lasso-based selection (scale first for faster convergence) ---
scaler = StandardScaler()
Xs = scaler.fit_transform(X)

lasso = Lasso(alpha=0.1, random_state=42, max_iter=1000)  # scaled => converges fast
lasso.fit(Xs, y)

sfm = SelectFromModel(lasso, prefit=True)
lasso_features = feature_names[sfm.get_support()]

# --- Compare ---
overlap = set(rfe_features) & set(lasso_features)

print("RFE selected features:", list(rfe_features))
print("SelectFromModel (Lasso) selected features:", list(lasso_features))
print("Overlap between RFE and Lasso-selected features:", list(overlap))

Both wrapper and embedded methods have distinct advantages and limitations. Wrapper methods like RFE are often more flexible because they can work with any model and can optimize for the specific predictive task. However, they are computationally expensive, especially with large datasets or many features, since they require fitting the model multiple times. Embedded methods such as SelectFromModel with Lasso are typically faster and scale better because feature selection happens during model training. However, their effectiveness depends on the model's assumptions; for instance, Lasso may arbitrarily select one feature among several highly correlated ones, potentially missing important predictors. As you saw in the code, the features selected by RFE and SelectFromModel with Lasso can overlap, but may also differ due to these underlying mechanisms.

Note

Multicollinearity — when two or more features are highly correlated — can impact feature selection. In such cases, methods like Lasso may select one correlated feature and ignore others, which can make interpretation tricky and sometimes lead to instability in the selected feature set.

Oliko kaikki selvää?

Kiitos palautteestasi!

Osio 2. Luku 2

Kysy tekoälyä

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

Awesome!

Completion rate improved to 8.33

Pyyhkäise näyttääksesi valikon


              1234567891011121314151617181920212223242526272829303132333435
            
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.feature_selection import RFE, SelectFromModel
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd

# --- Load dataset (NumPy for speed) ---
data = fetch_california_housing()
X = data.data            # shape (20640, 8)
y = data.target
feature_names = np.array(data.feature_names)

# --- RFE (fewer refits via step=2) ---
lr = LinearRegression()
rfe = RFE(estimator=lr, n_features_to_select=5, step=2)
rfe.fit(X, y)
rfe_features = feature_names[rfe.support_]

# --- Lasso-based selection (scale first for faster convergence) ---
scaler = StandardScaler()
Xs = scaler.fit_transform(X)

lasso = Lasso(alpha=0.1, random_state=42, max_iter=1000)  # scaled => converges fast
lasso.fit(Xs, y)

sfm = SelectFromModel(lasso, prefit=True)
lasso_features = feature_names[sfm.get_support()]

# --- Compare ---
overlap = set(rfe_features) & set(lasso_features)

print("RFE selected features:", list(rfe_features))
print("SelectFromModel (Lasso) selected features:", list(lasso_features))
print("Overlap between RFE and Lasso-selected features:", list(overlap))

Note

Oliko kaikki selvää?

Kiitos palautteestasi!

Osio 2. Luku 2