Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Вивчайте Building Pipelines with Scaling and Feature Selection | Feature Selection Strategies
Feature Selection and Regularization Techniques

bookBuilding Pipelines with Scaling and Feature Selection

Scaling your features before applying feature selection or regularization is crucial for reliable model performance. Many feature selection methods, such as those based on statistical tests or model coefficients, are sensitive to the scale of the data. Similarly, regularization techniques like Ridge and Lasso penalize large weights, so features with larger numeric ranges can dominate the penalty, leading to biased or misleading results. Standardizing your data ensures that each feature contributes equally to the analysis, making the selection and regularization processes fair and effective.

123456789101112131415161718192021222324252627
from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.feature_selection import SelectKBest, f_regression from sklearn.linear_model import Ridge from sklearn.pipeline import Pipeline # Load the California housing dataset housing = fetch_california_housing() X, y = housing.data, housing.target # Split the dataset X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) # Build the pipeline pipeline = Pipeline([ ("scaler", StandardScaler()), ("select", SelectKBest(score_func=f_regression, k=5)), ("regressor", Ridge(alpha=1.0)) ]) # Fit the pipeline pipeline.fit(X_train, y_train) # Evaluate on the test set score = pipeline.score(X_test, y_test) print(f"Test R^2 score: {score:.3f}")
copy

When you run the pipeline above, your data flows through several transformation steps in a specific order. First, the StandardScaler standardizes each feature so they all have mean zero and unit variance. This step is essential because it prevents features with larger scales from dominating the selection or penalization process. Next, SelectKBest applies a univariate statistical test (f_regression) to each scaled feature, selecting only the top five features that have the strongest relationship with the target variable. Finally, the Ridge regressor is trained on this reduced set of scaled features. By combining these steps in a pipeline, you ensure that the same transformations are applied consistently during both training and prediction, reducing the risk of data leakage and improving reproducibility.

question mark

Why is it important to scale your data before applying feature selection or regularization in a pipeline?

Select the correct answer

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 2. Розділ 3

Запитати АІ

expand

Запитати АІ

ChatGPT

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

Suggested prompts:

Can you explain why the order of steps in the pipeline matters?

How does SelectKBest determine which features to select?

What would happen if I skipped the scaling step?

Awesome!

Completion rate improved to 8.33

bookBuilding Pipelines with Scaling and Feature Selection

Свайпніть щоб показати меню

Scaling your features before applying feature selection or regularization is crucial for reliable model performance. Many feature selection methods, such as those based on statistical tests or model coefficients, are sensitive to the scale of the data. Similarly, regularization techniques like Ridge and Lasso penalize large weights, so features with larger numeric ranges can dominate the penalty, leading to biased or misleading results. Standardizing your data ensures that each feature contributes equally to the analysis, making the selection and regularization processes fair and effective.

123456789101112131415161718192021222324252627
from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.feature_selection import SelectKBest, f_regression from sklearn.linear_model import Ridge from sklearn.pipeline import Pipeline # Load the California housing dataset housing = fetch_california_housing() X, y = housing.data, housing.target # Split the dataset X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) # Build the pipeline pipeline = Pipeline([ ("scaler", StandardScaler()), ("select", SelectKBest(score_func=f_regression, k=5)), ("regressor", Ridge(alpha=1.0)) ]) # Fit the pipeline pipeline.fit(X_train, y_train) # Evaluate on the test set score = pipeline.score(X_test, y_test) print(f"Test R^2 score: {score:.3f}")
copy

When you run the pipeline above, your data flows through several transformation steps in a specific order. First, the StandardScaler standardizes each feature so they all have mean zero and unit variance. This step is essential because it prevents features with larger scales from dominating the selection or penalization process. Next, SelectKBest applies a univariate statistical test (f_regression) to each scaled feature, selecting only the top five features that have the strongest relationship with the target variable. Finally, the Ridge regressor is trained on this reduced set of scaled features. By combining these steps in a pipeline, you ensure that the same transformations are applied consistently during both training and prediction, reducing the risk of data leakage and improving reproducibility.

question mark

Why is it important to scale your data before applying feature selection or regularization in a pipeline?

Select the correct answer

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 2. Розділ 3
some-alt