Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Вивчайте Scaling and Normalizing Numerical Features | Section
Feature Engineering with PySpark

Scaling and Normalizing Numerical Features

Свайпніть щоб показати меню

Numeric columns in the flights dataset have very different ranges – DISTANCE spans hundreds to thousands, while DEPARTURE_DELAY is typically under 100. Many ML algorithms are sensitive to scale: a column with large values will dominate the model unless you normalize the features first.

StandardScaler

StandardScaler standardizes each feature to have mean 0 and standard deviation 1. It requires the input to be a vector column, so you first assemble the numeric columns:

123456789101112131415161718192021222324252627282930
import urllib.request from pyspark.sql import SparkSession from pyspark.ml.feature import VectorAssembler, StandardScaler urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("Scaling") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE"]) # Assembling numeric columns into a single vector assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE"], outputCol="FEATURES_RAW" ) flights_df = assembler.transform(flights_df) # Standardizing to mean=0, std=1 scaler = StandardScaler(inputCol="FEATURES_RAW", outputCol="FEATURES_SCALED", withMean=True, withStd=True) scaler_model = scaler.fit(flights_df) flights_df = scaler_model.transform(flights_df) flights_df.select("FEATURES_RAW", "FEATURES_SCALED").show(3, truncate=False)

MinMaxScaler

MinMaxScaler rescales each feature to a fixed range, by default [0, 1]:

123456
from pyspark.ml.feature import MinMaxScaler min_max_scaler = MinMaxScaler(inputCol="FEATURES_RAW", outputCol="FEATURES_MINMAX") flights_df = min_max_scaler.fit(flights_df).transform(flights_df) flights_df.select("FEATURES_RAW", "FEATURES_MINMAX").show(3, truncate=False)

Use StandardScaler when your algorithm assumes normally distributed features (e.g. linear regression, SVMs). Use MinMaxScaler when you need values bounded in a specific range (e.g. neural networks).

question mark

What does StandardScaler do to each feature?

Виберіть правильну відповідь

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 1. Розділ 3

Запитати АІ

expand

Запитати АІ

ChatGPT

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

Секція 1. Розділ 3
some-alt