Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Leer Scaling and Normalizing Numerical Features | Section
Feature Engineering with PySpark

Scaling and Normalizing Numerical Features

Veeg om het menu te tonen

Numeric columns in the flights dataset have very different ranges – DISTANCE spans hundreds to thousands, while DEPARTURE_DELAY is typically under 100. Many ML algorithms are sensitive to scale: a column with large values will dominate the model unless you normalize the features first.

StandardScaler

StandardScaler standardizes each feature to have mean 0 and standard deviation 1. It requires the input to be a vector column, so you first assemble the numeric columns:

123456789101112131415161718192021222324252627282930
import urllib.request from pyspark.sql import SparkSession from pyspark.ml.feature import VectorAssembler, StandardScaler urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("Scaling") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE"]) # Assembling numeric columns into a single vector assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE"], outputCol="FEATURES_RAW" ) flights_df = assembler.transform(flights_df) # Standardizing to mean=0, std=1 scaler = StandardScaler(inputCol="FEATURES_RAW", outputCol="FEATURES_SCALED", withMean=True, withStd=True) scaler_model = scaler.fit(flights_df) flights_df = scaler_model.transform(flights_df) flights_df.select("FEATURES_RAW", "FEATURES_SCALED").show(3, truncate=False)

MinMaxScaler

MinMaxScaler rescales each feature to a fixed range, by default [0, 1]:

123456
from pyspark.ml.feature import MinMaxScaler min_max_scaler = MinMaxScaler(inputCol="FEATURES_RAW", outputCol="FEATURES_MINMAX") flights_df = min_max_scaler.fit(flights_df).transform(flights_df) flights_df.select("FEATURES_RAW", "FEATURES_MINMAX").show(3, truncate=False)

Use StandardScaler when your algorithm assumes normally distributed features (e.g. linear regression, SVMs). Use MinMaxScaler when you need values bounded in a specific range (e.g. neural networks).

question mark

What does StandardScaler do to each feature?

Selecteer het correcte antwoord

Was alles duidelijk?

Hoe kunnen we het verbeteren?

Bedankt voor je feedback!

Sectie 1. Hoofdstuk 3

Vraag AI

expand

Vraag AI

ChatGPT

Vraag wat u wilt of probeer een van de voorgestelde vragen om onze chat te starten.

Sectie 1. Hoofdstuk 3
some-alt