Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Scaling and Normalizing Numerical Features | Section
Feature Engineering with PySpark

Scaling and Normalizing Numerical Features

Sveip for å vise menyen

Numeric columns in the flights dataset have very different ranges – DISTANCE spans hundreds to thousands, while DEPARTURE_DELAY is typically under 100. Many ML algorithms are sensitive to scale: a column with large values will dominate the model unless you normalize the features first.

StandardScaler

StandardScaler standardizes each feature to have mean 0 and standard deviation 1. It requires the input to be a vector column, so you first assemble the numeric columns:

123456789101112131415161718192021222324252627282930
import urllib.request from pyspark.sql import SparkSession from pyspark.ml.feature import VectorAssembler, StandardScaler urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("Scaling") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE"]) # Assembling numeric columns into a single vector assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE"], outputCol="FEATURES_RAW" ) flights_df = assembler.transform(flights_df) # Standardizing to mean=0, std=1 scaler = StandardScaler(inputCol="FEATURES_RAW", outputCol="FEATURES_SCALED", withMean=True, withStd=True) scaler_model = scaler.fit(flights_df) flights_df = scaler_model.transform(flights_df) flights_df.select("FEATURES_RAW", "FEATURES_SCALED").show(3, truncate=False)

MinMaxScaler

MinMaxScaler rescales each feature to a fixed range, by default [0, 1]:

123456
from pyspark.ml.feature import MinMaxScaler min_max_scaler = MinMaxScaler(inputCol="FEATURES_RAW", outputCol="FEATURES_MINMAX") flights_df = min_max_scaler.fit(flights_df).transform(flights_df) flights_df.select("FEATURES_RAW", "FEATURES_MINMAX").show(3, truncate=False)

Use StandardScaler when your algorithm assumes normally distributed features (e.g. linear regression, SVMs). Use MinMaxScaler when you need values bounded in a specific range (e.g. neural networks).

question mark

What does StandardScaler do to each feature?

Velg det helt riktige svaret

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 1. Kapittel 3

Spør AI

expand

Spør AI

ChatGPT

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

Seksjon 1. Kapittel 3
some-alt