Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprenda Scaling and Normalizing Numerical Features | Section
Feature Engineering with PySpark

Scaling and Normalizing Numerical Features

Deslize para mostrar o menu

Numeric columns in the flights dataset have very different ranges – DISTANCE spans hundreds to thousands, while DEPARTURE_DELAY is typically under 100. Many ML algorithms are sensitive to scale: a column with large values will dominate the model unless you normalize the features first.

StandardScaler

StandardScaler standardizes each feature to have mean 0 and standard deviation 1. It requires the input to be a vector column, so you first assemble the numeric columns:

123456789101112131415161718192021222324252627282930
import urllib.request from pyspark.sql import SparkSession from pyspark.ml.feature import VectorAssembler, StandardScaler urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("Scaling") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE"]) # Assembling numeric columns into a single vector assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE"], outputCol="FEATURES_RAW" ) flights_df = assembler.transform(flights_df) # Standardizing to mean=0, std=1 scaler = StandardScaler(inputCol="FEATURES_RAW", outputCol="FEATURES_SCALED", withMean=True, withStd=True) scaler_model = scaler.fit(flights_df) flights_df = scaler_model.transform(flights_df) flights_df.select("FEATURES_RAW", "FEATURES_SCALED").show(3, truncate=False)

MinMaxScaler

MinMaxScaler rescales each feature to a fixed range, by default [0, 1]:

123456
from pyspark.ml.feature import MinMaxScaler min_max_scaler = MinMaxScaler(inputCol="FEATURES_RAW", outputCol="FEATURES_MINMAX") flights_df = min_max_scaler.fit(flights_df).transform(flights_df) flights_df.select("FEATURES_RAW", "FEATURES_MINMAX").show(3, truncate=False)

Use StandardScaler when your algorithm assumes normally distributed features (e.g. linear regression, SVMs). Use MinMaxScaler when you need values bounded in a specific range (e.g. neural networks).

question mark

What does StandardScaler do to each feature?

Selecione a resposta correta

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 1. Capítulo 3

Pergunte à IA

expand

Pergunte à IA

ChatGPT

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Seção 1. Capítulo 3
some-alt