Scaling and Normalizing Numerical Features
Scorri per mostrare il menu
Numeric columns in the flights dataset have very different ranges – DISTANCE spans hundreds to thousands, while DEPARTURE_DELAY is typically under 100. Many ML algorithms are sensitive to scale: a column with large values will dominate the model unless you normalize the features first.
StandardScaler
StandardScaler standardizes each feature to have mean 0 and standard deviation 1. It requires the input to be a vector column, so you first assemble the numeric columns:
123456789101112131415161718192021222324252627282930import urllib.request from pyspark.sql import SparkSession from pyspark.ml.feature import VectorAssembler, StandardScaler urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("Scaling") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE"]) # Assembling numeric columns into a single vector assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE"], outputCol="FEATURES_RAW" ) flights_df = assembler.transform(flights_df) # Standardizing to mean=0, std=1 scaler = StandardScaler(inputCol="FEATURES_RAW", outputCol="FEATURES_SCALED", withMean=True, withStd=True) scaler_model = scaler.fit(flights_df) flights_df = scaler_model.transform(flights_df) flights_df.select("FEATURES_RAW", "FEATURES_SCALED").show(3, truncate=False)
MinMaxScaler
MinMaxScaler rescales each feature to a fixed range, by default [0, 1]:
123456from pyspark.ml.feature import MinMaxScaler min_max_scaler = MinMaxScaler(inputCol="FEATURES_RAW", outputCol="FEATURES_MINMAX") flights_df = min_max_scaler.fit(flights_df).transform(flights_df) flights_df.select("FEATURES_RAW", "FEATURES_MINMAX").show(3, truncate=False)
Use StandardScaler when your algorithm assumes normally distributed features (e.g. linear regression, SVMs). Use MinMaxScaler when you need values bounded in a specific range (e.g. neural networks).
Grazie per i tuoi commenti!
Chieda ad AI
Chieda ad AI
Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione