Handling Categorical Variables: StringIndexer and OneHotEncoder
Stryg for at vise menuen
Most ML models require numeric input. Columns like AIRLINE and ORIGIN_AIRPORT are strings – you need to convert them before training. PySpark's StringIndexer and OneHotEncoder handle this in two steps.
StringIndexer
StringIndexer maps each unique string to an integer index, ordered by frequency – the most common value gets index 0:
12345678910111213141516171819202122import urllib.request from pyspark.sql import SparkSession from pyspark.ml.feature import StringIndexer urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("CategoricalEncoding") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY"]) # Fitting the indexer on training data and transforming indexer = StringIndexer(inputCol="AIRLINE", outputCol="AIRLINE_IDX") flights_df = indexer.fit(flights_df).transform(flights_df) flights_df.select("AIRLINE", "AIRLINE_IDX").distinct().orderBy("AIRLINE_IDX").show()
OneHotEncoder
Integer indices imply an ordering that does not exist – the model might interpret airline index 3 as "greater than" airline index 1. OneHotEncoder removes this bias by converting each index into a sparse binary vector:
123456from pyspark.ml.feature import OneHotEncoder encoder = OneHotEncoder(inputCol="AIRLINE_IDX", outputCol="AIRLINE_VEC") flights_df = encoder.fit(flights_df).transform(flights_df) flights_df.select("AIRLINE", "AIRLINE_IDX", "AIRLINE_VEC").show(5)
Encoding Multiple Columns at Once
12345678910111213141516from pyspark.ml.feature import StringIndexer, OneHotEncoder # Indexing multiple columns in one step indexer = StringIndexer( inputCols=["AIRLINE", "ORIGIN_AIRPORT"], outputCols=["AIRLINE_IDX", "ORIGIN_IDX"] ) flights_df = indexer.fit(flights_df).transform(flights_df) encoder = OneHotEncoder( inputCols=["AIRLINE_IDX", "ORIGIN_IDX"], outputCols=["AIRLINE_VEC", "ORIGIN_VEC"] ) flights_df = encoder.fit(flights_df).transform(flights_df) flights_df.select("AIRLINE", "AIRLINE_VEC", "ORIGIN_AIRPORT", "ORIGIN_VEC").show(5)
Var alt klart?
Tak for dine kommentarer!
Sektion 1. Kapitel 2
Spørg AI
Spørg AI
Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat
Sektion 1. Kapitel 2