Lære Handling Categorical Variables: StringIndexer and OneHotEncoder

Stryg for at vise menuen

Most ML models require numeric input. Columns like AIRLINE and ORIGIN_AIRPORT are strings – you need to convert them before training. PySpark's StringIndexer and OneHotEncoder handle this in two steps.

StringIndexer

StringIndexer maps each unique string to an integer index, ordered by frequency – the most common value gets index 0:


              12345678910111213141516171819202122
            
import urllib.request
from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer

urllib.request.urlretrieve(
    "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv",
    "flights.csv"
)

spark = SparkSession.builder \
    .appName("CategoricalEncoding") \
    .master("local[*]") \
    .getOrCreate()

flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \
    .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY"])

# Fitting the indexer on training data and transforming
indexer = StringIndexer(inputCol="AIRLINE", outputCol="AIRLINE_IDX")
flights_df = indexer.fit(flights_df).transform(flights_df)

flights_df.select("AIRLINE", "AIRLINE_IDX").distinct().orderBy("AIRLINE_IDX").show()

OneHotEncoder

Integer indices imply an ordering that does not exist – the model might interpret airline index 3 as "greater than" airline index 1. OneHotEncoder removes this bias by converting each index into a sparse binary vector:


              123456
            
from pyspark.ml.feature import OneHotEncoder

encoder = OneHotEncoder(inputCol="AIRLINE_IDX", outputCol="AIRLINE_VEC")
flights_df = encoder.fit(flights_df).transform(flights_df)

flights_df.select("AIRLINE", "AIRLINE_IDX", "AIRLINE_VEC").show(5)

Encoding Multiple Columns at Once


              12345678910111213141516
            
from pyspark.ml.feature import StringIndexer, OneHotEncoder

# Indexing multiple columns in one step
indexer = StringIndexer(
    inputCols=["AIRLINE", "ORIGIN_AIRPORT"],
    outputCols=["AIRLINE_IDX", "ORIGIN_IDX"]
)
flights_df = indexer.fit(flights_df).transform(flights_df)

encoder = OneHotEncoder(
    inputCols=["AIRLINE_IDX", "ORIGIN_IDX"],
    outputCols=["AIRLINE_VEC", "ORIGIN_VEC"]
)
flights_df = encoder.fit(flights_df).transform(flights_df)

flights_df.select("AIRLINE", "AIRLINE_VEC", "ORIGIN_AIRPORT", "ORIGIN_VEC").show(5)

Var alt klart?

Tak for dine kommentarer!

Sektion 1. Kapitel 2

Spørg AI

Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat

Sektion 1. Kapitel 2