Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lernen Handling Categorical Variables: StringIndexer and OneHotEncoder | Section
Feature Engineering with PySpark

Handling Categorical Variables: StringIndexer and OneHotEncoder

Swipe um das Menü anzuzeigen

Most ML models require numeric input. Columns like AIRLINE and ORIGIN_AIRPORT are strings – you need to convert them before training. PySpark's StringIndexer and OneHotEncoder handle this in two steps.

StringIndexer

StringIndexer maps each unique string to an integer index, ordered by frequency – the most common value gets index 0:

12345678910111213141516171819202122
import urllib.request from pyspark.sql import SparkSession from pyspark.ml.feature import StringIndexer urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("CategoricalEncoding") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY"]) # Fitting the indexer on training data and transforming indexer = StringIndexer(inputCol="AIRLINE", outputCol="AIRLINE_IDX") flights_df = indexer.fit(flights_df).transform(flights_df) flights_df.select("AIRLINE", "AIRLINE_IDX").distinct().orderBy("AIRLINE_IDX").show()

OneHotEncoder

Integer indices imply an ordering that does not exist – the model might interpret airline index 3 as "greater than" airline index 1. OneHotEncoder removes this bias by converting each index into a sparse binary vector:

123456
from pyspark.ml.feature import OneHotEncoder encoder = OneHotEncoder(inputCol="AIRLINE_IDX", outputCol="AIRLINE_VEC") flights_df = encoder.fit(flights_df).transform(flights_df) flights_df.select("AIRLINE", "AIRLINE_IDX", "AIRLINE_VEC").show(5)

Encoding Multiple Columns at Once

12345678910111213141516
from pyspark.ml.feature import StringIndexer, OneHotEncoder # Indexing multiple columns in one step indexer = StringIndexer( inputCols=["AIRLINE", "ORIGIN_AIRPORT"], outputCols=["AIRLINE_IDX", "ORIGIN_IDX"] ) flights_df = indexer.fit(flights_df).transform(flights_df) encoder = OneHotEncoder( inputCols=["AIRLINE_IDX", "ORIGIN_IDX"], outputCols=["AIRLINE_VEC", "ORIGIN_VEC"] ) flights_df = encoder.fit(flights_df).transform(flights_df) flights_df.select("AIRLINE", "AIRLINE_VEC", "ORIGIN_AIRPORT", "ORIGIN_VEC").show(5)
question mark

Why use OneHotEncoder after StringIndexer?

Wählen Sie die richtige Antwort aus

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 1. Kapitel 2

Fragen Sie AI

expand

Fragen Sie AI

ChatGPT

Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen

Abschnitt 1. Kapitel 2
some-alt