Assembling Features with VectorAssembler
Swipe um das Menü anzuzeigen
Almost every MLlib algorithm expects a single vector column called FEATURES as input. VectorAssembler combines multiple numeric and vector columns into one dense or sparse vector.
Basic Usage
1234567891011121314151617181920212223242526272829import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col, floor, when from pyspark.ml.feature import VectorAssembler urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("VectorAssembler") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE", "SCHEDULED_TIME"]) flights_df = flights_df \ .withColumn("DEPARTURE_HOUR", floor(col("SCHEDULED_DEPARTURE") / 100).cast("integer")) \ .withColumn("IS_WEEKEND", (col("DAY_OF_WEEK") >= 6).cast("integer")) assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "DISTANCE", "SCHEDULED_TIME", "DEPARTURE_HOUR", "IS_WEEKEND"], outputCol="FEATURES" ) assembled_df = assembler.transform(flights_df) assembled_df.select("DEPARTURE_DELAY", "DISTANCE", "FEATURES").show(5, truncate=False)
Handling Nulls in VectorAssembler
By default VectorAssembler raises an error if any input column contains nulls. You can control this with handleInvalid:
12345assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "DISTANCE", "SCHEDULED_TIME", "DEPARTURE_HOUR", "IS_WEEKEND"], outputCol="FEATURES", handleInvalid="skip" # Options: "error" (default), "skip", "keep" )
"error"– raises an exception on null or NaN values;"skip"– drops rows with invalid values;"keep"– replaces invalid values with 0 in the output vector.
Combining Scalar and Vector Inputs
VectorAssembler can mix scalar columns and existing vector columns in a single step:
12345678910111213from pyspark.ml.feature import StringIndexer, OneHotEncoder # Adding an encoded airline vector flights_df = StringIndexer(inputCol="AIRLINE", outputCol="AIRLINE_IDX").fit(flights_df).transform(flights_df) flights_df = OneHotEncoder(inputCol="AIRLINE_IDX", outputCol="AIRLINE_VEC").fit(flights_df).transform(flights_df) # Combining scalar columns and the airline vector assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "DISTANCE", "DEPARTURE_HOUR", "AIRLINE_VEC"], outputCol="FEATURES" ) assembled_df = assembler.transform(flights_df) assembled_df.select("FEATURES").show(3, truncate=False)
War alles klar?
Danke für Ihr Feedback!
Abschnitt 1. Kapitel 11
Fragen Sie AI
Fragen Sie AI
Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen
Abschnitt 1. Kapitel 11