Assembling Features with VectorAssembler
Stryg for at vise menuen
Almost every MLlib algorithm expects a single vector column called FEATURES as input. VectorAssembler combines multiple numeric and vector columns into one dense or sparse vector.
Basic Usage
1234567891011121314151617181920212223242526272829import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col, floor, when from pyspark.ml.feature import VectorAssembler urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("VectorAssembler") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE", "SCHEDULED_TIME"]) flights_df = flights_df \ .withColumn("DEPARTURE_HOUR", floor(col("SCHEDULED_DEPARTURE") / 100).cast("integer")) \ .withColumn("IS_WEEKEND", (col("DAY_OF_WEEK") >= 6).cast("integer")) assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "DISTANCE", "SCHEDULED_TIME", "DEPARTURE_HOUR", "IS_WEEKEND"], outputCol="FEATURES" ) assembled_df = assembler.transform(flights_df) assembled_df.select("DEPARTURE_DELAY", "DISTANCE", "FEATURES").show(5, truncate=False)
Handling Nulls in VectorAssembler
By default VectorAssembler raises an error if any input column contains nulls. You can control this with handleInvalid:
12345assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "DISTANCE", "SCHEDULED_TIME", "DEPARTURE_HOUR", "IS_WEEKEND"], outputCol="FEATURES", handleInvalid="skip" # Options: "error" (default), "skip", "keep" )
"error"– raises an exception on null or NaN values;"skip"– drops rows with invalid values;"keep"– replaces invalid values with 0 in the output vector.
Combining Scalar and Vector Inputs
VectorAssembler can mix scalar columns and existing vector columns in a single step:
12345678910111213from pyspark.ml.feature import StringIndexer, OneHotEncoder # Adding an encoded airline vector flights_df = StringIndexer(inputCol="AIRLINE", outputCol="AIRLINE_IDX").fit(flights_df).transform(flights_df) flights_df = OneHotEncoder(inputCol="AIRLINE_IDX", outputCol="AIRLINE_VEC").fit(flights_df).transform(flights_df) # Combining scalar columns and the airline vector assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "DISTANCE", "DEPARTURE_HOUR", "AIRLINE_VEC"], outputCol="FEATURES" ) assembled_df = assembler.transform(flights_df) assembled_df.select("FEATURES").show(3, truncate=False)
Var alt klart?
Tak for dine kommentarer!
Sektion 1. Kapitel 11
Spørg AI
Spørg AI
Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat
Sektion 1. Kapitel 11