Assembling Features with VectorAssembler
Deslize para mostrar o menu
Almost every MLlib algorithm expects a single vector column called FEATURES as input. VectorAssembler combines multiple numeric and vector columns into one dense or sparse vector.
Basic Usage
1234567891011121314151617181920212223242526272829import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col, floor, when from pyspark.ml.feature import VectorAssembler urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("VectorAssembler") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE", "SCHEDULED_TIME"]) flights_df = flights_df \ .withColumn("DEPARTURE_HOUR", floor(col("SCHEDULED_DEPARTURE") / 100).cast("integer")) \ .withColumn("IS_WEEKEND", (col("DAY_OF_WEEK") >= 6).cast("integer")) assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "DISTANCE", "SCHEDULED_TIME", "DEPARTURE_HOUR", "IS_WEEKEND"], outputCol="FEATURES" ) assembled_df = assembler.transform(flights_df) assembled_df.select("DEPARTURE_DELAY", "DISTANCE", "FEATURES").show(5, truncate=False)
Handling Nulls in VectorAssembler
By default VectorAssembler raises an error if any input column contains nulls. You can control this with handleInvalid:
12345assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "DISTANCE", "SCHEDULED_TIME", "DEPARTURE_HOUR", "IS_WEEKEND"], outputCol="FEATURES", handleInvalid="skip" # Options: "error" (default), "skip", "keep" )
"error"– raises an exception on null or NaN values;"skip"– drops rows with invalid values;"keep"– replaces invalid values with 0 in the output vector.
Combining Scalar and Vector Inputs
VectorAssembler can mix scalar columns and existing vector columns in a single step:
12345678910111213from pyspark.ml.feature import StringIndexer, OneHotEncoder # Adding an encoded airline vector flights_df = StringIndexer(inputCol="AIRLINE", outputCol="AIRLINE_IDX").fit(flights_df).transform(flights_df) flights_df = OneHotEncoder(inputCol="AIRLINE_IDX", outputCol="AIRLINE_VEC").fit(flights_df).transform(flights_df) # Combining scalar columns and the airline vector assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "DISTANCE", "DEPARTURE_HOUR", "AIRLINE_VEC"], outputCol="FEATURES" ) assembled_df = assembler.transform(flights_df) assembled_df.select("FEATURES").show(3, truncate=False)
Tudo estava claro?
Obrigado pelo seu feedback!
Seção 1. Capítulo 11
Pergunte à IA
Pergunte à IA
Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo
Seção 1. Capítulo 11