Assembling Features with VectorAssembler
Свайпніть щоб показати меню
Almost every MLlib algorithm expects a single vector column called FEATURES as input. VectorAssembler combines multiple numeric and vector columns into one dense or sparse vector.
Basic Usage
1234567891011121314151617181920212223242526272829import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col, floor, when from pyspark.ml.feature import VectorAssembler urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("VectorAssembler") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE", "SCHEDULED_TIME"]) flights_df = flights_df \ .withColumn("DEPARTURE_HOUR", floor(col("SCHEDULED_DEPARTURE") / 100).cast("integer")) \ .withColumn("IS_WEEKEND", (col("DAY_OF_WEEK") >= 6).cast("integer")) assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "DISTANCE", "SCHEDULED_TIME", "DEPARTURE_HOUR", "IS_WEEKEND"], outputCol="FEATURES" ) assembled_df = assembler.transform(flights_df) assembled_df.select("DEPARTURE_DELAY", "DISTANCE", "FEATURES").show(5, truncate=False)
Handling Nulls in VectorAssembler
By default VectorAssembler raises an error if any input column contains nulls. You can control this with handleInvalid:
12345assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "DISTANCE", "SCHEDULED_TIME", "DEPARTURE_HOUR", "IS_WEEKEND"], outputCol="FEATURES", handleInvalid="skip" # Options: "error" (default), "skip", "keep" )
"error"– raises an exception on null or NaN values;"skip"– drops rows with invalid values;"keep"– replaces invalid values with 0 in the output vector.
Combining Scalar and Vector Inputs
VectorAssembler can mix scalar columns and existing vector columns in a single step:
12345678910111213from pyspark.ml.feature import StringIndexer, OneHotEncoder # Adding an encoded airline vector flights_df = StringIndexer(inputCol="AIRLINE", outputCol="AIRLINE_IDX").fit(flights_df).transform(flights_df) flights_df = OneHotEncoder(inputCol="AIRLINE_IDX", outputCol="AIRLINE_VEC").fit(flights_df).transform(flights_df) # Combining scalar columns and the airline vector assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "DISTANCE", "DEPARTURE_HOUR", "AIRLINE_VEC"], outputCol="FEATURES" ) assembled_df = assembler.transform(flights_df) assembled_df.select("FEATURES").show(3, truncate=False)
Все було зрозуміло?
Дякуємо за ваш відгук!
Секція 1. Розділ 11
Запитати АІ
Запитати АІ
Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат
Секція 1. Розділ 11