Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Impara Assembling Features with VectorAssembler | Section
Feature Engineering with PySpark

Assembling Features with VectorAssembler

Scorri per mostrare il menu

Almost every MLlib algorithm expects a single vector column called FEATURES as input. VectorAssembler combines multiple numeric and vector columns into one dense or sparse vector.

Basic Usage

1234567891011121314151617181920212223242526272829
import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col, floor, when from pyspark.ml.feature import VectorAssembler urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("VectorAssembler") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE", "SCHEDULED_TIME"]) flights_df = flights_df \ .withColumn("DEPARTURE_HOUR", floor(col("SCHEDULED_DEPARTURE") / 100).cast("integer")) \ .withColumn("IS_WEEKEND", (col("DAY_OF_WEEK") >= 6).cast("integer")) assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "DISTANCE", "SCHEDULED_TIME", "DEPARTURE_HOUR", "IS_WEEKEND"], outputCol="FEATURES" ) assembled_df = assembler.transform(flights_df) assembled_df.select("DEPARTURE_DELAY", "DISTANCE", "FEATURES").show(5, truncate=False)

Handling Nulls in VectorAssembler

By default VectorAssembler raises an error if any input column contains nulls. You can control this with handleInvalid:

12345
assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "DISTANCE", "SCHEDULED_TIME", "DEPARTURE_HOUR", "IS_WEEKEND"], outputCol="FEATURES", handleInvalid="skip" # Options: "error" (default), "skip", "keep" )
  • "error" – raises an exception on null or NaN values;
  • "skip" – drops rows with invalid values;
  • "keep" – replaces invalid values with 0 in the output vector.

Combining Scalar and Vector Inputs

VectorAssembler can mix scalar columns and existing vector columns in a single step:

12345678910111213
from pyspark.ml.feature import StringIndexer, OneHotEncoder # Adding an encoded airline vector flights_df = StringIndexer(inputCol="AIRLINE", outputCol="AIRLINE_IDX").fit(flights_df).transform(flights_df) flights_df = OneHotEncoder(inputCol="AIRLINE_IDX", outputCol="AIRLINE_VEC").fit(flights_df).transform(flights_df) # Combining scalar columns and the airline vector assembler = VectorAssembler( inputCols=["DEPARTURE_DELAY", "DISTANCE", "DEPARTURE_HOUR", "AIRLINE_VEC"], outputCol="FEATURES" ) assembled_df = assembler.transform(flights_df) assembled_df.select("FEATURES").show(3, truncate=False)
question mark

What happens by default when VectorAssembler encounters a null value in an input column?

Seleziona la risposta corretta

Tutto è chiaro?

Come possiamo migliorarlo?

Grazie per i tuoi commenti!

Sezione 1. Capitolo 11

Chieda ad AI

expand

Chieda ad AI

ChatGPT

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

Sezione 1. Capitolo 11
some-alt