Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Creating New Features from Existing Columns | Section
Feature Engineering with PySpark

Creating New Features from Existing Columns

Sveip for å vise menyen

Raw columns often contain more information than they appear to. Extracting that information into new columns – derived features – frequently improves model performance more than adding new data sources.

Extracting Departure Hour and Time of Day

SCHEDULED_DEPARTURE is stored as an integer in HHMM format. You can extract the hour and classify it into time-of-day buckets:

123456789101112131415161718192021222324252627282930313233
import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col, floor, when urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("DerivedFeatures") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY"]) # Extracting departure hour flights_df = flights_df.withColumn( "DEPARTURE_HOUR", floor(col("SCHEDULED_DEPARTURE") / 100).cast("integer") ) # Classifying into time-of-day buckets flights_df = flights_df.withColumn( "TIME_OF_DAY", when(col("DEPARTURE_HOUR") < 6, "night") .when(col("DEPARTURE_HOUR") < 12, "morning") .when(col("DEPARTURE_HOUR") < 18, "afternoon") .otherwise("evening") ) flights_df.select("SCHEDULED_DEPARTURE", "DEPARTURE_HOUR", "TIME_OF_DAY").show(5)

Computing Total Delay and Delay Ratio

12345678910111213
# Total delay as a combination of departure and arrival delay flights_df = flights_df.withColumn( "TOTAL_DELAY", col("DEPARTURE_DELAY") + col("ARRIVAL_DELAY") ) # Delay as a fraction of scheduled flight time – how badly delayed relative to duration flights_df = flights_df.withColumn( "DELAY_RATIO", (col("ARRIVAL_DELAY") / col("SCHEDULED_TIME")).cast("double") ) flights_df.select("DEPARTURE_DELAY", "ARRIVAL_DELAY", "SCHEDULED_TIME", "TOTAL_DELAY", "DELAY_RATIO").show(5)

Is Weekend Flag

1234567
# Adding a binary flag for weekend flights (DAY_OF_WEEK: 6=Saturday, 7=Sunday) flights_df = flights_df.withColumn( "IS_WEEKEND", (col("DAY_OF_WEEK") >= 6).cast("integer") ) flights_df.select("DAY_OF_WEEK", "IS_WEEKEND").distinct().orderBy("DAY_OF_WEEK").show()
question mark

What is the main purpose of creating derived features from existing columns?

Velg det helt riktige svaret

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 1. Kapittel 7

Spør AI

expand

Spør AI

ChatGPT

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

Seksjon 1. Kapittel 7
some-alt