Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lära Creating New Features from Existing Columns | Section
Feature Engineering with PySpark

Creating New Features from Existing Columns

Svep för att visa menyn

Raw columns often contain more information than they appear to. Extracting that information into new columns – derived features – frequently improves model performance more than adding new data sources.

Extracting Departure Hour and Time of Day

SCHEDULED_DEPARTURE is stored as an integer in HHMM format. You can extract the hour and classify it into time-of-day buckets:

123456789101112131415161718192021222324252627282930313233
import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col, floor, when urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("DerivedFeatures") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY"]) # Extracting departure hour flights_df = flights_df.withColumn( "DEPARTURE_HOUR", floor(col("SCHEDULED_DEPARTURE") / 100).cast("integer") ) # Classifying into time-of-day buckets flights_df = flights_df.withColumn( "TIME_OF_DAY", when(col("DEPARTURE_HOUR") < 6, "night") .when(col("DEPARTURE_HOUR") < 12, "morning") .when(col("DEPARTURE_HOUR") < 18, "afternoon") .otherwise("evening") ) flights_df.select("SCHEDULED_DEPARTURE", "DEPARTURE_HOUR", "TIME_OF_DAY").show(5)

Computing Total Delay and Delay Ratio

12345678910111213
# Total delay as a combination of departure and arrival delay flights_df = flights_df.withColumn( "TOTAL_DELAY", col("DEPARTURE_DELAY") + col("ARRIVAL_DELAY") ) # Delay as a fraction of scheduled flight time – how badly delayed relative to duration flights_df = flights_df.withColumn( "DELAY_RATIO", (col("ARRIVAL_DELAY") / col("SCHEDULED_TIME")).cast("double") ) flights_df.select("DEPARTURE_DELAY", "ARRIVAL_DELAY", "SCHEDULED_TIME", "TOTAL_DELAY", "DELAY_RATIO").show(5)

Is Weekend Flag

1234567
# Adding a binary flag for weekend flights (DAY_OF_WEEK: 6=Saturday, 7=Sunday) flights_df = flights_df.withColumn( "IS_WEEKEND", (col("DAY_OF_WEEK") >= 6).cast("integer") ) flights_df.select("DAY_OF_WEEK", "IS_WEEKEND").distinct().orderBy("DAY_OF_WEEK").show()
question mark

What is the main purpose of creating derived features from existing columns?

Vänligen välj det korrekta svaret

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 1. Kapitel 7

Fråga AI

expand

Fråga AI

ChatGPT

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Avsnitt 1. Kapitel 7
some-alt