Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Creating New Features from Existing Columns | Section
Feature Engineering with PySpark

Creating New Features from Existing Columns

Swipe to show menu

Raw columns often contain more information than they appear to. Extracting that information into new columns – derived features – frequently improves model performance more than adding new data sources.

Extracting Departure Hour and Time of Day

SCHEDULED_DEPARTURE is stored as an integer in HHMM format. You can extract the hour and classify it into time-of-day buckets:

123456789101112131415161718192021222324252627282930313233
import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col, floor, when urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("DerivedFeatures") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY"]) # Extracting departure hour flights_df = flights_df.withColumn( "DEPARTURE_HOUR", floor(col("SCHEDULED_DEPARTURE") / 100).cast("integer") ) # Classifying into time-of-day buckets flights_df = flights_df.withColumn( "TIME_OF_DAY", when(col("DEPARTURE_HOUR") < 6, "night") .when(col("DEPARTURE_HOUR") < 12, "morning") .when(col("DEPARTURE_HOUR") < 18, "afternoon") .otherwise("evening") ) flights_df.select("SCHEDULED_DEPARTURE", "DEPARTURE_HOUR", "TIME_OF_DAY").show(5)

Computing Total Delay and Delay Ratio

12345678910111213
# Total delay as a combination of departure and arrival delay flights_df = flights_df.withColumn( "TOTAL_DELAY", col("DEPARTURE_DELAY") + col("ARRIVAL_DELAY") ) # Delay as a fraction of scheduled flight time – how badly delayed relative to duration flights_df = flights_df.withColumn( "DELAY_RATIO", (col("ARRIVAL_DELAY") / col("SCHEDULED_TIME")).cast("double") ) flights_df.select("DEPARTURE_DELAY", "ARRIVAL_DELAY", "SCHEDULED_TIME", "TOTAL_DELAY", "DELAY_RATIO").show(5)

Is Weekend Flag

1234567
# Adding a binary flag for weekend flights (DAY_OF_WEEK: 6=Saturday, 7=Sunday) flights_df = flights_df.withColumn( "IS_WEEKEND", (col("DAY_OF_WEEK") >= 6).cast("integer") ) flights_df.select("DAY_OF_WEEK", "IS_WEEKEND").distinct().orderBy("DAY_OF_WEEK").show()
question mark

What is the main purpose of creating derived features from existing columns?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 1. Chapter 7

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 7
some-alt