Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
学ぶ Creating New Features from Existing Columns | Section
Feature Engineering with PySpark

Creating New Features from Existing Columns

メニューを表示するにはスワイプしてください

Raw columns often contain more information than they appear to. Extracting that information into new columns – derived features – frequently improves model performance more than adding new data sources.

Extracting Departure Hour and Time of Day

SCHEDULED_DEPARTURE is stored as an integer in HHMM format. You can extract the hour and classify it into time-of-day buckets:

123456789101112131415161718192021222324252627282930313233
import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col, floor, when urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("DerivedFeatures") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY"]) # Extracting departure hour flights_df = flights_df.withColumn( "DEPARTURE_HOUR", floor(col("SCHEDULED_DEPARTURE") / 100).cast("integer") ) # Classifying into time-of-day buckets flights_df = flights_df.withColumn( "TIME_OF_DAY", when(col("DEPARTURE_HOUR") < 6, "night") .when(col("DEPARTURE_HOUR") < 12, "morning") .when(col("DEPARTURE_HOUR") < 18, "afternoon") .otherwise("evening") ) flights_df.select("SCHEDULED_DEPARTURE", "DEPARTURE_HOUR", "TIME_OF_DAY").show(5)

Computing Total Delay and Delay Ratio

12345678910111213
# Total delay as a combination of departure and arrival delay flights_df = flights_df.withColumn( "TOTAL_DELAY", col("DEPARTURE_DELAY") + col("ARRIVAL_DELAY") ) # Delay as a fraction of scheduled flight time – how badly delayed relative to duration flights_df = flights_df.withColumn( "DELAY_RATIO", (col("ARRIVAL_DELAY") / col("SCHEDULED_TIME")).cast("double") ) flights_df.select("DEPARTURE_DELAY", "ARRIVAL_DELAY", "SCHEDULED_TIME", "TOTAL_DELAY", "DELAY_RATIO").show(5)

Is Weekend Flag

1234567
# Adding a binary flag for weekend flights (DAY_OF_WEEK: 6=Saturday, 7=Sunday) flights_df = flights_df.withColumn( "IS_WEEKEND", (col("DAY_OF_WEEK") >= 6).cast("integer") ) flights_df.select("DAY_OF_WEEK", "IS_WEEKEND").distinct().orderBy("DAY_OF_WEEK").show()
question mark

What is the main purpose of creating derived features from existing columns?

正しい答えを選んでください

すべて明確でしたか?

どのように改善できますか?

フィードバックありがとうございます!

セクション 1.  7

AIに質問する

expand

AIに質問する

ChatGPT

何でも質問するか、提案された質問の1つを試してチャットを始めてください

セクション 1.  7
some-alt