Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprende Creating New Features from Existing Columns | Section
Feature Engineering with PySpark

Creating New Features from Existing Columns

Desliza para mostrar el menú

Raw columns often contain more information than they appear to. Extracting that information into new columns – derived features – frequently improves model performance more than adding new data sources.

Extracting Departure Hour and Time of Day

SCHEDULED_DEPARTURE is stored as an integer in HHMM format. You can extract the hour and classify it into time-of-day buckets:

123456789101112131415161718192021222324252627282930313233
import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col, floor, when urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("DerivedFeatures") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY"]) # Extracting departure hour flights_df = flights_df.withColumn( "DEPARTURE_HOUR", floor(col("SCHEDULED_DEPARTURE") / 100).cast("integer") ) # Classifying into time-of-day buckets flights_df = flights_df.withColumn( "TIME_OF_DAY", when(col("DEPARTURE_HOUR") < 6, "night") .when(col("DEPARTURE_HOUR") < 12, "morning") .when(col("DEPARTURE_HOUR") < 18, "afternoon") .otherwise("evening") ) flights_df.select("SCHEDULED_DEPARTURE", "DEPARTURE_HOUR", "TIME_OF_DAY").show(5)

Computing Total Delay and Delay Ratio

12345678910111213
# Total delay as a combination of departure and arrival delay flights_df = flights_df.withColumn( "TOTAL_DELAY", col("DEPARTURE_DELAY") + col("ARRIVAL_DELAY") ) # Delay as a fraction of scheduled flight time – how badly delayed relative to duration flights_df = flights_df.withColumn( "DELAY_RATIO", (col("ARRIVAL_DELAY") / col("SCHEDULED_TIME")).cast("double") ) flights_df.select("DEPARTURE_DELAY", "ARRIVAL_DELAY", "SCHEDULED_TIME", "TOTAL_DELAY", "DELAY_RATIO").show(5)

Is Weekend Flag

1234567
# Adding a binary flag for weekend flights (DAY_OF_WEEK: 6=Saturday, 7=Sunday) flights_df = flights_df.withColumn( "IS_WEEKEND", (col("DAY_OF_WEEK") >= 6).cast("integer") ) flights_df.select("DAY_OF_WEEK", "IS_WEEKEND").distinct().orderBy("DAY_OF_WEEK").show()
question mark

What is the main purpose of creating derived features from existing columns?

Selecciona la respuesta correcta

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 1. Capítulo 7

Pregunte a AI

expand

Pregunte a AI

ChatGPT

Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla

Sección 1. Capítulo 7
some-alt