Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Handling Datetime Features | Section
Feature Engineering with PySpark

Handling Datetime Features

Swipe to show menu

The flights dataset stores date information across three separate integer columns – YEAR, MONTH, and DAY. Combining them into a proper date and extracting meaningful components is a common feature engineering step.

Building a Date Column

12345678910111213141516171819202122232425262728293031
import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col, concat_ws, lpad, to_date, dayofweek, month, quarter, dayofyear urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("DatetimeFeatures") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY"]) # Combining YEAR, MONTH, DAY into a DateType column flights_df = flights_df.withColumn( "FLIGHT_DATE", to_date( concat_ws("-", col("YEAR").cast("string"), lpad(col("MONTH").cast("string"), 2, "0"), lpad(col("DAY").cast("string"), 2, "0") ), "yyyy-MM-dd" ) ) flights_df.select("YEAR", "MONTH", "DAY", "FLIGHT_DATE").show(5)

Extracting Date Components

Once you have a proper DateType column, PySpark provides built-in functions to extract components:

1234567
# Extracting useful date components as separate feature columns flights_df = flights_df \ .withColumn("MONTH_NUM", month(col("FLIGHT_DATE"))) \ .withColumn("QUARTER", quarter(col("FLIGHT_DATE"))) \ .withColumn("DAY_OF_YEAR", dayofyear(col("FLIGHT_DATE"))) flights_df.select("FLIGHT_DATE", "MONTH_NUM", "QUARTER", "DAY_OF_YEAR").show(5)

Cyclic Encoding for Month and Hour

Month and hour are cyclic – December is close to January, and 23:00 is close to 00:00. Encoding them as raw integers loses this property. Sine and cosine encoding preserves it:

123456789
import pyspark.sql.functions as F import math # Encoding month cyclically so that month 12 is close to month 1 flights_df = flights_df \ .withColumn("MONTH_SIN", F.sin(2 * math.pi * col("MONTH_NUM") / 12)) \ .withColumn("MONTH_COS", F.cos(2 * math.pi * col("MONTH_NUM") / 12)) flights_df.select("MONTH_NUM", "MONTH_SIN", "MONTH_COS").distinct().orderBy("MONTH_NUM").show()
question mark

Why is cyclic encoding useful for month and hour features?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 1. Chapter 8

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 8
some-alt