Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lära Handling Datetime Features | Section
Feature Engineering with PySpark

Handling Datetime Features

Svep för att visa menyn

The flights dataset stores date information across three separate integer columns – YEAR, MONTH, and DAY. Combining them into a proper date and extracting meaningful components is a common feature engineering step.

Building a Date Column

12345678910111213141516171819202122232425262728293031
import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col, concat_ws, lpad, to_date, dayofweek, month, quarter, dayofyear urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("DatetimeFeatures") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY"]) # Combining YEAR, MONTH, DAY into a DateType column flights_df = flights_df.withColumn( "FLIGHT_DATE", to_date( concat_ws("-", col("YEAR").cast("string"), lpad(col("MONTH").cast("string"), 2, "0"), lpad(col("DAY").cast("string"), 2, "0") ), "yyyy-MM-dd" ) ) flights_df.select("YEAR", "MONTH", "DAY", "FLIGHT_DATE").show(5)

Extracting Date Components

Once you have a proper DateType column, PySpark provides built-in functions to extract components:

1234567
# Extracting useful date components as separate feature columns flights_df = flights_df \ .withColumn("MONTH_NUM", month(col("FLIGHT_DATE"))) \ .withColumn("QUARTER", quarter(col("FLIGHT_DATE"))) \ .withColumn("DAY_OF_YEAR", dayofyear(col("FLIGHT_DATE"))) flights_df.select("FLIGHT_DATE", "MONTH_NUM", "QUARTER", "DAY_OF_YEAR").show(5)

Cyclic Encoding for Month and Hour

Month and hour are cyclic – December is close to January, and 23:00 is close to 00:00. Encoding them as raw integers loses this property. Sine and cosine encoding preserves it:

123456789
import pyspark.sql.functions as F import math # Encoding month cyclically so that month 12 is close to month 1 flights_df = flights_df \ .withColumn("MONTH_SIN", F.sin(2 * math.pi * col("MONTH_NUM") / 12)) \ .withColumn("MONTH_COS", F.cos(2 * math.pi * col("MONTH_NUM") / 12)) flights_df.select("MONTH_NUM", "MONTH_SIN", "MONTH_COS").distinct().orderBy("MONTH_NUM").show()
question mark

Why is cyclic encoding useful for month and hour features?

Vänligen välj det korrekta svaret

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 1. Kapitel 8

Fråga AI

expand

Fråga AI

ChatGPT

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Avsnitt 1. Kapitel 8
some-alt