Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lernen Handling Datetime Features | Section
Feature Engineering with PySpark

Handling Datetime Features

Swipe um das Menü anzuzeigen

The flights dataset stores date information across three separate integer columns – YEAR, MONTH, and DAY. Combining them into a proper date and extracting meaningful components is a common feature engineering step.

Building a Date Column

12345678910111213141516171819202122232425262728293031
import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col, concat_ws, lpad, to_date, dayofweek, month, quarter, dayofyear urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("DatetimeFeatures") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY"]) # Combining YEAR, MONTH, DAY into a DateType column flights_df = flights_df.withColumn( "FLIGHT_DATE", to_date( concat_ws("-", col("YEAR").cast("string"), lpad(col("MONTH").cast("string"), 2, "0"), lpad(col("DAY").cast("string"), 2, "0") ), "yyyy-MM-dd" ) ) flights_df.select("YEAR", "MONTH", "DAY", "FLIGHT_DATE").show(5)

Extracting Date Components

Once you have a proper DateType column, PySpark provides built-in functions to extract components:

1234567
# Extracting useful date components as separate feature columns flights_df = flights_df \ .withColumn("MONTH_NUM", month(col("FLIGHT_DATE"))) \ .withColumn("QUARTER", quarter(col("FLIGHT_DATE"))) \ .withColumn("DAY_OF_YEAR", dayofyear(col("FLIGHT_DATE"))) flights_df.select("FLIGHT_DATE", "MONTH_NUM", "QUARTER", "DAY_OF_YEAR").show(5)

Cyclic Encoding for Month and Hour

Month and hour are cyclic – December is close to January, and 23:00 is close to 00:00. Encoding them as raw integers loses this property. Sine and cosine encoding preserves it:

123456789
import pyspark.sql.functions as F import math # Encoding month cyclically so that month 12 is close to month 1 flights_df = flights_df \ .withColumn("MONTH_SIN", F.sin(2 * math.pi * col("MONTH_NUM") / 12)) \ .withColumn("MONTH_COS", F.cos(2 * math.pi * col("MONTH_NUM") / 12)) flights_df.select("MONTH_NUM", "MONTH_SIN", "MONTH_COS").distinct().orderBy("MONTH_NUM").show()
question mark

Why is cyclic encoding useful for month and hour features?

Wählen Sie die richtige Antwort aus

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 1. Kapitel 8

Fragen Sie AI

expand

Fragen Sie AI

ChatGPT

Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen

Abschnitt 1. Kapitel 8
some-alt