Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Handling Datetime Features | Section
Feature Engineering with PySpark

Handling Datetime Features

Stryg for at vise menuen

The flights dataset stores date information across three separate integer columns – YEAR, MONTH, and DAY. Combining them into a proper date and extracting meaningful components is a common feature engineering step.

Building a Date Column

12345678910111213141516171819202122232425262728293031
import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col, concat_ws, lpad, to_date, dayofweek, month, quarter, dayofyear urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("DatetimeFeatures") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY"]) # Combining YEAR, MONTH, DAY into a DateType column flights_df = flights_df.withColumn( "FLIGHT_DATE", to_date( concat_ws("-", col("YEAR").cast("string"), lpad(col("MONTH").cast("string"), 2, "0"), lpad(col("DAY").cast("string"), 2, "0") ), "yyyy-MM-dd" ) ) flights_df.select("YEAR", "MONTH", "DAY", "FLIGHT_DATE").show(5)

Extracting Date Components

Once you have a proper DateType column, PySpark provides built-in functions to extract components:

1234567
# Extracting useful date components as separate feature columns flights_df = flights_df \ .withColumn("MONTH_NUM", month(col("FLIGHT_DATE"))) \ .withColumn("QUARTER", quarter(col("FLIGHT_DATE"))) \ .withColumn("DAY_OF_YEAR", dayofyear(col("FLIGHT_DATE"))) flights_df.select("FLIGHT_DATE", "MONTH_NUM", "QUARTER", "DAY_OF_YEAR").show(5)

Cyclic Encoding for Month and Hour

Month and hour are cyclic – December is close to January, and 23:00 is close to 00:00. Encoding them as raw integers loses this property. Sine and cosine encoding preserves it:

123456789
import pyspark.sql.functions as F import math # Encoding month cyclically so that month 12 is close to month 1 flights_df = flights_df \ .withColumn("MONTH_SIN", F.sin(2 * math.pi * col("MONTH_NUM") / 12)) \ .withColumn("MONTH_COS", F.cos(2 * math.pi * col("MONTH_NUM") / 12)) flights_df.select("MONTH_NUM", "MONTH_SIN", "MONTH_COS").distinct().orderBy("MONTH_NUM").show()
question mark

Why is cyclic encoding useful for month and hour features?

Vælg det korrekte svar

Var alt klart?

Hvordan kan vi forbedre det?

Tak for dine kommentarer!

Sektion 1. Kapitel 8

Spørg AI

expand

Spørg AI

ChatGPT

Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat

Sektion 1. Kapitel 8
some-alt