Handling Datetime Features
Pyyhkäise näyttääksesi valikon
The flights dataset stores date information across three separate integer columns – YEAR, MONTH, and DAY. Combining them into a proper date and extracting meaningful components is a common feature engineering step.
Building a Date Column
12345678910111213141516171819202122232425262728293031import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col, concat_ws, lpad, to_date, dayofweek, month, quarter, dayofyear urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("DatetimeFeatures") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) \ .fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY"]) # Combining YEAR, MONTH, DAY into a DateType column flights_df = flights_df.withColumn( "FLIGHT_DATE", to_date( concat_ws("-", col("YEAR").cast("string"), lpad(col("MONTH").cast("string"), 2, "0"), lpad(col("DAY").cast("string"), 2, "0") ), "yyyy-MM-dd" ) ) flights_df.select("YEAR", "MONTH", "DAY", "FLIGHT_DATE").show(5)
Extracting Date Components
Once you have a proper DateType column, PySpark provides built-in functions to extract components:
1234567# Extracting useful date components as separate feature columns flights_df = flights_df \ .withColumn("MONTH_NUM", month(col("FLIGHT_DATE"))) \ .withColumn("QUARTER", quarter(col("FLIGHT_DATE"))) \ .withColumn("DAY_OF_YEAR", dayofyear(col("FLIGHT_DATE"))) flights_df.select("FLIGHT_DATE", "MONTH_NUM", "QUARTER", "DAY_OF_YEAR").show(5)
Cyclic Encoding for Month and Hour
Month and hour are cyclic – December is close to January, and 23:00 is close to 00:00. Encoding them as raw integers loses this property. Sine and cosine encoding preserves it:
123456789import pyspark.sql.functions as F import math # Encoding month cyclically so that month 12 is close to month 1 flights_df = flights_df \ .withColumn("MONTH_SIN", F.sin(2 * math.pi * col("MONTH_NUM") / 12)) \ .withColumn("MONTH_COS", F.cos(2 * math.pi * col("MONTH_NUM") / 12)) flights_df.select("MONTH_NUM", "MONTH_SIN", "MONTH_COS").distinct().orderBy("MONTH_NUM").show()
Oliko kaikki selvää?
Kiitos palautteestasi!
Osio 1. Luku 8
Kysy tekoälyä
Kysy tekoälyä
Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme
Osio 1. Luku 8