Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Casting and Converting Data Types | Section
Data Processing with PySpark

Casting and Converting Data Types

Sveip for å vise menyen

inferSchema=True does a good job for most columns, but it sometimes gets types wrong – or you may need to convert columns for downstream processing. PySpark provides explicit casting and conversion functions for this.

Checking and Casting Types

123456789101112131415161718192021222324252627282930313233
import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col from pyspark.sql.types import IntegerType, FloatType, StringType urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("TypeCasting") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) # Checking current types flights_df.printSchema() # Casting DEPARTURE_DELAY from float to integer flights_df = flights_df.withColumn( "DEPARTURE_DELAY", col("DEPARTURE_DELAY").cast(IntegerType()) ) # Casting FLIGHT_NUMBER to string flights_df = flights_df.withColumn( "FLIGHT_NUMBER", col("FLIGHT_NUMBER").cast(StringType()) ) flights_df.select("DEPARTURE_DELAY", "FLIGHT_NUMBER").printSchema()

If a value cannot be cast – for example, casting "ABC" to IntegerType – PySpark replaces it with null instead of raising an error.

Building Datetime Columns

The flights dataset stores date components as separate integers. You can combine them into a proper date column:

12345678910111213141516
from pyspark.sql.functions import lpad, concat_ws, to_date # Constructing a date string and converting to DateType flights_df = flights_df.withColumn( "FLIGHT_DATE", to_date( concat_ws("-", col("YEAR").cast(StringType()), lpad(col("MONTH").cast(StringType()), 2, "0"), lpad(col("DAY").cast(StringType()), 2, "0") ), "yyyy-MM-dd" ) ) flights_df.select("YEAR", "MONTH", "DAY", "FLIGHT_DATE").show(5)

lpad left-pads single-digit months and days with "0" so the format matches yyyy-MM-dd.

question mark

What happens when PySpark cannot cast a value to the target type?

Velg det helt riktige svaret

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 1. Kapittel 3

Spør AI

expand

Spør AI

ChatGPT

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

Seksjon 1. Kapittel 3
some-alt