Impara Handling Missing Values and Duplicates

Scorri per mostrare il menu

Real-world datasets are rarely clean. The flights dataset contains nulls in delay columns for cancelled flights, and potentially duplicate rows from data pipeline issues. Handling these correctly before analysis prevents skewed results.

Detecting Nulls


              123456789101112131415161718192021
            
import urllib.request
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, when

urllib.request.urlretrieve(
    "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv",
    "flights.csv"
)

spark = SparkSession.builder \
    .appName("MissingDuplicates") \
    .master("local[*]") \
    .getOrCreate()

flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True)

# Counting nulls per column
flights_df.select([
    count(when(col(c).isNull(), c)).alias(c)
    for c in ["DEPARTURE_DELAY", "ARRIVAL_DELAY", "CANCELLATION_REASON", "AIR_TIME"]
]).show()

Dropping Nulls


              1234
            
# Dropping rows where ARRIVAL_DELAY is null
clean_df = flights_df.dropna(subset=["ARRIVAL_DELAY"])

print(f"Before: {flights_df.count()}, After: {clean_df.count()}")

subset limits dropping to specific columns – without it, any row with a null in any column is removed, which is often too aggressive.

Filling Nulls

For delay columns, nulls typically mean the flight was cancelled – filling with 0 would be misleading. A better approach is to fill with a sentinel value or keep them as nulls and filter before aggregation:


              12345
            
# Filling nulls in delay columns with 0 (use with care)
filled_df = flights_df.fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY"])

# Filling CANCELLATION_REASON nulls with a placeholder string
filled_df = filled_df.fillna("N/A", subset=["CANCELLATION_REASON"])

Removing Duplicates


              1234567
            
# Removing fully duplicate rows
deduped_df = flights_df.dropDuplicates()

# Removing duplicates based on key columns only
deduped_df = flights_df.dropDuplicates(["AIRLINE", "FLIGHT_NUMBER", "YEAR", "MONTH", "DAY"])

print(f"Before: {flights_df.count()}, After: {deduped_df.count()}")

Tutto è chiaro?

Grazie per i tuoi commenti!

Sezione 1. Capitolo 2

Chieda ad AI

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

Sezione 1. Capitolo 2