Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Handling Missing Values and Duplicates | Section
Data Processing with PySpark

Handling Missing Values and Duplicates

Sveip for å vise menyen

Real-world datasets are rarely clean. The flights dataset contains nulls in delay columns for cancelled flights, and potentially duplicate rows from data pipeline issues. Handling these correctly before analysis prevents skewed results.

Detecting Nulls

123456789101112131415161718192021
import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col, count, when urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("MissingDuplicates") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) # Counting nulls per column flights_df.select([ count(when(col(c).isNull(), c)).alias(c) for c in ["DEPARTURE_DELAY", "ARRIVAL_DELAY", "CANCELLATION_REASON", "AIR_TIME"] ]).show()

Dropping Nulls

1234
# Dropping rows where ARRIVAL_DELAY is null clean_df = flights_df.dropna(subset=["ARRIVAL_DELAY"]) print(f"Before: {flights_df.count()}, After: {clean_df.count()}")

subset limits dropping to specific columns – without it, any row with a null in any column is removed, which is often too aggressive.

Filling Nulls

For delay columns, nulls typically mean the flight was cancelled – filling with 0 would be misleading. A better approach is to fill with a sentinel value or keep them as nulls and filter before aggregation:

12345
# Filling nulls in delay columns with 0 (use with care) filled_df = flights_df.fillna(0, subset=["DEPARTURE_DELAY", "ARRIVAL_DELAY"]) # Filling CANCELLATION_REASON nulls with a placeholder string filled_df = filled_df.fillna("N/A", subset=["CANCELLATION_REASON"])

Removing Duplicates

1234567
# Removing fully duplicate rows deduped_df = flights_df.dropDuplicates() # Removing duplicates based on key columns only deduped_df = flights_df.dropDuplicates(["AIRLINE", "FLIGHT_NUMBER", "YEAR", "MONTH", "DAY"]) print(f"Before: {flights_df.count()}, After: {deduped_df.count()}")
question mark

What does dropna(subset=["ARRIVAL_DELAY"]) do?

Velg det helt riktige svaret

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 1. Kapittel 2

Spør AI

expand

Spør AI

ChatGPT

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

Seksjon 1. Kapittel 2
some-alt