Joins in PySpark: Inner, Left, and Anti
Glissez pour afficher le menu
Joins combine two DataFrames on a common key. PySpark supports all standard join types – the most useful for analytical work are inner, left, and anti.
Setting Up Two DataFrames
To demonstrate joins, create a small airline names DataFrame alongside the flights data:
123456789101112131415161718192021222324import urllib.request from pyspark.sql import SparkSession from pyspark.sql import Row urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("Joins") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) # Small reference DataFrame with full airline names airlines_df = spark.createDataFrame([ Row(IATA="AA", NAME="American Airlines"), Row(IATA="DL", NAME="Delta Air Lines"), Row(IATA="UA", NAME="United Airlines"), Row(IATA="WN", NAME="Southwest Airlines"), Row(IATA="AS", NAME="Alaska Airlines"), ])
Inner Join
Returns only rows where the key exists in both DataFrames:
123# Joining flights with airline names inner_df = flights_df.join(airlines_df, flights_df["AIRLINE"] == airlines_df["IATA"], "inner") inner_df.select("AIRLINE", "NAME", "ORIGIN_AIRPORT", "ARRIVAL_DELAY").show(5)
Left Join
Returns all rows from the left DataFrame, with nulls for unmatched rows from the right:
12left_df = flights_df.join(airlines_df, flights_df["AIRLINE"] == airlines_df["IATA"], "left") left_df.select("AIRLINE", "NAME", "ORIGIN_AIRPORT").show(5)
Anti Join
Returns rows from the left DataFrame that have no match in the right – useful for finding orphaned records:
123# Flights whose airline code is not in the reference table unmatched_df = flights_df.join(airlines_df, flights_df["AIRLINE"] == airlines_df["IATA"], "left_anti") print(unmatched_df.select("AIRLINE").distinct().collect())
Tout était clair ?
Merci pour vos commentaires !
Section 1. Chapitre 7
Demandez à l'IA
Demandez à l'IA
Posez n'importe quelle question ou essayez l'une des questions suggérées pour commencer notre discussion
Section 1. Chapitre 7