Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Oppiskele Joins in PySpark: Inner, Left, and Anti | Section
Data Processing with PySpark

Joins in PySpark: Inner, Left, and Anti

Pyyhkäise näyttääksesi valikon

Joins combine two DataFrames on a common key. PySpark supports all standard join types – the most useful for analytical work are inner, left, and anti.

Setting Up Two DataFrames

To demonstrate joins, create a small airline names DataFrame alongside the flights data:

123456789101112131415161718192021222324
import urllib.request from pyspark.sql import SparkSession from pyspark.sql import Row urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("Joins") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) # Small reference DataFrame with full airline names airlines_df = spark.createDataFrame([ Row(IATA="AA", NAME="American Airlines"), Row(IATA="DL", NAME="Delta Air Lines"), Row(IATA="UA", NAME="United Airlines"), Row(IATA="WN", NAME="Southwest Airlines"), Row(IATA="AS", NAME="Alaska Airlines"), ])

Inner Join

Returns only rows where the key exists in both DataFrames:

123
# Joining flights with airline names inner_df = flights_df.join(airlines_df, flights_df["AIRLINE"] == airlines_df["IATA"], "inner") inner_df.select("AIRLINE", "NAME", "ORIGIN_AIRPORT", "ARRIVAL_DELAY").show(5)

Left Join

Returns all rows from the left DataFrame, with nulls for unmatched rows from the right:

12
left_df = flights_df.join(airlines_df, flights_df["AIRLINE"] == airlines_df["IATA"], "left") left_df.select("AIRLINE", "NAME", "ORIGIN_AIRPORT").show(5)

Anti Join

Returns rows from the left DataFrame that have no match in the right – useful for finding orphaned records:

123
# Flights whose airline code is not in the reference table unmatched_df = flights_df.join(airlines_df, flights_df["AIRLINE"] == airlines_df["IATA"], "left_anti") print(unmatched_df.select("AIRLINE").distinct().collect())
question mark

What does a left anti join return?

Valitse oikea vastaus

Oliko kaikki selvää?

Miten voimme parantaa sitä?

Kiitos palautteestasi!

Osio 1. Luku 7

Kysy tekoälyä

expand

Kysy tekoälyä

ChatGPT

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

Osio 1. Luku 7
some-alt