Lære Sorting and Aggregating Data

Stryg for at vise menuen

Sorting and aggregation are the foundation of any analytical query. In PySpark they map directly to SQL ORDER BY and GROUP BY, with the same semantics but a DataFrame API.

Sorting


              1234567891011121314151617181920
            
import urllib.request
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

urllib.request.urlretrieve(
    "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv",
    "flights.csv"
)

spark = SparkSession.builder \
    .appName("SortAggregate") \
    .master("local[*]") \
    .getOrCreate()

flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True)

# Sorting by arrival delay descending
flights_df.select("AIRLINE", "ORIGIN_AIRPORT", "DESTINATION_AIRPORT", "ARRIVAL_DELAY") \
    .orderBy(col("ARRIVAL_DELAY").desc()) \
    .show(5)

Aggregating with `groupBy`


              1234567891011
            
from pyspark.sql.functions import avg, count, max, round

# Average arrival delay per airline
flights_df.groupBy("AIRLINE") \
    .agg(
        count("*").alias("TOTAL_FLIGHTS"),
        round(avg("ARRIVAL_DELAY"), 2).alias("AVG_DELAY"),
        max("ARRIVAL_DELAY").alias("MAX_DELAY")
    ) \
    .orderBy(col("AVG_DELAY").desc()) \
    .show()

agg() lets you compute multiple aggregations in a single groupBy pass – more efficient than chaining separate operations.

Filtering After Aggregation

To filter on an aggregated value, use filter() after groupBy() – equivalent to SQL HAVING:


              123456789
            
# Airlines with more than 5000 flights and average delay above 10 minutes
flights_df.groupBy("AIRLINE") \
    .agg(
        count("*").alias("TOTAL_FLIGHTS"),
        round(avg("ARRIVAL_DELAY"), 2).alias("AVG_DELAY")
    ) \
    .filter((col("TOTAL_FLIGHTS") > 5000) & (col("AVG_DELAY") > 10)) \
    .orderBy(col("AVG_DELAY").desc()) \
    .show()

Var alt klart?

Tak for dine kommentarer!

Sektion 1. Kapitel 10

Spørg AI

Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat

Sektion 1. Kapitel 10

Sorting and Aggregating Data

Sorting

Aggregating with groupBy

Filtering After Aggregation

Aggregating with `groupBy`