Aprenda Sorting and Aggregating Data

Deslize para mostrar o menu

Sorting and aggregation are the foundation of any analytical query. In PySpark they map directly to SQL ORDER BY and GROUP BY, with the same semantics but a DataFrame API.

Sorting


              1234567891011121314151617181920
            
import urllib.request
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

urllib.request.urlretrieve(
    "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv",
    "flights.csv"
)

spark = SparkSession.builder \
    .appName("SortAggregate") \
    .master("local[*]") \
    .getOrCreate()

flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True)

# Sorting by arrival delay descending
flights_df.select("AIRLINE", "ORIGIN_AIRPORT", "DESTINATION_AIRPORT", "ARRIVAL_DELAY") \
    .orderBy(col("ARRIVAL_DELAY").desc()) \
    .show(5)

Aggregating with `groupBy`


              1234567891011
            
from pyspark.sql.functions import avg, count, max, round

# Average arrival delay per airline
flights_df.groupBy("AIRLINE") \
    .agg(
        count("*").alias("TOTAL_FLIGHTS"),
        round(avg("ARRIVAL_DELAY"), 2).alias("AVG_DELAY"),
        max("ARRIVAL_DELAY").alias("MAX_DELAY")
    ) \
    .orderBy(col("AVG_DELAY").desc()) \
    .show()

agg() lets you compute multiple aggregations in a single groupBy pass – more efficient than chaining separate operations.

Filtering After Aggregation

To filter on an aggregated value, use filter() after groupBy() – equivalent to SQL HAVING:


              123456789
            
# Airlines with more than 5000 flights and average delay above 10 minutes
flights_df.groupBy("AIRLINE") \
    .agg(
        count("*").alias("TOTAL_FLIGHTS"),
        round(avg("ARRIVAL_DELAY"), 2).alias("AVG_DELAY")
    ) \
    .filter((col("TOTAL_FLIGHTS") > 5000) & (col("AVG_DELAY") > 10)) \
    .orderBy(col("AVG_DELAY").desc()) \
    .show()

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 1. Capítulo 10

Pergunte à IA

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Seção 1. Capítulo 10

Sorting and Aggregating Data

Sorting

Aggregating with groupBy

Filtering After Aggregation

Aggregating with `groupBy`