Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Sorting and Aggregating Data | Section
Introduction to PySpark

Sorting and Aggregating Data

Swipe to show menu

Sorting and aggregation are the foundation of any analytical query. In PySpark they map directly to SQL ORDER BY and GROUP BY, with the same semantics but a DataFrame API.

Sorting

1234567891011121314151617181920
import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("SortAggregate") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) # Sorting by arrival delay descending flights_df.select("AIRLINE", "ORIGIN_AIRPORT", "DESTINATION_AIRPORT", "ARRIVAL_DELAY") \ .orderBy(col("ARRIVAL_DELAY").desc()) \ .show(5)

Aggregating with groupBy

1234567891011
from pyspark.sql.functions import avg, count, max, round # Average arrival delay per airline flights_df.groupBy("AIRLINE") \ .agg( count("*").alias("TOTAL_FLIGHTS"), round(avg("ARRIVAL_DELAY"), 2).alias("AVG_DELAY"), max("ARRIVAL_DELAY").alias("MAX_DELAY") ) \ .orderBy(col("AVG_DELAY").desc()) \ .show()

agg() lets you compute multiple aggregations in a single groupBy pass – more efficient than chaining separate operations.

Filtering After Aggregation

To filter on an aggregated value, use filter() after groupBy() – equivalent to SQL HAVING:

123456789
# Airlines with more than 5000 flights and average delay above 10 minutes flights_df.groupBy("AIRLINE") \ .agg( count("*").alias("TOTAL_FLIGHTS"), round(avg("ARRIVAL_DELAY"), 2).alias("AVG_DELAY") ) \ .filter((col("TOTAL_FLIGHTS") > 5000) & (col("AVG_DELAY") > 10)) \ .orderBy(col("AVG_DELAY").desc()) \ .show()
question mark

What is the correct way to filter on an aggregated value in PySpark?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 1. Chapter 10

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 10
some-alt