Sorting and Aggregating Data
Stryg for at vise menuen
Sorting and aggregation are the foundation of any analytical query. In PySpark they map directly to SQL ORDER BY and GROUP BY, with the same semantics but a DataFrame API.
Sorting
1234567891011121314151617181920import urllib.request from pyspark.sql import SparkSession from pyspark.sql.functions import col urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("SortAggregate") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) # Sorting by arrival delay descending flights_df.select("AIRLINE", "ORIGIN_AIRPORT", "DESTINATION_AIRPORT", "ARRIVAL_DELAY") \ .orderBy(col("ARRIVAL_DELAY").desc()) \ .show(5)
Aggregating with groupBy
1234567891011from pyspark.sql.functions import avg, count, max, round # Average arrival delay per airline flights_df.groupBy("AIRLINE") \ .agg( count("*").alias("TOTAL_FLIGHTS"), round(avg("ARRIVAL_DELAY"), 2).alias("AVG_DELAY"), max("ARRIVAL_DELAY").alias("MAX_DELAY") ) \ .orderBy(col("AVG_DELAY").desc()) \ .show()
agg() lets you compute multiple aggregations in a single groupBy pass – more efficient than chaining separate operations.
Filtering After Aggregation
To filter on an aggregated value, use filter() after groupBy() – equivalent to SQL HAVING:
123456789# Airlines with more than 5000 flights and average delay above 10 minutes flights_df.groupBy("AIRLINE") \ .agg( count("*").alias("TOTAL_FLIGHTS"), round(avg("ARRIVAL_DELAY"), 2).alias("AVG_DELAY") ) \ .filter((col("TOTAL_FLIGHTS") > 5000) & (col("AVG_DELAY") > 10)) \ .orderBy(col("AVG_DELAY").desc()) \ .show()
Var alt klart?
Tak for dine kommentarer!
Sektion 1. Kapitel 10
Spørg AI
Spørg AI
Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat
Sektion 1. Kapitel 10