Aggregation and Grouping in Python
Deslize para mostrar o menu
Aggregation is the process of summarizing multiple rows of data into a single meaningful value, such as a sum, average, or count. Grouping allows you to apply these summaries across specific categories, such as finding the total sales for each distinct region.
Rarely do you need to look at millions of individual rows of raw data. Usually, you want to know the "big picture" - totals, averages, or counts per category. In Spark, you can achieve this by combining two powerful methods: groupBy() and agg().
The Basic groupBy Pattern
To summarize data by a specific category, you first use the groupBy() method. This tells Spark to gather all rows that share the same value (like "cut" in the diamonds table) into a group. However, grouping by itself doesn't do anything; you must follow it with an aggregation.
# Group by cut and count how many rows are in each
count_df = df.groupBy("cut").count()
display(count_df)
Performing Math with sum(), avg(), and max()
Once you have grouped your data, you can apply mathematical functions to your numeric columns. If you want to see the total profit per item type, you would use .sum().
# Total total_depth per cut category
total_depth = df.groupBy("cut").sum("x")
display(total_depth)
Notice that Spark automatically renames the column to sum(x). In the next chapter, we will learn how to make these names look more professional.
The agg() Method for Multiple Metrics
If you need to calculate more than one thing at a time—for example, both the average profit and the maximum revenue for each region—you use the .agg() (aggregate) method. This is the professional standard for building complex summaries.
from pyspark.sql import functions as F
# Calculate multiple metrics at once
summary_df = df.groupBy("cut").agg(
F.sum("x"),
F.avg("y"),
F.max("z")
)
display(summary_df)
We import pyspark.sql.functions as F to access these powerful mathematical tools.
Grouping by Multiple Columns
You aren't limited to grouping by just one category. You can pass a list of columns to see data at a more granular level, such as the total x for every Color Type within every Cut.
multi_group_df = df.groupBy("cut", "color").sum("x")
display(multi_group_df)
Sorting the Results
Aggregated data is often easier to read when sorted. You can chain the .orderBy() method at the end of your aggregation to see your top-performing categories at the top of the list.
# Show highest profit categories first
sorted_df = summary_df.orderBy("sum(x)", ascending=False)
display(sorted_df)
1. Which method must you call BEFORE applying a sum() or avg() if you want the results broken down by category?
2. What is the benefit of using the .agg() method instead of just .sum()?
Obrigado pelo seu feedback!
Pergunte à IA
Pergunte à IA
Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo