Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Aggregation and Grouping in Python | Working with Data
Databricks Fundamentals: A Beginner's Guide

bookAggregation and Grouping in Python

Sveip for å vise menyen

Note
Definition

Aggregation is the process of summarizing multiple rows of data into a single meaningful value, such as a sum, average, or count. Grouping allows you to apply these summaries across specific categories, such as finding the total sales for each distinct region.

Rarely do you need to look at millions of individual rows of raw data. Usually, you want to know the "big picture" - totals, averages, or counts per category. In Spark, you can achieve this by combining two powerful methods: groupBy() and agg().

The Basic groupBy Pattern

To summarize data by a specific category, you first use the groupBy() method. This tells Spark to gather all rows that share the same value (like "cut" in the diamonds table) into a group. However, grouping by itself doesn't do anything; you must follow it with an aggregation.

# Group by cut and count how many rows are in each
count_df = df.groupBy("cut").count()

display(count_df)

Performing Math with sum(), avg(), and max()

Once you have grouped your data, you can apply mathematical functions to your numeric columns. If you want to see the total profit per item type, you would use .sum().

# Total total_depth per cut category
total_depth = df.groupBy("cut").sum("x")

display(total_depth)

Notice that Spark automatically renames the column to sum(x). In the next chapter, we will learn how to make these names look more professional.

The agg() Method for Multiple Metrics

If you need to calculate more than one thing at a time—for example, both the average profit and the maximum revenue for each region—you use the .agg() (aggregate) method. This is the professional standard for building complex summaries.

from pyspark.sql import functions as F

# Calculate multiple metrics at once
summary_df = df.groupBy("cut").agg(
    F.sum("x"),
    F.avg("y"),
    F.max("z")
)

display(summary_df)
Note
Note

We import pyspark.sql.functions as F to access these powerful mathematical tools.

Grouping by Multiple Columns

You aren't limited to grouping by just one category. You can pass a list of columns to see data at a more granular level, such as the total x for every Color Type within every Cut.

multi_group_df = df.groupBy("cut", "color").sum("x")

display(multi_group_df)

Sorting the Results

Aggregated data is often easier to read when sorted. You can chain the .orderBy() method at the end of your aggregation to see your top-performing categories at the top of the list.

# Show highest profit categories first
sorted_df = summary_df.orderBy("sum(x)", ascending=False)

display(sorted_df)

1. Which method must you call BEFORE applying a sum() or avg() if you want the results broken down by category?

2. What is the benefit of using the .agg() method instead of just .sum()?

question mark

Which method must you call BEFORE applying a sum() or avg() if you want the results broken down by category?

Velg det helt riktige svaret

question mark

What is the benefit of using the .agg() method instead of just .sum()?

Velg det helt riktige svaret

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 4. Kapittel 6

Spør AI

expand

Spør AI

ChatGPT

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

Seksjon 4. Kapittel 6
some-alt