Summary  
This chapter explains how to group data by one or more columns and apply aggregation functions (count, sum, avg, max) followed by sorting the results.  

General domain of usage  
Data analysis

Aggregation is the process of summarizing multiple rows of data into a single meaningful value, such as a sum, average, or count. Grouping allows you to apply these summaries across specific categories, such as finding the total sales for each distinct region.

Definition

Rarely do you need to look at millions of individual rows of raw data. Usually, you want to know the "big picture" - totals, averages, or counts per category. In Spark, you can achieve this by combining two powerful methods: `groupBy()` and `agg()`.


## The Basic groupBy Pattern

To summarize data by a specific category, you first use the `groupBy()` method. This tells Spark to gather all rows that share the same value (like "cut" in the diamonds table) into a group. However, grouping by itself doesn't do anything; you must follow it with an aggregation.


```
# Group by cut and count how many rows are in each
count_df = df.groupBy("cut").count()

display(count_df)
```

## Performing Math with sum(), avg(), and max()
Once you have grouped your data, you can apply mathematical functions to your numeric columns. If you want to see the total profit per item type, you would use `.sum()`.

```
# Total total_depth per cut category
total_depth = df.groupBy("cut").sum("x")

display(total_depth)
```

Notice that Spark automatically renames the column to `sum(x)`. In the next chapter, we will learn how to make these names look more professional.


## The agg() Method for Multiple Metrics
If you need to calculate more than one thing at a time—for example, both the average profit and the maximum revenue for each region—you use the `.agg()` (aggregate) method. This is the professional standard for building complex summaries.


```
from pyspark.sql import functions as F

# Calculate multiple metrics at once
summary_df = df.groupBy("cut").agg(
    F.sum("x"),
    F.avg("y"),
    F.max("z")
)

display(summary_df)
```

We import `pyspark.sql.functions as F` to access these powerful mathematical tools.


Note

## Grouping by Multiple Columns
You aren't limited to grouping by just one category. You can pass a list of columns to see data at a more granular level, such as the total x for every **Color Type** within every **Cut**.


```
multi_group_df = df.groupBy("cut", "color").sum("x")

display(multi_group_df)
```

## Sorting the Results

Aggregated data is often easier to read when sorted. You can chain the `.orderBy()` method at the end of your aggregation to see your top-performing categories at the top of the list.


```
# Show highest profit categories first
sorted_df = summary_df.orderBy("sum(x)", ascending=False)

display(sorted_df)
```

Which method must you call BEFORE applying a `sum()` or `avg()` if you want the results broken down by category?

What is the benefit of using the .agg() method instead of just .sum()?

A practical introduction to Databricks, its core concepts, and hands-on data manipulation using Python and SQL. This course is designed for absolute beginners, focusing on clarity, simplicity, and real-world application.

Define Databricks simply and introduce key terms without jargon.

Get the user logged in and a compute environment running.

Master the primary development environment using familiar Python and SQL.

Practical, hands-on data manipulation using DataFrames (the core data structure).

Introduce the key differentiator, Delta Lake, simply.

Aggregation and Grouping in Python

The Basic groupBy Pattern

Performing Math with sum(), avg(), and max()

The agg() Method for Multiple Metrics

Grouping by Multiple Columns

Sorting the Results

1. Which method must you call BEFORE applying a `sum()` or `avg()` if you want the results broken down by category?

2. What is the benefit of using the .agg() method instead of just .sum()?