Challenge: Your First Data Cleaning Task
Desliza para mostrar el menú
A data pipeline typically follows a pattern: Load, Clean, Transform, and Save. In this mini-project, we apply these steps to a specific dataset to move from raw data to a clean, aggregated summary ready for business reporting.
To conclude Section 4, let's put everything you'ce learned into practice. While you can apply these steps to any dataset of your choice, we will demonstrate the solution using the popular Diamonds dataset. This dataset contains the prices and attributes (cut, color, clarity, etc.) of nearly 54,000 diamonds.
You can follow along the video and do the task on your own using the same dataset (widely available from Kaggle as diamonds.csv) or you can select any dataset of your choice and apply the same principle by setting a similar goal.
Please, ideally, pause the video/reading the text after reading the objective. Try the project on your own, no matter how comfortable you feel with your knowledge of Databricks. Even a simple try will be far more beneficial than simply reading through the solution. If you are lacking the time, do continue reading the solution and try this at home at a later date as an exercise nonetheless.
The Objective
Your goal is to identify the average price and average carats for "Premium" cut diamonds, grouped by their color, and save only the high-value results (where the average price is above $4,000) to a new table.
Step 1: Load and Inspect
First, load the data and check our schema to ensure the prices and carats are recognized as numeric values.
# Assuming the diamonds CSV was uploaded to a Volume
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/Volumes/main/default/my_volume/diamonds.csv")
df.printSchema()
Step 2: Filter and Select
You don't need all the columns, and you only care about "Premium" diamonds for this specific task:
# Filter for Premium cut and select relevant columns
premium_df = df.filter(df.cut == "Premium").select("color", "price", "carat")
Step 3: Aggregate and Group
Now you can calculate the averages. Use the pyspark.sql.functions library to ensure our math is accurate.
from pyspark.sql import functions as F
# Group by color and calculate averages
summary_df = premium_df.groupBy("color").agg(
F.avg("price").alias("avg_price"),
F.avg("carat").alias("avg_carat")
)
.alias() simply renames a column in your query result — like a temporary nickname. It doesn't change the actual table, just how the column appears in the output.
Think of it like AS in SQL — SELECT price AS price_in_usd.
Useful when column names are long, unclear, or when you're creating a calculated column and need to give it a readable name.
Step 4: Final Filter and Sort
We only want to keep the high-value categories where the average price exceeds 4,000. We will also sort the results so the most expensive average is at the top.
final_df = summary_df.filter(F.col("avg_price") > 4000).orderBy("avg_price", ascending=False)
display(final_df)
You may no encounter the F.col() function yet.
F.col() is how you reference a column by name in PySpark. The F is just the alias for pyspark.sql.functions — imported at the top of your notebook like this:
from pyspark.sql import functions as F
df.select(F.col("price"))
It's equivalent to just typing the column name as a string in many cases, but F.col() is preferred because it lets you chain operations directly on it:
F.col("price") * 1.1
F.col("cut").alias("diamond_cut")
Think of F.col("price") as saying "give me the price column as an object I can actually do things with."
Step 5: Save the Result
Finally, commit this cleaned "High-Value Premium Diamonds" report to our Catalog so the analytics team can use it.
final_df.write.mode("overwrite").saveAsTable("main.default.high_value_premium_diamonds")
1. In Step 3, why did you use the .alias() method within the aggregation?
2. If you want to apply this project to your own dataset, which part of the code must you change first?
¡Gracias por tus comentarios!
Pregunte a AI
Pregunte a AI
Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla