Writing the Processed Data to a Table
Scorri per mostrare il menu
Writing data is the process of moving a DataFrame from the cluster's temporary memory into permanent storage in the Catalog. By using the saveAsTable() method, you ensure that your cleaned and aggregated results are preserved and accessible to other users and tools.
Everything you have done so far has been "in-memory." If you were to turn off your cluster right now, your transformed DataFrames would disappear. To make your work permanent, you must write the data back to the Lakehouse. In Databricks, the standard way to do this is by saving your DataFrame as a Delta Table.
The saveAsTable() Syntax
To save your work, you chain the write method to your DataFrame. The most direct approach is:
# Save the 'summary_df' we created earlier as a permanent table
summary_df.write.mode("overwrite").saveAsTable("workspace.default.diamonds_summary”)
- write: accesses the DataFrame writer interface;
- mode("overwrite"): This tells Databricks what to do if a table with that name already exists. "Overwrite" replaces the old data with the new data. Other options include "append" (to add new rows to the end of the existing table);
- saveAsTable: specifies the three-part name (
catalog.schema.table) where the data will be stored.
Delta Lake: The Default Format
When you use saveAsTable, Databricks automatically saves the data in the Delta format. As we discussed in Section 1, Delta Lake provides reliability. It ensures that even if the cluster crashes in the middle of a "write" operation, your table won't be corrupted. It also allows for "Time Travel," meaning you can look back at previous versions of the table if you make a mistake.
Verifying the Write in the Catalog
Once the command finishes running, you should verify that the data has landed correctly:
- Navigate to the Catalog tab in the left-hand sidebar;
- Drill down into the
maincatalog and thedefaultschema; - Look for your new table name (e.g.,
regional_summary); - You can click on the table to see its schema, sample data, and metadata, such as when it was created and who created it.
Reading Your Saved Table
Once a table is in the Catalog, any authorized user can access it without needing your notebook. They can simply run a SQL query or use spark.table() to load it into their own environment:
# In a new notebook, anyone can now access your processed data
new_df = spark.table("main.default.regional_summary")
Best Practice: Clean Up
After saving your final results to a permanent table, it is a professional habit to terminate your cluster or at least "Clear State." Since your data is now safely stored in the Catalog, you no longer need to keep the temporary DataFrames taking up space in the cluster's RAM.
1. Which "mode" should you use if you want to replace an existing table with brand-new data from your DataFrame?
2. What is the primary benefit of saving a DataFrame using saveAsTable()?
Grazie per i tuoi commenti!
Chieda ad AI
Chieda ad AI
Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione