Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Writing the Processed Data to a Table | Working with Data
Databricks Fundamentals: A Beginner's Guide

bookWriting the Processed Data to a Table

Swipe to show menu

Note
Definition

Writing data is the process of moving a DataFrame from the cluster's temporary memory into permanent storage in the Catalog. By using the saveAsTable() method, you ensure that your cleaned and aggregated results are preserved and accessible to other users and tools.

Everything you have done so far has been "in-memory." If you were to turn off your cluster right now, your transformed DataFrames would disappear. To make your work permanent, you must write the data back to the Lakehouse. In Databricks, the standard way to do this is by saving your DataFrame as a Delta Table.

The saveAsTable() Syntax

To save your work, you chain the write method to your DataFrame. The most direct approach is:

# Save the 'summary_df' we created earlier as a permanent table
summary_df.write.mode("overwrite").saveAsTable("workspace.default.diamonds_summary”)
  • write: accesses the DataFrame writer interface;
  • mode("overwrite"): This tells Databricks what to do if a table with that name already exists. "Overwrite" replaces the old data with the new data. Other options include "append" (to add new rows to the end of the existing table);
  • saveAsTable: specifies the three-part name (catalog.schema.table) where the data will be stored.

Delta Lake: The Default Format

When you use saveAsTable, Databricks automatically saves the data in the Delta format. As we discussed in Section 1, Delta Lake provides reliability. It ensures that even if the cluster crashes in the middle of a "write" operation, your table won't be corrupted. It also allows for "Time Travel," meaning you can look back at previous versions of the table if you make a mistake.

Verifying the Write in the Catalog

Once the command finishes running, you should verify that the data has landed correctly:

  • Navigate to the Catalog tab in the left-hand sidebar;
  • Drill down into the main catalog and the default schema;
  • Look for your new table name (e.g., regional_summary);
  • You can click on the table to see its schema, sample data, and metadata, such as when it was created and who created it.

Reading Your Saved Table

Once a table is in the Catalog, any authorized user can access it without needing your notebook. They can simply run a SQL query or use spark.table() to load it into their own environment:

# In a new notebook, anyone can now access your processed data
new_df = spark.table("main.default.regional_summary")

Best Practice: Clean Up

After saving your final results to a permanent table, it is a professional habit to terminate your cluster or at least "Clear State." Since your data is now safely stored in the Catalog, you no longer need to keep the temporary DataFrames taking up space in the cluster's RAM.

1. Which "mode" should you use if you want to replace an existing table with brand-new data from your DataFrame?

2. What is the primary benefit of saving a DataFrame using saveAsTable()?

question mark

Which "mode" should you use if you want to replace an existing table with brand-new data from your DataFrame?

Select the correct answer

question mark

What is the primary benefit of saving a DataFrame using saveAsTable()?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 4. Chapter 8

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Section 4. Chapter 8
some-alt