Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Impara Loading Data from a File into a DataFrame | Working with Data
Databricks Fundamentals: A Beginner's Guide

bookLoading Data from a File into a DataFrame

Scorri per mostrare il menu

Note
Definition

The spark.read object is the entry point for reading external data into a Spark DataFrame. It supports various file formats, including CSV, JSON, Parquet, and Delta, and allows you to define how Spark should interpret the files.

In Section 2, you uploaded a CSV file to the Databricks environment. Now, you will learn how to "lift" that file from storage and bring it into the cluster's memory as a DataFrame using Python. This is the first step in almost every data engineering pipeline.

The spark.read Syntax

To load a file, we use a specific chain of commands. The basic structure looks like this:

df = spark.read.format("csv").option("header", "true").load("path/to/file")
  • format: tells Spark the type of file (csv, json, parquet);
  • option("header", "true"): tells Spark to use the first row of the file as column names;
  • load: the specific location of the file within Databricks.

Inferring the Schema

By default, Spark assumes every column in a CSV is a string (text). To make our data more useful, we add another option: .option("inferSchema", "true"). When this is enabled, Spark takes a quick look at the data and automatically identifies which columns are integers, decimals, or booleans. This saves you the manual work of defining data types yourself.

Locating Your File Path

To read a file, you need its path. In the Catalog or Workspace tab, you can locate your uploaded file, click the three dots (ellipsis) next to it, and select "Copy path". In modern Databricks, if you uploaded the file via the Data Ingestion UI as we did in Chapter 2.6, the data is already saved as a table, which we can read using:

df = spark.read.table("main.default.sample_sales_records")

However, if you are reading the raw file directly from a Volume, you would use the file path:

df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/Volumes/main/default/my_volume/sales_data.csv")

Verifying the Load

After running the load command, it is best practice to verify the data. You should immediately follow your read command with: display(df)

This confirms that the data has been loaded into the cluster's memory correctly, the headers are in the right place, and the data types look accurate. At this stage, the data is sitting in a temporary object called df, and you are ready to start transforming it.

1. Why should you use the .option("inferSchema", "true") setting when reading a CSV?

2. Which command is used to bring an existing table from the Catalog into a Python DataFrame?

question mark

Why should you use the .option("inferSchema", "true") setting when reading a CSV?

Seleziona la risposta corretta

question mark

Which command is used to bring an existing table from the Catalog into a Python DataFrame?

Seleziona la risposta corretta

Tutto è chiaro?

Come possiamo migliorarlo?

Grazie per i tuoi commenti!

Sezione 4. Capitolo 2

Chieda ad AI

expand

Chieda ad AI

ChatGPT

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

Sezione 4. Capitolo 2
some-alt