Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprenda Loading Data from a File into a DataFrame | Working with Data
Databricks Fundamentals: A Beginner's Guide

bookLoading Data from a File into a DataFrame

Deslize para mostrar o menu

Note
Definition

The spark.read object is the entry point for reading external data into a Spark DataFrame. It supports various file formats, including CSV, JSON, Parquet, and Delta, and allows you to define how Spark should interpret the files.

In Section 2, you uploaded a CSV file to the Databricks environment. Now, you will learn how to "lift" that file from storage and bring it into the cluster's memory as a DataFrame using Python. This is the first step in almost every data engineering pipeline.

The spark.read Syntax

To load a file, we use a specific chain of commands. The basic structure looks like this:

df = spark.read.format("csv").option("header", "true").load("path/to/file")
  • format: tells Spark the type of file (csv, json, parquet);
  • option("header", "true"): tells Spark to use the first row of the file as column names;
  • load: the specific location of the file within Databricks.

Inferring the Schema

By default, Spark assumes every column in a CSV is a string (text). To make our data more useful, we add another option: .option("inferSchema", "true"). When this is enabled, Spark takes a quick look at the data and automatically identifies which columns are integers, decimals, or booleans. This saves you the manual work of defining data types yourself.

Locating Your File Path

To read a file, you need its path. In the Catalog or Workspace tab, you can locate your uploaded file, click the three dots (ellipsis) next to it, and select "Copy path". In modern Databricks, if you uploaded the file via the Data Ingestion UI as we did in Chapter 2.6, the data is already saved as a table, which we can read using:

df = spark.read.table("main.default.sample_sales_records")

However, if you are reading the raw file directly from a Volume, you would use the file path:

df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/Volumes/main/default/my_volume/sales_data.csv")

Verifying the Load

After running the load command, it is best practice to verify the data. You should immediately follow your read command with: display(df)

This confirms that the data has been loaded into the cluster's memory correctly, the headers are in the right place, and the data types look accurate. At this stage, the data is sitting in a temporary object called df, and you are ready to start transforming it.

1. Why should you use the .option("inferSchema", "true") setting when reading a CSV?

2. Which command is used to bring an existing table from the Catalog into a Python DataFrame?

question mark

Why should you use the .option("inferSchema", "true") setting when reading a CSV?

Selecione a resposta correta

question mark

Which command is used to bring an existing table from the Catalog into a Python DataFrame?

Selecione a resposta correta

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 4. Capítulo 2

Pergunte à IA

expand

Pergunte à IA

ChatGPT

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Seção 4. Capítulo 2
some-alt