Summary  
This chapter introduces Spark’s DataFrame read API for loading data from external files or tables with configurable options such as format specification, header usage, and schema inference. It also demonstrates verifying the loaded DataFrame to ensure correct ingestion.  

General domain of usage  
Data engineering

The `spark.read` object is the entry point for reading external data into a Spark DataFrame. It supports various file formats, including CSV, JSON, Parquet, and Delta, and allows you to define how Spark should interpret the files.


Definition

In Section 2, you uploaded a CSV file to the Databricks environment. Now, you will learn how to "lift" that file from storage and bring it into the cluster's memory as a DataFrame using Python. This is the first step in almost every data engineering pipeline.


## The spark.read Syntax
To load a file, we use a specific chain of commands. The basic structure looks like this: 


```
df = spark.read.format("csv").option("header", "true").load("path/to/file")
```

- **format:** tells Spark the type of file (csv, json, parquet);
- **option("header", "true"):** tells Spark to use the first row of the file as column names;
- **load:** the specific location of the file within Databricks.


## Inferring the Schema
By default, Spark assumes every column in a CSV is a string (text). To make our data more useful, we add another option: `.option("inferSchema", "true")`. When this is enabled, Spark takes a quick look at the data and automatically identifies which columns are integers, decimals, or booleans. This saves you the manual work of defining data types yourself.


## Locating Your File Path
To read a file, you need its path. In the **Catalog** or **Workspace** tab, you can locate your uploaded file, click the three dots (ellipsis) next to it, and select "Copy path". In modern Databricks, if you uploaded the file via the Data Ingestion UI as we did in Chapter 2.6, the data is already saved as a table, which we can read using: 


```python
df = spark.read.table("main.default.sample_sales_records")
```

However, if you are reading the raw file directly from a Volume, you would use the file path: 


```
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/Volumes/main/default/my_volume/sales_data.csv")
```

## Verifying the Load
After running the load command, it is best practice to verify the data. You should immediately follow your read command with: `display(df)`

This confirms that the data has been loaded into the cluster's memory correctly, the headers are in the right place, and the data types look accurate. At this stage, the data is sitting in a temporary object called `df`, and you are ready to start transforming it.


Why should you use the .option("inferSchema", "true") setting when reading a CSV?

Which command is used to bring an existing table from the Catalog into a Python DataFrame?

A practical introduction to Databricks, its core concepts, and hands-on data manipulation using Python and SQL. This course is designed for absolute beginners, focusing on clarity, simplicity, and real-world application.

Define Databricks simply and introduce key terms without jargon.

Get the user logged in and a compute environment running.

Master the primary development environment using familiar Python and SQL.

Practical, hands-on data manipulation using DataFrames (the core data structure).

Introduce the key differentiator, Delta Lake, simply.

Loading Data from a File into a DataFrame

The spark.read Syntax

Inferring the Schema

Locating Your File Path

Verifying the Load

1. Why should you use the .option("inferSchema", "true") setting when reading a CSV?

2. Which command is used to bring an existing table from the Catalog into a Python DataFrame?