Summary  
This chapter covers DataFrame introspection methods—such as printSchema, display, describe, count, and columns—to inspect schema details, view sample data, compute basic statistics, count rows, and list headers.  

General domain of usage  
Data analysis

DataFrame exploration is the process of inspecting the structure, data types, and content of a DataFrame. Commands like `printSchema()` and `display()` are the primary tools used to validate that data has been loaded correctly before starting an analysis.

Definition

Once you have loaded your data into a DataFrame, you cannot simply assume it is perfect. You must inspect it to understand what you are working with. In this chapter, you will use two essential Python commands to "look under the hood" of our `sales_records` DataFrame.

## Inspecting the Structure: printSchema()
The first thing a data professional does with a new DataFrame is check the Schema. The schema is the blueprint of your data—it tells you the name of every column and the type of data it holds (Integer, String, Double, etc.).

In a new cell, run:


The output will be a tree-style list. This is where you verify that "Total_Revenue" is a numeric type (like `double`) and not just a piece of text. If a column you expected to be a number is listed as a `string`, you know you need to fix the data types before performing calculations.


## Inspecting the Content: display()

While `printSchema()` shows you the structure, `display()` shows you the actual data. As we discussed in Section 3, `display()` is a powerful Databricks-specific function.

Run:


This renders the first 10,000 rows of your DataFrame in an interactive grid. This is your chance to spot "dirty" data, such as missing values (shown as `null`) or inconsistent formatting in the "Region" or "Item_Type" columns.


## Quick Statistics: describe() and summary()
If you want to see the "math" behind your columns without writing complex queries, you can use the `describe()` command:


This returns a table showing the **Count, Mean, Standard Deviation, Min, and Max** for every numeric column. It is the fastest way to check for outliers — for example, if your "Min" price is a negative number, you know there is an error in your source data.


## Counting Rows: count()
To know the scale of your dataset, use the `count()` method:


This returns a single integer representing the total number of rows. It is useful for verifying that you haven't lost any data during the loading process.


## Viewing Column Names
Finally, if you just need a quick list of the column names to copy-paste into another function, use:


This returns a simple Python list of all headers, which is very helpful when your DataFrame has dozens of columns and you can't remember the exact spelling of one.

Which command should you use to see the "blueprint" of your DataFrame, including all column names and data types?

What is the purpose of running display(df.describe())?

A practical introduction to Databricks, its core concepts, and hands-on data manipulation using Python and SQL. This course is designed for absolute beginners, focusing on clarity, simplicity, and real-world application.

Define Databricks simply and introduce key terms without jargon.

Get the user logged in and a compute environment running.

Master the primary development environment using familiar Python and SQL.

Practical, hands-on data manipulation using DataFrames (the core data structure).

Introduce the key differentiator, Delta Lake, simply.

Basic DataFrame Exploration

Inspecting the Structure: printSchema()

Inspecting the Content: display()

Quick Statistics: describe() and summary()

Counting Rows: count()

Viewing Column Names

1. Which command should you use to see the "blueprint" of your DataFrame, including all column names and data types?

2. What is the purpose of running display(df.describe())?