Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lära Basic DataFrame Exploration | Working with Data
Databricks Fundamentals: A Beginner's Guide

bookBasic DataFrame Exploration

Svep för att visa menyn

Note
Definition

DataFrame exploration is the process of inspecting the structure, data types, and content of a DataFrame. Commands like printSchema() and display() are the primary tools used to validate that data has been loaded correctly before starting an analysis.

Once you have loaded your data into a DataFrame, you cannot simply assume it is perfect. You must inspect it to understand what you are working with. In this chapter, you will use two essential Python commands to "look under the hood" of our sales_records DataFrame.

Inspecting the Structure: printSchema()

The first thing a data professional does with a new DataFrame is check the Schema. The schema is the blueprint of your data—it tells you the name of every column and the type of data it holds (Integer, String, Double, etc.).

In a new cell, run:

df.printSchema()

The output will be a tree-style list. This is where you verify that "Total_Revenue" is a numeric type (like double) and not just a piece of text. If a column you expected to be a number is listed as a string, you know you need to fix the data types before performing calculations.

Inspecting the Content: display()

While printSchema() shows you the structure, display() shows you the actual data. As we discussed in Section 3, display() is a powerful Databricks-specific function.

Run:

display(df)

This renders the first 10,000 rows of your DataFrame in an interactive grid. This is your chance to spot "dirty" data, such as missing values (shown as null) or inconsistent formatting in the "Region" or "Item_Type" columns.

Quick Statistics: describe() and summary()

If you want to see the "math" behind your columns without writing complex queries, you can use the describe() command:

display(df.describe())

This returns a table showing the Count, Mean, Standard Deviation, Min, and Max for every numeric column. It is the fastest way to check for outliers — for example, if your "Min" price is a negative number, you know there is an error in your source data.

Counting Rows: count()

To know the scale of your dataset, use the count() method:

print(df.count())

This returns a single integer representing the total number of rows. It is useful for verifying that you haven't lost any data during the loading process.

Viewing Column Names

Finally, if you just need a quick list of the column names to copy-paste into another function, use:

print(df.columns)

This returns a simple Python list of all headers, which is very helpful when your DataFrame has dozens of columns and you can't remember the exact spelling of one.

1. Which command should you use to see the "blueprint" of your DataFrame, including all column names and data types?

2. What is the purpose of running display(df.describe())?

question mark

Which command should you use to see the "blueprint" of your DataFrame, including all column names and data types?

Vänligen välj det korrekta svaret

question mark

What is the purpose of running display(df.describe())?

Vänligen välj det korrekta svaret

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 4. Kapitel 3

Fråga AI

expand

Fråga AI

ChatGPT

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Avsnitt 4. Kapitel 3
some-alt