Basic DataFrame Exploration
Pyyhkäise näyttääksesi valikon
DataFrame exploration is the process of inspecting the structure, data types, and content of a DataFrame. Commands like printSchema() and display() are the primary tools used to validate that data has been loaded correctly before starting an analysis.
Once you have loaded your data into a DataFrame, you cannot simply assume it is perfect. You must inspect it to understand what you are working with. In this chapter, you will use two essential Python commands to "look under the hood" of our sales_records DataFrame.
Inspecting the Structure: printSchema()
The first thing a data professional does with a new DataFrame is check the Schema. The schema is the blueprint of your data—it tells you the name of every column and the type of data it holds (Integer, String, Double, etc.).
In a new cell, run:
df.printSchema()
The output will be a tree-style list. This is where you verify that "Total_Revenue" is a numeric type (like double) and not just a piece of text. If a column you expected to be a number is listed as a string, you know you need to fix the data types before performing calculations.
Inspecting the Content: display()
While printSchema() shows you the structure, display() shows you the actual data. As we discussed in Section 3, display() is a powerful Databricks-specific function.
Run:
display(df)
This renders the first 10,000 rows of your DataFrame in an interactive grid. This is your chance to spot "dirty" data, such as missing values (shown as null) or inconsistent formatting in the "Region" or "Item_Type" columns.
Quick Statistics: describe() and summary()
If you want to see the "math" behind your columns without writing complex queries, you can use the describe() command:
display(df.describe())
This returns a table showing the Count, Mean, Standard Deviation, Min, and Max for every numeric column. It is the fastest way to check for outliers — for example, if your "Min" price is a negative number, you know there is an error in your source data.
Counting Rows: count()
To know the scale of your dataset, use the count() method:
print(df.count())
This returns a single integer representing the total number of rows. It is useful for verifying that you haven't lost any data during the loading process.
Viewing Column Names
Finally, if you just need a quick list of the column names to copy-paste into another function, use:
print(df.columns)
This returns a simple Python list of all headers, which is very helpful when your DataFrame has dozens of columns and you can't remember the exact spelling of one.
1. Which command should you use to see the "blueprint" of your DataFrame, including all column names and data types?
2. What is the purpose of running display(df.describe())?
Kiitos palautteestasi!
Kysy tekoälyä
Kysy tekoälyä
Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme