Oppiskele Filtering and Selecting Data with Python

Pyyhkäise näyttääksesi valikon

Definition

Selecting and filtering are the most common operations in data manipulation. Selecting allows you to pick specific columns of interest, while Filtering allows you to narrow down the rows based on specific criteria or conditions.

In most real-world scenarios, you don't need to work with every single column or row in a dataset. Large tables can have hundreds of columns and millions of rows, most of which might be irrelevant to your specific analysis. In this chapter, you will learn how to "carve out" exactly the data you need using the select() and filter() methods.

Selecting Specific Columns

The select() method allows you to create a new DataFrame that only contains the columns you choose. This reduces the amount of memory your cluster uses and makes your results much easier to read.

# Select only two specific columns
selected_df = df.select("carat", "price")

display(selected_df)

Notice that Spark doesn't change the original df. Instead, it creates a new one called selected_df. If you wanted to select all columns except one, or perform a calculation during the selection, you would use more advanced syntax, but for basic tasks, passing the column names as strings is the standard approach.

Filtering Rows with Conditions

The filter() method (or its alias where()) acts as a sieve for your data. You provide a condition, and Spark only keeps the rows where that condition is true.

# Filter for rows where the carat is '0.23'
filtered_df = df.filter(df.carat == 0.23)

display(filtered_df)

You can use standard comparison operators like == (equals), != (not equals), > (greater than), and < (less than).

Combining Multiple Filters

Often, you need to apply more than one rule at a time. To do this, you can chain filters together or use logical operators like "And" (&) and "Or" (|).

good_carat_df = df.filter((df.carat == 0.23) & (df.cut == "Good"))

display(good_carat_df)

Note

When combining filters with & or |, always wrap each individual condition in parentheses. This ensures Spark evaluates the logic correctly.

Selecting and Filtering in One Step

Because Spark uses a "fluent" API, you can chain these commands together in a single line of code. This is a very common pattern in professional data engineering:

# Select specific diamonds by `price > 500`
high_profit_df = df.select("carat", "price").filter(df.price > 500)

display(high_profit_df)

Checking Your Work

After every selection or filter, it is a good habit to run a count(). If you start with 10,000 rows and after a filter you have 0, you know your filter condition might be too strict or contains a typo in the string values.

1. Which method would you use if you want to pick only 3 columns out of a 50-column table?

2. In Python, what is the correct way to filter for rows where the "Total_Profit" is greater than 1000?

Oliko kaikki selvää?

Kiitos palautteestasi!

Osio 4. Luku 4

Kysy tekoälyä

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

Osio 4. Luku 4