Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Oppiskele Selecting, Filtering, and Renaming Columns | Section
Introduction to PySpark

Selecting, Filtering, and Renaming Columns

Pyyhkäise näyttääksesi valikon

The three most common DataFrame operations are selecting the columns you need, filtering rows by condition, and renaming columns for clarity. These map directly to SQL SELECT, WHERE, and AS.

Selecting Columns

1234567891011121314151617
import urllib.request from pyspark.sql import SparkSession urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("SelectFilterRename") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) # Selecting specific columns flights_df.select("AIRLINE", "ORIGIN_AIRPORT", "DESTINATION_AIRPORT", "DEPARTURE_DELAY").show(5)

Filtering Rows

1234567891011
from pyspark.sql.functions import col # Filtering flights with arrival delay over 60 minutes delayed_df = flights_df.filter(col("ARRIVAL_DELAY") > 60) # Combining conditions long_delayed_df = flights_df.filter( (col("ARRIVAL_DELAY") > 60) & (col("DISTANCE") > 1000) ) long_delayed_df.select("AIRLINE", "ORIGIN_AIRPORT", "DESTINATION_AIRPORT", "ARRIVAL_DELAY", "DISTANCE").show(5)

Use col() from pyspark.sql.functions for conditions – it is more explicit and avoids ambiguity when working with multiple DataFrames.

Renaming Columns

1234567
# Renaming columns for readability renamed_df = flights_df \ .withColumnRenamed("ORIGIN_AIRPORT", "FROM") \ .withColumnRenamed("DESTINATION_AIRPORT", "TO") \ .withColumnRenamed("ARRIVAL_DELAY", "DELAY_MINUTES") renamed_df.select("AIRLINE", "FROM", "TO", "DELAY_MINUTES").show(5)

Adding a Derived Column

123456789
from pyspark.sql.functions import col # Adding a boolean column indicating whether the flight was significantly delayed flights_df = flights_df.withColumn( "IS_DELAYED", col("ARRIVAL_DELAY") > 15 ) flights_df.select("AIRLINE", "ARRIVAL_DELAY", "IS_DELAYED").show(5)
question mark

Which function is used for row-level conditions in PySpark?

Valitse oikea vastaus

Oliko kaikki selvää?

Miten voimme parantaa sitä?

Kiitos palautteestasi!

Osio 1. Luku 9

Kysy tekoälyä

expand

Kysy tekoälyä

ChatGPT

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

Osio 1. Luku 9
some-alt