Running SQL Queries on DataFrames
Sveip for å vise menyen
With a temporary view registered, you can write full SQL – subqueries, aggregations, window functions, and CASE WHEN expressions – exactly as you would in a relational database.
Aggregations and Grouping
123456789101112131415161718192021222324252627import urllib.request from pyspark.sql import SparkSession urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("SQLQueries") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) flights_df.createOrReplaceTempView("flights") # Flights per airline with average delay – only airlines with more than 5000 flights spark.sql(""" SELECT AIRLINE, COUNT(*) AS TOTAL_FLIGHTS, ROUND(AVG(ARRIVAL_DELAY), 2) AS AVG_DELAY FROM flights GROUP BY AIRLINE HAVING COUNT(*) > 5000 ORDER BY AVG_DELAY DESC """).show()
CASE WHEN
123456789101112131415# Classifying flights by delay severity spark.sql(""" SELECT AIRLINE, ARRIVAL_DELAY, CASE WHEN ARRIVAL_DELAY <= 0 THEN 'On time' WHEN ARRIVAL_DELAY <= 15 THEN 'Minor delay' WHEN ARRIVAL_DELAY <= 60 THEN 'Moderate delay' ELSE 'Major delay' END AS DELAY_CATEGORY FROM flights WHERE ARRIVAL_DELAY IS NOT NULL LIMIT 20 """).show()
Subqueries
12345678910# Finding airlines whose average delay exceeds the overall average spark.sql(""" SELECT AIRLINE, ROUND(AVG(ARRIVAL_DELAY), 2) AS AVG_DELAY FROM flights GROUP BY AIRLINE HAVING AVG(ARRIVAL_DELAY) > ( SELECT AVG(ARRIVAL_DELAY) FROM flights ) ORDER BY AVG_DELAY DESC """).show()
Run this and experiment with adding a WHERE CANCELLED = 0 clause to exclude cancelled flights from the delay averages.
Alt var klart?
Takk for tilbakemeldingene dine!
Seksjon 1. Kapittel 6
Spør AI
Spør AI
Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår
Seksjon 1. Kapittel 6