Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Running SQL Queries on DataFrames | Section
Data Processing with PySpark

Running SQL Queries on DataFrames

Swipe to show menu

With a temporary view registered, you can write full SQL – subqueries, aggregations, window functions, and CASE WHEN expressions – exactly as you would in a relational database.

Aggregations and Grouping

123456789101112131415161718192021222324252627
import urllib.request from pyspark.sql import SparkSession urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("SQLQueries") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) flights_df.createOrReplaceTempView("flights") # Flights per airline with average delay – only airlines with more than 5000 flights spark.sql(""" SELECT AIRLINE, COUNT(*) AS TOTAL_FLIGHTS, ROUND(AVG(ARRIVAL_DELAY), 2) AS AVG_DELAY FROM flights GROUP BY AIRLINE HAVING COUNT(*) > 5000 ORDER BY AVG_DELAY DESC """).show()

CASE WHEN

123456789101112131415
# Classifying flights by delay severity spark.sql(""" SELECT AIRLINE, ARRIVAL_DELAY, CASE WHEN ARRIVAL_DELAY <= 0 THEN 'On time' WHEN ARRIVAL_DELAY <= 15 THEN 'Minor delay' WHEN ARRIVAL_DELAY <= 60 THEN 'Moderate delay' ELSE 'Major delay' END AS DELAY_CATEGORY FROM flights WHERE ARRIVAL_DELAY IS NOT NULL LIMIT 20 """).show()

Subqueries

12345678910
# Finding airlines whose average delay exceeds the overall average spark.sql(""" SELECT AIRLINE, ROUND(AVG(ARRIVAL_DELAY), 2) AS AVG_DELAY FROM flights GROUP BY AIRLINE HAVING AVG(ARRIVAL_DELAY) > ( SELECT AVG(ARRIVAL_DELAY) FROM flights ) ORDER BY AVG_DELAY DESC """).show()

Run this and experiment with adding a WHERE CANCELLED = 0 clause to exclude cancelled flights from the delay averages.

question mark

Which SQL clause filters on an aggregated value?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 1. Chapter 6

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 6
some-alt