Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprenda Running SQL Queries on DataFrames | Section
Data Processing with PySpark

Running SQL Queries on DataFrames

Deslize para mostrar o menu

With a temporary view registered, you can write full SQL – subqueries, aggregations, window functions, and CASE WHEN expressions – exactly as you would in a relational database.

Aggregations and Grouping

123456789101112131415161718192021222324252627
import urllib.request from pyspark.sql import SparkSession urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("SQLQueries") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) flights_df.createOrReplaceTempView("flights") # Flights per airline with average delay – only airlines with more than 5000 flights spark.sql(""" SELECT AIRLINE, COUNT(*) AS TOTAL_FLIGHTS, ROUND(AVG(ARRIVAL_DELAY), 2) AS AVG_DELAY FROM flights GROUP BY AIRLINE HAVING COUNT(*) > 5000 ORDER BY AVG_DELAY DESC """).show()

CASE WHEN

123456789101112131415
# Classifying flights by delay severity spark.sql(""" SELECT AIRLINE, ARRIVAL_DELAY, CASE WHEN ARRIVAL_DELAY <= 0 THEN 'On time' WHEN ARRIVAL_DELAY <= 15 THEN 'Minor delay' WHEN ARRIVAL_DELAY <= 60 THEN 'Moderate delay' ELSE 'Major delay' END AS DELAY_CATEGORY FROM flights WHERE ARRIVAL_DELAY IS NOT NULL LIMIT 20 """).show()

Subqueries

12345678910
# Finding airlines whose average delay exceeds the overall average spark.sql(""" SELECT AIRLINE, ROUND(AVG(ARRIVAL_DELAY), 2) AS AVG_DELAY FROM flights GROUP BY AIRLINE HAVING AVG(ARRIVAL_DELAY) > ( SELECT AVG(ARRIVAL_DELAY) FROM flights ) ORDER BY AVG_DELAY DESC """).show()

Run this and experiment with adding a WHERE CANCELLED = 0 clause to exclude cancelled flights from the delay averages.

question mark

Which SQL clause filters on an aggregated value?

Selecione a resposta correta

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 1. Capítulo 6

Pergunte à IA

expand

Pergunte à IA

ChatGPT

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Seção 1. Capítulo 6
some-alt