Running SQL Queries on DataFrames
Deslize para mostrar o menu
With a temporary view registered, you can write full SQL – subqueries, aggregations, window functions, and CASE WHEN expressions – exactly as you would in a relational database.
Aggregations and Grouping
123456789101112131415161718192021222324252627import urllib.request from pyspark.sql import SparkSession urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("SQLQueries") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) flights_df.createOrReplaceTempView("flights") # Flights per airline with average delay – only airlines with more than 5000 flights spark.sql(""" SELECT AIRLINE, COUNT(*) AS TOTAL_FLIGHTS, ROUND(AVG(ARRIVAL_DELAY), 2) AS AVG_DELAY FROM flights GROUP BY AIRLINE HAVING COUNT(*) > 5000 ORDER BY AVG_DELAY DESC """).show()
CASE WHEN
123456789101112131415# Classifying flights by delay severity spark.sql(""" SELECT AIRLINE, ARRIVAL_DELAY, CASE WHEN ARRIVAL_DELAY <= 0 THEN 'On time' WHEN ARRIVAL_DELAY <= 15 THEN 'Minor delay' WHEN ARRIVAL_DELAY <= 60 THEN 'Moderate delay' ELSE 'Major delay' END AS DELAY_CATEGORY FROM flights WHERE ARRIVAL_DELAY IS NOT NULL LIMIT 20 """).show()
Subqueries
12345678910# Finding airlines whose average delay exceeds the overall average spark.sql(""" SELECT AIRLINE, ROUND(AVG(ARRIVAL_DELAY), 2) AS AVG_DELAY FROM flights GROUP BY AIRLINE HAVING AVG(ARRIVAL_DELAY) > ( SELECT AVG(ARRIVAL_DELAY) FROM flights ) ORDER BY AVG_DELAY DESC """).show()
Run this and experiment with adding a WHERE CANCELLED = 0 clause to exclude cancelled flights from the delay averages.
Tudo estava claro?
Obrigado pelo seu feedback!
Seção 1. Capítulo 6
Pergunte à IA
Pergunte à IA
Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo
Seção 1. Capítulo 6