Introduction to Spark SQL
メニューを表示するにはスワイプしてください
Spark SQL lets you query DataFrames using standard SQL syntax. If you are comfortable with SQL, this is often the fastest way to express complex analytical queries – and the results are identical to using the DataFrame API.
Registering a Temporary View
To run SQL against a DataFrame, register it as a temporary view. The view exists only for the duration of the SparkSession – it is not persisted to disk.
1234567891011121314151617import urllib.request from pyspark.sql import SparkSession urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("SparkSQL") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) # Registering the DataFrame as a temporary SQL view flights_df.createOrReplaceTempView("flights")
Running SQL Queries
12345678910# Selecting columns with a condition result = spark.sql(""" SELECT AIRLINE, ORIGIN_AIRPORT, DESTINATION_AIRPORT, ARRIVAL_DELAY FROM flights WHERE ARRIVAL_DELAY > 60 ORDER BY ARRIVAL_DELAY DESC LIMIT 10 """) result.show()
spark.sql() returns a regular DataFrame – you can chain further DataFrame operations on the result.
SQL vs DataFrame API
Both approaches produce the same execution plan. Choose whichever is more readable for the task:
1234567# SQL spark.sql(""" SELECT AIRLINE, COUNT(*) AS TOTAL_FLIGHTS, AVG(ARRIVAL_DELAY) AS AVG_DELAY FROM flights GROUP BY AIRLINE ORDER BY AVG_DELAY DESC """).show()
1234567# Equivalent DataFrame API from pyspark.sql.functions import avg, count, col flights_df.groupBy("AIRLINE") \ .agg(count("*").alias("TOTAL_FLIGHTS"), avg("ARRIVAL_DELAY").alias("AVG_DELAY")) \ .orderBy(col("AVG_DELAY").desc()) \ .show()
すべて明確でしたか?
フィードバックありがとうございます!
セクション 1. 章 5
AIに質問する
AIに質問する
何でも質問するか、提案された質問の1つを試してチャットを始めてください
セクション 1. 章 5