Loading and Inspecting Data with DataFrames
メニューを表示するにはスワイプしてください
Before any analysis, you need to understand what your data looks like – its shape, types, and basic statistics. PySpark provides several methods for this without triggering expensive full scans.
Reading CSV with Options
inferSchema=True works for small datasets but requires an extra pass over the data. For large files, defining the schema explicitly is faster:
1234567891011121314151617181920212223242526272829import urllib.request from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, IntegerType, FloatType, StringType urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("InspectingData") \ .master("local[*]") \ .getOrCreate() schema = StructType([ StructField("YEAR", IntegerType(), True), StructField("MONTH", IntegerType(), True), StructField("DAY", IntegerType(), True), StructField("DAY_OF_WEEK", IntegerType(), True), StructField("AIRLINE", StringType(), True), StructField("FLIGHT_NUMBER", IntegerType(), True), StructField("ORIGIN_AIRPORT", StringType(), True), StructField("DESTINATION_AIRPORT", StringType(), True), StructField("DEPARTURE_DELAY", FloatType(), True), StructField("ARRIVAL_DELAY", FloatType(), True), StructField("DISTANCE", IntegerType(), True), ]) flights_df = spark.read.csv("flights.csv", header=True, schema=schema)
Columns not listed in the schema are simply dropped – useful when you only need a subset of a wide dataset.
Inspecting the DataFrame
1234567891011# Printing column names and types flights_df.printSchema() # Counting rows print(flights_df.count()) # Previewing data flights_df.show(5, truncate=False) # Summary statistics flights_df.describe("DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE").show()
truncate=False prevents Spark from cutting off long string values in show() output.
Checking for Nulls
1234567from pyspark.sql.functions import col, count, when # Counting null values per column flights_df.select([ count(when(col(c).isNull(), c)).alias(c) for c in flights_df.columns ]).show()
すべて明確でしたか?
フィードバックありがとうございます!
セクション 1. 章 8
AIに質問する
AIに質問する
何でも質問するか、提案された質問の1つを試してチャットを始めてください
セクション 1. 章 8