Lära Loading and Inspecting Data with DataFrames

Svep för att visa menyn

Before any analysis, you need to understand what your data looks like – its shape, types, and basic statistics. PySpark provides several methods for this without triggering expensive full scans.

Reading CSV with Options

inferSchema=True works for small datasets but requires an extra pass over the data. For large files, defining the schema explicitly is faster:


              1234567891011121314151617181920212223242526272829
            
import urllib.request
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, FloatType, StringType

urllib.request.urlretrieve(
    "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv",
    "flights.csv"
)

spark = SparkSession.builder \
    .appName("InspectingData") \
    .master("local[*]") \
    .getOrCreate()

schema = StructType([
    StructField("YEAR", IntegerType(), True),
    StructField("MONTH", IntegerType(), True),
    StructField("DAY", IntegerType(), True),
    StructField("DAY_OF_WEEK", IntegerType(), True),
    StructField("AIRLINE", StringType(), True),
    StructField("FLIGHT_NUMBER", IntegerType(), True),
    StructField("ORIGIN_AIRPORT", StringType(), True),
    StructField("DESTINATION_AIRPORT", StringType(), True),
    StructField("DEPARTURE_DELAY", FloatType(), True),
    StructField("ARRIVAL_DELAY", FloatType(), True),
    StructField("DISTANCE", IntegerType(), True),
])

flights_df = spark.read.csv("flights.csv", header=True, schema=schema)

Columns not listed in the schema are simply dropped – useful when you only need a subset of a wide dataset.

Inspecting the DataFrame


              1234567891011
            
# Printing column names and types
flights_df.printSchema()

# Counting rows
print(flights_df.count())

# Previewing data
flights_df.show(5, truncate=False)

# Summary statistics
flights_df.describe("DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE").show()

truncate=False prevents Spark from cutting off long string values in show() output.

Checking for Nulls


              1234567
            
from pyspark.sql.functions import col, count, when

# Counting null values per column
flights_df.select([
    count(when(col(c).isNull(), c)).alias(c)
    for c in flights_df.columns
]).show()

Var allt tydligt?

Tack för dina kommentarer!

Avsnitt 1. Kapitel 8

Fråga AI

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Avsnitt 1. Kapitel 8