Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lära Loading and Inspecting Data with DataFrames | Section
Introduction to PySpark

Loading and Inspecting Data with DataFrames

Svep för att visa menyn

Before any analysis, you need to understand what your data looks like – its shape, types, and basic statistics. PySpark provides several methods for this without triggering expensive full scans.

Reading CSV with Options

inferSchema=True works for small datasets but requires an extra pass over the data. For large files, defining the schema explicitly is faster:

1234567891011121314151617181920212223242526272829
import urllib.request from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, IntegerType, FloatType, StringType urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("InspectingData") \ .master("local[*]") \ .getOrCreate() schema = StructType([ StructField("YEAR", IntegerType(), True), StructField("MONTH", IntegerType(), True), StructField("DAY", IntegerType(), True), StructField("DAY_OF_WEEK", IntegerType(), True), StructField("AIRLINE", StringType(), True), StructField("FLIGHT_NUMBER", IntegerType(), True), StructField("ORIGIN_AIRPORT", StringType(), True), StructField("DESTINATION_AIRPORT", StringType(), True), StructField("DEPARTURE_DELAY", FloatType(), True), StructField("ARRIVAL_DELAY", FloatType(), True), StructField("DISTANCE", IntegerType(), True), ]) flights_df = spark.read.csv("flights.csv", header=True, schema=schema)

Columns not listed in the schema are simply dropped – useful when you only need a subset of a wide dataset.

Inspecting the DataFrame

1234567891011
# Printing column names and types flights_df.printSchema() # Counting rows print(flights_df.count()) # Previewing data flights_df.show(5, truncate=False) # Summary statistics flights_df.describe("DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE").show()

truncate=False prevents Spark from cutting off long string values in show() output.

Checking for Nulls

1234567
from pyspark.sql.functions import col, count, when # Counting null values per column flights_df.select([ count(when(col(c).isNull(), c)).alias(c) for c in flights_df.columns ]).show()
question mark

What does inferSchema=True do?

Vänligen välj det korrekta svaret

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 1. Kapitel 8

Fråga AI

expand

Fråga AI

ChatGPT

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Avsnitt 1. Kapitel 8
some-alt