Introducing DataFrames
Scorri per mostrare il menu
A DataFrame is a distributed collection of data organized into named columns – conceptually identical to a pandas DataFrame or a SQL table, but processed in parallel across a cluster. DataFrames are the primary abstraction for structured data in PySpark and the tool you will use for the vast majority of real-world tasks.
DataFrames vs RDDs
RDDs give you full flexibility but no structure – Spark treats each element as an opaque Python object. DataFrames add a schema: every column has a name and a type. This lets Spark apply powerful optimizations through its Catalyst query optimizer, which rewrites and compresses your operations before execution.
For structured data like the flights dataset, DataFrames are faster, more readable, and require less code than RDDs.
Loading a DataFrame
123456789101112131415161718192021import urllib.request from pyspark.sql import SparkSession # Downloading the dataset locally urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("DataFramesIntro") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) # Inspecting the schema flights_df.printSchema() # Checking dimensions print(f"Rows: {flights_df.count()}, Columns: {len(flights_df.columns)}")
Previewing Data
12345678# Showing the first 5 rows flights_df.show(5) # Listing all column names print(flights_df.columns) # Basic statistics for numeric columns flights_df.describe("DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE").show()
describe() returns count, mean, standard deviation, min, and max – a quick sanity check before any analysis.
Run this locally and compare printSchema() output with the column list to verify that numeric columns like DEPARTURE_DELAY were correctly inferred as floats.
Grazie per i tuoi commenti!
Chieda ad AI
Chieda ad AI
Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione