Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Leer Introducing DataFrames | Section
Introduction to PySpark

Introducing DataFrames

Veeg om het menu te tonen

A DataFrame is a distributed collection of data organized into named columns – conceptually identical to a pandas DataFrame or a SQL table, but processed in parallel across a cluster. DataFrames are the primary abstraction for structured data in PySpark and the tool you will use for the vast majority of real-world tasks.

DataFrames vs RDDs

RDDs give you full flexibility but no structure – Spark treats each element as an opaque Python object. DataFrames add a schema: every column has a name and a type. This lets Spark apply powerful optimizations through its Catalyst query optimizer, which rewrites and compresses your operations before execution.

For structured data like the flights dataset, DataFrames are faster, more readable, and require less code than RDDs.

Loading a DataFrame

123456789101112131415161718192021
import urllib.request from pyspark.sql import SparkSession # Downloading the dataset locally urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("DataFramesIntro") \ .master("local[*]") \ .getOrCreate() flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) # Inspecting the schema flights_df.printSchema() # Checking dimensions print(f"Rows: {flights_df.count()}, Columns: {len(flights_df.columns)}")

Previewing Data

12345678
# Showing the first 5 rows flights_df.show(5) # Listing all column names print(flights_df.columns) # Basic statistics for numeric columns flights_df.describe("DEPARTURE_DELAY", "ARRIVAL_DELAY", "DISTANCE").show()

describe() returns count, mean, standard deviation, min, and max – a quick sanity check before any analysis.

Run this locally and compare printSchema() output with the column list to verify that numeric columns like DEPARTURE_DELAY were correctly inferred as floats.

question mark

What is the main advantage of DataFrames over RDDs?

Selecteer het correcte antwoord

Was alles duidelijk?

Hoe kunnen we het verbeteren?

Bedankt voor je feedback!

Sectie 1. Hoofdstuk 7

Vraag AI

expand

Vraag AI

ChatGPT

Vraag wat u wilt of probeer een van de voorgestelde vragen om onze chat te starten.

Sectie 1. Hoofdstuk 7
some-alt