Impara Understanding RDDs

Scorri per mostrare il menu

Before DataFrames existed, Spark's primary abstraction was the Resilient Distributed Dataset (RDD). Understanding RDDs helps you grasp how Spark works under the hood – DataFrames are built on top of them.

What Is an RDD

An RDD is an immutable, distributed collection of objects. "Resilient" means Spark can reconstruct lost partitions automatically by replaying the lineage of transformations. "Distributed" means the data is split across the cluster and processed in parallel.

Three key properties:

Immutable: you never modify an RDD in place. Every transformation produces a new RDD;
Lazy: transformations are not executed until an action is called;
Partitioned: data is divided into chunks, one per executor task.

RDDs vs DataFrames

If you already know pandas, you might wonder why RDDs matter. The short answer is that DataFrames are the right tool for structured data – but RDDs give you lower-level control when you need to work with unstructured data or custom Python objects.

Creating an RDD


              123456789101112131415161718192021
            
import urllib.request
from pyspark.sql import SparkSession

urllib.request.urlretrieve(
    "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv",
    "flights.csv"
)

spark = SparkSession.builder \
    .appName("RDDDemo") \
    .master("local[*]") \
    .getOrCreate()

# Creating an RDD from a list
numbers_rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
print(numbers_rdd.collect())

# Creating an RDD from a text file
lines_rdd = spark.sparkContext.textFile("flights.csv")
print(lines_rdd.first())
print(f"Total lines: {lines_rdd.count()}")

parallelize() distributes a local Python collection across partitions. textFile() reads a file and returns one RDD element per line.

Tutto è chiaro?

Grazie per i tuoi commenti!

Sezione 1. Capitolo 4

Chieda ad AI

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

Sezione 1. Capitolo 4