Understanding RDDs
Scorri per mostrare il menu
Before DataFrames existed, Spark's primary abstraction was the Resilient Distributed Dataset (RDD). Understanding RDDs helps you grasp how Spark works under the hood – DataFrames are built on top of them.
What Is an RDD
An RDD is an immutable, distributed collection of objects. "Resilient" means Spark can reconstruct lost partitions automatically by replaying the lineage of transformations. "Distributed" means the data is split across the cluster and processed in parallel.
Three key properties:
- Immutable: you never modify an RDD in place. Every transformation produces a new RDD;
- Lazy: transformations are not executed until an action is called;
- Partitioned: data is divided into chunks, one per executor task.
RDDs vs DataFrames
If you already know pandas, you might wonder why RDDs matter. The short answer is that DataFrames are the right tool for structured data – but RDDs give you lower-level control when you need to work with unstructured data or custom Python objects.
Creating an RDD
123456789101112131415161718192021import urllib.request from pyspark.sql import SparkSession urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("RDDDemo") \ .master("local[*]") \ .getOrCreate() # Creating an RDD from a list numbers_rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5]) print(numbers_rdd.collect()) # Creating an RDD from a text file lines_rdd = spark.sparkContext.textFile("flights.csv") print(lines_rdd.first()) print(f"Total lines: {lines_rdd.count()}")
parallelize() distributes a local Python collection across partitions. textFile() reads a file and returns one RDD element per line.
Grazie per i tuoi commenti!
Chieda ad AI
Chieda ad AI
Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione