Understanding RDDs
Pyyhkäise näyttääksesi valikon
Before DataFrames existed, Spark's primary abstraction was the Resilient Distributed Dataset (RDD). Understanding RDDs helps you grasp how Spark works under the hood – DataFrames are built on top of them.
What Is an RDD
An RDD is an immutable, distributed collection of objects. "Resilient" means Spark can reconstruct lost partitions automatically by replaying the lineage of transformations. "Distributed" means the data is split across the cluster and processed in parallel.
Three key properties:
- Immutable: you never modify an RDD in place. Every transformation produces a new RDD;
- Lazy: transformations are not executed until an action is called;
- Partitioned: data is divided into chunks, one per executor task.
RDDs vs DataFrames
If you already know pandas, you might wonder why RDDs matter. The short answer is that DataFrames are the right tool for structured data – but RDDs give you lower-level control when you need to work with unstructured data or custom Python objects.
Creating an RDD
123456789101112131415161718192021import urllib.request from pyspark.sql import SparkSession urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("RDDDemo") \ .master("local[*]") \ .getOrCreate() # Creating an RDD from a list numbers_rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5]) print(numbers_rdd.collect()) # Creating an RDD from a text file lines_rdd = spark.sparkContext.textFile("flights.csv") print(lines_rdd.first()) print(f"Total lines: {lines_rdd.count()}")
parallelize() distributes a local Python collection across partitions. textFile() reads a file and returns one RDD element per line.
Kiitos palautteestasi!
Kysy tekoälyä
Kysy tekoälyä
Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme