Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
学ぶ Understanding RDDs | Section
Introduction to PySpark

Understanding RDDs

メニューを表示するにはスワイプしてください

Before DataFrames existed, Spark's primary abstraction was the Resilient Distributed Dataset (RDD). Understanding RDDs helps you grasp how Spark works under the hood – DataFrames are built on top of them.

What Is an RDD

An RDD is an immutable, distributed collection of objects. "Resilient" means Spark can reconstruct lost partitions automatically by replaying the lineage of transformations. "Distributed" means the data is split across the cluster and processed in parallel.

Three key properties:

  • Immutable: you never modify an RDD in place. Every transformation produces a new RDD;
  • Lazy: transformations are not executed until an action is called;
  • Partitioned: data is divided into chunks, one per executor task.

RDDs vs DataFrames

If you already know pandas, you might wonder why RDDs matter. The short answer is that DataFrames are the right tool for structured data – but RDDs give you lower-level control when you need to work with unstructured data or custom Python objects.

Creating an RDD

123456789101112131415161718192021
import urllib.request from pyspark.sql import SparkSession urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("RDDDemo") \ .master("local[*]") \ .getOrCreate() # Creating an RDD from a list numbers_rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5]) print(numbers_rdd.collect()) # Creating an RDD from a text file lines_rdd = spark.sparkContext.textFile("flights.csv") print(lines_rdd.first()) print(f"Total lines: {lines_rdd.count()}")

parallelize() distributes a local Python collection across partitions. textFile() reads a file and returns one RDD element per line.

question mark

What does Resilient mean in RDD?

正しい答えを選んでください

すべて明確でしたか?

どのように改善できますか?

フィードバックありがとうございます!

セクション 1.  4

AIに質問する

expand

AIに質問する

ChatGPT

何でも質問するか、提案された質問の1つを試してチャットを始めてください

セクション 1.  4
some-alt