Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprende Understanding RDDs | Section
Introduction to PySpark

Understanding RDDs

Desliza para mostrar el menú

Before DataFrames existed, Spark's primary abstraction was the Resilient Distributed Dataset (RDD). Understanding RDDs helps you grasp how Spark works under the hood – DataFrames are built on top of them.

What Is an RDD

An RDD is an immutable, distributed collection of objects. "Resilient" means Spark can reconstruct lost partitions automatically by replaying the lineage of transformations. "Distributed" means the data is split across the cluster and processed in parallel.

Three key properties:

  • Immutable: you never modify an RDD in place. Every transformation produces a new RDD;
  • Lazy: transformations are not executed until an action is called;
  • Partitioned: data is divided into chunks, one per executor task.

RDDs vs DataFrames

If you already know pandas, you might wonder why RDDs matter. The short answer is that DataFrames are the right tool for structured data – but RDDs give you lower-level control when you need to work with unstructured data or custom Python objects.

Creating an RDD

123456789101112131415161718192021
import urllib.request from pyspark.sql import SparkSession urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("RDDDemo") \ .master("local[*]") \ .getOrCreate() # Creating an RDD from a list numbers_rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5]) print(numbers_rdd.collect()) # Creating an RDD from a text file lines_rdd = spark.sparkContext.textFile("flights.csv") print(lines_rdd.first()) print(f"Total lines: {lines_rdd.count()}")

parallelize() distributes a local Python collection across partitions. textFile() reads a file and returns one RDD element per line.

question mark

What does Resilient mean in RDD?

Selecciona la respuesta correcta

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 1. Capítulo 4

Pregunte a AI

expand

Pregunte a AI

ChatGPT

Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla

Sección 1. Capítulo 4
some-alt