Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lernen Creating and Transforming RDDs | Section
Introduction to PySpark

Creating and Transforming RDDs

Swipe um das Menü anzuzeigen

You can create an RDD from an existing DataFrame, from a Python collection, or directly from a file. Once you have one, you chain transformations to shape the data before triggering an action.

Creating RDDs

1234567891011121314151617181920
import urllib.request from pyspark.sql import SparkSession # Downloading the dataset locally urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("RDDTransformations") \ .master("local[*]") \ .getOrCreate() # From a Python list numbers_rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) # From a DataFrame (dataset already downloaded in previous chapter) flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) flights_rdd = flights_df.rdd

Common Transformations

123456789101112
# map – applying a function to every element distances_rdd = flights_rdd.map(lambda row: row["DISTANCE"]) # filter – keeping only elements that match a condition long_flights_rdd = flights_rdd.filter(lambda row: row["DISTANCE"] > 1000) # flatMap – like map but flattens the result # Splitting airline codes into individual characters (illustrative) chars_rdd = flights_rdd.flatMap(lambda row: list(row["AIRLINE"])) # distinct – removing duplicates unique_airlines_rdd = flights_rdd.map(lambda row: row["AIRLINE"]).distinct()

Chaining Transformations and Triggering Actions

12345678
# Finding unique airlines operating flights longer than 1000 miles result = flights_rdd \ .filter(lambda row: row["DISTANCE"] > 1000) \ .map(lambda row: row["AIRLINE"]) \ .distinct() \ .collect() print(sorted(result))

collect() returns all results to the driver as a Python list. Use it only when the result is small – calling collect() on a multi-gigabyte RDD will crash the driver.

For large results, use take(n) to retrieve only the first n elements, or write the output to disk.

question mark

What does collect() do?

Wählen Sie die richtige Antwort aus

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 1. Kapitel 5

Fragen Sie AI

expand

Fragen Sie AI

ChatGPT

Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen

Abschnitt 1. Kapitel 5
some-alt