Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lära Creating and Transforming RDDs | Section
Introduction to PySpark

Creating and Transforming RDDs

Svep för att visa menyn

You can create an RDD from an existing DataFrame, from a Python collection, or directly from a file. Once you have one, you chain transformations to shape the data before triggering an action.

Creating RDDs

1234567891011121314151617181920
import urllib.request from pyspark.sql import SparkSession # Downloading the dataset locally urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("RDDTransformations") \ .master("local[*]") \ .getOrCreate() # From a Python list numbers_rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) # From a DataFrame (dataset already downloaded in previous chapter) flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) flights_rdd = flights_df.rdd

Common Transformations

123456789101112
# map – applying a function to every element distances_rdd = flights_rdd.map(lambda row: row["DISTANCE"]) # filter – keeping only elements that match a condition long_flights_rdd = flights_rdd.filter(lambda row: row["DISTANCE"] > 1000) # flatMap – like map but flattens the result # Splitting airline codes into individual characters (illustrative) chars_rdd = flights_rdd.flatMap(lambda row: list(row["AIRLINE"])) # distinct – removing duplicates unique_airlines_rdd = flights_rdd.map(lambda row: row["AIRLINE"]).distinct()

Chaining Transformations and Triggering Actions

12345678
# Finding unique airlines operating flights longer than 1000 miles result = flights_rdd \ .filter(lambda row: row["DISTANCE"] > 1000) \ .map(lambda row: row["AIRLINE"]) \ .distinct() \ .collect() print(sorted(result))

collect() returns all results to the driver as a Python list. Use it only when the result is small – calling collect() on a multi-gigabyte RDD will crash the driver.

For large results, use take(n) to retrieve only the first n elements, or write the output to disk.

question mark

What does collect() do?

Vänligen välj det korrekta svaret

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 1. Kapitel 5

Fråga AI

expand

Fråga AI

ChatGPT

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Avsnitt 1. Kapitel 5
some-alt