Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Creating and Transforming RDDs | Section
Introduction to PySpark

Creating and Transforming RDDs

Sveip for å vise menyen

You can create an RDD from an existing DataFrame, from a Python collection, or directly from a file. Once you have one, you chain transformations to shape the data before triggering an action.

Creating RDDs

1234567891011121314151617181920
import urllib.request from pyspark.sql import SparkSession # Downloading the dataset locally urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("RDDTransformations") \ .master("local[*]") \ .getOrCreate() # From a Python list numbers_rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) # From a DataFrame (dataset already downloaded in previous chapter) flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) flights_rdd = flights_df.rdd

Common Transformations

123456789101112
# map – applying a function to every element distances_rdd = flights_rdd.map(lambda row: row["DISTANCE"]) # filter – keeping only elements that match a condition long_flights_rdd = flights_rdd.filter(lambda row: row["DISTANCE"] > 1000) # flatMap – like map but flattens the result # Splitting airline codes into individual characters (illustrative) chars_rdd = flights_rdd.flatMap(lambda row: list(row["AIRLINE"])) # distinct – removing duplicates unique_airlines_rdd = flights_rdd.map(lambda row: row["AIRLINE"]).distinct()

Chaining Transformations and Triggering Actions

12345678
# Finding unique airlines operating flights longer than 1000 miles result = flights_rdd \ .filter(lambda row: row["DISTANCE"] > 1000) \ .map(lambda row: row["AIRLINE"]) \ .distinct() \ .collect() print(sorted(result))

collect() returns all results to the driver as a Python list. Use it only when the result is small – calling collect() on a multi-gigabyte RDD will crash the driver.

For large results, use take(n) to retrieve only the first n elements, or write the output to disk.

question mark

What does collect() do?

Velg det helt riktige svaret

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 1. Kapittel 5

Spør AI

expand

Spør AI

ChatGPT

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

Seksjon 1. Kapittel 5
some-alt