Lära Creating and Transforming RDDs

Svep för att visa menyn

You can create an RDD from an existing DataFrame, from a Python collection, or directly from a file. Once you have one, you chain transformations to shape the data before triggering an action.

Creating RDDs


              1234567891011121314151617181920
            
import urllib.request
from pyspark.sql import SparkSession

# Downloading the dataset locally
urllib.request.urlretrieve(
    "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv",
    "flights.csv"
)

spark = SparkSession.builder \
    .appName("RDDTransformations") \
    .master("local[*]") \
    .getOrCreate()

# From a Python list
numbers_rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# From a DataFrame (dataset already downloaded in previous chapter)
flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True)
flights_rdd = flights_df.rdd

Common Transformations


              123456789101112
            
# map – applying a function to every element
distances_rdd = flights_rdd.map(lambda row: row["DISTANCE"])

# filter – keeping only elements that match a condition
long_flights_rdd = flights_rdd.filter(lambda row: row["DISTANCE"] > 1000)

# flatMap – like map but flattens the result
# Splitting airline codes into individual characters (illustrative)
chars_rdd = flights_rdd.flatMap(lambda row: list(row["AIRLINE"]))

# distinct – removing duplicates
unique_airlines_rdd = flights_rdd.map(lambda row: row["AIRLINE"]).distinct()

Chaining Transformations and Triggering Actions


              12345678
            
# Finding unique airlines operating flights longer than 1000 miles
result = flights_rdd \
    .filter(lambda row: row["DISTANCE"] > 1000) \
    .map(lambda row: row["AIRLINE"]) \
    .distinct() \
    .collect()

print(sorted(result))

collect() returns all results to the driver as a Python list. Use it only when the result is small – calling collect() on a multi-gigabyte RDD will crash the driver.

For large results, use take(n) to retrieve only the first n elements, or write the output to disk.

Var allt tydligt?

Tack för dina kommentarer!

Avsnitt 1. Kapitel 5

Fråga AI

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Avsnitt 1. Kapitel 5