Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Вивчайте Creating and Transforming RDDs | Section
Introduction to PySpark

Creating and Transforming RDDs

Свайпніть щоб показати меню

You can create an RDD from an existing DataFrame, from a Python collection, or directly from a file. Once you have one, you chain transformations to shape the data before triggering an action.

Creating RDDs

1234567891011121314151617181920
import urllib.request from pyspark.sql import SparkSession # Downloading the dataset locally urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("RDDTransformations") \ .master("local[*]") \ .getOrCreate() # From a Python list numbers_rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) # From a DataFrame (dataset already downloaded in previous chapter) flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) flights_rdd = flights_df.rdd

Common Transformations

123456789101112
# map – applying a function to every element distances_rdd = flights_rdd.map(lambda row: row["DISTANCE"]) # filter – keeping only elements that match a condition long_flights_rdd = flights_rdd.filter(lambda row: row["DISTANCE"] > 1000) # flatMap – like map but flattens the result # Splitting airline codes into individual characters (illustrative) chars_rdd = flights_rdd.flatMap(lambda row: list(row["AIRLINE"])) # distinct – removing duplicates unique_airlines_rdd = flights_rdd.map(lambda row: row["AIRLINE"]).distinct()

Chaining Transformations and Triggering Actions

12345678
# Finding unique airlines operating flights longer than 1000 miles result = flights_rdd \ .filter(lambda row: row["DISTANCE"] > 1000) \ .map(lambda row: row["AIRLINE"]) \ .distinct() \ .collect() print(sorted(result))

collect() returns all results to the driver as a Python list. Use it only when the result is small – calling collect() on a multi-gigabyte RDD will crash the driver.

For large results, use take(n) to retrieve only the first n elements, or write the output to disk.

question mark

What does collect() do?

Виберіть правильну відповідь

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 1. Розділ 5

Запитати АІ

expand

Запитати АІ

ChatGPT

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

Секція 1. Розділ 5
some-alt