Apprendre How Spark Works: Distributed Computing Explained

Glissez pour afficher le menu

When you run a PySpark job, Spark does not process your data on one machine. It splits the work across a cluster. Understanding this architecture helps you write more efficient code and interpret performance issues.

The Driver and Executors

Every Spark application has two types of processes:

Driver: the process running your Python code. It builds an execution plan and coordinates the work;
Executors: worker processes that run on cluster nodes. They receive tasks from the driver, process partitions of data, and return results.

Your PySpark script always runs on the driver. The actual data processing happens on the executors.

Partitions

Spark splits a dataset into partitions – chunks of rows that can be processed independently in parallel. Each executor handles one or more partitions at a time. More partitions allow more parallelism, but too many small partitions add overhead.

When you load a CSV file, Spark automatically partitions it. You can check and adjust partitioning:


              12345678910111213141516
            
import urllib.request
from pyspark.sql import SparkSession

# Downloading the file locally first
urllib.request.urlretrieve(
    "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv",
    "flights.csv"
)

spark = SparkSession.builder \
    .appName("FlightsAnalysis") \
    .master("local[*]") \
    .getOrCreate()

df = spark.read.csv("flights.csv", header=True, inferSchema=True)
df.show(5)

Note

SparkSession is covered in detail in the next chapter – for now, treat this as the standard boilerplate needed to run any PySpark code.

Lazy Evaluation

Spark does not execute transformations immediately. When you call filter(), select(), or groupBy(), Spark records the operation in an execution plan but does nothing yet. Execution happens only when you call an action – such as show(), count(), or write().

This allows Spark to optimize the full plan before running anything, often combining or reordering steps for efficiency.


              12345
            
# Building the execution plan without triggering computation
filtered = df.filter(df["ARRIVAL_DELAY"] > 60).select("AIRLINE", "ORIGIN_AIRPORT", "DESTINATION_AIRPORT")

# Triggering computation with show()
filtered.show(5)

Tout était clair ?

Merci pour vos commentaires !

Section 1. Chapitre 2

Demandez à l'IA

Posez n'importe quelle question ou essayez l'une des questions suggérées pour commencer notre discussion

Section 1. Chapitre 2