How Spark Works: Distributed Computing Explained
Glissez pour afficher le menu
When you run a PySpark job, Spark does not process your data on one machine. It splits the work across a cluster. Understanding this architecture helps you write more efficient code and interpret performance issues.
The Driver and Executors
Every Spark application has two types of processes:
- Driver: the process running your Python code. It builds an execution plan and coordinates the work;
- Executors: worker processes that run on cluster nodes. They receive tasks from the driver, process partitions of data, and return results.
Your PySpark script always runs on the driver. The actual data processing happens on the executors.
Partitions
Spark splits a dataset into partitions – chunks of rows that can be processed independently in parallel. Each executor handles one or more partitions at a time. More partitions allow more parallelism, but too many small partitions add overhead.
When you load a CSV file, Spark automatically partitions it. You can check and adjust partitioning:
12345678910111213141516import urllib.request from pyspark.sql import SparkSession # Downloading the file locally first urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("FlightsAnalysis") \ .master("local[*]") \ .getOrCreate() df = spark.read.csv("flights.csv", header=True, inferSchema=True) df.show(5)
SparkSession is covered in detail in the next chapter – for now, treat this as the standard boilerplate needed to run any PySpark code.
Lazy Evaluation
Spark does not execute transformations immediately. When you call filter(), select(), or groupBy(), Spark records the operation in an execution plan but does nothing yet. Execution happens only when you call an action – such as show(), count(), or write().
This allows Spark to optimize the full plan before running anything, often combining or reordering steps for efficiency.
12345# Building the execution plan without triggering computation filtered = df.filter(df["ARRIVAL_DELAY"] > 60).select("AIRLINE", "ORIGIN_AIRPORT", "DESTINATION_AIRPORT") # Triggering computation with show() filtered.show(5)
Merci pour vos commentaires !
Demandez à l'IA
Demandez à l'IA
Posez n'importe quelle question ou essayez l'une des questions suggérées pour commencer notre discussion