Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Impara How Spark Works: Distributed Computing Explained | Section
Introduction to PySpark

How Spark Works: Distributed Computing Explained

Scorri per mostrare il menu

When you run a PySpark job, Spark does not process your data on one machine. It splits the work across a cluster. Understanding this architecture helps you write more efficient code and interpret performance issues.

The Driver and Executors

Every Spark application has two types of processes:

  • Driver: the process running your Python code. It builds an execution plan and coordinates the work;
  • Executors: worker processes that run on cluster nodes. They receive tasks from the driver, process partitions of data, and return results.

Your PySpark script always runs on the driver. The actual data processing happens on the executors.

Partitions

Spark splits a dataset into partitions – chunks of rows that can be processed independently in parallel. Each executor handles one or more partitions at a time. More partitions allow more parallelism, but too many small partitions add overhead.

When you load a CSV file, Spark automatically partitions it. You can check and adjust partitioning:

12345678910111213141516
import urllib.request from pyspark.sql import SparkSession # Downloading the file locally first urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("FlightsAnalysis") \ .master("local[*]") \ .getOrCreate() df = spark.read.csv("flights.csv", header=True, inferSchema=True) df.show(5)
Note
Note

SparkSession is covered in detail in the next chapter – for now, treat this as the standard boilerplate needed to run any PySpark code.

Lazy Evaluation

Spark does not execute transformations immediately. When you call filter(), select(), or groupBy(), Spark records the operation in an execution plan but does nothing yet. Execution happens only when you call an action – such as show(), count(), or write().

This allows Spark to optimize the full plan before running anything, often combining or reordering steps for efficiency.

12345
# Building the execution plan without triggering computation filtered = df.filter(df["ARRIVAL_DELAY"] > 60).select("AIRLINE", "ORIGIN_AIRPORT", "DESTINATION_AIRPORT") # Triggering computation with show() filtered.show(5)
question mark

What triggers execution in Spark?

Seleziona la risposta corretta

Tutto è chiaro?

Come possiamo migliorarlo?

Grazie per i tuoi commenti!

Sezione 1. Capitolo 2

Chieda ad AI

expand

Chieda ad AI

ChatGPT

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

Sezione 1. Capitolo 2
some-alt