Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprende How Spark Works: Distributed Computing Explained | Section
Introduction to PySpark

How Spark Works: Distributed Computing Explained

Desliza para mostrar el menú

When you run a PySpark job, Spark does not process your data on one machine. It splits the work across a cluster. Understanding this architecture helps you write more efficient code and interpret performance issues.

The Driver and Executors

Every Spark application has two types of processes:

  • Driver: the process running your Python code. It builds an execution plan and coordinates the work;
  • Executors: worker processes that run on cluster nodes. They receive tasks from the driver, process partitions of data, and return results.

Your PySpark script always runs on the driver. The actual data processing happens on the executors.

Partitions

Spark splits a dataset into partitions – chunks of rows that can be processed independently in parallel. Each executor handles one or more partitions at a time. More partitions allow more parallelism, but too many small partitions add overhead.

When you load a CSV file, Spark automatically partitions it. You can check and adjust partitioning:

12345678910111213141516
import urllib.request from pyspark.sql import SparkSession # Downloading the file locally first urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/aa80ac56-0d50-49e8-9231-2c2374cd3e9d/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("FlightsAnalysis") \ .master("local[*]") \ .getOrCreate() df = spark.read.csv("flights.csv", header=True, inferSchema=True) df.show(5)
Note
Note

SparkSession is covered in detail in the next chapter – for now, treat this as the standard boilerplate needed to run any PySpark code.

Lazy Evaluation

Spark does not execute transformations immediately. When you call filter(), select(), or groupBy(), Spark records the operation in an execution plan but does nothing yet. Execution happens only when you call an action – such as show(), count(), or write().

This allows Spark to optimize the full plan before running anything, often combining or reordering steps for efficiency.

12345
# Building the execution plan without triggering computation filtered = df.filter(df["ARRIVAL_DELAY"] > 60).select("AIRLINE", "ORIGIN_AIRPORT", "DESTINATION_AIRPORT") # Triggering computation with show() filtered.show(5)
question mark

What triggers execution in Spark?

Selecciona la respuesta correcta

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 1. Capítulo 2

Pregunte a AI

expand

Pregunte a AI

ChatGPT

Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla

Sección 1. Capítulo 2
some-alt