Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Impara What Is Big Data and Why PySpark? | Section
Introduction to PySpark

What Is Big Data and Why PySpark?

Scorri per mostrare il menu

Big Data refers to datasets too large or complex to be processed efficiently with traditional tools like pandas or SQL on a single machine. The scale is typically described by three dimensions:

  • Volume: data measured in terabytes or petabytes rather than gigabytes;
  • Velocity: data generated continuously and at high speed (sensor streams, transaction logs);
  • Variety: structured tables, semi-structured JSON, unstructured text and media – often mixed together.

When your dataset no longer fits in RAM, pandas stops being an option. You need a framework built for distributed computing.

Where PySpark Fits In

Apache Spark is an open-source distributed computing engine designed to process large datasets across a cluster of machines. PySpark is its Python API – it lets you write Spark jobs in Python while Spark handles distributing the work across nodes.

Compared to pandas, PySpark:

  • processes data in parallel across many machines instead of sequentially on one;
  • handles datasets that far exceed the memory of any single machine;
  • provides the same high-level DataFrame API you already know, extended for distributed workloads.
123456789
from pyspark.sql import SparkSession # Creating a local SparkSession – the entry point to any Spark application spark = SparkSession.builder \ .appName("BigDataIntro") \ .master("local[*]") \ .getOrCreate() print(spark.version)
Note
Note

SparkSession is covered in detail in the third chapter – for now, treat this as the standard boilerplate needed to run any PySpark code.

local[*] tells Spark to run locally using all available CPU cores – useful for development and learning before deploying to a real cluster.

question mark

What problem does PySpark solve?

Seleziona la risposta corretta

Tutto è chiaro?

Come possiamo migliorarlo?

Grazie per i tuoi commenti!

Sezione 1. Capitolo 1

Chieda ad AI

expand

Chieda ad AI

ChatGPT

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

Sezione 1. Capitolo 1
some-alt