Summary  
This chapter covers how to initialize and use a Python API for a distributed computing engine, demonstrating session creation and parallel DataFrame operations across multiple cores. It explains the core abstractions that enable processing datasets larger than a single machine’s memory.

General domain of usage  
Big data processing

**Big Data** refers to datasets too large or complex to be processed efficiently with traditional tools like pandas or SQL on a single machine. The scale is typically described by three dimensions:

- **Volume**: data measured in terabytes or petabytes rather than gigabytes;
- **Velocity**: data generated continuously and at high speed (sensor streams, transaction logs);
- **Variety**: structured tables, semi-structured JSON, unstructured text and media – often mixed together.

When your dataset no longer fits in RAM, pandas stops being an option. You need a framework built for distributed computing.



## Where PySpark Fits In

**Apache Spark** is an open-source distributed computing engine designed to process large datasets across a cluster of machines. **PySpark** is its Python API – it lets you write Spark jobs in Python while Spark handles distributing the work across nodes.

Compared to pandas, PySpark:

- processes data in parallel across many machines instead of sequentially on one;
- handles datasets that far exceed the memory of any single machine;
- provides the same high-level DataFrame API you already know, extended for distributed workloads.



from pyspark.sql import SparkSession

# Creating a local SparkSession – the entry point to any Spark application
spark = SparkSession.builder \
    .appName("BigDataIntro") \
    .master("local[*]") \
    .getOrCreate()

print(spark.version)

`SparkSession` is covered in detail in the third chapter – for now, treat this as the standard boilerplate needed to run any `PySpark` code.

Note

`local[*]` tells Spark to run locally using all available CPU cores – useful for development and learning before deploying to a real cluster.

Dive into the fundamentals of big data processing with PySpark – from Spark's distributed architecture and RDDs to the DataFrame API for scalable, real-world data analysis.

Explore the foundations of PySpark, from understanding big data and Spark's architecture to hands-on practice with RDDs and DataFrames.

What Is Big Data and Why PySpark?

Where PySpark Fits In