Kurssisisältö
Mastering Big Data with PySpark
Mastering Big Data with PySpark
Data Partitioning
Now that you know what distributed systems are and why they matter, it's time to ask a deeper question: how do they actually work? Sure, adding more machines might sound as easy as plugging in a few network cables — but in reality, there's a lot more going on under the hood. To turn a bunch of computers into a cohesive and efficient system, you have to start with partitioning the data.
Data partitioning is the process of dividing a large dataset or table into smaller, independent segments — called partitions — that can be stored and processed separately, often across multiple nodes.
Partitioning isn't exclusive to distributed systems. Many modern databases also use partitioning on a single machine — often across multiple drives — to improve query performance and I/O efficiency. But in distributed systems, partitioning is critical for scalability and parallel processing. It allows massive datasets to be spread across nodes and computed in chunks, making complex workloads manageable.
There are two primary types of partitioning:
Horizontal Partitioning (Sharding)
In horizontal partitioning, also known as sharding, each partition contains a subset of rows from the dataset. Imagine a user table with millions of records. Instead of storing it all on one server, you split it into shards — perhaps by region, by user ID range, or using a hash function. Each shard follows the same schema, but holds only part of the data. Together, they still represent the full dataset.
Sharding is widely used in distributed databases like MongoDB, Cassandra, and Google Spanner.
Vertical Partitioning
In vertical partitioning, the dataset is split by columns instead of rows. Each partition holds a subset of attributes. This is especially useful in wide tables, where only a few columns are accessed frequently. Separating the "hot" and "cold" columns improves query speed and reduces data scanned per operation.
The similar principle is used in columnar file formats like Parquet and ORC, which are optimized for analytical workloads.
Partitioning is essential for parallel computation. In systems like Apache Spark and Hadoop, data is divided into partitions, each of which can be processed by a separate worker. A well-designed partitioning strategy leads to better resource utilization and faster execution.
Kiitos palautteestasi!