Kurssisisältö

Mastering Big Data with PySpark

1. Big Data Fundamentals

What is Big Data?Types of Data Common File Formats How Big Data Works

2. Distributed Systems

Why Distributed Systems Matter Data Partitioning Data Replication and Consistency Models

3. Spark Core

What is RDD?Creating & Loading RDDs RDD Transformations & Actions Lazy Evaluation Working with Key-Value Pairs RDD Partitioning Optimization Shared Variables Challenge: ???

4. Spark SQL

What is DataFrame?Creating & Loading DataFrames DataFrame Operations SQL Queries Schema Handling Filtering, Grouping, Aggregation Joins, Unions Window Functions Challenge: ???

5. Structured Streaming

6. MLlib

Common File Formats

Storing data isn't just about saving information — it's about doing so in a way that keeps systems fast, costs low, and pipelines flexible. Simple formats like CSV and JSON are easy to use and widely supported, but they often struggle under the demands of large-scale processing. And as data grows in volume and complexity, the choice of file formats becomes an increasingly important factor. From human-readable text files to highly optimized binary structures, formats like CSV, XML, JSON, Avro, Parquet, and ORC each bring unique advantages to the table.

Comparison Criteria

To better evaluate the strengths and limitations of each format, it's helpful to define a clear set of comparison criteria:

Human readability
Can the data be easily viewed and understood in a standard text editor? Readable formats are convenient for debugging and small-scale tasks but often sacrifice performance and structure.
Row or column orientation
Data can be stored in a row-oriented or column-oriented layout:
- Row-based formats store data record-by-record, which is efficient for transactional workloads or writing individual rows;
- Column-based formats store data field-by-field, making them ideal for analytical queries and compression.
Schema & schema evolution support
Some formats allow you to define and embed a schema, and even support schema evolution — the ability to add, remove, or rename fields without breaking existing files. This is crucial for long-lived pipelines where data structures change over time.
Compression efficiency
Storage space and I/O speed are heavily influenced by how well a format compresses data. Columnar formats typically offer stronger compression due to repeated values across columns and built-in encoding strategies like dictionary or run-length encoding.
Splittability
A format is considered splittable if it allows large files to be processed in parallel by multiple workers. This is especially important in distributed systems like Hadoop or Spark. Some compression methods, like gzip, prevent splitting unless the format was designed to work around this limitation.

The Formats

CSV (Comma-Separated Values)

CSV is a plain-text format where each line represents a row and columns are separated by commas. It's often the first format people encounter when working with tabular data.

id,name,age
1,Alice,34
2,Bob,29

Pros:

Human-readable;
Universally supported.

Cons:

No schema or data types;
Does not support nested structures;
Poor compression;
Fails to scale well for large datasets, especially when compressed with gzip (which makes it non-splittable).

Best used for: small datasets, quick data exports, data exchange with non-technical users.

XML (eXtensible Markup Language)

XML uses nested tags to represent complex, hierarchical data structures. It was widely adopted in enterprise systems before JSON became dominant.

<people>
  <person>
    <id>1</id>
    <name>Alice</name>
    <age>34</age>
  </person>
  <person>
    <id>2</id>
    <name>Bob</name>
    <age>29</age>
  </person>
</people>

Pros:

Self-describing and highly structured;
Supports validation using XSD schemas;
Works well for deeply nested data.

Cons:

Verbose and inefficient for large-scale processing;
Slow to parse;
Difficult to split for parallel reads;
Low compression efficiency.

Best used for: data exchange in enterprise systems, configuration files, legacy workflows.

JSON (JavaScript Object Notation)

JSON is a lightweight, text-based format widely used in APIs and modern data pipelines. It supports nesting and flexible schemas, making it ideal for representing semi-structured data.

[
  { "id": 1, "name": "Alice", "age": 34 },
  { "id": 2, "name": "Bob", "age": 29 }
]

Pros:

Human-readable;
Supports arrays and nested objects;
Widely used in modern applications and web services.

Cons:

No built-in schema enforcement;
Slower to parse at scale;
Hard to split for parallel processing without newline-delimited JSON (NDJSON).

Best used for: APIs, logging systems, and quick prototyping.

Apache Avro

Avro stores data in a row-oriented binary format, where each file begins with a JSON-encoded schema followed by a series of data blocks. Each block contains multiple records encoded according to the schema, and is separated by sync markers, which allow the file to be split and processed in parallel.

Pros:

Excellent for write-heavy workloads;
Supports rich data types and schema evolution;
Splittable via sync markers;
Good integration with Kafka and Hive.

Cons:

Not human-readable;
Less efficient for analytical queries compared to columnar formats.

Best used for: streaming pipelines, data ingestion, long-term storage where schema may change over time.

Apache Parquet

Parquet stores data in a column-oriented format, grouping values by column rather than by row. Each file is split into row groups, and within each group, data is further divided into column chunks and compressed pages, enabling fast scans of specific columns and reducing I/O.

Pros:

High compression efficiency;
Supports complex types and schema evolution;
Widely supported by big data tools.

Cons:

Not human-readable;
Frequent small writes or updates are very inefficient;

Best used for: analytical queries, data lakes, and massive datasets where performance and efficiency matter.

Apache ORC (Optimized Row Columnar)

ORC also follows a columnar layout, but enhances it with stripes — large blocks that contain column data, indexes, and metadata. Each stripe includes lightweight indexes, such as min/max values and Bloom filters, which accelerate filtering.

Pros:

Excellent compression;
Built-in indexes at the stripe level;
Supports complex types and schema evolution;
Great performance with Hive and Spark.

Cons:

Binary format;
Slightly less portable than Parquet.

Best used for: Large-scale analytical queries in Hive and Spark.

Comparison Table

Fill in the blanks

has become the de facto standard for web APIs and log files, thanks to its lightweight syntax and support for nested objects.
is best suited for analytical workloads and column-based queries, offering excellent compression and selective reads.
is a simple, human-readable text format ideal for quick exports of tabular data.
is optimal in Hive/Spark-heavy environments for read-intensive analytics, leveraging stripe-based storage, built-in indexes, and high compression.
is ideal for streaming data or schema-evolving ingestion pipelines, thanks to its embedded schema and support for schema evolution.
is a tag-based format mostly used in legacy enterprise systems, enabling hierarchical data exchange with strict schema validation.

Click or drag`n`drop items and fill in the blanks

Oliko kaikki selvää?

Kiitos palautteestasi!

Osio 1. Luku 3

Kysy tekoälyä

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

Kurssisisältö

Mastering Big Data with PySpark

1. Big Data Fundamentals

What is Big Data?Types of Data Common File Formats How Big Data Works

2. Distributed Systems

Why Distributed Systems Matter Data Partitioning Data Replication and Consistency Models

3. Spark Core

What is RDD?Creating & Loading RDDs RDD Transformations & Actions Lazy Evaluation Working with Key-Value Pairs RDD Partitioning Optimization Shared Variables Challenge: ???

4. Spark SQL

What is DataFrame?Creating & Loading DataFrames DataFrame Operations SQL Queries Schema Handling Filtering, Grouping, Aggregation Joins, Unions Window Functions Challenge: ???

5. Structured Streaming

6. MLlib

Common File Formats

Comparison Criteria

To better evaluate the strengths and limitations of each format, it's helpful to define a clear set of comparison criteria:

Human readability
Can the data be easily viewed and understood in a standard text editor? Readable formats are convenient for debugging and small-scale tasks but often sacrifice performance and structure.
Row or column orientation
Data can be stored in a row-oriented or column-oriented layout:
- Row-based formats store data record-by-record, which is efficient for transactional workloads or writing individual rows;
- Column-based formats store data field-by-field, making them ideal for analytical queries and compression.
Schema & schema evolution support
Some formats allow you to define and embed a schema, and even support schema evolution — the ability to add, remove, or rename fields without breaking existing files. This is crucial for long-lived pipelines where data structures change over time.
Compression efficiency
Storage space and I/O speed are heavily influenced by how well a format compresses data. Columnar formats typically offer stronger compression due to repeated values across columns and built-in encoding strategies like dictionary or run-length encoding.
Splittability
A format is considered splittable if it allows large files to be processed in parallel by multiple workers. This is especially important in distributed systems like Hadoop or Spark. Some compression methods, like gzip, prevent splitting unless the format was designed to work around this limitation.

The Formats

CSV (Comma-Separated Values)

CSV is a plain-text format where each line represents a row and columns are separated by commas. It's often the first format people encounter when working with tabular data.

id,name,age
1,Alice,34
2,Bob,29

Pros:

Human-readable;
Universally supported.

Cons:

No schema or data types;
Does not support nested structures;
Poor compression;
Fails to scale well for large datasets, especially when compressed with gzip (which makes it non-splittable).

Best used for: small datasets, quick data exports, data exchange with non-technical users.

XML (eXtensible Markup Language)

XML uses nested tags to represent complex, hierarchical data structures. It was widely adopted in enterprise systems before JSON became dominant.

<people>
  <person>
    <id>1</id>
    <name>Alice</name>
    <age>34</age>
  </person>
  <person>
    <id>2</id>
    <name>Bob</name>
    <age>29</age>
  </person>
</people>

Pros:

Self-describing and highly structured;
Supports validation using XSD schemas;
Works well for deeply nested data.

Cons:

Verbose and inefficient for large-scale processing;
Slow to parse;
Difficult to split for parallel reads;
Low compression efficiency.

Best used for: data exchange in enterprise systems, configuration files, legacy workflows.

JSON (JavaScript Object Notation)

JSON is a lightweight, text-based format widely used in APIs and modern data pipelines. It supports nesting and flexible schemas, making it ideal for representing semi-structured data.

[
  { "id": 1, "name": "Alice", "age": 34 },
  { "id": 2, "name": "Bob", "age": 29 }
]

Pros:

Human-readable;
Supports arrays and nested objects;
Widely used in modern applications and web services.

Cons:

No built-in schema enforcement;
Slower to parse at scale;
Hard to split for parallel processing without newline-delimited JSON (NDJSON).

Best used for: APIs, logging systems, and quick prototyping.

Apache Avro

Pros:

Excellent for write-heavy workloads;
Supports rich data types and schema evolution;
Splittable via sync markers;
Good integration with Kafka and Hive.

Cons:

Not human-readable;
Less efficient for analytical queries compared to columnar formats.

Best used for: streaming pipelines, data ingestion, long-term storage where schema may change over time.

Apache Parquet

Pros:

High compression efficiency;
Supports complex types and schema evolution;
Widely supported by big data tools.

Cons:

Not human-readable;
Frequent small writes or updates are very inefficient;

Best used for: analytical queries, data lakes, and massive datasets where performance and efficiency matter.

Apache ORC (Optimized Row Columnar)

Pros:

Excellent compression;
Built-in indexes at the stripe level;
Supports complex types and schema evolution;
Great performance with Hive and Spark.

Cons:

Binary format;
Slightly less portable than Parquet.

Best used for: Large-scale analytical queries in Hive and Spark.

Comparison Table

Fill in the blanks

Click or drag`n`drop items and fill in the blanks

Oliko kaikki selvää?

Kiitos palautteestasi!

Osio 1. Luku 3