Course Content
Introduction to Big Data with Apache Spark in Python
Introduction to Big Data with Apache Spark in Python
RDD
What is RDD?
RDDs are designed to be immutable, fault-tolerant, distributed, and highly efficient, making them well-suited for handling large-scale data processing tasks.
Here’s an in-depth look at RDDs, their features, and their usage:
Key Features of RDDs
Here is a list of most valuable features of RDDs:
- In-Memory Computing - RDDs can be cached in memory, significantly speeding up computations, especially for iterative algorithms where the same data is reused;
- Fault Tolerance - RDDs keep track of their lineage, which is the sequence of operations that created them, that helps in recomputing lost partitions in case of node failures, ensuring fault tolerance;
- Lazy Evaluation - RDD operations are lazily evaluated, meaning that transformations are not executed immediately, being recorded in a lineage graph and executed only when an action is called;
- Distributed Processing - RDDs are divided into partitions, which are processed in parallel across different nodes in the cluster, enabling Spark to handle large datasets by scaling horizontally with the addition of more nodes.
RDD Operations
They are lazy, meaning they don’t execute until an action is performed.
- Examples:
- map(func): applies the function func to each element of the RDD, returning a new RDD;
- filter(func): filters elements based on the function func, returning a new RDD with elements that pass the condition;
- flatMap(func): similar to map, but the function can return multiple output elements for each input element.
- Examples:
- count(): returns the number of elements in the RDD;
- collect(): retrieves all elements of the RDD to the driver program as an array;
- saveAsTextFile(path): writes the RDD elements to a specified path as a text file.
Everything was clear?
Thanks for your feedback!
Section 2. Chapter 3