Course Content
Introduction to Big Data with Apache Spark in Python
Introduction to Big Data with Apache Spark in Python
Big Data Processing
In terms of Big Data, there are a lot of ways for performing processing of extremely large datasets.
Here is the most important of them:
- Clustered Computing - involves multiple interconnected nodes working as a unified system to improve performance and reliability;
- Parallel Computing - breaks down tasks into smaller subtasks executed simultaneously on a single computer to speed up processing;
- Distributed Computing - multiple independent nodes(computers in network) work together over a network in parallel;
- Batch Processing - processes large volumes of data by splitting them into smaller pieces (called batches) and running them onto individual machines;
- Real-Time Processing - analyzes data as it is generated with minimal latency for immediate insights and responses.
Let’s look at each of them in more detailed way:
Clustered Computing
Clustered computing involves connecting multiple computers (nodes) together to work as a single system.
This setup allows for enhanced performance and reliability by leveraging the combined computational power and storage resources of all the nodes.
Parallel Computing
Parallel computing involves breaking down a computational task into smaller subtasks that are executed simultaneously across multiple processors or cores.
This approach speeds up processing by dividing the workload and executing parts concurrently.
Distributed Computing
Distributed computing involves multiple computers (or nodes) working together over a network to achieve a common goal.
Unlike clustered computing, distributed computing does not necessarily rely on the nodes being physically close or connected in a single location.
Batch Processing
Batch processing refers to processing large volumes of data in discrete chunks or batches at scheduled intervals.
It involves collecting and storing data, processing it as a batch, and then producing results.
Real-Time Processing
Real-time processing involves analyzing and processing data as it is generated or received, with minimal latency.
The goal is to provide immediate insights or responses based on the most current data.
Thanks for your feedback!