Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Big Data Processing | Big Data Basics
Introduction to Big Data with Apache Spark in Python
course content

Course Content

Introduction to Big Data with Apache Spark in Python

Introduction to Big Data with Apache Spark in Python

1. Big Data Basics
2. Spark Basics
3. Spark SQL

bookBig Data Processing

In terms of Big Data, there are a lot of ways for performing processing of extremely large datasets.

Here is the most important of them:

  • Clustered Computing - involves multiple interconnected nodes working as a unified system to improve performance and reliability;
  • Parallel Computing - breaks down tasks into smaller subtasks executed simultaneously on a single computer to speed up processing;
  • Distributed Computing - multiple independent nodes(computers in network) work together over a network in parallel;
  • Batch Processing - processes large volumes of data by splitting them into smaller pieces (called batches) and running them onto individual machines;
  • Real-Time Processing - analyzes data as it is generated with minimal latency for immediate insights and responses.

Let’s look at each of them in more detailed way:

Clustered Computing

Clustered computing involves connecting multiple computers (nodes) together to work as a single system.

This setup allows for enhanced performance and reliability by leveraging the combined computational power and storage resources of all the nodes.

Parallel Computing

Parallel computing involves breaking down a computational task into smaller subtasks that are executed simultaneously across multiple processors or cores.

This approach speeds up processing by dividing the workload and executing parts concurrently.

Distributed Computing

Distributed computing involves multiple computers (or nodes) working together over a network to achieve a common goal.

Unlike clustered computing, distributed computing does not necessarily rely on the nodes being physically close or connected in a single location.

Batch Processing

Batch processing refers to processing large volumes of data in discrete chunks or batches at scheduled intervals.

It involves collecting and storing data, processing it as a batch, and then producing results.

Real-Time Processing

Real-time processing involves analyzing and processing data as it is generated or received, with minimal latency.

The goal is to provide immediate insights or responses based on the most current data.

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 1. Chapter 3
some-alt