Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Structure of Spark | Spark Basics
Introduction to Big Data with Apache Spark in Python
course content

Course Content

Introduction to Big Data with Apache Spark in Python

Introduction to Big Data with Apache Spark in Python

1. Big Data Basics
2. Spark Basics
3. Spark SQL

bookStructure of Spark

Apache Spark comprises several key components, each designed to handle different types of data processing tasks:

  • Spark Core - the foundation of Spark, providing basic functionalities like task scheduling, memory management, and fault tolerance;
  • Spark SQL - enables querying of structured data using SQL or DataFrame APIs and integrates with various data sources (e.g., Hive, Avro, Parquet);
  • Spark Streaming - allows to process real-time data streams with micro-batch processing;
  • MLlib - a library for scalable machine learning algorithms, including classification, regression, clustering, and collaborative filtering.
  • GraphX - provides a framework for graph processing and analytics, supporting operations like graph traversal and computation.

Spark operates on a cluster architecture that consists of a master node and worker nodes.

The master node manages job scheduling and resource allocation, while worker nodes execute tasks and store data.

Basic Workflow:

  1. Submit Application - users submit Spark applications to a cluster manager (e.g., YARN, Mesos, or Kubernetes);
  2. Job Scheduling - the cluster manager schedules the job and distributes tasks across worker nodes;
  3. Task Execution - worker nodes execute the tasks, performing computations on data stored in memory or on disk;
  4. Result Collection - the results of the tasks are collected and returned to the user.

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 2. Chapter 2
some-alt