Course Content
Introduction to Big Data with Apache Spark in Python
Introduction to Big Data with Apache Spark in Python
Structure of Spark
Apache Spark comprises several key components, each designed to handle different types of data processing tasks:
- Spark Core - the foundation of Spark, providing basic functionalities like task scheduling, memory management, and fault tolerance;
- Spark SQL - enables querying of structured data using SQL or DataFrame APIs and integrates with various data sources (e.g., Hive, Avro, Parquet);
- Spark Streaming - allows to process real-time data streams with micro-batch processing;
- MLlib - a library for scalable machine learning algorithms, including classification, regression, clustering, and collaborative filtering.
- GraphX - provides a framework for graph processing and analytics, supporting operations like graph traversal and computation.
Spark operates on a cluster architecture that consists of a master node and worker nodes.
The master node manages job scheduling and resource allocation, while worker nodes execute tasks and store data.
Basic Workflow:
- Submit Application - users submit Spark applications to a cluster manager (e.g., YARN, Mesos, or Kubernetes);
- Job Scheduling - the cluster manager schedules the job and distributes tasks across worker nodes;
- Task Execution - worker nodes execute the tasks, performing computations on data stored in memory or on disk;
- Result Collection - the results of the tasks are collected and returned to the user.
Everything was clear?
Thanks for your feedback!
Section 2. Chapter 2