Contenido del Curso

Introduction to Big Data with Apache Spark in Python

1. Big Data Basics

Course Overview Spark Why Big Data?Big Data Processing Common Big Data Software Apache Hadoop Basics

2. Spark Basics

Why Apache Spark?Structure of Spark RDD Introduction to PySpark

3. Spark SQL

SparkContext and SparkSession Spark DataFrame and Columns Queries in PySpark Connection with Pandas Uploading Data from Files

Big Data Processing

In terms of Big Data, there are a lot of ways for performing processing of extremely large datasets.

Here is the most important of them:

Clustered Computing - involves multiple interconnected nodes working as a unified system to improve performance and reliability;
Parallel Computing - breaks down tasks into smaller subtasks executed simultaneously on a single computer to speed up processing;
Distributed Computing - multiple independent nodes(computers in network) work together over a network in parallel;
Batch Processing - processes large volumes of data by splitting them into smaller pieces (called batches) and running them onto individual machines;
Real-Time Processing - analyzes data as it is generated with minimal latency for immediate insights and responses.

Let’s look at each of them in more detailed way:

Clustered Computing

Clustered computing involves connecting multiple computers (nodes) together to work as a single system.

This setup allows for enhanced performance and reliability by leveraging the combined computational power and storage resources of all the nodes.

Parallel Computing

Parallel computing involves breaking down a computational task into smaller subtasks that are executed simultaneously across multiple processors or cores.

This approach speeds up processing by dividing the workload and executing parts concurrently.

Distributed Computing

Distributed computing involves multiple computers (or nodes) working together over a network to achieve a common goal.

Unlike clustered computing, distributed computing does not necessarily rely on the nodes being physically close or connected in a single location.

Batch Processing

Batch processing refers to processing large volumes of data in discrete chunks or batches at scheduled intervals.

It involves collecting and storing data, processing it as a batch, and then producing results.

Real-Time Processing

Real-time processing involves analyzing and processing data as it is generated or received, with minimal latency.

The goal is to provide immediate insights or responses based on the most current data.

¿Todo estuvo claro?

¡Gracias por tus comentarios!

Sección 1. Capítulo 3

Pregunte a AI

Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla