Course Content

Introduction to Big Data with Apache Spark in Python

1. Big Data Basics

Course Overview Spark Why Big Data?Big Data Processing Common Big Data Software Apache Hadoop Basics

2. Spark Basics

Why Apache Spark?Structure of Spark RDD Introduction to PySpark

3. Spark SQL

SparkContext and SparkSession Spark DataFrame and Columns Queries in PySpark Connection with Pandas Uploading Data from Files

Spark DataFrame and Columns

In Apache Spark, DataFrames and Columns are central concepts for handling and processing structured and semi-structured data.

Here’s a detailed overview.

Spark DataFrame

It is conceptually equivalent to the table in a relational database or a DataFrame in Pandas, but additionally - it is distributed across a cluster.

Creation

DataFrames can be created in several ways, including:

From existing RDDs;
From structured data files;
From a table in a database.

For example, here we are creating DataFrame using existing RDD:

Here - from a JSON file:

Finally - from a table into database:

Operations with DataFrame

Here are the most important operations with DataFrame in Spark.

Transformations

select - used to select specific columns:

filter - applies a filter to rows and columns:

groupBy - used to aggregate data based on a column value:

join - combines two DataFrames based on a common column:

Actions

show - displays several first rows of the DataFrame:

collect - retrieves entire dataset as a list of rows:

count - counts the number of rows into DataFrame:

Spark Column

Columns are used to refer to the data contained in specific fields of the DataFrame, manipulate and transform it.

As well as in case of DataFrame, conceptually Columns similar to Series object in Pandas library.

Working with Columns

We can select column from DataFrame, using the next expression:

We can create new columns, for example, using expressions:

Or using user-defined functions:

Also we can rename columns:

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 2

Ask AI

Ask anything or try one of the suggested questions to begin our chat