Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Spark DataFrame and Columns | Spark SQL
Introduction to Big Data with Apache Spark in Python
course content

Contenido del Curso

Introduction to Big Data with Apache Spark in Python

Introduction to Big Data with Apache Spark in Python

1. Big Data Basics
2. Spark Basics
3. Spark SQL

bookSpark DataFrame and Columns

In Apache Spark, DataFrames and Columns are central concepts for handling and processing structured and semi-structured data.

Here’s a detailed overview.

Spark DataFrame

It is conceptually equivalent to the table in a relational database or a DataFrame in Pandas, but additionally - it is distributed across a cluster.

Creation

DataFrames can be created in several ways, including:

  • From existing RDDs;
  • From structured data files;
  • From a table in a database.

For example, here we are creating DataFrame using existing RDD:

Here - from a JSON file:

Finally - from a table into database:

Operations with DataFrame

Here are the most important operations with DataFrame in Spark.

Transformations

  • select - used to select specific columns:
  • filter - applies a filter to rows and columns:
  • groupBy - used to aggregate data based on a column value:
  • join - combines two DataFrames based on a common column:

Actions

  • show - displays several first rows of the DataFrame:
  • collect - retrieves entire dataset as a list of rows:
  • count - counts the number of rows into DataFrame:

Spark Column

Columns are used to refer to the data contained in specific fields of the DataFrame, manipulate and transform it.

As well as in case of DataFrame, conceptually Columns similar to Series object in Pandas library.

Working with Columns

We can select column from DataFrame, using the next expression:

We can create new columns, for example, using expressions:

Or using user-defined functions:

Also we can rename columns:

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 3. Capítulo 2
some-alt