Course Content
Introduction to Big Data with Apache Spark in Python
Introduction to Big Data with Apache Spark in Python
Spark DataFrame and Columns
In Apache Spark, DataFrames and Columns are central concepts for handling and processing structured and semi-structured data.
Here’s a detailed overview.
Spark DataFrame
It is conceptually equivalent to the table in a relational database or a DataFrame in Pandas, but additionally - it is distributed across a cluster.
Creation
DataFrames can be created in several ways, including:
- From existing RDDs;
- From structured data files;
- From a table in a database.
For example, here we are creating DataFrame using existing RDD:
Here - from a JSON file:
Finally - from a table into database:
Operations with DataFrame
Here are the most important operations with DataFrame in Spark.
Transformations
select
- used to select specific columns:
filter
- applies a filter to rows and columns:
groupBy
- used to aggregate data based on a column value:
join
- combines two DataFrames based on a common column:
Actions
show
- displays several first rows of the DataFrame:
collect
- retrieves entire dataset as a list of rows:
count
- counts the number of rows into DataFrame:
Spark Column
Columns are used to refer to the data contained in specific fields of the DataFrame, manipulate and transform it.
As well as in case of DataFrame, conceptually Columns similar to Series object in Pandas library.
Working with Columns
We can select column from DataFrame, using the next expression:
We can create new columns, for example, using expressions:
Or using user-defined functions:
Also we can rename columns:
Thanks for your feedback!