Course Content
Introduction to Big Data with Apache Spark in Python
Introduction to Big Data with Apache Spark in Python
Spark DataFrame and Columns
In Apache Spark, DataFrames and Columns are central concepts for handling and processing structured and semi-structured data.
Hereβs a detailed overview.
Spark DataFrame
It is conceptually equivalent to the table in a relational database or a DataFrame in Pandas, but additionally - it is distributed across a cluster.
Creation
DataFrames can be created in several ways, including:
- From existing RDDs;
- From structured data files;
- From a table in a database.
For example, here we are creating DataFrame using existing RDD:
python
Here - from a JSON file:
python
Finally - from a table into database:
python
Operations with DataFrame
Here are the most important operations with DataFrame in Spark.
Transformations
select
- used to select specific columns:
python
filter
- applies a filter to rows and columns:
python
groupBy
- used to aggregate data based on a column value:
python
join
- combines two DataFrames based on a common column:
python
Actions
show
- displays several first rows of the DataFrame:
python
collect
- retrieves entire dataset as a list of rows:
python
count
- counts the number of rows into DataFrame:
python
Spark Column
Columns are used to refer to the data contained in specific fields of the DataFrame, manipulate and transform it.
As well as in case of DataFrame, conceptually Columns similar to Series object in Pandas library.
Working with Columns
We can select column from DataFrame, using the next expression:
python
We can create new columns, for example, using expressions:
python
Or using user-defined functions:
python
Also we can rename columns:
python
Thanks for your feedback!