Contenuti del Corso
Introduction to Big Data with Apache Spark in Python
Introduction to Big Data with Apache Spark in Python
Spark DataFrame and Columns
In Apache Spark, DataFrames and Columns are central concepts for handling and processing structured and semi-structured data.
Here’s a detailed overview.
Spark DataFrame
It is conceptually equivalent to the table in a relational database or a DataFrame in Pandas, but additionally - it is distributed across a cluster.
Creation
DataFrames can be created in several ways, including:
- From existing RDDs;
- From structured data files;
- From a table in a database.
For example, here we are creating DataFrame using existing RDD:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()
data = [("Alice", 29), ("Bob", 31), ("Cathy", 25)]
df = spark.createDataFrame(data, ["Name", "Age"])
Here - from a JSON file:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()
df = spark.read.json("path/to/jsonfile")
Finally - from a table into database:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()
df = spark.read.format("jdbc").options(
url="jdbc:mysql://localhost:3306/mydb",
driver="com.mysql.jdbc.Driver",
dbtable="mytable",
user="myuser",
password="mypassword").load()
Operations with DataFrame
Here are the most important operations with DataFrame in Spark.
Transformations
select
- used to select specific columns:
df.select("Name", "Age").show()
filter
- applies a filter to rows and columns:
df.filter(df.Age > 30).show()
groupBy
- used to aggregate data based on a column value:
df.groupBy("Age").count().show()
join
- combines two DataFrames based on a common column:
df1.join(df2, df1.id == df2.id).show()
Actions
show
- displays several first rows of the DataFrame:
df.show()
collect
- retrieves entire dataset as a list of rows:
df.collect()
count
- counts the number of rows into DataFrame:
df.count()
Spark Column
Columns are used to refer to the data contained in specific fields of the DataFrame, manipulate and transform it.
As well as in case of DataFrame, conceptually Columns similar to Series object in Pandas library.
Working with Columns
We can select column from DataFrame, using the next expression:
df.select(df["Name"]).show()
We can create new columns, for example, using expressions:
from pyspark.sql.functions import col
df.withColumn("AgePlusOne", col("Age") + 1).show()
Or using user-defined functions:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def to_upper(name):
return name.upper()
upper_udf = udf(to_upper, StringType())
df.withColumn("NameUpper", upper_udf(col("Name"))).show()
Also we can rename columns:
df.withColumnRenamed("Name", "FullName").show()
Grazie per i tuoi commenti!