Contenuti del Corso

Introduction to Big Data with Apache Spark in Python

1. Big Data Basics

Course Overview Spark Why Big Data?Big Data Processing Common Big Data Software Apache Hadoop Basics

2. Spark Basics

Why Apache Spark?Structure of Spark RDD Introduction to PySpark

3. Spark SQL

SparkContext and SparkSession Spark DataFrame and Columns Queries in PySpark Connection with Pandas Uploading Data from Files

Spark DataFrame and Columns

In Apache Spark, DataFrames and Columns are central concepts for handling and processing structured and semi-structured data.

Here’s a detailed overview.

Spark DataFrame

It is conceptually equivalent to the table in a relational database or a DataFrame in Pandas, but additionally - it is distributed across a cluster.

Creation

DataFrames can be created in several ways, including:

From existing RDDs;
From structured data files;
From a table in a database.

For example, here we are creating DataFrame using existing RDD:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()
data = [("Alice", 29), ("Bob", 31), ("Cathy", 25)]
df = spark.createDataFrame(data, ["Name", "Age"])

Here - from a JSON file:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()
df = spark.read.json("path/to/jsonfile")

Finally - from a table into database:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()
df = spark.read.format("jdbc").options(
    url="jdbc:mysql://localhost:3306/mydb",
    driver="com.mysql.jdbc.Driver",
    dbtable="mytable",
    user="myuser",
    password="mypassword").load()

Operations with DataFrame

Here are the most important operations with DataFrame in Spark.

Transformations

select - used to select specific columns:

df.select("Name", "Age").show()

filter - applies a filter to rows and columns:

df.filter(df.Age > 30).show()

groupBy - used to aggregate data based on a column value:

df.groupBy("Age").count().show()

join - combines two DataFrames based on a common column:

df1.join(df2, df1.id == df2.id).show()

Actions

show - displays several first rows of the DataFrame:

df.show()

collect - retrieves entire dataset as a list of rows:

df.collect()

count - counts the number of rows into DataFrame:

df.count()

Spark Column

Columns are used to refer to the data contained in specific fields of the DataFrame, manipulate and transform it.

As well as in case of DataFrame, conceptually Columns similar to Series object in Pandas library.

Working with Columns

We can select column from DataFrame, using the next expression:

df.select(df["Name"]).show()

We can create new columns, for example, using expressions:

from pyspark.sql.functions import col
df.withColumn("AgePlusOne", col("Age") + 1).show()

Or using user-defined functions:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def to_upper(name):
    return name.upper()

upper_udf = udf(to_upper, StringType())
df.withColumn("NameUpper", upper_udf(col("Name"))).show()

Also we can rename columns:

df.withColumnRenamed("Name", "FullName").show()

Tutto è chiaro?

Grazie per i tuoi commenti!

Sezione 3. Capitolo 2

Chieda ad AI

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione