Impara Reading and Writing Data: CSV, JSON, and Parquet

Scorri per mostrare il menu

PySpark can read and write multiple file formats. The three most common in data engineering are CSV, JSON, and Parquet. Each has different trade-offs in terms of readability, size, and query performance.

CSV

CSV is human-readable but slow – Spark must parse every character as text and infer or cast types manually.


              123456789101112131415161718
            
import urllib.request
from pyspark.sql import SparkSession

urllib.request.urlretrieve(
    "https://staging-content-media-cdn.codefinity.com/courses/811b6b9d-bc0c-477c-9828-ba6b18bb63dc/flights.csv",
    "flights.csv"
)

spark = SparkSession.builder \
    .appName("FileFormats") \
    .master("local[*]") \
    .getOrCreate()

# Reading CSV
flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True)

# Writing CSV
flights_df.write.csv("output_flights", header=True, mode="overwrite")

mode="overwrite" replaces the output directory if it already exists. Other options are "append", "ignore", and "error" (the default).

JSON

JSON supports nested structures that CSV cannot represent. PySpark reads one JSON object per line by default:


              123456
            
# Writing to JSON
flights_df.write.json("output_flights_json", mode="overwrite")

# Reading back
flights_json_df = spark.read.json("output_flights_json")
flights_json_df.printSchema()

Parquet

Parquet is a columnar binary format – the standard for large-scale data processing. It stores data column by column, which means queries that touch only a few columns skip the rest entirely. It also compresses well and preserves the schema.


              1234567
            
# Writing to Parquet
flights_df.write.parquet("output_flights_parquet", mode="overwrite")

# Reading back – no need to specify schema, it is stored in the file
flights_parquet_df = spark.read.parquet("output_flights_parquet")
flights_parquet_df.printSchema()
flights_parquet_df.show(5)

Tutto è chiaro?

Grazie per i tuoi commenti!

Sezione 1. Capitolo 1

Chieda ad AI

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

Sezione 1. Capitolo 1