Oppiskele Reading and Writing Data: CSV, JSON, and Parquet

Pyyhkäise näyttääksesi valikon

PySpark can read and write multiple file formats. The three most common in data engineering are CSV, JSON, and Parquet. Each has different trade-offs in terms of readability, size, and query performance.

CSV

CSV is human-readable but slow – Spark must parse every character as text and infer or cast types manually.


              123456789101112131415161718
            
import urllib.request
from pyspark.sql import SparkSession

urllib.request.urlretrieve(
    "https://staging-content-media-cdn.codefinity.com/courses/811b6b9d-bc0c-477c-9828-ba6b18bb63dc/flights.csv",
    "flights.csv"
)

spark = SparkSession.builder \
    .appName("FileFormats") \
    .master("local[*]") \
    .getOrCreate()

# Reading CSV
flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True)

# Writing CSV
flights_df.write.csv("output_flights", header=True, mode="overwrite")

mode="overwrite" replaces the output directory if it already exists. Other options are "append", "ignore", and "error" (the default).

JSON

JSON supports nested structures that CSV cannot represent. PySpark reads one JSON object per line by default:


              123456
            
# Writing to JSON
flights_df.write.json("output_flights_json", mode="overwrite")

# Reading back
flights_json_df = spark.read.json("output_flights_json")
flights_json_df.printSchema()

Parquet

Parquet is a columnar binary format – the standard for large-scale data processing. It stores data column by column, which means queries that touch only a few columns skip the rest entirely. It also compresses well and preserves the schema.


              1234567
            
# Writing to Parquet
flights_df.write.parquet("output_flights_parquet", mode="overwrite")

# Reading back – no need to specify schema, it is stored in the file
flights_parquet_df = spark.read.parquet("output_flights_parquet")
flights_parquet_df.printSchema()
flights_parquet_df.show(5)

Oliko kaikki selvää?

Kiitos palautteestasi!

Osio 1. Luku 1

Kysy tekoälyä

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

Osio 1. Luku 1