Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Oppiskele Reading and Writing Data: CSV, JSON, and Parquet | Section
Data Processing with PySpark

Reading and Writing Data: CSV, JSON, and Parquet

Pyyhkäise näyttääksesi valikon

PySpark can read and write multiple file formats. The three most common in data engineering are CSV, JSON, and Parquet. Each has different trade-offs in terms of readability, size, and query performance.

CSV

CSV is human-readable but slow – Spark must parse every character as text and infer or cast types manually.

123456789101112131415161718
import urllib.request from pyspark.sql import SparkSession urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/811b6b9d-bc0c-477c-9828-ba6b18bb63dc/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("FileFormats") \ .master("local[*]") \ .getOrCreate() # Reading CSV flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) # Writing CSV flights_df.write.csv("output_flights", header=True, mode="overwrite")

mode="overwrite" replaces the output directory if it already exists. Other options are "append", "ignore", and "error" (the default).

JSON

JSON supports nested structures that CSV cannot represent. PySpark reads one JSON object per line by default:

123456
# Writing to JSON flights_df.write.json("output_flights_json", mode="overwrite") # Reading back flights_json_df = spark.read.json("output_flights_json") flights_json_df.printSchema()

Parquet

Parquet is a columnar binary format – the standard for large-scale data processing. It stores data column by column, which means queries that touch only a few columns skip the rest entirely. It also compresses well and preserves the schema.

1234567
# Writing to Parquet flights_df.write.parquet("output_flights_parquet", mode="overwrite") # Reading back – no need to specify schema, it is stored in the file flights_parquet_df = spark.read.parquet("output_flights_parquet") flights_parquet_df.printSchema() flights_parquet_df.show(5)
question mark

Why is parquet preferred over CSV for large-scale processing?

Valitse oikea vastaus

Oliko kaikki selvää?

Miten voimme parantaa sitä?

Kiitos palautteestasi!

Osio 1. Luku 1

Kysy tekoälyä

expand

Kysy tekoälyä

ChatGPT

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

Osio 1. Luku 1
some-alt