Reading and Writing Data: CSV, JSON, and Parquet
Scorri per mostrare il menu
PySpark can read and write multiple file formats. The three most common in data engineering are CSV, JSON, and Parquet. Each has different trade-offs in terms of readability, size, and query performance.
CSV
CSV is human-readable but slow – Spark must parse every character as text and infer or cast types manually.
123456789101112131415161718import urllib.request from pyspark.sql import SparkSession urllib.request.urlretrieve( "https://staging-content-media-cdn.codefinity.com/courses/811b6b9d-bc0c-477c-9828-ba6b18bb63dc/flights.csv", "flights.csv" ) spark = SparkSession.builder \ .appName("FileFormats") \ .master("local[*]") \ .getOrCreate() # Reading CSV flights_df = spark.read.csv("flights.csv", header=True, inferSchema=True) # Writing CSV flights_df.write.csv("output_flights", header=True, mode="overwrite")
mode="overwrite" replaces the output directory if it already exists. Other options are "append", "ignore", and "error" (the default).
JSON
JSON supports nested structures that CSV cannot represent. PySpark reads one JSON object per line by default:
123456# Writing to JSON flights_df.write.json("output_flights_json", mode="overwrite") # Reading back flights_json_df = spark.read.json("output_flights_json") flights_json_df.printSchema()
Parquet
Parquet is a columnar binary format – the standard for large-scale data processing. It stores data column by column, which means queries that touch only a few columns skip the rest entirely. It also compresses well and preserves the schema.
1234567# Writing to Parquet flights_df.write.parquet("output_flights_parquet", mode="overwrite") # Reading back – no need to specify schema, it is stored in the file flights_parquet_df = spark.read.parquet("output_flights_parquet") flights_parquet_df.printSchema() flights_parquet_df.show(5)
Grazie per i tuoi commenti!
Chieda ad AI
Chieda ad AI
Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione