Impara Splitting Data into Chunks | Working with Large Datasets

Scorri per mostrare il menu

Handling large datasets that cannot fit into memory all at once requires a different approach than simply loading the entire file. When you try to load a massive CSV file into pandas with the regular read_csv function, you may run into memory errors or significant slowdowns. To avoid this, you can split the data into smaller, more manageable chunks and process each one independently. This technique is especially useful in scenarios such as:

Analyzing large log files;
Processing data exports from databases;
Working with time-series data collected over long periods.

Splitting data into chunks lets you process only a small part of the dataset at a time, which keeps your memory usage low and allows you to work efficiently even on modest hardware. For example, if you need to calculate statistics or filter rows from a file with millions of records, reading in chunks means you can process each part and, if needed, aggregate results as you go. This approach is also helpful when you want to stream data into a machine learning pipeline or perform incremental data cleaning.


              1234567891011
            
import pandas as pd

# Assume 'large_file.csv' is a very large CSV file
url = "https://staging-content-media-cdn.codefinity.com/b8f3c268-0e60-4ff0-a3ea-f145595033d8/section1/large_file.csv"

chunk_size = 100  # Number of rows per chunk

# To read.csv() from directory you use same syntax
for chunk in pd.read_csv(url, chunksize=chunk_size):
    # Count rows in this chunk
    print("Chunk has", len(chunk), "rows")

Tutto è chiaro?

Grazie per i tuoi commenti!

Sezione 1. Capitolo 2

Chieda ad AI

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

Sezione 1. Capitolo 2