Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Handling Large Files | Optimizing with Python's Built-in Features
Optimization Techniques in Python
course content

Conteúdo do Curso

Optimization Techniques in Python

Optimization Techniques in Python

1. Understanding and Measuring Performance
2. Efficient Use of Data Structures
3. Optimizing with Python's Built-in Features

bookHandling Large Files

Processing large files efficiently in Python is essential when working with datasets too big to fit in memory. Python provides tools like open() and map(), which allow you to process files lazily, saving memory and improving performance.

What Are Iterators?

Before proceeding with the open() function, we should first understands what an iterator is. An iterator is an object that represents a stream of data, allowing you to access one item at a time. Iterators implement two methods:

  • __iter__(): returns the iterator object itself;
  • __next__(): returns the next item in the stream and raises a StopIteration exception when no items are left.

Let's say we have an iterator named iterator_object. We can iterate over it using a usual for loop:

In fact, under the hood, the following happens (the next() function internally calls the __next__() method of the iterator):

Unlike standard collections, iterators are characterized by lazy evaluation, meaning they generate or fetch data only when required, rather than loading everything into memory at once. This approach makes them highly memory-efficient, particularly when working with large datasets.

File Objects as Iterators

Python's open() function returns a file object, which is an iterator. This allows you to:

  • Iterate over a file line by line using a for loop;
  • Read one line at a time into memory, making it suitable for large files (as long as individual lines fit in memory).

For example, if a log file with 1,000,000 lines includes both INFO and ERROR messages, we can still count ERROR occurrences by iterating through the file line by line, even if the file cannot fit entirely in memory (which will be the case if we add much more logs to it).

1234567891011
log_lines = [f"INFO: Log entry {i}" if i % 100 != 0 else f"ERROR: Critical issue {i}" for i in range(1, 1000001)] with open("large_log.txt", "w") as log_file: log_file.write("\n".join(log_lines)) # Process the file to count error entries error_count = 0 for line in open("large_log.txt"): if "ERROR" in line: error_count += 1 print(f"Total error entries: {error_count}")
copy

Transforming File Lines with map()

As mentioned in the previous chapter, map() returns an iterator, applying a transformation function lazily to each line in a file. Similar to file objects, map() processes data one item at a time without loading everything into memory, making it an efficient option for handling large files.

For example, let's create a file containing 1000000 email addresses, some of which include uppercase letters. Our goal is to convert all the emails to lowercase and save the normalized results in a new file ('normalized_emails.txt'). We'll use map() to achieve this, ensuring the script remains efficient and suitable for processing even larger files.

12345678910111213141516171819
# Create a file with mixed-case email addresses email_lines = [ "John.Doe@example.com", "Jane.SMITH@domain.org", "BOB.brown@anotherexample.net", "ALICE.williams@sample.com" ] * 250000 # Repeat to simulate a large file with open("email_list.txt", "w") as email_file: email_file.write("\n".join(email_lines)) # Process the file to standardize email addresses (convert to lowercase) with open("email_list.txt") as input_file, open("normalized_emails.txt", "w") as output_file: # Use map() to convert each email to lowercase lowercase_emails = map(str.lower, input_file) for email in lowercase_emails: output_file.write(email) print('Done')
copy
1. You need to convert all email addresses in a file to lowercase and save them to a new file **without loading everything into memory**. Which approach is most efficient?
2. Which of the following statements about file objects in Python is correct?
You need to convert all email addresses in a file to lowercase and save them to a new file **without loading everything into memory**. Which approach is most efficient?

You need to convert all email addresses in a file to lowercase and save them to a new file without loading everything into memory. Which approach is most efficient?

Selecione a resposta correta

Which of the following statements about file objects in Python is correct?

Which of the following statements about file objects in Python is correct?

Selecione a resposta correta

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 3. Capítulo 2
some-alt