Course Content

Optimization Techniques in Python

1. Understanding and Measuring Performance

Introduction to Python Performance Timing and Benchmarking Basics Measuring Function Performance Challenge: Implementing Benchmarking

2. Efficient Use of Data Structures

Lists and NumPy Arrays Sets and Tuples Challenge: Choosing Optimal Data Structures Using the collections Module Challenge: Handling Customer Requests

3. Enhancing Performance with Built-in Tools

Leveraging map() and List Comprehensions Handling Large Files Maximizing Sorting Efficiency Efficient String Operations

Handling Large Files

Processing large files efficiently is essential when working with datasets too big to fit in memory. Python provides tools like open() and map(), which allow you to process files lazily, saving memory and improving performance.

What Are Iterators?

Before proceeding with the open() function, we should first understands what an iterator is. An iterator is an object that represents a stream of data, allowing you to access one item at a time. Iterators implement two methods:

__iter__(): returns the iterator object itself;
__next__(): returns the next item in the stream and raises a StopIteration exception when no items are left.

Let's say we have an iterator named iterator_object. We can iterate over it using a usual for loop:

python

In fact, under the hood, the following happens (the next() function internally calls the __next__() method of the iterator):

python

Unlike standard collections, iterators are characterized by lazy evaluation, meaning they generate or fetch data only when required, rather than loading everything into memory at once. This approach makes them highly memory-efficient, particularly when working with large datasets.

File Objects as Iterators

The open() function returns a file object, which is an iterator. This allows you to:

Iterate over a file line by line using a for loop;
Read one line at a time into memory, making it suitable for large files (as long as individual lines fit in memory).

For example, if a log file with 1,000,000 lines includes both INFO and ERROR messages, we can still count ERROR occurrences by iterating through the file line by line, even if the file cannot fit entirely in memory (which will be the case if we add much more logs to it).


              1234567891011
            
log_lines = [f"INFO: Log entry {i}" if i % 100 != 0 else f"ERROR: Critical issue {i}" for i in range(1, 1000001)]
with open("large_log.txt", "w") as log_file:
    log_file.write("\n".join(log_lines))

# Process the file to count error entries
error_count = 0
for line in open("large_log.txt"):
    if "ERROR" in line:
        error_count += 1

print(f"Total error entries: {error_count}")

Transforming File Lines with map()

As mentioned in the previous chapter, map() returns an iterator, applying a transformation function lazily to each line in a file. Similar to file objects, map() processes data one item at a time without loading everything into memory, making it an efficient option for handling large files.

For example, let's create a file containing 1000000 email addresses, some of which include uppercase letters. Our goal is to convert all the emails to lowercase and save the normalized results in a new file ('normalized_emails.txt'). We'll use map() to achieve this, ensuring the script remains efficient and suitable for processing even larger files.


              12345678910111213141516171819202122
            
# Create a file with mixed-case email addresses
email_lines = [
    "John.Doe@example.com",
    "Jane.SMITH@domain.org",
    "BOB.brown@anotherexample.net",
    "ALICE.williams@sample.com"
] * 250000  # Repeat to simulate a large file

with open("email_list.txt", "w") as email_file:
    email_file.write("\n".join(email_lines))

# Process the file to standardize email addresses (convert to lowercase)
with open("email_list.txt") as input_file, open("normalized_emails.txt", "w") as output_file:
    # Use map() to convert each email to lowercase
    lowercase_emails = map(str.lower, input_file)
    for email in lowercase_emails:
        output_file.write(email)
      
    # Print the last email to verify the results
    print(email)

print('Done')

1. You need to convert all email addresses in a file to lowercase and save them to a new file without loading everything into memory. Which approach is most efficient?

2. Which of the following statements about file objects is correct?

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 2