Blog Post #125: The Problem with Huge Lists: A Memory Deep Dive

In our recent posts on comprehensions (Posts #120 Post #121 Post #122 Post #123 Post #124), we celebrated them as a powerful way to create new lists. The pattern of building a full list of results in memory is simple and works wonderfully for most everyday tasks with a few hundred or even a few thousand items.

But what happens when “everyday” becomes “enormous”? What if you need to work with a sequence containing not a thousand, but ten million items? Suddenly, this simple approach can bring your computer to its knees.

In this post, we’ll explore the performance and memory problems associated with creating very large lists, motivating the need for a more efficient solution.

How Lists Store Data in Memory

When you create a list, Python finds a block of memory and stores all the items of the list in it, right then and there. If you have a list of one million integers, Python must find enough free memory to hold all one million of those integers before your program can proceed to the next line.

Think of it like this: to bake a million cookies, this approach requires you to first gather and mix all one million cookies’ worth of dough in a single, gigantic bowl before you can even start baking the first cookie.

A Memory-Hungry Example

Let’s try to create a list containing the squares of the first 10 million integers. This will demonstrate the upfront cost of building a large list.

(Warning: Running the following code may be slow and consume a significant amount of your computer’s RAM. You can use a smaller number like 1_000_000 to see the effect without a long wait.)

# The underscore in 10_000_000 is just for readability; Python ignores it.
print("Creating a list of 10 million squared numbers...")

large_list = [i**2 for i in range(10_000_000)]

print("List created successfully.")
print(f"The list contains {len(large_list)} items.")

When you run this code, you’ll notice two things:

  1. A significant delay: Your program will pause for several seconds before printing the “List created successfully” message.
  2. High memory usage: If you were to watch your system’s activity monitor, you would see a large spike in memory usage (potentially hundreds of megabytes) as Python allocates space for and calculates all 10 million numbers.

The Problem: Eager Evaluation

This approach of building the entire collection at once is called eager evaluation. The entire list is built and stored in memory eagerly, or right away, before any of its items are actually used.

This is inefficient for two reasons:

  • Upfront Memory Cost: You pay the full memory price for the entire list, even if you only need to look at one item at a time in a for loop.
  • Upfront Time Cost: You have to wait for the entire list to be built before you can start processing even the very first item.

A Better Way: The Idea of Lazy Evaluation

But what if we don’t need all 10 million numbers at the same time? What if we just want to loop through them one by one, perform a calculation on each, and then discard it?

What if, instead, we could create a “smarter” object that knows how to generate the numbers from 0 to 10 million, but only produces them one at a time, and only when we ask for them?

This is the concept of lazy evaluation. Instead of preparing everything upfront, we generate each value only when it’s needed. This would use almost no memory, as we’d only ever be holding one number in memory at any given moment.

What’s Next?

Eagerly creating large lists in memory is inefficient, consuming significant time and RAM. For processing massive datasets, we need a “lazy” approach that generates values on demand.

This “smarter,” lazy object exists in Python, and it’s called a generator. In Post #126, we will be introduced to generators and see how they provide a memory-efficient solution to the problem of working with large sequences.

Author

Debjeet Bhowmik

Experienced Cloud & DevOps Engineer with hands-on experience in AWS, GCP, Terraform, Ansible, ELK, Docker, Git, GitLab, Python, PowerShell, Shell, and theoretical knowledge on Azure, Kubernetes & Jenkins. In my free time, I write blogs on ckdbtech.com

Leave a Comment