Petter Holt Juliussen • Mail | Mastodon | GitHub | Letterboxd

for later reference.

Generators

2019-04-04

Python generators are a simple way of creating iterators. All the overhead associated with building an iterator (implementing a class with __iter__() and __next__(), keeping track of internal states etc.) are automatically handled by generators in Python.

Generator functions

A generator is a function that returns an object (iterator) which we can iterate over (one value at a time).

Example

# Build and return a list
def firstn(n):
    num, nums = 0, []
    while num < n:
        nums.append(num)
        num += 1
    return nums

sum_of_first_n = sum(firstn(1000000))

The code is quite simple and straightforward, but it builds the full list in memory (all n "10 megabyte" integers). One solution is to resort to the generator pattern and implement an iterable object:

# Using the generator pattern (an iterable)
class firstn(object):
    def __init__(self, n):
        self.n = n
        self.num, self.nums = 0, []

    def __iter__(self):
        return self

    # Python 3 compatibility
    def __next__(self):
        return self.next()

    def next(self):
        if self.num < self.n:
            cur, self.num = self.num, self.num+1
            return cur
        else:
            raise StopIteration()

sum_of_first_n = sum(firstn(1000000))

This will perform as we expect, but there is a lot of boilerplate. Python provides generator functions as a convenient shortcut to building iterators:

# A generator that yields items instead of returning a list
def firstn(n):
    num = 0
    while num < n:
        yield num
        num += 1

sum_of_first_n = sum(firstn(1000000))

It is very similar to the implementation that built a list in memory, but has the memory usage characteristic of the iterator implementation.

Generator expressions

Generator expressions provide an additional shortcut to build generators out of expressions similar to that of list comprehensions. A list comprehension can also be turned into a generator expression by replacing the square brackets with parentheses.

Example

# List comprehension
doubles = [2 * n for n in range(50)]

# Same as the list comprehension above
doubles = list(2 * n for n in range(50))

By allowing generator expressions, we don't have to write a generator function if we do not need the list. If only list comprehensions were available, and we needed to lazily build a set of items to be processed, we will have to write a generator function.

Performance imporovements

The performance improvement from the use of generators is the result of the lazy (on demand) generation of values, which translates to lower memory usage. Furthermore, we do not need to wait until all the elements have been generated before we start to use them. This is similar to the benefits provided by iterators, but the generator makes building iterators easy.

Example

Both range and xrange represent a range of numbers, and have the same function signature, but range returns a list while xrange returns a generator (at least in concept; the implementation may differ).

# Using a non-generator (Python 2.x only)
sum_of_first_n = sum(range(1000000))

# Using a generator
sum_of_first_n = sum(xrange(1000000))

When we use range we build a 1,000,000 element list in memory and then find its sum. This is a waste, considering that we use these 1,000,000 elements just to compute the sum.

On the other hand, when we use xrange, we do not incur the cost of building a 1,000,000 element list in memory. The generator created by xrange will generate each number, which sum will consume to accumulate the sum.

Note: A generator will provide performance benefits only if we do not intend to use that set of generated values more than once.

# Note: Python 2.x only
s = sum(xrange(1000000))
p = product(xrange(1000000))

The same expensive process is performed twice. In cases like this, building a list in memory might be worth it:

# Note: Python 2.x only
nums = list(xrange(1000000))
s = sum(nums)
p = product(nums)

Examples

# The for loop will generate each i (i.e. 1,2,3,4,5, ...), add it to total,  and throw it away
# before the next i is generated.  This is opposed to iterating through range(...), which creates
# a potentially massive list and then iterates through it.
total = 0
for i in irange(1000000):
   total += i
# Generators can be composed

# square is a generator
square = (i * i for i in irange(1000000))

# Add the squares
total = 0
for i in square:
   total += i