Generator Pitfalls

2012-02-08 19:34

I’m a big fan of Python generators. With seemingly little effort, they often allow to greatly reduce memory overhead. They do so by completely eliminating intermediate results of list manipulations. Along with functions defined within itertools package, generators also introduce a basic lazy computation capabilities into Python.
This combination of higher-order functions and generators is truly remarkable. Why? Because compared to ordinary loops, it gains us both speed and readability at the same time. That’s quite an achievement; typically, optimizations sacrifice one for the other.

All this goodness, however, comes with few strings attached. There are some ways in which we can be bitten by improperly used generators, and I think it’s helpful to know about them. So today, I’ll try to outline some of those traps.

Generators are lazy

You don’t say, eh? That’s one of the main reasons we are so keen on using them, after all! But what I’m implying is the following: if a generator-based code is to actually do anything, there must be a point where all this laziness “bottoms out”. Otherwise, you are just building expression thunks and not really evaluating anything.

How such circumstances might occur, though? One case involves using generators for their side effects – a rather questionable practice, mind you:

  1. def process_kvp(key, value=None):
  2.     print key,
  3.     if value: print "=", value
  4.  
  5. itertools.starmap(process_kvp, some_dict.iteritems())

The starmap call does not print anything here, because it just constructs a generator. Only when this generator is iterated over, dictionary elements would be fed to process_kvp function. Such iteration would of course require a loop (or consume recipe from itertools docs), so we might as well do away with generator altogether and just stick with plain old for:

  1. for kv in some_dict.iteritems():
  2.     process_kvp(*kv)

In real code the judgement isn’t so obvious, of course. We’ll most likely receive the generator from some external function – a one that probably refers to its result only as an iterable. As usual in such cases, we must not assume anything beyond what is explicitly stated. Specifically, we cannot expect that we have any kind of existing, tangible collection with actual elements. An iterable can very well be just a generator, whose items are yet to be evaluated. This fact can impact performance by moving some potentially heavy processing into unexpected places, resulting in e.g. executing database queries while rendering website’s template.

Generators are truthy

The above point was concerned mostly with performance, but what about correctness? You may think it’s not hard to conform to iterables’ contract, which should in turn guarantee that they don’t backfire on us with unexpected behavior. Yet how many times did you find yourself reading (or writing!) code such as this:

  1. def do_stuff():
  2.     items = get_stuff_to_do() # returns iterable
  3.     if not items:
  4.         logging.debug("No stuff to do!")
  5.         return 0
  6.     # ...do something with 'items'...

It’s small and may look harmless, but it still manages to make an ungrounded (thus potentially false) assumption about an iterable. The problem lies in if condition, meant to check whether we have any items to do stuff with. It would work correctly if items were a list, dict, deque or any other actual collection. But for generators, it will not work that way – even though they are still undoubtedly iterable!

Why? Because generators are not collections; they are just suppliers. We can only tell them to return their next value, but we cannot peek inside to see if the value is really there. As such, generators are not susceptible to boolean coercion in the same way that collections are; it’s not possible to check their “emptiness”. They behave like regular Python objects and are always truthy, even if it’s “obvious” they won’t yield any value when asked:

  1. >>> g = (x for x in [])
  2. >>> if g: print "Truthy!"
  3. Truthy!

Going back to previous sample, we can see that if block won’t be executed in case of get_stuff_to_do returning an “empty” generator. Consequences of this fact may vary from barely noticeable to disastrous, depending on how the rest of do_stuff function looks like. Nevertheless, that code will run with one of its invariants violated: a fertile ground for any kind of unintended effects.

Generators are once-off

An intuitive, informal understanding of the term ‘iterable’ is likely to include one more unjustified assumption: that it can iterated over, and over, i.e. multiple times. Again, this is very much true if we’re dealing with a collection, but generators simply don’t carry enough information to repeat the sequence they yield. In other words, they cannot be rewound: once we go through a generator, it’s stuck in its final state, not usable for anything more.

Just like with previous caveats, failure to account for that can very well go unnoticed – at least until we spot some weird behavior it results in. Continuing our superficial example from preceding paragraph, let’s pretend the rest of do_stuff function requires going over items at least twice. It could be, for example, an iterable of objects in a game or physics simulation; objects that can potentially move really fast and thus require some more sophisticated collision detection (based on e.g. intersection of segments):

  1. new_positions = calculate_displacement(items, delta_time)
  2. collision_pairs, rest = detect_collisions(items, new_positions)
  3. collided = compute_collision_response(collision_pairs)
  4. for item in itertools.chain(collided, rest):
  5.     item.draw()

Even assuming the incredible coincidence of getting all the required math right (;-)), we wouldn’t see any action whatsoever if items is a generator. The reason for that is simple: when calculate_displacement goes through items once (vigorously applying the Eulerian integration, most likely), it fully expends the generator. For any subsequent traversal (like the one in detect_collitions), the generator will appear empty. In this particular snippet, it will most likely result in blank screen, which hopefully is enough of a hint to figure out what’s going on :P

Generators are not lists

An overarching conclusion of the above-mentioned pitfalls is rather evident and seemingly contrary to statement from the beginning. Indeed, generators may not be a drop-in replacement for lists (or other collections) if we are very much relying on their “listy” behavior. And unless memory overhead proves troublesome, it’s also not worth it to inject them into older code that already uses lists.

For new code, however, sticking with generators right off the bat has numerous benefits which I mentioned at the start. What it requires, though, is evicting some habits that might have emerged after we spent some time manipulating lists. I think I managed to pinpoint the most common ways in which those habits result in incorrect code. Incidentally, they all origin from accidental, unfounded expectations towards iterables in general. That’s no coincidence: generators simply happen to be the “purest” of iterables, supporting only the bare minimum of required operations.

Be Sociable, Share!
Be Sociable, Share!



Adding comments is disabled.

Comments are disabled.
 


© 2023 Karol Kuczmarski "Xion". Layout by Urszulka. Powered by WordPress with QuickLaTeX.com.