I’m a big fan of Python generators. With seemingly little effort, they often allow to greatly reduce memory overhead. They do so by completely eliminating intermediate results of list manipulations. Along with functions defined within itertools
package, generators also introduce a basic lazy computation capabilities into Python.
This combination of higher-order functions and generators is truly remarkable. Why? Because compared to ordinary loops, it gains us both speed and readability at the same time. That’s quite an achievement; typically, optimizations sacrifice one for the other.
All this goodness, however, comes with few strings attached. There are some ways in which we can be bitten by improperly used generators, and I think it’s helpful to know about them. So today, I’ll try to outline some of those traps.
You don’t say, eh? That’s one of the main reasons we are so keen on using them, after all! But what I’m implying is the following: if a generator-based code is to actually do anything, there must be a point where all this laziness “bottoms out”. Otherwise, you are just building expression thunks and not really evaluating anything.
How such circumstances might occur, though? One case involves using generators for their side effects – a rather questionable practice, mind you:
The starmap
call does not print anything here, because it just constructs a generator. Only when this generator is iterated over, dictionary elements would be fed to process_kvp
function. Such iteration would of course require a loop (or consume
recipe from itertools
docs), so we might as well do away with generator altogether and just stick with plain old for
:
In real code the judgement isn’t so obvious, of course. We’ll most likely receive the generator from some external function – a one that probably refers to its result only as an iterable. As usual in such cases, we must not assume anything beyond what is explicitly stated. Specifically, we cannot expect that we have any kind of existing, tangible collection with actual elements. An iterable can very well be just a generator, whose items are yet to be evaluated. This fact can impact performance by moving some potentially heavy processing into unexpected places, resulting in e.g. executing database queries while rendering website’s template.
The above point was concerned mostly with performance, but what about correctness? You may think it’s not hard to conform to iterables’ contract, which should in turn guarantee that they don’t backfire on us with unexpected behavior. Yet how many times did you find yourself reading (or writing!) code such as this:
It’s small and may look harmless, but it still manages to make an ungrounded (thus potentially false) assumption about an iterable. The problem lies in if
condition, meant to check whether we have any items
to do stuff with. It would work correctly if items
were a list, dict
, deque
or any other actual collection. But for generators, it will not work that way – even though they are still undoubtedly iterable!
Why? Because generators are not collections; they are just suppliers. We can only tell them to return their next
value, but we cannot peek inside to see if the value is really there. As such, generators are not susceptible to boolean coercion in the same way that collections are; it’s not possible to check their “emptiness”. They behave like regular Python objects and are always truthy, even if it’s “obvious” they won’t yield any value when asked:
Going back to previous sample, we can see that if
block won’t be executed in case of get_stuff_to_do
returning an “empty” generator. Consequences of this fact may vary from barely noticeable to disastrous, depending on how the rest of do_stuff
function looks like. Nevertheless, that code will run with one of its invariants violated: a fertile ground for any kind of unintended effects.
An intuitive, informal understanding of the term ‘iterable’ is likely to include one more unjustified assumption: that it can iterated over, and over, i.e. multiple times. Again, this is very much true if we’re dealing with a collection, but generators simply don’t carry enough information to repeat the sequence they yield. In other words, they cannot be rewound: once we go through a generator, it’s stuck in its final state, not usable for anything more.
Just like with previous caveats, failure to account for that can very well go unnoticed – at least until we spot some weird behavior it results in. Continuing our superficial example from preceding paragraph, let’s pretend the rest of do_stuff
function requires going over items
at least twice. It could be, for example, an iterable of objects in a game or physics simulation; objects that can potentially move really fast and thus require some more sophisticated collision detection (based on e.g. intersection of segments):
Even assuming the incredible coincidence of getting all the required math right (;-)), we wouldn’t see any action whatsoever if items
is a generator. The reason for that is simple: when calculate_displacement
goes through items
once (vigorously applying the Eulerian integration, most likely), it fully expends the generator. For any subsequent traversal (like the one in detect_collitions
), the generator will appear empty. In this particular snippet, it will most likely result in blank screen, which hopefully is enough of a hint to figure out what’s going on :P
An overarching conclusion of the above-mentioned pitfalls is rather evident and seemingly contrary to statement from the beginning. Indeed, generators may not be a drop-in replacement for lists (or other collections) if we are very much relying on their “listy” behavior. And unless memory overhead proves troublesome, it’s also not worth it to inject them into older code that already uses lists.
For new code, however, sticking with generators right off the bat has numerous benefits which I mentioned at the start. What it requires, though, is evicting some habits that might have emerged after we spent some time manipulating lists. I think I managed to pinpoint the most common ways in which those habits result in incorrect code. Incidentally, they all origin from accidental, unfounded expectations towards iterables in general. That’s no coincidence: generators simply happen to be the “purest” of iterables, supporting only the bare minimum of required operations.
Recently I had to solve a simple but very practical coding puzzle. The task was to check whether an IPv4 address (given in traditional dot notation) – is within specified network, described as a CIDR block.
You may notice that this is almost as ubiquitous as a programming problem can get. Implementations of its simplified version are executed literally millions (if not billions) of times per second, for every IP packet at every junction of this immensely vast Internet. Yet, when it comes to doing such thing in Python, one is actually left without much help from the standard library and must simply Do It Yourself.
It’s not particularly hard, of course. But before jumping to a solution, let’s look how we expect our function to behave:
Pretty obvious, isn’t it? Now, if you recall how packet routing works under the hood, you might remember that there are some bitwise operations involved. They are necessary to determine whether specific IP address can be found within given network. However low-level this may sound, there is really no escape from doing it this way. Yes, even in Python :P
It goes deeper, though. Most of the interface of Python’s socket
module is actually carbon-copied from POSIX socket.h and related headers, down to exact names of particular functions. As a result, solving our task in C isn’t very different from doing it in Python. I’d even argue that the C version is clearer and easier to follow. This is how it could look like:
#include
#include
#include
#include
bool ip_in_network(const char* addr, const char* net) {
struct in_addr ip_addr;
if (!inet_aton(addr, &ip_addr)) return false;
char network[32];
strncpy(network, net, strlen(net));
char* slash = strstr(network, “/”);
if (!slash) return false;
int mask_len = atoi(slash + 1);
*slash = ‘\0’;
struct in_addr net_addr;
if (!inet_aton(network, &net_addr)) return false;
unsigned ip_bits = ip_addr.s_addr;
unsigned net_bits = net_addr.s_addr;
unsigned netmask = net_bits & ((1 << mask_len) - 1);
return (ip_bits & netmask) == net_bits;
}[/c]
A similar thing done in Python obviously requires less scaffolding, but it also introduces its own intricacies via struct
module (for unpacking bytes). All in all, it seems like there is not much to be gained here from Python’s high level of abstraction.
And that’s perfectly OK: no language is a silver bullet. Sometimes we need to do things the quirky way.
It is quite likely you are familiar with the Wat talk by Gary Bernhardt. It is sweeping through the Internet, giving some good laugh to pretty much anyone who watches it. Surely it did to me!
The speaker is making fun of Ruby and JavaScript languages (although mostly the latter, really), showing totally unexpected and baffling results of some seemingly trivial operations – like adding two arrays. It turns out that in JavaScript, the result is an empty string. (And the reasons for that provoke even bigger “wat”).
After watching the talk for about five times (it hardly gets old), I started to wonder whether it is only those two languages that exhibit similarly confusing behavior… The answer is of course “No”, and that should be glaringly obvious to anyone who knows at least a bit of C++ ;) But beating on that horse would be way too easy, so I’d rather try something more ambitious.
Hence I ventured forth to search for “wat” in Python 2.x. The journey wasn’t short enough to stop at mere addition operator but nevertheless – and despite me being nowhere near Python expert – I managed to find some candidates rather quickly.
I strove to keep with the original spirit of Gary’s talk, so I only included those quirks that can be easily shown in interactive interpreter. The final result consists of three of them, arranged in the order of increasing puzzlement. They are given without explanation or rationale, hopefully to encourage some thought beyond amusement :)
Behold, then, the Wat of Python!
Python dictionaries have an inconspicuous method named setdefault
. Asking for its description, we’ll be presented with rather terse interpretation:
While it might not be immediately obvious what we could use this method for, we can actually find quite a lot of applications if we only pay a little attention. The main advantage of setdefault
seems to lie in elimination of if
s; a feat we can feel quite smug about. As an example, consider a function which groups list of key-value pairs (with possible duplicate keys) into a dictionary of lists:
This is something which we could do when parsing query string of an URL, if there weren’t a readily available API for that:
So, with setdefault
we can get the same job done more succinctly. It would seem that it is a nifty little function, and something we should keep in mind as part of our toolbox. Let’s remember it and move on, shall we?…
Actually, no. setdefault
is really not a good piece of dict
‘s interface, and the main reason we should remember about it is mostly for its caveats. Indeed, there are quite a few of them, enough to shrink the space of possible applications to rather tiny size. As a result, we should be cautious whenever we see (or write) setdefault
in our own code.
Here’s why.
Unless you were living under an enormous, media-impervious rock for at least few weeks, you must have heard of the recent ruckus about laws concerned with the so-called intellectual property. It was mostly due to infamous SOPA and PIPA bills that were to pass in the US, and the temporary “blackout” of various big websites whose owners voiced their protest against those bills. But we also have something of a local spin-off, relevant to the EU and specifically Poland: an outcry against ACTA, a similar albeit slightly less ridiculous piece of law, which (as of this writing) is due to be signed in a few days.
Even though we do not really know yet how those issues are going to be resolved, there are already few important lessons to be learned here. And I think that the biggest one is not actually concerned with the merit of laws in question. Instead, it draws attention to the problem of communication between IT industry and general public.
It never was, really. If anything, recent events served as a good wake-up call, reminding that the Internet and all related technological infrastructure is something that we take for granted way too often. While I might exaggerate a bit, I don’t think it’s very far-fetched to say that for average person, the Internet is pretty much magic. You pop up “the Internet app”, type whatever you are looking for, and few moments later, voilà: you just got it, by the sheer magic of intertubes. Assuming, of course, that there is actually something specific you have in mind; otherwise, you can always look at what your friends have “shared”, and from there start your clicky journey through the nether. It’s awesome, virtually boundless, and it just works… right?
So far, consequences of such technical ignorance were also mostly technical, surfacing as security issues, loss of data, malware spread and so on. But that’s alright! We’re dabbling into arcane and invoking supernatural powers, so it’s no wonder we sometimes accidentally summon few annoying daemons. Should that happen, we can always call an exorcist in a form of friendly geek-next-door, or (at worst) tech support.
Now, however, failure to grasp the fundamental nature of the Internet can result in much more dramatic backlash. See, this technical stuff underneath is not just a plumbing that can be safely ignored. The foundational, idealistic principles of the ‘net – decentralization, freedom, knowledge, progress, innovation, flexibility, and so on – are woven into its very fabric. It is by exposure to those “boring details” that those ideas may influence and reshape minds, helping to do away with flawed and outdated notions – including, for example, the 19th century’s concept of intellectual property. I cannot even fathom how someone with reasonable understanding of how software, Internet and IT work could conceive something as outrageous as those infamous IP laws. It just doesn’t compute.
Yet it was conceived, put into words, formed into a legal document and officially proposed – more than once, in fact. Obviously, such things don’t happen by itself, especially when powerful interest groups are at play. And that’s precisely the reason why we have various public institutions, from parliaments to international organizations: to weed out blatantly bad solutions, and sometimes even let the good ones pass through.
At least, that’s the theory. It’s naive to postulate virtues among politicians (i.e. those willingly aspiring for power), so we make them cling to one value they’ll always embrace, for it is needed to maintain a ruling position: popularity. This should roughly translate to caring for the same things the voters care for. Roughly, because there will be always some disconnection due to effects of scale, perception, biases and multitudes of other factors.
Still, this is pretty much how it works. In order to push an agenda we are vitally interested in (or obstruct one we are strongly against), we need to gain enough publicity. We need to make people share our values and consider as utility those things that we consider as such. That’s how we set the appropriate casual chain in motion, eventually leading to fulfillment of our goals.
And this is where the IT industry has failed miserably. It’s the reason why we’re frantically looking for support, rolling out our biggest cannons, hitting the news with blackouts of absolutely crucial Internet services and DDoS attacks on government sites. We have failed in making the society share our values in elegant, gradual and systematic manner, so now we need to condone a shock therapy to compensate for this negligence. Actually, in some cases we have been actively making things worse, professing the exact opposites of those values mentioned before: centralization, limitation, monocultures, stagnation and rehashing of old concepts.
Now we are just reaping what we have sown.
As the Python docs state, with
statement is used to encapsulate common patterns of try
–except
–finally
constructs. It is most often associated with managing external resources in order to ensure that they are properly freed, released, closed, disconnected from, or otherwise cleaned up. While at times it is sufficient to just write the finally
block directly, repeated occurrences ask for using this language goodness more consciously, including writing our own context managers for specialized needs.
Those managers – basically a with
-enabled, helper objects – are strikingly similar to small local objects involved in the RAII technique from C++. The acronym expands to Resource Acquisition Is Initialization, further emphasizing the resource management part of this pattern. But let us not be constrained by that notion: the usage space of with
is much wider.
On contemporary websites and web applications, it is extremely common task to display a list of items on page. In any reasonable framework and/or templating engine, this can be accomplished rather trivially. Here’s how it could look like in Jinja or Django templates:
But it’s usually not too long before our list grows big and it becomes unfeasible to render it all on the server. Or maybe, albeit less likely, we want it to be updated in real-time, without having to reload the whole page. Either case requires to incorporate some JavaScript code, talking to the server and obtaining next portion of data whenever it’s needed.
Obviously, that data has to be rendered as well, and there is one option of doing it on the server side, serving actual HTML directly to JS. An arguably better solution is to respond with JSON or similar representation of our items. This is conceptually simpler, feels less messy and is potentially reusable as a part of website’s external API. There is just one drawback: it forces rendering to be done also in JavaScript.