Python dictionaries have an inconspicuous method named
setdefault. Asking for its description, we’ll be presented with rather terse interpretation:
While it might not be immediately obvious what we could use this method for, we can actually find quite a lot of applications if we only pay a little attention. The main advantage of
setdefault seems to lie in elimination of
ifs; a feat we can feel quite smug about. As an example, consider a function which groups list of key-value pairs (with possible duplicate keys) into a dictionary of lists:
This is something which we could do when parsing query string of an URL, if there weren’t a readily available API for that:
setdefault we can get the same job done more succinctly. It would seem that it is a nifty little function, and something we should keep in mind as part of our toolbox. Let’s remember it and move on, shall we?…
setdefault is really not a good piece of
dict‘s interface, and the main reason we should remember about it is mostly for its caveats. Indeed, there are quite a few of them, enough to shrink the space of possible applications to rather tiny size. As a result, we should be cautious whenever we see (or write)
setdefault in our own code.
A small and obvious, yet important detail: the name of
setdefault begins with
set. To an untrained eye, it is therefore evident that the method is a kind of setter, and that we can expect a certain behavior to occur if we choose to use it. Specifically, we would assume the method will actually set something, not happily ignore everything we pass to it and do nothing.
But this is exactly what
setdefault does, in vast majority of cases! For any pair of arguments
v, the call:
is almost guaranteed to not do anything. It’s only the very first invocation for given
k that may actually have any effect – unless, of course, the
k is already present in
d. Provided we do not remove it, any subsequent call will not really set anything, and this is hardly a behavior we would expect from a
setdefault do, then? Given its name, a more reasonable functionality could revolve around the concept of “default value” for given key, independent from its actual value. It would be returned by
 operator if
k were not in the dictionary, instead of
KeyError being raised. And any call to
setdefault(k, ...) would overwrite the
k‘s default value, whether or not it currently has an actual value. This way,
setdefault would deliver on its promise of behaving like a setter.
Of course, this whole notion of “default value” is mighty confusing and doesn’t seem useful at all. Even if it was, it could be easily emulated by two
dicts working together. In fact, this would be a preferred way of handling such arcane use case; there is simply no room for it in standard library.
As described above,
setdefault looks like a deeply flawed version of something which strives to be a setter. But at the same time, it is almost equally eager to pretend to be a getter, for it is often its return value that we are really interested in.
Take another look at the following line in
Notice how we have a
setX function which does not only return some value, but definitely a non-trivial one: list! (We are
appending to it, after all.) It is unclear where this complex value really came from, unless we are aware that
setdefault is actually a variant of
dict.get method. In fact, it has exactly the same interface as
get – including the return value, which is either an object stored under given key, or a specific default value. Yes, you’ve heard that correctly: on the surface,
setdefault is really the same as
get! The only difference is in side effects, which the latter does not produce.
group_assoc_list we are vitally interested in those side effects. After all, they build up our dictionary of lists: the very structure this function is about. This might not always be the case, though. Quite often, we will be concerned only with value for particular key, dictionary itself being just incidental.
And this brings us to the next point.
There is this simple yet powerful – and extremely general – optimization technique, called memoization. The idea behind it is to introduce a layer of “persistent caching” in front of computationally expensive operation. As a result, such operation is performed only once for particular set of input data, with result saved into cache for cheap future access. When we ask for it, the cache is always polled first; if (and only if) result is not found there, it is computed from scratch and then cached (memoized).
What it has to do with dictionaries and
setdefault, though?… Well, the common way for implementing memoization is nothing else but a dictionary. Suppose that we have a following expensive function (unary one, for the sake of simplicity):
It looks like an easy task to slap a memoization facility over it. And if we indeed use a dictionary, this also seems like a great application of
setdefault, for it does the exact operation we want to perform on our cache: check value for given key, and optionally store one. Overall, it could look a lot like this:
Hopefully we have profiled our code before and after this supposed optimization, because what we are about to find is no performance gain whatsoever! This is easily explicable by the fact that Python is not a lazy language, as it requires all function arguments to be evaluated prior to the call. What it means is that expesive
_computation(arg) has to be calculated always, whether or not
arg is present in the
_cache. It is only afterwards when
setdefault performs its check, and that’s already too late for it to have any positive impact on performance.
So despite a seemingly perfect interface,
setdefault is actually ill-suited for memoization or caching. Changing that would require it to be promoted into full-fledged language mechanism, similar to
if-else ternary expressions. This is both extremely unlikely and totally unnecessary, given the narrow scope of resulting construct. As for caching, there are plenty of (correct) alternatives, such as the following:
So we have ruled out caching, but what about our original example? While
setdefault may indeed suffer from very poor naming, there is little doubt that in
group_assoc_list it is the best tool for the job. Or even more generally,
setdefault still looks very useful if we want to make sure that specific keys exist in the dictionary – as in this code:
But often it’s just a matter of looking at things from different perspective; we can easily stumble on alternative (and arguably better) phrasing:
Finally, even a seemingly ideal solution might be superseded by a more tailored one. Starting from Python 2.5, we have access to the
collections.defaultdict utility class – one that readily provides dictionary with defaults. If we use it instead of the ordinary
dict, we can pretend like it had all the values already inserted;
defaultdict will just create them on the fly.
Here’s how our
group_assoc_list function benefits from using this class:
defaultdict is still a pretty much
dict, so users of above function are unlikely to be surprised by unfamiliar interface.
Phew, that was a long writeup, and a summary is in order.
As I’ve shown, the
setdefault method is full of surprises and inconsistencies, causing noticeable confusion. Yet it often creeps up in code written by experienced pythonistas. Sometimes the reasons for it look legitimate, even in face of everything I’ve said here.
But my statement from the beginning still stands. If you encounter
setdefault, raise your alertness – it is might be indirect indicator of code smell. If you find yourself using
setdefault, consider thrice whether it’s really the best solution. And be especially mindful of the erroneous caching pattern that I gave a lengthy description of.
Maybe i don’t see real harmfulness of this method. Usage is so damn obvious and described in documentation.
Of course, all of your points may be valid when your code is hamful and written without reading docs.
If we follow this path, all code written in and based on imaginary documentation should be considered harmful.
For me, “But it is documented!” looks like a fully general counterargument, validating arbitrary atrocious design decisions – up to the mind-boggling examples from recently famous WAT presentation. You have to set a boundary somewhere, so why not opt for API that does not have high chances to confuse casual reader of code where it is used in?
You omited one important example – one that this function was designed for:
It really helps readability, because you can focus on the logic, not the argument handling boilterplate. This would probably be obsolete, if dict.update() returned itself, i.e:
Although, that usually would involve a lot more unnecessary copying.