Python dictionaries have an inconspicuous method named setdefault
. Asking for its description, we’ll be presented with rather terse interpretation:
While it might not be immediately obvious what we could use this method for, we can actually find quite a lot of applications if we only pay a little attention. The main advantage of setdefault
seems to lie in elimination of if
s; a feat we can feel quite smug about. As an example, consider a function which groups list of key-value pairs (with possible duplicate keys) into a dictionary of lists:
This is something which we could do when parsing query string of an URL, if there weren’t a readily available API for that:
So, with setdefault
we can get the same job done more succinctly. It would seem that it is a nifty little function, and something we should keep in mind as part of our toolbox. Let’s remember it and move on, shall we?…
Actually, no. setdefault
is really not a good piece of dict
‘s interface, and the main reason we should remember about it is mostly for its caveats. Indeed, there are quite a few of them, enough to shrink the space of possible applications to rather tiny size. As a result, we should be cautious whenever we see (or write) setdefault
in our own code.
Here’s why.
A small and obvious, yet important detail: the name of setdefault
begins with set
. To an untrained eye, it is therefore evident that the method is a kind of setter, and that we can expect a certain behavior to occur if we choose to use it. Specifically, we would assume the method will actually set something, not happily ignore everything we pass to it and do nothing.
But this is exactly what setdefault
does, in vast majority of cases! For any pair of arguments k
and v
, the call:
is almost guaranteed to not do anything. It’s only the very first invocation for given k
that may actually have any effect – unless, of course, the k
is already present in d
. Provided we do not remove it, any subsequent call will not really set anything, and this is hardly a behavior we would expect from a setX
method.
What should setdefault
do, then? Given its name, a more reasonable functionality could revolve around the concept of “default value” for given key, independent from its actual value. It would be returned by []
operator if k
were not in the dictionary, instead of KeyError
being raised. And any call to setdefault(k, ...)
would overwrite the k
‘s default value, whether or not it currently has an actual value. This way, setdefault
would deliver on its promise of behaving like a setter.
Of course, this whole notion of “default value” is mighty confusing and doesn’t seem useful at all. Even if it was, it could be easily emulated by two dict
s working together. In fact, this would be a preferred way of handling such arcane use case; there is simply no room for it in standard library.
As described above, setdefault
looks like a deeply flawed version of something which strives to be a setter. But at the same time, it is almost equally eager to pretend to be a getter, for it is often its return value that we are really interested in.
Take another look at the following line in group_assoc_list
function:
Notice how we have a setX
function which does not only return some value, but definitely a non-trivial one: list! (We are append
ing to it, after all.) It is unclear where this complex value really came from, unless we are aware that setdefault
is actually a variant of dict.get
method. In fact, it has exactly the same interface as get
– including the return value, which is either an object stored under given key, or a specific default value. Yes, you’ve heard that correctly: on the surface, setdefault
is really the same as get
! The only difference is in side effects, which the latter does not produce.
Granted, in group_assoc_list
we are vitally interested in those side effects. After all, they build up our dictionary of lists: the very structure this function is about. This might not always be the case, though. Quite often, we will be concerned only with value for particular key, dictionary itself being just incidental.
And this brings us to the next point.
There is this simple yet powerful – and extremely general – optimization technique, called memoization. The idea behind it is to introduce a layer of “persistent caching” in front of computationally expensive operation. As a result, such operation is performed only once for particular set of input data, with result saved into cache for cheap future access. When we ask for it, the cache is always polled first; if (and only if) result is not found there, it is computed from scratch and then cached (memoized).
What it has to do with dictionaries and setdefault
, though?… Well, the common way for implementing memoization is nothing else but a dictionary. Suppose that we have a following expensive function (unary one, for the sake of simplicity):
It looks like an easy task to slap a memoization facility over it. And if we indeed use a dictionary, this also seems like a great application of setdefault
, for it does the exact operation we want to perform on our cache: check value for given key, and optionally store one. Overall, it could look a lot like this:
Hopefully we have profiled our code before and after this supposed optimization, because what we are about to find is no performance gain whatsoever! This is easily explicable by the fact that Python is not a lazy language, as it requires all function arguments to be evaluated prior to the call. What it means is that expesive _computation(arg)
has to be calculated always, whether or not arg
is present in the _cache
. It is only afterwards when setdefault
performs its check, and that’s already too late for it to have any positive impact on performance.
So despite a seemingly perfect interface, setdefault
is actually ill-suited for memoization or caching. Changing that would require it to be promoted into full-fledged language mechanism, similar to if-else
ternary expressions. This is both extremely unlikely and totally unnecessary, given the narrow scope of resulting construct. As for caching, there are plenty of (correct) alternatives, such as the following:
So we have ruled out caching, but what about our original example? While setdefault
may indeed suffer from very poor naming, there is little doubt that in group_assoc_list
it is the best tool for the job. Or even more generally, setdefault
still looks very useful if we want to make sure that specific keys exist in the dictionary – as in this code:
But often it’s just a matter of looking at things from different perspective; we can easily stumble on alternative (and arguably better) phrasing:
Finally, even a seemingly ideal solution might be superseded by a more tailored one. Starting from Python 2.5, we have access to the collections.defaultdict
utility class – one that readily provides dictionary with defaults. If we use it instead of the ordinary dict
, we can pretend like it had all the values already inserted; defaultdict
will just create them on the fly.
Here’s how our group_assoc_list
function benefits from using this class:
Note that defaultdict
is still a pretty much dict
, so users of above function are unlikely to be surprised by unfamiliar interface.
Phew, that was a long writeup, and a summary is in order.
As I’ve shown, the setdefault
method is full of surprises and inconsistencies, causing noticeable confusion. Yet it often creeps up in code written by experienced pythonistas. Sometimes the reasons for it look legitimate, even in face of everything I’ve said here.
But my statement from the beginning still stands. If you encounter setdefault
, raise your alertness – it is might be indirect indicator of code smell. If you find yourself using setdefault
, consider thrice whether it’s really the best solution. And be especially mindful of the erroneous caching pattern that I gave a lengthy description of.
Maybe i don’t see real harmfulness of this method. Usage is so damn obvious and described in documentation.
Of course, all of your points may be valid when your code is hamful and written without reading docs.
If we follow this path, all code written in and based on imaginary documentation should be considered harmful.
For me, “But it is documented!” looks like a fully general counterargument, validating arbitrary atrocious design decisions – up to the mind-boggling examples from recently famous WAT presentation. You have to set a boundary somewhere, so why not opt for API that does not have high chances to confuse casual reader of code where it is used in?
You omited one important example – one that this function was designed for:
It really helps readability, because you can focus on the logic, not the argument handling boilterplate. This would probably be obsolete, if dict.update() returned itself, i.e:
Although, that usually would involve a lot more unnecessary copying.