Python’s setdefault Considered Harmful

2012-01-28 19:26

Python dictionaries have an inconspicuous method named setdefault. Asking for its description, we’ll be presented with rather terse interpretation:

  1. >>> help(dict.setdefault)
  2.  
  3. setdefault(...)
  4.     D.setdefault(k[,d]) -> D.get(k,d), also set D[k]=d if k not in D

While it might not be immediately obvious what we could use this method for, we can actually find quite a lot of applications if we only pay a little attention. The main advantage of setdefault seems to lie in elimination of ifs; a feat we can feel quite smug about. As an example, consider a function which groups list of key-value pairs (with possible duplicate keys) into a dictionary of lists:

  1. def group_assoc_list_(a_list): # with if
  2.     res = {}
  3.     for k, v in a_list:
  4.         if k in res:
  5.             res[k].append(v)
  6.         else:
  7.             res[k] = [v]
  8.     return res
  9.  
  10. def group_assoc_list(a_list): # with setdefault
  11.     res = {}
  12.     for k, v in a_list:
  13.         res.setdefault(k, []).append(v)
  14.     return res

This is something which we could do when parsing query string of an URL, if there weren’t a readily available API for that:

  1. def parse_query_string(qs):
  2.     return [arg.split('=') for arg in qs.split('&')]
  3.  
  4. >>> group_assoc_list(parse_query_string('a=1&b=2&c=3&b=4&a=5'))
  5. {'a': ['1', '5'], 'b': ['2', '4'], 'c': ['3']}

So, with setdefault we can get the same job done more succinctly. It would seem that it is a nifty little function, and something we should keep in mind as part of our toolbox. Let’s remember it and move on, shall we?…

Actually, no. setdefault is really not a good piece of dict‘s interface, and the main reason we should remember about it is mostly for its caveats. Indeed, there are quite a few of them, enough to shrink the space of possible applications to rather tiny size. As a result, we should be cautious whenever we see (or write) setdefault in our own code.

Here’s why.

This is not the setter you are looking for

A small and obvious, yet important detail: the name of setdefault begins with set. To an untrained eye, it is therefore evident that the method is a kind of setter, and that we can expect a certain behavior to occur if we choose to use it. Specifically, we would assume the method will actually set something, not happily ignore everything we pass to it and do nothing.

But this is exactly what setdefault does, in vast majority of cases! For any pair of arguments k and v, the call:

  1. d.setdefault(k, v)

is almost guaranteed to not do anything. It’s only the very first invocation for given k that may actually have any effect – unless, of course, the k is already present in d. Provided we do not remove it, any subsequent call will not really set anything, and this is hardly a behavior we would expect from a setX method.

What should setdefault do, then? Given its name, a more reasonable functionality could revolve around the concept of “default value” for given key, independent from its actual value. It would be returned by [] operator if k were not in the dictionary, instead of KeyError being raised. And any call to setdefault(k, ...) would overwrite the k‘s default value, whether or not it currently has an actual value. This way, setdefault would deliver on its promise of behaving like a setter.
Of course, this whole notion of “default value” is mighty confusing and doesn’t seem useful at all. Even if it was, it could be easily emulated by two dicts working together. In fact, this would be a preferred way of handling such arcane use case; there is simply no room for it in standard library.

Are we really getting it?

As described above, setdefault looks like a deeply flawed version of something which strives to be a setter. But at the same time, it is almost equally eager to pretend to be a getter, for it is often its return value that we are really interested in.

Take another look at the following line in group_assoc_list function:

  1. res.setdefault(k, []).append(v)

Notice how we have a setX function which does not only return some value, but definitely a non-trivial one: list! (We are appending to it, after all.) It is unclear where this complex value really came from, unless we are aware that setdefault is actually a variant of dict.get method. In fact, it has exactly the same interface as get – including the return value, which is either an object stored under given key, or a specific default value. Yes, you’ve heard that correctly: on the surface, setdefault is really the same as get! The only difference is in side effects, which the latter does not produce.

Granted, in group_assoc_list we are vitally interested in those side effects. After all, they build up our dictionary of lists: the very structure this function is about. This might not always be the case, though. Quite often, we will be concerned only with value for particular key, dictionary itself being just incidental.
And this brings us to the next point.

Memoize this!

There is this simple yet powerful – and extremely general – optimization technique, called memoization. The idea behind it is to introduce a layer of “persistent caching” in front of computationally expensive operation. As a result, such operation is performed only once for particular set of input data, with result saved into cache for cheap future access. When we ask for it, the cache is always polled first; if (and only if) result is not found there, it is computed from scratch and then cached (memoized).

What it has to do with dictionaries and setdefault, though?… Well, the common way for implementing memoization is nothing else but a dictionary. Suppose that we have a following expensive function (unary one, for the sake of simplicity):

  1. def computation(arg):
  2.     # (expensive calculations here)

It looks like an easy task to slap a memoization facility over it. And if we indeed use a dictionary, this also seems like a great application of setdefault, for it does the exact operation we want to perform on our cache: check value for given key, and optionally store one. Overall, it could look a lot like this:

  1. _cache = {}
  2.  
  3. def computation(arg):
  4.     return _cache.setdefault(arg, _computation(arg))
  5.  
  6. def _computation(arg):
  7.     # (expensive calculation here)

Hopefully we have profiled our code before and after this supposed optimization, because what we are about to find is no performance gain whatsoever! This is easily explicable by the fact that Python is not a lazy language, as it requires all function arguments to be evaluated prior to the call. What it means is that expesive _computation(arg) has to be calculated always, whether or not arg is present in the _cache. It is only afterwards when setdefault performs its check, and that’s already too late for it to have any positive impact on performance.

So despite a seemingly perfect interface, setdefault is actually ill-suited for memoization or caching. Changing that would require it to be promoted into full-fledged language mechanism, similar to if-else ternary expressions. This is both extremely unlikely and totally unnecessary, given the narrow scope of resulting construct. As for caching, there are plenty of (correct) alternatives, such as the following:

  1. def computation(arg):
  2.     try:
  3.         return _cache[arg]
  4.     except KeyError:
  5.         _cache[arg] = val = _computation(arg)
  6.         return val

The last stand

So we have ruled out caching, but what about our original example? While setdefault may indeed suffer from very poor naming, there is little doubt that in group_assoc_list it is the best tool for the job. Or even more generally, setdefault still looks very useful if we want to make sure that specific keys exist in the dictionary – as in this code:

  1. DEFAULT_CONFIG = { ... }
  2.  
  3. def read_config(config_file):
  4.     cfg = read_config_from_file(config_file)
  5.     for k, v in DEFAULT_CONFIG.iteritems():
  6.         cfg.setdefault(k, v)
  7.     return cfg

But often it’s just a matter of looking at things from different perspective; we can easily stumble on alternative (and arguably better) phrasing:

  1. def read_config(config_file):
  2.     cfg = DEFAULT_CONFIG.copy()
  3.     cfg.update(read_config_from_file(config_file))
  4.     return cfg

Finally, even a seemingly ideal solution might be superseded by a more tailored one. Starting from Python 2.5, we have access to the collections.defaultdict utility class – one that readily provides dictionary with defaults. If we use it instead of the ordinary dict, we can pretend like it had all the values already inserted; defaultdict will just create them on the fly.
Here’s how our group_assoc_list function benefits from using this class:

  1. from collections import defaultdict
  2. def group_assoc_list(a_list):
  3.     res = defaultdict(list)
  4.     for k, v in a_list:
  5.         res[k].append(v)
  6.     return res

Note that defaultdict is still a pretty much dict, so users of above function are unlikely to be surprised by unfamiliar interface.

Final words: be careful

Phew, that was a long writeup, and a summary is in order.

As I’ve shown, the setdefault method is full of surprises and inconsistencies, causing noticeable confusion. Yet it often creeps up in code written by experienced pythonistas. Sometimes the reasons for it look legitimate, even in face of everything I’ve said here.
But my statement from the beginning still stands. If you encounter setdefault, raise your alertness – it is might be indirect indicator of code smell. If you find yourself using setdefault, consider thrice whether it’s really the best solution. And be especially mindful of the erroneous caching pattern that I gave a lengthy description of.



3 comments for post “Python’s setdefault Considered Harmful”.
  1. ED:
    January 29th, 2012 o 4:24

    Maybe i don’t see real harmfulness of this method. Usage is so damn obvious and described in documentation.
    Of course, all of your points may be valid when your code is hamful and written without reading docs.
    If we follow this path, all code written in and based on imaginary documentation should be considered harmful.

  2. Xion:
    January 29th, 2012 o 16:03

    For me, “But it is documented!” looks like a fully general counterargument, validating arbitrary atrocious design decisions – up to the mind-boggling examples from recently famous WAT presentation. You have to set a boundary somewhere, so why not opt for API that does not have high chances to confuse casual reader of code where it is used in?

  3. LQC:
    March 10th, 2012 o 15:46

    You omited one important example – one that this function was designed for:

    1. def do_something(**kwargs):
    2.     kwargs.setdefault("option", True) # Set a default value for option
    3.     # do something
    4.     return continue_chain(**kwargs)

    It really helps readability, because you can focus on the logic, not the argument handling boilterplate. This would probably be obsolete, if dict.update() returned itself, i.e:

    1. kwargs = {"option": True}.update(kwargs)

    Although, that usually would involve a lot more unnecessary copying.

Comments are disabled.
 


© 2017 Karol Kuczmarski "Xion". Layout by Urszulka. Powered by WordPress with QuickLaTeX.com.