Python Optimization 101

2012-07-14 21:17

It’s pretty much assumed that if you’re writing Python, you are not really concerned with the performance and speed of your code, provided it gets the job done in sufficiently timely manner. The benefits of using such a high level language usually outweigh the cons, so we’re at ease with sacrificing some of the speed in exchange for other qualities. The feasibility of this trade is always relative, though, and depends entirely on the tasks at hand. Sometimes the ‘sufficiently fast’ bar might be hung quite high up.

But while some attitudes are clearly beyond Python’s reach – like real-time software on embedded systems – it doesn’t mean it’s impossible to write efficient code. More importantly, it’s almost always possible to write more efficient code than we currently have; the nimble domain of optimization has its subdivision dedicated specifically to Python. And quite surprisingly, performance-tuning at this high level of abstraction often proves to be even more challenging than squeezing nanoseconds out of bare metal.

So today, we’re going to look at some basic principles of optimization and good practices targeted at writing efficient Python code.

Tools of the trade

Before jumping into specific advice, it’s essential to briefly mention few standard modules that are indispensable when doing any kind of optimization work.

The first one is timeit, a simple utility for measuring execution time of snippets of Python code. Using timeit is often one of the easiest way to confirm (or refute) our suspicions about insufficient performance of particular piece of code. timeit helps us in straightforward way: by executing the statement in questions many times and showing average, as well as cumulative, time it has taken.

As for more detailed analysis, the profile and cProfile modules can be used to gain insight on CPU time consumed by different parts of our code. Profiling a statement will yield us some vital data about number of times that any particular function was called, how much time a single call takes on average and how big is the function’s impact on overall execution time. These are the essential information for identifying bottlenecks and therefore ensuring that our optimizations are correctly targeted.

Then there is the dis module: Python’s disassembler. This nifty tool allows us to inspect instructions that the interpreter is executing, with handy names translated from actual binary bytecode. Compared to actual assemblers or even code executed on Java VM, Python’s bytecode is pretty trivial to analyze:

  1. >>> dis.dis(lambda x: x + 1)
  2.   1           0 LOAD_FAST                0 (x)
  3.               3 LOAD_CONST               1 (1)
  4.               6 BINARY_ADD          
  5.               7 RETURN_VALUE

Getting familiar with it proves to be very useful, though, as eliminating slow instructions in favor of more efficient ones is a fundamental (and effective) optimization technique.

Local is better than global

Starting on the specifics, let’s look at a simple problem. What is the fastest way of converting int or float number into string?… Here’s several obvious (and some non-obvious) candidates:

  1. >>> str(42)
  2. >>> "%d" % 42
  3. >>> %s" % 42
  4. >>> "{0}".format(42)
  5. >>> (42).__str__()
  6. >>> int.__str__(42)

It might be weird to even consider anything besides explicit typecast to str. However, one could think that invoking Python internals directly – which __str__ method seems to be – may be a net performance gain. But including various string formatting routines seems like going way too far. Between parsing the format string, processing arguments and putting the resulting string together, they just have to be significantly slower than anything else… right? Well, let’s check it:

  1. >>> from timeit import Timer
  2. >>> Timer("str(42)").timeit()
  3. 0.4153470993041992
  4. >>> Timer("'%d' % 42").timeit()
  5. 0.04658198356628418

Whoa, that was unexpected! Not only formatting is faster, but it beats the “obvious” solution by order of magnitude. Such great discrepancy must indicate that we missed some important, hidden factor which also contributes significantly to overall performance (or lack thereof).

This unknown factor is the cost of lookup for non-local symbols such as str. Python interpreter stores those in various hash tables which are roughly equivalent to dictionaries (dicts) in the language itself. While certainly not slow in the absolute sense, these data structures incur noticeable overhead on ubiquitous operation that we wouldn’t normally consider as potential bottleneck. But alas, it’s there, and this can easily explain why string formatting ("%d" % 42) wins our little speed contest: it has no symbol lookup at all.

Unfortunately, programming with only operators and inline constants might be a little too constraining ;) There are other ways to avoid expensive lookups, though, and the easiest one is to use local names as much as possible. Names declared locally – be it variables, functions or even classes – are stored within current stack frame, so the interpreter can access them via offset relative to stack pointer (equivalent of EBP from x86 assembly). No strings have to be hashed, too, resulting in performance characteristics that are similar to indexing a list.

To make it more obvious while looking for performance tweaks, the opcodes for local lookups in dis() end in _FAST, .e.g LOAD_FAST. The slower, hash-based lookups are named LOAD_GLOBAL, LOAD_ATTR, STORE_ATTR, and so on. When looking for some easy optimizations, you should definitely examine your tight loops and try to eliminate lookups which aren’t of the *_FAST type. Often this means caching globals – including built-in functions – under local names through what seems like a series of redundant assignments:

  1. from math import sqrt
  3. def nth_prime(n):
  4.     if n < 3:
  5.         return &#91;2, 3, 5][n]
  7.     _xrange = xrange ; _int = int ; _sqrt = sqrt
  8.     i = 3 ; p = 7
  9.     while True:
  10.         for x in _xrange(2, _int(_sqrt(p)) + 1):
  11.             if p % x == 0:
  12.                 break
  13.         else:
  14.             if i == n:
  15.                 return p
  16.             i += 1
  17.         p += 1&#91;/python]
  18. A more sophisticated but uglier technique involves function arguments (which are also <code>*_FAST</code>) and their default values. This eliminates the cost of assignments at the expense of polluting function signature with implementation details:
  19. [python]def nth_prime(n, xrange=xrange, int=int, sqrt=sqrt):
  20.     # ...

It works because default values are stored within the function object, leading to infamous gotcha if they’re mutable. Here, however, we can leverage them to achieve performance benefits.

Loop smartly

Since we have indirectly touched the topic of optimizing loops, let’s tarry here a little bit longer. In compiled languages such as C or C++, a well thought-out loop is often the most efficient solution, compared e.g. to using higher-order functions like std::transform. Loops don’t introduce much overhead (unlike repeated function calls) and are prone to various compiler optimizations: extracting constant expressions, reducing number of operations per cycle, and so on.

But, Python is not C. If we have some experience with the latter, we must be careful as to not apply its best practices automatically when writing Python code. This, for example, is a not the most efficient way of populating a list:

  1. res = []
  2. i = 0
  3. while i < n:
  4.     res.append(foo(i))
  5.     i += 1&#91;/python]
  6. This one might be a little bit faster:
  7. &#91;python]res = []
  8. for i in xrange(n):
  9.     res.append(foo(i))&#91;/python]
  10. Both, however, cannot really compare to one of these:
  11. &#91;python]res = [foo(i) for i in xrange(n)]
  12. res = map(foo, xrange(n))&#91;/python]
  13. That's because loops are not so wonderful if we're writing them in interpreted, managed language. Every instruction is essentially a round trip to underlying "C code", including something as trivial as incrementing the loop counter. To take advantage of this fact, the logical course of action is to somehow make the <em>whole loop</em> execute within bounds of the interpreter. Incidentally, that is exactly what the last two versions are doing.
  15. <h3>Thou shall not call functions in vain</h3>
  16. Do note, though, that <code>foo</code> in the examples above is better <em>not</em> be taken as to mean an actual Python function, but more like an arbitrary expression. Function calls in Python are costly; in this aspect, the language is not at all different from its faster, compiler-aided siblings.
  18. Specific remarks are applicable to dealing with functions arguments, though. In Python, we have many different ways of handling them, both on caller and callee side. Some of them involve <em>packing</em> and <em>unpacking</em> arguments to and from tuples or dictionaries. Normally, these operations are seamless and invisible to the programmer, but they nevertheless require some additional effort and time.
  20. That expenditure is avoidable if we take care to maintain consistency between function definition and invocation. Is it a simple subroutine with constant number of arguments? Call it as <code>foo(x, y)</code> rather than <code>foo(*args)</code>. You want callers to optionally supply additional parameters, and you know exactly what those might be? Spell them out explicitly as <code>def bar(a, b=3, c='forb'):</code> instead of <code>def bar(a, **kwargs):</code>.
  22. Also: avoid (over)use of decorators. I say this reluctantly, because I consider decorators as one of the most beautiful and powerful language constructs (not only in Python), but they aren't exactly cheap. Look at the no-op decorator:
  23. [python]import functools
  25. def noop(func):
  26.    @functools.wraps(func)
  27.    def wrapped(*args, **kwargs):
  28.        return func(*args, **kwargs)
  29.    return wrapped

and observe what happens when we apply it to a simple function (without *args or **kwargs):

  1. @noop
  2. def next_multiple(of, after):
  3.     x = after + 1
  4.     while x % of != 0:
  5.         x += 1
  6.     return x
  8. next_multiple(of=7, after=nth_prime(12))  # 42!

Except for unlikely case when caller does it on its own, Python interpreter has to first pack actual arguments into args tuple and/or kwargs dictionary, because wrapped function require them in this form. But after it’s done with them and ready to call original func, those arguments have to be unpacked back into separate values.
The cost of both operations is of course negligible for decorators like @memcached or @login_required. However, if we’re just logging calls, checking argument types or doing anything with similarly small overhead, the added cost of (un)packing might count as significant part of this “bonus”. It’s unlikely to be a deal breaker on its own, but small things such as these tend to pile up over time.

If all fails…

This is Python Optimization 101, so we’ve obviously not fully explored the realm of possibilities, but I must nevertheless mention two “last resort” or “nuclear” options when it comes to optimizing performance. Mostly because they are very tempting by promising wonderful results, or viewed as low hanging fruit that yields noticeable improvements without much effort.

For many, the obvious solution to performance problems is to go down the abstraction ladder and reimplement critical code in low level language. For Python, that language is almost always C because their mutual interoperability is truly marvelous. Remember, however, that including C bits in your Python project will significantly increase its complexity, if only for the sole fact of adding one more language to the mix. Coding in C also demands somewhat different approach and skill set than writing Python, which may affect overall maintainability of your project in the long term.
Finally, you may want to be really sure there are no easier alternatives available. This is especially applicable for any kind of number crunching in presence of reliable and efficient libraries such as NumPy.

The other universal option is attempting to do blanket optimization by switching to PyPy. Benchmarks show performance gains to average around 5x, but they vary quite substantially between particular applications. PyPy has also some compatibility issues, especially when it comes to libraries with C extensions.
The general point is that while performance traits of our program running on PyPy will most likely be better, they are also bound to be different. While not very likely, we can address our most glaring bottlenecks and, simultaneously, create new ones because of PyPy-related trade-offs. In the end, we might still have to perform more targeted optimizations that we wished to avoid.

Further reading

Surprisingly, the ‘net isn’t exactly full of tips and techniques of Python optimization. Aside from somewhat dated guide on official wiki, you can find some helpful bits on StackOverflow but not very much beyond that.

One exception is a very insightful case study published on Dropbox’ tech blog. It describes taking rather simple algorithm (HTML escaping) through series of optimization attempts, recording results and drawing invaluable conclusions. I highly recommend it to all Python coders.

If you know any other articles the delve on the topic of optimization, be sure to mention them in comments!

Tags: , , ,
Author: Xion, posted under Computer Science & IT »

2 comments for post “Python Optimization 101”.
  1. Kos:
    July 22nd, 2012 o 23:13

    Imagination challenge: A tool that would make the integration of Python and C less a trouble in terms of maintenance. :)

    Cython is an interesting direction.

Comments are disabled.

© 2017 Karol Kuczmarski "Xion". Layout by Urszulka. Powered by WordPress with