It’s pretty much assumed that if you’re writing Python, you are not really concerned with the performance and speed of your code, provided it gets the job done in sufficiently timely manner. The benefits of using such a high level language usually outweigh the cons, so we’re at ease with sacrificing some of the speed in exchange for other qualities. The feasibility of this trade is always relative, though, and depends entirely on the tasks at hand. Sometimes the ‘sufficiently fast’ bar might be hung quite high up.
But while some attitudes are clearly beyond Python’s reach – like real-time software on embedded systems – it doesn’t mean it’s impossible to write efficient code. More importantly, it’s almost always possible to write more efficient code than we currently have; the nimble domain of optimization has its subdivision dedicated specifically to Python. And quite surprisingly, performance-tuning at this high level of abstraction often proves to be even more challenging than squeezing nanoseconds out of bare metal.
So today, we’re going to look at some basic principles of optimization and good practices targeted at writing efficient Python code.
Before jumping into specific advice, it’s essential to briefly mention few standard modules that are indispensable when doing any kind of optimization work.
The first one is timeit, a simple utility for measuring execution time of snippets of Python code. Using
timeit is often one of the easiest way to confirm (or refute) our suspicions about insufficient performance of particular piece of code. timeit helps us in straightforward way: by executing the statement in questions many times and showing average, as well as cumulative, time it has taken.
As for more detailed analysis, the profile and cProfile modules can be used to gain insight on CPU time consumed by different parts of our code. Profiling a statement will yield us some vital data about number of times that any particular function was called, how much time a single call takes on average and how big is the function’s impact on overall execution time. These are the essential information for identifying bottlenecks and therefore ensuring that our optimizations are correctly targeted.
Then there is the dis module: Python’s disassembler. This nifty tool allows us to inspect instructions that the interpreter is executing, with handy names translated from actual binary bytecode. Compared to actual assemblers or even code executed on Java VM, Python’s bytecode is pretty trivial to analyze:
Getting familiar with it proves to be very useful, though, as eliminating slow instructions in favor of more efficient ones is a fundamental (and effective) optimization technique.
Starting on the specifics, let’s look at a simple problem. What is the fastest way of converting
float number into string?… Here’s several obvious (and some non-obvious) candidates:
It might be weird to even consider anything besides explicit typecast to
str. However, one could think that invoking Python internals directly – which
__str__ method seems to be – may be a net performance gain. But including various string formatting routines seems like going way too far. Between parsing the format string, processing arguments and putting the resulting string together, they just have to be significantly slower than anything else… right? Well, let’s check it:
Whoa, that was unexpected! Not only formatting is faster, but it beats the “obvious” solution by order of magnitude. Such great discrepancy must indicate that we missed some important, hidden factor which also contributes significantly to overall performance (or lack thereof).
This unknown factor is the cost of lookup for non-local symbols such as
str. Python interpreter stores those in various hash tables which are roughly equivalent to dictionaries (
dicts) in the language itself. While certainly not slow in the absolute sense, these data structures incur noticeable overhead on ubiquitous operation that we wouldn’t normally consider as potential bottleneck. But alas, it’s there, and this can easily explain why string formatting (
"%d" % 42) wins our little speed contest: it has no symbol lookup at all.
Unfortunately, programming with only operators and inline constants might be a little too constraining ;) There are other ways to avoid expensive lookups, though, and the easiest one is to use local names as much as possible. Names declared locally – be it variables, functions or even classes – are stored within current stack frame, so the interpreter can access them via offset relative to stack pointer (equivalent of
EBP from x86 assembly). No strings have to be hashed, too, resulting in performance characteristics that are similar to indexing a list.
To make it more obvious while looking for performance tweaks, the opcodes for local lookups in
dis() end in
LOAD_FAST. The slower, hash-based lookups are named
STORE_ATTR, and so on. When looking for some easy optimizations, you should definitely examine your tight loops and try to eliminate lookups which aren’t of the
*_FAST type. Often this means caching globals – including built-in functions – under local names through what seems like a series of redundant assignments:
It works because default values are stored within the function object, leading to infamous gotcha if they’re mutable. Here, however, we can leverage them to achieve performance benefits.
Since we have indirectly touched the topic of optimizing loops, let’s tarry here a little bit longer. In compiled languages such as C or C++, a well thought-out loop is often the most efficient solution, compared e.g. to using higher-order functions like
std::transform. Loops don’t introduce much overhead (unlike repeated function calls) and are prone to various compiler optimizations: extracting constant expressions, reducing number of operations per cycle, and so on.
But, Python is not C. If we have some experience with the latter, we must be careful as to not apply its best practices automatically when writing Python code. This, for example, is a not the most efficient way of populating a list:
and observe what happens when we apply it to a simple function (without
Except for unlikely case when caller does it on its own, Python interpreter has to first pack actual arguments into
args tuple and/or
kwargs dictionary, because
wrapped function require them in this form. But after it’s done with them and ready to call original
func, those arguments have to be unpacked back into separate values.
The cost of both operations is of course negligible for decorators like
@login_required. However, if we’re just logging calls, checking argument types or doing anything with similarly small overhead, the added cost of (un)packing might count as significant part of this “bonus”. It’s unlikely to be a deal breaker on its own, but small things such as these tend to pile up over time.
This is Python Optimization 101, so we’ve obviously not fully explored the realm of possibilities, but I must nevertheless mention two “last resort” or “nuclear” options when it comes to optimizing performance. Mostly because they are very tempting by promising wonderful results, or viewed as low hanging fruit that yields noticeable improvements without much effort.
For many, the obvious solution to performance problems is to go down the abstraction ladder and reimplement critical code in low level language. For Python, that language is almost always C because their mutual interoperability is truly marvelous. Remember, however, that including C bits in your Python project will significantly increase its complexity, if only for the sole fact of adding one more language to the mix. Coding in C also demands somewhat different approach and skill set than writing Python, which may affect overall maintainability of your project in the long term.
Finally, you may want to be really sure there are no easier alternatives available. This is especially applicable for any kind of number crunching in presence of reliable and efficient libraries such as NumPy.
The other universal option is attempting to do blanket optimization by switching to PyPy. Benchmarks show performance gains to average around 5x, but they vary quite substantially between particular applications. PyPy has also some compatibility issues, especially when it comes to libraries with C extensions.
The general point is that while performance traits of our program running on PyPy will most likely be better, they are also bound to be different. While not very likely, we can address our most glaring bottlenecks and, simultaneously, create new ones because of PyPy-related trade-offs. In the end, we might still have to perform more targeted optimizations that we wished to avoid.
Surprisingly, the ‘net isn’t exactly full of tips and techniques of Python optimization. Aside from somewhat dated guide on official wiki, you can find some helpful bits on StackOverflow but not very much beyond that.
One exception is a very insightful case study published on Dropbox’ tech blog. It describes taking rather simple algorithm (HTML escaping) through series of optimization attempts, recording results and drawing invaluable conclusions. I highly recommend it to all Python coders.
If you know any other articles the delve on the topic of optimization, be sure to mention them in comments!
Imagination challenge: A tool that would make the integration of Python and C less a trouble in terms of maintenance. :)
Cython is an interesting direction.