There is a particular way many programmers tend to think about “algorithms”. I put the word in quotes because it’s a really just a small subset of actual algorithms that they refer to by this term.
In short, “algorithms” are perceived as self-contained boxes, classic procedures that take data in and spit results out. They are blocks, or packages, stowed in a corner of some bigger program, coming out only to perform they designated function. Clearly separated from the “usual” code, they may even occur only as libraries, both standard language libraries or third party extensions.
What’s important is being out of sight, out of mind. You are not supposed to delve into them. You can use them, or add new ones – sometimes even implementing them yourself! – but once you’re done, you get back to “real” code. Normal code, typical code, usual code; code that doesn’t have those “algorithms”.
Amusing as it is, this mode of thinking may actually be spurred by reading works such as the classic Introduction to Algorithms by Cormen et al. Since the subject is intrinsically quite heavy and full of intricate theory and math, it’s very tempting to skirt its surface. Sure, you need to know about most of the important algorithms, but you don’t really have to know them. It’s perfectly acceptable to skip the details of their implementation.
And why not? You can always go back and look it all up! Most of the important stuff is already implemented anyway. When was the last time you had to roll out your own
sort() function?… Great programmers reuse, haven’t you heard? We should all stand on the shoulders of giants!
Well, that’s lovely. But I’ve got some bad news. There are no “algorithms”. You cannot compartmentalize programs to have demarcated “difficult” parts. You can try, of course, but it’s futile. They won’t comply, because the pesky reality itself will inevitably tear down any walls you choose to surround the “algorithms” with.
Let’s say you have your isolated, tricky procedure. Over the lifetime of a program that contains it, any of the following is very likely to happen:
Many, many other situations could be added to this list, but the point should nevertheless be clear. It doesn’t matter how tightly you’re going to shut the tricky procedure off the rest of the code, the ongoing evolution of the latter will eventually effect changes on it. After a while, you won’t be dealing with a single algorithm, but a whole subsystem with multiple tendrils touching other parts of your program.
At this point, you can’t pretend it can be confined to a dark, forgotten place anymore. Now it has to be taken into account when dealing with the “normal” code on daily basis.
And it’s nothing to fear of, because it simply is just a normal code.
What is your first association evoked by a mention of the Java programming language? If you said “slow”, I hope all is well back there in the 90s! But if you said “verbose” instead, you would be very much in the right. Its wordiness is almost proverbial by now, with only some small improvements – like lambdas – emerging at the distant horizon.
Sounds good? Sure it does. The only problem here is that most of those optimistic claims are patently false. Good real-world code written in high level, dynamic language does not necessarily end up being much shorter. Quite often, it is either on par or even longer than the equivalent source in Java et al.
It happens for several reasons, most of which are not immediately obvious looking only at a Hello World-like samples.
It is said Python is pretty much pseudocode that happens to be syntactically strict enough to parse and run. While usually meant as a compliment, this observation has an uglier counterbalance. Being a pseudocode, the raw Python source is woefully incomplete when it comes to providing necessary information for future maintainers. As a result, it gets out of hand rather quickly.
To rein it, we need to put that information back somehow. Here’s where the documentation and commenting part steps in. In case of Python, there is even a syntactical distinction on the language level between both. The former takes the form of docstrings attached to function and class definitions. They are meant to fill in the gaps those definitions always leave open, including such “trivialities” as what kind of arguments the function actually takes, what it returns, and what exceptions it may raise.
Even when you don’t write expansive prose with frequent usage examples and such, an adequate amount of docstrings will still take noticeable space. The reward you’ll get for your efforts won’t be impressive, too: you’ll just add the information you’d include anyway if you were writing the code in a static language.
The catch here is that you won’t even get the automatic assurance that you’ve added the information correctly… which brings us immediately to the next point.
There are two types of programmers: those who write tests, and those who will be writing tests. Undoubtedly, the best way to have someone join the first group is to make them write serious code in any interpreted, dynamic language.
In those languages, a comprehensive set of tests is probably the only automated and unambiguous way to ensure your code is satisfying even some basic invariants. Whole classes of errors that elsewhere would be eliminated by the compiler can slip completely undetected if the particular execution path is not exercised regularly. This includes trying to invoke a non-existent method; to pass an unexpected argument to a function; call a “function” which is not callable (or conversely, not call a function when it should have been); and many others. To my knowledge, none of these are normally detected by the various tools for static analysis (linting).
So, you write tests. You write so many tests, in fact, that they easily outmatch the very code they’re testing – if not necessarily in effort, then very likely in quantity. There’s no middle ground here: either you blanket your code with good tests coverage, or you can expect it to break in all kinds of silly ways.
Documentation, comments and tests take time and space. But at least the code in high level language can be short and sweet. Beautiful, elegant, crisp and concise, without any kind of unnecessary cruft!
Just why are you staring at me like that when I say I used a metaclass here?…
See, the problem with powerful and highly expressive languages is that they are dangerous. The old adage of power and responsibility does very much apply here. Without a proper harness, you will one day find out that:
Unless you can guarantee you’ll always have only real Perl/Python/Ruby rockstars on board, it’s necessary to tame that wild creativity at least a little bit. The inevitable side effect is that your code will sometimes have to be longer, at least compared to what that smart but mystifying technique would yield.
Let me introduce you to the following two important functions:
I don’t believe, of course, that this is the first time you may have encountered them. Nor that they are even half as complicated as the definitions above would suggest.
But although they appear rather awkward, what I wanted for those formulations to highlight is one particular way of interpreting the min and max functions as they are used in code. You may think it’s easy enough to read them quite literally (“get me the smallest/largest value”), and in many cases you are absolutely correct:
Often though, we don’t want to get the extreme value from a set, list, vector or other collection of unspecified size. Instead, we call the min/max functions on a few arguments that are known and “hard-coded” upfront. In many cases, this is mostly done to spare us from introducing a verbose
if statement, or a more cryptic ternary operator (
Or is it? I was recently surprised when discussing some very simple programming exercise on one of the IRC channels I frequent. Someone pointed out how my proposed solution is quite unusual with its (over)use of the
max function. The task goes somewhere along these lines:
You are presented with a device that has n counters (c0, …, cn-1) and n+1 buttons (b0, …, bn). Every counter ci is incremented whenever a corresponding button bi is pressed. Additionally, pressing the extra button bn sets all the counters to the maximum value displayed on any of them (e.g. [1, 4, 3] → [4, 4, 4]).
Find a way to compute the final state of all counters given a sequence of k buttons presses.
Just by reading the above description and implementing it in the most straightforward way, we can trivially arrive at an algorithm which solves the problem in O(nk) time. And since we need to go through the sequence of button presses at least once, the lower bound for complexity of any other solution is therefore O(k).
My version was a simple improvement over the obvious one, scoring O(n+k) at the cost of maintaining a negligible amount of extra data:
The noteworthy application of
max function was responsible for updating one of those values:
If you look closely, you will notice how the first argument serves as a reference point, while only the second one is the actual ‘input’. What this invocation is saying is essentially “make
max_state at least as big as it was before – and possibly bigger”.
Hardly a groundbreaking insight, eh? This approach, however, allows to rapidly parse even complicated applications of
max. If you encounter, for example, an inlined version of a “clamping” function that is written in the following manner:
I’m pretty sure it will take you at least a few moments to grok what it’s doing. The
min compound is puzzling, and the meaningful arguments (a and b) are disconnected from the function names. Had it had more nesting, you might even have needed to *gasp* count the parentheses.
But we know it’s all about trimming a number to the <a; b> interval. Why not express it in a way that readily highlights this fact?
max(a, ...) means “I want the result to be at least a“.
min(b, ...) means “I want the result to be at most b“.
Making both thresholds into first arguments to their respective functions is just one of those nice small things that very subtly, almost invisibly, will make your code easier to work with.
Have you ever heard the term “architecture” applied to the realm of software? Pretty sure you have. Our discipline, at least on the surface level, seems to borrow a lot of nomenclature from fields related to the construction of buildings. We talk about designs, frameworks, foundations, sometimes even scaffolds or facades.
And, of course, there is this whole convention of referring to programmers as “engineers”. Guess it simply sounds too good as a job title to let it go!
But frivolous as it may be, there is something deeply wrong about drawing analogies between erecting real structures and hacking virtual ones. It’s not even about attributing undue significance to the “mere” activity of sitting behind a screen and hammering at the keyboard; at this point, software development is probably at least as important as civil engineering – perhaps even more so.
The problem is that thinking of developing software as analogous to constructing buildings promotes flawed, unsustainable approaches. What is even more insidious is that they appear perfectly reasonable at first, mature even. It seems intuitively right, after all, to think of software project as something akin to construction project. In both cases you need to satisfy certain requirements, accommodate for interactions with external environment, and fit everything within time and/or budget constrains.
If it seems so right, then why it is so wrong? Well, because the similarity is only apparent, and thus largely superficial.
How common it is to begin construction of a building before its blueprint is finalized? Although few shovels might have tilled the ground here and there, construction engineers generally do not begin laying foundations until the structure’s design is ready. Before even the first bricks form a nascent wall, you can have pretty good idea of how the final thing will look like.
Can you say the same about software project? Certain methodologies (*cough* waterfall) claim to achieve almost the same degree of robustness and predictability, but most of us know how patently ridiculous they are. Not only is the final shape of system unknown when you start writing it, but it is practically unknowable. The design only reveals itself while you are already building, and this is something I’m sure would make the blood of traditional engineers run cold.
There are reasons for those unknowns, however. While real world engineering deals with real world limitations, bits and pixels are often completely unconstrained by those. More than by climate conditions, size of electromagnetic spectrum or the speed of light, software systems are limited by other, existing systems and processes: not necessarily software ones, but often unstable and changing.
As a result, dealing with them can get arbitrarily complex, not to mention that it transitively affects future systems interacting with us.
In comparison, the ground you build upon is quite literally rock solid.
Thankfully, there is one aspect of programming that makes all those obstacles somewhat surmountable. Unlike skeletons of skyscrapers, made of reinforced concrete, code is always modifiable. Regardless of how far you’ve progressed with current implementation, you can change and redesign it. Sure, there is typically a requirement to support outside clients through existing API but rarely, if ever, you are limited in how you’re gonna support them.
Frankly, this is something that I often see people struggling to come to terms with. Maybe because of those construction analogies I’ve mentioned, the structure of a program is often perceived as rigid and almost constant.
You are not allowed to tamper with it. You can only extend it in certain, predefined ways. And you better get it right the first time, as there is no going back. It’s almost as if the code was literally set in stone.
This, however, is a massive misconception. Code is meant to be poked and prodded; extended and removed; modified and reverted; and most importantly, hacked on and refactored. The notion of “technical debt” is a beautiful concept that captures all those mechanics. You don’t want to be in debt which is too large, because the interest rates (maintenance costs) will hinder you in the long run. But you also don’t want to have too small credit, or you won’t be able to make new investments (features) and expand your “business”.
Lately, I was re-evaluating Google App Engine – the cloud computing platform – to see how feasible it would be for one pet project I’ve had in mind. It was pleasantly surprising overall, as the platform improved quite a lot while I wasn’t looking, since about a year and a half ago. Mostly interested in the Python part, I noticed that version 2.7 is now standard, lots of libraries are available out of the box, and it’s possible to use to pretty much any web framework you’d like to, such as Flask or Django.
Still, there are some quirks. App Engine SDK, for example, is a self-contained bundle with a bunch of Python packages that make it possible to run the development server with your app on your local machine. You don’t really “install” it into your Python interpreter, though.
Same goes for any additional, third party libraries your app may need. They must all be deployed along with it, as there is no setup.py or requirements.txt to specify your dependencies in. If you’re used to how e.g. Heroku handles dependencies, GAE’s way will undoubtedly be quite a letdown.
Good news are: you can still make it work sanely. By that I mean using virtualenv for development rather than your global, system-level interpreter, and keeping the code of any third party libraries out of your project’s repository. You may not get quite the same experience of
pip install and
pip freeze > requirements.txt but well… it’s close enough :)
So you have an application that requires some external libraries. Few of them are provided by App Engine itself, and you will be able to access them after you specify your requirement in app.yaml. Many times, however, you will want to tap into broader open source ecosystem, just like you’d like with any other Python app.
There is a way, fortunately, to include external libraries to go with your application without them cluttering your repository. Since the de facto standard for publishing code on the ‘net is to push it to a public Git repository, we can use Git submodules to “symlink” to those repositories. Our own Git repo won’t store any of their actual content, but only a list of URLs; the .gitmodules file.
If you held your breath at the mere mention of Git submodules, don’t panic. They get a lot of flak, that’s true, and many of their claimed shortcomings are quite genuine. All of them apply to the scenario where a main repo uses submodules to reuse shared subproject that is modified in conjunction with the main one.
As you have probably noticed, this is totally different than the setting we’re discussing here. When including an external dependency, the fact that Git submodule points to specific commit in the other repo is a feature, not bug. It’s the exact same reason why we should always put version numbers in requirements.txt: upgrading a third party library must never be accidental, or you risk breaking your code through unexpected API or behavior changes.
So, how to do it – use Git submodules, that is? You substitute
pip install with
git submodule add:
This will establish reference between the repo under given URL and a directory path inside your project, fetching the repo’s content in the process. But as you will quickly notice in
$ git status, that content won’t become part of the working directory.
After all this talk about being explicit with your libraries’ version, you probably also want to checkout a correct release:
Otherwise, you will work off whatever the current HEAD happened to be, exactly how
pip install flask would give you whatever is the newest release in PyPI.
Working alone from a single machine, this would set you up for the time being. For starting somewhere else, though, you need equivalent of
pip install -r requirements.txt, i.e. a way to fetch all your libraries at once. Here’s where
git submodule update comes handy:
It will both setup your freshly cloned repo to use submodules specified in .gitmodules files, as well as pull the submodules’ content.
There’s much more to Git submodules, of course, so if you want to gain much more thorough insight into them than this short overview, I recommend having a look at the Git book. And as with most things,
$ man git submodule is always helpful.
With dependencies seemingly in place, you might be quite disappointed trying to, you know, use them:
The reason for that is simple, though: the libraries are physically there on your disk, but they are not in your virtualenv’s
$PYTHONPATH, so Python has no idea where to import them from. There are ways to solve this problem that I could ramble for a while about, but I will just go ahead and demonstrate a ready-made shell script which handles it all :)
You might need to tweak it, e.g. if your GAE SDK installation path is different than /opt/google_appengine, but otherwise it should be pretty straightforward. One caveat, though: the script should be re-run after adding a brand new library, as described in previous section:
As an added bonus, you will get
appcfg binaries inside your virtualenv’s
./bin, so you may remove App Engine’s SDK directory from your regular
Setup of a local development environment generally ends here – you should be now ready to run your app through
dev_appserver. What’s still missing is making your bundled libraries work with remote Python on actual App Engine instance. Sadly, there is no virtualenv in the cloud.
Instead, we need to revert to the glorified
sys.path hacks. Before importing anything, we extend the actual PYTHONPATH so that it covers our third party libraries. If their directory layout is just like shown in the first section (lib/ root with subdirs for different libraries), the following shim will suffice to correctly bootstrap the import mechanics:
Place this in the root of your project’s source tree (outside the main Python package) and point the app.yaml to it:
With this, you may now deploy your app and see whether it works correctly. If you encounter problems, I recommend taking a look at Flask on App Engine Project Template. Even if you intend to use different web framework, the example code should be largely applicable.
Callbacks have numerous problems, though, out of which the most severe one is probably the phenomenon of “marching to the right”:
When using the (still) common style of providing a callback as the last argument to a function initiating an asynchronous operation, you get this annoying result of ever-increasing indentation as you chain those operations together. It feels like the language itself is telling you that it was not designed for such a complex stuff… Coincidence? ;)
But it gets worse. Operations may fail somewhere along the way, which is something you’d probably like to know about. Depending on the conventions your project, framework or environment uses, this could mean additional boilerplate inside the callbacks to distinguish success from error cases. This is typical in Node.js, where first argument of callback represents the error, if any:
Alternatively, you may be asked to provide the error handler separately; an “errback”, as it’s sometimes called. Splitting the code into small parts is great and everything, but here it means you’ll have two functions as arguments:
As part of a language, Python obviously has
import statements. They allow us to divide the code into different modules and packages:
What is lesser known fact is that it also has an
__import__ function. This function retains all functionality of the
import statement, but has some additional features and slightly different use cases. With it, for example, you can import a module whose name you only know at “runtime”:
This comes handy in various types of general dispatchers, factory functions, plugin systems, and so forth. Returned from
__import__ function is always a module object (even in cases when
fromlist argument is used), so often a
getattr is needed to extract a specific symbol from it.
Quite surprisingly, I have discovered that
__import__ function may very well be useful also when you do know the desired module name. Reason is that
import statement is sometimes unwieldy. It has similar problem as global variables (i.e.
global statements) and inner function
definitions (as opposed to
lambdas): it makes the code stretch unnecessarily in the vertical dimension.
This can be considered a waste if you only need to access one specific thing from one specific module. Using
__import__ function, you can golf the import and the usage into a single statement. Here’s an example, coming straight from my own project recursely:
Incidentally, the other uses of literal
__import__ described can be conveniently replaced thanks to that small library :)
Another issue with
import statement is that it introduces symbols into the global (or local) namespace. Most of the time, this is precisely what we want. Occasionally, though, a sole fact of loading the module is enough.
A canonical example of the latter case is web application with request handlers scattered between different Python files, or even packages. All those files have to be imported if the handlers are to be added to framework’s routing table; but beyond that, we have no business with them.
As a result, the
introduces an unused symbol – here, it is
handlers. Many linting tools will be eager to point this fact out, which is not really that helpful. There is sometimes an option to disable the warning on per line basis, but some checkers (e.g. pep8.py) don’t offer this functionality.
Universal solution? Use
__import__ function, of course:
The module is still loaded just fine, but since we’re ignoring the return value, no stray variables are created. As a added bonus, the
__import__ call also looks very different, signifying its special purpose.
Actually, there is one more benefit of this trick, also coming from fooling-the-tools department. Many Python IDEs, like Eclipse/Pydev, are able to automatically insert necessary imports and organize them in groups, effectively providing a neat, Java-like experience. What is not so neat is that they often insist on putting every
import statement somewhere near the beginning of the file. preceding any other definition, variable, class or function.
In a scenario described above, this behavior may actually cause problems. When the handlers’ module gets imported, it may need to refer back to the application object; this is exactly the case in the Flask framework, for example. If that object happens to be defined in the module importing
handlers, we’ll have a circular import error because the application object has not yet been defined. It would have been defined, however, if the statement:
hasn’t been touched by the IDE when it wanted to be helpful and organize our imports. All imports, as it turns out.
Fortunately, mechanisms like that tend to be easy to fool. Per answers to StackOverflow question I’ve once asked, it is a matter breaking the textual pattern that the algorithm searches our code for. As you may have guessed by now, one of the ways of achieving that goal is to shed the
import statement in favor of