People’s coding styles tend to evolve and change over time. One particular habit I seem to have picked up is to sprinkle the code liberally with numerous
TODO markers. I wish I could say it’s clear a sign of my ever-present dissatisfaction with imperfect solutions, but I suspect I simply adopted it while working at the current company :)
In any case,
FIXMEs &c.) are not actually something to scoff at, not too much at least. They are certainly better than the alternative, which is to commit shady code without explanation, rationale, or ideas for improvement. With
TODOs, you are at least making the technical debt apparent and explicit, thus increasing the likelihood you’ll eventually come around to pay it off.
When that glorious day comes, though, it would be nice to get a quick overview of the code’s shortcomings, so that you can decide what to work on first. Getting a list of
TODOs scattered over many files sounds like a great task for
grep, and a relatively simple one at that. But somehow, every time I wanted to do that I ended up spending some non-negligible time just working out the details of
grep‘s syntax and flags.
Thus, the logical course of action would be craft a simple script which would relieve me from doing that ever again. However, when I got around to writing it, I quickly realized the task it’s not actually that simple. In fact, it’s totally impossible to do it with just
grep, since it would require matching a regex against multiple subsequent lines of input . Standard GNU
grep doesn’t support that at all.
Well, at this point I should’ve probably taken the hint and realize it’s not exactly the best idea to use a shell script here. But hey, not everything has to be written in Python, right? :) So I rolled up my sleeves and after a fair amount of googling (and stack-overflowing), I unleashed a horror that I hereby present:
For best result, it is necessary to have
pcregrep installed, which is an extended version of
grep that supports the full spectrum of Perl-compatible regular expressions. On most popular Linux distros,
pcregrep is just one
apt-get install away.
Every computer program expands until it can read e-mail – or so they say. But many applications need not to read, but to send e-mails; web apps or web services are probably the most prominent examples. If you happen to develop them, you may sometimes want a local, dummy SMTP server just for testing this functionality. It doesn’t even have to send anything (it must not, actually), but it should allow you to see what would be sent if the app worked in a production environment.
By far the easiest way to setup such a server involves, quite surprisingly, Python. There is a standard library module called
smtpd, which is built exactly for this purpose. Amusingly, you don’t even have to write any code that uses it; you can invoke it straight from the command line:
This will start a server that listens on port 8025 and dumps every message “sent” through it to the standard output. A custom port is chosen because on *nix systems, only the ports above 1024 are accessible to an ordinary user. For the standard SMTP port 25, you need to start the server as root:
While it’s more typing, it frees you from having to change the SMTP port number inside your application’s code.
If you plan to use
smtpd more extensively, though, you may want to look at the small runner script I’ve prepared. By default, it tries to listen on port 25, but you can supply a port number as its sole argument.
Lately, I was re-evaluating Google App Engine – the cloud computing platform – to see how feasible it would be for one pet project I’ve had in mind. It was pleasantly surprising overall, as the platform improved quite a lot while I wasn’t looking, since about a year and a half ago. Mostly interested in the Python part, I noticed that version 2.7 is now standard, lots of libraries are available out of the box, and it’s possible to use to pretty much any web framework you’d like to, such as Flask or Django.
Still, there are some quirks. App Engine SDK, for example, is a self-contained bundle with a bunch of Python packages that make it possible to run the development server with your app on your local machine. You don’t really “install” it into your Python interpreter, though.
Same goes for any additional, third party libraries your app may need. They must all be deployed along with it, as there is no setup.py or requirements.txt to specify your dependencies in. If you’re used to how e.g. Heroku handles dependencies, GAE’s way will undoubtedly be quite a letdown.
Good news are: you can still make it work sanely. By that I mean using virtualenv for development rather than your global, system-level interpreter, and keeping the code of any third party libraries out of your project’s repository. You may not get quite the same experience of
pip install and
pip freeze > requirements.txt but well… it’s close enough :)
So you have an application that requires some external libraries. Few of them are provided by App Engine itself, and you will be able to access them after you specify your requirement in app.yaml. Many times, however, you will want to tap into broader open source ecosystem, just like you’d like with any other Python app.
There is a way, fortunately, to include external libraries to go with your application without them cluttering your repository. Since the de facto standard for publishing code on the ‘net is to push it to a public Git repository, we can use Git submodules to “symlink” to those repositories. Our own Git repo won’t store any of their actual content, but only a list of URLs; the .gitmodules file.
If you held your breath at the mere mention of Git submodules, don’t panic. They get a lot of flak, that’s true, and many of their claimed shortcomings are quite genuine. All of them apply to the scenario where a main repo uses submodules to reuse shared subproject that is modified in conjunction with the main one.
As you have probably noticed, this is totally different than the setting we’re discussing here. When including an external dependency, the fact that Git submodule points to specific commit in the other repo is a feature, not bug. It’s the exact same reason why we should always put version numbers in requirements.txt: upgrading a third party library must never be accidental, or you risk breaking your code through unexpected API or behavior changes.
So, how to do it – use Git submodules, that is? You substitute
pip install with
git submodule add:
This will establish reference between the repo under given URL and a directory path inside your project, fetching the repo’s content in the process. But as you will quickly notice in
$ git status, that content won’t become part of the working directory.
After all this talk about being explicit with your libraries’ version, you probably also want to checkout a correct release:
Otherwise, you will work off whatever the current HEAD happened to be, exactly how
pip install flask would give you whatever is the newest release in PyPI.
Working alone from a single machine, this would set you up for the time being. For starting somewhere else, though, you need equivalent of
pip install -r requirements.txt, i.e. a way to fetch all your libraries at once. Here’s where
git submodule update comes handy:
It will both setup your freshly cloned repo to use submodules specified in .gitmodules files, as well as pull the submodules’ content.
There’s much more to Git submodules, of course, so if you want to gain much more thorough insight into them than this short overview, I recommend having a look at the Git book. And as with most things,
$ man git submodule is always helpful.
With dependencies seemingly in place, you might be quite disappointed trying to, you know, use them:
The reason for that is simple, though: the libraries are physically there on your disk, but they are not in your virtualenv’s
$PYTHONPATH, so Python has no idea where to import them from. There are ways to solve this problem that I could ramble for a while about, but I will just go ahead and demonstrate a ready-made shell script which handles it all :)
You might need to tweak it, e.g. if your GAE SDK installation path is different than /opt/google_appengine, but otherwise it should be pretty straightforward. One caveat, though: the script should be re-run after adding a brand new library, as described in previous section:
As an added bonus, you will get
appcfg binaries inside your virtualenv’s
./bin, so you may remove App Engine’s SDK directory from your regular
Setup of a local development environment generally ends here – you should be now ready to run your app through
dev_appserver. What’s still missing is making your bundled libraries work with remote Python on actual App Engine instance. Sadly, there is no virtualenv in the cloud.
Instead, we need to revert to the glorified
sys.path hacks. Before importing anything, we extend the actual PYTHONPATH so that it covers our third party libraries. If their directory layout is just like shown in the first section (lib/ root with subdirs for different libraries), the following shim will suffice to correctly bootstrap the import mechanics:
Place this in the root of your project’s source tree (outside the main Python package) and point the app.yaml to it:
With this, you may now deploy your app and see whether it works correctly. If you encounter problems, I recommend taking a look at Flask on App Engine Project Template. Even if you intend to use different web framework, the example code should be largely applicable.
Those cute little text formats are all the rage now, especially Markdown and reStructuredText. You can write them pretty easily, without lots of markup boilerplate that HTML entails, so they increasingly rule the READMEs of various open source projects. Both GitHub and Bitbucket support them perfectly well for this purpose.
They are obviously not WYSIWYG, though, so you may want to check the formatting first, before showing your prose to the rest of the world. There are some standard HTML generators for each of those text formats, but it would be more convenient to have one tool to rule them all…
And so there is Pandoc. That nifty utility is capable of converting from many input text formats into a vast variety of output formats, including HTML and PDF. It can be installed very easily on most systems:
and downloadable installers are available for OS X and Windows.
pandoc is a well behaved command line program, so it can accept both files and standard input, enabling it to be used in shell pipelines. Unfortunately, browser executables are not so cooperative – they generally want files, not just streams of data. We can circumvent that with the use of
/tmp directory and a little bit of shell scripting:
This should work
chrome, and likely with any other browser.
Next step? Maybe deploy a file system watcher, hook it up with some browser instrumentation, and make the page reload whenever a change in the source is detected. Non-trivial, but definitely doable :)
is an example of hashbang. It’s a very neat Unix concept: when placed at the beginning of a script, the line starting with
# (hash) and
! (bang) indicates an interpreter that should be chosen when running the script as an executable. Often used for shells (
#!/bin/zsh…), it also works for many regular programming languages, like Ruby, Python or Perl. Some of them may not even use
# as comment character but still allow for hashbangs, simply by ignoring such a first line. Funnily enough, this is just enough to fully “support” them, as the choice of interpreter is done at the system level.
Sadly, though, the only portable way to write a hashbang is to follow it with absolute path to an executable, which makes it problematic for pretty much anything other than /bin/*sh.
Take Python as an example. On many Linuxes it will be under /usr/bin/python, but that’s hardly a standard. What about /usr/local/bin/python? ~/bin/python?… Heck, one Python I use is under /usr/local/Cellar/python/2.7.3/bin – that’s installed by Homebrew on OS X, a perfectly valid Unix! And I haven’t even mentioned virtualenv…
This madness is typically solved by a standard tool called env, located under /usr/bin on anything at least somewhat *nixy:
env looks up the correct executable for its argument, relying on the
PATH environmental variable (hence its name). Thanks to env, we can solve all of the problems signaled above, and any similar woes for many other languages. That’s because by the very definition, running Python file with the above hashbang is equivalent to passing it directly to the interpreter:
Now, what if you wanted to also include some flags in interpreter invocation? For
python, for example, you can add
-O to turn on some basic optimizations. The seemingly obvious solution is to include them in hashbang:
Although this may very well work, it puts us again into “not really portable” land. Thankfully, there is a very ingenious (but, sadly, quite Python-specific) trick that lets us add arguments and be confident that our program will run pretty much anywhere.
Here’s how it looks like:
Understandably, it may not be immediately obvious how does it work. Let’s dismantle the pieces one by one, so we can see how do they all fit together – down not just to every quotation sign, but also to every space.
Many are the quirks of shell scripting. Most are related to confusing syntax, but some come from certain surprising semantics of Bash as a language, as well as the way scripts are executed.
Consider, for example, that you’d like to list files that are within certain size range. This is something you cannot do with ls alone. And while there’s certainly some awk incantation that makes it trivial, let’s assume you’re a rare kind of scripter who actually likes their hacks readable:
So you use an explicit
while loop, obtain the file size using stat and compare it to given bounds using a straightforward
if statement. Pretty simple code that shouldn’t cause any troubles later on… right?
But as your needs grow, you find that you also want to count how many files fall within your range, and how many do not. Given that you have an explicit
if, it appears like a simple addition (in quite literal sense):
Why it doesn’t work, then? Because clearly this is not the output we’re looking for (ls_between is our script here):
It seems that neither matches nor misses are counted properly, even though it’s clear from the printed list that everything is fine with our
if statement and loop. Wherein lies the problem?
Often I advocate using Python for various automation tasks. It’s easy and powerful, especially when you consider how many great libraries – both standard and third party – are available at your fingertips. If asked, I could definitely share few anecdotes on how some .py script saved me a lot of hassle.
So I was a bit surprised to encounter a non-trivial problem where using Python seemed like an overkill. What I needed to do was to parse some text documents; extract specific bits of information from them; download several files through HTTP based on that; unzip them and place their content in designated directory.
Nothing too fancy. Rather simple stuff.
But then I realized that doing all this in Python would result in something like a screen and a half of terse code, full of tedious minutiae.
The parsing part alone would be a triply nested loop, with the first two layers taken by
os.walk boilerplate. Next, there would be the joys of
urllib2; heaven forbid it turns out I need some headers, cookies or authentication. Finally, I would have to wrap my head around the
zipfile module. Oh cool, seems like some
StringIO glue might be needed, too!
Granted, I would probably use glob2 for walking the file system, and definitely employ requests for HTTP work. And thus my little script would have external dependencies; isn’t that making it a full-blown program?…
Hey, I didn’t sign up for this! It was supposed to be simple. Why do I need to reimplement
curl, anyway? Can’t I just…