Often I advocate using Python for various automation tasks. It’s easy and powerful, especially when you consider how many great libraries – both standard and third party – are available at your fingertips. If asked, I could definitely share few anecdotes on how some .py script saved me a lot of hassle.
So I was a bit surprised to encounter a non-trivial problem where using Python seemed like an overkill. What I needed to do was to parse some text documents; extract specific bits of information from them; download several files through HTTP based on that; unzip them and place their content in designated directory.
Nothing too fancy. Rather simple stuff.
But then I realized that doing all this in Python would result in something like a screen and a half of terse code, full of tedious minutiae.
The parsing part alone would be a triply nested loop, with the first two layers taken by os.walk
boilerplate. Next, there would be the joys of urllib2
; heaven forbid it turns out I need some headers, cookies or authentication. Finally, I would have to wrap my head around the zipfile
module. Oh cool, seems like some StringIO
glue might be needed, too!
Granted, I would probably use glob2 for walking the file system, and definitely employ requests for HTTP work. And thus my little script would have external dependencies; isn’t that making it a full-blown program?…
Hey, I didn’t sign up for this! It was supposed to be simple. Why do I need to reimplement grep
and curl
, anyway? Can’t I just…
…oh wait.
Reading program’s command line and doing something with the arguments is the main purpose of most small (or bigger) utilities. Those are often written in Python – because of how easy and fast this is – so there should be a way to parse the command line in Python, too.
And in fact there are quite a few of them, all from the standard library. But the argparse module is most likely the best of them all, equally for its flexibility and power, as well as the sole fact of not being deprecated yet ;-)
For that matter, I have already used it several times, not only in Python. Today I want to present a summary of few useful techniques and solutions that I learned along the way, mostly by braving the not-so-friendly documentation of argparse. Given I’m not likely to do unusual stuff here, they should also address quite common, albeit less trivial use cases.
Following the convention of every operating system imaginable, argparse has positional arguments and flags. Flags are denoted by one or two dashes preceding the name or its one-letter abbreviation:
Normally in argparse, flags take arguments that are later stored in the result object. This would be helpful for parsing something like the -m
(message) flag in the git commit
example above.
Not every flag needs to behave like that, though. In the last ln
example, the -s
does not take any arguments. Instead, it alters the program behavior by its mere presence: with it, ln
creates a symbolic link instead of “hard” link. So in a sense, the flag is boolean. We would like to handle it as such.
In argparse, this is possible by setting the appropriate action=
in the add_argument
method:
Depending on what’s more logical for your program, you can reverse the logic to 'store_false'
and default=True
, of course.
If your program takes one entity as an argument and does something specific with it, users will often expect it to work with multiple entities too. You can observe it first hand with pip
:
or any version control application:
There is no reason to ignore this expectation and it’s pretty easy to satisfy in argparse. Again, there is an action=
for that:
and it’s sufficient for flags. Here the object returned by parse_args
will get foo
attribute with the list of arguments from all occurrences of --foo
.
For positionals, it’s a little bit trickier because by default, they are meant to appear exactly once. This can be changed using nargs=
:
The value of '+'
is probably the most useful here, as it requires for the argument to be present at least once. Just like for flags, the result will be a list of all its occurrences, so you can iterate or map
over it easily.
Less typically, you may want to have a positional argument which can be supplied or not (an optional one). Although it is possible with the API outlined above, I wouldn’t recommend it: you will have to deal with unnecessary 0-or-1-element list and you won’t get proper error checking at the argparse level.
The correct solution involves nargs=
, too, but with a dedicated '?'
value:
As you may guess, default=
allows you to specify the value in parse_args
result should the argument be omitted.
Once you set up your ArgumentParser
, you will (hopefully) want to test it. Lucky for you, this can be done easily without every touching the actual command line. Simply pass your arguments (as a list) to parse_args
and it will use it instead of sys.argv
:
With this you can easily write some nice unit tests for your parser – which you should do, obviously. What you should not do, however, is abusing this feature to call your program’s code from itself:
Just don’t.
There are, of course, many other interesting features and applications of argparse that you will find useful. I can especially recommend that you get to know about:
git
or pip
)--help
output, or for mutual exclusion (e.g. --verbose
and --quiet
option)Equipped with this knowledge, you should be able to write beautiful and easy to use command line tools. Please do so :)
The upcoming release of Windows 8 is stirring up the tech world, mostly because of some controversial decisions Microsoft has made regarding the new version of their flagship OS.
Were it a few years ago, I would likely participate in various discussions about this, springing up on message boards I typically frequent. Probably I would have also tested the development preview which was made available several months before. Most likely, I would have clear opinion about viability of the new Metro UI, WinRT apps’ platform or the overall push in the general direction of HTML5.
I would… But I do not. Actually, I couldn’t care less about the whole thing.
It’s not only because I touch my Windows installation once in a blue moon, almost exclusively for gaming. No, that’s mostly because it’s 2012, and Windows still doesn’t support some crucial functionality you could expect from modern operating system.
Unfortunately, most users don’t expect it because they never knew any better. Therefore I will try to highlight some features from alternative operating systems (typically Linux or OSX) that I think Windows can be rightfully scorned for lacking.
In Unix-like systems, files and directories with names starting from ‘.’ (dot) are “hidden”: they don’t appear when listed by ls and are not shown by default in graphical file browsers. It doesn’t seem very clear why there is such a mechanism, especially when we have extensive chmod permissions and attributes which are not tied to filename. Actually, that’s one of distinguishing features of Unix/Linux: there are neither .exe nor .app files, just chmod +x.
But, here it is: name-based visibility control for files and directories. Why such a thing was ever implemented in the first place? Well, it turns out it was purely by accident:
Long ago, as the design of the Unix file system was being worked out, the entries . and .. appeared, to make navigation easier. (…) When one typed ls, however, these entries appeared, so either Ken [Thompson] or Dennis [Ritchie] added a simple test to the program. It was in assembler then, but the code in question was equivalent to something like this:
if (name[0] == '.') continue;
That test was meant to filter out . and .. only. Unintentionally, though, it ruled out a much bigger class of names: all that start with a dot, because that’s what it actually checked for. Back then it probably seemed like an innocuous detail. Fast-forward a couple of decades and it’s a de facto standard for storing program-specific data inside user’s home path. They can grow quite numerous over time, too:
That’s over 100 entries, a vast majority of my home directory. It’s neither elegant nor efficient to have that much of app-specific cruft inside the most important place in the filesystem. And even if GUI applications tend to collectively use a single ~/.config directory, the tradition to clutter the root $HOME path is strong enough to persist for foreseeable future.
Heed this as a warning. In the event your software becomes a basis for many derived solutions, future programmers will exploit every corner case of every piece of logic you have written. It doesn’t really matter what you wanted to put into your code, but only what you actually did.
Looks like using Linux is really bound to slowly – but steadily – improve your commandline-fu. As evidence, today I wanted to share a little piece of shell acolyte’s magic that I managed to craft without very big trouble. It’s about counting lines in files – code lines in code files, to be specific.
For a single file, getting the number of text rows is very simple:
Although the name wc
comes from “word count”, the -l
switch changes its mode of operation into counting rows. The flexibility of this little program doesn’t end here; for example, it can also accept piped input (as stdin):
as well as multiple files:
or even wildcards, such as wc -l *.file
. With these we could rather easily count the number of lines of code in our project:
Unfortunately, the exact interpretation of **/*
wildcard seems to vary between shells. In zsh it works as shown above, but in bash I had it omit files from current directory. While it might make some sense here (as it would give a total without setup script and tests), I’m sure it won’t be the case all projects.
And so we need something smarter.
Parametry przekazywane programom podczas uruchomienia mają najczęściej formę kilku przełączników (switches), po których opcjonalnie występują właściwe dla nich argumenty. Zwyczajowo rzeczone przełączniki są poprzedzone myślnikiem (styl uniksowy) lub slashem (styl MS-DOS), jak w przykładzie poniżej:
gcc -c main.c -o main.o -pedantic
Tak skonstruowany wiersz poleceń jest przekazywany przez runtime do aplikacji C/C++ jako tablica ciągów znaków, powstałych już po podzieleniu go na części według spacji i ew. cudzysłowów. Tradycyjnie pierwszy parametr (
argc
) funkcji main
określa liczbę tych części, zaś drugi (argv
) jest wspomnianą tablicą je zawierającą. Warto pamiętać, że – jeśli interesują nas tylko argumenty wywołania – jest to tablica indeksowana od jedynki, gdyż argv[0]
zawiera zawsze samo polecenie uruchamiające program (w powyższym przykładzie byłby to ciąg "gcc"
). Ostatnim elementem jest z kolei argv[argc - 1]
, zatem w sumie tablica zawiera argc - 1
znaczących parametrów programu. Wreszcie, gwarantowane jest, iż argv[argc]
jest zawsze pustym wskaźnikiem, jeśli do czegokolwiek mogłoby się to przydać.
Bezpośrednia interpretacja wiersza poleceń, który używa wspomnianych na początku przełączników, może być jednak nieco kłopotliwa. Wydaje mi się, że lepiej jest przetworzyć ją do bardziej znośnej postaci słownika (mapy), pozwalającego odwoływać się do switchy i ich argumentów po nazwie. Można to zrobić choćby w poniższy sposób:
#include
const char SWITCH_CHAR = ‘-‘; // lub ‘/’
const CommandLine ParseCommandLine(int argc, const char* argv[])
{
CommandLine cl;
for (int i = 1; i < argc; )
if (*(argv[i]) == SWITCH_CHAR)
{
std::vector
p.reserve (argc – i);
int j;
for (j = i + 1;
j < argc && (*(argv[j]) != SWITCH_CHAR || strstr(argv[j], " "));
++j)
p.push_back (argv[j]);
cl.insert (std::make_pair(argv[i] + 1, p));
i = j;
}
else
++i;
return cl;
}[/cpp]
W rezultacie otrzymamy mapę, w której wyszukiwanie (std::map::find
) pozwoli nam określić, czy dany przełącznik był podany, zaś indeksowanie (operator []
) da nam dostęp do jego parametrów w postaci wektora.