Posts tagged ‘Python’

Unpacking Ad-hoc Dictionaries

2013-01-27 8:04

Today I’d like to present something what I consider rather obvious. Normally I don’t do that, but I’ve had one aspiring Pythonist whom I helped with the trick below, and he marveled at the apparent cleverness of this solution. So I thought it might be useful for someone else, too.

Here’s the deal. In Python, functions can be invoked with keyword arguments, so that argument name appears in the function call. Many good APIs use that feature extensively; database libraries known as ORMs are one typical example:

  1. user = session.query(User).filter_by(email="joe@example.com").first()

In this call to filter_by() we pass the email argument as a keyword. Its value is then used to construct an SQL query that contains a filter on email column in the WHERE clause. By adding more arguments, we can introduce more filters, linked together with the AND operator.

Suppose, though, that we don’t know the column name beforehand. We just have it stored in some variable, maybe because the query is part of authentication procedure and we support different means for it: e-mail, Facebook user ID, Twitter handle, etc.
However, the keyword arguments in function call must always be written as literal Python identifiers. Which means that we would need to “eval” them somehow, i.e. compute dynamically.

How? Probably best is to construct an ad-hoc dictionary and unpack it with ** operator:

  1. def get_user(column_name, column_value):
  2.     return session.query(User).filter(**{column_name: column_value}).first()

That’s it. It may not be obvious at first, because normally we only unpack dictionaries that were carefully crafted as local variables, or received as kwargs parameters:

  1. def some_function(one_arg, **kwargs):
  2.     kwargs['foo'] = 'bar'
  3.     some_other_function(**kwargs)

But ** works on any dictionary. We are thus perfectly allowed to create one and then unpack it immediately. It doesn’t make much sense in most cases, but this is one of the two instances when it does.

The other situation arises when we know the argument name while writing code, but we cannot use it directly. Python reserves many short, common words with plethora of meanings (computer-scientific or otherwise), so this is not exactly a rare occurrence. You may encounter it when building URLs in Flask:

  1. login_url = url_for('login', **{'as': test_user_id})

or parsing HTML with BeautifulSoup:

  1. comment_spans = comments_table.find_all('span', **{'class': 'comment'})

Strictly speaking, this technique allows you to have completely arbitrary argument names which are not even words. Special handling would be required on both ends of function call, though.

Tags: , ,
Author: Xion, posted under Computer Science & IT » 2 comments

At Least Python Got Equality Right

2013-01-03 23:13

I’m still flabbergasted after going through the analysis of PHP == operator, posted by my infosec friend Gynvael Coldwind. Up until recently, I knew two things about PHP: (1) it tries to be weakly typed and (2) it is not a stellar example of language design. Now I can confidently assert a third one…

It’s completely insane.

However, pondering the specific case of equality checks, I realized it’s not actually uncommon for programming languages to confuse the heck out of developers with their single, double or even triple “equals”. Among the popular ones, it seems to be a rule rather than exception.

Just consider that:

  • JavaScript has both == and ===, exactly like PHP does. And the former is just slightly less crazy than its PHP counterpart. For both languages, it just seems like a weak typing failure.
  • In C and C++, you may easily use = (assignment) in lieu of == (equality), because the former is perfectly allowed inside conditions for if, while or for statements.
  • Java is famously counterintuitive when it comes to comparing strings, requiring to use String.equals method rather than == (like in case of other fundamental data types). Many, many programmers have been bitten by that. (The fact that under certain conditions you can compare strings char-by-char with == doesn’t exactly help either).
  • C# complicates stuff even more by allowing to override Equals and overload == operator. It also introduces ReferenceEquals which usually works like ==, except when the latter is overloaded. Oh, and it also has two different kinds of types (value and reference types) which by default compare in two different ways… Joy!

The list could likely go on and include most of the mainstream languages but one of them would be curiously absent: Python.

You see, Python got the == operator right:

  • It tests for equality only, not identity (also known as “reference equality”). For that there is a separate is operator.
  • All basic objects – not only strings, but also lists or dictionaries (hash tables) – compare by value. Hence e.g. two lists are equal if they contain equal elements in the same order, whether or not they are the same objects.
  • Implicit conversions are applied judiciously. Different types of numbers (int, long, float) compare to each other just fine, but there is clear distinction between 42 (number) and "42" (string).
  • You can overload == but there are no magical tricks that instantly turn your class into wannabe fundamental type (like in C#). If you really want value semantics, you need to write that yourself.

In retrospect, all of this looks like basic sanity. Getting it right two decades ago, however… That’s work of genius, streak of luck – or likely both.

Tags: , , , , ,
Author: Xion, posted under Programming » 1 comment

Alternative @property Syntax

2012-12-19 22:04

As you probably know very well, in Python you can add properties to your classes. They behave like instance fields syntactically, but under the hood they call accessor functions whenever you want to get or set the property value:

  1. import os
  2.  
  3. class Directory(object):
  4.     """Simple class representing a directory in the file system."""
  5.     def __init__(self, path):
  6.         self.path = path
  7.  
  8.     @property
  9.     def parent(self):
  10.         """Parent directory."""
  11.         return Directory(os.path.join(self.path, os.pardir))
  12.  
  13. # usage: no () after .parent
  14. home_dir = Directory('/home/xion')
  15. root_dir = home_dir.parent.parent

Often – like in the example above – properties are read-only, providing only the getter method. It’s very easy to define them, too: just stick a @property decorator above method definition and you’re good to go.

iGet, iSet

Occasionally though, you will want to define a read-write property. (Or read-delete, but those are very rare). One function won’t cut it, since you need a setter in addition to getter. The canonical way Python docs recommend in such a case (at least since 2.6) is to use the @property.setter decorator:

  1. class TracedObject(object):
  2.     """Object that tracks changes to its properties."""
  3.     def __init__(self):
  4.         self.changed = False
  5.  
  6.     @property
  7.     def x(self):
  8.         return self._x
  9.  
  10.     @x.setter
  11.     def x(self, value):
  12.         self._x = value
  13.         self.changed = True

Besides that I find it ugly to split a single property between two methods, this approach will annoy many static code analyzers (including PEP8 checker) due to redefinition of x. Warnings like that are very useful in general, so we certainly don’t want to turn them off completely just to define a property or two.

So if our analyzer doesn’t support line-based warning suppression (like, again, pep8), we may want to look for a different solution.

Tags: , ,
Author: Xion, posted under Programming » 2 comments

Command Line Parsing in Python: Tips & Tricks

2012-12-13 21:49

Reading program’s command line and doing something with the arguments is the main purpose of most small (or bigger) utilities. Those are often written in Python – because of how easy and fast this is – so there should be a way to parse the command line in Python, too.
And in fact there are quite a few of them, all from the standard library. But the argparse module is most likely the best of them all, equally for its flexibility and power, as well as the sole fact of not being deprecated yet ;-)

For that matter, I have already used it several times, not only in Python. Today I want to present a summary of few useful techniques and solutions that I learned along the way, mostly by braving the not-so-friendly documentation of argparse. Given I’m not likely to do unusual stuff here, they should also address quite common, albeit less trivial use cases.

Boolean flags

Following the convention of every operating system imaginable, argparse has positional arguments and flags. Flags are denoted by one or two dashes preceding the name or its one-letter abbreviation:

  1. $ git commit -m "Fix stuff"
  2. $ hg bisect --bad 42
  3. $ ln -s ~/node_modules/foobar-0.0.1/bin/foobar ~/bin/foobar

Normally in argparse, flags take arguments that are later stored in the result object. This would be helpful for parsing something like the -m (message) flag in the git commit example above.
Not every flag needs to behave like that, though. In the last ln example, the -s does not take any arguments. Instead, it alters the program behavior by its mere presence: with it, ln creates a symbolic link instead of “hard” link. So in a sense, the flag is boolean. We would like to handle it as such.

In argparse, this is possible by setting the appropriate action= in the add_argument method:

  1. parser.add_argument("--symbolic", "-s", action='store_true', default=False)

Depending on what’s more logical for your program, you can reverse the logic to 'store_false' and default=True, of course.

Multiple positional arguments

If your program takes one entity as an argument and does something specific with it, users will often expect it to work with multiple entities too. You can observe it first hand with pip:

  1. $ pip install Flask
  2. $ pip install Flask WTForms SQLAlchemy celery pytz

or any version control application:

  1. $ git add README
  2. $ git add foo.h foo.c Makefile

There is no reason to ignore this expectation and it’s pretty easy to satisfy in argparse. Again, there is an action= for that:

  1. parser.add_argument("--foo", action='append')

and it’s sufficient for flags. Here the object returned by parse_args will get foo attribute with the list of arguments from all occurrences of --foo.

For positionals, it’s a little bit trickier because by default, they are meant to appear exactly once. This can be changed using nargs=:

  1. parser.add_argument("files", nargs='+')

The value of '+' is probably the most useful here, as it requires for the argument to be present at least once. Just like for flags, the result will be a list of all its occurrences, so you can iterate or map over it easily.

Optional positional arguments

Less typically, you may want to have a positional argument which can be supplied or not (an optional one). Although it is possible with the API outlined above, I wouldn’t recommend it: you will have to deal with unnecessary 0-or-1-element list and you won’t get proper error checking at the argparse level.

The correct solution involves nargs=, too, but with a dedicated '?' value:

  1. parser.add_argument("cache_dir", nargs='?', default='/tmp')

As you may guess, default= allows you to specify the value in parse_args result should the argument be omitted.

Testing

Once you set up your ArgumentParser, you will (hopefully) want to test it. Lucky for you, this can be done easily without every touching the actual command line. Simply pass your arguments (as a list) to parse_args and it will use it instead of sys.argv:

  1. >>> parser.parse_args(['-foo', 'bar'])
  2. Namespace(foo='bar')

With this you can easily write some nice unit tests for your parser – which you should do, obviously. What you should not do, however, is abusing this feature to call your program’s code from itself:

  1. def main(argv=sys.argv):
  2.     args = parse.parse_args(argv)
  3.     # ...
  4.  
  5. # later, somewhere deep inside...
  6. main(['other_function', 'one_argument', '--key', 'value'])

Just don’t.

Read more

There are, of course, many other interesting features and applications of argparse that you will find useful. I can especially recommend that you get to know about:

  • subparsers, a way to divide your complex tool into several internal commands (like git or pip)
  • argument groups for organizing your arguments into functional groups for better --help output, or for mutual exclusion (e.g. --verbose and --quiet option)
  • help text formatting, handy for more elaborate descriptions that need their whitespace preserved

Equipped with this knowledge, you should be able to write beautiful and easy to use command line tools. Please do so :)

Tags: ,
Author: Xion, posted under Programming » Comments Off on Command Line Parsing in Python: Tips & Tricks

Don’t Write Classes

2012-08-29 11:17

On this year’s PyCon US, there was a talk with rather (thought-)provoking title Stop Writing Classes. The speaker might not be the most charismatic one you’ve listened to, but his point is important, even if very simple. Whenever you have class with a constructor and just one other method, you could probably do better by turning it into a single function instead.

Examples given in the presentation were in Python, of course, but the whole advice is pretty generic. It can be applied with equal success even to languages that are object-oriented to the extreme (like Java): just replace ‘function’ with ‘static method’. However, if we are talking about Python, there are many more situations where we can replace classes with functions. Often this will result in simpler code with less nesting levels.

Let’s see a few examples.

Inheriting for custom __init__

Sometimes we want to construct many similar objects that differ only slightly in a way their constructors are invoked. A rather simple example would be a urllib2.Request with some custom HTTP headers included:

  1. import urllib2
  2.  
  3. class CustomRequest(urllib2.Request):
  4.     def __init__(self, url, data=None, headers=None, *args, **kwargs):
  5.         headers = headers or {}
  6.         headers.setdefault('User-Agent', 'MyAwesomeApplication')
  7.         super(CustomRequest, self).__init__(url, data, headers, *args, **kwargs)
  8.  
  9. # usage
  10. request = urllib2.urlopen(CustomRequest("http://www.google.com")).read()

That works, but it’s unnecessarily complex without adding any notable benefits. It’s unlikely that we ever want to perform an isinstance check to distinguish between CustomRequest and the original Request, which is the main “perk” of using class-based approach.

Indeed, we could do just as well with a function:

  1. def CustomRequest(url, data=None, headers=None, *args, **kwargs):
  2.     headers = headers or {}
  3.     headers.setdefault('User-Agent', 'MyAwesomeApplication')
  4.     return urllib2.Request(url, data, headers, *args, **kwargs)

Note how usage doesn’t even change, thanks to Python handling classes like any other callables. Also, notice the reduced amount of underscores ;)

Patching-in methods

Even if the method we want to override is not __init__, it might still make sense to not do it through inheritance. Python allows to add or replace methods of specific objects simply by assigning them to some attribute. This is commonly referred to as monkey patching and it enables to more or less transparently change behavior of most objects once they have been created:

  1. import logging
  2. import functools
  3.  
  4. def log_method_calls(obj, method_name):
  5.     """Wrap the given method of ``obj`` object with log calls."""
  6.     method = getattr(obj, method_name)
  7.  
  8.     @functools.wraps(method)
  9.     def wrapped_method(*args, **kwargs):
  10.         logging.debug("Calling %r.%s with args=%s, kwargs=%s",
  11.                       obj, method_name, args, kwargs)
  12.         result = method(*args, **kwargs)
  13.         logging.debug("Call to %r.%s returned %r",
  14.                       obj, method_name, result)
  15.         return result
  16.  
  17.     setattr(obj, method_name, wrapped_method)
  18.     return obj

You will likely say that this look more hackish than using inheritance and/or decorators, and you’ll be correct. In some cases, though, this might be a right thing. If the solution for the moment is indeed a bit hacky, “disguising” it into seemingly more mature and idiomatic form is unwarranted pretension. Sometimes a hack is fine as long as you are honest about it.

Plain old data objects

Coming to Python from a more strict language, like C++ or Java, you may be tempted to construct types such as this:

  1. /*
  2.  * Represents a MIME content type, e.g. text/html or image/png.
  3.  */
  4. class ContentType {
  5. private:
  6.     std:string _major, _minor;
  7. public:
  8.     ContentType(const std::string& major, const std::string& minor)
  9.         : _major(major), _minor(minor) { }
  10.     ContentType(const std::string& spec) {
  11.         std::vector<string> parts;
  12.         boost::split(parts, spec, boost::is_any_of("/"));
  13.         _major = parts.at(0); _minor = parts.at(1);
  14.     }
  15.  
  16.     const std::string& Major() const { return _major; }
  17.     const std::string& Minor() const { return _minor; }
  18.  
  19.     operator std::string () const {
  20.         return Major() + "/" + Minor();
  21.     }
  22. };
  23.  
  24. // usage
  25. ContentType plainText("text/plain");
  26. sendMail("Hello", "Hello world!", plainText);

An idea is to encapsulate some common piece of data and pass it along in uniform way. In compiled, statically typed languages this is a good way to make the type checker work for us to eliminate certain kind of bugs and errors. If we declare a function to take ContentType, we can be sure we won’t get anything else. As a result, once we convert the initial string (like "application/json") into an object somewhere at the edge of the system, the rest of it can be simpler: it doesn’t have to bother with strings anymore.

But in dynamically typed, interpreted languages you can’t really extract such benefits because there is no compiler you can instruct to do your bookkeeping. Although you are perfectly allowed to write analogous classes:

  1. class ContentType(object):
  2.     """Represents a MIME content type, e.g. text/html or image/png."""
  3.     def __init__(self, major, minor=None):
  4.         if minor is None:
  5.             self._major, self._minor = major.split('/')
  6.         else:
  7.             self._major = major
  8.             self._minor = minor
  9.  
  10.     major = property(lambda self: self._major)
  11.     minor = property(lambda self: self._minor)
  12.  
  13.     def __str__(self):
  14.         return '%s/%s' % (self.major, self.minor)

there is no real benefit in doing so. Since you cannot be bulletproof-sure that a function will only receive objects of your type, a better solution (some would say “more pythonic”) is to keep the data in original form, or a simple form that is immediately usable. In this particular case a raw string will probably do best, although a tuple ("text", "html") – or better yet, namedtuple – may be more convenient in some applications.

So indeed…

…stop writing classes. Not literally all of them, of course, but always be on the lookout for alternatives. More often than not, they tend to make code (and life) simpler and easier.

Tags: , ,
Author: Xion, posted under Programming » 3 comments

Python Optimization 101

2012-07-14 21:17

It’s pretty much assumed that if you’re writing Python, you are not really concerned with the performance and speed of your code, provided it gets the job done in sufficiently timely manner. The benefits of using such a high level language usually outweigh the cons, so we’re at ease with sacrificing some of the speed in exchange for other qualities. The feasibility of this trade is always relative, though, and depends entirely on the tasks at hand. Sometimes the ‘sufficiently fast’ bar might be hung quite high up.

But while some attitudes are clearly beyond Python’s reach – like real-time software on embedded systems – it doesn’t mean it’s impossible to write efficient code. More importantly, it’s almost always possible to write more efficient code than we currently have; the nimble domain of optimization has its subdivision dedicated specifically to Python. And quite surprisingly, performance-tuning at this high level of abstraction often proves to be even more challenging than squeezing nanoseconds out of bare metal.

So today, we’re going to look at some basic principles of optimization and good practices targeted at writing efficient Python code.

Tools of the trade

Before jumping into specific advice, it’s essential to briefly mention few standard modules that are indispensable when doing any kind of optimization work.

The first one is timeit, a simple utility for measuring execution time of snippets of Python code. Using timeit is often one of the easiest way to confirm (or refute) our suspicions about insufficient performance of particular piece of code. timeit helps us in straightforward way: by executing the statement in questions many times and showing average, as well as cumulative, time it has taken.

As for more detailed analysis, the profile and cProfile modules can be used to gain insight on CPU time consumed by different parts of our code. Profiling a statement will yield us some vital data about number of times that any particular function was called, how much time a single call takes on average and how big is the function’s impact on overall execution time. These are the essential information for identifying bottlenecks and therefore ensuring that our optimizations are correctly targeted.

Then there is the dis module: Python’s disassembler. This nifty tool allows us to inspect instructions that the interpreter is executing, with handy names translated from actual binary bytecode. Compared to actual assemblers or even code executed on Java VM, Python’s bytecode is pretty trivial to analyze:

  1. >>> dis.dis(lambda x: x + 1)
  2.   1           0 LOAD_FAST                0 (x)
  3.               3 LOAD_CONST               1 (1)
  4.               6 BINARY_ADD          
  5.               7 RETURN_VALUE

Getting familiar with it proves to be very useful, though, as eliminating slow instructions in favor of more efficient ones is a fundamental (and effective) optimization technique.

Tags: , , ,
Author: Xion, posted under Computer Science & IT » 2 comments

Dictionary with Missing Values

2012-06-29 22:36

When working with dictionaries in Python, or any equivalent data structure in some other language, it is quite important to remember the difference between a key which is not present and a key that maps to None (null) value. We often tend to blur the distinction by using dict.get:

  1. thingie = some_dict.get('key')
  2. if thingie:
  3.     do_stuff_with(thingie)

We do that because more often that not, None and other falsy values (such as empty strings) are not interesting on their own, so we may as well lump them together with the “no value at all” case.

There are some situations, however, where these variants shall be treated separately. One of them is building a dictionary of keyword arguments that are subsequently ‘unpacked’ through the **kwargs construct. Consider, for example, this code:

  1. import re
  2. from bs4 import BeautifulSoup
  3.  
  4. def find_images_by_text(html, text, in_alt=True, in_title=False):
  5.     """Search for <img> tags 'matching' given text in HTML document."""
  6.     regex = re.compile(".*%s.*" % re.escape(text), re.IGNORE_CASE)
  7.     attrs = {}
  8.     if in_alt:
  9.         attrs['alt'] = regex
  10.     if in_title:
  11.         attrs['title'] = regex
  12.     return BeautifulSoup(html).find_all('img', **attrs)

With a key mapping to None, we’re calling the function with argument explicitly set to None. Without the key present, we’re not passing the argument at all, allowing it to assume its default value.

But adding or not adding a key to dictionary is somewhat more cumbersome than mapping it to some value or None. The latter can be done with conditional expression (x if cond else None), together with many other keys and value at once. The former requires an if statement, as shown above.
Would it be convenient if we had a special “missing” value that could be used like None, but caused the key to not be added to dictionary at all? If we had it, we could (for example) rewrite parts of the previous function that currently contain if branches:

  1. attrs = {'alt': regex if in_alt else missing,
  2.          'title': regex if in_title else missing}

It shouldn’t be surprising that we could totally introduce such a value and extend dict to support this functionality – after all, it’s Python we’re talking about :) Patching the dict class itself is of course impossible, but we can inherit it and come up with something like the following piece:

  1. class MissingDict(dict):
  2.     def __init__(self, **kwargs):
  3.         kwargs = dict((k, v) for (k, v) in kwargs.iteritems()
  4.                       if v is not missing)
  5.         dict.__init__(self, **kwargs)
  6.  
  7. missing = object()

The magical missing object is only a marker here, used to filter out keys that we want to ignore.

With this class at hand, some dictionary manipulations become a bit shorter:

  1. def find_images_by_text(html, text, in_alt=True, in_title=False):
  2.     regex = re.compile(".*%s.*" % re.escape(text), re.IGNORE_CASE)
  3.     return BeautifulSoup(html).find_all('img', **MissingDict(
  4.         alt=regex if in_alt else missing,
  5.         title=regex if in_title else missing))

We could take this idea further and add support for missing not only initialization, but also other dictionary operations – most notably the __setitem__ assignments. This gist shows how it could be done.

Tags: , ,
Author: Xion, posted under Computer Science & IT » Comments Off on Dictionary with Missing Values
 


© 2017 Karol Kuczmarski "Xion". Layout by Urszulka. Powered by WordPress with QuickLaTeX.com.