…for fun and profit!
I’m still kind of amazed of how malleable the Python language is. It’s no small feat to allow for messing with classes before they are created but it turns out to be pretty commonplace now. My latest frontier of pythonic hackery is import hooks and today I’d like to write something about them. I believe this will come handy for at least a few pythonistas because the topic seems to be rather scarcely covered on the ‘net.
As you can easily deduce, the name ‘import hook’ indicates something related to Python’s mechanism of imports. More specifically, import hooks are about injecting our custom logic directly into Python’s importing routines. Before delving into details, though, let’s revise how the imports are being handled by default.
As far as we are concerned, the process seems to be pretty simple. When the Python interpreter encounters an import
statement, it looks up the list of directories stored inside sys.path
. This list is populated at startup and usually contains entries inserted by external libraries or the operating system, as well as some standard directories (e.g. dist-packages). These directories are searched in order and in greedy fashion: if one of them contains the desired package/module, it’s picked immediately and the whole process stops right there.
Should we run out of places to look, an ImportError
is raised. Because this is an exception we can catch, it’s possible to try multiple imports before giving up:
While this is extremely ugly boilerplate, it serves to greatly increase portability of our application or package. Fortunately, there is only handful of worthwhile libraries that we may need to handle this way; json
is the most prominent example.
__path__
What I presented above as Python’s import flow is sufficient as description for most purposes but far from being complete. It omits few crucial places where we can tweak things to our needs.
First is the __path__
attribute which can be defined in package’s __init__.py file. You can think of it as a local extension to sys.path
list that only works for submodules of this particular package. In other words, it contains directories that should be searched when a package’s submodule is being imported. By default it only has the __init__.py‘s directory but it can be extended to contain different paths as well.
A typical use case here is splitting single “logical” package between several “physical” packages, distributed separately – typically as different PyPI packets. For example, let’s say we have foo
package with foo.server
and foo.client
as subpackages. They are registered in PyPI as separate distributions (foo-server and foo-client, for instance) and user can have any or both of them installed at the same time. For this setup to work correctly, we need to modify foo.__path__
so that it may point to foo.server
‘s directory and foo.client
‘s directory, depending on whether they are present or not. While this task sounds exceedingly complex, it is actually very easy thanks to the standard pkgutil
module. All we need to do is to put the following two lines into foo/__init__.py file:
There is much more to __path__
manipulation than this simple trick, of course. If you are interested, I recommend reading an issue of Python Module of the Week devoted solely to pkgutil
.
sys.meta_path
and sys.path_hooks
Moving on, let’s focus on parts of import process that let you do the truly amazing things. Here I’m talking stuff like pulling modules directly from Zip files or remote repositories, or just creating them dynamically based on, say, WSDL description of Web services, symbols exported by DLLs, REST APIs, command line tools and their arguments… pretty much anything you can think of (and your imagination is likely better than mine). I’m also referring to “aggressive” interoperability between independent modules: when one package can adjust or expand its functionality when it detects that another one has been imported. Finally, I’m also talking about security-enhanced Python sandboxes that intercept import requests and can deny access to certain modules or alter their functionality on the fly.
All of these (and possibly much more) can be achieved through the usage of import hooks. There are two distinct types of them, usually referred to as meta hooks (defined in sys.meta_path
) and path hooks (defined in sys.path_hooks
). Although they are invoked at slightly different stages of the import flow, they are both built upon the same two concepts: that of a module finder and a module loader.
Module finder is simply an object which implements one specific method – find_module
:
It receives a fully qualified name of the module to be imported, along with path
where it’s supposed to be found. and it is expected that the method does one of three things:
None
, meaning that given module cannot be found by this particular finder. It can still be found during the next stages of import flow, either by some other custom finder or just the standard Python import mechanism.The last case is of course the most interesting one, as it gracefully leads us to the concept of module loader. This one is an object that implements the load_module
method:
Again, the fullname
parameter is a fully qualified name of the module that we want to import. Return value should be a module object – a final result of the whole importing process. Note that this could be something that was already imported; for such “duplicate” imports the loader should simply return the existing module:
If anything goes wrong at this stage, loader should raise an exception (typically is just an ImportError
).
Here’s where most of the theory ends, as conveniently described in PEP 302. In practice both finder and loader can be the same object and the find_module
method can simply return self
. As an example, consider this simple hook which is intended to block some specific modules from being imported at all:
Once installed in sys.meta_path
, it will intercept every attempt to import a new module and check whether its name exists on our list. This applies to indirect imports as well: if we attempt to use the Python Requests library:
then it will also fail, as requests
internally uses urllib3
, which in turn uses the restricted httplib
package.
A hook that is a total blocker doesn’t seem very useful, so let’s try something slightly different. Rather than refusing to import a particular module, we’ll proceed normally and issue a warning instead. Such a hook can help detect when deprecated Python modules are introduced to the project:
In order to access the normal importing mechanism, we can use the imp
package. Its functions find_module
and load_module
are roughly the equivalents of our import hook’s methods with the same names. But imp
offers much more, as it also contains functions capable of creating modules from various inputs (e.g. load_source
, load_compiled
) or even creating them completely from scratch (new_module
).
While this all is surely very interesting, we may doubt whether import hooks actually have any notable applications at all. There is surely potential for some really impressive things, including importing Python modules straight from remote URLs (security concerns notwithstanding). In my case, though, I had an actual need that import hooks seemed to satisfy best.
There is this great pytz
package for supporting date and time calculations involving timezones. In general, it is a really shaky ground to thread upon, where issues related to daylight saving time are among the easier ones to deal with. For the most part, pytz
helps navigating through the obstacles in elegant manner but there is one thing where it falls short.
The underlying timezone database has apparently no notion of usable, “generic” timezones – ones based solely on offset from GMT rather than precise location, and without any DST. That’s bad because generic timezones are useful in web development as temporary choice for new users, based on automatic detection through timezone offset obtained with JavaScript.
But the keyword here is ‘usable’. As a matter of fact, pytz
has Etc/GMT+X timezones. However, due to some obscure, decades-old compatibility imperative they are the exact opposite of what we would expect them to be: their offsets are effectively negated. It means that Etc/GMT+2, for example, doesn’t refer to normal time in eastern Europe (or DST in center/western) but to a timezone on the other side of Prime Meridian which is almost unused except as DST for few South American countries, and Greenland. It goes without saying that this is completely and utterly insane.
In cases like this there are usually two solutions. You can put a thin layer with appropriate fix in front of (already perfect) library interface and make sure that no one uses the library directly – and this is of course impossible. Or you can fork it and make necessary changes – but then you’ll have to maintain the fork, manually pulling changes from upstream whenever the original timezone database is updated (which is every few months). Neither is satisfying; couldn’t we just use the library as it is, but somehow patch it on the fly, just before it’s used?…
Why, of course – hello import hooks! Using a relatively simple module finder and loader, we can easily achieve the desired effect and transparently expand the pytz
library to include more useful generic timezones. The full code can be seen in this gist and it isn’t even long or complex.
That would be it for today’s write-up. If you want to learn more about the intricacies of Python’s import hooks – such as the meaning of sys.path_hooks
, for example – the canonical source will of course be the appropriate PEP. Beyond that, there isn’t really any wealth of information to point at: some blog posts here and there, with this one being probably the most useful.