Hacking Python Imports

2012-05-06 19:05

…for fun and profit!

I’m still kind of amazed of how malleable the Python language is. It’s no small feat to allow for messing with classes before they are created but it turns out to be pretty commonplace now. My latest frontier of pythonic hackery is import hooks and today I’d like to write something about them. I believe this will come handy for at least a few pythonistas because the topic seems to be rather scarcely covered on the ‘net.

Importing: a simplistic view

As you can easily deduce, the name ‘import hook’ indicates something related to Python’s mechanism of imports. More specifically, import hooks are about injecting our custom logic directly into Python’s importing routines. Before delving into details, though, let’s revise how the imports are being handled by default.

As far as we are concerned, the process seems to be pretty simple. When the Python interpreter encounters an import statement, it looks up the list of directories stored inside sys.path. This list is populated at startup and usually contains entries inserted by external libraries or the operating system, as well as some standard directories (e.g. dist-packages). These directories are searched in order and in greedy fashion: if one of them contains the desired package/module, it’s picked immediately and the whole process stops right there.

Should we run out of places to look, an ImportError is raised. Because this is an exception we can catch, it’s possible to try multiple imports before giving up:

  1. try:
  2.     # Python 2.7 and 3.x
  3.     import json
  4. except ImportError:
  5.     try:
  6.         # Python 2.6 and below
  7.         import simplejson as json
  8.     except ImportError:
  9.         try:
  10.              # some older versions of Django have this
  11.              from django.utils import simplejson as json
  12.          except ImportError:
  13.              raise Exception("MyAwesomeLibrary requires a JSON package!")

While this is extremely ugly boilerplate, it serves to greatly increase portability of our application or package. Fortunately, there is only handful of worthwhile libraries that we may need to handle this way; json is the most prominent example.

More details: about __path__

What I presented above as Python’s import flow is sufficient as description for most purposes but far from being complete. It omits few crucial places where we can tweak things to our needs.

First is the __path__ attribute which can be defined in package’s __init__.py file. You can think of it as a local extension to sys.path list that only works for submodules of this particular package. In other words, it contains directories that should be searched when a package’s submodule is being imported. By default it only has the __init__.py‘s directory but it can be extended to contain different paths as well.

A typical use case here is splitting single “logical” package between several “physical” packages, distributed separately – typically as different PyPI packets. For example, let’s say we have foo package with foo.server and foo.client as subpackages. They are registered in PyPI as separate distributions (foo-server and foo-client, for instance) and user can have any or both of them installed at the same time. For this setup to work correctly, we need to modify foo.__path__ so that it may point to foo.server‘s directory and foo.client‘s directory, depending on whether they are present or not. While this task sounds exceedingly complex, it is actually very easy thanks to the standard pkgutil module. All we need to do is to put the following two lines into foo/__init__.py file:

  1. import pkgutil
  2. __path__ = pkgutil.extend_path(__path__, __name__)

There is much more to __path__ manipulation than this simple trick, of course. If you are interested, I recommend reading an issue of Python Module of the Week devoted solely to pkgutil.

Actual hooks: sys.meta_path and sys.path_hooks

Moving on, let’s focus on parts of import process that let you do the truly amazing things. Here I’m talking stuff like pulling modules directly from Zip files or remote repositories, or just creating them dynamically based on, say, WSDL description of Web services, symbols exported by DLLs, REST APIs, command line tools and their arguments… pretty much anything you can think of (and your imagination is likely better than mine). I’m also referring to “aggressive” interoperability between independent modules: when one package can adjust or expand its functionality when it detects that another one has been imported. Finally, I’m also talking about security-enhanced Python sandboxes that intercept import requests and can deny access to certain modules or alter their functionality on the fly.

All of these (and possibly much more) can be achieved through the usage of import hooks. There are two distinct types of them, usually referred to as meta hooks (defined in sys.meta_path) and path hooks (defined in sys.path_hooks). Although they are invoked at slightly different stages of the import flow, they are both built upon the same two concepts: that of a module finder and a module loader.

Module finder is simply an object which implements one specific method – find_module:

  1. finder.find_module(fullname, path=None)

It receives a fully qualified name of the module to be imported, along with path where it’s supposed to be found. and it is expected that the method does one of three things:

  • Raises an exception, aborting the import process completely.
  • Returns None, meaning that given module cannot be found by this particular finder. It can still be found during the next stages of import flow, either by some other custom finder or just the standard Python import mechanism.
  • Returns a loader object that is capable of actually loading the module.

The last case is of course the most interesting one, as it gracefully leads us to the concept of module loader. This one is an object that implements the load_module method:

  1. loader.load_module(fullname)

Again, the fullname parameter is a fully qualified name of the module that we want to import. Return value should be a module object – a final result of the whole importing process. Note that this could be something that was already imported; for such “duplicate” imports the loader should simply return the existing module:

  1. def load_module(self, fullname):
  2.     if fullname in sys.modules:
  3.         return sys.modules[fullname]
  4.     # otherwise, do the loading magic

If anything goes wrong at this stage, loader should raise an exception (typically is just an ImportError).

Writing your own importer

Here’s where most of the theory ends, as conveniently described in PEP 302. In practice both finder and loader can be the same object and the find_module method can simply return self. As an example, consider this simple hook which is intended to block some specific modules from being imported at all:

  1. class ImportBlocker(object):
  2.     def __init__(self, *args):
  3.         self.module_names = args
  4.  
  5.     def find_module(self, fullname, path=None):
  6.         if fullname in self.module_names:
  7.             return self
  8.         return None
  9.  
  10.     def load_module(self, name):
  11.         raise ImportError("%s is blocked and cannot be imported" % name)
  12.  
  13. import sys
  14. sys.meta_path = [ImportBlocker('httplib')]

Once installed in sys.meta_path, it will intercept every attempt to import a new module and check whether its name exists on our list. This applies to indirect imports as well: if we attempt to use the Python Requests library:

  1. import request

then it will also fail, as requests internally uses urllib3, which in turn uses the restricted httplib package.

A hook that is a total blocker doesn’t seem very useful, so let’s try something slightly different. Rather than refusing to import a particular module, we’ll proceed normally and issue a warning instead. Such a hook can help detect when deprecated Python modules are introduced to the project:

  1. import logging
  2. import imp
  3. import sys
  4.  
  5. class WarnOnImport(object):
  6.     def __init__(self, *args):
  7.         self.module_names = args
  8.  
  9.     def find_module(self, fullname, path=None):
  10.         if fullname in self.module_names:
  11.             self.path = path
  12.             return self
  13.         return None
  14.  
  15.     def load_module(self, name):
  16.         if name in sys.modules:
  17.             return sys.modules[name]
  18.         module_info = imp.find_module(name, self.path)
  19.         module = imp.load_module(name, *module_info)
  20.         sys.modules[name] = module
  21.  
  22.         logging.warning("Imported deprecated module %s", name)
  23.         return module
  24.  
  25. sys.meta_path = [WarnOnImport('getopt', 'optparse', # etc.
  26.                              )]

In order to access the normal importing mechanism, we can use the imp package. Its functions find_module and load_module are roughly the equivalents of our import hook’s methods with the same names. But imp offers much more, as it also contains functions capable of creating modules from various inputs (e.g. load_source, load_compiled) or even creating them completely from scratch (new_module).

What gives?

While this all is surely very interesting, we may doubt whether import hooks actually have any notable applications at all. There is surely potential for some really impressive things, including importing Python modules straight from remote URLs (security concerns notwithstanding). In my case, though, I had an actual need that import hooks seemed to satisfy best.

There is this great pytz package for supporting date and time calculations involving timezones. In general, it is a really shaky ground to thread upon, where issues related to daylight saving time are among the easier ones to deal with. For the most part, pytz helps navigating through the obstacles in elegant manner but there is one thing where it falls short.

The underlying timezone database has apparently no notion of usable, “generic” timezones – ones based solely on offset from GMT rather than precise location, and without any DST. That’s bad because generic timezones are useful in web development as temporary choice for new users, based on automatic detection through timezone offset obtained with JavaScript.
But the keyword here is ‘usable’. As a matter of fact, pytz has Etc/GMT+X timezones. However, due to some obscure, decades-old compatibility imperative they are the exact opposite of what we would expect them to be: their offsets are effectively negated. It means that Etc/GMT+2, for example, doesn’t refer to normal time in eastern Europe (or DST in center/western) but to a timezone on the other side of Prime Meridian which is almost unused except as DST for few South American countries, and Greenland. It goes without saying that this is completely and utterly insane.

In cases like this there are usually two solutions. You can put a thin layer with appropriate fix in front of (already perfect) library interface and make sure that no one uses the library directly – and this is of course impossible. Or you can fork it and make necessary changes – but then you’ll have to maintain the fork, manually pulling changes from upstream whenever the original timezone database is updated (which is every few months). Neither is satisfying; couldn’t we just use the library as it is, but somehow patch it on the fly, just before it’s used?…

Why, of course – hello import hooks! Using a relatively simple module finder and loader, we can easily achieve the desired effect and transparently expand the pytz library to include more useful generic timezones. The full code can be seen in this gist and it isn’t even long or complex.

Further reading

That would be it for today’s write-up. If you want to learn more about the intricacies of Python’s import hooks – such as the meaning of sys.path_hooks, for example – the canonical source will of course be the appropriate PEP. Beyond that, there isn’t really any wealth of information to point at: some blog posts here and there, with this one being probably the most useful.

Tags: , , , ,
Author: Xion, posted under Programming »


2 comments for post “Hacking Python Imports”.
  1. laser hair regrowth reviews:
    August 17th, 2014 o 0:46

    They want you to come to the salon on a regular basis so
    they tell you that you need to get a haircut every four to six weeks, but it’s just a way for them to
    make more money. These use a number of techniques to move the natural oils down their hair such as boar bristle brushing and ‘scritching’
    between washes. light of one wavelength, not of an entire spectrum like a light bulb.

    Feel free to visit my homepage – laser hair regrowth reviews

Add a comment

Newline tags are added automatically.
For code, use [code][/code]. You can also insert LaTeX formulae inside [tex][/tex].
HTML tags allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

 


© 2014 Karol Kuczmarski "Xion". Layout by Urszulka. Powered by WordPress with QuickLaTeX.com.