Getting to the bottom of the Python import system

Mon Jan 16 2023E.W.Ayers

What happens when you type import foo.bar.baz in Python? The answer is really complicated! Read this if you've ever found yourself asking:

The module and packaging system is the worst part of Python. I just want to understand how it works so that I can get on with programming.

Resources:

The unit of distribution

Suppose that you want to use the same code in multiple places, you have a few options

Glossary

A Python module is a python object with type ModuleType. Every module has a __name__ attribute. A module is just a Python object like eg a class instance, and it has an attribute dictionary that you can get and set from. This dictionary contains all of the members of the module.

A package is a module with a __path__ attribute. The idea is that a package is a module that can contain other modules. If a module m is a member of a package p, we have m.__package__ == p. Note that packages defined in the above sense are not the units of distribution.

distutils is a module in the standard library. It is deprecated.

A wheel is a bundled-up ready-to-install Python package in bytecode format. In contrast with a source distribution or 'sdist' which is a Python package .

Twine is a third-party utility for publishing packages on PyPI.

Here are the serious tools:

Defining terms

The special thing about modules is that they can be imported. This is when you run a import m or from m import x statement in a Python interpreter.

The foo.bar.baz string in import foo.bar.baz as x is not a file-path and it is not the same as a normal python point.x attribute accessing operator. It's best just to think of it as it's own thing: a dot-separated list of strings. It's quite similar to the 'path' in a URL: https://docs.python.org/3.11/library/site.html where it's helpful to think of it as being a way of addressing the 'file' 3.11/library/site.html, but in reality there can be arbitrary code responsible for resolving this path into the content that is loaded.

When the interpreter imports a Python module import foo.bar.baz. It is trying to turn this string into a Python Module object in two steps:

  1. resolving the string to a ModuleSpec object.
  2. loading the module to actually create a Python module object.

Any modules that are imported are cached in a dictionary called sys.modules, so if you load the same module twice, it will not create two copies of the module. Both of these steps are completely extensible and you can write your own functions for resolving and loading python modules however you want.

However, in 99% of cases the resolving and loading steps are doing the following thing:

  1. Start with the module name "foo.bar.baz"
  2. Make sure parent modules foo and foo.bar are imported.
  3. If there is a parent module set paths = foo.bar.__path__ or use paths = sys.path otherwise. The paths are file directories that the import system should look in to find modules. Eg for me numpy.__path__ = ['~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/numpy'].
  4. The system looks in all of the paths directories for either baz.py or baz/__init__.py.
  5. If it finds one of those it returns a ModuleSpec with the loader being a SourceFileLoader.
  6. In the case of __init__.py, the module is a package (ie the module's __path__ attribute is set to be the directory of the file)

How does Python decide what goes in sys.path? It's controlled by the site module.

Now that we have resolved the module to a ModuleSpec object, we are ready to load it. In the usual case this is a SourceFileLoader and this is what it does.

  1. Compile the file, producing python bytecode.
  2. create a new module object
  3. Run the file, all the resulting globals are added to the module's attribute dictionary.
  4. Add the module to sys.modules.

Other things that are modules

Not all modules are python source files or directories with __init__.py, by default the following things are also modules

Relative imports

...

__main__

...

Not to be confused with __main__.py.

Namespace packages

Namespace packages are packages that do not correspond to a single file directory containing an __init__.py file. A namespace package is a python module whose __path__ attribute is a list of directories.

Python makes a namespace package whenever it traverses a directory containing python modules that does not have an __init__.py file.

Pip and venv

Venvs

When you make a virtual environment with python -m venv .venv.

pip install -e

Gotchas

a/
  b/
	c.py
  d/
    e.py
	__init__.py
  f.py

You can't just import files

Often you will be working in directory a and try to run import b.c, but this might fail. Annoyingly it can fail when you run python code but not fail when you run it in the interpreter. That's because a/ isn't in sys.path.

Gory details and extensibility

asdf

The complexity comes from:

Recommended reading is Chapter 5 of the Python Language Reference.

Quick reference

In most cases, a module is a Python source file in a directory, but this is not always true.

What happens when you import?

We'll come back to relative imports. When you type import foo.bar.baz as x, this is syntactic sugar for x = importlib.import_module('foo.bar.baz'). If we were to reimplement import_module, it would look something like this:

  1. Check the sys.modules cache to see if it's already there.
  2. Resolve the module by calling importlib.util.find_spec(name), to return a thing called a ModuleSpec. A module spec is a load of metadata about the module and a Loader object that decides how the module object is created and initialised.
  3. Create the module object using the given Loader object
  4. Add metadata attributes like __name__ to the module
  5. Add it to sys.modules
  6. Initialise the module.
  7. Return the module

In pseudo-python:

def import_module(name : str):
	# if the module is already loaded in the
	# sys.modules cache, just return that
	if name in sys.modules:
	  m = sys.modules.get(name)
	  assert m is not None
	  return m
	# resolve the module name
    spec : ModuleSpec = importlib.util.find_spec(name)
    if spec is None:
      # we couldn't find the module with that name.
      raise ModuleNotFoundError()
    # create the module
    module = spec.loader.create_module(spec)
    # add metadata attributes to module:
    # ie __name__, __spec__, __package__, __file__, ...
    _init_module_attrs(spec, module)
    sys.modules[name] = module
    # initialise the module
    spec.loader.exec_module(module)
    return module

Caveats:

What is importlib.util.find_spec doing?

How this works is really complicated. The basic task is to take a module name and spit out a ModuleSpec, which is all of the information needed to load a module into the python runtime.

Summary

Let's start by stating the usual path that find_spec takes:

  1. Start with the module name "foo.bar.baz"
  2. Make sure parent modules foo and foo.bar are imported.
  3. If there is a parent module set paths = foo.bar.__path__ or use sys.path otherwise. The paths are directories that the import system should look in to find modules. Eg for me numpy.__path__ = ['~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/numpy']. sys.path is your site-packages directory and the paths of any folders you have done pip -e on.
  4. The system looks in all of the paths directories for either baz.py or baz/__init__.py.
  5. If it finds one of those it returns a ModuleSpec with the loader being a SourceFileLoader.
  6. In the case of __init__.py, the module is a package (ie the module's __path__ attribute is set to be the directory of the file)

Longer Summary

  1. Start with the module name "foo.bar.baz"
  2. Make sure parent modules foo and foo.bar are imported.
  3. If there is a parent module set paths = foo.bar.__path__ or sys.path otherwise.
  4. For each 'meta finder' in sys.meta_path, run find_spec("foo.bar.baz", paths).
  5. Usually, this falls through to the last finder in the sys.meta_path list called PathFinder.
  6. PathFinder runs for each p in paths and each 'hook' hook in sys.path_hooks: hook(p).find_spec("foo.bar.baz") and returns the first one that doesn't throw an ImportError or return None.
  7. Usually, this falls through to a FileFinder(p).find_spec('foo.bar.baz') which does the following.
  8. Get the tail module: "baz". We succeed if any of the following files exist in the p directory: baz.py or baz/__init__.py (or a directory baz/ (called a 'namespace module'), we'll come back to this case)
  9. A ModuleSpec is returned with the loader being a SourceFileLoader. If the extension above was .pyc then a SourcelessFileLoader is used.

The Gory Details

There is a list of MetaPathFinder objects living in sys.meta_path. You can modify sys.meta_path to include your own things. A MetaPathFinder has one method find_spec that returns a module spec given a module name and an optional list of filepaths to look at to find the module

importlib.util.find_spec will run through all of the finders in sys.meta_path, making sure that parent packages (ie, modules with a __path__ attribute) are imported first. If there is a parent module (eg foo is the parent package of foo.bar), foo.__path__ is passed as the path argument to the finder. The pseudocode for this is below.

def util.find_spec(name, paths = None, target = None):
	parts = name.split('.')
	# parts = ["foo", "bar", "baz"]
	if len(parts) > 0:
		parent_name = ".".join(parts[:-1]) # 'foo.bar'
		parent_module = import_module(parent_name)
		paths = parent_module.__path__
		# the __path__ field on parent_module is
		# a list of file-paths that are used to resolve the module.
	for finder in sys.meta_path:
		spec = finder.find_spec(name, paths)
		if spec is not None:
			return spec
	return None

There are lots of MetaPathFinders in sys.meta_path that do various things, and libraries like to add their own too. The main, fallback finder is called PathFinder (source) and essentially does the following (+ caching + error handling + legacy + 'namespaces'):

class PathFinder:
	@classmethod
	def find_spec(cls, fullname, paths = None):
	  if paths is None:
	    paths = sys.path
	  for path in paths:
	    # find the first hook that doesn't throw
		for hook in sys.path_hooks:
			try:
			  finder = hook(path)
			  break
			except ImportError:
			  continue
		if not finder:
		  continue
		spec = finder.find_spec(fullname)
		if spec is None:
	        continue
	    return spec

So, there is a list of functions called sys.path_hooks of type List[Callable[[str], PathEntryFinder] where each returned PathEntryFinder is yet another abstract class that you have to call find_spec on, this time with no path argument.

In sys.path_hooks, the default two of these 'path hooks' are a zip importer and a FileFinder (source). FileFinder is the main one. A FileFinder is initialised with a path : str which is the directory that the finder is in charge of searching. FileFinder is also initialised with a list of extension suffixes (x = ".py", ".pyc") and loaders (SourceFileLoader, SourcelessFileLoader). FileFinder looks for a file p/baz.x or p/baz/__init__.x and returns the ModuleSpec with the relevant loader.

How to extend find_spec?

So, if you want to extend the module loading system with your own stuff, you can:

Why is this so complicated?

  1. Caching: Each of the stages I outlined above also has a caching stage. Additionally, you need mechanisms to invalidate the cache so you can do live-reload operations.
  2. Legacy: there used to just be one finder class called Finder, but this wasn't good enough because you need to be able to use different finders for different cases, so an extra layer of meta-finders was added to find the finders.
  3. Nitpicky edge cases:
    • namespace modules
    • packages
    • loading modules from non-python source
    • loading modules direct from archives
    • lots of different places where packages can be stored: environments, conda, the internet etc.

How does the import system decide to add __path__?

Given any module, you can make it a package by simply adding a __path__ attribute. However if your module is an __init__.py file, it will automatically add __path__ to be the parent directory.

What about relative imports?

A relative import is an import where the module name being imported starts with a dot. For example import .foo. In the above case, you take the current module m that is running import .foo; and you take the parent module name: m.__package__ (caveats); and you prepend that to .foo and do an absolute import.

If there are multiple dots as in import ..foo, you repeat the parent-finding process for the number of dots present.

This definition of relative import sucks because it means that in order to use them your python files need to be inside a package in order to import from each other. The shortcut way to do this is to just add __init__.py folders everywhere.

I recommend never using relative imports except inside of __init__.py files. It's just not worth it.

What are namespace packages?

A namespace package is a python package that doesn't have an associated module (ie no __init__.py). The idea is you can split a package across multiple files. See this Stack Overflow answer for more detail. Adding namespace packages complicates the logic for find_spec.

Sadly, __main__.

When you execute a python file with python foo.py, the given file is not loaded as the module foo. Instead, it is loaded as a special module called __main__. The main problem that this causes is that it breaks relative imports, since the __main__ module does not have a __package__ attribute set. The main recommendation seems to be that you should just avoid using relative imports.

Importing resources

[todo] this section is still under construction [todo]

Another cool thing that you can do with the Python import system is 'import' files that are not Python files. You can import data files or executable binaries.

Usually, if you want to get a file from a Python script you will call open('path/to/file'), but this assumes that you know where the file is on disk. By 'importing' files, you can ensure that the files are present wherever your Python package is called from, even if it is downloaded from PyPI.

There are two sites that told me this existed:

I'll try to keep with the example given in 'importlib-resources'. We have some folder structure:

mypkg/
  __init__.py
  resource.txt
  foo.py

Now in foo.py I can write:

from importlib.resources import files

Module resolution failures that always get me

[todo] DONT SET THE PYTHON_PATH ENVIRONMENT VARIABLE. IF YOU HAVE TO DO THIS YOU HAVE FAILED AS A PERSON.

Relative imports are doom

It's possible to end up importing the same module twice with different args.

Don't trust your IDE to do muo

Basic importing from a directory is broken

Suppose our working directory looks like this:

asdf/
  b.py # ← from asdf.c import X; Y = 5; print(X)
  c.py # ← X = 4
a.py # ← from asdf.c import X; print(X)

If I run python asdf/b.py, it will refuse to resolve c.py (no module named asdf). If I run python a.py, it will be ok!

One answer is to replace the import in b.py with from c import X. Then you can run python asdf/b.py and it's ok. But, now, if add a line from asdf.b import Y to a.py, we will get "no module named c.

I can't see how this is anything other than a flaw in Python. There is no way to import between the directories that doesn't break.

I usually get around this by making the root project folder a package with a pyproject.toml, and then running pip install -e .. But it's so miserable that I have to do that.