Advanced knowledge about Python (>=3.3) package structure and import model

A package is a group of reusable modules organized in one or a hierarchy of folders. Although modules themselves are already reusable without being bundled in a package, a package structure allows the code to be published and used by other programmers. This blog post addresses some advanced package development issues which are not present in module development.

Import search path

To use a package or another module, one needs to import it by specifying its name. Python will look for this name in a series of directories specified in the sys.path variable, which is accessible after import sys.

In default, sys.path includes the following directories in order:

The home directory of the program
PYTHONPATH directories (if set)
Standard library directories
The contents of any .pth files (if present)
The site-packages home of third-party extension

The home directory depends on how you run the code. If you are running a program, it is the directory containing the program’s top-level script file. If you are working interactively, it is the current working directory. A complete exposition of these directories is available in Mark Lutz’s Learning Python (5th edition, P679–680).

It is worth mentioning that Python searches names only in these immediate directories, not any subdirectories.

What to import

In Python, there are three different importable entities, and they all share the same syntax:

Module. A module is a file, either located in a package or not.
Package. A package is a folder containing a top-level __init__.py file. It generally contains one or more modules and/or other packages. In this post, I refer to them as submodules and subpackages, respectively.
Namespace package. The namespace package is available only since Python 3.3. It is one or more folders sharing the same name and containing no top-level __init__.py files.

In this post, I will focus on modules and packages. As for namespace packages, they are not a requirement for most programmers. Only the team leader of a large, loosely coupled team needs to master the concept of namespace packages.

Unless specified, all packages in this post refer to regular packages, not including the namespace packages.

Package structure

There are two types of package structures (single-level and multi-level) along with a non-package structure.

The non-package structure is just a single, self-contained file:

module.py

It often includes a section of unit test code of the form:

if __name__ == '__main__':

Here is the single-level structure:

Package
    |---__init__.py
    |---_submodule1.py
    |---_submodule2.py
    ...

Here is the multi-level structure:

Package
    |---__init__.py
    |---subpackage1
        |---__init__.py
        |---_submodule1-1.py
        ...
    |---subpackage2
        |---__init__.py
        |---_submodule2-1.py
        ...
    ...

All subpackages need to have a __init__.py file, just like the top-level package.

File known as `init.py`

This file is a must since Python 3.3 for a folder to be a package. Folders without this file are regarded as potential namespace packages.

This file has the following four functionalities:

Package declaration. It declares a package. A package has priority over a module with the same name in the same directory during the import process. That is, __init__.py works as a safeguard to ensure that this very package, not anything else, is imported.
Package initialization. This file is automatically executed when the package is imported. Therefore, it is the ideal place to create data files, open connections to databases, and set RNG seeds.
Namespace initialization. Names are generally defined in each submodule of the package. However, there exist some names having no apparent association with any submodules. These names can be defined in __init__.py. In addition, during an interactive IPython session, the <tab>-autocompletion can identify only these names when only the package (no submodules) is imported.
Exporting names. It has a module attribute known as __all__, which is a list of strings. This attribute declares all names exported when the package is imported by from * statement. Of course, all the names in __all__ need to be defined or imported from elsewhere for them to be importable.

Absolute versus relative import

While relative import refers to the import statement using relative paths, absolute import refers to the import statement using absolute paths. Both methods of import refer to package import, not standalone module import.

While relative import makes the package more robust against the migration of the importee, absolute import makes the package more robust against the migration of the importer. In addition, relative import is shorter to type.

There is an important change about relative import from Python 2.X to Python 3.X. In 2.X, an undotted import is relative-then-absolute. In 3.X, an undotted import is absolute-only.

The full setups can be summarized in the following table:

	2.X	3.X
`import xxx`	relative-then-absolute	absolute-only
`from xxx import yyy`	relative-then-absolute	absolute-only
`from .xxx import yyy`	relative-only	relative-only

To adopt the 3.X behavior in a 2.X module, one needs to run from __future__ import absolute_import at the beginning of the module (probably after the docstring). This technique also works in an interactive session.

Absolute import is valid in all kinds of modules, whereas relative import is invalid in programs or modules imported not as a portion of a package. For instance, when it is imported by a program in the same directory, it is imported as a regular module, not a submodule or a package.

From the table, we can see that Python 3.X distinguishes relative import from absolute import and shows no ambiguities. Therefore, the programmers have to choose between these two import methods. Skillful programmers will use relative import for weight lifting modules and absolute import for the test modules. This convention is the one adopted by scikit-learn.

Package versus program

The above revolution in Python 3.X is not without consequences. The dotted import not only makes the module no longer able to serve as both a package submodule and a program but also requires programs calling them to use absolute import.

It explains why the test modules use absolute import. On the one hand, absolute import allows the weight lifting source code to use relative import. Also, when relative import is used by the importee, the importer cannot use module import. On the other hand, absolute import allows each test module to be run as an individual program. That is, you can run individual tests separately rather than invoke the whole test suite every time.

The rule of thumb is that if a module is intended as both a package module and a program, it should use absolute import. For package-only modules, the convention is to use relative import. For program-only modules, the convention is to use absolute import and preferably also put them outside the package.

Publish the package

python setup.py sdist bdist_wheel
twine check dist/*
twine upload dist/*

Written on April 1, 2021