Appendix: More About Packaging and Installation

In the previous section we gave instructions for installing a particular Python working environment, with the desired libraries. Here we will give a little background information about packaging systems in general, followed by recommendations for installation in a different environment: the UH-HPC cluster.

Packages and Environments

Modern software is built in layers of collections of components. Low layers include the computer’s built-in BIOS (Basic Input-Output System), the current typical version of which is called UEFI (Universal Extensible Firmware Interface), and the boot loader, a mini-operating system that loads and starts the real operating system–a higher layer. At the operating system level (Windows, OS X, Linux, etc.–yes, there are many more) things start to get fuzzy, because each operating system has some core functions plus many options for adding additional functions. Especially in the Linux world, the root user may add or remove such functions, which are organized in packages that may contain one or more libraries, one or more executable programs, data files, and configuration files. Moving up yet another level, we have user space: the sorts of programs ordinary people use to do things on the computer. Again, the vast selection of optional programs (browsers, compilers, interpreted languages, editors, graphical display programs–the list is endless) is generally divided into packages containing the same sorts of things as are in operating system packages.

A principle of working with computers and programming is “Don’t Repeat Yourself” (DRY), so packages are often designed such that each one provides a bit of functionality that can be used in many other packages, which then depend on it. Installing one high-level package might require the installation of ten lower-level packages, each of which might in turn depend on more packages, and so on. As if that weren’t bad enough, different high-level packages might depend on different versions of a given lower-level package, such as a library that renders letters and numbers as arrays of dots on the screen or on a printer. Chaos! Gridlock!

The solution to problems with programs always seems to be more programs, and in this case it is a class of programs called package managers, that are designed specifically to manage the installation of software, including the ability to upgrade to newer versions, switch back to earlier versions, or uninstall a program entirely. Most common Linux distributions today are based on either Debian package management (e.g. Debian, Ubuntu, Mint) or the RPM system (e.g., Red Hat, CentOS, OpenSUSE). Apple does things very differently, but several Linux-like package management systems have been developed to facilitate installing much of the software available in Linux Distributions in a Linux-like fashion. The one that I use and recommend is Homebrew, which we introduced in Installing a Python working environment with UH software. The big difference between installing packages that come with a Linux distribution and installing on OS X with Homebrew is that the Linux installations are always done with root privileges (usually using sudo) and land in standard system locations, while Homebrew puts everything in the /usr/local directory and its subdirectories.

In principle, we could stop here; it is possible to install Python and a good set of Python libraries using the appropriate Linux package manager, or Homebrew on OS X. Libraries and programs that are not available, or for which one needs a more recent version, can be installed manually by the user, either in the user’s directory tree, or using sudo in system locations (normally in the /usr/local tree to keep them segregated from system-supplied packages). (The modern Python utility specifically for installing Python libraries is pip; more about it that below.) In practice, however, this doesn’t work well for many use cases. It can become hard to keep track of what is supplied by the system and what has been installed manually, and at any given time, one can have only one version of the latter. This is problematic with a software ecosystem developing as fast as the Python science stack is.

Python developers built tools to get around this problem: virtual environments, supported by the standard venv library module. It allows one to build, and switch among, any number of environments which may have different packages and versions installed with pip. Problem solved? No. For the scientific software stack, with it’s plethora of compiled libraries, venv plus pip just doesn’t work very well.

The solution: yes, more programs! This time the central program is conda, which is both a package manager, and a manager of its own very powerful and easy-to-use virtual environments. For an early and enthusiastic explanation see Jake VanderPlas’s conda myths blog post. Conda is written in Python, but it is a general-purpose, multi-platform (Windows, Mac, Linux) package manager, unlike any that came before it. It works with its own package format, package installation method, and package repositories. It was developed and is still maintained as Open Source by a company, Anaconda, which also provides a distribution–a startup installer, a set of packages, and a repo with additional packages–also called “Anaconda”. Conda has been so successful that another company has recently developed an alternative implementation, Mamba, in a bid to improve performance with large repos. It’s too early to say whether it will become widely used.

Starting with the Anaconda distribution is quick and easy, supplying access to most of the packages we would want to use, but it is not necessarily the best long-run approach. What I and many others recommend instead is using using Miniconda to set up a minimal base environment. One then creates a working environment with the desired set of packages. Any number of additional working environments may be created to test new versions, host different sorts of functionality, etc. The primary function of the base environment is to support conda itself; nothing other than conda and its dependencies needs to be installed or upgraded there.

Conda Channels and conda-forge

Collections of conda-compatible packages are held in repositories called “channels”. Regardless of whether one starts with Anaconda or Miniconda, there is a defaults channel from which Anaconda packages are served. Because this is designed to meet the needs of the Anaconda company and its customers, it is relatively stable. Although it holds a very large number of packages, it can’t include everything that conda users might want. Therefore a fully community-driven channel called conda-forge has been developed to provide more rapid updating and to include a broader selection of packages. For example, the GSW-Python package for seawater equation of state calculations is available in conda-forge but not in defaults. For the most part, packages from these two channels should be compatible with each other (although in the early days there were some problems), but I prefer to stick with conda-forge whenever possible.

Pip and conda

A basic recommendation is: Install all packages that are available from conda-forge from there; for packages that are unavailable on conda-forge, use pip to install either from PyPI or from a source code repository, which might be a clone on your own machine. If using pip, check first to see what dependencies your package has, and install them with conda if they are available that way. When using pip, be sure you are in the appropriate activated conda environment. Conda recognizes pip-installed packages, so using pip in this way works well.

Installation for a user of the UH-HPC cluster

Clusters are set up and managed centrally, so setting up and using the conda package manager and conda-installed packages is a little bit different than it would be on a user’s own Linux box or Mac. There are three types of difference:

  1. The user has no ability to install system-level packages. Instead, the administrators install packages together with modules that the user can load to put those packages on the user’s path, thereby making them accessible.

  2. The user cannot write to the user’s own .bashrc file, even though the standard Linux permissions for the file show the file as user-writable. Writing is blocked at a different level.

  3. Some libraries must be compiled specifically for HPC use, so packages from conda-forge using the Message Passing Interface (MPI) might not work correctly. See this note on MPI in conda-forge. I don’t know how this applies to the UH-HPC.

The UH-HPC admins have created a module giving access to a base Anaconda environment at the system level, meaning the user cannot change anything in that environment. It is possibly, however, to use that as a base environment for the creation of new working environments in the user’s directory tree (specifically, hidden in the ~/.conda/ subdirectory). This still has some disadvantages compared to using a normal Miniconda installation; primarily, that conda itself (which must live in the base environment) cannot be updated, and that the newer method of activating and deactivating environments with conda is unavailable.

A naive attempt to install Miniconda fails because .bashrc is not writable. But there is a reasonably clean and simple workaround: use zsh instead of bash when working with conda or in conda environments.

We will now illustrate how to set up a Miniconda installation in a user’s home directory. We are assuming the directory is freshly prepared for a new user; if not, at least check ~/.bash_profile to see that nothing python-related has been put on the PATH either directly or via module loading. If .conda/ and/or .condarc are present, it would be advisable to rename or delete them.

Starting in either a login node or a compute node, make zsh available by appending this line to the .bash_profile file:

module load tools/zsh

At this point there is no .zshrc file, so make one that starts with the following line:

. /opt/ohpc/admin/lmod/lmod/init/zsh # Needed to run modules in zsh.

Now log out completely, log back in, and start an interactive session on a compute node. Download the Miniconda installer:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

Run it like this, but at the end, answer “N” to the question about shell initialization:

bash Miniconda3-latest-Linux-x86_64.sh

Now tell it to initialize for zsh:

miniconda3/condabin/conda init zsh

Switch into zsh by executing:

zsh

and you should be in your fresh base environment. Now would be a good time to tell conda to prefer conda-forge. Execute:

conda config --add channels conda-forge
conda config --set channel_priority strict

Make your first working environment with some basics, and switch to it:

conda create -n py38 python=3.8 matplotlib scipy ipython
conda activate py38

Of course, any time you can use conda to add more packages, use pip to install packages that are not available from conda-forge, make and populate new environments, etc. Just be careful to activate whichever environment you want to work with.

Now, completely log out, and log back in to a login node. You can start an interactive compute session directly with zsh, or start first with bash and than execute zsh in the resulting terminal to switch. An example of the first case would be:

srun -I30 -p sandbox -N 1 -c 1 --mem=6G -t 0-01:00:00 --pty zsh

Your terminal should now be in a compute node, running zsh, with the base environment activated. Don’t forget to activate your desired working environment.

If you don’t want zsh to come up with the base activated you can use one more conda configuration option:

conda config --set auto_activate_base false

Like the other conda config commands, this is actually writing to the ~/.condarc file.

Note that you can activate any environment from scratch; conda will be on your path in zsh even when you have not activated an environment, so you can directly:

conda activate py38

for example, from your zsh shell, regardless of whether any other environment is active.