fastai2 in Singularity Container

The awesome people at fast.ai started the 2020 iteration (aka ‘part1-v4’) of their wildly popular Deep Learning Part I course earlier this month, running it entirely online because of covid-19 social (distancing) responsibility.

The course is using the brand new fastai v2 library (fastai2, currently in pre-release) along with PyTorch, and makes a start in covering the content of their upcoming book.

Installation of the fastai v2 library can be pretty straightforward using conda and pip. It is also well-supported on various cloud GPU platforms such as Paperspace and Colab. However, as with many other cutting-edge deep learning software stacks that typically involve quite frequent updates and changes (for bugfixes, performance enhancements, etc.), it can be a challenge to have everything setup in a multi-user HPC environment, without the risk of affecting other users’ software packages needed for production work.

Containerisation technology presents a possible solution to these challenges, by enabling self-contained (hah!) containers that can be built and deployed with all the internally consistent dependencies, without affecting other parts of the host system or other containers. Docker is arguably the most well-known container system right now, but it might not necessarily be the best for a multi-user HPC environment used for projects and production —instead of experimentation— work, as it can be difficult to setup and ensure the correct user/group permissions in the host system are replicated and honoured in Docker containers. There also seems to be potential risk of undesired privilege escalation to root access due to the way that the Docker daemon works, which is again a problem for multi-user production HPC.

My quick search showed that a different container system, Singularity, might be better-suited for my use case above. The article here helpfully describes some of the problems in Docker defaults that can be solved by Singularity. Even though I do not have sudo permission on the multi-user HPC, I am able to build Singularity containers with fastai2 on a different machine (where I have sudo), e.g. a cheap and cheerful small cloud instance. And when I (and/or others) run the container on the HPC, it will natively support NVIDIA’s CUDA GPU compute for deep learning, honour user/group permissions and filesystem access on the HPC, and will not break or interfere with other software stacks (e.g. finite element analysis with MPI, and GPU-enabled CFD with a different CUDA version) on the HPC used by other users. This gives me the flexibility of being able to experiment and tinker with the latest development version of fastai2 (or other deep learning packages) without having sudo on the HPC, prepare and share Singularity containers that have functioning fastai2 installations, while retaining the rigidity and stability needed for existing software with potentially conflicting dependencies and project-based user security permissions on the HPC.

I have not been experimenting with and using Singularity containers for very long yet, but I will try to describe the steps I took to build the Singularity container with an editable install (i.e. linked to an update-able Git repository) of fastai2.

Installing Singularity

Firstly, Singularity will need to be installed by the sysadmin on the HPC by just following the installation guide. If a separate machine/instance is used to build the Singularity containers (like in my case), then Singularity needs to be installed there too, and root permission is needed for the container-build.

Creating Singularity def file

Next, a Singularity definition file (similar to Docker’s Dockerfile) is created, to have all the steps needed to build the container with the software (fastai2 in this example) and its dependencies (e.g. fastai v1 library, fastcore, etc.), plus any ancillaries (e.g. Jupyter Notebook).

Singularity containers can be bootstrapped from Docker images (which are more popular and widely available), and so in the def file I start with NVIDIA’s own Docker image containing CUDA:

BootStrap: docker
From: nvidia/cuda

Then, define the environment variables that will be set at runtime (i.e. when the container is used):

%environment
    export LANG=C.UTF-8
    export PATH=$PATH:/opt/conda/bin
    export PYTHON_VERSION=3.7
    export LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64

The next bit contains the steps that will be used to install fastai2 and its dependencies, within the %post section of the def file. Again, start by defining the same environment variables, which are used also at build-time (as opposed to runtime, mentioned above):

%post
    export LANG=C.UTF-8
    export PATH=$PATH:/opt/conda/bin
    export PYTHON_VERSION=3.7
    export LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64

Then, install the software and tools needed to setup fastai2 later on. The default OS in NVIDIA’s CUDA Docker image is Ubuntu, and so apt-get is used for this step. I also update pip, and install miniconda, as conda will be used in the next step.

    apt-get -y update
    apt-get -y install --no-install-recommends build-essential ca-certificates \ 
            git vim zip unzip curl python3-pip python3-setuptools graphviz
    apt-get clean
    rm -rf /var/lib/apt/lists/*

    pip3 install --upgrade pip

    curl -o ~/miniconda.sh -O \
      https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh \
      && chmod +x ~/miniconda.sh \
      && ~/miniconda.sh -b -p /opt/conda \
      && rm ~/miniconda.sh \
      && conda install conda-build

Next, go ahead and use conda to install fastai v1 library, and while we are at it, also install Jupyter Notebook and its extensions:

    conda update conda && conda install -c pytorch -c fastai fastai \
      && conda install jupyter notebook \
      && conda install -c conda-forge jupyter_contrib_nbextensions

As I am going to do an editable pip install of both fastai2 and the fastcore dependency, I git clone the two repositories. Note that they are cloned into a shared filepath that exists on the HPC host system, so that I can choose to git pull update the repositories on the HPC host in future, and all the user(s) running the fastai2 Singularity container will automatically pick up the latest updates on the editable install:

    mkdir -p /data/shared
    cd /data/shared && git clone https://github.com/fastai/[fastai2][fastai2] \
    && git clone https://github.com/fastai/fastcore

Then, run the editable pip installs, as recommended currently by fastai as “probably the best approach at the moment, since fastai v2 is under heavy development” still:

    cd /data/shared/fastcore && python3.7 -m pip install -e ".[dev]"
    cd /data/shared/[fastai2][fastai2]  && python3.7 -m pip install -e ".[dev]"

As a final setup step, install some other libraries and packages used in the part1-v4 fastai course:

    conda install pyarrow
    python3.7 -m pip install graphviz ipywidgets matplotlib nbdev>=0.2.12 \
        pandas scikit_learn azure-cognitiveservices-search-imagesearch sentencepiece

With that, all the necessary installs and setup should be there. I then add the ‘start script’ that will be executed when the Singularity container is started. In this case:

Start the Jupyter Notebook server
Make it accessible to other computers/IP (firewalled to internal network only, in our case)
Have the server listen to a non-default port of 9999 (Jupyter default is 8888)
Give it a password hash for access (in this case, the hash corresponds to password fastai)
Make it start in the shared filepath on the HPC host system where I cloned the fastai2 and fastcore repositories. This is also where I have other shared files needed (e.g. the part1-v4 course material)

%startscript
    jupyter notebook --ip=0.0.0.0 --port=9999 --no-browser \
        --NotebookApp.password='sha1:a60ff295d0b9:506732d050d4f50bfac9b6d6f37ea6b86348f4ed' \
        --log-level=WARN --notebook-dir=/data/shared/ &

Finish the def file by adding some basic label and descriptions:

%labels
    ABOUT container for fastai2 (dev editable install) with jupyter notebook on startup (port 9999), for March 2020 fastai course
    AUTHOR Yijin Lee

The complete example def file explained above can be found here.

Building the Singularity container

With the def file, I can now build the Singularity container to get the resulting container sif file. I needed sudo or root permission for this, and so I used a cheap AWS instance (t2.small), instead of the HPC environment (where I only have basic user permissions). My AWS instance only has limited / root device file space, and so I set an environment variable for Singularity to use a different AWS block device storage as the temp directory (or else the build will fail):

root@aws-t2:~# export TMPDIR=/blockdevice/tmp
root@aws-t2:~# ls
fastai2.def
root@aws-t2:~# singularity build fastai2.sif fastai2.def

With the Singularity build, the requested sif file will be created. It is quite a big file, at around 5.0GB, but I only really needed to build and transfer it once, since it will contain an editable (and thus update-able) install of fastai2:

root@aws-t2:~# ls -lh
-rw-r--r-- 1 root   root   1.9K Mar 25 12:00 fastai2.def
-rwxr-xr-x 1 root   root   5.0G Mar 25 12:30 fastai2.sif

The sif file can then be copied/transferred to the HPC environment for actual use.

Running the Singularity container

As I want to use NVIDIA GPU for deep learning compute, the HPC where I run the fastai2 Singularity container needs to have the correct NVIDIA GPU drivers installed (by the sysadmin). Note that the only hard requirement is the drivers — CUDA and other dependencies are self-contained in our Singularity sif file already, all with the correct versions. I can check the NVIDIA GPU status by running nvidia-smi:

[ylee@hpc01 shared]$ nvidia-smi
Thu Mar 25 13:00:00 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:37:00.0 Off |                    0 |
| N/A   53C    P0    30W / 250W |     14MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2468      G   Xorg                                          14MiB |
+-----------------------------------------------------------------------------+

I will start the Singularity container from the shared filepath where the necessary files (e.g. fastai2 and fastcore repositories, part1-v4 course material, etc.) reside — this was mentioned above. In my case, this is in /data/shared, and my sif file is in /data/shared/singularity:

[ylee@hpc01 shared]$ pwd
/data/shared
[ylee@hpc01 shared]$ singularity instance start --nv ./singularity/fastai2.sif fastai2
INFO:    instance started successfully
[ylee@hpc01 shared]$ singularity instance list
INSTANCE NAME    PID      IMAGE
fastai2          13579    /data/shared/singularity/fastai2.sif

The --nv flag above is for Singularity to be able to leverage NVIDIA GPU.

Because of the ‘startscript’ defined in the def file, there should now be a Jupyter Notebook server running and listening on port 9999:

[ylee@hpc01 shared]$ netstat -plunt
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:9999            0.0.0.0:*               LISTEN      13579/python

I can thus point a web browser to the IP at port 9999, and enter the password (defined as fastai in the hash within our def file) to access Jupyter Notebook.

I can also run a shell within the Singularity container instance, to start interactive Python directly, without going via Jupyter Notebook:

[ylee@hpc01 shared]$ singularity shell instance://fastai2
Singularity fastai2.sif:/data/shared> python3.7
Python 3.7.6 (default, Jan  8 2020, 19:59:22)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.

From within Python, I can also quickly confirm that fastai2 is indeed installed, and CUDA compute is available for PyTorch:

>>> from fastai2.vision.all import *
>>> torch.cuda.is_available()
True

And, because the container has an editable pip install of fastai2 residing on the HPC host system, I can git pull or git checkout to a specific fastai2 commit from the HPC, and all users of the Singularity container will then ‘get’ the corresponding fastai2 version. For example, starting with a slightly older version (0.0.14):

>>> import fastai2
>>> fastai2.__version__
...
'0.0.14'
>>> exit()

I can exit from the Singularity instance shell to get back to the HPC host system, while leaving the container still running. I then change the fastai2 version (e.g. update to the latest via git pull), and the change will be ‘live’ back in the Singularity instance shell.

Singularity fastai2.sif:/data/shared> exit
exit
[ylee@hpc01 shared]$ cd fastai2
[ylee@hpc01 fastai2]$ git pull
.
.
.
[ylee@hpc01 fastai2]$ cd ..
[ylee@hpc01 shared]$ singularity shell instance://fastai2
Singularity fastai2.sif:/data/shared> python3.7
Python 3.7.6 (default, Jan  8 2020, 19:59:22)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import fastai2
>>> fastai2.__version__
'0.0.16'

All the shared filesystem files (e.g. ipynb notebooks) can be accessed from within the container, retaining the original user/group permissions, without having to do/set anything for Singularity. When done, just stop the running container:

[ylee@hpc01 shared]$ singularity instance stop fastai2
Killing fastai2 instance of /data/shared/singularity/fastai2.sif (PID=13579) (Timeout)

Summary

Without getting sudo or root permission on a production HPC cluster, I can define and build a Singularity container on a separate cheap cloud instance (where root is available), which can have a pip editable install of fastai2.

The resulting container sif file can be used on the HPC cluster, have native access to GPU CUDA compute, easily retain user/group permissions in the multi-user HPC environment, and have all the necessary software stack dependencies (except NVIDIA GPU driver, which must be present on the HPC host system) without messing up or interfering with other software stacks or environments on the HPC host system.

The editable install residing on the HPC host filesystem means that I can easily upgrade/change the version of fastai2 via git, and users of the Singularity container can get the corresponding version changes ‘live’. This allows a ‘balance’ of having flexibility to experiment with software stacks in a multi-user production HPC environment with native user/group permissions, while reducing the risks of messing things up for everyone (e.g. via undesired root privilege escalation that can happen in Docker). It also means that other users can all re(use) the same container with the same versions of software stack, e.g. for a fastai study group.

My Singularity example def file explained above can be found here. And please do join us for lively discussions on the fastai forums.