R on HPC clusters

Author

Marie-Hélène Burle

This section will show you how to use R once you have logged in to a remote cluster via SSH.

Modules

On the Alliance clusters, a number of utilities are available right away (e.g. Bash utilities, git, tmux, various text editors). Before you can use more specialized software however, you have to load the module corresponding to the version of your choice as well as any potential dependencies.

The cluster setup for this course has everything loaded, so this step is not necessary today, but it is very important to learn it.

R

First, of course, we need an R module.

To see which versions of R are available on a cluster, run:

module spider r

To see the dependencies of a particular version (e.g. r/4.1.2), run:

module spider r/4.1.2

This shows us that we need StdEnv/2020 to load r/4.1.2.

C compiler

If you plan on installing any R package, you will also need a C compiler.

In theory, one could use the proprietary Intel compiler which is loaded by default on the Alliance clusters, but it is recommended to replace it with the GCC compiler (R packages can be compiled by any C compiler—also including Clang and LLVM—but the default GCC compiler is the best way to avoid headaches).

Your turn:

  • How can you check which gcc versions are available on our training cluster?
  • What are the dependencies required by gcc/9.3.0?

Loading the modules

Once you know which modules you need, you can load them. The order is important: the dependencies (here StdEnv/2020) must be listed before the modules which depend on them.

module load StdEnv/2020 gcc/9.3.0 r/4.1.2

Installing R packages

For this course, all packages have already been installed in a communal library. You thus don’t have to install anything.

To install a package, launch the interactive R console with:

R

In the R console, run:

install.packages("<package_name>", repos="<url-cran-mirror>")

The first time you install a package, R will ask you whether you want to create a personal library in your home directory. Answer yes to both questions. Your packages will now install under ~/.

Some packages require additional modules to be loaded before they can be installed. Other packages need additional R packages as dependencies. In either case, you will get explicit error messages. Adding the argument dependencies = T helps in the second case, but you will still have to add packages manually from time to time.

To leave the R console, press <Ctrl+D>.

Running R jobs

There are two types of jobs that can be launched on an Alliance cluster: interactive jobs and batch jobs. We will practice both and discuss their respective merits and when to use which.

For this course, I purposefully built a rather small cluster (10 nodes with 4 CPUs and 30GB each) to give a tangible illustration of the constraints of resource sharing.

Interactive jobs

While it is fine to run R on the login node when you install packages, you must start a SLURM job before any heavy computation.

To run R interactively, you should launch an salloc session.

Example:

salloc --time=1:10:00 --mem-per-cpu=3700M --ntasks=8

This takes you to a compute node where you can now launch R to run computations:

R

This however leads to the same inefficient use of resources as happens when running an RStudio server: all the resources that you requested are blocked for you while your job is running, whether you are making use of them (running heavy computations) or not (thinking, typing code, running computations that use only a fraction of the requested resources).

Interactive jobs are thus best kept to develop code.

Scripts

To run an R script called <your_script>.R, you first need to write a job script:

Example:

<your_job>.sh
#!/bin/bash
#SBATCH --account=def-<your_account>
#SBATCH --time=15
#SBATCH --mem-per-cpu=3000M
#SBATCH --cpus-per-task=4
#SBATCH --job-name="<your_job>"
module load StdEnv/2020 gcc/11.3.0 r/4.1.2
Rscript <your_script>.R   # Note that R scripts are run with the command `Rscript`

Then launch your job with:

sbatch <your_job>.sh

You can monitor your job with sq (an alias for squeue -u $USER $@).

Batch jobs are the best approach to run parallel computations, particularly when they require a lot of hardware.

It will save you lots of waiting time (Alliance clusters) or money (commercial clusters).