Data on HPC clusters
So far, we have played with randomly created data. In your work, you will often need to work with real world data.
How do you move it to the cluster? Where should you store it?
It’s time to talk about data on HPC clusters.
Transferring data to/from the cluster
Secure Copy Protocol
Secure Copy Protocol (SCP) allows to copy files over the Secure Shell Protocol (SSH) with the scp
utility. scp
follows a syntax similar to that of the cp
command.
Note that you need to run it from your local machines (not from the cluster).
Copy from your machine to the cluster
# Copy a local file to your home directory on the cluster
scp /local/path/file username@hostname:
# Copy a local file to some path on the cluster
scp /local/path/file username@hostname:/remote/path
Copy from the cluster to your machine
# Copy a file from the cluster to some path on your machine
scp username@hostname:/remote/path/file /local/path
# Copy a file from the cluster to your current location on your machine
scp username@hostname:/remote/path/file .
You can also use wildcards to transfer multiple files:
# Copy all the Bash scripts from your cluster home dir to some local path
scp username@hostname:*.sh /local/path
Copying directories
To copy a directory, you need to add the -r
(recursive) flag:
scp -r /local/path/folder username@hostname:/remote/path
Copying for Windows users
MobaXterm users (on Windows) can copy files by dragging them between the local and remote machines in the GUI. Alternatively, they can use the download and upload buttons.
Secure File Transfer Protocol
The Secure File Transfer Protocol (SFTP) is more sophisticated and allows additional operations in an interactive shell. The sftp
command provided by OpenSSH and other packages launches an SFTP client:
sftp username@hostname
Look at your prompt: your usual Bash/Zsh prompt has been replaced with sftp>
.
From this prompt, you can access a number of SFTP commands. Type help
for a list:
sftp> help
Available commands:
bye Quit sftp
cd path Change remote directory to 'path'
chgrp [-h] grp path Change group of file 'path' to 'grp'
chmod [-h] mode path Change permissions of file 'path' to 'mode'
chown [-h] own path Change owner of file 'path' to 'own'
copy oldpath newpath Copy remote file
cp oldpath newpath Copy remote file
df [-hi] [path] Display statistics for current directory or
filesystem containing 'path'
exit Quit sftp
get [-afpR] remote [local] Download file
help Display this help text
lcd path Change local directory to 'path'
lls [ls-options [path]] Display local directory listing
lmkdir path Create local directory
ln [-s] oldpath newpath Link remote file (-s for symlink)
lpwd Print local working directory
ls [-1afhlnrSt] [path] Display remote directory listing
lumask umask Set local umask to 'umask'
mkdir path Create remote directory
progress Toggle display of progress meter
put [-afpR] local [remote] Upload file
pwd Display remote working directory
quit Quit sftp
reget [-fpR] remote [local] Resume download file
rename oldpath newpath Rename remote file
reput [-fpR] local [remote] Resume upload file
rm path Delete remote file
rmdir path Remove remote directory
symlink oldpath newpath Symlink remote file
version Show SFTP version
!command Execute 'command' in local shell
! Escape to local shell
? Synonym for help
As this list shows, you have access to a number of classic Unix command such as cd
, pwd
, ls
, etc. These commands will be executed on the remote machine.
In addition, there are a number of commands of the form l<command>
. “l” stands for “local”.
These commands will be executed on your local machine.
For instance, ls
will list the files in your current directory in the remote machine while lls
(“local ls”) will list the files in your current directory on your computer.
This means that you are now able to navigate two file systems at once: your local machine and the remote machine.
Here are a few examples:
sftp> pwd # print remote working directory
sftp> lpwd # print local working directory
sftp> ls # list files in remote working directory
sftp> lls # list files in local working directory
sftp> cd # change the remote directory
sftp> lcd # change the local directory
sftp> put local_file # upload a file
sftp> get remote_file # download a file
Copying directories
To upload/download directories, you first need to create them in the destination, then copy the content with the -r
(recursive) flag.
If you have a local directory called dir
and you want to copy it to the cluster you need to run:
sftp> mkdir dir # First create the directory
sftp> put -r dir # Then copy the content
To terminate the session, press <Ctrl+D>
.
Syncing
If, instead of an occasional copying of files between your machine and the cluster, you want to keep a directory in sync between both machines, you might want to use rsync instead. You can look at the Alliance wiki page on rsync for complete instructions.
Heavy transfers
While the methods covered above work very well for limited amounts of data, if you need to make large transfers, you should use globus instead, following the instructions in the Alliance wiki page on this service.
Windows line endings
On modern Mac operating systems and on Linux, lines in files are terminated with a newline (\n
). On Windows, they are terminated with a carriage return + newline (\r\n
).
When you transfer files between Windows and Linux (the cluster uses Linux), this creates a mismatch. Most modern software handle this correctly, but you may occasionally run into problems.
The solution is to convert a file from Windows encoding to Unix encoding with:
dos2unix file
To convert a file back to Windows encoding, run:
unix2dos file
Files management
The Alliance clusters are designed to handle large files very well. They are however slowed by the presence of many small files. It is thus important to know how to handle large collections of files by archiving them with tools such as tar and dar.
Where to store data
Supercomputers have several filesystems and you should familiarize yourself with the quotas and policies of the clusters you use.
All filesystems are mounted on all nodes so that you can access the data on any network storage from any node (e.g. something in /home
, /project
, or /scratch
will be accessible from any login node or compute node).
A temporary folder gets created directly on the compute nodes while a job is running. In situations with heavy I/O or involving many files, it is worth considering copying data to it as part of the job. In that case, make sure to copy the results back to network storage before the end of the job.