1 Introduction
2 Account Administration
3 System Architecture
4 Accessing the System
5 Citizenship
6 Managing Files
7 Software
8 Running Jobs
9 Sample Job Scripts
10 Help

Introduction

Johns Hopkins University's Rockfish is a community-shared cluster at Johns Hopkins University. It follows the "condominium model" with three main integrated units. The first unit is based on a National Science Foundation (NSF) Major Research Infrastructure Grant (#1920103) and other main grants like DURIP/DoD, a second unit contains medium-size condos ((Schools' condos), and the last unit is the collection of condos purchased by individual research groups. All three units share a base infrastructure, and resources are shared by all users. Rockfish provides resources and tools to integrate traditional High Performance Computing (HPC) with Data Intensive Computing and Machine Learning (ML). As a multi-purpose resource for all fields of science, it will provide High Performance and Data Intensive Computing services to Johns Hopkins University, Morgan State University and ACCESS researchers as a level 2 Service Provider.

Rockfish's compute nodes consist of two 24-core Intel Xeon Cascade Lake 6248R processors, 3.0GHz base frequency and 1 TB NMVe local drive. The regular and GPU nodes have 192GB of DDR4 memory, whereas the large memory nodes have 1.5TB of DDr4 memory. The GPU nodes also have 4 Nvidia A100 GPUs.

Figure 1. Rockfish SystemACCESS hostname: login.rockfish.jhu.edu

Account Administration

A proposal through the ACCESS Resource Allocation request System (XRAS) is required for a research or startup allocation. See ACCESS Allocations for more information about about different types of allocations.

Configuring Your Account

Rockfish uses the bash shell by default. Submit an ACCESS support ticket to request a different shell.

Modules

The Rockfish cluster uses Modules (lua modules version 8.3, developed at TACC) to dynamically manage users' shell environments. "module" commands will set, modify, or delete environment variables in support of scientific applications, allowing users to select a particular version of an application or a combination of packages.

The "ml available" command will display (i) the applications that have been compiled using GNU compilers, (ii) external applications like matlab, abaqus, which are independent of the compiler used and (iii) a set of core modules. Likewise, if the Intel compilers are loaded "ml avail" will display applications that are compiled using the Intel compilers.

A set of modules are loaded by default at login time. These include Slurm, gcc/9.3 and openmpi/3.1. We strongly recommend that users utilize this combination of modules whenever possible for best performance. In addition, several scientific applications are built with dependencies on other modules. Users will get a message on the screen if this is the case. For more information type:

login1$ ml spider application/version

For example, if you have the gcc/9.3.0 module loaded and try to load intel-mpi you will get::

Lmod has detected the following error: These module(s) or extension(s) exist but cannot be loaded as requested: "intel-mpi"
Try: "module spider intel-mpi" to see how to load the module(s).

The "ml available" command will also display a letter after the module indicating where it is:

L(oaded), D(efault), g(gpu), c(ontainer)

Table 1. Useful Modules Commands

COMMAND	ALIAS / SHORTCUT	DESCRIPTION

COMMAND	ALIAS / SHORTCUT	DESCRIPTION
`module list`	`ml`	List modules currently loaded
`module avail`	`ml av`	List all scientific applications with different versions
`module show modulename`	`ml show`	Show the environment variables and settings in the module file
`module load modulename`	`ml`	Load modules
`module unload modulename`	`mu modulename`	Unload the application or module
`module spider modulename`	`ml spider modulename`	Shows available versions for modulename
`module save modulename`	`ml save modulename`	Save current modules into a session (default) or named session
`module swap modulename`	`ml modulename`	Automatically swaps versions of modules
`module help`	`ml help`	Shows additional information about the scientific application

System Architecture

Rockfish has three types of compute nodes. "regular memory or standard" compute nodes (192GB), large memory nodes (1524GB) and GPU nodes with 4 Nvidia A100 GPUs. All compute nodes have access to three GPFS file sets. Rockfish, nodes and storage, have Mellanox HDR100 connectivity, with topology 1.5:1. Rockfish is managed using the Bright Computing cluster management software and the Slurm workload manager for job scheduling.

Compute Nodes

Table 2. Compute Node Specifications](#table2)

REGULAR (MEMORY) COMPUTE NODES

REGULAR (MEMORY) COMPUTE NODES
MODEL	Lenovo SD530 Intel Xeon Gold Cascade Lake 6248R
TOTAL CORES PER NODE	48 cores per node
NUMBER OF NODES	368
CLOCK RATE	3.0 GHz
RAM	192GB
TOTAL NUMBER OF CORES	17,664
LOCAL STORAGE	1 TB NVMe
LARGE MEMORY NODES
MODEL	Lenovo SR630 Intel Xeon Gold Cascade Lake 6248R
TOTAL CORES PER NODE	48 cores per node
NUMBER OF NODES	10
CLOCK RATE	3.0 GHz
RAM	1524GB
TOTAL NUMBER OF CORES	480
LOCAL STORAGE	1 TB NVMe
GPU NODES
MODEL	Lenovo SR670 Intel Xeon Gold Cascade Lake 6248R
TOTAL CORES PER NODE	48 cores per node
NUMBER OF NODES	10
CLOCK RATE	3.0 GHz
RAM	192GB
TOTAL NUMBER OF CORES	480
GPUS	4 Nvidia A110 GPUs (40Gb) PCIe
TOTAL NUMBER OF GPUS	40
LOCAL STORAGE	1 TB NVMe

Login Nodes

Rockfish's three login nodes (login01-03) are physical nodes with architecture and features similar to the regular memory compute nodes. Please use the gateway to connect to Rockfish.

Data Transfer Nodes (DTNs)

These nodes can be used to transfer data to the Rockfish cluster using secure copy, Globus or any other utility like Filezilla. The endpoint for Globus is "Rockfish User Data". The DTNs are "rfdtn1.rockfish.jhu.edu" and "rfdtn2.rockfish.jhu.edu". Thee nodes are mounted and available on all file systems.

Systems Software Environment

Table 3. Systems Software Environment](#table3)

SOFTWARE FUNCTION	DESCRIPTION

SOFTWARE FUNCTION	DESCRIPTION
CLUSTER MANAGEMENT	Bright Cluster Management
FILE SYSTEM MANAGEMENT	Xcat/Confluent
OPERATING SYSTEM	CentOS 8.2
FILE SYSTEMS	GPFS, ZFS
SCHEDULER AND RESOURCE MANAGEMENT	Slurm
USER ENVIRONMENT	Lua modules
COMPILERS	Intel, GNU, PGI
MESSAGE PASSING	Intel MPI, OpenMPI, MVAPICH

File Systems

Table 4. Rockfish File Systems](#table4)

FILE SYSTEM	QUOTA	FILE RETENTION	BACKUP	FEATURES

FILE SYSTEM	QUOTA	FILE RETENTION	BACKUP	FEATURES
`$HOME`	50GB	No file deletion policy	Backed up to an off-site location	NVMe File system
`$SCRATCH4`	10TB (combined with scratch16)	30 day retention. Files that have not been accessed for 30 days will be moved to the `/data` file system.	NO	Optimized for small files. Block sized 4MB
`$SCRATCH16`	Same as above	Same as above	NO	Block size 16MB Optimized for large files
`"data"`	10TB	No deletion policy, but quota driven	optional	GPFS file set, lower performance

Accessing the System

Rockfish is accessible only to those users and research groups that have been awarded a Rockfish-specific allocation. ACCESS users may connect to Rockfish using an SSH client using SSH keys for authentication; password-based authentication is not supported.

Users must generate and install their own SSH keys. For help with either of these, see the Generating SSH Keys and/or Uploading Your Public Key pages.

After you have uploaded your public key, you should be able to connect to the Rockfish system using an SSH client. For example, from a computer running a Linux, MacOS, Windows Subsystem for Linux, or Windows PowerShell you may connect to KyRIC by opening a Terminal (or PowerShell) and entering:

ssh [-XY] yourUserName@login.rockfish.jhu.edu

or

ssh [-XY] login.rockfish.jhu.edu -l yourUserName

Third-party SSH clients that provide a GUI (e.g., Bitvise, MobaXterm, PuTTY) may also be used to connect to KyRIC.

"login" is a gateway server that will authenticate credentials and then connect the user to one of three physical login nodes (identical to regular compute nodes). Hostname: login.rockfish.jhu.edu (gateway)

Citizenship

You share Rockfish with thousands of other users, and what you do on the system affects others. Exercise good citizenship to ensure that your activity does not adversely impact the system and the research community with whom you share it. Here are some rules of thumb:

Don't run jobs on the login nodes. Login nodes are used by hundreds of users to monitor their jobs, submit jobs, edit and manipulate files and in some cases to compile codes. We strongly request that users abstain from running jobs on login nodes. Sometimes users may want to run quick jobs to check that input files are correct or scientific applications are working properly. If this is the case, make sure this activity does not take more than a few minutes or even better request an interactive session (interact) to fully test your codes.
Don't stress the file systems. Do not perform activities that may impact the file systems (and the login nodes), for example rsync or copying large or many files from one file system to another. Please use globus of the data transfer node (rfdtn1) to copy large amounts of data
When submitting a help-desk ticket, be as informative as possible.

Login Node Activities

Request an interactive session "interact -usage"
login1$ interact -X -p analysis -n 1 -c 1 -t 120 c001> ml matlab ; matlab
Compile codes, for example run "make". Be careful if you are running commands with multiple processes. "make -j 4" may be fine but "make -j 20" may impact other users.
Check jobs, use this command "sqme"
Edit files, scripts, manipulate files
Submit jobs
Check output files

What is NOT allowed:

Run executables e.g. "./a.out"
multiple rsync sessions or copy large number of files

Managing Files

Transferring your Files

scp: Secure copy commands can be used when transferring small amounts of data. We strongly encourage to use the data transfer nodes instead of the gateway.
scp [-r] file-name userid@rfdtn1.rockfish.jhu.edu:/path/to/file/dir
rsync: An alternative to scp would be "rsync". This command is useful when copying files between file systems or in/out of Rockfish. rsync can also be used to sync file systems as new files are created or as files are modified.
login1$ rsync -azvh dir1 /new/path/
or
login1$ rsync -azvh DIR2 userid@server.ip.address:/path/to/new/dir
Globus: We strongly recommend the use of our managed end points via Globus. Rockfish's Globus end point is "Rockfish User Data"

Sharing Files with Collaborators

Users are strongly encouraged to use Globus features to share files with internal or external collaborators.

Software

Rockfish provides a broad application base managed by Lua modules. Most commonly used packages in bioinformatics, molecular dynamics, quantum chemistry, structural mechanics, and genomics are available ("ml avail"). Rockfish also supports Singularity containers.

Installed Software

Rockfish uses the Lua modules. Type "ml avail" to list all the scientific applications that are installed and available via modules.

"module" (or "ml") : displays a list of installed applications and corresponding versions.
"ml spider APP1" : displays all information on package APP1 (if it is installed)
"ml help APP1" : displays any additional information on this scientific application.

Building Software

Users may want to install scientific applications that are used only by the user of by the group in their HOME directories. Then users can create a private module.

Create a directory to install the application: "mkdir -p $HOME/code/APP1"
Install the application following the instructions (README or INSTALL files).
Create a directory in your HOME directory to create a module file: "mkdir $HOME/modulefiles/APP1"
Create a ".lua" file that adds the application path to your $PATH environment variable and all other requirements (lib or include files).
Load the module as "ml own; ml APP1"

Compilers and recommendations

The Rockfish cluster provides three different compilers for compute nodes, GNU, Intel and PGI. There are also MPI libraries (openmpi, Intelmpi and Mvapich2). Most applications have been built using GNU compilers version 9.3.0. Users should evaluate which compiler gives the best performance for their applications.

The intel compilers and intel-mpi libraries can be loaded by executing the following command:

login1$ ml intel intel-mpi intel-mkl

A standard command to compile a Fortran or C-code will look like: (add as many flags as needed)

login1$ ifort (icc) -O3 -xHOST -o code.x code.f90 (or code.c)

For GNU compilers you may want to use this sequence:

login1$ g++ -O3 -march=native -mtune=native -march=cascadelake-avx2

Running Jobs

Job Accounting

Rockfish allocations are made in core-hours. The recommended method for estimating your resource needs for an allocation request is to perform benchmark runs. The core-hours used for a job are calculated by multiplying the number of processor cores used by the wall-clock duration in hours. Rockfish core-hour calculations should assume that all jobs will run in the regular queue.

For example: if you request one core on one node for an hour your allocation will be charged one core-hour. If you request 24 cores on one node, and the job runs for one hour, your account will be charged 24 core-hours. For parallel jobs, compute nodes are dedicated to the job. If you request 2 compute nodes and the job runs for one hour, your allocation will be charged 96 core-hours.

Job accounting is independent of the number of processes you run on compute nodes. You can request 2 cores for your job for one hour. If you run only one process, your allocation will be charged for 2 core-hours.

Charge = Number of cores x wall-time.

Accessing the Compute Nodes

Batch jobs: Jobs can be submitted to the scheduler by writing a script and submitting it via the "sbatch" command:
login1$ sbatch script-file-name
where script-file-name is a file that contains a set of keywords used by the scheduler to set variables and the parameters for the job. It also contains a set of Linux commands to be executed. See Job Scripts below.
Interactive sessions: Users may need to connect to a compute node in interactive mode by using a internal script called "interact". See "interact -usage" will provide examples and a list of parameters. For example:
login1$ interact -p defq-n 1 -c 1 -t 120
Will request an interactive session on the defq queue with one core for 2 hours.
Alternatively users can use the full command:
login1$ alloc -J interact -N 1 -n 12 --time=120 --mem=48g -p defq srun --pty bash
This command will request an interactive session with 12 cores for 120 minutes and 48GB memory for the job (4GB per core).
ssh from a login node directly to a compute node. Users may ssh to a compute node where their jobs are running to check or monitor the status of their jobs. This connection will last a few minutes.

Slurm Job Scheduler

Rockfish uses Slurm (Simple Linux Universal Resource Manager) to manage resource scheduling and job submission. Slurm is an open source application with active developers and an increasing user community. It has been adopted by many HPC centers and universities. All users must submit jobs to the scheduler for processing, that is "interactive" use of login nodes for job processing is not allowed. Users who need to interact with their codes while these are running can request an interactive session using the script "interact", which will submit a request to the queuing system that will allow interactive access to the node.

Slurm uses "partitions" to divide types of jobs (partitions are called queues on other schedulers). Rockfish defines a few partitions that will allow sequential/shared computing and parallel (dedicated or exclusive nodes), GPU jobs and large memory jobs. The default partition is "defq".

Queues on Rockfish

Queue limits are subject to change. Rockfish will use partitions and resources associated with them to create different types of allocations.

Regular memory allocations will allow the use of all the regular compute nodes (currently the defq partition). All jobs submitted to the defq partition will account against this partition.

Large memory (LM) allocations will allow the use of the large memory nodes. If a user submits a job to this partition then the LM allocation is charged by default.

Likewise, there is a GPU partition that will allow the use of an GPU nodes.

Table 5. Rockfish Production Queues](#table5)

QUEUE NAME	MAX NODES PER JOB (ASSOC'D CORES)*	MAX DURATION	MAX NUMBER OF CORES (RUNNING)	MAX NUMBER RUNNING + QUEUED	CHARGE RATE (PER NODE-HOUR)

QUEUE NAME	MAX NODES PER JOB (ASSOC'D CORES)*	MAX DURATION	MAX NUMBER OF CORES (RUNNING)	MAX NUMBER RUNNING + QUEUED	CHARGE RATE (PER NODE-HOUR)
`defq`	368 nodes, 48 cores per node	72 hours	4800	9600	1 Service Unit (SU)
`bigmem`	10 nodes (1524GB per node)	48 hrs	144	288	1 SU
`a100`	10 nodes, 192GB RAM, 4 Nvidia A100	48	144	288	1SU

Job Management

Users can monitor their jobs with the "squeue" command. In this example user test345 is running two jobs: JobID: 31559 is a parallel job using 4 nodes. JobId: 31560 is a large memory job running on node bigmem01.

login1$ squeue -l -u $USER
JOBID PARTITION     NAME     USER    STATE  TIME NODES NODELIST(REASON)
31559      defq Parallel  test345  RUNNING  1:55     4 c[399-402]
31560    lrgmem       LM  test345  RUNNING  0:31     1 bigmem01

Users can also invoke a script, "sqme", to monitor jobs:

login1$ sqme

To cancel a job, sse the "scancel" command followed by the jobid. For example "scancel 31560" will cancel the LM job for user test345 in the example above.

Sample Job Scripts

The following scripts are examples for different workflows. Users can modify them according to the resources needed to run their applications.

MPI Jobs

This job will run on 5 nodes each with 48 processes/cores. Total 240 MPI processes.

#!/bin/bash
#SBATCH –job-name=MPI-job
#SBATCH –time=5:0:0
#SBATCH –partition=defq
#SBATCH -N 5
#SBATCH –ntasks-per-node=48
#SBATCH -A My-Account

module purge        
ml intel intel-mpi #load Intel compiler and Intel MPI libraries

mpirun ./my-mpi-code.x < my-input-file > My-output.log

OpenMP/Threaded Jobs

This script will run a small job that creates 8 threads. It will use the default time of 1:00:00 (one hour).

#!/bin/bash
#SBATCH –job-name=my-openmp-job
#SBATCH -P defq
#SBATCH -N 1
#SBATCH –ntasks-per-node=8
#SBATCH –mem-per-cpu=4GB
#SBATCH -A My-Account
#SBATCH –export=ALL

ml purge
ml gcc
export OMP_NUM_THREADS=8
time ./a.out > My-output.log

Hybrid (MPI + OpenMP)

This script will run a hybrid jobs (Gromacs) on two nodes, each node will have 8 MPI processes, each with 6 threads

#!/bin/bash
#SBATCH –Job-name=Hybrid
#SBATCH –time=4:0:0
#SBATCH –partition=defq
#SBATCH -N 2
#SBATCH –ntasks-per-node=8
#SBATCH –cpus-per-task=6
#SBATCH -A My-Account
#SBATCH -o Hybrid-%J.log
#SBATCH –export=ALL

ml purge
ml gcc openmpi
ml hwloc boost
ml gromacs/2016-mpi-plumed

mpirun -np 8 bin/gmx_mpi mdrun -deffnm [options…]

GNU parallel

This sample will run 48 serial jobs on one node using GNU parallel. This job directs output to the local scratch file system.

#!/bin/bash -l
#SBATCH --time=2:0:0
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=48
#SBATCH --partition=defq          
#SBATCH --export=ALL

module load gaussian
mkdir -p /scratch16/$USER/$SLURM_JOBID
export GAUSS_SCRDIR=/scratch16/$USER/$SLURM_JOBID

ml parallel

# create a file that contains the names of the 48 input files.
cat my-list | parallel -j 20 --joblog LOGS "g09 {}"

Parametric / Array / HTC jobs

This script is an example to run a set of 5,000 jobs. Only 480 jobs will run at a time. The input files are in a directory ($workdir). A temporary directory ($tmpdir) will be created in "scratch" where all the jobs will be run. At the end of each run the temporary directory is deleted.

#!/bin/bash -l
#SBATCH –job-name=small-array
#SBATCH –time=4:0:0
#SBATCH –partition=defq
#SBATCH –nodes=1
#SBATCH –ntasks-per-node=1
#SBATCH –mem-per-cpu=4G
#SBATCH –array=1-5000%480

# load modules and check
ml purge
module load intel
ml

# set variable "file" to read all the files in $workdir 
# (zmatabcde where "abcde" goes from 00001 to 05000) and 
# assign them to the job array
file=$(ls zmat* | sed -n ${SLURM_ARRAY_TASK_ID}p)
echo $file

# get the number for each file (abcde)
newstring="${file:4}"
export basisdir=/scratch16/jcombar1/LC-tests
export workdir=/scratch16/jcombar1/LC-tests
export tmpdir=/scratch16/jcombar1/TMP/$SLURM_JOBID
export PATH=/scratch16/jcombar1/LC/bin:$PATH
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
mkdir -p $tmpdir
cd $tmpdir

# run your job
cp $workdir/$file   ZMAT
cp $basisdir/GENBAS GENBAS
 ./a.out > $workdir/out.$newstring
cd ..
rm -rf $tmpdir

Bigmem (LM) Jobs

This script will run a job that needs large amounts of memory. Users need a special resource allocation (bigmem). It will use the default time 1:00:00 (one hour).

#!/bin/bash
#SBATCH –job-name=my-bigmem-job
#SBATCH -P bigmem
#SBATCH -N 1
#SBATCH –ntasks-per-node=48
#SBATCH -A My-Account_bigmem    ###   this flag is required
#SBATCH –export=ALL

ml purge
ml intel

time ./big-mem.x  > My-output.log

GPU (LM) Jobs (a100 partition)

This script will run a job that uses all 4 Nvidia a100 gpus. Users need a special resource allocation (gpu). It will use the default time 1:00:00 (one hour).

#!/bin/bash
#SBATCH –job-name=my-bigmem-job
#SBATCH -P a100
#SBATCH -N 1
#SBATCH –ntasks-per-node=48
#SBATCH --gres=gpu:4
#SBATCH -A My-Account_gpu    ###   this flag is required
#SBATCH –export=ALL

ml purge
ml intel

time ./gpu-code.x  > My-output.log

Help

Please visit the ACCESS help desk for important contact information. When submitting a support ticket, please include:

a complete description of the problem with accompanying screenshots if applicable
include any paths to job script or input/output files.
if you are having problems while on a login node please include the login node name

ACCESS Documentation

Rockfish - JHU