Rockfish - JHU
Introduction
Johns Hopkins University's Rockfish is a community-shared cluster at Johns Hopkins University. It follows the "condominium model" with three main integrated units. The first unit is based on a National Science Foundation (NSF) Major Research Infrastructure Grant (#1920103) and other main grants like DURIP/DoD, a second unit contains medium-size condos ((Schools' condos), and the last unit is the collection of condos purchased by individual research groups. All three units share a base infrastructure, and resources are shared by all users. Rockfish provides resources and tools to integrate traditional High Performance Computing (HPC) with Data Intensive Computing and Machine Learning (ML). As a multi-purpose resource for all fields of science, it will provide High Performance and Data Intensive Computing services to Johns Hopkins University, Morgan State University and ACCESS researchers as a level 2 Service Provider.
Rockfish's compute nodes consist of two 24-core Intel Xeon Cascade Lake 6248R processors, 3.0GHz base frequency and 1 TB NMVe local drive. The regular and GPU nodes have 192GB of DDR4 memory, whereas the large memory nodes have 1.5TB of DDr4 memory. The GPU nodes also have 4 Nvidia A100 GPUs.
Figure 1. Rockfish SystemACCESS hostname: login.rockfish.jhu.edu
Account Administration
A proposal through the ACCESS Resource Allocation request System (XRAS) is required for a research or startup allocation. See ACCESS Allocations for more information about about different types of allocations.
Configuring Your Account
Rockfish uses the bash
shell by default. Submit an ACCESS support ticket to request a different shell.
Modules
The Rockfish cluster uses Modules (lua modules version 8.3, developed at TACC) to dynamically manage users' shell environments. "module
" commands will set, modify, or delete environment variables in support of scientific applications, allowing users to select a particular version of an application or a combination of packages.
The "ml available
" command will display (i) the applications that have been compiled using GNU compilers, (ii) external applications like matlab, abaqus, which are independent of the compiler used and (iii) a set of core modules. Likewise, if the Intel compilers are loaded "ml avail
" will display applications that are compiled using the Intel compilers.
A set of modules are loaded by default at login time. These include Slurm, gcc/9.3
and openmpi/3.1
. We strongly recommend that users utilize this combination of modules whenever possible for best performance. In addition, several scientific applications are built with dependencies on other modules. Users will get a message on the screen if this is the case. For more information type:
login1$ ml spider application/version
For example, if you have the gcc/9.3.0
module loaded and try to load intel-mpi
you will get::
Lmod has detected the following error: These module(s) or extension(s) exist but cannot be loaded as requested: "intel-mpi"
Try: "module spider intel-mpi" to see how to load the module(s).
The "ml available
" command will also display a letter after the module indicating where it is:
L(oaded), D(efault), g(gpu), c(ontainer)
Table 1. Useful Modules Commands
COMMAND | ALIAS / SHORTCUT | DESCRIPTION |
---|---|---|
|
| List modules currently loaded |
|
| List all scientific applications with different versions |
|
| Show the environment variables and settings in the module file |
|
| Load modules |
|
| Unload the application or module |
|
| Shows available versions for modulename |
|
| Save current modules into a session (default) or named session |
|
| Automatically swaps versions of modules |
|
| Shows additional information about the scientific application |
System Architecture
Rockfish has three types of compute nodes. "regular memory or standard" compute nodes (192GB), large memory nodes (1524GB) and GPU nodes with 4 Nvidia A100 GPUs. All compute nodes have access to three GPFS file sets. Rockfish, nodes and storage, have Mellanox HDR100 connectivity, with topology 1.5:1. Rockfish is managed using the Bright Computing cluster management software and the Slurm workload manager for job scheduling.
Compute Nodes
Table 2. Compute Node Specifications](#table2)
REGULAR (MEMORY) COMPUTE NODES | |
---|---|
MODEL | Lenovo SD530 |
TOTAL CORES PER NODE | 48 cores per node |
NUMBER OF NODES | 368 |
CLOCK RATE | 3.0 GHz |
RAM | 192GB |
TOTAL NUMBER OF CORES | 17,664 |
LOCAL STORAGE | 1 TB NVMe |
LARGE MEMORY NODES | |
MODEL | Lenovo SR630 |
TOTAL CORES PER NODE | 48 cores per node |
NUMBER OF NODES | 10 |
CLOCK RATE | 3.0 GHz |
RAM | 1524GB |
TOTAL NUMBER OF CORES | 480 |
LOCAL STORAGE | 1 TB NVMe |
GPU NODES | |
MODEL | Lenovo SR670 |
TOTAL CORES PER NODE | 48 cores per node |
NUMBER OF NODES | 10 |
CLOCK RATE | 3.0 GHz |
RAM | 192GB |
TOTAL NUMBER OF CORES | 480 |
GPUS | 4 Nvidia A110 GPUs (40Gb) PCIe |
TOTAL NUMBER OF GPUS | 40 |
LOCAL STORAGE | 1 TB NVMe |
Login Nodes
Rockfish's three login nodes (login01-03
) are physical nodes with architecture and features similar to the regular memory compute nodes. Please use the gateway to connect to Rockfish.
Data Transfer Nodes (DTNs)
These nodes can be used to transfer data to the Rockfish cluster using secure copy, Globus or any other utility like Filezilla. The endpoint for Globus is "Rockfish User Data". The DTNs are "rfdtn1.rockfish.jhu.edu" and "rfdtn2.rockfish.jhu.edu". Thee nodes are mounted and available on all file systems.
Systems Software Environment
Table 3. Systems Software Environment](#table3)
SOFTWARE FUNCTION | DESCRIPTION |
---|---|
CLUSTER MANAGEMENT | Bright Cluster Management |
FILE SYSTEM MANAGEMENT | Xcat/Confluent |
OPERATING SYSTEM | CentOS 8.2 |
FILE SYSTEMS | GPFS, ZFS |
SCHEDULER AND RESOURCE MANAGEMENT | Slurm |
USER ENVIRONMENT | Lua modules |
COMPILERS | Intel, GNU, PGI |
MESSAGE PASSING | Intel MPI, OpenMPI, MVAPICH |
File Systems
Table 4. Rockfish File Systems](#table4)
FILE SYSTEM | QUOTA | FILE RETENTION | BACKUP | FEATURES |
---|---|---|---|---|
| 50GB | No file deletion policy | Backed up to an off-site location | NVMe File system |
| 10TB (combined with scratch16) | 30 day retention. Files that have not been accessed for 30 days will be moved to the | NO | Optimized for small files. Block sized 4MB |
| Same as above | Same as above | NO | Block size 16MB |
| 10TB | No deletion policy, but quota driven | optional | GPFS file set, lower performance |
Accessing the System
Rockfish is accessible only to those users and research groups that have been awarded a Rockfish-specific allocation. ACCESS users may connect to Rockfish using an SSH client using SSH keys for authentication; password-based authentication is not supported.
Users must generate and install their own SSH keys. For help with either of these, see the Generating SSH Keys and/or Uploading Your Public Key pages.
After you have uploaded your public key, you should be able to connect to the Rockfish system using an SSH client. For example, from a computer running a Linux, MacOS, Windows Subsystem for Linux, or Windows PowerShell you may connect to KyRIC by opening a Terminal (or PowerShell) and entering:
or
Third-party SSH clients that provide a GUI (e.g., Bitvise, MobaXterm, PuTTY) may also be used to connect to KyRIC.
"login
" is a gateway server that will authenticate credentials and then connect the user to one of three physical login nodes (identical to regular compute nodes). Hostname: login.rockfish.jhu.edu
(gateway)
Citizenship
You share Rockfish with thousands of other users, and what you do on the system affects others. Exercise good citizenship to ensure that your activity does not adversely impact the system and the research community with whom you share it. Here are some rules of thumb:
Don't run jobs on the login nodes. Login nodes are used by hundreds of users to monitor their jobs, submit jobs, edit and manipulate files and in some cases to compile codes. We strongly request that users abstain from running jobs on login nodes. Sometimes users may want to run quick jobs to check that input files are correct or scientific applications are working properly. If this is the case, make sure this activity does not take more than a few minutes or even better request an interactive session (interact) to fully test your codes.
Don't stress the file systems. Do not perform activities that may impact the file systems (and the login nodes), for example
rsync
or copying large or many files from one file system to another. Please use globus of the data transfer node (rfdtn1) to copy large amounts of dataWhen submitting a help-desk ticket, be as informative as possible.
Login Node Activities
Request an interactive session "interact -usage"
Compile codes, for example run "
make
". Be careful if you are running commands with multiple processes. "make -j 4
" may be fine but "make -j 20
" may impact other users.Check jobs, use this command "
sqme
"Edit files, scripts, manipulate files
Submit jobs
Check output files
What is NOT allowed:
Run executables e.g. "
./a.out
"multiple
rsync
sessions or copy large number of files
Managing Files
Transferring your Files
scp
: Secure copy commands can be used when transferring small amounts of data. We strongly encourage to use the data transfer nodes instead of the gateway.scp [-r] file-name userid@rfdtn1.rockfish.jhu.edu:/path/to/file/dir
rsync
: An alternative toscp
would be "rsync
". This command is useful when copying files between file systems or in/out of Rockfish.rsync
can also be used to sync file systems as new files are created or as files are modified.or
Globus: We strongly recommend the use of our managed end points via Globus. Rockfish's Globus end point is "
Rockfish User Data
"
Sharing Files with Collaborators
Users are strongly encouraged to use Globus features to share files with internal or external collaborators.
Software
Rockfish provides a broad application base managed by Lua modules. Most commonly used packages in bioinformatics, molecular dynamics, quantum chemistry, structural mechanics, and genomics are available ("ml avail
"). Rockfish also supports Singularity containers.
Installed Software
Rockfish uses the Lua modules. Type "ml avail
" to list all the scientific applications that are installed and available via modules.
"
module
" (or "ml
") : displays a list of installed applications and corresponding versions."
ml spider APP1
" : displays all information on package APP1 (if it is installed)"
ml help APP1
" : displays any additional information on this scientific application.
Building Software
Users may want to install scientific applications that are used only by the user of by the group in their HOME directories. Then users can create a private module.
Create a directory to install the application: "
mkdir -p $HOME/code/APP1
"Install the application following the instructions (
README
orINSTALL
files).Create a directory in your HOME directory to create a module file: "
mkdir $HOME/modulefiles/APP1
"Create a "
.lua
" file that adds the application path to your$PATH
environment variable and all other requirements (lib or include files).Load the module as "
ml own; ml APP1
"
Compilers and recommendations
The Rockfish cluster provides three different compilers for compute nodes, GNU, Intel and PGI. There are also MPI libraries (openmpi, Intelmpi and Mvapich2). Most applications have been built using GNU compilers version 9.3.0. Users should evaluate which compiler gives the best performance for their applications.
The intel compilers and intel-mpi libraries can be loaded by executing the following command:
A standard command to compile a Fortran or C-code will look like: (add as many flags as needed)
For GNU compilers you may want to use this sequence:
Running Jobs
Job Accounting
Rockfish allocations are made in core-hours. The recommended method for estimating your resource needs for an allocation request is to perform benchmark runs. The core-hours used for a job are calculated by multiplying the number of processor cores used by the wall-clock duration in hours. Rockfish core-hour calculations should assume that all jobs will run in the regular queue.
For example: if you request one core on one node for an hour your allocation will be charged one core-hour. If you request 24 cores on one node, and the job runs for one hour, your account will be charged 24 core-hours. For parallel jobs, compute nodes are dedicated to the job. If you request 2 compute nodes and the job runs for one hour, your allocation will be charged 96 core-hours.
Job accounting is independent of the number of processes you run on compute nodes. You can request 2 cores for your job for one hour. If you run only one process, your allocation will be charged for 2 core-hours.
Accessing the Compute Nodes
Batch jobs: Jobs can be submitted to the scheduler by writing a script and submitting it via the "sbatch" command:
where
script-file-name
is a file that contains a set of keywords used by the scheduler to set variables and the parameters for the job. It also contains a set of Linux commands to be executed. See Job Scripts below.Interactive sessions: Users may need to connect to a compute node in interactive mode by using a internal script called "interact". See "
interact -usage
" will provide examples and a list of parameters. For example:Will request an interactive session on the defq queue with one core for 2 hours.
Alternatively users can use the full command:
This command will request an interactive session with 12 cores for 120 minutes and 48GB memory for the job (4GB per core).
ssh
from a login node directly to a compute node. Users mayssh
to a compute node where their jobs are running to check or monitor the status of their jobs. This connection will last a few minutes.
Slurm Job Scheduler
Rockfish uses Slurm (Simple Linux Universal Resource Manager) to manage resource scheduling and job submission. Slurm is an open source application with active developers and an increasing user community. It has been adopted by many HPC centers and universities. All users must submit jobs to the scheduler for processing, that is "interactive" use of login nodes for job processing is not allowed. Users who need to interact with their codes while these are running can request an interactive session using the script "interact", which will submit a request to the queuing system that will allow interactive access to the node.
Slurm uses "partitions" to divide types of jobs (partitions are called queues on other schedulers). Rockfish defines a few partitions that will allow sequential/shared computing and parallel (dedicated or exclusive nodes), GPU jobs and large memory jobs. The default partition is "defq
".
Queues on Rockfish
Queue limits are subject to change. Rockfish will use partitions and resources associated with them to create different types of allocations.
Regular memory allocations will allow the use of all the regular compute nodes (currently the defq
partition). All jobs submitted to the defq
partition will account against this partition.
Large memory (LM) allocations will allow the use of the large memory nodes. If a user submits a job to this partition then the LM allocation is charged by default.
Likewise, there is a GPU partition that will allow the use of an GPU nodes.
Table 5. Rockfish Production Queues](#table5)
QUEUE NAME | MAX NODES PER JOB (ASSOC'D CORES)* | MAX DURATION | MAX NUMBER OF CORES (RUNNING) | MAX NUMBER RUNNING + QUEUED | CHARGE RATE (PER NODE-HOUR) |
---|---|---|---|---|---|
| 368 nodes, 48 cores per node | 72 hours | 4800 | 9600 | 1 Service Unit (SU) |
| 10 nodes (1524GB per node) | 48 hrs | 144 | 288 | 1 SU |
| 10 nodes, 192GB RAM, 4 Nvidia A100 | 48 | 144 | 288 | 1SU |
Job Management
Users can monitor their jobs with the "squeue
" command. In this example user test345 is running two jobs: JobID: 31559 is a parallel job using 4 nodes. JobId: 31560 is a large memory job running on node bigmem01.
Users can also invoke a script, "sqme
", to monitor jobs:
To cancel a job, sse the "scancel
" command followed by the jobid. For example "scancel 31560
" will cancel the LM job for user test345 in the example above.
Sample Job Scripts
The following scripts are examples for different workflows. Users can modify them according to the resources needed to run their applications.
MPI Jobs
This job will run on 5 nodes each with 48 processes/cores. Total 240 MPI processes.
OpenMP/Threaded Jobs
This script will run a small job that creates 8 threads. It will use the default time of 1:00:00 (one hour).
Hybrid (MPI + OpenMP)
This script will run a hybrid jobs (Gromacs) on two nodes, each node will have 8 MPI processes, each with 6 threads
GNU parallel
This sample will run 48 serial jobs on one node using GNU parallel. This job directs output to the local scratch file system.
Parametric / Array / HTC jobs
This script is an example to run a set of 5,000 jobs. Only 480 jobs will run at a time. The input files are in a directory ($workdir
). A temporary directory ($tmpdir
) will be created in "scratch
" where all the jobs will be run. At the end of each run the temporary directory is deleted.
Bigmem (LM) Jobs
This script will run a job that needs large amounts of memory. Users need a special resource allocation (bigmem). It will use the default time 1:00:00 (one hour).
GPU (LM) Jobs (a100 partition)
This script will run a job that uses all 4 Nvidia a100 gpus. Users need a special resource allocation (gpu). It will use the default time 1:00:00 (one hour).
Help
Please visit the ACCESS help desk for important contact information. When submitting a support ticket, please include:
a complete description of the problem with accompanying screenshots if applicable
include any paths to job script or input/output files.
if you are having problems while on a login node please include the login node name