Rockfish - JHU

Introduction

Johns Hopkins University's Rockfish is a community-shared cluster at Johns Hopkins University. It follows the "condominium model" with three main integrated units. The first unit is based on a National Science Foundation (NSF) Major Research Infrastructure Grant (#1920103) and other main grants like DURIP/DoD, a second unit contains medium-size condos ((Schools' condos), and the last unit is the collection of condos purchased by individual research groups. All three units share a base infrastructure, and resources are shared by all users. Rockfish provides resources and tools to integrate traditional High Performance Computing (HPC) with Data Intensive Computing and Machine Learning (ML). As a multi-purpose resource for all fields of science, it will provide High Performance and Data Intensive Computing services to Johns Hopkins University, Morgan State University and ACCESS researchers as a level 2 Service Provider.

Rockfish's compute nodes consist of two 24-core Intel Xeon Cascade Lake 6248R processors, 3.0GHz base frequency and 1 TB NMVe local drive. The regular and GPU nodes have 192GB of DDR4 memory, whereas the large memory nodes have 1.5TB of DDr4 memory. The GPU nodes also have 4 Nvidia A100 GPUs.

Figure 1. Rockfish SystemACCESS hostname: login.rockfish.jhu.edu

Account Administration

A proposal through the ACCESS Resource Allocation request System (XRAS) is required for a research or startup allocation. See ACCESS Allocations for more information about about different types of allocations.

Configuring Your Account

Rockfish uses the bash shell by default. Submit an ACCESS support ticket to request a different shell.

Modules

The Rockfish cluster uses Modules (lua modules version 8.3, developed at TACC) to dynamically manage users' shell environments. "module" commands will set, modify, or delete environment variables in support of scientific applications, allowing users to select a particular version of an application or a combination of packages.

The "ml available" command will display (i) the applications that have been compiled using GNU compilers, (ii) external applications like matlab, abaqus, which are independent of the compiler used and (iii) a set of core modules. Likewise, if the Intel compilers are loaded "ml avail" will display applications that are compiled using the Intel compilers.

A set of modules are loaded by default at login time. These include Slurm, gcc/9.3 and openmpi/3.1. We strongly recommend that users utilize this combination of modules whenever possible for best performance. In addition, several scientific applications are built with dependencies on other modules. Users will get a message on the screen if this is the case. For more information type:

login1$ ml spider application/version

For example, if you have the gcc/9.3.0 module loaded and try to load intel-mpi you will get::

Lmod has detected the following error: These module(s) or extension(s) exist but cannot be loaded as requested: "intel-mpi" Try: "module spider intel-mpi" to see how to load the module(s).

The "ml available" command will also display a letter after the module indicating where it is:

L(oaded), D(efault), g(gpu), c(ontainer)

Table 1. Useful Modules Commands

COMMAND

ALIAS / SHORTCUT

DESCRIPTION

COMMAND

ALIAS / SHORTCUT

DESCRIPTION

module list

ml

List modules currently loaded

module avail

ml av

List all scientific applications with different versions

module show modulename

ml show

Show the environment variables and settings in the module file

module load modulename

ml

Load modules

module unload modulename

mu modulename

Unload the application or module

module spider modulename

ml spider modulename

Shows available versions for modulename

module save modulename

ml save modulename

Save current modules into a session (default) or named session

module swap modulename

ml modulename

Automatically swaps versions of modules

module help

ml help

Shows additional information about the scientific application

System Architecture

Rockfish has three types of compute nodes. "regular memory or standard" compute nodes (192GB), large memory nodes (1524GB) and GPU nodes with 4 Nvidia A100 GPUs. All compute nodes have access to three GPFS file sets. Rockfish, nodes and storage, have Mellanox HDR100 connectivity, with topology 1.5:1. Rockfish is managed using the Bright Computing cluster management software and the Slurm workload manager for job scheduling.

Compute Nodes

Table 2. Compute Node Specifications](#table2)

REGULAR (MEMORY) COMPUTE NODES

REGULAR (MEMORY) COMPUTE NODES

MODEL

Lenovo SD530
Intel Xeon Gold Cascade Lake 6248R

TOTAL CORES PER NODE

48 cores per node

NUMBER OF NODES

368

CLOCK RATE

3.0 GHz

RAM

192GB

TOTAL NUMBER OF CORES

17,664

LOCAL STORAGE

1 TB NVMe

LARGE MEMORY NODES

MODEL

Lenovo SR630
Intel Xeon Gold Cascade Lake 6248R

TOTAL CORES PER NODE

48 cores per node

NUMBER OF NODES

10

CLOCK RATE

3.0 GHz

RAM

1524GB

TOTAL NUMBER OF CORES

480

LOCAL STORAGE

1 TB NVMe

GPU NODES

MODEL

Lenovo SR670
Intel Xeon Gold Cascade Lake 6248R

TOTAL CORES PER NODE

48 cores per node

NUMBER OF NODES

10

CLOCK RATE

3.0 GHz

RAM

192GB

TOTAL NUMBER OF CORES

480

GPUS

4 Nvidia A110 GPUs (40Gb) PCIe

TOTAL NUMBER OF GPUS

40

LOCAL STORAGE

1 TB NVMe

Login Nodes

Rockfish's three login nodes (login01-03) are physical nodes with architecture and features similar to the regular memory compute nodes. Please use the gateway to connect to Rockfish.

Data Transfer Nodes (DTNs)

These nodes can be used to transfer data to the Rockfish cluster using secure copy, Globus or any other utility like Filezilla. The endpoint for Globus is "Rockfish User Data". The DTNs are "rfdtn1.rockfish.jhu.edu" and "rfdtn2.rockfish.jhu.edu". Thee nodes are mounted and available on all file systems.

Systems Software Environment

Table 3. Systems Software Environment](#table3)

SOFTWARE FUNCTION

DESCRIPTION

SOFTWARE FUNCTION

DESCRIPTION

CLUSTER MANAGEMENT

Bright Cluster Management

FILE SYSTEM MANAGEMENT

Xcat/Confluent

OPERATING SYSTEM

CentOS 8.2

FILE SYSTEMS

GPFS, ZFS

SCHEDULER AND RESOURCE MANAGEMENT

Slurm

USER ENVIRONMENT

Lua modules

COMPILERS

Intel, GNU, PGI

MESSAGE PASSING

Intel MPI, OpenMPI, MVAPICH

File Systems

Table 4. Rockfish File Systems](#table4)

FILE SYSTEM

QUOTA

FILE RETENTION

BACKUP

FEATURES

FILE SYSTEM

QUOTA

FILE RETENTION

BACKUP

FEATURES

$HOME

50GB

No file deletion policy

Backed up to an off-site location

NVMe File system

$SCRATCH4

10TB (combined with scratch16)

30 day retention. Files that have not been accessed for 30 days will be moved to the /data file system.

NO

Optimized for small files. Block sized 4MB

$SCRATCH16

Same as above

Same as above

NO

Block size 16MB
Optimized for large files

"data"

10TB

No deletion policy, but quota driven

optional

GPFS file set, lower performance


Accessing the System

Rockfish is accessible only to those users and research groups that have been awarded a Rockfish-specific allocation. ACCESS users may connect to Rockfish using an SSH client using SSH keys for authentication; password-based authentication is not supported.

Users must generate and install their own SSH keys. For help with either of these, see the Generating SSH Keys and/or Uploading Your Public Key pages.

After you have uploaded your public key, you should be able to connect to the Rockfish system using an SSH client. For example, from a computer running a Linux, MacOS, Windows Subsystem for Linux, or Windows PowerShell you may connect to KyRIC by opening a Terminal (or PowerShell) and entering:

or

Third-party SSH clients that provide a GUI (e.g., Bitvise, MobaXterm, PuTTY) may also be used to connect to KyRIC.

"login" is a gateway server that will authenticate credentials and then connect the user to one of three physical login nodes (identical to regular compute nodes). Hostname: login.rockfish.jhu.edu (gateway)

Citizenship

You share Rockfish with thousands of other users, and what you do on the system affects others. Exercise good citizenship to ensure that your activity does not adversely impact the system and the research community with whom you share it. Here are some rules of thumb:

  • Don't run jobs on the login nodes. Login nodes are used by hundreds of users to monitor their jobs, submit jobs, edit and manipulate files and in some cases to compile codes. We strongly request that users abstain from running jobs on login nodes. Sometimes users may want to run quick jobs to check that input files are correct or scientific applications are working properly. If this is the case, make sure this activity does not take more than a few minutes or even better request an interactive session (interact) to fully test your codes.

  • Don't stress the file systems. Do not perform activities that may impact the file systems (and the login nodes), for example rsync or copying large or many files from one file system to another. Please use globus of the data transfer node (rfdtn1) to copy large amounts of data

  • When submitting a help-desk ticket, be as informative as possible.

Login Node Activities

  • Request an interactive session "interact -usage"

  • Compile codes, for example run "make". Be careful if you are running commands with multiple processes. "make -j 4" may be fine but "make -j 20" may impact other users.

  • Check jobs, use this command "sqme"

  • Edit files, scripts, manipulate files

  • Submit jobs

  • Check output files

What is NOT allowed:

  • Run executables e.g. "./a.out"

  • multiple rsync sessions or copy large number of files

Managing Files

Transferring your Files

  1. scp: Secure copy commands can be used when transferring small amounts of data. We strongly encourage to use the data transfer nodes instead of the gateway.

    scp [-r] file-name userid@rfdtn1.rockfish.jhu.edu:/path/to/file/dir

  2. rsync: An alternative to scp would be "rsync". This command is useful when copying files between file systems or in/out of Rockfish. rsync can also be used to sync file systems as new files are created or as files are modified.

    or

  3. Globus: We strongly recommend the use of our managed end points via Globus. Rockfish's Globus end point is "Rockfish User Data"

Sharing Files with Collaborators

Users are strongly encouraged to use Globus features to share files with internal or external collaborators.

Software

Rockfish provides a broad application base managed by Lua modules. Most commonly used packages in bioinformatics, molecular dynamics, quantum chemistry, structural mechanics, and genomics are available ("ml avail"). Rockfish also supports Singularity containers.

Installed Software

Rockfish uses the Lua modules. Type "ml avail" to list all the scientific applications that are installed and available via modules.

  • "module" (or "ml") : displays a list of installed applications and corresponding versions.

  • "ml spider APP1" : displays all information on package APP1 (if it is installed)

  • "ml help APP1" : displays any additional information on this scientific application.

Building Software

Users may want to install scientific applications that are used only by the user of by the group in their HOME directories. Then users can create a private module.

  1. Create a directory to install the application: "mkdir -p $HOME/code/APP1"

  2. Install the application following the instructions (README or INSTALL files).

  3. Create a directory in your HOME directory to create a module file: "mkdir $HOME/modulefiles/APP1"

  4. Create a ".lua" file that adds the application path to your $PATH environment variable and all other requirements (lib or include files).

  5. Load the module as "ml own; ml APP1"

Compilers and recommendations

The Rockfish cluster provides three different compilers for compute nodes, GNU, Intel and PGI. There are also MPI libraries (openmpi, Intelmpi and Mvapich2). Most applications have been built using GNU compilers version 9.3.0. Users should evaluate which compiler gives the best performance for their applications.

The intel compilers and intel-mpi libraries can be loaded by executing the following command:

A standard command to compile a Fortran or C-code will look like: (add as many flags as needed)

For GNU compilers you may want to use this sequence:

Running Jobs

Job Accounting

Rockfish allocations are made in core-hours. The recommended method for estimating your resource needs for an allocation request is to perform benchmark runs. The core-hours used for a job are calculated by multiplying the number of processor cores used by the wall-clock duration in hours. Rockfish core-hour calculations should assume that all jobs will run in the regular queue.

For example: if you request one core on one node for an hour your allocation will be charged one core-hour. If you request 24 cores on one node, and the job runs for one hour, your account will be charged 24 core-hours. For parallel jobs, compute nodes are dedicated to the job. If you request 2 compute nodes and the job runs for one hour, your allocation will be charged 96 core-hours.

Job accounting is independent of the number of processes you run on compute nodes. You can request 2 cores for your job for one hour. If you run only one process, your allocation will be charged for 2 core-hours.

Accessing the Compute Nodes

  • Batch jobs: Jobs can be submitted to the scheduler by writing a script and submitting it via the "sbatch" command:

    where script-file-name is a file that contains a set of keywords used by the scheduler to set variables and the parameters for the job. It also contains a set of Linux commands to be executed. See Job Scripts below.

  • Interactive sessions: Users may need to connect to a compute node in interactive mode by using a internal script called "interact". See "interact -usage" will provide examples and a list of parameters. For example:

    Will request an interactive session on the defq queue with one core for 2 hours.

    Alternatively users can use the full command:

    This command will request an interactive session with 12 cores for 120 minutes and 48GB memory for the job (4GB per core).

  • ssh from a login node directly to a compute node. Users may ssh to a compute node where their jobs are running to check or monitor the status of their jobs. This connection will last a few minutes.

Slurm Job Scheduler

Rockfish uses Slurm (Simple Linux Universal Resource Manager) to manage resource scheduling and job submission. Slurm is an open source application with active developers and an increasing user community. It has been adopted by many HPC centers and universities. All users must submit jobs to the scheduler for processing, that is "interactive" use of login nodes for job processing is not allowed. Users who need to interact with their codes while these are running can request an interactive session using the script "interact", which will submit a request to the queuing system that will allow interactive access to the node.

Slurm uses "partitions" to divide types of jobs (partitions are called queues on other schedulers). Rockfish defines a few partitions that will allow sequential/shared computing and parallel (dedicated or exclusive nodes), GPU jobs and large memory jobs. The default partition is "defq".

Queues on Rockfish

Queue limits are subject to change. Rockfish will use partitions and resources associated with them to create different types of allocations.

Regular memory allocations will allow the use of all the regular compute nodes (currently the defq partition). All jobs submitted to the defq partition will account against this partition.

Large memory (LM) allocations will allow the use of the large memory nodes. If a user submits a job to this partition then the LM allocation is charged by default.

Likewise, there is a GPU partition that will allow the use of an GPU nodes.

Table 5. Rockfish Production Queues](#table5)

QUEUE NAME

MAX NODES PER JOB (ASSOC'D CORES)*

MAX DURATION

MAX NUMBER OF CORES (RUNNING)

MAX NUMBER RUNNING + QUEUED

CHARGE RATE (PER NODE-HOUR)

QUEUE NAME

MAX NODES PER JOB (ASSOC'D CORES)*

MAX DURATION

MAX NUMBER OF CORES (RUNNING)

MAX NUMBER RUNNING + QUEUED

CHARGE RATE (PER NODE-HOUR)

defq

368 nodes, 48 cores per node

72 hours

4800

9600

1 Service Unit (SU)

bigmem

10 nodes (1524GB per node)

48 hrs

144

288

1 SU

a100

10 nodes, 192GB RAM, 4 Nvidia A100

48

144

288

1SU

Job Management

Users can monitor their jobs with the "squeue" command. In this example user test345 is running two jobs: JobID: 31559 is a parallel job using 4 nodes. JobId: 31560 is a large memory job running on node bigmem01.

Users can also invoke a script, "sqme", to monitor jobs:

To cancel a job, sse the "scancel" command followed by the jobid. For example "scancel 31560" will cancel the LM job for user test345 in the example above.

Sample Job Scripts

The following scripts are examples for different workflows. Users can modify them according to the resources needed to run their applications.

MPI Jobs

This job will run on 5 nodes each with 48 processes/cores. Total 240 MPI processes.

OpenMP/Threaded Jobs

This script will run a small job that creates 8 threads. It will use the default time of 1:00:00 (one hour).

Hybrid (MPI + OpenMP)

This script will run a hybrid jobs (Gromacs) on two nodes, each node will have 8 MPI processes, each with 6 threads

GNU parallel

This sample will run 48 serial jobs on one node using GNU parallel. This job directs output to the local scratch file system.

Parametric / Array / HTC jobs

This script is an example to run a set of 5,000 jobs. Only 480 jobs will run at a time. The input files are in a directory ($workdir). A temporary directory ($tmpdir) will be created in "scratch" where all the jobs will be run. At the end of each run the temporary directory is deleted.

Bigmem (LM) Jobs

This script will run a job that needs large amounts of memory. Users need a special resource allocation (bigmem). It will use the default time 1:00:00 (one hour).

GPU (LM) Jobs (a100 partition)

This script will run a job that uses all 4 Nvidia a100 gpus. Users need a special resource allocation (gpu). It will use the default time 1:00:00 (one hour).

Help

Please visit the ACCESS help desk for important contact information. When submitting a support ticket, please include:

  • a complete description of the problem with accompanying screenshots if applicable

  • include any paths to job script or input/output files.

  • if you are having problems while on a login node please include the login node name