DARWIN - Delaware

Getting started on DARWIN

DARWIN (Delaware Advanced Research Workforce and Innovation Network) is a big data and high performance computing system designed to catalyze Delaware research and education funded by a $1.4 million grant from the National Science Foundation (NSF). This award will establish the DARWIN computing system as an XSEDE Level 2 Service Provider in Delaware contributing 20% of DARWIN's resources to XSEDE: Extreme Science and Engineering Discovery Environment now transitioned to ACCESS: Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support as of September 1, 2022. DARWIN has 105 compute nodes with a total of 6672 cores, 22 GPUs, 100TB of memory, and 1.2PB of disk storage. See compute nodes and storage for complete details on architecture.

Figure 1. Fish-eye front view of DARWIN in the computing center

Configuration

The DARWIN cluster is being set up to be very similar to the existing Caviness cluster, and will be familiar to those currently using Caviness. However DARWIN is a NSF funded HPC resource available via committee reviewed allocation request process similar to ACCESS allocations.

An HPC system always has one or more public-facing systems known as login nodes. The login nodes are supplemented by many compute nodes which are connected by a private network. One or more head nodes run programs that manage and facilitate the functioning of the cluster. (In some clusters, the head node functionality is present on the login nodes.) Each compute node typically has several multi-core processors that share memory. Finally, all the nodes share one or more filesystems over a high-speed network.

Figure 2. DARWIN Configuration

Login nodes

Login (head) nodes are the gateway into the cluster and are shared by all cluster users. Their computing environment is a full standard variant of Linux configured for scientific applications. This includes command documentation (man pages), scripting tools, compiler suites, debugging/profiling tools, and application software. In addition, the login nodes have several tools to help you move files between the HPC filesystems and your local machine, other clusters, and web-based services.

Login nodes should be used to set up and submit job workflows and to compile programs. You should generally use compute nodes to run or debug application software or your own executables.

If your work requires highly interactive graphics and animations, these are best done on your local workstation rather than on the cluster. Use the cluster to generate files containing the graphics information, and download them from the HPC system to your local system for visualization.

When you use SSH to connect to darwin.hpc.udel.edu your computer will choose one of the login (head) nodes at random. The default command line prompt clearly indicates to which login node you have connected: for example, [bjones@login00.darwin ~]$ is shown for account bjones when connected to login node login00.darwin.hpc.udel.edu.

Only use SSH to connect to a specific login node if you have existing processes present on it. For example, if you used the screen or tmux utility to preserve your session after logout.

Compute nodes

There are many compute nodes with different configurations. Each node consists of multi-core processors (CPUs), memory, and local disk space. Nodes can have different OS versions or OS configurations, but this document assumes all the compute nodes have the same OS and almost the same configuration. Some nodes may have more cores, more memory, GPUs, or more disk.

All compute nodes are now available and configured for use. Each compute node has 64 cores, so the compute resources available are the following:

Compute Node

Number of Nodes

Node Names

Total Cores

Memory Per Node

Total Memory

Total GPUs

Compute Node

Number of Nodes

Node Names

Total Cores

Memory Per Node

Total Memory

Total GPUs

Standard

48

r1n00 - r1n47

3,072

512 GiB

24 TiB

 

Large Memory

32

r2l00 - r2l31

2,048

1024 GiB

32 TiB

 

Extra-Large Memory

11

r2x00 - r2x10

704

2,048 GiB

22 TiB

 

nVidia-T4

9

r1t00 - r1t07, r2t08

576

512 GiB

4.5 TiB

9

nVidia-V100

3

r2v00 - 02

144

768 GiB

2.25 TiB

12

AMD-MI50

1

r2m00

64

512 GiB

.5 TiB

1

Extended Memory

1

r2e00

64

1024 GiB + 2.73 TiB1)

3.73 TiB

 

Total

105

 

6,672

 

88.98 TiB

22

The standard Linux on the compute nodes are configured to support just the running of your jobs, particularly parallel jobs. For example, there are no man pages on the compute nodes. All compute nodes will have full development headers and libraries.

Commercial applications, and normally your programs, will use a layer of abstraction called a programming model. Consult the cluster specific documentation for advanced techniques to take advantage of the low level architecture.

Storage

Home filesystem

Each DARWIN user receives a home directory (/home/<uid>) that will remain the same during and after the early access period. This storage has slower access with a limit of 20 GiB. It should be used for personal software installs and shell configuration files.

Lustre high-performance filesystem

Lustre is designed to use parallel I/O techniques to reduce file-access time. The Lustre filesystems in use at UD are composed of many physical disks using RAID technologies to give resilience, data integrity, and parallelism at multiple levels. There is approximately 1.1 PiB of Lustre storage available on DARWIN. It uses high-bandwidth interconnects such as Mellanox HDR100. Lustre should be used for storing input files, supporting data files, work files, and output files associated with computational tasks run on the cluster.

  • Each allocation will be assigned a workgroup storage in the Lustre directory (/lustre/«workgroup»/).

  • Each workgroup storage will have a users directory (/lustre/«workgroup»/users/«uid») for each user of the workgroup to be used as a personal directory for running jobs and storing larger amounts of data.

  • Each workgroup storage will have a software and VALET directory (/lustre/«workgroup»/sw/ and /lustre/«workgroup»/sw/valet) all allow users of the workgroup to install software and create VALET package files that need to be shared by others in the workgroup.

  • There will be a quota limit set based on the amount of storage approved for your allocation for the workgroup storage.

While all filesystems on the DARWIN cluster utilize hardware redundancies to protect data, there is no backup or replication and no recovery available for the home or Lustre filesystems.

Local filesystems

Each node has an internal, locally connected disk. Its capacity is measured in terabytes. Each compute node on DARWIN has a 1.75 TiB SSD local scratch filesystem disk. Part of the local disk is used for system tasks such memory management, which might include cache memory and virtual memory. This remainder of the disk is ideal for applications that need a moderate amount of scratch storage for the duration of a job's run. That portion is referred to as the node scratch filesystem.

Each node scratch filesystem disk is only accessible by the node in which it is physically installed. The job scheduling system creates a temporary directory associated with each running job on this filesystem. When your job terminates, the job scheduler automatically erases that directory and its contents.

Software

There will not be a full set of software during early access and testing, but we will be continually installing and updating software. Installation priority will go to compilers, system libraries, and highly utilized software packages. Please DO let us know if there are packages that you would like to use on DARWIN, as that will help us prioritize user needs, but understand that we may not be able to install software requests in a timely manner.

Users will be able compile and install software packages in their home or workgroup directories. There will be very limited support for helping with user compiled installs or debugging during early access. Please reference basic software building and management to get started with software installations utilizing VALET (versus Modules) as suggested and used by IT RCI staff on our HPC systems.

Please review the following documents if you are planning to compile and install your own software

  • High Performance Computing (HPC) Tuning Guide for AMD EPYC™ 7002 Series Processors guide for getting started tuning AMD 2nd Gen EPYC™ Processor based systems for HPC workloads. This is not an all-inclusive guide and some items may have similar, but different, names in specific OEM systems (e.g. OEM-specific BIOS settings). Every HPC workload varies in its performance characteristics. While this guide is a good starting point, you are encouraged to perform your own performance testing for additional tuning. This guide also provides suggestions on which items should be the focus of additional, application-specific tuning (November 2020).

  • HPC Tuning Guide for AMD EPYC™ Processors guide intended for vendors, system integrators, resellers, system managers and developers who are interested in EPYC system configuration details. There is also a discussion on the AMD EPYC software development environment, and we include four appendices on how to install and run the HPL, HPCG, DGEMM, and STREAM benchmarks. The results produced are ‘good' but are not necessarily exhaustively tested across a variety of compilers with their optimization flags (December 2018).

  • AMD EPYC™ 7xx2-series Processors Compiler Options Quick Reference Guide, however we do not have the AOCC compiler (with Flang - Fortran Front-End) installed on DARWIN.

Scheduler

DARWIN will being using the Slurm scheduler like Caviness, and is the most common scheduler among ACCESS resources. Slurm on DARWIN is configured as fair share with each user being given equal shares to access the current HPC resources available on DARWIN.

Queues (Partitions)

Partitions have been created to align with allocation requests moving forward based on different node types. There will be no default partition, and must only specify one partition at a time. It is not possible to specify multiple partitions using Slurm to span different node types.

Run Jobs

In order to schedule any job (interactively or batch) on the DARWIN cluster, you must set your workgroup to define your cluster group. Each research group has been assigned a unique workgroup. Each research group should have received this information in a welcome email. For example,

# workgroup -g it_css

will enter the workgroup for it_css. You will know if you are in your workgroup based on the change in your bash prompt. See the following example for user bjones

[bjones@login00.darwin ~]$ workgroup -g it_css [(it_css:bjones)@login00.darwin ~]$ printenv USER HOME WORKDIR WORKGROUP WORKDIR_USER bjones /home/1201 /lustre/it_css it_css /lustre/it_css/users/1201 [(it_css:bjones)@login00.darwin ~]$

Now we can use salloc or sbatch as long as a partition is specified as well to submit an interactive or batch job respectively. See DARWIN Run Jobs, Schedule Jobs and Managing Jobs wiki pages for more help about Slurm including how to specify resources and check on the status of your jobs.

All resulting executables (created via your own compilation) and other applications (commercial or open-source) should only be run on the compute nodes.

It is a good idea to periodically check in /opt/shared/templates/slurm/ for updated or new templates to use as job scripts to run generic or specific applications designed to provide the best performance on DARWIN.

Help

ACCESS allocations

To report a problem or provide feedback, submit a help desk ticket on the ACCESS Portal and complete the form selecting darwin.udel.xsede.org as the system and your problem details in the description field to route your question more quickly to the research support team. Provide enough details (including full paths of batch script files, log files, or important input/output files) that our consultants can begin to work on your problem without having to ask you basic initial questions.

Ask or tell the HPC community

hpc-ask is a Google group established to stimulate interactions within UD's broader HPC community and is based on members helping members. This is a great venue to post a question about HPC, start a discussion, or share an upcoming event with the community. Anyone may request membership. Messages are sent as a daily summary to all group members. This list is archived, public, and searchable by anyone.

Publication and Grant Writing Resources

Please refer to the NSF award information for a proposal or requesting allocations on DARWIN. We require all allocation recipients to acknowledge their allocation awards using the following standard text: “This research was supported in part through the use of DARWIN computing system: DARWIN – A Resource for Computational and Data-intensive Research at the University of Delaware and in the Delaware Region, Rudolf Eigenmann, Benjamin E. Bagozzi, Arthi Jayaraman, William Totten, and Cathy H. Wu, University of Delaware, 2021, URL: https://udspace.udel.edu/handle/19716/29071″

ACCESS Allocations

A PI may request allocations on DARWIN via Access. See the Access Allocations page for details on how to do so. If an allocation on DARWIN is granted, a PI may use the ACCESS Allocations portal to add or remove accounts for an active allocation on DARWIN as long as the person you want to add has an ACCESS user portal account. If the person doesn’t have an ACCESS user portal account, then they need to visit the ACCESS User Registration to create one. The person will need to share their ACCESS user portal account with the PI to be added. Please keep in mind it may take up to 10 business days to process an account request on DARWIN for ACCESS users.

Accounts

An ACCESS username will be assigned having the form xsedeuuid. The uid is based on a unique, 4-digit numerical identifier assigned to you. An email with the subject [darwin-users] New DARWIN ACCESS (XSEDE) account information will be sent to the ACCESS user once their account is ready on DARWIN. Please keep in mind it may take up to 10 business days to process an account request on DARWIN for ACCESS users. Passwords are not set for ACCESS accounts on DARWIN, so you must set a password using the password reset web application at https://idp.hpc.udel.edu/access-password-reset/.

The application starts by directing the client to the CILogon authentication system where the “ACCESS CI (XSEDE)” provider should be selected. If successful (and the client has an account on DARWIN), the application next asks for an email address to which a verification email should be sent; the form is pre-populated with the email address on-record on DARWIN for the client's account. The client has 15 minutes to follow the link in that email message to choose a new password. The form displays information regarding the desired length and qualifications of a valid password. If the new password is acceptable, the client's DARWIN password is set and SSH access via password should become available immediately.

ACCESS users on DARWIN can use the password reset web application to reset a forgotten password, too.

See connecting to DARWIN for more details.

For example,

$ hpc-user-info -a xsedeu1201 full-name = Student Training last-name = Student Training home-directory = /home/1201 email-address = bjones@udel.edu clusters = DARWIN

Command

Function

hpc-user-info -a username

Display info about a user

hpc-user-info -h

Display complete syntax

Groups

The allocation groups of which you are a member determine which computing nodes, job queues, and storage resources you may use. Each group has a unique descriptive group name (gname). There are two categories of group names: class and workgroup.

The class category: All users belong to the group named everyone.

The workgroup category: Each workgroup has a unique group name (e.g., xg-tra180011) assigned for each allocation. The PI and users are members of that allocation group (workgroup). To see the usernames of all members of the workgroup, type the hpc-group-info -a allocation_workgroup command.

Use the command groups to see all of your groups. The example below is for user xsedeu1201

For example, the command below will display the complete information about the workgroup xg-tra180011 and its members.

The output of this command represents (description=PI), along with every member in the workgroup and their account information (Username, Full Name, Email Address).

Connecting to DARWIN

Secure Shell program (SSH)

Use a secure shell program/client (SSH) to connect to the cluster and a secure file transfer program to move files to and from the cluster.

There are many suitable secure clients for Windows, Mac OS X, and UNIX/Linux. We recommend MobaXterm or PuTTY and Xming for Windows users. Macintosh and UNIX/Linux users can use their pre-installed SSH and X11 software. (Newer versions of Mac OS X may not have a current version of X11 installed. See the Apple web site for X11 installation instructions.)

IT strongly recommends that you configure your clients as described in the online X-windows (X11) and SSH documents (Windows / Linux/MacOSX). If you need help generating or uploading your SSH keys, please see the Managing SSH Keys page for ACCESS recommendations on how to do so.

Your HPC home directory has a .ssh directory. Do not manually erase or modify the files that were initially created by the system. They facilitate communication between the login (head) node and the compute nodes. Only use standard ssh commands to add keys to the files in the .ssh directory.

Please refer to Windows and Mac/Linux related sections for specific details using the command line on your local computer:

Logging on to DARWIN

You need a DARWIN account to access the login node.

To learn about launching GUI applications on DARWIN, refer to Schedule Jobs page.

ACCESS users with an allocation award on DARWIN will not be able to login until their password is set by using the password reset web application at https://idp.hpc.udel.edu/access-password-reset/.

The application starts by directing the client to the CILogon authentication system where the “ACCESS CI (XSEDE)” provider should be selected. If successful (and the client has an account on DARWIN), the application next asks for an email address to which a verification email should be sent; the form is pre-populated with the email address on-record on DARWIN for the client's account. The client has 15 minutes to follow the link in that email message to choose a new password. The form displays information regarding the desired length and qualifications of a valid password. If the new password is acceptable, the client's DARWIN password is set and SSH access via password should become available immediately.

ACCESS users on DARWIN can use the password reset web application to reset a forgotten password, too.

Once a password has been set, you may login to DARWIN by using:

or if you need you to use X-Windows requiring X11 forwarding (e.g., for a Jupyter Notebook or applications that generate graphical output), then use

where XXXX is your unique uid. The standard methods documented for adding a public key on DARWIN will only work once a password has been set for your ACCESS DARWIN account using the password reset web application. If need help setting up SSH, please see the Generating SSH Keys page and/or Uploading Your SSH Key page.

Once you are logged into DARWIN, your account is configured as a member of an allocation workgroup which determines access to your HPC resources on DARWIN. Setting your allocation workgroup is required in order to submit jobs to the DARWIN cluster. For example, the bjones account is a member of the it_css workgroup. To start a shell in the it_css workgroup, type:

Consult the following pages for detailed instructions for using DARWIN.

File Systems

Home

The 13.5 TiB file system uses 960 GiB enterprise class SSD drives in a triple-parity RAID configuration for high reliability and availability. The file system is accessible to all nodes via IPoIB on the 100 Gbit/s InfiniBand network.

Storage

Each user has 20 GB of disk storage reserved for personal use on the home file system. Users' home directories are in /home (e.g., /home/1005), and the directory name is put in the environment variable $HOME at login.

High-Performance Lustre

Lustre is designed to use parallel I/O techniques to reduce file-access time. The Lustre file systems in use at UD are composed of many physical disks using RAID technologies to give resilience, data integrity, and parallelism at multiple levels. There is approximately 1.1 PiB of Lustre storage available on DARWIN. It uses high-bandwidth interconnects such as Mellanox HDR100. Lustre should be used for storing input files, supporting data files, work files, and output files associated with computational tasks run on the cluster.

Consult All About Lustre for more detailed information.

Workgroup Storage

Allocation workgroup storage is available on a high-performance Lustre-based file system having almost 1.1 PB of usable space. Users should have a basic understanding of the concepts of Lustre to take full advantage of this file system. The default stripe count is set to 1 and the default striping is a single stripe distributed across all available OSTs on Lustre. See Lustre Best Practices from Nasa.

Each allocation will have at least 1 TiB of shared (workgroup) storage in the /lustre/ directory identified by the «allocation_workgroup» (e.g., /lustre/it_css) accessbile by all users in the allocation workgroup, and is referred to as your workgroup directory ($WORKDIR), if the allocation workgroup has been set.

Each user in the allocation workgroup will have a /lustre/«workgroup»/users/«uid» directory to be used as a personal workgroup storage directory for running jobs, storing larger amounts of data, input files, supporting data files, work files, output files and source code. It can be referred to as $WORKDIR_USERS, if the allocation workgroup has been set.

Each allocation will also have a /lustre/«workgroup»/sw directory to allow users to install software to be shared for the allocation workgroup. It can be referred to as $WORKDIR_SW, if the allocation workgroup has been set. In addition a /lustre/«workgroup»/sw/valet) directory is also provided to store VALET package files to shared for the allocation workgroup.

Please see workgroup for complete details on environment variables.

Note: A full file system inhibits use for everyone preventing jobs from running.

Local/Node File System

Temporary Storage

Each compute node has its own 2 TB local hard drive, which is needed for time-critical tasks such as managing virtual memory. The system usage of the local disk is kept as small as possible to allow some local disk for your applications, running on the node.

Quotas and Usage

To help users maintain awareness of quotas and their usage on the /home file system, the command my_quotas is available to display a list of all quota-controlled file systems on which the user has storage space.

For example, the following shows the amount of storage available and in-use for user bjones in workgroup it_css for their home and workgroup directory.

Home

Each user's home directory has a hard quota limit of 20 GB. To check usage, use

The example below displays the usage for the home directory (/home/1201) for the account bjones as 7.2 GB used out of 20 GB which matches the above example provide by my_quotas command.

Workgroup

All of Lustre is available for allocation workgroup storage. To check Lustre usage for all users, use df -h /lustre.

The example below shows 25 TB is in use out of 954 TB of usable Lustre storage.

To see your allocation workgroup usage, please use the my_quotas command. Again the the following example shows the amount of storage available and in-use for user bjones in allocation workgroup it_css for their home and allocation workgroup directories.

Node

The node temporary storage is mounted on /tmp for all nodes. There is no quota, and if you exceed the physical size of the disk you will get disk failure messages. To check the usage of your disk, use the df -h command on the compute node where your job is running.

We strongly recommend that you refer to the node scratch by using the environment variable, $TMPDIR, which is defined by Slurm when using salloc or srunor sbatch.

For example, the command

shows size, used and available space in M, G or T units.

This node r1n00 has a 2 TB disk, with only 41 MB used, so 1.8 TB is available for your job.

There is a physical disk installed on each node that is used for time critical tasks, such as swapping memory. Most of the compute nodes are configured with a 2 TB disk, however, the /tmp file system will never have the total disk. Larger memory nodes will need to use more of the disk for swap space.

Recovering Files

While all file systems on the DARWIN cluster utilize hardware redundancies to protect data, there is no backup or replication and no recovery available for the home or Lustre file systems. All backups are the responsibility of the user. DARWIN's systems administrators are not liable for any lost data.

Usage Recommendations

Home directory: Use your home directory to store private files. Application software you use will often store its configuration, history and cache files in your home directory. Generally, keep this directory free and use it for files needed to configure your environment. For example, add symbolic links in your home directory to point to files in any of the other directory.

Workgroup directory: Use the personal allocation workgroup directory for running jobs, storing larger amounts of data, input files, supporting data files, work files, output files and source code in $WORKDIR_USERS as an extension of your home direcory. It is also appropriate to use the software allocation workgroup directory to build applications for everyone in your allocation group in $WORKDIR_SW as well as create a VALET package for your fellow researchers to access applications you want to share in $WORKDIR_SW/valet.

Node scratch directory: Use the node scratch directory for temporary files. The job scheduler software (Slurm) creates a temporary directory in /tmp specifically for each job's temporary files. This is done on each node assigned to the job. When the job is complete, the subdirectory and its contents are deleted. This process automatically frees up the local scratch storage that others may need. Files in node scratch directories are not available to the head node, or other compute nodes.

Transferring Files

Be careful about modifications you make to your startup files (e.g. .bash*). Commands that produce output such as VALET or workgroup commands may cause your file transfer command or application to fail. Log into the cluster with ssh to check what happens during login, and modify your startup files accordingly to remove any commands which are producing output and try again. See computing environment startup and logout scripts for help.

Common Clients

You can move data to and from the cluster using the following supported clients:

Table. Command-line Clients

sftp

Recommended for interactive, command-line use.

scp

Recommended for batch script use.

rsync

Most appropriate for synchronizing the file directories of two systems when only a small fraction
of the files have been changed since the last synchronization.

Rclone

Rclone is a command line program to sync files and directories to and from popular cloud storage services.

If you prefer a non-command-line interface, then consult this table for GUI clients.

Table. Graphical-User-Interface Clients

winscp

Windows only

fetch

Mac OS X only

filezilla

Windows, Mac OS X, UNIX, Linux

cyberduck

Windows, Mac OS X (command line version for Linux)

For Windows clients: editing files on Windows desktops and then transferring them back to the cluster, you may find that your file becomes "corrupt" during file transfer process. The symptoms are very subtle because the file appears to be okay, but in fact contains CRLF line terminators. This causes problems when reading the file on a Linux cluster and generates very strange errors. Some examples might be a file used for submitting a batch job such as submit.qs and one you have used before and know is correct, will no longer work. Or an input file used for ABAQUS like tissue.inp which has worked many times before produces an error like Abaqus Error: Command line option "input" must have a value..

Use the "file" utility to check for CRLF line terminators and dos2unix to fix it, like this below

Copying Files to the Cluster

To copy a file over an SSH connection from a Mac/UNIX/Linux system to any of the cluster's file systems, type the generic command:

scp [options] local_filename HPC_username@HPC_hostname:HPC_filename

Begin the HPC_filename with a "/" to indicate the full path name. Otherwise the name is relative to your home directory on the HPC cluster.

Use scp -r to copy an entire directory, for example:

copies the fuelcell directory in your local current working directory into the /lustre/it_css/users/1201/projects directory on the DARWIN cluster. The /lustre/it_css/users/1201/projects directory on the DARWIN cluster must exist, and bjones must have write access to it.

Copying files from the cluster

To copy a file over an SSH connection to a Mac/UNIX/Linux system from any of the cluster's files systems type the generic command:

scp [options] HPC_username@HPC_hostname:HPC_filename local_filename

Begin the HPC_filename with a "/" to indicate the full path name. Otherwise, the name is relative to your home directory.

Use scp -r to copy the entire directory. For example:

will copy the directory fuelcell on the DARWIN cluster into a new fuelcell directory in your local system's current working directory. (Note the final period in the command.)

Copying Files Between Clusters

You can use GUI applications to transfer small files to and from your PC as a way to transfer between clusters, however this is highly inefficient for large files due to multiple transfers and slower disk speeds. As a result, you do not benefit from the arcfour encoding.

The command tools work the same on any Unix cluster. To copy a file over an SSH connection, first logon the file cluster1 and then use the scp command to copy files to cluster1. Use the generic commands:

ssh [options] HPC_username1@HPC_hostname1
scp [options] HPC_filename1 HPC_username2@HPC_hostname2:HPC_filename2

Login to HPC_hostname1 and in the scp command begin both HPC_filename1 and HPC_filename2 with a "/" to indicate the full path name. The clusters will most likely have different full path names.

Use ssh -A to enable agent forwarding and scp -r to copy the entire directory.1) For example:

will copy the directory fuelcell from Farber to a new fuelcell directory on DARWIN.

  1.  

If you are using PuTTY, skip the ssh step and connect to the cluster you want to copy from.

Application Development

There are three 64-bit compiler suites on DARWIN with Fortran, C and C++:

PGI

Portland Compiler Suite

INTEL

Parallel Studio XE

GCC

GNU Compiler Collection

In addition, multiple versions of OpenJDK are available for compiling java applications on the login node.

DARWIN is based on AMD EPYC processors, please review the following documents if you are planning to compile and install your own software

  • High Performance Computing (HPC) Tuning Guide for AMD EPYC™ 7002 Series Processors guide for getting started tuning AMD 2nd Gen EPYC™ Processor based systems for HPC workloads. This is not an all-inclusive guide and some items may have similar, but different, names in specific OEM systems (e.g. OEM-specific BIOS settings). Every HPC workload varies in its performance characteristics. While this guide is a good starting point, you are encouraged to perform your own performance testing for additional tuning. This guide also provides suggestions on which items should be the focus of additional, application-specific tuning (November 2020).

  • HPC Tuning Guide for AMD EPYC™ Processors guide intended for vendors, system integrators, resellers, system managers and developers who are interested in EPYC system configuration details. There is also a discussion on the AMD EPYC software development environment, and we include four appendices on how to install and run the HPL, HPCG, DGEMM, and STREAM benchmarks. The results produced are ‘good' but are not necessarily exhaustively tested across a variety of compilers with their optimization flags (December 2018).

  • AMD EPYC™ 7xx2-series Processors Compiler Options Quick Reference Guide, however we do not have the AOCC compiler (with Flang - Fortran Front-End) installed on DARWIN.

Computing Environment

UNIX Shell

The UNIX shell is the interface to the UNIX operating system. The HPC cluster allows use of the enhanced Bourne shell bash, the enhanced C shell tcsh, and the enhanced Korn shell zsh. IT will primarily support bash, the default shell.

For most Linux systems, the sh shell is the bash shell and the csh shell is the tcsh shell. The remainder of this document will use only bash commands.

Environment Variables

Environment variables store dynamic system values that affect the user environment. For example, the PATH environment variable tells the operating system where to look for executables. Many UNIX commands and tools, such as the compilers, debuggers, profilers, editors, and applications with graphical user interfaces, often look at environment variables for information they need to function. The man pages for these programs typically have an ENVIRONMENT VARIABLES section with a list of variable names which tells how the program uses the values.

This is why we encourage users to use VALET to modify your environment versus explicitly setting environment variables.

In bash, a variable must be exported to be used as an environment variable. By convention, environment variables are all uppercase. You can display a list of currently set environment variables by typing

The "echo" and "export" commands will display and set environment variables.

COMMAND

RESULTS

COMMAND

RESULTS

echo $varName

Display specific environment variable

export varName=varValue

To set an environment variable to a value

You can display specific environment variables by typing for example:

The variable FFLAGS will have the value -g -Wall in the shell and exported to programs run from this shell.

Spaces are important. Do not put spaces around the equal sign. If the value has spaces, enclose the value in quotes.

If you see instructions that refer the setenv command, replace it with the export bash command. Make sure you use equal signs, with no spaces. The setenv "csh" command uses spaces instead of one equal.

Startup and Logout Scripts

All UNIX systems set up a default environment and provide users with the ability to execute additional UNIX commands to alter the environment. These commands are automatically sourced (executed) by your shell and define the normal and environmental variables, command aliases, and functions you need. Additionally, there is a final system-wide startup file that automatically makes global environment changes that IT sets for all users.

You can modify the default environment by adding lines at the end of the ~/.bash_profile file and the ~/.bashrc file. These modifications affect shells started on the login node and the compute nodes. In general we recommend that you should not modify these files especially when software documentation refers to changing the PATH environment variable, instead use VALET to load the software.

  • The ~/.bash_profile file's commands are executed once at login. Add commands to this file to set your login environment and to run startup programs.

  • The ~/.bashrc file's commands are executed by each new shell you start (spawn). Add lines to this file to create aliases and bash functions. Commands such as "xterm" and "workgroup" automatically start a new shell and execute commands in the ~/.bashrc file. The "salloc" command starts a shell on a compute node and will execute the ~/.bashrc file from your home directory, but it does not execute the commands in the ~/.bash_profile file.

You may modify the IT-supplied ~/.bash_udit file to be able to use several IT-supplied aliases (commands) and environment settings related to your workgroup and work directory . Edit .bash_udit and follow the directions in the file to activate these options. This is the ONLY way you should set your default workgroup at login. DO NOT add the workgroup command to your .bashrc or .bash_profile as this will likely prevent you from logging in and will cause file transfer programs like WinSCP, sftp or Fetch to break.

Exiting the login session or typing the "logout" command executes your ~/.bash_logout file and terminates your session. Add commands to ~/.bash_logout that you want to execute at logout.

To restore the .bash_profile, .bashrc, .bash_udit and .bash_logout files in your home directory to their original state, type from the login node:

Where to put startup commands: You can put bash commands in either ~/.bashrc or ~/.bash_profile. Again we do not recommend modifying these files unless you really know what you are doing. Here are general suggestions:

  • Even if you have favorite commands from other systems, start by using the supplied files and only modify .bash_udit for customization.

  • Add essential commands that you fully understand, and keep it simple. Quoting rules can be complicated.

  • Do not depend on the order of command execution. Do not assume your environment, set in .bash_profile, will be available when the commands in .bashrc are executed.

  • Do not include commands that spawn new shells, such as workgroup.

  • Be very careful of commands that may produce output. If you must, only execute them after a test to make sure there is a terminal to receive the output. Keep in mind using any commands that produce output may break other applications like file transfer (sftp, scp, WinSCP).

  • Do not include VALET commands as they produce output and will be a part of every job submitted which could cause conflicts with other applications you are trying to run in your job script.

  • Keep a session open on the cluster, so when you make a change that prevents you from logging on you can reverse the last change, or copy the original files from /opt/shared/templates/homedir/ to start over.

Using workgroup and Directories

There are some key environment variables that are set for you, and are important for your work on any cluster. They are used to find directories for your projects. These environment variables are set on initial connection to a cluster, and will be changed if you

  • set your workgroup (allocation group allocation_group name) with the "workgroup" command,

  • change to your project directory with the "cd" command,

  • connect to the compute node resources with "salloc" (or "sbatch") command specifying a single partition your allocation workgroup has access based on resources requested for your allocation.

Connecting to Login Node

The system's initialization scripts set the values of some environment variables to help use the file systems.

VARIABLE

VALUE

DESCRIPTION

VARIABLE

VALUE

DESCRIPTION

HOSTNAME

hostname

Host name

USER

login_name

Login name

HOME

/home/uid

Your home directory

The initialization scripts also set the standard prompt with your login name and a shortened host name. For example, if your hostname is darwin.hpc.udel.edu and your login_name is bjones, then the standard prompt will be

[bjones@login00.darwin ~]$

Clusters may be configured to have multiple login nodes, with one common name for connecting. For example, on the DARWIN cluster, the hostname may be set to login00 or login01, and the standard prompt and window title bar will indicate which login node on darwin.

Setting Workgroup

To use the compute node resources for a particular allocation group (workgroup), you need to use the "workgroup" command.

For example,

starts a new shell for the workgroup it_css, and sets the environment variables:

VARIABLE

EXAMPLE VALUE

DESCRIPTION

VARIABLE

EXAMPLE VALUE

DESCRIPTION

WORKDIR

/lustre/it_css

Allocation workgroup directory, this is not writeable

WORKGROUP

it_css

Current allocation workgroup name

WORKDIR_USER

/lustre/it_css/users/uid

Allocation workgroup user directory

WORKDIR_SW

/lustre/it_css/sw

Allocation workgroup software directory

Use specific environment variables such as $WORKDIR_USERS when referring to your allocation workgroup user directory and $WORKDIR_SW when referring to your allocation workgroup software directory. This will improve portability.

It is always important to be aware of your current allocation workgroup name. The standard prompt includes the allocation workgroup name, added to your username and host. You must have an allocation workgroup name in your prompt to use that allocation group's compute node resources to submit jobs using sbatch or salloc. An example prompt after the "workgroup" command,

[(it_css:bjones)@login01.darwin ~]$

Changing Directory

When you first connect to the login node, all your commands are executed from your home directory (~). Most of your work will be done in your allocation workgroup directory. The "workgroup" command has an option to start you in the allocation workgroup work directory. For example,

will spawn a new shell in the workgroup directory for it_css.

You can always use the cd bash command.

For example,

The first is using a path name relative to the current working directory (implied ./). The second, third and fourth use the full path ($WORKDIR and $WORKDIR_USERS always begins with a /). In all cases the directory is changed, and the $PWD environment variable is set:

VARIABLE

EXAMPLE VALUE

DESCRIPTION

VARIABLE

EXAMPLE VALUE

DESCRIPTION

PWD

/lustre/it_css/users/1201/project/fuelcell

Print (current) working directory

It is always important to be aware of your current working directory. The standard prompt ends with the basename of PWD. In these two examples the basename is the same, 1201, but the standard bash PROMPT_COMMAND, which is executed every time you change directories, will put the full path of your current working directory in your window title. For example,

bjones@login00.darwin:/lustre/it_css/users/1201

Connecting to a Compute Node

To run a job on the compute nodes, you must submit your job script using sbatch or start an interactive session using salloc. In both cases, you will be connected to one of your allocation compute nodes based on the partition (queue) specified with a clean environment. Do not rely on the environment you set on the login node. The variables USER, HOME, WORKGROUP, WORKDIR, WORKDIR_USERS and PWD are all set on the compute node to match the ones you had on the login node, but two variables are set to node-specific values:

VARIABLE

EXAMPLE VALUE

DESCRIPTION

VARIABLE

EXAMPLE VALUE

DESCRIPTION

HOSTNAME

r00n17

compute node name

TMPDIR

/tmp

temporary disk space

An empty directory is created by the SLURM job scheduler that is associated with your job and defined as TMPDIR. This is a safe place to store temporary files that will not interfere with other jobs and tasks you or other members of your group may be executing. This directory is automatically emptied on normal termination of your job. This way the usage on the node scratch file system will not grow over time.

Before submitting jobs you must first use the "workgroup" command. Type workgroup -h for additional information. Both "sbatch" and "salloc" will start in the same project directory you set on the login node and will require a single partition to be specified to be able to submit a batch or interactive session.

Using VALET

The UD-developed VALET system facilitates your use of compilers, libraries, programming tools and application software. It provides a uniform mechanism for setting up a package's required UNIX environment. VALET is a recursive acronym for VALET Automates Linux Environment Tasks. It provides functionality similar to the Modules package used at other HPC sites.

VALET commands set the basic environment for software. This may include the PATH, MANPATH, INFOPATH, LDPATH, LIBPATH and LD_LIBRARY_PATH environment variables, compiler flags, software directory locations, and license paths. This reduces the need for you to set them or update them yourself when changes are made to system and application software. For example, you might find several versions for a single package name, such as Mathematica/8 and Mathematica/8.0.4. You can even apply VALET commands to packages that you install or alter its actions by customizing VALET's configuration files. Type man valet for instructions or see the VALET software documentation for complete details.

The table below shows the basic informational commands for VALET. In subsequent sections, VALET commands are illustrated in the contexts of application development (e.g., compiling, using libraries) and running IT-installed applications.

COMMAND

FUNCTION

COMMAND

FUNCTION

vpkg_help

VALET help.

vpkg_list

List the packages that have VALET configuration files.

vpkg_versions pkgid

List versions available for a single package.

vpkg_info pkgid

Show information for a single package (or package version).

vpkg_require pkgid

Configure environment for one or more VALET packages.

vpkg_devrequire pkgid

Configure environment for one or more VALET packages including software development variables such as CPPFLAGS and LDFLAGS.

vpkg_rollback # or all

Each time VALET changes the environment, it makes a snapshot of your environment to which it can return.
vpkg_rollback attempts to restore the UNIX environment to its previous state. You can specify a number (#) to revert one or more prior changes to the environment or all to remove all changes.

vpkg_history

List the versioned packages that have been added to the environment.

man valet

Complete documentation of VALET commands.

Programming Environment

Programming Models

There are two memory models for computing: distributed-memory and shared-memory. In the former, the message passing interface (MPI) is employed in programs to communicate between processors that use their own memory address space. In the latter, open multiprocessing (OMP) programming techniques are employed for multiple threads (light weight processes) to access memory in a common address space. When your job spans several compute nodes, you must use an MPI model.

Distributed memory systems use single-program multiple-data (SPMD) and multiple-program multiple-data (MPMD) programming paradigms. In the SPMD paradigm, each processor loads the same program image and executes and operates on data in its own address space (different data). It is the usual mechanism for MPI code: a single executable is available on each node (through a globally accessible file system such as $WORKDIR), and launched on each node (through the MPI wrapper command, mpirun).

The shared-memory programming model is used on Symmetric Multi-Processor (SMP) nodes such as a single typical compute node (20 or 24 cores, 64 GB memory). The programming paradigm for this memory model is called Parallel Vector Processing (PVP) or Shared-Memory Parallel Programming (SMPP). The former name is derived from the fact that vectorizable loops are often employed as the primary structure for parallelization. The main point of SMPP computing is that all of the processors in the same node share data in a single memory subsystem. There is no need for explicit messaging between processors as with MPI coding.

The SMPP paradigm employs compiler directives (as pragmas in C/C++ and special comments in Fortran) or explicit threading calls (e.g. with Pthreads). The majority of science codes now use OpenMP directives that are understood by most vendor compilers, as well as the GNU compilers.

In cluster systems that have SMP nodes and a high-speed interconnect between them, programmers often treat all CPUs within the cluster as having their own local memory. On a node, an MPI executable is launched on each processor core and runs within a separate address space. In this way, all processor cores appear as a set of distributed memory machines, even though each node has processor cores that share a single memory subsystem.

Clusters with SMPs sometimes employ hybrid programming to take advantage of higher performance at the node-level for certain algorithms that use SMPP (OMP) parallel coding techniques. In hybrid programming, OMP code is executed on the node as a single process with multiple threads (or an OMP library routine is called), while MPI programming is used at the cluster-level for exchanging data between the distributed memories of the nodes.

Compiling Code

Fortran, C, C++, Java and Matlab programs should be compiled on the login node, however if lengthy compiles or extensive resources needed, you may need to schedule a job for compilation using salloc or sbatch which will be billed to your allocation. All resulting executables should only be run on the compute nodes.

The Compiler Suites

There are three 64-bit compiler suites that IT generally installs and supports: PGI CDK (Portland Group Inc.'s Cluster Development Kit), Intel Composer XE 2011, and GNU. In addition, IT has installed OpenJDK (Open Java Development Kit), which must only be used on the compute nodes. (Type vpkg_info openjdk for more information on OpenJDK.)

The PGI compilers exploit special features of AMD processors. If you use open-source compilers, we recommend the GNU collection.

You can use a VALET vpkg_require command to set the UNIX environment for the compiler suite you want to use. After you issue the corresponding vpkg_require command, the compiler path and supporting environment variables will be defined.

A general command for basic source code compilation is:

<compiler> <compiler_flags> <source_code_filename> -o <executable_filename>

For each compiler suite, the table below displays the compiler name, a link to documentation describing the compiler flags, and the appropriate filename extension for the source code file. The executable will be named a.out unless you use the -o <executable_filename> option.

To view the compiler option flags, their syntax, and a terse explanation, execute a compiler command with the -help option. Alternatively, read the compiler's man pages.

PGI

VALET COMMAND

REFERENCE MANUALS

USER GUIDES

PGI

VALET COMMAND

REFERENCE MANUALS

USER GUIDES

 

vpkg_require pgi

C, Fortran

C, Fortran

 

COMPILER

LANGUAGE

COMMON FILENAME EXTENSIONS

 

pgfortran

F90, F95, F2003

.f, .for, .f90, .f95

 

pgf77

F77

.f

 

pgCC

C++

.C, .cc

 

pgcc

C

.c

INTEL

VALET COMMAND

REFERENCE MANUALS

USER GUIDES

 

vpkg_require intel

C, Fortran

C, Fortran

 

COMPILER

LANGUAGE

COMMON FILENAME EXTENSIONS

 

ifort

F77, F90, F95

.f, .for, .f90, .f95

 

icpc

C++

.C, .c, .cc, .cpp, .cxx, .c++, .i, .ii

 

icc

C

.c

GCC

VALET COMMAND

REFERENCE MANUALS

USER GUIDES

 

vpkg_require gcc

C, Fortran

C, Fortran

 

COMPILER

LANGUAGE

COMMON FILENAME EXTENSIONS

 

gfortran, f95

F77, F90, F95

.f, .f90, .f95

 

g++

C++

.C, .c, .cc, .cpp, .cxx, .c++, .i, .ii

 

gcc

C

.c

Compiling Serial Programs

This section uses the PGI compiler suite to illustrate simple Fortran and C compiler commands that create an executable. For each compiler suite, you must first set the UNIX environment so the compilers and libraries are available to you. VALET commands provide a simple way to do this.

The examples below show the compile and link steps in a single command. These illustrations use source code files named fdriver.f90 (Fortran 90) or cdriver.c (C). They all use the -o option to produce an executable named 'driver.' The optional -fpic PGI compiler flag generates position-independent code and creates smaller executables. You might also use code optimization option flags such as -fast after debugging your program.

You can use the -c option instead to create a .o object file that you would later link to other object files to create the executable.

Some people use the UNIX make command to compile source code. There are many good online tutorials on the basics of using make. Also available is a cross-platform makefile generator, cmake. You can set the UNIX environment for cmake by typing the vpkg_require cmake command.

Using the PGI suite to illustrate:

First use a VALET command to set the environment:

Then use that compiler suite's commands:

Fortran 90 example:

C example:

Compiling Parallel Programs with OpenMP

If your program only uses OpenMP directives, has no message passing, and your target is a single SMP node, you should add the OpenMP compiler flag to the serial compiler flags.

Instead of using OpenMP directives in your program, you can add an OpenMP-based library. You will still need the OpenMP compiler flag when you use the library.

COMPILER SUITE

OPENMP COMPILER FLAG

COMPILER SUITE

OPENMP COMPILER FLAG

PGI

-mp

Open64

-mp

Intel

-openmp

Intel-2016

-qopenmp

GCC

-fopenmp

Compiling Parallel Programs with MPI

MPI Implementations

In the distributed-memory model, the message passing interface (MPI) allows programs to communicate between processors that use their own node's memory address space. It is the most commonly used library and runtime environment for building and executing distributed-memory applications on clusters of computers.

OpenMPI is the most desirable MPI implementation to use. It is the only one that works for job suspension, checkpointing, and task migration to other processors. These capabilities are needed to enable opportunistic use of idle nodes as well as to configure short-term and long-term queues.

Some software comes packaged with other MPI implementations that IT cannot change. In those cases, their VALET configuration files use the bundled MPI implementation. However, we recommend that you use OpenMPI whenever you need an MPI implementation.

MPI Compiler Wrappers

The OpenMPI implementation provides OpenMPI library compilers for C, C++, Fortran 77, 90, and 95. These compiler wrappers add MPI support to the actual compiler suites by passing additional information to the compiler. You simply use the MPI compiler wrapper in place of the compiler name you would normally use.

The compiler suite that's used depends on your UNIX environment settings. Use VALET commands to simultaneously set your environment to use the OpenMPI implementation and to select a particular compiler suite. The commands for the four compiler suites are:

(Type vpkg_versions openmpi to see if newer versions are available.)

The vpkg_require command selects the MPI and compiler suite combination, and then you may use the compiler wrapper commands repeatedly. The wrapper name depends only on the language used, not the compiler suite you choose: mpicc (C), mpicxx or mpic++ (C++), mpi77 (Fortran 77), and mpif90 (Fortran 90 and 95).

Fortran example:

C example:

You may use other compiler flags listed in each compiler suite's documentation.

To modify the options used by the MPI wrapper commands, consult the FAQ section of the OpenMPI web site.

Programming Libraries

IT installs high-quality math and utility libraries that are used by many applications. These libraries provide highly optimized math packages and functions. To determine which compilers IT used to prepare a library version, use the vpkg_versions VALET command.

Here is a representative sample of installed libraries. Use the vpkg_list command to see the most current list of libraries.

Open-source libraries

  • ATLAS: Automatically Tuned Linear Algebra Software (portable)

  • FFTW: Discrete Fast Fourier Transform library

  • BLAS and LAPACK at TACC: Enhanced BLAS routines from the Texas Advanced Computing Center (TACC)

  • HDF4 and HDF5: Hierarchical Data Format suite (file formats and libraries for storing and organizing large, numerical data collections)

  • HYPRE: High-performance preconditioners for linear system solvers (from LLNL)

  • LAPACK: Linear algebra routines

  • Matplotlib: Python-based 2D publication-quality plotting library

  • netCDF: network Common Data Form for creation, access and sharing of array-oriented scientific data

  • ScaLAPACK - Scalable LAPACK: Subset of LAPACK routines redesigned for distributed memory MIMD parallel computers using MPI

  • VTK – Visualization ToolKit: A platform for 3D computer graphics and visualization

Commercial Libraries

  • IMSL: RogueWave's mathematical and statistical libraries

  • MKL: Intel's Math Kernel Library

  • NAG: Numerical Algorithms Group's numerical libraries

The libraries will be optimized a given cluster architecture. Note that the calling sequences of some of the commercial library routines differ from their open-source counterparts.

Using Libraries

This section shows you how to link your program with libraries you or your colleagues have created or with centrally installed libraries such as ACML or FFTW. The examples introduce special environment variables (FFLAGS, CFLAGS, CPPFLAGS and LDFLAGS) whose use simplifies a command's complexity. The VALET commands vpkg_require and vpkg_devrequire can easily define the working environment for your compiler suite choice.

Joint use of VALET and these environment variables will also prepare your UNIX environment to support your use of make for program development. VALET will accommodate using one or several libraries, and you can extend its functionality for software you develop or install.

Intel Compiler Suite

You should use Intel MKL — it's a highly-optimized BLAS/LAPACK library.

If you use the Intel compilers, you can add -mkl to your link command, e.g.

The former uses the serial library, the latter uses the threaded library that respects the OpenMP runtime environment of the job for multithreaded BLAS/LAPACK execution.

If you're not using the Intel compilers, you'll need to generate the appropriate compiler directives using Intel's online tool.

Please use "dynamic linking" since that allows MKL to adjust the underlying kernel functions at runtime according to the hardware on which you're running. If you use static linking, you're tied to the lowest common hardware model available and you will usually not see as good performance.

You'll need to load a version of Intel into the environment before compiling/building and also at runtime using VALET such as

Among other things, this will set MKLROOT in the environment to the appropriate path, which the link advisor references. The MKL version (year) matches that of the compiler version (year).

To determine the available versions of Intel installed use

PGI Compiler Suite

Fortran Examples illustrated with the PGI compiler suite

Reviewing the basic compilation command

The general command for compiling source code:

«compiler» «compiler_flags» «source_code_filename» -o «executable_filename»

For example:

Using user-supplied libraries

To compile fdriver.f90 and link it to a shared F90 library named libfstat.so stored in $HOME/lib, add the library location and the library name (fstat) to the command:

The -L option flag is for the shared library directory's name; the -l flag is for the specific library name.

You can simplify this compiler command by creating and exporting two special environment variables. FFLAGS represents a set of Fortran compiler option flags; LDFLAGS represents the location and choice of your library.

Extending this further, you might have several libraries in one or more locations. In that case, list all of the '-l' flags in the LDLIBS statement, for example,

and all of the '-L' flags in the LDFLAGS statement. (The order in which the '-L' directories appear in LDFLAGS determines the search order.)

Using centrally supplied libraries (ACML, MKL, FFTW, etc.)

This extends the previous section's example by illustrating how to use VALET's vpkg_devrequire command to locate and link a centrally supplied library such as AMD's Core Math Library, ACML. Several releases (versions) of a library may be installed, and some may have been compiled with several compiler suites.

To view your choices, use VALET's vpkg_versions command:

The example below uses the acml/5.0.0-pgi-fma4 version, the single-threaded, ACML 5.0.0 FMA4 library compiled with the PGI 11 compilers. Since that version depends on the PGI 11 compiler suite,

jointly sets the UNIX environment for both ACML and the PGI compiler suite. Therefore, you should not also issue a vpkg_require pgi command.

Unlike vpkg_require, vpkg_devrequire also modifies key environment variables including LDFLAGS.

Putting it all together, the complete example using the library named acml is:

Note that $LDFLAGS must be in the compile statement but does not need an explicit export command here. The vpkg_devrequire command above defined and exported LDFLAGS and its value.

Using user-supplied libraries and centrally supplied libraries together

This final example illustrates how to use your fstat and fpoly libraries (both in $HOME/lib) with the acml5.0.0 library:

Remember that the library search order depends on the order of the LDFLAGS libraries.

C Examples illustrated with the PGI compiler suite

Reviewing the basic compilation command

The general command for compiling source code:

«compiler» «compiler_flags» «source_code_filename» -o «executable_filename»

For example,

Using user-supplied libraries

To compile cdriver.c and link it to a shared C library named libcstat.so stored in $HOME/lib and include header files in $HOME/inc, add the library location and the library name (cstat) to the command.

The -I option flag is for the include library's location; the -L flag is for the shared library directory's name; and the -l flag is for the specific library name.

You can simplify this compiler command by creating and exporting two special environment variables. CFLAGS represents a set of C compiler option flags; CPPFLAGS represents the C++ preprocessor flags; and LDFLAGS represents the location and choice of your shared library.

Extending this further, you might have several libraries in one or more locations. In that case, list all of the '-l' flags in the LDLIBS statement, for example,

and all of the '-L' flags in the LDFLAGS statement. (The order in which the '-L' directories appear in LDFLAGS determines the search order.)

Using centrally supplied libraries (ACML, MKL, FFTW, etc.)

This extends the previous section's example by illustrating how to use VALET's vpkg_devrequire command to locate and link a system-supplied library, such as AMD's Core Math Library, ACML. Several releases (versions) of a library may be installed, and some may have been compiled with several compiler suites.

To view your choices, use VALET's vpkg_versions command:

The example below uses the acml/5.0.0-pgi vpkg_devrequire acml/5.0.0-pgi-fma4 version, the single-threaded, ACML 5.0.0 FMA4 library compiled with the PGI 11 compilers. Since that version depends on the PGI 11 compiler suite,

jointly sets the UNIX environment for both ACML and the PGI compiler suite. Therefore, you should not also issue a vpkg_require pgi command.

Unlike vpkg_require, vpkg_devrequire also modifies key environment variables including LDFLAGS and CPPFLAGS.

Putting it all together, the complete example using the library named acml, is:

Note that, $CPPFLAGS and $LDFLAGS must be in the compile statement even though the export CPPFLAGS and export LDFLAGS statement didn't appear above. The vpkg_devrequire command above defined and exported CPPFLAGS and LDFLAGS and their values.

Using user-supplied libraries and centrally supplied libraries together

The final example illustrates how to use your cstat and cpoly libraries (both in $HOME/lib) with the acml library:

Remember that the library search order depends on the order of the LDFLAGS libraries.

Running Applications

The Slurm workload manager (job scheduling system) is used to manage and control the resources available to computational tasks. The job scheduler considers each job's resource requests (memory, disk space, processor cores) and executes it as those resources become available. As a cluster workload manager, Slurm has three key functions: (1) It allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. (2) It provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. (3) It arbitrates contention for resources by managing a queue of pending work.

Without a job scheduler, a cluster user would need to manually search for the resources required by his or her job, perhaps by randomly logging-in to nodes and checking for other users' programs already executing thereon. The user would have to "sign-out" the nodes he or she wishes to use in order to notify the other cluster users of resource availability1). A computer will perform this kind of chore more quickly and efficiently than a human can, and with far greater sophistication.

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Documentation for the current version of Slurm provided by SchedMD SchedMD Slurm Documentation.

You may find it helpful when migrating from one scheduler to another such as GridEngine to Slurm to refer to SchedMD's rosetta showing equivalent commands across various schedulers and their command/option summary (two pages).

It is a good idea to periodically check in /opt/shared/templates/slurm/ for updated or new templates to use as job scripts to run generic or specific applications designed to provide the best performance on DARWIN.

Need help? See Introduction to Slurm in UD's HPC community cluster environment.

  1. Historically, this is actually how some clusters were managed!

What is a Job?

In this context, a job consists of:

  • a sequence of commands to be executed

  • a list of resource requirements and other properties affecting scheduling of the job

  • a set of environment variables

For an interactive job, the user manually types the sequence of commands once the job is eligible for execution. If the necessary resources for the job are not immediately available, then the user must wait; when resources are available, the user must be present at his/her computer in order to type the commands. Since the job scheduler does not care about the time of day, this could happen anytime, day or night.

By comparison, a batch job does not require the user be awake and at his or her computer: the sequence of commands is saved to a file, and that file is given to the job scheduler. A file containing a sequence of shell commands is also known as a script, so in order to run batch jobs a user must become familiar with shell scripting. The benefits of using batch jobs are significant:

  • a job script can be reused (versus repeatedly having to type the same sequence of commands for each job)

  • when resources are granted to the job it will execute immediately (day or night), yielding increased job throughput

An individual's increased job throughput is good for all users of the cluster!

All resulting executables (created via your own compilation) and other applications (commercial or open-source) should only be run on the compute nodes.

Queues

At its most basic, a queue represents a collection of computing entities (call them nodes) on which jobs can be executed. Each queue has properties that restrict what jobs are eligible to execute within it: a queue may not accept interactive jobs; a queue may place an upper limit on how long the job will be allowed to execute or how much memory it can use; or specific users may be granted or denied permission to execute jobs in a queue.

Slurm uses a partition to embody the common set of properties that define what nodes they include, and general system state. A partition can be considered job queues representing a collection of computing entities each of which has an assortment of constraints such as job size limit, job time limit, users permitted to use it, etc. Priority-ordered jobs are allocated nodes within a partition until the resources (nodes, processors, memory, etc.) within that partition are exhausted. Once a job is assigned a set of nodes, the user is able to initiate parallel work in the form of job steps in any configuration within the allocation. The term queue will most often imply a partition.

When submitting a job to Slurm, a user must set their workgroup prior to submitting a job and explicitly request a single partition as part of the job submission doing so will place that partitions's resource restrictions (e.g. maximum execution time) on the job, even if they are not appropriate.

Slurm

The Slurm workload manager is used to manage and control the computing resources for all jobs submitted to a cluster. This includes load balancing, reconciling requests for memory and processor cores with availability of those resources, suspending and restarting jobs, and managing jobs with different priorities.

In order to schedule any job (interactively or batch) on a cluster, you must set your workgroup to define your allocation workgroup and explicitly request a single partition.

Runtime Environment

Generally, your runtime environment (path, environment variables, etc.) should be the same as your compile-time environment. Usually, the best way to achieve this is to put the relevant VALET commands in shell scripts. You can reuse common sets of commands by storing them in a shell script file that can be sourced from within other shell script files.

If you are writing an executable script that does not have the -l option on the bash command, and you want to include VALET commands in your script, then you should include the line:

You do not need this command when you:

  1. type commands, or source the command file,

  2. include lines in the file to be submitted with sbatch.

Getting Help

Slurm includes man pages for all of the commands that will be reviewed in this document. When logged-in to a cluster, type

to learn more about a Slurm command (in this case, squeue). Most commands will also respond to the -help command-line option to provide a succinct usage summary:

Job Accounting on DARWIN

Accounting for jobs on DARWIN varies with the type of node used within a given allocation type. There are two types of allocations:

  1. Compute - for CPU based nodes with 512 GiB, 1024 GiB, or 2048 GiB of RAM

  2. GPU - for GPU based nodes with NVIDIA Tesla T4, NVIDIA Tesla V100, or AMD Radeon Instinct MI50 GPUs

For all allocations and node types, usage is defined in terms of a Service Unit (SU). The definition of an SU varies with the type of node being used.

IMPORTANT: When a job is submitted, the SUs will be calculated and pre-debited based on the resources requested thereby putting a hold on and deducting the SUs from the allocation credit for your project/workgroup. However, once the job completes the amount of SUs debited will be based on the actual time used. Keep in mind that if you request 20 cores and your job really only takes advantage of 10 cores, then the job will still be billed based on the requested 20 cores. And specifying a time limit of 2 days versus 2 hours may prevent others in your project/workgroup from running jobs as those SUs will be unavailable until the job completes. On the other hand, if you do not request enough resources and your job fails (i.e. did not provide enough time, enough cores, etc.), you will still be billed for those SUs. See Scheduling Jobs Command options for help with specifying resources.

Moral of the story: Request only the resources needed for your job. Over or under requesting resources results in wasting your allocation credits for everyone in your project/workgroup.

Compute Allocations

A Compute allocation on DARWIN can be used on any of the four compute node types. Each compute node has 64 cores but the amount of memory varies by node type. The available resources for each node type are below:

COMPUTE NODE

NUMBER OF NODES

TOTAL CORES

MEMORY PER NODE

TOTAL MEMORY

COMPUTE NODE

NUMBER OF NODES

TOTAL CORES

MEMORY PER NODE

TOTAL MEMORY

Standard

48

3,072

512 GiB

24 TiB

Large Memory

32

2,048

1024 GiB

32 TiB

Extra-Large Memory

11

704

2,048 GiB

22 TiB

Extended Memory

1

64

1024 GiB + 2.73 TiB1)

3.73 TiB

Total

92

5,888

 

81.73 TiB

  1. 1024 GiB of system memory and 2.73 TiB of swap on high-speed Intel Optane NVMe storage

A Service Unit (SU) on compute nodes corresponds to the use of one compute core for one hour. The number of SUs charged for a job is based on the fraction of total cores or fraction of total memory the job requests, whichever is larger. This results in the following SU conversions:

COMPUTE NODE

SU CONVERSION

COMPUTE NODE

SU CONVERSION

Standard

1 unit = 1 core + 8 GiB RAM for one hour

Large Memory

1 unit = 1 core + 16 GiB RAM for one hour

Extra-Large Memory

1 unit = 1 core + 32 GiB RAM for one hour

Extended Memory

64 units = 64 cores + 1024 GiB RAM + 2.73 TiB swap for one hour2)

  1. always billed as the entire node

See the examples below for illustrations of how SUs are billed by the intervals in the conversion table:

NODE TYPE

CORES

MEMORY

SUS BILLED PER HOUR

NODE TYPE

CORES

MEMORY

SUS BILLED PER HOUR

Standard

1

1 GiB to 8 GiB

1 SU

Standard

1

504 GiB to 512 GiB

64 SUs3)

Standard

64

1 GiB to 512 GiB

64 SUs

Standard

2

1 GiB to 16 GiB

2 SUs

Large Memory

2

> 32 GiB and ≤ 48 GiB

3 SUs4)

  1. 512 GiB RAM on a standard node is equivalent to using all 64 cores, so you are charged as if you used 64 cores

  2. RAM usage exceeds what is available with 2 cores on a large memory node, so you are charged as if you used 3 cores

Note that these are estimates based on nominal memory. Actual charges are based on available memory which will be lower than nominal memory due to the memory requirements for the OS and system daemons.

GPU Allocations

A GPU allocation on DARWIN can be used on any of the three GPU node types. The NVIDIA-T4 and AMD MI50 nodes have 64 cores each, while the NVIDIA-V100 nodes have 48 cores each. The available resources for each node type are below:

GPU NODE

NUMBER OF NODES

TOTAL CORES

MEMORY PER NODE

TOTAL MEMORY

TOTAL GPUS

GPU NODE

NUMBER OF NODES

TOTAL CORES

MEMORY PER NODE

TOTAL MEMORY

TOTAL GPUS

nvidia-T4

9

576

512 GiB

4.5 TiB

9

nvidia-V100

3

144

768 GiB

2.25 TiB

12

AMD-MI50

1

64

512 GiB

.5 TiB

1

Total

13

784

 

7.25 TiB

22

A Service Unit (SU) on GPU nodes corresponds to the use of one GPU device for one hour. The number of SUs charged for a job is based on the fraction of total GPUs, fraction of total cores, or fraction of total memory the job requests, whichever is larger. Because the NVIDIA T4 and AMD MI50 nodes only have 1 GPU each, you have access to all available memory and cores for 1 SU. The NVIDIA V100 nodes have 4 GPUs each, so the available memory and cores per GPU is 1/4 of the total available on a node. This results in the following SU conversions:

GPU NODE

SU CONVERSION

GPU NODE

SU CONVERSION

nvidia-T4

1 unit = 1 GPU + 64 cores + 512 GiB RAM for one hour

AMD-MI50

1 unit = 1 GPU + 64 cores + 512 GiB RAM for one hour

nvidia-V100

1 unit = 1 GPU + 12 cores + 192 GiB RAM for one hour

See the examples below for illustrations of how SUs are billed by the intervals in the conversion table:

NODE TYPE

GPUS

CORES

MEMORY

SUS BILLED PER HOUR

NODE TYPE

GPUS

CORES

MEMORY

SUS BILLED PER HOUR

nvidia-T4

1

1 to 64

1 GiB to 512 GiB

1 SU

nvidia-T4

2

2 to 128

2 GiB to 1024 GiB

2 SUs

AMD-MI50

1

1 to 64

1 GiB to 512 GiB

1 SU

nvidia-V100

1

1 to 12

1 GiB to 192 GiB

1 SU

nvidia-V100

2

1 to 24

1 GiB to 384 GiB

2 SUs

nvidia-V100

1

25 to 48

1 GiB to 192 GiB

2 SUs5)

nvidia-V100

1

1 to 24

> 192 GiB and ≤ 384 GiB

2 SUs6)

  1. billed as if you were using 2 GPUs due to the proportion of CPU cores used

  2. billed as if you were using 2 GPUs due to the proportion of memory used

Note that these are estimates based on nominal memory. Actual charges are based on available memory which will be lower than nominal memory due to the memory requirements for the OS and system daemons.

The idle Partition

Jobs that execute in the idle partition do not result in charges against your allocation(s). If your jobs can support checkpointing, the idle partition will enable you to continue your research even if you exhaust your allocation(s). However, jobs submitted to the other partitions which do get charged against allocations will take priority and may cause idle partition jobs to be preempted.

Since jobs in the idle partition do not result in charges you will not see them in the output of the sproject command documented below. You can still use standard Slurm commands to check the status of those jobs.

Monitoring Allocation Usage

The sproject Command

UD IT has created the sproject command to allow various queries against allocations (UD and ACCESS) on DARWIN. You can see the help documentation for sproject by running sproject -h or sproject –help. The -h/–help flag also works for any of the subcommands: sproject allocations -h, sproject projects -h, sproject jobs -h, or sproject failures -h.

For all sproject commands you can specify an output-format of table, csv, or json using the –format <output-format> or -f <output-format> options.

sproject allocations

The allocations subcommand shows information for resource allocations granted to projects/workgroups on DARWIN to which you are a member. To see a specific workgroup's allocations, use the -g <workgroup> option as in this example for workgroup it_css:

The –detail flag will show additional information reflecting the credits, running + completed job charges, debits, and balance of each allocation:

The –by-user flag is helpful to see detailed allocation usage broken out by project user:

sproject projects

The projects subcommand shows information (such as the project id, group id, name, and creation date) for projects/workgroups on DARWIN to which you are a member. To see a specific project/workgroup, use the -g <workgroup> option as in this example for workgroup it_css:

Adding the –detail flag will also show each allocation associated with the project.

sproject jobs

The jobs subcommand shows information (such as the Slurm job id, owner, and amount charged) for individual jobs billed against resource allocations for projects/workgroups on DARWIN to which you are a member. Various options are available for sorting and filtering, use sproject jobs -h for complete details. To see jobs associated with a specific project/workgroup, use the -g <workgroup> option as in this example for workgroup it_css:

Jobs that complete execution will be displayed with a status of completed and the actual billable amount used by the job. At the top and bottom of each hour, completed jobs are resolved into per-user debits and disappear from the jobs listing (see the sproject allocations section above for the display of resource allocation credits, debits, and pre-debits).

sproject failures

The failures subcommand shows information (such as the Slurm job id, owner, and amount charged) for all jobs that failed to execute due to insufficient allocation balance on resource allocations for projects/workgroups on DARWIN to which you are a member. Various options are available for sorting and filtering, use sproject failures -h for complete details. To see failures associated with jobs run as a specific project/workgroup, use the -g <workgroup> option as in this example for workgroup it_css:

Adding the –detail flag provides further information such as the owner, amount, and creation date.

ACCESS Allocations

For ACCESS allocations on DARWIN, you may use the ACCESS user portal to check allocation usage, however keep in mind using the sproject command available on DARWIN will provide the most up-to-date allocation usage information since the ACCESS Portal will only be updated nightly.

Storage Allocations

Every DARWIN Compute or GPU allocation has a storage allocation associated with it on the DARWIN Lustre file system. These allocations are measured in tebibytes and the default amount is 1 TiB. There are no SUs deducted from your allocation for the space you use, but you will be limited to a storage quota based on your awarded allocation.

Each project/workgroup has a folder associated with it referred to as workgroup storage. Every file in that folder will count against that project/workgroup's allocated quota for their workgroup storage.

You can use the my_quotas command to check storage usage.

Queues/Partitions

The DARWIN cluster has several partitions (queues) available to specify when running jobs. These partitions correspond to the various node types available in the cluster:

PARTITION NAME

DESCRIPTION

NODE NAMES

PARTITION NAME

DESCRIPTION

NODE NAMES

standard

Contains all 48 standard memory nodes (64 cores, 512 GiB memory per node)

r1n00 - r1n47

large-mem

Contains all 32 large memory nodes (64 cores, 1024 GiB memory per node)

r2l00 - r2l10

xlarge-mem

Contains all 11 extra-large memory nodes (64 cores, 2048 GiB memory per node)

r2x00 - r2x10

extended-mem

Contains the single extended memory node (64 cores, 1024 GiB memory + 2.73 TiB NVMe swap)

r2e00

gpu-t4

Contains all 9 NVIDIA Tesla T4 GPU nodes (64 cores, 512 GiB memory, 1 T4 GPU per node)

r1t00 - r1t07, r2t08

gpu-v100

Contains all 3 NVIDIA Tesla V100 GPU nodes (48 cores, 768 GiB memory, 4 V100 GPUs per node)

r2v00 - r2v02

gpu-mi50

Contains the single AMD Radeon Instinct MI50 GPU node (64 cores, 512 GiB memory, 1 MI50 GPU)

r2m00

idle

Contains all nodes in the cluster, jobs on this partition can be preempted but are not charged against your allocation

 

Requirements

All partitions on DARWIN have two requirements for submitting jobs:

  1. You must set an allocation workgroup prior to submitting a job by using the workgroup command (e.g., workgroup -g it_nss). This ensures jobs are billed against the correct account in Slurm.

  2. You must explicitly request a single partition in your job submission using –partition or -p.

Defaults and Limits

All partitions on DARWIN except idle have the following defaults:

  • Default run time of 30 minutes

  • Default resources of 1 node, 1 CPU, and 1 GiB memory

  • Default no preemption

All partitions on DARWIN except idle have the following limits:

  • Maximum run time of 7 days

  • Maximum of 400 jobs per user per partition

The idle partition has the same defaults and limits as above with the following differences:

  • Preemption is enabled for all jobs

  • Maximum of 320 jobs per user

  • Maximum of 640 CPUs per user (across all jobs in the partition)

Maximum Requestable Memory

Each type of node (and thus, partition) has a limited amount of memory available for jobs. A small amount of memory must be subtracted from the nominal size listed in the table above for the node's operating system and Slurm. The remainder is the upper limit requestable by jobs, summarized by partition below:

Partition Name

Maximum (by node)

Maximum (by core)

Partition Name

Maximum (by node)

Maximum (by core)

standard

--mem=499712M

--mem-per-cpu=7808M

large-mem

--mem=999424M

--mem-per-cpu=15616M

xlarge-mem

--mem=2031616M

--mem-per-cpu=31744M

extended-mem

--mem=999424M

--mem-per-cpu=15616M

gpu-t4

--mem=491520M

--mem-per-cpu=7680M

gpu-v100

--mem=737280M

--mem-per-cpu=15360

gpu-mi50

--mem=491520M

--mem-per-cpu=7680M

gpu-mi100

--mem=491520M

--mem-per-cpu=7680M

The extended-mem Partition

Because access to the swap cannot be limited via Slurm, the extended-mem partition is configured to run all jobs in exclusive user mode. This means only a single user can be on the node at a time, but that user can run one or more jobs on the node. All jobs on the node will have access to the full amount of swap available, so care must be taken in usage of swap when running multiple jobs.

The GPU Partitions

Jobs that will run in one of the GPU partitions must request GPU resources using ONE of the following flags:

FLAG

DESCRIPTION

FLAG

DESCRIPTION

--gpus=<count>

<count> GPUs total for the job, regardless of node count

--gpus-per-node=<count>

<count> GPUs are required on each node allocated to the job

--gpus-per-socket=<count>

<count> GPUs are required on each socket allocated to the job

--gpus-per-task=<count>

<count> GPUs are required for each task in the job

If you do not specify one of these flags, your job will not be permitted to run in the GPU partitions.

On DARWIN the --gres flag should NOT be used to request GPU resources. The GPU type will be inferred from the partition to which the job is submitted if not specified.

The idle Partition

The idle partition contains all nodes in the cluster. Jobs submitted to the idle partition can be preempted when the resources are required for jobs submitted to the other partitions. Your job should support checkpointing to effectively use the idle partition and avoid lost work.

Be aware that implementing checkpointing is highly dependent on the nature of your job and the ability of your code or software to handle interruptions and restarts. For this reason, we can only provide limited support of the idle partition.

Jobs in the idle partition that have been running for less than 10 minutes are not considered for preemption by Slurm. Additionally, there is a 5 minute grace period between the delivery of the initial preemption signal (SIGCONT+SIGTERM) and the end of the job (SIGCONT+SIGTERM+SIGKILL). This means jobs in the idle partition will have a minimum of 15 minutes of execution time once started. Jobs submitted using the –requeue flag automatically return to the queue to be rescheduled once resources are available again.

Jobs that execute in the idle partition do not result in charges against your allocation(s). However, they do accumulate resource usage for the sake of scheduling priority to ensure fair access to this partition. If your jobs can support checkpointing, the idle partition will enable you to continue your research even if you exhaust your allocation(s).

Requesting a Specific Resource Type

Since the idle partition contains all nodes in the cluster, you will need to request a specific GPU type if your job needs GPU resources. The three GPU types are:

TYPE

DESCRIPTION

TYPE

DESCRIPTION

tesla_t4

NVIDIA Tesla T4

tesla_v100

NVIDIA Tesla V100

amd_mi50

AMD Radeon Instinct MI50

To request a specific GPU type while using the idle partition, include the –gpus=<type>:<count> flag with your job submission. For example, –gpus=tesla_t4:4 would request 4 NVIDIA Telsa T4 GPUs.

Scheduling Jobs

In order to schedule any job (interactively or batch) on a cluster, you must set your allocation workgroup to define your allocation group and explicitly specify a single partition that corresponds to the compute resources granted for the allocation. For example,

will set the allocation workgroup to it_css for account bjones which is reflected in the prompt change showing the allocation workgroup.

[(it_css:bjones)@login00 ~]$

When you submit a job, you must explicitly request a single partition that corresponds to the compute resources granted for the allocation. Keep in mind job scheduling is very complex. It doesn't get considered for execution immediately upon submission. Slurm will analyze and determine on each scheduling cycle, only the next N jobs that are pending will be considered for execution. This means the more jobs submitted by users will likely mean the longer your job may have to wait to be considered. To this point, all users should be good citizens and not over submit, and be patient and do not kill jobs and resubmit to try to increase your priority.

Need help? See Introduction to Slurm in UD's HPC community cluster environment.

IMPORTANT: When a job is submitted, the SUs will be calculated and pre-debited based on the resources requested thereby putting a hold on and deducting the SUs from the allocation credit for your project/workgroup. However, once the job completes the amount of SUs debited will be based on the actual time used. Keep in mind that if you request 20 cores and your job really only takes advantage of 10 cores, then the job will still be billed based on the requested 20 cores. And specifying a time limit of 2 days versus 2 hours may prevent others in your project/workgroup from running jobs as those SUs will be unavailable until the job completes. On the other hand, if you do not request enough resources and your job fails (i.e. did not provide enough time, enough cores, etc.), you will still be billed for those SUs. See Scheduling Jobs Command options for help with specifying resources and Job Accounting for details on SU calculations.

Moral of the story: Request only the resources needed for your job. Over or under requesting resources results in wasting your allocation credits for everyone in your project/workgroup.

Interactive jobs: An interactive job is billed the SU associated with the full wall time of its execution, not just for CPU time accrued through its duration. For example, if you leave an interactive job running for 2 hours and execute code for 2 minutes, your allocation will be billed for 2 hours of time, not 2 minutes. Please review job accounting to determine the SU associated for each type of resource requested (compute, gpu) and the SUs billed per hour.

Interactive Jobs (salloc)

All interactive jobs should be scheduled to run on the compute nodes, not the login/head node.

An interactive session (job) can often be made non-interactive (batch job) by putting the input in a file, using the redirection symbols < and >, and making the entire command a line in a job script file:

Then the non-interactive (batch job) job can be scheduled as a batch job.

Starting an Interactive Session

Remember you must specify your workgroup to define your allocation workgroup for available compute node resources before submitting any job as well as specifying a partition, and this includes starting an interactive session. Now use the Slurm command salloc on the login (head) node. Slurm will look for a node with a free scheduling slot (processor core) and a sufficiently light load, and then assign your session to it. If no such node becomes available, your salloc request will eventually time out.

Type:

to start a remote interactive shell on a node in the standard partition use

Type:

to open a shell on the login node itself and execute a series of srun commands against that allocation. For example,

sets up the salloc interactive session with 2 nodes on the standard partition. Now each use of srun is run on all compute nodes inside the interactive session and represents a job step.

Type:

to terminate the interactive shell and release the scheduling slot(s).

All the above commands work only when the user is already inside the workgroup and had defined a partition. If you do not specify a workgroup or –partition, you will get an error similar to this

or

There is a no way to avoid running the workgroup command and specifying a partition before submitting a job or requesting an interactive session.

Nodes for Interactive Sessions

You may use the login (head) node for interactive program development including Fortran, C, and C++ program editing and compiling since you won't be billed for usage on the login node. However you will need to use Slurm (salloc) to start interactive shells to utilize your allocation compute node resources for testing or running applications.

Batch Jobs (sbatch)

A batch job is a command to be executed now or any time in the future. Batch jobs are encapsulated as a shell script (which will be called a job script). This job script is simply a bash script contains special comment lines that provide flags to Slurm to influence their submission and scheduling. Both the srun and salloc command attempt to execute remote commands immediately; if resources are not available they will not return until resources have become available or the user cancels them (by means of <Ctrl>-C).

Slurm provides the sbatch command for scheduling batch jobs:

COMMAND

ACTION

COMMAND

ACTION

sbatch command_line_options job_script

Submit job with script command in the file job_script

For example,

This file myproject.qs will contain bash shell commands and SBATCH statements that include SBATCH options and resource specifications. The SBATCH statements begin with # and are the special directives that tell Slurm options such as partition, number of cores, how much time, how much memory, etc. to use for your job.

We strongly recommend that you use a script file that you pattern after the prototypes in /opt/shared/templates by using one of our templates and save your job script files within a $WORKDIR (private work) directory. There are README.md files each subdirectory to explain the use of these templates.

Reusable job scripts help you maintain a consistent batch environment across runs.

Slurm Environment Variables

In every batch session, Slurm sets environment variables that are useful within job scripts. Here are some common examples. The rest ca be found online in Slurm documentation.

ENVIRONMENT VARIABLE

CONTAINS

ENVIRONMENT VARIABLE

CONTAINS

HOSTNAME

Name of the execution (compute) node

SLURM_JOB_ID

Batch job id assigned by Slurm

SLURM_JOB_NAME

Name you assigned to the batch job

SLURM_JOB_NUM_NODES

Number of nodes allocated to job

SLURM_CPUS_PER_TASK

Number of cpus requested per task. Only set if the --cpus-per-task option is specified for a threaded job

SLURM_ARRAY_TASK_ID

Task id of an array job sub-task (See Array jobs)

TMPDIR

Name of directory on the (compute) node scratch filesystem

When Slurm assigns one of your job's tasks to a particular node, it creates a temporary work directory on that node's 2 TB local scratch disk. And when the task assigned to that node is finished, Slurm removes the directory and its contents. The form of the directory name is

For example, after typing salloc –partition=standard on the head node, an interactive job 46 ($SLURM_JOB_ID) is allocated on node r1n00

and now we are ready to use our interactive session on node r1n00 and the temporary node scratch directory for this interactive job.

See Filesystems and Computing environment for more information about the node scratch filesystem and using environment variables.

Command Options

The table below lists sbatch's common options.

Slurm tries to satisfy all of the resource-management options you specify in a job script or as sbatch command-line options.

OPTIONS

DESCRIPTION

OPTIONS

DESCRIPTION

--job-name=<string>

descriptive name for the job

--comment=<string>

alternate description of the job (more verbose than job name)

--partition=<partition-name>

execute the command in this partition

--nodes=<#>

execute the command on this many distinct nodes

--ntasks=<#>

execute this many copies of the command

--ntasks-per-node=<#>

execute this many copies of the command on each distinct node

--cpus-per-task=<#>

each copy of the command should have this many CPU cores allocated to it

--mem=<#>

total amount of real memory to allocate to the job

--mem-per-cpu=<#>

amount of memory to allocate to each CPU core allocated to the job

--exclusive

node(s) allocated to the job must have no other jobs running on them

--exclusive=user

node(s) allocated to the job must have no jobs associated with other users running on them except if jobs submitted by the user

--time=<time-spec>

indicates a maximum wall time limit for the job

Time

If no –time=<time-spec> option is specified, then the default time allocated is 30 minutes.

The <time-spec> can be of the following formats:

  • <#> - minutes

  • <#>:<#> - minutes and seconds

  • <#>:<#>:<#> - hours, minutes, and seconds

  • <#>-<#> - days and hours

  • <#>-<#>:<#> - days, hours, and minutes

  • <#>-<#>:<#>:<#> - days, hours, minutes, and seconds

Thus, specifying –time=4 indicates a wall time limit of four minutes and –time=4-0 indicates four days.

IMPORTANT: Make sure the wall time is mentioned as per the specified format. One of the most frequently seen error is jobs termininating about 1 minute after start ("TIMEOUT" error message).

doesn't look that different from

In the above case, the former is interpreted as 1 minute with a trailing second argument of "00:00:00" while the second equals 1 day.

CPU Cores

The number of CPU cores associated with a job and the scheme by which they are allocated on nodes can be controlled loosely or strictly by the flags mentioned above. Omitting all such flags implies a default will be set to a single task on a single node meaning 1 CPU core will allocated for your job.

Always associate tasks with the number of copies of a program, and cpus-per-task with the number of threads each copy of the program may use. While tasks can be distributed across multiple nodes, the cores indicated by cpus-per-task must all be present on the same node. Thus, programs parallelized with OpenMP directives would primarily be submitted using the –cpus-per-task flag, while MPI programs would use the –ntasks or –ntasks-per-node flag. Programs capable of hybrid MPI execution would use a combination of the two.

For example, putting the lines

the job script tells Slurm to set a hard limit of 1 hour on the CPU time resource for the job, and requests 4 tasks to be allocated mapped with single processor to each.

Memory

When reserving memory for your job by using –mem or –mem-per-cpu option, it will be considered MB if no units are specified, otherwise use the suffix k|M|G|T denoting kibi,mebi,gibi and tebibyte as the units. By default, if no memory specifications are provided, Slurm will allocate 1G per core for your job. For example, specifying

–mem=8G

tells Slurm to reserve 8 gibibyte units of memory for your job. However, specifying the following two options

–mem-per-cpu=8G –ntasks=4

tells Slurm to allocate 8 gibibyte units of memory per core for a total of 32 gigibyte units of memory for your job.

 

kibi, mebi, gibi and tebibyte are terms defined as powers of 1024 where kilo, mega, giga and terabyte are defined as powers of 1000.

Specifying the correct node type and amount of memory is important because your allocation is billed based on the Service Unit (SU) and each SU varies with the type of node and memory being used. If you specify a node type with a larger amount of memory then you are charged accordingly even if you don't use it.

The table below provides the usable memory values available for each type of node currently available on the DARWIN.

The Extended Memory is accessible by specifying the partition extended-mem and exclusive options. This allows only one user on the node at a time thereby making all swap space accessible for multiple jobs running on that node at once, sharing the swap; but no other user can be on it during that time.

VERY IMPORTANT: Keep in mind that not all memory can be reserved for a node due to a small amount required for system use. As a result, the maximum amount of memory that can be specified is based on what Slurm shows as available. For example, the baseline nodes in DARWIN show a memory size of 488 GiB versus the 512 GiB of physical memory present in them. This means if you try to specify the full amount of memory (i.e. 512G) for the standard partition, then the job will be rejected. This will work if you specify a different partition with more memory. For example,

For GPU nodes, you must also specify one of the GPU resource option flags, otherwise your job will not be permitted to run in the GPU partitions.

Exclusive Access

If a job is submitted with the –exclusive resource, the allocated nodes cannot be shared with other running jobs.

A job running on a node with –exclusive will block any other jobs from making use of resources on that host. To make sure your program is using all the cores on a node when specifying the exclusive resource, include inside the jobs scripts the –ntasks option i.e., –ntasks=36

Job script example:

Also, the exclusive resource works in two different ways in Slurm on DARWIN. One is simply specifying –exclusive and the other way is specifying –exclusive=user when submitting a job. In the first method, the job is scaled up with all the resources available on the node irrespective of the requirement. However, the job will only use the number of CPUs specified by the -ntasks option. In the second method, specifying =user means multiple jobs are allowed at the same time on the same node assigned for exclusive access for the user submitting the jobs.

GPU Nodes

Jobs that will run in one of the GPU partitions must request GPU resources using ONE of the following flags:

%th Flag %th Description

%td –gpus=<count> %td <count> GPUs total for the job, regardless of node count

%td –gpus-per-node=<count> %td <count> GPUs are required on each node allocated to the job

%td –gpus-per-socket=<count> %td <count> GPUs are required on each socket allocated to the job

%td –gpus-per-task=<count> %td <count> GPUs are required for each task in the job

If you do not specify one of these flags, your job will not be permitted to run in the GPU partitions.

On DARWIN the –gres flag should NOT be used to request GPU resources. The GPU type will be inferred from the partition to which the job is submitted if not specified.

After entering into the workgroup, GPU nodes can be requested through an interactive session using salloc or through batch submission using sbatch with an appropriate partition name and one of the above GPU resources flag option above.

Submitting an Interactive Job

In Slurm, interactive jobs are submitted to the job scheduler first by using the salloc command specifying a partition after being in your workgroup:

Dissecting this text of both, we see that:

  1. the job was assigned a numerical job identifier or job id of 56

  2. the job is assigned to the standard partition with job resources tracked and billed against the allocation group (workgroup), it_css

  3. the job is executing on compute node r1n00

  4. no cores, memory or time are specified so the defaults defined for all partitions is applied: 1 core, 1G and 30 minutes

  5. the final line is a shell prompt, running on r1n00 and waiting for commands to be typed

One can specify all the options that are applicable to sbatch in the above-mentioned table while running salloc command if appropriate.

Here are more examples to reflect on:

What is not apparent from the text for the two examples:

  • the shell prompt on compute node r1n00 and r2l00 represents different node types due to partition specified, and has as its working directory, the directory in which the salloc command was typed for user bjones in workgroup it_css; first example in the home directory '~' and second in workgroup user directory most likely /lustre/it_css/users/1201

  • memory specified as 400G or 500G is the total amount of memory needed for the job which requires specifying the appropriate partition either standard or large-mem due to maximum memory available on a particular node type

  • if resources had not been immediately available to this job, the text would have "hung" at "waiting for interactive job to be scheduled …" and later resumed with the message about its being successfully scheduled

By default, salloc will start a remote interactive shell on a node in the cluster. The alternative use is to open a shell on the login node itself and execute a series of srun commands against that allocation:

The command can have arguments presented to it:

The srun command accepts the same commonly-used options above discussed for sbatch and salloc.

By default, salloc will start a remote interactive shell on a node in the cluster. The alternative use is to open a shell on the login node itself and execute a series of srun commands against that allocation:

The command can have arguments presented to it:

Remember typing exit relinquishing the interactive job and other interactive jobs started within it as well.

Now another set of examples specifying two nodes and executing a series of srun commands against that allocation:

Each use of srun inside the salloc session represents a job step and is applied to both nodes. The first use of srun is job step zero (0), the second job step 1, etc. When referring to a specific job step, the syntax is <job-id>.<job-step>. The Slurm accounting and billing mechanisms retain usage data for each job step as well as an aggregate for the entire job.

In order to dedicate (reserve) an entire node to run your programs only, one might want to use –exclusive option. For more details, read about exclusive access.

Naming your Job

It can be confusing if a user has many interactive jobs submitted at one time. Taking a moment to name each interactive job according to its purpose may save the user a lot of effort later:

The name provided with the –job-name command-line option will be assigned to the interactive session/job that the user started versus the default name interact. See Managing Jobs on the sidebar for general information about commands in Slurm to manage all your jobs on DARWIN.

Launching GUI Applications (VNC for X11 Applications)

Please review using VNC for X11 Applications as an alternative to X11 Forwarding.

Launching GUI Applications (X11 Forwarding)

We can launch GUI applications on the DARWIN using X-forwarding technique. However, there are some pre-requisites required in order to launch GUI applications using X-forwarding.

For Windows OS, Xming is an X11 display server which must be installed and running on Windows (Windows XP and later) and a PuTTY session must configured with X11 before launching GUI applications on DARWIN. For help on configuring a PuTTY session with X11 see X-Windows (X11) and SSH document for Windows desktop use.

For Mac OS, SSH connection has to be started with -Y argument, ssh -Y darwin.hpc.udel.edu and XQuartz an X11 display server much be installed and running.

Once a SSH connection is established using X11 (and an X11 display server is running, Xming or XQuartz), below are the steps to be followed to test the session.

Type:

 

Type:

Check if the current session is being run with X11 using xdpyinfo | grep display and the name of the display should match the output above.

If the current session is not being run with X11 then you will like get an error. Below is an example of an error when Xming was not running for a Windows PuTTY session:

Once we confirm the session is properly configured with X11 forwarding, now we are ready to launch a GUI application on the compute node.

Type:

This will launch an interactive job on one of the compute nodes in the standard partition, in this case r1n02, with default options of one cpu (core), 1 GB of memory and 30 minutes time.

Now the compute node and environment will be ready to launch any program that has a GUI (Graphical User Interface) and be displayed on your local computer display.

 

The X11 protocol was never meant to handle graphically (in terms of bitmaps/textures) intensive operations especially over a wireless network. In general, a significant latency will be noticed while running GUI applications using X11 on Linux/Unix systems and basically unusable when on a wireless network.

Additionally, the –x11 argument can be augmented in this fashion –x11=[batch|first|last|all] to the following effects:

  • –x11=first This is the default, and provides X11 forwarding to the first compute hosts allocated.

  • –x11=last This provides X11 forwarding to the last of the compute hosts allocated.

  • –x11=all This provides X11 forwarding from all allocated compute hosts, which can be quite resource heavy and is an extremely rare use-case.

  • –x11=batch This supports use in a batch job submission, and will provide X11 forwarding to the first node allocated to a batch job.

These options can be used and further tested using the above display OR $DISPLAY commands.

Submitting the Job

Batch jobs are submitted to the job scheduler using the sbatch command:

Notice that the job name defaults to being the name of the job script; as discussed in the previous section, a job name can also be explicitly provided

job_script_02.qs

Specifying Options on the Command Line

It has already been demonstrated that command-line options to the sbatch command can be embedded in a job script. Likewise, the options can be specified on the command line. For example:

The –output option was provided in the queue script and on the command line itself: Slurm will honor options from the command line in preference to those embedded in the script. Thus, in this case the "output%j.txt" provided on the command line overrode the "my_job_op%h.txt" from the job script.

The sbatch command has many options available, all of which are documented in its man page. A few of the often-used options will be discussed here.

Default Options

There are several default options that are automatically added to every sbatch by Slurm as well as default resource requirements supplied, however an explanation of each is beyond the scope of this section. Providing an alternate value for any of these arguments – in the job script or on the sbatch command line – overrides the default value.

Email Notifications

Since batch jobs can run unattended, the user may want to be notified of status changes for a job: when the job begins executing; when the job finishes; or if the job was killed. Slurm will deliver such notifications (as emails) to a job's owner if the owner requests them using the –mail-user option:

%th Option %th Description

%td –mail-user=<email-address> %td deliver state-change notification emails to this address

%td –mail-type=<state>{,<state>..} %td deliver notification emails when the job enters the state(s) indicated

%td –requeue %td if this job is preempted by a higher-priority job, automatically resubmit it to execute again using the same parameters and job script

Consult the man page for the sbatch command for a deeper discussion of each of the –mail-type states. Valid state names are NONE, BEGIN, END, FAIL, REQUEUE, ALL, TIME_LIMIT_50, TIME_LIMIT_80, TIME_LIMIT_90, TIME_LIMIT, ARRAY_TASKS. The time limit states with numbers indicate a percentage of the full runtime: so enabling TIME_LIMIT_50 will see an email notification being delivered once 50% of the job's maximum runtime has elapsed.

Handling System Signals aka Checkpointing

Generally, there are two possible cases when jobs are killed: (1) preemption and (2) walltime configured within the jobs script has elapsed. Checkpointing can be used to intercept and handle the system signals in each of these cases to write out a restart file, perform the cleanup or backup operations, or any other tasks before the job gets killed. Of course, this depends on whether or not the application or software you are using is checkpoint enabled.

 

Please review the comments provided in the Slurm job script templates available in /opt/shared/templates that demonstrates the ways to trap these signals.

"TERM" is the most common system signal that is triggered in both the above cases. However, there is a working logic behind the preemption of job which works as below.

When a job gets submitted to a workgroup-specific partition and resources are tied up by jobs in the idle partition, the jobs in the idle partition will be preempted to make way. Slurm sends a preemption signal to the job (SIGCONT followed by SIGTERM) then waits for a grace period (5 minutes) before signaling again (SIGCONT followed by SIGTERM) then killing it (SIGKILL). However, if the job is able to simply be re-run as-is, the user can submit with –requeue to indicate that a idle job that was preempted should be rerun on the idle partition (possibly restarting immediately on different nodes, otherwise it will need to wait for resources to become available).

For example, using the logic provided in one of the Slurm job script templates, one can catch these signals during the preemption and handle them by performing the cleanup or backing up the job results operations as follows.

To catch signals asynchronously in Bash, you have to run commands in the background and "wait" for them to complete. This is why the templates includes a shell function named UD_EXEC when you set UD_JOB_EXIT_FN to a trap function name.

If you implement the restart logic at the start of the script, then you can avoid the signal stuff entirely by using the –requeue option with sbatch. Using this option tells Slurm when the job is preempted, it will automatically be moved back into the queue to execute again.

Job Output

Equally as important as executing the job is capturing any output produced by the job. By default, all the output(stdout and stderr) is sent to a single file that output file is named according to the formula

For the weather-processing example above, the output would be found in

 

In the job script itself it is often counterproductive to redirect a constituent command's output to a file. Allowing all output to stdout/stderr to be directed to the file provided by Slurm automatically provides a degree of "versioning" of all runs of the job by way of the -[job id] suffix on the output file's name.

The name of the output file can be overridden using the –output command-line option to sbatch. The argument to this option is the name of the file, possibly containing special characters that will be replaced by the job id, job name, etc. See the sbatch man page for a complete description.

In order to redirect the error output to a separate file(by default stdout and stderr directed to the same file), –error option can be used and is then directed to a file named as per the naming convention provided.

Array Jobs

An array job essentially runs the same job by generating a new repeated task many times. Each time, the environment variable SLURM_ARRAY_TASK_ID is set to a unique value and its value provides input to the job submission script.

The %A_%a construct in the output and error file names is used to generate unique output and error files based on the master job ID (%A) and the array-tasks ID (%a). In this fashion, each array-tasks will be able to write to its own output and error file.

Example: #SBATCH –output=arrayJob_%A_%a.out

 

The SLURM_ARRAY_TASK_ID is the key to make the array jobs useful. Use it in your bash script, or pass it as a parameter so your program can decide how to complete the assigned task.

For example, the SLURM_ARRAY_TASK_ID sequence values of 2, 4, 6, … , 5000 might be passed as an initial data value to 2500 repetitions of a simulation model. Alternatively, each iteration (task) of a job might use a different data file with filenames of data$SLURM_ARRAY_TASK_ID (i.e., data1, data2, data3, ', data2000).

The general form of the SBATCH option is:

–array= start_value - stop_value : step_size

with a step size of 2. For example, the option would be:

–array=1-7:2

with index values of 1,2,5,19,27:

–array=1,2,5,19,27

Chaining Jobs

If you have a multiple jobs where you want to automatically run other job(s) after the execution of another job, then you can use chaining. When you chain jobs, remember to check the status of the other job to determine if it successfully completed. This will prevent the system from flooding the scheduler with failed jobs. Here is a simple chaining example with three job scripts doThing1.qs, doThing2.qs and doThing3.qs.

The running of a job can be held until a particular job completes. This can be done so as to not to "hog" resources or because the output of one job is needed as input for the second. Job dependencies are used to defer the start of a job until the specified dependencies have been satisfied. They are specified with the –dependency option to sbatch in the format.

The –dependency portion of sbatch man page lists the flags that are to be used to implement chain jobs. "type" in the below format indicates the flags to be used to establish dependency.

The following do1.qs script does 3 important things.

  • If first sleeps for 30 seconds. This gives us time to start dependent jobs.

  • Does an ls of a non existent file. There is a non-zero exit code for this command.

  • Runs the "hello world" program phostname

do1.qs

The same script can be run multiple times to demonstrate the dependency option. afterok and afterany options are used for this purpose to establish dependency.

Job 36806 will only start after the intial run i.e., 36805 has finished execution irrespective of its exit status. This is implemented using afterany flag in the sbatch command. In the other case, job 36807 will start only after the first run i.e., 36805 finishes successfully (runs to completion with an exit code of zero).

The result of "ls" command will not affect the overall status of the job. So it might not always be sufficient to just use afterok in chaining jobs. The other option is that you can manually check the error status of individual commands within a script: The error status for a command is held in the variable $?. This can be checked and we can then force the script to exit. For example we can add the line

do1.qs

Now, job 36807 will not run after submission as an initial run i.e., 36805 will now exit with a non-zero status because of the if condition included in the above script.

This is how chain jobs can be implemented using dependency option.

Threads

Programs that use OpenMP or some other form of thread parallelism should use the "threads" parallel environment. This environment logically limits jobs to run on a single node only, which in turn limits the maximum number of workers to be the CPU core count for a node.

For more details, please look at the job script template /opt/shared/templates/slurm/generic/thread.qs.

MPI

It is the user's responsibility to setup the MPI environment before running the actual MPI job. The job script template found in /opt/shared/templates/slurm/generic/mpi/mpi.qs will setup your job requiring a generic MPI parallel environment. This parallel environment spans multiple nodes and allocates workers by "filling-up" one node before moving on. Slurm looks for the –ntasks-per-node to restrict the allocations per node as part of the filling-up strategy. If it is not specified, then the default way of filling-up proceeds. When a job starts an MPI "machines" file is automatically manufactured and placed in the job's temporary directory at ${TMPDIR}/machines. This file should be copied to a job's working directory or passed directly to the mpirun/mpiexec command used to execute the MPI program.

 

Software that uses MPI but is not started using mpirun or mpiexec will often have arguments or environment variables which can be set to indicate on which hosts the job should run or what file to consult for that list. Please consult software manuals and online support resources before contacting UD-IT for help determining how to pass this information to the program.

Submitting a Parallel Job

Like choosing the parallel environment in Grid Engine, choosing the appropriate number of tasks, threads, and CPUs required for the job is an important step in Slurm. A lot of information has been documented as comments in the template job scripts for your better understanding. In addition, below are few Slurm arguments that hold more weight while running a parallel job.

%th Options %th Description

%td –nodes=<#> %td execute the command on this many distinct nodes

%td –ntasks=<#> %td execute this many copies of the command

%td –ntasks-per-node=<#> %td execute this many copies of the command on each distinct node

%td –cpus-per-task=<#> %td each copy of the command should have this many CPU cores allocated to it

%td –mem=<#> %td total amount of real memory to allocate to the job

%td –mem-per-cpu=<#> %td amount of memory to allocate to each CPU core allocated to the job

Understanding or having a clear picture of the differences between these arguments is necessary to freely work with parallel jobs.

Using –nodes option with –tasks-per-node will be equivalent to mentioning the –ntasks as number of hosts * number of tasks per node will give the total number of tasks that the problem has been divided into.

When a parallel job executes, the following environment variables will be set by Slurm:

%th Variable %th Description

%td SLURM_CPUS_PER_TASK %td The number of slots granted to the job. OpenMP jobs should assign the value of $SLURM_CPUS_PER_TASK to the OMP_NUM_THREADS environment variable, for example.

%td SLURM_JOB_NODELIST %td List of nodes allocated to the job.

%td SLURM_TASKS_PER_NODE %td Number of tasks to be initiated on each node

The mechanism by which you can spread your job across nodes is a bit more complex. If your MPI job wants N CPUs and you're willing to have as few as M of them running per node, then the maximum node count is µ=⌈N/M⌉.

Order is significant, so for N=20 and you are willing to run 6 or more per node, then use

Do not rely on the output of scontrol show job or squeue with regard to the node count while the job is pending; it will not be accurate. Only once the job is scheduled will it show the actual value.

For example,

The scheduler found 5 nodes with 80 free CPUs (1@r00n17, 35@r01n03, 35@r01n12, 8@r01n16, 1@r01n50):

Job Templates

Detailed information pertaining to individual kinds of parallel jobs are provided by UD IT in a collection of job template scripts on a per-cluster basis under the /opt/shared/templates/slurm/generic directory. For example, on DARWIN this directory looks like:

The directory layout is self-explanatory: script templates specific for all MPI jobs can be found in the mpi directory; Open MPI is in the openmpi directory, generic MPI in the generic directory, and MPICH can be found in the mpich directory (all under the mpi directory; a template for serial jobs is serial.qs and threads.qs should be used for OpenMP jobs. These scripts are heavily documented to aid in users' choice of appropriate templates and are updated as we uncover best practices and performance issues. Please copy a script templates for new projects rather than potentially using an older version from a previous project. See DARWIN Slurm Job Script Templates for more details.

Need help? See Introduction to Slurm in UD's HPC community cluster environment.

Array Jobs

Hearkening back to the text-processing example cited above, the analysis of each of the 100 files could be performed by submitting 100 separate jobs to Slurm, each modified to work on a different file. Using an array job helps to automate this task: each sub-task of the array job is gets assigned a unique integer identifier. Each sub-task can find its sub-task identifier in the SLURM_ARRAY_TASK_ID environment variable.

Consider the following:

Four sub-tasks are executed, numbered from 1 through 4. The starting index must be greater than zero, and the ending index must be greater than or equal to the starting index. The step size going from one index to the next defaults to one, but can be any positive integer greater than zero. A step size is appended to the sub-task range as in 2-20:2 – proceed from 2 up to 20 in steps of 2, e.g. 2, 4, 6, 8, 10, et al.

 

The default job array size limits for Slurm are used on DARWIN to avoid oversubscribing the scheduler node's own resource limits (causing scheduling to become sluggish or even unresponsive).

Partitioning Job Data

There are essentially two methods for partitioning input data for array jobs. Both methods make use of the sub-task identifier in locating the input for a particular sub-task.

If 100 novels were in files with names fitting the pattern novel_sub-task-id.txt then the analysis could be performed with the following job script gerund_array.qs :

When complete, the job will produce 100 files named gerund_count_sub-task-id where the sub-task-id collates the results to the input files.

An alternate method of organizing the chaos associated with large array jobs is to partition the data in directories: the sub-task identifier is not applied to the filenames but is used to set the working directory for each sub-task. With this kind of logic, the job scriptgerund_arrays.qs looks like:

When complete, each directory will have a file named gerund_count containing the output of the gerund_count command.

Using an Index File

The partitioning scheme can be as complex as the user desires. If the directories were not named "1" through "100" but instead used the name of the novel contained within, an index file could be created containing the directory names, one per line:

The job submission script gerund_array.qs might then look like:

The sed command selects a single line of the index.txt file; for sub-task 1 the first line is selected, sub-task 2 the second line, etc.

NODE TYPE

SLURM SELECTION OPTIONS

REALMEMORY/MIB

REALMEMORY/GIB

NODE TYPE

SLURM SELECTION OPTIONS

REALMEMORY/MIB

REALMEMORY/GIB

Standard/512 GiB

--partition=standard

499712

488

Large Memory/1 TiB

--partition=large-mem

999424

976

Extra-Large Memory/2 TiB

--partition=xlarge-mem

2031616

1984

nVidia-T4/512 GiB

--partition=gpu-t4

499712

488

nVidia-V100/768 GiB

--partition=gpu-v100

737280

720

amd-MI50/512 GiB

--partition=gpu-mi50

499712

488

Extended Memory/3.73 TiB

--partition=extended-mem --exclusive

999424

976

Managing Jobs on DARWIN

Once a user has been able to submit jobs to the cluster – interactive or batch – the user will from time to time want to know what those jobs are doing. Is the job waiting in a queue for resources to become available, or is it executing? How long has the job been executing? How much CPU time or memory has the job consumed? Users can query Slurm for job information using the squeue command while the job is still active in the Slurm. The squeue command has a variety of command line options available to customize and filter what information it displays; discussing all of them is beyond the scope of this document. Use squeue –help or man squeue commands on the login node to view a complete description of available options.

With no options provided, squeue defaults to displaying a list of all jobs submitted by all the users on the cluster currently active in Slurm. This includes jobs that are waiting in a queue, jobs that are executing, and jobs that are in an error state. The list below is presented in a tabular format, with the following columns:

COLUMN

DESCRIPTION

COLUMN

DESCRIPTION

JOBID

Numerical identifier assigned when the job was submitted

PARTITION

The partition to which the job is assigned

NAME

The name assigned to the job

USER

The owner of the job

ST

Current state of the job (see next table)

TIME

Either the time the job was submitted or the time the job began execution, depending on its state

NODES

The number of nodes assigned to the job

NODELIST(Reason)

The list of nodes on which the job is running or the reason for which the job is in its current state(other than running)

The different states in which a job may exist are enumerated by the following codes:

STATE CODE

DESCRIPTION

STATE CODE

DESCRIPTION

CA

Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated

CG

Job is in the process of completing. Some processes on some nodes may still be active

F

Job terminated with non-zero exit code or other failure condition

PD

Job is awaiting resource allocation

PR

Job terminated due to preemption

R

Job currently has an allocation and is running

RD

Job is held

RQ

Completing job is being requeued

RS

Job is about to change size

S

Job has an allocation, but execution has been suspended and CPUs have been released for other jobs

TO

Job terminated upon reaching its time limit

There are many other possible job states codes which can be found in the official Slurm documentation.

#managing-jobstatus :markdown ## Checking Job Status

#managing-delete :markdown ## Deleting a job

Use the scancel «job_id» command to remove pending and running jobs from the queue.

For example, to delete job 28000

#managing-available :markdown ## Available Resources

#managing-status :markdown ## UD IT Status Commands

Software Installation on DARWIN

First see if your software is installed by using vpkg_list on DARWIN, and then check the Software documentation for all software which is organized in alphabetical order on the sidebar to see if there are any specific instructions provided for use and/or installation on DARWIN.

The HPC team has a set of standards and technology used to reduce complexity and bring consistency to the process of software installation and management.

Software is generally built, installed, and accessed using the VALET system (not Modules but similar) developed by Dr. Jeffrey Frey. VALET provides the ability to modify your environment without modifying your startup files like .bashrc and .bash_profile as it prevents jobs from failing by keeping a clean login environment.

This process of basic software and management is described in Software Management and Workgroup Directory.

With that said, you may need to use other methodologies for installing your software such as a Python virtualenv or a container technology like Singularity on the cluster along with creating a VALET package to properly add it to your environment.

Workgroup Software Installs

Each workgroup receives its own dedicated storage space. If a workgroup uses software that IT does not maintain, then its likely that software is present within that storage space. Often times every individual workgroup member might have his or her own copy that he or she maintains – this may be necessary if frequent redimensioning and recompiling is necessary to the job workflow.

This document assumes that there is software necessary to a workgroup's productivity that do not require per-job recompilation. While the scheme presented here works best when individuals do not require their own copy of the software, the scheme can be extended to include that situation, as well.

For the sake of examples, the workgroup name used in this document will be it_nss.

Directory Structure

After using the workgroup command (either explicitly or within your login files) the WORKDIR, WORKDIR_USERS and WORKDIR_SW environment variables are set based the allocation workgroup directory. Create a directory to contain all workgroup-specific software:

The privileges assigned to the sw directory allow any member of the workgroup to create directories/files inside it.

To streamline VALET usage, a directory named valet has been created inside sw to contain any VALET package definition files representing the software you've installed. The VALET software will automatically check this directory if present:

Again, the directory is made writeable by all workgroup members.

The IT-managed software organized under /opt/shared is a good example of how individual software packages will be installed inside $WORKDIR_SW. Each package gets its own subdirectory, named in lowercase. Individual versions of that software package are installed in subdirectories of that:

You may have seen an attic directory in some of the packages in /opt/shared – it is used to hold the original source/installer components for a software package.

"Buildmeister"

Some workgroups may elect to have one individual maintain their software – call him or her the buildmeister. A workgroup may have several buildmeisters who each maintain some subset of the software. At the total opposite extreme, every group member acts as buildmeister for his/her own software. However your workgroup decides to divide that responsibility, it is best to leave package version directories and their contents ONLY writable by the buildmeister for that package (or version of a package).

Building from Source

Let's say that version 2.2 of "SuperCFD" has just been released and I've logged-in and downloaded the Unix/Linux source code to my desktop. I copy the source package to DARWIN using scp (or sftp) and the remote directory /lustre/it_nss/sw/supercfd/attic. Note that this is the value of $WORKDIR_SW plus paths created in the previous section.

On DARWIN, I change to the SuperCFD software directory and prepare for the build of version 2.2:

With the source code uploaded to the attic directory (or you could use wget «URL» to download into the attic directory), I can unpack inside the version 2.2 directory I just created:

In this case, the source was a bzip-compressed tar file, so it was unpacked using the tar command. Other software might be packed as a gzip-compressed tar file or a ZIP file, so the actual command will vary.

The authors of SuperCFD organize their source code along the GNU autoconf guidelines. The source code directory was renamed from supercfd-2.2 to src; when running ./configure the install prefix will be set to $WORKDIR_SW/supercfd/2.2 which will then have e.g. bin, etc, share, lib directories accompanying the src directory. For any software package that uses autoconf or CMake, this is the organizational strategy IT has adopted.

Building for Autoconf/CMake

The SuperCFD package requires Open MPI 1.6 and NetCDF:

The vpkg_devrequire command is used for NetCDF because it will set the CPPFLAGS and LDFLAGS environment variables to include the appropriate -I and -L arguments, respectively. For source code that is configured for build using GNU autoconf or CMake this is usually good enough.

Some autoconf or CMake packages may require that the actual location of required libraries be specified. In such cases, use vpkg_devrequire as above and examine the value of CPPFLAGS or LDFLAGS e.g. using echo $CPPFLAGS.

Some software packages will require the buildmeister to pass arguments to the ./configure script or set environment variables. Keeping track of what parameters went into a build can be difficult. It's often a good idea to make note of the commands involved in a text file:

setup.sh

In the future the nature of the build is easily determined by looking at this file, versus examining build logs, etc. The process with CMake would be similar to what is outlined above.

After the software is built (e.g. using make and possibly a make check or make test) and installed (probably using make install) the software is ready for use:

UDBUILD Software Deployment

The software installed and deployed on the DARWIN cluster each has it's own methods for compiling and installing. To manage this process, the HPC team has a set of standards and technology used to reduce complexity and bring consistency to the process.

Software is built, installed, and accessed using the VALET system developed by Dr. Jeffrey Frey. IT developed a set of software helper functions which can be access using VALET by importing the udbuild vpkg.

This page describes the file system layout used by IT, and the anatomy of the udbuild file used to build and deploy software. Throughout this process, it is helpful to have an understanding of how to use VALET to add and remove software packages from your environment.

File System

Software is deployed to /opt/shared. The udbuild system defaults to this /opt/shared location. However, this can be changed by setting the UDBUILD_HOME environment variable before initializing the ubuild environment. A good value for this environment variable for workgroup software is WORKDIR/sw < /code>,andforpersonalsoftwareinstallationis < code>HOME/sw. Refer to workgroup software installs for help setting up your directories for workgroup storage.

Beneath this directory should be an attic sub-directory for downloaded software bundles, optionally an add-ons directory for software with optional add-ons, and one sub-directory for each package installed. These sub-directories should always be in all lower-case letter. One more layer down should be a directory for each version of the software installed. It is important understand that on a complex cluster like DARWIN, the same release of a software package may have multiple installations due to various compiler and dependency package requirements. These directories are the software installation roots.

Underneath the installation root should be a directory called src, which is the un-packed source bundle. Next to src should be any bin, lib, share, etc. directories neccessary for the final deployment.

An illustrated example of the software directory structure is as such:

  • opt

    • shared

      • atlas

        • 3.10.3

        • 3.10.3-intel

        • attic

          • udbuild - build and install script for atlas

        • python

          • 2.7.8

          • 3.2.5

          • add-ons python2.7.15 * mpi * 20180613 * python3.2.5

          • attic

            • udbuild - build and install script for python

Building

When building software, the base directory structure (including the attic directory) should be created by you before proceeding further. You should download the software source bundle into attic. Then, unpack the software bundle and rename the directory to src as above. This provides consistency in finding the source bundle and the udbuild file.

Examples of builds are provided below (after the udbuild function documentation).

udbuild functions

init_udbuildenv

This function initializes the udbuild environment. It ensures that you have the required PKGNAME and VERSION environment variables defined, you do not have VALET packages loaded before udbuild in your VALET history (these might affect your build), sets compiler variables like CC, FC, etc., then finally sets your PREFIX and VERSION variables based on it's command-line. These command-line options affect init_udbuildenv:

  • none - This is equivalent to not supplying any parameters

  • python-addon - Ensure a python VALET package is loaded, and set PREFIX appropriately for that python version's add-on installation path

  • r-addon - Ensure an R VALET package is loaded, and set PREFIX appropriately for that R version's add-on installation path

  • node-addon - Ensure a Node.JS version is loaded and set the PREFIX appropriately for that Node.JS versions' add-on installation path

  • Any other arguments are treated as the names of VALET packages which are loaded and added to the VERSION environment variable.

After all of this, your PREFIX variable will be set to

debug

Drop into a debug shell. If the debug shell is exited cleanly, then the udbuild script will continue from there. This is a useful routine to use when creating a udbuild script. You may want to build the script based on documentation, and run a debug function between a configure and make step to verify the environment looks sane. After the first successful compile, you can then remove the debug line.

download

The download function takes a mandatory and optional argument. The mandatory, first, argument is the URL to download. The resulting file will be named after the last part of the URL unless the optional second argument is specified. In this case, the resulting file will be named after the second argument.

If a file with the same name already exists, the download exits successfully without doing anything. If you wish to re-download the archive, delete or rename the existing one.

unpack

The unpack function takes a mandatory and optional argument. The mandatory, first, argument is the name of an archive file (tar.gz, tar.bz2, tar.xz, zip, etc.) to extract. The archive will be unpacked into a directory named src under the install prefix (versioned directory for installation) unless the optional second argument is specified, then it will be used in place of the name src. This directory, and its parents if necessary, will be created prior to extraction. Source archives using the tar format customarily have a single top-level directory entry which contains the package name and version, this is automatically stripped from the extracted archive. After completing the extraction process, the unpack function places the udbuild script into the newly created directory to prepare the script for configure and make steps.

If the src (or alternately specified) directory exists, then the archive is not extracted over it. In this case, the function returns successfully without doing anything. If you wish to force a new extraction, remove or rename the existing src directory.

create_valet_template

Create a YAML based valet package file template and place it in the attic directory if one can be found, and the same directory as the udbuild script if one cannot. This template is helpful, but usually cannot just be copied into place. For example, it only knows about the version of the software it is installing, and copying the file blindly would remove entries for all other versions. Furthermore, the "dependencies" entry is filled with all loaded valet packages, even if they are dependencies of dependencies, and not needed to be explicitly listed.

valet

This function takes either the name of a package (e.g. openmpi), or a package name/version pair (e.g. openmpi/1.8.2) and return true if there is a VALET package loaded to satisfy this dependency, and false otherwise.

This function can be used along with any other shell constructs, such as ifelsefi, to modify the behaviour of a build.

version

This function takes a string and validates that it exists as a complete entry (i.e. starts, stops, or is bounded by hyphens) in the VERSION string.

This function can be used along with any other shell constructs, such as ifelsefi, to modify the behavior of a build.

package

This function takes a string and validates that it exists as part of the final package name, which may include features. This is useful for matching features which are specified to configure a software build but don't show up in the version string or require a valet package be available, the string "threads" for OpenMP is a good example of using this feature.

ifvalet

This function is shorthand for if valet "1";thenshift; eval"@"; fi to make udbuild scripts simple to read and code.

ifversion

This function is shorthand for if version "1";thenshift; eval"@"; fi to make udbuild scripts simple to read and code.

ifpackage

This function is shorthand for if package "1";thenshift; eval"@"; fi to make udbuild scripts simple to read and code.

udbuildcapture

Put all screen output into a capture file. The main purpose of this is to log questions answered during an interactive isntall, to document what choices were made.

udbuildmon

This script is helpful to be run during the install phase of a build, for example:

It will log all open for write and mkdir system calls and log them to a file named udbuildmon.log. You can use this log file to verify the build did not write any files to unknown locations. This function should not be necessary with cmake builds, as they normally store this information in an install_manifest.txt file.

apath

Append a path to a variable and ensure that variable is exported. The required first argument is the environment variable name, all remaining arguments are paths to append to the end of the environment variable. A colon (:) character is used as the delimiter, as is standard in path environment variables.

ppath

Prepend a path to a variable similar to apath, but instead of adding the path to the end, add it to the beginning. Arguments are the same as apath

rpath

Remove a path from an environment variable. The required first argument is the environment variable name. All remaining arguments are removed from the environment variable. If an entry exists multiple times, all instances are removed.

aflag

Append a flag to an environment variable. The required first argument is the environment variable name. All remaining arguments are added to the environment variable. The aflag variable works under two contexts, which depend on the status of the first argument. If it is an already defined bash array variable type, then the remaining arguments are added as new elements in the array. In all other cases, the remaining arguments are added to the string using a space character as a delimiter.

Using a bash array has the advantage of allowing flags which contain whitespace characters. If this is a requirement, the following steps should be undertaken:

pflag

Prepend flags to an environment variable. This is the same as the aflag function, but it puts its arguments at the beginning of the variable. Its arguments are identical.

rflag

Remove a flag from an environment variable. This works similar to the rpath variable and also supports bash arrays.

udexpect

This is a wrapper around the TCL expect utility to simplify the process of answering questions for interactive builds. This function accepts an expect script as STDIN (the normal method is via HERE-DOC) and provides all the basics of running expect. Some standard responses are provided to simplify the process:

  • enter - Send a carrige return as if the user pressed their "Enter" key

  • yes - Send the string "yes" as if the user typed "yes" and pressed "Enter"

  • no - Send "no" and press "Enter"

  • y - Send "y" and press "Enter"

  • n - Send "n" and press "Enter"

  • respond text - Send text and press "Enter"

  • keypress c - Send the character c and DO NOT press "Enter"

  • user - Prompt the person at the keyboard for a respone, and send it, press "Enter"

makeflags_set

Update a file (presumably a Makefile and specified as the first argument) which uses the syntax "key=value" and update the value of the second argument to be that of the third argument. This is a simple helper function to make it simple to edit basic information in a Makefile.

makeflags_prepend

Update a file similar to makeflags_set, except prepend the third argument to the existing value, instead of replacing it.

makeflags_append

Update a file similar to makeflags_set, except append the third argument to the existing value, instead of replacing it.

udbuild script examples

simple

In this example, an easy-to-install software package called cmake is built and isntalled. It has no software dependencies, and uses the standard configure, make, make install procedure used by very many open source software packages.

To prepare for this build, you would want to create the following directories:

udbuild

It is imperitive to start udbuild scripts with the string #!/bin/bash -l because this instructs bash to setup the VALET system.

medium

udbuild

In this example, we use vpkg_devrequire to specify additional dependencies needed to build the cdo package. PREFIX, however, will still be set to /opt/shared/cdo/1.6.4.

complex

udbuild

In this more complicated example, we still need dependencies, but this time one of them will affect the PREFIX variable. The Intel64 compiler will be used, and PREFIX will be set to /opt/shared/hdf4/4.2.10-intel64.

Furthermore, specific CFLAGS changes will be made for this compiler. This example also illustrates how the VERSION string can be used. Here, we would set additional flags for the ./configure script if the VERSION string were set to 4.2.10-sansnetcdf. These options allow one build file to build multiple versions of a package, and with only minor changes near the top of the script (namely to the VERSION variable and the init_udbuildenv command-line.

Another interesting thing we do here is to make sure the installation is as complete as possible. HDF4 does not support shared object files for fortran libraries. So, first we build the shared objects which are possible, then we enable fortran and ensure the full compliment of archive .a files are present.

pythonudbuild-cdf

The python example above is used to install the cdf and related modules as an add-on for python version 3.6.5. It is also considered a complex example since it displays the use of the option python-addon for init_udbuildenv to initialize the udbuild environment and ensure a python VALET package is loaded, sets PREFIX appropriately for python version's add-on installation path, and utilizes the udbuild pip_install function to call pip with using the correct PREFIX. To use the python add-on module bundle, you can setup a VALET package. Below is an example VALET yaml package created from the create_valet_template function. It should be trimmed, as szip, hdf4, and hdf5 are all dependencies of netcdf. The dependencies section should look like the vpkg_devrequire calls of the udbuild script.

python-cdf.vpkg_yaml

System Status

There is no need for an opt-in node status notification service on DARWIN because there are no specific nodes assigned to a research workgroup partition for their jobs. All nodes of a particular node type will be available to a workgroup partition, if the research group invested (purchased) that particular node type. For example, say a research group purchased 1 standard node, 1 large memory node with 512 GB and 1 GPU node, then the workgroup partition for this research group would consist of all standard nodes, all nodes with 512 GB and all GPU nodes as the available resources to allocated for their jobs.

Machine Information: UD IT HPC has DARWIN machine information: attributes including a database of node information, milestones, offline nodes and nodes disabled for maintenance.

Ganglia Cluster Monitoring: Cluster monitoring for DARWIN uses Ganglia to monitor its hardware components.