FASTER - TAMU

Introduction

Fostering Accelerated Scientific Transformations, Education, and Research (FASTER) is a NSF-MRI-funded cluster (award number 2019129) that offers state of the art CPUs, GPUs, and NVMe (Non-Volatile MemoryExpress) based storage in a composable environment. The supercomputer uses an innovative composable software-hardware approach to let a researcher attach GPUs to CPUs depending on their workflow. Unlike traditional cluster architectures, each rack on FASTER hosts a stack of GPUs that are shared with the CPU-hosting nodes. Using a standard Slurm script, a researcher can choose to add 10 GPUs to their CPU-node request. The machine will specifically help researchers using workflows that can benefit from simultaneous access to several GPUs, surpassing accelerator limits imposed on conventional supercomputers.

Figure 1. The FASTER composable cluster hosted by Texas A&M High Performance Research Computing

Account Administration

Setting up Your Account

The computer systems will be available for use free-of-charge to researchers through ACCESS. Access to and use of such systems is permitted only for academic research and instructional activity. All researchers are responsible for knowing and following our policies.

Allocation Information

Allocations are made by granting Service Units (SUs) for projects to Principal Investigators (PIs). SUs are based on node hours and a factor is applied for each GPU used. SUs are consumed on the computing resources by users associated with projects by PIs. Researchers can apply for an allocation via the XRAS process.

Configuring Your Account

The default shell for FASTER (all nodes) is the bash shell. Edit your environment in the startup file, ".bash_profile" in your home directory. This file is read and executed when you login.

System Architecture

FASTER is a 184-node Intel cluster from Dell with an InfiniBand HDR-100 interconnect. NVIDIA A100 GPUs, A10 GPUs, A30 GPUs, A40 GPUs and T4 GPUs are distributed and composable via Liqid PCIe fabrics. All nodes are based on the Intel Ice Lake processor and have 256 GB of memory.

Compute Nodes

Table X. Compute Node Specifications

Login Nodes

Table X. Login Node Specifications

PROCESSOR TYPE

Intel Xeon 8352Y (Ice Lake) 2.20 GHz

COMPUTE NODES:

184

SOCKETS PER NODE:

2

CORES PER SOCKET:

32

CORES PER NODE:

64

HARDWARE THREADS PER CORE:

2

HARDWARE THREADS PER NODE:

128

CLOCK RATE:

2.20GHz (3.40 GHz Max Turbo Frequency)

RAM:

256 GB DDR4-3200

CACHE:

48 MB L3

LOCAL STORAGE:

3.84 TB local disk

Specialized Nodes

GPUs can be added to compute nodes on the fly by using the "gres" option in a Slurm script. A researcher can request up to 10 GPUs to create these CPU-GPU nodes. The following GPUs will be composable to the compute nodes.

  • 200 T4 16GB GPUs

  • 40 A100 40GB GPUs

  • 8 A10 24GB GPUs

  • 4 A30 24GB GPUs

  • 8 A40 48GB GPUs

Data Transfer Nodes

FASTER has two data transfer nodes that can be used to transfer data to FASTER via Globus Connect web interface or Globus command line. Globus Connect Server v5.4 is installed on the data transfer nodes. One data transfer node is dedicated to ACCESS users and its collection is listed as ACCESS TAMU FASTER.

Network

The FASTER system uses Mellanox HDR 100 InfiniBand interconnects.

File Systems

Each researcher has access to a home directory, scratch, and project space for their files. The scratch and project space is intended for active projects and is not backed up. The $HOME, $SCRATCH, and $PROJECT file systems are hosted on DDN Lustre storage with 5 PB of usable capacity and up to 20 GB/s bandwidth. Researchers can purchase space on /scratch by submitting a help-desk ticket.

Table X. FASTER File Systems

NUMBER OF NODES

4

PROCESSOR TYPE

Intel Xeon 8352Y (Ice Lake)

CORES PER NODE

64

MEMORY PER NODE

256 GB

FILE SYSTEM

QUOTA

PURPOSE

BACKUP

FILE SYSTEM

QUOTA

PURPOSE

BACKUP

$HOME
/home/userid

10GB/10,000 files

Home directories for small software, scripts, compiling, editing.

Yes

$SCRATCH
/scratch/user/userid

1TB/250,000 files

Intended for job activity and temporary storage

No

$PROJECT
/scratch/group/projectid

5TB/500,000 files

Not purged while allocation is active. Removed 90 days after allocation expiration

No

The "showquota" command can be used by a researcher to check their disk usage and file quotas on the different filesystems

$ showquota Your current disk quotas are: Disk Disk Usage Limit File Usage Limit /home/userid 1.4G 10.0G 3661 10000 /scratch/user/userid 117.6G 1.0T 24226 250000 /scratch/group/projectid 510.5G 5.0T 128523 500000

Accessing the System

FASTER is accessible via the web using the FASTER ACCESS Portal, which is an instance of Open OnDemand. Use your ACCESS ID or other CILogon credentials.

Please visit the Texas A&M FASTER Documentation for additional login instructions.

Code of Conduct

The FASTER environment is shared with hundreds of other researchers. Researchers should ensure that their activity does not does not adversely impact the system and the research community on it.

  • DO NOT run jobs or intensive computations on the login nodes.

  • Contact the FASTER team for jobs that run outside the bounds of regular wall times

  • Don't stress the scheduler with thousands of simultaneous job submissions

  • To facilitate a faster response to help tickets, please include details such as the Job ID, the time of the incident, path to your job-script, and the location of your files

File Management

Transferring your Files

Globus Connect is recommended for moving files to and from the FASTER cluster. You may also use the standard "scp", "sftp", or "rsync" utilities on the FASTER login node to transfer files.

Sharing Files with Collaborators

Researchers can use Globus Connect to transfer files to One Drive and other applications. Submit a help desk to request shared file spaces if needed.

Software

Common compiler tools like Intel and GCC are available on FASTER.

Easy Build

Software is preferably built and installed on FASTER using the EasyBuild system.

Researchers can request assistance from the Texas A&M HPRC helpdesk to build software as well.

Compiler Toolchains

EasyBuild relies on compiler toolchains. A compiler toolchain is a module consisting of a set of compilers and libraries put together for some specific desired functionality. A popular example is the foss toolchain series that consists of versions of the GCC compiler suite, OpenMPI, BLAS, LAPACK and FFTW that enable software to be compiled and used for serial as well as shared- and distributed-memory parallel applications. An intel toolchain series with the same range of functionality that joins the foss toolchains as the most commonly used toolchains is also available. Either of these toolchains can be extended for use with GPUs via the simple addition of a CUDA module.

Compiler toolchains vary across time as well as across compiler types. There is typically a new compiler chain release for each major new release of a compiler. For example, the foss-2021b chain includes the GCC 11.2.0 compiler, while the foss-2021a chain includes the GCC 10.3.0 compiler. The same is true for the Intel compiler releases, although with their oneAPI consolidation program Intel is presently simultaneously releasing two sets of C/C++/Fortran compilers that will eventually be merged into a single set. Another compiler chain series of note is the NVPHC series released by NVIDIA that has absorbed the former Portland Group compiler set and is being steadily modified to increase performance on NVIDIA GPUs.

Application Optimization

Performance maximization may be achieved by the relatively easy specification of optimal compiler flags to the significantly more complex and difficult instrumentation of source codes with OpenMP, MPI and CUDA commands to allow parallel tasks to be performed. The optimal compiler flags for a given application can be as simple as -fast for the Intel compilers or a series of many obscure and seldom-used flags found by the developer to optimize their application. These flags are typically codified in the software compiling parts - e.g. make, CMake, etc. - of the package infrastructure, which is typically used with no modifications by EasyBuild to build and install the module. Questions about such things are best addressed to the original developer of the package.

Available Software

Search for already installed software on FASTER using the Modules system.

Modules System

The Modules system organizes the multitude of packages we have installed on our clusters so that they can be easily maintained and used. Any software you would like to use on FASTER should use the Modules system.

No modules are loaded by default. The main command necessary for using software is the "module load" command. To load a module, use the following command:

[NetID@faster ~]$ module load packageName

The packageName specification in the "module load" command is case sensitive and it should include a specific version. To find the full name of the module you want to load, use the following command:

[NetID@faster ~]$ module spider packageName

To see a list of available modules, use the "mla" wrapper script:

Launching Applications

The Slurm batch processing system is used on FASTER to ensure the convenient and fair use of the shared resources. See more job submitting details using the Slurm batch scheduler on our the wiki. https://hprc.tamu.edu/wiki/FASTER:Batch

For running user compiled code, we assume a preferred compiler toolchain module has been loaded.

Running OpenMP code

To run OpenMP code, researchers need to set the number of threads that OpenMP regions can use. The following snippet shows a basic example on how to set the number of threads and execute the program (my_omp_prog.x)

Running MPI code

To run MPI code, researchers should use an mpi launcher. The following snippet shows a basic example on how to launch an mpi program using mpirun. In this case we launch 8 copies of my_mpi_prog.x)

Running Hybrid MPI/OpenMP code

For code using both MPI and OpenMP, researchers will need to launch the code as a regular MPI program and set the number of threads for the OpenMP regions. The following snippet show a simple example

In the above example we launch 8 copies of my_hybrid_prog.x. That means there are 8 processes running. Assuming my_hybrid_prog.x has parallel OpenMP regions, every process can use up to 4 threads. The total number of cores used in this case is 24.

Running Jobs

Job Accounting

FASTER allocations are made in Service Units (SUs). A service unit is one hour of wall clock time. Jobs must request whole nodes. Researchers will be charged at the rate of 64 SUs per hour / GPU for each T4 composed on a node. Each A100/A40/A10/A30 GPU accelerator will be charged 128 SUs per hour / GPU.

NODE TYPE

SUS CHARGED PER HOUR (WALL CLOCK)

NODE TYPE

SUS CHARGED PER HOUR (WALL CLOCK)

Compute node

64

Adding a T4 accelerator

64

Adding an A100/A40/A10/A30 accelerator

128

Accessing the Compute Nodes

Jobs are submitted to compute nodes using the Slurm scheduler with the following command:

The tamubatch Utility

The "tamubatch" utility is an automatic batch job script that submits jobs without the need to write a batch script. The researcher includes the executable commands in a text file, and tamubatch automatically annotates the text file and submits it as a job to the cluster. tamubatch uses default values for the job parameters, and accepts flags to control job parameters.

Visit the tamubatch wiki page for more information.

The tamulauncher Utility

The "tamulauncher" utility provides a convenient way to run a large number of serial or multithreaded commands without the need to submit individual jobs or a Slurm Job array. tamulauncher concurrently executes on a text file containing all the commands that need to be run. The number of concurrently executed commands depends on the batch scheduler. In interactive mode, tamulauncher is run interactively; the number of concurrently executed commands is limited to at most 8. There is no need to load any module before using tamulauncher. It is preferred over Job Arrays to submit a large number (thousands) of individual jobs, especially when the run times of the commands are relatively short.

See the tamulauncher wiki page for more information.

Slurm Job Scheduler

FASTER employs the Slurm job scheduler. The resource supports most common features. Some of the prominent ones are described in Table **.

Table X. Basic Slurm Environment Variables

VARIABLE

USAGE

DESCRIPTION

VARIABLE

USAGE

DESCRIPTION

Job ID

$SLURM_JOBID

Batch job ID assigned by Slurm.

Job Name

$SLURM_JOB_NAME

The name of the Job.

Queue

$SLURM_JOB_PARTITION

The name of the queue the job is dispatched from.

Submit Directory

$SLURM_SUBMIT_DIR

The directory the job was submitted from.

Temporary Directory

$TMPDIR

This is a directory assigned locally on the compute node for the job located at /work/job.$SLURM_JOBID. Use of $TMPDIR is recommended for jobs that use many small temporary files.

On FASTER, GPUs are requested using the "gres" resource flag in a Slurm script. The following resources (GPUs) can be currently requested via Slurm.

Table X. Composable Settings

Partitions (Queues)

Table. X FASTER Production Queues

1 node: 10x A100 (--gres=gpu:a100:10) = a compute nodes with 10 NVIDIA A100s which can be requested using the --gres=gpu:a100:10 slurm directive

1 node: 6x A100 (--gres=gpu:a100:6)

1 node: 4x A100 (--gres=gpu:a100:4)

11 nodes: 4x T4 (--gres=gpu:tesla_t4:4)

2 nodes: 8x T4 (--gres=gpu:tesla_t4:8)

1 node: 4x A10 (--gres=gpu:a10:4)

2 nodes: 2x A30 (--gres=gpu:a30:2)

2 nodes: 2x A40 (--gres=gpu:a40:2)

1 nodes: 4x A40 (--gres=gpu:a40:4)

Job Management

Jobs are submitted via the Slurm scheduler using the "sbatch" command. After a job has been submitted, you may want to check on its progress or cancel it. Below is a list of the most used job monitoring and control commands for jobs.

QUEUE NAME

MAX NODES PER JOB
(ASSOC'D CORES)*

MAX GPUS

MAX DURATION

MAX JOBS IN QUEUE*

CHARGE RATE
(PER NODE-HOUR)

QUEUE NAME

MAX NODES PER JOB
(ASSOC'D CORES)*

MAX GPUS

MAX DURATION

MAX JOBS IN QUEUE*

CHARGE RATE
(PER NODE-HOUR)

development

1 nodes
(64 cores)*

10

1 hr

1*

64 Service Unit (SU) + GPUs used

CPU

128 nodes
(8,192 cores)*

0

48 hrs

50*

64 Service Unit (SU)

GPU

128 nodes
(8,192 cores)*

10

48 hrs

50*

64 Service Unit (SU) + GPUs used

FUNCTION

COMMAND

EXAMPLE

FUNCTION

COMMAND

EXAMPLE

Submit a job

sbatch [script_file]

sbatch FileName.job

Cancel/Kill a job

scancel [job_id]

scancel 101204

Check status of a single job

scancel [job_id]

squeue -j 101204

Check status of all jobs for a user

squeue -u [user_name]

squeue -u someuser

Check CPU and memory efficiency for a job

seff [job_id]

seff 101204

Here is an example of the seff command provides for a finished job:

Interactive Computing

Researchers can run interactive jobs on FASTER using the TAMU Open OnDemand portal. TAMU OnDemand is a web platform through which users can access HPRC clusters and services with a web browser (Chrome, Firefox, IE, and Safari). All active researchers have access to TAMU OnDemand. To access the portal, researchers should login at the address: https://portal.hprc.tamu.edu.

Sample Job Scripts

The following scripts show how researchers can submit jobs on the FASTER cluster. All scripts are meant for full node utilization, i.e. using all 64 cores and all available memory. Researchers should update their account numbers and email address prior to job submission.

For MPI, OpenMP and hybrid jobs researchers are directed to use the appropriate executable lines in the above examples.

CPU Only

Single Node, Single Core (Serial)

Single Node, Multiple Core

Multiple Node, Multiple Core

CPU & GPU

The following example demonstrate how a researcher can submit jobs on single and multiple GPUs using the "gres" flag in a Slurm script. The "gpu" queue is specified in these scripts.

Single Node, Single Core

Single Node, Multiple Core

Multiple Node, Multiple Core

Visualization

Researchers can remotely visualize data by launching a VNC job through the TAMU OnDemand web portal. You will be taken to the portal's homepage, then at the top, select 'Interactive Apps' and then 'VNC'. Fill in the appropriate job parameters and then launch the job.

Running applications with graphic user interface (GUI) on FASTER can be done through X11 forwarding. Applications that require OpenGL 3D rendering will experience big delays since large amounts of graphic data need to be sent over the network to be rendered on your local machine. An alternative way of running such applications is through remote visualization, an approach that utilizes VNC and VirtualGL to run graphic applications remotely.

Containers

Containers are supported through the Singularity runtime engine. The singularity executable is available on compute nodes, but not on login nodes. Container workloads tend to be too intense for the shared login nodes.

Example: Pulling a container from a registry:

Example: Executing a command within a container:

Researchers can learn more about Singularity runtime on their own documentation site: SingularityCE Documentation Hub

Containers also are supported through the Charliecloud runtime engine. The Charliecloud executables are available through the module system.

Researchers can learn more about Charliecloud runtime on their own documentation site: Overview — Charliecloud 0.39~pre+a1fe557 documentation

Help

Contact us via email at help@hprc.tamu.edu.

To facilitate a faster response, please include details such as the Job ID, the time of the incident, path to your job-script, and the location of your files.

References