1 Introduction
2 Account Administration
3 System Architecture
4 Accessing the System
5 Code of Conduct
6 File Management
7 Software
8 Launching Applications
9 Running Jobs
10 Sample Job Scripts
11 Visualization
12 Containers
13 Help
14 References

Introduction

Fostering Accelerated Scientific Transformations, Education, and Research (FASTER) is a NSF-MRI-funded cluster (award number 2019129) that offers state of the art CPUs, GPUs, and NVMe (Non-Volatile MemoryExpress) based storage in a composable environment. The supercomputer uses an innovative composable software-hardware approach to let a researcher attach GPUs to CPUs depending on their workflow. Unlike traditional cluster architectures, each rack on FASTER hosts a stack of GPUs that are shared with the CPU-hosting nodes. Using a standard Slurm script, a researcher can choose to add 10 GPUs to their CPU-node request. The machine will specifically help researchers using workflows that can benefit from simultaneous access to several GPUs, surpassing accelerator limits imposed on conventional supercomputers.

Figure 1. The FASTER composable cluster hosted by Texas A&M High Performance Research Computing

Account Administration

Setting up Your Account

The computer systems will be available for use free-of-charge to researchers through ACCESS. Access to and use of such systems is permitted only for academic research and instructional activity. All researchers are responsible for knowing and following our policies.

Allocation Information

Allocations are made by granting Service Units (SUs) for projects to Principal Investigators (PIs). SUs are based on node hours and a factor is applied for each GPU used. SUs are consumed on the computing resources by users associated with projects by PIs. Researchers can apply for an allocation via the XRAS process.

Configuring Your Account

The default shell for FASTER (all nodes) is the bash shell. Edit your environment in the startup file, ".bash_profile" in your home directory. This file is read and executed when you login.

System Architecture

FASTER is a 184-node Intel cluster from Dell with an InfiniBand HDR-100 interconnect. NVIDIA A100 GPUs, A10 GPUs, A30 GPUs, A40 GPUs and T4 GPUs are distributed and composable via Liqid PCIe fabrics. All nodes are based on the Intel Ice Lake processor and have 256 GB of memory.

Compute Nodes

Table X. Compute Node Specifications

Login Nodes

Table X. Login Node Specifications

PROCESSOR TYPE	Intel Xeon 8352Y (Ice Lake) 2.20 GHz
COMPUTE NODES:	184
SOCKETS PER NODE:	2
CORES PER SOCKET:	32
CORES PER NODE:	64
HARDWARE THREADS PER CORE:	2
HARDWARE THREADS PER NODE:	128
CLOCK RATE:	2.20GHz (3.40 GHz Max Turbo Frequency)
RAM:	256 GB DDR4-3200
CACHE:	48 MB L3
LOCAL STORAGE:	3.84 TB local disk

Specialized Nodes

GPUs can be added to compute nodes on the fly by using the "gres" option in a Slurm script. A researcher can request up to 10 GPUs to create these CPU-GPU nodes. The following GPUs will be composable to the compute nodes.

200 T4 16GB GPUs
40 A100 40GB GPUs
8 A10 24GB GPUs
4 A30 24GB GPUs
8 A40 48GB GPUs

Data Transfer Nodes

FASTER has two data transfer nodes that can be used to transfer data to FASTER via Globus Connect web interface or Globus command line. Globus Connect Server v5.4 is installed on the data transfer nodes. One data transfer node is dedicated to ACCESS users and its collection is listed as ACCESS TAMU FASTER.

Network

The FASTER system uses Mellanox HDR 100 InfiniBand interconnects.

File Systems

Each researcher has access to a home directory, scratch, and project space for their files. The scratch and project space is intended for active projects and is not backed up. The $HOME, $SCRATCH, and $PROJECT file systems are hosted on DDN Lustre storage with 5 PB of usable capacity and up to 20 GB/s bandwidth. Researchers can purchase space on /scratch by submitting a help-desk ticket.

Table X. FASTER File Systems

NUMBER OF NODES	4
PROCESSOR TYPE	Intel Xeon 8352Y (Ice Lake)
CORES PER NODE	64
MEMORY PER NODE	256 GB

FILE SYSTEM	QUOTA	PURPOSE	BACKUP

FILE SYSTEM	QUOTA	PURPOSE	BACKUP
`$HOME` `/home/userid`	10GB/10,000 files	Home directories for small software, scripts, compiling, editing.	Yes
`$SCRATCH` `/scratch/user/userid`	1TB/250,000 files	Intended for job activity and temporary storage	No
`$PROJECT` `/scratch/group/projectid`	5TB/500,000 files	Not purged while allocation is active. Removed 90 days after allocation expiration	No

The "showquota" command can be used by a researcher to check their disk usage and file quotas on the different filesystems

$ showquota
Your current disk quotas are:
Disk                    Disk Usage     Limit   File Usage      Limit
/home/userid                1.4G     10.0G         3661        10000
/scratch/user/userid        117.6G      1.0T        24226     250000
/scratch/group/projectid    510.5G      5.0T       128523     500000

Accessing the System

FASTER is accessible via the web using the FASTER ACCESS Portal, which is an instance of Open OnDemand. Use your ACCESS ID or other CILogon credentials.

Please visit the Texas A&M FASTER Documentation for additional login instructions.

Code of Conduct

The FASTER environment is shared with hundreds of other researchers. Researchers should ensure that their activity does not does not adversely impact the system and the research community on it.

DO NOT run jobs or intensive computations on the login nodes.
Contact the FASTER team for jobs that run outside the bounds of regular wall times
Don't stress the scheduler with thousands of simultaneous job submissions
To facilitate a faster response to help tickets, please include details such as the Job ID, the time of the incident, path to your job-script, and the location of your files

File Management

Transferring your Files

Globus Connect is recommended for moving files to and from the FASTER cluster. You may also use the standard "scp", "sftp", or "rsync" utilities on the FASTER login node to transfer files.

Sharing Files with Collaborators

Researchers can use Globus Connect to transfer files to One Drive and other applications. Submit a help desk to request shared file spaces if needed.

Software

Common compiler tools like Intel and GCC are available on FASTER.

Easy Build

Software is preferably built and installed on FASTER using the EasyBuild system.

Researchers can request assistance from the Texas A&M HPRC helpdesk to build software as well.

Compiler Toolchains

EasyBuild relies on compiler toolchains. A compiler toolchain is a module consisting of a set of compilers and libraries put together for some specific desired functionality. A popular example is the foss toolchain series that consists of versions of the GCC compiler suite, OpenMPI, BLAS, LAPACK and FFTW that enable software to be compiled and used for serial as well as shared- and distributed-memory parallel applications. An intel toolchain series with the same range of functionality that joins the foss toolchains as the most commonly used toolchains is also available. Either of these toolchains can be extended for use with GPUs via the simple addition of a CUDA module.

Compiler toolchains vary across time as well as across compiler types. There is typically a new compiler chain release for each major new release of a compiler. For example, the foss-2021b chain includes the GCC 11.2.0 compiler, while the foss-2021a chain includes the GCC 10.3.0 compiler. The same is true for the Intel compiler releases, although with their oneAPI consolidation program Intel is presently simultaneously releasing two sets of C/C++/Fortran compilers that will eventually be merged into a single set. Another compiler chain series of note is the NVPHC series released by NVIDIA that has absorbed the former Portland Group compiler set and is being steadily modified to increase performance on NVIDIA GPUs.

Application Optimization

Performance maximization may be achieved by the relatively easy specification of optimal compiler flags to the significantly more complex and difficult instrumentation of source codes with OpenMP, MPI and CUDA commands to allow parallel tasks to be performed. The optimal compiler flags for a given application can be as simple as -fast for the Intel compilers or a series of many obscure and seldom-used flags found by the developer to optimize their application. These flags are typically codified in the software compiling parts - e.g. make, CMake, etc. - of the package infrastructure, which is typically used with no modifications by EasyBuild to build and install the module. Questions about such things are best addressed to the original developer of the package.

Available Software

Search for already installed software on FASTER using the Modules system.

Modules System

The Modules system organizes the multitude of packages we have installed on our clusters so that they can be easily maintained and used. Any software you would like to use on FASTER should use the Modules system.

No modules are loaded by default. The main command necessary for using software is the "module load" command. To load a module, use the following command:

[NetID@faster ~]$ module load packageName

The packageName specification in the "module load" command is case sensitive and it should include a specific version. To find the full name of the module you want to load, use the following command:

[NetID@faster ~]$ module spider packageName

To see a list of available modules, use the "mla" wrapper script:

[NetID@faster ~]$ mla

Launching Applications

The Slurm batch processing system is used on FASTER to ensure the convenient and fair use of the shared resources. See more job submitting details using the Slurm batch scheduler on our the wiki. https://hprc.tamu.edu/wiki/FASTER:Batch

For running user compiled code, we assume a preferred compiler toolchain module has been loaded.

Running OpenMP code

To run OpenMP code, researchers need to set the number of threads that OpenMP regions can use. The following snippet shows a basic example on how to set the number of threads and execute the program (my_omp_prog.x)

[NetID@faster1 ~]$ export OMP_NUM_THREADS=8
[NetID@faster1 ~]$ ./my_omp_prog.x

Running MPI code

To run MPI code, researchers should use an mpi launcher. The following snippet shows a basic example on how to launch an mpi program using mpirun. In this case we launch 8 copies of my_mpi_prog.x)

[NetID@faster1 ~]$ mpirun -np 8 ./my_mpi_prog.x

Running Hybrid MPI/OpenMP code

For code using both MPI and OpenMP, researchers will need to launch the code as a regular MPI program and set the number of threads for the OpenMP regions. The following snippet show a simple example

[NetID@faster1 ~]$ export OMP_NUM_THREADS=4
[NetID@faster1 ~]$ mpirun -np 8 ./my_hybrid_prog.x

In the above example we launch 8 copies of my_hybrid_prog.x. That means there are 8 processes running. Assuming my_hybrid_prog.x has parallel OpenMP regions, every process can use up to 4 threads. The total number of cores used in this case is 24.

Running Jobs

Job Accounting

FASTER allocations are made in Service Units (SUs). A service unit is one hour of wall clock time. Jobs must request whole nodes. Researchers will be charged at the rate of 64 SUs per hour / GPU for each T4 composed on a node. Each A100/A40/A10/A30 GPU accelerator will be charged 128 SUs per hour / GPU.

NODE TYPE	SUS CHARGED PER HOUR (WALL CLOCK)

NODE TYPE	SUS CHARGED PER HOUR (WALL CLOCK)
Compute node	64
Adding a T4 accelerator	64
Adding an A100/A40/A10/A30 accelerator	128

Accessing the Compute Nodes

Jobs are submitted to compute nodes using the Slurm scheduler with the following command:

[NetID@faster1 ~]$ sbatch MyJob.slurm
Submitted batch job 3606

The `tamubatch` Utility

The "tamubatch" utility is an automatic batch job script that submits jobs without the need to write a batch script. The researcher includes the executable commands in a text file, and tamubatch automatically annotates the text file and submits it as a job to the cluster. tamubatch uses default values for the job parameters, and accepts flags to control job parameters.

Visit the tamubatch wiki page for more information.

The `tamulauncher` Utility

The "tamulauncher" utility provides a convenient way to run a large number of serial or multithreaded commands without the need to submit individual jobs or a Slurm Job array. tamulauncher concurrently executes on a text file containing all the commands that need to be run. The number of concurrently executed commands depends on the batch scheduler. In interactive mode, tamulauncher is run interactively; the number of concurrently executed commands is limited to at most 8. There is no need to load any module before using tamulauncher. It is preferred over Job Arrays to submit a large number (thousands) of individual jobs, especially when the run times of the commands are relatively short.

See the tamulauncher wiki page for more information.

Slurm Job Scheduler

FASTER employs the Slurm job scheduler. The resource supports most common features. Some of the prominent ones are described in Table **.

Table X. Basic Slurm Environment Variables

VARIABLE	USAGE	DESCRIPTION

VARIABLE	USAGE	DESCRIPTION
Job ID	`$SLURM_JOBID`	Batch job ID assigned by Slurm.
Job Name	`$SLURM_JOB_NAME`	The name of the Job.
Queue	`$SLURM_JOB_PARTITION`	The name of the queue the job is dispatched from.
Submit Directory	`$SLURM_SUBMIT_DIR`	The directory the job was submitted from.
Temporary Directory	`$TMPDIR`	This is a directory assigned locally on the compute node for the job located at `/work/job.$SLURM_JOBID`. Use of `$TMPDIR` is recommended for jobs that use many small temporary files.

On FASTER, GPUs are requested using the "gres" resource flag in a Slurm script. The following resources (GPUs) can be currently requested via Slurm.

Table X. Composable Settings

Partitions (Queues)

Table. X FASTER Production Queues

1 node: 10x A100 (--gres=gpu:a100:10) = a compute nodes with 10 NVIDIA A100s which can be requested using the --gres=gpu:a100:10 slurm directive

1 node: 6x A100 (--gres=gpu:a100:6)

1 node: 4x A100 (--gres=gpu:a100:4)

11 nodes: 4x T4 (--gres=gpu:tesla_t4:4)

2 nodes: 8x T4 (--gres=gpu:tesla_t4:8)

1 node: 4x A10 (--gres=gpu:a10:4)

2 nodes: 2x A30 (--gres=gpu:a30:2)

2 nodes: 2x A40 (--gres=gpu:a40:2)

1 nodes: 4x A40 (--gres=gpu:a40:4)

Job Management

Jobs are submitted via the Slurm scheduler using the "sbatch" command. After a job has been submitted, you may want to check on its progress or cancel it. Below is a list of the most used job monitoring and control commands for jobs.

QUEUE NAME	MAX NODES PER JOB (ASSOC'D CORES)*	MAX GPUS	MAX DURATION	MAX JOBS IN QUEUE*	CHARGE RATE (PER NODE-HOUR)

QUEUE NAME	MAX NODES PER JOB (ASSOC'D CORES)*	MAX GPUS	MAX DURATION	MAX JOBS IN QUEUE*	CHARGE RATE (PER NODE-HOUR)
development	1 nodes (64 cores)*	10	1 hr	1*	64 Service Unit (SU) + GPUs used
CPU	128 nodes (8,192 cores)*	0	48 hrs	50*	64 Service Unit (SU)
GPU	128 nodes (8,192 cores)*	10	48 hrs	50*	64 Service Unit (SU) + GPUs used

FUNCTION	COMMAND	EXAMPLE

FUNCTION	COMMAND	EXAMPLE
Submit a job	`sbatch [script_file]`	`sbatch FileName.job`
Cancel/Kill a job	`scancel [job_id]`	`scancel 101204`
Check status of a single job	`scancel [job_id]`	`squeue -j 101204`
Check status of all jobs for a user	`squeue -u [user_name]`	`squeue -u someuser`
Check CPU and memory efficiency for a job	`seff [job_id]`	`seff 101204`

Here is an example of the seff command provides for a finished job:

% seff 12345678
Job ID: 12345678
Cluster: faster
User/Group: username/groupname
State: COMPLETED (exit code 0)
Nodes: 16
Cores per node: 28
CPU Utilized: 1-17:05:54
CPU Efficiency: 94.63% of 1-19:25:52 core-walltime
Job Wall-clock time: 00:05:49
Memory Utilized: 310.96 GB (estimated maximum)
Memory Efficiency: 34.70% of 896.00 GB (56.00 GB/node)

Interactive Computing

Researchers can run interactive jobs on FASTER using the TAMU Open OnDemand portal. TAMU OnDemand is a web platform through which users can access HPRC clusters and services with a web browser (Chrome, Firefox, IE, and Safari). All active researchers have access to TAMU OnDemand. To access the portal, researchers should login at the address: https://portal.hprc.tamu.edu.

Sample Job Scripts

The following scripts show how researchers can submit jobs on the FASTER cluster. All scripts are meant for full node utilization, i.e. using all 64 cores and all available memory. Researchers should update their account numbers and email address prior to job submission.

For MPI, OpenMP and hybrid jobs researchers are directed to use the appropriate executable lines in the above examples.

CPU Only

Single Node, Single Core (Serial)

#!/bin/bash

##NECESSARY JOB SPECIFICATIONS
#SBATCH --job-name=Example_SNSC_CPU  #Set the job name to "JobExample1"
#SBATCH --time=01:30:00              #Set the wall clock limit to 1hr 30min
#SBATCH --ntasks=1                   #Request 1 task
#SBATCH --mem=2560M                  #Request 2560MB (2.5GB) per node
#SBATCH --output=Example_SNSC_CPU.%j #Redirect stdout/err to file
#SBATCH --partition=cpu              #Specify partition to submit job to

##OPTIONAL JOB SPECIFICATIONS
##SBATCH --account=123456            #Set billing account to 123456
##SBATCH --mail-type=ALL             #Send email on all job events
##SBATCH --mail-user=email_address   #Send all emails to email_address

#First Executable Line

Single Node, Multiple Core

#!/bin/bash

##NECESSARY JOB SPECIFICATIONS
#SBATCH --job-name=Example_SNMC_CPU  #Set the job name to Example_SNMC_CPU
#SBATCH --time=01:30:00              #Set the wall clock limit to 1hr 30min
#SBATCH --nodes=1                    #Request 1 node
#SBATCH --ntasks-per-node=64         #Request 64 tasks/cores per node
#SBATCH --mem=248M                   #Request 248G (248GB) per node
#SBATCH --output=Example_SNMC_CPU.%j #Redirect stdout/err to file
#SBATCH --partition=cpu              #Specify partition to submit job to

##OPTIONAL JOB SPECIFICATIONS
##SBATCH --account=123456            #Set billing account to 123456
##SBATCH --mail-type=ALL             #Send email on all job events
##SBATCH --mail-user=email_address   #Send all emails to email_address

#First Executable Line

Multiple Node, Multiple Core

#!/bin/bash

##NECESSARY JOB SPECIFICATIONS
#SBATCH --job-name=Example_MNMC_CPU  #Set the job name to Example_MNMC_CPU
#SBATCH --time=01:30:00              #Set the wall clock limit to 1hr 30min
#SBATCH --nodes=2                    #Request 2 nodes
#SBATCH --ntasks-per-node=64         #Request 64 tasks/cores per node
#SBATCH --mem=248G                   #Request 248G (248GB) per node
#SBATCH --output=Example_MNMC_CPU.%j #Redirect stdout/err to file
#SBATCH --partition=cpu              #Specify partition to submit job to

##OPTIONAL JOB SPECIFICATIONS
##SBATCH --account=123456            #Set billing account to 123456
##SBATCH --mail-type=ALL             #Send email on all job events
##SBATCH --mail-user=email_address   #Send all emails to email_address

#First Executable Line

CPU & GPU

The following example demonstrate how a researcher can submit jobs on single and multiple GPUs using the "gres" flag in a Slurm script. The "gpu" queue is specified in these scripts.

Single Node, Single Core

#!/bin/bash

##NECESSARY JOB SPECIFICATIONS
#SBATCH --job-name=Example_SNSC_GPU  #Set the job name to Example_SNSC_GPU
#SBATCH --time=01:30:00              #Set the wall clock limit to 1hr 30min
#SBATCH --ntasks=1                   #Request 1 task
#SBATCH --mem=248G                   #Request 248G (248GB) per node
#SBATCH --output=Example_SNSC_GPU.%j #Redirect stdout/err to file
#SBATCH --partition=gpu              #Specify partition to submit job to
#SBATCH --gres=gpu:a100:1            #Specify GPU(s) per node, 1 A100 GPU

##OPTIONAL JOB SPECIFICATIONS
##SBATCH --account=123456            
#Set billing account to 123456
##SBATCH --mail-type=ALL             #Send email on all job events
##SBATCH --mail-user=email_address   #Send all emails to email_address

#First Executable Line

Single Node, Multiple Core

#!/bin/bash

##NECESSARY JOB SPECIFICATIONS
#SBATCH --job-name=Example_SNMC_GPU  #Set the job name to Example_SNMC_GPU
#SBATCH --time=01:30:00              #Set the wall clock limit to 1hr 30min
#SBATCH --nodes=1                    #Request 1 nodes
#SBATCH --ntasks-per-node=64         #Request 64 tasks/cores per node
#SBATCH --mem=248G                   #Request 248G (248GB) per node
#SBATCH --output=Example_SNMC_GPU.%j #Redirect stdout/err to file
#SBATCH --partition=gpu              #Specify partition to submit job to
#SBATCH --gres=gpu:a100:10           #Specify GPU(s) per node, 10 t4 GPU

##OPTIONAL JOB SPECIFICATIONS
##SBATCH --account=123456            #Set billing account to 123456
##SBATCH --mail-type=ALL             #Send email on all job events
##SBATCH --mail-user=email_address   #Send all emails to email_address

#First Executable Line

Multiple Node, Multiple Core

#!/bin/bash

##NECESSARY JOB SPECIFICATIONS
#SBATCH --job-name=Example_MNMC_GPU  #Set the job name to Example_MNMC_GPU
#SBATCH --time=01:30:00              #Set the wall clock limit to 1hr 30min
#SBATCH --nodes=2                    #Request 2 nodes
#SBATCH --ntasks-per-node=64         #Request 64 tasks/cores per node
#SBATCH --mem=248G                   #Request 248G (248GB) per node
#SBATCH --output=Example_MNMC_GPU.%j #Redirect stdout/err to file
#SBATCH --partition=gpu              #Specify partition to submit job to
#SBATCH --gres=gpu:a100:1            #Specify GPU(s) per node, 1 A100 gpu

##OPTIONAL JOB SPECIFICATIONS
##SBATCH --account=123456            #Set billing account to 123456
##SBATCH --mail-type=ALL             #Send email on all job events
##SBATCH --mail-user=email_address   #Send all emails to email_address

#First Executable Line

Visualization

Researchers can remotely visualize data by launching a VNC job through the TAMU OnDemand web portal. You will be taken to the portal's homepage, then at the top, select 'Interactive Apps' and then 'VNC'. Fill in the appropriate job parameters and then launch the job.

Running applications with graphic user interface (GUI) on FASTER can be done through X11 forwarding. Applications that require OpenGL 3D rendering will experience big delays since large amounts of graphic data need to be sent over the network to be rendered on your local machine. An alternative way of running such applications is through remote visualization, an approach that utilizes VNC and VirtualGL to run graphic applications remotely.

Containers

Containers are supported through the Singularity runtime engine. The singularity executable is available on compute nodes, but not on login nodes. Container workloads tend to be too intense for the shared login nodes.

Example: Pulling a container from a registry:

srun --ntasks=1 --mem=2560M --time=01:00:00 --pty bash -i
cd $SCRATCH
export SINGULARITY_CACHEDIR=$SCRATCH/.singularity
module load WebProxy
singularity pull hello-world.sif docker://hello-world

Example: Executing a command within a container:

srun --ntasks=1 --mem=2560M --time=01:00:00 --pty bash -i
cd $SCRATCH
singularity exec

Researchers can learn more about Singularity runtime on their own documentation site: SingularityCE Documentation Hub

Containers also are supported through the Charliecloud runtime engine. The Charliecloud executables are available through the module system.

module load Charliecloud
ch-image pull hello docker://hello-world
ch-convert hello hello.sqfs

Researchers can learn more about Charliecloud runtime on their own documentation site: https://hpc.github.io/charliecloud/index.html

Help

Contact us via email at help@hprc.tamu.edu.

To facilitate a faster response, please include details such as the Job ID, the time of the incident, path to your job-script, and the location of your files.

References

Introduction to FASTER
FASTER Quick Start Guide
tamubatch man page
tamulauncher man page

FASTER - TAMU

The tamubatch Utility

The tamulauncher Utility

The `tamubatch` Utility

The `tamulauncher` Utility