FASTER - TAMU
Introduction
Fostering Accelerated Scientific Transformations, Education, and Research (FASTER) is a NSF-MRI-funded cluster (award number 2019129) that offers state of the art CPUs, GPUs, and NVMe (Non-Volatile MemoryExpress) based storage in a composable environment. The supercomputer uses an innovative composable software-hardware approach to let a researcher attach GPUs to CPUs depending on their workflow. Unlike traditional cluster architectures, each rack on FASTER hosts a stack of GPUs that are shared with the CPU-hosting nodes. Using a standard Slurm script, a researcher can choose to add 10 GPUs to their CPU-node request. The machine will specifically help researchers using workflows that can benefit from simultaneous access to several GPUs, surpassing accelerator limits imposed on conventional supercomputers.
Figure 1. The FASTER composable cluster hosted by Texas A&M High Performance Research Computing
Account Administration
Setting up Your Account
The computer systems will be available for use free-of-charge to researchers through ACCESS. Access to and use of such systems is permitted only for academic research and instructional activity. All researchers are responsible for knowing and following our policies.
Allocation Information
Allocations are made by granting Service Units (SUs) for projects to Principal Investigators (PIs). SUs are based on node hours and a factor is applied for each GPU used. SUs are consumed on the computing resources by users associated with projects by PIs. Researchers can apply for an allocation via the XRAS process.
Configuring Your Account
The default shell for FASTER (all nodes) is the bash
shell. Edit your environment in the startup file, ".bash_profile
" in your home directory. This file is read and executed when you login.
System Architecture
FASTER is a 184-node Intel cluster from Dell with an InfiniBand HDR-100 interconnect. NVIDIA A100 GPUs, A10 GPUs, A30 GPUs, A40 GPUs and T4 GPUs are distributed and composable via Liqid PCIe fabrics. All nodes are based on the Intel Ice Lake processor and have 256 GB of memory.
Compute Nodes
Table X. Compute Node Specifications
Login Nodes
Table X. Login Node Specifications
PROCESSOR TYPE | |
---|---|
COMPUTE NODES: | 184 |
SOCKETS PER NODE: | 2 |
CORES PER SOCKET: | 32 |
CORES PER NODE: | 64 |
HARDWARE THREADS PER CORE: | 2 |
HARDWARE THREADS PER NODE: | 128 |
CLOCK RATE: | 2.20GHz (3.40 GHz Max Turbo Frequency) |
RAM: | 256 GB DDR4-3200 |
CACHE: | 48 MB L3 |
LOCAL STORAGE: | 3.84 TB local disk |
Specialized Nodes
GPUs can be added to compute nodes on the fly by using the "gres
" option in a Slurm script. A researcher can request up to 10 GPUs to create these CPU-GPU nodes. The following GPUs will be composable to the compute nodes.
200 T4 16GB GPUs
40 A100 40GB GPUs
8 A10 24GB GPUs
4 A30 24GB GPUs
8 A40 48GB GPUs
Data Transfer Nodes
FASTER has two data transfer nodes that can be used to transfer data to FASTER via Globus Connect web interface or Globus command line. Globus Connect Server v5.4 is installed on the data transfer nodes. One data transfer node is dedicated to ACCESS users and its collection is listed as ACCESS TAMU FASTER
.
Network
The FASTER system uses Mellanox HDR 100 InfiniBand interconnects.
File Systems
Each researcher has access to a home directory, scratch, and project space for their files. The scratch and project space is intended for active projects and is not backed up. The $HOME
, $SCRATCH
, and $PROJECT
file systems are hosted on DDN Lustre storage with 5 PB of usable capacity and up to 20 GB/s bandwidth. Researchers can purchase space on /scratch
by submitting a help-desk ticket.
Table X. FASTER File Systems
NUMBER OF NODES | 4 |
---|---|
PROCESSOR TYPE | Intel Xeon 8352Y (Ice Lake) |
CORES PER NODE | 64 |
MEMORY PER NODE | 256 GB |
FILE SYSTEM | QUOTA | PURPOSE | BACKUP |
---|---|---|---|
| 10GB/10,000 files | Home directories for small software, scripts, compiling, editing. | Yes |
| 1TB/250,000 files | Intended for job activity and temporary storage | No |
| 5TB/500,000 files | Not purged while allocation is active. Removed 90 days after allocation expiration | No |
The "showquota
" command can be used by a researcher to check their disk usage and file quotas on the different filesystems
$ showquota
Your current disk quotas are:
Disk Disk Usage Limit File Usage Limit
/home/userid 1.4G 10.0G 3661 10000
/scratch/user/userid 117.6G 1.0T 24226 250000
/scratch/group/projectid 510.5G 5.0T 128523 500000
Accessing the System
FASTER is accessible via the web using the FASTER ACCESS Portal, which is an instance of Open OnDemand. Use your ACCESS ID or other CILogon credentials.
Please visit the Texas A&M FASTER Documentation for additional login instructions.
Code of Conduct
The FASTER environment is shared with hundreds of other researchers. Researchers should ensure that their activity does not does not adversely impact the system and the research community on it.
DO NOT run jobs or intensive computations on the login nodes.
Contact the FASTER team for jobs that run outside the bounds of regular wall times
Don't stress the scheduler with thousands of simultaneous job submissions
To facilitate a faster response to help tickets, please include details such as the Job ID, the time of the incident, path to your job-script, and the location of your files
File Management
Transferring your Files
Globus Connect is recommended for moving files to and from the FASTER cluster. You may also use the standard "scp
", "sftp
", or "rsync
" utilities on the FASTER login node to transfer files.
Sharing Files with Collaborators
Researchers can use Globus Connect to transfer files to One Drive and other applications. Submit a help desk to request shared file spaces if needed.
Software
Common compiler tools like Intel and GCC are available on FASTER.
Easy Build
Software is preferably built and installed on FASTER using the EasyBuild system.
Researchers can request assistance from the Texas A&M HPRC helpdesk to build software as well.
Compiler Toolchains
EasyBuild relies on compiler toolchains. A compiler toolchain is a module consisting of a set of compilers and libraries put together for some specific desired functionality. A popular example is the foss toolchain series that consists of versions of the GCC compiler suite, OpenMPI, BLAS, LAPACK and FFTW that enable software to be compiled and used for serial as well as shared- and distributed-memory parallel applications. An intel toolchain series with the same range of functionality that joins the foss toolchains as the most commonly used toolchains is also available. Either of these toolchains can be extended for use with GPUs via the simple addition of a CUDA module.
Compiler toolchains vary across time as well as across compiler types. There is typically a new compiler chain release for each major new release of a compiler. For example, the foss-2021b chain includes the GCC 11.2.0 compiler, while the foss-2021a chain includes the GCC 10.3.0 compiler. The same is true for the Intel compiler releases, although with their oneAPI consolidation program Intel is presently simultaneously releasing two sets of C/C++/Fortran compilers that will eventually be merged into a single set. Another compiler chain series of note is the NVPHC series released by NVIDIA that has absorbed the former Portland Group compiler set and is being steadily modified to increase performance on NVIDIA GPUs.
Application Optimization
Performance maximization may be achieved by the relatively easy specification of optimal compiler flags to the significantly more complex and difficult instrumentation of source codes with OpenMP, MPI and CUDA commands to allow parallel tasks to be performed. The optimal compiler flags for a given application can be as simple as -fast
for the Intel compilers or a series of many obscure and seldom-used flags found by the developer to optimize their application. These flags are typically codified in the software compiling parts - e.g. make, CMake, etc. - of the package infrastructure, which is typically used with no modifications by EasyBuild to build and install the module. Questions about such things are best addressed to the original developer of the package.
Available Software
Search for already installed software on FASTER using the Modules system.
Modules System
The Modules system organizes the multitude of packages we have installed on our clusters so that they can be easily maintained and used. Any software you would like to use on FASTER should use the Modules system.
No modules are loaded by default. The main command necessary for using software is the "module load
" command. To load a module, use the following command:
[NetID@faster ~]$ module load packageName
The packageName
specification in the "module load
" command is case sensitive and it should include a specific version. To find the full name of the module you want to load, use the following command:
[NetID@faster ~]$ module spider packageName
To see a list of available modules, use the "mla
" wrapper script:
Launching Applications
The Slurm batch processing system is used on FASTER to ensure the convenient and fair use of the shared resources. See more job submitting details using the Slurm batch scheduler on our the wiki. https://hprc.tamu.edu/wiki/FASTER:Batch
For running user compiled code, we assume a preferred compiler toolchain module has been loaded.
Running OpenMP code
To run OpenMP code, researchers need to set the number of threads that OpenMP regions can use. The following snippet shows a basic example on how to set the number of threads and execute the program (my_omp_prog.x)
Running MPI code
To run MPI code, researchers should use an mpi launcher. The following snippet shows a basic example on how to launch an mpi program using mpirun. In this case we launch 8 copies of my_mpi_prog.x)
Running Hybrid MPI/OpenMP code
For code using both MPI and OpenMP, researchers will need to launch the code as a regular MPI program and set the number of threads for the OpenMP regions. The following snippet show a simple example
In the above example we launch 8 copies of my_hybrid_prog.x
. That means there are 8 processes running. Assuming my_hybrid_prog.x
has parallel OpenMP regions, every process can use up to 4 threads. The total number of cores used in this case is 24.
Running Jobs
Job Accounting
FASTER allocations are made in Service Units (SUs). A service unit is one hour of wall clock time. Jobs must request whole nodes. Researchers will be charged at the rate of 64 SUs per hour / GPU for each T4 composed on a node. Each A100/A40/A10/A30 GPU accelerator will be charged 128 SUs per hour / GPU.
NODE TYPE | SUS CHARGED PER HOUR (WALL CLOCK) |
---|---|
Compute node | 64 |
Adding a T4 accelerator | 64 |
Adding an A100/A40/A10/A30 accelerator | 128 |
Accessing the Compute Nodes
Jobs are submitted to compute nodes using the Slurm scheduler with the following command:
The tamubatch
Utility
The "tamubatch
" utility is an automatic batch job script that submits jobs without the need to write a batch script. The researcher includes the executable commands in a text file, and tamubatch
automatically annotates the text file and submits it as a job to the cluster. tamubatch
uses default values for the job parameters, and accepts flags to control job parameters.
Visit the tamubatch
wiki page for more information.
The tamulauncher
Utility
The "tamulauncher
" utility provides a convenient way to run a large number of serial or multithreaded commands without the need to submit individual jobs or a Slurm Job array. tamulauncher
concurrently executes on a text file containing all the commands that need to be run. The number of concurrently executed commands depends on the batch scheduler. In interactive mode, tamulauncher
is run interactively; the number of concurrently executed commands is limited to at most 8. There is no need to load any module before using tamulauncher
. It is preferred over Job Arrays to submit a large number (thousands) of individual jobs, especially when the run times of the commands are relatively short.
See the tamulauncher
wiki page for more information.
Slurm Job Scheduler
FASTER employs the Slurm job scheduler. The resource supports most common features. Some of the prominent ones are described in Table **.
Table X. Basic Slurm Environment Variables
VARIABLE | USAGE | DESCRIPTION |
---|---|---|
Job ID |
| Batch job ID assigned by Slurm. |
Job Name |
| The name of the Job. |
Queue |
| The name of the queue the job is dispatched from. |
Submit Directory |
| The directory the job was submitted from. |
Temporary Directory |
| This is a directory assigned locally on the compute node for the job located at |
On FASTER, GPUs are requested using the "gres
" resource flag in a Slurm script. The following resources (GPUs) can be currently requested via Slurm.
Table X. Composable Settings
Partitions (Queues)
Table. X FASTER Production Queues
1 node: 10x A100 (--gres=gpu:a100:10) = a compute nodes with 10 NVIDIA A100s which can be requested using the --gres=gpu:a100:10 slurm directive |
1 node: 6x A100 (--gres=gpu:a100:6) |
1 node: 4x A100 (--gres=gpu:a100:4) |
11 nodes: 4x T4 (--gres=gpu:tesla_t4:4) |
2 nodes: 8x T4 (--gres=gpu:tesla_t4:8) |
1 node: 4x A10 (--gres=gpu:a10:4) |
2 nodes: 2x A30 (--gres=gpu:a30:2) |
2 nodes: 2x A40 (--gres=gpu:a40:2) |
1 nodes: 4x A40 (--gres=gpu:a40:4) |
Job Management
Jobs are submitted via the Slurm scheduler using the "sbatch
" command. After a job has been submitted, you may want to check on its progress or cancel it. Below is a list of the most used job monitoring and control commands for jobs.
QUEUE NAME | MAX NODES PER JOB | MAX GPUS | MAX DURATION | MAX JOBS IN QUEUE* | CHARGE RATE |
---|---|---|---|---|---|
development | 1 nodes | 10 | 1 hr | 1* | 64 Service Unit (SU) + GPUs used |
CPU | 128 nodes | 0 | 48 hrs | 50* | 64 Service Unit (SU) |
GPU | 128 nodes | 10 | 48 hrs | 50* | 64 Service Unit (SU) + GPUs used |
FUNCTION | COMMAND | EXAMPLE |
---|---|---|
Submit a job |
|
|
Cancel/Kill a job |
|
|
Check status of a single job |
|
|
Check status of all jobs for a user |
|
|
Check CPU and memory efficiency for a job |
|
|
Here is an example of the seff
command provides for a finished job:
Interactive Computing
Researchers can run interactive jobs on FASTER using the TAMU Open OnDemand portal. TAMU OnDemand is a web platform through which users can access HPRC clusters and services with a web browser (Chrome, Firefox, IE, and Safari). All active researchers have access to TAMU OnDemand. To access the portal, researchers should login at the address: https://portal.hprc.tamu.edu.
Sample Job Scripts
The following scripts show how researchers can submit jobs on the FASTER cluster. All scripts are meant for full node utilization, i.e. using all 64 cores and all available memory. Researchers should update their account numbers and email address prior to job submission.
For MPI, OpenMP and hybrid jobs researchers are directed to use the appropriate executable lines in the above examples.
CPU Only
Single Node, Single Core (Serial)
Single Node, Multiple Core
Multiple Node, Multiple Core
CPU & GPU
The following example demonstrate how a researcher can submit jobs on single and multiple GPUs using the "gres
" flag in a Slurm script. The "gpu
" queue is specified in these scripts.
Single Node, Single Core
Single Node, Multiple Core
Multiple Node, Multiple Core
Visualization
Researchers can remotely visualize data by launching a VNC job through the TAMU OnDemand web portal. You will be taken to the portal's homepage, then at the top, select 'Interactive Apps' and then 'VNC'. Fill in the appropriate job parameters and then launch the job.
Running applications with graphic user interface (GUI) on FASTER can be done through X11 forwarding. Applications that require OpenGL 3D rendering will experience big delays since large amounts of graphic data need to be sent over the network to be rendered on your local machine. An alternative way of running such applications is through remote visualization, an approach that utilizes VNC and VirtualGL to run graphic applications remotely.
Containers
Containers are supported through the Singularity runtime engine. The singularity executable is available on compute nodes, but not on login nodes. Container workloads tend to be too intense for the shared login nodes.
Example: Pulling a container from a registry:
Example: Executing a command within a container:
Researchers can learn more about Singularity runtime on their own documentation site: SingularityCE Documentation Hub
Containers also are supported through the Charliecloud runtime engine. The Charliecloud executables are available through the module system.
Researchers can learn more about Charliecloud runtime on their own documentation site: Overview — Charliecloud 0.39~pre+a1fe557 documentation
Help
Contact us via email at help@hprc.tamu.edu.
To facilitate a faster response, please include details such as the Job ID, the time of the incident, path to your job-script, and the location of your files.
References
tamubatch
man pagetamulauncher
man page