Stampede-2 - TACC

Stampede-2 - TACC

Notices

  • Stampede2 has deployed 240 Intel "Ice Lake" (ICX) compute nodes, replacing 448 KNL compute nodes. Each ICX processor has 80 cores on 2 sockets (40 cores/socket). Hyperthreading is enabled: there are two hardware threads per core, for a total of 80 x 2 = 160 hardware threads per node. See ICX Compute Node specifications, new ICX job scripts, and the new icx-normal queue for more information. (03/09/22)

  • All users: refer to updated Remote Desktop Access instructions. (07/20/2021)

  • All users: read Managing I/O on TACC Resources. TACC Staff have put forth new file system and job submission guidelines. (01/09/20)

  • The Intel 18 compiler has replaced Intel 17 as the default compiler on Stampede2. The Intel 17 compiler and software stack are still available to those who load the appropriate modules explicitly. See Intel 18 to Become New Default Compiler on Stampede2 for more information. (02/26/19)

  • In order to balance queue wait times, the charge rate for all KNL queues has been adjusted to 0.8 SUs per node-hour. The charge rate for the SKX queues remains at 1 SU. (01/14/19)

  • Stampede2's Knights Landing (KNL) compute nodes each have 68 cores, and each core has 4 hardware threads. But it may not be a good idea to use all 272 hardware threads simultaneously, and it's certainly not the first thing you should try. In most cases it's best to specify no more than 64-68 MPI tasks or independent processes per node, and 1-2 threads/core. See Best Known Practices… for more information.

  • Stampede2's Skylake (SKX) compute nodes each have 48 cores on two sockets (24 cores/socket). Hyperthreading is enabled: there are two hardware threads per core, for a total of 48 x 2 = 96 hardware threads per node. See Table 2 for more information. Note that SKX nodes have their own queues.

Figure 1. Stampede2 System

Introduction

Stampede2, generously funded by the National Science Foundation (NSF) through award ACI-1540931, is one of the Texas Advanced Computing Center (TACC), University of Texas at Austin's flagship supercomputers. Stampede2 entered full production in the Fall 2017 as an 18-petaflop national resource that builds on the successes of the original Stampede system it replaces. The first phase of the Stampede2 rollout featured the second generation of processors based on Intel's Many Integrated Core (MIC) architecture. Stampede2's 4,200 Knights Landing (KNL) nodes represent a radical break with the first-generation Knights Corner (KNC) MIC coprocessor. Unlike the legacy KNC, a Stampede2 KNL is not a coprocessor: each 68-core KNL is a stand-alone, self-booting processor that is the sole processor in its node. Phase 2 added to Stampede2 a total of 1,736 Intel Xeon Skylake (SKX) nodes. The final phase of Stampede2 features the replacement of 448 KNL nodes with 224 Ice Lake nodes.

System Overview

KNL Compute Nodes

Stampede2 hosts 4,200 KNL compute nodes, including 504 KNL nodes that were formerly configured as a Stampede1 sub-system.

Each of Stampede2's KNL nodes includes 96GB of traditional DDR4 Random Access Memory (RAM). They also feature an additional 16GB of high bandwidth, on-package memory known as Multi-Channel Dynamic Random Access Memory (MCDRAM) that is up to four times faster than DDR4. The KNL's memory is configurable in two important ways: there are BIOS settings that determine at boot time the processor's memory mode and cluster mode. The processor's memory mode determines whether the fast MCDRAM operates as RAM, as direct-mapped L3 cache, or as a mixture of the two. The cluster mode determines the mechanisms for achieving cache coherency, which in turn determines latency: roughly speaking, this mode specifies the degree to which some memory addresses are "closer" to some cores than to others. See "Programming and Performance: KNL" below for a top-level description of these and other available memory and cluster modes.

Table 1. Stampede2 KNL Compute Node Specifications

Model:Ā 

Intel Xeon Phi 7250 ("Knights Landing")

Total cores per KNL node:Ā 

68 cores on a single socket

Hardware threads per core:Ā 

4

Hardware threads per node:Ā 

68 x 4 = 272

Clock rate:Ā 

1.4GHz

RAM:Ā 

96GB DDR4 plus 16GB high-speed MCDRAM. Configurable in two important ways; see "Programming and Performance: KNL" for more info.

Cache:Ā 

32KB L1 data cache per core; 1MB L2 per two-core tile. In default config, MCDRAM operates as 16GB direct-mapped L3.

Local storage:Ā 

All but 504 KNL nodes have a 107GB /tmp partition on a 200GB Solid State Drive (SSD). The 504 KNLs originally installed as the Stampede1 KNL sub-system each have a 32GB /tmp partition on 112GB SSDs. The latter nodes currently make up the development, long and flat-quadrant queues. Size of /tmp partitions as of 24 Apr 2018.

SKX Compute Nodes

Stampede2 hosts 1,736 SKX compute nodes.

Table 2. Stampede2 SKX Compute Node Specifications

Model:Ā 

Intel Xeon Platinum 8160 ("Skylake")

Total cores per SKX node:Ā 

48 cores on two sockets (24 cores/socket)

Hardware threads per core:Ā 

2

Hardware threads per node:Ā 

48 x 2 = 96

Clock rate:Ā 

2.1GHz nominal (1.4-3.7GHz depending on instruction set and number of active cores)

RAM:Ā 

192GB (2.67GHz) DDR4

Cache:Ā 

32KB L1 data cache per core; 1MB L2 per core; 33MB L3 per socket. Each socket can cache up to 57MB (sum of L2 and L3 capacity).

Local storage:Ā 

144GB /tmp partition on a 200GB SSD. Size of /tmp partition as of 14 Nov 2017.

ICX Compute Nodes

Stampede2 hosts 224 ICX compute nodes.

Table 2a. Stampede2 ICX Compute Node Specifications

Model:Ā 

Intel Xeon Platinum 8380 ("Ice Lake")

Total cores per ICX node:Ā 

80 cores on two sockets (40 cores/socket)

Hardware threads per core:Ā 

2

Hardware threads per node:Ā 

80 x 2 = 160

Clock rate:Ā 

2.3 GHz nominal (3.4GHz max frequency depending on instruction set and number of active cores)

RAM:Ā 

256GB (3.2 GHz) DDR4

Cache:Ā 

48KB L1 data cache per core; 1.25 MB L2 per core; 60 MB L3 per socket. Each socket can cache up to 110 MB (sum of L2 and L3 capacity)

Local storage:Ā 

342 GB /tmp partition

Login Nodes

The Stampede2 login nodes, upgraded at the start of Phase 2, are Intel Xeon Gold 6132 (SKX) nodes, each with 28 cores on two sockets (14 cores/socket). They replace the decommissioned Broadwell login nodes used during Phase 1.

Network

The interconnect is a 100Gb/sec Intel Omni-Path (OPA) network with a fat tree topology employing six core switches. There is one leaf switch for each 28-node half rack, each with 20 leaf-to-core uplinks (28/20 oversubscription).

File Systems Introduction

Stampede2 mounts three shared Lustre file systems on which each user has corresponding account-specific directories $HOME, $WORK, and $SCRATCH. Each file system is available from all Stampede2 nodes; the Stockyard-hosted work file system is available on most other TACC HPC systems as well. See Navigating the Shared File Systems for detailed information as well as the Good Citizenship file system guidelines.

Table 3. Stampede2 File Systems

FILE SYSTEM

QUOTA

KEY FEATURES

FILE SYSTEM

QUOTA

KEY FEATURES

$HOME

10GB, 200,000 files

Not intended for parallel or high-intensity file operations.
Backed up regularly.
Overall capacity ~1PB. Two Meta-Data Servers (MDS), four Object Storage Targets (OSTs).
Defaults: 1 stripe, 1MB stripe size.
Not purged.

$WORK

1TB, 3,000,000 files across all TACC systems,
regardless of where on the file system the files reside.

Not intended for high-intensity file operations or jobs involving very large files.
On the Global Shared File System that is mounted on most TACC systems.
See Stockyard system description for more information.
Defaults: 1 stripe, 1MB stripe size
Not backed up.
Not purged.

$SCRATCH

no quota

Overall capacity ~30PB. Four MDSs, 66 OSTs.
Defaults: 1 stripe, 1MB stripe size.
Not backed up.
Files are subject to purge if access time* is more than 10 days old.

Scratch File System Purge Policy

The $SCRATCH file system, as its name indicates, is a temporary storage space. Files that have not been accessed* in ten days are subject to purge. Deliberately modifying file access time (using any method, tool, or program) for the purpose of circumventing purge policies is prohibited.

*The operating system updates a file's access time when that file is modified on a login or compute node or any time that file is read. Reading or executing a file/script will update the access time. Use the "ls -ul" command to view access times.

Accessing the System

Access to all TACC systems now requires Multi-Factor Authentication (MFA). You can create an MFA pairing on the TACC User Portal. After login on the portal, go to your account profile (Home->Account Profile), then click the "Manage" button under "Multi-Factor Authentication" on the right side of the page. See Multi-Factor Authentication at TACC for further information.

Secure Shell (SSH)

The "ssh" command (SSH protocol) is the standard way to connect to Stampede2. SSH also includes support for the file transfer utilities scp and sftp. Wikipedia is a good source of information on SSH; the ACCESS Manage SSH Keys pages also provide detailed, ACCESS-specific help.

SSH is available within Linux and from the terminal app in the Mac OS. If you are using Windows, you will need an SSH client that supports the SSH-2 protocol: e.g. Bitvise, OpenSSH, MobaXterm, PuTTY, or SecureCRT. Initiate a session using the ssh command or the equivalent; from the Linux command line the launch command looks like this:

localhost$ ssh myusername@stampede2.tacc.utexas.edu

The above command will rotate connections across all available login nodes and route your connection to one of them. To connect to a specific login node, use its full domain name:

localhost$ ssh myusername@login2.stampede2.tacc.utexas.edu

To connect with X11 support on Stampede2 (usually required for applications with graphical user interfaces), use the "-X" or "-Y" switch:

localhost$ ssh -X myusername@stampede2.tacc.utexas.edu

Use your TACC password, not your ACCESS password, for direct logins to TACC resources. You can change your TACC password through the TACC User Portal. Log into the portal, then select "Change Password" under the "HOME" tab. If you've forgotten your password, go to the TACC User Portal home page and select "Password Reset" under the Home tab.

To report a connection problem, execute the ssh command with the "-vvv" option and include the verbose output when submitting a help ticket.

Do not run the "ssh-keygen" command on Stampede2. This command will create and configure a key pair that will interfere with the execution of job scripts in the batch system. If you do this by mistake, you can recover by renaming or deleting the .ssh directory located in your home directory; the system will automatically generate a new one for you when you next log into Stampede2.

  1. execute "mv .ssh dot.ssh.old"

  2. log out

  3. log into Stampede2 again

After logging in again the system will generate a properly configured key pair.

Using Stampede2

Stampede2 nodes run Red Hat Enterprise Linux 7. Regardless of your research workflow, you'll need to master Linux basics and a Linux-based text editor (e.g. emacs, nano, gedit, or vi/vim) to use the system properly. This user guide does not address these topics, however. There are numerous resources in a variety of formats that are available to help you learn Linux, including some listed on the TACC and --- ACCESS Support sites. If you encounter a term or concept in this user guide that is new to you, a quick internet search should help you resolve the matter quickly.

Configuring Your Account

Linux Shell

The default login shell for your user account is Bash. To determine your current login shell, execute:

$ echo $SHELL

If you'd like to change your login shell to csh, sh, tcsh, or zsh, submit a ticket through the TACC or ACCESS Support portal. The "chsh" ("change shell") command will not work on TACC systems.

When you start a shell on Stampede2, system-level startup files initialize your account-level environment and aliases before the system sources your own user-level startup scripts. You can use these startup scripts to customize your shell by defining your own environment variables, aliases, and functions. These scripts (e.g. .profile and .bashrc) are generally hidden files: so-called dotfiles that begin with a period, visible when you execute: "ls -a".

Before editing your startup files, however, it's worth taking the time to understand the basics of how your shell manages startup. Bash startup behavior is very different from the simpler csh behavior, for example. The Bash startup sequence varies depending on how you start the shell (e.g. using ssh to open a login shell, executing the "bash" command to begin an interactive shell, or launching a script to start a non-interactive shell). Moreover, Bash does not automatically source your .bashrc when you start a login shell by using ssh to connect to a node. Unless you have specialized needs, however, this is undoubtedly more flexibility than you want: you will probably want your environment to be the same regardless of how you start the shell. The easiest way to achieve this is to execute "source ~/.bashrc" from your".profile", then put all your customizations in".bashrc". The system-generated default startup scripts demonstrate this approach. We recommend that you use these default files as templates.

For more information see the Bash Users' Startup Files: Quick Start Guide and other online resources that explain shell startup. To recover the originals that appear in a newly created account, execute "/usr/local/startup_scripts/install_default_scripts".

Environment Variables

Your environment includes the environment variables and functions defined in your current shell: those initialized by the system, those you define or modify in your account-level startup scripts, and those defined or modified by the modules that you load to configure your software environment. Be sure to distinguish between an environment variable's name (e.g. HISTSIZE) and its value ($HISTSIZE). Understand as well that a sub-shell (e.g. a script) inherits environment variables from its parent, but does not inherit ordinary shell variables or aliases. Use export (in Bash) or setenv (in csh) to define an environment variable.

Execute the "env" command to see the environment variables that define the way your shell and child shells behave.

Pipe the results of env into grep to focus on specific environment variables. For example, to see all environment variables that contain the string GIT (in all caps), execute:

$ env | grep GIT

The environment variables PATH and LD_LIBRARY_PATH are especially important. PATH is a colon-separated list of directory paths that determines where the system looks for your executables. LD_LIBRARY_PATH is a similar list that determines where the system looks for shared libraries.

Account-Level Diagnostics

TACC's sanitytool module loads an account-level diagnostic package that detects common account-level issues and often walks you through the fixes. You should certainly run the package's sanitycheck utility when you encounter unexpected behavior. You may also want to run sanitycheck periodically as preventive maintenance. To run sanitytool's account-level diagnostics, execute the following commands:

login1$ module load sanitytool login1$ sanitycheck

Execute "module help sanitytool" for more information.

Accessing the Compute Nodes

You connect to Stampede2 through one of four "front-end" login nodes. The login nodes are shared resources: at any given time, there are many users logged into each of these login nodes, each preparing to access the "back-end" compute nodes (Figure 2. Login and Compute Nodes). What you do on the login nodes affects other users directly because you are competing for the same memory and processing power. This is the reason you should not run your applications on the login nodes or otherwise abuse them. Think of the login nodes as a prep area where you can manage files and compile code before accessing the compute nodes to perform research computations. See Good Citizenship for more information.

You can use your command-line prompt, or the "hostname" command, to tell you whether you are on a login node or a compute node. The default prompt, or any custom prompt containing "\h", displays the short form of the hostname (e.g. c401-064). The hostname for a Stampede2 login node begins with the string "login" (e.g. login2.stampede2.tacc.utexas.edu), while compute node hostnames begin with the character "c" (e.g. c401-064.stampede2.tacc.utexas.edu). Note that the default prompts on the compute nodes include the node type (knl, skx or icx) as well. The environment variable TACC_NODE_TYPE, defined only on the compute nodes, also displays the node type. The simplified prompts in the User Guide examples are shorter than Stampede2's actual default prompts.

While some workflows, tools, and applications hide the details, there are three basic ways to access the compute nodes:

  1. Submit a batch job using the sbatch command. This directs the scheduler to run the job unattended when there are resources available. Until your batch job begins it will wait in a queue. You do not need to remain connected while the job is waiting or executing. See Running Jobs for more information. Note that the scheduler does not start jobs on a first come, first served basis; it juggles many variables to keep the machine busy while balancing the competing needs of all users. The best way to minimize wait time is to request only the resources you really need: the scheduler will have an easier time finding a slot for the two hours you need than for the 48 hours you unnecessarily request.

  2. Begin an interactive session using idev or srun. This will log you into a compute node and give you a command prompt there, where you can issue commands and run code as if you were doing so on your personal machine. An interactive session is a great way to develop, test, and debug code. When you request an interactive session, the scheduler submits a job on your behalf. You will need to remain logged in until the interactive session begins.

  3. Begin an interactive session using ssh to connect to a compute node on which you are already running a job. This is a good way to open a second window into a node so that you can monitor a job while it runs.

Be sure to request computing resources that are consistent with the type of application(s) you are running:

  • A serial (non-parallel) application can only make use of a single core on a single node, and will only see that node's memory.

  • A threaded program (e.g. one that uses OpenMP) employs a shared memory programming model and is also restricted to a single node, but the program's individual threads can run on multiple cores on that node.

  • An MPI (Message Passing Interface) program can exploit the distributed computing power of multiple nodes: it launches multiple copies of its executable (MPI tasks, each assigned unique IDs called ranks) that can communicate with each other across the network. The tasks on a given node, however, can only directly access the memory on that node. Depending on the program's memory requirements, it may not be possible to run a task on every core of every node assigned to your job. If it appears that your MPI job is running out of memory, try launching it with fewer tasks per node to increase the amount of memory available to individual tasks.

  • A popular type of parameter sweep (sometimes called high throughput computing) involves submitting a job that simultaneously runs many copies of one serial or threaded application, each with its own input parameters ("Single Program Multiple Data", or SPMD). The "launcher" tool is designed to make it easy to submit this type of job. For more information:

    $ module load launcher $ module help launcher
Figure 2. Login and compute nodes

Ā 

Using Modules to Manage your Environment

Lmod, a module system developed and maintained at TACC, makes it easy to manage your environment so you have access to the software packages and versions that you need to conduct your research. This is especially important on a system like Stampede2 that serves thousands of users with an enormous range of needs. Loading a module amounts to choosing a specific package from among available alternatives:

$ module load intel # load the default Intel compiler $ module load intel/17.0.4 # load a specific version of Intel compiler

A module does its job by defining or modifying environment variables (and sometimes aliases and functions). For example, a module may prepend appropriate paths to $PATH and $LD_LIBRARY_PATH so that the system can find the executables and libraries associated with a given software package. The module creates the illusion that the system is installing software for your personal use. Unloading a module reverses these changes and creates the illusion that the system just uninstalled the software:

$ module load ddt # defines DDT-related env vars; modifies others $ module unload ddt # undoes changes made by load

The module system does more, however. When you load a given module, the module system can automatically replace or deactivate modules to ensure the packages you have loaded are compatible with each other. In the example below, the module system automatically unloads one compiler when you load another, and replaces Intel-compatible versions of IMPI and PETSc with versions compatible with gcc:

$ module load intel # load default version of Intel compiler $ module load petsc # load default version of PETSc $ module load gcc # change compiler Lmod is automatically replacing "intel/17.0.4" with "gcc/7.1.0". Due to MODULEPATH changes, the following have been reloaded: 1) impi/17.0.3 2) petsc/3.7

On Stampede2, modules generally adhere to a TACC naming convention when defining environment variables that are helpful for building and running software. For example, the "papi" module defines TACC_PAPI_BIN (the path to PAPI executables), TACC_PAPI_LIB (the path to PAPI libraries), TACC_PAPI_INC (the path to PAPI include files), and TACC_PAPI_DIR (top-level PAPI directory). After loading a module, here are some easy ways to observe its effects:

$ module show papi # see what this module does to your environment $ env | grep PAPI # see env vars that contain the string PAPI $ env | grep -i papi # case-insensitive search for 'papi' in environment

To see the modules you currently have loaded:

$ module list

To see all modules that you can load right now because they are compatible with the currently loaded modules:

$ module avail

To see all installed modules, even if they are not currently available because they are incompatible with your currently loaded modules:

$ module spider # list all modules, even those not available to load

To filter your search:

$ module spider slep # all modules with names containing 'slep' $ module spider sundials/2.5.0 # additional details on a specific module

Among other things, the latter command will tell you which modules you need to load before the module is available to load. You might also search for modules that are tagged with a keyword related to your needs (though your success here depends on the diligence of the module writers). For example:

$ module keyword performance

You can save a collection of modules as a personal default collection that will load every time you log into Stampede2. To do so, load the modules you want in your collection, then execute:

$ module save # save the currently loaded collection of modules

Two commands make it easy to return to a known, reproducible state:

$ module reset # load the system default collection of modules $ module restore # load your personal default collection of modules

On TACC systems, the command "module reset" is equivalent to "module purge; module load TACC". It's a safer, easier way to get to a known baseline state than issuing the two commands separately.

Help text is available for both individual modules and the module system itself:

$ module help swr # show help text for software package swr $ module help # show help text for the module system itself

See Lmod's online documentation for more extensive documentation. The online documentation addresses the basics in more detail, but also covers several topics beyond the scope of the help text (e.g. writing and using your own module files).

It's safe to execute module commands in job scripts. In fact, this is a good way to write self-documenting, portable job scripts that produce reproducible results. If you use "module save" to define a personal default module collection, it's rarely necessary to execute module commands in shell startup scripts, and it can be tricky to do so safely. If you do wish to put module commands in your startup scripts, see Stampede2's default startup scripts for a safe way to do so.

Citizenship

You share Stampede2 with many, sometimes hundreds, of other users, and what you do on the system affects others. All users must follow a set of good practices which entail limiting activities that may impact the system for other users. Exercise good citizenship to ensure that your activity does not adversely impact the system and the research community with whom you share it.

TACC staff has developed the following guidelines to good citizenship on Stampede2. Please familiarize yourself especially with the first two mandates. The next sections discuss best practices on limiting and minimizing I/O activity and file transfers. And finally, we provide job submission tips when constructing job scripts to help minimize wait times in the queues.

Do Not Run Jobs on the Login Nodes

Stampede2's few login nodes are shared among all users. Dozens, (sometimes hundreds) of users may be logged on at one time accessing the file systems. Think of the login nodes as a prep area, where users may edit and manage files, compile code, perform file management, issue transfers, submit new and track existing batch jobs etc. The login nodes provide an interface to the "back-end" compute nodes.

The compute nodes are where actual computations occur and where research is done. Hundreds of jobs may be running on all compute nodes, with hundreds more queued up to run. All batch jobs and executables, as well as development and debugging sessions, must be run on the compute nodes. To access compute nodes on TACC resources, one must either submit a job to a batch queue or initiate an interactive session using the idev utility.

A single user running computationally expensive or disk intensive task/s will negatively impact performance for other users. Running jobs on the login nodes is one of the fastest routes to account suspension. Instead, run on the compute nodes via an interactive session (e.g., via idev) or by submitting a batch job.

Do not run jobs or perform intensive computational activity on the login nodes or the shared file systems.
Your account may be suspended and you will lose access to the queues if your jobs are impacting other users.

Dos & Don'ts on the Login Nodes

  • Do not run research applications on the login nodes; this includes frameworks like MATLAB and R, as well as computationally or I/O intensive Python scripts. If you need interactive access, use the idev utility or Slurm's srun to schedule one or more compute nodes.

    DO THIS: Start an interactive session on a compute node and run Matlab.

    login1$ idev nid00181$ matlab

    DO NOT DO THIS: Run Matlab or other software packages on a login node

    login1$ matlab
  • Do not launch too many simultaneous processes; while it's fine to compile on a login node, a command like "make -j 16" (which compiles on 16 cores) may impact other users.

    DO THIS: build and submit a batch job. All batch jobs run on the compute nodes.

    login1$ make mytarget login1$ sbatch myjobscript

    DO NOT DO THIS: Invoke multiple build sessions.

    login1$ make -j 12

    DO NOT DO THIS: Run an executable on a login node.

    login1$ ./myprogram
  • That script you wrote to poll job status should probably do so once every few minutes rather than several times a second.

Do Not Stress the Shared File Systems

The TACC Global Shared File System, Stockyard, is mounted on most TACC HPC resources as the /work ($WORK) directory. This file system is accessible to all TACC users, and therefore experiences a lot of I/O activity (reading and writing to disk, opening and closing files) as users run their jobs, read and generate data including intermediate and checkpointing files. As TACC adds more users, the stress on the $WORK file system is increasing to the extent that TACC staff is now recommending new job submission guidelines in order to reduce stress and I/O on Stockyard.

TACC staff now recommends that you run your jobs out of the $SCRATCH file system instead of the global $WORK file system.

To run your jobs out $SCRATCH:

  • Copy or move all job input files to $SCRATCH

  • Make sure your job script directs all output to $SCRATCH

  • Once your job is finished, move your output files to $WORK to avoid any data purges.

Compute nodes should not reference $WORK unless it's to stage data in/out only before/after jobs.

Consider that $HOME and $WORK are for storage and keeping track of important items. Actual job activity, reading and writing to disk, should be offloaded to your resource's $SCRATCH file system (see Table. File System Usage Recommendations. You can start a job from anywhere but the actual work of the job should occur only on the $SCRATCH partition. You can save original items to $HOME or $WORK so that you can copy them over to $SCRATCH if you need to re-generate results.

More File System Tips

  • Don't run jobs in your $HOME directory. The $HOME file system is for routine file management, not parallel jobs.

  • Watch all your file system quotas. If you're near your quota in $WORK and your job is repeatedly trying (and failing) to write to $WORK, you will stress that file system. If you're near your quota in $HOME, jobs run on any file system may fail, because all jobs write some data to the hidden $HOME/.slurm directory.

  • Avoid storing many small files in a single directory, and avoid workflows that require many small files. A few hundred files in a single directory is probably fine; tens of thousands is almost certainly too many. If you must use many small files, group them in separate directories of manageable size.

  • TACC resources, with a few exceptions, mount three file systems: /home, /work and /scratch. Please follow each file system's recommended usage.

File System Usage Recommendations

FILE SYSTEM

BEST STORAGE PRACTICES

BEST ACTIVITIES

FILE SYSTEM

BEST STORAGE PRACTICES

BEST ACTIVITIES

$HOME

cron jobs
small scripts
environment settings

compiling, editing

$WORK

store software installations
original datasets that can't be reproduced
job scripts and templates

staging datasets

$SCRATCH

Temporary Storage
I/O files
job files
temporary datasets

all job I/O activity
see TACC's Scratch File System Purge Policy.

Limit Input/Output (I/O) Activity

In addition to the file system tips above, it's important that your jobs limit all I/O activity. This section focuses on ways to avoid causing problems on each resources' shared file systems.

  • Limit I/O intensive sessions (lots of reads and writes to disk, rapidly opening or closing many files)

  • Avoid opening and closing files repeatedly in tight loops. Every open/close operation on the file system requires interaction with the MetaData Service (MDS). The MDS acts as a gatekeeper for access to files on Lustre's parallel file system. Overloading the MDS will affect other users on the system. If possible, open files once at the beginning of your program/workflow, then close them at the end.

  • Don't get greedy. If you know or suspect your workflow is I/O intensive, don't submit a pile of simultaneous jobs. Writing restart/snapshot files can stress the file system; avoid doing so too frequently. Also, use the hdf5 or netcdf libraries to generate a single restart file in parallel, rather than generating files from each process separately.

If you know your jobs will require significant I/O, please submit a support ticket and an HPC consultant will work with you. See also Managing I/O on TACC Resources for additional information.

File Transfer Guidelines

In order to not stress both internal and external networks, be mindful of the following guidelines:

  • When creating or transferring large files to Stockyard ($WORK) or the $SCRATCH file systems, be sure to stripe the receiving directories appropriately. See Striping Large Files for more information.

  • Avoid too many simultaneous file transfers. You share the network bandwidth with other users; don't use more than your fair share. Two or three concurrent scp sessions is probably fine. Twenty is probably not.

  • Avoid recursive file transfers, especially those involving many small files. Create a tar archive before transfers. This is especially true when transferring files to or from Ranch.

Job Submission Tips

  • Request Only the Resources You Need Make sure your job scripts request only the resources that are needed for that job. Don't ask for more time or more nodes than you really need. The scheduler will have an easier time finding a slot for a job requesting 2 nodes for 2 hours, than for a job requesting 4 nodes for 24 hours. This means shorter queue waits times for you and everybody else.

  • Test your submission scripts. Start small: make sure everything works on 2 nodes before you try 20. Work out submission bugs and kinks with 5 minute jobs that won't wait long in the queue and involve short, simple substitutes for your real workload: simple test problems; hello world codes; one-liners like ibrun hostname; or an ldd on your executable.

  • Respect memory limits and other system constraints. If your application needs more memory than is available, your job will fail, and may leave nodes in unusable states. Use TACC's Remora tool to monitor your application's needs.

Managing Your Files

Stampede2 mounts three file Lustre file systems that are shared across all nodes: the home, work, and scratch file systems. Stampede2's startup mechanisms define corresponding account-level environment variables $HOME, $SCRATCH, and $WORK that store the paths to directories that you own on each of these file systems. Consult the Stampede2 File Systems table for the basic characteristics of these file systems, File Operations: I/O Performance for advice on performance issues, and Good Citizenship for tips on file system etiquette.

Navigating the Shared File Systems

Stampede2's /home and /scratch file systems are mounted only on Stampede2, but the work file system mounted on Stampede2 is the Global Shared File System hosted on Stockyard. Stockyard is the same work file system that is currently available on Frontera, Lonestar6, and several other TACC resources.

The $STOCKYARD environment variable points to the highest-level directory that you own on the Global Shared File System. The definition of the $STOCKYARD environment variable is of course account-specific, but you will see the same value on all TACC systems that provide access to the Global Shared File System. This directory is an excellent place to store files you want to access regularly from multiple TACC resources.

Your account-specific $WORK environment variable varies from system to system and is a sub-directory of $STOCKYARD (Figure 3). The sub-directory name corresponds to the associated TACC resource. The $WORK environment variable on Stampede2 points to the $STOCKYARD/stampede2 subdirectory, a convenient location for files you use and jobs you run on Stampede2. Remember, however, that all subdirectories contained in your $STOCKYARD directory are available to you from any system that mounts the file system. If you have accounts on both Stampede2 and Frontera, for example, the $STOCKYARD/stampede2 directory is available from your Frontera account, and $STOCKYARD/frontera is available from your Stampede2 account.

Your quota and reported usage on the Global Shared File System reflects all files that you own on Stockyard, regardless of their actual location on the file system.

See the example for fictitious user bjones in the figure below. All directories are accessible from all systems, however a given sub-directory (e.g. lonestar6, frontera) will exist only if you have an allocation on that system.

Figure 3.Account-level directories on the work file system (Global Shared File System hosted on Stockyard). Example for fictitious userĀ bjones. All directories usable from all systems. Sub-directories (e.g.Ā lonestar6,Ā frontera) exist only when you have allocations on the associated system.

Note that resource-specific sub-directories of $STOCKYARD are nothing more than convenient ways to manage your resource-specific files. You have access to any such sub-directory from any TACC resources. If you are logged into Stampede2, for example, executing the alias cdw (equivalent to "cd $WORK") will take you to the resource-specific sub-directory $STOCKYARD/stampede2. But you can access this directory from other TACC systems as well by executing "cd $STOCKYARD/stampede2". These commands allow you to share files across TACC systems. In fact, several convenient account-level aliases make it even easier to navigate across the directories you own in the shared file systems:

Table 4. Built-in Account Level Aliases

BUILT-IN ACCOUNT LEVEL ALIASES

BUILT-IN ACCOUNT LEVEL ALIASES

ALIAS

COMMAND

cd or cdh

cd $HOME

cdw

cd $WORK

cds

cd $SCRATCH

cdy or cdg

cd $STOCKYARD

Striping Large Files

Stampede2's Lustre file systems look and act like a single logical hard disk, but are actually sophisticated integrated systems involving many physical drives (dozens of physical drives for $HOME, hundreds for $WORK and $SCRATCH).

Lustre can stripe (distribute) large files over several physical disks, making it possible to deliver the high performance needed to service input/output (I/O) requests from hundreds of users across thousands of nodes. Object Storage Targets (OSTs) manage the file system's spinning disks: a file with 16 stripes, for example, is distributed across 16 OSTs. One designated Meta-Data Server (MDS) tracks the OSTs assigned to a file, as well as the file's descriptive data.

Before transferring to, or creating large files on Stampede2, be sure to set an appropriate default stripe count on the receiving directory.

To avoid exceeding your fair share of any given OST, a good rule of thumb is to allow at least one stripe for each 100GB in the file. For example, to set the default stripe count on the current directory to 30 (a plausible stripe count for a directory receiving a file approaching 3TB in size), execute:

$ lfs setstripe -c 30 $PWD

Note that an "lfs setstripe" command always sets both stripe count and stripe size, even if you explicitly specify only one or the other. Since the example above does not explicitly specify stripe size, the command will set the stripe size on the directory to Stampede2's system default (1MB). In general there's no need to customize stripe size when creating or transferring files.

Remember that it's not possible to change the striping on a file that already exists. Moreover, the "mv" command has no effect on a file's striping if the source and destination directories are on the same file system. You can, of course, use the "cp" command to create a second copy with different striping; to do so, copy the file to a directory with the intended stripe parameters.

You can check the stripe count of a file using the "lfs getstripe" command:

$ lfs getstripe myfile

Transferring Files

Transfer Using scp

You can transfer files between Stampede2 and Linux-based systems using either scp or rsync. Both scp and rsync are available in the Mac Terminal app. Windows SSH clients typically include scp-based file transfer capabilities.

The Linux scp (secure copy) utility is a component of the OpenSSH suite. Assuming your Stampede2 username is bjones, a simple scp transfer that pushes a file named "myfile" from your local Linux system to Stampede2 $HOME would look like this:

localhost$ scp ./myfile bjones@stampede2.tacc.utexas.edu: # note colon after net address

You can use wildcards, but you need to be careful about when and where you want wildcard expansion to occur. For example, to push all files ending in ".txt" from the current directory on your local machine to /work/01234/bjones/scripts on Stampede2:

localhost$ scp *.txt bjones@stampede2.tacc.utexas.edu:/work/01234/bjones/stampede2

To delay wildcard expansion until reaching Stampede2, use a backslash ("\") as an escape character before the wildcard. For example, to pull all files ending in ".txt" from /work/01234/bjones/scripts on Stampede2 to the current directory on your local system:

localhost$ scp bjones@stampede2.tacc.utexas.edu:/work/01234/bjones/stampede2/\*.txt .

You can of course use shell or environment variables in your calls to scp. For example:

localhost$ destdir="/work/01234/bjones/stampede2/data" localhost$ scp ./myfile bjones@stampede2.tacc.utexas.edu:$destdir

You can also issue scp commands on your local client that use Stampede2 environment variables like $HOME, $WORK, and $SCRATCH. To do so, use a backslash ("\") as an escape character before the "$"; this ensures that expansion occurs after establishing the connection to Stampede2:

localhost$ scp ./myfile bjones@stampede2.tacc.utexas.edu:\$WORK/data # Note backslash

Avoid using scp for recursive ("-r") transfers of directories that contain nested directories of many small files:

localhost$ scp -r ./mydata bjones@stampede2.tacc.utexas.edu:\$WORK # DON'T DO THIS

Instead, use tar to create an archive of the directory, then transfer the directory as a single file:

localhost$ tar cvf ./mydata.tar mydata # create archive localhost$ scp ./mydata.tar bjones@stampede2.tacc.utexas.edu:\$WORK # transfer archive

Transfer Using rsync

The rsync (remote synchronization) utility is a great way to synchronize files that you maintain on more than one system: when you transfer files using rsync, the utility copies only the changed portions of individual files. As a result, rsync is especially efficient when you only need to update a small fraction of a large dataset. The basic syntax is similar to scp:

localhost$ rsync mybigfile bjones@stampede2.tacc.utexas.edu:\$WORK/data localhost$ rsync -avtr mybigdir bjones@stampede2.tacc.utexas.edu:\$WORK/data

The options on the second transfer are typical and appropriate when synching a directory: this is a recursive update ("-r") with verbose ("-v") feedback; the synchronization preserves time stamps ("-t") as well as symbolic links and other meta-data ("-a"). Because rsync only transfers changes, recursive updates with rsync may be less demanding than an equivalent recursive transfer with scp.

See Striping Large Files for additional important advice about striping the receiving directory when transferring or creating large files on TACC systems.

As detailed in the Citizenship section above, it is important to monitor your quotas on the $HOME and $WORK file systems, and limit the number of simultaneous transfers. Remember also that $STOCKYARD (and your $WORK directory on each TACC resource) is available from several other TACC systems: there's no need for scp when both the source and destination involve sub-directories of $STOCKYARD. See Managing Your Files for more information about transfers on $STOCKYARD.

Transfer Using Globus

Globus is another way for ACCESS users to transfer data between ACCESS sites; see Globus">Globus at ACCESS and Data Transfer and Management for more information. You can also use Globus if you're affiliated with an institution like the University of Texas that provides access to CILogin.

Sharing Files with Collaborators

If you wish to share files and data with collaborators in your project, see Sharing Project Files on TACC Systems for step-by-step instructions. Project managers or delegates can use Unix group permissions and commands to create read-only or read-write shared workspaces that function as data repositories and provide a common work area to all project members.

Building Software

The phrase "building software" is a common way to describe the process of producing a machine-readable executable file from source files written in C, Fortran, or some other programming language. In its simplest form, building software involves a simple, one-line call or short shell script that invokes a compiler. More typically, the process leverages the power of makefiles, so you can change a line or two in the source code, then rebuild in a systematic way only the components affected by the change. Increasingly, however, the build process is a sophisticated multi-step automated workflow managed by a special framework like autotools or cmake, intended to achieve a repeatable, maintainable, portable mechanism for installing software across a wide range of target platforms.

Basics of Building Software

This section of the user guide does nothing more than introduce the big ideas with simple one-line examples. You will undoubtedly want to explore these concepts more deeply using online resources. You will quickly outgrow the examples here. We recommend that you master the basics of makefiles as quickly as possible: even the simplest computational research project will benefit enormously from the power and flexibility of a makefile-based build process.

Intel Compilers

Intel is the recommended and default compiler suite on Stampede2. Each Intel module also gives you direct access to mkl without loading an mkl module; see Intel MKL for more information. Here are simple examples that use the Intel compiler to build an executable from source code:

$ icc mycode.c # C source file; executable a.out $ icc main.c calc.c analyze.c # multiple source files $ icc mycode.c -o myexe # C source file; executable myexe $ icpc mycode.cpp -o myexe # C++ source file $ ifort mycode.f90 -o myexe # Fortran90 source file

Compiling a code that uses OpenMP would look like this:

$ icc -qopenmp mycode.c -o myexe # OpenMP

See the published Intel documentation, available both online and in ${TACC_INTEL_DIR}/documentation, for information on optimization flags and other Intel compiler options.

GNU Compilers

The GNU foundation maintains a number of high quality compilers, including a compiler for C (gcc), C++ (g++), and Fortran (gfortran). The gcc compiler is the foundation underneath all three, and the term "gcc" often means the suite of these three GNU compilers.

Load a gcc module to access a recent version of the GNU compiler suite. Avoid using the GNU compilers that are available without a gcc module — those will be older versions based on the "system gcc" that comes as part of the Linux distribution.

Here are simple examples that use the GNU compilers to produce an executable from source code:

$ gcc mycode.c # C source file; executable a.out $ gcc mycode.c -o myexe # C source file; executable myexe $ g++ mycode.cpp -o myexe # C++ source file $ gfortran mycode.f90 -o myexe # Fortran90 source file $ gcc -fopenmp mycode.c -o myexe # OpenMP; GNU flag is different than Intel

Note that some compiler options are the same for both Intel and GNU (e.g. "-o"), while others are different (e.g. "-qopenmp" vs "-fopenmp"). Many options are available in one compiler suite but not the other. See the online GNU documentation for information on optimization flags and other GNU compiler options.

Compiling and Linking as Separate Steps

Building an executable requires two separate steps: (1) compiling (generating a binary object file associated with each source file); and (2) linking (combining those object files into a single executable file that also specifies the libraries that executable needs). The examples in the previous section accomplish these two steps in a single call to the compiler. When building more sophisticated applications or libraries, however, it is often necessary or helpful to accomplish these two steps separately.

Use the "-c" ("compile") flag to produce object files from source files:

$ icc -c main.c calc.c results.c

Barring errors, this command will produce object files main.o, calc.o, and results.o. Syntax for other compilers Intel and GNU compilers is similar.

You can now link the object files to produce an executable file:

$ icc main.o calc.o results.o -o myexe

The compiler calls a linker utility (usually /bin/ld) to accomplish this task. Again, syntax for other compilers is similar.

Include and Library Paths

Software often depends on pre-compiled binaries called libraries. When this is true, compiling usually requires using the "-I" option to specify paths to so-called header or include files that define interfaces to the procedures and data in those libraries. Similarly, linking often requires using the "-L" option to specify paths to the libraries themselves. Typical compile and link lines might look like this:

$ icc -c main.c -I${WORK}/mylib/inc -I${TACC_HDF5_INC} # compile $ icc main.o -o myexe -L${WORK}/mylib/lib -L${TACC_HDF5_LIB} -lmylib -lhdf5 # link

On Stampede2, both the hdf5 and phdf5 modules define the environment variables $TACC_HDF5_INC and $TACC_HDF5_LIB. Other module files define similar environment variables; see Using Modules to Manage Your Environment for more information.

The details of the linking process vary, and order sometimes matters. Much depends on the type of library: static (.a suffix; library's binary code becomes part of executable image at link time) versus dynamically-linked shared (.so suffix; library's binary code is not part of executable; it's located and loaded into memory at run time). The link line can use rpath to store in the executable an explicit path to a shared library. In general, however, the LD_LIBRARY_PATH environment variable specifies the search path for dynamic libraries. For software installed at the system-level, TACC's modules generally modify LD_LIBRARY_PATH automatically. To see whether and how an executable named "myexe" resolves dependencies on dynamically linked libraries, execute "ldd myexe".

A separate section below addresses the Intel Math Kernel Library (MKL).

Compiling and Linking MPI Programs

Intel MPI (module impi) and MVAPICH2 (module mvapich2) are the two MPI libraries available on Stampede2. After loading an impi or mvapich2 module, compile and/or link using an mpi wrapper (mpicc, mpicxx, mpif90) in place of the compiler:

$ mpicc mycode.c -o myexe # C source, full build $ mpicc -c mycode.c # C source, compile without linking $ mpicxx mycode.cpp -o myexe # C++ source, full build $ mpif90 mycode.f90 -o myexe # Fortran source, full build

These wrappers call the compiler with the options, include paths, and libraries necessary to produce an MPI executable using the MPI module you're using. To see the effect of a given wrapper, call it with the "-show" option:

$ mpicc -show # Show compile line generated by call to mpicc; similarly for other wrappers

Building Third-Party Software in Your Own Account

You're welcome to download third-party research software and install it in your own account. In most cases you'll want to download the source code and build the software so it's compatible with the Stampede2 software environment. You can't use yum or any other installation process that requires elevated privileges, but this is almost never necessary. The key is to specify an installation directory for which you have write permissions. Details vary; you should consult the package's documentation and be prepared to experiment. When using the famous three-step autotools build process, the standard approach is to use the PREFIX environment variable to specify a non-default, user-owned installation directory at the time you execute configure or make:

$ export INSTALLDIR=$WORK/apps/t3pio $ ./configure --prefix=$INSTALLDIR $ make $ make install

Other languages, frameworks, and build systems generally have equivalent mechanisms for installing software in user space. In most cases a web search like "Python Linux install local" will get you the information you need.

In Python, a local install will resemble one of the following examples:

$ pip install netCDF4 --user # install netCDF4 package to $HOME/.local $ python setup.py install --user # install to $HOME/.local $ pip install netCDF4 --prefix=$INSTALLDIR # custom location; add to PYTHONPATH

Similarly in R:

$ module load Rstats # load TACC's default R $ R # launch R > install.packages('devtools') # R will prompt for install location

You may, of course, need to customize the build process in other ways. It's likely, for example, that you'll need to edit a makefile or other build artifacts to specify Stampede2-specific include and library paths or other compiler settings. A good way to proceed is to write a shell script that implements the entire process: definitions of environment variables, module commands, and calls to the build utilities. Include echo statements with appropriate diagnostics. Run the script until you encounter an error. Research and fix the current problem. Document your experience in the script itself; including dead-ends, alternatives, and lessons learned. Re-run the script to get to the next error, then repeat until done. When you're finished, you'll have a repeatable process that you can archive until it's time to update the software or move to a new machine.

If you wish to share a software package with collaborators, you may need to modify file permissions. See Sharing Files with Collaborators for more information.

Intel Math Kernel Library (MKL)

The Intel Math Kernel Library (MKL) is a collection of highly optimized functions implementing some of the most important mathematical kernels used in computational science, including standardized interfaces to:

  • BLAS (Basic Linear Algebra Subroutines), a collection of low-level matrix and vector operations like matrix-matrix multiplication

  • LAPACK (Linear Algebra PACKage), which includes higher-level linear algebra algorithms like Gaussian Elimination

  • FFT (Fast Fourier Transform), including interfaces based on FFTW (Fastest Fourier Transform in the West)

  • ScaLAPACK (Scalable LAPACK), BLACS (Basic Linear Algebra Communication Subprograms), Cluster FFT, and other functionality that provide block-based distributed memory (multi-node) versions of selected LAPACK, BLAS, and FFT algorithms;

  • Vector Mathematics (VM) functions that implement highly optimized and vectorized versions of special functions like sine and square root.

MKL with Intel C, C++, and Fortran Compilers

There is no MKL module for the Intel compilers because you don't need one: the Intel compilers have built-in support for MKL. Unless you have specialized needs, there is no need to specify include paths and libraries explicitly. Instead, using MKL with the Intel modules requires nothing more than compiling and linking with the "-mkl" option.; e.g.

$ icc -mkl mycode.c $ ifort -mkl mycode.c

The "-mkl" switch is an abbreviated form of "-mkl=parallel", which links your code to the threaded version of MKL. To link to the unthreaded version, use "-mkl=sequential". A third option, "-mkl=cluster", which also links to the unthreaded libraries, is necessary and appropriate only when using ScaLAPACK or other distributed memory packages. For additional information, including advanced linking options, see the MKL documentation and Intel MKL Link Line Advisor.

MKL with GNU C, C++, and Fortran Compilers

When using a GNU compiler, load the MKL module before compiling or running your code, then specify explicitly the MKL libraries, library paths, and include paths your application needs. Consult the Intel MKL Link Line Advisor for details. A typical compile/link process on a TACC system will look like this:

$ module load gcc $ module load mkl # available/needed only for GNU compilers $ gcc -fopenmp -I$MKLROOT/include \ -Wl,-L${MKLROOT}/lib/intel64 \ -lmkl_intel_lp64 -lmkl_core \ -lmkl_gnu_thread -lpthread \ -lm -ldl mycode.c

For your convenience the mkl module file also provides alternative TACC-defined variables like $TACC_MKL_INCLUDE (equivalent to $MKLROOT/include). Execute "module help mkl" for more information.

Using MKL as BLAS/LAPACK with Third-Party Software

When your third-party software requires BLAS or LAPACK, you can use MKL to supply this functionality. Replace generic instructions that include link options like "-lblas" or "-llapack" with the simpler MKL approach described above. There is no need to download and install alternatives like OpenBLAS.

Using MKL as BLAS/LAPACK with TACC's MATLAB, Python, and R Modules

TACC's MATLAB, Python, and R modules all use threaded (parallel) MKL as their underlying BLAS/LAPACK library. These means that even serial codes written in MATLAB, Python, or R may benefit from MKL's thread-based parallelism. This requires no action on your part other than specifying an appropriate max thread count for MKL; see the section below for more information.

Controlling Threading in MKL

Any code that calls MKL functions can potentially benefit from MKL's thread-based parallelism; this is true even if your code is not otherwise a parallel application. If you are linking to the threaded MKL (using "-mkl", "-mkl=parallel", or the equivalent explicit link line), you need only specify an appropriate value for the max number of threads available to MKL. You can do this with either of the two environment variables MKL_NUM_THREADS or OMP_NUM_THREADS. The environment variable MKL_NUM_THREADS specifies the max number of threads available to each instance of MKL, and has no effect on non-MKL code. If MKL_NUM_THREADS is undefined, MKL uses OMP_NUM_THREADS to determine the max number of threads available to MKL functions. In either case, MKL will attempt to choose an optimal thread count less than or equal to the specified value. Note that OMP_NUM_THREADS defaults to 1 on TACC systems; if you use the default value you will get no thread-based parallelism from MKL.

If you are running a single serial, unthreaded application (or an unthreaded MPI code involving a single MPI task per node) it is usually best to give MKL as much flexibility as possible by setting the max thread count to the total number of hardware threads on the node (272 on KNL, 96 on SKX, 160 on ICX). Of course things are more complicated if you are running more than one process on a node: e.g. multiple serial processes, threaded applications, hybrid MPI-threaded applications, or pure MPI codes running more than one MPI rank per node. See Accelerate Fast Math with IntelĀ® oneAPI Math Kernel Library and related Intel resources for examples of how to manage threading when calling MKL from multiple processes.

Using ScaLAPACK, Cluster FFT, and Other MKL Cluster Capabilities

See "Working with the Intel Math Kernel Library Cluster Software" and "Intel MKL Link Line Advisor" for information on linking to the MKL cluster components.

Building for Performance on Stampede2

Compiler

When building software on Stampede2, we recommend using the most recent Intel compiler and Intel MPI library available on Stampede2. The most recent versions may be newer than the defaults. Execute "module spider intel" and "module spider impi" to see what's installed. When loading these modules you may need to specify version numbers explicitly (e.g. "module load intel/18.0.0" and "module load impi/18.0.0").

Architecture-Specific Flags

To compile for KNL only, include "-xMIC-AVX512" as a build option. The "-x" switch allows you to specify a target architecture, while MIC-AVX512 is the KNL-specific subset of Intel's Advanced Vector Extensions 512-bit instruction set. Besides all other appropriate compiler options, you should also consider specifying an optimization level using the "-O" flag:

$ icc -xMIC-AVX512 -O3 mycode.c -o myexe # will run only on KNL

Similarly, to build for SKX or ICX, specify the CORE-AVX512 instruction set, which is native to SKX and ICX:

$ ifort -xCORE-AVX512 -O3 mycode.f90 -o myexe # will run on SKX or ICX

Because Stampede2 has two kinds of compute nodes, however, we recommend a more flexible approach when building with the Intel compiler: use CPU dispatch to build a multi-architecture ("fat") binary that contains alternate code paths with optimized vector code for each type of Stampede2 node. To produce a multi-architecture binary for Stampede2, build with the following options:

-xCORE-AVX2 -axCORE-AVX512,MIC-AVX512

These particular choices allow you to build on any Stampede2 node (KNL, SKX and ICX nodes), and use CPU dispatch to produce a multi-architecture binary. We recommend that you specify these flags in both the compile and link steps. Specify an optimization level (e.g. "-O3") along with any other appropriate compiler switches:

$ icc -xCORE-AVX2 -axCORE-AVX512,MIC-AVX512 -O3 mycode.c -o myexe

The "-x" option is the target base architecture (instruction set). The base instruction set must run on all targeted processors. Here we specify CORE-AVX2, which is native for older Broadwell processors and supported on all KNL, SKX and ICX nodex. This option allows configure scripts and similar build systems to run test executables on any Stampede2 login or compute node. The "-ax" option is a comma-separated list of alternate instruction sets: CORE-AVX512 for SKX and ICX, and MIC-AVX512 for KNL.

Now that we have replaced the original Broadwell login nodes with newer Skylake login nodes, "-xCORE-AVX2" remains a reasonable (though conservative) base option. Another plausible, more aggressive base option is "-xCOMMON-AVX512", which is a subset of AVX512 that runs on all KNL, SKX and ICX nodex.

It's best to avoid building with "-xHost" (a flag that means "optimize for the architecture on which I'm compiling now"). Using "-xHost" on a SKX login node, for example, will result in a binary that won't run on KNL.

Don't skip the "-x" flag in a multi-architecture build: the default is the very old SSE2 (Pentium 4) instruction set. **Don't create a multi-architecture build with a base option of either "-xMIC-AVX512" (native on KNL) or "-xCORE-AVX512" (native on SKX/ICX);** there are no meaningful, compatible alternate ("-ax") instruction sets:

$ icc -xCORE-AVX512 -axMIC-AVX512 -O3 mycode.c -o myexe # NO! Base incompatible with alternate

On Stampede2, the module files for newer Intel compilers (Intel 18.0.0 and later) define the environment variable TACC_VEC_FLAGS that stores the recommended architecture flags described above. This can simplify your builds:

$ echo $TACC_VEC_FLAGSĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā # env variable available only for intel/18.0.0 and later -xCORE-AVX2 -axCORE-AVX512,MIC-AVX512 $ icc $TACC_VEC_FLAGS -O3 mycode.c -o myexe

Simplicity is a major advantage of this multi-architecture approach: it allows you to build and run anywhere on Stampede2, and performance is generally comparable to single-architecture builds. There are some trade-offs to consider, however. This approach will take a little longer to compile than single-architecture builds, and will produce a larger binary. In some cases, you might also pay a small performance penalty over single-architecture approaches. For more information see the Intel documentation.

For information on the performance implications of your choice of build flags, see the sections on Programming and Performance for KNL and SKX and ICX respectively.

If you use GNU compilers, see GNU x86 Options for information regarding support for KNL, SKX and ICX. Note that GNU compilers do not support multi-architecture binaries.

Running Jobs on the Stampede2 Compute Nodes

Job Accounting

Like all TACC systems, Stampede2's accounting system is based on node-hours: one unadjusted Service Unit (SU) represents a single compute node used for one hour (a node-hour). For any given job, the total cost in SUs is the use of one compute node for one hour of wall clock time plus any charges or discounts for the use of specialized queues, e.g. Frontera's flex queue, Stampede2's development queue, and Longhorn's v100 queue. The queue charge rates are determined by the supply and demand for that particular queue or type of node used and are subject to change.

Stampede2 SUs billed = (# nodes) x (job duration in wall clock hours) x (charge rate per node-hour)

The Slurm scheduler tracks and charges for usage to a granularity of a few seconds of wall clock time. The system charges only for the resources you actually use, not those you request. If your job finishes early and exits properly, Slurm will release the nodes back into the pool of available nodes. Your job will only be charged for as long as you are using the nodes.

TACC does not implement node-sharing on any compute resource. Each Stampede2 node can be assigned to only one user at a time; hence a complete node is dedicated to a user's job and accrues wall-clock time for all the node's cores whether or not all cores are used.

Tip: Your queue wait times will be less if you request only the time you need: the scheduler will have a much easier time finding a slot for the 2 hours you really need than say, for the 12 hours requested in your job script.

Principal Investigators can monitor allocation usage via the TACC User Portal under "Allocations->Projects and Allocations". Be aware that the figures shown on the portal may lag behind the most recent usage. Projects and allocation balances are also displayed upon command-line login.

To display a summary of your TACC project balances and disk quotas at any time, execute:

login1$ /usr/local/etc/taccinfoĀ Ā Ā Ā Ā Ā Ā Ā # Generally more current than balances displayed on the portals.

Slurm Job Scheduler

Stampede2's job scheduler is the Slurm Workload Manager. Slurm commands enable you to submit, manage, monitor, and control your jobs.

Slurm Partitions (Queues)

Currently available queues include those in Stampede2 Production Queues. See KNL Compute Nodes, SKX Compute Nodes, Memory Modes, and Cluster Modes for more information on node types.

Table 5. Stampede2 Production Queues

QUEUE NAME

NODE TYPE

MAX NODES PER JOB
(ASSOC'D CORES)*

MAX DURATION

MAX JOBS IN QUEUE*

CHARGE RATE
(PER NODE-HOUR)

QUEUE NAME

NODE TYPE

MAX NODES PER JOB
(ASSOC'D CORES)*

MAX DURATION

MAX JOBS IN QUEUE*

CHARGE RATE
(PER NODE-HOUR)

development

KNL cache-quadrant

16 nodes
(1,088 cores)*

2 hrs

1*

0.8 Service Unit (SU)

normal

KNL cache-quadrant

256 nodes
(17,408 cores)*

48 hrs

50*

0.8 SU

large**

KNL cache-quadrant

2048 nodes
(139,264 cores)*

48 hrs

5*

0.8 SU

long

KNL cache-quadrant

32 nodes
(2,176 cores)*

120 hrs

2*

0.8 SU

flat-quadrant

KNL flat-quadrant

32 nodes
(2,176 cores)*

48 hrs

5*

0.8 SU

skx-dev

SKX

4 nodes
(192 cores)*

2 hrs

1*

1 SU

skx-normal

SKX

128 nodes
(6,144 cores)*

48 hrs

20*

1 SU

skx-large**

SKX

868 nodes
(41,664 cores)*

48 hrs

3*

1 SU

icx-normal

ICX

40 nodes
(3,200 cores)*

48 hrs

20*

1.67 SU

Ā 

  • Queue status as of March 7, 2022.

Queues and limits are subject to change without notice. Execute "qlimits" on Stampede2 for real-time information regarding limits on available queues. See Monitoring Jobs and Queues for additional information.

** To request more nodes than are available in the normal queue, submit a consulting (help desk) ticket through the TACC or the ACCESS Support portal. Include in your request reasonable evidence of your readiness to run under the conditions you're requesting. In most cases this should include your own strong or weak scaling results from Stampede2.

*** For non-hybrid memory-cluster modes or other special requirements, submit a ticket through the TACC or the ACCESS Support portal.

Submitting Batch Jobs with sbatch

Use Slurm's "sbatch" command to submit a batch job to one of the Stampede2 queues:

login1$ sbatch myjobscript

Here "myjobscript" is the name of a text file containing #SBATCH directives and shell commands that describe the particulars of the job you are submitting. The details of your job script's contents depend on the type of job you intend to run.

In your job script you (1) use #SBATCH directives to request computing resources (e.g. 10 nodes for 2 hrs); and then (2) use shell commands to specify what work you're going to do once your job begins. There are many possibilities: you might elect to launch a single application, or you might want to accomplish several steps in a workflow. You may even choose to launch more than one application at the same time. The details will vary, and there are many possibilities. But your own job script will probably include at least one launch line that is a variation of one of the examples described here.

Job Scripts

KNL Serial Job in Normal Queue

SKX Serial Job in Normal Queue

N/A

KNL MPI Job in Normal Queue
SKX MPI Job in Normal Queue
ICX MPI Job in Normal Queue
KNL OpenMP Job in Normal Queue
SKX OpenMP Job in Normal Queue
ICX OpenMP Job in Normal Queue
KNL Hybrid Job in Normal Queue
SKX Hybrid Job in Normal Queue
ICX Hybrid Job in Normal Queue

Ā 

Your job will run in the environment it inherits at submission time; this environment includes the modules you have loaded and the current working directory. In most cases you should run your applications(s) after loading the same modules that you used to build them. You can of course use your job submission script to modify this environment by defining new environment variables; changing the values of existing environment variables; loading or unloading modules; changing directory; or specifying relative or absolute paths to files. Do not use the Slurm "--export" option to manage your job's environment: doing so can interfere with the way the system propagates the inherited environment.

The Common sbatch Options table below describes some of the most common sbatch command options. Slurm directives begin with "#SBATCH"; most have a short form (e.g. "-N") and a long form (e.g. "--nodes"). You can pass options to sbatch using either the command line or job script; most users find that the job script is the easier approach. The first line of your job script must specify the interpreter that will parse non-Slurm commands; in most cases "#!/bin/bash" or "#!/bin/csh" is the right choice. Avoid "#!/bin/sh" (its startup behavior can lead to subtle problems on Stampede2), and do not include comments or any other characters on this first line. All #SBATCH directives must precede all shell commands. Note also that certain #SBATCH options or combinations of options are mandatory, while others are not available on Stampede2.

Table 6. Common sbatch Options

OPTION

ARGUMENT

COMMENTS

OPTION

ARGUMENT

COMMENTS

-p

queue_name

Submits to queue (partition) designated by queue_name

-J

job_name

Job Name

-N

total_nodes

Required. Define the resources you need by specifying either:
(1) "-N" and "-n"; or
(2) "-N" and "--ntasks-per-node".

-n

total_tasks

This is total MPI tasks in this job. See "-N" above for a good way to use this option. When using this option in a non-MPI job, it is usually best to set it to the same value as "-N".

--ntasks-per-node
or
--tasks-per-node

tasks_per_node

This is MPI tasks per node. See "-N" above for a good way to use this option. When using this option in a non-MPI job, it is usually best to set --ntasks-per-node to 1.

-t

hh:mm:ss

Required. Wall clock time for job.

--mail-user=

email_address

Specify the email address to use for notifications. Use with the --mail-type= flag below.

--mail-type=

begin, end, fail, or all

Specify when user notifications are to be sent (one option per line).

-o

output_file

Direct job standard output to output_file (without -e option error goes to this file)

-e

error_file

Direct job error output to error_file

-d=

afterok:jobid

Specifies a dependency: this run will start only after the specified job (jobid) successfully finishes

-A

projectnumber

Charge job to the specified project/allocation number. This option is only necessary for logins associated with multiple projects.

-a
or
--array

N/A

Not available. Use the launcher module for parameter sweeps and other collections of related serial jobs.

--mem

N/A

Not available. If you attempt to use this option, the scheduler will not accept your job.

--export=

N/A

Avoid this option on Stampede2. Using it is rarely necessary and can interfere with the way the system propagates your environment.

Ā 

By default, Slurm writes all console output to a file named "slurm-%j.out", where %j is the numerical job ID. To specify a different filename use the "-o" option. To save stdout (standard out) and stderr (standard error) to separate files, specify both "-o" and "-e".

Launching Applications

The primary purpose of your job script is to launch your research application. How you do so depends on several factors, especially (1) the type of application (e.g. MPI, OpenMP, serial), and (2) what you're trying to accomplish (e.g. launch a single instance, complete several steps in a workflow, run several applications simultaneously within the same job). While there are many possibilities, your own job script will probably include a launch line that is a variation of one of the examples described in this section:

Launching One Serial Application

To launch a serial application, simply call the executable. Specify the path to the executable in either the PATH environment variable or in the call to the executable itself:

myprogram # executable in a directory listed in $PATH $WORK/apps/myprov/myprogram # explicit full path to executable ./myprogram # executable in current directory ./myprogram -m -k 6 input1 # executable with notional input options

Launching One Multi-Threaded Application

Launch a threaded application the same way. Be sure to specify the number of threads. Note that the default OpenMP thread count is 1.

export OMP_NUM_THREADS=68 # 68 total OpenMP threads (1 per KNL core) ./myprogram

Launching One MPI Application

To launch an MPI application, use the TACC-specific MPI launcher "ibrun", which is a Stampede2-aware replacement for generic MPI launchers like mpirun and mpiexec. In most cases the only arguments you need are the name of your executable followed by any arguments your executable needs. When you call ibrun without other arguments, your Slurm #SBATCH directives will determine the number of ranks (MPI tasks) and number of nodes on which your program runs.

#SBATCH -N 5 #SBATCH -n 200 ibrun ./myprogram # ibrun uses the $SBATCH directives to properly allocate nodes and tasks

To use ibrun interactively, say within an idev session, you can specify:

login1$ idev -N 2 -n 80 c123-456$ ibrun ./myprogram # ibrun uses idev's arguments to properly allocate nodes and tasks

Launching One Hybrid (MPI+Threads) Application

When launching a single application you generally don't need to worry about affinity: both Intel MPI and MVAPICH2 will distribute and pin tasks and threads in a sensible way.

export OMP_NUM_THREADS=8 # 8 OpenMP threads per MPI rank ibrun ./myprogram # use ibrun instead of mpirun or mpiexec

As a practical guideline, the product of $OMP_NUM_THREADS and the maximum number of MPI processes per node should not be greater than total number of cores available per node (KNL nodes have 68 cores, SKX nodes have 48 cores, ICX nodes have 80 cores).

More Than One Serial Application in the Same Job

TACC's "launcher" utility provides an easy way to launch more than one serial application in a single job. This is a great way to engage in a popular form of High Throughput Computing: running parameter sweeps (one serial application against many different input datasets) on several nodes simultaneously. The launcher utility will execute your specified list of independent serial commands, distributing the tasks evenly, pinning them to specific cores, and scheduling them to keep cores busy. Execute "module load launcher" followed by "module help launcher" for more information.

MPI Applications One at a Time

To run one MPI application after another (or any sequence of commands one at a time), simply list them in your job script in the order in which you'd like them to execute. When one application/command completes, the next one will begin.

module load git module list ./preprocess.sh ibrun ./myprogram input1 # runs after preprocess.sh completes ibrun ./myprogram input2 # runs after previous MPI app completes

More than One MPI Application Running Concurrently

To run more than one MPI application simultaneously in the same job, you need to do several things:

  • use ampersands to launch each instance in the background;

  • include a wait command to pause the job script until the background tasks complete;

  • use the ibrun "-n" and "-o" switches to specify task counts and hostlist offsets respectively; and