/
Stampede-2 - TACC

Stampede-2 - TACC

Notices

  • Stampede2 has deployed 240 Intel "Ice Lake" (ICX) compute nodes, replacing 448 KNL compute nodes. Each ICX processor has 80 cores on 2 sockets (40 cores/socket). Hyperthreading is enabled: there are two hardware threads per core, for a total of 80 x 2 = 160 hardware threads per node. See ICX Compute Node specifications, new ICX job scripts, and the new icx-normal queue for more information. (03/09/22)

  • All users: refer to updated Remote Desktop Access instructions. (07/20/2021)

  • All users: read Managing I/O on TACC Resources. TACC Staff have put forth new file system and job submission guidelines. (01/09/20)

  • The Intel 18 compiler has replaced Intel 17 as the default compiler on Stampede2. The Intel 17 compiler and software stack are still available to those who load the appropriate modules explicitly. See Intel 18 to Become New Default Compiler on Stampede2 for more information. (02/26/19)

  • In order to balance queue wait times, the charge rate for all KNL queues has been adjusted to 0.8 SUs per node-hour. The charge rate for the SKX queues remains at 1 SU. (01/14/19)

  • Stampede2's Knights Landing (KNL) compute nodes each have 68 cores, and each core has 4 hardware threads. But it may not be a good idea to use all 272 hardware threads simultaneously, and it's certainly not the first thing you should try. In most cases it's best to specify no more than 64-68 MPI tasks or independent processes per node, and 1-2 threads/core. See Best Known Practices… for more information.

  • Stampede2's Skylake (SKX) compute nodes each have 48 cores on two sockets (24 cores/socket). Hyperthreading is enabled: there are two hardware threads per core, for a total of 48 x 2 = 96 hardware threads per node. See Table 2 for more information. Note that SKX nodes have their own queues.

Figure 1. Stampede2 System

Introduction

Stampede2, generously funded by the National Science Foundation (NSF) through award ACI-1540931, is one of the Texas Advanced Computing Center (TACC), University of Texas at Austin's flagship supercomputers. Stampede2 entered full production in the Fall 2017 as an 18-petaflop national resource that builds on the successes of the original Stampede system it replaces. The first phase of the Stampede2 rollout featured the second generation of processors based on Intel's Many Integrated Core (MIC) architecture. Stampede2's 4,200 Knights Landing (KNL) nodes represent a radical break with the first-generation Knights Corner (KNC) MIC coprocessor. Unlike the legacy KNC, a Stampede2 KNL is not a coprocessor: each 68-core KNL is a stand-alone, self-booting processor that is the sole processor in its node. Phase 2 added to Stampede2 a total of 1,736 Intel Xeon Skylake (SKX) nodes. The final phase of Stampede2 features the replacement of 448 KNL nodes with 224 Ice Lake nodes.

System Overview

KNL Compute Nodes

Stampede2 hosts 4,200 KNL compute nodes, including 504 KNL nodes that were formerly configured as a Stampede1 sub-system.

Each of Stampede2's KNL nodes includes 96GB of traditional DDR4 Random Access Memory (RAM). They also feature an additional 16GB of high bandwidth, on-package memory known as Multi-Channel Dynamic Random Access Memory (MCDRAM) that is up to four times faster than DDR4. The KNL's memory is configurable in two important ways: there are BIOS settings that determine at boot time the processor's memory mode and cluster mode. The processor's memory mode determines whether the fast MCDRAM operates as RAM, as direct-mapped L3 cache, or as a mixture of the two. The cluster mode determines the mechanisms for achieving cache coherency, which in turn determines latency: roughly speaking, this mode specifies the degree to which some memory addresses are "closer" to some cores than to others. See "Programming and Performance: KNL" below for a top-level description of these and other available memory and cluster modes.

Table 1. Stampede2 KNL Compute Node Specifications

Model: 

Intel Xeon Phi 7250 ("Knights Landing")

Total cores per KNL node: 

68 cores on a single socket

Hardware threads per core: 

4

Hardware threads per node: 

68 x 4 = 272

Clock rate: 

1.4GHz

RAM: 

96GB DDR4 plus 16GB high-speed MCDRAM. Configurable in two important ways; see "Programming and Performance: KNL" for more info.

Cache: 

32KB L1 data cache per core; 1MB L2 per two-core tile. In default config, MCDRAM operates as 16GB direct-mapped L3.

Local storage: 

All but 504 KNL nodes have a 107GB /tmp partition on a 200GB Solid State Drive (SSD). The 504 KNLs originally installed as the Stampede1 KNL sub-system each have a 32GB /tmp partition on 112GB SSDs. The latter nodes currently make up the development, long and flat-quadrant queues. Size of /tmp partitions as of 24 Apr 2018.

SKX Compute Nodes

Stampede2 hosts 1,736 SKX compute nodes.

Table 2. Stampede2 SKX Compute Node Specifications

Model: 

Intel Xeon Platinum 8160 ("Skylake")

Total cores per SKX node: 

48 cores on two sockets (24 cores/socket)

Hardware threads per core: 

2

Hardware threads per node: 

48 x 2 = 96

Clock rate: 

2.1GHz nominal (1.4-3.7GHz depending on instruction set and number of active cores)

RAM: 

192GB (2.67GHz) DDR4

Cache: 

32KB L1 data cache per core; 1MB L2 per core; 33MB L3 per socket. Each socket can cache up to 57MB (sum of L2 and L3 capacity).

Local storage: 

144GB /tmp partition on a 200GB SSD. Size of /tmp partition as of 14 Nov 2017.

ICX Compute Nodes

Stampede2 hosts 224 ICX compute nodes.

Table 2a. Stampede2 ICX Compute Node Specifications

Model: 

Intel Xeon Platinum 8380 ("Ice Lake")

Total cores per ICX node: 

80 cores on two sockets (40 cores/socket)

Hardware threads per core: 

2

Hardware threads per node: 

80 x 2 = 160

Clock rate: 

2.3 GHz nominal (3.4GHz max frequency depending on instruction set and number of active cores)

RAM: 

256GB (3.2 GHz) DDR4

Cache: 

48KB L1 data cache per core; 1.25 MB L2 per core; 60 MB L3 per socket. Each socket can cache up to 110 MB (sum of L2 and L3 capacity)

Local storage: 

342 GB /tmp partition

Login Nodes

The Stampede2 login nodes, upgraded at the start of Phase 2, are Intel Xeon Gold 6132 (SKX) nodes, each with 28 cores on two sockets (14 cores/socket). They replace the decommissioned Broadwell login nodes used during Phase 1.

Network

The interconnect is a 100Gb/sec Intel Omni-Path (OPA) network with a fat tree topology employing six core switches. There is one leaf switch for each 28-node half rack, each with 20 leaf-to-core uplinks (28/20 oversubscription).

File Systems Introduction

Stampede2 mounts three shared Lustre file systems on which each user has corresponding account-specific directories $HOME, $WORK, and $SCRATCH. Each file system is available from all Stampede2 nodes; the Stockyard-hosted work file system is available on most other TACC HPC systems as well. See Navigating the Shared File Systems for detailed information as well as the Good Citizenship file system guidelines.

Table 3. Stampede2 File Systems

FILE SYSTEM

QUOTA

KEY FEATURES

FILE SYSTEM

QUOTA

KEY FEATURES

$HOME

10GB, 200,000 files

Not intended for parallel or high-intensity file operations.
Backed up regularly.
Overall capacity ~1PB. Two Meta-Data Servers (MDS), four Object Storage Targets (OSTs).
Defaults: 1 stripe, 1MB stripe size.
Not purged.

$WORK

1TB, 3,000,000 files across all TACC systems,
regardless of where on the file system the files reside.

Not intended for high-intensity file operations or jobs involving very large files.
On the Global Shared File System that is mounted on most TACC systems.
See Stockyard system description for more information.
Defaults: 1 stripe, 1MB stripe size
Not backed up.
Not purged.

$SCRATCH

no quota

Overall capacity ~30PB. Four MDSs, 66 OSTs.
Defaults: 1 stripe, 1MB stripe size.
Not backed up.
Files are subject to purge if access time* is more than 10 days old.

Scratch File System Purge Policy

The $SCRATCH file system, as its name indicates, is a temporary storage space. Files that have not been accessed* in ten days are subject to purge. Deliberately modifying file access time (using any method, tool, or program) for the purpose of circumventing purge policies is prohibited.

*The operating system updates a file's access time when that file is modified on a login or compute node or any time that file is read. Reading or executing a file/script will update the access time. Use the "ls -ul" command to view access times.

Accessing the System

Access to all TACC systems now requires Multi-Factor Authentication (MFA). You can create an MFA pairing on the TACC User Portal. After login on the portal, go to your account profile (Home->Account Profile), then click the "Manage" button under "Multi-Factor Authentication" on the right side of the page. See Multi-Factor Authentication at TACC for further information.

Secure Shell (SSH)

The "ssh" command (SSH protocol) is the standard way to connect to Stampede2. SSH also includes support for the file transfer utilities scp and sftp. Wikipedia is a good source of information on SSH; the ACCESS Manage SSH Keys pages also provide detailed, ACCESS-specific help.

SSH is available within Linux and from the terminal app in the Mac OS. If you are using Windows, you will need an SSH client that supports the SSH-2 protocol: e.g. Bitvise, OpenSSH, MobaXterm, PuTTY, or SecureCRT. Initiate a session using the ssh command or the equivalent; from the Linux command line the launch command looks like this:

localhost$ ssh myusername@stampede2.tacc.utexas.edu

The above command will rotate connections across all available login nodes and route your connection to one of them. To connect to a specific login node, use its full domain name:

localhost$ ssh myusername@login2.stampede2.tacc.utexas.edu

To connect with X11 support on Stampede2 (usually required for applications with graphical user interfaces), use the "-X" or "-Y" switch:

localhost$ ssh -X myusername@stampede2.tacc.utexas.edu

Use your TACC password, not your ACCESS password, for direct logins to TACC resources. You can change your TACC password through the TACC User Portal. Log into the portal, then select "Change Password" under the "HOME" tab. If you've forgotten your password, go to the TACC User Portal home page and select "Password Reset" under the Home tab.

To report a connection problem, execute the ssh command with the "-vvv" option and include the verbose output when submitting a help ticket.

Do not run the "ssh-keygen" command on Stampede2. This command will create and configure a key pair that will interfere with the execution of job scripts in the batch system. If you do this by mistake, you can recover by renaming or deleting the .ssh directory located in your home directory; the system will automatically generate a new one for you when you next log into Stampede2.

  1. execute "mv .ssh dot.ssh.old"

  2. log out

  3. log into Stampede2 again

After logging in again the system will generate a properly configured key pair.

Using Stampede2

Stampede2 nodes run Red Hat Enterprise Linux 7. Regardless of your research workflow, you'll need to master Linux basics and a Linux-based text editor (e.g. emacs, nano, gedit, or vi/vim) to use the system properly. This user guide does not address these topics, however. There are numerous resources in a variety of formats that are available to help you learn Linux, including some listed on the TACC and --- ACCESS Support sites. If you encounter a term or concept in this user guide that is new to you, a quick internet search should help you resolve the matter quickly.

Configuring Your Account

Linux Shell

The default login shell for your user account is Bash. To determine your current login shell, execute:

If you'd like to change your login shell to csh, sh, tcsh, or zsh, submit a ticket through the TACC or ACCESS Support portal. The "chsh" ("change shell") command will not work on TACC systems.

When you start a shell on Stampede2, system-level startup files initialize your account-level environment and aliases before the system sources your own user-level startup scripts. You can use these startup scripts to customize your shell by defining your own environment variables, aliases, and functions. These scripts (e.g. .profile and .bashrc) are generally hidden files: so-called dotfiles that begin with a period, visible when you execute: "ls -a".

Before editing your startup files, however, it's worth taking the time to understand the basics of how your shell manages startup. Bash startup behavior is very different from the simpler csh behavior, for example. The Bash startup sequence varies depending on how you start the shell (e.g. using ssh to open a login shell, executing the "bash" command to begin an interactive shell, or launching a script to start a non-interactive shell). Moreover, Bash does not automatically source your .bashrc when you start a login shell by using ssh to connect to a node. Unless you have specialized needs, however, this is undoubtedly more flexibility than you want: you will probably want your environment to be the same regardless of how you start the shell. The easiest way to achieve this is to execute "source ~/.bashrc" from your".profile", then put all your customizations in".bashrc". The system-generated default startup scripts demonstrate this approach. We recommend that you use these default files as templates.

For more information see the Bash Users' Startup Files: Quick Start Guide and other online resources that explain shell startup. To recover the originals that appear in a newly created account, execute "/usr/local/startup_scripts/install_default_scripts".

Environment Variables

Your environment includes the environment variables and functions defined in your current shell: those initialized by the system, those you define or modify in your account-level startup scripts, and those defined or modified by the modules that you load to configure your software environment. Be sure to distinguish between an environment variable's name (e.g. HISTSIZE) and its value ($HISTSIZE). Understand as well that a sub-shell (e.g. a script) inherits environment variables from its parent, but does not inherit ordinary shell variables or aliases. Use export (in Bash) or setenv (in csh) to define an environment variable.

Execute the "env" command to see the environment variables that define the way your shell and child shells behave.

Pipe the results of env into grep to focus on specific environment variables. For example, to see all environment variables that contain the string GIT (in all caps), execute:

The environment variables PATH and LD_LIBRARY_PATH are especially important. PATH is a colon-separated list of directory paths that determines where the system looks for your executables. LD_LIBRARY_PATH is a similar list that determines where the system looks for shared libraries.

Account-Level Diagnostics

TACC's sanitytool module loads an account-level diagnostic package that detects common account-level issues and often walks you through the fixes. You should certainly run the package's sanitycheck utility when you encounter unexpected behavior. You may also want to run sanitycheck periodically as preventive maintenance. To run sanitytool's account-level diagnostics, execute the following commands:

Execute "module help sanitytool" for more information.

Accessing the Compute Nodes

You connect to Stampede2 through one of four "front-end" login nodes. The login nodes are shared resources: at any given time, there are many users logged into each of these login nodes, each preparing to access the "back-end" compute nodes (Figure 2. Login and Compute Nodes). What you do on the login nodes affects other users directly because you are competing for the same memory and processing power. This is the reason you should not run your applications on the login nodes or otherwise abuse them. Think of the login nodes as a prep area where you can manage files and compile code before accessing the compute nodes to perform research computations. See Good Citizenship for more information.

You can use your command-line prompt, or the "hostname" command, to tell you whether you are on a login node or a compute node. The default prompt, or any custom prompt containing "\h", displays the short form of the hostname (e.g. c401-064). The hostname for a Stampede2 login node begins with the string "login" (e.g. login2.stampede2.tacc.utexas.edu), while compute node hostnames begin with the character "c" (e.g. c401-064.stampede2.tacc.utexas.edu). Note that the default prompts on the compute nodes include the node type (knl, skx or icx) as well. The environment variable TACC_NODE_TYPE, defined only on the compute nodes, also displays the node type. The simplified prompts in the User Guide examples are shorter than Stampede2's actual default prompts.

While some workflows, tools, and applications hide the details, there are three basic ways to access the compute nodes:

  1. Submit a batch job using the sbatch command. This directs the scheduler to run the job unattended when there are resources available. Until your batch job begins it will wait in a queue. You do not need to remain connected while the job is waiting or executing. See Running Jobs for more information. Note that the scheduler does not start jobs on a first come, first served basis; it juggles many variables to keep the machine busy while balancing the competing needs of all users. The best way to minimize wait time is to request only the resources you really need: the scheduler will have an easier time finding a slot for the two hours you need than for the 48 hours you unnecessarily request.

  2. Begin an interactive session using idev or srun. This will log you into a compute node and give you a command prompt there, where you can issue commands and run code as if you were doing so on your personal machine. An interactive session is a great way to develop, test, and debug code. When you request an interactive session, the scheduler submits a job on your behalf. You will need to remain logged in until the interactive session begins.

  3. Begin an interactive session using ssh to connect to a compute node on which you are already running a job. This is a good way to open a second window into a node so that you can monitor a job while it runs.

Be sure to request computing resources that are consistent with the type of application(s) you are running:

  • A serial (non-parallel) application can only make use of a single core on a single node, and will only see that node's memory.

  • A threaded program (e.g. one that uses OpenMP) employs a shared memory programming model and is also restricted to a single node, but the program's individual threads can run on multiple cores on that node.

  • An MPI (Message Passing Interface) program can exploit the distributed computing power of multiple nodes: it launches multiple copies of its executable (MPI tasks, each assigned unique IDs called ranks) that can communicate with each other across the network. The tasks on a given node, however, can only directly access the memory on that node. Depending on the program's memory requirements, it may not be possible to run a task on every core of every node assigned to your job. If it appears that your MPI job is running out of memory, try launching it with fewer tasks per node to increase the amount of memory available to individual tasks.

  • A popular type of parameter sweep (sometimes called high throughput computing) involves submitting a job that simultaneously runs many copies of one serial or threaded application, each with its own input parameters ("Single Program Multiple Data", or SPMD). The "launcher" tool is designed to make it easy to submit this type of job. For more information:

Figure 2. Login and compute nodes

 

Using Modules to Manage your Environment

Lmod, a module system developed and maintained at TACC, makes it easy to manage your environment so you have access to the software packages and versions that you need to conduct your research. This is especially important on a system like Stampede2 that serves thousands of users with an enormous range of needs. Loading a module amounts to choosing a specific package from among available alternatives:

A module does its job by defining or modifying environment variables (and sometimes aliases and functions). For example, a module may prepend appropriate paths to $PATH and $LD_LIBRARY_PATH so that the system can find the executables and libraries associated with a given software package. The module creates the illusion that the system is installing software for your personal use. Unloading a module reverses these changes and creates the illusion that the system just uninstalled the software:

The module system does more, however. When you load a given module, the module system can automatically replace or deactivate modules to ensure the packages you have loaded are compatible with each other. In the example below, the module system automatically unloads one compiler when you load another, and replaces Intel-compatible versions of IMPI and PETSc with versions compatible with gcc:

On Stampede2, modules generally adhere to a TACC naming convention when defining environment variables that are helpful for building and running software. For example, the "papi" module defines TACC_PAPI_BIN (the path to PAPI executables), TACC_PAPI_LIB (the path to PAPI libraries), TACC_PAPI_INC (the path to PAPI include files), and TACC_PAPI_DIR (top-level PAPI directory). After loading a module, here are some easy ways to observe its effects:

To see the modules you currently have loaded:

To see all modules that you can load right now because they are compatible with the currently loaded modules:

To see all installed modules, even if they are not currently available because they are incompatible with your currently loaded modules:

To filter your search:

Among other things, the latter command will tell you which modules you need to load before the module is available to load. You might also search for modules that are tagged with a keyword related to your needs (though your success here depends on the diligence of the module writers). For example:

You can save a collection of modules as a personal default collection that will load every time you log into Stampede2. To do so, load the modules you want in your collection, then execute:

Two commands make it easy to return to a known, reproducible state:

On TACC systems, the command "module reset" is equivalent to "module purge; module load TACC". It's a safer, easier way to get to a known baseline state than issuing the two commands separately.

Help text is available for both individual modules and the module system itself:

See Lmod's online documentation for more extensive documentation. The online documentation addresses the basics in more detail, but also covers several topics beyond the scope of the help text (e.g. writing and using your own module files).

It's safe to execute module commands in job scripts. In fact, this is a good way to write self-documenting, portable job scripts that produce reproducible results. If you use "module save" to define a personal default module collection, it's rarely necessary to execute module commands in shell startup scripts, and it can be tricky to do so safely. If you do wish to put module commands in your startup scripts, see Stampede2's default startup scripts for a safe way to do so.

Citizenship

You share Stampede2 with many, sometimes hundreds, of other users, and what you do on the system affects others. All users must follow a set of good practices which entail limiting activities that may impact the system for other users. Exercise good citizenship to ensure that your activity does not adversely impact the system and the research community with whom you share it.

TACC staff has developed the following guidelines to good citizenship on Stampede2. Please familiarize yourself especially with the first two mandates. The next sections discuss best practices on limiting and minimizing I/O activity and file transfers. And finally, we provide job submission tips when constructing job scripts to help minimize wait times in the queues.

Do Not Run Jobs on the Login Nodes

Stampede2's few login nodes are shared among all users. Dozens, (sometimes hundreds) of users may be logged on at one time accessing the file systems. Think of the login nodes as a prep area, where users may edit and manage files, compile code, perform file management, issue transfers, submit new and track existing batch jobs etc. The login nodes provide an interface to the "back-end" compute nodes.

The compute nodes are where actual computations occur and where research is done. Hundreds of jobs may be running on all compute nodes, with hundreds more queued up to run. All batch jobs and executables, as well as development and debugging sessions, must be run on the compute nodes. To access compute nodes on TACC resources, one must either submit a job to a batch queue or initiate an interactive session using the idev utility.

A single user running computationally expensive or disk intensive task/s will negatively impact performance for other users. Running jobs on the login nodes is one of the fastest routes to account suspension. Instead, run on the compute nodes via an interactive session (e.g., via idev) or by submitting a batch job.

Do not run jobs or perform intensive computational activity on the login nodes or the shared file systems.
Your account may be suspended and you will lose access to the queues if your jobs are impacting other users.

Dos & Don'ts on the Login Nodes

  • Do not run research applications on the login nodes; this includes frameworks like MATLAB and R, as well as computationally or I/O intensive Python scripts. If you need interactive access, use the idev utility or Slurm's srun to schedule one or more compute nodes.

    DO THIS: Start an interactive session on a compute node and run Matlab.

    DO NOT DO THIS: Run Matlab or other software packages on a login node

  • Do not launch too many simultaneous processes; while it's fine to compile on a login node, a command like "make -j 16" (which compiles on 16 cores) may impact other users.

    DO THIS: build and submit a batch job. All batch jobs run on the compute nodes.

    DO NOT DO THIS: Invoke multiple build sessions.

    DO NOT DO THIS: Run an executable on a login node.

  • That script you wrote to poll job status should probably do so once every few minutes rather than several times a second.

Do Not Stress the Shared File Systems

The TACC Global Shared File System, Stockyard, is mounted on most TACC HPC resources as the /work ($WORK) directory. This file system is accessible to all TACC users, and therefore experiences a lot of I/O activity (reading and writing to disk, opening and closing files) as users run their jobs, read and generate data including intermediate and checkpointing files. As TACC adds more users, the stress on the $WORK file system is increasing to the extent that TACC staff is now recommending new job submission guidelines in order to reduce stress and I/O on Stockyard.

TACC staff now recommends that you run your jobs out of the $SCRATCH file system instead of the global $WORK file system.

To run your jobs out $SCRATCH:

  • Copy or move all job input files to $SCRATCH

  • Make sure your job script directs all output to $SCRATCH

  • Once your job is finished, move your output files to $WORK to avoid any data purges.

Compute nodes should not reference $WORK unless it's to stage data in/out only before/after jobs.

Consider that $HOME and $WORK are for storage and keeping track of important items. Actual job activity, reading and writing to disk, should be offloaded to your resource's $SCRATCH file system (see Table. File System Usage Recommendations. You can start a job from anywhere but the actual work of the job should occur only on the $SCRATCH partition. You can save original items to $HOME or $WORK so that you can copy them over to $SCRATCH if you need to re-generate results.

More File System Tips

  • Don't run jobs in your $HOME directory. The $HOME file system is for routine file management, not parallel jobs.

  • Watch all your file system quotas. If you're near your quota in $WORK and your job is repeatedly trying (and failing) to write to $WORK, you will stress that file system. If you're near your quota in $HOME, jobs run on any file system may fail, because all jobs write some data to the hidden $HOME/.slurm directory.

  • Avoid storing many small files in a single directory, and avoid workflows that require many small files. A few hundred files in a single directory is probably fine; tens of thousands is almost certainly too many. If you must use many small files, group them in separate directories of manageable size.

  • TACC resources, with a few exceptions, mount three file systems: /home, /work and /scratch. Please follow each file system's recommended usage.

File System Usage Recommendations

FILE SYSTEM

BEST STORAGE PRACTICES

BEST ACTIVITIES

FILE SYSTEM