OOKAMI - Stonybrook

Introduction

Ookami is a computer technology testbed supported by the National Science Foundation under grant OAC 1927880. It provides researchers with access to the A64FX processor developed by Riken and Fujitsu for the Japanese path to exascale computing and is currently deployed in Japan’s fastest computer, Fugaku. Ookami is the first such computer outside of Japan. By focusing on crucial architectural details, the ARM-based, multi-core, 512-bit SIMD-vector processor with ultrahigh-bandwidth memory promises to retain familiar and successful programming models while achieving very high performance for a wide range of applications. It supports a wide range of data types and enables both HPC and big data applications.

The Ookami HPE (formerly Cray) Apollo 80 system has 174 A64FX compute nodes each with 32GB of high-bandwidth memory and a 512 Gbyte SSD. This amounts to about 1.5M node hours per year. A high-performance Lustre filesystem provides about 800 TB storage.

To facilitate users exploring current computer technologies and contrasting performance and programmability with the A64FX, Ookami also includes:

  • 1 node with dual socket AMD Milan (64 cores) with 512 Gbyte memory and 2 NVIDIA V100 GPUs

  • 2 nodes with dual socket Thunder X2 (64 cores) each with 256 Gbyte memory

  • 1 node with dual socket Intel Skylake (36 cores) with 192 Gbyte memory

Summing all of that up the system delivers around 1.5 mill nodes hours per year. Ookami is an ACCESS level 2 service provider and from October 2022 90% of its resources are allocated via ACCESS.

Account Administration

Obtaining an Account

As an ACCESS computing resource, Okkami is accessible to ACCESS users who receive an allocation on the system. To obtain an account, users may submit a request through the ACCESS Allocation Request System. Anyone needing help may open a support ticket and use it to ask for help in submitting an Allocation request.

Once your allocation is granted the Ookami team will contact you via email with detailed instructions on your account. If you have any questions please contact ookami_computer@stonybrook.edu.

Logging In

Please see the Logging Into RP Resources page, which describes the procedures for logging into the different ACCESS Resource Providers (RPs). Ookami supports the SSH (Secure Shell) mechanism for logging in, which uses SSH keys. If you need help creating or uploading your SSH keys, please see the Managing SSH Public Keys page for detailed information on how to do so.

Configuring Your Account

  • The default shell is bash.

  • Default environment variables are set via the .bashrc and .bash_profile files

  • The Environment Modules system (v4.4.0) is used to control environment variables needed to access software.

  • Users may see a list of modules using the "module avail" command and can load a particular module with "module load /"

    • For example, module load "gcc/11.2.0" to load GCC v11.2.0

System Architecture

Compute Nodes

There are 174 Fujitsu A64FX compute nodes available, each with 32GB of high-bandwidth memory (users are limited to use 27GB, the other 5GB are dedicated to the OS) and an additional 512GB SSD. Additionally, 2 A64FX nodes are available as debug nodes (fj-debug1 and fj-debug2).

Table 1. Compute Node Specifications

MODEL

Fujitsu A64FX 700Model

NUMBER OF NODES

174 compute nodes available via slurm
2 debug nodes (fj-debug1 and fj-debug2) available via ssh from the login nodes

TOTAL CORES PER NODE

48

HARDWARE THREADS PER CORE

1 thread per core

HARDWARE THREADS PER NODE

48 cores per node x 1 thread per core = 48 threads per node

CLOCK RATE

1.8GHz

RAM

32GB @ 1TB/s

CACHE

64KB L1$ per core - 256b cache line
8MB L2$ shared between all cores - 256b cache line
Zero L3$

Login Nodes

There are two login nodes (login1, login2). Those are dual socket Thunder X2 (64 cores) each with 256 Gbyte memory. Connecting to Ookami will round robin users between login1 and login2. The login nodes provide an external interface to the Ookami computing cluster. They are for preparing submission scripts for the batch queue, submitting and monitoring jobs in the batch queue, analyzing results, and moving data. It is NOT appropriate for running computational jobs.

Specialized Nodes

There are two dedicated debug nodes (fj-debug1, fj-debug2) which are also Fujitsu A64FX nodes as the compute nodes. Users can ssh from the login nodes to the debug nodes without allocating them (ssh fj-debug1 or ssh fj-debug2). Those nodes are dedicated to compiling, debugging and doing testing. Note that those nodes are shared between users at all times.

There is one node with dual socket Intel Skylake (36 cores) with 192 Gbyte memory (fj-skylake) and one node with dual socket AMD Milan (64 cores) with 512 Gbyte memory and 2 NVIDIA V100 GPUs (fj-epyc). Since there is only one of those nodes, these are mainly for testing purposes.

Network

Networking at Stony Brook University is provided by a redundant 100 Gigabit connection to Internet2, connecting to ESNET, as well as Amazon, Microsoft, & Google, at 100 Gigabits/sec at 32 Avenue of Americas in New York City. The CEWIT Data Center and the campus network operations center (NOC) are connected via single-mode fiber, with 2 pairs of links providing redundant connectivity at 2x 100 Gigabits/sec. Cybersecurity is provided through a layered approach, including a high-availability pair of firewalls that can support up to 200 Gigabits/sec. The SBU data network is funded through campus funds.

File Systems

Ookami uses the Lustre parallel file system. The total available storage is around 800 TB.

Table 2. Ookami File Systems

FILE SYSTEM

QUOTA

DETAILS

FILE SYSTEM

QUOTA

DETAILS

$HOME

30GB

Backed up
Not shareable
Never cleared

$SCRATCH

30TB

Not backed up
Not shareable
Cleared monthly

$PROJECT_DIR

Up to 8TB

Available per request
Backed up
Shareable
Cleared per request

Accessing the System

You may access the Ookami login nodes using the command line from any modern workstation via secure shell (SSH).

Linux and MacOS

In Linux of MacOSX, simply open your favorite terminal program and SSH to the Ookami login node with X11 enabled by issuing the command:

ssh -X NetID@login.ookami.stonybrook.edu

Windows

MobaXterm Home Edition may be freely downloaded and installed by Ookami users, as long as multiple individuals are not using the same installation. MobaXterm comes with its own X server, so no additional utilities are required to enable X11 tunneling. Login with Ookami by clicking the "New Session" button and provide the hostname ( login.ookami.stonybrook.edu ) and your username.

DUO Authentication

When you attempt to access the login node by following the above methods, you will receive a notification on your DUO-enrolled device. To finish logging in, please view the DUO notification and approve the log in attempt by selecting the green check mark.

If you have not already setup DUO, please refer to ourFAQ page on enrolling in DUO first.

DUO_PASSCODE

You can make the DUO authentication process a tiny bit quicker if you use the DUO_PASSCODE environment variable. This will allow you to pre-select the type of DUO authentication you want to use instead of manually selecting it every time. So if you always want a DUO push to your phone, you can set DUO_PASSCODE to push, and you won't have to type ‘1' every time you log in. Also, this variable can sometimes fix issues with SCP/SFTP and other software used for file transfers.

Here are the possible values for the DUO_PASSCODE variable:

push Push a login request to your device. phone Authenticate via phone callback. sms Get a new batch of SMS passcodes. Your login attempt fails — log in again with one of your new passcodes. A numeric passcode Log in using a passcode, either generated with Duo Mobile, sent via SMS, generated by your hardware token, or provided by an administrator.

You can also add a number to the end of these factor names if you have more than one device registered. For example, push2 will send a login request to your second phone, phone3 will call your third phone, etc.

You can set the DUO_PASSCODE variable by appending a line to your Ookami ~/.bashrc like so:

echo 'export DUO_PASSCODE=push' >> ~/.bashrc

If this does not work, please check the caveat on our DUO and LD_LIBRARY_PATH page. You may need to change the order of commands in your .bashrc file.

Additionally, please do not set DUO_PASSCODE to sms in your .bashrc or you will be unable to log in to Ookami unless you connect through the VPN (see "VPN Access" below). The sms method of authentication will send you sms codes, but you must then set the value of DUO_PASSCODE to equal one of your one time use codes which you can't do if it's set in your .bashrc on Ookami. You can set it on the client side by modifying your MobaXTerm session configuration like so:

On Mac and Linux, you can modify your ~/.ssh/config file to include this setting:

Host *.ookami.stonybrook.edu SendEnv DUO_PASSCODE

And then set DUO_PASSCODE from your terminal before you log in:

When reporting a problem to the help desk, please execute the gsissh command with the -vvv option and include the verbose output in your problem description.

Citizenship

You share Ookami with a lot of other users, and what you do on the system affects others. Exercise good citizenship to ensure that your activity does not adversely impact the system and the research community with whom you share it. Here are some rules of thumb.

  • Don't run jobs on the login nodes

  • Don't run huge jobs on the debug nodes

  • You can use the login and debug nodes for compiling, but remember that these are shared nodes. So don't use all cores

  • Don't stress the filesystem

  • Use the slack channel or ticketing system for your questions

  • Include important details in your ticket (e.g. which modules are you using)

Managing and Transferring Files

Transferring Your Files

DUO Authentication

Just like when you log in to Ookami, transferring files will also require Two-Factor Authentication via DUO. However, some file transferring software will initiate many separate connections to Ookami, which can generate lots of DUO pushes and lock your DUO account. If this happens, you'll receive an email saying that your account is locked, and you must reply to confirm that it should be unlocked. To avoid this, connect to Stony Brook's VPN before making any connections to Ookami. Please see DoIT's VPN Homepage for more information on requesting a VPN account and setting up a connection. You'll need to authenticate once with DUO to connect to the VPN, but after this, all connections you make to Ookami through the VPN will bypass DUO.

If you're on campus and connected to WolfieNet-Secure, however, you won't be able to connect to Stony Brook's VPN. One solution is to use WolfieNet-Guest to connect to the VPN. But if you want the added security and speed of WolfieNet-Secure, you'll have to use a method of transferring files that won't cause DUO to spam you with authentication requests. See our recommendations below.

If you experience any problems that involve endless hanging, lost connections, and/or lack of DUO pushes while attempting to transfer files, try setting the DUO_PASSCODE variable in your ~/.bashrc. Our Logging In FAQ page has more information about this variable and how to set it. You can also set a default DUO device and action when logging in non-interactively (i.e. using sftp, scp, or similar software) by visiting Stony Brook's DUO self service portal.

Windows MOBAXTERM

We recommend that you use MobaXTerm for transferring files to and from a Windows machine. After starting MobaXterm, select "New Session", and choose "SFTP":

After providing your login information, you will see your local file system on the left, while your home directory on the cluster is shown on the right:

To transfer files back and forth, simply navigate to the appropriate local and remote directories on each side of the screen, and then drag and drop.

You may notice that when using MobaXTerm for an SSH session to Ookami, a small sidebar will appear on the left side of your screen with an SFTP browser. We recommend that you do not use this browser unless you are using a VPN. Without a VPN, it will send you DUO pushes every time you upload or download a file, but creating a full SFTP session as described above will not.

WINSCP

Another program that will allow you to transfer files to and from Ookami using a Graphical User Interface is WinSCP. WinSCP can be downloaded. Use the following settings to start an SFTP session using WinSCP:

You must use DUO to authenticate, unless you are connected to Stony Brook's VPN. Once connected, ensure you keep the Transfer Settings set to default to avoid having to authenticate with DUO again. You may then transfer files by clicking and dragging from your local machine's directory to a directory on Ookami.

MacOS/Linux

MacOS/Linux users who access Ookami via a terminal program can transfer files back and forth using either sftp or scp command line functions. Example syntax for scp can be found here, while sftp examples can be found here. Please note that without Stony Brook's VPN, every individual scp command you run will require DUO authentication. For this reason, we recommend using scp for one time transfers of large files or directories.

One way to minimize the number of DUO authentications needed is to copy multiple files with one scp command:

or using wildcards to transfer all files of a particular type (in this case, .txt files):

If you need to frequently make different small transfers back and forth as you work, we recommend you keep an sftp session open rather than several scp commands.

Sharing Files with Collaborators

You can request a shared project directory (up to 8TB). This will be located in /lustre/projects/group-name.

Building Software

There are four available compilers for A64FX: Fujitsu (module fujitsu/compiler/version), Cray (module CPE/version), Arm (module arm-modules/version) and GCC (module gcc/version). Those can be loaded via the corresponding modules. You can compile natively either on the compute nodes (via slurm) or on the debug nodes. The login nodes (Thunder X2) can be used for cross compiling.

Table 3. Compilers and Flags for A64FX

FEATURE

CRAY

ARM

GCC

FUJITSU

FEATURE

CRAY

ARM

GCC

FUJITSU

 

FLAGS

Optimization

-O3

-O3 or -Ofast

-O3 or -Ofast

-Kfast

Vectorization

-h vector3

-mcpu=a64fx -armpl

-mcpu=a64fx

-KSVE

Vectorization report

-h msgs

-Rpass=loop-vectorize

-fopt-info-vec

-Koptmsg=2
-Nlst=t
(creates a *.lst file with optimization information)

Report on missed optimization

-h negmsgs

-Rpass-analysis=loop-vectorize

-fopt-info-vec-missed

 

OpenMP

-h omp

-fopenmp

-fopenmp

-Kopenmp

Debugging

-G 2

-ggdb

-ggdb

-g

Large memory

-h pic

-mcmodel=large

-mcmodel=large

-mcmodel=large

LANGUAGE

COMPILER COMMAND

C

cc

armclang

gcc

fcc

C++

CC

armclang++

g++

FCC

Fortran

ftn

armflang

gfortran

frt

General note: most codes build out of the box, getting good performance on A64FX might in most cases though require more work. The compiler makes a huge performance difference. In general Cray and Fujitsu deliver best performance. Arm delivers competitive performance and fully supports current language standards. GCC optimizes for SVE and A64FX and sometimes generates best performance, but can't optimize math functions which for most codes leads to a huge lack in performance.

For other nodes / GPUs also the Intel and Nvidia compilers are available.

Software

All installed software is available via modules. module avail shows you all available modules. Note that the command generally shows you just those modules which are available for the architecture on the node you are currently on (e.g. Intel modules are just available on fj-skylake). The only exception are the login nodes. There all modules independent of the architecture are listed. They are in different folders (e.g. aarch64, x86_64) indicating on which nodes they are available. This was implemented to allow users checking fast which modules are available without connecting to a specific node.

Softwares requiring a license can be used by groups who provide a valid license. If you have a license for a specific software and want to use it on Ookami please submit a ticket about this.

Running Jobs

Job Accounting

Ookami's accounting system is based on node-hours: one unadjusted Service Unit (SU) represents a single compute node used for one hour (a node-hour). For any given job, the total cost in SUs is the use of one compute node for one hour of wall clock time. All partitions have the same charge rate.

Ookami SUs billed = (# nodes) x (job duration in wall clock hours)

The Slurm scheduler tracks and charges for usage to a granularity of a few seconds of wall clock time. The system charges only for the resources you actually use, not those you request. If your job finishes early and exits properly, Slurm will release the nodes back into the pool of available nodes. Your job will only be charged for as long as you are using the nodes.

Ookami does not implement node-sharing on any compute resource. Each node can be assigned to only one user at a time; hence a complete node is dedicated to a user's job and accrues wall-clock time for all the node's cores whether or not all cores are used.

Tip: Your queue wait times will be less if you request only the time you need: the scheduler will have a much easier time finding a slot for the 2 hours you really need than say, for the 12 hours requested in your job script.

Job Scheduler

Ookami's job scheduler is the Slurm Workload Manager (https://www.schedmd.com/ ). Slurm commands enable you to submit, manage, monitor, and control your jobs.

Accessing the Compute Nodes

You connect to Ookami through one of two login nodes. The login nodes are shared resources: at any given time, there are many users logged into each of these login nodes, each preparing to access the compute nodes. What you do on the login nodes affects other users directly because you are competing for the same memory and processing power. This is the reason you should not run your applications on the login nodes or otherwise abuse them. Think of the login nodes as a prep area where you can manage files and compile code before accessing the compute nodes to perform research computations.

You can use the "hostname" command, to tell you whether you are on a login node or a compute node. The login nodes are named login1 and login2. The A64FX nodes are name fj-debug1, fj-debug2 and fjXXX, where XXX is the number of the node (ranging between 001 - 174)

Interactive Session

The Slurm scheduler allows for running an interactive shell on compute nodes. The slurm module is loaded by default. If you unloaded it you can always reload it via

To enter an interactive session, use the srun command with the –pty directive. At a minimum, provide the following options to srun to enable the interactive shell:

You can pass the same additional options to srun as you would in your Slurm job script files. Some useful options are:

For an interactive job using 1 node and 24 tasks per node with a 4 hour run time on the short queue, this would look like:

Upon initializing the interactive shell, you will be taken away from the login node.

All of your environment variables from the login node will be copied to your interactive shell (just as when you submit a job). This means all of your modules will still be loaded and you will remain in the same working directory as before. You can immediately run your program for testing. All contents sent to stdout will be printed directly to the terminal unless otherwise directed.

Batch Jobs

Job submission is handled via the Slurm workload manager. Example submission scripts can be found in the section "sample job scripts".

ssh from a login node directly to a compute node

This is possible when you have allocated a node either via an interactive job or via a batch job. Then you can simple ssh to the node using ssh fjXXX, where XXX is the node you wish to connect to. This can be figured out using squeue -u $USER.

Be sure to request computing resources that are consistent with the type of application(s) you are running:

  • A serial (non-parallel) application can only make use of a single core on a single node, and will only see that node's memory.

  • A threaded program (e.g. one that uses OpenMP) employs a shared memory programming model and is also restricted to a single node, but the program's individual threads can run on multiple cores on that node.

  • An MPI (Message Passing Interface) program can exploit the distributed computing power of multiple nodes: it launches multiple copies of its executable (MPI tasks, each assigned unique IDs called ranks) that can communicate with each other across the network. The tasks on a given node, however, can only directly access the memory on that node. Depending on the program's memory requirements, it may not be possible to run a task on every core of every node assigned to your job. If it appears that your MPI job is running out of memory, try launching it with fewer tasks per node to increase the amount of memory available to individual tasks.

Table 4. Partitions (Queues)

QUEUE NAME

NODE TYPE

MIN NODES PER JOB

MAX NODES PER JOB*

MAX DURATION

CHARGE RATE (PER NODE-HOUR)

QUEUE NAME

NODE TYPE

MIN NODES PER JOB

MAX NODES PER JOB*

MAX DURATION

CHARGE RATE (PER NODE-HOUR)

short

A64FX

1

32

4 hours

1 SU (Service Unit)

medium

A64FX

8

40

12 hours

1 SU

large

A64FX

24

80

8 hours

1 SU

long

A64FX

1

8

2 days

1 SU

extended

A64FX

1

2

7 days

1 SU

all-nodes

A64FX

81

174

4 hours

1 SU

The maximum number of nodes a user can use at the same time is 120 (the all-nodes queue is excluded from this restriction to allow for full system runs).

The maximum number of nodes a user can use at the same time in the extended queue is 20.

Sample Job Scripts

Example Serial Job Script

Example serial "Hello World" job script using 1 node and 1 core

Example MPI Job Script

Example MPI "Hello World" job script using 4 nodes and 48 cores

Example OpenMP Job Script

Example OpenMP "Hello World" job script using 1 node and 48 cores

Example Hybrid Job Script

Example Hybrid "Hello World" job script using 4 nodes, 4 MPI ranks per node and 12 OpenMP threads per rank:

Example Array Job Script

Example Array "Hello World" job script using an array of 5 jobs:

Job Management

Monitoring Queue Status:

squeue lists all jobs

squeue -u $user lists all your jobs

squeue -u $user --start predicts the starting time of your queued jobs

Monitoring Job Status:

When monitoring the jobs via squeue their status is listed next to them. The most frequent statii are:

Table 5. Job Status Codes

JOB STATE

DESCRIPTION

JOB STATE

DESCRIPTION

PD Pending

The job is waiting in a queue for allocation of resources

R Running

The job currently is allocated to a node and is running

CG Completing

The job is finishing but some processes are still active

Containers

The Singularity container platform is available on Ookami. Users can access Singularity commands to run singularity containers by loading the singularity/3.7.1 module.

Protected Data

Presently, the Ookami cluster has NOT been approved for HIPAA data or any data associated with privacy or liability concerns. Consequently, use of this system to process ePHI or other data that falls under the purview of HIPAA and privacy guidelines is in violation of the act.

Help

You can get help via the Ookami slack channel, the ticketing system or by joining the virtual office hours (Tu 10 - noon EST, Th 2 - 4 pm EST).

References

Further information can be found on the Ookami website.