ACES (Texas A&M)

Apply for an Accounts

If you do not already have one, first apply for an ACCESS account.

https://identity.access-ci.org/new-user

The next step is to submit an ACES application at:

https://forms.gle/mnY4dxxH3D1A5j5N8

After your ACES application is approved you will receive an email with further instructions.

SSH Access

ssh access for XSEDE/ACCESS users

As of August 31st, login to ACES Phase I for XSEDE/ACCESS users transitioned from the XSEDE sso hub to the faster jump host:

1 ssh -J fasterusername@faster-jump.hprc.tamu.edu:8822 fasterusername@login.faster.hprc.tamu.edu

 

Hardware Summary

Component:

Quantity

Description

Component:

Quantity

Description

Graphcore IPU

16

16 Colossus GC200 IPUs and dual AMD Rome CPU server on a 100 GbE RoCE fabric

Intel FPGA PAC D5005

2

FPGA SOC with Intel Stratix 10 SX FPGAs, 64 bit quad-core Arm Cortex-A53 processors, and 32GB DDR4

Intel Optane SSDs

8

3 TB of Intel Optane SSDs addressable as memory using MemVerge Memory Machine.

SSH Access

ssh access for XSEDE/ACCESS users

As of August 31st, login to ACES Phase I for XSEDE/ACCESS users transitioned from the XSEDE sso hub to the faster jump host:

1 ssh -J fasterusername@faster-jump.hprc.tamu.edu:8822 fasterusername@login.faster.hprc.tamu.edu

Graphcore IPUs

From one of FASTER login nodes, ssh into poplar1 system.

1 [username@faster ~]$ ssh poplar1

Set up the Poplar SDK environment

In this step, set up several environment variables to use the Graphcore tools and Poplar graph programming framework.

1 2 [username@poplar1 ~]$ source /opt/gc/poplar/poplar_sdk-ubuntu_18_04-[ver]/poplar-ubuntu_18_04-[ver]/enable.sh [username@poplar1 ~]$ source /opt/gc/poplar/poplar_sdk-ubuntu_18_04-[ver]/popart-ubuntu_18_04-[ver]/enable.sh

[ver] indicates the version number of the package.

Example commands with an existing version on FASTER:

1 2 source /opt/gc/poplar/poplar_sdk-ubuntu_18_04-2.5.1+1001-64add8f33d/poplar-ubuntu_18_04-2.5.0+4748-e94d646535/enable.sh source /opt/gc/poplar/poplar_sdk-ubuntu_18_04-2.5.1+1001-64add8f33d/popart-ubuntu_18_04-2.5.1+4748-e94d646535/enable.sh
1 2 3 4 5 mkdir -p /localdata/$USER/tmp export TF_POPLAR_FLAGS=--executable_cache_path=/localdata/$USER/tmp export POPTORCH_CACHE_DIR=/localdata/$USER/tmp # export POPLAR_LOG_LEVEL=INFO # export POPLIBS_LOG_LEVEL=INFO

Set up environments of frameworks for IPU

PyTorch (Poptorch)

Set up PyTorch (Poptorch)

The local home dir is small (300G total). You can store large files in /localdata/username (or use localdata symlink from your home dir). /localdata has 3.5TB available.

1 2 3 4 5 [username@poplar1 ~]$ cd localdata [username@poplar1 localdata]$ virtualenv -p python3 poptorch_test [username@poplar1 localdata]$ source poptorch_test/bin/activate [username@poplar1 localdata]$ python -m pip install -U pip [username@poplar1 localdata]$ python -m pip install <sdk_path>/poptorch_x.x.x.whl

For <sdk_path>/poptorch_x.x.x.whl, you can use /opt/gc/poplar/poplar_sdk-ubuntu_18_04-2.5.1+1001-64add8f33d/poptorch-2.5.0+62288_0f4af0bf32_ubuntu_18_04-cp36-cp36m-linux_x86_64.whl, which exists on FASTER

Clone a copy of the Graphcore tutorials repository and change the directory to mnist
1 2 [username@poplar1 localdata]$ git clone https://github.com/graphcore/tutorials.git [username@poplar1 localdata]$ cd tutorials/simple_applications/pytorch/mnist/
Install the dependencies and run the model
1 2 [username@poplar1 mnist]$ pip install -r requirements.txt [username@poplar1 mnist]$ python mnist_poptorch.py

TensorFlow 1

Set up TensorFlow 1 for IPU

The local home dir is small (300G total). You can store large files in /localdata/NetID (or use localdata symlink from your home dir). /localdata has 3.5TB available.

1 2 3 4 [username@poplar1 ~]$ cd localdata [username@poplar1 localdata]$ virtualenv venv_tf1 -p python3.6 [username@poplar1 localdata]$ source venv_tf1/bin/activate [username@poplar1 localdata]$ python -m pip install <sdk_path>/tensorflow_x.x.x.whl

For <sdk_path>/tensorflow_x.x.x.whl, you can use /opt/gc/poplar/poplar_sdk-ubuntu_18_04-2.5.1+1001-64add8f33d/tensorflow-1.15.5+gc2.5.1+193128+c9005c133f4+amd_znver1-cp36-cp36m-linux_x86_64.whl, which exists on FASTER

Clone a copy of the Graphcore tutorials repository and change the directory to mnist
1 2 [username@poplar1 localdata]$ https://github.com/graphcore/tutorials.git [username@poplar1 localdata]$ cd tutorials/simple_applications/tensorflow/mnist/
Run the model
1 [username@poplar1 localdata]$ python mnist.py

TensorFlow 2

Set up TensorFlow 2 for IPU

The local home dir is small (300G total). You can store large files in /localdata/NetID (or use localdata symlink from your home dir). /localdata has 3.5TB available.

1 2 3 4 [username@poplar1 ~]$ cd localdata [username@poplar1 localdata]$ virtualenv venv_tf2 -p python3.6 [username@poplar1 localdata]$ source venv_tf2/bin/activate [username@poplar1 localdata]$ python -m pip install <sdk_path>/tensorflow_x.x.x.whl

For <sdk_path>/tensorflow_x.x.x.whl, you can use /opt/gc/poplar/poplar_sdk-ubuntu_18_04-2.5.1+1001-64add8f33d/tensorflow-2.5.2+gc2.5.1+193132+4673d3afb3b+amd_znver1-cp36-cp36m-linux_x86_64.whl, which exists on FASTER

Clone a copy of the Graphcore tutorials repository and change the directory to tensorflow2/keras/completed_demos
1 2 [username@poplar1 localdata]$ https://github.com/graphcore/tutorials.git [username@poplar1 localdata]$ cd tutorials/tutorials/tensorflow2/keras/completed_demos/
Run the model
1 [username@poplar1 completed_demos]$ python completed_demo_ipu.py

 

Graphcore Documentation can be found at https://docs.graphcore.ai/en/latest/

Liqid PCIe Card with Intel Optane SSDs

Submit a standard batch job or interactive job to the memverge partition

1 srun --partition=memverge --time=24:00:00 --pty bash

Sample job file:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 #!/bin/bash ##NECESSARY JOB SPECIFICATIONS #SBATCH --job-name=Example #Set the job name to Example #SBATCH --time=24:00:00 #Set the wall clock limit to 24 hrs #SBATCH --nodes=1 #Request 1 nodes #SBATCH --ntasks-per-node=64 #Request 64 tasks/cores per node #SBATCH --mem=248G #Request 248G (248GB) per node #SBATCH --output=Example.%j #Redirect stdout/err to file #SBATCH --partition=memverge #Specify the MemVerge partition #lines required to setup the environment for your code # add the mm command in front of your executable to run with memory machine mm executable

Sample job file to run with singularity

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 #!/bin/bash ##NECESSARY JOB SPECIFICATIONS #SBATCH --job-name=Example #Set the job name to Example #SBATCH --time=24:00:00 #Set the wall clock limit to 24 hrs #SBATCH --nodes=1 #Request 1 nodes #SBATCH --ntasks-per-node=64 #Request 64 tasks/cores per node #SBATCH --mem=248G #Request 248G (248GB) per node #SBATCH --output=Example.%j #Redirect stdout/err to file #SBATCH --partition=memverge #Specify the MemVerge partitionexport SINGULARITY_BIND='/var/log/memverge,/etc/memverge,/opt/memverge,/var/memverge' # Required directories and libraries for memverge memory machine export SINGULARITY_BIND='/var/log/memverge,/etc/memverge,/opt/memverge,/var/memverge' for lib in \ libblkid.so.1 \ libcrypto.so.1.1 \ libc.so.6 \ libdaxctl.so.1 \ libdl.so.2 \ libgcc_s.so.1 \ libkmod.so.2 \ liblzma.so.5 \ libmount.so.1 \ libm.so.6 \ libndctl.so.6 \ libpcre2-8.so.0 \ libprotobuf-c.so.1 \ libpthread.so.0 \ librt.so.1 \ libselinux.so.1 \ libssl.so.1.1 \ libstdc++.so.6 \ libudev.so.1 \ libuuid.so.1 \ libz.so.1 \ ; do export SINGULARITY_BIND=$SINGULARITY_BIND,/lib64/$lib:/lib/$lib ; done # run your singularity container command, included the mm command for memverge memory machine singularity exec filename.sif mm executable

Intel FPGA PAC D5005

The FPGA nodes support both an older OpenCL development workflow, as well as a newer Intel oneAPI workflow.

Access

To access the Intel FPGA PAC D5005, submit an interactive job to the FPGA partition from one of the FASTER login nodes:

1 srun --partition=fpga --time=24:00:00 --pty bash

Getting Started

Once the session starts, you need to load the environment variables to access and interact with the FPGA on the node:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 $ source /opt/intel/oneapi/setvars.sh  :: initializing oneAPI environment ... -bash: BASH_VERSION = 4.2.46(2)-release args: Using "$@" for setvars.sh arguments:  :: advisor -- latest  :: ccl -- latest  :: compiler -- latest  :: dal -- latest  :: debugger -- latest  :: dev-utilities -- latest  :: dnnl -- latest  :: dpcpp-ct -- latest  :: dpl -- latest  :: intelfpgadpcpp -- latest  :: intelpython -- latest  :: ipp -- latest  :: ippcp -- latest  :: ipp -- latest  :: mkl -- latest  :: mpi -- latest  :: tbb -- latest  :: vpl -- latest  :: vtune -- latest  :: oneAPI environment initialized ::

Telemetry for the FPGA can be viewed using the 'fpgainfo' command:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 $ fpgainfo FPGA information utility Usage: fpgainfo [-h] [-B <bus>] [-D <device>] [-F <function>] [-S <socket-id>] {errors,power,temp,fme,port,bmc} -h,--help Print this help -B,--bus Set target bus number -D,--device Set target device number -F,--function Set target function number -S,--socket-id Set target socket number Subcommands: Print and clear errors fpgainfo errors [-h] [-c] {all,fme,port} -h,--help Print this help -c,--clear Clear all errors --force Retry clearing errors 64 times to clear certain error conditions Print power metrics fpgainfo power [-h] -h,--help Print this help Print thermal metrics fpgainfo temp [-h] -h,--help Print this help Print FME information fpgainfo fme [-h] -h,--help Print this help Print accelerator port information fpgainfo port [-h] -h,--help Print this help Print all Board Management Controller sensor values fpgainfo bmc [-h] -h,--help Print this help

For continuous monitoring, utilize this command in conjunction with the 'watch' command.

To run a status check on the FPGA, run:

1 $ aocl diagnose

This will display information about the libraries and initialization status of the FPGA device.

If the device shows as "Unitialized", it can be initialized with a standard image with:

1 2 3 $ aocl initialize acl0 pac_s10 aocl initialize: Running initialize from /opt/intel/oneapi/intelfpgadpcpp/latest/board/intel_s10sx_pac/linux64/libexec Program succeed.

The FPGA device must be initialized with the image that matches the compilation target of a binary i.e. if a binary is compiled for "pac_s10", the board must be initialized with the "pac_s10" standard image before running. There are two potential image options for the "aocl initialize" command:

Name

Description

Name

Description

pac_s10

Standard Intel FPGA PAC D5005 (Intel Stratix 10 SX) without unified shared memory support (USM).

pac_s10_usm

Standard Intel FPGA PAC D5005 (Intel Stratix 10 SX) with unified shared memory support (USM). Device must be initialized with this image if a binary using USM will be run on the FPGA device.

More information regarding unified shared memory can be found here: Unified Shared Memory — DPC++ Reference documentation

If the node has multiple FPGA devices, they can be viewed with:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 $ aocl list-devices -------------------------------------------------------------------- Device Name: acl0 BSP Install Location: /opt/intel/oneapi/intelfpgadpcpp/latest/board/intel_s10sx_pac Vendor: Intel Corp Physical Dev Name Status Information pac_ed00000 Passed Intel PAC Platform (pac_ed00000) PCIe 29:00.0 USM not supported DIAGNOSTIC_PASSED --------------------------------------------------------------------

The user can then target the correct device when running their code or initializing the device.

Example

oneAPI Samples

The README.md in each directory contains information for compiling and running.

1 2 $ git clone https://github.com/oneapi-src/oneAPI-samples.git $ cd oneAPI-samples/DirectProgramming/DPC++FPGA/Tutorials
Example 1: fpga_compile

Navigate to the "fpga_compile" example under "GettingStarted" within the oneAPI samples repository:

1 2 $ cd GettingStarted $ cd fpga_compile

Create a build directory for configuration files:

1 2 $ mkdir build $ cd build

Configure the program to compile for the Intel S10 SX PAC (Intel PAC D5005):

1 $ cmake .. -DFPGA_BOARD=intel_s10sx_pac:pac_s10

Once the configuration completes, there will be several make options available:

 

 

 

Compilation Types

Command

Device Image Type

Compilation Duration

Description

make fpga_emu

FPGA Emulator

Seconds

Compile for emulation (compiles quickly, targets emulated FPGA device). Allows user to validate design, but does not represent actual performance of code on hardware.

make report

Optimization Report

Minutes

Generate the optimization report. The FPGA device code is partially compiled for hardware. The compiler generates an optimization report that describes the structures generated on the FPGA, identifies performance bottlenecks, and estimates resource utilization.

make fpga

FPGA Hardware

Hours

Compile for FPGA hardware (takes longer to compile, targets FPGA device). Compiles the actual bitstream for running the program on hardware.

The recommended workflow is to compile a program for emulation prior to compiling for hardware execution. This does not actually compile the program to run on the FPGA itself, but rather on the CPU via a virtual FPGA emulation device. This allows a user to validate the correctness of their design while benefiting from the short compile times of CPU compilation. The optimization report assists the user in improving different aspects of their design before moving onto hardware compilation.

Example 2: buffered_host_streaming

Navigate to the "buffered_host_streaming" example under "DesignPatterns" within the oneAPI samples repository:

1 2 $ cd DesignPatterns $ cd buffered_host_streaming

Create a build directory for configuration files:

1 2 $ mkdir build $ cd build

Configure the program to compile for the Intel S10 SX PAC (Intel PAC D5005):

1 $ cmake .. -DFPGA_BOARD=intel_s10sx_pac:pac_s10_usm -DUSM_HOST_ALLOCATIONS_ENABLED=1

Note that the value of FPGA_BOARD is the USM variant of the FPGA device. Like in Example 1, there will be three make targets to choose from: fpga_emu, report, and fpga. Once compilation finishes, ensure that the board is initialized to the correct standard image:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 $ aocl initialize acl0 pac_s10_usm $ aocl list-devices -------------------------------------------------------------------- Device Name: acl0 BSP Install Location: /opt/intel/oneapi/intelfpgadpcpp/latest/board/intel_s10sx_pac Vendor: Intel Corp Physical Dev Name Status Information pac_ed00000 Passed Intel PAC Platform (pac_ed00000) PCIe 29:00.0 USM supported DIAGNOSTIC_PASSED --------------------------------------------------------------------

Example output:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 # buffered_host_streaming.fpga $ ./buffered_host_streaming.fpga Repetitions: 200 Buffers: 2 Buffer Count: 524288 Iterations: 4 Total Threads: 64 Running the roofline analysis Producer (32 threads) Time: 1.1101 ms Throughput: 30226.1777 MB/s Consumer (32 threads) Time: 1.0272 ms Throughput: 32667.0989 MB/s Producer & Consumer (32 threads, each) Time: 3.4327 ms Throughput: 9774.9486 MB/s Kernel Time: 3.5139 ms Throughput: 9549.1001 MB/s Maximum Design Throughput: 9549.1001 MB/s The FPGA kernel limits the performance of the design Done the roofline analysis Running the full design without API Average latency without API: 4.3190 ms Average processing time without API: 749.3281 ms Average throughput without API: 8955.8717 MB/s Running the full design with API Average latency with API: 4.6629 ms Average processing time with API: 1005.6579 ms Average throughput with API: 6673.1306 MB/s PASSED

Example output if the incorrect standard board image is programmed:

1 2 3 4 5 6 7 8 9 10 11 $ aocl inititalize acl0 pac_s10 # incorrect board; binaries compiled with pac_s10_usm $ ./buffered_host_streaming.fpga Repetitions: 200 Buffers: 2 Buffer Count: 524288 Iterations: 4 Total Threads: 64 ERROR: The selected device does not support USM host allocations terminate called without an active exception Aborted (core dumped)

Resources

Resource

Description

Resource

Description

FPGA Optimization Guide for Intel® oneAPI Toolkits

The FPGA Optimization Guide for Intel® oneAPI Toolkits provides guidance on leveraging the functionalities of SYCL* to optimize a design.

Intel® FPGA Programmable Acceleration Card D5005 Data Sheet

This datasheet for the Intel® FPGA PAC shows electrical, mechanical, compliance, and other key specifications. This datasheet assists data center operators and system integrators to properly deploy the Intel® FPGA PAC into their servers. It also documents the FPGA power envelope, connectivity speeds to memory, and network connectivity, so that accelerator function unit (AFU) developers can properly design and test their IP.

Intel® FPGA Training

Set of labs for using FPGAs with oneAPI through the Intel® DevCloud.

Intel® Quartus® Prime Pro Edition User Guide: Scripting

Detailed guide for running Quartus programs on the command line.

Intel® Stratix® 10 FPGAs & SoC FPGA

Assorted documentation for the Stratix 10 FPGA family, including pinouts and device schematics.

Intel® Stratix® 10 FPGA Developer Center

Provides various resources to complete an Intel® FPGA design on the Stratix 10 architecture.

Why is FPGA Compilation Different?

Describes differences between CPU, GPU, and FPGA program compilation.

Support

Please report any issues encountered on the FPGAs to help@hprc.tamu.edu, and include information about actions taken and/or commands run prior to the error so the HPRC team may reproduce and resolve the issue.