ACCESS Pegasus

Overview

Pegasus is a workflow management system, which enables you to run computational workflows across ACCESS resources. You will seamlessly be able to orchestrate jobs and data movements at different resource providers. At this point, Pegasus on ACCESS is mainly used for high throughput computing (HTC) workloads. This means jobs which can fit on a single compute node (single core, multicore, or single node MPI jobs).

Pegasus is being used in production to execute scientific workflows in several different disciplines including astronomy, gravitational-wave physics, bioinformatics, earthquake engineering, helio-seismology, limnology, machine learning, and molecular dynamics, among others. Pegasus provides the necessary abstractions for scientists to create workflows and allows for transparent execution of these workflows on a range of computing platforms. More information can be found on the Pegasus website or in the Pegasus user guide.

When using Pegasus, the first step is to define an experiment as a workflow using one of the provided Python, Java, or R APIs. A popular choice is the Python API from inside a Jupyter Notebook, using predefined workflows that can be easily modified.  A user defines their workflow in terms of each compute job; that is the executable that will be run, the input files required, and the output files produced. Pegasus will automatically infer dependencies between jobs based on the input and output files used for each job. And finally, each job itself may be a sub workflow allowing users to organize larger workflows on the order of hundreds, thousands, or even millions of tasks.  

Once the workflow is defined, Pegasus will compile this abstract workflow into an executable workflow, specific to the execution environment that the user is targeting. This is referred to as the planning phase. Because Pegasus workflows are abstract, they are also portable, as they can be planned again to run on different execution environment. By utilizing the wealth of research done in graph algorithms, during this planning phase Pegasus can perform optimizations to the workflow to improve its reliability and scalability.

Pegasus is built on top of HTCondor, and heavily utilizes HTCondor DAGMan as its execution engine. For ACCESS, a HTCondor pool is dynamically created as an overlay across ACCESS resources, and users can thus submit workflows at a central location, and have the jobs execute at one or more ACCESS resource providers.

The HTCondor pool is created as an overlay across one or more ACCESS resource providers

Logging In / Jupyter

To get started, use a web browser and log in to https://access.pegasus.isi.edu . Use your ACCESS credentials to log in.

Allocation Optional for Tutorial Workflows

Typically, using ACCESS Pegasus to run workflows necessitates users to link their own allocations. However, the initial notebooks in this guide are pre-configured to operate on a modest resource bundled with ACCESS Pegasus. As you progress to more complex sample workflows, you'll be required to utilize your own allocation.

If you prefer to run the workflow using your own allocation, you can provision as described next.

Creating Workflows

Looking at examples of solutions that have already been implemented can be very helpful. With that in mind, we have created a collection of sample workflows that can be conveniently explored using our web-based Jupyter notebooks.

The examples can be found your $HOME directory under the ACCESS-Pegasus-Examples/ directory.

In Jupyter, navigate to the example you are interested in, and step through the notebook. Once the workflow is submitted, you have to add compute resources with HTCondor Annex.

The first few notebooks are set up as self-guided introduction to Pegasus. The final example is a complete workflow focused on automating the variant calling process, and it was adapted from the Data Carpentry Lesson on Data Wrangling and Processing for Genomics. This particular workflow involves downloading and aligning SRA data to the E. coli REL606 reference genome, and identifying any differences between the reads and the genome. Additionally, it performs variant calling in order to track changes in the population over time.

For full description on how to create workflows, please see the Pegasus user guide

Configuring Resources

IU Jetstream2

Jetstream2 is configured differently from other resource providers (for those, see HTCondor Annex below). The procedure for Jetstream2 includes starting up a special VM instance, providing it with your username and token. To get started, first start a shell on https://access.pegasus.isi.edu . On the shell, run the following two commands and note down the results:

  1. whoami - this will give your username.

  2. cat ~/.condor/pilot.token - this will output your token. Note that it is probably longer than you terminal, but it really one long line without any linebreaks.

One you have those pieces of information, use your web browser to go to Jetstream2 Exosphere. If you haven’t already done so, add your allocation. Now click the top right button Create and select to start a new instance. Click By Image, and search for ACCESS-Pegasus-Worker. Click Create Instance. You can now pick a size - base this on your workload. For example, if you have high memory tasks, start an instance with more memory.

An important step here is to select Show for Advanced Options. In the Boot Script, find the runcmd: section, and right before it, add a new section using the values from the earlier shell session:

bootcmd: - /opt/ACCESS-Pegasus-Jetstream2/bin/vm-conf [USERNAME] [TOKEN]

An example what this can look like is:

Click Create. The instance will now start and register back to ACCESS Pegasus. After a few minutes, go back to the shell on ACCESS Pegasus, and run condor_status - this should show your instance and it means it is ready for work.

Once you are done running your workload, the instance will go away automatically. It will shutdown after 30 minutes of not seeing any new jobs.

HTCondor Annex

Resources (except Jetstream2) can be brought in with the HTCondor Annex tool, by sending pilot jobs (also called glideins) to the ACCESS resource providers. These pilots have the following properties:

  • A pilot can run multiple user jobs - it stays active until no more user jobs are available or until end of life has been reached, whichever comes first.

  • A pilot is partitionable - job slots will dynamically be created based on the resource requirements in the user jobs. This means you can fit multiple user jobs on a compute node at the same time.

  • A pilot will only run jobs for the user who started it.

As part of setting up annex, you need to know the your local username on the various Resources. The easiest way for you to figure out that is by navigating to your ACCESS Profile on the Allocations Page . There at the bottom of the page, you will see a table titled “Resource Provider Site Usernames”.

You have to have an allocation at the resource provider you want to use. The resources we currently support are:

Resource

Nickname

Ondemand Instance

Project list command

Queues (tested)

Resource

Nickname

Ondemand Instance

Project list command

Queues (tested)

PSC Bridges2

bridges2

Log in (but passwords/keys have to be registered - see user guide)

projects

RM

Purdue Anvil

anvil

Log in

mybalance

standard

SDSC Expanse

expanse

Log in

module load sdsc; expanse-client user -p

compute

Setting Up SSH Keys and Config

ACCESS resource providers have slightly different policies for logging in to the resources. We recommend that you create a separate key for HTCondor Annex, and a set up a ~/.ssh/config file containing remote usernames and which ssh key to use. Log in to https://access.pegasus.isi.edu and start an interactive shell. Create a new ssh key:

$ ssh-keygen -f ~/.ssh/annex

The open an editor and create ~/.ssh/config. You will have to specify the username you have been assigned for each resource

Host anvil.rcac.purdue.edu *.anvil.rcac.purdue.edu    User MYUSERNAME    IdentityFile ~/.ssh/annex Host bridges2.psc.edu *.bridges2.psc.edu    User MYUSERNAME    IdentityFile ~/.ssh/annex Host expanse.sdsc.edu *.expanse.sdsc.edu    User MYUSERNAME    IdentityFile ~/.ssh/annex

Determining Project ID and Queue

To start an annex, you need to have the project identifier at the particular resource provider. Note that this might not be the same as your ACCESS allocation id. You have to log in to the resource provider, via the OpenOndemand instances in the table above, and run a resource provider specific command to determine the id. You can also use this login to authorize the ssh key from the previous step. For example, to get set up on Anvil, log in to https://ondemand.anvil.rcac.purdue.edu and start an interactive shell. In that shell, first run mybalance:

$ mybalance Allocation Type SU Limit SU Usage SU Usage SU Balance Account (account) (user) ============= ==== ========== ========== ========== ========== abc12345 CPU 100000.0 0.0 0.0 100000.0

Take note of the allocation account name, you will need it when starting the annex.

Installing the SSH key

Install the ~/.ssh/annex.pub key from access.pegasus.isi.edu in the resource ~/.ssh/authorized_keys:

$ nano ~/.ssh/authorized_keys

Copy the contents from ~/.ssh/annex.pub (make sure it is the .pub one).

Provisioning Resources

You can create an annex with the annex create command . There is also a annex add  command once you have an annex running and want to add more resources. You have to specify your allocation and the last part of the command is the queue and resource. Note that $USER should be left alone in the command - the shell will substitute the correct value there.

$ htcondor annex create --nodes 1 --lifetime 86400 --project PROJECT_ID $USER QUEUE@RESOURCE

For example, if you want to run on Anvil, using the standard queue and your project id is abc1234, the command would be:

$ htcondor annex create --nodes 1 --lifetime 86400 --project abc1234 $USER standard@anvil

The command will ask you to authenticate. For some resource providers, the ssh key will be enough. Some might require a two-factor login:

Duo two-factor login for user Enter a passcode or select one of the following options: 1. Duo Push to XXX-XXX-1234 2. Phone call to XXX-XXX-1234 Passcode or option (1-2): 1 Thank you.

Monitoring

The status of your annex can be displayed with the annex status  command:

$ htcondor annex status $USER

The command will provide an overview of resources, and how long they will be available:

$ htcondor annex status $USER Annex 'bob' is established. Its oldest established request is about 0.06 hours old and will retire in 0.94 hours. You requested 2 nodes for this annex, of which 1 are in an established annex. There are 128 CPUs in the established annex, of which 4 are busy. 3 jobs must run on this annex, and 3 currently are. You requested resources for this annex 1 times; 0 are pending, 1 comprise the established annex, and 0 have retired.

Another tool to show your resources is condor_status. This will show the “slots” available, but note that these are partitionable, e.g. they can be dynamically created based on the size of your jobs. Example:

$ condor_status -const "AnnexName == \"$USER\"" Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1@a666.anvil.rcac.purdue.edu LINUX X86_64 Unclaimed Idle 0.000 248310 0+00:05:00 slot1_2@a666.anvil.rcac.purdue.edu LINUX X86_64 Claimed Busy 0.000 3072 0+00:04:04 slot1_3@a666.anvil.rcac.purdue.edu LINUX X86_64 Claimed Busy 0.020 3072 0+00:04:02 slot1_4@a666.anvil.rcac.purdue.edu LINUX X86_64 Claimed Busy 0.020 3072 0+00:04:04 Total Owner Claimed Unclaimed Matched Preempting Backfill Drain X86_64/LINUX 4 0 3 1 0 0 0 0 Total 4 0 3 1 0 0 0 0

Frequently Asked Questions

Can I use Pegasus for my HPC workloads?

HPC jobs require a Pegasus install at the resource provider. This will be explored later in the ACCESS Pegasus pilot, but please let us know that you have an interest in this and what your requirements are.

Need Help?

Several support channels are available when you need help with ACCESS Pegasus:

  • Open an ACCESS ticket. Tag it workflows so that the tickets gets routed quickly.

  • We encourage you to join the Slack Workspace as it is an on-going, open forum for all Pegasus users to share ideas, experiences, and talk out issues with the Pegasus development team. Please ask for an invite by trying to join pegasus-users.slack.com in the Slack app, or send an email to pegasus-support@isi.edu and request an invite.

  • pegasus-users@isi.edu is an open discussion list. You can subscribe here.