ACCESS Pegasus

Overview

Pegasus is a workflow management system, which enables you to run computational workflows across ACCESS resources. You will seamlessly be able to orchestrate jobs and data movements at different resource providers. At this point, Pegasus on ACCESS is mainly used for high throughput computing (HTC) workloads. This means jobs which can fit on a single compute node (single core, multicore, or single node MPI jobs).

Pegasus is being used in production to execute scientific workflows in several different disciplines including astronomy, gravitational-wave physics, bioinformatics, earthquake engineering, helio-seismology, limnology, machine learning, and molecular dynamics, among others. Pegasus provides the necessary abstractions for scientists to create workflows and allows for transparent execution of these workflows on a range of computing platforms. More information can be found on the Pegasus website or in the Pegasus user guide.

When using Pegasus, the first step is to define an experiment as a workflow using one of the provided Python, Java, or R APIs. A popular choice is the Python API from inside a Jupyter Notebook, using predefined workflows that can be easily modified.  A user defines their workflow in terms of each compute job; that is the executable that will be run, the input files required, and the output files produced. Pegasus will automatically infer dependencies between jobs based on the input and output files used for each job. And finally, each job itself may be a sub workflow allowing users to organize larger workflows on the order of hundreds, thousands, or even millions of tasks.  

Once the workflow is defined, Pegasus will compile this abstract workflow into an executable workflow, specific to the execution environment that the user is targeting. This is referred to as the planning phase. Because Pegasus workflows are abstract, they are also portable, as they can be planned again to run on different execution environment. By utilizing the wealth of research done in graph algorithms, during this planning phase Pegasus can perform optimizations to the workflow to improve its reliability and scalability.

Pegasus is built on top of HTCondor, and heavily utilizes HTCondor DagMan as its execution engine. For ACCESS, a HTCondor pool is dynamically created as an overlay across ACCESS resources, and users can thus submit workflows at a central location, and have the jobs execute at one or more ACCESS resource providers.

The HTCondor pool is created as an overlay across one or more ACCESS resource providers

Logging In / Jupyter

To get started, use a web browser and log in to https://access.pegasus.isi.edu . Use your ACCESS credentials to log in.

Example Workflows

Some example workflows can be found in Github. Start a shell on pegasus.access-ci.org (Clusters > Shell), and check out the repostitory:

1 $ git clone https://github.com/pegasus-isi/ACCESS-Pegasus-Examples.git

In Jupyter, navigate to the example you are interested in, and step through the notebook. Once the workflow is submitted, you have to add compute resources with HTCondor Annex.

HTCondor Pool / Annex

At this point you should have some idle jobs in the queue. They are idle because there are no resources yet to execute on. Resources can be brought in with the HTCondor Annex tool, by sending pilot jobs (also called glideins) to the ACCESS resource providers. These pilots have the following properties:

  • A pilot can run multiple user jobs - it stays active until no more user jobs are available or until end of life has been reached, whichever comes first.

  • A pilot is partitionable - job slots will dynamically be created based on the resource requirements in the user jobs. This means you can fit multiple user jobs on a compute node at the same time.

  • A pilot will only run jobs for the user who started it.

You have to have an allocation at the resource provider you want to use. The resources we currently support are:

Resource

Nickname

Ondemand Instance

Project list command

Queues (tested)

Resource

Nickname

Ondemand Instance

Project list command

Queues (tested)

PSC Bridges2

bridges2

Log in (but passwords/keys have to be registered - see user guide)

projects

RM

Purdue Anvil

anvil

Log in

mybalance

standard

SDSC Expanse

expanse

Log in

module load sdsc; expanse-client user

compute

Setting Up SSH Keys and Config

ACCESS resource providers have slightly different policies for logging in to the resources. We recommend that you create a separate key for HTCondor Annex, and a set up a ~/.ssh/config file containing remote usernames and which ssh key to use. Log in to https://access.pegasus.isi.edu and start an interactive shell. Create a new ssh key:

1 $ ssh-keygen -f ~/.ssh/annex

The open an editor and create ~/.ssh/config. You will have to specify the username you have been assigned for each resource

1 2 3 4 5 6 7 8 9 10 11 Host anvil.rcac.purdue.edu *.anvil.rcac.purdue.edu    User MYUSERNAME    IdentityFile ~/.ssh/annex Host bridges2.psc.edu *.bridges2.psc.edu    User MYUSERNAME    IdentityFile ~/.ssh/annex Host expanse.sdsc.edu *.expanse.sdsc.edu    User MYUSERNAME    IdentityFile ~/.ssh/annex

Determining Project ID and Queue, Installing SSH key

To start an annex, you need to have the project identifier at the particular resource provider. Note that this might not be the same as your ACCESS allocation id. You have to log in to the resource provider, via the OpenOndemand instances in the table above, and run a resource provider specific command to determine the id. You can also use this login to authorize the ssh key from the previous step. For example, to get set up on Anvil, log in to https://ondemand.anvil.rcac.purdue.edu and start an interactive shell. In that shell, first run mybalance:

1 2 3 4 5 $ mybalance Allocation Type SU Limit SU Usage SU Usage SU Balance Account (account) (user) ============= ==== ========== ========== ========== ========== abc12345 CPU 100000.0 0.0 0.0 100000.0

Take note of the allocation account name, you will need it when starting the annex.

Then install the ~/.ssh/annex.pub key from access.pegasus.isi.edu as ~/.ssh/authorized_keys:

1 $ nano ~/.ssh/authorized_keys

Copy the contents from ~/.ssh/annex.pub (make sure it is the .pub one).

Starting an Annex

You can create an annex with the annex create command . There is also a annex add  command once you have an annex running and want to add more resources. You have to specify your allocation and the last part of the command is the queue and resource.

1 $ htcondor annex create --nodes 1 --lifetime 86400 --project PROJECT_ID $USER QUEUE@RESOURCE

For example, if you want to run on Anvil, using the standard queue and your project id is abc1234, the command would be:

1 $ htcondor annex create --nodes 1 --lifetime 86400 --project abc1234 $USER standard@anvil

The command will ask you to authenticate. For some resource providers, the ssh key will be enough. Some might require a two-factor login:

1 2 3 4 5 6 7 8 9 Duo two-factor login for user Enter a passcode or select one of the following options: 1. Duo Push to XXX-XXX-1234 2. Phone call to XXX-XXX-1234 Passcode or option (1-2): 1 Thank you.

Monitoring an Annex

The status of your annex can be displayed with the annex status  command:

1 $ htcondor annex status $USER

The command will provide an overview of resources, and how long they will be available:

1 2 3 4 5 6 7 $ htcondor annex status $USER Annex 'bob' is established. Its oldest established request is about 0.06 hours old and will retire in 0.94 hours. You requested 2 nodes for this annex, of which 1 are in an established annex. There are 128 CPUs in the established annex, of which 4 are busy. 3 jobs must run on this annex, and 3 currently are. You requested resources for this annex 1 times; 0 are pending, 1 comprise the established annex, and 0 have retired.

Another tool to show your resources is condor_status. This will show the “slots” available, but note that these are partitionable, e.g. they can be dynamically created based on the size of your jobs. Example:

1 2 3 4 5 6 7 8 9 10 11 12 13 $ condor_status -const "AnnexName == \"$USER\"" Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1@a666.anvil.rcac.purdue.edu LINUX X86_64 Unclaimed Idle 0.000 248310 0+00:05:00 slot1_2@a666.anvil.rcac.purdue.edu LINUX X86_64 Claimed Busy 0.000 3072 0+00:04:04 slot1_3@a666.anvil.rcac.purdue.edu LINUX X86_64 Claimed Busy 0.020 3072 0+00:04:02 slot1_4@a666.anvil.rcac.purdue.edu LINUX X86_64 Claimed Busy 0.020 3072 0+00:04:04 Total Owner Claimed Unclaimed Matched Preempting Backfill Drain X86_64/LINUX 4 0 3 1 0 0 0 0 Total 4 0 3 1 0 0 0 0

Frequently Asked Questions

Can I use Pegasus for my HPC workloads?

HPC jobs require a Pegasus install at the resource provider. This will be explored later in the ACCESS Pegasus pilot, but please let us know that you have an interest in this and what your requirements are.