Pegasus is a workflow management system, which enables you to run computational workflows across ACCESS resources. You will seamlessly be able to orchestrate jobs and data movements at different resource providers. At this point, Pegasus on ACCESS is mainly used for high throughput computing (HTC) workloads. This means jobs which can fit on a single compute node (single core, multicore, or single node MPI jobs).
Pegasus is being used in production to execute scientific workflows in several different disciplines including astronomy, gravitational-wave physics, bioinformatics, earthquake engineering, helio-seismology, limnology, machine learning, and molecular dynamics, among others. Pegasus provides the necessary abstractions for scientists to create workflows and allows for transparent execution of these workflows on a range of computing platforms. More information can be found on the Pegasus website or in the Pegasus user guide.
When using Pegasus, the first step is to define an experiment as a workflow using one of the provided Python, Java, or R APIs. A popular choice is the Python API from inside a Jupyter Notebook, using predefined workflows that can be easily modified. A user defines their workflow in terms of each compute job; that is the executable that will be run, the input files required, and the output files produced. Pegasus will automatically infer dependencies between jobs based on the input and output files used for each job. And finally, each job itself may be a sub workflow allowing users to organize larger workflows on the order of hundreds, thousands, or even millions of tasks.
Once the workflow is defined, Pegasus will compile this abstract workflow into an executable workflow, specific to the execution environment that the user is targeting. This is referred to as the planning phase. Because Pegasus workflows are abstract, they are also portable, as they can be planned again to run on different execution environment. By utilizing the wealth of research done in graph algorithms, during this planning phase Pegasus can perform optimizations to the workflow to improve its reliability and scalability.
Pegasus is built on top of HTCondor, and heavily utilizes HTCondor DagMan as its execution engine. For ACCESS, a HTCondor pool is dynamically created as an overlay across ACCESS resources, and users can thus submit workflows at a central location, and have the jobs execute at one or more ACCESS resource providers.
The HTCondor pool is created as an overlay across one or more ACCESS resource providers
In Jupyter, navigate to the example you are interested in, and step through the notebook. Once the workflow is submitted, you have to add compute resources with HTCondor Annex.
HTCondor Pool / Annex
At this point you should have some idle jobs in the queue. They are idle because there are no resources yet to execute on. Resources can be brought in with the HTCondor Annex tool, by sending pilot jobs (also called glideins) to the ACCESS resource providers. These pilots have the following properties:
A pilot can run multiple user jobs - it stays active until no more user jobs are available or until end of life has been reached, whichever comes first.
A pilot is partitionable - job slots will dynamically be created based on the resource requirements in the user jobs. This means you can fit multiple user jobs on a compute node at the same time.
A pilot will only run jobs for the user who started it.
You have to have an allocation at the resource provider you want to use. The resources we currently support are:
ACCESS resource providers have slightly different policies for logging in to the resources. We recommend that you create a separate key for HTCondor Annex, and a set up a ~/.ssh/config file containing remote usernames and which ssh key to use. Log in to https://access.pegasus.isi.edu and start an interactive shell. Create a new ssh key:
$ ssh-keygen -f ~/.ssh/annex
The open an editor and create ~/.ssh/config. You will have to specify the username you have been assigned for each resource
Determining Project ID and Queue, Installing SSH key
To start an annex, you need to have the project identifier at the particular resource provider. Note that this might not be the same as your ACCESS allocation id. You have to log in to the resource provider, via the OpenOndemand instances in the table above, and run a resource provider specific command to determine the id. You can also use this login to authorize the ssh key from the previous step. For example, to get set up on Anvil, log in to https://ondemand.anvil.rcac.purdue.edu and start an interactive shell. In that shell, first run mybalance:
Allocation Type SU Limit SU Usage SU Usage SU Balance
Account (account) (user)
============= ==== ========== ========== ========== ==========
abc12345 CPU 100000.0 0.0 0.0 100000.0
Take note of the allocation account name, you will need it when starting the annex.
Copy the contents from ~/.ssh/annex.pub (make sure it is the .pub one).
Starting an Annex
You can create an annex with the annex create command . There is also a annex add command once you have an annex running and want to add more resources. You have to specify your allocation and the last part of the command is the queue and resource.
The command will ask you to authenticate. For some resource providers, the ssh key will be enough. Some might require a two-factor login:
Duo two-factor login for user
Enter a passcode or select one of the following options:
1. Duo Push to XXX-XXX-1234
2. Phone call to XXX-XXX-1234
Passcode or option (1-2): 1
Monitoring an Annex
The status of your annex can be displayed with the annex status command:
$ htcondor annex status $USER
The command will provide an overview of resources, and how long they will be available:
$ htcondor annex status $USER
Annex 'bob' is established.
Its oldest established request is about 0.06 hours old and will retire in 0.94 hours.
You requested 2 nodes for this annex, of which 1 are in an established annex.
There are 128 CPUs in the established annex, of which 4 are busy.
3 jobs must run on this annex, and 3 currently are.
You requested resources for this annex 1 times; 0 are pending, 1 comprise the established annex, and 0 have retired.
Another tool to show your resources is condor_status. This will show the “slots” available, but note that these are partitionable, e.g. they can be dynamically created based on the size of your jobs. Example:
$ condor_status -const "AnnexName == \"$USER\""
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
email@example.com LINUX X86_64 Unclaimed Idle 0.000 248310 0+00:05:00
firstname.lastname@example.org LINUX X86_64 Claimed Busy 0.000 3072 0+00:04:04
email@example.com LINUX X86_64 Claimed Busy 0.020 3072 0+00:04:02
firstname.lastname@example.org LINUX X86_64 Claimed Busy 0.020 3072 0+00:04:04
Total Owner Claimed Unclaimed Matched Preempting Backfill Drain
X86_64/LINUX 4 0 3 1 0 0 0 0
Total 4 0 3 1 0 0 0 0
Frequently Asked Questions
Can I use Pegasus for my HPC workloads?
HPC jobs require a Pegasus install at the resource provider. This will be explored later in the ACCESS Pegasus pilot, but please let us know that you have an interest in this and what your requirements are.