ACCESS Pegasus Workflows

Overview

Pegasus is a workflow management system, which enables you to run computational workflows across ACCESS resources. You will seamlessly be able to orchestrate jobs and data movements at different resource providers. At this point, Pegasus on ACCESS is mainly used for high throughput computing (HTC) workloads. This means jobs which can fit on a single compute node (single core, multicore, or single node MPI jobs).

soykb.png
Example workflow from the SoyKB project

Pegasus is being used in production to execute scientific workflows in several different disciplines including astronomy, gravitational-wave physics, bioinformatics, earthquake engineering, helio-seismology, limnology, machine learning, and molecular dynamics, among others. Pegasus provides the necessary abstractions for scientists to create workflows and allows for transparent execution of these workflows on a range of computing platforms. More information can be found on the Pegasus website or in the Pegasus user guide.

When using Pegasus, the first step is to define an experiment as a workflow using one of the provided Python, Java, or R APIs. A popular choice is the Python API from inside a Jupyter Notebook, using predefined workflows that can be easily modified.  A user defines their workflow in terms of each compute job; that is the executable that will be run, the input files required, and the output files produced. Pegasus will automatically infer dependencies between jobs based on the input and output files used for each job. And finally, each job itself may be a sub workflow allowing users to organize larger workflows on the order of hundreds, thousands, or even millions of tasks.  

Once the workflow is defined, Pegasus will compile this abstract workflow into an executable workflow, specific to the execution environment that the user is targeting. This is referred to as the planning phase. Because Pegasus workflows are abstract, they are also portable, as they can be planned again to run on different execution environment. By utilizing the wealth of research done in graph algorithms, during this planning phase Pegasus can perform optimizations to the workflow to improve its reliability and scalability.

Pegasus is built on top of HTCondor, and heavily utilizes HTCondor DAGMan as its execution engine. For ACCESS, a HTCondor pool is dynamically created as an overlay across ACCESS resources, and users can thus submit workflows at a central location, and have the jobs execute at one or more ACCESS resource providers.

Pegasus-ACCESS-Overview.png

 

Logging In / Jupyter

To get started, use a web browser and log in to https://pegasus.access-ci.org . Use your ACCESS credentials to log in.

Allocation Optional for Tutorial Workflows

Typically, using ACCESS Pegasus to run workflows necessitates users to link their own allocations. However, the initial notebooks in this guide are pre-configured to operate on a modest resource bundled with ACCESS Pegasus. As you progress to more complex sample workflows, you'll be required to utilize your own allocation.

Creating Workflows

Looking at examples of solutions that have already been implemented can be very helpful. With that in mind, we have created a collection of sample workflows that can be conveniently explored using our web-based Jupyter notebooks.

The examples can be found your $HOME directory under the ACCESS-Pegasus-Examples/ directory.

In Jupyter, navigate to the example you are interested in, and step through the notebook. Once the workflow is submitted, you have to add compute resources with HTCondor Annex.

The first few notebooks are set up as self-guided introduction to Pegasus. The final example is a complete workflow focused on automating the variant calling process, and it was adapted from the Data Carpentry Lesson on Data Wrangling and Processing for Genomics. This particular workflow involves downloading and aligning SRA data to the E. coli REL606 reference genome, and identifying any differences between the reads and the genome. Additionally, it performs variant calling in order to track changes in the population over time.

For full description on how to create workflows, please see the Pegasus user guide

Job Routing / Resource Provisioning

ACCESS Pegasus enables jobs to flow to a set of different resources, some which are always available, and some which have to be explicitly provisioned by the users when they need it. Note that by default a job will try to go anywhere it can - you might have to exclude resources if there are places you do not want the jobs to go.

The following figure shows an overview of the resources, and below that is a more detailed discussion of the resources.

TestPool

The TestPool consists of a small number of cores, available for anyone to use at any time, even without an allocation. These are meant to be used for jobs with quick turnaround time, such as tutorials, development, and debugging.

You can see the state of the TestPool by running:

condor_status -const 'TestPool =?= True'

If you do not want your jobs to run on the TestPool, please add TestPool =!= True to your job requirements.

Cloud

Adding cloud resources, using your own allocation, is done by starting a provided VM image, and injecting a provided token for authentication. The VMs join the pool and start running jobs. When there are no more jobs, the VMs shut themselves down.

More details on how to provide cloud resources

HTCondor Annex

ACCESS HPC Resources can be brought in with the HTCondor Annex tool, by sending pilot jobs (also called glideins) to the clusters. The pilots will run under your ACCESS allocation, and have the following properties:

  • A pilot can run multiple user jobs - it stays active until no more user jobs are available or until end of life has been reached, whichever comes first.

  • A pilot is partitionable - job slots will dynamically be created based on the resource requirements in the user jobs. This means you can fit multiple user jobs on a compute node at the same time.

  • A pilot will only run jobs for the user who started it.

Annexes can be named, and jobs can be configured to only go to certain named Annexes. By default, the annexes are named with your username.

More details on how to use the HTCondor Annex

OSPool

The OSPool is always connected to ACCESS Pegasus, but requires jobs to have an OSG project name specified. If you have an ACCESS allocation on OSG, you can use the “TG-NNNNNN” allocation id as project name. Or, if you have an OSG assigned project name, you may use that. You can specify the project name in your workflows like:

props.add_site_profile("condorpool", "condor", "+ProjectName", "\"TG-NNNNNN\"")

Also note that the OSPool uses a different approach to containers. Instead of using Pegasus’ built in container execution, create non-container jobs, with a property specify the container to use:

props.add_site_profile("condorpool", "condor", "+SingularityImage", "\"/cvmfs/singularity.opensciencegrid.org/htc/rocky:8\"")

More information about containers on the OSPool can be found in the OSG documentation.

More details on how to the OSPool

Frequently Asked Questions

Can I use Pegasus for my HPC workloads?

HPC jobs require a Pegasus install at the resource provider. This will be explored later in the ACCESS Pegasus pilot, but please let us know that you have an interest in this and what your requirements are.

Need Help?

Several support channels are available when you need help with ACCESS Pegasus:

  • Open an ACCESS ticket. Tag it workflows so that the tickets gets routed quickly.

  • We encourage you to join the Slack Workspace as it is an on-going, open forum for all Pegasus users to share ideas, experiences, and talk out issues with the Pegasus development team. Please ask for an invite by trying to join pegasus-users.slack.com in the Slack app, or send an email to pegasus-support@isi.edu and request an invite.

  • pegasus-users@isi.edu is an open discussion list. You can subscribe here.