/
HTCondor Annex

HTCondor Annex

HPC Resources can temporarily be mad available to the workflows with the HTCondor Annex tool, by sending pilot jobs (also called glideins) to the ACCESS resource providers. These pilots have the following properties:

  • A pilot can run multiple user jobs - it stays active until no more user jobs are available or until end of life has been reached, whichever comes first.

  • A pilot is partitionable - job slots will dynamically be created based on the resource requirements in the user jobs. This means you can fit multiple user jobs on a compute node at the same time.

  • A pilot will only run jobs for the user who started it.

As part of setting up annex, you need to know the your local username on the various Resources. The easiest way for you to figure out that is by navigating to your ACCESS Profile on the Allocations Page . There at the bottom of the page, you will see a table titled “Resource Provider Site Usernames”.

You have to have an allocation at the resource provider you want to use. The resources we currently support are:

Resource

Nickname

Ondemand Instance

Project list command

Queues (tested)

Resource

Nickname

Ondemand Instance

Project list command

Queues (tested)

PSC Bridges2

bridges2

Log in (but passwords/keys have to be registered - see user guide)

projects

RM

Purdue Anvil

anvil

Log in

mybalance

standard

SDSC Expanse

expanse

Log in

module load sdsc; expanse-client user -p

compute

Setting Up SSH Keys and Config

ACCESS resource providers have slightly different policies for logging in to the resources. We recommend that you create a separate key for HTCondor Annex, and a set up a ~/.ssh/config file containing remote usernames and which ssh key to use. Log in to https://access.pegasus.isi.edu and start an interactive shell. Create a new ssh key:

$ ssh-keygen -f ~/.ssh/annex

The open an editor and create ~/.ssh/config. You will have to specify the username you have been assigned for each resource

Host anvil.rcac.purdue.edu *.anvil.rcac.purdue.edu    User MYUSERNAME    IdentityFile ~/.ssh/annex Host bridges2.psc.edu *.bridges2.psc.edu    User MYUSERNAME    IdentityFile ~/.ssh/annex Host expanse.sdsc.edu *.expanse.sdsc.edu    User MYUSERNAME    IdentityFile ~/.ssh/annex

Determining Project ID and Queue

To start an annex, you need to have the project identifier at the particular resource provider. Note that this might not be the same as your ACCESS allocation id. You have to log in to the resource provider, via the OpenOndemand instances in the table above, and run a resource provider specific command to determine the id. You can also use this login to authorize the ssh key from the previous step. For example, to get set up on Anvil, log in to https://ondemand.anvil.rcac.purdue.edu and start an interactive shell. In that shell, first run mybalance:

$ mybalance Allocation Type SU Limit SU Usage SU Usage SU Balance Account (account) (user) ============= ==== ========== ========== ========== ========== abc12345 CPU 100000.0 0.0 0.0 100000.0

Take note of the allocation account name, you will need it when starting the annex.

Installing the SSH key

Install the ~/.ssh/annex.pub key from access.pegasus.isi.edu in the resource ~/.ssh/authorized_keys:

$ nano ~/.ssh/authorized_keys

Copy the contents from ~/.ssh/annex.pub (make sure it is the .pub one).

Provisioning Resources

You can create an annex with the annex create command . There is also a annex add  command once you have an annex running and want to add more resources. You have to specify your allocation and the last part of the command is the queue and resource. Note that $USER should be left alone in the command - the shell will substitute the correct value there.

$ htcondor annex create --nodes 1 --lifetime 86400 --project PROJECT_ID $USER QUEUE@RESOURCE

For example, if you want to run on Anvil, using the standard queue and your project id is abc1234, the command would be:

$ htcondor annex create --nodes 1 --lifetime 86400 --project abc1234 $USER standard@anvil

The command will ask you to authenticate. For some resource providers, the ssh key will be enough. Some might require a two-factor login:

Duo two-factor login for user Enter a passcode or select one of the following options: 1. Duo Push to XXX-XXX-1234 2. Phone call to XXX-XXX-1234 Passcode or option (1-2): 1 Thank you.

Monitoring

The status of your annex can be displayed with the annex status  command:

$ htcondor annex status $USER

The command will provide an overview of resources, and how long they will be available:

$ htcondor annex status $USER Annex 'bob' is established. Its oldest established request is about 0.06 hours old and will retire in 0.94 hours. You requested 2 nodes for this annex, of which 1 are in an established annex. There are 128 CPUs in the established annex, of which 4 are busy. 3 jobs must run on this annex, and 3 currently are. You requested resources for this annex 1 times; 0 are pending, 1 comprise the established annex, and 0 have retired.

Another tool to show your resources is condor_status. This will show the “slots” available, but note that these are partitionable, e.g. they can be dynamically created based on the size of your jobs. Example:

$ condor_status -const "AnnexName == \"$USER\"" Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1@a666.anvil.rcac.purdue.edu LINUX X86_64 Unclaimed Idle 0.000 248310 0+00:05:00 slot1_2@a666.anvil.rcac.purdue.edu LINUX X86_64 Claimed Busy 0.000 3072 0+00:04:04 slot1_3@a666.anvil.rcac.purdue.edu LINUX X86_64 Claimed Busy 0.020 3072 0+00:04:02 slot1_4@a666.anvil.rcac.purdue.edu LINUX X86_64 Claimed Busy 0.020 3072 0+00:04:04 Total Owner Claimed Unclaimed Matched Preempting Backfill Drain X86_64/LINUX 4 0 3 1 0 0 0 0 Total 4 0 3 1 0 0 0 0

Related content