PBS Job Chains and Dependencies

from: https://docs.loni.org/wiki/PBS_Job_Chains_and_Dependencies

 

Deprecated: please see new documentation site.

Quite often, a single simulation requires multiple long runs which must be processed in sequence. One method for creating a sequence of batch jobs is to execute the “qsub” or “llsubmit” command to submit its successor. We strongly discourage recursive, or “self-submitting,” scripts since for some jobs, chaining isn’t an option. When your job hits the time limit, the batch system kills them and the command to submit a subsequent job is not processed.

LoadLeveler and PBS both allow users to move the logic for chaining from the script and into the scheduler. The LoadLeveler feature is discussed in the article LoadLeveler Job Chains and Dependencies.

In PBS, you can use the “qsub -W depend=…” option to create dependencies between jobs.

qsub -W depend=afterok:<Job-ID> <QSUB SCRIPT>

Here, the batch script <QSUB SCRIPT> will be submitted after the Job, <Job-ID> was successfully completed. Useful options to “depend=…” are

  • afterok:<Job-ID> Job is scheduled if the Job <Job-ID> exits without errors or is successfully completed.
  • afternotok:<Job-ID> Job is scheduled if the Job <Job-ID> exited with errors.
  • afterany:<Job-ID> Job is scheduled if the Job <Job-ID> exits with or without errors.

One method to simplify this process is to write multiple batch scripts, job1.pbs, job2.pbs, job3.pbs etc and submit them using the following script:

#!/bin/bash

FIRST=$(qsub job1.pbs)
echo $FIRST
SECOND=$(qsub -W depend=afterany:$FIRST job2.pbs)
echo $SECOND
THIRD=$(qsub -W depend=afterany:$SECOND job3.pbs)
echo $THIRD

Modify script according to number of job chained jobs required. The Job <$FIRST> will be placed in queue while the jobs <$SECOND> and <$THIRD> will be placed in queue with the “Not Queued” (NQ) flag in Batch Hold. When <$FIRST> is completed, the NQ flag will be replaced with the “Queued” (Q) flag and will be moved to the active queue.

A few words of caution: If you list the dependency as “afterok”/”afternotok” and your job exits with/without errors then your subsequent jobs will be killed due to “dependency not met”.

This entry was posted in HPC. Bookmark the permalink.

Leave a comment