Architecture

SyQADA Process

syqada auto automates the manual process of batch management that was used in SyQADA 0.9.8 as follows:

# create task directory and build METADATA files for each task
#
# run *syqada manage --fix done failed*  on each step in turn to see if it has completed.

The –fix done failed option prepares the batch for resumption, if the task directory was incompletely managed (probably because of a SyQADA termination before all jobs were completed).

Then, for each step (in order) that syqada manage determines is not complete:

# run *syqada batch* on the task.

If syqada batch succeeds:

# run *syqada manage* to confirm that the step is complete.

Batch Generation

syqada batch relies on a METADATA file to create the jobs for its batch. This is normally created by syqada auto, but relatively easy to create or edit manually if one wished.

The BatchRunner uses the METADATA file to build scripts (usually one for each sample) suitable for submission to the cluster, with appropriate parameters set for the duration and memory demands of the job as specified in the task file. This script is made executable, so that it can equally well be run locally on a standard Unix machine (note that paths in the configuration file may be correct for one machine but not for another).

See How to Build a Script Template for description of approximately what a script template and a task file look like.

Batch Generation Gotchas

Once syqada auto has run once, each task’s METADATA file is fixed, and errors in the .task file have been propagated to it. Therefore, if you are developing a new protocol or modifying the behavior of an old one, changing an error in the .task file needs to be accompanied either by removing and regenerating (all) the METADATA files, or by editing the METADATA file in question. The former is preferable, because it increases reproducibility.

Underneath the Hood

SyQADA was built in pieces so that each piece could be functional without waiting for the whole system to work (there are some warts in code and functionality as a result, but the benefit is that it has been beaten pretty hard during development, so that the older the behavior, the more reliable it is). As a consequence, the JobBatch, JobGenerator, BatchRunner, batch_tool, Automator, and Tool are all free-standing python applications that will still hypothetically work by themselves if invoked from the command line. The only ones that make any real sense to use any more are the BatchRunner, the batch_tool, the Automator, and the Tool, which are invoked from the SyQADA driver as syqada batch, syqada manage, syqada automator, and syqada tools, respectively.

The functional flow is as follows:

The Automator (syqada auto) examines the batchroot of each task in turn, creating the directory if it does not exist. If no METADATA file exists in the batchroot, that file is created from the protocol and config files. The Automator then uses the batch_tool on each batchroot to determine the current state of the batch so as to decide whether to invoke the BatchRunner on that task.

The BatchRunner (syqada batch) creates a JobBatch object to manage the task. For any new task, it then sets up the batchroot directory and uses the JobGenerator to generate new jobs using the task file and the task template identified by the task file. New jobs are placed in the task’s PENDING directory.

Thereafter, the BatchRunner gives control to the existing JobBatch object to manage the jobs in the task. The JobBatch submits each job (up to some limit) to the queue_manager (either LOCAL, PBS, or LSF) which manages job execution. The job is then moved to the task’s RUNNING directory. When the JobBatch and queue_manager determine that the job is complete, it is moved to the task’s DONE directory. Failures are moved to ERROR. Intermediate states QUEUED and STUCK can be created dynamically to record those statuses.

When the JobBatch is complete, control is returned to the BatchRunner, which in turn relinquishes control to the Automator, which then decides whether to terminate with an error or run the next step.

Historical Invocation

This example is provided merely as an indication of the organic growth of SyQADA. Before about version 0.9.4, one created a matching batchrunner.sh for each script template. This continued to be supported, although mostly unused, throughout version 0.9. Here is an example of a batchrunner.sh that might have worked with the script template described in How to Build a Script Template. Version 0.9.8 did so by creating a METADATA file and then proceeding in the same way as version 1.0. (Note, a particular problem was that namepatterns that included glob symbols need to be quoted, since they were passed through the shell)

syqada batch \
  --project $PROJECT \
  --configuration control/$PROJECT.config \
  --sample_file control/$PROJECT.samples \
  --inputdir 05-PicardSortSam/output \
  --rundir . \
  --interface LOCAL \
  --task $VARIANT_CALLING/pipelines/seqdata/alignment/07-FilterExome.sh \
  --gb_memory=4 \
  --jobestimate=15:00 \
  --namepattern '_*.bam' \
  --step $@

Environment

There are one or two special terms that are drawn from the program’s shell environment. Their names, uses and defaults follow:

SYQADAPATH
    A colon-separated list of directories in which to look for workflow tasks and templates.
    It defaults to the directory containing the bin where the running SyQADA executable
    is found.

The following environment variables are frequently used in the standard workflows:

PROJECT
    The name of the project, for interpolation into filenames, chiefly tumor-normal and
    pairs files for somatic variant detection.

TEAMROOT
    The root of a directory tree that contains a number of executable bins as well
    as the standard genomic reference files in use by the team.

SYQADA
    The path to the version of SyQADA currently in use.