Running an Existing Workflow

To run an existing workflow, you need to create two files, a config file, a sample file; and copy a protocol file from an existing workflow.

Let us call our project MYPROJECT. To follow convention, and therefore use the easiest invocation, these files should be placed in your control directory and named MYPROJECT.config, MYPROJECT.samples, and MYPROJECT.protocol:

MYPROJECT.config

This file contains colon-separated attribute-value pairs on separate lines that specify the paths for programs and reference files required by the workflow.

A possible start is to copy from workflows/example/Example.config, which contains specifications for most of the default programs and reference files. Most of the required reference data are nested inside the TEAM_ROOT directory.

The directory containing the initial data should be designated as the sourcedata attribute of the config file.

MYPROJECT.samples
This file contains one sample name per line. These names normally serve as the prefix of the files in the sourcedata directory, for instance, SAMPLE123.bam or S001_001.fastq. Best practice is to create de-identified sample names if you have been given medical record numbers to reduce the chances of MRNs getting propagated into result tables or charts. If given a directory of files named with MRNs, I like to create a second directory that uses hard links named with de-identified sample names. Of course, you need a manifest that maps between the two sets of names.
MYPROJECT.protocol
This file contains task specifications or references to files containing task specifications. You should need only to copy the protocol from the appropriate subdirectory of the workflows/control directory (see Packaged Workflows for descriptions) into your control directory. You may need to construct a protocol file. This is explained in Building a New Workflow.

Assuming you have followed convention, you can then execute the workflow automatically by running:

>>> syqada auto

or, if you prefer lots of typing, or wish to place these files in a non-standard place, you can run:

>>> syqada auto --configuration path/MYPROJECT.config \
              --sample_file path/MYPROJECT.samples \
              --protocol path/MYPROJECT.protocol \
              --project MYPROJECT \
              --notifications path/afilename.txt

If all is correctly configured, SyQADA will first create a series of processing directories numbered sequentially and named for the tasks in your protocol, something like:

01-bwa-align
02-bwa-sampe

Each of these directories will now contain a METADATA file that contains everything necessary for an invocation of syqada batch to create the job scripts for the task and then execute them. (You may optionally use syqada auto –init to cause SyQADA to create the directories and their metadata and then run each step manually with syqada batch)

Then SyQADA will attempt to run each step in turn, creating the appropriate number of jobs for your sample data, then submitting them (see Job Submission below). SyQADA will then monitor the jobs for completion, submitting more jobs as necessary to complete the batch. SyQADA reports progress roughly every 5 minutes, or earlier when jobs finish, indicating percent of jobs complete, as well as the number of jobs that have completed or failed. Upon completion of each step, SyQADA reports a message such as:

H00:00:16.242 Batch completed, 10 successes
H00:00:16.242 Example batch 01-example finished

You may monitor progress of the workflow by viewing the SyQADA status page for your project:

http://d1prphaplotype1.mdanderson.org:8080/RISPROJECTS/syqada-status/MYPROJECT.txt

Manual syqada tools:

If your workflow stops with the message:

H04:12:12.918 INFO Kadara_FEPilot batch 0804-mutect stopped

then there is some problem that you will have to correct. In addition to syqada auto, there are two other useful commands:

syqada manage
Corrects problems with SyQADA’s management of jobs.
syqada batch
Runs a single batch, optionally moving failed job scripts back to the PENDING directory and saving the failed logs for later examination.

They are used roughly as follows (of course the batch directory will probably be different):

>>> syqada manage DIRECTORY

will show the current state of the task that is running in DIRECTORY. Usage of syqada manage is discussed in syqada manage.

If SyQADA terminates early for some reason and syqada manage reports unmanaged files, you can run:

>>> syqada manage DIRECTORY --fix done failed

and completed or failed jobs that were not managed automatically will be moved into their appropriate status directories.

>>> syqada batch DIRECTORY --step repend run

Alternatively, you can invoke without the run step, thus:

>>> syqada batch DIRECTORY --step repend

and then resume automatic mode with:

>>> syqada auto

There are obviously many ways in which a step in the workflow can go wrong. Please see the short Troubleshooting Guide.

You may find it convenient to identify your project to your (bash) environment:

>>> export PROJECT=MYPROJECT

so that you can refer to it as $PROJECT. Maybe not. This used to be a requirement for running SyQADA, but it is no longer strictly necessary, because syqada itself sets a PROJECT variable based on the –project parameter if given or else on the protocol name. There are generic uses of $PROJECT in protocol and task files in the workflow repository, especially in tumor-normal workflows. Environment variables are discussed in Environment.

Job Submission

If you are running on one of the cluster nodes, job submission is accomplished by effectively running this shell command for PBS:

>>> cat job-script | qsub

or this command for LSF:

>>> cat job-script | bsub

The SyQADA JobGenerator takes care of constructing a script that includes all the necessary incantations to run the cluster job in the right environment for your job. For very short jobs, adding the interface = LOCAL option in the job task will cause SyQADA to spawn the job on the local machine.

All jobs run on machines without cluster access are spawned locally. When SyQADA is running in LOCAL mode, killing the SyQADA monitor kills any child jobs that the monitor has spawned, leaving them in an invalid state, since they are still in the RUNNING queue after completing with a (forced) failure. Those jobs need to be cleaned up (stay tuned, there is a feature ticket to fix this) manually by moving the scripts from RUNNING to ERROR and then running:

>>> syqada batch BATCHNAME --step repend

before resuming.

The script itself is used by SyQADA to indicate the progress of the job through the system. It starts in the PENDING directory, is moved to RUNNING, then possibly to QUEUED or STUCK, and to DONE upon completion or ERROR upon failure. Because the script is the object submitted to the cluster, you may examine it, edit it, and resubmit it if you wish (editing and resubmission can make keeping track of reproducibility a nightmare, however, and is only recommended for debugging a new workflow).

Architecture consistency headaches

Hypothetically, one can start a workflow on the cluster, run one or more tasks, stop, and then switch to run the rest of the workflow on a local machine. However, although SyQADA eliminates the need to worry about whether you are on the cluster or not, one of the biggest remaining hassles is that several significant executables have different versions on the cluster and on local machines, and so the configuration file must be changed to reflect this. Since one of the guiding principles of SyQADA is that the configuration file should be immutable once the workflow has started, this is a nasty problem for which there is no graceful solution except to avoid switching horses in midstream.

I commonly run sequence analysis on the cluster and then switch to run variant annotation and analysis on the haplotypes, but the reason to do so (historically, vtools was not available on the cluster) is longer so compelling as it was. On the other hand, vtools does not do well with parallel execution that it does not control by itself, so running it on the cluster is more or less a misuse of the head node resource (the use of max_bolus = 1 can address this, however).