Directory Structure of a SyQADA Workflow

A SyQADA workflow is performed from a single working directory. Initially, the directory contains only a control directory, which itself contains several files, at least PROJECT.config, PROJECT.protocol, and PROJECT.samples (it is possible to run SyQADA without using this convention, but costs little to arrange it this way, and there is no test suite coverage for doing otherwise, so things might break in subtle ways. I believe it is worth reducing the intellectual overhead of working in bioinformatics by having a convention for where to look for the control files).

Workflow Structure Definition

When SyQADA reads the protocol file and any task files referred to in it, it creates a series of directories, one per task, in the working directory (several perfectly useful SyQADA Workflows have only one task). (To me) It makes sense to discuss the structure of a single task before we examine the larger directory.

Directory Structure of a Task

SyQADA creates its task directories for a workflow in the current working directory where it is invoked. It therefore expects that all subsequent invocations on that workflow will occur in the same working directory. SyQADA creates and expects a directory named ‘NNNN-taskname’ where NNNN is a four-digit number with leading zeroes, and taskname is a description derived from the task definition specified by the workflow protocol. The directories are numbered with leading zeroes so that workflows with 10 (or 100?) or more steps will show in sequence in a directory listing. Such a directory has these contents (shown in order of use, not of ls listing, with descriptive notes to the right

01-first-step/
                METADADA # The file that defines how jobs are generated
                PENDING/ # The directory into which job scripts are written
                RUNNING/ # The directory into which jobs are moved as they start
                DONE/    # The directory into which jobs are moved as they complete
                ERROR/   # The directory into which jobs are moved if they fail
                LOGS/    # The directory into which standard error and output are written
                project-tmp/ # holds intermediate outputs that are removed at end of step
                workflows/ # a cache of the task files, and templates named in the protocol
                jobbatch.log # a log file of task progress
                job-statistics.txt # execution statistics of each job in the task

A job script is a bash script named for the task, the sample, and genomic split if specified The sample and genomic split together form an identifier that is used to name the job’s output and control files (if the jobgeneration option “summary” is specified, then the project name is used in place of the sample and split). The job for a sample named XXX, for example, is initially placed in the PENDING directory:

01-first-step/PENDING/first-step-runner-XXX.sh

Upon submission, the job id is attached to the job script name when it is moved into the RUNNING directory, and remains there as it is processed...

01-first-step/RUNNING/first-step-runner-XXX.sh%1234

On a cluster, the job can often be queued before it runs. If SyQADA detects this, the fact is recorded by creating a QUEUE directory and placing the job script there:

01-first-step/QUEUED/first-step-runner-XXX.sh%1234

A queued cluster job is moved back to RUNNING when SyQADA detects that it has been assigned to a compute node.

Upon completion of the job, the LOGS directory will contain four files per completed job, each named consistently with the job script. The output directory will contain:

01-first-step/DONE/first-step-runner-XXX.sh%1234
01-first-step/LOGS/
                     XXX.begun
                     XXX.err
                     XXX.out
                     XXX.done

The file XXX.begun contains in its first line a timestamp of the actual commencement of the job to execute; the rest of the file is a list of the values of environment variables at the time the job was executed.

The files XXX.err and XXX.out files contain, respectively, the contents of standard error and output of the job.

The file XXX.done contains a timestamp of the time the payload of the job actually completed.

Should the job terminate with a non-zero return code, the error is recorded by moving the job to ERROR, and creating an XXX.failed file with the timestamp of the failure instead of XXX.done:

01-first-step/ERROR/first-step-runner-XXX.sh%1234
01-first-step/LOGS/XXX.failed

Nota Bene: Not all programs conform to the venerable Unix convention of putting error messages in standard error, so the .err file may be empty even when there is an error. You may need to look into the .out file to determine errors. This occurs with MuTect(?) among other frequently used programs.

The general behavior is as follows.

If the split option was used, there will be multiple job files per sample, depending on the specification in the split. The general form is:

* split = MM-NN[,MM-NN][,name1][,name2]

where MM-NN defines a numeric range. Multiple ranges are permitted. In addition, the names of non-numeric chromosomes can be used, like ‘X’, ‘XY’, ‘M’, ‘MT’, or ‘P’

These special labels relate to the human genome. The gatk options render a grouping of chromosomes that is approximately balanced by total length:

* split = autosome
* split = autosome+x
* split = chromosome
* split = chromosome+m
* split = gatk
* split = gatk-autosome
* split = gatk-autosome+x

The files created for sample XXX are named in the obvious way, XXX-chr1 to XXX-chr22, XXX-chrX, or XXX-chrY respectively. The specially named chromosomes get names in the same way, like XXX-chrXY.

In each file, the terms {chromosome} and {region} are substituted with the value of the split for that file.

Structure of a Workflow

As a matter of course, sqyada works in a single working directory. It expects a sourcedata directory to be specified. In the basic tutorial, this sourcedata directory is in the working directory; however, our convention is to place it in the project’s base directory.

Our conventional directory structure looks like this:

MYPROJECT/
         artifacts/
         sourcedata/
                    batchA/
         working/find-loh/
                          control/

In the structure thus named, you would first cd to MYPROJECT/working/find-loh. SyQADA will normally expect a config file, a sample file, and a protocol file in the control directory. Deviating from that convention is possible, but may lead to confusion, since you risk losing track of the invocation parameters of the deviation. Furthermore, it is not tested, so you also risk syqada stumbling over itself and spraying a dirty text stream all over your screen, or worse, configuring a job contrary to your expectations.

A workflow is basically a numerically ordered set of directories numbered from 01 (see Replication Directory Structure for the numbering differences with the replication option, and :ref:`` for the numbering of nested protocols). As SyQADA runs, it constructs the numbered directories that will contain the control files and output of each task, as described above in Directory Structure of a Task. The result may look like this:

MYPROJECT/
         working/find-loh/
                          01-phase-samples
                          02-combine-variants
                          03-haplohseq
                          control/
                          workflows/
                          project-tmp/

The workflows directory is a cache of the tasks and templates referred to in the protocol file. This caching behavior, new to syqada 2.1, is discussed in Improved Reproduciblity (New in 2.1).

The project-tmp directory is created to store temporary files needed only within a single task step; these are removed whenever a step is completed successfully.

The METADATA file in each directory defines the input directory for that task. You can write these files yourself, but syqada auto creates the METADATA files for each step based on the protocol file and other initial conditions. The first task expects its input in a sourcedata directory defined in the project configuration file. Subsequent tasks expect their inputs from the output directory of the preceding task. An exception to this rule is that if the preceding task name begins with QC, the input directory is found in the most recent non-QC step (a second consecutive QC step takes its input from the preceding QC step). One can also change a step’s input directory using an inputdir specification in the protocol task definition for that step (see Generated Terms (inputdir) for details on that as well as added_input).

The Sample File

The sample file, normally named PROJECT.samples, contains information about the samples in the project, one per line. It suffices that the file contain one name per line; however, a tab-separated “phenotype” file can also be used as the samples file. If a “phenotype” file is used, it must have a header line, and one of the terms in that line must be one of the following, or some upper-case variant of one of these:

sample, sample_id, sampleid, sample_identifier, sample_name, samplename

More than one of these terms may appear in the header, but the leftmost such term found is the column that will be used to populate the {sample} key in a task template. Any other column name can be used to populate a task template by prefacing it with the string sample:. Thus, for the following sample file contents:

sample_id Sample_Name eyes SAMPLE
Sample123 Fred        blue value-inaccessible

The following line from a template file:

echo {sample}({sample:sample_id}) has name {sample:Sample_Name} (that's {sample:sample_name}) and {sample:eyes} eyes.

will be transformed into:

echo Sample123(Sample123) has name Fred (that's Fred) and blue eyes.

Note that, because the term ‘sample’ is handled specially, the fourth column of the file above, headed by ‘SAMPLE,’ is ignored with a warning, so that {sample} is unambiguous.

Workflows with Replication

Replication workflows are designed to do experimental comparison of different parameters to a program or programs over the same set of samples. They differ in structure from standard workflows by including replication identities in each directory name.

See Replication for more details.