How to Build a Script Template¶
Script templates are typically built from working invocations, and then the SyQADA- and configuration-dependent terms extracted into standard terms enclosed in braces. Here is a simple example. The #!/bin/bash on the first line is optional as a reminder that the script will eventually be run as a shell script, but it will be supplanted by another #!/bin/bash during job generation:
#!/bin/bash
{bedtools} intersect \
-wa \
-header \
-abam {filename} \
-b {capturefile} \
> {output_prefix}.bam
The terms bedtools and capturefile are expected to be defined in the config file, and should be specific as to full path and version. Failure to specify full paths to executables and reference files can jeopardize your ability to reproduce your workflow. The capturefile should also defined in the config file, and is dependent on the capture technology used for the sequencing experiment. The terms filename and output_prefix are standard terms produced by the SyQADA job generator based on the project inputs.
Improved Reproduciblity (New in 2.1)¶
The following options to improve reproducibility have been added to syqada:
--strictness [PROGRAMPATHS | PATH | CACHING=[strict | ignore ] | LAZY ] (default PROGRAMPATHS,CACHING=strict)
--protocol_caching [strict | ignore | diffs | status | force ] (default strict)
--compatibility 2.0 (omitted by default to obtain current behavior)
>>> Reproducibility of Script Templates
The strictness options PROGRAMPATHS, PATH, and LAZY affect the environment of the running job script. In order to improve the reproducibility of job script templates, starting in 2.1, the default behavior of syqada is to eliminate the following Unix environment variables from the execution environment of a script that is to be submitted for execution. These are:
PYTHONPATH
PERL5LIB
LD_LIBRARY_PATH
Scripts that rely on these variables must specify them in the template. The conventional (my) way of doing this is through template parameters. I also use Anaconda python and specify the precise conda environment, so this is what I commonly place in my .config file (occasionally, such as for SparCC, I must override these values in an individual task or protocol file:
python3_setup : source activate py3.6
python3path : $TEAM_ROOT/lib/python/zip/labtools-3.2.2.zip
The default template environment behavior in 2.1 is:
--strictness PROGRAMPATHS
To be even stricter, you can specify –strictness PROGRAMPATHS,PATH which will also eliminate the PATH variable in the script execution environment.
To override this behavior and preserve the less scrupulous 2.0 behavior, invoke syqada with –strictness LAZY or equivalently –compatibility 2.0 (Nota Bene: this also affectsthe protocol caching described next. To allow inheritance of the current environment but enforce protocol caching, use –strictness CACHING=strict).
>>> Caching to Improve Reproducibility of Workflow Versions
Often, in order to repeat workflows on new data, a protocol calls a particular task file and/or particular template from a repository. It has been our experience occasionally to return to examine the results of a workflow on some sourcedata or other and attempt to re-run a particular step or steps, only to find that the script taskfile or template called by reference has been altered in the repository, causing the new invocation to fail, either because the taskfile changed its parameters, or the template program invocation changed.
In a first attempt to reduce the opportunity for such failures, syqada 2.1 now by default caches the task files, templates, and nested protocol files that are refered to in the protocol file.
Protocol Caching¶
The default protocol caching behavior in 2.1 is:
--protocol_caching strict
The default behavior can be altered in several ways. The global ones:
--strictness LAZY or --compatibility 2.0
preserve all old lax behaviors.
To alter the caching behavior alone, –protocol_caching ignore can be used; this will avoid caching anything, while still leaving in place the template script environment strictness (although if a workflows directory, cache or otherwise, is present, it will still be used).
If changes have been made the reference workflow repository, that you wish to apply to the cache, –protocol_caching force will cause the cache to be updated from the repository.
Special Considerations¶
Several terms are special to specific kinds of analysis.
The terms groupid, forward, and reverse are only used for the bwa sampe step (task 02) of the Illumina alignment workflow. They are used to resolve the paired read files generated by Illumina sequencing.
The terms tumorname and normalname are used for tumor-normal comparisons such as during somatic variation detection using Mutect. In this case only, the term sample is filled from the name given to the tumor-normal pair in the first column of the tumor-normal file, which is a tab-separated file with the format:
individual normal_sample tumor_sample
on each line.
INLINE templates¶
(New in syqada 2.0-delta) a simple one-line command can be expressed as an INLINE template. For example:
template = INLINE wc -c {inputdir}/{sample}.name > {output_prefix}.chars
This example comes from the basic tutorial protocol named Example.reference.
Error reporting¶
As an aid to understanding configuration errors, SyQADA parses the template to look for terms found within braces that are not defined in either the config file or the protocol file (or a task file by the protocol file). If any are found, an error is report similar to this:
Keys not found in the configured task: ['missing_term1', 'missing_term2']
Generated Terms¶
These are the terms that are defined by SyQADA during job generation. They can be used as appropriate for a given script.
- check_error
- This is used for actions that have multiple command invocations within them. It is replaced by standard shell conditional code that will write timestamp to the .failed file if the previous step returned a non-zero error code, and proceed if it returned OK.
- chromosome
- Substituted with the generated chromosome during splitting of a job for locus-based parallellism. See region, which is preferred in most cases.
- filename
- The generated input filename for each created invocation script. If the invocation script uses the –merge option (or the –summary option, which implicitly does a –merge), then the {filename} term must stand alone with no other braced terms on the same line.
- files_per_job
- The number of output files that each job should generate. This is normally unnecessary, because SyQADA checks to make sure that all successful jobs produce the same number of outputs, but can be used to validate number of output files for summary jobs, because a single job has nothing to compare to. It could also be used to allow files_per_job = 0 for those occasional cases where the output name does not match the sample name, but I would think that a QC step would be a better idea.
- forward
- Forward readname for forward-reverse read matching.
- groupid
- Group name for forward-reverse read matching.
- inputdir
Location of the directory or directories in which to find the input for this step. The inputdir for any task is the output directory of the preceding step. If inputdir is specified in the task description or protocol, the default is overridden with the new value, which may be either an existing directory or a task identifier. A step whose task identifier begins with ‘QC’ is ignored when determining which is the preceding step. (A step whose task identifier begins with ‘QC’ will use the preceding step’s output as input whether its identifier begins with ‘QC’ or not) The job generator then uses the sample name and the namepattern to build an input or a set of inputs using shell globbing syntax. The special specification:
inputdir = sourcedata
can be used to repeat the use of the sourcedata directory for input during a later step
Additional input directories may be specified in the task description or protocol using added_input, which can be a comma-separated list of existing directories or task identifiers. They will be referenced by index as {added_input=1}, {added_input=2}, etc. {added_input} is the equivalent of {added_input=1}. added_input does not contribute to the definition of filename, so templates must use constructions like {added_input=1}/{sample}.suffix to make reference to specific files. The term sourcedata may be used to specify the sourcedata directory.
iteration
If {iteration} is specified in the template, a number of jobs corresponding to the value of the term iterations will generated, populating the value of iteration in each job with the corresponding iteration number. To trigger this behavior, the term iterations must appear in the task definition (or in the parameters for that TASKDEF in the protocol). iterations can either be a number or a comma-separated set of numbers, and behaves a bit like the inputs to the python range() function, e.g.: from,to,step, except that . If the value of iterations is instead a string, (e.g., override), then the value is taken from the value of iterations in the config file. The claim is that this simplifies the construction of workflows like the SparCC runner, which uses the value of iterations to generate that number of permutations, and then later uses iteration to iterate over those permutations. As to simplicity, your opinion may vary, but it did avoid the duplication of the parameter from the config file into the protocol/task file.
- logdir
- Location of LOGS directory, used occasionally when a program creates its own log output other than stdout/stderr.
- mergefile
- Obsolete term formerly used to indicate filling multiple filenames as input, replaced by filename.
- normalname
- The normal input for a tumor/normal pair defined by a –tumor_normal option. The individual from whom the pair was taken is available as sample. This term obsolesces the term normalfile.
- output_prefix
- Shorthand for {outputdir}/{root}{region}
- outputdir
- Location of directory in which to find all the input for this step, path created by the job generator.
- project
- The project name provided as METADATA, most often used during summary tasks.
- region
- Substituted with the generated locus range during splitting of a job for locus-based parallellism.
- reverse
- Reverse readname for forward-reverse read matching.
- root
- The rootname (with trailing dotted suffix removed, equivalent to the shell syntax $blah:r) of an output file, to which an appropriate suffix is usually then added.
- sample
- The sample name taken from the sample_file, normally PROJECT.samples. In the case of the tumor_normal option, the individual from whom the pair was taken is available as sample.
- sample_file
- The file, normally PROJECT.samples, containing information about the samples in the project, one sample per line. See The Sample File for details.
- tmpdir
- Location of the standard temporary directory for the project, often used as a value for java.io.tmpdir, but available for any similar purpose. An attempt is made to clean up its contents at the end of a successful batch.
- touch_output
- This is used for actions that produce no output, such as the import and annotation steps of vtools. It is replaced by a command that will create a {root}.complete file in the {outputdir}, so that SyQADA can conclude that the step completed successfully. (See Hacks – Egregious)
- tumorname
- The tumor input for a tumor/normal pair defined by a –tumor_normal option. The individual from whom the pair was taken is available as sample. This term obsolesces the term tumorfile.