.. _script-templates: How to Build a Script Template ============================== Script templates are typically built from working invocations, and then the SyQADA- and configuration-dependent terms extracted into standard terms enclosed in braces. Here is a simple example. The `#!/bin/bash` on the first line is optional as a reminder that the script will eventually be run as a shell script, but it will be supplanted by another #!/bin/bash during job generation:: #!/bin/bash {bedtools} intersect \ -wa \ -header \ -abam {filename} \ -b {capturefile} \ > {output_prefix}.bam The terms `bedtools` and `capturefile` are expected to be defined in the config file, and should be specific as to full path and version. *Failure to specify full paths to executables and reference files can jeopardize your ability to reproduce your workflow.* The `capturefile` should also defined in the config file, and is dependent on the capture technology used for the sequencing experiment. The terms `filename` and `output_prefix` are standard terms produced by the SyQADA job generator based on the project inputs. .. _reproducibility_2_1: Improved Reproduciblity (New in 2.1) ------------------------------------ The following options to improve reproducibility have been added to syqada:: --strictness [PROGRAMPATHS | PATH | CACHING=[strict | ignore ] | LAZY ] (default PROGRAMPATHS,CACHING=strict) --protocol_caching [strict | ignore | diffs | status | force ] (default strict) --compatibility 2.0 (omitted by default to obtain current behavior) >>> Reproducibility of Script Templates The strictness options PROGRAMPATHS, PATH, and LAZY affect the environment of the running job script. In order to improve the reproducibility of job script templates, starting in 2.1, the default behavior of syqada is to eliminate the following Unix environment variables from the execution environment of a script that is to be submitted for execution. These are:: PYTHONPATH PERL5LIB LD_LIBRARY_PATH Scripts that rely on these variables must specify them in the template. The conventional (my) way of doing this is through template parameters. I also use Anaconda python and specify the precise conda environment, so this is what I commonly place in my .config file (occasionally, such as for SparCC, I must override these values in an individual task or protocol file:: python3_setup : source activate py3.6 python3path : $TEAM_ROOT/lib/python/zip/labtools-3.2.2.zip The default template environment behavior in 2.1 is:: --strictness PROGRAMPATHS To be even stricter, you can specify --strictness PROGRAMPATHS,PATH which will also eliminate the PATH variable in the script execution environment. To override this behavior and preserve the less scrupulous 2.0 behavior, invoke syqada with --strictness LAZY or equivalently --compatibility 2.0 (Nota Bene: this also affectsthe protocol caching described next. To allow inheritance of the current environment but enforce protocol caching, use --strictness CACHING=strict). >>> Caching to Improve Reproducibility of Workflow Versions Often, in order to repeat workflows on new data, a protocol calls a particular task file and/or particular template from a repository. It has been our experience occasionally to return to examine the results of a workflow on some sourcedata or other and attempt to re-run a particular step or steps, only to find that the script taskfile or template called by reference has been altered in the repository, causing the new invocation to fail, either because the taskfile changed its parameters, or the template program invocation changed. In a first attempt to reduce the opportunity for such failures, syqada 2.1 now by default caches the task files, templates, and nested protocol files that are refered to in the protocol file. Protocol Caching ---------------- The default protocol caching behavior in 2.1 is:: --protocol_caching strict The default behavior can be altered in several ways. The global ones:: --strictness LAZY or --compatibility 2.0 preserve all old lax behaviors. To alter the caching behavior alone, --protocol_caching ignore can be used; this will avoid caching anything, while still leaving in place the template script environment strictness (although if a workflows directory, cache or otherwise, is present, it will still be used). If changes have been made the reference workflow repository, that you wish to apply to the cache, --protocol_caching force will cause the cache to be updated from the repository. Special Considerations ---------------------- Several terms are special to specific kinds of analysis. The terms `groupid`, `forward`, and `reverse` are only used for the `bwa sampe` step (task 02) of the Illumina alignment workflow. They are used to resolve the paired read files generated by Illumina sequencing. The terms `tumorname` and `normalname` are used for tumor-normal comparisons such as during somatic variation detection using Mutect. In this case only, the term `sample` is filled from the name given to the tumor-normal pair in the first column of the tumor-normal file, which is a tab-separated file with the format:: individual normal_sample tumor_sample on each line. INLINE templates ---------------- (New in syqada 2.0-delta) a simple one-line command can be expressed as an INLINE template. For example:: template = INLINE wc -c {inputdir}/{sample}.name > {output_prefix}.chars This example comes from the basic tutorial protocol named Example.reference. Error reporting --------------- As an aid to understanding configuration errors, SyQADA parses the template to look for terms found within braces that are not defined in either the config file or the protocol file (or a task file by the protocol file). If any are found, an error is report similar to this:: Keys not found in the configured task: ['missing_term1', 'missing_term2'] .. _terms: Generated Terms --------------- These are the terms that are defined by SyQADA during job generation. They can be used as appropriate for a given script. check_error This is used for actions that have multiple command invocations within them. It is replaced by standard shell conditional code that will write timestamp to the `.failed` file if the previous step returned a non-zero error code, and proceed if it returned OK. chromosome Substituted with the generated chromosome during splitting of a job for locus-based parallellism. See `region`, which is preferred in most cases. filename The generated input filename for each created invocation script. If the invocation script uses the --merge option (or the --summary option, which implicitly does a --merge), then the {filename} term must stand alone with no other braced terms on the same line. files_per_job The number of output files that each job should generate. This is normally unnecessary, because SyQADA checks to make sure that all successful jobs produce the same number of outputs, but can be used to validate number of output files for summary jobs, because a single job has nothing to compare to. It could also be used to allow `files_per_job = 0` for those occasional cases where the output name does not match the sample name, but I would think that a QC step would be a better idea. forward Forward readname for forward-reverse read matching. groupid Group name for forward-reverse read matching. .. _inputdir: inputdir Location of the directory or directories in which to find the input for this step. The *inputdir* for any task is the output directory of the preceding step. If *inputdir* is specified in the task description or protocol, the default is overridden with the new value, which may be either an existing directory or a task identifier. A step whose task identifier begins with 'QC' is ignored when determining which is the preceding step. (A step whose task identifier begins with 'QC' will use the preceding step's output as input whether its identifier begins with 'QC' or not) The job generator then uses the sample name and the namepattern to build an input or a set of inputs using shell globbing syntax. The special specification:: inputdir = sourcedata can be used to repeat the use of the sourcedata directory for input during a later step Additional input directories may be specified in the task description or protocol using *added_input*, which can be a comma-separated list of existing directories or task identifiers. They will be referenced by index as {added_input=1}, {added_input=2}, etc. {added_input} is the equivalent of {added_input=1}. *added_input* does not contribute to the definition of *filename*, so templates must use constructions like {added_input=1}/{sample}.suffix to make reference to specific files. The term *sourcedata* may be used to specify the sourcedata directory. iteration If *{iteration}* is specified in the template, a number of jobs corresponding to the value of the term *iterations* will generated, populating the value of *iteration* in each job with the corresponding iteration number. To trigger this behavior, the term *iterations* must appear in the task definition (or in the parameters for that TASKDEF in the protocol). *iterations* can either be a number or a comma-separated set of numbers, and behaves a bit like the inputs to the python range() function, e.g.: from,to,step, except that . If the value of *iterations* is instead a string, (e.g., override), then the value is taken from the value of *iterations* in the config file. The claim is that this simplifies the construction of workflows like the SparCC runner, which uses the value of *iterations* to generate that number of permutations, and then later uses *iteration* to iterate over those permutations. As to simplicity, your opinion may vary, but it did avoid the duplication of the parameter from the config file into the protocol/task file. logdir Location of LOGS directory, used occasionally when a program creates its own log output other than stdout/stderr. mergefile Obsolete term formerly used to indicate filling multiple filenames as input, replaced by *filename*. normalname The normal input for a tumor/normal pair defined by a --tumor_normal option. The individual from whom the pair was taken is available as *sample*. This term obsolesces the term *normalfile*. output_prefix Shorthand for {outputdir}/{root}{region} outputdir Location of directory in which to find all the input for this step, path created by the job generator. project The project name provided as METADATA, most often used during summary tasks. region Substituted with the generated locus range during splitting of a job for locus-based parallellism. reverse Reverse readname for forward-reverse read matching. root The rootname (with trailing dotted suffix removed, equivalent to the shell syntax $blah:r) of an output file, to which an appropriate suffix is usually then added. sample The sample name taken from the sample_file, normally PROJECT.samples. In the case of the tumor_normal option, the individual from whom the pair was taken is available as *sample*. sample_file The file, normally PROJECT.samples, containing information about the samples in the project, one sample per line. See :ref:`sample_file` for details. tmpdir Location of the standard temporary directory for the project, often used as a value for java.io.tmpdir, but available for any similar purpose. An attempt is made to clean up its contents at the end of a successful batch. touch_output This is used for actions that produce no output, such as the import and annotation steps of vtools. It is replaced by a command that will create a {root}.complete file in the {outputdir}, so that SyQADA can conclude that the step completed successfully. (See :ref:`hacks`) tumorname The tumor input for a tumor/normal pair defined by a --tumor_normal option. The individual from whom the pair was taken is available as *sample*. This term obsolesces the term *tumorfile*.