.. _script-templates:

How to Build a Script Template
==============================

Script templates are typically built from working invocations,
and then the SyQADA- and configuration-dependent terms extracted into
standard terms enclosed in braces. Here is a simple example. The
`#!/bin/bash` on the first line is optional as a reminder that
the script will eventually be run as a shell script, but
it will be supplanted by another #!/bin/bash during job generation::

  #!/bin/bash

  {bedtools} intersect \
    -wa \
    -header \
    -abam {filename} \
    -b {capturefile} \
	> {output_prefix}.bam

The terms `bedtools` and `capturefile` are expected to be defined in the config file,
and should be specific as to full path and version. 
*Failure to specify full paths to executables and reference files can
jeopardize your ability to reproduce your workflow.*
The `capturefile` should also defined in the config file, and
is dependent on the capture technology used for the sequencing
experiment. The terms `filename` and `output_prefix` are standard
terms produced by the SyQADA job generator based on the project inputs.

.. _reproducibility_2_1:

Improved Reproduciblity (New in 2.1)
------------------------------------

The following options to improve reproducibility have been added to syqada::

  --strictness [PROGRAMPATHS | PATH | CACHING=[strict | ignore ] | LAZY ] (default PROGRAMPATHS,CACHING=strict)
  --protocol_caching [strict | ignore | diffs | status | force ] (default strict)
  --compatibility 2.0 (omitted by default to obtain current behavior)

>>> Reproducibility of Script Templates

The strictness options PROGRAMPATHS, PATH, and LAZY affect the environment of the running
job script.
In order to improve the reproducibility of job script templates, starting in 2.1,
the default behavior of syqada is to eliminate the following Unix environment variables
from the execution environment of a script that is to be submitted for execution.
These are::

  PYTHONPATH
  PERL5LIB
  LD_LIBRARY_PATH

Scripts that rely on these variables must specify them in the
template. The conventional (my) way of doing this is through template parameters.
I also use Anaconda python and specify the precise conda environment, so this is 
what I commonly place in my .config file (occasionally, such as for SparCC, I must
override these values in an individual task or protocol file::

  python3_setup : source activate py3.6
  python3path : $TEAM_ROOT/lib/python/zip/labtools-3.2.2.zip

The default template environment behavior in 2.1 is::

  --strictness PROGRAMPATHS

To be even stricter, you can specify --strictness PROGRAMPATHS,PATH
which will also eliminate the PATH variable in the script execution
environment.

To override this behavior and preserve the less scrupulous 2.0
behavior, invoke syqada with --strictness LAZY or equivalently
--compatibility 2.0 (Nota Bene: this also affectsthe protocol caching
described next. To allow inheritance of the current environment but
enforce protocol caching, use --strictness CACHING=strict).

>>> Caching to Improve Reproducibility of Workflow Versions

Often, in order to repeat workflows on new data, a protocol calls a
particular task file and/or particular template from a repository. It
has been our experience occasionally to return to examine the results
of a workflow on some sourcedata or other and attempt to re-run a
particular step or steps, only to find that the script taskfile or
template called by reference has been altered in the repository,
causing the new invocation to fail, either because the taskfile
changed its parameters, or the template program invocation changed.

In a first attempt to reduce the opportunity for such failures, syqada
2.1 now by default caches the task files, templates, and nested
protocol files that are refered to in the protocol file.

Protocol Caching
----------------

The default protocol caching behavior in 2.1 is::

  --protocol_caching strict

The default behavior can be altered in several ways. The global ones::

  --strictness LAZY or --compatibility 2.0

preserve all old lax behaviors.

To alter the caching behavior alone, --protocol_caching ignore can be
used; this will avoid caching anything, while still leaving in place
the template script environment strictness (although if a workflows
directory, cache or otherwise, is present, it will still be used).

If changes have been made the reference workflow repository, that you
wish to apply to the cache, --protocol_caching force will cause the
cache to be updated from the repository.

Special Considerations
----------------------

Several terms are special to specific kinds of analysis.

The terms `groupid`, `forward`, and `reverse` are only used for the `bwa sampe`
step (task 02) of the Illumina alignment workflow. They are used to resolve the paired
read files generated by Illumina sequencing.

The terms `tumorname` and `normalname` are used for tumor-normal comparisons 
such as during somatic variation detection using Mutect. In this case only,
the term `sample` is filled from the name given to the tumor-normal pair in
the first column of the tumor-normal file, which is a tab-separated file with the format::

  individual normal_sample tumor_sample

on each line.

INLINE templates
----------------

(New in syqada 2.0-delta) a simple one-line command can be expressed as an
INLINE template. For example::

  template = INLINE wc -c {inputdir}/{sample}.name > {output_prefix}.chars

This example comes from the basic tutorial protocol named Example.reference.

Error reporting
---------------

As an aid to understanding configuration errors, SyQADA parses the template to look
for terms found within braces that are not defined in either the config file
or the protocol file (or a task file by the protocol file). If any are found,
an error is report similar to this::

  Keys not found in the configured task: ['missing_term1', 'missing_term2']

.. _terms:

Generated Terms
---------------

These are the terms that are defined by SyQADA during job generation.
They can be used as appropriate for a given script.

check_error
    This is used for actions that have multiple command invocations
    within them.  It is replaced by standard shell conditional code
    that will write timestamp to the `.failed` file if the previous
    step returned a non-zero error code, and proceed if it returned
    OK.

chromosome
    Substituted with the generated chromosome during splitting of a
    job for locus-based parallellism. See `region`, which is preferred
    in most cases.

filename
    The generated input filename for each created invocation script.
    If the invocation script uses the --merge option (or the --summary
    option, which implicitly does a --merge), then the {filename} term
    must stand alone with no other braced terms on the same line.

files_per_job
    The number of output files that each job should generate. This is
    normally unnecessary, because SyQADA checks to make sure that all
    successful jobs produce the same number of outputs, but can be
    used to validate number of output files for summary jobs, because
    a single job has nothing to compare to. It could also be used to
    allow `files_per_job = 0` for those occasional cases where the
    output name does not match the sample name, but I would think that
    a QC step would be a better idea.

forward
    Forward readname for forward-reverse read matching.

groupid
    Group name for forward-reverse read matching.

.. _inputdir:

inputdir
    Location of the directory or directories in which to find the input for this
    step. 
    The *inputdir* for any task is the output directory of the preceding step.
    If *inputdir* is specified in the task description or protocol, the default is overridden
    with the new value, which may be either an existing directory or a task identifier.
    A step whose task identifier begins with 'QC' is ignored when determining which is the preceding step.
    (A step whose task identifier begins with 'QC' will use the preceding step's output as input
    whether its identifier begins with 'QC' or not)
    The job generator then uses the sample name and the namepattern to build an input
    or a set of inputs using shell globbing syntax.
    The special specification::

      inputdir = sourcedata

    can be used to repeat the use of the sourcedata directory for input during a later step

    Additional input directories may be specified in the task description or protocol
    using *added_input*, which can be a comma-separated list of existing directories or task identifiers.
    They will be referenced by index as {added_input=1}, {added_input=2}, etc.
    {added_input} is the equivalent of {added_input=1}.
    *added_input* does not contribute to the definition of *filename*, so templates
    must use constructions like {added_input=1}/{sample}.suffix to make reference to
    specific files. The term *sourcedata* may be used to specify the sourcedata directory.

iteration

    If *{iteration}* is specified in the template, a number of jobs corresponding to
    the value of the term *iterations* will generated, populating the value of
    *iteration* in each job with the corresponding iteration number. To trigger this
    behavior, the term *iterations* must appear in the task definition (or in the
    parameters for that TASKDEF in the protocol).  *iterations* can either be a
    number or a comma-separated set of numbers, and behaves a bit like the inputs to the
    python range() function, e.g.: from,to,step, except that .  If the value of
    *iterations* is instead a string, (e.g., override), then the value is taken from
    the value of *iterations* in the config file. The claim is that this simplifies
    the construction of workflows like the SparCC runner, which uses the value of
    *iterations* to generate that number of permutations, and then later uses
    *iteration* to iterate over those permutations. As to simplicity, your opinion
    may vary, but it did avoid the duplication of the parameter from the config
    file into the protocol/task file.

logdir
    Location of LOGS directory, used occasionally when a program
    creates its own log output other than stdout/stderr.

mergefile
    Obsolete term formerly used to indicate filling multiple filenames as input,
    replaced by *filename*.

normalname
    The normal input for a tumor/normal pair defined by a --tumor_normal option. The
    individual from whom the pair was taken is available as *sample*.
    This term obsolesces the term *normalfile*.

output_prefix
    Shorthand for {outputdir}/{root}{region}

outputdir
    Location of directory in which to find all the input for this
    step, path created by the job generator.

project
    The project name provided as METADATA, most often used during summary tasks.

region
    Substituted with the generated locus range during splitting of a
    job for locus-based parallellism.

reverse
    Reverse readname for forward-reverse read matching.

root
    The rootname (with trailing dotted suffix removed, equivalent to
    the shell syntax $blah:r) of an output file, to which an appropriate suffix
    is usually then added.

sample
    The sample name taken from the sample_file, normally PROJECT.samples. In the case of the tumor_normal option, the
    individual from whom the pair was taken is available as *sample*.

sample_file
    The file, normally PROJECT.samples, containing information about
    the samples in the project, one sample per line. See :ref:`sample_file`
    for details.

tmpdir
    Location of the standard temporary directory for the project,
    often used as a value for java.io.tmpdir, but available for
    any similar purpose. An attempt is made to clean up its contents at
    the end of a successful batch.

touch_output
    This is used for actions that produce no output, such as the
    import and annotation steps of vtools. It is replaced by a
    command that will create a {root}.complete file in
    the {outputdir}, so that SyQADA can conclude that the step
    completed successfully. (See :ref:`hacks`)

tumorname
    The tumor input for a tumor/normal pair defined by a --tumor_normal
    option. The individual from whom the pair was taken is available as *sample*.
    This term obsolesces the term *tumorfile*.