Building a New Workflow

To build a new workflow, you will need to find or create script templates that perform the tasks you need, find or construct task specifications that define the special behaviors you may need when invoking the templates, and create a protocol file that enumerates in execution order either the task specifications or the task files that define those task specifications.

Constructing a new template is discussed in How to Build a Script Template.

Constructing a new task specification

A task specification requires the following items:

TASKDEF = name reference

The identification of the task. The word TASKDEF in uppercase is required. The reference can either be the word INLINE, in which case the task definition is provided in-line in the protocol file, or can be a relative reference to another file that contains a task definition. The name portion is concatenated with a 2-digit sequence number to assign a name to the batchroot of the task. It is also used as a reference name for specifying inputdir or added_input (see Generated Termsinputdir).

template = template-reference

A file-system reference to the script template that should be used for this task. A template-reference with a leading / is treated as an absolute path. A template-reference without the leading / is resolved first against the local directory (a common usage is to place custom templates in the control directory) and then against the paths found in the SYQADAPATH environment variable. By default, SYQADAPATH is set to the root of the deployment directory of the running syqada (the directory containing the bin directory in which the syqada executable is located). Thus, template-references in the standard workflows usually specify paths beginning with ‘workflows’ to identify the $SYQADA/workflows directory.

jobestimate = DD:HH:MM:SS

An estimate of how long each job will take. This matters for cluster queue selection and control, since a job that exceeds its jobestimate (walltime) will be killed without mercy or trace (See PBS Cluster Gamesmanship). See Tuning Job Estimates for details of how to use Big-O notation for tuning estimates.

gb_memory = NN

Amount of memory required by the executables in the task. This is important for resource allocation on the clusters, and for determining bolus size on a local queue manager. 32 is the maximum for the normal queues of the PBS cluster, and 16 is the minimum for all queues of the LSF cluster. No job in the current (1.0) set of workflows requires more than 32. Porting SyQADA to run on a different cluster should not be difficult, but several of the assumptions addressed by the PBS and LSF queue manager modules may have leaked into the higher-level modules, so beware.

The standard workflow templates that make use of java now use the task gb_memory attribute to set the maximum heap size. This is slightly suboptimal in that the heap is most of the memory allocated by java, but not quite all.

See Tuning Job Estimates for details of how to use Big-O notation for tuning estimates.

processors = NN

NN is a number from 1 to 24 (the current maximum per node on both clusters), or the value NA, in which case the number computed by gb_memory*1.33 will be used (PBS and LOCAL) or the value 1 used on LSF. The greater of the specified processors value and the value computed by the memory request will be used. See PBS Cluster Gamesmanship for a discussion of the memory/processor tradeoff. Processor number is the weakest point of the LSF manager, since it was not abstracted gracefully during the first iteration of cluster management.

Processor allocation is going to be the sorest point about standardizing workflows for running on either cluster.

namepattern = [bash-regex | ignore]

A regular expression in the syntax of the bash shell that describes which files should be selected when concatenated onto the string created by formatting the values for {inputdir}/{sample} for each sample. For example, if inputdir is /project/sourcedata, sample is Sample1, and bash-regex is ‘/*.fastq’, the list of files matching /project/sourcedata/Sample1/*.fastq will be returned. The special value ignore can be used to indicate that the namepattern is unused for the task spec.

There are numerous other specifications that can be added, many of which are task-specific. In fact, every term found enclosed in braces in a script template must be populated, or syqada will report the undefined term (this is a particularly ugly error message, because the job generation is nested deep in the bowels of syqada). Many of these terms will be filled either from the config file, or by syqada itself (see Generated Terms for a comprehensive list). However, the template can also contain special terms that allow parameterization of aspects of the computation, or, for instance replication or iteration. These must be filled from attributes in the task file or protocol file. Terms specified in the protocol file override those in the task file.

It is permissible in the protocol to define a template parameter using the name of another template parameter or parameters (circular references produce protocol warnings). For example, consider the task definition fragment given below, with an INLINE template:

----------------------------------------
TASKDEF = fragment INLINE
template = INLINE echo {parameter1}
...
parameter1 = Life, the Universe, and Everything = {parameter2}*{parameter3}
parameter2 = 6
parameter3 = 9

syqada will render the template as:

Life, the Universe, and Everything = 6*9

I recommend you not go crazy with this feature.

If a template parameter is defined with the special character ~ (tilde), it is rendered as an empty string. This permits templates to include flags that may be chosen or added in the protocol specification. As an example, consider these task file and protocol file fragments:

workflows/.../use-empty.task
----------------------------------------
TASKDEF = fragment INLINE
template = INLINE rsync -rtl {rsync_options} {source} {dest}
...
rsync_options = ~

control/protocol-using-that-task.protocol
----------------------------------------
TASKDEF HeartOfGold = workflows/.../use-empty.task
source = Earth/Dent,ArthurDent
dest = Magrathea

By default, the task will leave the source, Earth/Dent,ArthurDent in place. Should a Vogon constructor fleet run the task with this definition added in the protocol:

rsync_options = --remove-source-files

The source, Dent,ArthurDent, would be removed from Earth.

syqada validate discusses syqada commands for identifying missing parameters and validating that protocols, task files, and task templates are correct.

Special command languages

Certain programs with command languages that include braces can be convenient for one-line invocations in scripts. Since 1.1-RC2, templates that contain invocations of perl, awk, or groovy can include braces as well without triggering the ugly “Keys not found” complaint; because implementing this correctly involved eliminating code rather than adding code, I believe this is robust. However, I reserve the right to have ignored an obvious usage of braces in one of these languages that breaks syqada.

Environment Variables and strictness

There are several commonly used Unix environment variables that, if not specified correctly, can lead to obnoxiously subtle task failures or, worse, apparent successes that produce incorrect results. For this reason, in the interest of improving reproducibility, SyQADA now removes from the task execution environment the following variables: PYTHONPATH, PERL5LIB, and LD_LIBRARY_PATH. If these are necessary to a task, they should be set in the template with values as template parameters and specified in the config or protocol file. At this instant, the PATH variable is inherited due to an oversight. In the near future, we anticipate resetting PATH to the short path provided in a standard Unix /etc/profile.

Job submission control

There is are two special specifications that may be added to control job submission. The first of these governs the rate at which jobs are submitted:

max_bolus = NN

This is a hard limit on the number of jobs that can be submitted at once, and can be used as max_bolus = 1 to force sequential execution if that is for some reason necessary. A case in point is vtools sample imports, which run a risk of concurrency violations if run in parallel. Because memory allocation on the HAPS is tricky, it can also be useful if you are running there.

The second specification applies to cluster execution only (If it is present, it is ignored in non-clustered execution). It allows the use of named special-purpose queues. This requires an addition to the protocol preamble (the lines between the protocol version header and the first TASKDEF):

valid_queues = queuename1,queuename2

When this line is present in the preamble, a TASKDEF can include the line:

queue = queuename1 (or queuename2)

and syqada will submit the job to that queue.

Protocol Construction

A protocol file can be constructed in two ways, either as a single file, with all tasks specified in-line in the order they should be performed, or as a file that contains references to the task files in the order. The protocol file can contain in-line specification of overrides to the attributes specified in a task file, so that an individual protocol can be tailored for data volume or other special handling. This is illustrated in the example reference protocol, workflows/example/control/Example.reference:

Protocol Version 1.0
TASKDEF = count-characters workflows/example/control/01.count-characters.task
TASKDEF = demonstrate-failure-handling workflows/example/control/02.demonstrate-failure-handling.task
jobestimate = 1:23
options:
  special_setting = testing
# this just illustrates a comment.
# It's here to exhaust the testing of in-line task overrides.
# Blank lines are also OK

TASKDEF = QC_indexer  workflows/example/control/02.QC-count-outputs.task
added_input = 01,01-count-characters/output
TASKDEF = report-all workflows/example/control/03.report-all.task
added_input = QC_indexer

The task reference must be of the form:

TASKDEF = name [INLINE | path-to-taskfile]

A task whose name begins with the letters QC is assumed to be a QC step, and is considered extraneous to the actual payload of the workflow. In this case, syqada arranges for the following task to look to the last task before the QC step to find its inputs. A second QC step in a row will look to the previous QC step for its inputs.

Protocol Nesting

It is also permitted (provided that you add the term protocol_nesting to the preamble of the protocol, just to indicate you were serious) to nest a protocol using a PROTOCOLREF, of the form:

PROTOCOLREF = path-to-protocolfile.protocol

or:

PROTOCOLREF = protocol-reference-name path-to-protocolfile.protocol

This inserts the tasks specified in the PROTOCOLREF directly in the existing workflow, using task directory naming that sorts the included task directories in correct sequential order.

Individual terms in the nested protocol can be overridden as follows. Assuming the included protocol includes this TASKDEF:

TASKDEF = included-step-id path-to-task-definition

Then in the wrapping protocol file, this sequence:

PROTOCOLREF = nested-id path-to-protocolfile
included-step-id.parameterA = new-parameter-value
*.infinite_improbability = everywhere-at-once

will assign the value new-parameter-value to parameterA, and the value everywhere-at-once to the term infinite_improbability in every step of the included protocol.

Given the nesting above, subsequent TASKDEFs in the wrapper protocol can identify tasks from the nested protocol uniquely thus:

added_input = nested-id.included-step-id

The features tutorial demonstrates the use of nesting if you wish to see a working example.

At this instant, nothing prevents you from including PROTOCOLREF to a protocol that includes PROTOCOLREF, but I don’t plan on testing that, because that way lies madness. We do make sure that you didn’t create a loop in your reference string, but that was necessary for the single inclusion and easy to implement in the general case, so that’s all the help you get.

PBS Cluster Gamesmanship

The following notes apply to the PBS cluster. The LSF cluster usage is still too new for me to grok the gamesmanship. And I probably need to remove this if it’s going to get distributed to the Center. Stay tuned...

job duration

  syqada attempts to select the most appropriate *and* desirable queue for the batch based
  the jobestimate and on the number of jobs needed for the task. In particular, there is a
  constraint of 500 processors per user on the short queue, so a batch that would otherwise
  qualify for the short queue but exceeds that threshold will be submitted instead to the
  medium queue so that more jobs can be run simultaneously. Like any optimization, this can
  have pathological behavior, if, for instance the cluster is saturated so that all jobs get
  queued before running. Further improvement of the optimization will be based on a cost-benefit
  analysis of the perceived benefit.

  Because of the way the cluster allocates jobs, the only way to have a high likelihood (note,
  I do not use the word, `guarantee`) of a job that takes walltime X finishing before X
  expires is to request all 24 processors. Not only is it antisocial to overallocate, since
  cluster quotas are calculated as

  (number of processors) X walltime

  this can be a costly inroad on your quarterly quota. However, the Catch-22 of job duration is
  that as the cluster approaches saturation, walltimes tend to increase due to both CPU competition
  and increased storage access latencies, so a walltime that is accurate and adequate for
  an unsaturated cluster is not necessarily appropriate for a loaded cluster. Your mileage
  can and will vary.

memory and processor usage

  For simplicity's sake, the current implementation of syqada is incapable of requesting more than
  one node per job, so it will require a modest coding change to permit that, or else a manual step
  after running :code:`syqada batch --step init` that globally changes the string nodes=1 appropriately.

  I am inclined to recommend requesting memory in powers of 2. I don't have any proof that this is
  a good idea, and it may entirely be coincidence, but it has seemed that I have more transient
  failures when requesting 10 or 12 GB of memory rather than 8, 16, or 32.

memory and processor usage (local shared machines)

 syqada does a perfectly nice job of managing computations on the HAPS, with a couple of
 caveats. Number of processors per job is a poor guideline (I have rarely seen the CPU even
 close saturation) so syqada tries to use memory allocation to choose the number of jobs to
 submit at once.  Inadequate gb_memory assignment can lead to transient job failures with
 particularly ugly dumps to the syqada console. Adding a max_bolus attribute to the job
 (either in the METADATA or by command line to *syqada batch*) is an alternative way to
 address this.  Java processes seem to be the chief memory hogs in which I have seen this (JLOH
 is advertized as a memory hog, but does not really seem to be in normal usage).

future work

  syqada is now recording good performance numbers that should allow optimization of memory
  allocation and improved estimation of job duration. That will eventually show itself in the
  standard workflow task files.

Tuning Job Estimates

Frustration with managing long-running jobs that fail because the jobestimate parameter made no allowance for the effect of data size has caused me to add a brand new misfeature, using computational complexity representations to specify walltime and memory usage estimates. The syntax and some examples for their use can be found at Expressing Computational Complexity in Job Estimates.