Tutorial on Special Features¶
This tutorial is designed to illustrated several features that have been found to be useful. Their use does significantly complicate the life of the analyst using them, so you may wish to adopt them with caution. In particular, using replication is drinking absynthe with Baudelaire, and protocol nesting is a Russian fairy tale. Combine them, and you’re in a story by E.T.A. Hoffman.
If you thought A Simple, Stupid Workflow was simple and stupid, you have another horror waiting for you here.
Running syqada tutorial
and selecting the Features tutorial will populate your tutorial directory with a control directory and the protocol, config, and samples files.
Start by examining control/Features.protocol
. You’ll see that it is
version 1.4, which allows the use of all the “advanced” features in this tutorial (versioning
of protocols has persisted since the transition from SyQADA-0 to SyQADA-1.0, although empirical evidence suggests that it’s a Bridge Too Far, so it should probably be eliminated).
If you execute
syqada describe
in your tutorial directory, syqada assumes that you are invoking the protocol found in the control directory, so you will see something like this:
Using these additional valid queues: 'testing'
Protocol control/Features.protocol
Description: Protocol for tutorial demonstration of advanced features:
replication, iteration, nesting, complexity, quality assurance
Preamble
Protocol nesting is True
Valid queues: testing
Replicands: ['parameterA', 'parameterB', 'parameterC', 'parameterD']
Tasks
01-simple-replication
Description: (Obviously,) scatter is parameterA,parameterD
02-partial-aggregation
Description: parameterD is gathered; parameterA is still
scattered
03-another-replication
Description: parameterB is added to scatter with parameterA
03-QC-step1
Description: scatter is still parameterA,parameterB
03-QC-step-repeat
Description: This tests passing inputs per-replication, a common
condition of replicating an existing protocol
04-aggregation
Description: Both scattered parameters are explicitly gathered
0501-spin-date
Description: Iterate copies of a file containing iteration count
and date of execution. 100 copies by default, should be
overwritten as 10 by the tutorial protocol.
0502-stats
Description: Summarize the output of the spin-date task.
06-summary
Description: Demonstrate use of nested protocol output. We can
use a nested task in added_input.
The preamble of the protocol includes two comment lines that provide a description. The next three lines define the parameters to be replicated. These lines must appear in the protocol before the first TASKDEF. The protocol describes five tasks plus two inherited from a nested protocol. The template of the first of them is:
echo PA{parameterA}, PD{parameterD} > {output_prefix}.out
which, of course, just creates an output file for each sample listing the parameters
for that replication. The third one simply greps a particular value from the first
output set. Note the echo
command, which causes the script to succeed whether
grep finds anything or not:
grep PB4 {inputdir}/{sample}.out > {output_prefix}.out
echo ignore the error code
The config file contains a dummy source directory, because this protocol does not require data. The samples file identifies two samples.
When you run:
syqada auto --init
you should see a series of directories, with parameters 1 and 4 varying over two values each
01-simple-replication-pa1_1-pa4_16@~dummy_
01-simple-replication-pa1_1-pa4_7
01-simple-replication-pa1_2-pa4_16@~dummy_
01-simple-replication-pa1_2-pa4_7
Note the task names that include the @ and _ characters, which are substitutions for the colon and the division symbol to prevent clashes with Unix conventions for hostname specification and file separator.
Replication Directory Structure¶
For the example replication, with two parameters varying over two values in the first and third steps, one parameter varying over two values in the second partial aggregation step, and the aggregation step, these batch directories will be created when the whole protocol is complete:
01-simple-replication-pa1_1-pa4_16@~dummy_
01-simple-replication-pa1_1-pa4_7
01-simple-replication-pa1_2-pa4_16@~dummy_
01-simple-replication-pa1_2-pa4_7
02-partial-aggregation-pa1_1
02-partial-aggregation-pa1_2
03-another-replication-pa1_1-pa2_3
03-another-replication-pa1_1-pa2_4
03-another-replication-pa1_2-pa2_3
03-another-replication-pa1_2-pa2_4
03-QC-step1-pa1_1-pa2_3
03-QC-step1-pa1_1-pa2_4
03-QC-step1-pa1_2-pa2_3
03-QC-step1-pa1_2-pa2_4
04-aggregation
In addition, thereafter, the Features.protocol uses a PROTOCOLREF to include a two-step nested protocol, which demonstrates the use of iteration, and then a final step to demonstrate reference of one of the nested tasks:
0501-spin-date
0502-stats
06-summary
The name of a replication directory may be parsed to identify the values of parameters used in each replication. The parameter names are abbreviated to three characters, made unique by numbering the third character if necessary (don’t bother to test the system by using 10 parameter names that share the first two characters, it will break and you’re going to have an excessive-compute problem anyway. That is absolutely a YAGNI (Glossary) beyond the scope of our development mandate).
Each replicate directory contains a METADATA file that includes replicate information, e.g.:
replicate:
parameterA = 1
parameterD = 8
Also look at control/Features.replication, which now contains the value sets you defined plus a map of the replicate numbers to the permutations of the parameters. This is not as useful right now as it might be if it included the abbreviated names of the parameters. It is unused by syqada, but you might devise a way to take advantage of it in an elaborate aggregation or reporting step.
You can now run:
syqada auto --project Features
This simple workflow should have no
difficulty running to completion (but, pending a bug-fix, it will – you will need to respond to prompts with a “y” to get through the replicates in the first task, possibly after repeating syqada auto if it stops).
The gather steps, steps 02 and 04, as well as step 03, which
inherits the scatter remaining in step 02,
use regular expressions that comprise all the output directories of
the previous step to formulate their inputdir
. For example, here is a fragment of the job runner
for step 04:
#!/bin/bash
...
(echo PA1: ; grep -c PA1 {03-another-replication-pa1_1-pa2_3/output,03-another-replication-pa1_1-pa2_4/output,03-another-replication-pa1_2-pa2_3/output,03-another-replication-pa1_2-pa2_4/output}/*.out) > 04-aggregation/output/Features.aggregate
(echo PB4: ; grep -c PB4 {03-another-replication-pa1_1-pa2_3/output,03-another-replication-pa1_1-pa2_4/output,03-another-replication-pa1_2-pa2_3/output,03-another-replication-pa1_2-pa2_4/output}/*.out) >> 04-aggregation/output/Features.aggregate
(echo PA1: ; grep -c PA1 {02-partial-aggregation-pa1_1/output,02-partial-aggregation-pa1_2/output}/*.aggregate) >> 04-aggregation/output/Features.aggregate
echo Ignore the error code
...
As you can see, it generates some pretty ghastly-long command-lines within the shell script, but it does what you need done, and I, at least, wouldn’t want to write them myself.
Feel free to examine the resulting structure, metadata, and scripts.
Note that each replicate of the QC step knows how to identify its single predecessor. I have no idea what would happen if you aggregated during a QC step. Exercise left to the reader. We will make no attempt to find out until the use case arrives at our door.