syqada manage¶
>>> syqada manage *batchroot*
runs batch_tool, which will show the current state of the batch in the given task directory, including any batch management problems. A batch management problem is a circumstance usually caused by premature termination of SyQADA while it is managing the execution of a batch. In this case, the progress of jobs currently in the PBS queue obviously cannot be recorded, and repair work will become necessary. An explanation of what syqada manage does is found in How syqada manage Works.
If there are batch management problems, they may be shown in detail by adding the –detail option with one or more of the additional arguments, done, failed, or output. [I haven’t used this in a long time, and there are no unit/integration tests for it, so it may behave oddly.]
If batch management problems do exist, they can be fixed by adding the –fix option with one or more of the same additional arguments.
Note that jobs left in running (or queued or stuck) that are terminated forcibly after the SyQADA batch manager has quit running will not be recognized by syqada manage and must be moved by hand. Since these jobs need to be re-run, the tactic I use is:
mv *batchroot*/running/* *batchroot*/error
syqada batch *batchroot* --step repend
moving them to batchroot/error and then using –step repend instead of simply moving them to batchroot/pending allows SyQADA to trim off the old process id, which would otherwise confuse SyQADA upon restart.
syqada batch XXX –step rerun
is supposed to address this, but a bug frequently crops up, so the method above is more reliable.
Examples of output¶
A batch that has completed successfully will produce results that look like:
> syqada manage task-directory/
0.9.9
checking control directories... ............................................
checking logs... ...........................................................
syqada-0.9.9: task 0102-varscan
jobs 88, queues pending 0, running 0, done 88, error 0
, , begun 88, done 88, failed 0, outputs 88
88 of 88 required jobs completed.
batch completed
Obviously, there are other conditions. A batch in progress will produce results that look like:
> syqada manage task-directory/
0.9.9
checking control directories... ............................................
checking logs... .......
syqada-0.9.9: task 0802-varscan
jobs 88, queues pending 81, running 0, done 7, error 0
, , begun 7, done 7, failed 0, outputs 7
batch can resume
A batch that has failed will produce results that look like:
> syqada manage 01-phase-samples
0.9.8.3
Checking control directories... .........................................
.........................................................................
Checking logs... .......................................
syqada-0.9.8.3: Task 01-phase-samples
Jobs 2948, Queues PENDING 2615, RUNNING 0, DONE 0, ERROR 333
, , begun 333, done 0, failed 333, outputs 0
Batch in error
In certain cases, because of the timing of SyQADA managing the completing jobs, you may see a message saying “batch needs curation” with a description of discrepancies. This is harmless as long as syqada (auto or batch) is still running. if SyQADA has terminated, and you see the message “batch needs curation,” you can run with the –details parameter to get more information. for example:
> syqada manage task-directory/ --details done
to show you exactly which jobs marked done have not been properly managed. done is not the only option. other options to the detail parameter are explained on the batch_tool.py page.
To curate the batch after syqada batch has terminated, you can run with the –fix parameter. for example:
> syqada manage task-directory/ --fix done failed output
This will curate all jobs that:
were still in state running but had indicated that they had completed (done)
were still in state running but had indicated that they had had an error (failed)
were in state done but did not have the same number of outputs as other completed jobs (output)
How syqada manage Works¶
syqada manage does the following things:
* Determines from the task definition how many jobs it should expect for this task.
* Tabulates and counts the contents of the PENDING, RUNNING, ERROR, and DONE directories (these should have only job scripts in them).
* Tabulates and counts the contents of the LOGS directory, expecting one .begun, one .out, and one .err file, and either a .done file or a .failed file for each job.
* If an error state is detected, the error classification routine below is invoked.
These conditions must be met to declare a task complete:
- All its job scripts must reside in the DONE directory.
- There must be a .done file in the LOGS for each run-suffix.
- There must be at least one file in the output directory matching the run-suffix of each script
- All jobs must have the same number of outputs in the output directory.
- As a by-product of the counting method, syqada is likely to have an ungracious response if any of the log files is missing.
Error classification¶
SyQADA error classification makes the reasonable assumption to begin with that error outputs with the same number of lines probably have the same cause, and likely only differ by sample name. It verifies this by comparing all erroroutputs (the python set object makes this trivial) and counting the unique sets of output, both as generated, and with the sample names removed. It then categorizes the results, describes them, and an instance of each class of error message is selected for display.
This has proved immensely useful because of the time it saves in error resolution, often as much as the rest of syqada put together.