BatchRunner.py

Prepare a batch of jobs using JobGenerator and then run them using JobBatch.

Invocation looks something like:

>>> syqada batch batchrootdirectory --step init run

There are currently three behaviors that can be specified via the –step parameter: init, run, and repend:

init

builds a JobBatch, including directories and METADATA, and creates jobs based on the sample names, task, and a filename specification, and places them in the PENDING queue. Loosely, the INPUTDIR attribute is concatenated in turn with each sample name and NAMEPATTERN attribute to form a search string to create input files for each sample job.

If invoked without init, BatchRunner reloads the JobBatch from its METADATA file, and then performs either repend or run (or both).

run
simply runs a JobBatch’s main process loop.
repend
looks for failed jobs (those found in the ERROR directory) and assumes they should be re-run. It moves them to the PENDING queue and tells the JobBatch to run its process loop. It also moves old entries in the LOGS directory to LOGS/badlogs, but does not (yet) clean up the garbage output.
max_bolus=N
overrides the maximum bolus size found in the batch’s METADATA, handy for situations where the memory/processor estimates were incorrect for a LOCAL batch.

The following options of the jobgeneration attribute of the METADATA file are implemented by BatchRunner:

generate
causes a job to be created for each sample independently of the existence of appropriately named files in the filesystem. An example of usage is workflows/haplohseq/download, where the samplenames are known in advance but the data must be fetched from a remote site. It has also proved useful in other cases, such as the haploh pipeline.
summary

causes the glob pattern for input filenames to be created by combining {inputdir} and the namepattern, omitting the sample name(s). This generates a single summary job named for the project by attaching merge internally.

summary can be used in combination with generate

merge
causes all files named for a particular sample and namepattern to be passed and matched against the term ‘{filename}’, which must stand by itself with no other braced terms on the same line in the job script template.

This METADATA option is implemented by BatchRunner:

tumor_normal
gives the name of a file containing tumor/normal pairs. Each line should be a triple: individual-identifier, normal-sample, tumor-sample. These are then concatenated with the inputdir and namepattern to create glob patterns. Matching files are then passed to the script and matched against the terms ‘{normalname}’ and ‘{tumorname}’ respectively. The individual-identifier is passed to the script as {sample}.

Other options are discussed in JobGenerator.

Developer Documentation Only Below This Point:

class BatchRunner.BatchRunner(batchroot, args, protocol=None, rpt=None)

Manage a JobBatch.

Given a batch directory, expect a METADATA file there, and build or reload a JobBatch, create the jobs necessary to run the task on the input data, and let the JobBatch manage the jobs.

Creating the jobs is delegated to JobGenerator, and details are found there. Running the batch is delegated to JobBatch.

generate_jobs(merge, jobdict, splitrange, overlap, iterations, filenames=None, globstring=None)

Hand over to JobGenerator the task of creating shell scripts.

globstring string used for a Unix shell glob to find matching filenames

–merge implies lump all filenames together (also implied by –summary) otherwise feed one file at a time to the generator

jobdict is a dict of sample-related terms if a dict for 1.1 and above, else just a string as a jobname to retrofit project name as jobname

init_JobBatch(resume=False)

Initialize a JobBatch, either from METADATA or from configuration supplied as arguments to BatchRunner.

If this is a new JobBatch, run the JobGenerator to create the shell scripts to be run.

manage_job_creation(queuespec)

Evaluate parameters to determine how the jobs should be split, and then delegate the job creation to JobGenerator.

prepare_configuration(configuration, task, sample_file)

Read the configuration file and build a dictionary for use throughout the batch. Substitute environment variables specified in the usual bash/csh form (leading dollar signs) with their environment values, or report an error.

reduce_inodes(tmpdir)

After completion, irrespective of success, reduce duplicate files to common inodes

repend(rerun=False)

Instruct the JobBatch to move scripts that failed back to PENDING so that they may be re-run.

run(parallel=False)

Execute the JobBatch main loop

validate_queue()

Check to make sure that a specially chosen queue is permissible.

Note that this assumes that self.b, the JobBatch() has been created (i.e., init_JobBatch() has started).

HACK: It also sets a flag that allows the (PBS) queue_manager to determine whether to set the special (PBS) queue variable.

BatchRunner.LEGAL_STEPS = ['init', 'repend', 'rerun', 'run', 'auto']
BatchRunner.choose_max_bolus(args, metadata)

Select max_bolus based on arguments and metadata. Return an int

BatchRunner.common_parsing(parser)

Arguments that should be shared with any object (the Automator) that invokes this class.

BatchRunner.main(input=None, parser=None, auto=None, resume=None, parallel=False, stderr=<_io.TextIOWrapper name='<stderr>' mode='w' encoding='UTF-8'>, rpt=None)

This main() has parameters that are only used by Automator during its run process. That, I feel, is far better than effectively duplicating this in Automator as BatchRunner_main(), which is how it was done until 2.1.

BatchRunner.parse(input=None, parser=None, stream=<_io.TextIOWrapper name='<stderr>' mode='w' encoding='UTF-8'>)

Return and create all of the necessary information to get the JobBatch running Complain about semantic validation issues.