GlossaryΒΆ
- batch
- All the jobs that need to be performed for a particular task or step of a protocol
- batch runner
- BatchRunner.py is the program invoked by SyQADA to manage job generation and batch execution.
- batchroot
- The directory in which the jobs for a task will be managed. This directory contains a METADATA file as well as a series of directories, PENDING, RUNNING, DONE, ERROR, and LOGS, which record the state of the batch. Additional directories QUEUED and STUCK are created if those conditions are detected by SyQADA. SyQADA commands batch, manage, and tools all require the specification of batchroot as their first argument in order to work correctly.
- big-O notation
- Optional representation of computational complexity used in specification of jobestimate and gb_memory, if you wish to try to improve the accuracy of walltime estimates. See Tuning Job Estimates.
- config file
- A file that contains key-value pairs that will be used to populate values in the script templates during job generation. Most of the values are paths to software or reference files. This file should remain unaltered during the lifetime of the pipeline (presumably one might add to it, but one should not change an existing setting, such as the version of a piece of software, after using that software in the pipeline). An example config file is found in workflows/control/Example.config.
- environment
- The standard Unix dictionary of variable names and their mapped
values that is provided to a Unix process when it starts. In the
bash shell, the environment can be displayed by running the
env
command. Environment variables are set in the bash shell and passed to child processes by executing the commandexport VARNAME=value
They are used by prepending the variable name with a dollar sign. Thus, given the command above, the commandecho $VARNAME
would printvalue
. There are a few environment variables that are used in the default config. It is very fond of TEAM_ROOT, because that is relatively well fixed for the Scheet Lab on a given machine. One should avoid environment variables in general, though, because they usually become so well ingrained in the user’s sense of environment that problems they cause are devilishly hard to debug. - gather
- An attribute of a step in a replicate workflow that specifies which replicates from previous steps will be summarized.
- interface
- A subclass of queue manager used for a particular task execution. See queue_manager, below.
- job
- A single instance of process execution suitable for submission to the queue_manager. For most workflows, there will be one job generated for each sample for each task. For CPU-intensive tasks, the split option can generate data parallellism by chromosome or region. For summary tasks, the jobgeneration = summary option can generate a single job that collates all results of the previous step for reporting or analysis. For tumor-normal tasks, there is one job per tumor-normal pair.
- job generation
- The act of creating the necessary execution scripts for a step of the pipeline, performed by JobGenerator.py (but usually invoked by BatchRunner.py behind the scenes). Job generation involves reading the config file and a METADATA file, a script template, and a samples file and producing executable scripts that can be run on the local CPU or submitted to the cluster.
- jobgeneration
- A configuration term found in the protocol file or task file describing how SyQADA should generate jobs. Possible values are generate, merge, and summary. The special value irregular is used to indicate that the number of jobs generated can vary from sample to sample.
- parameter, template parameter
- A term wrapped in braces in a template (e.g., {parameter1}). The parameter and braces will be substituted with a value defined in either the config file, the task file, or the protocol file for the step. Such a definition can itself embed another term in braces; however, circular references are not permitted and reference chains are discouraged.
- pipeline
An automated sequence of processes requiring no human intervention.
A brief manifesto on word usage: it seems that everyone in the bioinformatics world refers to a sequence of bioinformatic steps to analyze a set of data as a “pipeline.” Because of computer science upbringing, my definition of a “pipeline” is an automated process, whereas a sequence of steps in which a user executes each step and studies the results before proceeding to the next step is a “workflow.” However, to communicate with everyone else in bioinformatics, I often equate them. You may see either term used in this documentation.
- protocol
- A list of tasks in the order they should be performed, along with specifications of the variable parameters. This can either be a file containing the tasks and all their specifications, or a file that lists the taskfiles (.task) themselves, as well as possible parameter choices on a per-step basis. A workflow is governed by a protocol; through metonymy, protocol and workflow often stand for each other in this document.
- QC step
- A quality control step, that is, a step in a workflow that measures some aspect of the performance of a previous step, whose output is therefore not directly part of the workflow’s production. The task identifier for such a step should be prefaced with QC (as in QC1-coverage) so that SyQADA will set the subsequent task’s inputdir appropriately.
- queue_manager
- The controller that knows how to submit jobs for execution, check their statuses, and manage their output. There are currently three queue_managers, PBS, which runs jobs on the Nautilus cluster; LSF, which runs jobs on the Shark cluster; and LOCAL, which runs jobs on the local host. Since version 0.9.8, SyQADA identifies and chooses the cluster manager when running on a cluster node, and the LOCAL manager when not. Specifying “interface = LOCAL” in the METADATA will use the LOCAL manager when on the cluster.
- replicate
- A single iteration of the pipeline generated by placing one value each of any of one or more designated interpolant terms into the METADATA file of that iteration. There is no limit on the number of replicate parameters or their values, but if the simple product of the number of values is high, it is likely to tax the capacity of the system and the patience of both the invoker and the cluster system administrators.
- sample
- The name of some biological tissue or other bodily substance that has been extracted and measured to yield a file containing data that is to be analyzed in a workflow. Although this is often named for an individual, care should be used to distinguish the usage, because one individual can provide many samples. The obvious example is a tumor-normal pair of samples from one individual.
- sample file
- A file containing a series of sample names, one to a line, that will be used to find data on which to perform a workflow. The term sample_file is available for use in script templates. Starting with 1.1, the sample file may now be a tab-delimited file containing other columns that provide phenotypic or other sample-specific attributes.
- scatter
- A replication step that defines which parameters will be “scattered” across the task to create individual replicates.
- script template
- A file that represents a Unix bash shell script with certain
standard terms surrounded in braces (viz.,
{braces}
) indicating that they will be replaced by values found in the configuration file or computed by the JobGenerator during SyQADA initialization. - stderr, stdout
- The historical abbreviation of the names of the two standard outputs of a Unix process. The stderr and stdout for each job are captured in LOGS/SAMPLENAME.err and LOGS/SAMPLENAME.out.
- task
- One step of a workflow, defined by a .task file that specifies a script template (usually in the workflows directory) with a name in the form task_name.template. .task files usually carry numeric prefixes to help provide a guide to their ordering, but a protocol file is a better guide to the proper ordering of the tasks in a workflow. For historical reasons, in SyQADA this is also called a batch.
- task identifier
The term assigned in the TASKDEF of a protocol step to identify its output directory for use as input in non-immediate subsequent steps. The format is:
TASKDEF my_index_name path-to-task-definition.taskwhich permits later reference using either
inputdir = my_index_nameor
added_input = my_index_name- workflow
- A sequence of processes performed by a system. Many workflows are computerized, so that some steps are performed by machines and some steps, usually quality control, are performed by humans. A pipeline is a workflow performed entirely by machine. As of 0.9.8, SyQADA will run workflows up to the point of first error, at which point they morph back into workflows.
- workflows directory
- The directory under the SYQADA home that contains several nested workflow subdirectories, each of which contains the script templates for the tasks in that workflow. See Packaged Workflows.
- working directory
- The directory in which SyQADA runs and creates its
task subdirectories. A typical example is
$PROJECT/working/alignment
. - YAGNI
- Ya Ain’t Gonna Need It: The “extreme programming” design philosophy that governs SyQADA development. Only those features that are identified as necessary should be designed and implemented. e.g., SyQADA makes no provision for workflows with conditional paths, because the cases where that is appropriate don’t seem to occur in our workflows, it would be difficult to implement, and it would make workflow specification more difficult than it already is.