Preface to the Tutorials

If you are in a real hurry, jump straight to the three Example Tutorials, but you are encouraged to read the preface below.

The reproducibility of large-scale, complex analyses is one of the paramount problems of bioinformatics. This is a non-trivial engineering problem that must be addressed to perform high quality research. The System for Quality-Assured Data Analysis (SyQADA), a workflow automation system described here, seeks to address reproducibility while imposing the smallest possible learning curve on the user (Keep It Simple Stupid). SyQADA can be contrasted with most other workflow systems because its only dependencies are a unix operating system providing the bash shell and a standard installation of python 3.

A SyQADA workflow is simply a list of task definitions. To create new workflows, a user must write a bash script template that uses a simple syntax for specifying parameters that will be substituted with input and output filenames, sample names, and other values that can vary with each invocation of the script. However, SyQADA comes bundled with common next-generation sequencing (NGS) analysis pipelines including those for sequencing alignment, coverage profiling, variant calling, mutation detection, copy number profiling and variant annotation/reporting.

SyQADA relies on the Unix filesystem to record its progress and allow users to understand that progress easily. This means that simply using tools like ls, cat, and grep can tell you a lot about the execution of a SyQADA workflow. The more comfortable you are with the Unix filesystem the happier you’ll be (this is a fundamental truth of modern computing, irrespective of whether you use SyQADA).

As a matter of course, SyQADA works in a single project directory, expecting a specified sourcedata directory in some other location related to the same project. The way I have been configuring things, the directory structure looks about like this:

MYPROJECT/
         sourcedata/batchA
         working/example/
                        control

The available workflows are found in the SyQADA installation directory under workflows. For instance, the tasks for detecting somatic variation are found in:

workflows/seqdata/somaticvariation/tasks

The templates for detecting somatic variation are found in:

workflows/seqdata/somaticvariation/templates

Protocols are found in the protocols directories of each workflows subdirectory. To select one for a new project, create and cd into a new directory, and execute:

syqada begin

You will see a list of protocols. Selecting one will create a control directory and populate it with the protocol file, a config file containing the terms and paths that need to be defined for the protocol, and a dummy samples file. If you are running a somatic workflow, you’ll need to create a tumor_normal file (if you are, adding a TissueType column to the samples file can be useful for annotations).

It is sometimes useful (particularly with tumor-normal projects) to set the PROJECT environment variable. This would match the prefix of your protocol file. In bash, that would be:

export PROJECT=MYPROJECT

The samples file should contain the names of the samples you have, which are usually the names of the directories in which raw data was delivered, if you got fastq files from the sequencing core, or the filenames of vcf or CEL files.

Example Tutorials

Internally, SyQADA sets the environment variable SYQADA to the syqada-2.2.2 directory. You may wish to set that variable in your own environment, but you will probably at least want to add syqada-2.2.2/bin to your PATH variable to simplify your life (if you’re on our team, the team bash_profile does so).

The tutorial workflows are found in:

$SYQADA/workflows/tutorial

The fast road, however, is to create a test directory, cd into it,

>>> mkdir tutor
>>> cd tutor

and then execute

>>>  syqada tutorial
  ( 0) Example              Simple example protocol for tutorial
  ( 1) Features             Protocol for tutorial on special features
  ( 2) HAPLOHSEQ            Protocol for real-world tutorial on the use of hapLOHseq
  Select the number between 0 and 2 corresponding to your choice ...

Select the number corresponding to your choice. The simple, stupid example follows. The two links here provide the specific details of the other two tutorials.

A Simple, Stupid Workflow

Now you are ready to run the four steps of a mind-bogglingly pointless workflow that counts the characters in a series of files named for the “sample names” and records them in individual files, tests the lengths of those files to see if they match a given value (8), and then runs a “QC” step (the example is used in the test suite, so it’s useful to include a variety of steps).

At this point, you probably want to look at control/EXAMPLE.reference. In this protocol, you’ll find four tasks:

Protocol Version 1.0
TASKDEF = count-characters workflows/example/control/01.count-characters.task
TASKDEF = demonstrate-failure-handling workflows/example/control/02.demonstrate-failure-handling.task
...
TASKDEF = QC_indexer  workflows/example/control/02.QC-count-outputs.task
...
TASKDEF = report-all workflows/example/control/03.report-all.task
...

Four lousy tasks! Well, it’s a model. The other lines are mostly there so that the test suite can test SyQADA functionality. Although a SyQADA task normally takes its inputs from the output of the preceding task, the added_input lines allow tasks to take additional inputs identified either by the indexing term in the task definition, or by explicit pathname. Note that the first lousy task is so simple it can be expressed as an INLINE template:

template = INLINE wc -c {inputdir}/{sample}.name > {output_prefix}.chars

An alternative (the original) representation of the same protocol is control/EXAMPLE.protocol, which specifies each task inline, and uses a separate template file for the first task.

Now, blast ahead and create the structure for the first task of the workflow by executing:

syqada auto --init

The command will have created the first task directory and metadata:

01-count-characters/METADATA

Look at this METADATA file if you wish. It defines the necessary parameters for syqada to create and execute the jobs associated with this first step.

Then repeat the previous command, removing the –init:

syqada auto

You should see a bunch o’ output, which should take 20 seconds or so to finish. SyQADA will report the submission and successful completion of all jobs for the first step, as well as the start of the second step, and culminating in something quite a bit like this:

H00:00:11.645 syqada-1.1-beta: Task 02-demonstrate-failure-handling 10 of 11 required jobs completed. Batch in error
H00:00:11.646 Batch 02-demonstrate-failure-handling result: 1 - Batch in error: stop
H00:00:11.646 Batch problem: Batch in error
Checking control directories... ...........
Checking logs... ......................
********************************************************************************
         rxia:   15 stderr,     1 stdout 02-demonstrate-failure-handling/ERROR/demonstrate-failure-handling-runner-rxia.sh%10648
********************************************************************************
-----------------------------------  stderr  -----------------------------------
--------------------------------------------------------------------------------

This demonstrates a job failure that might occur due to a
...(11)...
syqada batch 02-demonstrate-failure-handling --step repend run
will accomplish the same thing in one step
--------------------------------------------------------------------------------
********************************************************************************

/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
** syqada believes this is an intentional error for the tutorial.
** For more information, run:
**     syqada errors --help EXAMPLE_MESSAGE
\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/

Automator terminated task 02-demonstrate-failure-handling with an error.

This will indicate, as you can see, that the second step, 02-demonstrate-failure-handling, failed, and it shows the truncated standard error output for the failed job. Note that syqada errors makes an attempt to recognize and provide help with the problem. In this case, SyQADA is able to guess the nature of the error, so it tells us that we can get more information by running:

syqada errors --help EXAMPLE_MESSAGE

Doing so provides a more detailed description of the problem.

Before we start diagnosing the failure, for practice, run:

syqada manage 01-count-characters

and look at the results. Then:

ls 01-count-characters

just to look around. Poke around there for a while; look at the LOGS and output directories, and at the DONE directory.

Now, If you look into the second directory, you’ll see ten scripts in the DONE directory and one in the ERROR directory, as well as ten files named LOGS/*.done and one named LOGS/rxia.failed.

So let’s debug the error (SyQADA told us, but we one is not always so lucky). I generally start by confirming the task status:

syqada manage 02-demonstrate-failure-handling

which produces this:

1.0
Jobs 11, Queues  PENDING 0,  RUNNING 0,  DONE 10,  ERROR 1
               ,           ,  begun 11,  done 10,  failed 1, outputs 10
Batch in error

The first line reports total jobs and contents of each status directory. The second reports the number of files of each suffix in the LOGS directory, and the number of jobs with output.

Since there’s only one error, it’s easy to find by looking for the .failed file in LOGS. Start with

ls LOGS/*.failed

to find the guilty party, and then cat the standard error output for that job:

cat 02-demonstrate-failure-handling/LOGS/rxia.err

You’ll see this:

This demonstrates a job failure that might occur due to a
system configuration issue. In this case, the data had
a length of 8 and there was no file in the working directory named no-length-based-name-bias.

If you execute
 touch no-length-based-name-bias
and then run
 syqada batch 02-demonstrate-failure-handling --step repend
it should report that the job has been moved back to PENDING, so that
it can be run again. Then
 syqada batch 02-demonstrate-failure-handling --step run
will cause it to run without error.
 syqada batch 02-demonstrate-failure-handling --step repend run
will accomplish the same thing in one step

With a workflow running real software, of course, the program probably won’t tell you what to do next (GATK programs give very helpful messages, but the norm is simply to report the error or worse, fail unceremoniously). You’ll have to figure out an error message that can range from “file not found” to an obscure stack dump. With some programs, you’ll discover that standard error output is empty, and you have to look at LOGS/*.out instead. You may need to examine the files found in the output directory to determine whether the program generated any output at all to help determine the cause of the error. Since figuring out the cause of a failure is the hardest task in computing, SyQADA tries to make it as simple as it can be by standardizing the location of outputs, and by parsing the stderr file to detect some common failures generated by configuration errors and suggest possible causes. See Troubleshooting Guide for some tips on how to sort out failures.

Using the syqada errors command

You might examine the LOGS directory by hand just to get the feel of it, but of course, when syqada auto terminated, it reported:

Checking control directories... ...........
Checking logs... ......................
   1 error with  15 lines of stderr output
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Errors %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
********************************************************************************
         rxia:   15 stderr,     1 stdout 02-demonstrate-failure-handling/ERROR/demonstrate-failure-handling-runner-rxia.sh%22389
********************************************************************************
-----------------------------------  stderr  -----------------------------------
--------------------------------------------------------------------------------

This demonstrates a job failure that might occur due to a
...(11)...
syqada batch 02-demonstrate-failure-handling --step repend run
will accomplish the same thing in one step
--------------------------------------------------------------------------------
********************************************************************************

This is basically the equivalent of running

% syqada errors 02-demonstrate-failure-handling 5

The ...(11)... in the output indicates that there are 11 lines of standard error output elided. To get syqada to show you the whole file, re-run so:

% syqada errors 02-demonstrate-failure-handling 15

This asks SyQADA to print up to 15 lines of a single error.

syqada errors can be even more useful when there are many errors, because it categorizes them by size and uniqueness to simplify your deducing what is wrong. See syqada errors for more information.

Now let’s go ahead and do what we were told in the error message:

% touch no-length-based-name-bias
% syqada batch 02-demonstrate-failure-handling --step repend run

and the error job(s) will be restored to PENDING. If you want, you can check that by repeating the syqada manage command from above. But let’s go ahead and finish the workflow by resuming syqada auto – simply re-run the syqada auto command same as before.

This will confirm that task 01 is complete, discover and report the incompleteness of 02, and then ask you if you want to resume. When you respond yes, SyQADA will re-submit the one job that was repended, and then, when it succeeds, continue by finishing task 03 and task 04 (syqada auto simply runs syqada manage to determine whether a task is complete).

If you have not done so already, browse the templates directory and compare those templates with the generated scripts (which are now all in the DONE directories of the various steps) to see how a template is converted into a script.

Task 03 demonstrates the use of the QC designation, and the template also shows two ways of specifying the list of files in an input directory.

Task 04 demonstrates the use of the task specification:

jobgeneration = summary

Only a single job is created, which analyzes all the output of the previous task. Because the immediate previous task is prefixed with QC, task 04 looks back beyond it to task 02 for its input.

In addition, Task 04 illustrates how the added_input parameter is populated by the template. The protocol defines two values for added_input, separated by a comma; examine the output file

For a tutorial on real-world bioinformatics using the hapLOHseq allelic imbalance detector developed in our lab, see Real-World Tutorial: hapLOHseq. For a tutorial on the use of replication, see Tutorial on Special Features.

Creating a New Workflow

To create a new workflow, you would need to create your own simple task and template definitions for steps in the workflow. Examples of those are provided for this workflow in the example/haplohseq/tasks and example/haplohseq/templates directories.

You need to create:

A protocol file:

Lists sequential steps to be executed by the workflow.

A config file:

Specifies software and data dependencies for the workflows.  These
variables can be referenced in protocol, task and template files so
that workflows can easily be ported to other platforms by simply
modifying config files.

A samples file:

Lists names of samples to process through workflows.  These names
are used as prefixes for intermediate and output files of the
workflow.

And tasks and templates:

These are definition files for steps in the workflow.  Tasks define
resources needed for a step in the workflow to be executed on an LSF
or PBS cluster (or on a local server or desktop).  Tasks also allow
the user to split jobs based on chromosomes.  Template files define
the actual step to be executed.

log files:

Log files are generated for each step in the workflow including
logs for console output, errors and a job completion status.

output files:

Each step of the workflow contains an output directory that contains
the artifacts generated by that step.

Running the command:

syqada begin

will provide a list of existing protocols that you can start a project with. When you select a number from the list, that protocol will be used to build a skeletal control directory with protocol, config, and sample files you can edit to prepare them to run the protocol.