.. _existing-workflow:

Running an Existing Workflow
============================

To run an existing workflow, you need to create two files, a config file,
a sample file; and copy a protocol file from an existing workflow.

Let us call our project MYPROJECT. To follow convention, and therefore use
the easiest invocation, these files should be
placed in your control directory and named
MYPROJECT.config, MYPROJECT.samples, and MYPROJECT.protocol:

MYPROJECT.config
    This file contains colon-separated attribute-value pairs on separate lines that
    specify the paths for programs and reference files required by the workflow.

    A possible start is to copy from
    workflows/example/Example.config, which contains specifications
    for most of the default programs and reference files. Most of the
    required reference data are nested inside the TEAM_ROOT directory.

    The
    directory containing the initial data should be designated as the sourcedata
    attribute of the config file.

MYPROJECT.samples
    This file contains one sample name per line. These names normally serve as the prefix
    of the files in the sourcedata directory, for instance, SAMPLE123.bam or S001_001.fastq.
    Best practice is to create de-identified sample names if you have been given medical record numbers 
    to reduce the chances of MRNs getting propagated into result tables or charts. If given
    a directory of files named with MRNs, I like to create a second directory that uses
    hard links named with de-identified sample names. Of course, you need a manifest that
    maps between the two sets of names.

MYPROJECT.protocol
    This file contains task specifications or references to files containing task specifications.
    You should need only to copy the protocol from the appropriate
    subdirectory of the workflows/control directory (see :ref:`Workflows` for
    descriptions) into your control directory. You may need to construct a
    protocol file. This is explained in :ref:`new-workflow`.

Assuming you have followed convention, you can then execute the workflow automatically by running:

>>> syqada auto

or, if you prefer lots of typing, or wish to place these files in a non-standard place,
you can run:

>>> syqada auto --configuration path/MYPROJECT.config \
              --sample_file path/MYPROJECT.samples \
              --protocol path/MYPROJECT.protocol \
              --project MYPROJECT \
              --notifications path/afilename.txt

If all is correctly configured, SyQADA will first create a series of
processing directories numbered sequentially and named for the tasks
in your protocol, something like::

  01-bwa-align
  02-bwa-sampe

Each of these directories will now contain a METADATA file that
contains everything necessary for an invocation of *syqada batch* to
create the job scripts for the task and then execute them. (You may
optionally use *syqada auto --init* to cause SyQADA to create the directories
and their metadata and then run each step manually with *syqada batch*)

Then SyQADA will attempt to run each step in turn, creating the appropriate
number of jobs for your sample data, then submitting them (see :ref:`jobsubmission` below).
SyQADA will then monitor the jobs for completion, submitting more jobs
as necessary to complete the batch.  SyQADA reports progress roughly every 5
minutes, or earlier when jobs finish, indicating percent of jobs complete,
as well as the number of  jobs that have completed or failed.
Upon completion of each step, SyQADA reports a message such as::

  H00:00:16.242 Batch completed, 10 successes
  H00:00:16.242 Example batch 01-example finished

You may monitor progress of the workflow by viewing the SyQADA status page for
your project::

  http://d1prphaplotype1.mdanderson.org:8080/RISPROJECTS/syqada-status/MYPROJECT.txt

Manual syqada tools:
--------------------

If your workflow stops with the message::

  H04:12:12.918 INFO Kadara_FEPilot batch 0804-mutect stopped

then there is some problem that you will have to correct. In addition to *syqada auto*,
there are two other useful commands:

syqada manage
       Corrects problems with SyQADA's management of jobs.

syqada batch
       Runs a single batch, optionally moving failed job scripts back to the PENDING
       directory and saving the failed logs for later examination.

They are used roughly as follows (of course the batch directory will probably be different)::

>>> syqada manage DIRECTORY

will show the current state of the task that is running in DIRECTORY. Usage of syqada manage
is discussed in :ref:`syqada_manage`.

If SyQADA terminates early for some reason and syqada manage reports unmanaged files, you can run::

>>> syqada manage DIRECTORY --fix done failed

and completed or failed jobs that were not managed automatically will be moved into
their appropriate status directories.

>>> syqada batch DIRECTORY --step repend run

Alternatively, you can invoke without the *run* step, thus:

>>> syqada batch DIRECTORY --step repend

and then resume automatic mode with::

>>> syqada auto


There are obviously many ways in which a step in the workflow can go wrong. Please see
the short :ref:`Troubleshooting`.

You may find it convenient to identify your project to your (bash) environment:

>>> export PROJECT=MYPROJECT

so that you can refer to it as $PROJECT. Maybe not. This used to be a requirement
for running SyQADA, but it is no longer strictly necessary, because syqada itself sets a PROJECT variable based on the --project parameter if given or else on the protocol name. There are
generic uses of $PROJECT in protocol and task files in the workflow repository,
especially in tumor-normal workflows. Environment variables
are discussed in :ref:`environment`.


.. _jobsubmission:

Job Submission
--------------
If you are running on one of the cluster nodes, job submission is accomplished by
effectively running this shell command for PBS::

  >>> cat job-script | qsub

or this command for LSF::

  >>> cat job-script | bsub

The SyQADA JobGenerator takes care of constructing a script that
includes all the necessary incantations to run the cluster job in the right environment
for your job. For very short jobs, adding the *interface = LOCAL* option
in the job task will cause SyQADA to spawn the job on the local machine.

All jobs run on machines without cluster access are spawned locally. When
SyQADA is running in LOCAL mode, killing the SyQADA monitor kills any
child jobs that the monitor has spawned, leaving them in an invalid state,
since they are still in the RUNNING queue after completing with a (forced) failure.
Those jobs need to be cleaned up (stay tuned, there is a feature ticket to fix this)
manually by moving the scripts from RUNNING to ERROR and then running:

>>> syqada batch BATCHNAME --step repend

before resuming.

The script itself is used by SyQADA to indicate the progress of the job through the system.
It starts in the PENDING directory, is moved to RUNNING, then possibly to QUEUED or STUCK,
and to DONE upon completion or ERROR upon failure. Because the script is the object submitted
to the cluster, you may examine it, edit it, and resubmit it if you wish (editing and resubmission
can make keeping track of reproducibility a nightmare, however, and is only recommended for
debugging a new workflow).

Architecture consistency headaches
----------------------------------

Hypothetically, one can start a workflow on the cluster, run one or more tasks,
stop, and then switch to run the rest of the workflow on a local machine.
However, although SyQADA eliminates the need to worry about whether you are
on the cluster or not, one of the biggest remaining hassles is that several
significant executables have different versions on the cluster and on
local machines, and so the configuration file must be changed to reflect this.
Since one of the guiding principles of SyQADA is that the configuration file
should be immutable once the workflow has started, this is a nasty problem
for which there is no graceful solution except to avoid switching horses in
midstream.

I commonly run sequence analysis on the cluster and then switch to run variant
annotation and analysis on the haplotypes, but the reason to do so
(historically, vtools was not available on the cluster) is longer so compelling
as it was. On the other hand, vtools does not do well with parallel execution
that it does not control by itself, so running it on the cluster is more or less
a misuse of the head node resource (the use of max_bolus = 1 can address this, however).