.. _motivation:

Motivation for SyQADA
=====================

SyQADA is a system of python libraries and executables
that make it possible to manage analysis projects over lots of data.
The average scientist would rather spend time thinking about what
observed data mean than managing those data (see :ref:`research`).
The goal of SyQADA is to make it simple to:

Make processes reproducible::

  Running the identical process on many inputs of the same type

  Providing some measure of confidence upon completion that every step
  of every task was performed on every input

  Re-running the same process with altered parameters

  Duplicating an existing experimental method on new data

Unify process development::

  Running steps of the process either on the cluster or on the HAPS
  (or on a mac)

  Adapting the workflow of one analysis to support a new analysis

Simplify problem-solving::

  Identifying data problems

  Identifying and re-run jobs that fail because of system problems

  Providing QC reports and visualization as a standard with any project

Support publication of analytical results::

  Documenting what computational steps were performed for a given
  project

  Documenting what versions of software and what parameters were used
  for a given project

  Ultimately, producing MAGE-TAB-like IDF SDRF output to use in
  publication of the experimental methods

Simplify data management::

  Identifying the path of intermediate data through the workflow

  Standardizing the wrangling of intermediate filenames so common to
  process workflows.

  Removing easily reproducible intermediate data

Assist in time management::

  Providing time estimates for individual steps and whole workflows.

  Supporting the execution of an entire processing workflow with as
  little user intervention as possible.

Philosophical Motivation
------------------------

These ambitious goals are offset by the following desires:

Simple usage::

   Keep the usage of the tool as understandable as possible because
   the inherent complexity of data management and bioinformatics
   analysis does not need further confounding factors.

   With SyQADA, there is little distinction between a regular user and
   a power user. A power user of Unix will have an easier time dealing
   with problems than a novice, but a SyQADA novice may well learn
   most useful SyQADA commands in the duration of his first SyQADA
   workflow.

No unjustified generality::

   SyQADA development adheres to the software engineering principle
   YAGNI: if You Aint Gonna Need It, you should not implement it. It
   is difficult enough to design and execute a good, useful, and
   reliable workflow to complicate it with features and syntax that
   obfuscate the simple; we prefer not to complicate our lives or
   yours unnecessarily.

   All of the workflows that we have developed are simple linear
   processes that require no decision-making based on results of
   analysis. We believe that this is the typical case with research
   workflows. The decision-making is an inherent part of the research,
   and too dependent on "researcher intuition" to be reproducible. The
   workflows need to conform to a repeatable protocol to produce
   sufficient breadth of data to permit one to draw conclusions.

   A common failing of workflow systems is to provide for arbitrary
   complexity simply because it can be done. This leads to systems
   that are unnecessarily hard to use and often confusing.  A recent
   example I saw demonstrated included a graphic display of the
   workflow that completely obfuscated the simple linear nature of the
   workflow simply because the tool used to draw the display was a
   general-purpose tool designed to display more complex graphs.

   Similarly, depending on XML for specifications is imposing
   unnecessary generality. XML is a wonderfully expressive language
   that requires special expertise to parse, is inherently difficult
   for a human to read, and would unnecessarily complicate the life of
   both the user and the developer. I assure you that neither one of
   us wants that.

Minimal magic::

   The tool should avoid as much as possible making the user dependent
   on it to accomplish the task. Using *syqada manage* makes it much
   easier to determine the state of a task, but it is quite possible
   to understand the task status using only the Unix `ls` command.
   Similarly, if a user wishes, she can use SyQADA to generate the
   jobs and then submit them manually.

   The tool should also avoid making the developer dependent on
   specialized knowledge other than basic programming. Thus, job
   management information is stored in the file system rather than in
   a database (even an embeddable one) so that neither user nor
   developer need consider the use of SQL or noSQL. SQL, noSQL, XML,
   and JSON have their places, but adding to the intellectual overhead
   required to use this tool is not one of them.

Apologia
--------

Although, or perhaps because, I have spent a substantial fraction of my career developing dynamic
web-based systems, I have not seen a need for a point-and-click
interface, because I believe that it is too easy to encounter a
situation that requires direct manipulation of the file system. I
think that a web interface that provided the kind of system access
that would be necessary to deal with the problems that might occur in
an embarassingly parallel clustered computation workflow would be so
general that you would have to provide all the functionality of a Unix shell to
make it usable. Better to avoid re-inventing the wheel by simply using a secure shell interface.

I believe furthermore that such a web interface would be so flexible and option-ridden that
it would be on the one hand almost unusable, and on the other inherently
impossible to secure against command insertion attacks like the incomparable
Little Bobby Tables (http://xkcd.com/327):

.. image:: exploits_of_a_mom.png