.. _motivation: Motivation for SyQADA ===================== SyQADA is a system of python libraries and executables that make it possible to manage analysis projects over lots of data. The average scientist would rather spend time thinking about what observed data mean than managing those data (see :ref:`research`). The goal of SyQADA is to make it simple to: Make processes reproducible:: Running the identical process on many inputs of the same type Providing some measure of confidence upon completion that every step of every task was performed on every input Re-running the same process with altered parameters Duplicating an existing experimental method on new data Unify process development:: Running steps of the process either on the cluster or on the HAPS (or on a mac) Adapting the workflow of one analysis to support a new analysis Simplify problem-solving:: Identifying data problems Identifying and re-run jobs that fail because of system problems Providing QC reports and visualization as a standard with any project Support publication of analytical results:: Documenting what computational steps were performed for a given project Documenting what versions of software and what parameters were used for a given project Ultimately, producing MAGE-TAB-like IDF SDRF output to use in publication of the experimental methods Simplify data management:: Identifying the path of intermediate data through the workflow Standardizing the wrangling of intermediate filenames so common to process workflows. Removing easily reproducible intermediate data Assist in time management:: Providing time estimates for individual steps and whole workflows. Supporting the execution of an entire processing workflow with as little user intervention as possible. Philosophical Motivation ------------------------ These ambitious goals are offset by the following desires: Simple usage:: Keep the usage of the tool as understandable as possible because the inherent complexity of data management and bioinformatics analysis does not need further confounding factors. With SyQADA, there is little distinction between a regular user and a power user. A power user of Unix will have an easier time dealing with problems than a novice, but a SyQADA novice may well learn most useful SyQADA commands in the duration of his first SyQADA workflow. No unjustified generality:: SyQADA development adheres to the software engineering principle YAGNI: if You Aint Gonna Need It, you should not implement it. It is difficult enough to design and execute a good, useful, and reliable workflow to complicate it with features and syntax that obfuscate the simple; we prefer not to complicate our lives or yours unnecessarily. All of the workflows that we have developed are simple linear processes that require no decision-making based on results of analysis. We believe that this is the typical case with research workflows. The decision-making is an inherent part of the research, and too dependent on "researcher intuition" to be reproducible. The workflows need to conform to a repeatable protocol to produce sufficient breadth of data to permit one to draw conclusions. A common failing of workflow systems is to provide for arbitrary complexity simply because it can be done. This leads to systems that are unnecessarily hard to use and often confusing. A recent example I saw demonstrated included a graphic display of the workflow that completely obfuscated the simple linear nature of the workflow simply because the tool used to draw the display was a general-purpose tool designed to display more complex graphs. Similarly, depending on XML for specifications is imposing unnecessary generality. XML is a wonderfully expressive language that requires special expertise to parse, is inherently difficult for a human to read, and would unnecessarily complicate the life of both the user and the developer. I assure you that neither one of us wants that. Minimal magic:: The tool should avoid as much as possible making the user dependent on it to accomplish the task. Using *syqada manage* makes it much easier to determine the state of a task, but it is quite possible to understand the task status using only the Unix `ls` command. Similarly, if a user wishes, she can use SyQADA to generate the jobs and then submit them manually. The tool should also avoid making the developer dependent on specialized knowledge other than basic programming. Thus, job management information is stored in the file system rather than in a database (even an embeddable one) so that neither user nor developer need consider the use of SQL or noSQL. SQL, noSQL, XML, and JSON have their places, but adding to the intellectual overhead required to use this tool is not one of them. Apologia -------- Although, or perhaps because, I have spent a substantial fraction of my career developing dynamic web-based systems, I have not seen a need for a point-and-click interface, because I believe that it is too easy to encounter a situation that requires direct manipulation of the file system. I think that a web interface that provided the kind of system access that would be necessary to deal with the problems that might occur in an embarassingly parallel clustered computation workflow would be so general that you would have to provide all the functionality of a Unix shell to make it usable. Better to avoid re-inventing the wheel by simply using a secure shell interface. I believe furthermore that such a web interface would be so flexible and option-ridden that it would be on the one hand almost unusable, and on the other inherently impossible to secure against command insertion attacks like the incomparable Little Bobby Tables (http://xkcd.com/327): .. image:: exploits_of_a_mom.png