hapLOH readme file

### OVERVIEW AND SETUP ###
## 1.1 Overview ##
The basic inputs to hapLOH are BAF values, haplotype estimates, and a set of command line arguments to specify 
desired options and parameters of the algorithm.  Any phasing software may be used to generate the haplotype estimates.
See section 2.1 for input options, and section 3.2.1 for guidelines on choosing parameter values.  See file Usage.txt in 
this directory or run haploh --help for a full list of command line options.

## 1.2 License ##
Please see LICENSE.txt in this directory.

## 1.3 Requirements ##
hapLOH is a binary executable that depends on Perl and Python interpreters.
Releases are available for MacOS and Linux. hapLOH requires
-Python 3+ (tested under 3.5.2, but probably generic to 3+)
-Perl 5.8

## 1.4 Installation ##
The software is distributed as a compressed tarball. To install hapLOH with version number VERSION, run
  tar zxf hapLOH-VERSION.tgz
in a directory where you wish to install hapLOH. This will create a directory named hapLOH-VERSION. The executable 
is hapLOH-VERSION/bin/haploh .

A Python 3 interpreter must be installed and in the environment PATH
for hapLOH to run. If you do not have Python3, we recommend the
Anaconda Python distribution for its simplicity of installation.

### RUNNING HAPLOH ### 
## 2.1 Input options ##
There are three options for providing input files to hapLOH:

Run one sample using default input file formats:
   haploh --baf BAF --phased PHASED --mean_informative_marker_count N [--hapid --switchprobs SWITCHPROBS]
Input file formats are described in section 1.6.

Run multiple samples using default input file formats:
   haploh --mean_informative_marker_count N --datadir DATADIR --samples SAMPLE SAMPLE [...] [--hapid]
This assumes that DATADIR contains a sample-SAMPLE.hapguess, sample-SAMPLE.bafs (and sample-SAMPLE.switchprobs file if --hapid) for 
each SAMPLE, in input file formats described in section 2.2.

Run one or multiple samples using fastPHASE output and table-format BAF files:
   haploh --fastphase FASTPHASE [--chromosomes CHROMOSOMES] --mean_informative_marker_count N [--hapid]
This assumes that FASTPHASE is the prefix of a set of two or three files:
     a _hapguess_switch.txt file with sample ID lines (haplotype estimate output from fastPHASE)
     a tab-delimited file .bafs file, where each column is position-ordered BAFs for each SAMPLE, with 
        header 'SAMPLE.B allele freq' for each SAMPLE
     a _switchprobs.txt file (output from fastPHASE with flag '-w'), if the --hapid option is specified
When this completes, in addition to hapLOH output, DESTDIR contains an 'intermediates' directory containing 
SAMPLE.hapguess, SAMPLE.bafs, and SAMPLE.switchprobs file for each SAMPLE found in both the haplotype and 
BAF files.

We recommend the use of the '--destdir' option for organizing the output files.

The EXAMPLES directory contains a small script, example_commands.sh, that can be run to demonstrate hapLOH on 
sample data.

For the full set of options, run
   haploh --help
or see Usage.txt.

## 2.2 Input file formats ##
Basic input consists of a pair of files containing the BAF values and the statistically-estimated haplotypes, using the 
formats described below.  Each file should contain data for one individual only.  Markers should be ordered by genomic 
position, and paired files should contain data for the same markers.

  -   The BAF file is a single-line, space-delimited file or a single-column file containing the BAFs.  Missing 
      values may be denoted by '?'; numerical values outside of the range [0,1] will also be considered missing.  
  -   The statistical haplotype file is a two-line, space-delimited file with each line representing a germline haplotype using 
      A/B allele labels. Missing values should be denoted by '?'.

To estimate the over-represented haplotypes, in addition to the hapLOH inputs above, you will need to supply a file 
describing the switch rates for the phase estimates.

  -   either one value representing the average switch accuracy of the statistical haplotype estimates, or one line of 
      space-delimited values representing interval-specific switch probabilities, where an interval is the interval between 
      consecutive informative markers or the interval leading to the first informative marker (for unordered haplotypes, the 
      accuracy at the leading interval should be 0.5).  The number of switch probabilities should equal the number of 
      informative markers.

### USAGE NOTES ###
There are three basic applications for hapLOH --- detection in a region of interest, localization, and estimation of the 
over-represented haplotype.  

## 3.1 Detection ##
To perform detection (i.e. testing a specific region for deviation from the null phase concordance rate), you will need to 
select values from the .switch_enumeration file that correspond to your region of interest.  First determine which markers 
in your dataset are located in the region of interest.  Then use the .informative file to determine which of those markers are 
informative (note that indices are 0-based).  Since the values in the .switch_enumeration file correspond to every 
consecutive pair of informative markers, there will be one fewer value than number of informative markers.  Drop the last 
informative marker in the region of interest and select the values corresponding to the remaining informative markers 
--  the average of these will be the observed phase concordance rate for the region.

The localization HMM is run by default, but can be turned off with '--no-localization' if all you wish is detection. 

## 3.2 Localization ##
The procedure is run by default.  

# 3.2.1 Choosing parameters for the localization HMM #
Two options are available for fitting the HMM -- the 'fixed' mode, in which the transition probabilities are supplied by the 
user and the emission probabilities are estimated via EM, and the 'estimate' mode, in which both transition and emission 
probabilities are estimated.  

For either mode, the user should specify (1) the expected imbalance event size in terms of number of markers 
('--mean_informative_marker_count') and (2) the genome-wide prevalence of events ('--event_prevalence').  These 
values will be used to determine the transition probabilities.  It may be useful to note that since the transition probabilities 
are constant across sites, the resulting event size distribution is geometric with parameter (1/expected event size).  

To translate a genomic size into number of informative markers, consider marker density and the rate of marker 
heterozygosity in the sample.  For example, a Caucasian sample will be expected to have about 33% heterozygosity at 
the Illumina 370K array markers (~100 markers/Mb), corresponding to about 33 informative markers per Mb.  The 
prevalence should reflect the average proportion of markers in each sample expected to be affected by imbalance.

For the 'estimate' mode, the user may also supply a 'gamma' value, which may be interpreted as a weight on the 
expected event size.  The lower the weight, the more the observed data affect the transition parameters.  

Generally, the initial emission probabilities ('--initial_alphas' parameter) may be left at the default value without 
affecting the results.

In hapLOH-1.4, there are three new options to specify the BAF threshold:

0. --threshold median is the default, uses the median of the BAFs in
the --baf file as before.
1. --threshold mean will compute the mean of the BAFs in the --baf file.
2. --threshold <float> will use the value <float> as the BAF threshold.
3. If the --baf file contains two columns, the second is assumed to be a
marker-by-marker threshold value.

# 3.2.2 Output files #
Output will be placed in DESTDIR.
For invocation using the --baf option, the basename of the BAF input file is used as the prefix for output.
(i.e., chopping off the last dot and what follows it, e.g., test.baf -> test.switch_enumeration
For invocation using --samples, the given samplenames are used as the prefixes for output.
For invocation using --fastphase, the samplenames extracted from the phased input file are used as
the prefixes for output (Using --fastphase, a directory called DESTDIR/intermediates is created and
sample-NAME.bafs, sample-NAME.hapguess, sample-NAME.switchprobs created therein for each sample found).
NOTE: To simplify coding, a directory called DESTDIR/intermediates is created for invocation using the --baf option,
and symbolic links with the name sample-NAME.bafs, sample-NAME.hapguess, sample-NAME.switchprobs are created therein
for the input files.

*.summary	    	     gives the number of sites with missing phase or BAF values, and some
			      	     other summary information including invocation and version.
*.informative		    0-based indices of the sites that were used to determine phase concordance; 
			     	    useful for mapping results back to genomic regions
*.baf_phased_haplotypes	     header indicates number of individuals processed and total number of markers;
			     	    	      	     first and second haplotype lines give over- and under-represented haplotypes
			     			     as determined by BAF thresholding at informative sites
*.switch_enumeration	     	phase concordance indicators at each pair of informative markers
			     	   	        (0=switch, 1=concordance)
*.postprobs		    output of localization HMM; conditional probability of imbalance (either 
			     	    lower-level or higher-level state imbalance)
*.postprobs_state1	     output of localization HMM; conditional probability of lower-level imbalance
*.postprobs_state2	     output of localization HMM; conditional probability of higher-level imbalance
*.finalparams		     the final parameter values of the localization HMM
*.EM_log		     	    parameter values of localization HMM at each EM iteration

## 3.3 Estimating the over-represented haplotype ##
Run using option '--hapid'.  

# 3.3.1 Inputs #
The algorithm requires output from the localization HMM, so requesting '--no-localization' with '--hapid' currently produces 
a warning and quits without running.

# 3.3.2 Output files #
*.excesshap_haps	first line indicates over-represented haplotype and second line indicates under-represented 
haplotype as determined by HMM; note that haplotypes are ordered genomewide, but order is only meaningful when 
imbalance exists.

### MORE INFORMATION ###
## 4.1 Authors ##
Algorithm by Selina Vattathil and Paul Scheet.
Software by Selina Vattathil and Jerry Fowler.

## 4.2 Reference ##
Selina Vattathil and Paul Scheet (2012). "Haplotype-based profiling of subtle allelic imbalance with SNP arrays".
Genome Research 23:152-158.
doi:10.1101/gr.141374.112

## 4.3 Contact ##
svattathil@utexas.edu