hapLOH readme file ### OVERVIEW AND SETUP ### ## 1.1 Overview ## The basic inputs to hapLOH are BAF values, haplotype estimates, and a set of command line arguments to specify desired options and parameters of the algorithm. Any phasing software may be used to generate the haplotype estimates. See section 2.1 for input options, and section 3.2.1 for guidelines on choosing parameter values. See file Usage.txt in this directory or run haploh --help for a full list of command line options. ## 1.2 License ## Please see LICENSE.txt in this directory. ## 1.3 Requirements ## hapLOH is a binary executable that depends on Perl and Python interpreters. Releases are available for MacOS and Linux. hapLOH requires -Python 3+ (tested under 3.5.2, but probably generic to 3+) -Perl 5.8 ## 1.4 Installation ## The software is distributed as a compressed tarball. To install hapLOH with version number VERSION, run tar zxf hapLOH-VERSION.tgz in a directory where you wish to install hapLOH. This will create a directory named hapLOH-VERSION. The executable is hapLOH-VERSION/bin/haploh . A Python 3 interpreter must be installed and in the environment PATH for hapLOH to run. If you do not have Python3, we recommend the Anaconda Python distribution for its simplicity of installation. ### RUNNING HAPLOH ### ## 2.1 Input options ## There are three options for providing input files to hapLOH: Run one sample using default input file formats: haploh --baf BAF --phased PHASED --mean_informative_marker_count N [--hapid --switchprobs SWITCHPROBS] Input file formats are described in section 1.6. Run multiple samples using default input file formats: haploh --mean_informative_marker_count N --datadir DATADIR --samples SAMPLE SAMPLE [...] [--hapid] This assumes that DATADIR contains a sample-SAMPLE.hapguess, sample-SAMPLE.bafs (and sample-SAMPLE.switchprobs file if --hapid) for each SAMPLE, in input file formats described in section 2.2. Run one or multiple samples using fastPHASE output and table-format BAF files: haploh --fastphase FASTPHASE [--chromosomes CHROMOSOMES] --mean_informative_marker_count N [--hapid] This assumes that FASTPHASE is the prefix of a set of two or three files: a _hapguess_switch.txt file with sample ID lines (haplotype estimate output from fastPHASE) a tab-delimited file .bafs file, where each column is position-ordered BAFs for each SAMPLE, with header 'SAMPLE.B allele freq' for each SAMPLE a _switchprobs.txt file (output from fastPHASE with flag '-w'), if the --hapid option is specified When this completes, in addition to hapLOH output, DESTDIR contains an 'intermediates' directory containing SAMPLE.hapguess, SAMPLE.bafs, and SAMPLE.switchprobs file for each SAMPLE found in both the haplotype and BAF files. We recommend the use of the '--destdir' option for organizing the output files. The EXAMPLES directory contains a small script, example_commands.sh, that can be run to demonstrate hapLOH on sample data. For the full set of options, run haploh --help or see Usage.txt. ## 2.2 Input file formats ## Basic input consists of a pair of files containing the BAF values and the statistically-estimated haplotypes, using the formats described below. Each file should contain data for one individual only. Markers should be ordered by genomic position, and paired files should contain data for the same markers. - The BAF file is a single-line, space-delimited file or a single-column file containing the BAFs. Missing values may be denoted by '?'; numerical values outside of the range [0,1] will also be considered missing. - The statistical haplotype file is a two-line, space-delimited file with each line representing a germline haplotype using A/B allele labels. Missing values should be denoted by '?'. To estimate the over-represented haplotypes, in addition to the hapLOH inputs above, you will need to supply a file describing the switch rates for the phase estimates. - either one value representing the average switch accuracy of the statistical haplotype estimates, or one line of space-delimited values representing interval-specific switch probabilities, where an interval is the interval between consecutive informative markers or the interval leading to the first informative marker (for unordered haplotypes, the accuracy at the leading interval should be 0.5). The number of switch probabilities should equal the number of informative markers. ### USAGE NOTES ### There are three basic applications for hapLOH --- detection in a region of interest, localization, and estimation of the over-represented haplotype. ## 3.1 Detection ## To perform detection (i.e. testing a specific region for deviation from the null phase concordance rate), you will need to select values from the .switch_enumeration file that correspond to your region of interest. First determine which markers in your dataset are located in the region of interest. Then use the .informative file to determine which of those markers are informative (note that indices are 0-based). Since the values in the .switch_enumeration file correspond to every consecutive pair of informative markers, there will be one fewer value than number of informative markers. Drop the last informative marker in the region of interest and select the values corresponding to the remaining informative markers -- the average of these will be the observed phase concordance rate for the region. The localization HMM is run by default, but can be turned off with '--no-localization' if all you wish is detection. ## 3.2 Localization ## The procedure is run by default. # 3.2.1 Choosing parameters for the localization HMM # Two options are available for fitting the HMM -- the 'fixed' mode, in which the transition probabilities are supplied by the user and the emission probabilities are estimated via EM, and the 'estimate' mode, in which both transition and emission probabilities are estimated. For either mode, the user should specify (1) the expected imbalance event size in terms of number of markers ('--mean_informative_marker_count') and (2) the genome-wide prevalence of events ('--event_prevalence'). These values will be used to determine the transition probabilities. It may be useful to note that since the transition probabilities are constant across sites, the resulting event size distribution is geometric with parameter (1/expected event size). To translate a genomic size into number of informative markers, consider marker density and the rate of marker heterozygosity in the sample. For example, a Caucasian sample will be expected to have about 33% heterozygosity at the Illumina 370K array markers (~100 markers/Mb), corresponding to about 33 informative markers per Mb. The prevalence should reflect the average proportion of markers in each sample expected to be affected by imbalance. For the 'estimate' mode, the user may also supply a 'gamma' value, which may be interpreted as a weight on the expected event size. The lower the weight, the more the observed data affect the transition parameters. Generally, the initial emission probabilities ('--initial_alphas' parameter) may be left at the default value without affecting the results. In hapLOH-1.4, there are three new options to specify the BAF threshold: 0. --threshold median is the default, uses the median of the BAFs in the --baf file as before. 1. --threshold mean will compute the mean of the BAFs in the --baf file. 2. --threshold will use the value as the BAF threshold. 3. If the --baf file contains two columns, the second is assumed to be a marker-by-marker threshold value. # 3.2.2 Output files # Output will be placed in DESTDIR. For invocation using the --baf option, the basename of the BAF input file is used as the prefix for output. (i.e., chopping off the last dot and what follows it, e.g., test.baf -> test.switch_enumeration For invocation using --samples, the given samplenames are used as the prefixes for output. For invocation using --fastphase, the samplenames extracted from the phased input file are used as the prefixes for output (Using --fastphase, a directory called DESTDIR/intermediates is created and sample-NAME.bafs, sample-NAME.hapguess, sample-NAME.switchprobs created therein for each sample found). NOTE: To simplify coding, a directory called DESTDIR/intermediates is created for invocation using the --baf option, and symbolic links with the name sample-NAME.bafs, sample-NAME.hapguess, sample-NAME.switchprobs are created therein for the input files. *.summary gives the number of sites with missing phase or BAF values, and some other summary information including invocation and version. *.informative 0-based indices of the sites that were used to determine phase concordance; useful for mapping results back to genomic regions *.baf_phased_haplotypes header indicates number of individuals processed and total number of markers; first and second haplotype lines give over- and under-represented haplotypes as determined by BAF thresholding at informative sites *.switch_enumeration phase concordance indicators at each pair of informative markers (0=switch, 1=concordance) *.postprobs output of localization HMM; conditional probability of imbalance (either lower-level or higher-level state imbalance) *.postprobs_state1 output of localization HMM; conditional probability of lower-level imbalance *.postprobs_state2 output of localization HMM; conditional probability of higher-level imbalance *.finalparams the final parameter values of the localization HMM *.EM_log parameter values of localization HMM at each EM iteration ## 3.3 Estimating the over-represented haplotype ## Run using option '--hapid'. # 3.3.1 Inputs # The algorithm requires output from the localization HMM, so requesting '--no-localization' with '--hapid' currently produces a warning and quits without running. # 3.3.2 Output files # *.excesshap_haps first line indicates over-represented haplotype and second line indicates under-represented haplotype as determined by HMM; note that haplotypes are ordered genomewide, but order is only meaningful when imbalance exists. ### MORE INFORMATION ### ## 4.1 Authors ## Algorithm by Selina Vattathil and Paul Scheet. Software by Selina Vattathil and Jerry Fowler. ## 4.2 Reference ## Selina Vattathil and Paul Scheet (2012). "Haplotype-based profiling of subtle allelic imbalance with SNP arrays". Genome Research 23:152-158. doi:10.1101/gr.141374.112 ## 4.3 Contact ## svattathil@utexas.edu