WHAMM!: Whole-genome Homozygosity Analysis and Mapping Machina WHAMM...!
Latest WHAMM release is v0.14 (03-Nov-2008)

Whole-Genome Homozygosity Analysis and Mapping Machina

Introduction | Basics | Download | Commands | Usage | Input | Output | Association | Summaries | Permutations | Imputations | FAQ

1. Introduction

2. Basic information

3. Download and general notes

4. Command reference table

5. The Model, Usage, and Pipeline 6. Input Files and Formats 7. Output Files and Formats 8. Association Analysis 9. Data Summaries 10. Permutations 11. Integrated Haplotype Score (iHS) Selection Scan 12. Note on Imputated Data

XX. FAQ & Hints


Integrated Haplotype Score (iHS) Overview

The integrated haplotype Score (iHS) is a measure of the amount of extended haplotype homozygosity (EHH) at a given SNP along the ancestral allele relative to the derived allele. This measure is typically standardized (mean 0, variance 1) empirically to the distribution of observed iHS scores over a range SNPs with similar derived allele frequencies. Extended homozygosity for haplotypes on a high frequency derived allele relative to the ancestral background is a signature of a positively selected sweep which has not yet reached fixation. A classic signature is in the lactase region (LCT), where presumably selection on lactase-persistence into adulthood is a trait that has be subject to a selective sweep in European (and some African populations) which has not fixed completely.

The measure was designed by Voight et al. (2006) as a method to describe a recent map of positive selection in the human genome.

For additional details, resources, and information about the method as applied to hapmap phase I and phase II data, please see the Haplotter website, which allows the ease of scan of those results. WHAMM provides some machinery to perform a genome-wide selection scan on whole-genome association data, the tools and analytical pipeline to perform the analysis is described below. In order to function, WHAMM requires the iHS calculator to be installed in the WHAMM working directory. Information on installing the package can be found here.

iHS Scan: Preparing Data Files

Calculation of the iHS requires formatting of phased data and marker map information into files where the markers are converted into ancestral/derived states by haplotype (.data) and into marker map formats (.info) compatible with the iHS calculator. To convert .phased data files and population genetic .gmap files into these files, use:

WHAMM.pl --sampleinfo mydata.sinfo --ploaderfile mydata.pload --iHS-prep myancstates.txt myrhomap.gmap

myancstates.txt is a file of ancestral states specified for a set of markers, and myrhomap.gmap is a binary map formatted file (.bim) where a population genetic recombination map position (rho = 4Nr) is replaced with the cM genetic map position.

This will ouptut two files: a .data file where each line represents a phased chromosome, where '0' is defined as the ancestral state and '1' is defined as the derived state, according to the myancstates.txt file, and a second file (.info) which contains the map information. Markers where the ancestral state is not defined are automatically excluded. This step also assumes no missing data in the .phased data file. ***Including missing data in the phase data files will have indeterminant consequences that have not been fully tested.***

NOTE The --keep option is also compatible with the --iHS-prep command, allowing you to specify a subset of samples to include in the resultant .data file.

NOTE For speed, you can utilize the --chr N option which will only output data files for the Nth chromsome.

NOTE The conversion step assumes that the data contained in the .phased files are in the same strand orientation as the ancestral states file. For ease, the ancestral states file provided from the Haplotter website assumes the +ve strand, and there is some limited error checking for inconsistences between the .phased data and the ancestral states file (those markers that appear to differ in strand will be outputted to an error file).

Strand flipping phased data files

The ancestral states file provided on the Haplotter website are oriented to the +ve strand, though most genome platforms have selected SNP assays on both the +ve and -ve strands of the genome. WHAMM can 'flip' alleles and output data to accomodate this.

WHAMM.pl --ploaderfile mydata.pload --flip myfliplist.txt mymapfile.bim

The will output a new .phased data and .bim map files where, for each marker listed in myfliplist.txt, the alleles have been "flipped" (i.e., A->T, C->G, G->C, T->A).

NOTE For speed, you can utilize the --chr N option which will only output data files for the Nth chromsome.

iHS Scan: Calculating the Unstandardized iHS

Assuming you have data files prepared, WHAMM allows you to calculate the unstandardized iHS on an entire chromosome worth of data, or a subset of markers, depending on your requirements. To perform the scan along the entire chromosome, use:

WHAMM.pl --iHS-scan ihs_infofile.info ihs_datafile.data

To focus only on a subset of markers, use:

WHAMM.pl --iHS-subscan ihs_infofile.info ihs_datafile.data rs123 rs567

which will perform the scan from marker rs123 to rs567, inclusive. Alternatively, you can specify by index:

WHAMM.pl --iHS-subscan ihs_infofile.info ihs_datafile.data 1 10

In all cases, these commands will calculate an unstandardized iHS for all markers in the .info file. The output file will contain several columns:

SNP: Marker identifier for the SNP
POS: physical position for the SNP
FREQ_1: derived allele frequency for the SNP
unstd_iHS: unstandardized IHS score
density: marker density in the interval scanned (where EHH < 0.05)
max_gap: size of the maximum gap (in units of rho) in the regions spanned
ngap: number of gaps in the region spanned
Warnings: Warning flags if the calculation runs out of markers before the EHH decays to below 0.05. The flags are:
  • Size_EHH_l, Size_EHH_r: No markers on the left or right ride, respectively.
  • Edge_EHH0_l, Edge_EHH0_r: Ran out of markers before the EHH decayed below 0.05 along the ancestral haplotype background, on the left and right side, respectively.
  • Edge_EHH1_l, Edge_EHH1_l: Ran out of markers before the EHH decayed below 0.05 along the dervied haplotype background, on the left and right side, respectively.

    NOTE For robustness, the iHS is reported as "NA" for any marker which has a warning flag.

    NOTE In terms of gap penalties, the current program will record gaps >100kb and will penalize the iHS if a gap exceeds 300kb.

    NOTE For parallel computing, using the index version of --iHS-subscan is perferable, and will include all markers spanned, including those specified. i.e., 1 10 will include markers 1 through 10, inclusive.

    iHS Scan: Standardizing the iHS

    It is known that the average polarity and magnitude of the unstandardized iHS depend on the derived allele frequency of the SNP under interrogation. Intuitively, this makes sense: new (derived) rare SNPs will tend to exist on 1 or relatively few haplotypes, and thus because of age little recombination will have ocurred to place the rare allele on other haplotype backgrounds. In other words, the iHS for low frequnecy derived alleles is expected to be negative and much less than zero under most demographic scenarios. To provide a way to compare iHS scores across different frequnecy classes, a standardization procedure is applied by normalizing the given unstandardized iHS score by the mean and standard deviation of other iHS scores observed across the genome, with similar derived allele frequencies. To standardize the iHS results given a pre-calculated standardization file, use:

    WHAMM.pl --iHS-std ihs_stdfile.txt ihs_resfile.ihs

    This will output a new file (*_std.ihs) which replaces the 'unstd_iHS' column with a normalized iHS, i.e. 'std_iHS'.

    The standardization file has n rows [no header], with the first two columns defining the range of derived allele frequncies for the bin, followed by the mean and standard deviation of iHS score for all SNPs within the given allele frequency span, e.g.,

    0 0.025 -1.753 0.756
    0.025 0.05 -1.529 0.722
    0.05 0.075 -1.386 0.708
    0.95 0.975 1.282 0.733
    0.975 1 1.540 0.754

    This document last modified Monday, 02-Apr-2012 17:35:45 EDT