2. Basic information
3. Download and general notes
4. Command reference table
5. The Model, Usage, and Pipeline
Integrated Haplotype Score (iHS) OverviewThe integrated haplotype Score (iHS) is a measure of the amount of extended haplotype homozygosity (EHH) at a given SNP along the ancestral allele relative to the derived allele. This measure is typically standardized (mean 0, variance 1) empirically to the distribution of observed iHS scores over a range SNPs with similar derived allele frequencies. Extended homozygosity for haplotypes on a high frequency derived allele relative to the ancestral background is a signature of a positively selected sweep which has not yet reached fixation. A classic signature is in the lactase region (LCT), where presumably selection on lactase-persistence into adulthood is a trait that has be subject to a selective sweep in European (and some African populations) which has not fixed completely. The measure was designed by Voight et al. (2006) as a method to describe a recent map of positive selection in the human genome. For additional details, resources, and information about the method as applied to hapmap phase I and phase II data, please see the Haplotter website, which allows the ease of scan of those results. WHAMM provides some machinery to perform a genome-wide selection scan on whole-genome association data, the tools and analytical pipeline to perform the analysis is described below. In order to function, WHAMM requires the iHS calculator to be installed in the WHAMM working directory. Information on installing the package can be found here.
iHS Scan: Preparing Data FilesCalculation of the iHS requires formatting of phased data and marker map information into files where the markers are converted into ancestral/derived states by haplotype (.data) and into marker map formats (.info) compatible with the iHS calculator. To convert .phased data files and population genetic .gmap files into these files, use:
WHAMM.pl --sampleinfo mydata.sinfo --ploaderfile mydata.pload --iHS-prep myancstates.txt myrhomap.gmapmyancstates.txt is a file of ancestral states specified for a set of markers, and myrhomap.gmap is a binary map formatted file (.bim) where a population genetic recombination map position (rho = 4Nr) is replaced with the cM genetic map position. This will ouptut two files: a .data file where each line represents a phased chromosome, where '0' is defined as the ancestral state and '1' is defined as the derived state, according to the myancstates.txt file, and a second file (.info) which contains the map information. Markers where the ancestral state is not defined are automatically excluded. This step also assumes no missing data in the .phased data file. ***Including missing data in the phase data files will have indeterminant consequences that have not been fully tested.*** NOTE The --keep option is also compatible with the --iHS-prep command, allowing you to specify a subset of samples to include in the resultant .data file. NOTE For speed, you can utilize the --chr N option which will only output data files for the Nth chromsome. NOTE The conversion step assumes that the data contained in the .phased files are in the same strand orientation as the ancestral states file. For ease, the ancestral states file provided from the Haplotter website assumes the +ve strand, and there is some limited error checking for inconsistences between the .phased data and the ancestral states file (those markers that appear to differ in strand will be outputted to an error file).
Strand flipping phased data filesThe ancestral states file provided on the Haplotter website are oriented to the +ve strand, though most genome platforms have selected SNP assays on both the +ve and -ve strands of the genome. WHAMM can 'flip' alleles and output data to accomodate this.
WHAMM.pl --ploaderfile mydata.pload --flip myfliplist.txt mymapfile.bimThe will output a new .phased data and .bim map files where, for each marker listed in myfliplist.txt, the alleles have been "flipped" (i.e., A->T, C->G, G->C, T->A). NOTE For speed, you can utilize the --chr N option which will only output data files for the Nth chromsome.
iHS Scan: Calculating the Unstandardized iHSAssuming you have data files prepared, WHAMM allows you to calculate the unstandardized iHS on an entire chromosome worth of data, or a subset of markers, depending on your requirements. To perform the scan along the entire chromosome, use:
WHAMM.pl --iHS-scan ihs_infofile.info ihs_datafile.dataTo focus only on a subset of markers, use:
WHAMM.pl --iHS-subscan ihs_infofile.info ihs_datafile.data rs123 rs567which will perform the scan from marker rs123 to rs567, inclusive. Alternatively, you can specify by index:
WHAMM.pl --iHS-subscan ihs_infofile.info ihs_datafile.data 1 10In all cases, these commands will calculate an unstandardized iHS for all markers in the .info file. The output file will contain several columns: SNP: Marker identifier for the SNP
POS: physical position for the SNP
FREQ_1: derived allele frequency for the SNP
unstd_iHS: unstandardized IHS score
density: marker density in the interval scanned (where EHH < 0.05)
max_gap: size of the maximum gap (in units of rho) in the regions spanned
ngap: number of gaps in the region spanned
Warnings: Warning flags if the calculation runs out of markers before the EHH decays to below 0.05. The flags are:
iHS Scan: Standardizing the iHSIt is known that the average polarity and magnitude of the unstandardized iHS depend on the derived allele frequency of the SNP under interrogation. Intuitively, this makes sense: new (derived) rare SNPs will tend to exist on 1 or relatively few haplotypes, and thus because of age little recombination will have ocurred to place the rare allele on other haplotype backgrounds. In other words, the iHS for low frequnecy derived alleles is expected to be negative and much less than zero under most demographic scenarios. To provide a way to compare iHS scores across different frequnecy classes, a standardization procedure is applied by normalizing the given unstandardized iHS score by the mean and standard deviation of other iHS scores observed across the genome, with similar derived allele frequencies. To standardize the iHS results given a pre-calculated standardization file, use:
WHAMM.pl --iHS-std ihs_stdfile.txt ihs_resfile.ihsThis will output a new file (*_std.ihs) which replaces the 'unstd_iHS' column with a normalized iHS, i.e. 'std_iHS'. The standardization file has n rows [no header], with the first two columns defining the range of derived allele frequncies for the bin, followed by the mean and standard deviation of iHS score for all SNPs within the given allele frequency span, e.g., 0 0.025 -1.753 0.756
0.025 0.05 -1.529 0.722
0.05 0.075 -1.386 0.708
0.95 0.975 1.282 0.733
0.975 1 1.540 0.754