WHAMM!: Whole-genome Homozygosity Analysis and Mapping Machina WHAMM...!
Latest WHAMM release is v0.14 (03-Nov-2008)

Whole-Genome Homozygosity Analysis and Mapping Machina

Introduction | Basics | Download | Commands | Usage | Input | Output | Association | Summaries | Permutations | Imputations | FAQ

1. Introduction

2. Basic information

3. Download and general notes

4. Command reference table

5. The Model, Usage, and Pipeline 6. Input Files and Formats 7. Output Files and Formats 8. Association Analysis 9. Data Summaries 10. Permutations 11. Integrated Haplotype Score (iHS) Selection Scan 12. Note on Imputated Data

XX. FAQ & Hints

 

My Header

Insert the model Description Here.

WHAMM Prerequisites -- PHASED genotype data

There are some prerequisites to note before you begin:

  • Make sure you have a cleaned data set (QC, relatedness removed, and otherwise)
  • Run an appropriate phasing method (i.e., BEAGLE).
  • Construct an initial sample info [.sinfo] file from the data. The file has the format:
    # fid FID1 FID1 FID2 FID2 FID3 FID3 ... FIDn FIDn
    # iid IID1 IID1 IID2 IID2 IID3 IID3 ... IIDn IIDn

    where (FIDn, IIDn) form the familyID and individualID identifiers for the given sample n. Note that if you were constructing a header file for the BEAGLE output phased files, this would be that header. As far as I can tell, characters of the variety [A-Za-z0-9_] are tolerated (more could be; I haven't tested everything).

  • Construct a map file which includes cM position for each marker in the data [mycM_mapfile.bim]. I have provided a set of maps for download for several available whole-genome platforms, however they contain ALL SNPs on the array -- users will want to construct a new map file (using this cM information) which contains only their marker list.

  • Construct a file which specifies the locations of your phased data files and the chromosomes to which they refer [.pload]. Note that X, Y, and MT chromosomes will be excluded - and I haven't tested the functionality if you tried to include them. This file has the following form:
    1 /mydir/maybehere/hereiam/beaglephaseout.3.chr1.recode.phased
    2 /mydir/maybehere/hereiam/beaglephaseout.3.chr2.recode.phased
    ...
    22 /mydir/maybehere/hereiam/beaglephaseout.3.chr22.recode.phased

Armed with these files, you should be set to run the machinery.

WHAMM Prerequisites -- UNPHASED genotype data

There are some prerequisites to note before you begin:

  • Make sure you have a cleaned data set (QC, relatedness removed, and otherwise)
  • Construct an initial sample info [.sinfo] file from the data. The file has the format:
    # fid FID1 FID1 FID2 FID2 FID3 FID3 ... FIDn FIDn
    # iid IID1 IID1 IID2 IID2 IID3 IID3 ... IIDn IIDn

    where (FIDn, IIDn) form the familyID and individualID identifiers for the given sample n. Note that if you were constructing a header file for the BEAGLE output phased files, this would be that header. As far as I can tell, characters of the variety [A-Za-z0-9_] are tolerated (more could be; I haven't tested everything).

  • Construct a map file which includes cM position for each marker in the data [mycM_mapfile.bim]. I have provided a set of maps for download for several available whole-genome platforms, however they contain ALL SNPs on the array -- users will want to construct a new map file (using this cM information) which contains only their marker list.

  • Construct a file which specifies the locations of your genotype data files and the chromosomes to which they refer [.pload]. Note that X, Y, and MT chromosomes will be excluded - and I haven't tested the functionality if you tried to include them. The genotype data must be in TRANSPOSED PED file format -- this format can be achieved in PLINK using the --transpose option (see the PLINK web help pages for details). This file has the following form:
    1 /mydir/maybehere/hereiam/mydata.chr1.tped
    2 /mydir/maybehere/hereiam/mydata.chr2.tped
    ...
    22 /mydir/maybehere/hereiam/mydata.chr22.tped

rmed with these files, you should be set to run the machinery.

WHAMM Analysis Pipeline

Regardless if you are using phased haplotypes or unphased genotypes, the analysis pipeline is the same:

a. Create a segment map file [.smap] and some sort of genetic information file, based on haplotype [.hinfo] or genotype [.ginfo] data. [step 1]
b. Using these files, create probabilities of homozygosity for each segment for each individual. [step 2]
c. Using an individual's probabilities, use a forward/backward algorithm to estimate homozygosity parameters, genome-wide 'painting' maps of homozygosity, as well as probability of homozygosity in each segment given the maximum-likelihood estimated parameters. [step 3]
d. Perform association testing for each segment between probability of homozygosity and phenotype [step 4]

The default option in WHAMM is to assume the data has been phased. If you are using genotype data, you can add the argument

--gt

where approriate to override the default behavior. In the pipeline below, I will denote where you should do this with the optional argument specified as:
[--gt]

and
--infofile [whamm.hinfo | whamm.ginfo]

I'm going to assume that you know which pipeline you want to use, and as a result, which files to provide the program (i.e., whamm.ginfo is provided to the --infofile argument with the --gt argument specified).

1. The first step is making the info files [.smap and .hinfo or .ginfo]. To do that, call:
WHAMM.pl [--gt] --samplefile whamm.sinfo --ploaderfile whamm.pload
--make-infofiles hotspots.list mycM_mapfile.bim --out whamm

note that the "hotspots.list" file is the hotspot map available from the hapmap project, and "mycM_mapfile.bim" is a plink binary map file which contains cM positions for each marker included in the data. I have made a copies of the hotspot file for use with WHAMM available download. Note that you will need to construct a more accurate map file with the markers that passed your QC -- the mapfile I have provided include all makers on the specified platform.

At this stage, you will have whamm.smap and whamm.hinfo files (or whamm.ginfo, in the case of genotyped data), ready for the next step. Note: This step takes perhaps ~20 minutes to run on data set of 4000 samples.

2. Now, you need to calculate haplotype probabilities from the data given the haplotype information and segment maps you've created. WHAMM currently will create a new set of files for each individual in the data; there tend to be many samples so I suggest making a new directory (mydir) to hold this information and including this path in the output (--out option).
WHAMM.pl [--gt] --samplefile whamm.sinfo --ploaderfile whamm.pload
--segfile whamm.smap --infofile [whamm.hinfo | whamm.ginfo] --calc-probs --out mydir/whamm_probs

After this run, for every sample, you will have created a homozygosity probability file [.hzprobs] for every individual in the data. Note that in this step, you could take advantage of parallel computing by using the --keep option, which will calculate probabilities for a subset of the data. If I have parallel computing at my disposal, what I typically do is create a master file which lists all the individuals in my data (column 1 is FID, column 2 is IID). Then, use the split command in UNIX to create a number of sub-lists, each of which is a job in my parallel process. These sub-lists are also useful for making concatenated end results output files (see below). This information is used for the next step, which involves applying the forward/backward algorithm and estimating (f,a) parameters for the data.

3. Next, we need to estimate parameters from the data [.params], calculate the forward/backward path for the data [.fbpath], as well as report the hzdosage given the parameters for each segment [.hzdose]. This portion of the routine has been optimized to apply to an entire samples worth of data - it will operate on a single individual's .hzprob file. You ought to be able to automate this with a simple shell or perl script. Again, you will generate a set of files for each individual, so I suggest outputting the results to a directory (mydir) which you can include in the output option (--out). Note here that one can take advantage of parallel processing (if one felt so inclined), as an individual's parameter estimates do not depend on anyone else.
WHAMM.pl --segfile whamm.smap --paint mydir/whamm_probs_fid_iid.hzprobs
--fbprobs --out mydir/hzmap_paint

Once this is completed, you will want to create concatenated sets of the above file sets [.params, .hzdose, .fbpath]. Generally this is as simple as using the cat command in unix, though this can be a bit tricky if you have lots of samples (read: lots of files) to cat. I trust that you can handle this. Future versions will likely create a single set of file(s) which can be easily concatenated (rather than having to deal with 1000's of files).

4. Once you have created concatenated .fbpath, .hzdose, and .params files, you are just about set to do analysis. Currently, I don't have much in the way of summary statistic/plot making incorporated into WHAMM yet, so if you want those things you'll have to come and talk to me. Examples of things: Individual specific chromosome painting plots; proportion of genome coveraged by Homozygosity; enrichment of minor-allele homozygotes in HZ regions. Mostly what you want to do is analyze phenotypes. You will of course need to make a standard phenotype file; either with case/control labels [1 control, 2 case], or as a QTL (which perhaps you have standardized). The most basic implementation: suppose you had a set of cases and controls, and you want to test for association between HZ status and phenotype. You can do this via a Fisher's exact test:
WHAMM.pl --samplefile whamm.sinfo --segfile whamm.smap
--pheno pheno.txt --paintfile whamm.fbpath --fisher whamm_fisher

the result will be an HZ association file [.hzassoc]:

CHR: Chromosome
SEGID: HZ Segment ID
START_POS: Physical Start Position for the Segment
END_POS: Physical End Position for the Segment
CNT_HZEXT: Number of Homozygotes in the extreme (or cases)
CNT_NHZEXT: Number of non-homozygotes in the extreme (or cases)
CNT_HZNEXT: Number of Homozygotes in the rest of the sample (or controls)
CNT_NHZNEXT: Number of non-Homozygotes in the rest of the sample (or controls)

An alternative approach would be to perform a logistic regression, with outcome (phenotype) predicted by HZ status at a given segment. This approach is based on the probability of homozygosity, rather than an absolute call, so this in some sense accounts for the uncertainty there. To do this analysis:

WHAMM.pl --samplefile whamm.sinfo --segfile whamm.smap --pheno pheno.txt --dosefile whamm.hzdose
--logistic b --out whamm_logreg

the result will be a dosage HZ association file [.hzdassoc]:

CHR: Chromosome
SEGID: HZ Segment ID
SPOS: Physical Start Position for the Segment
EPOS: Physical End Position for the Segment
TYPE: test at the segment; HZ is the test of homozygosity (or covariate)
NMISS: number of non-missing data points
BETA: Beta coefficient (logit function)
SE: Standard error on beta (logit)
STAT: Z score for test of the term
PVAL: P-value for the term

Hopefully, this is straightforward.
 

This document last modified Monday, 02-Apr-2012 17:35:45 EDT