Background

A major use of phasing is haplotype estimation of GWAS samples in order to speed up imputation from large reference panel of haplotypes such as 1000 Genomes. The current recommendation is that GWAS samples are first 'pre-phased' using the most accurate method available. The subsequent imputation step (which involves imputing alleles from one set of haplotypes into another set) is fast. As new haplotype reference sets become available imputation can be re-run much more efficiently. The approach we recommend is:
1. Phase the GWAS samples with SHAPEIT
2. Impute non-typed SNPs into SHAPEIT haplotypes with IMPUTE2.

Step1: Alignment of the SNPs


SNP positions in build 37


The most recent 1,000 genomes haplotypes are defined at SNPs that use build37 coordinates. You have thus to make sure that your GWAS SNPs use also the same version. If it is not the case, you can use the UCSC liftOver tool to perform the conversion to build37 coordinates.


Strand alignment


This is a crucial step of prephasing/imputation to make sure that the GWAS dataset is well aligned with the reference panel of haplotypes. You can check SNP alignment in two steps with Plink (step1 and step2) or with GTOOL.

You can also check strand alignment using SHAPEIT as described in detail here with this command line:

shapeit -check -B chr20.unphased --input-ref chr20.reference.hap.gz chr20.reference.legend.gz chr20.reference.sample --output-log chr20.alignments

Then, to list all strand alignment problems, just do:

cat chr20.alignments.snp.strand | grep "strand"

To generate a list of positions to flip, just pipe awk '{ print $2 }' to the previous command line.

Step2: Phasing the GWAS samples

Once you GWAS dataset correctly aligned to the reference panel, we strongly recommend to phase each chromosome in a single run instead of making chunks. It makes the procedure much easier and increase downstream imputation quality. To do so, use the following SHAPEIT command line:

shapeit -B chr20.unphased -M chr20.gmap.gz -O chr20.phased

Notes on clusters

Suppose that you want to prephase your GWAS on a cluster where each node has X CPU cores. In this case, the approach we recommend is:
1. To reserve a complete cluster node for each SHAPEIT job
2. To run each SHAPEIT job with X threads to fully load the CPU-cores of a node


Notes on servers

Suppose that you want to prephase your GWAS on a server with Y CPU cores (usually, Y=8,12,16,32,48 or 64). In this case, the approach we recommend is
1. To process the chromosomes in decreasing size order (the biggest one to the smallest one, that is from chr1 to chr22)
2. To use several parallel SHAPEIT jobs (J), each using several threads (T) such that the server is fully loaded (JxT=#CPU-cores)
Specifically, we recommend:

Y #jobs (J) #threads (T)
8 1 8
12 1 12
16 2 8
32 4 8
48 4 12
64 8 8

To set up this approach, xargs is a really useful Linux command. To use it, you can proceed in 2 steps:

Step1: Generate all the command lines (1 per job):

for i in $(seq 1 22); do echo "-B chr$i\.unphased -M chr$i\.gmap -O chr$i\.phased --thread T" >> myCommands.txt

Step2: Run all the jobs using xargs:

cat myCommands.txt | xargs -PJ -n8 shapeit &

Then, you will have J jobs running in parallel on your server, each using T threads, and your server will be loaded at 100%.

Step3: Imputation of the GWAS samples

Once SHAPEIT has produced haplotype estimates, you can use IMPUTE2 to impute untyped genotypes using the latest release of the 1000 Genomes haplotypes. On the previous example, you can this command:

impute2 -use_prephased_g -known_haps_g chr20.phased.haps -h chr20.reference.hap.gz -l chr20.reference.legend.gz -m chr20.gmap.gz -int 35e6 36e6 -Ne 20000 -o chr20.imputed

We strongly advice to use the latest IMPUTE2 version 2.2.2 available here. Several comments on the previous command line:
1. Prephased GWAS haplotypes are specified using -known_haps_g
2. The flag -use_prephased_g is used to set IMPUTE2 in the prephasing mode
3. The option -int 35e6 36e6 is used to specify the region to be imputed.

X chromosome prephasing and imputation

Step 1: Prephasing using SHAPEIT:

bin/shapeit -B chrX.unphased -M chrX.gmap.gz -O chrX.phased --chrX

Step 2: Imputation using IMPUTE2:

impute2 -chrX -use_prephased_g -known_haps_g chrX.phased.haps -sample_known_haps_g chrX.phased.sample -h chrX.reference.hap.gz -l chrX.reference.legend.gz -m chrX.gmap.gz -o chrX.imputed -int 10e6 11e6

Two comments:
1. You must use the -chrX flag for IMPUTE2 to proceed with X chromosome imputation
2. You must give the SAMPLE file generated by SHAPEIT to IMPUTE2. This SAMPLE has a sex column that gives the gender of the GWAS individuals.