GWAS tutorial 

For a short introduction to bacterial GWAS, you may wish to read this review .
This tutorial shows how to use pyseer to perform a GWAS for penicillin resistance using 616 S. pneumoniae genomes collected from Massachusetts. These genomes were first reported here and can be accessed here . One of the earliest GWAS studies in bacteria was performed using this data, and we will try to replicate their results .
The data for this tutorial can be accessed here . Extract the archive:
tar xvf pyseer_tutorial.tar.bz2
To find the following files:
Contents
assemblies.tar.bz2
Archive of genome assemblies.
fsm_file_list.txt
Input to run fsm-lite.
snps.vcf.gz
SNPs mapped against the Spn23F reference.
gene_presence_absence.Rtab
Output from roary run on these genomes.
core_genome_aln.tree
IQ-TREE phylogeny (using -m GTR) from the core genome alignment.
resistances.pheno
Whether an isolate was resistant to penicillin, to be used as the phenotype.
mash_sketch.msh
mash sketch output, from running mash sketch -s 10000 -o mash_sketch *.fa.
Spn23F.fa
23FSpn sequence.
Spn23F.gff
23FSpn sequence and annotation.
6952_7#3.fa
The draft sequence assembly of one isolate in the collection.
6952_7#3.gff
The draft annotation of the isolate.
To run commands with the scripts/ directory you will need to have cloned
the github repository (though other commands can continue to be run using your
conda/pip install.
SNP and COG association with fixed effects model
We will first of all demonstrate using pyseer with the original seer model,
using MDS components as fixed effects to control for the population structure.
We will test the association of SNPs mapped to a reference (provided as a VCF file) and COG
presence/absence (provided as and Rtab file, from running roary on the
annotations).
The first step is to estimate the population structure. We will do this using
a pairwise distance matrix produced using mash. Either create the mash
sketches yourself:
mkdir assemblies
cd assemblies
tar xf ../assemblies.tar.bz2
cd ..
mash sketch -s 10000 -o mash_sketch assemblies/*.fa
or use the pre-computed mash_sketch.msh directly. Next, use these to
calculate distances between all pairs of samples:
mash dist mash_sketch.msh mash_sketch.msh| square_mash > mash.tsv
Alternatively, we could extract patristic distances from a phylogeny:
python scripts/phylogeny_distance.py core_genome_aln.tree > phylogeny_dists.tsv
Let’s perform an MDS and these distances and look at a scree plot to choose the number of
dimensions to retain:
scree_plot_pyseer mash.tsv
There is a drop after about 8 dimensions, so we will use this many. This is
subjective, and you may choose to include many more. This is
a sensitivity/specificity tradeoff – choosing more components is more likely
to reduce false positives from population structure, at the expense of power.
Using more components will also slightly increase computation time.
We can now run the analysis on the COGs:
pyseer --phenotypes resistances.pheno --pres gene_presence_absence.Rtab --distances mash.tsv --save-m mash_mds --max-dimensions 8 > penicillin_COGs.txt
Which prints the following to STDERR:
Read 603 phenotypes
Detected binary phenotype
Structure matrix has dimension (616, 616)
Analysing 603 samples found in both phenotype and structure matrix
10944 loaded variants
4857 filtered variants
6087 tested variants
6087 printed variants
pyseer has automatically matched the sample labels between the inputs, and
only used those which were present in the phenotype file. This has accounted
for the fact that not all of the samples were measured for the current
phenotype. We have used the default filters, so only intermediate frequency
COGs have been considered. The core genome COGs and low frequency COGs are in
the 4857 filtered out. Take a look at the top hits:
sort -g -k4,4 penicillin_COGs.txt | head
variant af      filter-pvalue   lrt-pvalue      beta    beta-std-err    intercept       PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8 notes
group_4276      7.79E-02        1.27E-11        2.70E-21        1.29E+01        7.12E-01        -1.29E+00       -7.01E-01       -2.75E+00       -6.64E+00    -9.02E-01       1.46E+01        -3.83E+00       -6.05E-01       -4.25E+00       high-bse
group_4417      8.96E-02        3.21E-09        4.72E-20        -6.08E+00       6.99E-01        -4.51E-01       -1.12E+00       5.08E-01        -5.61E+00    8.20E-01        8.19E+00        -4.95E-01       -4.53E-01       9.70E-01        bad-chisq
cpsG    1.18E-01        1.34E-16        1.69E-19        3.77E+00        5.25E-01        -1.34E+00       2.49E+00        1.24E-01        -5.19E+00   6.57E-01 1.01E+01        8.38E-02        -3.06E-01       8.48E-01
group_3096      1.18E-01        1.34E-16        1.69E-19        3.77E+00        5.25E-01        -1.34E+00       2.49E+00        1.24E-01        -5.19E+00    6.57E-01        1.01E+01        8.38E-02        -3.06E-01       8.48E-01
group_5738      1.18E-01        1.34E-16        1.69E-19        3.77E+00        5.25E-01        -1.34E+00       2.49E+00        1.24E-01        -5.19E+00    6.57E-01        1.01E+01        8.38E-02        -3.06E-01       8.48E-01
group_8161      1.18E-01        1.34E-16        1.69E-19        3.77E+00        5.25E-01        -1.34E+00       2.49E+00        1.24E-01        -5.19E+00    6.57E-01        1.01E+01        8.38E-02        -3.06E-01       8.48E-01
group_8834      1.18E-01        1.34E-16        1.69E-19        3.77E+00        5.25E-01        -1.34E+00       2.49E+00        1.24E-01        -5.19E+00    6.57E-01        1.01E+01        8.38E-02        -3.06E-01       8.48E-01
mnaA    1.18E-01        1.34E-16        1.69E-19        3.77E+00        5.25E-01        -1.34E+00       2.49E+00        1.24E-01        -5.19E+00   6.57E-01 1.01E+01        8.38E-02        -3.06E-01       8.48E-01
tagA    1.18E-01        1.34E-16        1.69E-19        3.77E+00        5.25E-01        -1.34E+00       2.49E+00        1.24E-01        -5.19E+00   6.57E-01 1.01E+01        8.38E-02        -3.06E-01       8.48E-01
Note that the first two rows have notes high-bse and bad-chisq
respectively. For the former this may represent a high effect size, low
frequency results. For the latter this is likely due to the MAF filter not
being stringent enough. The identical p-values of the other results are as these COGs
appear in exactly the same set of samples.
We will now perform an analysis using the SNPs produced from mapping reads
against the provided reference genome. To speed up the program we will load the
MDS decomposition mash_mds.pkl which was created by the COG analysis above:
pyseer --phenotypes resistances.pheno --




    
vcf snps.vcf.gz --load-m mash_mds.pkl --lineage --print-samples > penicillin_SNPs.txt
This gives similar log messages:
Read 603 phenotypes
Detected binary phenotype
Loaded projection with dimension (603, 269)
Analysing 603 samples found in both phenotype and structure matrix
Writing lineage effects to lineage_effects.txt
198248 loaded variants
81370 filtered variants
116878 tested variants
116700 printed variants
We haven’t specified the number of MDS dimensions to retain, so the default of
10 will be used (anything up to the 269 retained positive eigenvalues could be
chosen). Turning on the test for lineage effects with --lineage uses the
MDS components as the lineage, and writes the lineages most associated with
the phenotype to lineage_effects.txt:
lineage wald_test       p-value
MDS3    10.3041807281   0.0
MDS10   6.61332035523   3.75794950713e-11
MDS5    6.03559150525   1.58381441295e-09
MDS4    2.35736678835   0.0184050574981
MDS6    1.33118701438   0.183127483126
MDS2    1.02523510885   0.305252266
MDS9    0.850386297867  0.39511035157
MDS7    0.780676383001  0.434992854366
MDS1    0.478181602218  0.632520955891
MDS8    0.344928992152  0.730147754076
Variants associated with both the phenotype and MDS3, MDS10 or MDS5 may
therefore be of interest as lineage effects.
The output now includes the lineage each variant is associated with, though not
all variants can be assigned a lineage. --print-samples forces the
inclusion of a comma separated list of samples the variant is present in
k-samples and not present in nk-samples (not shown here for brevity):
variant af      filter-pvalue   lrt-pvalue      beta    beta-std-err    intercept       PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8 PC9      PC10    lineage notes
26_23_G 4.31E-02        3.31E-01        4.42E-01        -4.19E-01       5.49E-01        -9.22E-01       1.84E-01        -6.00E-01       -7.53E+00   8.84E-01 2.05E+01        -1.79E+00       2.69E-01        1.16E-01        -7.52E-01       3.66E+00        MDS1
26_31_G_T       5.64E-02        3.94E-06        1.00E+00        6.78E-01        6.92E-01        -8.90E-01       1.97E-01        -4.13E-01       -7.05E+00    8.63E-01        1.91E+01        -1.33E+00       3.02E-01        9.13E-02        -4.99E-01       3.35E+00        MDS10   bad-chisq
26_83_A_G       4.58E-01        9.88E-04        3.25E-01        4.06E-01        4.13E-01        -1.21E+00       -1.43E-01       -7.84E-01       -7.35E+00    6.13E-01        1.91E+01        -1.19E+00       1.73E-01        6.44E-01        -4.47E-01       3.63E+00        MDS6
26_109_G_A      1.33E-02        1.46E-01        2.10E-14        4.15E+01        7.25E-01        -9.97E-01       9.39E-02        3.33E-02        -9.52E+00    1.72E+00        3.41E+01        1.38E+00        4.43E-01        -1.20E+00       6.82E-02        4.28E+00
26_184_G_A      3.32E-02        1.06E-02        8.49E-01        1.75E-01        9.11E-01        -9.65E-01       1.37E-01        -5.96E-01       -7.42E+00    8.65E-01        1.98E+01        -1.71E+00       3.00E-01        2.78E-01        -6.18E-01       3.63E+00
26_281_C_T      1.01E-01        1.20E-05        3.97E-01        -5.91E-01       6.91E-01        -9.08E-01       1.12E-01        -7.04E-01       -7.24E+00    7.18E-01        2.02E+01        -1.73E+00       4.32E-01        3.50E-01        -6.84E-01       3.69E+00        MDS4
26_293_G_A      1.49E-02        3.50E-01        5.31E-01        7.06E-01        1.07E+00        -9.73E-01       1.29E-01        -6.11E-01       -7.49E+00    9.16E-01        2.03E+01        -1.54E+00       3.02E-01        2.55E-01        -5.93E-01       3.66E+00        MDS6
26_483_G_A      2.37E-01        7.85E-02        1.82E-02        9.16E-01        3.90E-01        -1.32E+00       -2.83E-01       -1.30E+00       -7.28E+00    6.77E-01        1.78E+01        -1.79E+00       2.59E-01        1.10E+00        3.15E-02        3.44E+00        MDS9
26_539_G_A      1.33E-02        1.46E-01        2.10E-14        4.15E+01        7.25E-01        -9.97E-01       9.39E-02        3.33E-02        -9.52E+00    1.72E+00        3.41E+01        1.38E+00        4.43E-01        -1.20E+00       6.82E-02        4.28E+00
This contains co-ordinates and p-values, which can be converted to a .plot
file using the following awk one-liner:
cat <(echo "#CHR SNP BP minLOG10(P) log10(p) r^2") \\
<(paste <(sed '1d' penicillin_SNPs.txt | cut -d "_" -f 2) \\
<(sed '1d' penicillin_SNPs.txt | cut -f 4) | \\
awk '{p = -log($2)/log(10); print "26",".",$1,p,p,"0"}' ) | \\
tr ' ' '\t' > penicillin_snps.plot
If we drag and drop 23FSpn.gff and penicillin_snps.plot files into
phandango you should see
a Manhattan plot similar to this:
The three highest peaks are in the pbp2x, pbp1a and pbp2b genes,
which are the correct loci. There are also flat lines, suggesting
these may be lineage effects from population structure that has not been fully
controlled for. In actual fact, if we inspect the SNPs along these two lines
(p = 2.10E-14 and p = 1.58E-15) we see that all of them are annotated
with the note bad-chisq and are at the lower end of the included minor allele
frequency threshold (1.3% and 1.2% respectively). These are therefore variants
which were underpowered, and the associations are spurious. They should be
filtered out, and we should probably have used a MAF cutoff of at least 2%
given the total number of samples we have. As a rule of thumb, a MAF cutoff
corresponding to a MAC of at least 10 isn’t a bad start. Let’s run it again:
pyseer --phenotypes resistances.pheno --vcf snps.vcf.gz --load-m output/mash_mds.pkl --min-af 0.02 --max-af 0.98 > penicillin_SNPs.txt
Read 603 phenotypes
Detected binary phenotype
Loaded projection with dimension (603, 269)
Analysing 603 samples found in both phenotype and structure matrix
198248 loaded variants
106949 filtered variants
91299 tested variants
91225 printed variants
A lot more low frequency variants have been filtered out this time, and if we
make a plot file our Manhattan plot looks much cleaner:
K-mer association with mixed effects model
We will now use k-mers as a variant to test both short variation as well as
gene presence/absence. This can be done using the steps above replacing the
--vcf




    
 argument with --kmers, which would replicate the results from the
original seer tutorial. For demonstration purposes we will instead use the
other association model available in pyseer, the linear mixed model.
First, count the k-mers from the assemblies:
mkdir -p assemblies
cd assemblies
tar xvf ../assemblies.tar.bz2
fsm-lite -l ../fsm_file_list.txt -s 6 -S 610 -v -t fsm_kmers | gzip -c - > ../fsm_kmers.txt.gz
cd ..
This will require you to have fsm-lite installed
If you do not have the time/resources to do this, you can follow the rest of these steps using the
SNPs as above.
Everything here also applies to unitigs, which can be
called with unitig-counter.
These are generally recommended due to their lower redundancy (and are also therefore
faster) and potentially easier interpretation.
To correct for population structure we must supply pyseer with the kinship
matrix \(K\) using the --similarities argument (or --load-lmm if using
a previous analysis where --save-lmm was used).
We will use the distances from the core genome phylogeny, which
has been midpointed rooted:
python scripts/phylogeny_distance.py --lmm core_genome_aln.tree > phylogeny_K.tsv
Alternatively, we could extract a kinship matrix from the mapped SNPs by calculating \(K = GG^T\)
similarity_pyseer --vcf snps.vcf.gz samples.txt > gg.snps.txt
We can now run pyseer with --lmm. Due to the large number of k-mers we are going to test, we will increase the
number of CPUs used to 8:
pyseer --lmm --phenotypes resistances.pheno --kmers fsm_kmers.txt.gz --similarity phylogeny_K.tsv --output-patterns kmer_patterns.txt --cpu 8 > penicillin_kmers.txt
The heritability \(h^2\) estimated from the kinship matrix \(K\) is printed to STDERR,
and after about 5 hours the results have finished being written:
Read 603 phenotypes
Detected binary phenotype
Setting up LMM
Similarity matrix has dimension (616, 616)
Analysing 603 samples found in both phenotype and similarity matrix
h^2 = 0.90
15167239 loaded variants
1042215 filtered variants
14125024 tested variants
14124993 printed variants
The heritability estimate shouldn’t be interpreted as a quantitative measure
for this binary phenotype, but a high heritability is consistent with the mechanism of penicillin
resistance in this species (the sequence can give up to 99% prediction
accuracy of penicillin resistance).
The results look similar, though also include the heritability of each variant
tested:
variant af      filter-pvalue   lrt-pvalue      beta    beta-std-err    variant_h2      notes
TTTTTTTTTTTT    8.11E-01        1.51E-06        1.05E-01        6.13E-02        3.78E-02        6.60E-02
TTTTTTTTTTTTT   7.08E-01        6.20E-06        4.03E-01        -3.34E-02       3.98E-02        3.41E-02
TTTTTTTTTTTTTT  5.97E-01        6.39E-05        1.81E-01        -4.05E-02       3.03E-02        5.45E-02
TTTTTTTTTTTTTTT 3.55E-01        5.92E-04        7.90E-01        -6.84E-03       2.57E-02        1.09E-02
TTTTTTTTTTTTTTTT        1.48E-01        2.11E-03        7.38E-01        1.13E-02        3.37E-02        1.37E-02
TTTTTTTTTTTTTTTTT       6.47E-02        3.94E-01        4.89E-01        3.11E-02        4.49E-02        2.83E-02
TTTTTTTTTTTTTTTTTT      3.48E-02        2.73E-02        2.59E-01        -6.73E-02       5.96E-02        4.60E-02
TTTTTTTTTTTTTTTTTTT     2.32E-02        2.18E-01        6.96E-01        -2.81E-02       7.19E-02        1.59E-02
TTTTTTTTTTTTTTTTTTTT    1.66E-02        2.58E-01        9.46E-01        -5.63E-03       8.37E-02        2.74E-03
The downstream processing of the k-mer results in penicillin_kmers.txt will be
shown in the next section. Before that, we can determine a significance threshold
using the number of unique k-mer patterns:
python scripts/count_patterns.py kmer_patterns.txt
Patterns:       2627332
Threshold:      1.90E-08
This is over five times lower than the total number of k-mers tested, so stops
us from being hyper-conservative with the multiple testing correction.
We can also create a Q-Q plot to check that p-values are not inflated. We can do that
by using the qq_plot.py script:
python scripts/qq_plot.py penicillin_kmers.txt
which produces the following Q-Q plot:
When interpreting this plot, check that it is well controlled at low p-values and doesn’t
show any large ‘shelves’ symptomatic of poorly controlled confounding population
structure. Although this plot is far above the null (as indeed, there are many
k-mers associated with penicillin resistance), the p-values up to 0.01 are as expected
which is what we’re after.
Interpreting significant k-mers
For the final step we will work with only those k-mers which exceeded the
significance threshold in the mixed model analysis. We will filter these from
the output using a simple awk command:
cat <(head -1 penicillin_kmers.txt) <(awk '$4<1.90E-08 {print $0}' penicillin_kmers.txt) > significant_kmers.txt
There are 5327 significant k-mers.
Mapping to a single reference
Let’s use bwa mem to map these to
the reference provided:
phandango_mapper significant_kmers.txt Spn23F.fa Spn23F_kmers.plot
Read 5327 k-mers
Mapped 2425 k-mers
Not all the k-mers have been mapped, which is usually the case. Note there are 2459
mapping lines in the output, as 34 secondary mappings we included. It is a good idea
to map to range of references to help with an interpretion for all of the significant
k-mers. The k-mer annotation step, described next, also helps cover all k-mers. Let’s
look at the plot file in phandango:
In this view we no longer see all of the Manhattan plot as we have filtered out
the low p-value k-mers. There is generally less noise due to LD/population structure when
compared to our previous result above. There are peaks in the three pbp genes again, with
the strongest results in pbp2x and pbp2b as before. Zooming in:
The whole pbp2x gene is covered by significant k-mers, whereas only a small
part of pbp1a is hit. This could be due to the fact that only some sites
in pbp1a can be variable, only some of the variable sites affect penicllin
resistance, or due to the ability to map k-mers to this region.
Annotating k-mers
We can annotate these k-mers with the genes they are found in, or are near. To
try and map every k-mer we can include a number of different reference
annotations, as well as all the draft annotations of the sequences the k-mers
were counted from. For the purposes of this tutorial we will demonstrate with
a single type of each annotation, but this could be expanded by adding all
the annotated assemblies to the input.
We’ll start by creating a references.txt file listing the annotations we
wish to use (see Annotating k-mers for more information on how to construct
this file):
Spn23F.fa    Spn23F.gff      ref
6952_7#3.fa  6952_7#3.gff    draft
Now run the script. This will iterate down the list of annotations, annotating the k-mers which
haven’t already been mapped to a previous annotation (requires bedtools, bedops and the
pybedtools package):
annotate_hits_pyseer significant_kmers.txt references.txt annotated_kmers.txt
Reference 1
5327 kmers remain
Draft reference 2
2902 kmers remain
If this runs slowly you can split the significant_kmers.txt file into
pieces to parallelise the process.
By default annotate_hits_pyseer will only consider CDS features in the
provided GFF files. If you want to consider other feature types you can use the
--feature-type option (e.g. --feature-type rRNA --feature-type tRNA).
Annotations marked ref can partially match between k-mer and reference
sequence, whereas those marked draft require an exact match. In this case
the single draft didn’t add any matches.
The genes a k-mer is in, as well as the nearest upstream and downstream are added to the
output:
TTTTTTTCTACAATAAAATAGGCTCCATAATATCTATAGTGGATTTACCCACTACAAATATTATAGAACCCGTTTTATTATGGAAAGACTTATTGGACTT    6.47E-02        2.08E-12        2.10E-09        7.97E-01        1.31E-01        2.41E-01        FM211187:252213-252312;FM211187.832;;FM211187.834
TTTTTTTATAGATTTCAGGATCAGCCAAATAGTAATCCG 8.42E-01        1.03E-36        2.99E-10        -4.38E-01       6.83E-02        2.53E-01        FM211187:723388-723417;FM211187.2367;;FM211187.2371
TTTTTTTATAGATTTCAGGATCAGCCAAATAGTAATCCGCCAGCTGGCGTT     8.39E-01        3.38E-35        4.04E-09        -3.95E-01       6.62E-02        2.37E-01        FM211187:1614084-1614122;penA;penA;penA
The output format is contig:position;upstream;in;downstream.
The first line shows the k-mer was mapped to FM211187:252213-252312, the
nearest gene downstream having ID FM211187.832 and upstream having ID FM211187.834.
The third line shows that k-mer overlaps penA – note when a gene= field
is found this is used in preference to the ID= field.
Finally, we can summarise these annotations to create a plot of significant
genes. We will only use genes k-mers are actually in, but if we wanted to we
could also include up/downstream genes by using the --nearby option:
python scripts/summarise_annotations.py annotated_kmers.txt > gene_hits.txt
We’ll use ggplot2 in R to plot these results:
require(ggplot2)
require(ggrepel)
library(ggrepel)
gene_hits = read.table("gene_hits.txt", stringsAsFactors=FALSE, header=TRUE)
ggplot(gene_hits, aes(x=avg_beta, y=maxp, colour=avg_maf, size=hits, label=gene)) +
   geom_point(alpha=0.5) +
   geom_text_repel(aes(size=60), show.legend = FALSE, colour='black') +
   scale_size("Number of k-mers", range=c(1,10)) +
   scale_colour_gradient('Average MAF') +
   theme_bw(base_size=14) +
   ggtitle("Penicillin resistance") +
   xlab("Average effect size") +
   ylab("Maximum -log10(p-value)")
You can customise this however you wish (for example adding the customary italics on gene
names); these commands will produce a plot like this:
The main hits have high p-values and are common, and in this case are covered
by many k-mers. In this case penA (pbp2b) and penX (pbp2x) are the main
hits. Other top genes recR and ddl are adjacent to the pbp genes and are
in LD with them,
creating an artifical association.
The results with large effect sizes (see Effect sizes) and relatively low p-values also have low MAF, and are
probably false positives. This can be seen better by changing the axes: