Running genome-wide ecological association studies with the R package LEA

This short tutorial presents the main steps of a population genomic data set analysis using the R computer package LEA. A first objective of the tutorial is to illustrate the use of the functions snmf and lfmm for running large-scale ancestry analyses and for performing genome scans for local adaptation. A second objective is to provide guidelines for controlling false discovery rates in genome scans for selection.

LEA can handle several classical formats for genotypic matrices. More specifically, the package uses the lfmm and geno formats, and provides functions to convert from other formats such asn ped, vcf, and ancestrymap. Ecological variables must be formatted in the env format. In this tutorial, we process data simulated for 165 diploid individuals genotyped at 500 SNPs.

The SNP data are available as genotype.lfmm.

The environmental data are available as gradient.env.

A LEA session starts with

library(LEA)

## Warning: package 'LEA' was built under R version 3.2.0

Analysis of population structure

Using the genotypic matrix, the first step is to evaluate population genetic structure with the program snmf. The snmf function provides outputs similar to those of the computer program STRUCTURE, but snmf is much faster than STRUCTURE. We start with converting the genotypic matrix from the “lfmm” to the “geno” format

genotype = lfmm2geno("genotype.lfmm")

## 
##  - number of detected individuals:   165
##  - number of detected loci:      500

Then we create an object of class “snmf” using the snmf function. Here we run the program for all \(K\), the number of ancestral populations, from 1 to 12.

obj.snmf = snmf(genotype, K = 1:12, entropy = T, ploidy = 2, project="new")

## The project is saved into :
##  genotype.snmfProject 
## 
## To load the project, use:
##  project = load.snmfProject("genotype.snmfProject")
## 
## To remove the project, use:
##  remove.snmfProject("genotype.snmfProject")

Since the program was called with the entropy option, we can plot the values of the cross-entropy criterion for each \(K\). The value for which the function plateaus or increases is our estimate of \(K\)

plot(obj.snmf)

We observe that the minimum value is around \(K = 6\). We can also visualize a barplot of ancestry coeffcients as follows.

barplot(t(Q(obj.snmf, K = 6)), col = 1:6)

Genome scan for selection using environmental variables

The next step of analysis is to run genome scans for selection using individual environmental variables. For this, we use the environmental data stored in the “gradient.env” file. The file contains a single environmental variable.

The snmf results suggested that we could use around \(K = 6\) factors in lfmm to describe population structure. Note that lfmm does not require an accurate estimate of \(K\) (its goal is to end with well-calibrated \(p\)-values, not to estimate genetic ancestry). So, our next step is run lfmm with 6 latent factors, and we want 5 runs in order to increase the power of the program to detect true associations (Note that more runs would be better). The typical call is as follows.

obj.lfmm = lfmm("genotype.lfmm", "gradient.env", K = 6, rep = 5, project="new")

## The project is saved into :
##  genotype_gradient.lfmmProject 
## 
## To load the project, use:
##  project = load.lfmmProject("genotype_gradient.lfmmProject")
## 
## To remove the project, use:
##  remove.lfmmProject("genotype_gradient.lfmmProject")

Combining scores from multiple runs

Since we ran 5 runs of the MCMC program, we need to combine the results (\(z\)-scores) from the 5 runs. This can be done as follows.

#Record z-scores from the 5 runs in the zs matrix
zs = z.scores(obj.lfmm, K = 6)

#Combine z-scores using the median
zs.median = apply(zs, MARGIN = 1, median)

Computing a correct set of \(p\)-values

The median \(z\)-scores must be recalibrated before computing \(p\)-values. We suggest to start with computing the genomic inflation factor, \(\lambda\), and then dividing the scores by \(\lambda\)

#Compute the GIF
lambda = median(zs.median^2)/qchisq(0.5, df = 1)
lambda

## [1] 0.7652

# compute adjusted p-values from the combined z-scores
adj.p.values = pchisq(zs.median^2/lambda, df = 1, lower = FALSE)

The value of the genomic inflation factor is close 1, which is a good thing. But, the key point is to check the histogram of \(p\)-values. Ideally, this histogram should look flat with a peak close to zero. This ideal shape indicates that the \(p\)-values are correct (i.e., are drawn from a uniform distribution under the null-hypothesis)

#histogram of p-values
hist(adj.p.values, col = "red")

The genomic inflation factor is acknowledged to be overly conservative. So, it might be better to calibrate the \(p\)-values manually, using a lower value. Here, we try \(\lambda\) = .55.

# compute adjusted p-values from the combined z-scores
adj.p.values = pchisq(zs.median^2/.55, df = 1, lower = FALSE)

#histogram of p-values
hist(adj.p.values, col = "green")

This is better!

Now, the histogram is as we expect from the results of a correct testing procedure. We can apply multiple testing correction procedures with confidence.

Control of false discoveries

To correct for multiple testing, we can apply the Benjamini-Hochberg algorithm. This can be done as follows.

## FDR control: Benjamini-Hochberg at level q
## L = number of loci
L = 500
#fdr level q
q = 0.1
w = which(sort(adj.p.values) < q * (1:L)/L)
candidates.bh = order(adj.p.values)[w]

Now, our list of candidate loci is recorded in the R object “candidates.bh”. Note that R contains packages that compute q-values automatically (see library(qvalue) from Bioconductor).

## FDR control: Storey's q-values 
library(qvalue)
plot(qvalue(adj.p.values))
candidates.qv = which(qvalue(adj.p.values, fdr = .1)$signif)

Storey's algorithm is less conservative than the BH algorithm, and the list “candidates.qv” contains a few additional candidates.