This short tutorial illustrates how to analyze population genetic structure, and how to perform genome scans for selection using
the function **snmf** from the R computer package **LEA** (Frichot et al. 2014, Frichot et al. 2015). This tutorial also provides guidelines for controlling the false discovery rate in genome scans for selection.

LEA can handle several classical formats for genotypic matrices. More specifically, the package uses the **lfmm** and **geno** formats, and provides functions to convert data sets in **ped**, **vcf**, or **ancestrymap** format.

The example data set used in the demo is available here: genotype1.lfmm.

A LEA session starts with the following R command

`library(LEA)`

`## Warning: package 'LEA' was built under R version 3.2.0`

Using the genotypic matrix, the first step is to evaluate population genetic structure. The snmf function provides outputs similar to those of the computer program **STRUCTURE**. We start with converting the genotype matrix from the “lfmm” to the “geno” format as follows

`genotype = lfmm2geno("genotype1.lfmm")`

```
##
## - number of detected individuals: 200
## - number of detected loci: 500
```

Then we create an “snmf” object (or project) using the **snmf** function. Here we run the program for \(K= 1-6\) ancestral populations.

`obj.snmf = snmf(genotype, K = 1:6, entropy = T, ploidy = 2, project = "new")`

```
## The project is saved into :
## genotype1.snmfProject
##
## To load the project, use:
## project = load.snmfProject("genotype1.snmfProject")
##
## To remove the project, use:
## remove.snmfProject("genotype1.snmfProject")
```

Since the program was called with the **entropy** option, we can plot the values of the cross-entropy criterion for each \(K\). The value for which the function plateaus or increases is our estimate of \(K\).

`plot(obj.snmf)`

```
## Warning in plot(s$crossEntropy[1, ], ylab = "Minimal Cross-Entropy", xlab
## = "Number of ancestral populations", : bytecode version mismatch; using
## eval
```

We observe that the minumum value is obtained for \(K = 2\). Note that we can visualize the barplot of ancestry coeffcients as follows.

`barplot(t(Q(obj.snmf, K = 2)), col = 2:3, ylab = "Ancestry coefficients", xlab = "Sampled individuals")`

The next step of analysis is to run a genome scan for selection based on the \(F_{\text ST}\) statistic. Here, the \(F_{\text ST}\) statistic is computed
from the ancestral allele frequencies estimated by **snmf**.We define the function **fst** function as follows.

```
fst = function(project,run = 1, K, ploidy = 2){
library(LEA)
l = dim(G(project, K = K, run = run))[1]
q = apply(Q(project, K = K, run = run), MARGIN = 2, mean)
if (ploidy == 2) {
G1.t = G(project, K = K, run = run)[seq(2,l,by = 3),]
G2.t = G(project, K = K, run = run)[seq(3,l,by = 3),]
freq = G1.t/2 + G2.t}
else {
freq = G(project, K = K, run = run)[seq(2,l,by = 2),]}
H.s = apply(freq*(1-freq), MARGIN = 1, FUN = function(x) sum(q*x))
P.t = apply(freq, MARGIN = 1, FUN = function(x) sum(q*x))
H.t = P.t*(1-P.t)
return(1-H.s/H.t)
}
```

Then we compute the \(F_{\text ST}\) statistics as follows

`fst.values = fst(obj.snmf, K=2)`

Next, we convert the \(F_{\text ST}\) values into absolute values of \(z\)-scores.

```
#convert fst values into z-scores (absolute values)
n = dim(Q(obj.snmf, K = 2))[1]
fst.values[fst.values<0] = 0.000001
K = 2
z.scores = sqrt(fst.values*(n-K)/(1-fst.values))
```

The estimated \(z\)-scores must be recalibrated before applying a test for neutrality at each locus. Due to the existence of population structure, the null-hypothesis may be misspecified, and this step is often necessary.

We suggest using an “empirical-null hypothesis” approach, starting with computing a genomic inflation factor, \(\lambda\), and then dividing the scores by \(\lambda\)

.```
#Compute the GIF
K=2
lambda = median(z.scores^2)/qchisq(1/2, df = K-1)
lambda
```

`## [1] 1.784576`

```
# compute adjusted p-values from the combined z-scores
adj.p.values = pchisq(z.scores^2/lambda, df = K-1, lower = FALSE)
```

Then, we can **check the histogram** of \(p\)-values. Ideally, this histogram should have a flat shape with a peak close to zero. Having this shape indicates that the null-hypothesis is correct, *i.e.*, \(p\)-values are sampled from a uniform distribution under the null-hypothesis.

```
#histogram of p-values
hist(adj.p.values, col = "red")
```

The **genomic inflation factor** is acknowledged to be overly **conservative**. So, it might be better to calibrate the \(p\)-values using other methods (eg, using the R package “fdrtool") or following a trial-and-error approach. Here we try \(\lambda = 1.6\).

```
# compute adjusted p-values from the combined z-scores
adj.p.values = pchisq(z.scores^2/1.6, df = 1, lower = FALSE)
#histogram of p-values
hist(adj.p.values, col = "green")
```

The plot looks slightly better.

To correct for multiple testing, we apply the Benjamini-Hochberg algorithm. This can be done as follows.

```
## FDR control: Benjamini-Hochberg at level q
L = 500
q = 0.1
w = which(sort(adj.p.values) < q * (1:L)/L)
candidates = order(adj.p.values)[w]
```

Now, our list of candidate loci is recorded in the R object “candidates”.

```
plot(-log10(adj.p.values), main="Manhattan plot", xlab = "Locus", cex = .7, col = "grey")
points(candidates, -log10(adj.p.values)[candidates], pch = 19, cex = .7, col = "red")
```