LEA webpage

LEA contains three main functions snmf() , lfmm2() and genetic.gap(). For an overview the package, the best introduction is in the package vignette (latest release from Github).

The snmf() function produces results similar to the widely-used computer program STRUCTURE. Users can play with a large panel of graphical functions for displaying pie-charts and interpolated admixture coefficients on geographic maps. For a short introduction on how to run structure-like analyses and importing files from the structure or tess formats, follow the examples presented in this tutorial. Improved graphical functions are implemented in the companion R package tess3r .

The lfmm2() and lfmm() functions implement latent factor mixed models for detecting loci exhibiting association with ecological variables or phenotypic traits. For large number of genotypes, it is strongly recommended to use lfmm2() .

The genetic.gap() function computes genomic offset statistics given current and altered environmental predictors.

The LEA package also implements genome scans for selection in snmf() , based on ancestral population differentiation tests for detecting signatures of local adaptation when no populations are defined a priori.

Tutorial on how to display ancestry coefficients (Q matrix) in geographic maps: Analyzing population genomic data with tess3r .

Tutorial on visualizing geographic predictions of genetic offset measures . Check the Web version .

A brief presentation of statistical methods implemented in LEA for the inference of population structure and local adaptation is available as a PDF presentation here.

The lfmm format can be used as an input format for genotypic matrices in the lfmm, snmf and pca functions. The lfmm format has one row for each individual. Each row contains one value at each loci (separated by spaces or tabulations) corresponding to the number of alleles. The number of alleles corresponds to the number of reference alleles or the number of derived alleles. Missing genotypes are encoded by the value -9 or 9. We provide examples of conversion from the lfmm format to the geno format, and conversion from the STRUCTURE format to the lfmm and geno formats.

Here is an example of a genotypic matrix using the lfmm format with 3 individuals and 4 loci:

1 0 0 1
1 1 9 2
2 0 1 1

R code examples:

Read and write data in the lfmm format

library(LEA)


# "tutorial" contains a genotype matrix (R, 400 SNPs) 
#  and an environmental variable (C) for 50 individuals.
data("tutorial")

# write R in a file called "genotypes.lfmm"
# Create file:	"genotypes.lfmm".
write.lfmm(R, "genotypes.lfmm")

# read the file "genotypes.lfmm".
R = read.lfmm("genotypes.lfmm")

Conversion from the lfmm format to the geno format

library(LEA)


# "tutorial" contains a genotype matrix (R, 400 SNPs) 
#  and an environmental variable (C) for 50 individuals.
data("tutorial")
write.lfmm(R, "genotypes.lfmm")

# Conversion 	from the lfmm format ("genotypes.lfmm") 
#		to the geno format ("genotypes.geno").
# By default, 	the name of the output file is the same name 
#		as the input file with a .geno extension.
# Create file:	"genotypes.geno".
output = lfmm2geno("genotypes.lfmm")

# Conversion 	from the lfmm format ("genotypes.lfmm") 
#		to the geno format with the output file called "plop.geno".
# Create file:	"plop.geno".
output = lfmm2geno("genotypes.lfmm", "plop.geno")

# As force = false and the file "genotypes.geno" already exists,
# nothing happens.
output = lfmm2geno("genotypes.lfmm", force = FALSE)

For converting genotype matrices from the STRUCTURE or the TESS format to the lfmm and geno format, use the struct2geno() function. Follow the explanations given in our introductory document on how to run structure-like analysis using R.

 
### If struct2geno is not installed (older versions) 
##source("http://membres-timc.imag.fr/Olivier.Francois/Conversion.R")
 
### Artificial data with 10 diploid individuals and 10 STR markers
### STRUCTURE file: 'dat.str'

library(LEA)
dat.str  <- matrix(sample(c(100:120,-9), 200, replace = TRUE), 
                   nrow = 10, ncol = 20)
write.table(dat.str, 
            file = "dat.str", 
            col.names = FALSE, 
            row.names = FALSE, 
            quote = FALSE)

### Conversion 
struct2geno("dat.str", ploidy = 2, FORMAT = 1)

### snmf run and barplot
s  <- snmf("dat.str.geno", K = 2, project = "new")
barplot(as.qmatrix(Q(s)), xlab = "Individuals")

The "env" format can be used as an input format for the environmental variables in the lfmm function. An env format file has one row for each individual. Each row contains one value for each environmental variable (separated by spaces or tabulations).

Here is an example of an environmental file using the "env" format with 3 individuals and 2 variable:

0.252477 0.95250639
0.216618 0.10902647
-0.47509 0.07626694

R code examples:

Read and write of an environmental variables in the env format

library(LEA)

# Creation of an environmental matrix C 
# containing 2 environmental variables for 3 individuals.
# C contains one line for each individual and one column for each variable.
C = matrix(runif(6), ncol=2, nrow=3)

# Write C in a file called "tuto.env".
# Create file:	"tuto.env".
write.env(C,"tuto.env")

# Read the file "tuto.env".
C = read.env("tuto.env")

The geno format can be used as an input format for genotypic matrices in the lfmm, snmf and pca functions. The geno format has one row for each locus. Each row contains 1 character per individual: 0 means zero copies of the reference allele. 1 means one copy of the reference allele. 2 means two copies of the reference allele. 9 means missing data.

Here is an example of a genotypic matrix using the geno format with 3 individuals and 4 loci:

R code examples:

Read and write data in the geno format

library(LEA)

# "tutorial" contains a genotype matrix (R, 400 SNPs) 
#  and an environmental variable (C) for 50 individuals.
data("tutorial")

# Write genotypes in a file called "genotypes.geno".
# Create file:  "genotypes.geno".
write.geno(R, "genotypes.geno")

# Read the file "genotypes.geno".
R = read.geno("genotypes.geno")

Conversion from the geno format to the lfmm format

library(LEA)

# Creation of the genotype file: "genotypes.geno"
# 400 SNPs for 50 individuals.
data("tutorial")
write.geno(R, "genotypes.geno")

# Conversion   from the geno format ("genotypes.geno") 
#              to the lfmm format ("genotypes.lfmm").
# By default,  the name of the output file is the same name 
#              as the input file with a .lfmm extension.
# Create file: "genotypes.lfmm".
output = geno2lfmm("genotypes.geno")

# Conversion   from the geno format ("genotypes.geno") 
#              to the lfmm format with the output file called "plop.lfmm".
# Create file: "plop.lfmm".
output = geno2lfmm("genotypes.geno", "plop.lfmm")

# As force = false and the file "genotypes.lfmm" already exists,
# nothing happens.
output = geno2lfmm("genotypes.geno", force = FALSE)

The ped format is largely used in population genetics. The ped format can be used as an input format for genotypic matrices in the lfmm, snmf and pca functions. LEA includes functions to convert from ped to geno and lfmm formats. The ped format has one row for each individual. Each row contains 6 columns of information for each individual, plus two genotype columns for each SNP. Each column must be separated by spaces or tabulations. Genotype format must be either 0ACGT or 01234, where 0 means missing data. The first 6 columns of the genotype file are: 1st column is family ID, 2nd column is sample ID, 3rd and 4th columns are sample IDs of parents, 5th column is gender (male is 1, female is 2), 6th column is case/control status (1 is control, 2 is case), quantitative trait value or population group label.

The ped format is also described here.

Here is an example of a genotypic matrix in the ped format using 3 individuals and 4 loci:

1 SAMPLE0 0 0 2 2 1 2 3 3 1 1 2 1
2 SAMPLE1 0 0 1 2 2 1 1 3 0 4 1 1
3 SAMPLE2 0 0 2 1 2 2 3 3 1 4 1 1

R code examples:

Conversion from the ped format to the lfmm and geno formats

library(LEA)

# Creation of a file called "example.ped"
# with 4 SNPs for 3 individuals.
data("example.ped")
write.table(example.ped,"example.ped",
	col.names = FALSE, row.names = FALSE, quote = FALSE)

# Conversion 	from the ped format ("example.ped") 
#		to the lfmm format ("example.lfmm").
# By default, 	the name of the output file is the same name 
#		as the input file with a .lfmm extension.
# Create file:	"example.lfmm".
output = ped2lfmm("example.ped")

# Conversion 	from the ped format ("example.ped") 
#		to the geno format ("example.geno").
# By default, 	the name of the output file is the same name 
#		as the input file with a .geno extension.
# Create file:	"example.geno".
output = ped2geno("example.ped")

# Conversion 	from the ped format ("example.ped") 
#		to the geno format with the output file called "plop.geno".
# Create file:	"plop.geno".
output = ped2geno("example.ped", "plop.geno")

# As force = false and the file "example.geno" already exists,
# nothing happens.
output = ped2geno("example.ped", force = FALSE)

The ancestrymap format is widely used in population genetics. The ancestrymap format can be used as an input format for genotypic matrices in the lfmm, snmf and pca functions. LEA includes functions to convert ancestrymap data to geno and lfmm formats. The ancestrymap format has one row for each genotype. Each row has 3 columns: the 1st column is the SNP name, the 2nd column is the sample ID, the 3rd column is the number of alleles. Genotypes for a given SNP name are written in consecutive lines. The number of alleles can be the number of reference alleles or the number of derived alleles Missing genotypes are encoded by the value 9.

Here is an example of a genotypic matrix using the ancestrymap format with 3 individuals and 4 loci:

rs0000  SAMPLE0 1
rs0000  SAMPLE1 1
rs0000  SAMPLE2 2
rs1111  SAMPLE0 0
rs1111  SAMPLE1 1
rs1111  SAMPLE2 0
rs2222  SAMPLE0 0
rs2222  SAMPLE1 9
rs2222  SAMPLE2 1
rs3333  SAMPLE0 1
rs3333  SAMPLE1 2
rs3333  SAMPLE3 1

R code examples:

Conversion from the ancestymap format to the lfmm and geno formats

library(LEA)

# Creation of a file called "example.ancestrymap"
# with 4 SNPs for 3 individuals.
data("example.ancestrymap")
write.table(example.ancestrymap,"example.ancestrymap",
col.names = FALSE, row.names = FALSE, quote = FALSE)

# Conversion 	from the ancestrymap format ("example.ancestrymap") 
#		to the lfmm format ("example.lfmm").
# By default,   the name of the output file is the same name
#               as the input file with a .lfmm extension.
# Create file:	"example.lfmm".
output = ancestrymap2lfmm("example.ancestrymap")
                        
# Conversion 	from the ancestrymap format ("example.ancestrymap") 
#		to the geno format ("example.geno").
# By default,   the name of the output file is the same name
#               as the input file with a .geno extension.
# Create file:	"example.geno".
output = ancestrymap2geno("example.ancestrymap")

# Conversion    from the ancestrymap format ("example.ancestrymap") 
#               to the geno format with the output file called "plop.geno".
# Create file:	"plop.geno".
output = ancestrymap2geno("example.ancestrymap", "plop.geno")

# As force = false and the file "example.geno" already exists,
# nothing happens.
output = ancestrymap2geno("example.ancestrymap", force = FALSE)

The vcf format is often used in population genetics. The vcf format can be used as an input format for genotypic matrices in the lfmm, snmf and pca functions. LEA includes functions to convert the vcf format to the geno format. The vcf format is described here.

Here is an example of a genotypic matrix using the vcf format with 3 individuals and 4 loci:

##fileformat=VCFv4.1 
##FORMAT= 
##INFO= 
##INFO= 
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE0 SAMPLE1 SAMPLE2 
1 1001 rs0000 T C 999 . VM=1;SM=100 GT:GM 1/0:1 0/1:2 1/1:3 
1 1002 rs1111 G A 999 . VM=2;SM=101 GT:GM 0/0:6 0/1:7 0/0:8 
1 1003 notres G AA 999 . VM=3;SM=102 GT:GM 0/0:11 ./.:12 0/1:13 
1 1004 rs2222 G A 999 . VM=3;SM=102 GT:GM 0/0:11 . 1/0:13
1 1003 notres GA A 999 . VM=3;SM=102 GT:GM 0/0:11 ./.:12 0/1:13 
1 1005 rs3333 G A 999 . VM=3;SM=102 GT:GM 1/0:11 1/1:12 0/1:13

R code examples:

Conversion from the vcf format to the geno format

library(LEA)

# creation of a file called "example.vcf"
# with 4 SNPs for 3 individuals.
data("example.vcf")
write.table(example.vcf,"example.vcf",col.names =
c("#CHROM", "POS", "ID", "REF", "ALT", "QUAL", "FILTER", "INFO",
  "FORMAT", "SAMPLE0", "SAMPLE1", "SAMPLE2"),
row.names = FALSE, quote = FALSE)

# Conversion 	from the vcf format ("example.vcf") 
# 		to the geno format ("example.geno").
# By default,   the name of the output file is the same name
#               as the input file with a .geno extension.
# Create files:	"example.geno",
#		"example.vcfsnp" - SNP information,
#		"example.removed" - removed lines. 
output = vcf2geno("example.vcf")

# Conversion    from the vcf format ("example.vcf")
#               to the geno format with the output file called "plop.geno".
# Create files:	"plop.geno",
#		"plop.vcfsnp" - SNP information,
#		"plop.removed" - removed lines. 
output = vcf2geno("example.vcf", "plop.geno")

# As force = false and the file "example.geno" already exists,
# nothing happens.
output = vcf2geno("example.vcf", force = FALSE)

The function pca performs a principal component analysis of a genotypic matrix using the lfmm, geno, ancestrymap, ped or vcf format. The function computes eigenvalue, eigenvector, and standard deviation for each principal component and the projection of each individual on each component. The function pca returns an object of class "pca-class" containing the output data and the input parameters.

The function tracyWidom performs Tracy-Widom tests on the covariance matrix eigenvalues.

R code examples:
(Caution: the examples have to be executed in the same order as in the tutorial.)

Perform a PCA

library(LEA)

# Creation of the genotype file: "genotypes.lfmm"
# 400 SNPs for 50 individuals.
data("tutorial")
write.lfmm(R, "genotypes.lfmm")

# run of PCA
# Available options, K (the number of PCs calculated), 
#                    center and scale. 
# Creation of   genotypes.pcaProject - the pcaProject object.
#               a directory genotypes.pca containing:
# Create files: genotypes.eigenvalues - eigenvalues,        
#               genotypes.eigenvectors - eigenvectors,
#               genotypes.sdev - standard deviations,
#               genotypes.projections - projections,
# Create a pcaProject object: pc.
pc = pca("genotypes.lfmm", scale = TRUE)

Display information

# Display information about the analysis.

# Summarize the analysis.
summary(pc)

Graphical outputs

par(mfrow=c(2,2))

# Plot eigenvalues.
plot(pc, lwd=5, col="red",xlab=("PCs"),ylab="eigen")

# PC1-PC2 plot.
plot(pc$projections)
# PC3-PC4 plot.
plot(pc$projections[,3:4])

# Plot standard deviations.
plot(pc$sdev)

Perform Tracy-Widom tests

# Perfom Tracy-Widom tests on all eigenvalues.
# create file: tuto.tracyWidom - tracy-widom test information.	
tw = tracy.widom(pc)

# plot the percentage of variance explained by each component
plot(tw$percentage)

# display the p-values for the Tracy-Widom tests. 
tw$pvalues

The snmf function estimates ancestry coefficients using the sparse non-negative matrix factorization. The function returns a project object containing all runs of the snmf program for the input data. It can be useful to perform several runs of snmf for various numbers of ancestral populations (K).

This tutorial describes how to analyze a genotypic dataset, manage an snmf project, run snmf with advanced options, and perform post-treatments of the results.

R code examples:
(Caution: the examples have to be executed in the same order as in the tutorial.)

Example of analyses using snmf

library(LEA)

# Creation of the genotype file: genotypes.geno.
# The data contain 400 SNPs for 50 individuals.
data("tutorial")
write.geno(tutorial.R, "genotypes.geno")

################
# running snmf #
################

project = snmf("genotypes.geno",
                K = 1:10, 
                entropy = TRUE, 
                repetitions = 10,
                project = "new")

# plot cross-entropy criterion of all runs of the project
plot(project, cex = 1.2, col = "lightblue", pch = 19)

Post-treatments

# show the project
show(project)

# summary of the project
summary(project)

# plot cross-entropy criterion of all runs of the project
plot(project, cex = 1.2, col = "lightblue", pch = 19)

# get the cross-entropy of all runs for K = 4
ce = cross.entropy(project, K = 4)

# select the run with the lowest cross-entropy for K = 4
best = which.min(ce)

# display the Q-matrix
Q.matrix <- as.qmatrix(Q(project, K = 4, run = best))
my.colors <- c("tomato", "lightblue", "olivedrab", "gold")

barplot(Q.matrix, 
        border = NA, 
        space = 0, 
        col = my.colors, 
        xlab = "Individuals",
        ylab = "Ancestry proportions", 
        main = "Ancestry matrix") -> bp
        
axis(1, at = 1:nrow(Q.matrix), labels = bp$order, las = 3, cex.axis = .4)


# get the ancestral genotype frequency matrix, G, for the 2nd run for K = 4. 
G.matrix = G(project, K = 4, run = 2)

Advanced snmf options


# Q.input.file: init a run with a given ancestry coefficient matrix Q. 
# Here, it is initialized with the Q matrix from the first run with K=4
project = snmf("genotypes.geno",K=4, 
Q.input.file="./genotypes.snmf/K4/run1/genotypes_r1.4.Q")

# I: init the Q matrix of a run from a smaller run with 500 randomly chosen SNPs. 
project = snmf("genotypes.geno", K = 4, I = 500)

# CPU: run snmf with 2 CPUs. 
project = snmf("genotypes.geno", K = 4, CPU=2)

# percentage: run sNMF and calculate the cross-entropy criterion with 10% of masked
# genotypes, instead of 5% of masked genotypes. 
project = snmf("genotypes.geno", K = 4, entropy= TRUE, percentage = 0.1)

# seed: choose the seed to init the randomization.
project = snmf("genotypes.geno", K = 4, seed=42)

# alpha: choose the regularization parameter.
project = snmf("genotypes.geno", K = 4, alpha = 100)

# tolerance: choose the tolerance parameter.
project = snmf("genotypes.geno", K = 4, tolerance = 0.0001)

Manage an snmf project

# snmf runs are automatically saved into an snmf project directory.
# The name of the snmf project file is the same name as 
# the name of the input file with a .snmfProject extension ("genotypes.snmfProject").
# The name of the snmf project directory is the same name as
# the name of the input file with a .snmf extension ("genotypes.snmf/")
# A unique project includes all runs for each input file.

# An snmf project can be load in a different session.
project = load.snmfProject("genotypes.snmfProject")

# An snmf project can be removed.
# Caution: All files associated with the project will be removed.
remove.snmfProject("genotypes.snmfProject")

The R function lfmm fits Latent Factor Mixed Models to the data.

The lfmm function returns a project object containing all lfmm runs. When performing additional runs, the function enables the project to be included as a parameter to add more runs. Performing several runs for various values of the number of latent factors (K) is recommended.

The LEA package includes functions for post-processing lfmm runs.

R code examples:
(Caution: the examples have to be executed in the same order as in the tutorial.)

Example of analyses using lfmm

library(LEA)

data("tutorial")
# Creation of the genotype file: "genotypes.lfmm"
# 400 SNPs for 50 individuals.
write.lfmm(R, "genotypes.lfmm")
# creation of the environment file, gradient.env.
# It contains 1 environmental variable for 50 individuals.
write.env(C, "gradients.env")

################
# running lfmm #
################

# main options, K: (the number of latent factors), 
#           CPU: the number of CPUs.

# Runs with K = 3 and 5 repetitions.
# runs with 6000 iterations 
# including 3000 iterations for burnin.
# Around 30 seconds per run.
project = lfmm( "genotypes.lfmm", 
                "gradients.env", 
                 K = 3, 
                 repetitions = 5, 
                 project = "new")

# get adjusted p-values using all runs
pv = lfmm.pvalues(project, K = 3)

# Evaluate FDR and POWER
for (alpha in c(.05,.1,.15,.2)) {
    # expected FDR
    print(paste("expected FDR:", alpha))
    L = length(pv$pvalues)
    # Benjamini-Hochberg's mehod for an expected FDR = alpha.
    w = which(sort(pv$pvalues) < alpha * (1:L)/L)
    candidates = order(pv$pvalues)[w]

    # estimated FDR and True Positive Rate
    # The targets SNPs are loci 351 to 400
    Lc = length(candidates)
    estimated.FDR = length(which(candidates <= 350))/Lc
    estimated.TPR = length(which(candidates > 350))/50
    print(paste("FDR:", estimated.FDR, "True Positive Rate:", estimated.TPR))
}

Post-treatments

# show the project
show(project)

# summary of the project
summary(project)

# get the z-scores for the 2nd run for K = 3
z = z.scores(project, K = 3, run = 2)

# get the p-values for the 2nd run for K = 6
p = lfmm.pvalues(project, K = 3, run = 2)

Manage an lfmm project

# All runs of lfmm are 
# automatically saved into an lfmm project directory.
# The name of the lfmm project file is a combination of 
# the name of the genotype file and the environment file 
# with a .lfmmProject extension ("genotypes_gradient.lfmmProject").
# The name of the lfmm project directory is the same name as
# the lfmm project file with a .lfmm extension ("genotypes_gradient.lfmm/")
#  A unique project includes all runs for each input file.

# An lfmm project can be loaded in a new session.
project = load.lfmmProject("genotypes_gradients.lfmmProject")

# An lfmm project can be removed.
# Caution: All the files associated with the project will be removed.
remove.lfmmProject("genotypes_gradients.lfmmProject")