LEA: an R package for Landscape and Ecological Associations studies






LEA contains two main functions snmf() and lfmm(). For an overview the package, the best introduction in the package vignette (latest release from Github).

A brief presentation of statistical methods implemented in LEA for the inference of population structure and local adaptation is available as a PDF presentation here.







The lfmm format can be used as an input format for genotypic matrices in the lfmm, snmf and pca functions. The lfmm format has one row for each individual. Each row contains one value at each loci (separated by spaces or tabulations) corresponding to the number of alleles. The number of alleles corresponds to the number of reference alleles or the number of derived alleles. Missing genotypes are encoded by the value -9 or 9. We provide examples of conversion from the lfmm format to the geno format, and conversion from the STRUCTURE format to the lfmm and geno formats.

Here is an example of a genotypic matrix using the lfmm format with 3 individuals and 4 loci:

1 0 0 1
1 1 9 2
2 0 1 1

R code examples:

  • Read and write data in the lfmm format
    library(LEA)
    
    
    # "tutorial" contains a genotype matrix (R, 400 SNPs) 
    #  and an environmental variable (C) for 50 individuals.
    data("tutorial")
    
    # write R in a file called "genotypes.lfmm"
    # Create file:	"genotypes.lfmm".
    write.lfmm(R, "genotypes.lfmm")
    
    # read the file "genotypes.lfmm".
    R = read.lfmm("genotypes.lfmm")

  • Conversion from the lfmm format to the geno format
    library(LEA)
    
    
    # "tutorial" contains a genotype matrix (R, 400 SNPs) 
    #  and an environmental variable (C) for 50 individuals.
    data("tutorial")
    write.lfmm(R, "genotypes.lfmm")
    
    # Conversion 	from the lfmm format ("genotypes.lfmm") 
    #		to the geno format ("genotypes.geno").
    # By default, 	the name of the output file is the same name 
    #		as the input file with a .geno extension.
    # Create file:	"genotypes.geno".
    output = lfmm2geno("genotypes.lfmm")
    
    # Conversion 	from the lfmm format ("genotypes.lfmm") 
    #		to the geno format with the output file called "plop.geno".
    # Create file:	"plop.geno".
    output = lfmm2geno("genotypes.lfmm", "plop.geno")
    
    # As force = false and the file "genotypes.geno" already exists,
    # nothing happens.
    output = lfmm2geno("genotypes.lfmm", force = FALSE)

  • For converting genotype matrices from the STRUCTURE or the TESS format to the lfmm and geno format, use the struct2geno() function. Follow the explanations given in our introductory document on how to run structure-like analysis using R.

     
    ### If struct2geno is not installed (older versions) 
    ##source("http://membres-timc.imag.fr/Olivier.Francois/Conversion.R")
     
    ### Artificial data with 10 diploid individuals and 10 STR markers
    ### STRUCTURE file: 'dat.str'
    
    library(LEA)
    dat.str  <- matrix(sample(c(100:120,-9), 200, replace = TRUE), 
                       nrow = 10, ncol = 20)
    write.table(dat.str, 
                file = "dat.str", 
                col.names = FALSE, 
                row.names = FALSE, 
                quote = FALSE)
    
    ### Conversion 
    struct2geno("dat.str", ploidy = 2, FORMAT = 1)
    
    ### snmf run and barplot
    s  <- snmf("dat.str.geno", K = 2, project = "new")
    barplot(as.qmatrix(Q(s)), xlab = "Individuals")
     

The "env" format can be used as an input format for the environmental variables in the lfmm function. An env format file has one row for each individual. Each row contains one value for each environmental variable (separated by spaces or tabulations).

Here is an example of an environmental file using the "env" format with 3 individuals and 2 variable:

0.252477 0.95250639
0.216618 0.10902647
-0.47509 0.07626694

R code examples:

  • Read and write of an environmental variables in the env format
    library(LEA)
    
    # Creation of an environmental matrix C 
    # containing 2 environmental variables for 3 individuals.
    # C contains one line for each individual and one column for each variable.
    C = matrix(runif(6), ncol=2, nrow=3)
    
    # Write C in a file called "tuto.env".
    # Create file:	"tuto.env".
    write.env(C,"tuto.env")
    
    # Read the file "tuto.env".
    C = read.env("tuto.env")

The geno format can be used as an input format for genotypic matrices in the lfmm, snmf and pca functions. The geno format has one row for each locus. Each row contains 1 character per individual: 0 means zero copies of the reference allele. 1 means one copy of the reference allele. 2 means two copies of the reference allele. 9 means missing data.

Here is an example of a genotypic matrix using the geno format with 3 individuals and 4 loci:

112
010
091
121

R code examples:

  • Read and write data in the geno format
    library(LEA)
    
    # "tutorial" contains a genotype matrix (R, 400 SNPs) 
    #  and an environmental variable (C) for 50 individuals.
    data("tutorial")
    
    # Write genotypes in a file called "genotypes.geno".
    # Create file:  "genotypes.geno".
    write.geno(R, "genotypes.geno")
    
    # Read the file "genotypes.geno".
    R = read.geno("genotypes.geno")

  • Conversion from the geno format to the lfmm format
    library(LEA)
    
    # Creation of the genotype file: "genotypes.geno"
    # 400 SNPs for 50 individuals.
    data("tutorial")
    write.geno(R, "genotypes.geno")
    
    # Conversion   from the geno format ("genotypes.geno") 
    #              to the lfmm format ("genotypes.lfmm").
    # By default,  the name of the output file is the same name 
    #              as the input file with a .lfmm extension.
    # Create file: "genotypes.lfmm".
    output = geno2lfmm("genotypes.geno")
    
    # Conversion   from the geno format ("genotypes.geno") 
    #              to the lfmm format with the output file called "plop.lfmm".
    # Create file: "plop.lfmm".
    output = geno2lfmm("genotypes.geno", "plop.lfmm")
    
    # As force = false and the file "genotypes.lfmm" already exists,
    # nothing happens.
    output = geno2lfmm("genotypes.geno", force = FALSE)

The ped format is largely used in population genetics. The ped format can be used as an input format for genotypic matrices in the lfmm, snmf and pca functions. LEA includes functions to convert from ped to geno and lfmm formats. The ped format has one row for each individual. Each row contains 6 columns of information for each individual, plus two genotype columns for each SNP. Each column must be separated by spaces or tabulations. Genotype format must be either 0ACGT or 01234, where 0 means missing data. The first 6 columns of the genotype file are: 1st column is family ID, 2nd column is sample ID, 3rd and 4th columns are sample IDs of parents, 5th column is gender (male is 1, female is 2), 6th column is case/control status (1 is control, 2 is case), quantitative trait value or population group label.

The ped format is also described here.

Here is an example of a genotypic matrix in the ped format using 3 individuals and 4 loci:

1 SAMPLE0 0 0 2 2 1 2 3 3 1 1 2 1
2 SAMPLE1 0 0 1 2 2 1 1 3 0 4 1 1
3 SAMPLE2 0 0 2 1 2 2 3 3 1 4 1 1

R code examples:

  • Conversion from the ped format to the lfmm and geno formats
    library(LEA)
    
    # Creation of a file called "example.ped"
    # with 4 SNPs for 3 individuals.
    data("example.ped")
    write.table(example.ped,"example.ped",
    	col.names = FALSE, row.names = FALSE, quote = FALSE)
    
    # Conversion 	from the ped format ("example.ped") 
    #		to the lfmm format ("example.lfmm").
    # By default, 	the name of the output file is the same name 
    #		as the input file with a .lfmm extension.
    # Create file:	"example.lfmm".
    output = ped2lfmm("example.ped")
    
    # Conversion 	from the ped format ("example.ped") 
    #		to the geno format ("example.geno").
    # By default, 	the name of the output file is the same name 
    #		as the input file with a .geno extension.
    # Create file:	"example.geno".
    output = ped2geno("example.ped")
    
    # Conversion 	from the ped format ("example.ped") 
    #		to the geno format with the output file called "plop.geno".
    # Create file:	"plop.geno".
    output = ped2geno("example.ped", "plop.geno")
    
    # As force = false and the file "example.geno" already exists,
    # nothing happens.
    output = ped2geno("example.ped", force = FALSE)

The ancestrymap format is widely used in population genetics. The ancestrymap format can be used as an input format for genotypic matrices in the lfmm, snmf and pca functions. LEA includes functions to convert ancestrymap data to geno and lfmm formats. The ancestrymap format has one row for each genotype. Each row has 3 columns: the 1st column is the SNP name, the 2nd column is the sample ID, the 3rd column is the number of alleles. Genotypes for a given SNP name are written in consecutive lines. The number of alleles can be the number of reference alleles or the number of derived alleles Missing genotypes are encoded by the value 9.

Here is an example of a genotypic matrix using the ancestrymap format with 3 individuals and 4 loci:

rs0000  SAMPLE0 1
rs0000  SAMPLE1 1
rs0000  SAMPLE2 2
rs1111  SAMPLE0 0
rs1111  SAMPLE1 1
rs1111  SAMPLE2 0
rs2222  SAMPLE0 0
rs2222  SAMPLE1 9
rs2222  SAMPLE2 1
rs3333  SAMPLE0 1
rs3333  SAMPLE1 2
rs3333  SAMPLE3 1

R code examples:

  • Conversion from the ancestymap format to the lfmm and geno formats
    library(LEA)
    
    # Creation of a file called "example.ancestrymap"
    # with 4 SNPs for 3 individuals.
    data("example.ancestrymap")
    write.table(example.ancestrymap,"example.ancestrymap",
    col.names = FALSE, row.names = FALSE, quote = FALSE)
    
    # Conversion 	from the ancestrymap format ("example.ancestrymap") 
    #		to the lfmm format ("example.lfmm").
    # By default,   the name of the output file is the same name
    #               as the input file with a .lfmm extension.
    # Create file:	"example.lfmm".
    output = ancestrymap2lfmm("example.ancestrymap")
                            
    # Conversion 	from the ancestrymap format ("example.ancestrymap") 
    #		to the geno format ("example.geno").
    # By default,   the name of the output file is the same name
    #               as the input file with a .geno extension.
    # Create file:	"example.geno".
    output = ancestrymap2geno("example.ancestrymap")
    
    # Conversion    from the ancestrymap format ("example.ancestrymap") 
    #               to the geno format with the output file called "plop.geno".
    # Create file:	"plop.geno".
    output = ancestrymap2geno("example.ancestrymap", "plop.geno")
    
    # As force = false and the file "example.geno" already exists,
    # nothing happens.
    output = ancestrymap2geno("example.ancestrymap", force = FALSE)

The vcf format is often used in population genetics. The vcf format can be used as an input format for genotypic matrices in the lfmm, snmf and pca functions. LEA includes functions to convert the vcf format to the geno format. The vcf format is described here.

Here is an example of a genotypic matrix using the vcf format with 3 individuals and 4 loci:

##fileformat=VCFv4.1 
##FORMAT= 
##INFO= 
##INFO= 
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE0 SAMPLE1 SAMPLE2 
1 1001 rs0000 T C 999 . VM=1;SM=100 GT:GM 1/0:1 0/1:2 1/1:3 
1 1002 rs1111 G A 999 . VM=2;SM=101 GT:GM 0/0:6 0/1:7 0/0:8 
1 1003 notres G AA 999 . VM=3;SM=102 GT:GM 0/0:11 ./.:12 0/1:13 
1 1004 rs2222 G A 999 . VM=3;SM=102 GT:GM 0/0:11 . 1/0:13
1 1003 notres GA A 999 . VM=3;SM=102 GT:GM 0/0:11 ./.:12 0/1:13 
1 1005 rs3333 G A 999 . VM=3;SM=102 GT:GM 1/0:11 1/1:12 0/1:13

R code examples:

  • Conversion from the vcf format to the geno format
    library(LEA)
    
    # creation of a file called "example.vcf"
    # with 4 SNPs for 3 individuals.
    data("example.vcf")
    write.table(example.vcf,"example.vcf",col.names =
    c("#CHROM", "POS", "ID", "REF", "ALT", "QUAL", "FILTER", "INFO",
      "FORMAT", "SAMPLE0", "SAMPLE1", "SAMPLE2"),
    row.names = FALSE, quote = FALSE)
    
    # Conversion 	from the vcf format ("example.vcf") 
    # 		to the geno format ("example.geno").
    # By default,   the name of the output file is the same name
    #               as the input file with a .geno extension.
    # Create files:	"example.geno",
    #		"example.vcfsnp" - SNP information,
    #		"example.removed" - removed lines. 
    output = vcf2geno("example.vcf")
    
    # Conversion    from the vcf format ("example.vcf")
    #               to the geno format with the output file called "plop.geno".
    # Create files:	"plop.geno",
    #		"plop.vcfsnp" - SNP information,
    #		"plop.removed" - removed lines. 
    output = vcf2geno("example.vcf", "plop.geno")
    
    # As force = false and the file "example.geno" already exists,
    # nothing happens.
    output = vcf2geno("example.vcf", force = FALSE)

The function pca performs a principal component analysis of a genotypic matrix using the lfmm, geno, ancestrymap, ped or vcf format. The function computes eigenvalue, eigenvector, and standard deviation for each principal component and the projection of each individual on each component. The function pca returns an object of class "pca-class" containing the output data and the input parameters.

The function tracyWidom performs Tracy-Widom tests on the covariance matrix eigenvalues.

R code examples:
  (Caution: the examples have to be executed in the same order as in the tutorial.)

The snmf function estimates ancestry coefficients using the sparse non-negative matrix factorization. The function returns a project object containing all runs of the snmf program for the input data. It can be useful to perform several runs of snmf for various numbers of ancestral populations (K).

This tutorial describes how to analyze a genotypic dataset, manage an snmf project, run snmf with advanced options, and perform post-treatments of the results.

R code examples:
  (Caution: the examples have to be executed in the same order as in the tutorial.)

The R function lfmm fits Latent Factor Mixed Models to the data.

The lfmm function returns a project object containing all lfmm runs. When performing additional runs, the function enables the project to be included as a parameter to add more runs. Performing several runs for various values of the number of latent factors (K) is recommended.

The LEA package includes functions for post-processing lfmm runs.

R code examples:
  (Caution: the examples have to be executed in the same order as in the tutorial.)