LFMM: Latent Factor Mixed Models

A unified framework for inferring latent confounders and gene-environment (and other) associations

Frequently Asked Questions


Is LFMM restricted to environmental data or could we use it with phenotypes?

LFMM could be used for environmental data (exposure) and for phenotypic traits (outcome). If it is used with phenotypic traits, then the effect sizes are not directly interpretable, but the p-values always are. The R version "lfmm" contains statistical tests for phenotypic traits and genotype-phenotype associations. The main feature of LFMMs is (optimal) estimation of confounders. Once the confounders are estimated, statistical tests are similar to other GWAS methods (options like regression, logistic regression, linear mixed models are available).

Which version should I use?

The R versions lfmm and LEA (github or bioconductor) are the best. The github versions are development versions and they can change often. Make sure to reinstall them regularly to optimze their use (using the R command devtools::install_github).


What kind of preprocessing is necessary for LFMM runs?

Removing closely related individuals and standard GWAS filtering is recommended. Filter out monomorphic loci, and generally use MAF > 5 percent. You could not expect significant associations with rare variants unless the data set contains thousands of individuals. Some LD pruning might be useful too (but MAF filters also reduce the impact of LD).

What do I do with missing data?

The LFMM p-values are biased when genotype data are missing. The best strategy is to use a data imputation method. The R package LEA has an imputation procedure (LEA::impute) that imputes missing data based on ancestral allele frequency and LD. It is fast, and it works without a reference genome or a genetic map.

How do I choose K?

Use K from sNMF or from the PCA screeplot. A precise value of K is not required. Final adjustments are done by using standard test recalibration methods. Empirical null hypothesis testing works well when K is reasonably chosen (number of genetic groups in the data).

Input formats

Can I analyse poolseq data with LFMM?

The best strategy is to re-simulate allelic data from the pooled data. For a pool of m individuals and a derived allele frequency f, draw m artificial samples from the beta distribution with parameters a = mf + 1 and b = m(1 - f) + 1. In R, you could use rbeta(m, mf + 1 , m(1 - f) + 1). This strategy can restore power loss due to considering pooled data instead of individual data. Note that the LFMM input format tolerates continuous allele frequency data.

Other questions

Please send me an email (eric.frichot@gmail.com, olivier.francois@imag.fr).