A tutorial for spatial factor analysis (spfa)

Overview

Spatial factor analysis (spfa) incorporates geographic information in a factor analysis by modelling spatial dependencies in an explicit way. In spfa, inference is performed in a matrix factorization framework similar to principal component analysis (PCA).

Like PCA, spfa can be useful for an exploratory data analysis of population genetic structure and for investigating the scale of isolation by distance (IBD). In population genetic analysis, spfa corrects principal components for effects of spatial autocorrelation and IBD.

The spfa method was described in a publication by Frichot et al. (2012): http://journal.frontiersin.org/article/10.3389/fgene.2012.00254/full.

Input and output data

The input data could be any genotypic matrix that would work with a PCA. Our example has SNP genotypes encoded as 0, 1, 2, with individuals as rows. The script requires geographic coordinates for each individual, and a ‘scale’ parameter - for which we suggest to try several values. The scale parameter corresponds the decay of correlation of gene frequencies in space, and it is expressed in units of the average distance between sampling sites. For an average distance equal to \(m\) kilometers and a scale parameter equal to \(\sigma\), correlation disappears at a distance around \(10/ m \sigma\).

The output is a score matrix U, which contains factors corrected for spatial autocorrelation.

Example data

The example reproduces the analysis presented in the Figure 8 of (Frichot et al. 2012). The genetic data consist of SNPs genotyped in Asian populations. In our study, spfa provided evidence of a major discontinuity separating two clusters, one in Central Asia and one in East-Asia. Uyghur and Hazara populations aligned with the two main clusters and were placed in an intermediate position, suggesting genetic admixture from ancestral Central Asian and East-Asian gene pools.

R codes

For running the tutorial, open an R session and source our R script below. The script starts by computing a distance matrix using the geographic coordinates (lon, lat) of individuals. This steps requires the “fields” package to be installed. Remove the comment character below to install “fields” on your computer OS.

source("http://membres-timc.imag.fr/Olivier.Francois/spfa.R")
#install.packages("fields") #remove the comments to install the 'fields' package

Reading the data.

The data consist of 10667 SNP genotypes in a matrix format and 427 individual geographic coordinates encoded as (longitude,latitude) in a decimal format. Individual genotypes are represented as rows in the data matrix.

genotype = read.table("http://membres-timc.imag.fr/Olivier.Francois/genotype_Asia.txt")
coord = read.table("http://membres-timc.imag.fr/Olivier.Francois/coord_Asia.txt")[,1:2]

PCA and spfa plots.

We start by running a basic PCA and displaying a classic PC plot. In this plot, the V shape is the signature of isolation by distance. This is the effect we would like to correct.

pc = prcomp(genotype)
plot(pc$x, col = "blue", pch = 19, cex = .5)

Now, we use the spfa function with a small scale value. The number of factors, \(K\), must be greater than the number of genetic clusters in the data. We use \(K = 5\) factors. The results look like the PC plot.

obj1 = spfa(genotype, coord, K = 5, scale = 0.000001)

plot(obj1$U, main ="scale 1e-6", col = "lightblue", pch = 19, cex = .5)

obj1$mean # the average between-individual distance (km)

## [1] 2981.259

Let us use an intermediate scale value. We can observe that the V shape is removed. Admixed individuals are at the center of the plot. The scale (1e-5) at which IBD disappears corresponds to a distance of around 335.4 km.

obj2 = spfa(genotype, coord, K = 5, scale = 0.00001)
plot(obj2$U, main ="scale 1e-5",  col = "green", pch = 19, cex = .5)

Using a larger scale value, all factors get flattened, and no residual signal is observed. So the interpretable value was in the previous parameter.

obj3 = spfa(genotype, coord, K = 5, scale = 0.0001)
plot(obj3$U,  main ="scale 1e-4",  col = "orange", pch = 19, cex = .5)