Simulating phenotypes and evaluating GWAS methods with naturalgwas

Olivier François

2018-01-15

Summary

Association studies of polygenic traits with genomic loci are notoriously difficult when those studies are conducted at large geographical scales. The difficulty arises as genotype frequencies often vary in geographic space and across distinct environments in a complex fashion. Those large-scale variations are known to yield false positives in association testing approaches. The R package naturalgwas allows researchers to simulate trait association from observed genotypes, modelling gene by environment interactions where environment is derived from a climate database. The R package includes an oracle method correcting false associations by using the true set of confounding variables. The simulated data and the oracle method can be used 1) to compare the ability of association methods to correctly remove confounding factors for a particular data set, 2) to evaluate power to detect causal variants, and 3) to assess the influence of various parameters including the number of causal variants, effect sizes and gene by environment interaction.


Introduction

This vignette presents a short tutorial on how to simulate phenotypes for “natural” organisms, and evaluate genome-wide association studies (GWAS) when field sampling has been performed at a large geographical scale. We illustrate the simulation approach with some phenotypic traits simulated from data of the plant species Arabidopsis thaliana (Atwell et al. 2010). To start with the R package naturalgwas, load the R functions in memory space as follows.

#install.packages("cate")
#devtools::install_github("bcm-uga/lfmm")
library(naturalgwas)
Data files

Running the main simulation function simu_pheno, and running the oracle method requires three files as input to the program: 1) a file encoding individual genotypes, 2) a file with individual geographic coordinates, and 3) a genetic map. Those data must be loaded in the R environment. The simulation could also work without geographic data by specifying a environmental variable. For example, we considered SNP data from the chromosome 5 of European accession/ecotypes of the plant species A. thaliana.

data(A.thaliana)
genotype <- A.thaliana$genotype
chrpos <- A.thaliana$chrpos 

For SNPs, the genotype matrix encodes each individual genotype in a row. Each locus corresponds to a specific column. Genotypes are encoded as 0,1,2 for diploids, and 0,1 for haploids. Those numbers represent the number of reference or derived allele at each particular locus. A. thaliana is a diploid species with very high levels of inbreeding. In our example, 162 genotypes were encoded as haploids considering 53,859 SNPs from chromosome 5 (Atwell et al. 2010). Let’s print some genotypic values for three individuals at the first ten loci. We have a matrix of size 3 times 10 filled with 0 and 1 values.

dim(genotype)
## [1]   162 53859
genotype[sample(162, 3),1:10]
##      V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## [1,]  1  1  1  1  1  1  1  0  1   1
## [2,]  0  0  1  1  1  1  0  0  1   1
## [3,]  0  1  1  1  1  1  1  0  1   1

The coordinate variable is a two-column matrix that contains longitude and latitude for each individual in the sample. Longitude (°E) and latitude (°N) must be encoded in the decimal format. Note that headers should be ignored when loading data into the R program. Let us show the location of the sampling sites in Europe (each point corresponds to an individual in the sample).

library(maps)
coordinates <- A.thaliana$coord
plot(coordinates, 
     pch = 19, cex = .7, col = "green4",
     xlab = "Longitude (°E)", 
     ylab = "Latitude (°N)")
map(add = T, interior = FALSE)

Simulating artificial phenotypes from real genotypes

For simulating artificial phenotypes from real genotypes and accounting for genotype by environment interactions, we need 1) a proxy variable for the enviromnent, 2) a reference set of weakly linked causal loci for which we wish to simulate trait association, 3) and a set of confounder factors typically arising from complex population structure and isolation-by-distance.

  env <- get_climate(coordinates)
  ref.set <- create_refset(chrpos, window = 501)
  confounder <- create_factor(genotype, K = 10)

The above commands create an artificial environmental variable by combining 19 bioclimatic variables extracted from the ‘worldclim’ database, and they take a few minutes to download. The SNP reference set consists of picking one representant SNP in each window of size window. In the third line, 10 confounders are computed from the data. The confounders correspond to the first ten principal components of the genotype matrix using PCA from LEA (Frichot and Francois 2015). The PCA screeplot for the SNP data is displayed.

In addition, the create_factor function estimates the amount of residual variation in the data and an order of magnitude for effect sizes.

  confounder$sigma 
## [1] 81.91204
  confounder$base 
## [1] 93.90588

Simulation of phenotypic traits requires the following arguments: 1) a SNP genotype matrix, 2) a set of confounders created with create_factor, an environmental variable, a reference set of weakly linked SNPs created with create_refset, a number of causal SNPs to be drawn from the reference set, and values for effect size and GxE interaction parameters.

  n.causal <- 6
  eff.size <- 5.0

  sim <- simu_pheno(A.thaliana$genotype, 
                    confounder, 
                    env, 
                    ref.set, ncausal = n.causal, 
                    effect.size = eff.size, gxe = .5)
  
  ss <- sum(apply(genotype[,sim$causal.set], 2, var))
  ss*(eff.size)^2*confounder$base^2/(ss*(eff.size)^2*confounder$base^2 + confounder$sigma^2 + confounder$base^2)
  (ss*(eff.size)^2*confounder$base^2 + confounder$sigma^2)/(ss*(eff.size)^2*confounder$base^2 + confounder$sigma^2 + confounder$base^2)

In this simulation, a polygenic trait is associated with 6 causal variants, sampled from the SNP reference set. Each causal variant is associated with an effect size of 5 units. Gene by environment interactions are modelled by using the env variable created from ‘worldclim’.

  hist(scale(sim$phenotype), main = "Trait value")

To check whether the simulated trait displays any geographic variation, the trait variable can interpolated on a geographic map as follows.

  library(fields)
  fit = fields::Krig(coordinates, scale(sim$phenotype), theta = 10, m = 2)
  surface(fit, extrap = TRUE, xlab = "Longitude", ylab = "Latitude", levels = c(0))
  map(add = TRUE, interior = F)

Running the oracle method

Our next step is to perform a GWAS for the simulated phenotype, and to evaluate the relative power of methods to detect causal variants. To this objective, we use the ‘linear regression’ method as an oracle algorithm. The method is very close to the simulation model, and it becomes an oracle method when the true confounders are provided in arguments.

pv.oracle <- oracle(scale(sim$phenotype), 
                    genotype, 
                    confounder, K = 10)$pv

Now visualize the Manhattan plot for the oracle method. The vertical bars correspond to the causal variants in the simulation.

   plot(-log10(pv.oracle), cex = .4, col = "grey", main = "Manhattan plot (Oracle)")
   points(sim$causal, -log10(pv.oracle)[sim$causal], type = "h", lty = 1, col = "blue")
   abline(-log10(0.05/length(pv.oracle)), 0, lty = 2)

Heritability and mixed linear models

In this section, we used a mixed model approach and the R package rrBLUP to provide an heritability estimate for the simulated trait.

library(rrBLUP)
G <- A.thaliana$genotype
rownames(G) <-  A.thaliana$ecotype.id[,1]
data <- data.frame(y = scale(sim$phenotype), 
                   gid = A.thaliana$ecotype.id[,1])

#predict breeding values and compute h2
kb <- kin.blup(data = data, 
                geno="gid", 
                pheno = "y", 
                K = A.mat(G))

cat("Heritability = ", kb$Vg/(kb$Vg + kb$Ve), "\n")
## Heritability =  0.9366343

To provide a better estimate for the K matrix, we could use a latent factor model as follows.

## fit latent factors using an LFMM
mod.lfmm <- lfmm::lfmm_ridge(Y = A.thaliana$genotype, 
                             X = scale(sim$phenotype), 
                             K = 6)

Kinship <- mod.lfmm$U %*% t(mod.lfmm$U)/nrow(mod.lfmm$U)
rownames(Kinship) <- A.thaliana$ecotype.id[,1]
## predict breeding values and compute h2
kb <- kin.blup(data = data, 
                geno="gid", 
                pheno = "y", 
                K = Kinship)

cat("Heritability = ", kb$Vg/(kb$Vg + kb$Ve), "\n")
## Heritability =  0.03081077

The simulated trait does not exhibit a very high heritability value. Let us perform an association study by using a mixed linear model.

geno <- t(A.thaliana$genotype)
colnames(geno) <- A.thaliana$ecotype.id[,1]
G <- data.frame(marker = 1:nrow(A.thaliana$chrpos),
                A.thaliana$chrpos,
                geno,
                check.names = FALSE)
pheno <- data.frame(line = A.thaliana$ecotype.id[,1], 
                    y = scale(sim$phenotype))

mlm <- GWAS(pheno = pheno, 
            geno = G, 
            K = Kinship,
            n.PC = 5,    
            plot = FALSE)
## [1] "GWAS for trait: y"
## [1] "Variance components estimated. Testing markers."
## Manhattan plot
plot(mlm$y, cex = .4, col = "grey",
     ylab = "-log10(pvalues)",
     main = "Manhattan plot (mlm)" )
points(sim$causal, mlm$y[sim$causal], 
       type = "h", lty = 1, col = "green4")
abline(-log10(0.05/length(pv.oracle)), 0, lty = 2)

The results of a mixed linear model using the LFMM kinskip estimate are very close to the oracle method.

qqplot(-log10(pv.oracle), mlm$y, col = "grey", cex = .4, ylab ="mlm scores")
abline(0,1, col = 'green4')

Let us compare the results of method that estimate confounders at the same time as it computes association with phenotype. The cate method implements latent factor mixed models. Based on a principal component analysis of the genotype data, we evaluate that around \(K = 6\) factors explain population structure, and 6 hidden factors are used in cate. In cate, the pvalues are computed as follows.

  pheno <- scale(sim$phenotype)
  pv.cate <- cate::cate( ~ pheno, 
                   X.data = data.frame(pheno), 
                   Y = genotype, 
                   r = 6, 
                   calibrate = TRUE)$beta.p.value

Now let us visualize the Manhattan plot for the cate method and compare it to the Manhattan plot for the oracle method. The vertical bars correspond to the causal variants in the simulation.

   plot(-log10(pv.cate), cex = .4, col = "grey", main = "Manhattan plot (Latent factor model)" )
   points(sim$causal, -log10(pv.cate)[sim$causal], type = "h", lty = 1, col = "orange" )
    abline(-log10(0.05/length(pv.oracle)), 0, lty = 2)

We eventually compare the results of the latent factor model with the oracle method.

   ## qqplot
   qqplot(-log10(pv.oracle), -log10(pv.cate) , pch = 19, cex = .4, col = "grey" )
   abline( 0, 1, lwd = 2, col = "orange" )

In this situation, the latent factor model and the oracle method performed similarly. Other testing methods could be evaluated similarly (glm, mixed models, etc). We compare the mixed modeling approach and cate below.

   ## qqplot
   qqplot(mlm$y, -log10(pv.cate) , pch = 19, cex = .4, col = "grey" )
   abline( 0, 1, lwd = 2, col = "orange" )

Comparison with LFMM

For comparisons with cate, we also use \(K = 6\) factors in lfmm.

lfmm.mod <- lfmm::lfmm_ridge(Y = genotype, 
                             X = scale(sim$phenotype), 
                             K = 6)

We compute the pvalues as follows

  p <- lfmm::lfmm_test(Y = genotype, 
                       X = scale(sim$phenotype), 
                       lfmm = lfmm.mod, 
                       calibrate = "gif")
  pv.lfmm <- p$calibrated.pvalue

Now let us visualize the Manhattan plots for lfmm and cate. The vertical bars correspond to the causal variants in the simulation.

   plot( -log10(pv.lfmm), cex = .4, col = "grey", main = "Manhattan plot (lfmm)" )
   points( sim$causal, -log10(pv.lfmm)[sim$causal], type = "h", lty = 1, col = "blue" )
   abline(-log10(0.05/length(pv.oracle)), 0, lty = 2)

We eventually compare the results of the latent factor model with the oracle method.

   ## plot
   qqplot(-log10(pv.oracle), -log10(pv.lfmm) , pch = 19, cex = .4, col = "grey" )
   abline( 0, 1, lwd = 2, col = "blue" )

conditional tests LFMM

Based on previous analysis, we understood that only a handful of SNP could be detected with those simulation parameters. Here, we experimented forward conditional tests, in which the top hits are recursively included as covariates in an LFMM estimation/testing approach (12 cycles).

cond.test <- lfmm::forward_test(Y = A.thaliana$genotype,
                                X = scale(sim$phenotype),
                                K = 6,
                                niter = 12)

We compare the results with the truth.

# proposed candidates
cat("Candidate loci:")
## Candidate loci:
sort(cond.test$candidates)
##  [1]  8791 19536 19551 19588 19646 19663 19702 19703 19721 22796 26303
## [12] 34820
# truth
cat("Causal SNPs:")
## Causal SNPs:
sim$causal.set
## [1]  8768 22796 26303 29309 34820 46844

Package reference

References