Warning: this function is in testing. Users are advised to interpret the results with caution.

The importance of recurrently mutated hotspots is widely appreciated in cancer. This tutorial shows how to apply the new sitednds function provided in the latest version of the dNdScv package to estimate dN/dS ratios at single-site level. Sitewise dN/dS estimation has a rich history in comparative genomics (e.g. Massingham and Goldman, 2005) but it has only been used in cancer studies occasionally (e.g. Martincorena et al., 2015). Yet, studying the relative strength of selection at single sites can be valuable, as emphasised by a recent study (Cannataro et al., 2018).

The new sitednds function allows the user to compute maximum-likelihood dN/dS estimates for recurrently mutated sites, as well as p-values against neutrality. Sitewise dN/dS ratios reflect the ratio between the number of observed mutations and the number expected under neutrality, while controlling for trinucleotide rates and for variable mutation rates across genes. In sparse datasets, point estimates for lowly-recurrent sites are likely to be underestimated, but p- and q-values provide a measure of their significance.

An important aspect is that mutation rates can vary considerably across sites, even after correcting for these known mutational biases. sitednds models the observed mutation counts across synonymous sites as following a negative binomial distribution. This effectively controls for Poisson noise in the mutation counts per site and fits a Gamma distribution to the unexplained variation in mutation rate across sites. P-values for site recurrence are calculated using the fitted negative binomial distribution. These p-values should be more conservative and reliable than only considering Poisson variation or non-parametric bootstrapping, but they still rely on the assumption than the Gamma distribution appropriately captures the unexplained variation across sites.

A major limitation is the fact that mapping artefacts and SNP contamination are common problems in cancer genomic datasets, and these tend to lead to recurrent false positive mutation calls. In noisy datasets, the results of sitednds can be dominated by artefacts. Users trying sitednds should be very critical of the results. In the context of cancer genomic studies, a considerable number of synonymous recurrently mutated sites among the significant hits in sitednds most certainly indicates a problem with the variant calling. This is exemplified in this tutorial analysing two real datasets.

Sitewise dN/dS ratios in a cancer dataset

As a small example, in this tutorial we will use public somatic mutation calls from bladder cancers from TCGA. To reduce the risk of false positives and increase the signal to noise ratio, this example will only consider mutations in Cancer Gene Census genes (v81).

library("dndscv")
data("dataset_tcgablca", package="dndscv") # Loading the bladder cancer data
data("cancergenes_cgc81", package="dndscv") # Loading the genes in the Cancer Gene Census (v81)
dndsout = dndscv(mutations, outmats=T, gene_list=known_cancergenes)

The sitednds function takes the output of dndscv as input. In order for the dndsout object to be compatible with sitednds, users must use the “outmats=T” argument in dndscv. After running dndscv, we can evaluate the results at the gene level as explained in the main tutorial of dndscv:

sel_cv = dndsout$sel_cv
print(head(sel_cv, 10), digits = 3) # Printing the top 10 genes by q-value
##           gene_name n_syn n_mis n_non n_spl n_ind wmis_cv wnon_cv wspl_cv
## 564            TP53     5   136    39     5    26 36.1349   101.0   101.0
## 23           ARID1A     6    37    44     6    28  2.8878    34.4    34.4
## 289           KMT2D    10    52    54     9    27  1.7105    18.3    18.3
## 276           KDM6A     4    21    33     9    39  1.8506    24.5    24.5
## 461             RB1     1     8    34    14    18  1.8310    73.0    73.0
## 526           STAG2     7    12    23     2    17  0.8160    12.8    12.8
## 97  CDKN2A.p16INK4a     0    13     3     2     5 15.9320    95.8    95.8
## 96    CDKN2A.p14arf     1    15     0     2     5 17.5753    43.9    43.9
## 332           MLLT3    43     4     0     0     3  0.0614     0.0     0.0
## 156           EP300     2    39    20     1     8  4.7957    17.9    17.9
##     wind_cv  pmis_cv ptrunc_cv pallsubs_cv  pind_cv  qmis_cv qtrunc_cv
## 564  255.74 0.00e+00  0.00e+00    0.00e+00 2.32e-43 0.00e+00  0.00e+00
## 23    39.69 1.38e-03  0.00e+00    0.00e+00 1.68e-24 4.64e-02  0.00e+00
## 289    7.62 6.44e-02  0.00e+00    0.00e+00 8.40e-09 5.67e-01  0.00e+00
## 276   41.97 1.30e-01  0.00e+00    0.00e+00 2.45e-31 6.39e-01  0.00e+00
## 461  112.48 2.93e-01  0.00e+00    0.00e+00 1.02e-25 7.81e-01  0.00e+00
## 526   28.42 6.31e-01  3.02e-13    7.77e-16 5.65e-15 9.47e-01  3.04e-11
## 97    94.31 1.28e-07  2.78e-08    1.93e-11 9.72e-09 9.66e-06  1.68e-06
## 96    98.63 2.26e-09  1.20e-03    8.01e-10 7.80e-09 2.27e-07  2.79e-02
## 332    4.24 2.80e-14  6.69e-04    1.80e-14 4.18e-02 8.44e-12  1.61e-02
## 156    8.24 3.52e-05  3.61e-12    2.37e-11 6.30e-05 1.51e-03  3.11e-10
##     qallsubs_cv pglobal_cv qglobal_cv
## 564    0.00e+00   0.00e+00   0.00e+00
## 23     0.00e+00   0.00e+00   0.00e+00
## 289    0.00e+00   0.00e+00   0.00e+00
## 276    0.00e+00   0.00e+00   0.00e+00
## 461    0.00e+00   0.00e+00   0.00e+00
## 526    7.81e-14   0.00e+00   0.00e+00
## 97     1.29e-09   0.00e+00   0.00e+00
## 96     4.02e-08   2.22e-16   1.67e-14
## 332    1.36e-12   2.70e-14   1.81e-12
## 156    1.39e-09   5.24e-14   3.16e-12

The table above reveals a problem with this dataset. The gene MLLT3 appears as significant in dndscv (i.e. it violates the neutral null model of dN/dS=1), but due to a very large excess of synonymous mutations (notice the high number of synonymous mutations and the very low dN/dS values). We can further confirm that the low dN/dS value in this gene is due to an excess of synonymous mutations and not genuine negative selection by comparing the observed number of synonymous mutations in the gene (43) and the expected number (exp_syn and exp_syn_cv columns below):

print(dndsout$genemuts[dndsout$genemuts$gene_name=="MLLT3",])
##     gene_name n_syn n_mis n_non n_spl  exp_syn  exp_mis   exp_non
## 332     MLLT3    43     4     0     0 1.258753 3.974196 0.2248953
##       exp_spl exp_syn_cv
## 332 0.1507168   3.632818

Thus, MLLT3 is a false positive, most likely due to recurrent artefacts or SNP contamination in the gene. A careful examination of all statistically significant genes in the dataset reveals other likely false positives. As we will see below, this will also affect the sitewise dN/dS analysis.

To run the sitewise dN/dS model on this dataset, we only need to input the dndsout object into the sitednds function. By default, sitednds will calculate sitewise dN/dS ratios and p-values for sites mutated at least two times (use the argument min_recurr to control this). While p-values are only provided for recurrently mutated sites, false discovery adjustment corrects for all possible changes.

hotspots = sitednds(dndsout) # Running sitewise dN/dS
print(hotspots$theta) # Overdispersion (unexplained variation of the mutation rate across sites)
##        MLE    CI95low  CI95_high 
## 0.07261735 0.05468222 0.10069983

You can see that the maximum-likelihood estimate of theta is very low. This reflects considerable variation in the mutation rate across sites, not explained by the trinucleotide context or by the estimated relative mutation rate of the gene. sitednds takes this into account when calculating p-values. If there is large uncertainty in the estimation of theta, users can choose to use the lower bound estimate of theta instead of the maximum-likelihood estimate, when calculating p-values (use the argument theta_option=“conservative” in sitednds).

The main output of sitednds is a table with all hotspots studied, including their position, the gene affected, the aminoacid change induced, the number of times that the mutation was observed, the expected number of mutations at this site by chance under neutrality (mu) and the dN/dS ratio. The table also contains p-values and q-values for the probability of observing that many mutations at the site by chance. Again, please treat these p-values with caution.

print(head(hotspots$recursites,10)) # First 10 lines of the table of hotspots
##    chr       pos ref mut   gene aachange     impact ref3_cod mut3_cod freq
## 1    4   1803568   C   G  FGFR3    S249C   Missense      TCC      TGC   27
## 2   17   7577538   C   T   TP53    R248Q   Missense      CGG      CAG   15
## 3   17  37868208   C   T  ERBB2    S310F   Missense      TCC      TTC   16
## 4   19  45867687   T   C  ERCC2    N238S   Missense      AAC      AGC    9
## 5    9  20414343   A   G  MLLT3    S167S Synonymous      GTA      GCA   11
## 6    9  20414340   G   A  MLLT3    S168S Synonymous      GCA      GTA   12
## 7    3 178936091   G   A PIK3CA    E545K   Missense      TGA      TAA   23
## 8    4 153247289   G   C  FBXW7    R505G   Missense      CCG      CGG    8
## 9   17   7577085   C   T   TP53    E285K   Missense      AGA      AAA   10
## 10   4   1806099   A   G  FGFR3    Y375C   Missense      TAT      TGT    8
##             mu     dnds         pval         qval
## 1  0.004349085 6208.203 7.561905e-37 3.663031e-30
## 2  0.004297050 3490.767 1.034825e-21 2.506377e-15
## 3  0.005381862 2972.949 1.615659e-21 2.608783e-15
## 4  0.001419861 6338.647 3.491422e-18 4.228163e-12
## 5  0.003143547 3499.232 5.292417e-18 5.127356e-12
## 6  0.004582783 2618.496 1.514320e-17 1.222576e-11
## 7  0.022757216 1010.668 2.543084e-17 1.759836e-11
## 8  0.001499587 5334.803 3.117442e-16 1.887634e-10
## 9  0.005642605 1772.231 3.591526e-14 1.933063e-08
## 10 0.002847343 2809.637 4.627725e-14 2.241697e-08

We can choose a significance cutoff (e.g. q-value<0.05) to list the significant hotspots in the dataset:

signifsites = hotspots$recursites[hotspots$recursites$qval<0.05, ]
print(signifsites[,c("gene","aachange","impact","freq","dnds","qval")], digits=5)
##      gene aachange     impact freq     dnds       qval
## 1   FGFR3    S249C   Missense   27  6208.20 3.6630e-30
## 2    TP53    R248Q   Missense   15  3490.77 2.5064e-15
## 3   ERBB2    S310F   Missense   16  2972.95 2.6088e-15
## 4   ERCC2    N238S   Missense    9  6338.65 4.2282e-12
## 5   MLLT3    S167S Synonymous   11  3499.23 5.1274e-12
## 6   MLLT3    S168S Synonymous   12  2618.50 1.2226e-11
## 7  PIK3CA    E545K   Missense   23  1010.67 1.7598e-11
## 8   FBXW7    R505G   Missense    8  5334.80 1.8876e-10
## 9    TP53    E285K   Missense   10  1772.23 1.9331e-08
## 10  FGFR3    Y375C   Missense    8  2809.64 2.2417e-08
## 11 PIK3CA    E542K   Missense   17   747.02 7.9963e-08
## 12   TP53    R280T   Missense   10  1453.35 9.1276e-08
## 13  HLA-A     Q78R   Missense    4 13446.24 2.1402e-06
## 14  MLLT3    S166S Synonymous    7  1527.46 1.1661e-05
## 15 NFE2L2     R34G   Missense    5  2050.39 2.0190e-04
## 16   KRAS     G12D   Missense    4  3146.49 5.5540e-04
## 17   NAB2    P211P Synonymous    3  9736.56 5.7996e-04
## 18  FGFR3    G372C   Missense    4  2851.04 7.2814e-04
## 19   HRAS     Q61K   Missense    3  8318.78 8.3064e-04
## 20   TP53    E336*   Nonsense    4  2502.57 1.0945e-03
## 21 PIK3CA   H1047R   Missense    4  2391.66 1.2455e-03
## 22   TP53    S241F   Missense    5  1278.51 1.3452e-03
## 23  MLLT3    S188S Synonymous    5  1091.04 2.7397e-03
## 24   TP53    Q331*   Nonsense    7   531.95 5.6984e-03
## 25  KDM6A    Q562*   Nonsense    8   393.68 1.3510e-02
## 26  ERBB3    V104L   Missense    3  2434.11 2.3575e-02
## 27  MLLT3    S155S Synonymous    4   872.83 4.8176e-02
## 28   TP53    R248W   Missense    4   851.59 4.9561e-02
## 29   TP53    R273C   Missense    4   850.23 4.9561e-02

Careful examination of the significant hotspots reveals many well-known cancer-driver hotspots, including in FGFR3 (e.g. S249, Y375, G372), TP53 (e.g. R248), PIK3CA (e.g. E542, H1047), HRAS (Q61), KRAS (G12), ERBB2 (S310), ERBB3 (V104), etc. Note that the exact aminoacid position affected depends on the exact protein isoform used for annotation (see Ensembl protein IDs in dndsout$annotmuts).

However, the table of significant hotspots also contains a considerable number of likely false positives, including multiple synonymous sites in MLLT3. A proper analysis of these data would require careful reevaluation and improvement of the mutation calls, before repeating this analysis. Significant improvements to somatic mutation calls against recurrent artefacts can be achieved by using an unmatched normal panel and by more stringent filtering of germline SNP contamination.

The TCGA mutation calls used in this example are an old version and it is likely that more recent versions are much less affected by artefacts. However, I decided to use this dataset as an example to highlight the importance of critically examining the results and the impact of recurrent artefacts on driver discovery at gene and site level.

As a final note, users with whole-genome or whole-exome data can run sitednds on all genes. However, given the frequent presence of recurrent artefacts and the sparsity of cancer datasets, the signal-to-noise ratio can be considerably increased by running sitednds on a list of known cancer genes. To do so, I recommend running dndscv on all genes and then running sitednds on a list of genes of interest using the optional gene_list argument in sitednds. Running dndscv on all genes ensures that mutations from all genes are used to estimate the trinucleotide mutation rates, typically increasing their accuracy.

Sitewise dN/dS ratios in normal oesophagus

In a recent study, we sequenced 844 small biopsies of normal oesophageal epithelium from 9 transplant donors to study the extent of mutation and selection in a normal tissue (Martincorena et al., 2018). In this part of the tutorial, we will reanalyse this dataset using dndscv and sitednds. We first run dndscv using the settings from the analysis of normal skin data described in the main dNdScv tutorial.

library("dndscv")
data("dataset_normaloesophagus", package="dndscv") # Loading the mutations in normal oesophagus
mutations = unique(mutations) # Removing duplicate mutations (more conservative)
data("dataset_normalskin_genes", package="dndscv")
dndsout = dndscv(mutations, outmats=T, gene_list=target_genes, max_coding_muts_per_sample=Inf, max_muts_per_gene_per_sample=Inf) # outmats=T is required to run sitednds

We can see the list of genes under positive selection and the global dN/dS values using:

sel_cv = dndsout$sel_cv
print(sel_cv[sel_cv$qglobal_cv<0.05, c(1:6,19)], digits = 3)
##    gene_name n_syn n_mis n_non n_spl n_ind qglobal_cv
## 73      TP53     9   323    50    41    97   0.00e+00
## 48    NOTCH1    29   831   278   134   572   0.00e+00
## 49    NOTCH2     8   225    59    26   127   0.00e+00
## 50    NOTCH3    11   140    93    37    82   0.00e+00
## 27      FAT1    19   127   113    13   166   0.00e+00
## 7     ARID1A     8    28    34     5    68   0.00e+00
## 3      AJUBA     1    22     9     0    30   0.00e+00
## 19      CUL3     7    34     9    10    20   1.33e-13
## 13     CCND1     2    11     9     0     1   1.29e-05
## 55    PIK3CA     2    45     1     0     4   9.14e-04
## 42     KMT2D    21    69    12     4    43   2.14e-03
## 47    NFE2L2     5    27     1     0     8   4.57e-03
## 8      ARID2    12    28     9     2    12   5.13e-03
## 40     KMT2A    15    68     8     2    34   2.73e-02
## 74      TP63     5    29     1     1     6   4.78e-02
print(dndsout$globaldnds, digits = 3)
##      name  mle cilow cihigh
## wmis wmis 2.07  1.90   2.27
## wnon wnon 7.82  6.92   8.83
## wspl wspl 8.31  7.10   9.73
## wtru wtru 7.97  7.14   8.90
## wall wall 2.55  2.34   2.78

To apply the sitednds model, we simply use the following code. Only the top 30 hotspots by q-value are shown, but a total of 133 sites are identified as significant with 5% FDR.

hotspots = sitednds(dndsout) # Running sitewise dN/dS
signifsites = hotspots$recursites[hotspots$recursites$qval<0.05, ]
head(signifsites[,c("gene","aachange","impact","freq","dnds","qval")], 30)
##      gene aachange           impact freq      dnds         qval
## 1  NOTCH1    F357S         Missense    5 3279.5143 1.122415e-05
## 2  NOTCH1    N390S         Missense    6 1860.5378 1.122415e-05
## 3  NOTCH1    I471T         Missense    5 1438.3442 3.971835e-04
## 4  NOTCH1    G443D         Missense    5 1162.6781 5.585327e-04
## 5  NOTCH1    G394D         Missense    5 1162.6781 5.585327e-04
## 6  NOTCH1        . Essential_Splice    5 1150.1452 5.585327e-04
## 7  NOTCH1    S385F         Missense    5 1103.8104 5.585327e-04
## 8  NOTCH1    Y550*         Nonsense    4 2014.8063 5.585327e-04
## 9  NOTCH1    N718S         Missense    6  723.4757 5.585327e-04
## 10 NOTCH1    D352Y         Missense    4 1607.8249 1.146733e-03
## 11   TP53    M246V         Missense    5  851.6114 1.235630e-03
## 12 PIK3CA   H1047R         Missense    5  851.6114 1.235630e-03
## 13 NOTCH1    I471F         Missense    4 1091.4795 3.997529e-03
## 14 NOTCH1    G367D         Missense    4  986.9852 4.968849e-03
## 15   TP53    T125T       Synonymous    4  955.4674 4.968849e-03
## 16 NOTCH1    V717V       Synonymous    3 2260.2324 4.968849e-03
## 17 NOTCH1    D469Y         Missense    4  934.8717 4.968849e-03
## 18 NOTCH1    D412Y         Missense    4  934.8717 4.968849e-03
## 19 NOTCH1    W327C         Missense    4  934.8717 4.968849e-03
## 20 NOTCH1        . Essential_Splice    4  920.1162 4.968849e-03
## 21 NOTCH1    N431H         Missense    3 2174.7189 4.968849e-03
## 22 NOTCH1    G384S         Missense    4  861.3480 5.901671e-03
## 23 NOTCH1    C440R         Missense    3 1938.6707 6.244907e-03
## 24 NOTCH1    E455K         Missense    6  388.0714 6.244907e-03
## 25   TP53    L257V         Missense    3 1775.2181 7.617893e-03
## 26   TP53    Y220C         Missense    5  466.5666 9.513535e-03
## 27 NOTCH1    G310R         Missense    6  345.8816 1.015569e-02
## 28   TP53    H193R         Missense    4  681.2891 1.139337e-02
## 29   TP53    R273C         Missense    6  320.6674 1.374732e-02
## 30 NOTCH1   C1238*         Nonsense    3 1366.7624 1.374732e-02

Remarkably, owing to the very large number of mutant clones identified in this study, this analysis finds a large number of statistically-significant sites (n=81 for qval<0.05). Reassuringly, they are all in genes detected under positive selection in the original publication (Martincorena et al, 2018). This comprises 61 sites in NOTCH1, 17 sites in TP53, the well-known PIK3CA hotspot H1047R and one site in FAT1 and TP63.

This analysis also identifies a known driver hotspot in a synonymous site of TP53 (T125T), which is known to affect splicing of TP53. Intriguingly, it also identifies a synonymous site in NOTCH1 (V717V), which deserves careful follow-up analysis. Apart from these two synonymous sites, all other 79 significant hotspots are non-synonymous.

References

  • Martincorena I, et al. (2017) Universal Patterns of Selection in Cancer and Somatic Tissues. Cell. 171(5):1029-1041. doi:10.1016/j.cell.2017.09.042.
  • Massingham T, Goldman N. (2005) Detecting amino acid sites under positive selection and purifying selection. Genetics. 169(3):1753-62. doi:10.1534/genetics.104.032144.
  • Martincorena I, et al. (2015) High burden and pervasive positive selection of somatic mutations in normal human skin. Science. 348(6237):880-6. doi:10.1126/science.aaa6806.
  • Cannataro VL, Gaffney SG, Townsend JP. (2018) Effect Sizes of Somatic Mutations in Cancer. J Natl Cancer Inst. doi:10.1093/jnci/djy168.
  • Martincorena I, Fowler JC, et al. (2018) Somatic mutant clones colonize the human esophagus with age. Science. doi:10.1126/science.aau3879.