Warning: this function is in testing. Users are advised to interpret the results with caution.
The importance of recurrently mutated hotspots is widely appreciated in cancer. This tutorial shows how to apply the new sitednds function provided in the latest version of the dNdScv package to estimate dN/dS ratios at single-site level. Sitewise dN/dS estimation has a rich history in comparative genomics (e.g. Massingham and Goldman, 2005) but it has only been used in cancer studies occasionally (e.g. Martincorena et al., 2015). Yet, studying the relative strength of selection at single sites can be valuable, as emphasised by a recent study (Cannataro et al., 2018).
The new sitednds function allows the user to compute maximum-likelihood dN/dS estimates for recurrently mutated sites, as well as p-values against neutrality. Sitewise dN/dS ratios reflect the ratio between the number of observed mutations and the number expected under neutrality, while controlling for trinucleotide rates and for variable mutation rates across genes. In sparse datasets, point estimates for lowly-recurrent sites are likely to be underestimated, but p- and q-values provide a measure of their significance.
An important aspect is that mutation rates can vary considerably across sites, even after correcting for these known mutational biases. sitednds models the observed mutation counts across synonymous sites as following a negative binomial distribution. This effectively controls for Poisson noise in the mutation counts per site and fits a Gamma distribution to the unexplained variation in mutation rate across sites. P-values for site recurrence are calculated using the fitted negative binomial distribution. These p-values should be more conservative and reliable than only considering Poisson variation or non-parametric bootstrapping, but they still rely on the assumption than the Gamma distribution appropriately captures the unexplained variation across sites.
A major limitation is the fact that mapping artefacts and SNP contamination are common problems in cancer genomic datasets, and these tend to lead to recurrent false positive mutation calls. In noisy datasets, the results of sitednds can be dominated by artefacts. Users trying sitednds should be very critical of the results. In the context of cancer genomic studies, a considerable number of synonymous recurrently mutated sites among the significant hits in sitednds most certainly indicates a problem with the variant calling. This is exemplified in this tutorial analysing two real datasets.
As a small example, in this tutorial we will use public somatic mutation calls from bladder cancers from TCGA. To reduce the risk of false positives and increase the signal to noise ratio, this example will only consider mutations in Cancer Gene Census genes (v81).
library("dndscv")
data("dataset_tcgablca", package="dndscv") # Loading the bladder cancer data
data("cancergenes_cgc81", package="dndscv") # Loading the genes in the Cancer Gene Census (v81)
dndsout = dndscv(mutations, outmats=T, gene_list=known_cancergenes)
The sitednds function takes the output of dndscv as input. In order for the dndsout object to be compatible with sitednds, users must use the “outmats=T” argument in dndscv. After running dndscv, we can evaluate the results at the gene level as explained in the main tutorial of dndscv:
sel_cv = dndsout$sel_cv
print(head(sel_cv, 10), digits = 3) # Printing the top 10 genes by q-value
## gene_name n_syn n_mis n_non n_spl n_ind wmis_cv wnon_cv wspl_cv
## 564 TP53 5 136 39 5 26 36.1349 101.0 101.0
## 23 ARID1A 6 37 44 6 28 2.8878 34.4 34.4
## 289 KMT2D 10 52 54 9 27 1.7105 18.3 18.3
## 276 KDM6A 4 21 33 9 39 1.8506 24.5 24.5
## 461 RB1 1 8 34 14 18 1.8310 73.0 73.0
## 526 STAG2 7 12 23 2 17 0.8160 12.8 12.8
## 97 CDKN2A.p16INK4a 0 13 3 2 5 15.9320 95.8 95.8
## 96 CDKN2A.p14arf 1 15 0 2 5 17.5753 43.9 43.9
## 332 MLLT3 43 4 0 0 3 0.0614 0.0 0.0
## 156 EP300 2 39 20 1 8 4.7957 17.9 17.9
## wind_cv pmis_cv ptrunc_cv pallsubs_cv pind_cv qmis_cv qtrunc_cv
## 564 255.74 0.00e+00 0.00e+00 0.00e+00 2.32e-43 0.00e+00 0.00e+00
## 23 39.69 1.38e-03 0.00e+00 0.00e+00 1.68e-24 4.64e-02 0.00e+00
## 289 7.62 6.44e-02 0.00e+00 0.00e+00 8.40e-09 5.67e-01 0.00e+00
## 276 41.97 1.30e-01 0.00e+00 0.00e+00 2.45e-31 6.39e-01 0.00e+00
## 461 112.48 2.93e-01 0.00e+00 0.00e+00 1.02e-25 7.81e-01 0.00e+00
## 526 28.42 6.31e-01 3.02e-13 7.77e-16 5.65e-15 9.47e-01 3.04e-11
## 97 94.31 1.28e-07 2.78e-08 1.93e-11 9.72e-09 9.66e-06 1.68e-06
## 96 98.63 2.26e-09 1.20e-03 8.01e-10 7.80e-09 2.27e-07 2.79e-02
## 332 4.24 2.80e-14 6.69e-04 1.80e-14 4.18e-02 8.44e-12 1.61e-02
## 156 8.24 3.52e-05 3.61e-12 2.37e-11 6.30e-05 1.51e-03 3.11e-10
## qallsubs_cv pglobal_cv qglobal_cv
## 564 0.00e+00 0.00e+00 0.00e+00
## 23 0.00e+00 0.00e+00 0.00e+00
## 289 0.00e+00 0.00e+00 0.00e+00
## 276 0.00e+00 0.00e+00 0.00e+00
## 461 0.00e+00 0.00e+00 0.00e+00
## 526 7.81e-14 0.00e+00 0.00e+00
## 97 1.29e-09 0.00e+00 0.00e+00
## 96 4.02e-08 2.22e-16 1.67e-14
## 332 1.36e-12 2.70e-14 1.81e-12
## 156 1.39e-09 5.24e-14 3.16e-12
The table above reveals a problem with this dataset. The gene MLLT3 appears as significant in dndscv (i.e. it violates the neutral null model of dN/dS=1), but due to a very large excess of synonymous mutations (notice the high number of synonymous mutations and the very low dN/dS values). We can further confirm that the low dN/dS value in this gene is due to an excess of synonymous mutations and not genuine negative selection by comparing the observed number of synonymous mutations in the gene (43) and the expected number (exp_syn and exp_syn_cv columns below):
print(dndsout$genemuts[dndsout$genemuts$gene_name=="MLLT3",])
## gene_name n_syn n_mis n_non n_spl exp_syn exp_mis exp_non
## 332 MLLT3 43 4 0 0 1.258753 3.974196 0.2248953
## exp_spl exp_syn_cv
## 332 0.1507168 3.632818
Thus, MLLT3 is a false positive, most likely due to recurrent artefacts or SNP contamination in the gene. A careful examination of all statistically significant genes in the dataset reveals other likely false positives. As we will see below, this will also affect the sitewise dN/dS analysis.
To run the sitewise dN/dS model on this dataset, we only need to input the dndsout object into the sitednds function. By default, sitednds will calculate sitewise dN/dS ratios and p-values for sites mutated at least two times (use the argument min_recurr to control this). While p-values are only provided for recurrently mutated sites, false discovery adjustment corrects for all possible changes.
hotspots = sitednds(dndsout) # Running sitewise dN/dS
print(hotspots$theta) # Overdispersion (unexplained variation of the mutation rate across sites)
## MLE CI95low CI95_high
## 0.07261735 0.05468222 0.10069983
You can see that the maximum-likelihood estimate of theta is very low. This reflects considerable variation in the mutation rate across sites, not explained by the trinucleotide context or by the estimated relative mutation rate of the gene. sitednds takes this into account when calculating p-values. If there is large uncertainty in the estimation of theta, users can choose to use the lower bound estimate of theta instead of the maximum-likelihood estimate, when calculating p-values (use the argument theta_option=“conservative” in sitednds).
The main output of sitednds is a table with all hotspots studied, including their position, the gene affected, the aminoacid change induced, the number of times that the mutation was observed, the expected number of mutations at this site by chance under neutrality (mu) and the dN/dS ratio. The table also contains p-values and q-values for the probability of observing that many mutations at the site by chance. Again, please treat these p-values with caution.
print(head(hotspots$recursites,10)) # First 10 lines of the table of hotspots
## chr pos ref mut gene aachange impact ref3_cod mut3_cod freq
## 1 4 1803568 C G FGFR3 S249C Missense TCC TGC 27
## 2 17 7577538 C T TP53 R248Q Missense CGG CAG 15
## 3 17 37868208 C T ERBB2 S310F Missense TCC TTC 16
## 4 19 45867687 T C ERCC2 N238S Missense AAC AGC 9
## 5 9 20414343 A G MLLT3 S167S Synonymous GTA GCA 11
## 6 9 20414340 G A MLLT3 S168S Synonymous GCA GTA 12
## 7 3 178936091 G A PIK3CA E545K Missense TGA TAA 23
## 8 4 153247289 G C FBXW7 R505G Missense CCG CGG 8
## 9 17 7577085 C T TP53 E285K Missense AGA AAA 10
## 10 4 1806099 A G FGFR3 Y375C Missense TAT TGT 8
## mu dnds pval qval
## 1 0.004349085 6208.203 7.561905e-37 3.663031e-30
## 2 0.004297050 3490.767 1.034825e-21 2.506377e-15
## 3 0.005381862 2972.949 1.615659e-21 2.608783e-15
## 4 0.001419861 6338.647 3.491422e-18 4.228163e-12
## 5 0.003143547 3499.232 5.292417e-18 5.127356e-12
## 6 0.004582783 2618.496 1.514320e-17 1.222576e-11
## 7 0.022757216 1010.668 2.543084e-17 1.759836e-11
## 8 0.001499587 5334.803 3.117442e-16 1.887634e-10
## 9 0.005642605 1772.231 3.591526e-14 1.933063e-08
## 10 0.002847343 2809.637 4.627725e-14 2.241697e-08
We can choose a significance cutoff (e.g. q-value<0.05) to list the significant hotspots in the dataset:
signifsites = hotspots$recursites[hotspots$recursites$qval<0.05, ]
print(signifsites[,c("gene","aachange","impact","freq","dnds","qval")], digits=5)
## gene aachange impact freq dnds qval
## 1 FGFR3 S249C Missense 27 6208.20 3.6630e-30
## 2 TP53 R248Q Missense 15 3490.77 2.5064e-15
## 3 ERBB2 S310F Missense 16 2972.95 2.6088e-15
## 4 ERCC2 N238S Missense 9 6338.65 4.2282e-12
## 5 MLLT3 S167S Synonymous 11 3499.23 5.1274e-12
## 6 MLLT3 S168S Synonymous 12 2618.50 1.2226e-11
## 7 PIK3CA E545K Missense 23 1010.67 1.7598e-11
## 8 FBXW7 R505G Missense 8 5334.80 1.8876e-10
## 9 TP53 E285K Missense 10 1772.23 1.9331e-08
## 10 FGFR3 Y375C Missense 8 2809.64 2.2417e-08
## 11 PIK3CA E542K Missense 17 747.02 7.9963e-08
## 12 TP53 R280T Missense 10 1453.35 9.1276e-08
## 13 HLA-A Q78R Missense 4 13446.24 2.1402e-06
## 14 MLLT3 S166S Synonymous 7 1527.46 1.1661e-05
## 15 NFE2L2 R34G Missense 5 2050.39 2.0190e-04
## 16 KRAS G12D Missense 4 3146.49 5.5540e-04
## 17 NAB2 P211P Synonymous 3 9736.56 5.7996e-04
## 18 FGFR3 G372C Missense 4 2851.04 7.2814e-04
## 19 HRAS Q61K Missense 3 8318.78 8.3064e-04
## 20 TP53 E336* Nonsense 4 2502.57 1.0945e-03
## 21 PIK3CA H1047R Missense 4 2391.66 1.2455e-03
## 22 TP53 S241F Missense 5 1278.51 1.3452e-03
## 23 MLLT3 S188S Synonymous 5 1091.04 2.7397e-03
## 24 TP53 Q331* Nonsense 7 531.95 5.6984e-03
## 25 KDM6A Q562* Nonsense 8 393.68 1.3510e-02
## 26 ERBB3 V104L Missense 3 2434.11 2.3575e-02
## 27 MLLT3 S155S Synonymous 4 872.83 4.8176e-02
## 28 TP53 R248W Missense 4 851.59 4.9561e-02
## 29 TP53 R273C Missense 4 850.23 4.9561e-02
Careful examination of the significant hotspots reveals many well-known cancer-driver hotspots, including in FGFR3 (e.g. S249, Y375, G372), TP53 (e.g. R248), PIK3CA (e.g. E542, H1047), HRAS (Q61), KRAS (G12), ERBB2 (S310), ERBB3 (V104), etc. Note that the exact aminoacid position affected depends on the exact protein isoform used for annotation (see Ensembl protein IDs in dndsout$annotmuts).
However, the table of significant hotspots also contains a considerable number of likely false positives, including multiple synonymous sites in MLLT3. A proper analysis of these data would require careful reevaluation and improvement of the mutation calls, before repeating this analysis. Significant improvements to somatic mutation calls against recurrent artefacts can be achieved by using an unmatched normal panel and by more stringent filtering of germline SNP contamination.
The TCGA mutation calls used in this example are an old version and it is likely that more recent versions are much less affected by artefacts. However, I decided to use this dataset as an example to highlight the importance of critically examining the results and the impact of recurrent artefacts on driver discovery at gene and site level.
As a final note, users with whole-genome or whole-exome data can run sitednds on all genes. However, given the frequent presence of recurrent artefacts and the sparsity of cancer datasets, the signal-to-noise ratio can be considerably increased by running sitednds on a list of known cancer genes. To do so, I recommend running dndscv on all genes and then running sitednds on a list of genes of interest using the optional gene_list argument in sitednds. Running dndscv on all genes ensures that mutations from all genes are used to estimate the trinucleotide mutation rates, typically increasing their accuracy.
In a recent study, we sequenced 844 small biopsies of normal oesophageal epithelium from 9 transplant donors to study the extent of mutation and selection in a normal tissue (Martincorena et al., 2018). In this part of the tutorial, we will reanalyse this dataset using dndscv and sitednds. We first run dndscv using the settings from the analysis of normal skin data described in the main dNdScv tutorial.
library("dndscv")
data("dataset_normaloesophagus", package="dndscv") # Loading the mutations in normal oesophagus
mutations = unique(mutations) # Removing duplicate mutations (more conservative)
data("dataset_normalskin_genes", package="dndscv")
dndsout = dndscv(mutations, outmats=T, gene_list=target_genes, max_coding_muts_per_sample=Inf, max_muts_per_gene_per_sample=Inf) # outmats=T is required to run sitednds
We can see the list of genes under positive selection and the global dN/dS values using:
sel_cv = dndsout$sel_cv
print(sel_cv[sel_cv$qglobal_cv<0.05, c(1:6,19)], digits = 3)
## gene_name n_syn n_mis n_non n_spl n_ind qglobal_cv
## 73 TP53 9 323 50 41 97 0.00e+00
## 48 NOTCH1 29 831 278 134 572 0.00e+00
## 49 NOTCH2 8 225 59 26 127 0.00e+00
## 50 NOTCH3 11 140 93 37 82 0.00e+00
## 27 FAT1 19 127 113 13 166 0.00e+00
## 7 ARID1A 8 28 34 5 68 0.00e+00
## 3 AJUBA 1 22 9 0 30 0.00e+00
## 19 CUL3 7 34 9 10 20 1.33e-13
## 13 CCND1 2 11 9 0 1 1.29e-05
## 55 PIK3CA 2 45 1 0 4 9.14e-04
## 42 KMT2D 21 69 12 4 43 2.14e-03
## 47 NFE2L2 5 27 1 0 8 4.57e-03
## 8 ARID2 12 28 9 2 12 5.13e-03
## 40 KMT2A 15 68 8 2 34 2.73e-02
## 74 TP63 5 29 1 1 6 4.78e-02
print(dndsout$globaldnds, digits = 3)
## name mle cilow cihigh
## wmis wmis 2.07 1.90 2.27
## wnon wnon 7.82 6.92 8.83
## wspl wspl 8.31 7.10 9.73
## wtru wtru 7.97 7.14 8.90
## wall wall 2.55 2.34 2.78
To apply the sitednds model, we simply use the following code. Only the top 30 hotspots by q-value are shown, but a total of 133 sites are identified as significant with 5% FDR.
hotspots = sitednds(dndsout) # Running sitewise dN/dS
signifsites = hotspots$recursites[hotspots$recursites$qval<0.05, ]
head(signifsites[,c("gene","aachange","impact","freq","dnds","qval")], 30)
## gene aachange impact freq dnds qval
## 1 NOTCH1 F357S Missense 5 3279.5143 1.122415e-05
## 2 NOTCH1 N390S Missense 6 1860.5378 1.122415e-05
## 3 NOTCH1 I471T Missense 5 1438.3442 3.971835e-04
## 4 NOTCH1 G443D Missense 5 1162.6781 5.585327e-04
## 5 NOTCH1 G394D Missense 5 1162.6781 5.585327e-04
## 6 NOTCH1 . Essential_Splice 5 1150.1452 5.585327e-04
## 7 NOTCH1 S385F Missense 5 1103.8104 5.585327e-04
## 8 NOTCH1 Y550* Nonsense 4 2014.8063 5.585327e-04
## 9 NOTCH1 N718S Missense 6 723.4757 5.585327e-04
## 10 NOTCH1 D352Y Missense 4 1607.8249 1.146733e-03
## 11 TP53 M246V Missense 5 851.6114 1.235630e-03
## 12 PIK3CA H1047R Missense 5 851.6114 1.235630e-03
## 13 NOTCH1 I471F Missense 4 1091.4795 3.997529e-03
## 14 NOTCH1 G367D Missense 4 986.9852 4.968849e-03
## 15 TP53 T125T Synonymous 4 955.4674 4.968849e-03
## 16 NOTCH1 V717V Synonymous 3 2260.2324 4.968849e-03
## 17 NOTCH1 D469Y Missense 4 934.8717 4.968849e-03
## 18 NOTCH1 D412Y Missense 4 934.8717 4.968849e-03
## 19 NOTCH1 W327C Missense 4 934.8717 4.968849e-03
## 20 NOTCH1 . Essential_Splice 4 920.1162 4.968849e-03
## 21 NOTCH1 N431H Missense 3 2174.7189 4.968849e-03
## 22 NOTCH1 G384S Missense 4 861.3480 5.901671e-03
## 23 NOTCH1 C440R Missense 3 1938.6707 6.244907e-03
## 24 NOTCH1 E455K Missense 6 388.0714 6.244907e-03
## 25 TP53 L257V Missense 3 1775.2181 7.617893e-03
## 26 TP53 Y220C Missense 5 466.5666 9.513535e-03
## 27 NOTCH1 G310R Missense 6 345.8816 1.015569e-02
## 28 TP53 H193R Missense 4 681.2891 1.139337e-02
## 29 TP53 R273C Missense 6 320.6674 1.374732e-02
## 30 NOTCH1 C1238* Nonsense 3 1366.7624 1.374732e-02
Remarkably, owing to the very large number of mutant clones identified in this study, this analysis finds a large number of statistically-significant sites (n=81 for qval<0.05). Reassuringly, they are all in genes detected under positive selection in the original publication (Martincorena et al, 2018). This comprises 61 sites in NOTCH1, 17 sites in TP53, the well-known PIK3CA hotspot H1047R and one site in FAT1 and TP63.
This analysis also identifies a known driver hotspot in a synonymous site of TP53 (T125T), which is known to affect splicing of TP53. Intriguingly, it also identifies a synonymous site in NOTCH1 (V717V), which deserves careful follow-up analysis. Apart from these two synonymous sites, all other 79 significant hotspots are non-synonymous.