The R package pmsignature is developed for efficiently extracting characteristic mutation patterns (mutation signatures) from the set of mutations collected typically from cancer genome sequencing data.
For extracting mutation signatures, principal component analysis or nonnegative matrix factorization have been popular. Compared to these existing approaches, the pmsignature
has following advantages:
Currently, pmsignature can only accept tab delimited text files with specialized format. We will improve the program so that it can accept VCF format files.
Shiraishi et al. Extraction of Latent Probabilistic Mutational Signature in Cancer Genomes, submitted.
For input data, we need mutation feature data for each sample and mutation. Here, mutation features are elements used for categorize the mutations such as:
Currently, pmsignature can accept following two formats of tab-delimited text file.
sample1 | chr1 | 100 | A | C |
sample1 | chr1 | 200 | A | T |
sample1 | chr2 | 100 | G | T |
sample2 | chr1 | 300 | T | C |
sample3 | chr3 | 400 | T | C |
1 | 4 | 4 | 4 | 3 | 3 | 2 |
2 | 4 | 3 | 3 | 1 | 1 | 2 |
3 | 4 | 4 | 3 | 2 | 2 | 2 |
4 | 3 | 3 | 2 | 3 | 3 | 1 |
5 | 3 | 4 | 2 | 4 | 4 | 2 |
6 | 4 | 1 | 4 | 2 | 1 | 2 |
3 | 2 | 1 | 1 | 1 | 1 | 2 |
7 | 4 | 2 | 2 | 4 | 3 | 2 |
First, the R packages VariantAnnotation and BSgenome.Hsapiens.UCSC.hg19, which pmsignature depends has to be installed. Also, devtools may be necessary for ease of installation.
source("http://bioconductor.org/biocLite.R")
biocLite(c("VariantAnnotation", "BSgenome.Hsapiens.UCSC.hg19"))
install.packages("devtools")
The easiest way for installing pmsignature is to use the package devtools:
library(devtools)
devtools::install_github("friend1ws/pmsignature")
First, create the input data from your mutation data.
After installing pmsignature, you can find the above example file at the directory where pmsignature is installed.
Mutation Position Format
inputFile <- system.file("extdata/Nik_Zainal_2012.mutationPositionFormat.txt", package="pmsignature");
print(inputFile);
Mutation Feature Vector Format
inputFile <- system.file("extdata/Hoang_MFVF.txt", package="pmsignature");
print(inputFile);
Type the following commands (inputFile is the path of the data you want to analyze):
Mutation Position Format
G <- readMPFile(inputFile, numBases = 5);
Here, inputFile is the path for the input file. numBases is the number of flanking bases to consider including the central base (if you want to consider two 5’ and 3’ bases, then set 5). You can rformat the data as the full model by typing
G <- readMPFile(inputFile, numBases = 5, type = "full");
Also, you can add transcription direction information by typing
G <- readMPFile(inputFile, numBases = 5, trDir = TRUE);
Mutation Feature Vector Format
G <- readMFVFile(inputFile, numBases = 5);
When you want to set the number of mutation signature as 3, type the following command:
Param <- getPMSignature(G, K = 3);
If you want to add the background signature, then after obtaining the background probability, perform the estimation. Currently, we only provide the background data for the “independent” and “full” model with 3 and 5 flanking bases.
BG_prob <- readBGFile(G);
Param <- getPMSignature(G, K = 3, BG = BG_prob);
In default, we repeat the estimation 10 times by changing the initial value, and select the parameter with maximum likelihood. If you want to changet the repeat number, then
Param <- getPMSignature(G, K = 3, numInit=20);
You can check the mutation signature by typing
visPMSignature(Param, 1)
visPMSignature(Param, 2)
visPMSignature(Param, 3)