Our cancer-specific algorithm is capable of predicting the functional effects of cancer-associated protein missense mutations by combining
sequence conservation within hidden Markov models (HMMs), representing the alignment of homologous sequences and conserved
protein domains, with cancer-specific "pathogenicity weights", representing the overall tolerance of the corresponding model
to cancer mutations.
For more information, please refer to the following publications:
Shihab HA, Gough J, Cooper DN, Day INM, Gaunt, TR. (2013). Predicting the Functional Consequences of Cancer-Associated Amino Acid Substitutions.
Bioinformatics 29:1504-1510.
Shihab HA, Gough J, Cooper DN, Stenson PD, Barker GLA, Edwards KJ, Day INM, Gaunt, TR. (2013). Predicting the Functional, Molecular and Phenotypic
Consequences of Amino Acid Substitutions using Hidden Markov Models. Hum. Mutat., 34:57-65
Our software accepts one of the following formats (see here for annotating VCF files):
<protein> <substitution>
dbSNP rs identifiers
<protein>
is the protein identifier and <substitution>
is the amino acid substitution in the conventional one
letter format. At present, our server accepts SwissProt/TrEMBL, RefSeq and Ensembl protein identifiers, e.g.:
P43026 L441Por:
rs137854462
It is possible to submit multiple amino acid substitutions as a 'Batch Submission' via our server. Here, all amino acid substitutions for a protein can be
entered on a single line and should be separated by a comma, e.g:
P43026 L441P ENSP00000325527 N548I,E1073K,C2307S
As described in our paper, our server uses a default prediction threshold of -0.75. Here, predictions with scores less than this indicate the
mutation is potentially associated with cancer; however, our prediction threshold this can be adjusted and tuned to cater
for your individual needs. For example, if you are interested in minimising the number of false positives in your analysis, then you should opt
for a more conservative threshold, e.g. -3.0; however, if you are interested in capturing a large proportion of cancer-associated mutations (regardless of
the number of false positives), then a less stringent threshold should be selected, e.g. 0.0 or higher. To inform you of this choice, the specificity and sensitivity
of our software at various prediction thresholds can be seen using the below interactive graph:
Unfortunately, due to disk space constraints, we are unable to annotate Variant Call Format (VCF) files on your behalf. However, the consequences of all VCF variants
can be derived using the Ensembl Variant Effect Predictor (VEP).
Once annotated, the following script (available here) is capable of parsing these annotations and will provide you with a list of protein
consequences which can then be used as input into our server/software.
Additional help on using our script is available by typing the following command:
python parseVCF.py --help