It has been reported that greater than 90% of human genes are alternative spliced, suggesting a challenging complexity of the transcriptome reconstruction. An analysis of splicing complexity by gene size has indicated that the combination potential for exon splicing can increase exponentially. Current studies showed that the number of transcripts reported to date may be limited by the short-read sequencing due to the transcript reconstruction algorithms, whereas long-read sequencing is able to significantly improve the detection of alternative splicing variants since there is no need to assemble transcripts from short reads.
Our neoantigen prediction pipeline requires a very widely used standard data input from next-generation sequencing assays. In the simplest nontrivial scenario, the pipeline only needs a gene prediction output file in GFF format. Our pipeline will use the input GFF file to extract ORFs from the predicted isoforms and compared them with reference protein database typically from Uniprot Swiss-Prot. If the sequencing data is from third-generation long read sequencing, the input files may consist the gene prediction file in GFF format, and a sequence file.
neoantigeR provides several
neoantigeR should be installed as follows:
neoantigeR is very usefriendly. Users only need to provide a ‘Tab’ delimited input data file and give the indexes of control and case samples.
Read Inupt Data to R . As described above, our pipeline relies on input generated from the analysis of high-throughput parallel sequencing data including short-read RNA-Seq or long-read PacBio SMRT sequencing data. Generally, these data can be easily obtained from existing aligning and gene calling tools. Here, we outline an example preparatory steps to generate these input data.
Reference genome sequence alignment was performed using the Bowtie2 [Ref] for aligning of original raw sequences (FASTQ files) to obtain SAM/BAM files. In brief, Bowtie (version 2.1) [22] was used for alignment with default parameters. The resulting alignments (in BAM format) file was subsequently used as input to the gene calling tool Cufflinks [Ref] in de-novo gene-finding mode (no gene annotation is provided) with default parameters. Cufflinks accepted aligned RNA-Seq reads (SAM/BAM format) and assembled the alignments into a parsimonious set of transcripts. A transcript annotation file in GFF format was produced.
The samples are normalized to have the same library size (i.e. total raw read counts) if the experiment replicates were obtained by the same protocol and an equal library size was expected within each experimental condition. The neoantigeR will apply the sum normalization so that each column has equal value by summing all the genes of each replicate.
source("../R/neoantigenR.R")
## Warning: package 'seqinr' was built under R version 3.3.2
## Loading required package: Biostrings
## Warning: package 'Biostrings' was built under R version 3.3.2
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, cbind, colnames,
## do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, lengths, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff,
## sort, table, tapply, union, unique, unsplit, which, which.max,
## which.min
## Loading required package: S4Vectors
## Warning: package 'S4Vectors' was built under R version 3.3.2
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
##
## colMeans, colSums, expand.grid, rowMeans, rowSums
## Loading required package: IRanges
## Loading required package: XVector
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:seqinr':
##
## translate
## Warning: package 'GenomicRanges' was built under R version 3.3.2
## Loading required package: GenomeInfoDb
## Warning: package 'GenomeInfoDb' was built under R version 3.3.2
## Warning: package 'Gviz' was built under R version 3.3.2
## Loading required package: grid
## Loading required package: rtracklayer
## Warning: package 'rtracklayer' was built under R version 3.3.2
protein.database.file.name = "../data/swissuniprots.fasta"
reference.gff.file = "../data/gencode.v19.annotation.gff3"
pacbio.gff = "../data/model.gff"
pacbio.gencode.overlapping.file = "../data/bedtool.intersect.overlaps.txt"
output.folder = "../data/"
file.exists(pacbio.gff)
## [1] TRUE
print(pacbio.gff)
## [1] "../data/model.gff"
org="hg19"
A dataset containing the full-length whole transcriptome from three diverse human tissues (brain, heart, and liver) was directly downloaded from the PacBio official website [Ref].
This dataset is ideal for exploring differential alternative splicing events.
neoantigenR.initialize()
## Warning in dir.create(dataDir): '..\data\\data' already exists
## Reading ../data/model.gff: found 11 rows with classes: character, character, character, integer, integer, character, character, character, character
## Reading ../data/bedtool.intersect.overlaps.txt: found 95 rows with classes: character, character, character, integer, integer, character, character, character, character, character, character, character, integer, integer, character, character, character, character, character
## Reading ../data/gencode.v19.annotation.gff3: found 59 rows with classes: character, character, character, integer, integer, character, character, character, character
## Warning in .Method(..., deparse.level = deparse.level): number of columns
## of result is not a multiple of vector length (arg 1)
neoantigenR.get.Model()
## Warning in .Method(..., deparse.level = deparse.level): number of columns
## of result is not a multiple of vector length (arg 2)
## Warning in .Method(..., deparse.level = deparse.level): number of columns
## of result is not a multiple of vector length (arg 1)
neoantigenR.get.peptides()
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## 57 65 APHNPAPPT
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
neoantigenR.write()
library(knitr)
## Warning: package 'knitr' was built under R version 3.3.2
kable(head(alignment.indel.detailed.info.table,10), caption = "The top 10 records for neoantigens")
Gene | PacBioSequenceID | UniprotProteinID | ProteinStart | ProteinEnd | DNAStart | DNAEnd | IndelSequence | Coverage | Isof.Index | reference.indel.region.seq | indel.region.seq.overlap.ratio | type | uniprotID | IndelSequenceAnchorSeq | IndepPositioninGenome | Chrom | PacBio | region | PacBioStart | PacbioEnd | dot | strand | dot2 | PacbioTranscriptID | ReferenceChrom | Reference_exon_start | Reference_exon_end | dot3 | geneName | strand2 | exontype | geneID | exonID | transcriptID | unkonwn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
indel.seq.detailed.info | IFITM2 | i1HQ_A1254|c20967/f1p0/700 | >sp|Q01629|IFM2_HUMAN Interferon-induced transmembrane protein 2 OS=Homo sapiens GN=IFITM2 PE=1 SV=2 | 57 | 65 | 169 | 195 | APHNPAPPT | 0 | 2 | VPHNPAPPM | 0.78 | mismatch | Q01629 | KEEQEVAMLGAPHNPAPPTSTVIHIRSET | 308288 | chr11 | PacBio | exon | 308120 | 308438 | . | + | . | PB.1583.1 | chr11 | 308231 | 308438 | IFITM2 | . | + | exon | 2 | 32 | 6 | 13 |
In RNA-Seq transcriptome data, it common to observe thousands of alternative isoforms sequence that are novel transcribed regions or variants of existing genes, however, the majority of these isoforms will result in significantly truncated proteins or sequences interrupted by many stop codons when performing in-silico translation. Itâs important to survey the fraction of these isoforms that will produce peptide sequence highly similar to reference proteins with meaningful sequence alternations.