Contents

1 Introduction to neoantigeR

1.1 Authors and Affliations

Shaojun Tang#, Subha Madhavan

  1. Innovation Center for Biomedical Informatics, Georgetown University Medical Center, Washington, DC, United States of America,

2 Department of Oncology, Georgetown University Medical Center, Washington, DC, United States of America,

1.2 Abstract

It has been reported that greater than 90% of human genes are alternative spliced, suggesting a challenging complexity of the transcriptome reconstruction. An analysis of splicing complexity by gene size has indicated that the combination potential for exon splicing can increase exponentially. Current studies showed that the number of transcripts reported to date may be limited by the short-read sequencing due to the transcript reconstruction algorithms, whereas long-read sequencing is able to significantly improve the detection of alternative splicing variants since there is no need to assemble transcripts from short reads.

1.3 Introduction

Our neoantigen prediction pipeline requires a very widely used standard data input from next-generation sequencing assays. In the simplest nontrivial scenario, the pipeline only needs a gene prediction output file in GFF format. Our pipeline will use the input GFF file to extract ORFs from the predicted isoforms and compared them with reference protein database typically from Uniprot Swiss-Prot. If the sequencing data is from third-generation long read sequencing, the input files may consist the gene prediction file in GFF format, and a sequence file.

neoantigeR provides several

2 Installing neoantigeR

neoantigeR should be installed as follows:

3 preparing sequencing data for neoantigen characterization by neoantigeR

neoantigeR is very usefriendly. Users only need to provide a ‘Tab’ delimited input data file and give the indexes of control and case samples.

3.1 Reading the input data:

Read Inupt Data to R . As described above, our pipeline relies on input generated from the analysis of high-throughput parallel sequencing data including short-read RNA-Seq or long-read PacBio SMRT sequencing data. Generally, these data can be easily obtained from existing aligning and gene calling tools. Here, we outline an example preparatory steps to generate these input data.

Reference genome sequence alignment was performed using the Bowtie2 [Ref] for aligning of original raw sequences (FASTQ files) to obtain SAM/BAM files. In brief, Bowtie (version 2.1) [22] was used for alignment with default parameters. The resulting alignments (in BAM format) file was subsequently used as input to the gene calling tool Cufflinks [Ref] in de-novo gene-finding mode (no gene annotation is provided) with default parameters. Cufflinks accepted aligned RNA-Seq reads (SAM/BAM format) and assembled the alignments into a parsimonious set of transcripts. A transcript annotation file in GFF format was produced.

fCI workflow 1.

4 Running neoantigenR

4.1 provide the input files and setup the enviornment

The samples are normalized to have the same library size (i.e. total raw read counts) if the experiment replicates were obtained by the same protocol and an equal library size was expected within each experimental condition. The neoantigeR will apply the sum normalization so that each column has equal value by summing all the genes of each replicate.

    source("../R/neoantigenR.R")
## Warning: package 'seqinr' was built under R version 3.3.2
## Loading required package: Biostrings
## Warning: package 'Biostrings' was built under R version 3.3.2
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, cbind, colnames,
##     do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, lengths, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff,
##     sort, table, tapply, union, unique, unsplit, which, which.max,
##     which.min
## Loading required package: S4Vectors
## Warning: package 'S4Vectors' was built under R version 3.3.2
## Loading required package: stats4
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
## 
##     colMeans, colSums, expand.grid, rowMeans, rowSums
## Loading required package: IRanges
## Loading required package: XVector
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:seqinr':
## 
##     translate
## Warning: package 'GenomicRanges' was built under R version 3.3.2
## Loading required package: GenomeInfoDb
## Warning: package 'GenomeInfoDb' was built under R version 3.3.2
## Warning: package 'Gviz' was built under R version 3.3.2
## Loading required package: grid
## Loading required package: rtracklayer
## Warning: package 'rtracklayer' was built under R version 3.3.2
    protein.database.file.name         =    "../data/swissuniprots.fasta"
    reference.gff.file                         =    "../data/gencode.v19.annotation.gff3"
    pacbio.gff                                   =  "../data/model.gff"
    pacbio.gencode.overlapping.file  =  "../data/bedtool.intersect.overlaps.txt"
    output.folder                                =  "../data/"

    file.exists(pacbio.gff)
## [1] TRUE
    print(pacbio.gff)
## [1] "../data/model.gff"
    org="hg19"

4.2 neoantigeR analysis with long read sequencing data

    neoantigenR.initialize()
## Warning in dir.create(dataDir): '..\data\\data' already exists
## Reading ../data/model.gff: found 11 rows with classes: character, character, character, integer, integer, character, character, character, character 
## Reading ../data/bedtool.intersect.overlaps.txt: found 95 rows with classes: character, character, character, integer, integer, character, character, character, character, character, character, character, integer, integer, character, character, character, character, character 
## Reading ../data/gencode.v19.annotation.gff3: found 59 rows with classes: character, character, character, integer, integer, character, character, character, character
## Warning in .Method(..., deparse.level = deparse.level): number of columns
## of result is not a multiple of vector length (arg 1)
    neoantigenR.get.Model()
## Warning in .Method(..., deparse.level = deparse.level): number of columns
## of result is not a multiple of vector length (arg 2)
## Warning in .Method(..., deparse.level = deparse.level): number of columns
## of result is not a multiple of vector length (arg 1)
    neoantigenR.get.peptides()
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## 57   65   APHNPAPPT 
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
## use default substitution matrix
    neoantigenR.write()

4.3 Printting sample neoantigen precursor candidates

 library(knitr)
## Warning: package 'knitr' was built under R version 3.3.2
 kable(head(alignment.indel.detailed.info.table,10), caption = "The top 10 records for neoantigens")
The top 10 records for neoantigens
Gene PacBioSequenceID UniprotProteinID ProteinStart ProteinEnd DNAStart DNAEnd IndelSequence Coverage Isof.Index reference.indel.region.seq indel.region.seq.overlap.ratio type uniprotID IndelSequenceAnchorSeq IndepPositioninGenome Chrom PacBio region PacBioStart PacbioEnd dot strand dot2 PacbioTranscriptID ReferenceChrom Reference_exon_start Reference_exon_end dot3 geneName strand2 exontype geneID exonID transcriptID unkonwn
indel.seq.detailed.info IFITM2 i1HQ_A1254|c20967/f1p0/700 >sp|Q01629|IFM2_HUMAN Interferon-induced transmembrane protein 2 OS=Homo sapiens GN=IFITM2 PE=1 SV=2 57 65 169 195 APHNPAPPT 0 2 VPHNPAPPM 0.78 mismatch Q01629 KEEQEVAMLGAPHNPAPPTSTVIHIRSET 308288 chr11 PacBio exon 308120 308438 . + . PB.1583.1 chr11 308231 308438 IFITM2 . + exon 2 32 6 13

5 Theory behind neoantigeR

In RNA-Seq transcriptome data, it common to observe thousands of alternative isoforms sequence that are novel transcribed regions or variants of existing genes, however, the majority of these isoforms will result in significantly truncated proteins or sequences interrupted by many stop codons when performing in-silico translation. It’s important to survey the fraction of these isoforms that will produce peptide sequence highly similar to reference proteins with meaningful sequence alternations.