This vignette described how to run the Keep Me Around kma suite to compute intron retention in RNA-Seq experiments.

1 General pipeline

The general pipeline is as follows:

  1. Pre-process to generate intron coordinates and sequences
  2. Quantify transcript expression using eXpress against augmented transcriptome containing introns
  3. Post-process to compute intron retention using R package

TODO: Put a more detailed flowchart here

1.1 Installing

There are two required portions of the tool to perform intron retention quantification, though installing the post-processing tools (intron retention) installs the python code with the exception of dependencies. The pre-process step is written in python and only needs to be run when the annotation changes. The post-processing step is written in R and where most of your interaction will take place.

1.1.1 Installing the pre-processing tools

If you have installed the R package (see Installing the post-processing tools), then you have successfully installed the pre-processing tools. From R, you can find the path by typing:

system.file("pre-process", package="kma")
## [1] "/Library/Frameworks/R.framework/Versions/3.1/Resources/library/kma/pre-process"

The only additional dependencies are the Python packages pyfasta, biopython and pysam. All packages can be installed via PyPI:

pip install pyfasta biopython pysam

1.1.2 Installing the quantification tools

The tools needed for quantification are:

Please see the corresponding documentation on their respective web pages.

1.1.3 Installing the post-processing tools

The pre- and post-processing tools are contained in an R packaged called kma. The development version is always on Github. To install, kma and required R dependencies, runs the code below:

required_packages <- c("devtools", "data.table", "reshape2", "dplyr")
install.packages(required_packages)
devtools::install_github("http://github.com/pachterlab/kma")

You can then load the kma R package like you would any other R packages:

library("kma")
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

1.2 Organization

We recommend organizing your experiments in the following format:

experiment
|____conditionA
| |____conditionA1
| |____conditionA2
| |____conditionA3
|____conditionB
| |____conditionB1
| |____conditionB2
| |____conditionB3

where experiment is the top level for the particular set of experiments, conditionX refers to the condition (e.g. tumor, or control), and conditionXY represents the biological replicates. This allows for a structured way to keep track of your data as well as an easy way to load it in R. An example can be seen in the worked example section.

1.3 Generate intron coordinates (pre-processing)

Note: Pre-processing only has to be run when the gene annotation changes. If you use the same annotation for many experiments, you won’t be running this tool often.

The path to the pre-processing tools can be found in R by typing system.file("preprocess", package="kma"). Open a terminal and put this path in an environment variable:

PRE=/path/to/kma/preprocess

If you are on a Mac, the path might look like: /Library/Frameworks/R.framework/Versions/3.1/Resources/library/kma/preprocess

We are now ready to generate the introns. A typical command line call to the pre-processing tools looks like:

python $PRE/generate_introns.py --genome seq.fa --gtf trans.gtf --extend N --out out_dir

Where the inputs are:

  • --genome seq.fa: genome sequence in Multi-FASTA format where contig names correspond to the GTF file.
  • --gtf trans.gtf: annotation file in GTF format.
  • --extend N: Optional, though we strongly recommend setting it. If set, N is the number of bases bases to overlap into the intron. Note, this should be at most read_length - 1, but we suggest making it smaller.
  • --out out_dir: A directory to write the outputs. Will be created if doesn’t already exist.

The following outputs will then be put in out_dir:

  • introns.fa - a FASTA file containing the intron sequences.
  • introns.bed - BED file with coordinates used to quantify intron.
  • intron_to_transcripts.txt - a table of intron to transcript relationships. This file is used during the post-processing phase.

1.4 Quantification

Note: Technically any quantification tool can be used with kma, but currently only support with eXpress is implemented. Please contact me on Github if you’re interested in a quantification tool being supported.

Here, we will discuss quantification. After generate_introns.py is run, introns.fa should be combined with the full transcript sequences. This can be done using the Linux command cat:

cat trans.fa introns.fa > trans_and_introns.fa

We assume the file name is trans_and_introns.fa in the following sections, but the file name can be anything.

After you’ve done this, this section requires the following steps:

  1. Create the Bowtie 2 index
  2. Align reads to the augmented transcriptome
  3. Quantify against the augmented transcriptome

1.4.1 Creating the Bowtie 2 index

See the Bowtie 2 manual for more advanced options.

bowtie2-build --offrate 1 trans_and_introns.fa trans_and_introns

This only has to be run once if every time you decide to change the gene annotation.

1.4.2 Align reads

Once you have a Bowtie 2 index, you can align any number of RNA-Seq experiments to that index. The following arguments are recommended when running Bowtie 2:

bowtie2 -k 200 --rdg 6,5 --rfg 6,5 --score-min L,-.6,-.4 -X trans_and_introns
    -1 left.fastq -2 right.fastq | samtools view -Sb - > hits.bam

If you don’t have many reads, you might consider replacing the -k 200 arugment with -a.

1.4.3 Quantify

Once you’ve aligned the reads, you can run eXpress against the alignments. See the eXpress website for additional arguments. The general eXpress call is as follows:

express trans_and_introns.fa hits.bam

If you don’t have many reads, consider running additional batch iterations using the -B argument.

1.5 Computing intron retention (post-processing)

The first step is to the load the quantification data into R. The function read_express takes a list of file names, sample names, and condition names. read_express then returns a list with the attributes:

  • tpm - a data.frame with TPM of all samples
  • uniq_counts - a data.frame with the number of unique counts of all samples
  • all_data - a list of data.frames from all the eXpress output. Sorted by target_id.
  • sample - a character vector for each sample, describing the sample (e.g. tumor1, tumor2)
  • condition - a character vector for each sample describing the grouping (e.g. tumor)

This data can then be used by newIntronRetention. An example of this can be seen in the worked example.

2 A worked example

We will run through a small example. The data set including reads can be found here. If you are not interested in the pre-processing and quantification steps (some might find it simple), the results from the pipeline exist in the R package:

system.file("example", package="kma")

In this case, you can simply jump to the post-processing phase which runs through using the R package itself.

This example contains the reads, annotation, and sequences for the SF3B1 gene in orthochromatic erythroblasts from GSE53635. This dataset is unrealistically small, so the results aren’t entirely reliable, but can be run very quickly through the pipeline. Thus, the results aren’t very reliable, but serves as an example of how to run through the pipeline.

2.1 Pre-processing

Run system.file in R to get the pre-processing directory (see the pre-processing section above for more information):

system.file("pre-process", package = "kma")

Then, store this value in an environment variable in your shell:

PRE=/path/to/kma/pre-process

Extract the data from the downloaded archive and move to the example directory and run the pre-processing:

tar -xf kma_example.tar.gz
cd kma_example/example
python $PRE/generate_introns.py --genome genome/chr2.fa \
    --gtf annotation/refGene_sf3b1.gtf --extend 25 \
    --out kma_pre-process_out

We chose --extend 25 here because the reads are 51 bases long. An overlap of 25 bases on each side will give us a fair amount of junction reads without resulting in spurious alignments. If your read length is longer, feel free to set this value higher.

The output files will then be stored in kma_pre-process_out which you can view using ls kma_pre-process_out:

intron_to_transcripts.txt introns.bed               introns.fa

The introns.fa file can now be merged with the transcriptome sequences in the quantification section.

2.2 Quantification

2.2.1 Merging the transcriptome and intron sequences

Next, we need to merge the transcriptome and intron sequences. This can be done using the command cat:

cat annotation/sf3b1.fa kma_pre-process_out/introns.fa > annotation/sf3b1_with_introns.fa

The file annotations/sf3b1_with_introns.fa now contains the transcriptome sequences and the intron sequences. This FASTA file can now be used to build a Bowtie 2 index for alignment.

2.2.2 Building the Bowtie 2 index

We want to align against the introns and transcripts. Thus, we will build a Bowtie 2 index. See the Bowtie 2 website for more information on the parameters:

bowtie2-build --offrate 1 annotation/sf3b1_with_introns.fa annotation/sf3b1_with_introns

This call will result in many Bowtie 2 index files (*.bt2 files).

2.2.3 Alignment

For alignment, we will only align one sample as aligning other samples can be done similarly and can be automated. Please note, the reads we are aligning are single-end. The command for paired-end reads can be found on the Bowtie 2 website.

bowtie2 -k 200 --rdg 6,5 --rfg 6,5 --score-min L,-.6,-.4 -x annotation/sf3b1_with_introns \
    -U experiment/ortho/ortho1/*.fastq.gz |
    samtools view -Sb - > experiment/ortho/ortho1/sf3b1.bam

This will result in a BAM file experiment/ortho/ortho1/sf3b1.bam which can then be quantified using eXpress.

2.2.4 Quantification with eXpress

Now that we have alignment files, we can quantify isoform and intron expression with eXpress:

express -o experiment/ortho/ortho1/xprs_out annotation/sf3b1_with_introns.fa \
    experiment/ortho/ortho1/sf3b1.bam

The quantification results will now be in experiment/ortho/ortho1/xprs_out. The other samples can be processed similarly.

2.3 Post-processing

2.3.1 Loading the files

Since we organized our data nicely, we can load the data into R quite easily using Sys.glob:

base_dir <- system.file("example", package="kma")
xprs_fnames <- Sys.glob(file.path(base_dir, "experiment/*/*/xprs_out/results.xprs"))
xprs_fnames
## [1] "/Library/Frameworks/R.framework/Versions/3.1/Resources/library/kma/example/experiment/ortho/ortho1/xprs_out/results.xprs"
## [2] "/Library/Frameworks/R.framework/Versions/3.1/Resources/library/kma/example/experiment/ortho/ortho2/xprs_out/results.xprs"
## [3] "/Library/Frameworks/R.framework/Versions/3.1/Resources/library/kma/example/experiment/ortho/ortho3/xprs_out/results.xprs"

Sys.glob does wildcard expansion on a character to find file names.

Sample names can be inferred from the file paths by removing the extraneous information using sub and gsub.

sample_names <- sub(file.path(base_dir, "experiment/[a-z]+/"), "", xprs_fnames) %>%
    sub("xprs_out/results.xprs", "", .) %>%
    gsub("/", "", .)
sample_names
## [1] "ortho1" "ortho2" "ortho3"

Since we labelled replicates according to the condition they are a part of, you can also infer condition names from the sample:

condition_names <- sub("[0-9]+", "", sample_names)
condition_names
## [1] "ortho" "ortho" "ortho"

You now have three R variables:

  • xprs_fnames - a character vector of file names pointing to the quantification files
  • sample_names - a character vector of identifiers for each sample
  • condition_names - a character vector of identifiers of conditions

A user need not use Sys.glob or have their data structured the way we have presented, but if they do, manually typing the sample and condition names is unnecessary.

We can now load all the eXpress data into R:

xprs <- read_express(xprs_fnames, sample_names, condition_names)
## Reading all eXpress data...
## Sorting...
## Extracting column:  tpm 
## Extracting column:  uniq_counts

This results in a named list with the following members:

names(xprs)
## [1] "tpm"         "uniq_counts" "all_data"    "condition"   "sample"

We only need one other file, intron_to_transcripts.txt, which was created during the pre-processing step. We can read that in using read.table or data.table::fread which is much faster, just be sure to use the data.table = FALSE flag as seen below:

intron_to_trans <- data.table::fread(file.path(base_dir, "kma_pre-process_out",
    "intron_to_transcripts.txt"), data.table = FALSE)
head(intron_to_trans)
##                     intron    target_id  gene         intron_extension
## 1 chr2:198283312-198283520    NM_012433 SF3B1 chr2:198283287-198283545
## 2 chr2:198266854-198267279    NM_012433 SF3B1 chr2:198266829-198267304
## 3 chr2:198264890-198264975    NM_012433 SF3B1 chr2:198264865-198265000
## 4 chr2:198261052-198262708    NM_012433 SF3B1 chr2:198261027-198262733
## 5 chr2:198270196-198272721    NM_012433 SF3B1 chr2:198270171-198272746
## 6 chr2:198283675-198285151 NM_001005526 SF3B1 chr2:198283650-198285176

To create an IntronRetention object, you can call newIntronRetention with the following command:

ir <- newIntronRetention(xprs$tpm, intron_to_trans, xprs$condition,
    xprs$uniq_counts)
## 'melting' unique counts
## computing denominator
## computing numerator
## computing retention
## 'melting' expression
## joining unique_counts and retention data
## sorting and grouping by (intron, condition)

newIntronRetention take the following arguments:

  • expression data frame (in this case we are using TPM from eXpress)
  • intron to transcripts relationship table
  • condition character vector
  • Optional: a unique counts vector. This is used for filtering later on

See help(newIntronRetention) for more details on the arguments.

Printing the IntronRetention object results in a short summary:

print(ir)
## IntronRetention object (25 introns)
## ----------------------------------------------
## Samples: ortho1 ortho2 ortho3
## Conditions:  ortho ortho ortho

The data member flat in the IntronRetention object is where most of the operations happen. Advanced users might be interested in performing direct exploratory analysis on this table, which is similar to a denormalized SQL table (often reffered to as a “tidy” table in R).

head(ir$flat)
##   sample                   intron numerator denominator  retention
## 1 ortho1 chr2:198257185-198257695  4282.919    448626.8 0.00954673
## 2 ortho2 chr2:198257185-198257695  6981.895    308804.7 0.02260942
## 3 ortho3 chr2:198257185-198257695  5393.122    304077.2 0.01773603
## 4 ortho1 chr2:198257912-198260779 25633.290    469977.2 0.05454156
## 5 ortho2 chr2:198257912-198260779 12321.710    314144.5 0.03922306
## 6 ortho3 chr2:198257912-198260779 10855.100    309539.2 0.03506858
##   condition unique_counts
## 1     ortho            23
## 2     ortho            15
## 3     ortho            15
## 4     ortho          1040
## 5     ortho           200
## 6     ortho           207

2.3.2 Filtering

Filter functions have the following format: filter_name(ir, options) where ir is an IntronRetention object, and options are options for the filter. They return an IntronRetention object with an updated flat member. Filters can be implemented by the user, but should return TRUE if the intron passes the filter and FALSE otherwise. The column name for each filter should begin with f_.

These filters below filter denominators that have expression below 1 TPM, introns that have exactly 0 or 1 PSI (due to either no contributions from the intron coverage or solely coverage from the intron). filter_low_frags filters out introns that don’t have at least N number of unique reads:

ir <- ir %>%
    filter_low_tpm(1) %>%
    filter_perfect_psi() %>%
    filter_low_frags(3)
## Not grouped. Grouping.
colnames(ir$flat)
##  [1] "sample"        "intron"        "numerator"     "denominator"  
##  [5] "retention"     "condition"     "unique_counts" "f_low_tpm_1"  
##  [9] "f_perf_psi"    "f_low_count_3"

2.3.2.1 Zero coverage

The zero coverage filter is a special type of filter that has to be computed on the read alignments and corresponding eXpress results.

python $PRE/zeroCoverage.py experiment/ortho/ortho1/xprs_out/results.xprs \
    experiment/ortho/ortho1/sf3b1.bam \
    experiment/ortho/ortho1/zero_coverage.txt

The output is then in experiment/ortho/ortho1/zero_coverage.txt. Zero coverage data can be read in using the function get_batch_intron_zc. Like read_express it needs file names, sample names, and condition names:

zc_fnames <- Sys.glob(file.path(base_dir, "experiment/*/*/zero_coverage.txt"))
zc_samples <- sub(file.path(base_dir, "experiment/[a-z]+/"), "", zc_fnames) %>%
    sub("zero_coverage.txt", "", .) %>%
    gsub("/", "", .)
zc_conditions <- sub("[0-9]+", "", zc_samples)
all_zc <- get_batch_intron_zc(zc_fnames, zc_samples, zc_conditions)
head(all_zc)
## Source: local data frame [6 x 6]
## 
##           intron_extension zc_len      zc_pval      zc_qval sample
## 1 chr2:198257160-198257720    143 9.972254e-05 1.894728e-03 ortho1
## 2 chr2:198257887-198260804    242 3.881734e-41 7.375295e-40 ortho1
## 3 chr2:198261027-198262733    220 2.706177e-07 5.141736e-06 ortho1
## 4 chr2:198262815-198263209     81 2.808659e-01 1.000000e+00 ortho1
## 5 chr2:198263280-198264803    203 8.181478e-05 1.554481e-03 ortho1
## 6 chr2:198264865-198265000     16 6.042884e-01 1.000000e+00 ortho1
## Variables not shown: condition (chr)

The result from get_batch_intron_zc is a data.frame that can be inspected manually if you’d like. To incorporate it with a IntronRetention object, use summarize_zero_coverage:

ir <- summarize_zero_coverage(ir, all_zc)
## Summarize zero coverage across introns and conditions
## Join IntronRetention and zero coverage

This adds a new column to ir$flat called f_zc_*:

colnames(ir$flat)
##  [1] "intron"        "condition"     "sample"        "numerator"    
##  [5] "denominator"   "retention"     "unique_counts" "f_low_tpm_1"  
##  [9] "f_perf_psi"    "f_low_count_3" "f_zc_0.2"

2.3.3 Hypothesis testing

Hypothesis testing aggregates all filters (by taking the intersection of all filters), computes the null distribution, then returns a data frame summarizing the test. Since there is some randomness involved in generating the permutations, you might consider setting a seed before calling retention_test as we do below:

set.seed(42)
ir_test <- retention_test(ir)
## aggregating filters
## joining filtered data set
## estimating the null for each condition
## computing p-values
head(ir_test)
##   condition                   intron mean_retention var_retention pvalue
## 1     ortho chr2:198257912-198260779    0.042944402  1.051855e-04 0.5877
## 2     ortho chr2:198263305-198264778    0.008521050  1.725107e-05 0.9791
## 3     ortho chr2:198267759-198268308    0.003581595  1.310575e-06 0.9973
## 4     ortho chr2:198274731-198281464    0.050106833  2.162046e-04 0.5532
## 5     ortho chr2:198283312-198283520    0.410303948  1.938278e-02 0.0000
## 6     ortho chr2:198283675-198285151    0.202623085  5.853153e-03 0.0919
##   qvalue
## 1 0.9973
## 2 0.9973
## 3 0.9973
## 4 0.9973
## 5 0.0000
## 6 0.3676

Significant introns can be found using the dplyr function filter:

ir_test %>%
    filter(qvalue <= 0.10) %>%
    select(-c(pvalue))
## Source: local data table [1 x 5]
## 
##   condition                   intron mean_retention var_retention qvalue
## 1     ortho chr2:198283312-198283520      0.4103039    0.01938278      0

In this case we only see one significant intron. Since we have such a small dataset, the null distribution isn’t that reliable so the results shouldn’t be taken too seriously, but when the data set is larger (like in a normal experiment), the results are more reliable.

3 Intron retention in terminal erythropoiesis

TODO This section is coming soon. It will showcase how to make plots as well as some diagnostics one might do on a complete data set.