kma
)This vignette described how to run the Keep Me Around kma
suite to compute intron retention in RNA-Seq experiments.
The general pipeline is as follows:
R
packageTODO: Put a more detailed flowchart here
There are two required portions of the tool to perform intron retention quantification, though installing the post-processing tools (intron retention) installs the python code with the exception of dependencies. The pre-process step is written in python and only needs to be run when the annotation changes. The post-processing step is written in R
and where most of your interaction will take place.
If you have installed the R
package (see Installing the post-processing tools), then you have successfully installed the pre-processing tools. From R
, you can find the path by typing:
system.file("pre-process", package="kma")
## [1] "/Library/Frameworks/R.framework/Versions/3.1/Resources/library/kma/pre-process"
The only additional dependencies are the Python packages pyfasta
, biopython
and pysam
. All packages can be installed via PyPI:
pip install pyfasta biopython pysam
The tools needed for quantification are:
Please see the corresponding documentation on their respective web pages.
The pre- and post-processing tools are contained in an R
packaged called kma
. The development version is always on Github. To install, kma
and required R
dependencies, runs the code below:
required_packages <- c("devtools", "data.table", "reshape2", "dplyr")
install.packages(required_packages)
devtools::install_github("http://github.com/pachterlab/kma")
You can then load the kma
R package like you would any other R packages:
library("kma")
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
We recommend organizing your experiments in the following format:
experiment
|____conditionA
| |____conditionA1
| |____conditionA2
| |____conditionA3
|____conditionB
| |____conditionB1
| |____conditionB2
| |____conditionB3
where experiment
is the top level for the particular set of experiments, conditionX
refers to the condition (e.g. tumor, or control), and conditionXY
represents the biological replicates. This allows for a structured way to keep track of your data as well as an easy way to load it in R
. An example can be seen in the worked example section.
Note: Pre-processing only has to be run when the gene annotation changes. If you use the same annotation for many experiments, you won’t be running this tool often.
The path to the pre-processing tools can be found in R by typing system.file("preprocess", package="kma")
. Open a terminal and put this path in an environment variable:
PRE=/path/to/kma/preprocess
If you are on a Mac, the path might look like: /Library/Frameworks/R.framework/Versions/3.1/Resources/library/kma/preprocess
We are now ready to generate the introns. A typical command line call to the pre-processing tools looks like:
python $PRE/generate_introns.py --genome seq.fa --gtf trans.gtf --extend N --out out_dir
Where the inputs are:
--genome seq.fa
: genome sequence in Multi-FASTA format where contig names correspond to the GTF file.--gtf trans.gtf
: annotation file in GTF format.--extend N
: Optional, though we strongly recommend setting it. If set, N
is the number of bases bases to overlap into the intron. Note, this should be at most read_length - 1
, but we suggest making it smaller.--out out_dir
: A directory to write the outputs. Will be created if doesn’t already exist.The following outputs will then be put in out_dir
:
Note: Technically any quantification tool can be used with kma
, but currently only support with eXpress
is implemented. Please contact me on Github if you’re interested in a quantification tool being supported.
Here, we will discuss quantification. After generate_introns.py
is run, introns.fa
should be combined with the full transcript sequences. This can be done using the Linux command cat
:
cat trans.fa introns.fa > trans_and_introns.fa
We assume the file name is trans_and_introns.fa
in the following sections, but the file name can be anything.
After you’ve done this, this section requires the following steps:
See the Bowtie 2 manual for more advanced options.
bowtie2-build --offrate 1 trans_and_introns.fa trans_and_introns
This only has to be run once if every time you decide to change the gene annotation.
Once you have a Bowtie 2 index, you can align any number of RNA-Seq experiments to that index. The following arguments are recommended when running Bowtie 2:
bowtie2 -k 200 --rdg 6,5 --rfg 6,5 --score-min L,-.6,-.4 -X trans_and_introns
-1 left.fastq -2 right.fastq | samtools view -Sb - > hits.bam
If you don’t have many reads, you might consider replacing the -k 200
arugment with -a
.
Once you’ve aligned the reads, you can run eXpress against the alignments. See the eXpress website for additional arguments. The general eXpress call is as follows:
express trans_and_introns.fa hits.bam
If you don’t have many reads, consider running additional batch iterations using the -B
argument.
The first step is to the load the quantification data into R
. The function read_express
takes a list of file names, sample names, and condition names. read_express
then returns a list with the attributes:
tpm
- a data.frame
with TPM of all samplesuniq_counts
- a data.frame
with the number of unique counts of all samplesall_data
- a list of data.frame
s from all the eXpress output. Sorted by target_id
.sample
- a character vector for each sample, describing the sample (e.g. tumor1
, tumor2
)condition
- a character vector for each sample describing the grouping (e.g. tumor
)This data can then be used by newIntronRetention
. An example of this can be seen in the worked example.
We will run through a small example. The data set including reads can be found here. If you are not interested in the pre-processing and quantification steps (some might find it simple), the results from the pipeline exist in the R package:
system.file("example", package="kma")
In this case, you can simply jump to the post-processing phase which runs through using the R package itself.
This example contains the reads, annotation, and sequences for the SF3B1 gene in orthochromatic erythroblasts from GSE53635. This dataset is unrealistically small, so the results aren’t entirely reliable, but can be run very quickly through the pipeline. Thus, the results aren’t very reliable, but serves as an example of how to run through the pipeline.
Run system.file
in R
to get the pre-processing directory (see the pre-processing section above for more information):
system.file("pre-process", package = "kma")
Then, store this value in an environment variable in your shell:
PRE=/path/to/kma/pre-process
Extract the data from the downloaded archive and move to the example directory and run the pre-processing:
tar -xf kma_example.tar.gz
cd kma_example/example
python $PRE/generate_introns.py --genome genome/chr2.fa \
--gtf annotation/refGene_sf3b1.gtf --extend 25 \
--out kma_pre-process_out
We chose --extend 25
here because the reads are 51 bases long. An overlap of 25 bases on each side will give us a fair amount of junction reads without resulting in spurious alignments. If your read length is longer, feel free to set this value higher.
The output files will then be stored in kma_pre-process_out
which you can view using ls kma_pre-process_out
:
intron_to_transcripts.txt introns.bed introns.fa
The introns.fa
file can now be merged with the transcriptome sequences in the quantification section.
Next, we need to merge the transcriptome and intron sequences. This can be done using the command cat
:
cat annotation/sf3b1.fa kma_pre-process_out/introns.fa > annotation/sf3b1_with_introns.fa
The file annotations/sf3b1_with_introns.fa
now contains the transcriptome sequences and the intron sequences. This FASTA file can now be used to build a Bowtie 2 index for alignment.
We want to align against the introns and transcripts. Thus, we will build a Bowtie 2 index. See the Bowtie 2 website for more information on the parameters:
bowtie2-build --offrate 1 annotation/sf3b1_with_introns.fa annotation/sf3b1_with_introns
This call will result in many Bowtie 2 index files (*.bt2
files).
For alignment, we will only align one sample as aligning other samples can be done similarly and can be automated. Please note, the reads we are aligning are single-end. The command for paired-end reads can be found on the Bowtie 2 website.
bowtie2 -k 200 --rdg 6,5 --rfg 6,5 --score-min L,-.6,-.4 -x annotation/sf3b1_with_introns \
-U experiment/ortho/ortho1/*.fastq.gz |
samtools view -Sb - > experiment/ortho/ortho1/sf3b1.bam
This will result in a BAM file experiment/ortho/ortho1/sf3b1.bam
which can then be quantified using eXpress.
Now that we have alignment files, we can quantify isoform and intron expression with eXpress:
express -o experiment/ortho/ortho1/xprs_out annotation/sf3b1_with_introns.fa \
experiment/ortho/ortho1/sf3b1.bam
The quantification results will now be in experiment/ortho/ortho1/xprs_out
. The other samples can be processed similarly.
Since we organized our data nicely, we can load the data into R quite easily using Sys.glob
:
base_dir <- system.file("example", package="kma")
xprs_fnames <- Sys.glob(file.path(base_dir, "experiment/*/*/xprs_out/results.xprs"))
xprs_fnames
## [1] "/Library/Frameworks/R.framework/Versions/3.1/Resources/library/kma/example/experiment/ortho/ortho1/xprs_out/results.xprs"
## [2] "/Library/Frameworks/R.framework/Versions/3.1/Resources/library/kma/example/experiment/ortho/ortho2/xprs_out/results.xprs"
## [3] "/Library/Frameworks/R.framework/Versions/3.1/Resources/library/kma/example/experiment/ortho/ortho3/xprs_out/results.xprs"
Sys.glob
does wildcard expansion on a character to find file names.
Sample names can be inferred from the file paths by removing the extraneous information using sub
and gsub
.
sample_names <- sub(file.path(base_dir, "experiment/[a-z]+/"), "", xprs_fnames) %>%
sub("xprs_out/results.xprs", "", .) %>%
gsub("/", "", .)
sample_names
## [1] "ortho1" "ortho2" "ortho3"
Since we labelled replicates according to the condition they are a part of, you can also infer condition names from the sample:
condition_names <- sub("[0-9]+", "", sample_names)
condition_names
## [1] "ortho" "ortho" "ortho"
You now have three R variables:
xprs_fnames
- a character vector of file names pointing to the quantification filessample_names
- a character vector of identifiers for each samplecondition_names
- a character vector of identifiers of conditionsA user need not use Sys.glob
or have their data structured the way we have presented, but if they do, manually typing the sample and condition names is unnecessary.
We can now load all the eXpress data into R:
xprs <- read_express(xprs_fnames, sample_names, condition_names)
## Reading all eXpress data...
## Sorting...
## Extracting column: tpm
## Extracting column: uniq_counts
This results in a named list with the following members:
names(xprs)
## [1] "tpm" "uniq_counts" "all_data" "condition" "sample"
We only need one other file, intron_to_transcripts.txt
, which was created during the pre-processing step. We can read that in using read.table
or data.table::fread
which is much faster, just be sure to use the data.table = FALSE
flag as seen below:
intron_to_trans <- data.table::fread(file.path(base_dir, "kma_pre-process_out",
"intron_to_transcripts.txt"), data.table = FALSE)
head(intron_to_trans)
## intron target_id gene intron_extension
## 1 chr2:198283312-198283520 NM_012433 SF3B1 chr2:198283287-198283545
## 2 chr2:198266854-198267279 NM_012433 SF3B1 chr2:198266829-198267304
## 3 chr2:198264890-198264975 NM_012433 SF3B1 chr2:198264865-198265000
## 4 chr2:198261052-198262708 NM_012433 SF3B1 chr2:198261027-198262733
## 5 chr2:198270196-198272721 NM_012433 SF3B1 chr2:198270171-198272746
## 6 chr2:198283675-198285151 NM_001005526 SF3B1 chr2:198283650-198285176
To create an IntronRetention
object, you can call newIntronRetention
with the following command:
ir <- newIntronRetention(xprs$tpm, intron_to_trans, xprs$condition,
xprs$uniq_counts)
## 'melting' unique counts
## computing denominator
## computing numerator
## computing retention
## 'melting' expression
## joining unique_counts and retention data
## sorting and grouping by (intron, condition)
newIntronRetention
take the following arguments:
See help(newIntronRetention)
for more details on the arguments.
Printing the IntronRetention
object results in a short summary:
print(ir)
## IntronRetention object (25 introns)
## ----------------------------------------------
## Samples: ortho1 ortho2 ortho3
## Conditions: ortho ortho ortho
The data member flat
in the IntronRetention
object is where most of the operations happen. Advanced users might be interested in performing direct exploratory analysis on this table, which is similar to a denormalized SQL table (often reffered to as a “tidy” table in R).
head(ir$flat)
## sample intron numerator denominator retention
## 1 ortho1 chr2:198257185-198257695 4282.919 448626.8 0.00954673
## 2 ortho2 chr2:198257185-198257695 6981.895 308804.7 0.02260942
## 3 ortho3 chr2:198257185-198257695 5393.122 304077.2 0.01773603
## 4 ortho1 chr2:198257912-198260779 25633.290 469977.2 0.05454156
## 5 ortho2 chr2:198257912-198260779 12321.710 314144.5 0.03922306
## 6 ortho3 chr2:198257912-198260779 10855.100 309539.2 0.03506858
## condition unique_counts
## 1 ortho 23
## 2 ortho 15
## 3 ortho 15
## 4 ortho 1040
## 5 ortho 200
## 6 ortho 207
Filter functions have the following format: filter_name(ir, options)
where ir
is an IntronRetention
object, and options are options for the filter. They return an IntronRetention
object with an updated flat
member. Filters can be implemented by the user, but should return TRUE
if the intron passes the filter and FALSE
otherwise. The column name for each filter should begin with f_
.
These filters below filter denominators that have expression below 1 TPM, introns that have exactly 0 or 1 PSI (due to either no contributions from the intron coverage or solely coverage from the intron). filter_low_frags
filters out introns that don’t have at least N
number of unique reads:
ir <- ir %>%
filter_low_tpm(1) %>%
filter_perfect_psi() %>%
filter_low_frags(3)
## Not grouped. Grouping.
colnames(ir$flat)
## [1] "sample" "intron" "numerator" "denominator"
## [5] "retention" "condition" "unique_counts" "f_low_tpm_1"
## [9] "f_perf_psi" "f_low_count_3"
The zero coverage filter is a special type of filter that has to be computed on the read alignments and corresponding eXpress results.
python $PRE/zeroCoverage.py experiment/ortho/ortho1/xprs_out/results.xprs \
experiment/ortho/ortho1/sf3b1.bam \
experiment/ortho/ortho1/zero_coverage.txt
The output is then in experiment/ortho/ortho1/zero_coverage.txt
. Zero coverage data can be read in using the function get_batch_intron_zc
. Like read_express
it needs file names, sample names, and condition names:
zc_fnames <- Sys.glob(file.path(base_dir, "experiment/*/*/zero_coverage.txt"))
zc_samples <- sub(file.path(base_dir, "experiment/[a-z]+/"), "", zc_fnames) %>%
sub("zero_coverage.txt", "", .) %>%
gsub("/", "", .)
zc_conditions <- sub("[0-9]+", "", zc_samples)
all_zc <- get_batch_intron_zc(zc_fnames, zc_samples, zc_conditions)
head(all_zc)
## Source: local data frame [6 x 6]
##
## intron_extension zc_len zc_pval zc_qval sample
## 1 chr2:198257160-198257720 143 9.972254e-05 1.894728e-03 ortho1
## 2 chr2:198257887-198260804 242 3.881734e-41 7.375295e-40 ortho1
## 3 chr2:198261027-198262733 220 2.706177e-07 5.141736e-06 ortho1
## 4 chr2:198262815-198263209 81 2.808659e-01 1.000000e+00 ortho1
## 5 chr2:198263280-198264803 203 8.181478e-05 1.554481e-03 ortho1
## 6 chr2:198264865-198265000 16 6.042884e-01 1.000000e+00 ortho1
## Variables not shown: condition (chr)
The result from get_batch_intron_zc
is a data.frame
that can be inspected manually if you’d like. To incorporate it with a IntronRetention
object, use summarize_zero_coverage
:
ir <- summarize_zero_coverage(ir, all_zc)
## Summarize zero coverage across introns and conditions
## Join IntronRetention and zero coverage
This adds a new column to ir$flat
called f_zc_*
:
colnames(ir$flat)
## [1] "intron" "condition" "sample" "numerator"
## [5] "denominator" "retention" "unique_counts" "f_low_tpm_1"
## [9] "f_perf_psi" "f_low_count_3" "f_zc_0.2"
Hypothesis testing aggregates all filters (by taking the intersection of all filters), computes the null distribution, then returns a data frame summarizing the test. Since there is some randomness involved in generating the permutations, you might consider setting a seed before calling retention_test
as we do below:
set.seed(42)
ir_test <- retention_test(ir)
## aggregating filters
## joining filtered data set
## estimating the null for each condition
## computing p-values
head(ir_test)
## condition intron mean_retention var_retention pvalue
## 1 ortho chr2:198257912-198260779 0.042944402 1.051855e-04 0.5877
## 2 ortho chr2:198263305-198264778 0.008521050 1.725107e-05 0.9791
## 3 ortho chr2:198267759-198268308 0.003581595 1.310575e-06 0.9973
## 4 ortho chr2:198274731-198281464 0.050106833 2.162046e-04 0.5532
## 5 ortho chr2:198283312-198283520 0.410303948 1.938278e-02 0.0000
## 6 ortho chr2:198283675-198285151 0.202623085 5.853153e-03 0.0919
## qvalue
## 1 0.9973
## 2 0.9973
## 3 0.9973
## 4 0.9973
## 5 0.0000
## 6 0.3676
Significant introns can be found using the dplyr function filter
:
ir_test %>%
filter(qvalue <= 0.10) %>%
select(-c(pvalue))
## Source: local data table [1 x 5]
##
## condition intron mean_retention var_retention qvalue
## 1 ortho chr2:198283312-198283520 0.4103039 0.01938278 0
In this case we only see one significant intron. Since we have such a small dataset, the null distribution isn’t that reliable so the results shouldn’t be taken too seriously, but when the data set is larger (like in a normal experiment), the results are more reliable.
TODO This section is coming soon. It will showcase how to make plots as well as some diagnostics one might do on a complete data set.