Package: DASC
Authors: Haidong Yi, Ayush T. Raman
Version: 1.1.0
Compiled date: 2018-05-31
License: MIT + file LICENSE
Prerequisites: NMF, cvxclustr, Biobase, foreach, doParallel

1 Getting started

Before installing the package, please make sure the prerequisite packages described in DESCRIPTION file have been installed successfully.

DASC is an R package, to install the package, start terminal and enter:

On Linux/MacOs

git clone https://https://github.com/HaidYi/DASC.git
R CMD build DASC
R CMD INSTALL DASC_*.tar.gz

2 Introduction

DASC is used for identifying batches and classifying samples into different batches in a high dimensional gene expression dataset. The batch information can be further used as a covariate in conjunction with other variables of interest among standard bioinformatics analysis like differential expression analysis.

2.1 Citation info

If you use DASC for your analysis, please cite our paper as here below.

    @article{Yi2018Detecting,
      title={Detecting hidden batch factors through data-adaptive adjustment for biological effects},
      author={Yi, H. and Raman, A. T. and Zhang, H. and Allen, G. I. and Liu, Z.},
      journal={Bioinformatics},
      volume={34},
      number={7},
      pages={1141},
      year={2018},
}

3 Quick Example

library(DASC)
data("esGolub")
samples <- c(20,21,28,30)
dat <- exprs(esGolub)[1:100,samples]
pdat <- pData(esGolub)[samples,]

## use nrun = 50 or more for better convergence of results
res <- DASC(edata = dat, pdata = pdat, factor = pdat$Cell, 
                        method = 'ama', type = 3, lambda = 1, 
                        rank = 2:3, nrun = 5, annotation='esGolub Dataset')
res
$`2`
$`2`$consensus
     [,1] [,2] [,3] [,4]
[1,]    1    0    0    1
[2,]    0    1    1    0
[3,]    0    1    1    0
[4,]    1    0    0    1

$`2`$class
[1] 1 2 2 1
Levels: 1 2

$`2`$dispersion
[1] 1


$`3`
$`3`$consensus
     [,1] [,2] [,3] [,4]
[1,]  1.0  0.0  0.0  0.6
[2,]  0.0  1.0  0.4  0.0
[3,]  0.0  0.4  1.0  0.0
[4,]  0.6  0.0  0.0  1.0

$`3`$class
[1] 1 2 3 1
Levels: 1 2 3

$`3`$dispersion
[1] 0.76

4 Setting up the data

The first step in using DASC package is to properly format the data. For example, in case of gene expression data, it should be a matrix with features (genes, transcripts) in the rows and samples in the columns. DASC then requires the information for the variable of interest to model the gene expression data effectively.Variable of interest could be a genotype or treatment information.

4.1 Stanford RNA-Seq Dataset

Below is an example of Stanford gene expression dataset (Chen et. al. PNAS, 2015; Gilad et. al. F1000 Research, 2015). It is a filtered raw counts dataset which was published by Gilad et al. F1000 Research. 30% of genes with the lowest expression & mitochondrial genes were removed (Gilad et al.F1000 Research).

## libraries
set.seed(99999)
library(DESeq2)
library(ggplot2)
library(pcaExplorer)

## dataset
rawCounts <- stanfordData$rawCounts
metadata <- stanfordData$metadata
## Using a smaller dataset
idx <- which(metadata$tissue %in% c("adipose", "adrenal", "sigmoid"))
rawCounts <- rawCounts[,idx]
metadata <- metadata[idx,]
head(rawCounts)
        adipose (h) adrenal (h) sigmoid (h) adipose (m) adrenal (m)
STAG2          1430        4707        4392        3223        8235
STAG1           835        2362        1687        2750        2732
GOSR2           142         891          97        1599        1430
C1orf43        1856        9591        2611         706         498
ART5              1           4           0           0           0
ART1              0           0           0           0           1
        sigmoid (m)
STAG2         10435
STAG1          2833
GOSR2           887
C1orf43         753
ART5              0
ART1              0
head(metadata)
                setname                 seqBatch species  tissue
adipose (h) adipose (h) D87PMJN1:253:D2GUAACXX:8   human adipose
adrenal (h) adrenal (h) D87PMJN1:253:D2GUAACXX:8   human adrenal
sigmoid (h) sigmoid (h) D87PMJN1:253:D2GUAACXX:8   human sigmoid
adipose (m) adipose (m) D4LHBFN1:276:C2HKJACXX:4   mouse adipose
adrenal (m) adrenal (m) D4LHBFN1:276:C2HKJACXX:4   mouse adrenal
sigmoid (m) sigmoid (m) D4LHBFN1:276:C2HKJACXX:4   mouse sigmoid

5 Batch detection using PCA Analysis

## Normalizing the dataset using DESeq2
dds <- DESeqDataSetFromMatrix(rawCounts, metadata, design = ~ species+tissue)
dds <- estimateSizeFactors(dds)
dat <- counts(dds, normalized = TRUE)
lognormalizedCounts <- log2(dat + 1)
## PCA plot using 
rld.dds <- rlog(dds)
pcaplot(rld.dds, intgroup=c("tissue","species"), ntop=1000, pcX=1, pcY=2)

In the PCA plot, PC1 shows the differences between the species. PC2 shows the differences between the species i.e. samples clustering based on tissues.

6 Batch detection using DASC

res <- DASC(edata = dat, pdata = metadata, factor = metadata$tissue,
                method = 'ama', type = 3, lambda = 1, rank = 2:3, nrun = 10,
                annotation = 'Stanford Dataset')
## Consensus plot
annotation <- data.frame(Batch = res$`2`$class, Tissue = as.character(metadata$tissue)) 
consensusmap(res$`2`$consensus, annCol = annotation, main = "rank = 2")

## Batches -- dataset has 6 batches
sample.clust <- data.frame(sample.name = colnames(lognormalizedCounts), 
                            clust = as.vector(res$`2`$class), 
                            batch = metadata$seqBatch)
ggplot(data = sample.clust, aes(x=c(1:6), y=clust, color=factor(clust))) + 
    geom_point(size = 4) + xlab("Sample Number") + ylab("Cluster Number")

Based on the above plots, we observe that the dataset has 2 batches. This can further be compared with the sequencing platform or metadata$seqBatch. The results suggest that differences in platform led to batch effects. Batch number can be used as another covariate, when differential expression analyses using DESeq2,edgeR or limma are performed.

7 Analysis on entire Stanford Dataset

## not running this part of the code for building package
## Using entire dataset
rawCounts <- stanfordData$rawCounts
metadata <- stanfordData$metadata
dds <- DESeqDataSetFromMatrix(rawCounts, metadata, design = ~ species+tissue)
dds <- estimateSizeFactors(dds)
dat <- counts(dds, normalized = TRUE)
lognormalizedCounts <- log2(dat + 1)

## PCA Plot
rld.dds <- rlog(dds)
pcaplot(rld.dds, intgroup=c("tissue","species"), ntop = nrow(rld.dds), 
        pcX = 1, pcY = 2)

## Running DASC
res <- DASC(edata = dat, pdata = metadata, factor = metadata$tissue,
                method = 'ama', type = 3, lambda = 1, rank = 2:10,
                nrun = 100, annotation = 'Stanford Dataset')

## Consensus plot
annotation <- data.frame(Batch = res$`10`$class, Tissue = metadata$tissue)
consensusmap(res$`10`$consensus, annCol = annotation, main = "rank = 10")

## Clustering samples based on batches
sample.clust <- data.frame(sample.name = colnames(lognormalizedCounts), 
                            clust = as.vector(res$`10`$class), 
                            batch = metadata$seqBatch)
ggplot(data = sample.clust, aes(x=c(1:26), y=clust, color=factor(clust))) + 
    geom_point(size = 4) + xlab("Sample Number") + ylab("Cluster Number")

Based on the PCA plot, we see the PC1 captures the differences based on species. Based on consensusmap plot and Cophenetic & dispersion values, there are 10 batches in the dataset. Our results show that the batches are not only due to platform but due other reason like differences in the date and the time of library preparation and sequencing of the samples. Another important observation, is that at rank equal to 2, we observe entire dataset to cluster based on speicies.

8 Session Info

sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.4

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] C/zh_CN.UTF-8/zh_CN.UTF-8/C/zh_CN.UTF-8/zh_CN.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] RColorBrewer_1.1-2          pcaExplorer_2.6.0          
 [3] ggplot2_2.2.1               DESeq2_1.20.0              
 [5] SummarizedExperiment_1.10.1 DelayedArray_0.6.0         
 [7] BiocParallel_1.14.1         matrixStats_0.53.1         
 [9] GenomicRanges_1.32.3        GenomeInfoDb_1.16.0        
[11] IRanges_2.14.10             S4Vectors_0.18.2           
[13] DASC_1.1.0                  cvxclustr_1.1.1            
[15] igraph_1.2.1                Matrix_1.2-14              
[17] NMF_0.21.0                  cluster_2.0.7-1            
[19] rngtools_1.3.1              pkgmaker_0.27              
[21] registry_0.5                Biobase_2.40.0             
[23] BiocGenerics_0.26.0         knitr_1.20                 
[25] BiocStyle_2.8.2            

loaded via a namespace (and not attached):
  [1] colorspace_1.3-2       rprojroot_1.3-2        htmlTable_1.12        
  [4] XVector_0.20.0         base64enc_0.1-3        d3heatmap_0.6.1.2     
  [7] rstudioapi_0.7         topGO_2.32.0           DT_0.4                
 [10] ggrepel_0.8.0          bit64_0.9-7            AnnotationDbi_1.42.1  
 [13] codetools_0.2-15       splines_3.5.0          doParallel_1.0.11     
 [16] geneplotter_1.58.0     Formula_1.2-3          gridBase_0.4-7        
 [19] annotate_1.58.0        GO.db_3.6.0            png_0.1-7             
 [22] pheatmap_1.0.10        graph_1.58.0           shinydashboard_0.7.0  
 [25] shiny_1.1.0            compiler_3.5.0         httr_1.3.1            
 [28] GOstats_2.46.0         backports_1.1.2        assertthat_0.2.0      
 [31] lazyeval_0.2.1         limma_3.36.1           later_0.7.2           
 [34] acepack_1.4.1          htmltools_0.3.6        prettyunits_1.0.2     
 [37] tools_3.5.0            Category_2.46.0        glue_1.2.0            
 [40] gtable_0.2.0           GenomeInfoDbData_1.1.0 reshape2_1.4.3        
 [43] Rcpp_0.12.17           iterators_1.0.9        crosstalk_1.0.0       
 [46] xfun_0.1               stringr_1.3.1          mime_0.5              
 [49] XML_3.98-1.11          shinyAce_0.3.1         zlibbioc_1.26.0       
 [52] scales_0.5.0           shinyBS_0.61           promises_1.0.1        
 [55] RBGL_1.56.0            SparseM_1.77           yaml_2.1.19           
 [58] memoise_1.1.0          gridExtra_2.3          biomaRt_2.36.1        
 [61] rpart_4.1-13           latticeExtra_0.6-28    stringi_1.2.2         
 [64] RSQLite_2.1.1          genefilter_1.62.0      foreach_1.4.4         
 [67] checkmate_1.8.5        bibtex_0.4.2           rlang_0.2.1           
 [70] pkgconfig_2.0.1        bitops_1.0-6           evaluate_0.10.1       
 [73] lattice_0.20-35        purrr_0.2.5            labeling_0.3          
 [76] htmlwidgets_1.2        bit_1.1-14             AnnotationForge_1.22.0
 [79] GSEABase_1.42.0        plyr_1.8.4             magrittr_1.5          
 [82] bookdown_0.7           R6_2.2.2               Hmisc_4.1-1           
 [85] DBI_1.0.0              pillar_1.2.3           foreign_0.8-70        
 [88] withr_2.1.2            survival_2.42-3        RCurl_1.95-4.10       
 [91] nnet_7.3-12            tibble_1.4.2           rmarkdown_1.9         
 [94] progress_1.1.2         locfit_1.5-9.1         grid_3.5.0            
 [97] data.table_1.11.4      Rgraphviz_2.24.0       blob_1.1.1            
[100] threejs_0.3.1          digest_0.6.15          xtable_1.8-2          
[103] tidyr_0.8.1            httpuv_1.4.3           munsell_0.4.3