Introduction to scImpute

Wei Vivian Li, Jingyi Jessica Li

2018-06-08

The emerging single cell RNA sequencing (scRNA-seq) technologies enable the investigation of transcriptomic landscape at single-cell resolution. However, scRNA-seq analysis is complicated by the excess of zero or near zero counts in the data, which are the so-called dropouts due to low amounts of mRNA within each individual cell. Consequently, downstream analysis of scRNA-seq woule be severely biased if the dropout events are not properly corrected. scImpute is developed to accurately and efficiently impute the dropout values in scRNA-seq data.

scImpute can be applied to raw data count before the users perform downstream analyses such as

Quick start

scImpute can be easily incorporated into existing pipeline of scRNA-seq analysis. Its only input is the raw count matrix with rows representing genes and columns representing cells. It will output an imputed count matrix with the same dimension. In the simplest case, the imputation task can be done with one single function scimpute:

scimpute(# full path to raw count matrix
         count_path = system.file("extdata", "raw_count.csv", package = "scImpute"), 
         infile = "csv",           # format of input file
         outfile = "csv",          # format of output file
         out_dir = "./",           # full path to output directory
         labeled = FALSE,          # cell type labels not available
         drop_thre = 0.5,          # threshold set on dropout probability
         Kcluster = 2,             # 2 cell subpopulations
         ncores = 10)              # number of cores used in parallel computation

This function returns the column indices of outlier cells, and creates a new file scImpute_count.csv in out_dir to store the imputed count matrix.

Step-by-step description

The input file can be a .csv file, .txt file, or .rds file. In all cases, the first column should give the gene names and the first row should give the cell names. We use the example files in the package as illustration. If the raw counts are stored in a .csv file, and we also hope to output the imputed matrix into a .csv file, then specify this information with

# full path of the input file
count_path = system.file("extdata", "raw_count.csv", package = "scImpute")
infile = "csv"
outfile = "csv"

Similarly, If the raw counts are stored in a .txt file, and we also hope to output the imputed matrix into a .txt file, then specify this information with

# full path of the input file
count_path = system.file("extdata", "raw_count.txt", package = "scImpute")
infile = "txt"
outfile = "txt"

Next, we need to set up the directory to store all the temporary and final outputs:

# a '/' sign is necessary at the end of the path
out_dir = "~/output/"

We highly recommend using parallel computing with scImpute, which will significantly reduce the computation time. Suppose we would like to use 20 cores, then we can run the scImpute function with ncores = 20.

scImpute has two statistical parameters. The first parameter is Kcluster, which determines the number of initial clusters to help identify candidate neighbors of each cell. The imputation results does not heavily rely on the choice of Kcluster, since scImpute uses a model-based method to select similar cells in a later stage. Kcluster can be specified based on the number of known cell types and users’ biological expertise, and it may also be learned by clustering the raw data and inspecting the clustering results. The second parameter is drop_thre. Only the values that have dropout probability larger than drop_thre are imputed by scImpute. A default threshold drop_thre = 0.5 is sufficient for most scRNA-seq data.

Now to get the imputed matrix, all we need is the main scimpute function

Kcluster = 2
drop_thre = 0.5
ncores = 10
scimpute(count_path, infile, outfile, out_dir, labeled = FALSE,  drop_thre, Kcluster, ncores)  

If outfile = "csv", this function will create a new file scimpute_count.csv in out_dir to store the imputed count matrix; if outfile = "txt", this function will create a new file scimpute_count.txt in out_dir.

Note that the order of parameters matters in R functions, so we suggest using the format in Quick start to specify parameters and avoid mistakes. If the users would like to apply scImpute on data coming from homogeneous cells, this can be achieved by setting Kcluster = 1 and labeled = FALSE.

Apply scImpute with cell type information

Sometimes users may have the cell type (or subpopulation) information of the single cells and scimpute can take advantage of this information to impute among each cell type. To do this, we need a character vector labels specifying the cell type of each column in the raw count matrix. In other words, the length of labels equals the number of cells and the order of elements in labels should match the order of columns in the raw count matrix. Then we just need to specify labeled = TRUE in scimpute (default is FALSE) and specify the labels argument. Kcluster is not used when labeled = TRUE.

labels = readRDS(system.file("extdata", "labels.rds", package = "scImpute"))
labels[1:5]
> [1] "c1" "c1" "c1" "c2" "c2"

scimpute(count_path, 
         infile = "csv", 
         outfile = "csv", 
         out_dir = out_dir,
         labeled = TRUE, 
         drop_thre = 0.5,
         labels = labels, 
         ncores = 10)

Apply scImpute to TPM values

We strongly suggest using scImpute on count matrices. However, if only TPM values are available, users can apply scImpute with gene lengths supplied. scImpute will use the gene lengths (sum of exon lengths) to scale the data , which ensures a good fitting of the mixture models. In this case, users need to specify type = "TPM" (type = "count" by default), and supply a vector genelen of gene lengths. The order of genes in genelen should match the order in the expression matrix. For example:

> genelen[1:3]
ENSMUSG00000021252 ENSMUSG00000007777 ENSMUSG00000024442
              4235                998               2404

scimpute(count_path, 
         infile = "csv", 
         outfile = "csv", 
         out_dir = out_dir,
         type = "TPM"
         genelen = genelen,
         drop_thre = 0.5,
         ncores = 10)

How to save computation time with scImpute

scImpute benefits from parallel computation, and each processor does not require heavy memory cost. scimpute completes computation in seconds when applied to a dataset with 10,000 genes and 100 cells, running with 10 cores. The memory requirement for this data set is around 2G. The running time mostly depends on

When the number of cells is extremely large, a filtering step on the cells can save the computation time of scImpute.