The emerging single cell RNA sequencing (scRNA-seq) technologies enable the investigation of transcriptomic landscape at single-cell resolution. However, scRNA-seq analysis is complicated by the excess of zero or near zero counts in the data, which are the so-called dropouts due to low amounts of mRNA within each individual cell. Consequently, downstream analysis of scRNA-seq woule be severely biased if the dropout events are not properly corrected. scImpute
is developed to accurately and efficiently impute the dropout values in scRNA-seq data.
scImpute
can be applied to raw data count before the users perform downstream analyses such as
scImpute
can be easily incorporated into existing pipeline of scRNA-seq analysis. Its only input is the raw count matrix with rows representing genes and columns representing cells. It will output an imputed count matrix with the same dimension. In the simplest case, the imputation task can be done with one single function scimpute
:
scimpute(# full path to raw count matrix
count_path = system.file("extdata", "raw_count.csv", package = "scImpute"),
infile = "csv", # format of input file
outfile = "csv", # format of output file
out_dir = "./", # full path to output directory
labeled = FALSE, # cell type labels not available
drop_thre = 0.5, # threshold set on dropout probability
Kcluster = 2, # 2 cell subpopulations
ncores = 10) # number of cores used in parallel computation
This function returns the column indices of outlier cells, and creates a new file scImpute_count.csv
in out_dir
to store the imputed count matrix.
The input file can be a .csv
file, .txt
file, or .rds
file. In all cases, the first column should give the gene names and the first row should give the cell names. We use the example files in the package as illustration. If the raw counts are stored in a .csv
file, and we also hope to output the imputed matrix into a .csv
file, then specify this information with
# full path of the input file
count_path = system.file("extdata", "raw_count.csv", package = "scImpute")
infile = "csv"
outfile = "csv"
Similarly, If the raw counts are stored in a .txt
file, and we also hope to output the imputed matrix into a .txt
file, then specify this information with
# full path of the input file
count_path = system.file("extdata", "raw_count.txt", package = "scImpute")
infile = "txt"
outfile = "txt"
Next, we need to set up the directory to store all the temporary and final outputs:
# a '/' sign is necessary at the end of the path
out_dir = "~/output/"
We highly recommend using parallel computing with scImpute
, which will significantly reduce the computation time. Suppose we would like to use 20 cores, then we can run the scImpute
function with ncores = 20
.
scImpute
has two statistical parameters. The first parameter is Kcluster
, which determines the number of initial clusters to help identify candidate neighbors of each cell. The imputation results does not heavily rely on the choice of Kcluster
, since scImpute
uses a model-based method to select similar cells in a later stage. Kcluster
can be specified based on the number of known cell types and users’ biological expertise, and it may also be learned by clustering the raw data and inspecting the clustering results. The second parameter is drop_thre
. Only the values that have dropout probability larger than drop_thre
are imputed by scImpute
. A default threshold drop_thre = 0.5
is sufficient for most scRNA-seq data.
Now to get the imputed matrix, all we need is the main scimpute
function
Kcluster = 2
drop_thre = 0.5
ncores = 10
scimpute(count_path, infile, outfile, out_dir, labeled = FALSE, drop_thre, Kcluster, ncores)
If outfile = "csv"
, this function will create a new file scimpute_count.csv
in out_dir
to store the imputed count matrix; if outfile = "txt"
, this function will create a new file scimpute_count.txt
in out_dir
.
Note that the order of parameters matters in R functions, so we suggest using the format in Quick start to specify parameters and avoid mistakes. If the users would like to apply scImpute
on data coming from homogeneous cells, this can be achieved by setting Kcluster = 1
and labeled = FALSE
.
scImpute
with cell type informationSometimes users may have the cell type (or subpopulation) information of the single cells and scimpute
can take advantage of this information to impute among each cell type. To do this, we need a character vector labels
specifying the cell type of each column in the raw count matrix. In other words, the length of labels
equals the number of cells and the order of elements in labels
should match the order of columns in the raw count matrix. Then we just need to specify labeled = TRUE
in scimpute
(default is FALSE
) and specify the labels
argument. Kcluster
is not used when labeled = TRUE
.
labels = readRDS(system.file("extdata", "labels.rds", package = "scImpute"))
labels[1:5]
> [1] "c1" "c1" "c1" "c2" "c2"
scimpute(count_path,
infile = "csv",
outfile = "csv",
out_dir = out_dir,
labeled = TRUE,
drop_thre = 0.5,
labels = labels,
ncores = 10)
scImpute
to TPM valuesWe strongly suggest using scImpute
on count matrices. However, if only TPM values are available, users can apply scImpute
with gene lengths supplied. scImpute
will use the gene lengths (sum of exon lengths) to scale the data , which ensures a good fitting of the mixture models. In this case, users need to specify type = "TPM"
(type = "count"
by default), and supply a vector genelen
of gene lengths. The order of genes in genelen
should match the order in the expression matrix. For example:
> genelen[1:3]
ENSMUSG00000021252 ENSMUSG00000007777 ENSMUSG00000024442
4235 998 2404
scimpute(count_path,
infile = "csv",
outfile = "csv",
out_dir = out_dir,
type = "TPM"
genelen = genelen,
drop_thre = 0.5,
ncores = 10)
scImpute
scImpute
benefits from parallel computation, and each processor does not require heavy memory cost. scimpute
completes computation in seconds when applied to a dataset with 10,000 genes and 100 cells, running with 10 cores. The memory requirement for this data set is around 2G. The running time mostly depends on
ncores
)When the number of cells is extremely large, a filtering step on the cells can save the computation time of scImpute
.