scTDA is an object oriented python library for topological data analysis of high-throughput single-cell RNA-seq data. It includes tools for the preprocessing, analysis, and exploration of single-cell RNA-seq data, based on topological representations produced by the Mapper algorithm. This tutorial illustrates some of the basic features of scTDA using the human preimplantation dataset of Petropoulos S, et al., "Single-Cell RNA-Seq Reveals Lineage and X Chromosome Dynamics in Human Preimplantation Embryos", Cell 165: 1012 - 1026 (2016), as a case study. The scTDA code contains more advanced options and commands that are not described in this tutorial. For more detailed documentation, please use the ipython command ?
.
import scTDA
For convenience, scTDA includes a class for preparing, normalizing, and filtering data. It takes as input one or more files with read or UMI counts. In each file, rows correspond to genes and tab-separated columns correspond to cells, except for the first column, which contains gene IDs. Files can be all from a same time point or from multiple time points. In this tutorial we consider 1,529 cells from 88 human preimplantation embryos sampled at 5 different timepoints. The dataset consists of 89 files corresponding to embryonic days 3 (13 files), 4 (16 files), 5 (24 files), 6 (18 files) and 7 (18 files). The file `data.txt` contains a summary of the file names, timepoints, and number of cells in each file.
files = []
cells = []
libs = []
days = []
with open('data.txt', 'r') as f:
for line in f:
sp = line[:-1].split('\t')
files.append(sp[0])
cells.append(int(sp[1]))
libs.append(sp[2])
days.append(int(sp[3]))
The class scTDA.Preprocess
allows to read, normalize, filter, and organize the data, shaping it in the appropriate form for downstream analyses. An instance is initialized by providing a list of files containing the read or UMI counts, the corresponding timepoints and library ID's, and the number of cells in each file. If spike-in's were used in the experiment, they can be included in the analysis by specifying their common identifier with the option spike=
,
p = scTDA.Preprocess(files, days, libs, cells, spike='ERCC-')
The method scTDA.Preprocess.show_statistics()
can be used to show some simple statistics of the data:
p.show_statistics()
From the first plot we observe that some cells have a low ratio between spike-in reads and average spike-in reads in the library, presumably due to low sequencing depth. In addition, some cells have low ratio of spike-in reads and uniquely mapped reads. This plot is useful for filtering cells out based on spike-in reads. The second plot shows the distribution of transcripts per gene. The third plot shows the fraction of cells with undetected expression of each gene (blue) and spike-in's (yellow) as a function of their average number of counts across all cells. This plot is useful for modelling the technical noise and identifying genes that have a high biological variability and low probability of drop-out. The fourth plot displays the distribution of the total number of transcripts per cell. We observe a large dispersion in this distribution, which we will reduce by subsampling counts. Finally, the last plot displays the distribution of the cell complexity across the dataset, defined as the number of genes expressed in each cell.
We can use these plots to decide the filetring and normalization strategy. In this particular example, we subsample reads using the method scTDA.Preprocess.subsample()
, so that all cells have the same total number of reads:
p.subsample(17989)
Here we have subsampled to the minimum number of reads per cell in the dataset. If we had subsampled to a larger number, cells that have a lower number of reads would have been removed from downstream analysis.
We can select for cells based on various criteria using the method scTDA.Preprocess.select_cells()
,
p.select_cells(filterYlow=0.001, filterXlow=0.01, filterXhigh=3.0)
The parameters filterXlow
and filterXhigh
set respectively lower and upper bounds in the ratio between spike-in reads and the average number of spike-in reads in the library. The parameters filterYlow
and filterYhigh
set respectively lower and upper bounds in the ratio between spike-in reads and uniquely mapped reads. Out of the 1,529 cells, 1,415 pass all the above filters. The cells that pass all the constraints are shown in red in the plot and are stored in the boolean array scTDA.Preprocess.which_samples
.
We can display again the statistics, overlying in red the corresponding plots for the set of cells that pass all the constraints:
p.show_statistics()
We can model the dependence of the dropout rate on the average gene expression by fitting a sigmoid function. Then assign a z-score to each gene by fitting the residuals with a normal distribution, where the standard deviation is estimated from the lower 16th percentile of the data. This procedure is implemented the method scTDA.Preprocess.fit_sigmoid()
,
p.fit_sigmoid()
When subsampling has been applied (as in this particular example), scTDA.Preprocess.fit_sigmoid()
makes use of the subsampled data to fit the sigmoid. Alternatively, the sigmoid can be fitted to the spike-in's by using the option to_spikes=True
.
We can use the above sigmoidal fit to identify genes that have a low probability of drop-out and that have a higher variability than expected from purely technical noise. To that end, we make use of the method scTDA.Preprocess.select_genes()
,
p.select_genes(avg_counts=2.0, min_z=3.0)
In this particular example we have selected genes that have an average of 2 or more counts across the cells, and a minimum z-score of +3.0 with respect to the sigmoidal fit. The genes that pass the constrints are represented in red, and are stored in the variable scTDA.Preprocess.which_genes
. Their names can be recovered thorugh the following command,
print p.genes[p.which_genes]
Once we are satisfied with the selection of cells, genes, and subsampling, we can use the command scTDA.Preprocess.save()
to generate the neccessary files for downstream analyses based on our selection,
p.save('Embryo')
This command generates three tab-separated files, named `Embryo.all.tsv`, `Embryo.no_subsampling.tsv`, and `Embryo.mapper.tsv`, where the rows correspond to selected cells. The first column of `Embryo.all.tsv` and `Embryo.no_subsampling.tsv` contains a unique identifier of the cell, the second column contains the sampling timepoint, the third column contains the library ID, and the remaining columns contain $\textrm{log}_2(1+\textrm{TPM})$ expression values for all the genes. `Embryo.all.tsv` contains subsampled data, whereas `Embryo.no_subsampling.tsv` constains non-subsampled data. `Embryo.mapper.tsv` contains $\textrm{log}_2(1+\textrm{TPM})$ expression values for the genes that were selected using the scTDA.Preprocess.select_genes()
.
The file `Embryo.mapper.tsv` can be used to produce a topological representation of the dataset. The class scTDA.TopologicalRepresentation
provides an interface to the python module sakmapper, which contains an implementation of the Mapper algorithm. In this example, we generate an instance of the Mapper algorithm using the first two PCA components of the data,
t = scTDA.TopologicalRepresentation('Embryo', lens='pca', metric='euclidean')
We use the method scTDA.TopologicalRepresentation.save()
to generate a Mapper representation of the data based in this PCA projection, using 25 x 25 bins with an average 40% overlap,
t.save('Embryo_pca_25_0.40', 25, 0.40);
The topological representation is stored in GEXF format in the file `Embryo_pca_25_0.40.gexf`, and is accompanied by a JSON file `Embryo_pca_25_0.40.gexf.json`, which specifies the cells that are associated to each node of the graph. We can visualize the topological representation using the class scTDA.UnrootedGraph
,
c = scTDA.UnrootedGraph('Embryo_pca_25_0.40', 'Embryo.no_subsampling.tsv', groups=False)
c.draw('PDGFRA');
where we have colored the nodes of the topological representation by the expression levels of PDGFRA.
An alternative and more convenient way of generating the topological representation is using the software Ayasdi. scTDA provides a method to parse a topological representation generated from the table `Embryo.mapper.tsv` using an existing Ayasdi session,
scTDA.ParseAyasdiGraph('Embryo_mds','1480173499692','-1429704228027513618','username','password')
This command produces the files `Embryo_mds.gexf`, `Embryo_mds.json` and `Embryo_mds.groups.json`. '1480173499692'
and '-1429704228027513618'
are the identifiers of the Ayasdi session that we are importing into scTDA. They can be obtained from the Ayasdi web interface as indicated in the following screenshot,
scTDA provides two classes for the analysis of single cell RNA-seq expression data: scTDA.UnrootedGraph
and scTDA.RootedGraph
, respectively for non-longitudinal and longitudinal single cell RNA-seq data. scTDA.RootedGraph
is an inherited class from scTDA.UnrootedGraph
, so all methods of scTDA.UnrootedGraph
are also included in scTDA.RootedGraph
, in addition to methods that are specific of scTDA.RootedGraph
. In this tutorial we make use of scTDA.RootedGraph
, since we are dealing with longitudinal data. The class is initialized with the files produced in the previous steps,
c = scTDA.RootedGraph('Embryo_mds', 'Embryo.no_subsampling.tsv', posgl=True)
When using topological representations generated through sakmapper, the option groups=False
should be specified when initializing the classes scTDA.RootedGraph
or scTDA.UnRootedGraph
. The argument posgl=True
forces scTDA to use a pre-comuputed layout for visualization of the topological representation. To compute the visualization layout of a topological representation for the first time initialize the class using the argument posgl=False
.
The class scTDA.RootedGraph
includes several methods for the analysis and visualization of data. We can display some general statistics using the method scTDA.RootedGraph.show_statistics()
,
c.show_statistics()
The first plot displays the distribution of the number of cells per node in the topological representation. The second plot presents the distribution of the number of common cells between nodes that share an edge in the topological representation. The third plot shows the distribution of the number of nodes that contain the same cell. The last plot shows the distribution of transcripts in $\textrm{log}_2(1+\textrm{TPM})$.
The method SCTDA.RootedGraph.draw()
can be used to plot the topological representation colored by the expression of a given gene:
c.draw('DPPA5');
Node sizes are proportional to the number of cells in the node. The method SCTDA.RootedGraph.draw()
also allows to map a gene or list of genes to different color channels:
c.draw(['HTR3E', 'CDX1']);
SCTDA.RootedGraph.draw()
can map expression up to four different color channels (red, green, blue, and orange). It can also map lists of genes to a channel:
c.draw([['NTSR2', 'PCDHGA2', 'HTR3E'], 'CDX1']);
scTDA
accepts some special keywords as gene names. The keyword '_CDR'
represents cell complexity (that is, the number of genes whose expression is detected in a cell):
c.draw('_CDR');
Note that the topological representation in this example was built using subsampled data, so that the cell complexity did not affect the structure of the topological representation. The complexity computed by _CDR
uses non-subsampled data in this example.
The correlation between sampling time point and cell complexity can be computed using the method scTDA.RootedGraph.plot_CDR_correlation()
,
c.plot_CDR_correlation()
scTDA.RootedGraph.plot_CDR_correlation()
returns the two parameters of the linear fit, Pearson's $r$, the $p$-value and the standard error.
The keyword 'timepoint'
can be used to color acccording to the sampling timepoint,
c.draw('timepoint');
A remarkable feature of the topological representation, based just on expression data, is that it reproduces correctly the differentiation time course, as observed in the figure.
Similarly, the keyword 'timepoint_'
can be used to plot the density of cells that belong to a given time point,
c.draw('timepoint_4');
scTDA determines a root node in the topological representation by maximizing the correlation between sampling time and the graph distance function. The root node corresponds to the less differentiated state. We can color the topological representation according to the distance to the root node using the keyword '_dist_root'
,
c.draw('_dist_root');
Correlation between sampling time point and distance to root node can be computed using the method scTDA.RootedGraph.plot_rootlane_correlation()
,
c.plot_rootlane_correlation()
scTDA.RootedGraph.plot_rootlane_correlation()
returns the two parameters of the linear fit, Pearson's $r$, the $p$-value, and the standard error.
The keyword 'lib_'
can be used to localize a given library within the topological representation,
c.draw('lib_E6_17');
This can be useful to check for batch effects. Similarly, the keyword 'ID_'
can be used to identify a given cell within the topological representation,
c.draw('ID_D7_E7_19_7');
The method scTDA.RootedGraph.draw_expr_timeline()
can be used to compute the expression of a gene or list of genes as a function of the pseudo-time inferred from the distance to root function,
c.draw_expr_timeline('DNMT3B');
We can compute the skeleton of the differentiation tree using the method scTDA.RootedGraph.draw_skeleton()
. This method represents the skeleton colored according to a gene or list of genes, where each node in the skeleton corresponds to a set of connected nodes at the same distance of the root node in the condensed topological representation. Node sizes are proportional to the number of cells in the node, whereas edge sizes are proportional to the number of edges in the condensed topological representation connecting each pair of group of nodes,
c.draw_skeleton('timepoint', markpath=True);
scTDA.RootedGraph.draw_skeleton()
correctly reproduces in this example the segregation of the inner cell mass (red path) from the trophectoderm (gray path). The expression of a gene or gene list as a function of the pseudo-time across the inner cell mass path can be obtained using the command,
c.draw_expr_timeline('DNMT3B', path=True);
There are several quantities associated to each gene that allow to identify genes that are specific of particular cell sub-populations or time points. The first quantity that we may consider is the gene connectivity in the topological representation. This is defined as,
\begin{equation} S(g) = \frac{N-1}{N}\sum_{i,j} p_i(g) w_{ij} p_j(g) \end{equation}where the sum runs over all nodes in the topological representation, $w$ is the adjacency matrix of the topological representation, $N$ the total number of nodes and $p_i(g)$ the average expression of gene $g$ on node $i$, normalized as a probablity, namely
\begin{equation} \sum_i p_i = 1 \end{equation}Gene connectivity allows to identify genes whose expression appears to be highly localized in the topological representation. The connectivity of a gene can be computed using the method scTDA.RootedGraph.connectivity()
,
c.connectivity('PDGFRA')
The statistical significance of a particular gene connectivity score can be assessed by means of a permutation test, where cell ID's are randomly permuted. This is implemented in the method scTDA.RootedGraph.connectivity_pvalue()
,
c.connectivity_pvalue('PDGFRA', n=1000)
where n
specifies the number of permutations. The idea is that genes with a statistically significant connectivity, like the above example, are biologically relevant at a particular stage or for a particular cell sub-population,
c.draw('PDGFRA');
To the contrary, genes with non-significant gene connectivity in the topological representation are unspecific of any stage or cell sub-population,
c.connectivity_pvalue('ACTA2', n=1000)
c.draw('ACTA2');
Two other interesting quantities are the centroid of a gene with respect to the root node and its dispersion. These are respectively defined as
\begin{equation} C(g)=\sum_i d_i p_i(g) \end{equation}\begin{equation} D(g) = \sqrt{\sum_i (d_i - C(g))^2p_i(g)} \end{equation}where $d_i$ is the distance of node $i$ to the root node in the topological representation.
The centroid and dispersion of a gene or list of genes can be computed using the method scTDA.RootedGraph.centroid()
,
c.centroid('PDGFRA')
For convenience, scTDA.RootedGraph.centroid()
expresses the centroid and dispersion in units of pseudo-time.
There are few other useful methods implemented in scTDA. scTDA.RootedGraph.expr()
returns the number of cells on which expression of a gene or list of genes is detected,
c.expr('PDGFRA')
scTDA.RootedGraph.delta()
returns the mean, minimum and maximum expression values of a gene or list of genes,
c.delta('PDGFRA')
These methods can be restricted to specific groups of nodes. Groups of nodes are specified in the file name.groups.json
and stored in the python dictionary scTDA.RootedGraph.dicgroups
. For instance, in this example we have defined two groups of nodes in the file Embryo_mds.groups.json
. These can be accessed through the command
c.dicgroups.keys()
We can restrict the methods scTDA.RootedGraph.expr()
and scTDA.RootedGraph.delta()
to any of these groups by using the argument group=
. For instance,
c.expr('PDGFRA', group='group_1')
c.delta('PDGFRA', group='group_1')
The method scTDA.RootedGraph.save()
creates a file name.genes.tsv
that contains a tab separated table with all the above quantities for all the genes in the dataset, including Bejamini-Holchberg adjusted p-values for the gene connectivity. This method allows to restrict to genes that are expressed in more than filtercells
cells, and whose maximum expression value is above filterexp
in $\textrm{log}_2(1+\textrm{TPM})$ units. In addition, it allows to annotate genes according to one or more lists of genes.
In this example we have downloaded from EMBL-EBI QuickGO two tables listing the members of the gene ontologies RNA splicing and Poly-(A) RNA binding. We can parse the genes in those tables into a dictionary that will be used by the command scTDA.RootedGraph.save()
to annotate the genes:
splic = []
rna = []
f = open('GO0008380_RNA_spliciing.tsv', 'r')
for nb, line in enumerate(f):
if nb > 0:
sp = line.split('\t')
splic.append(sp[2])
f.close()
splic = list(set(splic))
f = open('GO0044822_polyA_RNA_binding.tsv', 'r')
for nb, line in enumerate(f):
if nb > 0:
sp = line.split('\t')
rna.append(sp[2])
f.close()
rna = list(set(rna))
annotations = {'Splicing': splic, 'RNA_binding': rna}
We use the method scTDA.RootedGraph.save()
to create the file `Embryo_mds.genes.tsv`, where we only consider genes that are expressed in more than 4 cells and the statistical significance of the gene connectivity for each gene is assessed using 1,000 permutations,
c.save(n=1000, filtercells=4, filterexp=0.0, annotation=annotations);
scTDA.RootedGraph.save()
produces two plots. The first plot displays the gene connectivity against the number of cells with non-zero expression for each gene. Genes with statistically significant gene connectivity (after Benjamini-Hochberg adjustement of the false discovery rate) are displayed in red. The second plot displays the number of cells with non-zero expression against the centroid and dispersion, for genes with statistically significant connectivity.
!head Embryo_mds.genes.tsv
scTDA.compare_results()
can be used to perform comparisons between the results of two analysis based on the centroid and dispersion of genes with significant gene conectivity. This is useful when evaluating the consistency of the results across different parameter choices, technical, or biological replicates, or between different experiments. In this example, we compare the results of two scTDA analyses of the same data differing in the metric that was used to build the topological representation. The representation in the analysis Embryo_mds
was built using correlation distance, whereas the representation in the analysis Embryo_mds_euclid
was built using Euclidean distance:
scTDA.compare_results('Embryo_mds', 'Embryo_mds_euclid');
We can define transient subpopulations in an unsupervised way by clustering significantly localized low-dispersion genes according to their centroid in the topological representation. The optimal number of clusters can be determined using Davies-Bouldin criterion. This procedure is implemented in the method scTDA.RootedGraph.cellular_subpopulations()
,
g = c.cellular_subpopulations(3.2)
In this example we have considered significant genes with a dispersion smaller than 3.2. scTDA.RootedGraph.cellular_subpopulations()
produces two plots. The first plots shows Davies-Bouldin index as a funcion of the number of clusters. The second plot displays the distribution of low-dispersion genes according to their centroid and dispersion, and colors the different clusters. scTDA.RootedGraph.cellular_subpopulations()
returns the lists of genes in each cluster,
c.draw([g[0], g[1], g[2]]);
An alternative approach to identify transient cellular sub-populations is to use colocalization patterns. These can be computed by clustering the Jensen-Shannon distance matrix associated to significantly-localized low-dispersion genes,
\begin{equation} J(g,h) = \left[\frac12\sum_i\left(p_i(g)\textrm{log }p_i(g)+p_i(h)\textrm{log }p_i(h)-(p_i(g)+p_i(h))\textrm{log }\frac{p_i(g)+p_i(h)}{2}\right)\right]^{\frac12} \end{equation}To that end, we use the command:
g = c.cellular_subpopulations(3.15, method='js', clus_thres=0.7)
c.draw([g[0],g[1],g[2]]);