Contents

library(BiocStyle)
library(knitr)
library(ComplexHeatmap)

1 Introduction

NMF (nonnegative matrix factorization) is a matrix decomposition method. A description of the algorithm and it’s implementation can be found e.g. in (Lee and Seung 1999). In 2003, Brunet et al. applied NMF to gene expression data (Brunet et al. 2003). In 2010, NMF, an R package implementing several NMF solvers was published (Gaujoux and Seoighe 2010). NMF basically solves the problem as illustrated in the following figure (Image taken from https://en.wikipedia.org/wiki/Non-negative_matrix_factorization):

NMF

Here, \(V\) is an input matrix with dimensions \(n \times m\). It is decomposed into two matrices \(W\) of dimension \(n \times l\) and \(H\) of dimension \(l \times m\), which when multiplied approximate the original matrix \(V\). \(l\) is a free parameter in NMF, it is called the factorization rank. If we call the columns of \(W\) , then \(l\) corresponds to the number of signatures. The decomposition thus leads to a reduction in complexity if \(l < n\), i.e. if the number of signatures is smaller than the number of features, as indicated in the above figure.

In 2015, Mejia-Roa et al. introduced an implementation of an NMF-solver in CUDA, which lead to significant reduction of computation times by making use of massive parallelisation on GPUs (Mejia-Roa et al. 2015). Other implementations of NMF-solvers on GPUs exist.

It is the pupose of the package Bratwurst described here to provide wrapper functions in R to these NMF-solvers in CUDA. Massive parallelisation not only leads to faster algorithms, but also makes the benefits of NMF accessible to much bigger matrices. Furthermore, functions for preprocessing, estimation of the optimal factorization rank and post-hoc feature selection are provided.

2 The Bratwurst package

The main feature of the package Bratwurst is an S4 object called nmf.exp. It is derived from SummarizedExperiment, has containers for a data matrix, column annotation data and row annotation data and inherits SummarizedExperiment’s accessor functions colData and rowData. The matrix to be stored in this data structure is the matrix \(V\) as described above, corresponding to the input matrix for the NMF-solver. nmf.exp furthermore has containers for the matrices \(W\) and \(H\) which are results of the decomposition. As NMF algorithms have to be run iteratively, an instance of the class nmf.exp can store large lists of matrices, corresponding to the results of different iteration steps. Accessor functions to all different containers are provided.

A crucial step in data analysis with NMF is the determination of the optimal factorization rank, i.e. the number of columns of the matrix \(W\) or equivalently the number of rows of the matrix \(H\). No consensus method for an automatic evaluation of the optimal factorization rank has been found to date. Instead, the decomposition is usually performed iteratively over a range of possible factorization ranks and different quality measures are computed for every tested factorization ranks. Many quality measures have been proposed:

The package Bratwurst provides functions to compute all

3 Example: leukemia data

Preparations

library(Bratwurst)
library(NMF)

Load the example data

data(leukemia)
samples <- "leukemia"

This data was initially generated by the following commands:

data.path  <- file.path(getwd(), "data")
matrix.file <- list.files(data.path, "*data.txt", full.names = T)
rowAnno.file <- list.files(data.path, "micro.*anno.*txt", full.names = T)
rowAnno.bed <- list.files(data.path, ".bed", full.names = T)
colAnno.file <- list.files(data.path, "sample.*anno.*txt", full.names = T)
# Read files to summarizedExperiment
leukemia.nmf.exp <- nmfExperimentFromFile(matrix.file = matrix.file,
                                          rowAnno.file = rowAnno.file,
                                          colData.file = colAnno.file)
save(leukemia.nmf.exp, file = file.path(data.path, "leukemia.rda"))

Now we are ready to start an NMF analysis.

3.1 NMF analysis

3.1.1 Call wrapper function

The wrapper function for the NMF solvers in the Bratwurst package is runNmfGpu. It is called as follows:

k.max <- 4
outer.iter <- 10
inner.iter <- 10^4

# leukemia.nmf.exp<- runNmfGpu(nmf.exp = leukemia.nmf.exp,
#                              k.max = k.max,
#                              outer.iter = outer.iter,
#                              inner.iter = inner.iter,
#                              tmp.path = "/tmp/tmp_leukemia")
leukemia.nmf.exp<- runNmfGpuPyCuda(nmf.exp = leukemia.nmf.exp,
                                   k.max = k.max,
                                   outer.iter = outer.iter,
                                   inner.iter = inner.iter,
                                   tmp.path = "/tmp/tmp_leukemia",
                                   cpu = TRUE)
## [1] "2017-10-07 21:28:54 CEST"
## Factorization rank:  2 
## [1] "2017-10-07 21:29:07 CEST"
## Factorization rank:  3 
## [1] "2017-10-07 21:29:23 CEST"
## Factorization rank:  4
# leukemia.nmf.exp<- runNmfGpuPyCuda(nmf.exp = leukemia.nmf.exp,
#                                    k.max = k.max,
#                                    outer.iter = outer.iter,
#                                    inner.iter = inner.iter,
#                                    tmp.path = "/tmp/tmp_leukemia",
#                                    cpu = TRUE)
# leukemia.nmf.exp<- runNmfGpuPyCuda(nmf.exp = leukemia.nmf.exp,
#                                    k.max = k.max,
#                                    outer.iter = outer.iter,
#                                    inner.iter = inner.iter,
#                                    tmp.path = "/tmp/tmp_leukemia",
#                                    binary.file.transfer = TRUE)

Depending on the choice of parameters (dimensions of the input matrix, number of iterations), this step may take some time. Note that the algorithm updates the user about the progress in the iterations.

Several getter functions are available to access the data in the updated nmf.exp object:

3.1.2 HMatrixList

Returns a list of matrices H for a specific factorization rank k. There are as many entries in this list as there were iterations in the outer iteration. Of course the number of rows of the matrix H corresponds to the chosen factorization rank.

tmp.object <- HMatrixList(leukemia.nmf.exp, k = 2)
class(tmp.object)
## [1] "list"
length(tmp.object)
## [1] 10
class(tmp.object[[1]])
## [1] "matrix"
dim(tmp.object[[1]])
## [1]  2 38
kable(tmp.object[[1]][, c(1:5)])
X19769 X23953 X28373 X9335 X9692
0.0011390 0.0006481 0.0016158 0.0009238 0.0021593
0.0105159 0.0096408 0.0104468 0.0094084 0.0082162

If no value for k is supplied, the function returns a list of lists, one for every iterated factorization rank.

tmp.object <- HMatrixList(leukemia.nmf.exp)
class(tmp.object)
## [1] "list"
length(tmp.object)
## [1] 3
class(tmp.object[[1]])
## [1] "list"
length(tmp.object[[1]])
## [1] 10

3.1.3 HMatrix

Returns the matrix H for the optimal decomposition (i.e. the one with the minimal residual) for a specific factorization rank k. As in the previous paragraph, the number of rows of the matrix H corresponds to the chosen factorization rank.

tmp.object <- HMatrix(leukemia.nmf.exp, k = 2)
class(tmp.object)
## [1] "matrix"
dim(tmp.object)
## [1]  2 38
kable(tmp.object[, c(1:5)])
X19769 X23953 X28373 X9335 X9692
0.0095044 0.0087104 0.0094348 0.0085019 0.0074251
0.0012930 0.0007470 0.0018399 0.0010526 0.0024340

If no value for k is supplied, the function returns a list of optimal matrices, one for every iterated factorization rank.

H.list <- HMatrix(leukemia.nmf.exp)
class(H.list)
## [1] "list"
length(H.list)
## [1] 3
kable(H.list[[1]][, c(1:5)])
X19769 X23953 X28373 X9335 X9692
0.0095044 0.0087104 0.0094348 0.0085019 0.0074251
0.0012930 0.0007470 0.0018399 0.0010526 0.0024340

3.1.4 WMatrixList

Returns a list of matrices W for a specific factorization rank k. There are as many entries in this list as there were iterations in the outer iteration. Of course the number of columns of the matrix W corresponds to the chosen factorization rank.

tmp.object <- WMatrixList(leukemia.nmf.exp, k = 2)
class(tmp.object)
## [1] "list"
length(tmp.object)
## [1] 10
class(tmp.object[[1]])
## [1] "matrix"
dim(tmp.object[[1]])
## [1] 4951    2
kable(as.data.frame(tmp.object[[1]][c(1:5), ]))
V1 V2
18709.635 23831.166
1419.627 5031.364
22651.731 26325.280
2795.331 3784.764
4268.198 5256.990

If no value for k is supplied, the function returns a list of lists, one for every iterated factorization rank.

3.1.5 WMatrix

Returns the matrix W for the optimal decomposition (i.e. the one with the minimal residual) for a specific factorization rank k. As in the previous paragraph, the number of columns of the matrix W corresponds to the chosen factorization rank.

tmp.object <- WMatrix(leukemia.nmf.exp, k = 2)
class(tmp.object)
## [1] "matrix"
dim(tmp.object)
## [1] 4951    2
kable(as.data.frame(tmp.object[c(1:5), ]))
V1 V2
26334.001 16702.881
5564.502 1267.781
29096.680 20209.017
4182.737 2495.348
5809.047 3810.115

If no value for k is supplied, the function returns a list of optimal matrices, one for every iterated factorization rank.

W.list <- WMatrix(leukemia.nmf.exp)
class(W.list)
## [1] "list"
length(W.list)
## [1] 3
kable(as.data.frame(W.list[[1]][c(1:5), ]))
V1 V2
26334.001 16702.881
5564.502 1267.781
29096.680 20209.017
4182.737 2495.348
5809.047 3810.115

3.1.6 FrobError

Returns a data frame with as many columns as there are iterated factorization ranks and as many rows as there are iterations per factorization rank.

kable(FrobError(leukemia.nmf.exp))
2 3 4
0.5583462 0.5139669 0.4650006
0.5583469 0.5139767 0.4644593
0.5583471 0.5139647 0.4647121
0.5583458 0.5139682 0.4946795
0.5583463 0.5139662 0.4644852
0.5583480 0.5139661 0.4655613
0.5583408 0.5139789 0.4646555
0.5583459 0.5139693 0.4645018
0.5583456 0.5139705 0.4646260
0.5583466 0.5139621 0.4645738

3.2 Determine the optimal factorization rank

In NMF, Several methods have been described to assess the optimal factorization rank. The Bratwurst packages implements some of them. They are computed by applying custom functions which subsequently update the data structure of type nmf.exp.

3.2.1 Get Frobenius error.

The most important information about the many iterated decompositions is the norm of the residual. In NMF this is often called the Frobenius error, as the Frobenius norm may be used.

leukemia.nmf.exp <- computeFrobErrorStats(leukemia.nmf.exp)

3.2.2 Generate Alexandrov Criterion plot

In (Alexandrov et al. 2013) an approach is described in which a modified silhouette criterion is used to estimate the stability across iteration steps for one fixed factorization rank k.

leukemia.nmf.exp <- computeSilhoutteWidth(leukemia.nmf.exp)

3.2.3 Cophenetic correlation coefficient plot

leukemia.nmf.exp <- computeCopheneticCoeff(leukemia.nmf.exp)

3.2.4 Compute amari type distance

leukemia.nmf.exp <- computeAmariDistances(leukemia.nmf.exp)

After having executed all these functions, the values of the computed measures can be accessed with OptKStats:

kable(OptKStats(leukemia.nmf.exp))
k min mean sd cv sumSilWidth meanSilWidth copheneticCoeff meanAmariDist
2 2 0.5583408 0.5583459 0.0000019 0.0000035 19.99999 0.9999996 0.9999996 0.0000001
3 3 0.5139621 0.5139690 0.0000053 0.0000102 29.99976 0.9999922 0.9618570 0.0000026
4 4 0.4644593 0.4677255 0.0094765 0.0202608 37.79392 0.9448479 0.9616292 0.0203077

These quality measures can be displayed together:

3.2.5 Generate plots to estimate optimal k

gg.optK <- plotKStats(leukemia.nmf.exp)
gg.optK

3.2.6 Generate ranked error plot.

It may also be useful to inspect the Frobenius error after ranking. This may give an estimation of the convergence in the parameter space of initial conditions.

gg.rankedFrobError <- plotRankedFrobErrors(leukemia.nmf.exp)
## Warning: `legend.margin` must be specified using `margin()`. For the old
## behavior use legend.spacing
gg.rankedFrobError

3.3 Visualize the matrix H (exposures)

The matrices H may be visualized as heatmaps. We can define a meta information object and annotate meta data:

entity.colVector <- c("red", "blue")
names(entity.colVector) <- c("ALL", "AML")
subtype.colVector <- c("orange", "darkgreen", "blue")
names(subtype.colVector) <- c("B-cell", "T-cell", "-")
anno_col <- list(V2 = entity.colVector,
                 V3 = subtype.colVector)
heat.anno <- HeatmapAnnotation(df = colData(leukemia.nmf.exp)[, c(2:3)],
                               col = anno_col)

And now display the matrices H with meta data annotation:

# sapply(1:length(HMatrix(leukemia.nmf.exp)), function(i) {
#   current_k <- as.numeric(names(HMatrix(leukemia.nmf.exp))[i])
#   h.heatmap <- Heatmap(HMatrix(leukemia.nmf.exp, k = current_k),
#                        clustering_distance_columns = "pearson",
#                        heatmap_legend_param = list(color_bar = "continuous"),
#                        show_column_names = F, cluster_rows = F,
#                        top_annotation = heat.anno)
#   draw(h.heatmap)
# })

Bratwurst provides a plotting function to display the matrices H with meta data annotation:

# Plot Heatmaps for H over all k
lapply(seq(2, k.max), function(k) {
  plotHMatrix(leukemia.nmf.exp, k)
})
## [[1]]

## 
## [[2]]

## 
## [[3]]

3.4 Feature selection

3.4.1 Row K-means to determine signature specific features

### Find representative regions.
# Get W for best K
leukemia.nmf.exp <- setOptK(leukemia.nmf.exp, 4)
OptK(leukemia.nmf.exp)
## [1] 4
signature.names <- getSignatureNames(leukemia.nmf.exp, OptK(leukemia.nmf.exp))
signature.names
## [1] "ALL T-cell"                  "AML -"                      
## [3] "ALL B-cell"                  "ALL B-cell 0.64\nAML - 0.36"
FeatureStats(leukemia.nmf.exp)
## DataFrame with 0 rows and 0 columns
leukemia.nmf.exp <- computeFeatureStats(leukemia.nmf.exp)
FeatureStats(leukemia.nmf.exp)
## DataFrame with 4951 rows and 7 columns
##          cluster deltaCenters  deltaMean explainedVar    oddsVar   coefVar
##      <character>    <numeric>  <numeric>    <numeric>  <numeric> <numeric>
## 1           0001   -0.5568324 -39756.459    0.9432945 0.06011428 0.9243573
## 2           1011    0.4206972   2846.402    0.9703385 0.03056815 0.4096196
## 3           0110   -0.4122321 -23795.892    0.9299844 0.07528684 0.6030567
## 4           0001   -0.5134451  -5482.167    0.9771967 0.02333546 0.8098756
## 5           0001   -0.2973338  -3753.836    0.7113544 0.40576905 0.4284450
## ...          ...          ...        ...          ...        ...       ...
## 4947        0001   -0.3058005  -1453.227    0.8623299  0.1596490 0.4226175
## 4948        1010    0.3650308  24496.418    0.8006735  0.2489485 0.4966201
## 4949        0001   -0.4366316 -11306.444    0.8783798  0.1384597 0.6588259
## 4950        0011   -0.2368439  -1002.578    0.7968063  0.2550101 0.3740944
## 4951        1011    0.5186336  14705.280    0.8040131  0.2437608 0.6446432
##        meanSil
##      <numeric>
## 1    0.6132485
## 2    0.6470219
## 3    0.7309322
## 4    0.6561406
## 5    0.3315914
## ...        ...
## 4947 0.4933222
## 4948 0.5758699
## 4949 0.5102926
## 4950 0.5100298
## 4951 0.4178666
# You might want to add additional selection features
# such as entropy or absolute delta 
# Entropy
leukemia.nmf.exp <- computeEntropy4OptK(leukemia.nmf.exp)
FeatureStats(leukemia.nmf.exp)
## DataFrame with 4951 rows and 8 columns
##          cluster deltaCenters  deltaMean explainedVar    oddsVar   coefVar
##      <character>    <numeric>  <numeric>    <numeric>  <numeric> <numeric>
## 1           0001   -0.5568324 -39756.459    0.9432945 0.06011428 0.9243573
## 2           1011    0.4206972   2846.402    0.9703385 0.03056815 0.4096196
## 3           0110   -0.4122321 -23795.892    0.9299844 0.07528684 0.6030567
## 4           0001   -0.5134451  -5482.167    0.9771967 0.02333546 0.8098756
## 5           0001   -0.2973338  -3753.836    0.7113544 0.40576905 0.4284450
## ...          ...          ...        ...          ...        ...       ...
## 4947        0001   -0.3058005  -1453.227    0.8623299  0.1596490 0.4226175
## 4948        1010    0.3650308  24496.418    0.8006735  0.2489485 0.4966201
## 4949        0001   -0.4366316 -11306.444    0.8783798  0.1384597 0.6588259
## 4950        0011   -0.2368439  -1002.578    0.7968063  0.2550101 0.3740944
## 4951        1011    0.5186336  14705.280    0.8040131  0.2437608 0.6446432
##        meanSil    entropy
##      <numeric>  <numeric>
## 1    0.6132485 0.40276173
## 2    0.6470219 0.10712160
## 3    0.7309322 0.19958809
## 4    0.6561406 0.30440023
## 5    0.3315914 0.09671888
## ...        ...        ...
## 4947 0.4933222 0.08987241
## 4948 0.5758699 0.14880447
## 4949 0.5102926 0.21256989
## 4950 0.5100298 0.07251025
## 4951 0.4178666 0.26891164
leukemia.nmf.exp <- computeAbsDelta4OptK(leukemia.nmf.exp)
FeatureStats(leukemia.nmf.exp)
## DataFrame with 4951 rows and 12 columns
##          cluster deltaCenters  deltaMean explainedVar    oddsVar   coefVar
##      <character>    <numeric>  <numeric>    <numeric>  <numeric> <numeric>
## 1           0001   -0.5568324 -39756.459    0.9432945 0.06011428 0.9243573
## 2           1011    0.4206972   2846.402    0.9703385 0.03056815 0.4096196
## 3           0110   -0.4122321 -23795.892    0.9299844 0.07528684 0.6030567
## 4           0001   -0.5134451  -5482.167    0.9771967 0.02333546 0.8098756
## 5           0001   -0.2973338  -3753.836    0.7113544 0.40576905 0.4284450
## ...          ...          ...        ...          ...        ...       ...
## 4947        0001   -0.3058005  -1453.227    0.8623299  0.1596490 0.4226175
## 4948        1010    0.3650308  24496.418    0.8006735  0.2489485 0.4966201
## 4949        0001   -0.4366316 -11306.444    0.8783798  0.1384597 0.6588259
## 4950        0011   -0.2368439  -1002.578    0.7968063  0.2550101 0.3740944
## 4951        1011    0.5186336  14705.280    0.8040131  0.2437608 0.6446432
##        meanSil    entropy absDelta.V1 absDelta.V2 absDelta.V3 absDelta.V4
##      <numeric>  <numeric>   <numeric>   <numeric>   <numeric>   <numeric>
## 1    0.6132485 0.40276173  -72805.279  -57678.004  -59976.596   16026.291
## 2    0.6470219 0.10712160   -6493.359  -11453.221   -6051.212   -4736.680
## 3    0.7309322 0.19958809  -75538.848  -36105.793  -13055.144  -68805.656
## 4    0.6561406 0.30440023  -10269.604   -8853.642   -9520.157    1416.533
## 5    0.3315914 0.09671888  -13908.431   -9539.127  -11315.591   -4080.044
## ...        ...        ...         ...         ...         ...         ...
## 4947 0.4933222 0.08987241   -4336.471   -4835.015   -3773.614   -1408.580
## 4948 0.5758699 0.14880447  -38368.238  -98896.148  -34813.841  -72271.604
## 4949 0.5102926 0.21256989  -26646.255  -24264.488  -19356.206    -809.429
## 4950 0.5100298 0.07251025   -4785.826   -4465.965   -1772.201   -3469.279
## 4951 0.4178666 0.26891164  -32096.117  -49318.458  -19053.072   -8574.508

3.4.2 Feature visualization

# Plot all possible signature combinations
plotSignatureFeatures(leukemia.nmf.exp)
## Warning: `legend.margin` must be specified using `margin()`. For the old
## behavior use legend.spacing

## Warning: `legend.margin` must be specified using `margin()`. For the old
## behavior use legend.spacing
## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
## instead

# Plot only signature combinations
plotSignatureFeatures(leukemia.nmf.exp, sig.combs = F)
## Warning: `legend.margin` must be specified using `margin()`. For the old
## behavior use legend.spacing
## Warning: `legend.margin` must be specified using `margin()`. For the old
## behavior use legend.spacing
## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
## instead

# Try to display selected features on W matrix
sig.id <- "1000"
m <- WMatrix(leukemia.nmf.exp, k = OptK(leukemia.nmf.exp))[
  FeatureStats(leukemia.nmf.exp)[, 1] == sig.id, ]
m <- m[order(m[, 1]), ]
c <- getColorMap(m)
#m <- t(apply(m, 1, function(r) (r - mean(r))/sd(r)))

Heatmap(m, col = c, cluster_rows = F, cluster_columns = F)

References

Alexandrov, LB, S Nik-Zainal, DC Wedge, SA Aparicio, S Behjati, AV Biankin, GR Bignell, et al. 2013. “Signatures of Mutational Processes in Cancer.” Nature. Nature Publishing Group.

Brunet, Jean-Philippe, Pablo Tamayo, Todd R. Golub, and Jill P. Mesirov. 2003. “Metagenes and Molecular Pattern Discovery Using Matrix Factorization.” PNAS. PNAS.

Gaujoux, Renaud, and Cathal Seoighe. 2010. “A Flexible R Package for Nonnegative Matrix Factorization.” BMC Bioinformatics. BMC.

Lee, Daniel D., and Sebastian Seung. 1999. “Learning the Parts of Objects by Non-Negative Matrix Factorization.” Nature. Nature Publishing Group.

Mejia-Roa, Edgardo, Daniel Tabas-Madrid, Javier Setoain, Carlos Garcia, Francisco Tirado, and Alberto Pascual-Montano. 2015. “NMF-MGPU: Non-Negative Matrix Factorization on Multi-GPU Systems.” BMC Bioinformatics. BMC.