RcppPlanc is an R package initiated for wrapping the C++ library PLANC (planc GitHub, S. Eswar, 2021), where fast algorithms for non-negative matrix factorization (NMF) are implemented. Based on that, we also mimic the optimization and implemented integrative NMF (iNMF) (liger GitHub, J. D. Welch, 2019).
NMF problem is stated with the objective below and it aims at factorizing a given non-negative X matrix into two low-rank non-negative matrices W and H so that WH ≈ X. Denote the dimensionality of X to be m features by n sample points, then W will have the size of m × k and H is of k × n. k is the inner dimension of the factorization and should be less than m and n.
The implementation currently supports the factorization of both dense
and sparse (dgCMatrix
) matrix. The following code chunk
factorizes a randomly initialized non-negative dense matrix.
The output result res
is a list object with entries
res$W
and res$H
for the output matrices as
described above. Note that here matrix res$H
is transposed
(i.e. n × k) for the
sake of efficient implementation. res$objErr
is the
objective error computed for the result.
There are other variant algorithms for NMF that can be specified with
argument algo=
and also symmetric NMF approach implemented
in symNMF()
. Please see ?nmf
or
?symNMF
for more detailed documentation of the
functions.
iNMF jointly factorizes multiple datasets that share the same set of
features. It is originally designed for integrating single-cell
biological data, such as transcriptomics and other modalities. As its
objective function stated as below, it factorizes given datasets Ei (m × ni)
into W (m × k), Vi (m × k) and Hi (k × ni),
where W is shared across all
datasets and V and H are dataset specific. λ is regularization parameter, m is the number of shared features
(e.g. gene expression), ni is the number
of sample points (e.g. cells) of each dataset and d is the number of datasets, It is
expected that (W + Vi)Hi ≈ Ei.
Please see ?inmf
for more detailed explanation.
We prepared some example datasets that are down-sampled from public available study (Hyun Min Kang, 2018). They can be loaded with the following code chunk. These datasets are presented as sparse matrices.
To run iNMF with the datasets in hand, simply create a list of them
and call the function inmf()
.
The returned result is also a list object, which contains the following entries:
res$H
, a list of dataset specific H matrices, each of size ni × k.
Similarly, they are also transposed.res$V
, a list of dataset specific V matrices, each of size m × k.res$W
, the W
matrix, m × k.res$objErr
, the objective error value calculated for
the result.We also implemented the other variant of iNMF in
onlineINMF()
(C. Gao, 2021) and
uinmf()
(A.R. Kriebel,
2022). Please see their help pages for the detail.
HDF5 files have been widely adpoted for storing large-scale datasets on hard drive. Relevant examples can be found such as the H5AD files used by Python AnnData package that store annotated data, and the output file of 10X CellRanger Count which stores pre-processed single-cell expression matrix and metadata.
We support using either dense matrices stored in 2-D datasets or CSC
(compressed sparse column) sparse matrices stored with sets of three 1-D
array datasets. Currently, we don’t provide interfaces for interactively
exploring the data in an HDF5, but this can be easily achieved with
other packages, such as hdf5r. To use
datasets stored in HDF5 files, we provide argument list constructors
H5Mat()
and H5SpMat()
, respectively for dense
and sparse matrix, to organize the necessary information for retrieving
data.
We prepared example HDF5 files with identical information as the
package data ctrl.sparse
and stim.sparse
, in
both dense and sparse forms. The argument lists for them can be
constructed as shown below:
ctrl.denseH5FilePath <- system.file("extdata/ctrl_dense.h5", package = "RcppPlanc")
ctrl.h5dense <- H5Mat(filename = ctrl.denseH5FilePath, dataPath = "data")
ctrl.h5dense
#> Argument list for constructing HDF5 dense matrix
#> filename: /tmp/Rtmpgcqs9I/Rinst9ae1028d78a/RcppPlanc/extdata/ctrl_dense.h5
#> data path: data
stim.denseH5FilePath <- system.file("extdata/stim_dense.h5", package = "RcppPlanc")
stim.h5dense <- H5Mat(filename = stim.denseH5FilePath, dataPath = "data")
stim.h5dense
#> Argument list for constructing HDF5 dense matrix
#> filename: /tmp/Rtmpgcqs9I/Rinst9ae1028d78a/RcppPlanc/extdata/stim_dense.h5
#> data path: data
ctrl.sparseH5FilePath <- system.file("extdata/ctrl_sparse.h5", package = "RcppPlanc")
ctrl.h5sparse <- H5SpMat(filename = ctrl.sparseH5FilePath,
valuePath = "scaleDataSparse/data", rowindPath = "scaleDataSparse/indices", colptrPath = "scaleDataSparse/indptr",
nrow = nrow(ctrl.sparse), ncol = ncol(ctrl.sparse))
ctrl.h5sparse
#> Argument list for constructing HDF5 CSC sparse matrix
#> filename: /tmp/Rtmpgcqs9I/Rinst9ae1028d78a/RcppPlanc/extdata/ctrl_sparse.h5
#> value path: scaleDataSparse/data
#> rowind path: scaleDataSparse/indices
#> colptr path: scaleDataSparse/indptr
#> dimension: 173 x 300
stim.sparseH5FilePath <- system.file("extdata/stim_sparse.h5", package = "RcppPlanc")
stim.h5sparse <- H5SpMat(filename = stim.sparseH5FilePath,
valuePath = "scaleDataSparse/data", rowindPath = "scaleDataSparse/indices", colptrPath = "scaleDataSparse/indptr",
nrow = nrow(stim.sparse), ncol = ncol(stim.sparse))
stim.h5sparse
#> Argument list for constructing HDF5 CSC sparse matrix
#> filename: /tmp/Rtmpgcqs9I/Rinst9ae1028d78a/RcppPlanc/extdata/stim_sparse.h5
#> value path: scaleDataSparse/data
#> rowind path: scaleDataSparse/indices
#> colptr path: scaleDataSparse/indptr
#> dimension: 173 x 300
With the argument list constructed, we can simply run any iNMF algorithm with having them in a list, which is the same as what we did with in-memory matrix objects.
Note that it is recommended to apply
onlineINMF()
with HDF5 files due to the nature of the HALS algorithm it adopts.inmf()
anduinmf()
use ANLS updating approach which requires subsetting on rows of the data and could be very slow. Although this is still supported. Optimization might be updated in future versions.
Here, the result is also a similar list object but with two
additional entries res$A
and res$B
. These are
dataset specific matrices storing the algorithm update information but
not the primary result of iNMF. These are needed for performing other
scenarios of online iNMF. Please see ?onlineINMF
for more
explanation.
Note that the argument minibatchSize
is set to
50
only for demonstrating this minimal example of this
vignette. This is a parameter specifying the chunk size of each updates
in the iterative algorithm, which obviously should be a smaller number
than the number of sample points in provided datasets. Users testing
with real life large-scale data should ignore it or determine it basing
on the specification (e.g. RAM) of the machines.