As the name suggests, this repository implements functions to perform a PCA on the gene-by-cell expression matrix, returning low-dimensional coordinates for each cell that can be used for efficient downstream analyses, e.g., clustering, visualization. The code itself was originally derived from the scran and batchelor R packages factored out into a separate C++ library for easier re-use.
Given a tatami::Matrix
, the scran_pca::simple_pca()
function will compute the PCA to obtain a low-dimensional representation of the cells:
#include "scran_pca/scran_pca.hpp"
const tatami::Matrix<double, int>& mat = some_data_source();
// Take the top 20 PCs:
scran_pca::SimplePcaOptions opt;
opt.rank = 20;
auto res = scran_pca::simple_pca(mat, opt);
res.components; // rows are PCs, columns are cells.
res.rotation; // rows are genes, columns correspond to PCs.
res.variance_explained; // one per PC, in decreasing order.
res.total_variance; // total variance in the dataset.
Advanced users can fiddle with more of the options:
opt.scale = true;
opt.num_threads = 4;
opt.realize_matrix = false;
auto res2 = scran_pca::simple_pca(mat, opt);
In the presence of multiple blocks, we can perform the PCA on the residuals after regressing out the blocking factor. This ensures that the inter-block differences do not contribute to the first few PCs, instead favoring the representation of intra-block variation.
std::vector<int> blocks = some_blocks();
scran_pca::BlockedPcaOptions bopt;
bopt.rank = 10; // taking the top 10 PCs this time.
auto bres = scran_pca::blocked_pca(mat, blocks.data(), bopt);
bres.components; // rows are PCs, columns are cells.
bres.center; // rows are blocks, columns are genes.
The components derived from the residuals will only be free of inter-block differences under certain conditions (equal population composition with a consistent shift between blocks).
If this is not the case, more sophisticated batch correction methods are required.
If those methods accept a low-dimensional representation for the cells as input,
we can use scran_pca::blocked_pca()
to obtain an appropriate matrix that focuses on intra-block variation without making assumptions about the inter-block differences:
bopt.components_from_residuals = false;
auto bres2 = scran_pca::blocked_pca(mat, blocks.data(), bopt);
Check out the reference documentation for more details.
If you're using CMake, you just need to add something like this to your CMakeLists.txt
:
include(FetchContent)
FetchContent_Declare(
scran_pca
GIT_REPOSITORY https://github.com/libscran/scran_pca
GIT_TAG master # or any version of interest
)
FetchContent_MakeAvailable(scran_pca)
Then you can link to scran_pca to make the headers available during compilation:
# For executables:
target_link_libraries(myexe libscran::scran_pca)
# For libaries
target_link_libraries(mylib INTERFACE libscran::scran_pca)
find_package(libscran_scran_pca CONFIG REQUIRED)
target_link_libraries(mylib INTERFACE libscran::scran_pca)
To install the library, use:
mkdir build && cd build
cmake .. -DSCRAN_PCA_TESTS=OFF
cmake --build . --target install
By default, this will use FetchContent
to fetch all external dependencies.
If you want to install them manually, use -DSCRAN_PCA_FETCH_EXTERN=OFF
.
See the tags in extern/CMakeLists.txt
to find compatible versions of each dependency.
If you're not using CMake, the simple approach is to just copy the files in include/
- either directly or with Git submodules - and include their path during compilation with, e.g., GCC's -I
.
This requires the external dependencies listed in extern/CMakeLists.txt
, which also need to be made available during compilation.