Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for multiple matrices and improving construction of TileDB objects #53

Merged
merged 35 commits into from
Nov 28, 2024

Conversation

jkanche
Copy link
Member

@jkanche jkanche commented Nov 22, 2024

This PR introduces major improvements to matrix handling, storage, and performance, including support for multiple matrices in H5AD/AnnData workflows and optimizations for ingestion and querying.

Support for multiple matrices:

  • Both build_cellarrdataset and CellArrDataset now support multiple matrices. During ingestion, a TileDB group called "assays" is created to store all matrices, along with group-level metadata.

This may introduce breaking changes with the default parameters based on how these classes are used. Previously to build the TileDB files:

dataset = build_cellarrdataset(
    output_path=tempdir,
    files=[adata1, adata2],
    matrix_options=MatrixOptions(matrix_name="counts", dtype=np.int16),
    num_threads=2,
)

Now you may provide a list of matrix options for each layers in the files.

dataset = build_cellarrdataset(
    output_path=tempdir,
    files=[adata1, adata2],
    matrix_options=[
        MatrixOptions(matrix_name="counts", dtype=np.int16),
        MatrixOptions(matrix_name="log-norm", dtype=np.float32),
    ],
    num_threads=2,
)

Querying follows a similar structure:

cd = CellArrDataset(
    dataset_path=tempdir,
    assay_tiledb_group="assays",
    assay_uri=["counts", "log-norm"]
)

assay_uri is relative to assay_tiledb_group. For backwards compatibility, assay_tiledb_group can be an empty string.

  • Parallelized ingestion:
    The build process now uses num_threads to ingest matrices concurrently. Two new columns in the sample metadata, cellarr_sample_start_index and cellarr_sample_end_index, track sample offsets, improving matrix processing.

    • Note: The process pool uses the spawn method on UNIX systems, which may affect usage on windows machines.
  • TileDB query condition fixes:
    Fixed a few issues with fill values represented as bytes (seems to be common when ascii is used as the column type) and in general filtering operations on TileDB Dataframes.

  • Index remapping:
    Improved remapping of indices from sliced TileDB arrays for both dense and sparse matrices. This is not a user facing function but an internal slicing operation.

  • Get a sample:
    Added a method to access all cells for a particular sample. you can either provide an index or a sample id.

sample_1_slice = cd.get_cells_for_sample(0)
  • Other updates to documentation, tutorials, the README, and additional tests.

@jkanche jkanche self-assigned this Nov 22, 2024
@jkanche jkanche changed the title EOD refactor the build functionality Nov 23, 2024
@jkanche jkanche marked this pull request as ready for review November 25, 2024 22:35
@jkanche jkanche requested review from tony-kuo and keviny2 November 25, 2024 22:35
@jkanche
Copy link
Member Author

jkanche commented Nov 26, 2024

@tony-kuo and @keviny2 i'll update and merge this tomorrow, but good to get another set of eyes here

src/cellarr/CellArrDataset.py Outdated Show resolved Hide resolved
@jkanche
Copy link
Member Author

jkanche commented Nov 28, 2024

This PR has dependencies in #61 and #60. Will merge this PR first to keep it small and avoid merge conflicts later.

@jkanche jkanche changed the title refactor the build functionality Support for multiple matrices and improving construction of TileDB objects Nov 28, 2024
@jkanche jkanche merged commit 48de52c into master Nov 28, 2024
6 checks passed
@jkanche jkanche deleted the refactor-layers branch November 28, 2024 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants