read recruitment: memory scaling behavior for many targets

Hi,
as discussed, here the proposed enhancements:

`locityper v1.2.0 @ 2025-10-07 19:33:08`

parameterization:
`locityper recruit --input HIFI_READS --seqs-all TARGETS --distinct --output OUTPUT --minimizer 21 15 --chunk-size 500 --match-len 10000 --threads 12`

```
...
Collected 311391945 minimizers across 43380 loci and 43380 sequences
...
Cgroup mem limit exceeded ...
# fails with ~350 GB of available memory
```
The HIFI reads are a single SMRT cell dataset (Revio), other runs finish with the same/similar input, which points at the number of target sequences as being the root cause.

Suggested enhancements:

1. no solution, but heads-up for users:
- mention scaling behavior in docs / CLI help; if confirmed that the number of target sequences is the problem, provide a recommendation for maximal number of targets per target file such that users know right away how to split/divide-and-conquer the problem
2. desired solution: implement chunking for processing target sequences

Best,
Peter

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read recruitment: memory scaling behavior for many targets #17

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

read recruitment: memory scaling behavior for many targets #17

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions