Skip to content

Commit

Permalink
Merge branch 'jdaw/demux-without-classification' into 'master'
Browse files Browse the repository at this point in the history
[demux] support demux without classification

Closes DOR-371

See merge request machine-learning/dorado!647
  • Loading branch information
tijyojwad committed Oct 12, 2023
2 parents 1d2c4c3 + 1ef74df commit eeef757
Show file tree
Hide file tree
Showing 4 changed files with 49 additions and 14 deletions.
11 changes: 9 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -161,11 +161,18 @@ By default, Dorado is set up to trim the barcode from the reads. To disable trim

The default heuristic for double-ended barcodes is to look for them on either end of the read. This results in a higher classification rate but can also result in a higher false positive count. To address this, `dorado basecaller` also provides a `--barcode-both-ends` option to force double-ended barcodes to be detected on both ends before classification. This will reduce false positives dramatically, but also lower overall classification rates.

The output from `dorado basecaller` can be demultiplexed into per-barcode BAMs using `samtools split`. e.g.
The output from `dorado basecaller` can be demultiplexed into per-barcode BAMs using `dorado demux`. e.g.

```
$ samtools split -u <output-dir>/unclassified.bam -f "<output-dir>/<prefix>_%!.bam" <input-bam>
$ dorado demux -o <output-dir> --no-classify <input-bam>
```
This will output a BAM file per barcode in the `output-dir`.

The barcode information is reflected in the BAM `RG` header too. Therefore demuxing is also possible through `samtools split`. e.g.
```
$ samtools split -u <output-dir>/unclassified.bam -f "<output-dir>/<prefix>_%!.bam" <input-bam>
``
However, `samtools split` uses the full `RG` string as the filename suffix, which can result in very long file names. We recommend using `dorado demux` to split barcoded BAMs.
#### Classifying existing datasets
Expand Down
36 changes: 25 additions & 11 deletions dorado/cli/demux.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -44,9 +44,14 @@ int demuxer(int argc, char* argv[]) {
.nargs(argparse::nargs_pattern::any);
parser.add_argument("--output-dir").help("Output folder for demultiplexed reads.").required();
parser.add_argument("--kit-name")
.help("Barcoding kit name. Choose from: " +
dorado::barcode_kits::barcode_kits_list_str() + ".")
.required();
.help("Barcoding kit name. Cannot be used with --no-classify. Choose "
"from: " +
dorado::barcode_kits::barcode_kits_list_str() + ".");
parser.add_argument("--no-classify")
.help("Skip barcode classification. Only demux based on existing classification in "
"reads. Cannot be used with --kit-name.")
.default_value(false)
.implicit_value(true);
parser.add_argument("-t", "--threads")
.help("Combined number of threads for barcoding and output generation. Default uses "
"all available threads.")
Expand Down Expand Up @@ -83,6 +88,12 @@ int demuxer(int argc, char* argv[]) {
std::exit(1);
}

if ((parser.is_used("--no-classify") && parser.is_used("--kit-name")) ||
(!parser.is_used("--no-classify") && !parser.is_used("--kit-name"))) {
spdlog::error("Please specify either --no-classify or --kit-name to use the demux tool.");
std::exit(1);
}

if (parser.get<bool>("--verbose")) {
utils::SetDebugLogging();
}
Expand Down Expand Up @@ -122,13 +133,16 @@ int demuxer(int argc, char* argv[]) {
PipelineDescriptor pipeline_desc;
auto demux_writer = pipeline_desc.add_node<BarcodeDemuxerNode>(
{}, output_dir, demux_writer_threads, 0, parser.get<bool>("--emit-fastq"));
std::vector<std::string> kit_names;
if (auto names = parser.present<std::vector<std::string>>("--kit-name")) {
kit_names = std::move(*names);

if (parser.is_used("--kit-name")) {
std::vector<std::string> kit_names;
if (auto names = parser.present<std::vector<std::string>>("--kit-name")) {
kit_names = std::move(*names);
}
auto demux = pipeline_desc.add_node<BarcodeClassifierNode>(
{demux_writer}, demux_threads, kit_names, parser.get<bool>("--barcode-both-ends"),
parser.get<bool>("--no-trim"));
}
auto demux = pipeline_desc.add_node<BarcodeClassifierNode>(
{demux_writer}, demux_threads, kit_names, parser.get<bool>("--barcode-both-ends"),
parser.get<bool>("--no-trim"));

// Create the Pipeline from our description.
std::vector<dorado::stats::StatsReporter> stats_reporters;
Expand All @@ -154,7 +168,7 @@ int demuxer(int argc, char* argv[]) {
kStatsPeriod, stats_reporters, stats_callables);
// End stats counting setup.

spdlog::info("> starting barcoding");
spdlog::info("> starting barcode demuxing");
reader.read(*pipeline, max_reads);

// Wait for the pipeline to complete. When it does, we collect
Expand All @@ -166,7 +180,7 @@ int demuxer(int argc, char* argv[]) {
tracker.update_progress_bar(final_stats);
tracker.summarize();

spdlog::info("> finished barcoding");
spdlog::info("> finished barcode demuxing");

return 0;
}
Expand Down
6 changes: 5 additions & 1 deletion dorado/read_pipeline/BarcodeDemuxerNode.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,11 @@ void BarcodeDemuxerNode::worker_thread() {
int BarcodeDemuxerNode::write(bam1_t* const record) {
assert(m_header);
// Fetch the barcode name.
std::string bc(bam_aux2Z(bam_aux_get(record, "BC")));
std::string bc = "unclassified";
auto bam_tag = bam_aux_get(record, "BC");
if (bam_tag) {
bc = std::string(bam_aux2Z(bam_tag));
}
// Check of existence of file for that barcode.
auto res = m_files.find(bc);
htsFile* file = nullptr;
Expand Down
10 changes: 10 additions & 0 deletions tests/test_simple_basecaller_execution.sh
Original file line number Diff line number Diff line change
Expand Up @@ -211,4 +211,14 @@ if [[ $num_read_groups -ne "0" ]]; then
exit 1
fi

# Test demux only on a pre-classified BAM file
$dorado_bin demux --no-classify --output-dir "$output_dir/demux_only_test/" $output_dir/read_group_test.bam
for bam in $output_dir/demux_only_test/SQK-RBK114-96_barcode01.bam $output_dir/demux_only_test/SQK-RBK114-96_barcode04.bam $output_dir/demux_only_test/unclassified.bam ; do
if [ ! -f $bam ]; then
echo "Missing expected bam file $bam"
exit 1
fi
done


rm -rf $output_dir

0 comments on commit eeef757

Please sign in to comment.