Skip to content

Carafe throws an unhelpful exception when the input FASTA file has an unexpected suffix (e.g., '.fas') #5

@mriffle

Description

@mriffle

Bug Report: NullPointerException when --db has unrecognized extension

Summary

generate_spectral_library throws a NullPointerException on pWriter.close() when the file passed to --db has an extension other than .fa, .fasta, .tsv, .txt, or .csv. The underlying issue is that the extension check silently falls through, leaving all_peptide_forms empty, and the parquet-writing loop never initializes the writer before trying to close it.

Environment

  • File: src/main/java/ai/AIGear.java
  • Method: generate_spectral_library(String model_dir)
  • Affected lines: 1696-1709 (extension dispatch) and 1719-1804 (parquet writer loop)

Reproduction

  1. Run the AI/training + library-generation pipeline.
  2. Pass a valid FASTA file to --db whose name ends in .fas (a common FASTA extension; same issue applies to .faa, .pep, .parquet, etc.).
  3. Training completes normally; library generation then aborts with the stack trace below.

Observed output

Model training => Message: > === Improvement Summary (Median) ===
... (training succeeds) ...
Time used for model training: 4.69 min
Generating peptide forms: 0
Use parquet format ...
Error running Carafe: Cannot invoke "org.apache.parquet.hadoop.ParquetWriter.close()" because "pWriter" is null
java.lang.NullPointerException: Cannot invoke "org.apache.parquet.hadoop.ParquetWriter.close()" because "pWriter" is null
    at main.java.ai.AIGear.generate_spectral_library(AIGear.java:1803)
    at main.java.ai.AIGear.main(AIGear.java:1215)
    at main.java.gui.CarafeLauncher.launchCLI(CarafeLauncher.java:66)
    at main.java.gui.CarafeLauncher.main(CarafeLauncher.java:38)

Note the log shows Generating peptide forms: 0, but neither branch's diagnostic message is emitted (no Protein sequences:... total unique peptide sequences:... from DBGear.protein_digest, and no The input for spectral library generation is a peptide forms table: from the TSV/CSV branch). That confirms both branches were skipped.

Root cause

In AIGear.java:1696-1709:

if (this.db.toLowerCase().endsWith(".fa") || this.db.toLowerCase().endsWith(".fasta")) {
    searchedPeptides = dbGear.protein_digest(this.db);
    all_peptide_forms = searchedPeptides.parallelStream()
            .map(PeptideUtils::calcPeptideIsoforms)
            .flatMap(List::stream).sorted(comparator_peptide_mass_for_peptide_from_min2max).collect(toList());
} else if (this.db.toLowerCase().endsWith(".tsv") || this.db.toLowerCase().endsWith(".txt") || this.db.toLowerCase().endsWith(".csv")) {
    ...
    SkylineIO.load_skyline_precursor_table(this.db, sep, all_peptide_forms, precursor_charge_list);
}
// no else — unrecognized extensions fall through silently

When this.db ends in .fas (or any other unrecognized extension), neither branch runs, no error is raised, and all_peptide_forms remains empty.

Then in the parquet-writing block at AIGear.java:1719-1804:

ParquetWriter<GenericRecord> pWriter = null;
...
boolean file_is_closed = false;
while (i_peptide <= all_peptide_forms.size()) {           // 0 <= 0 — enters once
    for (int i = 0; i < this.n_peptides_per_batch; i++) {
        if (i_peptide >= all_peptide_forms.size()) {      // 0 >= 0 — true
            finished = true;
            break;                                        // exits before pWriter is constructed
        }
        ...
        if (i == 0) {
            ...
            pWriter = AvroParquetWriter.<GenericRecord>builder(localOutputFile)...build();
            file_is_closed = false;
        }
        ...
    }
    if (finished) break;
}
if (!file_is_closed) {
    pWriter.close();   // line 1803 — NPE: pWriter was never initialized
}

With an empty input list, the inner loop breaks before i == 0 is reached, so pWriter is never assigned. The final if (!file_is_closed) then dereferences null.

Suggested fixes

Two issues, both worth fixing:

1. Recognize more FASTA extensions. At minimum add .fas. Common variants in the wild include .fas, .faa, .fna, and .pep. Suggested change at AIGear.java:1696:

String dbLower = this.db.toLowerCase();
if (dbLower.endsWith(".fa") || dbLower.endsWith(".fasta") || dbLower.endsWith(".fas")) {
    ...
}

2. Fail fast on unrecognized input. Add a final else that throws (or logs and exits) with a clear message naming the file and listing supported extensions. This prevents the silent fall-through that produces the misleading NPE further downstream. Independently, the parquet-writer block at line 1802-1804 should guard against pWriter == null (i.e., empty input) and either skip the close or throw a descriptive error — defense in depth in case any other path ever produces an empty all_peptide_forms.

Workaround for users

Rename or symlink the FASTA so it ends in .fasta or .fa before passing it to --db.

Impact

Any user whose FASTA file uses the .fas extension (a standard, widely-used FASTA suffix) hits this after a successful — and potentially long — training run, with an error message that gives no hint that the --db argument is the actual problem.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions