Bug Report: NullPointerException when --db has unrecognized extension
Summary
generate_spectral_library throws a NullPointerException on pWriter.close() when the file passed to --db has an extension other than .fa, .fasta, .tsv, .txt, or .csv. The underlying issue is that the extension check silently falls through, leaving all_peptide_forms empty, and the parquet-writing loop never initializes the writer before trying to close it.
Environment
- File:
src/main/java/ai/AIGear.java
- Method:
generate_spectral_library(String model_dir)
- Affected lines: 1696-1709 (extension dispatch) and 1719-1804 (parquet writer loop)
Reproduction
- Run the AI/training + library-generation pipeline.
- Pass a valid FASTA file to
--db whose name ends in .fas (a common FASTA extension; same issue applies to .faa, .pep, .parquet, etc.).
- Training completes normally; library generation then aborts with the stack trace below.
Observed output
Model training => Message: > === Improvement Summary (Median) ===
... (training succeeds) ...
Time used for model training: 4.69 min
Generating peptide forms: 0
Use parquet format ...
Error running Carafe: Cannot invoke "org.apache.parquet.hadoop.ParquetWriter.close()" because "pWriter" is null
java.lang.NullPointerException: Cannot invoke "org.apache.parquet.hadoop.ParquetWriter.close()" because "pWriter" is null
at main.java.ai.AIGear.generate_spectral_library(AIGear.java:1803)
at main.java.ai.AIGear.main(AIGear.java:1215)
at main.java.gui.CarafeLauncher.launchCLI(CarafeLauncher.java:66)
at main.java.gui.CarafeLauncher.main(CarafeLauncher.java:38)
Note the log shows Generating peptide forms: 0, but neither branch's diagnostic message is emitted (no Protein sequences:... total unique peptide sequences:... from DBGear.protein_digest, and no The input for spectral library generation is a peptide forms table: from the TSV/CSV branch). That confirms both branches were skipped.
Root cause
In AIGear.java:1696-1709:
if (this.db.toLowerCase().endsWith(".fa") || this.db.toLowerCase().endsWith(".fasta")) {
searchedPeptides = dbGear.protein_digest(this.db);
all_peptide_forms = searchedPeptides.parallelStream()
.map(PeptideUtils::calcPeptideIsoforms)
.flatMap(List::stream).sorted(comparator_peptide_mass_for_peptide_from_min2max).collect(toList());
} else if (this.db.toLowerCase().endsWith(".tsv") || this.db.toLowerCase().endsWith(".txt") || this.db.toLowerCase().endsWith(".csv")) {
...
SkylineIO.load_skyline_precursor_table(this.db, sep, all_peptide_forms, precursor_charge_list);
}
// no else — unrecognized extensions fall through silently
When this.db ends in .fas (or any other unrecognized extension), neither branch runs, no error is raised, and all_peptide_forms remains empty.
Then in the parquet-writing block at AIGear.java:1719-1804:
ParquetWriter<GenericRecord> pWriter = null;
...
boolean file_is_closed = false;
while (i_peptide <= all_peptide_forms.size()) { // 0 <= 0 — enters once
for (int i = 0; i < this.n_peptides_per_batch; i++) {
if (i_peptide >= all_peptide_forms.size()) { // 0 >= 0 — true
finished = true;
break; // exits before pWriter is constructed
}
...
if (i == 0) {
...
pWriter = AvroParquetWriter.<GenericRecord>builder(localOutputFile)...build();
file_is_closed = false;
}
...
}
if (finished) break;
}
if (!file_is_closed) {
pWriter.close(); // line 1803 — NPE: pWriter was never initialized
}
With an empty input list, the inner loop breaks before i == 0 is reached, so pWriter is never assigned. The final if (!file_is_closed) then dereferences null.
Suggested fixes
Two issues, both worth fixing:
1. Recognize more FASTA extensions. At minimum add .fas. Common variants in the wild include .fas, .faa, .fna, and .pep. Suggested change at AIGear.java:1696:
String dbLower = this.db.toLowerCase();
if (dbLower.endsWith(".fa") || dbLower.endsWith(".fasta") || dbLower.endsWith(".fas")) {
...
}
2. Fail fast on unrecognized input. Add a final else that throws (or logs and exits) with a clear message naming the file and listing supported extensions. This prevents the silent fall-through that produces the misleading NPE further downstream. Independently, the parquet-writer block at line 1802-1804 should guard against pWriter == null (i.e., empty input) and either skip the close or throw a descriptive error — defense in depth in case any other path ever produces an empty all_peptide_forms.
Workaround for users
Rename or symlink the FASTA so it ends in .fasta or .fa before passing it to --db.
Impact
Any user whose FASTA file uses the .fas extension (a standard, widely-used FASTA suffix) hits this after a successful — and potentially long — training run, with an error message that gives no hint that the --db argument is the actual problem.
Bug Report: NullPointerException when
--dbhas unrecognized extensionSummary
generate_spectral_librarythrows aNullPointerExceptiononpWriter.close()when the file passed to--dbhas an extension other than.fa,.fasta,.tsv,.txt, or.csv. The underlying issue is that the extension check silently falls through, leavingall_peptide_formsempty, and the parquet-writing loop never initializes the writer before trying to close it.Environment
src/main/java/ai/AIGear.javagenerate_spectral_library(String model_dir)Reproduction
--dbwhose name ends in.fas(a common FASTA extension; same issue applies to.faa,.pep,.parquet, etc.).Observed output
Note the log shows
Generating peptide forms: 0, but neither branch's diagnostic message is emitted (noProtein sequences:... total unique peptide sequences:...fromDBGear.protein_digest, and noThe input for spectral library generation is a peptide forms table:from the TSV/CSV branch). That confirms both branches were skipped.Root cause
In
AIGear.java:1696-1709:When
this.dbends in.fas(or any other unrecognized extension), neither branch runs, no error is raised, andall_peptide_formsremains empty.Then in the parquet-writing block at
AIGear.java:1719-1804:With an empty input list, the inner loop breaks before
i == 0is reached, sopWriteris never assigned. The finalif (!file_is_closed)then dereferences null.Suggested fixes
Two issues, both worth fixing:
1. Recognize more FASTA extensions. At minimum add
.fas. Common variants in the wild include.fas,.faa,.fna, and.pep. Suggested change atAIGear.java:1696:2. Fail fast on unrecognized input. Add a final
elsethat throws (or logs and exits) with a clear message naming the file and listing supported extensions. This prevents the silent fall-through that produces the misleading NPE further downstream. Independently, the parquet-writer block at line 1802-1804 should guard againstpWriter == null(i.e., empty input) and either skip the close or throw a descriptive error — defense in depth in case any other path ever produces an emptyall_peptide_forms.Workaround for users
Rename or symlink the FASTA so it ends in
.fastaor.fabefore passing it to--db.Impact
Any user whose FASTA file uses the
.fasextension (a standard, widely-used FASTA suffix) hits this after a successful — and potentially long — training run, with an error message that gives no hint that the--dbargument is the actual problem.