Skip to content

feat(FTA-192): export gene Nomenclature History fields (etymology + gene change events)#81

Open
ianlongden wants to merge 8 commits into
mainfrom
FTA-192-part2
Open

feat(FTA-192): export gene Nomenclature History fields (etymology + gene change events)#81
ianlongden wants to merge 8 commits into
mainfrom
FTA-192-part2

Conversation

@ianlongden

@ianlongden ianlongden commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Full gene Nomenclature History export (FTA-192 and sub-task FTA-193). This branch now carries both parts of the work.

Part 1 — Etymology (FTA-192)

  • etymology (G27)GeneDTO.note_dtos with note type gene_etymology, via the existing prop-to-note pipeline (FBrf/PMID evidence attached).

Part 2 — Gene change events (FTA-193)

Maps two FlyBase gene featureprops to GeneChangeEventSlotAnnotationDTOs:

  • gene_nomenclature_comment (G41) → one gene change event per comment, the comment carried as an inner gene_nomenclature_note NoteDTO in the event's note_dtos.
  • identity_source (G28b) → value split into two symbols: symbol_renamed_to (new, first) and symbol_renamed_from (old, second). Values not yielding exactly two symbols are skipped and logged.

Both events use event_type_name='rename' (confirmed by Steven on FTA-193 — G41 is a note specifically about renames). event_status_name and current_version were made optional in agr_curation_schema PR #326, so they are left None and dropped from the export (FlyBase has no event status or versioning data to import).

Changes

  • src/agr_datatypes.py: new ChangeEventSlotAnnotationDTO base + GeneChangeEventSlotAnnotationDTO; gene_change_event_dtos slot added to GeneDTO. Only event_type_name (and gene_identifier) are required.
  • src/gene_handler.py: etymology added to gene_prop_to_note_mapping; new map_gene_change_events() called from map_fb_data_to_alliance(). Reuses convert_prop_to_note(), lookup_pub_curies(), and props_by_type (no new query).
  • src/curation_tsv.py + src/AGR_data_retrieval_curation_gene.py: new write_gene_change_events_tsv() companion TSV — emits gene_curation_*_gene_change_events.tsv (one row per change event: event_type, symbol_renamed_from/to, note, pipe-joined evidence; honors ADD_OBSOLETE=NO). Etymology already reaches the notes TSV as a gene_etymology note, so no separate writer was needed for it.

Schema dependency

Validates against a LINKML_VERSION that includes both PR #323 (introduces GeneChangeEventSlotAnnotation) and PR #326 (makes event_status_name/current_version optional).

Test plan

  • python -m py_compile passes for both files.
  • Ran AGR_data_retrieval_curation_gene.py in test mode (-t); output validates against the agr_curation_schema main schema (no tagged release contains gene_change_event_dtos yet — see #323/#326). Validated file: gene_curation_fb_2026_01_reporting.jsonVALID for agr schema.

🤖 Generated with Claude Code

ianlongden and others added 4 commits June 1, 2026 13:50
Map the FlyBase 'etymology' featureprop (G27) to GeneDTO.note_dtos with
note type 'gene_etymology' via the existing prop-to-note pipeline, which
attaches FBrf/PMID evidence curies. Part of the Nomenclature History export
(FTA-192); the gene_nomenclature_comment and identity_source mappings remain
staged in FTA-193 pending agr_curation_schema PR #323.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add GeneChangeEventSlotAnnotationDTO (and ChangeEventSlotAnnotationDTO
base) plus a gene_change_event_dtos slot on GeneDTO, and a
map_gene_change_events() handler method that maps two FlyBase gene
featureprops to Alliance gene change events:

- gene_nomenclature_comment (G41): one change event per comment, the
  comment carried as an inner gene_nomenclature_note NoteDTO.
- identity_source (G28b): value split into symbol_renamed_to (new) and
  symbol_renamed_from (old), skipping malformed values defensively.

Reuses convert_prop_to_note(), lookup_pub_curies() and props_by_type
(no new query). The schema's required event_type_name/event_status_name/
current_version fields and the identity_source delimiter use TODO-flagged
placeholders pending confirmation (see FTA-193 comment).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add GeneChangeEventSlotAnnotationDTO (and ChangeEventSlotAnnotationDTO
base) plus a gene_change_event_dtos slot on GeneDTO, and a
map_gene_change_events() handler method that maps two FlyBase gene
featureprops to Alliance gene change events:

- gene_nomenclature_comment (G41): one change event per comment, the
  comment carried as an inner gene_nomenclature_note NoteDTO.
- identity_source (G28b): value split into symbol_renamed_to (new) and
  symbol_renamed_from (old), skipping malformed values defensively.

Both events use event_type_name 'rename' (confirmed by Steven on
FTA-193). event_status_name and current_version were made optional in
agr_curation_schema PR #326, so they are left None and omitted from the
export (FlyBase has no event status or versioning data to import).
Reuses convert_prop_to_note(), lookup_pub_curies() and props_by_type
(no new query).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Brings in the etymology mapping (part 1, commit 26c8874). Conflicts in
gene_handler.py and agr_datatypes.py resolved in favour of the corrected
gene-change-events code on this branch (event_type='rename', event_status_name
and current_version omitted per schema PR #326), discarding the superseded
placeholder version from FTA-192's commit 8ac0cfa.
@ianlongden ianlongden changed the title feat(FTA-193): map gene nomenclature history props to gene change events feat(FTA-192): export gene Nomenclature History fields (etymology + gene change events) Jun 11, 2026
@ianlongden

Copy link
Copy Markdown
Contributor Author

Dump of data started will merge if the data looks okay.

ianlongden and others added 4 commits June 11, 2026 14:56
map_gene_gcrp_xrefs() looked up self.fb_agr_db_dict['UniProt/GCRP'], but
that key was commented out of fb_agr_db_dict in aa1fbfc so UniProt/GCRP
xrefs are excluded from the general xref export. Any gene with a
UniProt/GCRP dbxref therefore raised KeyError. Hardcode the prefix as
'UniProtKB' (the value previously held in the dict) so only the
gcrp_cross_reference_dto is populated, matching the intent of aa1fbfc.

Pre-existing main regression, unrelated to the FTA-192 Nomenclature
History work on this branch; committed here to unblock test-mode export.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add write_gene_change_events_tsv() to curation_tsv and call it from the
gene retrieval script, producing gene_curation_*_gene_change_events.tsv
alongside the JSON and existing TSVs. One row per gene_change_event_dto:
rename events fill symbol_renamed_from/to, nomenclature comment events
fill the note column from the inner note's free_text, evidence curies are
pipe-joined. Honors ADD_OBSOLETE=NO like the other writers.

Etymology already reaches the notes TSV (it is a gene_etymology note in
note_dtos), so only gene_change_event_dtos needed a new writer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The 'identity_source' featureprop value carries a 'Source for identity of: '
prefix before the two symbols (e.g. 'Source for identity of: Bar B'). The
parser split the whole string on whitespace, so every value yielded >2 tokens,
failed the len==2 guard, and the rename event was silently skipped. Strip the
prefix before splitting so symbol_renamed_to/from are populated (recovers ~12k
clean two-symbol values DB-wide; merge/multi-token values still skipped+logged).

Also add an 'evidence' column to the notes reporting TSV (pipe-joined
evidence_curies) so curators can check the FBrf behind each note, per Steven's
review.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Multi-token "identity_source" props (3+ tokens, e.g. gene merges or values
with embedded provenance sentences) fail the two-symbol check and are skipped
during gene change event mapping. Previously these were only written to the
warning log, so curators could not review them.

Capture each skipped prop (FBid, raw value, token count, internal/obsolete
flags) on the GeneHandler and write a companion
..._skipped_identity_source.tsv. The writer intentionally does not honor
ADD_OBSOLETE=NO so obsolete/internal cases are included and flagged, letting
curators check whether bad-syntax values sit on obsolete FBgns.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant