feat(FTA-192): export gene Nomenclature History fields (etymology + gene change events)#81
Open
ianlongden wants to merge 8 commits into
Open
feat(FTA-192): export gene Nomenclature History fields (etymology + gene change events)#81ianlongden wants to merge 8 commits into
ianlongden wants to merge 8 commits into
Conversation
Map the FlyBase 'etymology' featureprop (G27) to GeneDTO.note_dtos with note type 'gene_etymology' via the existing prop-to-note pipeline, which attaches FBrf/PMID evidence curies. Part of the Nomenclature History export (FTA-192); the gene_nomenclature_comment and identity_source mappings remain staged in FTA-193 pending agr_curation_schema PR #323. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add GeneChangeEventSlotAnnotationDTO (and ChangeEventSlotAnnotationDTO base) plus a gene_change_event_dtos slot on GeneDTO, and a map_gene_change_events() handler method that maps two FlyBase gene featureprops to Alliance gene change events: - gene_nomenclature_comment (G41): one change event per comment, the comment carried as an inner gene_nomenclature_note NoteDTO. - identity_source (G28b): value split into symbol_renamed_to (new) and symbol_renamed_from (old), skipping malformed values defensively. Reuses convert_prop_to_note(), lookup_pub_curies() and props_by_type (no new query). The schema's required event_type_name/event_status_name/ current_version fields and the identity_source delimiter use TODO-flagged placeholders pending confirmation (see FTA-193 comment). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add GeneChangeEventSlotAnnotationDTO (and ChangeEventSlotAnnotationDTO base) plus a gene_change_event_dtos slot on GeneDTO, and a map_gene_change_events() handler method that maps two FlyBase gene featureprops to Alliance gene change events: - gene_nomenclature_comment (G41): one change event per comment, the comment carried as an inner gene_nomenclature_note NoteDTO. - identity_source (G28b): value split into symbol_renamed_to (new) and symbol_renamed_from (old), skipping malformed values defensively. Both events use event_type_name 'rename' (confirmed by Steven on FTA-193). event_status_name and current_version were made optional in agr_curation_schema PR #326, so they are left None and omitted from the export (FlyBase has no event status or versioning data to import). Reuses convert_prop_to_note(), lookup_pub_curies() and props_by_type (no new query). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Brings in the etymology mapping (part 1, commit 26c8874). Conflicts in gene_handler.py and agr_datatypes.py resolved in favour of the corrected gene-change-events code on this branch (event_type='rename', event_status_name and current_version omitted per schema PR #326), discarding the superseded placeholder version from FTA-192's commit 8ac0cfa.
Contributor
Author
|
Dump of data started will merge if the data looks okay. |
map_gene_gcrp_xrefs() looked up self.fb_agr_db_dict['UniProt/GCRP'], but that key was commented out of fb_agr_db_dict in aa1fbfc so UniProt/GCRP xrefs are excluded from the general xref export. Any gene with a UniProt/GCRP dbxref therefore raised KeyError. Hardcode the prefix as 'UniProtKB' (the value previously held in the dict) so only the gcrp_cross_reference_dto is populated, matching the intent of aa1fbfc. Pre-existing main regression, unrelated to the FTA-192 Nomenclature History work on this branch; committed here to unblock test-mode export. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add write_gene_change_events_tsv() to curation_tsv and call it from the gene retrieval script, producing gene_curation_*_gene_change_events.tsv alongside the JSON and existing TSVs. One row per gene_change_event_dto: rename events fill symbol_renamed_from/to, nomenclature comment events fill the note column from the inner note's free_text, evidence curies are pipe-joined. Honors ADD_OBSOLETE=NO like the other writers. Etymology already reaches the notes TSV (it is a gene_etymology note in note_dtos), so only gene_change_event_dtos needed a new writer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The 'identity_source' featureprop value carries a 'Source for identity of: ' prefix before the two symbols (e.g. 'Source for identity of: Bar B'). The parser split the whole string on whitespace, so every value yielded >2 tokens, failed the len==2 guard, and the rename event was silently skipped. Strip the prefix before splitting so symbol_renamed_to/from are populated (recovers ~12k clean two-symbol values DB-wide; merge/multi-token values still skipped+logged). Also add an 'evidence' column to the notes reporting TSV (pipe-joined evidence_curies) so curators can check the FBrf behind each note, per Steven's review. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Multi-token "identity_source" props (3+ tokens, e.g. gene merges or values with embedded provenance sentences) fail the two-symbol check and are skipped during gene change event mapping. Previously these were only written to the warning log, so curators could not review them. Capture each skipped prop (FBid, raw value, token count, internal/obsolete flags) on the GeneHandler and write a companion ..._skipped_identity_source.tsv. The writer intentionally does not honor ADD_OBSOLETE=NO so obsolete/internal cases are included and flagged, letting curators check whether bad-syntax values sit on obsolete FBgns. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Full gene Nomenclature History export (FTA-192 and sub-task FTA-193). This branch now carries both parts of the work.
Part 1 — Etymology (FTA-192)
etymology(G27) →GeneDTO.note_dtoswith note typegene_etymology, via the existing prop-to-note pipeline (FBrf/PMID evidence attached).Part 2 — Gene change events (FTA-193)
Maps two FlyBase gene featureprops to
GeneChangeEventSlotAnnotationDTOs:gene_nomenclature_comment(G41) → one gene change event per comment, the comment carried as an innergene_nomenclature_noteNoteDTOin the event'snote_dtos.identity_source(G28b) → value split into two symbols:symbol_renamed_to(new, first) andsymbol_renamed_from(old, second). Values not yielding exactly two symbols are skipped and logged.Both events use
event_type_name='rename'(confirmed by Steven on FTA-193 — G41 is a note specifically about renames).event_status_nameandcurrent_versionwere made optional in agr_curation_schema PR #326, so they are leftNoneand dropped from the export (FlyBase has no event status or versioning data to import).Changes
src/agr_datatypes.py: newChangeEventSlotAnnotationDTObase +GeneChangeEventSlotAnnotationDTO;gene_change_event_dtosslot added toGeneDTO. Onlyevent_type_name(andgene_identifier) are required.src/gene_handler.py:etymologyadded togene_prop_to_note_mapping; newmap_gene_change_events()called frommap_fb_data_to_alliance(). Reusesconvert_prop_to_note(),lookup_pub_curies(), andprops_by_type(no new query).src/curation_tsv.py+src/AGR_data_retrieval_curation_gene.py: newwrite_gene_change_events_tsv()companion TSV — emitsgene_curation_*_gene_change_events.tsv(one row per change event:event_type,symbol_renamed_from/to,note, pipe-joinedevidence; honorsADD_OBSOLETE=NO). Etymology already reaches the notes TSV as agene_etymologynote, so no separate writer was needed for it.Schema dependency
Validates against a
LINKML_VERSIONthat includes both PR #323 (introducesGeneChangeEventSlotAnnotation) and PR #326 (makesevent_status_name/current_versionoptional).Test plan
python -m py_compilepasses for both files.AGR_data_retrieval_curation_gene.pyin test mode (-t); output validates against theagr_curation_schemamainschema (no tagged release containsgene_change_event_dtosyet — see #323/#326). Validated file:gene_curation_fb_2026_01_reporting.json→ VALID for agr schema.🤖 Generated with Claude Code