Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Question: Do we expect many participant IDs to be associated with more than one harmonized_diagnosis? #913

Open
jaclyn-taroni opened this issue Jan 16, 2021 · 2 comments
Labels
data question Further information is requested

Comments

@jaclyn-taroni
Copy link
Member

What data file(s) does this issue pertain to?

pbta-histologies.tsv

What release are you using?

release-v18-20201123

Put your question or report your issue here.

For context, I was reviewing #912 and there was a question about an increase in the number of unique Kids_First_Participant_ID, harmonized_diagnosis pairs.

I looked into how often Kid_First_Participant_ID are associated with multiple harmonized_diagnosis values with the following:

library(tidyverse)

# Read in v18 pbta-histologies.tsv explicitly
v18_df <- read_tsv("data/release-v18-20201123/pbta-histologies.tsv")

# Find participant IDs associated with multiple harmonized dx values
pt_multiple_dx <- v18_df %>% 
  # For tumor samples
  dplyr::filter(sample_type == "Tumor", 
                composition == "Solid Tissue") %>% 
  # Filter to unique participant-harmonized dx pairs
  select(Kids_First_Participant_ID, 
         harmonized_diagnosis) %>% 
  distinct()  %>%
  # Find participant IDs when more than one harmonized dx exists
  count(Kids_First_Participant_ID) %>%
  filter(n > 1) %>%
  pull(Kids_First_Participant_ID)

# Create a TSV file with all ID and disease type info in these cases
v18_df %>% 
  filter(Kids_First_Participant_ID %in% pt_multiple_dx, 
         sample_type == "Tumor", 
         composition == "Solid Tissue") %>% 
  select(Kids_First_Participant_ID,
         Kids_First_Biospecimen_ID,
         sample_id,
         pathology_diagnosis,
         integrated_diagnosis,
         harmonized_diagnosis,
         molecular_subtype,
         tumor_descriptor) %>% 
  arrange(Kids_First_Participant_ID) %>%
  write_tsv("participants_with_multiple_harmonized_dx.txt")

There are 24 instances of an individual Kid_First_Participant_ID being associated with multiple harmonized_diagnosis values. Here's the file produced using the above: participants_with_multiple_harmonized_dx.txt

This number seems high to me. I do wonder if some of this stems from #735 (which is now closed), or because of changes to the HGAT and LGAT subtyping being somewhat in flux as they overlap or not. Either way, I wanted to make a note of it!

@jaclyn-taroni jaclyn-taroni added question Further information is requested data labels Jan 16, 2021
@jaclyn-taroni jaclyn-taroni changed the title Question: Do we expect many participant IDs to be associated with more than one harmonized_dx? Question: Do we expect many participant IDs to be associated with more than one harmonized_diagnosis? Jan 16, 2021
@jharenza
Copy link
Collaborator

Hi @jaclyn-taroni - this list looks to be expected to me. Some of these discrepancies are through subtyping: BS_EE73VE7V missing the K28 mutation, BS_1J2WQ08M ependymoma perhaps missing the YAP1 fusion, BS_HZNKSQ17 not having a matched RNA-Seq for medulloblastoma subtyping, BS_8T7DZV2F medulloblastoma may be discrepant with consensus subtypes, and the rest seem to be typical - many LGG initial tumors that progressed to HGG, PT_00G007DM ETMR that progressed but did not have the C19MC alteration (interesting), PT_25Z2NX27 (7316-355, 7316-1462) this may be a data input inconsistency, as both diagnoses are fibromas, but it is free text, so it came out differently in the harmonized diagnosis. We can go back and check those pathology reports. BS_N8WXTFN4 should be checked (or an eye kept out during #819) for the BRAF V600E mutation in case that was missed. I did also notice a few other "benign" specimens that were separate of the tumors (cortical tubers) for some patients.

I think some of these could be interesting cases to discuss in the paper!

@jaclyn-taroni
Copy link
Member Author

Great, glad I asked 😄 thanks for digging in

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
data question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants