Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RNA-seq only generic genes #76

Open
ajlee21 opened this issue Apr 13, 2021 · 1 comment
Open

RNA-seq only generic genes #76

ajlee21 opened this issue Apr 13, 2021 · 1 comment

Comments

@ajlee21
Copy link
Contributor

ajlee21 commented Apr 13, 2021

When we compared the correlation between gene percentiles generated by SOPHIE versus the manually curated dataset here, we noticed that there was a group of genes that SOPHIE identified as generic but were not found to be generic using the manually curated dataset.
In this case, SOPHIE was trained on recount2 (RNA-seq) dataset while the manually curated dataset was using array platform.

See #75 for details:

  • Overall, it appears that lower expression values in the template (which correspond to RNA-seq only generic genes) are changed more than genes with higher values
  • I believe this is likely due to the VAE compression -- possibly the ReLU activation function and/or gaussian constraint in the loss function.
  • Why does this compression not affect genes with higher expression values (RNA-seq/array generic genes) as much? This is probably because these values are closer to the mean expression of the compendium. The compression is probably affecting genes on the outliers of the distribution more.

Why is this compression not seen in the array data?

  • In the array data, the gene expression for the RNA-seq only genes vs RNA-seq/array genes were similar in the array training compendium.
  • Overall the variance in array expression is lower compared to RNA-seq so there isnt' as much compression needed
    So genes with low gene expression in the real experiment are getting a boost/increase after going through VAE (simulated experiment) which allows them to be detected as DE.

Possible solutions to consider:

  • We don't want lowly expressed genes to get artificially detected as frequently DE.
  • Would requiring varying parameters for the activation function and weighting for KL term in the loss function. This will need to be addressed in the future, probably not by this manuscript.
  • Rescale RNA-seq data in some way
@ajlee21
Copy link
Contributor Author

ajlee21 commented Apr 26, 2021

Some analyses were performed here: https://github.com/greenelab/generic-expression-patterns/tree/master/explore_RNAseq_only_generic_genes

It looks like the VAE is artificially boosting lowly expressed genes in RNA-seq data, which allows them to be detected as DE. We think this VAE boosting isn't seen as much in array data due to the lower variance of array data compared to RNA-seq. Further test would need to be performed to examine the effect of different data types: array vs RNA-seq

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant