Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added a column 'asm_accession' to the sequence_collection_l1 table #78

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

waterflow80
Copy link
Collaborator

@waterflow80 waterflow80 commented Feb 5, 2024

Description

Currently, the API doesn't store the assembly accession value that corresponds to each inserted seqcol object. Which is an important meta data that can be useful.
Fixes #13

Solution design

We propose to add a new attribute asm_accession in the SeqColLevelOneEntity class (and hence a new column in the sequence_collections_l1 table).

So the resulting table will be the following (we ignored the display of the jsonLevelOneObject):
Screenshot from 2024-02-05 17-27-08
And of course, the asm_accession will not be returned in level 1 nor in level 2. (we may have a further discussion about this)

Now the API will check for the existence of an asm_accession before proceeding with the ingestion process, this will help avoid all the extra work of downloading data, constructing the objects and then checking.

However, we should consider checking the case where not all seqcol objects that correspond to a specific asm_accession are inserted in the database, in that case we should insert them. (I think this case is very much less likely to happen)

Further discussion

I think we can discuss the naming of asm_accession, because it might make sense to call it insdc_accession ?

@tcezard
Copy link
Member

tcezard commented Feb 6, 2024

I have not looked a the code yet but I have a few comments on the solution you chose.

  • Adding a new columns means that we will essentially force every sequence collection to have an insdc_accession or leave the column null. We might want in the future to add sequence collection that do not have an INSDC accession. We might also want in the future to ingest a sequence collection with an existing INSDC accession and a different naming convention.
  • The issue of restarting to complete a partial ingestion should be dealt with a transaction. This is something @nitin-ebi has dealt with in EVA-3494 Ingest assembly in batches contig-alias#125. we should discuss it in our next call.

@waterflow80
Copy link
Collaborator Author

I have not looked a the code yet but I have a few comments on the solution you chose.

* Adding a new columns means that we will essentially force every sequence collection to have an `insdc_accession` or leave the column `null`. We might want in the future to add sequence collection that do not have an INSDC accession. We might also want in the future to ingest a sequence collection with an existing INSDC accession and a different naming convention.

* The issue of restarting to complete a partial ingestion should be dealt with a transaction. This is something @nitin-ebi has dealt with in [EVA-3494 Ingest assembly in batches contig-alias#125](https://github.com/EBIvariation/contig-alias/pull/125). we should discuss it in our next call.

That makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding a 'database' for assembly accessions to map saved seqCol objects
2 participants