Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All aditional sequences removed in seed-to-alignment step #36

Open
lbleicher opened this issue Mar 20, 2023 · 8 comments
Open

All aditional sequences removed in seed-to-alignment step #36

lbleicher opened this issue Mar 20, 2023 · 8 comments

Comments

@lbleicher
Copy link

Hi, I've tried to run seed-to-alignment in another computer and this time it ran to the finish, but all sequences were removed (they get flagged Keep=false) and the final file (06_alignment.fasta) only has what was originally in my list of sequences. Am I doing something wrong with my input file? It has five sequences, two from Mus Musculus (which has the two paralogs), one from Homo sapiens (which lost one of them), one from Danio Rerio and one from Saccoglossus kowalevskii (as the outgroup). I'm sending the input file.

mq_lb_test.csv

@lbleicher
Copy link
Author

So, I've been trying different configurations for the seed file and perhaps I am not building it properly.

From the example (LY86/LY96) it seems I shouldn't need to add a sequence as an outgroup, so in one of our attempts we made a similar file - if I want to reconstruct ancestrals for a pair of paralogs derived from a duplication that happened during early chordate evolution, I should get a basal chordate species that already has the two paralogs and a more derived one which also has both, and that will define the scope, is that correct?

In my case, one of the paralogs is missing in humans. Does that mean I should not include the other human sequence or is it ok as long as if I identify it?

Also, looking at the manual I noticed that even though aliases would be optional they should help the program identify which are the paralogs. So I built a new file with only species containing the two paralogs, and to each of them I added all aliases for each of them on Uniprot (also including gene names). This time I do get a final alignment that has more stuff than what was already on the seed, but it seems they only managed to retrieve sequences from one of the two paralogs, but not the other.

Here's what I have in my seed file:

species,name,aliases,sequence,accession
Amblyraja radiata,TTR,TTR;ATT;Transthyretin;ATTR;Prealbumin;TBPA;PALB,MFRALLLVALALSASCTSVPHHHGDHDTRCPILIKVLDALKGTPAANVQVEMLKRNEDKTWQSINTGVTGRNGELHNLTSEEHLAVGLYKLHFETGAYWSNAGQAHFHECANVAFRVASLTSDHYTIAVLLSPYSYSTTGIVMNPHET,XP_032875375.1
Amblyraja radiata,HIU,HIUase;UraH;5-hydroxyisourate hydrolase;HIU hydrolase;HIUHase;Transthyretin-related protein,MTAPRLQHIQDHLVPRKCNSWAVSMSLSPLTTHALNTAEGIPAARLPVSVHRLDNEEWKEMAKGVTNSDGRCPGLLTSQTFIPGIYKIKFATDKYWQSLERASFYPYIEVVFNISDASQKYHIALLLNPFSYTTYRGS,XP_032891577.1
Rhincodon typus,TTR,TTR;ATT;Transthyretin;ATTR;Prealbumin;TBPA;PALB,MRFTLILIVCLAAFYKGETVDHGGSDTKCPINIKVLNAVNGTPAANLRLQIYRQNADLSWQQLNTGITAINGEAHNLTTEADFIPGMYKVHFNTADYWRNLGYTPFHECVNVLWFSDPFPWVLTSSQEAPSDWFTNCEVSEFLVVSNS,XP_020365676.2
Rhincodon typus,HIU,HIUase;UraH;5-hydroxyisourate hydrolase;HIU hydrolase;HIUHase;Transthyretin-related protein,MNSLEIARTAEAGVRVDTVWSWRKTTGYVGLDQGMDIDRAVITKKANTLDLNDNLKAATKLKTGNSLHRFSAVTYSNSPDHKMSQSPLTTHILNTAQGIPAAHVAVSVVRLEDKAWREIATGITNSDGRCPGLLTSETFVPGIYKMKFAVDEYWQSLKMASFYPYIEVVFNISDACQKYHIALLLSPFSYSTYRGS,XP_048461878.1
Protopterus annectens,TTR,TTR;ATT;Transthyretin;ATTR;Prealbumin;TBPA;PALB,MAKALYLLFLAGVVLLSDGAPVDHGTAEKSCPLEVKILDAIRGIPAAGMEVHVYKKAADDSWSEVATGKTTRDGEIHNLLNDENFLEGEYKVTFGTKSFWEGLNLNPFHLDHDVTFKAHTGGHQHYTIAVLLSPYSITTTAVVSSVQT,XP_043921383.1
Protopterus annectens,HIU,HIUase;UraH;5-hydroxyisourate hydrolase;HIU hydrolase;HIUHase;Transthyretin-related protein,MSSLRLSHIHHHIASDKNSQRMADGTSSRLTTHVLNTALGIPASNLALTVSRLEVVGNKEHWEQLTGRCTNSDGRCPGLLTNDHFIDATYKVRFETGDYWKEQGITSFYPYVEVVFTIKDPSQKYHIPLLLSPFSYTTYRGS,XP_043937680.1
Mus musculus,HIU,HIUase;UraH;5-hydroxyisourate hydrolase;HIU hydrolase;HIUHase;Transthyretin-related protein,MATESSPLTTHVLDTASGLPAQGLCLRLSRLEAPCQQWMELRTSYTNLDGRCPGLLTPSQIKPGTYKLFFDTERYWKERGQESFYPYVEVVFTITKETQKFHVPLLLSPWSYTTYRGS,Q9CRB3
Mus musculus,TTR,TTR;ATT;Transthyretin;ATTR;Prealbumin;TBPA;PALB,GPAGAGESKCPLMVKVLDAVRGSPAVDVAVKVFKKTSEGSWEPFASGKTAESGELHGLTTDEKFVEGVYRVELDTKSYWKTLGISPFHEFADVVFTANDSGHRHYTIAALLSPYSYSTTAVVSNPQN,P07309

@lbleicher
Copy link
Author

Update: it seems the issue was caused by the fact one of the aliases for HIUase is part of one of the alias for TTRs (HIUases were also known as Transthyretin-related proteins), and that would cause it to drop it. Removing that alias seems to have solved the problem, I now got an alignment with sequences for both paralogs.

@Kona-O
Copy link
Contributor

Kona-O commented Mar 22, 2023

Hi! I just wanted to let you know that Mike is out of town until tomorrow. I'm glad you were able to figure out where things were going wrong in your seed file. I hope the rest of the steps go more smoothly. This is really helpful feedback for us to describe what to do and what not to do when constructing your seed dataset, as well as what kinds of checks topiary could potentially do to help the user from having too similar of aliases. Thank you!

@lbleicher
Copy link
Author

Thanks, Kona, after fixing that detail it seems to be working fine with various input changes.

As for defining scope, am I right in not including an outgroup in the initial dataset?

@Kona-O
Copy link
Contributor

Kona-O commented Mar 22, 2023

That depends on what ancestors you are interested in. If you want to reconstruct the deepest ancestor of the family encompassing both HIUases and TTRs, then I think it would be best to include an outgroup species (with sequence and aliases) with the "key_species" column set to True. Otherwise, topiary might have a hard time finding and deciding to keep this kind of outgroup sequence. If you are reconstructing the deepest ancestor of either HIUases or TTRs then your current seed setup would be fine.

Also, to make things move faster, you can set the "key_species" column to True for only the most evolutionary divergent species for each paralog (example: both paralogs from Amblyraja radiata, both paralogs from Mus musculus, and the outgroup species sequence). It is ideal to choose species with well-annotated genomes, but that's not always an option. Then set all other special sequences you want to include in your project as "key_species" = False.
The sequences with "key_species" = True will be used in the initial BLAST step, which by default searches through sequences 2 taxonomic ranks deeper than the rank set by your key_species. This is in hopes of finding deeper ancestors if they are out there. Then topiary will download and BLAST all of the resulting initial sequence hits against the full proteome available for each of the key_species. The ones set to false will not be used for these steps to minimize workload but will be maintained throughout your project.

@harmsm
Copy link
Contributor

harmsm commented Mar 23, 2023

@lbleicher : Glad you got it to work! Did topiary give you any warning about the alias overlap? In many cases, it will do so; I'd like it to do so all of the time if possible.

@lbleicher
Copy link
Author

It did! I probably missed it before because it came early in the output - perhaps it would be interesting to write the warnings as the last output text? Perhaps I had the same issue in other attempts but didn't notice it.

@harmsm
Copy link
Contributor

harmsm commented Mar 24, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants