-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Confused about + and - stranded nodes in GFA #25
Comments
Indeed, for each node in the graph, we can either choose either its positive or negative strand as the "reference" representation in the GFA file. In the example you provided, it would be better to have all of them be positive for the sake of readability. However, the algorithm sometimes chooses the negative one for its internal reasons. For example, sometimes it is handy to select as the reference the strand that is lexicographically smaller or has a smaller value of a hash function: it helps to keep track of the different copies of the node in different directions. Hope this helps. |
Similar to Sibelia(Z) we want to compare similarity of nodes - we want to extend on both sides of a target node and compare the sequences of the context. We know about the C++ API around the twopaco graph, however, we are not familiar with C++ hence we looked for a |
Honestly, I think the best format suited for your purpose is the junction list format. SibeliaZ basically works using some form of it, and it is the simplest possible (at least from my point of view) representation of the graph. |
Hi @iminkin I took a close look at this however I'm pretty confused about how we could exactly use this to find shared regions between multiple genomes? Let's take a really simple example:
When aligned:
Then i created a graph using:
When we get the junction paths:
We see they differ at indices 8 till 11 which is the 3-mer difference. So can we then say as long as two sequences have the same array of junctions they are the same sequence (like in the bold above)? |
Yes, the whole point of the compacted graph is to compress identical substrings of length at least k into integers. Those integers are much easier to index and compare: if you want to see if two substrings are equal just check if they consist of the same junctions. |
Let's use this very simple FASTA:
Then we construct the graph:
./twopaco -k 15 -f 16 test.fa -o graph
and convert it to GFA:graphdump -k 15 -f gfa2 -s test.fa graph > graph.gfa
:When we look at the paths we have:
We can only reconstruct the sequence from the GFA by taking the reverse complement of
-
nodes. When we look at the paths all nodes are on the same strand (i.e. all-
or all+
), for example, all24
nodes are-
. So why weren't these just all recorded as+
?The text was updated successfully, but these errors were encountered: