Ways to visualize diploid tree sequences that aren't sequences of trees #2068
Replies: 30 comments 14 replies
-
Excerpted from an email I sent to Peter just now: Well, yes, the notes do already state that the first generation is retained but is not part of the sample, and that that is necessary for recapitation. But I guess I had not really thought about / realized that first-generation individuals that are not ancestral to the sample would (a) also be present (of course), and (b) would still be there even after recapitation, but without any simulated histories (again, of course, but I never thought about it), and so (c) would need to be excluded or otherwise dealt with during subsequent analysis to avoid getting confused. That’s not an issue that I remember ever coming up, but I can see it being surprising/confusing for people. It’s one of those things where, if the tree sequences were visualized in some way, it would be immediately obvious, but when just working with them as a black-box object in Python it is easy not to realize or to think through the implications. I know tskit currently provides some visualization tools that show a given tree graphically. I think I’m thinking of something a little different. I’d maybe like to see a visualization tool that is more just about the times and relationships of individuals, nodes, and samples, without worrying about the particular tree structures that connect them through time. This representation would let me see things like:
I’m imagining a depiction that is a sort of summary of the entire tree sequence, rather than a picture of a specific tree. The y-axis would be time, with the more distant past at the top I suppose (that’s what would be intuitive to me, anyway). The x-axis is meaningless and used only for visual separation of elements; ideally x positions would be chosen to minimize the number of line crossings in the drawn graph, but that’s a frill. Each node would be shown as a small dot, and each individual would be a (gray) circle that encloses two dots (for diploids); nodes that do not have a recorded individual associated with them would be shown as standalone dots not enclosed by a “individual circle”. Lines would connect a later-time dot to an earlier-time dot if any portion of the genome was inherited from that earlier-time dot (without passing through an intervening dot), so it would be more of a directed graph than a tree, and would represent an overlay of all of the tree sequences onto a single diagram. Samples would be shown as red dots, while non-samples ancestral to the sample would be black dots, and nodes not ancestral to the sample would be white dots. I’ve attached a sort of mock-up of how I imagine a model like Jean’s (a SLiM haploid clonal model – i.e., diploids where every second genome is empty and has no ancestral connections – with occasional horizontal gene transfer) would look before and after recapitation; I don’t know that it makes any actual sense as a diagram of a real tree sequence, but it illustrates what I mean, I hope. Possibly the lines of the graph could be colored to show something useful, too – perhaps the weighting of that line across all of the tree sequences (how common of a path of inheritance it is for the sample), or perhaps how many mutations occurred along all of the tree-sequences branches represented by it, or something of that sort (perhaps things like this could be options chosen by the user). With a visualization tool like this, I could write a python script that would read in a .trees file from SLiM, recapitate, change the sample in some way, simplify, overlay mutations, etc., and I could call the visualization function at each step of the process to confirm that things look the way I intend them to look. I think this would help me immensely, since I am a very visual thinker, and I bet many users would love it. (Addendum from a later email:) Expanding upon this idea a bit more: first-generation individuals/nodes should also use a distinctive color, and remembered individuals should be designated in some orthogonal fashion so they are also visible (bigger dots?). :-> I have attached illustrations here: |
Beta Was this translation helpful? Give feedback.
-
@jeromekelleher and @molpopgen, comments would be welcome. I'm quite excited about this idea. @petrelharp wrote back with a caveat:
I agree that for typical "real" models with thousands or millions of individuals these diagrams would be very large and might not be terribly useful (although I'm not sure they would never be useful; I'd like to see what the diagram for, say, a 10-population stepping-stone model with a soft selective sweep would look like :->). But I think people could still find them very useful for developing their Python tree-seq workflow, by simulating with a downscaled model so the diagrams aren't too big, writing out their Python steps and visualizing them to make sure everything in that pipeline is working as intended, and then upscaling their model to production size (and commenting out the visualizations). I also think it would be very useful for generating pictures to use in the documentation, for explaining concepts like the retention of first-generation ancestors, simplification, and recapitation. |
Beta Was this translation helpful? Give feedback.
-
Well, there is import msprime
ts = msprime.simulate(5, recombination_rate=0.3, random_seed=2)
print(ts.draw_text())
ts.draw_svg("tmp.svg") This is the text (might get mangled by some browsers):
and SVG: I guess this isn't what you're looking for though? |
Beta Was this translation helpful? Give feedback.
-
Right; what I'm looking for is a diagram that combines/summarizes all of those tree diagrams into a single diagram. As Peter wrote, more like a pedigree, except that (a) only the nodes present in the tree sequence are shown, so it is not a full pedigree in the genealogical sense, and (b) lines are shown between every pair of nodes that are connected by a branch in any tree in the tree sequence, not just between parents/offspring. And the other big thing I'm looking for is color-coding and other visualization of things like which nodes are samples, which nodes are first-generation ancestors, which nodes are remembered individuals, which pairs of nodes are grouped together to form diploid individuals in the individuals table, etc., which is not visible in those existing visualizations. So this is a much higher-level view of what is going on than is provided by |
Beta Was this translation helpful? Give feedback.
-
I don't have a concrete technical suggestion, but this seems like something that a Python package capable of drawing a network might handle? |
Beta Was this translation helpful? Give feedback.
-
Yes; if there were a package that handled choosing x-positions so as to minimize line-crossing that would be particularly useful. The rest of it is probably pretty easy to do oneself, really. |
Beta Was this translation helpful? Give feedback.
-
Python (and R) have a seemingly limitless supply of options. I've been (casually) interested in this same issue, so I'll be looking for something soon. |
Beta Was this translation helpful? Give feedback.
-
This gets pretty close, but isn't perfect depending on how picky one is, but it took some effort to figure out. It needs a lot of the graphviz stack on your system, including the C runtime library, as import msprime
import pandas as pd
import networkx
from collections import namedtuple
Datumz = namedtuple('Datumz', ['parent', 'child'])
ts = msprime.simulate(5, recombination_rate=6, random_seed=666)
D = []
for t in ts.trees():
for n in t.nodes():
for c in t.get_children(n):
D.append(Datumz(n, c))
df = pd.DataFrame(D, columns=Datumz._fields)
df['time'] = ts.tables.nodes.time[df.parent]
G = networkx.from_pandas_edgelist(
df, 'parent', 'child', create_using=networkx.DiGraph())
A = networkx.nx_agraph.to_agraph(G)
times = df.groupby(['time'])
all_samples = [i for i in ts.samples()]
for n, g in times:
up = [i for i in g.parent.unique() if i not in all_samples]
A.add_subgraph(up, rank='same')
A.add_subgraph(all_samples, rank='same')
A.draw('A.png', prog='dot') |
Beta Was this translation helpful? Give feedback.
-
To extend this to remembered individuals, etc., you need to manipulate the horizontal groupings, too, add color, etc.. |
Beta Was this translation helpful? Give feedback.
-
This will change the border color and shape of nodes 0 and 1. The UI semantics are the same as the low-level program for i in [0, 1]:
n = A.get_node(i)
n.attr['color'] = 'red'
n.attr['shape'] = 'box' |
Beta Was this translation helpful? Give feedback.
-
Seems likely that everything can be done entirely in I used |
Beta Was this translation helpful? Give feedback.
-
This version respects the temporal order of the root nodes on each tree. import msprime
import pandas as pd
import networkx
from collections import namedtuple
Datumz = namedtuple('Datumz', ['parent', 'child'])
ts = msprime.simulate(5, recombination_rate=6, random_seed=666)
D = []
for t in ts.trees():
for n in t.nodes():
for c in t.get_children(n):
D.append(Datumz(n, c))
df = pd.DataFrame(D, columns=Datumz._fields)
df['time'] = ts.tables.nodes.time[df.parent]
df.sort_values(['time'], inplace=True, ascending=False)
G = networkx.from_pandas_edgelist(
df, 'parent', 'child', create_using=networkx.DiGraph())
A = networkx.nx_agraph.to_agraph(G)
# Make a subgraph of roots, where our A->B edges
# are A is older than B.
roots = []
for t in ts.trees():
for r in t.roots:
if r not in roots:
roots.append(r)
roots = sorted(roots, key=lambda x: -ts.tables.nodes.time[x])
pg = A.add_subgraph([], level="same", name="roots")
for i in range(1, len(roots)):
pg.add_edge(roots[i-1], roots[i], style="invis")
A.add_subgraph([i for i in ts.samples()], rank='same', name="samples")
A.draw('A.png', prog='dot') |
Beta Was this translation helpful? Give feedback.
-
wow, love it! |
Beta Was this translation helpful? Give feedback.
-
With the root order preserved, at this point it should be relatively straightforward to create subgraphs representing actual samples taken at |
Beta Was this translation helpful? Give feedback.
-
Ping everyone here - please have a look at my pull request with figures. Let me know if you can't build the docs locally (if you have sphinx set up, you just need to run |
Beta Was this translation helpful? Give feedback.
-
I've got lots of feedback from @bhaller - thanks! This is addressed by tskit-dev/pyslim#70; closing this now. |
Beta Was this translation helpful? Give feedback.
-
This should not be closed, I don't think! For me, this thread has been about making some kind of visualization tool that can display the user's tree sequence in a particular manner. The utility of that for the doc was just one angle; as I wrote originally, "I’d maybe like to see a visualization tool that is more just about the times and relationships of individuals, nodes, and samples, without worrying about the particular tree structures that connect them through time." The docs have had pretty pictures added now (and other improvements), which is great, but the need for this visualization tool remains. |
Beta Was this translation helpful? Give feedback.
-
GitHub doesn't seem to offer me the option of reopening this issue. I could just open a new issue and copy/paste my original comment over, but there has been lots of further good discussion and work from @molpopgen and others, so it would be better if we could leave this issue open (and perhaps change its title)...? |
Beta Was this translation helpful? Give feedback.
-
Sorry! Re-opened! The issue should be renamed, though, if it's still open, and we should get some specific proposals for what to implement... |
Beta Was this translation helpful? Give feedback.
-
@hyanwong I gather you've been doing work on tree sequence visualization lately, so maybe this issue will interest you! :-> |
Beta Was this translation helpful? Give feedback.
-
@bhaller suggests moving this to tskit; I'm going to change the title somewhat and move it. |
Beta Was this translation helpful? Give feedback.
-
@bhaller Yes, it does. Recently I've been entirely working on trees (and animating SVGs) but I completely agree with you about the need for a pedigree-style viz. I tried this a few years ago with some of the dynamic graph viz tools, in an attempt to get something exactly like the examples you posted above, but never found the right graphic toolkit, which allow you to specify a Y position for each node but then works out the best X positioning so as to avoid line crossing. It's not clear to me that @molpopgen 's graphviz example allows continuous Y positions - merely a pre-defined stacking order. This was the barrier I hit too. It seemed strange to me that no-one had come up with a decent heuristic algorithm for the case of a DAG with nodes that have user-specified-floating-point-y, algorithmically-chosen-x. I think we should probably ask some network-viz people? I don't know any off-hand, though. The problem seems well characterised to me. |
Beta Was this translation helpful? Give feedback.
-
I don't know any such people either. I'd be tempted to find a semi-optimal solution through some process like simulated annealing, but it might be a solved problem, which would obviously be better! |
Beta Was this translation helpful? Give feedback.
-
Right, there are only "levels". As with all things graphviz, I find the first two steps tricky, the third very hard, and the next impossible. |
Beta Was this translation helpful? Give feedback.
-
Also see tskit-dev/msprime#1851 |
Beta Was this translation helpful? Give feedback.
-
By putting a few other pieces I've been working on recently, I think we can use the graphviz "dot" algorithm, called within networkx, to get quite a nice ARG visualisation, and it looks much nice with recombination nodes. Here I plot recombination nodes in red, and use the functions import msprime
import networkx as nx
import numpy as np
import tskit
ts = msprime.simulate(10, record_full_arg=True, random_seed=1, recombination_rate=1.6)
ts_arg = convert_to_single_rec_node(ts)
# I think we are assuming that the nodes are listed in time order in this script
G = to_networkx_graph(ts_arg)
import collections
nodes_at_time = collections.defaultdict(list)
colour_map = []
for nd in G.nodes(data=True):
nodes_at_time[nd[1]["time"]].append(nd[0])
colour_map.append("red" if nd[1]["flags"] & msprime.NODE_IS_RECOMB else "blue")
# Turn into graphviz
A = nx.nx_agraph.to_agraph(G)
# First cluster all nodes at the same times (probably mostly samples)
for t, nodes in nodes_at_time.items():
if len(nodes) > 1:
A.add_subgraph(nodes, level="same", name=f"cluster_t{t}")
# We could also cluster nodes from a single individual together here
A.layout(prog="dot")
pos = {n: [float(x) for x in A.get_node(n).attr["pos"].split(",")] for n in G.nodes()}
nx.draw_networkx(G, pos, node_color=colour_map, with_labels=True) That plot uses the rank time for the Y axis. We can then use the graphviz -> networkx code in tskit-dev/msprime#1851 (comment) to change the Y positions:
It's annoying that pushing the nodes to the correct times makes them overlap. |
Beta Was this translation helpful? Give feedback.
-
For posterity (i.e. so I don't lose it), here's the code and an example from some of the stuff I've been posting on slack: import itertools
from IPython.display import display, SVG
import msprime
import networkx as nx
import numpy as np
import pandas as pd
import tskit
msprime.NODE_IS_RECOMB = 2 # Use the "reserved" space in the bitfield, as we will eventually port this to tskit
msprime.NODE_IS_SOMETIMES_UNARY = int(2**24)
def convert_to_single_rec_node(ts):
"""
Should be unnecessary when https://github.com/tskit-dev/msprime/issues/1942 is fixed
"""
tables = ts.dump_tables()
tables.nodes.clear()
node_mapping = np.zeros(ts.num_nodes, dtype=tables.edges.child.dtype)
node_id = 0
while node_id < ts.num_nodes:
node_mapping[node_id] = tables.nodes.num_rows
node = ts.node(node_id)
if (node.flags & msprime.NODE_IS_RE_EVENT):
node_id += 1
assert node_id < ts.num_nodes
assert ts.node(node_id).flags & msprime.NODE_IS_RE_EVENT
assert ts.node(node_id).time == node.time
node_mapping[node_id] = tables.nodes.num_rows
node.flags &= ~msprime.NODE_IS_RE_EVENT # Unset the NODE_IS_RE_EVENT bit
node.flags |= msprime.NODE_IS_RECOMB
tables.nodes.append(node)
node_id += 1
tables.edges.parent = node_mapping[tables.edges.parent]
tables.edges.child = node_mapping[tables.edges.child]
tables.mutations.node = node_mapping[tables.mutations.node]
tables.sort()
return tables.tree_sequence()
def flag_unary_nodes(ts):
"""
Return a tree sequence in which any nodes which are sometimes unary nodes in the TS (i.e.
there is a tree in which they node has only one child) is flagged with msprime.NODE_IS_SOMETIMES_UNARY
"""
# This is very inefficient - find a proper incremental appraoch
n = []
tables = ts.dump_tables()
for tree in ts.trees():
for u in tree.nodes():
if tree.num_children(u) == 1:
n.append(u)
flags = tables.nodes.flags
flags &= np.full_like(flags, ~msprime.NODE_IS_SOMETIMES_UNARY) # unset this bit
flags[n] |= msprime.NODE_IS_SOMETIMES_UNARY
tables.nodes.flags = flags
return tables.tree_sequence()
def to_networkx_graph(ts):
edges=ts.tables.edges
G = nx.from_pandas_edgelist(
pd.DataFrame({'source': edges.parent, 'target': edges.child, 'left': edges.left, 'right': edges.right}),
edge_attr=True,
create_using=nx.MultiDiGraph
)
nx.set_node_attributes(G, {n.id: {'flags':n.flags, 'time': n.time, 'labels': "foo"} for n in ts.nodes()})
return G
def add_individuals_to_coalescence_nodes(ts):
tables = ts.dump_tables()
nodes_individual = tables.nodes.individual
for node in ts.nodes():
if (node.flags & msprime.NODE_IS_RECOMB) or (node.flags & msprime.NODE_IS_CA_EVENT):
continue
if node.individual == tskit.NULL:
nodes_individual[node.id] = tables.individuals.add_row()
tables.nodes.individual = nodes_individual
return tables.tree_sequence()
def draw_with_curved_multi_edges(G, pos, colour_map, curve_scale=1):
"""
networkx matlib plots show all edges as straight lines by default.
This function curves the edges if there is more than one edge with identical parent/child values
At the moment curve_scale is a hack to adjust the fact that we don't know the measurement scale on
the x or the y axis.
"""
nx.draw_networkx_nodes(G, pos, node_color=colour_map)
nx.draw_networkx_labels(G, pos)
for (start, end), edge_iter in itertools.groupby(G.edges()):
# identical node/neighbour pairs placed together according to docs
edges = list(edge_iter)
dist = ((pos[start][0] - pos[end][0]) ** 2 + (pos[start][1] - pos[end][1]) ** 2) ** 0.5
# FIXME - dist won't be accurate if X and Y axes are on different scales
curvature = curve_scale/dist
# FIXME - calculate curve_scale from the plot dimensions, rather than requiring it
max_edge_index = len(edges) - 1
for i, e in enumerate(edges):
curve_prop = (2 * i / max_edge_index - 1) if max_edge_index > 0 else 0
nx.draw_networkx_edges(
G,
pos,
edgelist=[e],
connectionstyle=f'arc3, rad = {curve_prop * curvature}')
# SIMULATE
ts = msprime.sim_ancestry(8, sequence_length=1e4, population_size=1e4, record_full_arg=True, random_seed=12, recombination_rate=1e-8)
ts_arg = convert_to_single_rec_node(ts)
ts_arg = add_individuals_to_coalescence_nodes(ts_arg) # optional - can help if simplifying away rRE and CA nodes
ts_arg = flag_unary_nodes(ts_arg) # So we can plot nodes which are partially coalescent in blue-green
# PLOT WITH RANK TIMES
import collections
import itertools
import matplotlib.pyplot as plt
G = to_networkx_graph(ts_arg)
nodes_at_time = collections.defaultdict(list)
colour_map = []
for nd in G.nodes(data=True):
nodes_at_time[nd[1]["time"]].append(nd[0])
colour = "lightgreen"
if nd[1]["flags"] & msprime.NODE_IS_RECOMB:
colour = "red"
elif nd[1]["flags"] & msprime.NODE_IS_CA_EVENT:
colour = "cyan"
elif nd[1]["flags"] & msprime.NODE_IS_SOMETIMES_UNARY:
colour = "lightseagreen"
colour_map.append(colour)
# Turn into graphviz
A = nx.nx_agraph.to_agraph(G)
# First cluster all nodes at the same times (probably mostly samples)
for t, nodes in nodes_at_time.items():
if len(nodes) > 1:
A.add_subgraph(nodes, level="same", name=f"cluster_t{t}")
# We could also cluster nodes from a single individual together here
# Get the positions from graphviz
A.layout(prog="dot")
pos = {n: [float(x) for x in A.get_node(n).attr["pos"].split(",")] for n in G.nodes()}
fig = plt.figure(1, figsize=(10, 15))
draw_with_curved_multi_edges(G, pos, colour_map, 20)
plt.show()
|
Beta Was this translation helpful? Give feedback.
-
(I just converted this to a discussion, since we want to keep it open indefinitely) |
Beta Was this translation helpful? Give feedback.
-
Since this discussion is still open, I thought that I'd add another visualization to the mix. These figures are made with D3.js force layout and are interactive within a Jupyter Notebook or a web browser. I remade @hyanwong's ARG (above) using two line types: "ortho"(1) and "line"(2). As an optional add-on, the line of rectangles at the bottom of each figure corresponds with the different trees and their respective breakpoints. You can hover over each rectangle and it will highlight the tree within the ARG(3). The node positioning from the force layout method does not inherently limit edge crossover, but because the figure is interactive, you can drag the nodes around to better positions. Depending on the application, that may or may not be an issue. I'm still working on this method to make it more usable across a wide range of tree sequences, such as the "ortho" line pathing for polytomies needs to be improved. If there's a feature that you think would be particularly useful to add, let me know. I'd be interested in incorporating this into tskit (or tsviz), if others find it helpful. |
Beta Was this translation helpful? Give feedback.
-
Also see #2970 for the genetic genealogy embedded in the pedigree. |
Beta Was this translation helpful? Give feedback.
-
In particular, we need pictures of the "first-generation" individuals that trip people up.
Edit: see below; this issue has evolved from the initial intention.
Beta Was this translation helpful? Give feedback.
All reactions