-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spatial equivalent of GNN #7
Comments
Hi @hyanwong, finally I found the time to give your code a go! In To begin with, I used the slendr tutorial library(slendr)
library(dplyr)
# Define world and regions
map <- world(
xrange = c(-15, 60), # min-max longitude
yrange = c(20, 65), # min-max latitude
crs = "EPSG:3035" # coordinate reference system (CRS) for West Eurasia
)
africa <- region(
"Africa", map,
polygon = list(c(-18, 20), c(40, 20), c(30, 33),
c(20, 32), c(10, 35), c(-8, 35))
)
europe <- region(
"Europe", map,
polygon = list(
c(-8, 35), c(-5, 36), c(10, 38), c(20, 35), c(25, 35),
c(33, 45), c(20, 58), c(-5, 60), c(-15, 50)
)
)
anatolia <- region(
"Anatolia", map,
polygon = list(c(28, 35), c(40, 35), c(42, 40),
c(30, 43), c(27, 40), c(25, 38))
)
# Define populations
afr <- population( # African ancestral population
"AFR", parent = "ancestor", time = 52000, N = 3000,
map = map, polygon = africa
)
ooa <- population( # population of the first migrants out of Africa
"OOA", parent = afr, time = 51000, N = 500, remove = 25000,
center = c(33, 30), radius = 400e3
) %>%
move(
trajectory = list(c(40, 30), c(50, 30), c(60, 40)),
start = 50000, end = 40000, snapshots = 20
)
ehg <- population( # Eastern hunter-gatherers
"EHG", parent = ooa, time = 28000, N = 1000, remove = 6000,
polygon = list(
c(26, 55), c(38, 53), c(48, 53), c(60, 53),
c(60, 60), c(48, 63), c(38, 63), c(26, 60))
)
eur <- population( # European population
name = "EUR", parent = ehg, time = 25000, N = 2000,
polygon = europe
)
ana <- population( # Anatolian farmers
name = "ANA", time = 28000, N = 3000, parent = ooa, remove = 4000,
center = c(34, 38), radius = 500e3, polygon = anatolia
) %>%
expand_range( # expand the range by 2.500 km
by = 2500e3, start = 10000, end = 7000,
polygon = join(europe, anatolia), snapshots = 20
)
yam <- population( # Yamnaya steppe population
name = "YAM", time = 7000, N = 500, parent = ehg, remove = 2500,
polygon = list(c(26, 50), c(38, 49), c(48, 50),
c(48, 56), c(38, 59), c(26, 56))
) %>% move(trajectory = list(c(15, 50)), start = 5000, end = 3000, snapshots = 10)
# Define gene flow events
gf <- list(
gene_flow(from = ana, to = yam, rate = 0.5, start = 6500, end = 6400, overlap = FALSE),
gene_flow(from = ana, to = eur, rate = 0.5, start = 8000, end = 6000),
gene_flow(from = yam, to = eur, rate = 0.75, start = 4000, end = 3000)
)
# Compile model
model <- compile_model(
populations = list(afr, ooa, ehg, eur, ana, yam), # populations defined above
gene_flow = gf,
generation_time = 30,
resolution = 10e3, # resolution in meters per pixel
competition = 130e3, mating = 100e3, # spatial interaction parameters
dispersal = 70e3, # how far can offspring end up from their parents
)
# Run model in SLiM
slim(model, sequence_length = 10e6, recombination_rate = 1e-8,
save_locations = TRUE, method = "batch", random_seed = 314159) After slim has been run, we need to save the model coordinates ts <- ts_load(model, file=file.path(model$path, "output_slim.trees"))
cnames = c("pedigree_id", "location_x", "location_y")
individuals <- ts_data(ts) %>% dplyr::distinct(ind_id, .keep_all = TRUE) %>% dplyr::mutate(location_x = as.vector(sf::st_coordinates(location)[, 1]), location_y = as.vector(sf::st_coordinates(location)[, 2])) %>% dplyr::select(pedigree_id, location_x, location_y) %>% as.data.frame
write.table(individuals[, cnames], sep="\t", row.names=FALSE, file="individuals.metadata.tsv") Standing in the simulation output directory, we load tree sequences import tskit
import pandas as pd
ts = tskit.load("output_slim.trees")
# Dump tables
tables = ts.dump_tables()
metadata = pd.read_table("individuals.metadata.tsv")
metadata.set_index(['pedigree_id'], inplace=True)
schema_dict = tables.individuals.metadata_schema.asdict()
tables.individuals.clear()
from collections import OrderedDict
schema_dict["properties"]["location_x"] = OrderedDict({'binaryFormat': 'f', 'description': 'x coordinate (longitude) in EPSG:3035', 'index': 8, 'type': 'number', 'default': 0.0})
schema_dict["properties"]["location_y"] = OrderedDict({'binaryFormat': 'f', 'description': 'y coordinate (latitude) in EPSG:3035', 'index': 9, 'type': 'number', 'default': 0.0})
tables.individuals.metadata_schema = tskit.MetadataSchema(schema_dict)
for row in list(ts.tables.individuals):
md = row.metadata
d = metadata.loc[md['pedigree_id']].to_dict()
md.update(**d)
row = row.replace(metadata=md)
tables.individuals.append(row)
# Make new tree sequence
ts = tables.tree_sequence() Next we calculate gnn coordinates and plot using geoviews import geoviews as gv
gv.extension('bokeh')
import geoviews.tile_sources as gts
from geoviews import opts
# Needed to project EPSG:3035 coordinates correctly
from cartopy import crs
import numpy as np
from tqdm import tqdm
np.random.seed(79)
# Select a subset of the 10000 samples; otherwise takes 1-2h
# to calculate gnn
samplesidx = sorted(np.random.choice(np.arange(ts.num_samples), 50, replace=False))
d = pd.DataFrame(0, index=np.arange(len(samplesidx)), columns=['id', 'pedigree_id', 'x', 'y', 'u', 'v'])
d['id'] = samplesidx
d.set_index('id', inplace=True)
for u in samplesidx:
md = ts.individual(ts.node(u).individual).metadata
d.loc[u]['x'] = md["location_x"]
d.loc[u]['y'] = md["location_y"]
d.loc[u]['pedigree_id'] = md["pedigree_id"]
d['color'] = 'blue'
# Calculate GNN coordinates
for tree in tqdm(ts.trees()):
for u in samplesidx:
x = 0
y = 0
n = 0
focal_node = tree.parent(u)
for s in tree.samples(focal_node):
if s != u:
loc_x = ts.individual(ts.node(s).individual).metadata["location_x"]
loc_y = ts.individual(ts.node(s).individual).metadata["location_y"]
n += 1
x += d.loc[u]['x'] - loc_x
y += d.loc[u]['y'] - loc_y
d.loc[u]['u'] += x * tree.span / n
d.loc[u]['v'] += y * tree.span / n
# Average coordinate of GNN
d['u'] /= ts.sequence_length
d['v'] /= ts.sequence_length
d.reset_index(inplace=True)
# Make gnn data frame that holds the gnn coordinates
gnn = pd.DataFrame(-1, index=np.arange(len(samplesidx)), columns=['id', 'pedigree_id', 'x', 'y', 'u', 'v'])
gnn['id'] = d['id'] + ts.num_samples
gnn['x'] = d['x'].values + d['u'].values
gnn['y'] = d['y'].values + d['v'].values
gnn['color'] = 'green'
# Make geoviews points object of samples and gnns
p = pd.concat([d, gnn])
neighbours = pd.DataFrame(dict(sample_id=d['id'], gnn_id=gnn['id']))
points = gv.Points(p, ['x', 'y'], crs=crs.epsg(3035))
tiles = gv.tile_sources.Wikipedia
nodes = gv.Nodes(points, ['x', 'y', 'id'], ['pedigree_id', 'color'])
graph = gv.Graph((neighbours, nodes), ['sample_id', 'gnn_id'], [])
(tiles * graph).opts(opts.Graph(node_size=8, width=800, height=800, directed=True, arrowhead_length=0.01, node_color='color', edge_line_alpha=0.5, node_alpha=.5, edge_line_width=2)) There must be some glaring error still as one of the gnns is in Madagascar when all should be either Northern Africa or Europe (see attached image)... |
Very nice! There is I think a minor bug - I think that this bit
should be
|
Thanks @petrelharp that seems to do the trick; at least the Madagascan gnns have migrated northward :) |
Very nice! And, hm, the fact that that one sample had GNN in Madagascar tells us that something was different about that sample - the error was worse for samples with lots of nearest neighbors, i.e., samples that don't coalesce into the tree very recently. So - that sample seems to be an outlier? Perhaps that could be communicated somehow? Any ideas? |
Hey @percyfal, thanks for sharing the link over in the slendr repo to your results here In case someone decides to follow up on your slendr code, I would suggest to run the SLiM simulation simply as:
without the Also, for a simple simulation like this, the tree sequence can be loaded simply by I've been thinking about spatial GNN things lately, so I'll be following this issue closely! Cool stuff! |
I discussed with @percyfal some ideas of how to plot a GNN equivalent when we don't have a set of defined subpopulations. One possibility is to look at the nearest neighbours and to summarize their x/y locations, perhaps using an average location and some sort of variance.
Here's a SLiM script we could use for testing (try running it in the SLiM GUI: needs the
world_map_540x217.png
file from the SLiM docs):And here's a rough, non-efficient, way of plotting out the spatial GNN (lots could be improved, and I think there's a bug, but putting it down here for reference):
The text was updated successfully, but these errors were encountered: