Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add utility function to compute single-graph semantic similarity #41

Open
bschilder opened this issue Aug 23, 2024 · 4 comments
Open

Add utility function to compute single-graph semantic similarity #41

bschilder opened this issue Aug 23, 2024 · 4 comments

Comments

@bschilder
Copy link
Contributor

bschilder commented Aug 23, 2024

A very common thing users might want to do is to compute the semantic similarity between nodes in a graph and then store that data back in the edges of the graph (to use as edge weights later).

I've created the following function to automate this.

#' Add semantic similarity
#'
#' First computes semantic similarity between all pairs of nodes in a graph.
#' Then adds the continuous similarity score as an edge attribute.
#' @param sim_fun The similarity function to use.
#'  Default is link[igraph]{similarity}.
#' @param sim_col Name of the new edge attribute to store the similarity score.
#' @param ... Additional arguments passed to the similarity function
#'  (\code{sim_fun}).
#' @returns Graph object with similarity added as a new edge attribute.
#' @export
#' @examples
#' filename <- system.file("extdata", "eds_marfan_kg.tar.gz", package = "monarchr")
#' g <- file_engine(filename) |>
#'           fetch_nodes(query_ids = "MONDO:0007525") |>
#'           expand(predicates = "biolink:has_phenotype",
#'                  categories = "biolink:PhenotypicFeature")|>
#'           expand(categories = "biolink:Gene")
#' g <- graph_semsim(g)
#' edges(g)$similarity
graph_semsim <- function(graph,
					    sim_fun=igraph::similarity,
					    sim_col="similarity",
					    ...){
	from <- to <- NULL;
	message("Computing pairwise node similarity.")
	X <- sim_fun(graph, ...)
	rownames(X) <- colnames(X) <- igraph::V(graph)$name
	graph <- graph|>
		activate(edges)|>
		dplyr::mutate(!!sim_col:=purrr::map2_dbl(from, to, ~ X[.y, .x]))
	return(graph)
}
@bschilder
Copy link
Contributor Author

bschilder commented Aug 23, 2024

Of course, this won't yield the "true" semantic similarity between the nodes unless your graph is sufficiently large and complete to accurately characterise the relationships between the nodes. But still, can come in handy.

@bschilder
Copy link
Contributor Author

bschilder commented Aug 23, 2024

monarch_semsim does something a bit like this, but instead compares two graph to each other.
Is it possible to query the semantic similarity API for a single graph instead (so we can get accurate similarity scores between all edge-connected nodes within the graph)? Might be a nice complement to my graph_semsim function which only considers the local graph instance.

@bschilder bschilder changed the title Add utility function to compute semantic similarity Add utility function to compute single-graph semantic similarity Aug 23, 2024
@oneilsh
Copy link
Collaborator

oneilsh commented Aug 27, 2024

The Monarch semsim API does take two sets of node IDs, and computes the best match from each in set A to those in set B, and vice-versa (essentially designed to support https://monarchinitiative.org/explore#phenotype-explorer).

I suppose for a single graph we could just pass the nodes as both set A and set B, but the functionality giving just the best match means that each node will just be reported to match itself (I think). Perhaps this is a feature request for the API - all-vs-all semantic similarity queries. Could be intensive though given the O(n^2) nature of the result. Tagging @kevinschaper

@bschilder
Copy link
Contributor Author

The Monarch semsim API does take two sets of node IDs, and computes the best match from each in set A to those in set B, and vice-versa (essentially designed to support https://monarchinitiative.org/explore#phenotype-explorer).

I suppose for a single graph we could just pass the nodes as both set A and set B, but the functionality giving just the best match means that each node will just be reported to match itself (I think). Perhaps this is a feature request for the API - all-vs-all semantic similarity queries. Could be intensive though given the O(n^2) nature of the result. Tagging @kevinschaper

Yeah, instead of returning only the top 1 similar node I'd want to return each node's similarities with each other node. I agree, for large graphs this would be a massive computation. Perhaps precomputing this and storing it as a separate database (version-controlled and regenerated for each KG release). Is this something that would be feasible @kevinschaper ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants