Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG in construct_graph #418

Open
kamurani opened this issue Feb 18, 2025 · 2 comments · May be fixed by #419
Open

BUG in construct_graph #418

kamurani opened this issue Feb 18, 2025 · 2 comments · May be fixed by #419

Comments

@kamurani
Copy link
Contributor

I have found quite a severe bug in construct_graph when using the add_distance_threshold edge construction function.

Essentially, spurious edges are added to the graph that should not appear, based on the provided threshold distance in Angstroms.

I have decided to make an issue here ASAP before I can diagnose this but I intend to track down why this is happening.

This is potentially a very serious bug as using threshold_distance edges is a common protein graph construction technique and may result in people being unaware of the true structure of their data.

Version

  • MacOSX
  • Python 3.12.4
import graphein 
graphein.__version__
# '1.7.7'

Steps to reproduce bug

Download AlphaFold predicted structure here

uniprot_id = "P04629"

from functools import partial
from graphein.protein import construct_graph
from graphein.protein.config import ProteinGraphConfig
from graphein.protein.edges.distance import add_distance_threshold

u, v = 'A:MET:1', 'A:LEU:526'

new_edge_funcs = {"edge_construction_functions": [partial(add_distance_threshold, long_interaction_threshold=5, threshold=10.)]}
config = ProteinGraphConfig(**new_edge_funcs)
g_dist_all = construct_graph(
    path = f"{uniprot_id}.pdb",
    config=config,
)

g_dist_all.get_edge_data(u, v)
# {'kind': {'distance_threshold'}, 'distance': 48.41152358823357}

Note that this edge mysteriously appears, despite the node-node distance being well over the threshold distance specified!

@a-r-j
Copy link
Owner

a-r-j commented Feb 18, 2025

Hi @kamurani thanks for flagging. I could reproduce the error. I think I've tracked down the source.

The issue is the indexing here:

n1 = G.graph["pdb_df"].loc[a1, "node_id"]

For some reason the dataframe index is shifted. I believe it happens in this function, likely at this sort operation:

protein_df = sort_dataframe(protein_df)

I suspect I didn't catch this when switching from biopandas to CPDB for parsing.

I see two options. 1. add an index reset in process_dataframe or change the indexing in the edges function to use iloc rather than loc. I prefer the first option. Would you be able to try this and make a PR?

@kamurani
Copy link
Contributor Author

I will have a chance today to work on it, will have a go and submit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants