Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Effect of Hydrogens and Kekulization on pKa Prediction #5

Open
arazthexd opened this issue Jun 12, 2024 · 1 comment
Open

Effect of Hydrogens and Kekulization on pKa Prediction #5

arazthexd opened this issue Jun 12, 2024 · 1 comment

Comments

@arazthexd
Copy link

Congratulations on the great publication!

I was trying out your model and your code for a project of mine. I was looking to have a rough estimate of ratios of most common protomers of a molecule. I was planning on doing that using its predicted pKa values for each atom but the problem were molecules with more than one atom with different protonation states. While trying out QupKake I had some observations that made me doubt if it's possible to do with with it but I wanted to share the observations and hear your thoughts as well as if there could be a way to do this.

Basically, what brought me the doubt was that given different protonation states and also different SMILES formats (canonical and kekulized) the predictions were different. I'll show an example.

  1. Consider the kekulized SMILES for eprosartan: 'CCCCC1=NC=C(\C=C(/CC2=CC=CS2)C(O)=O)N1CC1=CC=C(C=C1)C(O)=O'
    When provided with this SMILES, this is how the output looks like:
    image
    basic:
    idx=5: pka=6.281378, basic
    idx=17: pka=6.023213, basic (?!)
    acidic:
    idx=17: pka=3.745246, acidic
    idx=28: pka=3.870438, acidic

In the results above, everything looks reasonable except the basic pKa of atom 17 which should be much lower.

  1. If the same molecule SMILES is provided without kekulization ('CCCCc1ncc(/C=C(\Cc2cccs2)C(=O)O)n1Cc1ccc(C(=O)O)cc1') the result would look as follows:
    image
    basic:
    idx=5: pka=6.265716, basic
    idx=18: pka=6.035862, basic (?!)
    idx=27: pka=6.107231, basic (?!)
    acidic:
    idx=18: pka=3.744408, acidic
    idx=27: pka=3.866692, acidic

It seems the pKa prediction module has a very low deviation from the previous results but I wonder why another carboxylic acid is enumerated as basic when input SMILES changes. I also wanted to ask why you think the model is predicting such high basic pKa values for carboxylic acid? I would be grateful to read your comments about it.

  1. Now let's consider the same kekulized SMILES but with one of the carboxylic acids already ionized: CCCCC1=NC=C(\C=C(/CC2=CC=CS2)C(O)=O)N1CC1=CC=C(C=C1)C([O-])=O
    Here is the result:
    image
    idx=5: pka=6.218657, basic
    idx=17: pka=5.955811, basic (?!)
    idx=28: pka=4.008273, basic
    acidic:
    idx=17: pka=3.568614, acidic

The prediction of atom 28 makes a lot of sense and is close to the acidic predicted pKa of it in the first results. What was somehow interesting to me was the drop in acidic pKa of atom 17 as I expected a rise because of the total charge of the molecule. Perhaps this is because such a molecule is somehow outside of the applicability domain of the model as I didn't see any already ionized molecules in the training data but I'm not sure if this is the case. If it is, it might be reasonable to neutralize the already ionized inputs before the predictions.

Another thing that caught my eye was that there was also different if the SMILES had explicit or implicit hydrogens which again, shouldn't matter I think.

@arazthexd
Copy link
Author

Just another example for the effect of protonation states and a weird prediction would be this molecule:
image

When given the neutral form of the molecule, the pka of carboxylic group is predicted to be 3.96457 but when the amine group is protonated and positively charged in the input, pka of the carboxylic acid group is predicted to be 5.50159.

As I mentioned, it's understandable that the model was not trained on such data, but this seems to be an interesting trend I'm seeing in almost all examples and I would like to discuss about why it's behaving this way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant