Questions about active learning with relative label problem #13

phonglam3103 · 2024-05-31T21:56:51Z

phonglam3103
May 31, 2024

Dear Dr. Walters and community,

I appreciate the detailed tutorial on developing an active learning classifier. However, I have some concerns regarding the feasibility of using a simple oracle, such as computational methods like docking or binding free energy (BFE) calculations, in prospective drug discovery campaigns where experimental validation isn’t pre-conducted.

Challenge of Labeling Virtual Actives
In various papers I've read, such as the Deep Docking workflow from [10.1021/acscentsci.0c00229], researchers typically label the top 1% of docking scores as virtual actives and the remainder as virtual inactives. While this approach is straightforward, it can complicate the model and sampling process.

One issue arises when, in the first iteration, 100K compounds are docked and 1K are labeled as virtual actives. In subsequent iterations, we predict the labels of the remaining compounds and likely select the next 100K compounds with the highest probability of being active for further docking. This process can lead to a situation where the initial 1K compounds may change their labels from active (1) to inactive (0), introducing bias and potentially compromising the model’s integrity.

Another challenge occurs with uncertainty sampling. In this method, the least confident compounds are included in the next iteration, expanding the training set from 100K to 200K compounds. This change impacts the initial percentile calculation, possibly leading to the reclassification of compounds from inactive to active, which can disrupt the model and its predictions.

Given these challenges, I am seeking insights or suggestions on how to manage the dynamic changes in compound labels across iterations to maintain model fairness and accuracy. Specifically:

How can we handle dynamic label changes to ensure the model remains unbiased and accurate?
What are the best practices for managing expanding training sets in uncertainty sampling without affecting initial label assignments?
I hope my explanation is clear. I look forward to your thoughts and suggestions on these issues.

Best regards,
Phong.

Answered by PatWalters

Jun 3, 2024

Hi Phong,

Thanks for your detailed question. For virtual screening, I always use a regression model, so I don't have to worry about the threshold shifts you describe. However, even with a classification model, I don't think the issue you mention will be a factor. Relabeling specific instances shouldn't be a problem if a model is retrained at each active learning iteration.

Best,

Pat

View full answer

PatWalters · 2024-06-03T15:07:52Z

PatWalters
Jun 3, 2024
Maintainer

Hi Phong,

Thanks for your detailed question. For virtual screening, I always use a regression model, so I don't have to worry about the threshold shifts you describe. However, even with a classification model, I don't think the issue you mention will be a factor. Relabeling specific instances shouldn't be a problem if a model is retrained at each active learning iteration.

Best,

Pat

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about active learning with relative label problem #13

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Questions about active learning with relative label problem #13

phonglam3103 May 31, 2024

Replies: 1 comment

PatWalters Jun 3, 2024 Maintainer

phonglam3103
May 31, 2024

PatWalters
Jun 3, 2024
Maintainer